# CS152 – Computer Architecture and Engineering Lecture 11 – Pipeline Control 2003-09-29 Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/ Cal .. \_ ..... #### Review: Pipelining - · Reduce CPI by overlapping many instructions - Average throughput of approximately 1 CPI with fast clock - · Utilize capabilities of the Datapath - start next instruction while working on the current one - limited by length of longest stage (plus fill/flush) - detect and resolve hazards - What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores - · What makes it hard? - structural hazards: suppose we had only one memory - control hazards: need to worry about branch instructions - data hazards: an instruction depends on a previous instruction Patterson Fall 2003 © LICE #### Administrivia - Lab 4 Project document Thursday 9 PM paper or email - Reading Chapter 6, sections 6.1 to 6.5 - Midterm Wed Oct 8 5:30 8:30 in 1 LeConte - Midterm review Sunday Oct 4, 5 PM in 306 Soda - Bring 1 page, handwritten notes, both sides - Meet at LaVal's Northside afterwards for Pizza - Office hours - Mon 4 5:30 Jack, Tue 3:30-5 Kurt, Wed 3 4:30 John, Thu 3:30-5 Ben - Dave's office hours Tue 3:30 5 ## Is CPI = 1 for our pipeline? • Remember that CPI is an "Average # cycles/inst - CPI here is 1, since the average throughput is 1 instruction every cycle. - What if there are stalls or multi-cycle execution? - Usually CPI > 1. How close can we get to 1?? CS 152 L11 Pipeline 2 (31) atterson Fall 2003 © U ### Recall: Compute CPI? - · Start with Base CPI - Add stalls $$\begin{split} CPI &= CPI_{base} + CPI_{stall} \\ CPI_{stall} &= STALL_{type-1} \times freq_{type-1} + STALL_{type-2} \times freq_{type-2} \end{split}$$ - · Suppose: - CPI<sub>base</sub>=1 - Freq<sub>branch</sub>=20%, freq<sub>load</sub>=30% - Suppose branches always cause 1 cycle stall - Loads cause a 100 cycle stall 1% of time - Then: CPI = 1 + $(1\times0.20)$ + $(100\times0.30\times0.01)$ =1.5 - · Multicycle? Could treat as: $CPI_{stall} = (CYCLES-CPI_{base}) \times freq_{inst}$ CS 152 L11 Pipeline 2 (32) Bettereen Fell 2003 © LICE #### Case Study: MIPS R4000 (200 MHz) - · 8 Stage Pipeline: - IF—first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. - IS-second half of access to instruction cache. - RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection. - EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. - DF-data fetch, first half of access to data cache. - DS-second half of access to data cache. - TC-tag check, determine whether the data cache access hit. - WB-write back for loads and register-register operations. - 8 Stages: What is impact on Load delay? Branch delay? Why? CS 152 L11 Pipeline 2 (33 Patterson Pail 2003 © U # Case Study: MIPS R4000 | TWO Cycle<br>Load Latency | IF | IS<br>IF | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | |------------------------------------------------------------------------------------------------------------------------------|----|--------------------|----------------|----------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------------| | THREE Cycle<br>Branch Latency<br>(conditions evaluated<br>during EX phase)<br>Delay slot plus two :<br>Branch likely cancels | | IS<br>IF<br>y slot | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | Patterson Fall 2003 © #### MIPS R4000 Floating Point - · FP Adder, FP Multiplier, FP Divider - · Last step of FP Multiplier/Divider uses FP Adder HW - · 8 kinds of stages in FP units: | Stage | Functional unit | Description | |-------|-----------------|----------------------------| | Α | FP adder | Mantissa ADD stage | | D | FP divider | Divide pipeline stage | | E | FP multiplier | Exception test stage | | M | FP multiplier | First stage of multiplier | | N | FP multiplier | Second stage of multiplier | | R | FP adder | Rounding stage | | S | FP adder | Operand shift stage | | U | | Unpack FP numbers | Patterson Fall 2003 © UCB #### MIPS FP Pipe Stages FP Instr Add, Subtract U S+A A+R R+S Multiply U E+M M M N N+AR Divide R D<sup>28</sup> ... D+A D+R, D+R, D+A, D+R, A, Square root (A+R)108 ... A R U S Negate Absolute value U S FP compare U A Stages: A Mantissa ADD stage First stage of multiplier D Divide pipeline stage Second stage of multiplier E Exception test stage R Rounding stage Operand shift stage Unpack FP numbers # **Summary** - What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores - · Hazards limit performance - Structural: need more HW resources - Data: need forwarding, compiler scheduling - Control: early evaluation & PC, delayed branch, prediction - · Data hazards must be handled carefully: - RAW data hazards handled by forwarding - WAW and WAR hazards don't exist in 5-stage pipeline - MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) - More performance from deeper pipelines, parallelism