# CS152 – Computer Architecture and Engineering Lecture 11 – Pipeline Control 2003-09-29 **Dave Patterson** (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/ Patterson Fall 2003 © UCB ## Review: Pipelining - Reduce CPI by overlapping many instructions - Average throughput of approximately 1 CPI with fast clock - Utilize capabilities of the Datapath - start next instruction while working on the current one - limited by length of longest stage (plus fill/flush) - detect and resolve hazards - What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores - What makes it hard? - structural hazards: suppose we had only one memory - control hazards: need to worry about branch instructions - data hazards: an instruction depends on a previous instruction CS 152 L11 Pipeline 2 (2) Patterson Fall 2003 © UCB ### Recap: Ideal Pipelining #### **Assume instructions** are completely independent! Maximum Speedup ≤ Number of stages Speedup ≤ Time for unpipelined operation Time for longest stage Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup ≤ 4 CS 152 L11 Pipeline 2 (3) Patterson Fall 2003 © UCB # FYI: MIPS R3000 clocking discipline - 2-phase non-overlapping clocks - Pipeline stage is two (level sensitive) latches CS 152 L11 Pipeline 2 (4) Patterson Fall 2003 © UCB ## MIPS R3000 Instruction Pipeline | | Inst Fetch | | | Decode<br>Reg. Read | | ALU / E.A | | N | Memory | | Write Reg | | J | | | |----|------------|-------|------------|---------------------|-----|-----------|------|----------|--------|---------|-----------|---|---|--|--| | | TLE | 3 | I-Cache RF | | | Operation | | | | | WB | | | | | | | | | | | | | E.A. | TLE | | D-Ca | ache | | | | | | Re | esou | rce l | Jsag | e | | | | _ | | | | | | | | | | ГLВ | | | | | TLB | | <u> </u> | Τ | $\perp$ | | 1 | | | | | | | I-ca | che | | | | | | | | | | | | | | | | | | RF | | | | Wi | 38 | | | | | | | | | | | | | ALU | ALU | | | | | | | | | | Write in phase 1, read in phase 2 => eliminates bypass from WB D-Cache #### Recall: Data Hazard on r1 With MIPS R3000 pipeline, no need to forward from WB stage CS 152 L11 Pipeline 2 (6) Patterson Fall 2003 © UCB #### Clarification about clock edges in lab4! - Since Register have edge-triggered write: - Must have everything set up at end of memory stage - This means that "M" register here is not necessary! - Also, Memories will be synchronous Need to setup addresses and values in advance CS 152 L11 Pipeline 2 (7) Patterson Fall 2003 © UCB #### MIPS R3000 Multicycle Operations Use control word of local stage to step through multicycle operation Stall all stages above multicycle operation in the pipeline Drain (bubble) stages below it Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready - This means stall mflo/mfhi in decode stage if multiply/divide still executing - Extra credit in Lab 5 does this Ex: Multiply, Divide, Cache Miss CS 152 L11 Pipeline 2 (8) Patterson Fall 2003 © UCB # Recall: Single cycle control! CS 152 L11 Pipeline 2 (9) Patterson Fall 2003 © UCB ## **Data Stationary Control** - The Main Control generates the control signals during Reg/Dec - Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later - Control signals for Mem (MemWr Branch) are used 2 cycles later - Control signals for Wr (MemtoReg MemWr) are used 3 cycles later CS 152 L11 Pipeline 2 (10) ## Datapath + Data Stationary Control CS 152 L11 Pipeline 2 (11) Patterson Fall 2003 © UCB #### Administrivia - Lab 4 Project document Thursday 9 PM paper or email - Reading Chapter 6, sections 6.1 to 6.5 - Midterm Wed Oct 8 5:30 8:30 in 1 LeConte - Midterm review Sunday Oct 4, 5 PM in 306 Soda - Bring 1 page, handwritten notes, both sides - Meet at LaVal's Northside afterwards for Pizza - Office hours - Mon 4 5:30 Jack, Tue 3:30-5 Kurt, Wed 3 4:30 John, Thu 3:30-5 Ben - Dave's office hours Tue 3:30 5 # Let's Try it Out 100 and r13, r14, 15 | 10 | lw | r1, r2(35) | | |----|------|---------------|---------------------------| | 14 | addl | r2, r2, 3 | | | 20 | sub | r3, r4, r5 | | | 24 | beq | r6, r7, 100 | | | 30 | ori | r8, r9, 17 | these addresses are octal | | 34 | add | r10, r11, r12 | | CS 152 L11 Pipeline 2 (13) Patterson Fall 2003 © UCB #### Start: Fetch 10 ## Fetch 14, Decode 10 **CS 152 L11 Pipeline 2 (15)** #### Fetch 20, Decode 14, Exec 10 CS 152 L11 Pipeline 2 (16) #### Fetch 24, Decode 20, Exec 14, Mem 10 #### Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 **CS 152 L11 Pipeline 2 (18)** #### Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 #### Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20 #### Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 #### Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 #### Pipelined Processor - Stalls propagate backwards to freeze previous stages - Bubbles in pipeline introduced by placing "Noops" into local stage, stall previous stages. CS 152 L11 Pipeline 2 (23) Patterson Fall 2003 © UCB ## Recap: Data Hazards - Avoid some "by design" - eliminate WAR by always fetching operands early (DCD) in pipe - eleminate WAW by doing all WBs in order (last stage, static) - Detect and resolve remaining ones - stall or forward (if possible) CS 152 L11 Pipeline 2 (24) Patterson Fall 2003 © UCB #### **Hazard Detection** Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. - A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j ) - Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. - When instruction issues, reserve its result register. - When on operation completes, remove its write reservation. - A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j ) - A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j ) CS 152 L11 Pipeline 2 (25) Patterson Fall 2003 © UCB ## Record of Pending Writes In Pipeline Registers CS 152 L11 Pipeline 2 (26) # Resolve RAW by "forwarding" (or bypassing) - Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe - Increase muxes to add paths from pipeline registers - Data Forwarding =Data Bypassing ## What about memory operations? - o If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! - ° What does delaying WB on arithmetic operations cost? - cycles? - hardware? - ° What about data dependence on loads? R1 <- R4 + R5 $R2 \leftarrow Mem[R2 + I]$ R3 <- R2 + R1 - ⇒ "Delayed Loads" - Can recognize this in decode stage and introduce bubble while stalling fetch stage (hint for lab 5!) - o Tricky situation: R1 <- Mem[ R2 + I] Mem[R3+34] <- R1 Handle with bypass in memory stage! # Compiler Avoiding Load Stalls: #### Question: Critical Path??? - Bypass path is invariably trouble - Options? - -Make logic really fast - Move forwarding after muxes - Problem: screws up branches that require forwarding! - Use same tricks as "carry-skip" adder to fix this? - This option may just push delay around...! Patterson Fall 2003 © UCB #### Is <u>CPI = 1 for our pipeline?</u> Remember that CPI is an "Average # cycles/inst - CPI here is 1, since the average throughput is 1 instruction every cycle. - What if there are stalls or multi-cycle execution? - Usually CPI > 1. How close can we get to 1?? CS 152 L11 Pipeline 2 (31) Patterson Fall 2003 © UCB ## Recall: Compute CPI? - Start with Base CPI - Add stalls $$\begin{aligned} CPI &= CPI_{base} + CPI_{stall} \\ CPI_{stall} &= STALL_{type-1} \times freq_{type-1} + STALL_{type-2} \times freq_{type-2} \end{aligned}$$ - Suppose: - CPI<sub>base</sub>=1 - Freq<sub>branch</sub>=20%, freq<sub>load</sub>=30% - Suppose branches always cause 1 cycle stall - Loads cause a 100 cycle stall 1% of time - Then: CPI = $1 + (1 \times 0.20) + (100 \times 0.30 \times 0.01) = 1.5$ - Multicycle? Could treat as: $$CPI_{stall} = (CYCLES-CPI_{base}) \times freq_{inst}$$ CS 152 L11 Pipeline 2 (32) Patterson Fall 2003 © UCB # Case Study: MIPS R4000 (200 MHz) #### 8 Stage Pipeline: - IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. - IS—second half of access to instruction cache. - RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection. - EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. - DF-data fetch, first half of access to data cache. - DS—second half of access to data cache. - TC-tag check, determine whether the data cache access hit. - WB—write back for loads and register-register operations. #### 8 Stages: What is impact on Load delay? Branch delay? Why? CS 152 L11 Pipeline 2 (33) Patterson Fall 2003 © UCB # Case Study: MIPS R4000 | TWO Cycle Load Latency | IF | IS<br>IF | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS | |---------------------------------------------------------------------------------------------------------------|----|--------------------|----------------|-------------------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------| | THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two s Branch likely cancels | | IS<br>IF<br>y slot | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF<br>taken | DF<br>EX<br>RF<br>IS | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS | ## MIPS R4000 Floating Point - FP Adder, FP Multiplier, FP Divider - Last step of FP Multiplier/Divider uses FP Adder HW - 8 kinds of stages in FP units: | Stage | Functional unit | Description | |-------|-----------------|----------------------------| | Α | FP adder | Mantissa ADD stage | | D | FP divider | Divide pipeline stage | | E | FP multiplier | Exception test stage | | M | FP multiplier | First stage of multiplier | | N | FP multiplier | Second stage of multiplier | | R | FP adder | Rounding stage | | S | FP adder | Operand shift stage | | U | | Unpack FP numbers | ## MIPS FP Pipe Stages ``` 3 5 6 8 FP Instr S+A A+R R+S Add, Subtract Multiply U E+M M M M Ν N+AR D+A D+R, D+R, D+A, D+R, A, D^{28} Divide R Α R Square root (A+R)^{108} Ε R Negate S Absolute value FP compare U Α R Stages: ``` | M | First stage of multiplier | |---|----------------------------| | N | Second stage of multiplier | | R | Rounding stage | | S | Operand shift stage | | U | Unpack FP numbers | A Mantissa ADD stage D Divide pipeline stage E Exception test stage #### R4000 Performance - Not ideal CPI of 1: - FP structural stalls: Not enough FP hardware (parallelism) - FP result stalls: RAW data hazard (latency) - Branch stalls (2 cycles + unfilled slots) #### **Summary** - What makes it easy - all instructions are the same length - just a few instruction formats - memory operands appear only in loads and stores - Hazards limit performance - Structural: need more HW resources - Data: need forwarding, compiler scheduling - Control: early evaluation & PC, delayed branch, prediction - Data hazards must be handled carefully: - RAW data hazards handled by forwarding - WAW and WAR hazards don't exist in 5-stage pipeline - MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) - More performance from deeper pipelines, parallelism CS 152 L11 Pipeline 2 (38) Patterson Fall 2003 © UCB