# CS61C: Machine Structures Lecture 5.2.2 Pipelining I 2004-07-22 Kurt Meinz inst.eecs.berkeley.edu/~cs61c # Review Datapath (1/3) - Datapath is the hardware that performs operations necessary to execute programs. - Control instructs datapath on what to do next. - · Datapath needs: - access to storage (general purpose registers and memory) - computational ability (ALU) - helper hardware (local registers and PC) # CS 61C L5.22 Pipelining 1(5) K. Meinz, Summer 2004 © U. ## **Review Datapath (2/3)** - Five stages of datapath (executing an instruction): - 1. Instruction Fetch (Increment PC) - 2. Instruction Decode (Read Registers) - 3. ALU (Computation) - 4. Memory Access - 5. Write to Registers - ALL instructions must go through ALL five stages. 1) IFetch: Fetch Instruction, Increment PC 2) Decode Instruction, Read Registers 3) Execute: Mem-ref: Calculate Address Arith-log: Perform Operation 4) Memory: Load: Read Data from Memory Store: Write Data to Register Cal #### **Example** - Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; compute instruction throughput - Nonpipelined Execution: - Iw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns - add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns - Pipelined Execution: - Max(IF,Read Reg,ALU,Memory,Write Reg) (n) = 2 ns K. Meinz, Summer 2004 © UCB #### **Example** - Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; compute instruction latency - Nonpipelined Execution: - Iw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns - add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns - Pipelined Execution: - SUM(IF,Read Reg,ALU,Memory,Write Reg) = 10 ns K. Meinz, Summer 2004 © U #### **Things to Remember** - Optimal Pipeline - Each stage is executing part of an instruction each clock cycle. - One instruction finishes during each clock cycle. - On average, executes far more quickly. - What makes this work? - Similarities between instructions allow us to use same stages for all instructions (generally). - Each stage takes about the same amount of time as all others: little wasted time. Cal . S 61C L5.2.2 Pipelining I (21 K Mainz Summer 2004 © HC # **Pipeline Summary** - Pipelining is a BIG IDEA - widely used concept - · What makes it less than perfect? ... CS 61C | 5 2 2 Pinelining | /22\ K Mainz Summer 2004 © HI # Pipeline Hazard: Matching socks in later load A depends on D; stall since folder tied up CS 61C L5.2.2 Pipelining I (23) K. Meinz, Summer 2004 © UCB #### **Problems for Computers** - Limits to pipelining: <u>Hazards</u> prevent next instruction from executing during its designated clock cycle - <u>Structural hazards</u>: HW cannot support this combination of instructions (single person to fold and put clothes away) - Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard; "bubbles" in the pipeline - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock) CS 61C L5.2.2 Pipelining I (24) K. Meinz, Summer 2004 © UC #### Structural Hazard #1: Single Memory (2/2) - Solution: - infeasible and inefficient to create second memory - (We'll learn about this more next week) - so simulate this by having two Level 1 <u>Caches</u> (a temporary smaller [of usually most recently used] copy of memory) - have both an L1 <u>Instruction Cache</u> and an L1 <u>Data Cache</u> - requires complex hardware to control when both caches miss! K Mainz Summer 2004 © UCB # Structural Hazard #2: Registers (2/2) - Fact: Register access is VERY fast: takes less than half the time of ALU stage - Solution: introduce convention - always Write to Registers during first half of each clock cycle - always Read from Registers during second half of each clock cycle (easy when async) - Result: can perform Read and Write during same clock cycle K. Meinz, Summer 20 # # Control Hazard: Branching (2/7) - We put branch decision-making hardware in ALU stage - therefore two more instructions after the branch will always be fetched, whether or not the branch is taken - Desired functionality of a branch - if we do not take the branch, don't waste any time and continue executing normally - if we take the branch, don't execute any instructions after the branch, just go to the desired label CS 61C L5.2.2 Pipelining I ( K. Meinz, Summer 2004 © UCB #### Control Hazard: Branching (3/7) - Initial Solution: Stall until decision is made - insert "no-op" instructions: those that accomplish nothing, just take time - Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage) - Drawback: Will still fetch inst at branch+4. Must either decode branch in IF or squash fetched branch+4. # **Control Hazard: Branching (4/7)** - Optimization #1: - move asynchronous comparator up to Stage 2 - · as soon as instruction is decoded (Opcode identifies is as a branch), immediately make a decision and set the value of the PC (if necessary) - · Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed - · Side Note: This means that branches are idle in Stages 3, 4 and 5. # **Control Hazard: Branching (5/7)** Insert a single no-op (bubble) n Time (clock cycles) s t add r. beq 0 lw d e • Impact: 2 clock cycles per branch instruction ⇒ slow #### **Control Hazard: Branching (6/7)** - Optimization #2: Redefine branches - · Old definition: if we take the branch, none of the instructions after the branch get executed by accident - · New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot) ## **Control Hazard: Branching (7/7)** - Notes on Branch-Delay Slot - · Worst-Case Scenario: can always put a no-op in the branch-delay slot - · Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program - re-ordering instructions is a common method of speeding up programs - compiler must be very smart in order to find instructions to do this - usually can find such an instruction at least 50% of the time Jumps also have a delay slot... # **Example: Nondelayed vs. Delayed Branch Nondelayed Branch Delayed Branch** \$8, \$9,\$10 add \$1 ,\$2,\$3 add \$1 ,\$2,\$3 sub \$4, \$5,\$6 sub \$4, \$5,\$6 beq \$1, \$4, Exit beq \$1, \$4, Exit \$8, \$9,\$10 xor \$10, \$1,\$11 xor \$10, \$1,\$11 Exit: # Data Hazards (1/2) • Consider the following sequence of instructions add \$t0, \$t1, \$t2 sub \$t4, \$t0, \$t3 and \$t5, \$t0, \$t6 or \$t7, \$t0, \$t8 xor \$t9, \$t0, \$t10 #### Data Hazard: Loads (3/4) - Instruction slot after a load is called "load delay slot" - If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle. - If the compiler puts an unrelated instruction in that slot, then no stall - Letting the hardware stall the instruction in the delay slot is equivalent to putting a nop in the slot (except the latter uses more code space) 2.2 Pipelining I (42) K. Meinz, Summer 2004 © UC ### C.f. Branch Delay vs. Load Delay - Load Delay occurs only if necessary (dependent instructions). - Branch Delay always happens (part of the ISA). - Why not have Branch Delay interlocked? - Answer: Interlocks only work if you can detect hazard ahead of time. By the time we detect a branch, we already need its value ... hence no interlock is possible! Cal K. Meinz, Summer 2004 © UCI #### **Historical Trivia** - First MIPS design did not interlock and stall on load-use data hazard - Real reason for name behind MIPS: Microprocessor without Interlocked Pipeline Stages - Word Play on acronym for Millions of Instructions Per Second, also called MIPS - Load/Use → Wrong Answer! 31C L5.2.2 Pipelining I (45) K Mainz Summer 2004 © H # "And in Conclusion.." - Pipeline challenge is hazards - Forwarding helps w/many data hazards - Delayed branch helps with control hazard in 5 stage pipeline - Monday: Pipelined Datapath and Control in Detail! - Please finish phase 1 of proj 3 by Monday. K Mainz Summer 2004 © HC