# CS152 - Exam 2 Review 2003-11-18 Jack and Kurt www-inst.eecs.berkeley.edu/~cs152/ Jack Kang and Kurt M #### Question 1a: Problem 1a: Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a 512-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into "tag", "index", and "cache-line offset" pieces. Which address bits comprise each piece? - Tag: - Index: - · Block Offset: look Kong and Kust Ma # Question 1b: Problem 1a: Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a 512-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into "tag", "index", and "cache-line offset" pieces. Which address bits comprise each piece? Tag: 24 bits total: 31-8 Index: 4 bits total: 7-4 Block Offset: 4 bits total: 3-0 Jack Kang and Kurt Me #### Question 1b: Problem 1a: Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a 512-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into "tag", "index", and "cache-line offset" pieces. Which address bits comprise each piece? Tag: 24 bits total: 31-8 Index: 4 bits total: 7-4 Block Offset: 4 bits total: 3-0 Problem 1b: How many sets does this cache have? Explain. Jack Kang and Kurt Mein # Question 1b: Problem 1a: Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a \$12-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into "tag", "index", and "cache-line offset" pieces. Which address bits comprise each piece? Tag: 24 bits total: 31-8 Index: 4 bits total: 7-4 Block Offset: 4 bits total: 3-0 Problem 1b: How many sets does this cache have? Explain. 4 bits in the index field → 2^4 possible values → 16 sets Jack Kang and Kurt Meinz # Question 1c: Problem 1c: Draw a block diagram for this cache. Show a 32-bit address coming into the diagram and a 32-bit data result and "Hit" signal coming out. Include, all of the comparators in the system and any muxes as well. Include the data storage memories (indexed by the "Index"), the tag matching logic, and any muxes. You can indicate RAM with a simple block, but make sure to label address widths and data widths. Make sure to label the function of various blocks and the width of any buses. Jack Kang and Kurt Me # Question 1d: Problem 1d: Below is a series of memory read references set to the cache from part (a). Assume that the cache is initially empty and classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. One example is shown. Hint: start by splitting the address into components. Slow your work. look Kong and Kust Mai # Question 1d: Problem 1d: Below is a series of memory read references set to the cache from part (a). Assume that the cache is initially empty and classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. One example is shown. Hunt: start by splitting the address into components. Show your work. | Address | Hit/Miss? | Miss Type? | Address | Hit/Miss? | Miss Type? | |---------|-----------|------------|---------|-----------|------------| | 0x300 | Miss | Compulsory | 0x3B2 | Miss | Compulsory | | 0x1BC | Miss | Compulsory | 0x10C | Hit | _ | | 0x206 | Miss | Compulsory | 0x205 | Miss | Conflict | | 0x109 | Miss | Compulsory | 0x301 | Miss | Conflict | | 0X308 | Miss | Conflict | 0x3AE | Miss | Compulsory | | 0x1A1 | Miss | Compulsory | 0x1A8 | Miss | Conflict | | 0x1B1 | Hit | _ | 0x3A1 | Hit | _ | | 0x2AE | Miss | Compulsory | 0x1BA | Hit | _ | | | | | | | | Jack Kang and Kurt Me # Question 1e: Problem 1d: Below is a series of memory read references set to the cache from part (a). Assume that the eache is initially empty and classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. One example is shown. Hint: start by splitting the address into components. Show your work. | Address | Hit/Miss? | Miss Type? | Address | Hit/Miss? | Miss Type? | | |---------|-----------|------------|---------|-----------|------------|--| | 0x300 | Miss | Compulsory | 0x3B2 | Miss | Compulsory | | | 0x1BC | Miss | Compulsory | 0x10C | Hit | _ | | | 0x206 | Miss | Compulsory | 0x205 | Miss | Conflict | | | 0x109 | Miss | Compulsory | 0x301 | Miss | Conflict | | | 0X308 | Miss | Conflict | 0x3AE | Miss | Compulsory | | | 0x1A1 | Miss | Compulsory | 0x1A8 | Miss | Conflict | | | 0x1B1 | Hit | _ | 0x3A1 | Hit | _ | | | 0x2AE | Miss | Compulsory | 0x1BA | Hit | _ | | | | | | | | | | Problem 1e: Calculate the miss rate and hit rate. Jack Kang and Kurt Mein # Question 1e: Problem 1d: Below is a series of memory read references set to the cache from part (a). Assume that the cache is initially empty and classify each memory references as a hit or a miss. Identify each miss as either computory, conflict, or capacity. One example is shown. Hint: start by splitting the address into components. Show your work. | Address | Hit/Miss? | Miss Type? | Address | Hit/Miss? | Miss Type? | |---------|-----------|------------|---------|-----------|------------| | 0x300 | Miss | Compulsory | 0x3B2 | Miss | Compulsory | | 0x1BC | Miss | Compulsory | 0x10C | Hit | _ | | 0x206 | Miss | Compulsory | 0x205 | Miss | Conflict | | 0x109 | Miss | Compulsory | 0x301 | Miss | Conflict | | 0x308 | Miss | Conflict | 0x3AE | Miss | Compulsory | | 0x1A1 | Miss | Compulsory | 0x1A8 | Miss | Conflict | | 0x1B1 | Hit | _ | 0x3A1 | Hit | _ | | 0x2AE | Miss | Compulsory | 0.2182 | 1Dr | | Problem 1e: Calculate the miss rate and hit rate. $Hit Rate = \frac{4}{16} = 0.25$ $Miss\ Rate = I - Hit\ Rate = \frac{12}{16} = 0.75$ Jack Kang and Kurt Mein # Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | |-----------------------|-------------------------------|----------------------------|------------| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | DISK | 50ms +<br>20ms/byte | 0% | 16K bytes | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for instructions? For Data (assume all readsy? AMATDisk = ? lack Kang and Kurt M #### Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | | | |-----------------------|-------------------------------|----------------------------|------------|--|--| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | | | DISK | 50ms +<br>20ms/bata | 0% | 16K bytes | | | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for Instructions? For Data (assume all reads)? AMATDisk = AccessTime + AMATMissPenalty + TransferRate\*TransferSize = 50E6ns + 0 + (20ns/byte \* 16Kbytes) - = 50E6ns + 0 + (20 = ~5E7ns = 5E7ns/ (2ns/clock) → 2.5E7 clocks Jack Kang and Kurt I #### Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | |-----------------------|-------------------------------|----------------------------|------------| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | DISK | 50ms +<br>20ms/byte | 0% | 16K bytes | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for Instructions? For Data (assume all reads)? AMATDRAM = AccessTime + AMATMiss + TransferRate\*TransferSize = 100ns + 5E7ns\*0.01 + (25ns/8bytes \* 128bytes) = = ~~ 5E5ns - = 5E5ns/ (2ns/clock) → 2.5E5 clocks #### Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | | | |-----------------------|-------------------------------|----------------------------|------------|--|--| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | | | DISK | 50ms + | 0% | 16K bytes | | | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for Instructions? For Data (assume all reads)? AMATL2 = AccessTime + AMATMiss + TransferRate\*TransferSize = (20c\*2ns/c) + 5E5ns\*0.02 + (2ns/8bytes \* 64bytes) = - - ~~ 1E4 ns - = 1E4ns/ (2ns/clock) → 5E3 clocks Jack Kang and Kurt N #### Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | | | | |-----------------------|-------------------------------|----------------------------|------------|--|--|--| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | | | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | | | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | | | | DISK | 50ms +<br>20ms/byte | 0% | 16K bytes | | | | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for instructions? For Data (assume all reads)? # Question 1f: Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters: | Component | Hit Time | Miss Rate | Block Size | |-----------------------|-------------------------------|----------------------------|------------| | First-Level<br>Cache | 1 cycle | 4% Data<br>1% Instructions | 64 bytes | | Second-Level<br>Cache | 20 cycles +<br>1 cycle/64bits | 2% | 128 bytes | | DRAM | 100ns+<br>25ns/8 bytes | 1% | 16K bytes | | DISK | 50ms +<br>20ms/byte | 0% | 16K bytes | Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn't miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for Instructions? For Data (assume all reads)? # Question 1g: Problem 1g: Suppose that we measure the following instruction mix for benchmark "X": Loads: 20%, Stores: 15%, Integer: 20%, Floating, Point: 15% Branches: 20% Assume that we have a single-issue processor with a minimum CPI of 1.0. Assume that we have a branch predictor that is correct 95% of the time, and that an incorrect prediction costs 3 cycles. Finally, assume that data hazards cause an average penalty of 0.7 cycles for floating point operations. Integer operations run at maximum throughput. What is the average CPI of Benchmark X, including memory misses (from part g)? #### Question 1g: Problem 1g: Suppose that we measure the following instruction mix for benchmark "X": Loads: 20%, Stores: 15%, Integer: 30%, Floating-Point: 15% Branches: 20% Assume that we have a single-issue processor with a minimum CPI of 1.0. Assume that we have a branch prediction that is correct 95% of the time, and that an incorrect prediction costs 3 cycles. Finally, assume that data hazards cause an average penalty of 0.7 cycles for floating point operations. Integer operations must a maximum throughput. What is the average CPI of Benchmark X, including memory misses (from part g)? CPI = MinCPI + Σ [ CPI of exceptional events ] = MinCPI + CPIHazardStalls + CPIMemoryStalls Jack Kang and Kurt M #### Question 1g: Problem 1g: Suppose that we measure the following instruction mix for benchmark "X": Loads: 20%, Stores: 15%, Integer: 30%, Floating-Point: 15% Branches: 20% Assume that we have a single-issue processor with a minimum CPI of 1.0. Assume that we have a branch prediction that is correct 95% of the time, and that an incorrect prediction costs 3 cycles. Finally, assume that data hazards cause an average penalty of 0.7 cycles for floating point operations. Integer operations must a maximum throughput. What is the average CPI of Benchmark X, including memory misses (from part g)? CPI = MinCPI + Σ [ CPI of exceptional events ] - = MinCPI + CPIHazardStalls + CPIMemoryStalls - + Σ(InstTypeFreq\*CPI) + Σ(MemAccessFreq\*AccessAMAT) #### Question 1g: Problem 1g: Suppose that we measure the following instruction mix for benchmark "X": Loads: 20%, Stores: 15%, Integer: 30%, Floating-Point: 15% Branches: 20% Assume that we have a single-issue processor with a minimum CPI of 1.0. Assume that we have a branch predictor that is correct 95% of the time, and that an incorrect prediction costs 3 cycles. Finally, assume that data hazards cause an average penalty of 0.7 cycles for floating point operations. Integer operations must a maximum throughput. What is the average CPI of Benchmark X, including memory misses (from part g)? CPI = MinCPI + Σ [ CPI of exceptional events ] - = MinCPI + CPIHazardStalls + CPIMemoryStalls - + Σ(InstTypeFreq\*CPI) + Σ(MemAccessFreq\*AccessAMAT) - = 1 + [ (FPFreq\*FPCPI) - \* [(PFFFeq\*FPCF)] + (BBRanchFreq\*BBCPI)] + [(MeminstFreq \* AMATL1Inst/(2ns/clock)) + (DataInstFreq \*AMATL1Data/(2ns/clock))] Jack Kang and Kurt Me #### Question 2a: 2a) Explain why we would be unable to pick a single optimum number of branch delay slots for #### Question 2a 2a) Explain why we would be unable to pick a single optimum number of branch delay slots for the above processor. Brunch delay slots affect correctness (they represent functional behavior – things always executed when a brunch is executed), we have to pick a single number. The result wouldn't be optimal under all circumstances, since we issue (0, 1, or 2 instructions per cycle after the brunch. #### Question 2b 2a) Explain why we would be unable to pick a single optimum number of branch delay slots for the above processor. Cal #### Question 2b 2a) Explain why we would be unable to pick a single optimum number of branch delay slots for This depends on whether or not the two memory stages are separable. A WAR hazard would occur (if were possible for a later store to change the value of an early read. If stores go to memory early but loads take two cycles, this might be a problem. The ways to fix this lift is happen) is so make sure that stores take two cycles just like loads. Note that the answer to this question is lade? NOT miles you do something world will your enemony system. Jack Kang and Kurt M # Question 2c - 1) Finish the diagram. Stages are boses with letters mode: Use "F" for a fetch stage, "D" for a decode stage, EX; through EX; for the execution stages of each of the pipelines fineduling memory accesses, and "W" for a writeback stage, Charly label which is the even pipeline. Include arrows for forward information flow if this is not obvious. - 2) Next, describe what is being computed in each EX stage (including partial results). 3) Show all forwarding paths (as arrows). Your psychon-should never stall unless a value is not ready. Label each bypos arrow with the types of instructions that will forward their results along that path (e. ine "Th" for modift, "Dir of the", "A" for add, "T" is integer operations, and "Ld" for load results). [Hint: think carefully about inputs to store instructions?] # Question 2c EX EVEN MEM: D W ODDMEM. MEM Cal Jack Kang and Kurt M # Question 2c EX Stages: EX; Integer ops, Branches, Memory address computation, First stage of A, M, D EX; First stage of load store, Finish A, Second stage of M, D EX; Final stage of D EX; Final stage of D EX; Final stage of D Stages stages of D Stages of D EX; Final stages of D Stages of D EX; Final stages of D Stages of D EX; Final stages of D EX; Final stages of D Stages of D EX; Final stages of D Stages of D EX; Final stages of D Stages of D EX; Final stages of D Stages of D EX; Final #### Question 2d 2d) Note that we assume that a load is not completed until the end of EX<sub>2</sub> and that a store must have its value by the beginning of EX<sub>2</sub>. Consider the following common sequence for a memory copy: loop: ld rl, 0(r2) st rl, 0(r3) add r2, r2, #4 subi r4, r4, #1 add r3, r3, #4 bne r4, r0, loop nop Why can't the load and store to be dispatched in the same cycle? What is the minimum number of instructions that must be placed between them to avoid stalling? #### Question 2d 2d) Note that we assume that a load is not completed until the end of EX<sub>3</sub> and that a store must have its value by the beginning of EX<sub>2</sub>. Consider the following common sequence for a memory copy: loop: ld rl, 0(r2) st rl, 0(r3) add r2, r2, #4 eubi r4, r4, #1 add r3, r3, #4 bne r4, r0, loop nop Why can't the load and store to be dispatched in the same cycle? What is the minimum number of instructions that must be placed between them to avoid stalling? They cannot be dispatched in the same cycle because of the dependency through r1. In this pipeline, the store must execute 2 cycles later than the load (because loads take 2 cycles). In the best case (load in the odd pipeline, store in the even pipeline), there must be 1 hubble cycle or 2 instructions. So, answer 2 instructions. The easiest way to understand this is to imagine that the load is in the $EX_s$ stage of the odd pipeline while the store is in the $EX_s$ stage of the even pipeline. Look at the answer for the previous problem. There is a special store are to handle this circumstances. The load is 2 cycles ahead of the store. We need to fill instructions in the two different $EX_s$ stages. #### Question 2e 2e) What can you change about the pipeline to reduce your answer to (2d)? Assume that you are not allowed to change the latencies of any instructions. Jack Kang and Kurt N # Question 2e 2e) What can you change about the pipeline to reduce your answer to (2d)? Assume that you are not allowed to change the latencies of any instructions. By shifting the memory stages in the even pipeline forward I cycle, we can get $\theta$ instructions. What this means is that the two mem stages for the even pipeline are in $EX_t$ and $EX_t$ . Then, if the load is in the odd pipeline and the store is in the even pipeline (next cycle), we have no stalls. # Question 2g 2g) [Extra Credit: 5pts] Briefly describe the logic that would be required in the decode stage of this pipeline. In five (5) sentences or less (and possibly a small figure), describe a mechanism that would permit the decode stage to decide which of two instructions presented to it could be dispatched. # Question 2g 2g) [Extra Credit: 5pts] Briefly describe the logic that would be required in the decode stage of this pipeline. In five (5) sentences or less (and possibly a small figure), describe a mechanism that would permit the decode stage to decide which of two instructions presented to it could be dispatched. -We have to check to see if the 2nd instruction depends on the first one. -We have to check the operands of the two instructions against any instructions still in the pipeline, and see if it can issue. This step is slightly complex because different instructions in the pipeline finish at different times. # Question 3: Extra Credit (Problem 3X): Assume that you have a Tomasulo architecture with functional units of the same execution latency (number of cycles) as our deeply pipelined processor (be careful to adjust use latencies to get number of execution cycles). Assume that it issues one instruction per cycle and has an unpipelined divider with a small number of reservation stations. Suppose the other functional units are duplicated with many reservation stations and that there are many CDBs. What is the minimum number of divide reservation stations to achieve one instruction per cycle with the optimized code of (3b)? Show your work. [Initi: assume that the maximum issue rate is sustained and look at the scheduling of a single iteration] Load: 3 cycles, Add: 2 cycles, Multiply: 4 cycles, Divide: 9 cycles (careful here!) ``` loop: ldf SF20, 0(ST10) ldf SF10, 8(ST10) multf SF6, SF20, SF1 addf SF12, SF6, SF2 addf SF10, SF10, SF6 dfvf SF13, SF12, SF10 addf SF20, SF20, SF1 bre SF1, SF1, SF20, SF1 bre SF1, SF20, SF2 sf1 SF20, SF30, SF300, S ``` # Question 3: #### Kevs to Problem: 1) # of station entries needed = # of div instructions in flight at same time | Tom | asulo | Trace | : | | | | | CC | 5: F | irst Fe | w ins | truct | ions | | | |-------|-------|-------|----|---|----|----|----|----|------|---------|-------|-------|------|----|---| | N | rd | rs | rt | I | E1 | EF | WB | N | rd | rs | rt | I | E1 | EF | W | | ldf | F20 | R10 | | 1 | 2 | 4 | 5 | | | | | | | | Г | | Ldf | F10 | R10 | | 2 | 3 | 5 | | | | | | | | | | | multf | F6 | F20 | F1 | 3 | | | | | | | | | | | | | addf | F12 | F6 | F2 | 4 | | | | | | | | | | | Г | | addi | R10 | R10 | | 5 | | | | | | | | | | | | | | | | | | | | | | | | | | | | Γ | | | | | | | | | | | | | | | | | Г | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Г | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Г | | | | | | | | | | | | | | | | | Г | | | | | | | | | | | | | | | | | _ | | Tomasulo Trace: | | | | | | | | CC 11: First->Second Divf | | | | | | | | |-----------------|-----|-----|-----|----|----|-----|----|---------------------------|----|----|----|---|----|----|----| | N | rd | rs | rt | I | E1 | EF | WB | N | rd | rs | rt | ı | E1 | EF | WE | | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | | | | | | | | | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | | | | | | | | | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | | | | | | | | | | addf | f12 | f6 | f2 | 4 | 11 | 12* | | | | | | | | | | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | | | | | | | | | | divf | f13 | f12 | f10 | 6 | | | | | | | | | | | | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | | | | | | | | | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | | | | | | | | | | bne | | r1 | | 9 | | | | | | | | | | | | | stf | | f13 | r20 | 10 | | | | | | | | | | | | | ldf | f20 | r10 | | 11 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | _ | _ | | | | | | | | _ | | | Tom | nasulo | Trace | e: | | | | CC | 15: Fii | rst->S | econd | Divf | | | | | |-------|--------|-------|-----|----|----|-----|----|---------|--------|-------|------|----|----|----|----| | N | rd | rs | rt | ı | E1 | EF | WB | N | rd | rs | rt | ı | E1 | EF | WE | | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | addf | f12 | f6 | f2 | 14 | | | | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | addi | r10 | r10 | | 15 | | | | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | | | | | | | | | | addf | f12 | f6 | f2 | 4 | 11 | 12 | 13 | | | | | | | | | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | | | | | | | | | | divf | f13 | f12 | f10 | 6 | 14 | 22* | | | | | | | | | | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | | | | | | | | | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | | | | | | | | | | bne | | r1 | | 9 | | | | | | | | | | | | | stf | | f13 | r20 | 10 | | | | | | | | | | | | | ldf | f20 | r10 | | 11 | 12 | 14 | 15 | | | | | | | | | | ldf | f10 | r10 | | 12 | 13 | 15 | | | | | | | | | | | multf | f6 | f20 | f1 | 13 | | | | | | | | | | | | | Tom | nasulo | Trace | e: | | | | CC 17: First->Second->Third Divf | | | | | | | | | |-------|--------|-------|-----|----|----|-----|----------------------------------|------|-----|-----|-----|----|----|----|----| | N | rd | rs | rt | I | E1 | EF | WB | N | rd | rs | rt | 1 | E1 | EF | WI | | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | addf | f12 | f6 | f2 | 14 | | | | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | addi | r10 | r10 | | 15 | 16 | 16 | 17 | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | divf | f13 | f12 | f10 | 16 | | | | | addf | f12 | f6 | f2 | 4 | 11 | 12 | 13 | addi | r20 | r20 | | 17 | | | | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | | | | | | | | | | divf | f13 | f12 | f10 | 6 | 14 | 22 | | | | | | | | | | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | | | | | | | | | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | | | | | | | | | | bne | | r1 | | 9 | | | | | | | | | | | | | stf | | f13 | r20 | 10 | | | | | | | | | | | | | ldf | f20 | r10 | | 11 | 12 | 14 | 15 | | | | | | | | | | ldf | f10 | r10 | | 12 | 13 | 15 | 16 | | | | | | | | | | multf | f6 | f20 | f1 | 13 | 16 | 19* | | | | | | | | | | | Tom | asulc | Trace | <b>:</b> : | | | | | CC 21: First->Second- >Third Divf | | | | | | | | |-------|-------|-------|------------|----|----|----|----|-----------------------------------|-----|-----|-----|----|----|----|----| | N | rd | rs | rt | I | E1 | EF | WB | N | rd | rs | rt | ı | E1 | EF | WB | | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | addf | f12 | f6 | f2 | 14 | 21 | 22 | | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | addi | r10 | r10 | | 15 | 16 | 16 | 17 | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | divf | f13 | f12 | f10 | 16 | | | | | addf | f12 | f6 | f2 | 4 | 11 | 12 | 13 | addi | r20 | r20 | | 17 | 18 | 18 | 19 | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | subi | r1 | r1 | | 18 | 19 | 19 | 20 | | divf | f13 | f12 | f10 | 6 | 14 | 22 | | bne | | r1 | | 19 | | | | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | stf | | f13 | r20 | 20 | | | | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | ldf | f20 | r10 | | 21 | | | | | bne | | r1 | | 9 | | | | | | | | | | | | | stf | | f13 | r20 | 10 | | | | | | | | | | | | | ldf | f20 | r10 | | 11 | 12 | 14 | 15 | | | | | | | | | | ldf | f10 | r10 | | 12 | 13 | 15 | 16 | | | | | | | | | | multf | f6 | f20 | f1 | 13 | 16 | 19 | 20 | | | | | | | | | | Tom | asulo | Trace | <b>:</b> | | | | CC 23: First->Second- >Third Divf | | | | | | | | | |-------|-------|-------|----------|----|----|----|-----------------------------------|-------|-----|-----|-----|----|----|-----|----| | N | rd | rs | rt | I | E1 | EF | WB | N | rd | rs | rt | I | E1 | EF | WE | | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | addf | f12 | f6 | f2 | 14 | 21 | 22 | 23 | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | addi | r10 | r10 | | 15 | 16 | 16 | 17 | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | divf | f13 | f12 | f10 | 16 | | | | | addf | f12 | f6 | f2 | 4 | 11 | 12 | 13 | addi | г20 | r20 | | 17 | 18 | 18 | 19 | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | subi | r1 | r1 | | 18 | 19 | 19 | 20 | | divf | f13 | f12 | f10 | 6 | 14 | 22 | 23 | bne | | r1 | | 19 | | | | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | stf | | f13 | r20 | 20 | | | П | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | ldf | f20 | r10 | | 21 | 22 | 24* | | | bne | | r1 | | 9 | | | | ldf | f10 | r10 | | 22 | 23 | 25* | | | stf | | f13 | r20 | 10 | | | | multf | f6 | f20 | f1 | 23 | | | | | ldf | f20 | r10 | | 11 | 12 | 14 | 15 | | | | | | | | | | ldf | f10 | r10 | | 12 | 13 | 15 | 16 | | | | | | | | | | multf | f6 | f20 | f1 | 13 | 16 | 19 | 20 | | | | | | | | | | N | rd | rs | rt | 1 | E1 | EF | WB | N | rd | rs | rt | I | E1 | EF | ٧ | |-------|-----|-----|-----|----|----|-----|----|-------|-----|-----|-----|----|----|-----|---| | ldf | f20 | r10 | | 1 | 2 | 4 | 5 | addf | f12 | f6 | f2 | 14 | 21 | 22 | 2 | | ldf | f10 | r10 | | 2 | 3 | 5 | 6 | addi | r10 | r10 | | 15 | 16 | 16 | 1 | | multf | f6 | f20 | f1 | 3 | 6 | 9 | 10 | divf | f13 | f12 | f10 | 16 | 24 | 32* | | | addf | f12 | f6 | f2 | 4 | 11 | 12 | 13 | addi | r20 | r20 | | 17 | 18 | 18 | 1 | | addi | r10 | r10 | | 5 | 6 | 6 | 7 | subi | r1 | r1 | | 18 | 19 | 19 | 2 | | divf | f13 | f12 | f10 | 6 | 14 | 22 | 23 | bne | | r1 | | 19 | | | - | | addi | r20 | r20 | | 7 | 8 | 8 | 9 | stf | | f13 | r20 | 20 | | | | | subi | r1 | r1 | | 8 | 9 | 9 | 10 | ldf | f20 | r10 | | 21 | 22 | 24 | | | bne | | r1 | | 9 | | | - | ldf | f10 | r10 | | 22 | | | Г | | stf | | f13 | r20 | 10 | 24 | 26* | | multf | f6 | f20 | f1 | 23 | | | | | ldf | f20 | r10 | | 11 | 12 | 14 | 15 | II | | | | | | | | | ldf | f10 | r10 | | 12 | 13 | 15 | 16 | | | | | | | | T | | multf | f6 | f20 | f1 | 13 | 16 | 19 | 20 | | | | | | | | | Question 3: Tomasulo Trace: CC 23: First->Second->Third Divf Divf1: Issued 6 Finished 23 Divf2: Issued 16 Finished 33 Divf3: Issued ?? Finished ?? We're Done! Question 3: Tomasulo Trace: CC 23: First->Second- >Third Divf Divf1: Issued 6 Finished 23 Divf2: Issued 16 Finished 33 Divf3: Issued 26 Finished 43 We're Done! The second divf issues before the first finished, so we will need at least 2 entries. The first finishes before the third issues, so we will need at most 2 entries. Therefore, we need 2 entries. Question 4 TLB Page Cache table A. miss miss miss B. miss miss hit C. miss hit miss D. miss hit hit E. hit miss miss F. hit miss hit G. hit hit miss H. hit hit hit | | | | | Question 4 | |----|------------|---------------|------|-----------------------------------------------------------------------------------| | | TLB | Page<br>table | | Possible? If so, under what circumstance | | | 1. miss | miss | miss | TLB misses and is followed by a page fault; after retry, data must miss in cache. | | | 2. miss | miss | hit | Impossible: data cannot be allowed in cache if the page is not in memory. | | | 3. miss | hit | miss | TLB misses, but entry found in page table; after retry, data misses in cache. | | | 4. miss | hit | hit | TLB misses, but entry found in page table; after retry, data is found in cache. | | | 5. hit | miss | miss | Impossible: cannot have a translation in TLB if page is not present in memory. | | | 6. hit | miss | hit | Impossible: cannot have a translation in TLB if page is not present in memory. | | | 7. hit | hit | miss | Possible, although the page table is never really checked if TLB hits. | | • | 8. hit | hit | hit | Possible, although the page table is never really checked if TLB hits. | | Ca | S 152 Revi | ew | | Jack Kang and Kurt Meinz |