Lab #7
Final Project

CS152 - Computer Architecture
Fall 2003, Prof Dave Patterson


Problem 0: (Team Evaluation) due Thursday, 11/6 by 9 PM via E-mail to your TA.
Lab organizational e-mail due to TA Thursday 11/6 by 9 PM via E-mail to your TA.  ALSO, you should enter your project on the projects page by this time (more on this below).
Your modules from section B-D will be due on Monday, 11/17. At this time, you must demonstrate to your TA that your modules work (they do not necessarily need to be connected to the rest of your processor, but you must demonstrate module level testing).
The unfair benchmark from section E is due on Monday 11/17 by midnight.
You will need to demo your final project on Monday, 12/1.
30 minute oral presentations will take place on Tuesday, 12/2. Signups and more information will follow.
We will all gather at 6 pm (time subject to change) on Wednesday, 12/3 in the lab for the contest.


Problem 0 (Team Evaluation) for lab 5 and 6 is due Thursday, 11/6 by midnight.  Same description as previously.


Problem 0.5: Memory-mapped I/O changes.

For the final project we will need a 64-bit cycle counter in order to test some of the longer-running programs. Therefore you must implement a new 64-bit cycle counter in order to achieve this task.

Address
Reads
Writes
0xFFFFFEF0 Lower 32 Bits of Big Cycle Counter Nothing
0xFFFFFEF4 Upper 32 Bits of Big Cycle Counter Nothing
0xFFFFFEF8 Timer State Timer Command
0xFFFFFEFC Reserved for Future Use Reserved for Future Use

Your 64-bit cycle counter will have two states, running and stopped. A load of the Timer State should indicate 1 if it is running and 0 if it is stopped. You may issue 3 different commands to your cycle counter via store words to the Timer Command location.

Value
Action
0x00000001 Reset the Counter
0x00000002 Start the Counter
0x00000004 Stop the Counter


Problem 1: The final project.

For this project, your group is to implement several sub-projects selected from the list included below. You must choose at least one option from group A, B, and E. Each "option" has an attached number of points.  You must implement a total of 7.5 points per person in your group.  Thus, for instance, a 4 person group must implement 30 points worth of work from the list below.  You will not be awarded more than the number of points possible for your group (except through the special extra-credit section); however, you may choose to implement additional sub-projects to increase your chances at extra-credit or guard against mistakes.  All projects must maintain compatibility with the standard MIPS instruction set.

In light of the pressures of this class, we offering a reduced point scale in an effort to help everybody focus and complete a working project. Completing the 7.5 points per person, as listed above, will allow your final project to be eligible for full points. You may also choose to complete only 6 points per person, but the maximum score for your group will then be an A-. The third option is to complete only 4.5 points per person, which will result in a maximum score of B+. It may be beneficial for some of you to target a B+ or an A-, and then add on additional modules after meeting those initial point requirements.

If you have ideas on other options, talk to one of the TAs or the professor for approval, as well an assignment of points for your option. For additional information on these options, please consult the graduate textbook, websites, or your TAs.

Near the end of classes, your group must do a final presentation. Your presentation should be 20 minutes long, with an additional 10 minutes for questions from the professor, the TAs, and your fellow classmates. Everybody in your group must present; your individual grade will include your presentation. Good presentations (and write-ups, for that matter) will cover the specific sub-projects you chose to implement, and how they affected your processors performance. What design decisions were made and why? How successful were you? How did you measure your performance? Detailed descriptions of your project datapath are not appropriate for a 20-minute presentation.  However, high-level data paths and explanations of specific implementations might be appropriate.

Your final lab write-up is due Friday 12/5, after the contest.   Please make sure that your write-up is clear, concise, and complete. To receive full credit for a sub-project, you must include a discussion on how it effects the performance of you project. You will not get a chance to discuss your final project with your TA, so make sure that you effectively communicate the key-points of your project. One good technique is to analyze the performance of select benchmarks with and without your new features.

This document contains only a brief summary of the projects. For more information about any of the specific sub-projects, please consult with your TA. All projects are weighted in accordance with their estimated difficulty. You should not implement a project unless it is relevant to your design (i.e., branch prediction is useless for the single-issue MIPS pipeline).

Important note: many of these options can get quite complicated.  Thus, plan out a testing philosophy along with your architecture. One thing that you learned this term is how to make interesting tracer modules out of Verilog.  Use this technique freely, since ASCII traces (possibly written to files) of pipeline state is likely to assist in debugging your design.

As you will see below, at minimum, we require you to maintain a disassembly-tracer as in previous labs.  This required  instruction tracer should only show committed instructions, i.e. show the instructions that actually finish execution.  Note that in this lab, we now require that you include a cycle count value in the tracer output (at the beginning of each output line).  This will show exactly which cycle instructions finish on.  Since it is part of the debugging code, all of the logic for the cycle counter can be included in the Verilog monitor code; it should be reset to zero on reset, and increment with each clock cycle (since CPI is not necessarily 1, you will actually see the delay in your instruction traces).

Should you decide to use other tracer modules (say ones that watch individual reservation stations, etc), then you should also include cycle counts in those trace outputs as well.  Each tracer module can write its output to an individual file, with the cycle counts helping you to figure out what order things happened in.

Subprojects [#points]

Group A:

Super-Scalar [18]:
To increase performance, convert your pipeline into a 2-way super-scalar pipeline. Because super-scalar implementations increase the instruction bandwidth, this project may entail changes through the entire memory system.

The IF stage needs to fetch two instructions, and then determine if both instructions can be issued in parallel (hazards?). Additional complexity is added because data must be forwarded between the pipes. Note that unless you choose to implement an enhanced superscalar pipeline (see group D) only one of your pipes will have a real memory stage, so two lw/sw instructions cannot be issued in parallel.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.

Out-of-order Execution [22]:
Implement a processor which uses the Tomasulo algorithm to perform out-of-order execution.  Include at least 5 different units: instruction fetch/branch unit, load unit, store unit, integer ALU, multiply unit.  Note that the instruction fetch unit is tricky, since you need to deal with branches whose values may be coming from the broadcast results bus.  Also, see if you can be more clever than we were in class about the broadcast results bus so that a chain of instructions that depend on each other can have an execution throughput of one per cycle.  For more information, look at the end of chapter 6, which talks about the power-pc 604, and shows a possible organization.  Incidentally, if done properly, the store unit could serve as a write buffer....

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.  Suggestion: you might consider outputing two separate traces (to files): (1) a trace of dispatched instructions and (2) a trace of committed instructions/possibly triggered from the broadcast results bus.  Only (2) is required, however.

Deep pipelining [18]:
Deep pipelining breaks the pipeline up into more than five stages, decreasing the cycle time of your processor. Lengthening the pipeline, however, will increase the difficulties that hazards cause in the system. Some components, such as the ALU, will have to be broken up into two stages (you will have to implement it using individual gates).  To get full-credit for this part, you need to implement at least 8 stages: 2 IF, 1 DEC, 2 EXE, 2 MEM, and 1WB (see also 'not-so-deep' pipelining below).

Further, the cycle time of your processor must be 20ns or lower (50Mhz or higher).

Please note that this project will involve HEAVY USE of the cad tools in order to achieve the 50Mhz. If you feel that heavily exploring the Xilinx CAD tools should not be part of a Computer Architecture course, do not do this option.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.

Group B: Predictors

Branch Prediction [8]:
Branch prediction attempts to reduce the branch penalty for super-scalar, out-of-order, or deep-pipelining designs. When a branch is mis-predicted, instructions in the pipeline will have to be squashed and correct execution resumed.  Typically, branches are predicted based on the address of the branch instruction.

Note that you need to think carefully about what it means to have a predicted instruction stream and how you will recover from a mis-prediction.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace. Your instruction tracer should only show committed instructions (i.e. not the squashed ones).  A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.

Note: Implementing branch prediction is highly recommended.

Load-Value Predictor [8]:
Load-value predictors attempt to predict the result of a LW instruction (based on the instruction address). This prediction allows execution to continue for a short time using the predicted value. Meanwhile, the load value is fetched from memory and compared to the predicted value. On a mis-prediction, instructions in the pipeline must be squashed and execution restarted with the correct value.

The simplest type of load-value predictor is a "last-value" predictor, which keeps a table of instruction addresses for loads as well as the previous value actually loaded.  Further, a small "confidence" counter (2 bits) is kept for each entry.  To predict, we use the instruction address of the load to index into the table and predict the previous value as long as the value of the confidence counter is high enough (say at least "2").  When the load is actually resolved, we check to see if (1) it is in the table and (2) whether we predicted the right thing or not.  If it is in the table and we predicted the right thing, we increment the confidence (without overflowing).  If in the table and we didn't predict the right thing, then we decrement the confidence (without going below zero).

Note that you need to think carefully about what it means to have a predicted value in your pipeline and how you will recover from problems.  Also, some sort of tracer output which indicates which loads are predicted, etc,

Jump Target Predictor [8]:
Similar to a branch predictor and load-value predictor, a jump target predictor predicts the destination of JR instructions based on the address of the JR instruction itself. If the destination of the jump is mis-prediced, then instructions in the pipeline must be squashed (similar to branch prediction).

Group C: Caches

Non-Blocking Loads [7]:
Non blocking loads allow the pipeline to continue execution during a data-cache miss until the fetched data is actually needed.  This can be accomplished by adding a full/empty bit for every register in the register file, and keeping a small table (2-4 entries) for outstanding loads.   The entries in this table are called Miss-Status Handling Registers (MSHR) .  Each MSHR is used to keep information about one outstanding cache miss (for a load).  A MSHR contains (1) A valid bit indicating the MSHR is in use, (2) the address that has missed in the cache and (3) the register that the result must go back to.  Initially the full/empty bits for all registers are set to "full" (this is a reset condition) and all of the MSHRs are set to invalid.

When a load reaches the memory stage, you check to see if it misses in the cache.  If so, you see if you have a free MSHR and stall if all of them are in use.  If you have a free MSHR, you enter the info for the load into the MSHR, and set the F/E bit for the destination register to empty.  Now, the simplest thing to do is flush all following instructions and restart fetching at the instruction after the load.  The reason is as follows.  In the decode stage, you need to check to see if you are decoding an instruction that has its F/E bit set to empty; if so, you stall until the F/E bit becomes full again.  Make sure you understand WAR and WAW hazards for loads.  Data returning from memory checks the MSHR and updates the appropriate F/E bit, register, and MSHR (making it invalid).

Note that there are more clever ways to implement blocking (without the F/E bits): hint - the only registers that could possibly be empty are in the MSHRs.

Write Back Cache [7]:
Implement a write-back scheme as discussed in class. Note that this may interact in a strange way with your write buffer.  In this new write policy, your write buffer will only be used to hold writes until they make it into the cache.  This means that a "write miss" will now load the complete cache line, then allow the write to go forward.

Further, both write and read cache misses need to read a complete 8-word cache line.  Where this gets complicated is that you may need to kick out a dirty cache line in the process.  Make sure that you get this correct!  Remember that writing to memory is a long, painful penalty, and a sloppy write policy can make things much worse.

Also note that you need to enable write-bursts as well (for write back of dirty lines).  To do that, you need to set bit M9 of the mode register to 0 to enable write bursts and read bursts to be the same length.

Victim Cache [4]:
Implement a Victim Cache which contains 4 cache-lines worth of information, and which sits between the data cache and the DRAM.

Second-level Cache[4]:
Implement a single second-level cache with 32K byte data with 64 byte cache lines. If you don't have enough blockRams, you may implement a 24K byte L2 cache.

Mutual Exclusion[2]:
As discussed in class, one idea that AMD did was mutual exclusion; cache data was either in L1 or L2, but not in both. As an extension to the Second-level cache, implement mutual exclusion.

Group D: Miscellaneous

Reorder buffer [12]:
Implement a reorder buffer of at least 8 instructions.  This option makes sense to be combined with a number of other options here.  It must be the case that you have precise exceptions as a result (i.e. that you could stop the system and easily recover good state).  It would certainly help with the out-of-order execution model or the non-blocking-load option. It might also help with branch prediction, load-value prediction, etc.  Even in the non-out-of-order pipeline, it could give you a longer time to let your prediction take effect, i.e. you might predict a load on a cache miss and let the execution continue.  The number of entries in your reorder buffer would directly affect the number of instructions after the load that you could let execute before you absolutely had to know whether the prediction was correct or not...

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace. The trace should show committed instructions.  One useful debugging option would be to make another tracer which shows dispatched instructions (as they enter the reorder buffer).

Explicit Register Renaming [12]:
Expand your physical register file to 64 registers, and implement register renaming to eliminate WAR and WAW hazards. You will need to implement a translation table, which maps your physical registers to the ISA registers. Make sure you free a register when it's not longer being used, or you may risk deadlocking your processor because it can't issue. You must be able to fix speculation erros (misprediction) by restoring your translation table and free list.

Enhanced Superscalar Pipeline [8]:
Modify your instruction and data caches to support simultaneous memory operations. Your caches should be able to perform two simultaneous accesses (to either the same cache line or different ones). Simultaneous hits must not stall the processor. You must be able to handle simultaneous misses.
The only exception to this rule are interdependent data access instructions such as:
lw $t0, 0($t9)
lw $t1, 0($t0)
or
lw $t1, 0($t9)
sw $t1, 0($t9)
Note: You may (and probably should) use dual-ported block rams for this purpose.

Stream Buffer (Prefetch) [4]:
Use a small FIFO of 2 cache lines which is used to prefetch instructions from memory.  On an instruction cache miss, you should go ahead and fill the missing line in the cache, the proceed to start fetching additional lines, which you place into the stream buffer.  Then, on subsequent misses, you check the stream buffer to potentially fetch data much more quickly.  Note that one possible implementation would take advantage of burst-mode DRAM operation to fill the stream buffer more quickly...

Multiplier/Divider[4]:
Implement a combined multiplier/divider unit that handles both signed and unsigned multiplies/divides.  Once again, this should be implemented as a coprocessor so that the actual instructions don't block the pipeline.  Only attempts to execute mflo/mfhi instructions should block if the operation is not yet ready. You must improve your multiplier to use booth's 2-bit encoding. 

Group E: Unfair Benchmarks

Hopefully, in this class, you have learned how benchmarks can be manipulated to misrepresent the performances of different processors.  As part of your final project, each group needs to write a benchmark. This benchmark should be designed to make your processor perform as well as possible, while also making other processors perform poorly. You may want to look at the project description website to see what options other groups are implementing, and target those options. Of course, the best benchmark will be one that breaks other processors (just be careful that it doesn't break yours!)

Your benchmark must consist of less than 200 instructions, and conform to the instruction set as defined in lab 6. This benchmark must run on the board, and display output to the IO switches, such that one can determine if the processor successfully finished the benchmark. The end of the test must also display the cycle count (these outputs can be at different breaks). Remember that you need to subtract off the cycle counts to get an accurate count of how many cycles the program took.

We will incorporate some of the better benchmarks in a student test suite, and use them in our final testing / competition. Thus, the benchmarks will be due Monday, 11/17.

Extra-Credit

Extra-credit will be assigned for this lab based on a friendly competition between groups for performance. Your program does not need to be `the best' to receive extra credit, but obviously the best design will receive the most extra points. We will have competitions for: best CPI, highest Clock Rate, best Performance / Hardware Resource usage (LUT count), and best overall performance, based on total execution time on the benchmarks..

One test program will be released early on, to allow you to target your optimizations. The other benchmark program, a mystery program, will not be released until a few days before the project deadline. Finally, one of the final benchmarks will include student submitted benchmarks from above. A good understanding of the different project options will help you decide in advance which solutions will have the greatest impact on performance.

Increasing the amount of first-level cache in your system is not allowed: the total amount of first-level cache in your system may not exceed 4096 words. 4096 words is derived from Lab 5/6 which has 2048 words each for I and D caches; although you may not exceed 4096 words total, you may redistribute the memory if you wish (i.e., 1024/3072 words).