Purpose

To determine the parameters of a computer's cache by observing the amount of time it takes to access the elements of an array.

Reading

P&H, 7.1-3 and 7.5

Description

The program ~cs61c/labs/lab6-2/cache.c (updated at 1 pm today) produces timing data for accesses of elements in an array. Here is its main loop:

for (index = 0; index < limit; index = index + stride) {
x[index] = x[index] + 1;
}

The rest of the code sets the variables limit and stride, and determines the time it takes to execute the loop.

The program runs a series of tests (varying the values of limit and stride from test to test). Each test is repeated many times, and the results (time it takes to execute the loop) are averaged. The same loops are run without memory accesses to measure the time taken for loop overhead. The program subtracts the overhead time from the loop time so the final result should be an estimate on the time spent on memory accesses only. Unfortunately, some of our compilers don't comple the real loop and dummy loop in the same way, so your times may be a exaggerated. You don't have to fix this.

* The size value refers to the size, in bytes, of the array you are testing with, not the size of the cache. (One of your goals is to figure out the cache size for yourself.)

* The stride value says how many bytes are being skipped for each memory access. For example, if the stride is 4, then bytes 0, 4, 8, 12, ... in the array are being accessed while bytes 1, 2, 3, 5, 6, 7, 9, 10, 11, ... are being skipped.

* The read+write value is the amount of time needed to access memory, in nanoseconds. You'll have to figure out for yourself whether these times denote hits or misses. As explained above, the calculated times may be a exaggerated if the compiler doesn't make equivalent machine code for the two loops that process the array.

Exercise 1: Gather Timing Data

Start by gathering some data. Log in to a machine of your choice. Note the machine name and any information you can find about its processor. uname -a or uname -X may work on some of the instructional machines. Copy ~cs61c/labs/lab6-2/cache.c and compile it without any optimizations. Then compile using gcc -o cache cache.c, and run the program. The program tests reading and writing with arrays from 4 Kbytes to 1 Mbytes at strides varying from 4 to 512 bytes. You will want to capture the output into a file, so use the command ./cache > out (which puts the result in a file named out). This may take awhile as each test in cache.c takes a second or two to complete (because each test is repeated multiple times).

Take a look at the file out. Then graph the data by running gnuplot. If it doesn't work on the machine you connected to, run it from your local lab machine. Give gnuplot the command:

plot "out" using 0:8 with linespoints

That command means "plot the values in the file out as points, where each data value is a new point on the x-coordinate with the read+write time as its y-coordinate". gnuplot breaks the graph line where there's a blank line in the input. You should see the results of each array size experiment as separate graph segments. Each segment shows various strides, increasing from 4 bytes at its leftmost point to 512 bytes at its rightmost. (To learn more about gnuplot, try its excellent help system: help plot, help plot with, help set, help, etc.)

Before you start number-crunching, try to understand the qualitative features of the graph. You should be able to explain what each of the jumps in access time means, and why some are much bigger than others. Do all your graph segments start at the same access time? Do they all end at the same access time?

Exercise 2: Determining Cache Parameters

From what you see in the graph and the numbers in your table, you should be able to determine most of the parameters of the test computer's cache. Here are the questions you should answer:

* Approximately how long does a cache hit take? Why?
* Approximately how long does a cache miss take? Why?
* What is the size in bytes of the first-level cache? Why?
* What is the block size in bytes of the first-level cache? Why?

Exercise 3: Determining Associativity

Answer one more question:

* What is the set associativity on the cache you're analyzing? Why?

You may not be able to figure this out from the data given by the original cache.c. Changing the stride_min, stride_max, cache_min, and cache_max settings at the top of the file will help. When you're calculating the set associativity, you ll want a large stride and large cache sizes to eliminate the effects of the block size and cache size. With a sufficiently large stride length along with arrays big enough to contain 1, 2, 4, ... elements to access, you should be able to observe the effects of putting several elements into the same cache set (row). Normally, retrieving elements into the same set would cause repeated misses and no benefit from the cache. But if the cache is associative at all, you ll see benefits even when multiple blocks are loaded into the same set. cache.c's testing style of increasing the array size should give you results that show how many blocks can be loaded into one set, and that's the cache's associativity.

A piece of information not required for this lab but worth exploring is the following:

* Does the machine have more than one level of cache? Why or why not? If so, what the parameters of the non-primary cache(s)?

You will probably not be able to conclude very much about multiple levels of cache because the smaller caches tend to hide the effects of the larger ones. Do not expect to be able to calculate all the parameters of non-primary caches.

What You Should Have Learned =)

As a result of this lab, you ought to be able to specify what data and features of the cache program let you determine a given cache parameter, and to answer questions like the following (drawn from an old exam):

Running the cache program on the Wazcog Mark IV computer, you get the following results:

stride -> 4   8 16 32 64 128 256

size
64      109 126 122 121 101
128     120 118 116 121 117 105
256     119 116 102 117 113 113 110
512     466 966 949 949 983 308 311
1024    458 949 983 966 949 983 307
2048    449 992 954 966 979 979 979

(Stride and size are in bytes, times in nanoseconds. Yes, this is an artificially small example, to make the numbers fit easily.)

* What is the cache size in bytes?
* What is the block size in bytes?
* What is the allocation policy? (Direct mapped, fully associative, or N-way set associative? If N-way set associative, what is N?)