Our project is to build a shared address space multiprocessor cache simulator using the MESI and MOESI protocols, and then compare the performance of adding on a cache coherency extension known as adaptive sequential prefetching to see how performance is affected.
We covered snooping based cache coherence protocols in class like MSI, MESI, MOESI, and MESIF.
We read up about adaptive sequential prefetching, which is an extension that can be added onto a cache coherency protocol intended to decrease the cache miss rate. From the paper “Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors” by Fredrik Dahlgren and Michael Dubois, “Adaptive sequential prefetching cuts the number of read misses by fetching a number of consecutive blocks into the cache in anticipation of future misses. The number of prefetched blocks is adapted according to a dynamic measure of prefetching effectiveness which reflects the amount of spatial locality at different times … Moreover, as opposed to simply adopting a larger block size, it does not affect the false sharing miss rate.”
The challenge of making a parallel cache is making sure that we maintain cache coherence across transactions. Implementing the coherence protocols will be quite nontrivial since there is a good amount of complexity for the systems in terms of messages being sent and state movements, so we need to ensure that the protocols are done correctly in our simulator. We will need to come up with a communication system between “processors” and make sure they send the correct messages to each other.
Additionally, we need to figure out how to add on adaptive sequential prefetching to the plain MESI and MOESI while still ensuring their correctness. Implementing adaptive sequential prefetching includes adding extra bits for each cache line and additional counters for each cache.
We also need to come up with a way to test out our simulator and metrics by which we measure performance. Our plan is to use the framework from Professor Railing's Computer Architecture Design Simulator for Students. We will need to be able to implement the protocols in accordance with this framework, which will require learning about the simulator framework and how to use it. We also need to come up with programs to test out our cache simulators on, and we plan to find traces of example memory accesses similar to what we did in 15213’s Cache Lab. We should also pick programs that can show when adaptive sequential prefetching is most useful and least useful so that we can best analyze the performance.
We are using the CADSS for the cache simulator and testing as suggested by Professor Mowry. We are referring to the Dahlgren and Dubois paper mentioned above (link) for adding adaptive sequential prefetching. We have not decided on what we will use for traces, but we are thinking about potentially using a trace generation software to come up with potential memory traces.
PLAN TO ACHIEVE:
HOPE TO ACHIEVE:
We will use the Gates machines. Our system does not require very complicated hardware since it is just a simulator of a cache, but we will require a multiprocessor machine like the ones in Gates since we need to simulate a multiprocessor cache.
Work Completed:
In the five days we have worked on the project since it was approved, we have successfully implemented MSI, MESI, and MOESI protocols on top of the already implemented MI protocol. We cloned Professor Railing’s CADSS repository and read through the relevant modules/starter code for our project before modifying the files in the coherence folders. This involved adding three pairs of cache/snoop functions to facilitate transitions between the different states based on the bus traffic. We have begun validating our implementations by testing on a small set of traces but are still looking into large-scale trace generators. The two options we are researching at the moment are Intel Pin and ScalaMemTrace.
Revised Schedule:
We are still on track to produce all deliverables for the project. We are also optimistic that we will be able to meet the “Hope to Achieve” goals.
We intend to present graphs for data on miss rate, hit rate, bus traffic, etc. on varying traces. These traces will be designed to allow for as much variance as possible so we can present findings based on different configurations for a machine.
We do not yet have any data to report other than successful validation of small traces.
Right now we are most concerned about trace generation for validation. We have spent more than half our working time so far on researching the tools mentioned earlier and are still struggling to get results with them. This is not that big of a problem right now because we can use existing traces and generate medium-sized ones on our own, but we would like to figure out large-scale generation sooner rather than later. For the actual implementation, it seems like it’s a matter of just doing the work. We didn’t find it too hard to implement the cache coherence protocols and we think we have a good grasp on adaptive prefetching.
This section will present the final results of the project, including performance comparisons, conclusions on the effectiveness of adaptive sequential prefetching, and potential future work. (Content coming soon.)