User:Kbbuch/sandbox

As the need for fast computing increases in the modern day, we require our CPU cores to run at faster clocks than ever and for that, we need to hide the memory latency of the main memory access. Multi-level caches are the optimum solution to provide faster access to data residing in the main memory. The memory hierarchy that acts as a bottleneck for the CPU core performance can be relaxed by using the hierarchical cache structure to reduce the latency and hence speed up the CPU clock.

Multi-Level Cache
Access to the main memory for each instruction execution may result in the very slow processing with the clock speed dependent on the the time taken to find the data in main memory and fetch it. In order to hide this memory latency from the processor, data caching is used. Whenever the data is required by the processor, it is fetched from the memory and stored in the small structure called Cache. For any further references to that data, cache is searched first before going to main memory. This structure resides closer to the processor in terms of time taken to search and fetch data with respect to Main Memory. The advantages of using cache can be proven by calculating the Average Access Time (AAT) for the memory hierarchy without cache and with cache.

Average Access Time (AAT)
Cache, being small in the size, may result in frequent misses and we might have to go to main memory to fetch data. And hence, the AAT depends on the the miss rate of all the structures that it searches through for the data.

$$AAT = HitTime + ((MissRate) x (MissPenalty))$$AAT for main memory is given by Hit time main memory. AAT for caches can be given by Hit time cache + (Miss rate cache + Miss Penalty time taken to go to main memory after missing cache). Hit time for caches is very less as compared to main memory and hence the resulting AAT after using cache in the memory hierarchy is improved significantly.

Trade Offs
While using the cache to improve the memory latency may not always result in the required improvement in time taken to fetch data due to the way caches are organized and traversed. E.g. The same size direct mapped caches usually has more miss rate than fully associative caches. This may also depend upon the benchmark that we are testing the processor upon and the pattern of instructions. But always using the fully associative cache may result in more power consumption as it has to search the whole cache every time. Due to this, the trade off between the power consumption and the size of the cache becomes critical in the cache design.

Evolution
In case of a miss in the cache, the purpose of using such a structure will be rendered useless and we will ultimately have to go to main memory to fetch the required data. The idea of using multiple levels of cache comes into picture here. This means that if we miss the cache closest to the processor, we will search for the data in the next closest level of cache and will keep on doing that until the run out of levels of caches and will finally search the main memory. The general trend is to keep L1 cache small and at a distance of 1-2 CPU cycles from the processor with the lower levels of caches increasing in the size to store more data than L1 and hence lower miss rate. This in turn results into a better AAT. The number of levels of cache can be designed by architects as per the requirement after checking for the trade-offs between cost, AATs and size.

Multi-Level Caches
With the technology scaling which made memory systems feasible and smaller to accommodate on single chip, most of the modern day processors go for up-to 3 or 4 levels of caches. The reduction in the AAT can be understood by this example where we check AAT for different configurations up to 3-level caches. Main memory = 50ns, L1 = 1ns (10% miss rate), L2 = 5ns (1% miss rate), L3 = 10 ns (0.2% miss rate) ·

AAT (No cache) = 50ns

AAT (L1 cache+ Main Memory) = 1ns + (0.1*50ns) = 6ns

AAT (L1 cache + L2 cache + Main Memory) = 1ns + (0.1*(5 + 0.01(50ns)) = 1.505ns

AAT (L1 cache + L2 cache + L3 cache + Main Memory) = 1ns + (0.1*(5 + 0.01(10 + 0.002*50ns))) = 1.5001ns

Intel Broadwell Microarchitechture (2014)
 * L1 Cache - 64kB per core
 * L2 Cache - 256kB per core
 * L3 Cache - 2MB to 6MB shared
 * L4 Cache - 128MB of eDRAM (Iris Pro Models only)

Intel Kaby Lake Microarchitechture (2016)
 * L1 Cache - 64kB per core
 * L2 Cache - 256kB per core
 * L3 Cache - 8192kB shared

IBM Power 7
 * L1 Cache (Instruction + Data) => Each 64-banked, each bank has 2rd+1wr ports  => 32kB, 8-way, 128B block  => Write through (1..16 bytes)
 * L2 Cache => 256kB, 8-way, 128B block  => Write back  => Inclusive of L12ns access latency
 * L3 Cache => 8 regions of 4MB (tot 32MB); local region 6ns, remote 30ns; each region 8-way  => DRAM data array, SRAM tag array

Disadvantages

 * Increase in the cost of memory system.
 * Increase in area consumed by memory system.
 * In case of a large programs with poor temporal locality, even the multi-level caches cannot help improve the performance and eventually, main memory needs to be reached to fetch the data.