User:Anirudhsixfeet/sandbox/Cache performance measurement and metric

CPU Cache is a piece of hardware which reduces the access time to the data in the memory by keeping some part of the frequently used data of the main memory in itself. It is smaller and faster than the main memory.

The performance of a computer system depends on the performance of all its individual units which include different execution units like integer, branch and floating point,I/O units, bus, caches and memory systems.The gap between the speed of the processor and the speed of the main memory is growing exponentially. Up until 2001-2005, CPU speed as measured in its clock frequency grew annually by 55%, whereas the memory speed only grew by 7% annually. This problem is known as the memory wall. The motivation behind creating a structure like cache and its hierarchy was to bridge this speed gap and overcome the memory wall.

The critical component in of most of the high performance computers is the cache.Since the cache was created to bridge the speed gap, its performance measurement and metrics play an important role in designing and choosing various parameters like cache size, associativity, replacement policy, etc. The performance of the cache depends on the cache hits and the cache misses which are the factors that constraint the performance of the system. Cache hits is the number of accesses to the cache that actually find that data in the cache, whereas cache misses are those that do not find the block in the cache. These cache hits and misses contribute to the term Average Access Time (AAT) also known as AMAT ( Average memory access time ) which, as the name suggests, is the average time it takes to access the memory. This is one major metric for Cache Performance because this number becomes highly significant as the speed of the processor is increased.

Introduction to types of cache misses
The performance of the processor due the cache hierarchy depends on the number of accesses to the cache that find the blocks in the cache (cache hits) versus those which do not find the blocks in the cache. When an attempt to read or write data from the cache is unsuccessful it results in lower level or main memory access and results in a longer latency, this phenomenon is known as a cache miss. There are three basic types of cache misses know as the 3Cs and some other less popular cache misses.

Compulsory Misses
Each memory block when first referenced causes a compulsory miss. This implies that the number of compulsory misses is the number of distinct memory blocks ever referenced. They are sometimes called cold misses too. Cold misses cannot be avoided unless the block is prefetched.

It has been observed that an increase in block size to a certain extant to exploit spatial locality leads to a decrease in cold misses. Increasing block size leads to leads of prefetching of nearby words in a blocks and preventing future cold misses.Increasing the block size too much can lead to prefetching of useless data, thus increasing the number of cold misses.

Conflict Misses
Conflict misses occur when the data required was in the cache previously but got evicted. These evictions occur because another request was mapped to the same cache line. Generally, conflict misses are measured by subtracting the number of misses of a cache with limited associativity by the number of misses of a fully-associative cache of the same size and cache block size.

Since conflict misses can be attributed to the lack of sufficient associativity, increasing the associativity to a certain extent (8‐way associativity almost as effective as fully‐associative) decreases the amount of conflict misses, however, such an approach increases the cache access time and consumes a lot more power than a set associative cache.

Capacity Misses
A capacity miss occurs due to the limited size of a cache and not the cache's mapping function. When the working set i.e the data that is currently important to the program, is bigger than the cache size capacity misses will occur frequently. Out of the 3Cs capacity misses are the hardest to identify and can be thought of as non compulsory misses in a fully associative cache. In a single processor system the misses that exist after subtracting the number of compulsory misses and conflict misses can be categorized as capacity misses.

Since capacity misses can be attributed to the limited size of cache a simple way to reduce the number of such misses is to increase the cache size. Although this method is very intuitive it leads to a longer access time and an increase in the cache area and power consumption.

The above three kinds of misses only address uni-processor misses.

Coherence Misses
The 3Cs group of cache misses can be extended to 4Cs when a multi-processor system with cache is involved, The fourth C being coherence misses.The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread's cache has been invalidated by a write from another thread. Coherence in a multi-processor system is maintained if only one copy or a memory block is present or all the copies have the same value. Even if the all the copies of memory block do not have the same value, it doesn't necessarily lead to a coherence miss. A coherence miss occurs when threads execute loads such that they observe the different values of the memory block.

The coherence problem is complex and affects the scalability of parallel programs.A global order of all memory accesses to the same location must exist across the system to tackle this problem.

System-Related Misses
System activities such as interrupts, context switches and system calls lead to the process being suspended and it cache state being altered. When the process execution is resumed, it suffers cache misses to restore the cache state that was altered. These misses are called system-related misses.

Average memory access time
These cache misses directly correlate to the increase in cycles per instruction (CPI). However the amount of effect the cache misses have on the CPI also depends on how much of the cache miss can be overlapped with computations due to the ILP ( Instruction-level parallelism ) and how much of it can be overlapped with other cache misses due to Memory-level parallelism. . If we ignore both these effects, then the Average memory access time is a metric which becomes important. It is used to measure the performance of the memory systems and hierarchies. It refers to the time necessary to perform a memory access on an average rate. It is the addition of the execution time for the memory instructions and the memory stall cycles. The execution time is the time taken for a cache access and the memory stall cycles include the time taken to service a cache miss and access the lower levels of memory. If the access latency, miss rate and miss penalty are known, the average memory access time can be calculated with the following relation.

$$ AMAT = T_{L1} + MR_{L1} \cdot MP_{L1} $$

where $$ T_{L1} $$ is the access latency of level one cache, $$ MR_{L1} $$ is the miss rate of level one cache and $$ MP_{L1} $$ is the additional cycles a miss at higher level takes to  be serviced compared to a hit at higher level and is calculated using the following relation.

$$ MP_{L1} = T_{L2} + MR_{L2} \cdot MP_{L2} $$

this formula can be expanded further and used recursively for all the further levels in the memory hierarchy to get the $$AMAT$$.

Power Law of Cache Misses
The Power law of cache misses shows a trend in the capacity misses in a particular application of the program as affected by the cache size. This empirical observation led to the mathematical form of power law which shows the relation between the miss rate and the cache size.It can be stated as"$M = M_0 * C^{-\alpha}$"where M is the miss rate for a cache of size C and M0 is the miss rate of a baseline cache. The exponent α is workload-specific and typically ranges from 0.3 to 0.7, with an average of 0.5. The power law was validates on quite a few of real-world benchmarks. .

This relation clearly shows that there is only a small fraction of cache misses that can be eliminated for constant increase in cache size. Also, note that this law holds true only for a certain finite range of cache size up to which the miss rate doesn't flatten out. The miss rate will eventually become zero and become stagnant at a certain large enough cache size and after that the relation will not give correct estimates.

Stack Distance Profile
The stack distance profile is a better representation of how the cache misses are affected by the cache size. The power law of cache misses just showed an rough approximation of the same. A stack distance profile captures the temporal reuse behavior of an application in a fully-or set-associative cache. .

Applications with the tendency of exhibiting more temporal reuse behavior generally access data that is more recently used. Let us assume the associativity of a cache to be $$A$$. To collect the stack distance profile information of this cache assuming it has LRU replacement policy, ,$$A+1$$ counters are used starting from $$C_0$$ to $$C_A$$ and one additional counter $$C_<A$$ which keeps the count of the misses. The counter $$C_i$$ is incremented when there is a hit in the $$i^{th}$$ way and the counter $$C_<A$$ is incremented on every miss. The stack distance profile shows the how the trend of hits decreasing from the most recently used data to the least recently used. Using this stack distance profile information, the cache miss for a cache with associativity $$A{'}$$ and LRU replacement policy, where $$A{'}<A$$ can be computed as "$miss = C_<A + \sum_{i=A{'}+1}^{A} C_i$" This information has a limitation that it can only capture the temporal reuse across different associativities. For other purposes, the temporal reuse has to be studied in greater detail.