Value cache encoding

Power consumption is becoming increasingly important for both embedded, mobile computing and high-performance systems. Off-chip data bus consumes a significant part of system power. It is observed that the off-chip data bus consumes between 9.8% and 23.2% of the total power consumed by the system depending on the system. So, reducing the power consumption of the off-chip data bus would reduce the overall power consumption.

Introduction
Off-chip buses are associated with higher capacitance values internal node capacitances and hence, therefore techniques for minimizing switching at external address and data buses, even at the expense of a slight increase in switching at internal capacitances are quite useful. Off chip power consumption can be reduced by at least these two techniques: by reducing the number of bus lines activated during data transfer and by reducing the number of bit transitions on the active bus lines. A mix of both of these techniques produce optimal result. Value cache encoding is a scheme which is used to reduce power consumption in off chip data bus. In this scheme, Cache at both side of data bus is used to reduce the dynamic power dissipation on off-chip data buses. These caches are maintained such that their contents  are the same all the time.

Scheme Details
In this protocol, we employ a small cache (called value cache, or VC for short) at each side of the off-chip data bus. These value caches keep track of the data values that have recently been transmitted over the bus. The entries in these caches are constructed in such a way that the contents of both the value caches are the same all the time. When a data value needs to be transmitted over the bus, we first check whether it is in the value cache of the sender (whether it is memory or cache). If it is, we transmit only the index of the data (i.e., its value cache address, or index) instead of the actual data value and, the other side (receiver) can determine the data value by using this index and its value cache. To transmit the data in the value cache using only 1 bit switching activity, the size of value cache is limited to the width of the data bus. That is, with a 32-bit bus, the VC could have only 32 entries. Since the value caches employed by our power protocol are very small, the width of the index value is much smaller than the width of the actual data value. Consequently, fewer off-chip bus lines need to be activated for transmission Our approach tries to achieve the first option by exploiting the locality of the data values communicated over the off-chip data bus. However, once the width of the data (that needs to be transmitted) has been reduced, we can also expect a reduction (in general) on the average bit switching activity per transfer. In addition, this switching activity can be further reduced by using well-known bus encoding schemes in conjunction with our strategy

Cache Consistency
The receiver side runs the same placement and replacement policy for the VC as the sender. Thus, the value of the data sent over the bus is copied in the receiver VC at the same index location, as in the sender VC. Engineers use one extra control bit to indicate whether the data being sent over the bus is the verbatim data or an index to the VC. A memory-write activity is handled in a similar fashion.

Example
We assume that initially the values 100 and 200 are not present in the VC. During Transaction #1, A is sent from the memory to the cache. The requested data item, stored at some address (say address X of the memory) has a value 100. The memory controller searches the VC for the value 100 and detects a miss. Therefore, the value 100 is sent over the off-chip data bus. Also, following our power protocol, the value 100 is stored at the same location (say 5) of the VCs of both the source and the destination. For Transaction #2, the memory controller searches for the value 200, cannot find the value in the VC, and repeats the steps as described above. At this point the value caches at the both ends contain data values 100 and 200. In Transaction #3, the memory controller finds that a value 100 has to be sent to serve a read request (of the same memory location as before, or of a different memory location that has the same value). But, note that the value 100 is already present in the VC of the sender in location 5 as a result of Transaction #1. Therefore, instead of sending the value 100, the memory controller just sends the index value 5. The receiver, on the other hand, fetches the value of the actual data (100 in this case) from location 5 of its VC. Finally, in Transaction #4, we want to send the data item D having a value 200 to the memory (i.e., a write request). But, the value 200 is already cached in both the VCs as a result of Transaction #2 from memory to cache. Consequently, the index to the cached copy (present in the VC) of value 200 is used to complete Transaction #4 but in the reverse direction. This last transaction shows that, the data placed in the VCs during a transaction in one direction can be re-used (from the VC) during a transaction in the reverse direction.

Replacement Policy
LRU is used as replacement policy in both caches. It is implemented using reference bit and a n-bit time stamp for each value stored in cache. When a value appears in input, reference bit is set. At regular intervals, the reference bit is shifted right into the high-order bit position of the n-bit timestamp causing all bits in the timestamp also to be shifted right and the lowest-order bit in the timestamp being discarded. For example, timestamp 000 means this value did not appear during the last three time intervals, timestamp 100 means it was just seen in the last interval, and the timestamp 000 with reference bit set means it is encountered in the current time slot. Operation mentioned above is performed on all entries of both caches with all reference bits reset. Thus, the timestamp keeps the history of value occurrences for the last n time periods. When an entry is required and a value is to be evicted, the entry that is selected is the one with the smallest timestamp and clear reference bit. The new value is put in with a fresh reference bit and timestamp (all 0's) in this selected entry.

Type of Value Cache
After describing the protocol, we will now see two approaches for maintaining cache : identifying the frequent values on the fly.
 * 1)  Both caches can be initialized using fixed set of values depending on frequency of appearance of values in previous run.
 * 2)   A changing set of frequent values can be maintained as the program runs. Thus, the contents of the frequent value tables adapt to changes in the frequent values for different parts of execution. The advantage of filling cache with fixed value is that the coders do not have to change the contents of the table dynamically thus reducing the runtime overhead. However, it requires that values be known before hand and different program needs different values. The second method, on the other hand, does not need a priori information of data values and does not distinguish among different programs. With these features, we pay the price of

Other Application
The protocol we discussed was applied on bus whose one end is on-chip cache and other end was off-chip memory, it is possible to adapt our strategy to work with an off-chip L2 cache as well. In addition, power protocol can also be used to reduce the switching activity between on-chip L1 cache and on-chip L2 cache (although the results would not be as good as those with the off-chip bus). In fact, our strategy can be used between any two communicating devices in the system (with the VC support). Further, we are not restricted by point-to-point configurations. That is, our approach can be made to work in an environment where multiple devices are communicating over a shared (power-hungry) data bus. Obviously, in this case, among other things, we would need a coherence mechanism (the discussion of which is beyond the scope of this paper). A drawback of our strategy is the extra space needed by two value caches (one on-chip and the other off-chip). In this paper, we do not present a detailed study of the circuit space implications of our approach. As will be presented in the experimental results section, even a small VC (128 entries) generates reasonably good energy behavior; so, we can expect that the space overhead due to our optimization will not be excessive.