User:Sschaus/sandbox

= Power Optimization and Management Techniques for Memory and Storage = Memory and storage devices are integral parts of all computing systems. Both devices are used to store data, memory typically being volatile and storage typically being non-volatile. The characteristics of each device lends itself to different applications within a computer system. In 2006 the EPA released a study on power consumption in datacenters showing that 1.5% of US power was consumed by datacenters. Storage devices account for 5.2% of the datacenter power consumption. In a typical server, memory devices consume three times as much power as storage devices, so it stands to reason that 15.6% of datacenter power is consumed by memory devices. By these numbers, datacenter memory and storage device power consumption accounted for $936 million in 2006, and that number is expected to grow year over year for the foreseeable future. New technologies and methodologies are arising that allow reducing power in meaningful ways. Exploiting these solutions will be of great concern to system designers that wish to benefit from increased power savings. Both private and public sectors will want to decrease expanding IT costs associated with power consumption. Environmentalists also take interest in decreasing this power consumption because it would lessen the environmental impact associating with harvesting more power.

Overview
Lowering power consumption in memory and storage can be tackled in several general ways. Reducing access to these devices will minimize the use of dynamic power. Scaling the voltage and frequency down will shrink both dynamic and static power. Physical design of the memory and storage technologies can be altered to reduce dynamic and static power. Reducing device access generally involves a cache or a deferred access technique. By lowering the device access, device technologies are better able to take advantage of the idle periods by entering a lower power state. A slight variant is to group the device accesses. Because there is usually a cost in both time and power for transitioning to or from a lower power state, grouping device accesses allows minimizing power state transitions. Scaling down the voltage and frequency for devices will decrease the gate voltage threshold. This decreases the current needed for the device. At the cost of performance degradation, voltage and frequency scaling will reduce power consumption in all operating modes. Likewise, being able to reduce the frequency of certain device operations can also lead to savings in power consumption. Each storage or memory technology has differing characteristics regarding performance, power consumption, cost and size. As expected, there is no panacea when it comes to memory and storage technologies, so these characteristics will need to be weighed for a particular application. In addition, there can be modifications to well-known technologies to further supplement power savings. Software control is vital to some of the power conservation tactics. Some power management techniques are simply exposed to programmers and users while others require algorithms to be developed to efficiently manage the hardware.

Caching
The CPU cache is the last level of on-chip memory before the I/O bus, where the system memory device is connected. A hit in this cache means that the memory device will not need to be accessed, allowing the memory device to remain idle. To efficiently manage power consumption in the memory device, it is prudent to maximize the hit rate in the CPU cache. One study of efficient power techniques shows that increasing the idle periods of memory technologies is the best strategy for reducing power consumption. Though the most typical application of a cache is near the CPU, caches can be applied anywhere along the I/O path in memory and storage systems. PA-CDRAM is an example of integrating a cache into the memory module. Having a cache within the memory module overcomes I/O bus size limitations and allows more aggressive caching techniques because the cache miss penalty is not as pronounced as it is in a CPU cache. Storage devices also show potential to be more power efficient with the integration of a cache. NAND flash was used to implement a cache in a hard disk drive for the purpose of reducing power consumption. So-called hybrid drives are now commercially available.

Maximizing the Hit Rate
Increasing the size of the CPU cache can improve the likelihood of a cache hit. The memory or storage technology used for a cache is typically more expensive when compared to the memory or storage device, so this approach may not be feasible from a cost perspective in some systems. Additionally, the physical size increase may not be desirable in some applications. The cache associativity also has an impact on the cache hit rate. In general, reducing associativity produces faster hit times. This is contrasted by a fully associative cache, which produces the highest hit rate. It is for this reason that the researchers of PA-CDRAM chose to use a fully associative cache. The performance degradation from a fully associative cache may be undesirable, so a compromise could be an n-way associative cache.

Power Aware Caching Policy
Caches typically employ one of two methods for writing dirty data back to the higher level device. Write through caches immediately write data back to the higher level device once the cache is dirtied. This means that for every write to the cache, there is an accompanying write to the higher level device. Write back caches only write data back to the higher level device when the cache line is evicted. This means that multiple cache writes can be performed without requiring higher level device access (write coalescing). Because write back caches increase the idleness of higher level devices, they are a natural candidate in systems where power consumption is a concern. In the analysis of PA-LRU, write back caching was shown to conserve 20% more energy than write through caching. In special applications it may be appropriate to use a timeout for cache eviction. Under these circumstances, longer timeouts allow for more write coalescing, which decreases higher level device access. Longer timeouts degrade performance and can also increase the idle time in the cache, which may be undesirable if idle power in the cache is greater than idle power in the higher level device, so a balance must be struck.

Deferring Writes
Distributed systems can contain many storage devices. To conserve power, some of these storage devices can be placed in a low power mode. One trick for maintaining data coverage while some storage devices are in low power mode is to replicate data across multiple nodes. To maximize the amount of time these storage devices are in low power mode, writes to these devices can be deferred by making note of the write in a log file on one of the active storage nodes. This log file then serves as a cache with dirty data. Once a storage node comes out of low power mode, the pending writes can be flushed to the storage device, which ensures data coherency across all nodes. This method of deferring writes can be extended to any system that has memory or storage devices that are periodically offline. The analysis of PA-LRU noted that this technique could produce 55% savings in power consumption.

Varying Frequency and/or Voltage
A popular method in clocked circuits, which can be applied to memory and storage technologies, is Dynamic Voltage and Frequency Scaling (DVFS). DFVS allows changing the voltage and frequency of a circuit on the fly, effectively increasing or decreasing performance and conversely decreasing or increasing peak power consumption. The tradeoff between performance and power consumption has some interesting consequences. Another application of varying frequency can be applied to DRAM circuits. DRAM cells are typically constructed of a transistor and a capacitor, which will hold the charge for that cell’s state. Capacitors leak current at a rapid rate. It is for this reason that the JEDEC standard specifies a DRAM refresh every 64ms for DDR2. A natural approach to reducing the cost of the DRAM refresh is to increase the period in between refreshes.

Dynamic Voltage and Frequency Scaling
It is important to note that DVFS can decrease peak power consumption. This distinction is important because overall power consumption can actually be increased due to the energy-delay product. PA-CDRAM researchers noticed a decrease in peak power consumption when using a distributed cache controller versus a centralized cache controller because there is less activity on the I/O bus. Because there is a respective delay associated with their distributed cache controller, the total power consumption is actually lower for the centralized cache controller. This happens because the idleness is increased with the faster method. DVFS techniques usually encompass the entire system, but it may be beneficial to consider a DVFS approach to individual components within the circuit. Memory channels are often configured for parallel access, though access distributions across those channels are usually uneven. A DVFS approach can be taken to manage power consumption of the individual memory channels according to their access patterns. Typically, devices only support a handful of voltages or frequencies. This leads DVFS approaches to consider each setting a particular state and develop a state machine. In the interest of conserving power, algorithms can be deployed to determine the optimum state given the current load. The performance of the state selection algorithm must be considered as a higher number of state considerations can lead to a higher latency and power consumption. Additionally, the state transitions themselves can have additional power and latency costs. It is typically more efficient to minimize these state transitions, so this must also be factored into the state selection algorithm.

Reducing DRAM Refresh Rate
All capacitors in a DRAM module do not leak current at the same rate. The leakage rate can be attributed to environmental factors, location in the DRAM module and manufacturing variability. This means that when the DRAM cells are allowed to decay past specification, some cells may be in an incorrect state, while others may be able to maintain the correct state. This behavior lends itself to using error correction when the refresh rate is lowered below the specification. RS-ECC is shown to be effective in increasing the period in between DRAM refreshes while maintaining data integrity. This idea has also been used by Elpida as part of their Super Self-Refresh. Another possible technique for reducing the DRAM refresh rate can be used in systems where DRAM is used as a cache. Because the refresh of individual cache lines in a DRAM can be controlled, a cache line eviction could simply involve not refreshing that cache line. This will conserve refresh power until that cache line is filled again, which can offer significant power savings when cache evictions are based on a timeout.

Heterogeneous Systems
Memory and storage device technologies both have dominant strategies in the marketplace. The most typical storage device is the HDD. The most typical memory device is DRAM. These technologies have been selected for their relatively low cost and high density. That being said, there are more costly storage and memory technologies that offer different characteristics. While a complete switch to these new technologies may not be feasible from a cost perspective, coupling these technologies with the well-established DRAM and HDD can aggregate the strengths of the technologies and mitigate the weaknesses.

Heterogeneous Storage
As mentioned previously, a popular marriage of technologies in storage devices is integrating NAND flash into a HDD. HDD’s contain spinning platters and a moving actuator arm, which are used during device access. These components consume most of the power in a HDD. Integrating NAND allows data requests to be serviced by the lower power NAND flash, which allows the HDD to remain in a lower power state. HDD’s have higher density and are cheaper than NAND, so the ratio of NAND to HDD is still fairly small (8MB NAND in a 750GB HDD for a Seagate hybrid drive). As that ratio evens out, power consumption will decrease due to the increased hit rate of the cache. Though NAND in HDD is common, the concept of having a higher performance lower power technology integrated with a lower performance higher power technology will be novel as long as the two technologies maintain this relationship. For storage systems, non-volatile technologies are important to guarantee data preservation, though volatile technologies can be utilized with a cache write through policy. Using volatile cache memory could be acceptable if the read accesses greatly outnumber the write accesses.

Heterogeneous Memory
SRAM is the predominant technology for CPU caches because of its speed. DRAM is typically chosen for main memory because it is cheaper than SRAM. SRAM can be integrated into the DRAM module to increase the access speed of DRAM, as in PA-CDRAM. This cache based approach allows larger cache block sizes than a CPU cache because the data interconnection width within the DRAM module is higher than an I/O bus. The result is an SRAM cache that can be very efficient when the reference locality is high and longer idle periods in the DRAM circuits. Some emerging non-volatile technologies are becoming fast enough to take the place of conventional volatile memories, which is beneficial because the static power consumption of non-volatile memories are much lower than volatile memories. PRAM is shown to act sufficiently as main memory with a DRAM-based cache. This technique is effective because data not frequently accessed will reside in PRAM, which does not consume much power when inactive. Additionally, the amount of DRAM used in the system is greatly decreased, which due to refresh power, also offers decreased power consumption.

Taking Advantage of Idle Devices
Power consumption during idle periods is typically much lower than in active periods. For this reason, many power management strategies aim to maximize the idle time. There are usually several power states for devices with varying latencies and performance penalties involved in going into a deeper sleep or waking up. The deepest sleep modes may not be feasible in smaller applications, but large-scale distributed systems (i.e. datacenters) can afford to power down entire memory or storage modules for extended periods of time.

Powering Down Memory Devices
Turning off volatile memory devices sometimes requires the state of the memory to be preserved for when the device is woken up. This is usually accomplished by transferring the volatile memory state to a non-volatile memory device. The volatile memory can be saved to the main storage device, as is done for hibernation in popular operating systems like Windows and Linux. Systems that use non-volatile memory for the main memory do not need to worry about transferring data and just move to a low power mode. Most DRAM devices support a self-refresh mode for reaching the minimal power state. In this state, the DRAM clock is disabled and an internal refresh counter is used to periodically refresh cells. This low power state will maintain DRAM state, but does not make that data available to the system until that device is woken up.

Dynamic Active Device Population Control
One approach to control the memory device population is to degrade some of the performance enhancements that lead to wider device usage. One such strategy would be to reduce memory channel parallelism. Data is often interleaved across multiple memory channels, so migrating that data to a smaller subset of memory channels can leave entire memory channels idle. A strategy like this will weigh performance goals against current memory usage patterns to determine the optimum number of memory channels. Controlling the storage device population can be complicated because data coverage usually is a concern. Replication is used to maintain data coverage while some storage devices are offline. Though performance will degrade, this may be acceptable at times of low usage. The redundant storage devices can then be placed into a low power mode, effectively removing it from the storage network.

Power Management Software
Many of the hardware architectures discussed in this article are tools that present opportunities for savings in power consumption. In these cases, software algorithms are required to achieve the full potential of these technologies. The most common software approach is to design power states for the entire system. Devices can then be selectively placed into lower power states at the user’s convenience. High spatial locality is a common characteristic among storage and memory device access. The degree of this locality can be maximized by intelligently grouping data on the device or devices. To efficiently manage power consumption in systems with memory or storage devices, it is important to develop a model of these devices so that the power management algorithms know the penalties or gains of using particular devices. Adding this logic can produce a mathematical algorithm for determining the best approach.

OS Power Management
Most modern operating systems offer several ways to conserve power while the computer is not in use. Linux abstracts ACPI or APM so that device drivers can register with the power management subsystem to perform device power management when in the Run, Standby or Suspend state. Standby will place the system memory into a lower power state where data is still retained, but other devices are powered off. Suspend will save the contents of system memory to non-volatile storage and power off the system memory as well. Though these states can be entered manually by a user, activity timers can be used to enter these states automatically.

Increasing Locality
By having some forethought as to the access patterns of data structures, power can be saved by increasing the locality. These savings come in the form of reduced address bus activity because data closer together in address typically require fewer transitions on the bus. Consider an iterative loop that is accessing two arrays sequentially. The access patterns will be a[0], b[0], a[1], b[1], etc. A more efficient approach would be to organize the two arrays in memory as one interleaved array so that a[0] and b[0] are sequential in address. Locality in HDD’s is important because moving the actuator arm during disk seeks consumes power. Most applications with HDD’s use some sort of file system. This choice of file system can have an impact on the power consumption of the disk. Journaling effectively spreads a file between a journal and the on-disk file location for a short period. A journaling file system like Ext3 will perform more writes than a non-journaling file system like Ext2, which consumes more power. Another approach is to localize commonly accesses files along the same tracks. One further adaptation to increasing locality is to physically shrink the size of files on the disk through the use of data compression. Reducing the length or number of writes is beneficial to any non-volatile storage device, not just HDD’s.

Power Aware Device Modeling
Different devices consume different amounts of power for each type of operation. Knowing these power consumption numbers better equips software algorithm for knowing which devices to power down or which devices to have service requests under different situations. LBM does such modeling for NAND and HDD. With PA-LRU and LBM, the cache is manipulated by storing blocks with the highest energy consumption cost (i.e. data from a disk that is in low power mode). This sort of technique can be used in simple systems where something frequently accessed (i.e. the operating system) can be placed on the lower power SSD, while data that is not used as frequently (i.e. pictures) can reside on the higher power HDD.