User:Hooiwai/Power Optimization and Management Techniques for Memory and Storage

Power optimization and management techniques for memory and storage are important aspects in modern computing system design of all sizes. The goal of power optimization and management is to consume just enough power to perform a particular task with a specific performance level in mind. System performance is not confined to just the speed of computation (the most common and conventional metric of performance most people think of), performance level is also defined by financial cost of operating a system and/or longevity of power supply without wall AC supply (essentially cordless devices).

Importance
Power optimization and management in general is an ongoing and active area of computer science. Irrespective of the size or function of any computing systems, all consumes power, and the consumption of power always carry an economic cost. The economic cost manifest differently for different systems, and on the surface, it would seem the cost would bear more weight on some computing devices (such as smartphone) while other computing systems are seemingly immune (such as data center).

For example, it is easier to understand that in a battery-operated, cordless device, such as smartphone, higher power consumption means either the battery needs either higher capacity (increase in direct cost of manufacturing and sale price) or have to suffer shorter usage time and potential lost business to competing vendors on the market (opportunity cost loss). On the other end of the physical size spectrum, a data center that is attached to the AC power grid seems to have unlimited supply of power and doesn't suffer the same power constraint as a smartphone. However, as an operator of a data center with a fixed budget, every dollar that is spent on paying for the power (both for running the computing systems and cooling them) means one less dollar to spend on increasing computing capacity of the center, and it is the computing capacity that generates revenue for the owner.

To illustrate the impact of memory and storage power consumption as described in cases above, real world studies are presented below with actual power values collected.

Profiling power consumption of memory and storage in smartphone
During the 2010 USENIX Annual Technical Conference, Aaron Carroll and Gernot Heiser of National Information and Communications Technology Australia (NICTA) presented an article titled "An Analysis of Power Consumption in a Smartphone". In the article, authors explored power consumption of various components of a smartphone, including RAM and storage, during different usage profiles. It is found that RAM uses 4% of daily total power consumption on average, and higher in specific profiling sessions. Audio playback session consumes more than 6% of the total power excluding backlight; video playback session consumes more than 8% of the total power excluding backlight; SMS session consumes between 6%-7% of the total power excluding backlight. Storage component of the power consumption is <2%. As expected, display, CPU, and cellular radio dominated power consumption numbers, while the combined power consumption of memory and storage is at most 10% of the total. It is however worth noting that this number is arrived after power reduction effort is already incorporated in the design of the smartphone, and pre-optimization number would be higher.

Real world example of the effect of power on operation in data center
In an online editorial, Tim Nufire, VP of Engineering at Backblaze (an online backup service), described that one-third of the operating cost of the company's data center is spent paying for electricity. According to a study by National Renewable Energy Laboratory (NREL), in 2011, a typical data center power consumption can be divided into 3 groups: Computing (44%), Cooling (28%), Lighting/loss/misc (28%). When that power consumption model is applied to Backblaze's data center referenced above, roughly 15% of the cost goes toward powering the computing system, and roughly 9% of the cost goes toward removing heat generated by the act of computer processing.

All components of a computing system, such as memory and storage, contribute to power consumption. By using power optimization and management techniques on these major subsystem, designers can tailor how best to attain a certain performance level appropriate to the type of computing system under consideration. In addition, physical packaging of the components represented by memory and storage can have a large effect on cooling requirement, which is another power reduction opportunity from the view point of owner/administrator. While the last point is mechanical in nature instead of revolving around EE/CS design discipline, nonetheless it fits in with the holistic approach of power management and optimization.

Layered approach
For useful work to be done, designer of complex computing system cannot simply specify lowest power components possible in constructing a system, or else other performance requirements may not be met. It is important to allow the system to perform optimally based on constantly changing usage model. In order for that to happen, the system must be designed with enough flexibility and granularity so that individual performance and feature aspects can be fine tuned on-the-fly, while maintaining adequate performance to satisfy operational requirements. This can be accomplished with layered approach.

Memory and storage components have built-in power-saving hardware features (lowest layer) that can be activated when peak performance is not required. The existence of power-saving feature at component level allows application software (upper layer) and operating system (middle layer) to make independent choices on when and which subsystem gets throttled without affecting other portion of this system or impact performance requirement. Layered approach allows for optimal operations according to different profiles, instead of all-or-nothing power on-off profiles.

For example, when a user is streaming music from a network source to a smartphone (such as running Pandora music service), it is possible to power down the storage subsystem (most likely flash memory or SD card in this case) without affecting cellular network access, RAM buffering, and audio decoding in CPU. When the user decides to use the same smartphone to send a SMS message, RAM can be put to sleep since the CPU has more than enough local cache to store the short ASCII message and directly send it to the cellular radio chip for transmission. In both usage cases, the lowest layer hardware has power optimizing features that enable the operating system to make intelligent choices based on application profile. Both usage profiles are dynamically switched without direct user intervention, and most importantly to the user, performance was not affected by power-saving steps.

Hardware level design
Some technical features in hardware components are explored for their power saving potentials.

Low Power DDR SDRAM vs DDR SDRAM
LP DDR SDRAM is a close cousin to regular DDR SDRAM. The major advantage of using LPDDR memory over regular DDR memory is simple to understand: LPDDR operates at a lower voltage than the regular DDR SDRAM. Across all the generations of SDRAM technology, Low Power version always maintain a voltage level reduction over the regular version. Without going into ASIC design geometry shrinkage and its effect on static power consumption, a simple operating voltage reduction conserves a significant amount of dynamic power consumption in the SDRAM devices. In CMOS circuit, dynamic power is given by this formula:
 * P = α•C•V2•f

Where P is the power consumed, α is the activity factor, from 0 to 100%, the circuit switching statistical distribution, C is the switched capacitance, V is the supply voltage, and f is the clock frequency of the circuit. With all other terms in the equation being the same, the power saving from the drop in voltage produce is a squared term. If we express the dynamic power consumed by regular SDRAM in any of the DDR standard as P, then the fractional power consumption can be listed as: Memory manufacture such as Micron Technology claims that LPDDR memory does not give up any computational performance edge over the regular version.

In addition to dynamic power saving, LPDDR has 3 more hardware features that are not in the regular DDR memory :
 * 1) Temperature Compensated Self Refresh (TCSR)
 * 2) Partial Array Self Refresh (PASR)
 * 3) Deep Power-Down (DPD)


 * Temperature Compensated Self Refresh (TCSR)

When the environmental temperature is low, SDRAM content does not decay and corrupt as quickly, therefore less frequent refresh is required. To take advantage of this device physics phenomenon, Low Power class memories can use either internal temperature sensor or external programming instructions to adjust the length of time between refresh. The lengthening allows for more work to be done in between refresh cycles, which reduces overall memory usage overhead and net power saving.
 * Partial Array Self Refresh (PASR)

Historically, all SDRAM memory blocks are refreshed periodically. Obviously, if the memory is not fully utilized, refreshing areas that are not in used would be a waste of time and power. LPDDR allows partial array self refresh, so unused area are left alone. When an application requires only a fraction of the available memory, operating system that is aware of this hardware feature can choose to only write to memory areas that will be refreshed.
 * Deep Power-Down (DPD)

When a computing system is idle, there are times when it is not important to maintain data content inside the SDRAM. However, it would be impractical to shut off power to the SDRAM, since SDRAM requires an initialization sequence to be performed every time power is reapplied, causing delay and generally sluggish response from the user's point of view. LPDDR utilizes a deep power-down mode where current draw is reduced significantly where data content is no longer maintained, but still has enough power where initialization and I/O calibration data are not lost.

Adaptive Page Management in SDRAM
In many ways, accessing and using SDRAM is similar to opening up a printed dictionary. In order to find something in a dictionary, user must open to the page where the term in question is located. With a long list of terms to look up in the dictionary, user must flip to a new page (essentially closing old page and opening new page) to where the next term on the list is located. This action of page turning takes time and effort (power), and it is unavoidable unless the next term on the list happens to be on the same page as the previous term on the list.

In SDRAM, if two back-to-back memory accesses are on the same page (best case scenario), the time it takes is simply tCAS (CAS latency time of the memory device). If the accesses are on different pages (worst case scenario), then the currently opened page must be closed first before the new page can be opened. The difference in time of the two scenarios varies between different types of SDRAM technology (DDR vs non-DDR, DDR2 vs DDR3, etc.), so one must consult the actual data sheet from the manufacturer to figure it out. In an online article on AnandTech, author Rajinder Gill looked at a use case with DDR3 SDRAM. The time it takes to perform worst case scenario access is tRP + tRCD + tCAS. Plugging in actual clock cycle numbers, best case scenario needed 6 clock cycles to perform 2 back-to-back memory accesses, worst case scenario needed 18 clock cycles to perform 2 back-to-back memory accesses that are on different pages. That difference in time is tripled!!

Adaptive Page Management by Intel
Intel filed a patent with the US Patent Office that addresses the inefficiency in the worst case scenario. The concept is simple: if the memory controller maintains a list of memory access requests, it can easily determine if it would be advantageous to reorder the request lists so that page closings/openings are minimized. The access reordering hardware feature in the memory controller can group any memory accesses that are to the same page together, improving overall efficiency and reducing power at the same time. The process does not require user intervention, and it is transparent to the user.

Transaction Scheduler in ARM processors
Intel is not the only CPU vendor that has memory access reordering feature, ARM Holdings also includes a less comprehensive version in their memory controller. As part of the IP accessory portfolio that was released with their ARM Cortex A9 CPU core, DDR SDRAM controller was shipped with a transaction scheduler that reordered same-page accesses in its read and write queues.

With Intel dominating PC and server class CPU sockets, and ARM Cortex-A9 dominating smartphone and tablets CPU sockets in the 2010-2012 market, almost all computing devices built in those years are covered with this access re-ordering technologies.

Self-managed power management in hard disk drive
In hard disk drives, most if not all vendors offer power management features, such as various levels of performance idle modes in between the fully active mode and sleep mode. Here is an excerpt from a mobile hard disk drive datasheet from Hitachi's TravelStar series (TS5K1000) showing what some of those modes might be and the associated power consumption in each of the mode : Operating system can choose to actively manage the hard disk drive using the different power-down modes to conserve power. To transition from sleep/idle states (0.1W - 1.5W) to active state (1.6W - 1.8W), there is a spin-up power penalty (4.5W start up state at the maximum) depending on which idle state the drive is in. It should be apparent that if the drive has to constantly bounce between active-sleep-wake cycles, it will end up spending more time and power on transitioning than doing actual work, due to the high power consumption of start up mode. Back in 1999 when IBM still had a hard drive product line (hard disk business sold to Hitachi GST, and currently owned by Western Digital), it did research on adaptive self-management on mobile hard drive. The focus of the research is on how best for the disk to self-manage power-wise without out outside control. The research discovers that while there is no oracle solution in predicting usage, hard disk access did follow a bursty pattern, and that there are characteristic usage profile based on typical software applications. Due to the bursty nature of the commands generated by typical software applications, the disk should not enter any power saving mode until a reasonable amount of time has passed (burst completion), and then aggressively move to the lowest power state to take advantage of the lull in activities. This yields better power saving than going into different power mode based on fixed intervals. The research was implemented to the hard disk as such:
 * Adaptive Battery Life Extender achieves power savings by exploiting the command burst patterns. The command frequency distribution is measured to characterize the activity. This information is maintained in a history buffer, and is used to judge how long the current burst of commands will continue. The algorithm calculates the probability that the current burst of commands is complete based on the statistics of the distribution. This information is coupled with the energy characteristics of the modes to determine which mode, if any, is best suited to the current conditions. -IBM Almaden Research Center (1999)

Testing is performed across 3 different models of IBM Travelstar mobile hard disk (4GN, 10GT, 14GS), with this particular power saving design on and off. Using Ziff-Davis Battery Mark 2 benchmarking software as the test platform, it is shown that there is significant savings.

Power Management Standards
The general goal of standardizing power management specification is to enable all computing systems to share a common language in managing power-saving features, removing redundancy from various pieces of software and firmware vying for power management control, and centralizing all the hardware and software capabilities so a cohesive power management plan can be made. There are two popular power management specifications: Advance Power Management and Advanced Configuration and Power Interface.

Advanced Power Management in Linux machine
Advance Power Management is commonly supported in Linux based systems. Developed by Intel and Microsoft, it consists of one or more layers of software that support power management in computers with power manageable hardware. APM creates a software interface that is hardware independent, and interface between actual power management software that is aware of specific hardware features, and an operating system power management system design. The goal is to abstract away the mechanics of using hardware power management features that could have vendor-specific implementation, allowing higher level software to use those features without needing to know the specific instructions.

APM-enabled computer system has 5 power states: Full On, APM Enabled, APM Standby, APM Suspend, and Off. When the system is in Suspend mode, both memory and storage are put in low power modes. Low power mode for memory would correspond to the possibility of using LPDDR SDRAM features of TCSR, PASR, and DPD mode as stated in the LPDDR section above. Entering low power mode for storage can translate to putting a hard disk from Performance IDLE mode to Low Power IDLE, moving from 1.5W to 0.5W of power consumption.

Advanced Configuration and Power Interface
Advanced Configuration and Power Interface (ACPI) was designed together by Hewlett-Packard, Intel, Microsoft, Phoenix Technology, and Toshiba to replace the order APM standard. The pursuant of similar goals have not changed, but rather the emphasis has shifted more to operating system from BIOS. It is still an interface specification that has both hardware and software components. ACPI improved upon APM in breaking down different power states. Instead of having only 5 states for the entire system, lumping all the system resources together in those states, ACPI separated out the general system power management, CPU power management, and device power management. Each device is also managed independently with their own states. Because of this fine control granularity, system that uses ACPI can independently power down sub-sections without affecting other parts. Different smartphone usage models alluded to in the beginning of the article would be possible under this scheme.

Operating System consideration for Solid State Disk
As the cost of flash memory goes down, it is becoming more common to see solid state drive replacing hard disk drive in laptop sold in stores. In traditional hard drive, usage improvement can be found by defragmenting the drive. Referring to the power dissipation table of hard disk in previous section, hard disk consumes power moving the magnetic reading head and turning the disk platter while seeking for data. This is due to the mechanical nature of the spinning disk and the fact that it takes time for the magnetic reading head to traverse across the disk surface. Defragmentation allows related data to be placed close together and thus able to minimize "seek" penalty, and is basically done by copying related data sectors together. Solid State Disk, however, does not have this mechanical aspect, thus does not benefit from defragmentation. File system that is aware of this change in technology will refrain from performing background defragmentation maintainance function, saving power in the process. More importantly, mobile platform such as smartphone and tablet uses flash memory exclusively as their primary storage device, therefore mobile platform operating system would need to understand how to optimally utilize flash memory technology. Flash memory is not bit or byte eraseable, rather, only an entire block of data can be erased at one time. After a while, the only way to reclaim unused memory sector for future use is to erase that data block. This reclaimation process takes up computing resource to perform, and of course, burns power. In a mobile environment where run time between charging is an important factor, this needs to be managed carefully to not burn power needlessly. Android operating system performs this garbage collection as soon as "destroyed". Examining common smartphone usage model, certain applications are constantly being called up over the course of a day, such as SMS service, email, and actual voice calling application. Android allows an application to go on "pause" mode without completely killing a program when the human user is finished with it, allowing the application to be resurrected repeatedly, without going through a time and power consuming cycle of flash memory reclaimation during "create and destroy".

''Portions of this page are reproduced from work created and shared by the Android Open Source Project and used according to terms described in the Creative Commons 2.5 Attribution License. ''

Application software level design (Flikker)
Power optimization techniques discussed so far have been required to maintaining data integrity of the computer system and user experience. However, depending on situations, maintaining data integrity might not affect user experience. For example, if a smartphone user is viewing a picture, movie, or some other multimedia content on his device, he would not notice if the content has degraded in quality compared to viewing it on a full size computer with a large screen due to the small size screen of the smartphone. In the 2011 Architectural Support for Programming Languages and Operating Systems conference, a paper was presented on this exact idea.

Authors of that paper suggested that power saving can be had by reducing the refresh rate of SDRAM in mobile device below manufacturer recommendation. In SDRAM, when timely refresh of its cell does not happen, the data content slowly degrades and corrupt. If the memory was storing a picture, having the memory content slowly corrupts means that picture quality degrades over time. In a small-screened device, the picture degradation may not be noticeable. Based on that theory, the researches constructed a software called Flikker and tested it in real hardware, and found evidences to strongly support their theories--image is still of acceptable quality and power consumption was reduced (from 5% to 25% DRAM power).

Effects of physical packaging
Removing heat from storage element is also another area where power can be saved. In a conventional hard disk cloud storage pod pictured at the right, 135TB of data are stored in 45 3-TB hard disk drives in a 4U enclosure. The drives are packed close together (15 across, 3 rows deep), with small gaps in between drives that would allow air to be forced through to cool the storage array. Being that it is a standard 19-inch rack mounted chassis, the internal width of the chassis is about 430mm. Combining that fact with drive height of 147mm, the total frontal area is 63210mm2. Based on the dimension of the hard disk drives and the way they are oriented, 147mm x 26.1mm x 15 = 57550.5mm2 of the frontal area is blocked. Since the amount of cooling airflow is defined by cross-sectional area, this layout leaves only 5659.5mm2 of air channel to move through to cool the disk array.

Viking Technology has a slightly different approach to mass storage physical packaging. They took the flash memory chip that normally goes inside a solid-state disk and packaged them in a DDR3 DIMM package, the result is a functional equivalent of a hard disk but in a DIMM module form factor. 64 of these DIMM modules are installed in a 1U chassis, giving it 32TB of storage per 1U rack height, and the image can be found here: http://www.vikingtechnology.com/sites/default/files/productimage/SATADIMM_JBOD.jpg. With a thickness of 6.75mm and height of 30mm and 32 of them across the chassis, total frontal area that is blocked is 6.75mm x 30mm x 32 = 6480mm2. From the picture, the SATADIMM modules line up in front of 8 sets of 35mm cooling fans, giving a total 280mm x 35mm of cooling area. Using conservative calculation of 280mm x 30mm yields a frontal area of 8400mm2, that leaves 1920mm2 of cooling area. If 4 of these 1U chassis are stacked together, there would be 7680mm2 of cooling area for 128TB of data.

For the conventional disk storage pod, there is 5659.5mm2 of air channel to cool 135TB of data, giving just less than 42mm2 of air channel to cool per terrabyte. For SATADIMM, there is 7680mm2 of cooling area for 128TB of data, giving 60mm2 of air channel to cool per terrabyte. That is a 50% increase in air channel for cooling. More efficient cooling chassis would require less power to cool it, as the fans would not need a higher flow specification than would require in a congested chassis. Each air molecule would also work less hard to remove heat for a given amount of data storage, as more can fit through the chassis, potentially reducing data center cooling system load.

Improvement in near future
Outside the box thinking in hardware engineering such as the SATADIMM will be important in near future, as intelligent packaging does not require as much capital investment as fundamental break-through technological research in this poor economic environment. If existing production line can be re-used with a better performing product power-wise, the market would be receptive to it. After all, it is the cost that drives the technology.

With the current CMOS technology, further increase in density and reduction in lithographic geometry will increase the amount of static and leakage current consumption. The pursue of higher density silicon devices (for both memory and storage) will soon make power saving improvement that is expected with a new generation device impossible. With Intel's 3D transistor, and other silicon vendor using multilayer chip packaging where RAM, flash, and processor are stacked on top of each other in the same chip packaging, the industry is moving into the 3D direction to increase in density. For any given amount of 2D real estate on the circuit board, the thirst for higher memory and storage density can be satisfied without sacrificing power consumption due to shrinking device physics. Hybrid Memory Cube is one such emerging technology. With a robust amount of influential technology companies as developers, this does not seem like an idea that will go away soon. Once again, HMC technology is not really a new break-through in device physics, but yet another tweak in manufacturing and packaging process. We have had multichip packaging on the market for years now, therefore stacking of memory cells (but with higher speed interconnect than before) does not require brand new factories, but rather re-tooling of the existing ones. Time will tell whether HMC can deliver the claim of 70% power reduction from current DDR3 SDRAM.

The same outside the box thinking in the software side is also important, as illustrated by Flikker application discussed above. For a long time, DRAM management focus on correctly partitioning data to a small subset of the memory module so the rest of the memory module can be ignored, while maintaining perfect integrity on the active portion of the memory. What Flikker showed is that the world does not have to be perfect for user to be happy, and exploiting user's tolerance of imperfection can lead to power saving. In some ways it is not a new idea, as JPEG, GIF, and various other image compression formats has shown that human can tolerate a lossy media presentation, but the genius is in connecting those dots together.