User:Snowbell19/sandbox

Dark Silicon
Modern processors are becoming more and more energy dense with Dennard scaling slowing down (the continuous decrease of supply/threshold voltage of transistors on each new manufacturing technology). Due to physical constraints of heat dissipation and power delivery, problems occur if all the transistors in a chip run at the same time. The percentage of transistors that a chip can switch at full frequency is limited by a utilization wall and as a result, a large portion of the processor has to be turned off. This area has been coined as dark silicon, and it has become one of the most significant constraints in gaining speedup in modern chips.

History
"Dark Silicon and the End of Multicore Scaling," a paper from ISCA '11 investigates the limit of multi-core scaling and estimate the fraction of dark silicon on a chip. It predicts that the speedup of multi-cores will slow down and dark silicon "dominates" starting 2021 for CPUs and 2015 for GPUs, which is an optimistic view. Taking a conservative view the year would be 2016 for CPUs and 2012 for GPUs. This brings up the problem of dark silicon and inspires research on techniques that could mitigate this dark silicon.

There are several approaches to solving this dark silicon problem. We can roughly divide this into several categories. First, using a heterogeneous architecture, such as the ARM big.LITTLE system consisting of a set of Cortex-A15s and Cortex-A7s to switch between performance cores and energy-efficient cores when needed. Second, there are novel architectures that use accelerators to speed up the performance only when needed. There are also other power gating techniques and cooling methods to bring down the heat and power consumption.

Device Heterogeneity
This '13 DAC paper suggests exploiting heterogeneity in processor manufacturing to solve the dark silicon problem. They explore new devices of High-K dielectrical and Nano-electro-mechanical switch(NEMS) to fight against dark silicon. High-K dielelectrical is a technique that replaces the silicon dioxide, and can decrease the leakage current by a significant amount (<1% than silicon-dioxide). Some processor manufacturers are already adopting this new device. NEMS is built on physical chip and can decrease the leakage current by orders of magnitude. However, it has a longer switch delay compared to CMOS, thus research on making a hybrid device of these two are being conducted. Simulations show that a configuration of a mixture of NEMS-CMOS cores with High-K cores is preferable to a chip designed with cores of a single device. A configuration of 1 High-K core and 7 NEMS-CMOS cores would reduce the energy by 20.8% with no performance loss, compared to a 7 High-K core and 0 NEMS-CMOS core.

Memory Level Parallelism Using Architectural Heterogeneity
In the proposed design of an asymmetric multicore processor with both a memory-level parallelism and instruction-level parallelism specialized core, the authors propose a hardware-level mechanism which monitors the L2 miss rate of the application and detects memory-level parallelism phases. It uses this information to assign the application to the best core. It is able to detect both the start and end of this memory-level parallelism period, exploiting the phase behavior of applications while powering off the unused core. It achieves a 21.1% energy delay reduction for SPEC2006, with a 6.6% performance improvement.

GreenDroid
In this '11 Micro IEEE paper, the authors explore a new architecture as a mobile application processor. It uses 100 or so specialized conservation cores which are automatically generated, and places them on the dark silicon area of a chip. These cores can run part of the application more efficiently than a general purpose processor, and are only turned on when needed by the program. The cores are reconfigurable, and the authors claim a 11x less energy spending on general-purpose mobile programs. Though mitigating the dark silicon problem, there is a shortcome in that cores cannot communicate with each other. They use the cache to use shared operands and communicate the results; this might be limited by the data cache bandwidth.

Approximate Programming
Disciplined approximate computing, where it uses unreliable hardware components with error-tolerant software, is an approach to improve energy efficiency to mitigate the power constraints of dark silicon. They suggest providing the programmer with safe ways to write programs such that they can tolerate a small amount of inaccuracies. A language extension can statically confirm that error-sensitive parts are isolated from approximated components. They provide an architecture which is energy efficient when running these adjusted programs, along with accelerators for hot code, and achieve up to 43% energy savings.

Power Gating
Power gating in mobile SoCs can cut the leakage power for inactive blocks, with techniques such as Super Cut-off to the baseline power gates. This would enable a stronger ratio of turn-off leakage to on-current. Advanced state retention and standby voltage scaling can improve energy and latency for these leakage mitigation schemes.

Cooling Techniques
While many techniques focus on reducing power for power's sake, heat is also becoming a problem. These days temperature sensors are integrated into the system and if the CPU/GPU gets too hot the frequency/voltage is modified to bring down the temperature. Superlattice-based thermoelectric cooling (TEC) is a technology that allows to target hot spots of each core separately, while offering large heat pumping capability. With a small form factor, it is possible to integrate them between the die of the processor and the heat spreader. This paper propose a power/thermal model for estimating the impact of DVFS, power of the TECs, and the number of threads on performance and power. Running SPEC06 on an Intel quad-core Core i7 940 processor, they found that using DVFS+TECs reduced the number of dark cores in the system by and average of two, compared to using DVFS alone.

Patents
ARM has licensed a bundle of 138 properties from Sonics Inc, a company that develops system IP for cloud-scale SoCs in 2013. ARM is supporting Sonics in research gaining large levels of power savings through techniques of integrating system processors, on-chip networks, IP subsystems and cores more closely. With the Network-On-Chip(NoC), power management can be automated. When a message is to be delivered on a shut-down component, the NoC will know this and will be able to power up the component and deliver the message. Their recently announced technique of SonicGN controls the domain partitioning, clock gating and turning the domain on/off. With reducing the CPU overhead, they estimate this NoC can save up to potentially half of the total SOC power consumption. The patents of Sonics is known to have been embedded into more than two billion chips worldwide.

Future Directions
As transistors get smaller and cheaper to squeeze into a chip, the power consumption and heat dissipation become the limits in the performance speedup we can achieve. Then it makes sense to put a lot of additional accelerator type parts into the system; just turn them on when they are needed, and they will be dark the rest of the time. For example, there are a group of instructions dedicated to AES encryption in the ARMv8 architecture. For certain applications it is also possible to make the tradeoff of performance / accuracy. Other techniques approaching from the material side or cooling down the cores would also be helpful.