User:Tailinchu/sandbox

= Software-based energy/power meter = Instead of directly using both current and voltage, software-based energy/power meter uses indirect and limited metrics to infer energy/power usage in system. Sometimes, current sensor and voltage sensor are not both readily available due to additional component cost and space; therefore other metrics, such as cache miss, are used for estimation. These metrics are then used for model training, which provides software accurate energy or power.

Another goal of software-based power meter is to divide up a coarse power consumption into subcategories that is difficult to precisely measure. For example, the power meter can report whole system consumption, but cannot give per-process and per-thread power/energy.

Usage
Typically software-based energy/power meter enables detailed analysis. Common usages fall into the following two categories:

Subdivision
A coarse energy/power is available. However, the problem of interest is not easy to measure but adds up to the total coarse energy/power. For example, laptop can report whole system consumption from battery, but cannot give per-process and per-thread power/energy. However, all processes and threads power consumption sum up to whole system consumption. The problem is then to divide precisely the total power with metrics from processes and threads. These metrics can be system calls or memory access.

t+1 Estimation
While this is related to training, after the metrics are learned, one can relate metrics back to power/energy consumption. When the metrics are changed in the next time instance, the model can predict power/energy of t+1 from previous training results even if the energy/power is unknown.

Billing
As more web applications are hosted on VM, researches show how to bill users based on energy usage. In addition to computation power hours, cloud service companies like Amazon can charge based on energy consumption, which reflects closely to the actual cost of running server farm.

Optimization
Knowing how functional units's power consumption changes on different loads is especially helpful for optimization.

For example, modern laptop often includes GPU for gaming or highly parallelizable computing. Unlike other components in PC, GPU can be viewed as another subsystem, which has groups of cores and levels of memory and cache. Understanding core performance bottleneck, such as memory bandwidth, suggest when cores can be disabled to save energy.

Requirement
software-based energy/power meter has the following requirements. A practical estimation model will try to fulfill more requirements while coping with conflicting requirements.

Accuracy
The model should estimate energy accurately. Low cycle full system simulation can achieve high accuracy; however, this conflicts with the requirement of speed and low overhead.

Run-time estimation
Often online and on-the-fly estimation can be used by other system modules. For example, energy/power readings are useful for scheduler that can lower overall system consumption. This requirement is useful for devices with battery, such as laptop and smartphone. However, a intrusive model will disturb normal system execution.

Non-intrusive and low overhead
The speed requirement conflicts with accuracy. More accurate model often requires longer training time.

Metric availability
Because not all metrics are available in different systems, the model should only reply on a few metrics that have high availability.

Blackbox
Blackbox in model is an entity which one has no knowledge to break down its components.

There are some models that require deep understanding of relationship between processor functional unit tasks and the consumption. For example, counting the functional unit activities with built-in hardware. On the other hand, the blackbox model requires no such knowledge, as it assumes that the relationship between performance counter and power consumption is linear. For example, more instructions executed, or TLB misses contribute to more power. The black box approach is certainly simpler and faster; however it is sometimes too simple to be accurate.

As more mobile device, which has lots of distinct components, emerges, blackbox is obviously not suitable for estimation. More models tend to become component-based by correlating visible system state variables with power. Learning power consumption in different system states is sometimes accurate but requires training. To be increase accuracy, developers have to train on all relevant system states.

Metric
These metrics are often proportional to power, so after learning how these metrics correspond to power, they can be used for estimation. Some metrics are linear to either current or voltage, enabling linear regression techniques to learn parameters.

Performance counter
In computers, hardware performance counters, or hardware counters are a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems. Typically the work falls into two categories. The first estimates power consumption by monitoring usage on functional units, the second derives mathematical correlation between performance counter and power consumption independent of the functional units.

hardware performance counter vs software profiler
Compared to software profilers, hardware counters provide low-overhead and detailed performance information related to processor, caches and main memory etc. Typically newer CPU models have more performance counters. Because processor manufacturer or operating system will provide the higher-level api, another benefit of using them is that little source code modifications are needed in general. However, the types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations. For example, VMware infrastructure has different set of performance counters available from intel.

The drawback is that there can be difficulties correlating the low level performance metrics back to source code. In addition, limited number of registers for storing counter values often force users to conduct multiple measurements to collect all desired performance metrics. A general workaround is to poll m out of n performance counters at a time to build up the full profile; however, the total time to probe all performance counters will increase, which also increases error rate.

data-dependent registers and frequency
Data-dependent registers and frequency contribute errors in performance counter estimation. Activities on double-ended bitlines are not data dependent. As a result, this can be addressed by performance counter based on utilization estimate. In contrast, single-ended bitlines can depend on the data. For example, if the circuit is precharged to logic one, then only zeros will dissipate power. This is one of the main reason that some performance counter cannot be estimated well. Another reason is that register frequency can skew the estimation. A more sophisticated scheme might weigh register samples.

temperature effect
Ideally power consumption does not increase over time; however, higher temperature increases leakage and adds to the overall power consumption. It is known that power and temperature affects each other. However because temperature sensors are not available per core, some models choose not to include it, which increases error rate; on the other hand, recent model uses on-die sensor to approximate temperature.

Instruction
Instruction level power modeling has been proposed to evaluate the power dissipation of a given piece of software. The basic idea is to explicitly associate the consumed power with individual instruction execution.

$$Power=BasePower + InterInstructionPower + PowerDueToOtherInterInstructionEffect$$

Inter-instruction effect measurement
However, measuring inter-instruction power between two instructions is non-trivial because exhaustive power characterization of every pair has to be conducted. For example, for the Intel IA-32 ISA with 331 unique instructions, the number of possible instruction pairs need to be measured are 109,561 (331^2). Therefore, this estimation technique is intrusive and computation intensive.

Difficulties for on-the-fly training
Fitting all the measured power consumption for each instruction into a small hardware table is infeasible. Often to hold a large table like this, software table lookup is required. System generates dedicated software traps, then performs table lookup. Due to this instruction-level granularity, it has high overhead.

Data dependent instruction
The inter-instruction effect is deeply affected from data distribution. In the analysis developed in , a great relation among power and data distribution is observed, but they noted that this dependency was bigger in the smallest processor, the DSP one. The average instruction approach only works if a strong relation between data distribution and power is not observed in processor.

functions, thread or process
Instead of evaluating power at instruction level, software macro-modeling techniques treat application functions, thread, process as “black boxes” and construct macro-models that correlate the power with a set of characteristics of interest. Such power characteristics of interest can be collected with a low-level energy simulation framework or operating system API. As a result, a software function, thread and process power template can be represented as sum of other metrics. Windows supports process performance counter, which gives access to a wide range of per-process information about page faults, thread state, process state and priority.

However, storing the complete control flow graph for each software functions and counting the number of each correlated path whenever that function is invoked is computationally intensive, so offline analysis is often used. On the other hand, per-thread and per-process is easier if operating system api can be used. Online estimation requires simpler metrics.

Power data bank
Power data bank consists of information for energy consumption and execution time for each library functions. The data are obtained from a list power simulators from transistor level to RT level. A lower level simulator is able to provide more accurate data but has longer simulation time. The user can choose the simulator that fits the requirements.

Consequently, a function might have a list of values which are from different simulators. This concept builds up a common interface for function-level power information.

Cycle-accurate Architectural Level Simulation
The circuit and gate level simulations are found to be infeasible when evaluating power consumption of large software executing on complex computing systems. One of the alternative is cycle-accurate architectural level power simulators. It is known that architectural level power simulations can be used to model modern superscalar processor that has deep pipelines, out-of-order and speculative execution.

Early works have developed an architectural level system power consumption analysis tool for single-bit 16bit DSP and in-order 32bit RISC microprocessor. It shows short sequences of low-level assembly code benchmarks but does not model out-of-order hardware.

Another framework for architectural-level power analysis, Wattch, is much faster than existing layout-level tool at that time, and maintains accuracy within 10% of their estimates.

However, long simulation time of cycle accurate simulation hinders the efficiency of design space searching, especially in simulating large applications using detailed processor models. Because of that, the simulation based power model can not be used to support run-time software power estimation.

OS routine
OS routine based power characterization and estimation thus avoid the computationally expensive full system simulation for each estimated application. Combined with the existing performance estimation. OS routine level power model can lead to highly efficient and accurate runtime power modeling for the OS.

$$Eos = sum(powerOfService_i * timeOfService_i)$$

Comparing to subroutine, the existence of OS routine level energy locality implies that we can use smaller tables to store the most frequently invoked OS routine power model parameters.

Hardware component
Divide-and-conquer suggests that power consumption of a system can be model as sum of power consumption of subsystems. After subdividing, power estimation is then a recursive problem. To increase accuracy, one can experiment on hardware component; these insights can open up the blackbox.

For example, smartphone can have the following components However, these components must not be assumed to consume power independently; sometimes, inter- and intra- component effect needs to be considered.
 * Processor
 * LCD display
 * Wifi-interface
 * GPS
 * Bluetooth
 * Audio
 * Camera
 * Battery
 * Storage

Power state of hardware components
Most accurate models suggest that each hardware components have power states. For example, the ACPI specification defines the four Global "Gx" states and six Sleep "Sx" states for an ACPI-compliant computer-system: While each state represents a level of power consumption, if the system has a large number of states, it is problematic to classify current system state with fluctuating measurements.
 * G0: on
 * G1: sleep, which internally divides into 4 sleeping states
 * G2: soft off
 * G3: mechanical off

In addition, Inter-component effect might trigger state transition of one components with another.

Tail Effect
Tail effect appears on smartphones network module, including 3G, GPS, and WIFI. In principle, the power state of components should correspond to the throughput of the work. For example wifi 11Mbps consumes more power than the lower power state at 5.5 Mbps. This is called "utilization-power-state correlation assumption". However, due to the tail effects, which denotes the device in high power state after active I/O activities. Anytime 3G network is used, there will be a "tail" after usage. For example, angry bird ad module uses 3G for uploading and downloading, and its tail effect shortens the battery life.

State of discharge (SOD)
State of discharge represents a percentage of battery energy. By creating a lookup table from voltage to state of charge, one can discover energy consumption by using two voltages measurements at two different time instances.

$$P*(t1-t2) = E * (SOD(V1) - SOD(V2))$$

Because relation between SOD and voltage is different for different batteries, training often requires first charging the battery to its full capacity then discharge at constant current rate.

Ah counting and kalman filtering method
Besides finding the relation between SOD and voltage, Ah counting and kalman filtering method are also commonly used. Ah counting uses integral of load current to estimate SOD. If the current can be measured accurately, then Ah counting will be accurate enough for estimation. However, it is suitable only for stable load. Severe workload, which in turns triggers severe battery electrochemical will accumulate errors over time.

kalman filtering is described in. Typically, a battery is consider to be a power system, and SOD is the internal state of this system.

$$StateNext = A_k * StateNow + B_k * u_k + w_k $$

kalman filtering does not require SOD to be correct in the first state because it will converge to the right value. It is also suitable to model battery with self-discharge.

Processor utilization
Processor utilization is used to estimate processor energy specifically. Intuitively, higher processor utilization consumes more energy.

Therefore, the model is:

$$Eprocessor = a * UtilizationProcessor + b$$

By obtaining the mapping between energy consumption of a processor and its utilization, linear regression can determine the fitting variables between the two.

Memory Access and Cache Miss
A more accurate estimate of memory energy many use a cycle accurate simulation of the hardware design. However, the key usage of memory power is read/write throughput. Last level cache (LLC) miss counter, which is widely available in most processor, can be used.

$$Emem = a * LLCMissCount + b$$

GPU memory access
GPU is a complex sub-component that can be further divided into more components to model. In particular, GPU has several memory spaces: global, shared, local, texture and constant. Due to proximity, the global and local memory use the same physical GDDR memory, and are modeled separately; the other 3 memory spaces: shared, constant, texture are modeled as components of streaming multiprocessor (SM).

Disk IO
Most research in disk power management focus on disk behavior. Disk energy consumption can be divided into these stages : Idle/standby consumes the least power while write consumes the most. While creating model based on these stages can be accurate, they are not available in virtual machine environment. Therefore, a simpler model can track bytes read or written of logical disk. The total bytes read/written can be merged into one variables in training if the disk read/write energy difference is negligible. Moreover, a constant model has error rate as high as 35%, while linear simple model can cut down the error rate to 15%. Without cache modeling (aggressive readahead), the error rate could rise up to 112%.
 * Spin up/down
 * seeking
 * rotation
 * read/write
 * Idle/standby

Model Fitting
After data are collected, they are used to fit some models. Because online estimation is highly favored, most researches use linear model if only proportionality between the metrics and power/energy is known. Piecewise linear or quadratic is used if linearity does not hold, and the reasons are usually fully discussed.

Linear
Linear model is widely used for its speed and simplicity. In most component-based estimation where there exist a few non-linear metrics, linear model can also be used because it does not contribute much error. Linear model also captures the idea of subsystem, where each subsystem power consumption adds up to the total power consumption. Except for a few special cases, it is not possible to directly derive an equation to compute the best-fit values from the data. Instead nonlinear regression requires a computationally intensive, iterative approach.

Piecewise Linear
Piecewise linear is an alternative where the power consumption data is clearly non-linear. In this case, a maximum fitting error is chosen. To approximation a curve, one first generates linear fit, if the error is higher than threshold, a new point is added. The process iterates until the maximum error is satisfied. This model creates several linear models underneath. For example, state of discharge can be modeled as piecewise linear. By doing this, the training can still be relatively fast.

Finite State Machine (FSM)
Each state of the FSM denotes a fixed number of power state. The edges denote the corresponding system calls.

To create FSM, power consumption and system call traces are measured. These data points have fluctuation and bursty behavior. Then a modeled power consumption is created by identifying the timing of system call and averaging the fluctuating data points. Finally the FSM is created from this modeled power consumption.

Unlike previous work of using system-call as measurement of activities, system-call is a trigger that allows state transitions.

Invocation of system calls thats turn on/off components immediately triggers the change of power state; therefore, it avoids the delay due to periodic sampling of counter. This system call approach naturally suggests using finite state machine to describe state transitions. In addition, system call can be mapped back to the original function call or thread/process, allowing a fine grain measurement of power.

Offline vs. Online Estimation
Estimating power consumption is critical for hardware and software developers, and of the latter, particularly for OS programmers writing process schedulers. These information can be used to flatten power usage, preventing workload spikes in certain peak period. In addition, getting online estimation of power is non-trivial. Even though more intrusive hardware instrumentation is available, it is deployed for designing the system. For this reason, online estimation is valuable.

Due to performance concern, typical online estimation has the following characteristics: On the other hand, offline estimation does not have aforementioned requirements and are therefore more flexible.
 * Less variables
 * Linear (simpler) model (linear least square fit)
 * Divide into multiple components to cut down training matrix size if possible
 * Fixed number of power states
 * Make use of highly-available performance counter
 * Put trained data into lookup table