Simultaneous and heterogeneous multithreading

Simultaneous and heterogeneous multithreading (SHMT) is a software framework that takes advantage of heterogeneous computing systems that contain a mixture of central processing units (CPUs), graphics processing units (GPUs), and special purpose machine learning hardware, for example Tensor Processing Units (TPUs).

Each component processes information differently. Often data has to move among processors, which can create bottlenecks, with one processor starving while waiting on another to finish.

Architecture
The system defines virtual processors and virtual operations (VOPs). VOPs decompose into one or more high-level operations (HLOPs). It then distributes the operations across the processors. The runtime system then dynamically maps virtual processors to physical processors, assessing resource availability in order to keep all the processors busy. The scheduler employs a light-weight, quality-aware work-stealing (QAWS) policy.

Conventional runtimes use assign one processor (set) to each subtask, leaving other types of processors idle. In other words, the CPU(s) run (possibly in parallel), then when that subtask completes, the next subtask is handed to the GPU(s). When they finish the next subtask is handed to the TPU(s).

Adding software pipelining allows the second subtask to run using partial results from the first subtask, which improves resource utilization.

SHMT takes things a step further, identifying subtasks that can run independently of others to the appropriate processor type, allow even better parallelism. Some subtasks can be performed on multiple processor types. SHMT can divide a single subtask across such processor types. Thus the fundamental breakthrough is to keep more processors working simultaneously, reducing time and energy costs.

Benchmark
Researchers tested the concept using a typical smartphone configuration tweaked so that it resembled a data center server.

The hardware was Nvidia's Jetson Nano module containing a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture GPU cores. A Google Edge TPU was connected via its M.2 Key E slot. The processors communicated via an onboard PCI Express (PCIe) interface. Shared data was hosted in a 4 GB 64-bit LPDDR4. The Edge TPU adds an 8 MB device memory. Ubuntu Linux 18.04 was the operating system.

Compared to a conventional system performance increased by 1.95X boost, while energy consumption was reduced by 51%, on a range of benchmarks, including Black–Scholes, DCT8X8, DWT, FFT, Histogram, Hotspot, Laplacian, MF, Sobel, SRAD, and GMEAN.