QPACE

QPACE (QCD Parallel Computing on the Cell Broadband Engine) is a massively parallel and scalable supercomputer designed for applications in lattice quantum chromodynamics.

Overview
The QPACE supercomputer is a research project carried out by several academic institutions in collaboration with the IBM Research and Development Laboratory in Böblingen, Germany, and other industrial partners including Eurotech, Knürr, and Xilinx. The academic design team of about 20 junior and senior scientists, mostly physicists, came from the University of Regensburg (project lead), the University of Wuppertal, DESY Zeuthen, Jülich Research Centre, and the University of Ferrara. The main goal was the design of an application-optimized scalable architecture that beats industrial products in terms of compute performance, price-performance ratio, and energy efficiency. The project officially started in 2008. Two installations were deployed in the summer of 2009. The final design was completed in early 2010. Since then QPACE is used for calculations of lattice QCD. The system architecture is also suitable for other applications that mainly rely on nearest-neighbor communication, e.g., lattice Boltzmann methods.

In November 2009 QPACE was the leading architecture on the Green500 list of the most energy-efficient supercomputers in the world. The title was defended in June 2010, when the architecture achieved an energy signature of 773 MFLOPS per Watt in the Linpack benchmark. In the Top500 list of most powerful supercomputers, QPACE ranked #110-#112 in November 2009, and #131-#133 in June 2010.

QPACE was funded by the German Research Foundation (DFG) in the framework of SFB/TRR-55 and by IBM. Additional contributions were made by Eurotech, Knürr, and Xilinx.

Architecture
In 2008 IBM released the PowerXCell 8i multi-core processor, an enhanced version of the IBM Cell Broadband Engine used, e.g., in the PlayStation 3. The processor received much attention in the scientific community due to its outstanding floating-point performance. It is one of the building blocks of the IBM Roadrunner cluster, which was the first supercomputer architecture to break the PFLOPS barrier. Cluster architectures based on the PowerXCell 8i typically rely on IBM BladeCenter blade servers interconnected by industry-standard networks such as Infiniband. For QPACE an entirely different approach was chosen. A custom-designed network co-processor implemented on Xilinx Virtex-5 FPGAs is used to connect the compute nodes. FPGAs are re-programmable semiconductor devices that allow for a customized specification of the functional behavior. The QPACE network processor is tightly coupled to the PowerXCell 8i via a Rambus-proprietary I/O interface.

The smallest building block of QPACE is the node card, which hosts the PowerXCell 8i and the FPGA. Node cards are mounted on backplanes, each of which can host up to 32 node cards. One QPACE rack houses up to eight backplanes, with four backplanes each mounted to the front and back side. The maximum number of node cards per rack is 256. QPACE relies on a water-cooling solution to achieve this packaging density.

Sixteen node cards are monitored and controlled by a separate administration card, called the root card. One more administration card per rack, called the superroot card, is used to monitor and control the power supplies. The root cards and superroot cards are also used for synchronization of the compute nodes.

Node card
The heart of QPACE is the IBM PowerXCell 8i multi-core processor. Each node card hosts one PowerXCell 8i, 4 GB of DDR2 SDRAM with ECC, one Xilinx Virtex-5 FPGA and seven network transceivers. A single 1 Gigabit Ethernet transceiver connects the node card to the I/O network. Six 10 Gigabit transceivers are used for passing messages between neighboring nodes in a three-dimensional toroidal mesh.

The QPACE network co-processor is implemented on a Xilinx Virtex-5 FPGA, which is directly connected to the I/O interface of the PowerXCell 8i. The functional behavior of the FPGA is defined by a hardware description language and can be changed at any time at the cost of rebooting the node card. Most entities of the QPACE network co-coprocessor are coded in VHDL.

Networks
The QPACE network co-processor connects the PowerXCell 8i to three communications networks:


 * The torus network is a high-speed communication path that allows for nearest-neighbor communication in a three-dimensional toroidal mesh. The torus network relies on the physical layer of 10 Gigabit Ethernet, while a custom-designed communications protocol optimized for small message sizes is used for message passing. A unique feature of the torus network design is the support for zero-copy communication between the private memory areas, called the Local Stores, of the Synergistic Processing Elements (SPEs) by direct memory access. The latency for communication between two SPEs on neighboring nodes is 3 μs. The peak bandwidth per link and direction is about 1 GB/s.
 * Switched 1 Gigabit Ethernet is used for file I/O and maintenance.
 * The global signals network is a simple 2-wire system arranged as a tree network. This network is used for evaluation of global conditions and synchronization of the nodes.

Cooling
The compute nodes of the QPACE supercomputer are cooled by water. Roughly 115 Watt have to be dissipated from each node card. The cooling solution is based on a two-component design. Each node card is mounted to a thermal box, which acts as a large heat sink for heat-critical components. The thermal box interfaces to a coldplate, which is connected to the water-cooling circuit. The performance of the coldplate allows for the removal of the heat from up to 32 nodes. The node cards are mounted on both sides of the coldplate, i.e., 16 nodes each are mounted on the top and bottom of the coldplate. The efficiency of the cooling solution allows for the cooling of the compute nodes with warm water. The QPACE cooling solution also influenced other supercomputer designs such as SuperMUC.

Installations
Two identical installations of QPACE with four racks have been operating since 2009: The aggregate peak performance is about 200 TFLOPS in double precision, and 400 TFLOPS in single precision. The installations are operated by the University of Regensburg, Jülich Research Centre, and the University of Wuppertal.
 * Jülich Research Centre
 * University of Wuppertal