User:Henriok/Cell (microprocessor)

Cell is a microprocessor architecture jointly developed by Sony Computer Entertainment, Toshiba, and IBM, an alliance known as "STI". Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas over a four-year period beginning March 2001 on a budget reported by IBM as approaching US$400 million. The first major commercial application of Cell was in Sony's PlayStation 3 game console.

History


In 2000, Sony Computer Entertainment, Toshiba Corporation, and IBM formed an alliance ("STI") to design and manufacture the processor.

The STI Design Center opened in March 2001. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.

During this period, IBM filed many patents pertaining to the Cell architecture, manufacturing process, and software environment. An early patent version of the Broadband Engine was shown to be a chip package comprising four "Processing Elements," which was the patent's description for what is now known as the Power Processing Element. Each Processing Element contained 8 APUs, which are now referred to as SPEs on the current Broadband Engine chip. Said chip package was widely regarded to run at a clock speed of 4 GHz and with 32 APUs providing 32 GFLOPS each, the Broadband Engine was shown to have 1 teraflop of raw computing power. This design was fabricated using a 90 nm SOI process.

In November 2006, David A. Bader at Georgia Tech was selected by Sony, Toshiba, and IBM from more than a dozen universities to direct the first STI Center of Competence for the Cell Processor. This partnership is designed to build a community of programmers and broaden industry support for the Cell processor. There is a Cell Programming tutorial video available.

In March 2007 IBM announced that the 65 nm version of Cell BE is in production at its plant in East Fishkill, New York.

Again in February 2008, IBM debuted that it will begin to fabricate Cell processors with the 45 nm process

Commercialization
On May 17, 2005, Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the forthcoming PlayStation 3 console. This Cell configuration will have one Power processing element (PPE) on the core, with eight physical Synergistic Processing Elements (SPE)|SPEs]] in silicon. In the PlayStation 3, one SPE is locked-out during the test process, a practice which helps to improve manufacturing yields, and another one is reserved for the OS, leaving 6 free SPEs to be used by games' code. The clock-frequency is 3.2 GHz. The introductory design is fabricated using a 90-nanometre SOI process, with initial volume production slated for IBM's facility in East Fishkill, New York.

On June 28, 2005, IBM and Mercury Computer Systems announced a partnership agreement to build Cell-based computer systems for embedded applications such as medical imaging, industrial inspection, aerospace and defense, seismic processing, and telecommunications. Mercury has since then released blades, conventional rack servers and PCI Express accelerator boards with Cell processors.

In the fall of 2006, IBM released the QS20 blade module using double Cell BE processors for tremendous performance in certain applications, reaching a peak of 410 gigaFLOPS per module. They have since released updated versions, QS21 and QS22 with improved design.

Exotic features such as the XDR memory subsystem, the coherent Element Interconnect Bus (EIB) interconnect, and the many SPEs positions Cell for deployment in the supercomputing space to exploit the Cell processor's prowess in floating point processing. IBM's Cell based QS22 is a part of the IBM Roadrunner supercomputer that became operational in 2008.

Toshiba has announced plans to incorporate Cell in high definition television sets. They have also made the SpursEngine wich employs SPEs into accelerator procesosrs for laptops and workstations.

IBM has announced plans to incorporate Cell processors as add-on cards into IBM System z mainframes, to enable them to be used as servers for MMORPGs

Architecture
The Cell architecture includes a novel memory coherence architecture for which IBM received many patents. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. IBM provides a comprehensive Linux-based Cell development platform to assist developers in confronting these challenges. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.

The architecture is designed to bridge the gap between conventional desktop processors (such as the Athlon, Pentium, and PowerPC families) and more specialized high-performance processors, such as the NVIDIA and ATI graphics-processors (GPUs). The name indicates its intended use, namely as a component in digital distribution systems; as such it may be utilized in high-definition displays and recording equipment, as well as computer entertainment systems for the HDTV era. Additionally the processor may be suited to digital imaging systems (medical, scientific, etc.) as well as physical simulation (e.g., scientific and structural engineering modeling).

While a Cell chip can have a number of different configurations, the basic configuration is that the chip consists of four components: A "Power Processor Element" ("PPE"), multiple "Synergistic Processing Elements" ("SPE"), an internal "Element Interconnect Bus" ("EIB") and devices for I/O and memory control.

Modern graphics cards have multiple elements very similar to the SPEs, known as shader units, with an attached high speed memory. Programs, known as shaders, are loaded onto the units to process the input data streams fed from the previous stages (possibly the CPU), according to the required operations. The main differences are that the Cell's SPEs are much more general purpose than shader units, and the ability to chain the SPEs under program control offers considerably more flexibility, allowing the Cell to handle graphics, sound, or any other workload.

Power Processor Element (PPE)
The PPE is a general purpose processor, composed of a "Power Processor Unit", the PPU and the 512 KiB L2 cache operating at half core speed. The PPE is capable of running a conventional operating systems, due to its similarity to other PowerPC processors, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs.

The PPU is the 64-bit Power Architecture based, Power ISA v.2.03 compliant, two-way multithreaded, in-order core. The PPU contains a 32 KiB instruction and a 32 KiB data L1 cache The PPU has a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec register set. The AltiVec unit is fully pipelined for single precision floating point. The PPU can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6 GFLOPS at 3.2 GHz.

The PPE and bus architecture includes various modes of operation giving different levels of memory protection, allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.

Xenon in Xbox 360
The PPE was designed specificaly for the Cell processor på IBM, but during development, Microsoft approached IBM wanting a high performance processor core for its Xbox 360. IBM complied and made the tri-core Xenon processor, based on a a slightly modified version of the PPE. IBM was allowed to do this since the agreement with Sony and Toshiba stated that each company was free to use the technology in Cell however they wanted.

Synergistic Processing Elements (SPE)
Each SPE is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface). An SPE is a RISC processor with 128-bit SIMD organization for single and double precision instructions. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or for SIMD computations on a variety of integer and floating point formats. Each SPE contains a 256 KiB embedded SRAM for instruction and data, called "Local Storage" which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128 entry register file. An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.

At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. For double-precision, as often used in personal computers, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS. The SPEs are capable of performing double precision calculations, albeit with an order of magnitude performance penalty. However, there are ways to circumvent this in software using iterative refinement, which means values are calculated in double precision only when necessary. Jack Dongarra and his team demonstrated a 3.2 GHz Cell with 8 SPEs delivering a performance equal to 100 GFLOPS on an average double precision Linpack 4096x4096 matrix. Tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.

Toshiba has developed a co-processor powered by four SPEs, but no PPE, called the SpursEngine designed to accelerate 3D and movie effects in consumer electronics.

Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'. The EIB is presently implemented as a circular ring comprised of four 16 byte wide unidirectional channels which counter-rotate in pairs, running at half the system clock rate so effective channel rate is 16 bytes every two system clocks. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96 byte per clock. While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed.

Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.

Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels. The implementation of the EIB allows a future replacement for a more efficient crossbar switch if one are willing to trade floor space for performance.

Memory controller and I/O
Cell contains a dual channel next-generation Rambus XIO macro which interfaces to Rambus XDR memory. The MIC is a memory interface controller is separate from the XIO and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.

Attached to the Bus Interface Controller, BIC, is the system interface, also a Rambus design, known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.

Two Cell processors can be connected via the BIC to form one coherent Cell domain. This allows future implementations of just pairing two chips in one multi-chip module or quite easily pairing two processors on the same die, making a combined Cell processor with 2 PPEs and 16 SPUs on the same chip.

Implementations
The Cell architecture have spawned a few implementations, and there are some on the drawing board.

Cell/B.E.
The original design comprising 1 PPE and 8 SPUs. Manufactured by IBM and Toshiba on a 90 nm SOI process, 241 million transistors, 235 mm² large (PPE: 27 mm²; SPE: 15 mm²), >200 GFLOPS (SP) and >20 GFLOPS (DP). 3.2-4.6 GHz. This processor is used in PlayStation 3 even though one SPU is disabled to increase production yield.

Cell/B.E. (65 nm)
Updated version, similar to the original design, but manufactured on a 64 nm fabrication process reducing its size to 175 mm² large (PPE: 20 mm²; SPE: 11 mm²). Minor revisions to the floorplan such as moving a few components and eFUSEs reduces the transistor count to 239 million. This processor is used in late models of PlayStation 3, but still with one SPU disabled.

Cell/B.E. (45 nm)
Yet another update, similar to the Cell/BE+, slated for 2009, will be manufactured on a 45 nm process reducing its size to 115 mm² large (PPE: 11 mm²; SPE: 6.5 mm²).

PowerXCell 8i
IBM will enhance the Cell/B.E.+ design by improving floating point capabilities.This processor is used in the BladeCenter QS22 module powering the Roadrunner supercomputer.

New eDP (enhanced double precision) SPEs add fully pipelined douple precision functionality, effectly multiplying performance by a factor of four, reaching over 100 GFLOPS of double precision performance for the whole chip at 3.2 GHz. Instead of expensive RDRAM memory system, the cheeaper and more common DDR2 is used, and the chip as two 800 MHz 144-bit DDR2 controllers on die, supporting up to 32 GB RAM.

The die is 212 mm², manufactured at a 65 nm SOI fabrication process, 250 million transistors, dissipating up to 92 W. The processor is packaged on a 47.5×47.5 mm plastic ball grid array (PBGA) with 1827 balls.

PowerXCell 32iv
This is a proposed future product targeted at a 2010 timeframe. It will use four enhanced PPE cores, 32 enhanced SPEs and a next generation memory technology. Estimated performance will exceed 1 TFOPS per chip. It has been suggested that the PPE will be based on POWER7.

mini Cell
Toshiba has shown on roadmaps their intention to build a cheaper, lower performace mid class chip with one PPE and only four SPEs. This "mini Cell" will be fabricated on a cheap bulk 65 nm CMOS-process in a 2008 time frame. This will eventualy be shrunk to 45 nm reducing size and power requirements further.

micro Cell
Toshiba even envisions a "micro Cell" with just one SPU for extremely power contsrained applications like mobile devices. Micro Cell might be manufactured on a 45 nm buld CMOS process in a 2010 timeframe.

SpursEngine
Toshiba has developed a co-processor powered by four SPEs, but no PPE, called the SpursEngine designed to accelerate 3D and movie effects in consumer electronics.

Blade Server

 * IBM Blade Center QS20 – 2x 3.2 GHz Cell/B.E. processors, 1 GB XDRAM. 40 GB hard drive. 2 units wide.
 * IBM Blade Center QS21 – 2x 3.2 GHz Cell/B.E. processors,  2 GB XDRAM. 1 unit wide.
 * IBM Blade Center QS22 – 2x 3.2 GHz PowerXCell 8i,up to 32 GTB DDR2 RAM, x16 PCIe. 1 unit wide.


 * Mercury Dual Cell-Based Blade 2 Systems – 2x 3.2 GHz Cell/B.E. processors, 2 GB XDRAM. 1 unit wide.

Video game consoles
Sony's PlayStation 3 video game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS. In late 2007 Sony upgraded the CPU to the 65 nm version to reduce cost and power requirements. They still have one SPU disabled though.

Accelerators

 * Mercury Cell Accelerator Board is an accelerator card connected to a workstation by a double wide x16 PCIe slot providing 180 Gflops single precision floating point performance. It containes a 90 nm 2.8 GHz Cell/B.E. processor and 1 GB XDRAM and 4 GB DDR2 local store. It communicates with the host system and network through a Gbit Ethernet port. It was released in 2006.
 * Fixstars GigaAccel 180 is an accelerator card connected to a workstation by a double wide x16 PCIe slot providing 180 Gflops single precision floating point performance. It contains a 65 nm 2.8 GHz PowerXCell 8i processor and 4 GB XDRAM. It runs Linux and communicates with the host system through double Gbit Ethernet ports. It's manufactured by IBM in Japan but will be released to OEMs. GigaAccel 180will be available in fall of 2008.
 * Toshiba's SpursEngine is a co-processor powered by four SPEs, but no PPE, designed to accelerate 3D and movie effects in consumer electronics. The SpursEngine can be fitted to PCIe accelerator cards or mounted directly on motherboards. It became availabe in 2008.

Home cinema
Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG-2 streams simultaneously on a 1920&times;1080 screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.

Supercomputing
IBM's supercomputer, IBM Roadrunner, is a hybrid of general purbose CISC (AMD Opteron) as well as Cell processors. It was, in 2008, the first computers to run at petaflops speeds. It uses the PowerXCell 8i variant, manufactured using 65 nm technology with an enhanced SPUs that can handle double precision calculations, reaching double precision 100 Gflops.

Cluster computing
Clusters of PlayStation 3 consoles are an attractive and cheap alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terra Soft Solutions is selling 8-node and 32-node PS3 clusters with Yellow Dog Linux pre-installed, an implementation of Dongarra's research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. The computational Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.

Mainframes
IBM announced April 25, 2007 that it will begin integrating its Cell Broadband Engine Architecture microprocessors into the company's line of mainframes, coining the term "Gameframe".

Software engineering
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources, not limited to just different computing paradigms:

Job queue
The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.

Self-multitasking of SPEs
The kernel and scheduling is distributed across the SPEs. Tasks are synchronized using mutexes or semaphores as in a conventional operating system. Ready-to-run tasks wait in a queue for a SPE to execute them. The SPEs use shared memory for all tasks in this configuration.

Stream processing
Each SPE runs a distinct program. Data comes from an input stream, and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to an output stream.

This provides a flexible and powerful architecture for stream processing, and allows explicit scheduling for each SPE separately. Other processors are also able to perform streaming tasks, but are limited by the kernel loaded.

Distributed computing
There is a BOINC distributed computing application that runs on http://www.ps3grid.net/PS3GRID/ ; it is entirely devoted to kinds of biological computations that can only be successfully completed with microprocessors running in parallel.

Open source software development
An open source software-based strategy was adopted to accelerate the development of a Cell/BE ecosystem and to provide an environment to develop Cell applications. In 2005, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers. Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture at LinuxTag 2005.

Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries.

Terra Soft Solutions provides Yellow Dog Linux for IBM, and Mercury Cell-based systems, as well as for the PlayStation3. Terra Soft strategically partnered with Mercury to provide a Linux Board Support Package for Cell, and support and development of software applications on various other Cell platforms, including the IBM BladeCenter JS21 and Cell QS20, and Mercury Cell-based solutions. Terra Soft also maintains the Y-HPC (High Performance Computing) Cluster Construction and Management Suite and Y-Bio gene sequencing tools. Y-Bio is built upon the RPM Linux standard for package management, and offers tools which help bioinformatics researchers conduct their work with greater efficiency. IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources. IBM is currently maintaining a Linux kernel and GDB ports, while Sony maintains the GNU toolchain (GCC, binutils).

In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0", consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for Fedora Core 4 are maintained at the Barcelona Supercomputing Center website.

In August 2007, Mercury Computer Systems released a Software Development Kit for PlayStation 3 for High-Performance Computing.

With the release of kernel version 2.6.16 on March 20, 2006, the Linux kernel officially supports the Cell processor.