User talk:Pedroguaraldi

 Cell Architecture Explained by Nicholas Blachford 



Initially designed for the PlayStation 3, Sony, Toshiba and IBM's new Cell processor promises seemingly obscene computing capabilities for what will rapidly become a very low price.

6 months ago I wrote an article describing this new processor based on the original Cell processor patent application from 2002. Since that original document was written the Cell design evolved considerably in both hardware and software. The Cell was revealed at ISSCC in February 2005 and since then a great deal of information has been revealed about the final architecture in various articles, papers and interviews.

This new version has been almost completely rewritten to cover the Cell as it is today. New sections have also been added to cover Cell software development and the reasons behind the choice of a relatively simple architecture.

In part 1 I look at how the Cell came about and it’s main components. Part 2 looks at the infrastructure components and the concept of stream processing. Part 3 covers the options for programming the Cell and the issues likely to be encountered. In Part 4 I look at the design decisions in the Cell and look at why the architecture is so simple compared to other contemporary microprocessors.

Background
The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3. The genesis of the idea was in 1999 when Sony’s Ken Kutaragi [Kutaragi] “Father of the PlayStation” was thinking about a computer which acted like Cells in a biological system. A patent was applied for listing Masakazu Suzuoki and Takeshi Yamazaki as the inventors in 2002 (the first version of this article covered this patent [Patent]).

The architecture as it exists today was the work of three companies: Sony, Toshiba and IBM. Sony and Toshiba previously co-operated on the PlayStation 2 but this time the plan was more ambitious and went beyond chips for video games consoles. The aim was to build a new general purpose processor for a computer. With that in mind IBM was brought in as their expertise is in computer design.

“ Though sold as a game console, what will in fact enter the home is a Cell-based computer. ” - Ken Kutaragi

IBM also brought it’s chip design expertise and in this case used a very aggressive approach by producing a fully custom design - the chip’s circuitry was designed by hand instead of with automated tools, very few other companies use this approach. IBM also has the industry's leading silicon process which will be used in the manufacturing. Sony and Toshiba bring mass manufacturing capabilities and knowledge. Each of the three companies produces different products and these have different needs of a CPU. Consumer electronics requires very power efficient systems, reliability and predictability. Computer systems on the other hand (sometimes) have multiple processors, and need to be compatible across different generations. The final Cell design incorporates features to satisfy all these needs.

To turn the ideas into a real product the the companies officially partnered in 2000 and set up a design centre in Austin, Texas in March 2001 with engineers from each of the three companies. Development was done in 10 centres around the globe by some 400 people.

The amount of money subsequently spent on this project is vast, two 65nm chip fabrication facilities are being built at billions each, Sony has paid IBM hundreds of millions to set up a production line in East Fishkill, New York. Then there's a few hundred million on development - all before a single chip rolls of the production lines.

Although it’s been primarily touted as the technology for the PlayStation 3, Cell is designed for much more. Sony and Toshiba, both being major electronics manufacturers buy in all manner of different components. One of the reasons for Cell's development is they want to save costs by building their own components. Next generation consumer technologies such as Blu-ray, HDTV, HD Camcorders and of course the PS3 will all require a very high level of computing power and they are going to need the chips to provide it. Cell will be used for all of these and more, IBM will also be using the chips in servers. The partners can also sell the chips to 3rd party manufacturers [3rd party].

The Cell architecture is like nothing we have ever seen in commodity microprocessors, it is closer in design to multiprocessor vector supercomputers. The Cell developers have taken this kind of technology and for the first time are bringing it to your home. The aim is produce a low cost system with a massive increase in compute performance over existing systems. Putting such an architecture on a single chip is a huge, complex project, no other manufacturer appears to have even attempted to do anything this ambitious to date.

So, what is Cell Architecture?
Cell is an architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as jobs or apulets), these are sent out to the hardware Cells where they are computed, the results are then returned.

This architecture is not fixed, if you have a computer, PS3 and HDTV which have Cell processors they can co-operate on problems. They've been talking about this sort of thing for years of course but the Cell is actually designed to do it. I for one quite like the idea of watching "Contact" on my TV while a PS3 sits in the background churning through SETI@home, actually this is rather unlikely as I’m not a gamer but you get the picture...

According to IBM the Cell performs 10x faster than existing CPUs on many applications. This may sound ludicrous but GPUs (Graphical Processors Units) already deliver similar or even higher sustained performance in many non-graphical applications [GPU10]. The technology in the Cell is similar to that in GPUs so such high performance is certainly well within the realm of possibilities. The big difference is though that Cell is a lot more general purpose so can be usable for a wider variety of tasks.

The Cell architecture can go further though, there's no reason why your system can't distribute software Cells over a network or even all over the world. The Cell is designed to fit into everything from (eventually) PDAs up to servers so you can make an ad-hoc Cell computer out of completely different systems.

Scaling is just one capability of the Cell architecture but the individual systems are going to be potent enough on their own. An individual Cell is one hell of a powerful processor, they have a theoretical computing capability of 256 GFLOPS (Billion Floating Point Operations per Second) [GFLOPS] at 4GHz. In the computing world quoted figures (bandwidth, processing, throughput) are often theoretical maximums and rarely if ever met in real life. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure.

This isn’t by luck or fluke, it’s by design. The Cell’s hardware has been specifically designed to provide sufficient data to the computational elements to enable such performance. This is a rather different approach from the usual way which is to hide the slower parts of the system. All systems are limited by their slowest components [Amdahl's law], Cell was designed not to have any slow components!

Specifications
An individual hardware Cell is made up of a number of elements:


 * 1 Power Processor Element (PPE).
 * 8 Synergistic Processor Elements (SPEs).
 * Element Interconnect Bus (EIB).
 * Direct Memory Access Controller (DMAC).
 * 2 Rambus XDR memory controllers.
 * Rambus FlexIO (Input / Output) interface.



The final specifications haven't been given out yet but this is what we know so far:


 * Capable of running at speeds beyond 4 GHz.
 * Memory bandwidth: 25.6 GBytes per second.
 * I/O bandwidth: 76.8 GBytes per second.
 * 256 GFLOPS (Single precision at 4 GHz).
 * 256 GOPS (Integer at 4 GHz).
 * 25 GFLOPS (Double precision at 4 GHz).
 * 235 square mm.
 * 235 million transistors.

Power consumption has been estimated at 60 - 80 Watts at 4 GHz for the prototype but this could change in the production version.

Chip manufacturing is a complex process and the chips that appear at the end of the production line vary in capabilities and some have errors. While they can go higher, because of the vagaries of manufacturing, economics and heat dissipation the Cell which will be used in the PS3 is clocked at 3.2 GHz and will have only 7 SPEs. Cells with 6 SPEs will be used in consumer electronics.

The Power Processor Element (PPE)
The PPE is a conventional microprocessor core which sets up tasks for the SPEs to do. In a Cell based system the PPE will run the operating system and most of the applications but compute intensive parts of the OS and applications will be offloaded to the SPEs.

As an example lets say I was running an audio synthesiser application. The OS and most of the application would run on the PPE but the highly intensive audio generation and processing would be off-loaded to the SPEs.

The PPE is a 64 bit, "Power Architecture" processor with 512K cache. Power Architecture is a catch all term IBM have been using for a while to describe both PowerPC and POWER processors. This type of microprocessor is not used in PCs but compatible processors are found in Apple Macintosh systems. The PPE is capable of running POWER or PowerPC binaries.

While the PPE uses the PowerPC instruction set, it is not based on an existing design on the market today. That is to say, it is NOT based on the existing 970 / G5 or POWER processors. It is a completely different architecture so clock speed comparisons are completely meaningless.

The PPE is a dual issue, dual threaded, in-order processor. Unlike many modern processors the hardware architecture is an “old style” RISC design, i.e. the PPE has a relatively simple architecture. Most modern microprocessors devote a large amount of silicon to executing as many instructions as possible at once by executing them "out-of-order" (OOO). This type of design is widely used but it requiring hefty amounts of additional circuitry and consumes large amounts of power. With the PPE, IBM have not done this and have instead gone with a much simpler design which uses considerably less power than other PowerPC devices - even at higher clock rates.

This design will however have the downside of potentially having rather erratic performance on branch laden applications. Such a simple CPU needs the compiler to do a lot of the scheduling work that hardware usually does so a good compiler will be essential. That said, the Cell's high bandwidth memory and I/O subsystems and the PPE's high clock speed and dual threading capability may well make up for these potential performance deficiencies.

Some of the technology in the PPE has been derived from IBM's high end POWER series of CPUs, Like POWER5 the PPE has the ability to run 2 threads simultaneously. When one thread is stalled and is waiting for data the second thread can issue instructions keeping the instruction units busy. IBM's hypervisor technology [Hyper] is also used allowing the Cell to run multiple operating systems simultaneously. According to IBM the Cell can run a normal OS alongside a real time OS with both functioning correctly.

Another interesting point about the PPE is that it includes support for the VMX vector instructions, (also known as "AltiVec" or "Velocity Engine"). VMX can speed up anything from financial calculations to operating system functions though it (or its PC equivalents) don't appear to be that widely used currently. One company which does use VMX extensively is Apple who use it to accelerate functions in OS X, it would not have been a huge job for Apple to utilise the PPE in the Cell.

A lesser known feature which appears to be present is memory tags required by some of IBM's "big iron" operating systems. I don’t know the purpose of these tags (they are optional in power architecture) but the Cell is said to be capable of running OS/400 [The400] and this requires them. It is not confirmed these are present but if so it looks like IBM could have some interesting plans for the Cell which involve rather more than gaming... IBM’s Unix variant, AIX is also said to be running on Cell. (Note: none of this is confirmed).

The PPE is an interesting processor and it looks likely that similar cores will turn up in systems other than the Cell. The CPU cores used in the XBox360 while different, appear to be derived from the same original design.

As mentioned above the PPE has been simplified compared to other desktop processors, I discuss the reasons behind this and their implications in part 4.

A 4GHz PowerPC sounds like a pretty potent processor until you realise that the PPEs are really just used as controllers in the Cell - the real action is in the SPEs:

Synergistic Processor Elements (SPEs)
Each Cell contains 8 SPEs.

An SPE is a self contained vector processor which acts as an independent processor. They each contain 128 x 128 bit registers, there are also 4 (single precision) floating point units capable of 32 GigaFLOPS* and 4 Integer units capable of 32 GOPS (Billions of integer Operations per Second) at 4GHz. The SPEs also include a small 256 Kilobyte local store instead of a cache. According to IBM a single SPE (which is just 15 square millimetres and consumes less than 5 Watts at 4GHz) can perform as well as a top end (single core) desktop CPU given the right task.

This is counting Multiply-Adds which count as 2 instructions, hence 4GHz x 4 x 2 = 32 GFLOPS.

32 X 8 SPEs = 256 GFLOPS

Like the PPE the SPEs are in-order processors and have no Out-Of-Order capabilities. This means that as with the PPE the compiler is very important. The SPEs do however have 128 registers and this gives plenty of room for the compiler to unroll loops and use other techniques which largely negate the need for OOO hardware.



Vector Processing
The SPEs are vector (or SIMD) [Vector] processors. That is, they do multiple operations simultaneously with a single instruction. Vector computing has been used in supercomputers since the 1970s (the Cray 1 was one of the first to use the technique) and modern CPUs have media accelerators (e.g. MMX, SSE, VMX / AltiVec) which work on the same principle. Each SPE is capable of 4 X 32 bit operations per cycle (8 if you count multiply-adds). In order to take full advantage of the SPEs, the programs running will need to be "vectorised", this can be done in many application areas such as video, audio, 3D graphics, scientific calculations and can be used at least partially in many other areas. Some compilers can “autovectorise” code, this involves analysing code for sections which can utilise a vector processor and needs no involvement from he developer. This can deliver considerable performance improvements and as such is an area of active research and development, v4.0 of the open source GCC compiler includes some of this functionality [GCC].

AltiVec?
The SPE's instruction set is similar to VMX/AltiVec but not identical. Some instructions have been removed and others added, the availability of 128 registers also makes a considerable difference in what is possible. Some changes are the addition of 64 bit floating point operations and program flow control operations as well as the removal of integer saturation rounding.

Another one of the differences is between the double and single precision capabilities. The double precision calculations are IEEE standard whereas the single precision are not. By not using IEEE standards the single precision calculations can be calculated faster, this feature appears to have been derived from the PS2 which did the same.

Despite these differences, according to IBM, by making some relatively minor changes and taking into account the SPE's local stores, software should compile to either an SPE or PowerPC (+ VMX) target. That said the binary code used is different so existing AltiVec binaries will not work.

Double Precision FLOPS
Double precision (64 bit) floating point data types are used when dealing with very large or small numbers or when you need to be very accurate. The first version of Cell supports these but the implementation shares the computation area of the single precision floating point units. Sharing these means the designers have saved a lot of room but there is a performance penalty, the first generation Cell can "only" do around 25 dual precision GFLOPS at 4 GHz. The first generation however are designed for the PS3 where high double precision operations are not necessary. IBM have alluded to the possibility that a later generation will include full speed dual precision floating point units from which you can expect a very sizeable performance boost.

SPE Local Stores
One way in which SPEs operate differently from conventional CPUs is that they lack a cache and instead use a “Local Store”. This potentially makes them (slightly) harder to program but they have been designed this way to reduce hardware complexity and increase performance. That said, if you use vector units and take account of the cache when you program a conventional CPU, developing optimised code for the SPE may actually be easier as you don’t need to worry about cache behaviour.

Conventional Cache
Conventional CPUs perform all their operations in registers which are directly read from or written to main memory. Operating directly on main memory is hundreds of times slower than using registers so caches (a fast on chip memory of sorts) are used to hide the effects of going to or from main memory. Caches work by storing part of the memory the processor is working on. If you are working on a 1/2 MB piece of data it is likely only a small fraction of this (perhaps a couple of thousand bytes) will be present in cache. There are kinds of cache design which can store more or even all the data but these are not used as they are too expensive, too slow or both.

If data being worked on is not present in the cache, the CPU stalls and has to wait for this data to be fetched. This essentially halts the processor for hundreds of cycles. According to the manufacturers, It is estimated that even high end server CPUs such as POWER, Itanium and PA-RISC (all with very large, very fast caches) spend anything up to 80% of their time waiting for memory.

Dual-core CPUs will become common soon and desktop versions have a cache per core. If either of the cores or other system components try to access the same memory address, the data in the cache may become out of date and thus needs updated (made coherent). Supporting this requires logic and takes time and in doing so this limits the speed that a conventional system can access memory and cache. The more processors there are in a system the more complex this problem becomes. Cache design in conventional CPUs speeds up memory access but compromises are required to make it work.

SPE Local Stores - No Cache?
To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any. Instead they used a series of 256 Kbyte “local stores”, there are 8 of these, 1 per SPE. Local stores are like cache in that they are an on-chip memory but the way they are constructed and act is completely different. They are in effect a second-level register file.

The SPEs operate on registers which are read from or written to the local stores. The local stores can access main memory in blocks of 1Kb minimum (16Kb maximum) but the SPEs cannot act directly on main memory (they can only move data to or from the local stores).

By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache and made it faster in the process. There is also no coherency mechanism directly connected to the local store and this simplifies things further.

This may sound like an inflexible system which will be complex to program but it’ll most likely be handled by a compiler with manual control used if you need to optimise.

This system will deliver data to the SPE registers at a phenomenal rate. 16 bytes (128 bits) can be moved per cycle to or from the local store giving 64 Gigabytes per second, interestingly this is precisely one register’s worth per cycle. Caches can deliver similar or even faster data rates but only in very short bursts (a couple of hundred cycles at best), the local stores can each deliver data at this rate continually for over ten thousand cycles without going to RAM.

One potential problem is that of “contention”. Data needs to be written to and from memory while data is also being transferred to or from the SPE’s registers and this leads to contention where both systems will fight over access slowing each other down. To get around this the external data transfers access the local memory 1024 bits at a time, in one cycle (equivalent to a transfer rate of 0.5 Terabytes per second!).

This is just moving data to and from buffers but moving so much in one go means that contention is kept to a minimum.

In order to operate anything close to their peak rate the SPEs need to be fed with data and by using a local store based design the Cell designers have ensured there is plenty of it close by and it can be read quickly. By not requiring coherency in the Local Stores, the number of SPEs can be increased easily. Scaling will be much easier than in systems with conventional caches.

Local Store V’s Cache
To go back to the example of an audio processing application, audio is processed in small blocks so to reduce any delay as the human auditory is highly sensitive to this. If the block of audio, the algorithm used and temporary blocks can fit into an SPE’s local store the block can be processed very, very fast as there are no memory accesses involved during processing and thus nothing to slow it down. Getting all the data into the cache in a conventional CPU will be difficult if not impossible due to the way caches work.

It is in applications like these that the Cell will perform at its best. The use of a local store architecture instead of a conventional cache ensures the data blocks can be hundreds or thousands of bytes long and they can all be guaranteed to be in the local store. This makes the Cell’s management of data fundamentally different from other CPUs.

The Cell has massive potential computing power. Other processors also have high potential processing capabilities but rarely achieve them. It is the ability of local stores to hold relatively large blocks of data that may allow Cells to get close to their maximum potential.

Local stores are not a new invention, in the early 1990s the AT&T DSP Commodore were planning on using in a never released Amiga included a “visible cache”. They go back further though, a local store type arrangement was used in the 1985 Cray 2 supercomputer.

Locking Cache
The PPE as a more conventional design does not have a local store but does include a feature called a “locking cache”. This stops data in parts of the cache being overwritten allowing them to act like a series of small local stores. This is used for streaming data into and out of the PPE or holding data close that is regularly needed. If the locked part acted like a normal cache the data being held could get flushed out to main memory forcing the PPE to stall while it was being retrieved, this could cause performance to plummet in some instances (a memory read can take hundreds of cycles).

Locking caches are common in embedded processors, Intel’s XScale and the XBox360’s [Xbox360] CPUs include them as do modern G3 and G4s. They are not generally included in desktop processors due to their more general purpose nature. Using cache locking in a desktop environment could prove catastrophic for performance as applications working on data close to that locked would not be able to use the cache at all. It is possible to achieve similar results with clever programming tricks and this is a much better idea in a multitasking environment.

In Part 2...
SPEs can also be chained, that is they can be set up to process data in a stream using multiple SPEs in parallel. In this mode a Cell may approach its theoretical maximum processing speed of 256 GigaFlops.

In part 2 I shall look at this, the rest of the internals of the Cell and other aspects of the architecture.

Steam Processing
A big difference in Cells from normal CPUs is the ability of the SPEs in a Cell to be chained together to act as a stream processor [Stream]. A stream processor takes data and processes it in a series of steps.

A Cell processor can be set-up to perform streaming operations in a sequence with one or more SPEs working on each step. In order to do stream processing an SPE reads data from an input into it's local store, performs the processing step then stores the result into it's local store. The second SPE reads the output from the first SPE's local store and processes it and stores it in it's output area.

This sequence can use many SPEs and SPEs can access different blocks of memory depending on the application. If the computing power is not enough the SPEs in other Cells can also be used to form an even longer chain.

Stream processing does not generally require large memory bandwidth but Cell has it anyway and on top of this the internal interconnect system allows multiple communication streams between SPEs simultaneously so they don’t hold each other up.



So you think your PC is fast...
It is when the SPEs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform an order of magnitude more calculations per second than any desktop processor currently available.

On the other hand if the stream uses large amounts of bandwidth and the data blocks can fit into the local stores the performance difference might actually be bigger. Even if conventional CPUs are capable of processing, the data at the same rate the transfers between the CPUs will be held up while they wait for chip to chip transfers. The Cell’s internal interconnect system allows transfers running into hundreds of Gigabytes per second, chip to chip interconnects allows transfers in the low 10’s of Gigabytes per second.

While conventional processors have vector units on board (SSE or VMX / AltiVec) they are not dedicated vector processors. The vector processing capability is an add-on to the existing instruction sets and has to share the CPUs resources. The SPEs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory (and not even this much if the data can fit in the local stores). Add to this the fact there are 8 of them and you can see why their potential computational capacity is so large.

Such a large performance difference may sound completely ludicrous but it's not without precedent, in fact if you own a reasonably modern graphics card your existing system is already capable of similar processing feats: "For example, the Nvidia GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing. In comparison, the theoretical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6GFlops." [GPU]

“GPUs are >10x faster than CPU for appropriate problems” [GPU10]

The 3D Graphics chips in computers have long been capable of very much higher performance than general purpose CPUs. Previously they were restricted to 3D graphics processing but since the addition of vertex and pixel shaders people have been using them for more general purpose tasks [GPGPU], this has not been without some difficulties but Shader 4.0 parts are expected to be even more general purpose than before.

Existing GPUs can already provide massive processing power when programmed properly but this is not exactly an easy task. The difference with the Cell is it will be cheaper, considerably easier to program and will be useable for a much wider class of problems.

The EIB and DMAC
The original patent application describes a DMAC (Direct Memory Access Controller) which controlled memory access for the SPEs/PPE (then known as the APUs and the PU) and connected everything with a 1024 bit bus. The original DMAC design also included the memory protection system.

The DMAC has changed considerably as the system evolved from the original design into the final product. The final version is quite different from the original but many parts still exist in one way or another. A DMAC still exists and controls memory access but it no longer controls memory protection. The 1024 bit interconnect bus was also replaced with a series of smaller ring busses called the EIB (Element Interconnect Bus).

The EIB consists of 4 x 16 byte rings which run at half the CPU clock speed and can allow up to 3 simultaneous transfers. The theoretical peak of the EIB is 96 bytes per cycle (384 Gigabytes per second) however, according to IBM only about two thirds of this is likely to be achieved in practice.

The original 1024 bit mentioned in the patent bus no longer connects the system together but it still exists in between the I/O buffer and the local stores. The memory protection was replaced completely by MMUs and moved into the SPEs.

It's clear to me that the DMAC and EIB combination is one of the most important parts of the Cell design, it doesn't do processing itself but has to contend with potentially hundreds of Gigabytes of data flowing through it at any one time to many different destinations as well as handling a queue of some 128 memory requests.

Memory and I/O
All the internal processing units need to be fed so a high speed memory and I/O system is an absolute necessity. For this purpose Sony and Toshiba licensed the high speed "Yellowstone" and "Redwood" technologies from Rambus [Rambus], these are used in the the XDR RAM and FlexIO. Both of these are interesting technologies not only for their raw speed but they have also been designed to simplify board layouts. Engineers spend a lot of time making sure wires on motherboards are all exactly the same length so signals are synchronised. FlexIO and XDR RAM both have a technology called "FlexPhase" [FlexPhase] which allow signals to come in at different times reducing the need for the wires to be exactly the same length, this will make life considerably easier for board designers working with the Cell.

As with everything else in the Cell architecture the memory system is designed for raw speed, it will have both low latency and very high bandwidth. As mentioned previously the SPEs access memory in blocks of 128 bytes. It’s not clear how the PPE accesses memory but 128 bytes happens to be a common cache line size so it may do the same.

The Cell will use high speed XDR RAM for memory. A Cell has a memory bandwidth of 25.6 Gigabytes per second which is considerably higher than any PC but necessary as the SPEs will eat as much memory bandwidth as they can get. Even given this the buses are not large (72 data pins in total), this is important as it keeps chip manufacturing costs down. The Cells runs it’s memory interface at 3.2 Gigabits per second per pin though memory in production now is already capable of higher speeds than this. XDR is designed to scale to 6.4 Gigabits per second so memory bandwidth has the potential to double.

Memory Capacity
The total memory that can be attached is variable as the XDR interface is configurable, Rambus’ site shows how 1GB can be connected [Xdimm]. Theoretically an individual Cell can be attached to many Gigabytes of memory depending on density of the the RAM chips in use and this may involve using one pin per physical memory chip which reduces bandwidth.

SPEs may need to access memory from different Cells especially if a long stream is set up, thus the Cells also include a high speed interconnect. This consists of a set of 12 x 8 bit busses which run at 6.4 Gigabit/second per wire (76.8 Gigabytes per second total). The busses are directional with 7 going out and 5 going in.

The current systems allows 2 Cells to be connected glue-less (i.e. without additional chips). Connecting more Cells requires an additional chip. This is different from the patent as it specified 4 Cells could be directly connected and a further 4 could be connected via a switch.

IBM have announced a blade system made up of a series of dual Cell “workstations”. The system is rated at up to 16 TeraFlops, which will require 64 Cells.

Memory Management Units
Memory management is used to stop programs interfering with each other and to move data to disc when it is not in use.

The original Cell patent had two mechanisms for memory protection one of which was very simple and very fast while the second was very complex and slow. These have been replaced by a more conventional system used when accessing main memory.

While there is no protection or coherency mechanisms used within the local stores (again for simplicity and speed), the PPE and SPEs do contain Memory Management Units (MMUs) used when accessing memory or other SPEs’ local stores.

These act just like a conventional multiprocessor system and as such are much more sophisticated than the method described in the original patent. Multiprocessor systems is an area IBM are long experienced in so this change appears to have come from them. While this may be slower than the fast system in the patent the difference in reality is likely to be insignificant.

Processing Concrete
The Cell architecture goes against the grain in many areas but in one area it has gone in the complete opposite direction to the rest of the technology industry.

Operating systems started as a rudimentary way for programs to talk to hardware without developers having to write their own drivers every time. As time went on operating systems have evolved and taken on a wide variety of complex tasks, one way it has done this is by abstracting more and more away from the hardware.

Object oriented programming goes further and abstracts individual parts of programs away from each other. This has evolved into Java like technologies which provide their own environment thus abstracting the application away from the individual operating system, many other languages including C#, Perl and Python do the same. Web technologies do the same thing, the platform which is serving you with this page is completely irrelevant, as is the platform viewing it. When writing this I did not have to make a Windows or Mac specific version of the HTML, the underlying hardware, OSs and web browsers are completely abstracted away.

If there is a law in computing, abstraction is it, it is an essential piece of today's computing technology, much of what we do would not be possible without it.

Cell however, has gone against the grain and actually removed a level of abstraction. The programming model for the Cell will be concrete, when you program an SPE you will be programming what is in the SPE itself, not some abstraction. You will be "hitting the hardware" so to speak. The SPEs programming model will include 256K of local store and 128 registers, the SPE itself will include 128 registers and 256K of local store, no less, no more.

In modern x86 or PowerPC/POWER designs the number of registers present is different from the programming model, in the Transmeta CPUs it’s not only the registers, the entire internal architecture is different from what you are programming.

While this may sound like sacrilege and there are reasons why it is a bad idea in general there is one big advantage: Performance. Every abstraction layer you add adds computations and not by some small measure, an abstraction can decrease performance by a factor of ten fold. Consider that in any modern system there are multiple abstraction layers on top of one another and you'll begin to see why a 50MHz 486 may of seemed fast years ago but runs like a small dog these days, you need a more modern processor to deal with the subsequently added abstractions.

The big disadvantage of removing abstractions is it will add complexity for the developer and it limits how much the hardware designers can change the system. The latter has always been important and is essentially THE reason for abstraction but if you've noticed modern processors haven't really changed much in years. AMD64 is a big improvement over the Athlon but the overall architecture is actually very similar, Intel’s Pentium-M traces it’s linage right the way back to the Pentium Pro. The new dual core devices from AMD, Intel and IBM have not changed the internal architecture at all.

The Cell designers obviously don't expect their architecture to change much either so have chosen to set it in stone from the beginning. That said there is some flexibility in the system so it can still evolve over time.

The Cell approach does give some of the benefits of abstraction though. Java has achieved cross platform compatibility by abstracting the OS and hardware away, it provides a "virtual machine" which is the same across all platforms, the underlying hardware and OS can change but the virtual machine does not.

Cell does this but in a completely different way. Java provides a software based "virtual machine" which is the same on all platforms, Cell provides a machine as well - but they do it in hardware, the equivalent of Java's virtual machine is the Cells physical hardware. If I was to write code for SPEs on OS X the exact same Cell code would run on Windows, Linux or Zeta because in all cases it is the hardware Cells which execute it.

This does not however mean you have to program the Cells in assembly, Cells have compilers just like everything else. Java provides a “machine” but you don't program it directly either.

By actually providing 128 real registers (32 for the PPE) and 256K of local store they have made life more difficult for compiler writers but in some respects they’ve also made it easier, it’ll be a lot easier to figure out what’s going on and thus optimise for. There’s no need to try and figure out what will or will not be put in rename registers since these don’t exist.

Hard Real Time Processing
Some stream processing needs to be timed exactly and this has also been considered in the design to allow "hard" real time data processing. An "absolute timer" is used to ensure a processing operation falls within a specified time limit. This is useful on it's own but also ensures compatibility with faster next generation Cells since the timer is independent of the processing itself.

Hard real time processing is usually controlled by specialist operating systems such as QNX which are specially designed for it. Cell's hardware support for it means pretty much any OS will be able to support it to some degree or another. This will not however magically provide everything an RT OS provides (they are designed for bomb proof reliability) so things like QNX won’t be going away anytime soon from safety critical areas.

An interesting aspect of the Cell is that it includes IBMs virtualisation technology [Hyper], this allow multiple operating systems to run simultaneously. This ability can be combined with the real time capability allowing a real time OS to run alongside a non-real time OS. Such an ability could be used in industrial equipment allowing a real time OS to perform the critical functions while a secondary non real time OS can act as a monitoring, display and control system. This could lead to all sorts of interesting possibilities of hybrid operating systems - how about Linux doing the GUI and I/O while QNX handles the critical stuff?

To DRM or not to DRM?
Some will no doubt be turned off by the fact that DRM (Digital Rights Management) is said to be built into the Cell hardware. Sony is a media company and like the rest of the industry that arm of the company are no doubt pushing for DRM type solutions. It must also be noted that the Cell is destined for HDTV and BluRay / HD-DVD systems. Like it or not, all high definition recorded content is going to be very strictly controlled by DRM so Sony have to add this capability otherwise they would be effectively locking themselves out of a large chunk of their target market. Hardware DRM is no magic bullet however, hardware systems have been broken before - including Set Top Boxes and even IBM's crypto hardware for their mainframes.

But is it really DRM?
While the system is taken to mean DRM I don't believe there is a specific scheme designed in, rather I think it may be more accurate to say the system has "hardware security" facilities. Essentially there’s a protection mechanism which can be used for DRM, but you could equally use it to protect your on-line banking.

The system enables an SPE to lock most of it’s local store for it's own use only. When this is done nothing else can access that memory. In this case nothing really means nothing, neither the OS or even the Hypervisor can access this memory while it is locked, only a small portion of the store remains unlocked so the SPE can still communicate with the rest of the system.

To give an example of how this could be useful, it could be used to address a perceived security weakness of the old WAP (Wireless Application protocol) system - think internet for mobile phones (note: I don't know current versions of the system so I'm referring to what they did in early versions). Originally the phones didn't have enough computing power to use the web's encryption system so a different system was used. Unfortunately this meant that the data would have to be decrypted from the web system then then re-encrypted into the new WAP system, this poses an immediate security problem. If a hacker could get into the system and gain root access they may be able to monitor the data when it was decrypted.

Cell could solve such a problem as the decryption and re-encryption could be done in a protected memory segment. Even if a hacker could get into the system getting root access would not help, even root cannot access that memory. Even if they were to get into hypervisor space they still couldn’t access the data, nothing gets in when it’s protected.

I doubt this system is still used by WAP gateways but if it is I’d expect WAP gateway manufacturers to suddenly become interested in Cell processors...

No details of how the system works have been released, I doubt if any system can be made un-hackable, but it doesn't seem like a direct attack would work in this case.

Other Options And The Future
There are plans for future technology in the Cell architecture. Optical interconnects appear to be planned, it's doubtful that this will appear anytime in the near future but clearly the designers are planning for the day when copper wires hit their limit (thought to be around 10GHz). Optical connections are quite rare and expensive at the moment but they look like they will become more common soon [Opti].

The design of Cells is not entirely set in stone, there can be variable numbers of SPEs and the SPEs themselves can include more floating point or integer calculation units. In some cases SPEs can be removed and other things such as I/O units or graphics processor placed in their place. Nvidia are proving the graphics hardware for the PS3 so this may be done within a modified Cell at some point.

As Moore's law moves forward and we get yet more transistors per chip I've no doubt the designers will take advantage of this. The idea of having 4 Cells per chip is mentioned in the patent but this isn’t likely to happen for 3-4 years yet as they are too big to fit on a single reasonably priced chip right now..

The first version of the Cell has high double precision floating point performance but it could be higher and I expect a future version to boost it significantly, possibly by as much as 5x.

In Part 3...
Just how difficult is it going to be to code for this thing? Not as bad as it appears.

In part 3 I look at the different options for programming the Cell.

Developing for the Cell
While developing for the Cell may sound like it could be an exquisite from of torture, fortunately this is not going to be the case. If you can understand multithreading, cache management and SSE / VMX / AltiVec [AltiVec/VMX] development it looks like you will have no problems with the Cell.

The primary language for developing on the Cell is expected to be C with normal thread synchronisation techniques used for controlling the execution on the different cores. C++ is also supported to a degree and other languages are also in development (including apparently, Fortran).

Various systems are in development for controlling execution on the Cell so developers should have plenty of options, this compares well with PS2 development which was primarily done in assembly and was highly restrictive.

Task distribution to the SPEs can be handled by the OS, middleware, compiled into applications or if you're brave (mad?) enough you can drop into assembly and roll your own system.

One method described by Sony uses the SPEs to do programmer defined Jobs [Jobs]. Each job is put into a queue, when an SPE becomes free the next job in line is assigned to that SPE for execution. In this scenario job assignment is controlled by the PPE but other schemes have the SPEs running a tiny OS which allows it to assign itself jobs. The SPEs are then completely autonomous and operate with no guidance from the PPE at all. The PPE just puts jobs into the job queue. Jobs are self contained mini-programs, the SPE loads up the data, DMAs in the data and gets computing.

In all the above cases the SPEs are dynamically assigned, the developer does not need to worry about which SPE is doing what.

SPEs can multitask like normal CPUs and have running tasks switch in and out but this is not an entirely good idea as the context switch time is likely to be pretty horrendous - the entire state of the SPE needs to be saved, this includes not just the contents of the registers but also the entire local store.

Hello Tosh, gotta Toshiba?
Toshiba are developing software to run on Cells in their consumer goods. They have talked about a “custom” OS (actually Linux) in which tasks are divided into SPE and PPE "modules". Each SPE module is a sub-task which can operate using one or more SPEs depending on the compute power required, modules can also stream data to one another.

A complete task may consist of a number of modules which each do some processing then pass the results to the next module. Toshiba have talked about a digital TV system which uses a set of SPE modules to decode digital TV signals. To demonstrate they showed a single Cell (clock speed unknown) decoding 48 standard definition MPEG2 streams and scaling the results to a single HDTV screen. The demo is said to run without frame drops and one of the SPEs was left idle while all this was going on.

Such a demo sounds fairly useless to most people but TV stations and media megalomaniacs usually pay a lot of money to do exactly this sort of thing. That said one of the demos at the PS3 launch show PS3 showed how this type of capability could be used on a TV to select which channel to watch.

Celling Penguins: Linux on Cell
Linux is already used in house by IBM as a “bring-up” platform for new hardware so unsurprisingly Linux was one of (if not the) first OS to get up and running on the Cell, in fact Linux support was added so early it was running on a simulated Cell before real hardware Cells had even been produced.

The fact Linux already runs on 64 bit PowerPC ISA based systems (PowerPC 970, POWER 4/5) is a big help as the PPE is compatible with these.

A release for Linux is expected at some point and full ISA details should be released then. Experimental patches have already been posted for the Linux kernel under the name BPA (Broadband Processor Architecture [BPA]. Cell support is being also added to the GCC compiler and GDB debugger [Dev]. In the meantime if you are lucky enough to have an NDA with one of the companies you can do a programming course and get access to Cell simulators until the real thing starts shipping.

The system currently being implemented into Linux treats the SPEs as a virtual file system [CellLinux] and you write data and programs into them via standard “write” operations. There is also a Library interface which abstracts the interactions partially and this looks likely to be extended in the future.

Cell doesn’t appear to have any limitations regarding which Operating Systems can run, indeed the library abstraction has been designed to be portable. Other Operating Systems than Linux are said to be running already, pretty much nothing has been said about what these others are though.

If you want to run different OSs or want to run multiple copies of one (useful for stability and OS development) Cell also supports a hypervisor allowing you to run several Operating Systems at the same time. IBM has an open source hypervisor [Hyper] which was used in the validation of the Cell.

Converting Applications for Cell
As with any new architecture an application specifically designed for it will run best. However this is not going to be an option in many cases, especially existing large applications.

The first step in getting an application running on the Cell is to port it to the PowerPC ISA. Depending on the application this can be anything from pressing recompile to rewriting a whole heap of code. Once the application is on PowerPC it should run on the Cell’s PPE without problem.

The next stage is to find out what should run on the SPEs.

SPEs are best suited to small repetitive tasks so pretty much any application should be able to make use of them. However, profiling the code is important to find out what exactly gets used the most and this needs to be analysed to see if it is suitable for running on an SPE.

Large algorithms (>100KB compiled) or algorithms which jump randomly around memory accessing little pieces of data will likely be ill suited to running on the SPEs, pseudo-random accesses are not a problem and can gain a large boost. Vectorisable and/or parallelisable algorithms are good bets for targeting the SPE.

Once the code to be moved has been identified, it needs to be partitioned away from the rest of the code so it is self-contained. Only once it is fully self contained can it be moved over to the SPE. If this code is already accessed as a plug-in or similar architecture this should be relatively easy. If not making the code self contained could require rather more work.

The initial port to the SPE is generally done as scalar (i.e. non-vector) code in order to get it up and running. This will involve getting the synchronisation and communication code working.

Once it’s up and running the code can then be vectorised and the SIMD units used properly. This isn’t the final stage however as the computations and data flow need to be balanced to make the most efficient use of the SPEs. After this, other optimisations can then be investigated for inclusion.

There are numerous methodologies in development for Cell, one such development flow presented at Power.org in Barcelona [CellDev].

Targeting SPEs
The SPEs will be considerably better at some things than others. You probably don’t want to run the OS on one, but it appears that if you were sufficiently persistent you could get them to do pretty much anything you want. I think the SPEs might be better described as “algorithm accelerators”.

The reason I say this is because while most code may be big and branchy the stuff which actually does the majority of the work are small repetitive loops, exactly the sort of thing that SPEs should be good at.

Don’t believe me? I’ll use my own desktop as an example:

I’m using a PowerBook running OS X Tiger, it’s currently running the Pages word processor, Safari (browser), Mail, Preview (Image/PDF viewer), Terminal and Desktop Manager. I also regularly run various video players, iPhoto, Photoshop, SETI, Skype, OmniGraffle (diagramming) and Camino (browser). E-UAE (Amiga emulator), GarageBand and even Xcode get fired up once in a while. iTunes is pretty much permanently on (currently playing Garbage’s “Bleed like me”). In addition to this lot the OS itself is doing umpteen jobs at any one time, I wont bother listing them as it’ll just sound like a dodgy Apple ad...

It might surprise you to learn that almost everything in that list can be accelerated to some degree by the SPEs - even the OS itself can benefit in several areas. Anything which uses graphics, video or audio are good targets as they work on chunks of data and are in many cases parallelisable and vectorisable. Text rendering and anti-aliasing can be done in an SPE, as can searching. Even if operations are not vectorisable scalar operations ran across different SPEs will still be of use. Sorting and encryption are also potential targets.

There’s not many applications in that list which will drive the system at 100% CPU usage for more than a few seconds at a time. For the most part it’s the these types of applications which actually need power are the very ones which will benefit the most from running on SPEs. Some of these kinds of applications can be accelerated to a very high degree (Photoshop, SETI, GarageBand).

I predicted SETI would be a good target in first version of this article as it is based on FFTs. Cells have been shown to run FFTs at lubriciously high speeds, a single SPE managed 19 GFLOPS at 3.2GHz [CellDev], I think my prediction of high SETI performance will hold! If you are not interested in looking for aliens though you will find FFTs and similar algorithms are very widely used in compute intensive applications.

Note: I said in the first version of this article that I thought Apple would be stark raving mad if they didn’t use the Cell. I am still of that opinion. However, if Apple wont go to the Cell, you can be pretty sure the Cell will go to Apple, the aptly named Cell-industries [CI] is planning an add-on.

SPE Instruction Set
The exact specification of the SPE’s ISA hasn’t been released yet but it appears to be a cross between VMX and the PS2’s Emotion Engine. It doesn’t contain the full VMX instruction set as some stuff was removed and other stuff added. According to IBM using a subset of the VMX “intrinsics” you can compile to both the SPEs and standard VMX, the only difference is the local stores. The actual code is different at the binary level so existing AltiVec/VMX code needs to be recompiled before it will run [CellLinux].

Since the SPEs can act as independent processors they also include instructions used when running programs such as branches, loads and stores. They can also operate as scalar processors so non-vectorisable code can also run.

SPE programs cannot directly access anything other than local store though data can be DMA’d to and from the local stores from other system addresses. Exceptions can be generated by the SPEs but are handled by the PPE, even then the type of exceptions generated are limited.

Here is a roundup of what is currently known of the SPE’s Instruction set (this should not be considered an exhaustive list):


 * Based on VMX/AltiVec - some instructions added, some removed.
 * Includes some (all?) of the PS2’s Emotion Engine ISA.
 * Supports vector or scalar operations.
 * Includes loads, stores, branches and branch hints.
 * 8, 16, 32 and 64 bit integer operations.
 * Single and dual precision floating point.
 * Saturation arithmetic for FP (not integer).
 * Simplified rounding modes for single precision FP.
 * IEEE 754 support for double precision FP (not precise mode).
 * Logical operations.
 * Byte operations: Shuffle, Permute, Shift and Rotate (Shift / Rotate per Qword or slot).
 * 128 x 128 bit Registers.
 * Local Store DMA I/O (to / from any address in system).
 * Commands for mailbox access, interrupts etc.
 * SPE Simulation

Until simulators or hardware becomes available it appear the best way of understanding SPE development would be to learn parallel programming techniques and VMX. The local stores could probably be simulated by doing everything in a 256K block of RAM and only allowing access to the rest of RAM via aligned 128 byte transfers to / from it. This will at least give you some idea of the issues you have to face. You do however need to note the 256K will include your code so don’t use all of it for data. This will not simulate the speed of the local store or the impact of the additional registers but should be useful nonetheless. The impact of a smaller number of registers and CPU caching will mean performance of the final code will be very difficult to estimate until hardware or real simulators are available.

Cell Programming Issues
One article had suggested that some developers had found the performance of both the Cell and the XBox360’s processors to be low. This article was highly controversial and the original was removed but that didn’t stop it spreading, [Comment]. It appears the developers have taken some decidedly single threaded game code and ran it on pre-preproduction hardware with a processor optimised for multi-threading and stream processing (most likely with an immature compiler). That the performance wasn’t what they expected is not so much news as blindingly obvious. Of course the more cynical might suggest that’s exactly what these developers expected... Getting the full potential from a Cell will be more difficult than programming a single threaded PC application. Multiple execution threads will be necessary as will careful choice of algorithms and data flow control.

There are not exactly new problems and solutions have long existed to them. Multiprocessor or even uni-processor servers have been doing this sort for thing for donkey’s years. It’s all old hat to BeOS programmers and many other programmers for that fact. You will likely see technologies appearing from other areas (e.g. application servers) which take the pain out of thread management.

The same will also happen to the PC in time so they’re not going to get off lightly. The problems that need to be solved for programming a Cell are exactly the same problems that need to be solved for programming a multi-core PC processor.

The types of algorithms which a Cell will be bad at are the very same algorithms that a PC processor is bad at. Any algorithm which reads data from memory in a non-linear manner will cause your CPU to twiddle its thumbs as it sits waiting for the relevant memory to get pulled in, such algorithms are also likely to cause cache thrashing so it won’t help much. Branch prediction is not much use either as it works on instructions, not data.

Cell will suffer here (possibly more so) but smart programmers can do tricks like splitting data into blocks and reading all the relevant data into the SPE’s local memories in one go, this will make algorithms on which the Cell is supposedly bad, vastly quicker. The SPEs can communicate internally and read from each others local stores, if all SPEs are used for this nearly 2 MB can conceivably be used at once. This will not be possible on a PC as cache works in a completely different way.

The Cell also supports stream processing via the local stores, this can drastically reduce the need to go to memory and they will perform best in this configuration. The Xenon has special cache modifications (locking cache sets) to allow streaming but PC CPUs appear to have no direct way to support this currently.

SPE Performance Issues
For the PPE to get round its issues it just needs software to be compiled for it with a decent compiler. With the SPEs the issues are more complex as they are more optimised for specific types of code. As with the PPE there is no OOO hardware so the compiler is again important, but with 128 registers there’s plenty of room to unroll loops. Scalar (i.e. single operation) processing can work, the SPEs were really designed for vector processing and will perform best (and considerably faster) when doing vector operations.

To get full use of an SPE the algorithm in use and at least some of the data needs to fit in a local store. Anything which deals with chunks of data which fits entirely into the local stores should pretty much go like a bat out of hell.

All CPUs are held back by memory accesses, these can take hundreds of cycles leaving the CPU sitting there doing nothing for long periods. A processor will only run as fast as it can get data to process. If the data is held in a low latency local memory getting data is not going to be a problem. It is in conditions like these that the individual SEPs may approach their theoretical maximum processing speed.

Programs which need to access external memory can move data to and from the local stores but there are restrictions in that transfers need to be properly aligned and should be in chunks of 128 Bytes (transfers can be smaller but there’s no point as it’s designed to handle 128 Bytes). Additionally, due to there being multiple processors in a Cell the memory access is shared so access requests need to be put in a queue. Scheduling memory access early will be important for maximising memory throughput.

While conventional CPUs try to hide memory access with caches, the SPEs puts it under the control of the programmer/compiler. This adds complexity, but not all view compiler controlled memory access to be a hindrance:

“An argument has been growing that processor instructions sets are just broken because they hide ‘memory’ (now in truth L2 cache) from ‘I/O’ (better known as memory) and the speed differences are now so huge that software can manage this better than hardware guesswork.” - Alan Cox [AC]

The SPEs are dual issue but only a single calculation instruction can be issued per cycle. Clever compilers may make use of vector processing as it should be possible to schedule multiple scalar operations in a single vector operation if they are the same.

The PPE does have some branch prediction hardware but the SPE has none. To get around this the SPE includes a “branch hint” instruction which the compiler can use. In addition to this in some cases instructions can be used which remove the need for branches altogether. Developers using GPUs for general purpose programming have more constraints than the SPEs and have developed techniques for reducing the costs of branches. It’s quite possible that at least some of these can be applied to SPE programs.

In the future, instead of having multiple discrete computers you'll have multiple computers acting as a single system. Upgrading will not mean replacing an old system anymore, it'll mean enhancing it. What's more your "computer" may in reality also include your PDA, TV, printer and Camcorder all co-operating and acting as one. The network will quite literally be the computer.

The Future: Multi-Cell'd Animals
One of the main points of the entire Cell architecture is parallel processing. The original idea for Cells working across networks as mentioned in the patent appears to still be in development but probably won’t be in wide use for some time yet. The idea is that “software cells” can be sent pretty much anywhere and don't depend on a specific transport means.

Want more computing power? Plug in a few more Cells and there you have it. If you have a few cells sitting around talking to each other via WiFi connections the system can use it to distribute software cells for processing. The idea is similar to the the above mentioned job queues but rather than jobs being assigned locally they are assigned across a network to any Cell with spare processing capability.



The mechanism present in the software Cells makes use of whatever networking technology is in use, this allows ad-hoc arrangements of Cells to be made. This system essentially moves a lot of complexity which would normally be handled by hardware and moves it into the system software. This usually slows things down but the benefit is flexibility, you give the system a set of software cells to compute and it figures out how to distribute them itself. If your system changes (Cells added or removed) the OS should take care of this without programmer needing to worry about it.

Writing software for parallel processing is usually highly difficult and this helps get around the problem. You still of course have to parallelise the program into software cells/jobs but once that's done you don't have to worry if you have one Cell or ten (unless you’ve optimised for a specific number).

It's not clear how this system will operate in practice but it would have to be adaptive to allow resending of jobs when Cells appear and disappear on a network. That said such systems already exist and are in very wide use.

This system was not designed to act like a “big iron” machine, that is, it is not arranged around a single shared or closely coupled set of memories. All the memory may be addressable but each Cell has its own memory and they will work most efficiently in it. It appears the IBM workstation/blade will have 2 Cells but it can act as a SMP system (i.e. the Cells can share each others memory). The patent specified a way to connect 8 Cells but given the size and likely cost of the first generation of Cells I doubt anything like this will appear soon. Programming The Cell: Conclusion

From reading various articles and conference papers / presentations it’s clear that there are many different Cell development models under active development, some of these are complex, others designed to make things easy. These developments will take considerable time to become mature so won’t show their full potential for some time to come.

The patent described a highly complex system which would be highly complex to program, a lot of work is going on to make sure this is not the case. Tools and middleware will be developed to make things easier still.

The Cell was designed as a general purpose system but tuned for specific classes of algorithms. These kinds of algorithm are very common and are utilised in a major chunk of the kinds of software which require high performance. PCs vendors have nothing to worry about, it’s the workstation vendors that should be worried.

Part 4...
Why does the Cell work the way it does how and why did they come up with this design?

In part 4 I look into the reason for the Cell’s design decisions and the impact the they are likely to make.

Branching Orders: Is The Cell General Purpose?
There has been a lot of debate about how the Cell will perform on general purpose code with many saying it will not do well as it is a “specialised processor”. This is not correct, the Cell was designed as a general purpose processor, but optimised for high compute tasks. The PPE is a conventional processor and will act like one. The big difference will be in the SPEs as they were designed to accelerate specific types of code and will be notably better in some areas than others, however even the SPEs are general purpose. The Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed.

The PPE is a normal general purpose core and should have no problems on most code. That said PPE has a simplified architecture compared to other desktop processors and this seems to be taken in some quarters as a reason for not just low performance but very low performance on general purpose code. Few care to explain why this is or even what this “general purpose” code is. Some do make vague references to the lack of Out-Of-Order hardware or the limited branch prediction hardware.

While I am of the opinion that the relative simplicity may be a disadvantage in some areas I doubt it’s going to make that much of a difference as what the PPE may lose in Instructions Per Cycle (IPC) it will make up for with a higher clock speed. However the PPE does not exist in isolation, the heavy lifting work will be handed off to the SPEs. Even if there is a performance difference removing the heavy duty work will make a bigger difference than any loss due to simplification.

As for the SPEs, not everything can be vectorised but that doesn’t mean SPEs suddenly become useless. Just because you can issue 4 instructions per bundle doesn’t mean you have to. There’s no reason why you can’t issue an instruction which performs one calculation and three nothings. In this case the SPE will in effect become a dual issue, in-order RISC. The SPEs should be quite capable of acting as general purpose CPUs even on non-vectorisable code. How well they will perform will largely be down to the programmer, compiler and memory usage pattern, if the memory usage is predictable they should do well, if not they will do badly. They will be somewhat inefficient running some types of general purpose code but there are plenty of small intensive loops in general purpose applications and much processing is concentrated in them. It is in these areas where the SPEs will be fast and efficient.

RISC Strikes Back
The trend in CPUs over the last 15 years has been to increase performance by not only increasing the clock rate but also by increasing the Instructions Per Cycle. The designers have used the additional transistors each process shrink has brought to create increasingly sophisticated machines which can execute more and more instructions in one go by executing them Out-Of-Order (OOO), many modern desktop CPUs do this, exceptions are Transmeta, VIA and Intel’s Itaniums. OOO is something of a rarity in the embedded world as power consumption considerations don’t allow it.

The PPE is completely different however. It uses a very simple design and contains no OOO hardware, this is the complete opposite approach to IBM's last PowerPC core, the 970FX (aka G5).

The reason for the complete change in direction in design philosophy is due to the physical limitations CPU designers are now hitting. OOO CPUs are highly complex and use large numbers of transistors to achieve their goals, these all require power and need to be cooled. Cooling is becoming an increasingly difficult problem as transistors are beginning to leak electrons making them consume power even when they are not in active use. The problem has got to the point where all the desktop CPU manufacturers have pretty much given up trying to gain performance by boosting clock speeds and have taken the multi-core approach instead. They haven’t taken to simplification yet but almost certainly will have to in the future.

While this may appear to be a new problem it’s not, it has been predicted and investigated for many years.

The following quote is from a research paper by IBM “master inventor” Philip Emma in 1996:

“If the goal of a microarchitecture is low power... only those features that pervasively provide low CPI* are included. Features that only help CPI sometimes (and that hurt cycle time all of the time) are eliminated if low power is a goal. Those elements should also be eliminated if high performance is a goal.

As the industry pushes processor design into the GHz range and beyond, there will be a resurgence of the RISC approach. While superscalar design is very fashionable, it remains so largely because its impact on cycle time is not well understood. Complex superscalar design stands in the path of the highest performance; he who achieves the highest MHz runs the fastest.” [Emma]

CPI: Cycles per instruction, a large part of the referenced paper argues this is a much better measure than IPC.

IBM, like any large technology company does research. In the following year (1997), long before GHz or 64 bit CPUs arrived on the desktop IBM developed an experimental 64 bit PowerPC which ran at 1GHz. Its snappy title was guTS (GigaHertz unit Test Site) [guTS].

The guTS and a later successor were designed to test circuit design techniques for high frequency, not low power. However since it was only for research the architecture of the CPU was very simple, unlike other modern processors it is in-order and can only issue a single instruction at a time. The first version only implemented part of the PowerPC instruction set, a later version in 2000 implemented it all.

It turns out that the power consumption problem has become so bad that if you want high clock frequency you now have to simplify, there is simply no choice in the matter. If you don’t simplify the CPU will consume so much power it will become very difficult to cool and thus the clock speed will be limited. Both IBM and Intel have discovered this rather publicly, try buying 3GHz G5 or a 4GHz P4.

When a low power, high clocked general purpose core was required for the Cell, this simple experimental CPU designed without power constraints in mind turned out to be perfect. The architecture has since been considerably modified, the now dual issue, dual-threaded PPE is a descendant of the guTS.

The XBox360’s “Xenon” [Xbox360] processor cores also appear to be derived from the guTS processor although they are not quite the same as the PPE. In the Cell the PPE uses the PowerPC instruction set and acts as a controller for the more specialised SPEs. The Xenon cores uses a modified version of the PowerPC instruction set with additional instructions and a beefed up 128 register VMX unit.

Discrete PowerPC parts based on this technology have been long rumoured, some rumours even suggest POWER6 might even use similar technology to get the high frequency boost it’s promising. While rising clock speeds seem to have been declared dead by the rest of the industry, evidently somebody forgot to tell IBM... [Ultra]

There is an unwritten assumption that the OOO hardware and large branch predictors (the PPE has relatively simple branch predictors) play a major part in CPU performance. This is not the case, just throwing transistors at a CPU doesn’t massively raise its speed, Gelsinger’s law: “Doubling the number of transistors increases performance by 40%”. However, most of this 40% is not coming from OOO hardware, most can be traced directly to increasing cache sizes.

While OOO CPUs allow more instructions to run concurrently than the PPE, they can't sustain this higher issue rate for any length of time. IPC is an average of the number of instructions per cycle that a CPU can perform. Measured IPC ranges wildly going to above 3 in some instances to well below 1 in others, the average IPC figure is below 2 [LowIPC]. In accordance with this both the PPE and SPEs are designed to issue 2 instructions per cycle.

OOO CPUs are specifically designed to increase IPC but they can only do this in some types of code. Code which contains dependencies has to be executed in order, OOO is no help on this sort of code. OOO may actually decrease performance in some areas since CPUs with aggressive OOO hardware cannot run at as high clock rates as in-order CPUs. Algorithms which scale with clock speed are consequently held back. OOO takes a lot of room and a lot of power but only increases performance in specific areas, increasing the clock speed increases the performance of everything.

While it may be possible to increase the IPC of the PPE with OOO* hardware the performance gain would be at best limited. The PPE could be replaced by something like a 970FX but this has a larger core and either the die size would have to grow to around 250mm square or a pair of SPEs removed. The 970 also consumes hefty amounts of power at high clock speeds - more than the entire Cell. In order to fit a 970 and keep it cool enough to run in a PS3 the frequency would probably be reduced to well under 3GHz, probably under 2.5GHz. The end result would be a CPU which runs hotter, slower and costs more to build.

Hope you’re keeping up with all these TLAs (Three Letter Acronyms).

OOO was a good way to boost performance in the past but as power consumption has become a limiting factor the entire industry has been looking for new ways of boosting performance. x86 vendors have taken to using 2 cores instead of attempting to boost clock speed or IPC further. Future generations are expected to have 4 then 8 cores. Even then I expect x86 to simplify their OOO capabilities for the same reason, I doubt OOO will be removed completely though as x86 CPUs gain more from OOO hardware than PowerPCs due to the smaller number of architectural* registers [x86vsPPC].

Architectural registers are internal registers the programmer can use directly. OOO CPUs have many more “rename” registers but the programmer cannot directly use them.

Predicting Branches
Advanced branch prediction hardware is like OOO hardware, it only boosts performance sometimes.

As an 	example, lets say the CPU is processing a 100 iteration loop and it takes 10 cycles to complete an iteration. Execution at the end of the loop can be assumed to branch to the beginning of the loop so the loop continues without stalling the pipeline. On the final iteration of the loop the guessed branch will be incorrect, and the PPE will incur a 7 cycle penalty. A large branch predictor might avoid this but in doing so saves just 7 cycles out of 1000. In this case the branch predictor delivers a performance boost of less than 1%.

The usefulness of the branch predictor is dependant on the type of code being run but paradoxically the OOO hardware can act against it. The branch predictor can only work efficiently if it remains ahead of the execution hardware in the instruction stream. If the OOO hardware works effectively the gap closes and the CPU effectively runs out of instructions [Emma], in that case the branch predictor is sitting doing nothing.

Branch predictors are not useless (they wouldn’t be used otherwise) so the PPE does include one. However, it’s not as large as those found in some desktop processors.

Dual Threading
Unlike the SPEs the PPE is dual threaded. Both threads issue alternately but if one gets held up the other takes over. Two instructions can be issued per cycle (four are fetched every other cycle). Hardware multithreading is similar to Intel’s “Hyperthreading”, it is already present in POWER5 and you can be sure it will feature in many other processors in the future.

While the lack of OOO and a large branch predictor will have some impact, the ability to run a second thread will make up for it at least partially as a second thread can utilise the full execution resources while the first thread is waiting.

The Law of Diminishing Returns
Single chip CPUs were first produced in 1971 with Intel’s 4004. In the 34 years since then various features have been added to enhance performance and functionality.

Single chip CPUs started at just 4 bits but rapidly went upwards through 8, 16 and 1979’s Motorola 68000, a 32 bit processor. The first 64 processors did not appear until 1992.

Cache was next to be added, first in very small quantities (a few bytes in the 68010) but this has been rising ever since then.

Soon after external devices such as Memory Management Units and Floating Point Units were also integrated.

The 68040 and 80486 delivered pipelining, this was the beginning of the integration of RISC technologies into CISC chips. Pipelining allows the CPU to operate on different stages of different operations simultaneously - e.g. it can be reading one while executing another.

Superscalar execution was next, this gave the processors the ability to execute multiple instructions simultaneously. This arrived in the 80586 (aka Pentium) and 68060.

OOO (Out of Order) execution appeared in the Pentium Pro. Along with it came things like speculative execution and pre-fetching.

x86 got SIMD/Vector capabilities in increments from MMX onwards. PowerPC got it in one go with the introduction of AltiVec.

Intel introduced Hyperthreading in a version of the Pentium 4.

Most recently AMD have lead the way with point to point busses, 64 bits, on-die memory controllers and more recently dual cores.

While the above list focuses on desktop processors, most of technologies were previously used in high end RISC processors who had in turn had taken the technologies from the mainframe and supercomputer worlds. Many of these technologies were developed or implemented in the 1960s.

As these technologies have been added the performance of CPUs has gone up. However the vast majority of that performance has come not from architecture enhancements but from clock speed.

Some enhancements have made a lot more difference than others. Cache in particular has made a very big difference due to the disparity between CPU clock speeds and memory speeds, this gap is only growing larger so cache is becoming more important.

Going superscalar allows more than one instruction be calculated at once and this potentially doubles performance, however this will not happen all the time as dependancies mean instructions often have to be processed one at a time.

OOO hardware attempts to raise the IPC to higher levels but fundamental constraints in software mean that is not consistently possible, average IPC still remains below 2.

This is the law of diminishing returns, each new feature adds performance but the improvement each time gets smaller and smaller. Unfortunately the opposite happens with power consumption as each improvement also seems to get progressively more complex. Going superscalar increases performance and power consumption but OOO hardware increases performance less and power consumption more.

In the Cell the choice was between an OOO design with large branch predictors or a simpler, smaller and higher clock speed design. Complex features like OOO only give a relatively small performance boost, they increase power consumption disproportionally to the performance they add. in order to attain the performance and power consumption aims these sorts or features could not used in the Cell design.

The use of a reasonable sized register set along with a good compiler can schedule the instructions so dependancies have less of an impact. The compiler can in effect do the job of OOO hardware, Intel’s Itanium has clearly shown just how effective a compiler can be at doing this.

A General Purpose Conclusion
Cell will not magically accelerate a general purpose system, it will require considerable work to get the best out of it. It’s not even clear if anyone actually intends to build a desktop system using the Cell after Apple went off and got some “Intel inside”.

The Cell designers have deliberately produced a simple design in order to get around the heat problems complex designs now suffer [HPCA]. This has allowed them to produce a processor with 9 cores while the rest of the industry is shipping just 2. On top of that they have also been able to boost the clock speed.

In order to create this kind of system and get around the rising power consumption problems the industry is facing, the Cell designers have produced cores with a relatively simple architecture. In order to do this they have removed some common features which while useful are not critical to performance.

The resulting simplicity will impact performance in some areas but a combination of higher clock speeds, new compiler technology and smart algorithm selection should largely get around these problems.

The SPEs will be particularly sensitive to the type of code running as they are more “tuned” than conventional processors. In these cases it will be important to allocate the correct type of code to the right processor. This is why the PPE and SPEs are different.

Short Overview
The Cell architecture consists of a number of elements:

The Cell Processor
This is a 9 core processor, one of these cores is a PowerPC and acts as a controller. The remaining 8 cores are called SPEs and these are very high performance vector processors. Each SPE contains it's own block of high speed RAM and is capable of 32 GigaFlops (32 bit). The SPEs are independent processors and can act alone or can be set up to process a stream of data with different SPEs working on different stages. This ability to act as a "stream processor" gives access to the full processing power of a Cell which is claimed to be more than 10 times higher than even the fastest desktop processors.

In addition to the raw processing power the Cell includes a high performance multi-channel memory subsystem and a number of high speed interconnects for connecting to other Cells or I/O devices.

Distributing Processing
A software infrastructure under development will allow Cells to work together. While they can be directly connected via the high speed interconnects they can also be connected in other ways or distributed over a network. The Cells are not gaming or computer specific, they can be in anything from PDAs to TVs but they can still be used to act as a single system.

Parallel programming is usually complex but in this case the OS will look at the resources it has and distribute tasks accordingly, this process does not involve any more programming than the initial parallelisation. If you want more processing power you simply add more Cells, you do not need to replace the existing ones as the new Cells will augment the existing ones.

Overall the Cell architecture is an architecture for distributed, parallel processing using very powerful computational engines developed using a highly aggressive design strategy. These devices shall be produced in vast numbers so they will provide vast processing resources at a low cost.

Conclusion
The rule in the PC world has always been “evolution not revolution”, often simply incorporating features from other platforms. The changes are incremental and produce incremental performance boosts.

Cell is a revolution, a completely new microprocessor architecture which, while it may take some time to get used to, promises a vast performance boost over today’s systems. GPUs can already run 10 times faster than desktop CPUs, Cell will not only bring similar performance but will do so for more applications and it’ll be easier to program.

Being produced in large volumes also means the Cell will be cheap. They will likely see wide spread not just in living rooms but but in the realm of industry and science as well. The embedded world is much, much larger than the PC world and often imposes stringent constraints on the components used, the same sort of constraints the Cell has been designed for.

Some have suggested that STI should have gone for a more conventional design such as three PowerPC 970s on a single chip. Such a design would not have addressed the power issues and would, as a result have to of been driven at a relatively low clock rate. Instead, by using simpler designs which use vectors the Cell designers have managed to fit 9 cores on a single chip at a higher clock speed, the potential performance is consequently considerably higher.

The Cell is a new architecture and will seem strange and alien to many used to rather more conventional desktop designs. In order to utilise it properly programers will have face new problems and devise new ways of solving them. It remains to be seen how much of the Cell’s potential can be achieved and how difficult it is to extract it, but it’s clear that STI are trying to make this as painless as possible.

Many people do not like change, to them Cell represents a threat. For others it represents an opportunity.

Lets see how many take the opportunity, and what other opportunities the other CPU vendors come up with in response.

Acknowledgements
Many thanks the people who took the time to review this article and the people who put me in touch with them.

I’d also like to thank the people who responded to the previous version, I wasn’t able to respond to many of the e-mails but I did read them all.

Miscellaneous Notes
Fanboi hyp3.

Contrary to some opinions the first version was not an attempt to hype the Cell processor.

The original version of this was written after I read a debate on the Cell in late 2004. The Cell was clearly not widely understood at the time and I felt I could provide an explanation.

I am a technology enthusiast and when I see some very advanced technology heading in our direction I’m interested and I’m excited. Some of this emotion evidently made it into the first version of the article.

Neither this or the previous version was requested or paid for by Sony, Toshiba, IBM or anyone else.

About the Author

Nicholas Blachford lives in Paris.

He is currently (slowly) learning French and designing / writing a softsynth for OS X.

He is “talking to” Cell-Industries.

You might find him hanging about at whyzzat.

'''This article was extracted from http://www.blachford.info/computer/Cell/Cell0_v2.html. It is copyrighted and it shall not be used with commercial intentions'''