User talk:Csedukit

PowerPC Definition: The PowerPC architecture is a collaborative effort of Apple, IBM, and Motorola to create a new generation of high performance microprocessors which can used in everything from personal computers, workstations, servers, and multiprocessor systems to embedded microcontrollers. Also is a RISC based computer architecture. The name is derived from IBM's name for the architecture, Performance Optimization With Enhanced RISC.

Description : The PowerPC is based on IBM’s highly successful POWER architecture. The POWER architecture was designed for scientific workstations, and has been optimized for both integer and floating-point math operations. The POWER architecture also incorporates a “branch processor” which attempts to minimize the impact of branch instructions on the processor’s performance. When the Apple-IBM-Motorola consortium set out to design PowerPC, the members modified the POWER architecture to reduce manufacturing costs and make the design more suitable for desktop computers. They eliminated parts of the POWER instruction set that made the POWER architecture more difficult to implement but had a minimal impact on performance. While the architects were modifying the instructions set for the architecture, they also removed dependencies between instructions, and added features which simplified building multi-processor systems.

The PowerPC Family of Microprocessors 601 - The 601 is a fusion of the POWER architecture and the PowerPC architecture. It is designed to drive mainstream desktop systems. A Macintosh with a 601 will deliver integer performance three to five times that of today’s high-end 68040-based Macintosh systems and floating point performance around ten times that of today’s high-end 68040-based Macintosh systems. 603 - The 603 is the first PowerPC only implementation of the PowerPC architecture. It is designed for low-cost and low-power consumption. The 603 will be used in portable and low-cost desktop Macintosh with PowerPC systems. In many ways, over time the 603 could become Apple’s replacement for the 68030. 604 - The 604 is designed for mainstream desktop personal computers. It will cost about as much as the 601, but will deliver higher performance. 620 - The 620, which is currently still in the design phase, is a high-performance microprocessor that Motorola and IBM believes will be well-suited for very high-end personal computers, workstations, servers, and multiprocessor systems.

The Machine The main theme characterizing RISC computing is keeping the CPU as busy as possible so that cycles are not wasted. This is achieved in two principal ways, superscalar design and pipelining. The term superscalar refers to the CPU being a collection of semi-independent execution units operating in parallel, so that instructions can be issued to these units in parallel (and possibly out of order, as long as they are not interdependent). The figure illustrates a simplified view of the communication paths among the execution units (IU, BPU, FPU) and the supporting memory managers and cache. We'll take a quick look at the operation of each, concentrating particularly on features contributing to machine performance. Instruction and Branch Units Instructions march through the instruction queue (IQ) from Q7 toward Q0 as vacancies are created. New instructions are requested as soon as possible. If the cache is hit, as many as eight instructions (the whole IQ, or, a cache sector) can be prefetched in one cycle. Otherwise, further bus cycles will be needed, but this is normally simultaneous with currently executing instructions. Instructions can be issued from any of the lower four elements of the IQ to either the branch processor (BPU) or floating-point unit (FPU), as long as the decode stage of the target unit is vacant. The integer unit (IU) is fed only through Q0, which doubles as the IU decode stage. Instruction fetch is normally sequential, unless the BPU decides on a change of execution path.

The BPU : "owns" two registers holding branch target addresses, the link (LR) and count (CTR) registers, allowing relative independence of the BPU. The LR also gets the return address following a branch, if any. The condition register (CR) provides the information necessary to resolve conditional branches. One thing a good compiler should do is schedule the instruction that updates the CR well ahead of its dependent branch instruction to allow resolution as early as possible. The BPU can examine up to one branch instruction at a time. Unconditional branches are simply removed from the instruction stream, with fetching directed along the new path. Conditional branches have a predictor bit, indicating the more likely of branch taken/not taken. If a conditional branch is encountered, instructions continue to execute along the predicted path, but not as far as the writeback stage, where registers are updated. When the condition is resolved, if the prediction was correct, writeback is enabled and execution continues as if no branch occurred. If incorrect, the instruction unit backs up by flushing everything since the branch, and fetching a new cache sector of instructions. This process of effectively removing branches from code is termed branch folding.

IU                                                                                                                                           The integer unit does what you expect: arithmetic, logical, and bit-field operations on integers. It contains the general-purpose register file (GPR), and the integer exception register (XER). There are thirty-two GPRs, each 32 bits wide on the 601. Each handles either data manipulation or address calculation. They are dual-ported (as are the FPRs) to allow two independent accesses at once. The XER holds result flags such as carry and overflow from arithmetic operations. The IU handles address calculation for all execution units. All load and store requests (even for floating-point operands) are processed by the IU and passed from there to the memory management unit (MMU). The IU implements feed-forwarding, simultaneously making available the result of an integer execute stage to both the register writeback bus, and the execute stage of any follow-on integer instruction waiting for that result.

FPU                                                                                                                                       The floating-point unit contains the floating-point register file (FPR), and the status and control register (FPSCR). The thirty-two 64-bit registers handle either single or double precision operations. Only a subset of operations are handled in hardware, as with the 68040, the others must be emulated. Of interest, though, is the combined multiply-add instruction, which directly supports the vector and matrix algebra needed in common graphics transformations. It is also well suited to series expansions, speeding software emulation of transcendental functions. The FPSCR holds calculation result flags such as overflow, NaN, INF, etc., and environment controls, such as rounding direction. At this time, the FPU does not support feed-forwarding. MMU The memory management unit handles the translation of logical to physical addresses, access privileges, memory protection and virtual memory paging. Performance is enhanced in this unit by the incorporation of several on-chip tables of recently used addresses, so that translation can be bypassed whenever possible. Of course load and store requests look to the cache first. Misses are queued in the memory unit for servicing. The MMU can address up to 4E9 (4 Gigabytes) of physical memory, and 4E15 (4 Petabytes) of virtual memory. Where to store all that is a separate issue (4 Petabytes = 6.2 million CD ROMs).

MemoryUnit This unit buffers data transfers between the cache and memory. It contains a two entry read and three entry write queue. Each entry is actually capable of holding eight consecutive words (a cache sector). Writing to memory is primarily performed to make room in the cache for new entries. The least recently used (LRU) entry in the cache is moved to the write queue, where it waits its turn for use of the system bus. Reads are performed mainly to load the cache. If not interdependent, waiting reads and writes may be performed out of order, according to priority. However, special instructions are available to strictly enforce program order of reads and writes when needed.  Cache                                                                             A 32-kByte (write-back) cache is provided to minimize time waiting for off-chip accesses. In the 601 it is a unified cache, holding both instructions and data. In the future, it will likely have a Harvard architecture, keeping instructions and data separate. That would simplify and speed up cache searching, and allow concurrent data and instruction accesses. An advantage of the unified design is that those nasty programs that modify their own code are given a chance of running successfully - making the PPC look even safer sometimes than a 68040. Of course, nobody you or I know writes self-modifying code, right? The cache is subdivided into 8 pages. Each page contains 64 cache lines. Each line is subdivided into 2 sectors (or blocks) of 8 32-bit words each. The block is the smallest cacheable unit. Cache sectors are read and written in a special burst mode. To further reduce processing delays when a load or fetch misses in the cache, the specifically requested words are feed-forwarded to the waiting execution unit before the remainder of the sector read completes. Pipelining:                                                                                                                                 The second major factor in keeping the CPU busy is pipelining, which refers to the scheme of dividing the processing of an instruction into several independent serial stages - similar to a factory assembly line. Each execution unit is pipelined, but has a different number of stages. More stages allow breaking a process into simpler steps. The BPU, who's task is already simple has a combined decode/execute stage. The FPU, having the most complicated operations to perform has two execute stages. The term superpipelined is often used to characterize pipelines having more than four or five stages. For simplicity, we focus our discussion on the IU, whose four stages typify basic pipeline design. These stages are:

* fetch/dispatch - get the instruction from the stream into the execution unit's decoder, * decode - figure out what the instruction does and initiate requests for any needed operands, * execute - apply indicated operation, * writeback - update target register(s) with result.

TIP:reduced instruction set computer a type of microprocessor that recognizes a relatively limited number of instructions. One advantage of reduced instruction set computers is that they can execute their instructions very fast because the instructions are so simple.