Alpha 21464

The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture (ISA) developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8 (codenamed Araña). Slated for a 2004 release, it was canceled on 25 June 2001 when Compaq announced that Alpha would be phased out in favor of Itanium by 2004. When it was canceled, the Alpha 21464 was at a late stage of development but had not been taped out.

The 21464's origins began in the mid-1990s when computer scientist Joel Emer was inspired by Dean Tullsen's research into simultaneous multithreading (SMT) at the University of Washington. Emer had researched the technology in the late 1990s and began to promote it once he was convinced of its value. Compaq made the announcement that the next Alpha microprocessor would use SMT in October 1999 at Microprocessor Forum 1999. At that time, it was expected that systems using the Alpha 21464 would ship in 2003.

Description
The microprocessor was an eight-issue superscalar design with out-of-order execution, four-way SMT and a deep pipeline. It fetches 16 instructions from a 64 KB two-way set-associative instruction cache. The branch predictor then selected the "good" instructions and entered them into a collapsing buffer. (This allowed for a fetch bandwidth of up to 16 instructions per cycle, depending on the taken branch density.) The front-end had significantly more stages than previous Alpha implementation and as a result, the 21464 had a significant minimum branch misprediction penalty of 14 cycles. The microprocessor used an advanced branch prediction algorithm to minimize these costly penalties.

Implementing SMT required the replication of certain resources such as the program counter. Instead of one program counter, there were four program counters, one for each thread. However, very little logic after the front-end needed to be expanded for SMT support. The register file contained 512 entries, but its size was determined by the maximum number of in-flight instructions, not SMT. Access to the register file required three pipeline stages due to the physical size of the circuit. Up to eight instructions from four threads could be dispatched to eight integer and four floating-point execution units every cycle. The 21464 had a 64 KB data cache (Dcache), organized as eight banks to support dual-porting. This was backed by an on-die 3 MB, six-way set-associative unified secondary cache (Scache).

The integer execution unit made use of a new structure: the register cache. The register cache was not meant to mitigate the three tick register file latency (as some reports have claimed), but to reduce the complexity of operand bypass management. The register cache held all the results produced by the ALU and Load pipes for the previous N cycles. (N was something like 8.) The register cache structure was an architectural relabeling of what previous processors had implemented as a distributed mux.

The system interface was similar to that of the Alpha 21364. There were integrated memory controllers that provided ten RDRAM channels. Multiprocessing was facilitated by a router that provided links to other 21464s, and it architecturally supported 512-way multiprocessing without glue logic.

It was to be implemented in a 0.125 μm (sometimes referred to as 0.13 μm) complementary metal–oxide–semiconductor (CMOS) process with seven layers of copper interconnect, partially depleted silicon-on-insulator (PD-SOI), and low-K dielectric. The transistor count was estimated to be 250 million and die size was estimated to be 420 mm2.

Tarantula
Tarantula was the code-name for an extension of the Alpha architecture under consideration and a derivative of the Alpha 21464 that implemented the aforementioned extension. It was canceled while still in development, before any implementation work had started, and before the 21464 was finished. The extension was to provide Alpha with a vector processing capability. It specified thirty-two 64 by 128-bit (8,192-bit or 1 KB) vector registers, approximately 50 vector instructions, and an unspecified number of instructions for moving data to and from the vector registers. Other EV8 follow-up candidates included a multicore design with two EV8 cores and a 4.0 GHz operating frequency.