User:Vivrax/Cray MTA

Cray MTA (formerly known as Tera MTA, for Multithreaded Architecture) is a highly multithreaded scalar shared-memory multiprocessor supercomputer architecture developed by Tera Computer Company and later expanded by Cray. It features a barrel processor arrangment with 128 hardware threads, fine-grained communication and synchronization between threads, and high latency tolerance for irregular computations. There have been 5 generations of supercomputer systems developed upon the MTA architecture, all bearing the same name as the underlying architecture. Its predecesors are the HEP and Horizon supercompter architecturesr. The principal architect of all generations was Burton J. Smith.

History
The MTA design cost Tera $6.7 million (1995), $10.5 million (1996), $13.2 million (1997), $13.7 million (1998), $15.2 million (1999), having 50 full-time engineers in 1996 and 70 in 1999. DARPA invested $19.3 million from 1990 to 1999 into the MTA design. As of 1998, Tera had been 1 granted software patent for their compiler optimazations and had 15 pending for software and hardware inventions.

The core CPU architecture remained the same from 1990 throughout all generations, with the only changes being the manufacturing process change (GaAs to CMOS), support for various custom and off-the-shelf memory modules, faster clock speeds and minor changes in the interconnect topology (modified Cayley before reverting back to the 3D torus). Thus, software compiled for the initial Tera should run on any later system.

In 1998, Tera revealed plans to deploy the Tera MTA to European partners, though that never materialized. In 2014, YarcData, a Cray division, announced that its uRiKA line of systems (GD, based upon Cray XMT2, and XA, based on Intel Xeon CPUs) is being used by "CSCS, the Mayo Clinic, Noblis, ORNL, QinetiQ, Pittsburgh Supercomputing Center and Sandia National Laboratories, as well as leading government and intelligence organizations, financial services firms, life sciences companies, and telecommunications providers", though only the first three and PSC actually had an XMT2 system deployed, while ORNL and Sandia had older systems (Cray XMT), predating YarcData and uRiKA. The software used in uRiKA-GD later became Cray Graph Engine.

Development on Threadstorm4, as well as the whole MTA architecture, ended silently after XMT2. Probable reasons include the leaving of Burton J. Smith in 2015 for Microsoft, as well as competition from commodity processors such as Intel's Xeon and Xeon Phi, even though Cray never officially discontinued neither XMT nor XMT2. A 2014 research paper from the University of Washington implied the Cray XMT isn't overcoming its narrow range of applicability, demonstrating a system from commodity x86 processors and specialized software could be utilized to achieve performance in the same order of magnitude as Threadstorm CPUs (between 2.6x faster and 4.4x slower than the XMT) for graph-related problems. In the CUG 2015 proceedings, a presentation showed successful porting of the uRiKA-GD system from MTA architecture to XC30/40 with good preliminary performance, implying the redundancy of Cray XMT2 infrastructure by 2015. In late 2016, the uRiKA-GD (the last user of the Cray XMT2 infrastructure and Threadstorm4 CPUs) and uRiKA-XA (based upon Xeon E5 v3 CPUs) appliances merged into uRiKA-GX, which utilized Xeon E5 v4 processors. As of 2020, Cray has removed all customer documentation on both XMT, XMT2 and uRiKA-GD from its online catalogue. Most XMT and XMT2-based systems have been decommisioned by 2020.

Tera Computer System
Tera Computer System is the first generation of the MTA architecture, as descibed in a 1990 paper. It was never commerically produced as the system in the paper, though prototypes were probably made during its development. Architectural deisgn choices the Tera Computer System set forth were retained by all subsequent architectures (except for additional state bits and hashing granularity), thus ensuring backwards-compatibility.


 * 64-bit VLIW ISA, 3 RISC-like instructions per VLIW instruction
 * load-store architecture
 * 64-bit single-core barrel processor with 128 streams (each mapping 1 thread)
 * 3 functional units (Memory, Adder, Control)
 * 128 separate register sets (32 GP, 8 target, status word with PC)
 * 16 protection domains
 * fair, full context-switching at each cycle (zero-cost switching)
 * pipeline of 21 cycles
 * no data caches
 * latency-tolerant, targeted at problems with lot of unpredictable memory access (graph problems, semantic databases, genome sequencing, other big data)
 * "CPU shall never stall, memory lanes should be saturated"
 * big unified shared content-addressable memory with scrambling at memory word granularity
 * 6 additional state bits on each 64-bit memory word for tagged memory
 * support for IEEE 754 single and double precision floating point numbers, quad precision 128-bit floating point numbers, 64-bit complex numbers and arbitrary length unsigned numbers

Tera MTA
Tera MTA is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the second generation of the Tera MTA architecture, and the first commercial implementation of the architecture.

Each MTA processor (CPU) has a high-performance ALU with many independent register sets, each running an independent thread. For example, the Cray MTA-2 uses 128 register sets and thus 128 threads per CPU/ALU. All MTAs to date use a barrel processor arrangement, with a thread switch on every cycle, with blocked (stalled) threads skipped to avoid wasting ALU cycles. When a thread performs a memory read, execution blocks until data returns; meanwhile, other threads continue executing. With enough threads (concurrency), there are nearly always runnable threads to "cover" for blocked threads, and the ALUs stay busy. The memory system uses full/empty bits to ensure correct ordering. For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value. This ensures correct ordering, but allows fine-grained interleaving and provides a simple programming model. The memory system is also "randomized", with adjacent physical addresses going to different memory banks. Thus, when two threads access memory simultaneously, they rarely conflict unless they are accessing the same location.

A goal of the MTA is that porting codes from other machines is straightforward, but gives good performance. A parallelizing FORTRAN compiler can produce high performance for some codes with little manual intervention. Where manual porting is required, the simple and fine-grained synchronization model often allows programmers to write code the "obvious" way yet achieve good performance. A further goal is that programs for the MTA will be scalable – that is, when run on an MTA with twice as many CPUs, the same program will have nearly twice the performance. Both of these are challenges for many other high-performance computer systems.

An uncommon feature of the MTA is several workloads can be interleaved with good performance. Typically, supercomputers are dedicated to a task at a time. The MTA allows idle threads to be allocated to other tasks with very little effect on the main calculations.

The MTA uses many register sets, thus each register access is slow. Although concurrency (running other threads) typically hides latency, slow register file access limits performance when there are few runable threads. In existing MTA implementations, single-thread performance is 21 cycles per instruction, so performance suffers when there are fewer than 21 threads per CPU.

The MTA-1, -2, and -3 use no data caches. This reduces CPU complexity and avoids cache coherency problems. However, no data caching introduces two performance problems. First, the memory system must support the full data access bandwidth of all threads, even for unshared and thus cacheable data. Thus, good system performance requires very high memory bandwidth. Second, memory references take 150-170 cycles, a much higher latency than even a slow cache, thus increasing the number of runable threads required to keep the ALU busy.

Full/empty status changes use polling, with a timeout for threads that poll too long. A timed-out thread may be descheduled and the hardware context used to run another thread; the OS scheduler sets a "trap on write" bit so the waited-for write will trap and put the descheduled thread back in the run queue. Where the descheduled thread is on the critical path, performance may suffer substantially.

The MTA is latency-tolerant, including irregular latency, giving good performance on irregular computations if there is enough concurrency to "cover" delays. The latency-tolerance hardware may be wasted on regular calculations, including those with latency that is high but which can be scheduled easily.

The operating system of the Tera MTA was Tera MTX, a fully distributed and symmetric implementation of UNIX. The system runs stand-alone and does not have dedicated service nodes or front ends, as common in later generations.

Unisys Corporation was contracted to provide semiconductor test and packaging services (1996-1998), while Vitesse (1996-1997, 2000) and TriQuant (1996, 1998-2000) were contracted for supply of GaAs wafers.

MTA and MTA-2 systems were constructed from resource modules. Each resource module measures approximately 5 by 7 by 32 inches and contains: In addition to the SDSC researchers, early users of the MTA system also included the Boeing, Caltech, Los Alamos National Laboratory, National Energy Research Scientific Computer Center and Sanders, a Lockheed Martin company. The only deployed Tera MTA system in existence, at the San Diego Supercomputer Center, was retired in September 2001, roughly a year after its last upgrade to 16 CPUs.
 * computational processor (CP)
 * I/O processor (IOP) connected to an I/O device via HIPPI
 * memory units.

MTA CPU
The MTA CPU is comprised of 24 GaAs chips, based off Vitesse's FX350K gate arrays, with a very high thermal dissipiation between 20 and 48 W per chip, resulting in about 1,000 W for the whole CPU package. The custom FX350K chips were produced using Vitesse's 0.6 μm "H-GaAs III" MESFET process. The CPU is packaged in a custom 524-ball BGA. The MTA processor design was a joint effort between Tera and the Design Services group of Cadence Design Systems.


 * 8KB level 1 and 2MB level 2 instruction caches
 * Up to 8 concurrent memory references per thread

Memory subsystem

 * custom memory modules
 * 4 additional state bits per 64-bit memory word (reduced from 6 bits in the Tera Computer System, removed 2 trap bits)


 * 4-way associative, 512 entry TLB
 * 47 bit memory address space, though only 42 bits are used
 * segment sizes from 8 kB to 256 MB
 * word granular address hashing

Interconnect system
Each node has five ports, four to neighboring nodes and one to a resource board. Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle. Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction, for a peak communication bandwidth of 13.3 GB/s.

Cray MTA-2
Cray MTA-2 is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture. Presented in 2001, MTA-2 was an attempt to produce a cheaper and more reliable MTA architecture implementation. The manufacturing process switched from GaAs to CMOS, dramatically lowering both cost of production and maintenance, as well as thermal dissipiation of the processor package, though essentially retaining the same CPU architecture, and regressing the network design from a 4-D torus topology to a less efficient but more scalable Cayley graph topology. Despite addressing major flaws with the Tera MTA, Cray MTA-2 wasn't a commercial success, selling only one medium sized system of 40 CPUs.

Taiwan Semiconductor Manufacturing Company was contracted in 1999 to produce CMOS wafers. Kyocera America, Inc. was contracted for MTA-2 packaging, Samsung Semiconductor for MTA-2 custom memory, Sun Microsystems, Inc. for I/O systems of the MTA-2 and InterCon Systems, Inc. for custom interconnects. The operating system of the Cray MTA-2 was Cray MTX, Tera MTX's successor.

In February 2001, Cray announced at $5.4 million contract for a 28 CPU MTA-2 for the U.S. Naval Research Laboratory in Q4 2001. In February 2002 Cray delivered a 16 CPU MTA-2 and upgraded it in parts to a 40 CPU MTA-2 in late 2002. Largest known prototype was a 44 CPU system at Cray. A presentation from the CUG 2006 proceedings suggested that "10 MTA processors [are] likely as fast as 32,000 BlueGene/L processors" for graph-related problems.

Torrent CPU
TDP of 50 W. 8 trap registers added per register set (retained until inclusive XMT2). The CMOS process was not disclosed but considering TSMC technology at the time (between the contract was signed in 1999 and the first delivered system in 2002), the chips could have been produced using a 180 nm (1999) or 130 nm (2001) manufacturing process.

Cray XMT
Cray XMT (Cray eXtreme MultiThreading, codenamed Eldorado ) is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the fourth generation of the Tera MTA architecture, targeted at large graph problems (e.g. semantic databases, big data, pattern matching). Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support. It brought various substantial improvements over Cray MTA-2, most notably more than doubling clock speed, nearly tripling the peak performance, and vastly increasing the maximum CPU count to 8,192 and maximum memory to 64 TB, with a data TLB reach of maximal 128 TB. To provide compatibility with existing XT3 infrastructure, commodity memory and higher network bandwitdth, engineers had to sacrifice higher network and memory speeds of the MTA-2.

Cray XMT uses a scrambled content-addressable memory model with 64 bytes (8 blocks) granularity on DDR1 ECC modules to implicitly load-balance memory access across the whole shared global address space of the system. There are no hardware interrupts and hardware threads are allocated by an instruction, not the OS. Front-end (login, I/O, and other service nodes, utilizing AMD Opteron processors and running SLES Linux) and back-end (compute nodes, utilizing Threadstorm3 processors and running MTK, a simple BSD Unix-based microkernel ) communicate through the LUC (Lightweight User Communication) C++ interface, a RPC-style bidirectional client/server interface.

Research and development of the Cray XMT was co-funded by the U.S. government. Though it was presented already in 2005, the first early version of the system was shiped in September 2007 with smaller shipments in 2008 and 2009. It was declared to be a current product in Cray's 10-K filling of 2010. Largest known prototype was a 512 CPU system at Cray.

Threadstorm3
Threadstorm3 (referred to as "MT processor" and Threadstorm before XMT2 ) is a 64-bit single-core VLIW barrel processor (compatible with 940-pin Socket 940 used by AMD Opteron processors) with 128 hardware streams, onto each a software thread can be mapped (effectively creating 128 hardware threads per CPU), running at 500 MHz and using the MTA instruction set or a superset of it. It has a 128KB, 4-way associative data buffer. Each Threadstorm3 has 128 separate register sets and program counters (one per each stream), which are fairly fully context-switched at each cycle. Its estimated peak performance is 1.5 GFLOPS. It has 3 functional units (memory, fused multiply-add and control), which receive operations from the same MTA instruction and operate within the same cycle. Each stream has 32 general-purpose registers, 8 target registers and a status word, containing the program counter. High-level control of job allocation across threads is not possible. Due to the MTA's pipeline length of 21, each stream is selected to execute instructions again no prior than 21 cycles later, effectively capping sequential single-thread speed to 23.8 MHz. The TDP of the processor package is 30 W.

Due to their thread-level context switch at each cycle, performance of Threadstorm CPUs is not constrained by memory access time. In a simplified model, at each clock cycle an instruction from one of the threads is executed and another memory request is queued with the understanding that by the time the next round of execution is ready the requested data has arrived. This is contrary to many conventional architectures which stall on memory access. The architecture excels in data walking schemes where subsequent memory access cannot be easily predicted and thus wouldn't be well suited to a conventional cache model.

TSMC was contracted in 2006 to procude Threadstorm3 chips and later the SeaStar2 router chip. Again, the manufacturing process for the chips was not disclosed but given TSMC's technology at the time, a 90 nm (2004) or 65 nm (2006) could have been used for Threadstorm3.

Memory subsystem
Cray XMT uses a shared byte-addressable memory system with 64-bit memory words, each of which has 4 additional "Extended Memory Semantics" bits (full/empty, forwarding and 2 trap bits) bits asociated with it, which enable lightweight, fine-grained syncronization on all memory between all CPUs. Memory is implemented using commodity DDR components (with modules from 4 GB to 16 GB) and is protected against single bit failures (using ECC). Each module has a single access port and is 128 bits wide. Logical addresses are hashed to physical addresses in 8 word blocks rather than the word granularity hashing that used in MTA-2's custom memory system. Threadstorm3 also introduced a local, non-coherent 128 kB, 4-way associated data buffer, previously absent from the architecture.

Scorpio
After launching XMT, Cray researched a possible multi-core variant of the Threadstorm3, dubbed Scorpio. Most of Threadstorm3's features would be retained, including the multiplexing of many hardware streams onto an execution pipeline and the implementation of additional state bits for every 64-bit memory word. Cray later abandoned Scorpio, and the project yielded no manufactured chip.

Cray XMT2
Cray XMT2 (also "next-generation XMT" or simply XMT ) is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the fifth generation of the Tera MTA architecture. Presented in 2011, it supersedes Cray XMT, which had issues with memory hotspots. It uses Threadstorm4 CPUs inside Cray XT5 blades and increases memory capacity eightfold to 512 TB and memory bandwidth trifold (300 MHz instead 200 MHz) compared to XMT by using twice the memory modules per node and DDR2. It introduces the Node Pair Link inter-Threadstorm connect, as well as memory-only nodes, with Threadstorm4 packages having their CPU and HyperTransport 1.x components disabled. The underlying scrambled content-addressable memory model has been inherited from XMT. XMT2 is the first iteration of the MTA architecture to reduce the additional state bits from 4 to 2 (full/empty and extended, removing trap bits).

Cray's 10-K fillings and product websites never mention XMT2 or "next-genration XMT" as offered products, instead listing uRiKA-GD utilizing Threadstorm4 CPUs (marketed as "Graph Accelerators") until late 2016 (when uRiKA-GD merged with uRiKA-XA into uRiKA-GX, replacing Threadstorm4 with Intel Xeon E5 v4), while making no direct mention to their memory subsystems or the XT5 infrastructure. Pricing on uRiKA configurations has not been made public, but according to Arvind Parthasarathi, then-YarcData general manager, a low-end setup (probably the smallest, one-cabinet XMT2 configuration of 64 CPUs) cost in the low hundreds of thousands of dollars. According to publicly available data, Cray sold two uRiKA-64 and one uRiKA-128 systems, as well as two more uRiKA configurations of unknown size.

Threadstorm4
Threadstorm4 (also "Threadstorm IV" and "Threadstorm 4.0" ) is a 64-bit single-core VLIW barrel processor (compatible with 1207-pin Socket F used by AMD Opteron processors) with 128 hardware streams, very similar to its predecessor, Threadstorm3. It features an improved, DDR2-capable memory controller.

TSMC was contracted to procude Threadstorm4 chips. As with previous generations, the manufacturing process for the chips was not disclosed but given TSMC's available technology for CPU ICs in 2011, a 40 nm, 28 nm or 20 nm process could have been used for Threadstorm4. The TDP was not publicly revealed, though probably it is the same 30 W as with its predecessor.

Memory subsystem
Cray intentionally decided against a DDR3 controller, citing the reusing of existing Cray XT5 infrastructure and a shorter burst length than DDR3. Though the longer burst length could be compensated by higher speeds of DDR3, it would also require more power, which Cray engineers wanted to avoid.