University of Illinois Center for Supercomputing Research and Development

The Center for Supercomputing Research and Development (CSRD) at the University of Illinois (UIUC) was a research center funded from 1984 to 1993. It built the shared memory Cedar computer system, which included four hardware multiprocessor clusters, as well as parallel system and applications software. It was distinguished from the four earlier UIUC Illiac systems by starting with commercial shared memory subsystems that were based on an earlier paper published by the CSRD founders. Thus CSRD was able to avoid many of the hardware design issues that slowed the Illiac series work. Over its 9 years of major funding, plus follow-on work by many of its participants, CSRD pioneered many of the shared memory architectural and software technologies upon which all 21st century computation is based.

History
UIUC began computer research in the 1950s, initially for civil engineering problems, and eventually succeeded by cooperative activities among the Math, Physics, and Electrical Engineering Departments to build the Illiac computer series. This led to founding the Computer Science Department in 1965.

By the early 1980s, a time of world-wide HPC expansion arrived, including the race with the Japanese 5th generation system targeting innovative parallel applications in AI. HPC/supercomputing had emerged as a field, commercial supercomputers were in use by industry and labs (but little by academia), and academic architecture and compiler research were expanding. This led to formation of the Lax committee. to study the academic needs of focused HPC research, and to provide commercial HPC systems for university research. When HPC practitioner Ken Wilson won the Nobel physics prize in 1982, he expanded his already strong advocacy of both, and soon several government agencies introduced HPC R&D programs.

As a result, the UIUC Center for Supercomputing R&D (CSRD) was formed in 1984 (with funding from DOE, NSF, and UIUC, as well as DoD Darpa and AFOSR), under the leadership of three CS professors who had worked together since the Illiac 4 project – David Kuck (Director), Duncan Lawrie (Assoc. Dir. for SW) and Ahmed Sameh (Assoc. Dir for applications), plus Ed Davidson (Assoc. Dir. for hardware/ architecture) who joined from ECE. Many graduate students and post-docs were already contributing to constituent efforts; full time academic professionals were hired, and other faculty cooperated. A total of up to 125 people were involved at the peak, over the nine years of full CSRD operation

The UIUC administration responded to the computing and scientific times. CSRD was set up as a Graduate College unit, with space in Talbot Lab. UIUC President Stanley Ikenberry arranged to have Governor James Thompson directly endow CSRD with $1 million per year to guarantee personnel continuity. CSRD management helped write proposals that led to a gift from Arnold Beckman of a $50 million building, the establishment of NCSA, and a new CSRD building (now CSL).

The CSRD plan for success took a major departure from earlier Illiac machines by integrating four commercially built parallel machines using an innovative interconnection network and global shared memory. Cedar was based on designing and building a limited amount of innovative hardware, driven by SW that was built on top of emerging parallel applications and compiler technology. By breaking the tradition of building hardware first and then dealing with SW details later, this codesign approach led to the name Cedar instead of Illiac 5.

Earlier work by the CSRD founders had intensively studied a variety of new high-radix interconnection networks, built tools to measure the parallelism in sequential programs, designed and built a restructuring compiler (Parafrase) to transform sequential programs into parallel forms, as well as inventing parallel numerical algorithms. During the Parafrase development of the 1970s, several papers were published proposing ideas for expressing and automatically optimizing parallelism. These ideas influenced later compiler work at IBM, Rice U. and elsewhere. Parafrase had been donated to Fran Allen's IBM PTRAN group in the late 1970s, Ken Kennedy had gone there on sabbatical and obtained a Parafrase copy, and Ron Cytron joined the IBM group from UIUC. Also, KAI was founded in 1979, by three Parafrase veterans who wrote KAP, a new source-source restructurer, (Kuck, Bruce Leasure, and Mike Wolfe).

The key Cedar idea was to exploit feasible-scale parallelism, by linking together a number of shared memory nodes through an interconnection network and memory hierarchy. Alliant Computers, Inc. Alliant Computer Systems had obtained venture capital funding (in Boston), based on an earlier architecture paper by the CSRD team and was then shipping systems. The Cedar team was thus immediately able to focus on designing hardware to link 4 Alliant systems and add a global shared memory to the Alliant 8-processor shared memory nodes. In distinction to this, other academic teams of the era pursued massively parallel systems (CalTech, later in cooperation with Intel), fetch-and-add combining networks (NYU), innovative caching (Stanford), dataflow systems (MIT), etc.

In sharp contrast, two decades earlier, the Illiac 4 team required years of work with state of the art industry hardware technology leaders to get the system designed and built. The 1966 industrial hardware proposals for Illiac 4 hardware technology even included a GE Josephson Junction proposal which John Bardeen helped evaluate while he was developing the theory that led to his superconductivity Nobel prize. After contracting with Burroughs Corp to build and integrate an all-transistor hardware system, lengthy discussions ensued about the semiconductor memory design (and schedule slips) with subcontractor Texas Instruments' Jack Kilby (IC inventor and later Nobelist), Morris Chang (later TSMC founder) and others. Earlier Illiac teams had pushed contemporary technologies, with similar implementation problems and delays.

Many attempts at parallel computing startups arose in the decades following Illiac 4, but nothing achieved success until adequate languages and software were developed in the 1970s and 80s. Parafrase veteran Steve Chen joined Cray and led development of the parallel/vector Cray-XMP, released in 1982. The 1990s were a turning point with many 1980s startups failing, the end of bipolar technology cost-effectiveness, and the general end of academic computer building. By the 2000s, with Intel and others manufacturing massive numbers of systems, shared memory parallelism had become ubiquitous.

CSRD and the Cedar system played key roles in advancing shared memory system effectiveness. Many CSRD innovations of the late 80s (Cedar and beyond) are in common use today, including hierarchical shared memory hardware. Cedar also had parallel Fortran extensions, a vectorizing and parallelizing compiler, and custom Linux-based OS, that were used to develop advanced parallel algorithms and applications. These will be detailed below.

Cedar design and construction
One unusually productive aspect of the Cedar design effort was the ongoing cooperation among the R&D efforts of architects, compiler writers, and application developers. Another was the substantial legacy of ideas and people from the Parafrase project in the 1970s. These enabled the team to focus on several design topics quickly:


 * Interconnection network and shared memory hierarchy
 * Compiler algorithms, OS, and SW tools
 * Applications and performance analysis

The architecture group had a decade of parallel interconnect and memory experience and high-radix shuffle network chosen, so after selecting Alliant as the node manufacturer, custom interfacing hardware was designed in conjunction with Alliant engineers. The compiler team started by designing Cedar Fortran for this architecture, and by modifying the Kuck & Assoc. (KAI) source-to-source translator with Cedar-specific transformations for the Alliant compiler. Having nearly two decades of parallel algorithm experience (starting from Illiac 4), the applications group chose several applications to study, based on emerging parallel algorithms. This was later extended to include some widely used applications that shared the need for the chosen algorithms . Designing, building and integrating the system was then a multi-year effort, including architecture, hardware, compiler, OS and algorithm work.

System Architecture & Hardware
The hardware design led to 3 different types of 24” printed circuit boards, with the network board using CSRD-designed crossbar gate array chips. The boards were assembled into three custom racks in a machine room in Talbot Lab using water-cooled heat exchangers. Cedar’s key architectural innovations and features included:


 * A hierarchical/cluster-based shared-memory multiprocessor (SMP) design using Alliant FX as the building block. This approach is still being followed in today’s parallel machines, where Alliant 8-processor systems have been shrunken to single chip multi-core nodes containing 16, 32 or more cores, depending on power and thermal limitations.
 * The first SMP use of a scalable high-radix multi-stage, shuffle-exchange interconnection network, i.e. a 2-stage Omega network using 8x8 crossbar switches as building blocks. In 2005, the Cray BlackWidow Cray X2 used a variant of such networks in the form of a high radix 3-stage Clos network, with 32x32 crossbar switches. Subsequently many other systems have adopted the idea.
 * The first hardware data prefetcher, using a “next-array-element” prefetching scheme (instead of “next-cache-line” used in some later machines) to load array data from the shared global memory. Data prefetching is a critical technology on today’s multicores. [Need Ref]
 * The first “processor-in-memory” (PIM) in its shared global memory to perform long-latency synchronization operations. Today, using PIM to carry out various operations in shared global memory is still an active architectural research area.
 * Software-combining techniques for scalable synchronization operations

Language and compiler
By 1984, Fortran was still the standard language of HPC programming, but no standard existed for parallel programming. Building on the ideas of Parafrase and emerging commercial programming methods, Cedar Fortran was designed and implemented for programming Cedar and to serve as the target of the Cedar autoparallelizer.

Cedar Fortran contained a two-level parallel loop hierarchy that reflected the Cedar architecture. Each iteration of outer parallel loops made use of one cluster and a second level parallel loop made use of one of the eight processors of a cluster for each of its iterations. Cedar Fortran also contained primitives for doacross synchronization and control of critical sections. Outer-level parallel loops were initiated, scheduled and synchronized using a runtime library while inner loops relied on Alliant hardware instructions to initiate the loops, schedule and synchronize their iterations.

Global variables and arrays were allocated in global memory while those declared local to iterations of outer parallel loops were allocated within clusters. There were no caches between clusters and main memory and therefore, programmers had to explicitly copy from global memory to local memory to attain faster memory accesses. These mechanisms worked well in all cases tested and gave programmers control over processor assignment and memory allocation. As discussed in the next section, numerous applications were implemented in Cedar Fortran.

Cedar compiler work started with the development of a Fortran parallelizer for Cedar built by extending KAP, a vectorizer, which was contributed by KAI to CSRD. Because it was built on a vectorizer  the first modified version of KAP developed at CSRD lacked some important capabilities necessary for an effective translation for multiprocessors, such as array privatization and parallelization of outer loops. Unlike Parafrase (written in PL/1), which ran only on IBM machines, KAP (written in C) ran on many machines (KAI customer base). To identify the missing capabilities and develop the necessary translation algorithms, a collection of Fortran programs from the Perfect Benchmarks was parallelized by hand. Only techniques that were considered implementable were used in the manual parallelization study. The techniques were later used for a second generation parallelizer that proved effective on collections of programs not used in the manual parallelization study .

Applications and benchmarking
Meanwhile the algorithms/applications group was able to use Cedar Fortran to implement and test algorithms and run them on the four quadrants independently before system integration. The group was focused on developing a library of parallel algorithms and their associated kernels that mainly govern the performance of large-scale computational science and engineering (CSE) applications. Some of the CSE applications that were considered during the Cedar project included: electronic circuit and device simulation, structural mechanics and dynamics, computational fluid dynamics, and the adjustment of very large geodetic networks.

A systematic plan for performance evaluation of many CSE applications on the Cedar platform was outlined in   and. In almost all of the above-mentioned CSE applications, dense and sparse matrix computations proved to largely govern the overall performance of these applications on the Cedar architecture. Parallel algorithms that realize high performance on the Cedar architecture were developed for:


 * solving dense, and large sparse (structured as well as unstructured) linear systems,
 * computing few eigenpairs of large symmetric tridiagonal matrices,
 * computing all the eigenpairs of dense symmetric standard eigenvalue problems, and all the singular triplets of dense non-symmetric real matrices,
 * computing few of the smallest eigenpairs of large sparse standard and generalized symmetric eigenvalue problems, and
 * computing few of the largest or smallest singular triplets of large sparse nonsymmetric real matrices.

In preparing to evaluate candidate hardware building blocks and the final Cedar system, CSRD managers began to assemble a collection of test algorithms; this was described in and later evolved into the Perfect Club. Before that, there were only kernels and focused algorithm approaches (Linpack, NAS benchmarks). In the following decade the idea became popular, especially as many manufacturers introduced high performance workstations, which buyers wanted to compare; SPEC became the workhorse of the field and was followed by many others. SPEC was incorporated in 1988 and released its first benchmark in 1992 (Spec92) and a high performance benchmark in 1994. (David Kuck and George Cybenko were early advisors, Kuck served on the BoD in the early 90s, and Rudolf Eigenmann drove the Spec HPG effort, leading to the release of a first high performance benchmark in 1996.)

In a joint effort between the CSRD groups, the Parafrase memory hierarchy loop blocking work of Abu Sufah was exploited for the Cedar cache hierarchy. Several papers were published demonstrating performance enhancement for basic linear algebra algorithms on the Alliant quadrants and Cedar. A sabbatical spent at CSRD at the time by Jack Dongarra and Danny Sorensen led this work to be transferred as the BLAS 3 (to extend the simpler BLAS 1 and BLAS 2), a standard that is now widely used.

Cedar conclusion
CSRD had many alumni who went on to important careers in computing. Some left early, others came late, etc. Among the leaders were UIUC faculty member Dan Gajski, who was affiliated with the CSRD directors in formulating plans and proposals, but left UIUC just before CSRD actually commenced. Another was Mike Farmwald who joined as an Associate Director for hardware/architecture when Ed Davidson left. Immediately after leaving Mike was a co-founder of Rambus, which continues as a memory design leader. David Padua became Assoc. Director for SW after Duncan Lawrie left, and continued many CSRD projects as a UIUC CS professor. Over time, CSRD researchers became CS and ECE department heads at 5 Big Ten universities.

By 1990, the Cedar system had been completed. The CSRD team was able to scale applications from single clusters to the full 4-cluster system and begin performance measurements. Despite these innovation successes, there was no follow up machine construction project. After the end of the Cedar project, the Stanford DASH/FLASH projects, and the MIT Alewife project around 1995, the era of large, multi-faculty academic machine designs had come to an end. Cedar was a preeminent part of the last wave of such projects. ISCA’s 25th Anniversary Proceedings contain several retrospective papers describing some of the machines in that last wave, including one on Cedar.

About 50 remaining CSRD students, academic professionals and faculty became a research group within the Coordinated Science Laboratory by 1994. For several years, they continued the work initiated in the 1980s, including experimental evaluations of Cedar and continuation of several lines of CSRD compiler research .

Other CSRD contributions
Beyond the core CSRD work of designing, building and using Cedar, many related topics arose. Some were directly motivated by the Cedar project. Many of these had value well beyond Cedar, were pursued well-beyond the official end of CSRD, and were taken up by many academic and industrial groups. Next, the most important such topics are discussed.

Guided Self Scheduling
In the mid 1980s, C. Polychronopoulos developed one of the most influential strategies for the scheduling of parallel loop iterations. The strategy, called Guided Self-Scheduling, schedules the execution of a group of loop iterations each time a processor becomes available. The number of iterations in these groups decreases as the execution of the loop progresses in such a way that the load imbalance is reduced relative to the static or dynamic scheduling techniques used at the time. Guided Self-Scheduling influenced research and practice with numerous citations of the paper introducing the technique and the adoption of the strategy by OpenMP as one of its standard loop scheduling techniques.

Approximation by superpositions of a sigmoidal function
In the mid to late 1980’s, the so-called “Parallel Distributed Processing” (PDP) effort recast earlier generations of neural computation by demonstrating effective machine learning algorithms and neural architectures. The computing paradigm, far removed from traditional von Neumann computer architecture, demonstrated that PDP approaches and algorithms could address a variety of application problems in novel ways. However, it was not known what kinds of problems could be solved using such massively parallel neural network architectures. In 1989, CSRD researcher George Cybenko, demonstrated that even the simplest nontrivial neural network had the representational power to approximate a wide variety of functions, including categorical classifiers and continuous real-valued functions. That work was seminal in that it showed that, in principle, neural machines based on biological nervous systems could effectively emulate any input-output relationship that was computable by traditional machines. As a result, Cybenko’s result has been often called the “Universal Approximation Theorem” in the literature. The proof of that result relied on advanced functional analysis techniques and was not constructive. Even so, it gave rigorous justification for generations of neural network architectures, including deep learning and large language models in wide use in the 2020’s. While Cybenko’s Universal Approximation Theorem addressed the capabilities of neural-based computing machines, it was silent on the ability of such architectures to effectively learn their parameter values from data. Cybenko and CSRD colleagues, Sirpa Saarinen and Randall Bramley, subsequently studied the numerical properties of neural networks which are typically trained using stochastic gradient descent and its variants. They observed that neurons saturate when network parameters are very negative or very positive leading to arbitrarily small gradients which turn result in optimization problems that are numerically poorly conditioned. This property has been called the “vanishing gradient” problem in machine learning.

BLAS 3
The Basic Linear Algebra Subroutines (BLAS) are among the most important mathematical software achievements. They are essential components of LINPACK and versions are used by every major vendor of computer hardware. The BLAS library was developed in three different phases. BLAS 1 provided optimized implementations for basic vector operations; BLAS 2 contributed matrix-vector capabilities to the library. Blas 3 involves optimizations for matrix-matrix operations. The multi-cluster shared memory architecture of Cedar inspired a great deal of library optimization research involving cache locality and data reuse for matrix operations of this type. The official BLAS 3 standard was published in 1990 as. This was inspired, in part, on. Additional CSRD research data management for complex memory management followed and some of the more theoretical work was published as and. The performance impact of these algorithms when running on Cedar is reported in.

OpenMP
Beyond CSRD, the many parallel startup companies of the 1980s created a profusion of ad hoc parallel programming styles, based on various process and thread models. Subsequently, many parallel language and compiler ideas were proposed, including compilers for Cray Fortran, KAI-based source-to-source optimizers, etc. Some of these tried to create product differentiation advantages, but largely went contrary to user desires for performance portability. By the late 1980s, KAI started a standardization effort that led to the ANSI X3H5 draft standard, which was widely adopted.

In the 1990s, after CSRD, these ideas influenced KAI in auto-parallelization, and soon another round of standardization was begun. By 1996 KAI had SGI as a customer and they joined the effort to form the OpenMP consortium – the OpenMP Architecture Review Board incorporated in 1997 with a growing collection of manufacturers. KAI also developed parallel performance and thread checking tools, which Intel bought with its purchase of KAI in 2000. Many KAI staff members remain, and the Intel development continues, directly inherited from Parafrase and CSRD. Today, OMP is the industry standard shared memory programming API for C/C++ and Fortran.

Speculative parallelization
For his PhD thesis, Rauchwerger introduced an important paradigm shift in the analysis of program loops for parallelization. Instead of first validating the transformation into parallel form through a priori analysis either statically by the compiler or dynamically at runtime, the new paradigm speculatively parallelized the loop and then checked its validity. This technique, named “speculative parallelization", executes a loop in parallel and tests subsequently if any data dependences could have occurred. If this validation test fails, then the loop is re-executed in a safe manner, starting from a safe state, e.g., sequentially from a previous checkpoint. This approach, known as the LRPD Test (Lazy Reduction and Privatization Doall Test). Briefly, the LRPD test instruments the shared memory references of the loop in some “shadow" structures and then, after loop execution, analyzes them for dependent patterns. This pioneering contribution has been quite influential and has been applied throughout the years by many researchers from CSRD or elsewhere.

Race detection
In 1987, Allen pioneered the use of memory traces for the detection of race conditions in parallel programs. Race conditions are defects of parallel programs that manifest in different outcomes for different exertions of the same program and the same input data. Because of their dynamic nature, race detections are difficult to detect and the techniques introduced by Allen and expanded in are the best strategy known to cope with this problem. The strategy has been highly influential with numerous researchers working on the topic during the last decades. The technique has been incorporated into numerous experimental and commercial tools, including Intels' Inspector.

Contributions to Benchmarking – SPEC
One of CSRD’s thrusts was to develop metrics able to  evaluate both hardware and software systems using real applications. To this end, the Perfect Benchmarks provided a set of computational applications, collected from various science domains, which were used to evaluate and drive the study of the Cedar system and its compilers. In 1994, members of CSRD and the Standard Performance Evaluation Corporation (SPEC) expanded on this thrust, forming the SPEC High-Performance Group. This group released a first real-application SPEC benchmark suite, SPEC HPC 96. SPEC has been continuing the development of benchmarks for high-performance computing to this date, a recent suite being SPEChpc 2021. With CSRD’s influence, the SPEC High-Performance Group also prompted a close collaboration of industrial and academic participants. A joint workshop in 2001 on Real-Application Benchmarking founded a workshop series, eventually leading to the formation of the SPEC Research Group, which in turn co-initiated the now annual ACM/SPEC International Conference on Performance Engineering.

Parallel Programming Tools
Funded by Darpa, the HPC++ project was led by Dennis Gannon and Allen Malony and Postdocs Francois Bodin from William Jalby’s group in Rennes and Peter Beckman now at Argonne National Lab. This work led from a collaboration between Malony, Gannon and Jalby that began at CSRD. HPC++ is based extensions to C++ standard template library to support a number parallel programming scenarios including single-program-multiple-data (SPMD) and Bulk Synchronous Parallel on both shared memory and distributed memory parallel systems. The most significant outcome of this collaboration was the development of the TAU Parallel Performance System. Originally developed for HPC++, it has become a standard for measuring, visualization and optimizing parallel programs for nearly all programming languages and is available for all parallel computing platforms. It supports various programming interfaces such as OpenCL, DPC++/SYCL, OpenACC, and OpenMP. It can also gather performance information of GPU computations from different vendors such as Intel and NVIDIA. TAU has been used for many HPC applications and projects.

Applications
The Cedar project has strongly influenced the research activities of many of CSRD’s faculty members long after the end of the project. After the termination of the Cedar project, the first task undertaken by three members of Cedar’s Algorithm and Application group (A. Sameh, E. Gallopoulos, and B. Philippe) was documenting the parallel algorithms developed, and published in a variety of journals and conference proceedings, during the lifetime of the project. The result was a graduate textbook: “Parallelism in Matrix Computations” by E. Gallopoulos, B. Philippe, and A. Sameh, published by Springer, 2016. The parallel algorithm development experience gained by one of the members of the Cedar project (A. Sameh) proved to be of great value in his research activities after leaving UIUC. He used many of these parallel algorithms in joint research projects:

•	fluid-particle interaction with the late Daniel Joseph (a National Academy of Science faculty member in Aerospace Engineering at the University of Minnesota, Twin Cities),

•	fluid-structure interaction with Tayfun Tezduyar (Mechanical Engineering at Rice University),

•	computational nanoelectronics with Mark Lundstrom (Electrical & Computer Engineering at Purdue University).

These activities were followed, in 2020, by a Birkhauser volume (edited by A. Grama and A. Sameh) containing two parts: part I consisting of some recent advances in high performance algorithms, and part II consisting of some selected challenging computational science and engineering applications.

Compiler assisted cache coherence
Cache coherence is a key problem in building shared memory multiprocessors. It was traditionally implemented in hardware via coherence protocols. However, the advent of systems like Cedar allowed one to consider a compiler-assisted implementation of cache coherence for parallel programs, with minimal and completely local hardware support. Where a hardware coherence protocol like МESI relies on remote invalidation of cache lines, a compiler-assisted protocol performs a local self-invalidation as directed by a compiler.. CSRD researchers developed several different approaches to compiler-assisted coherence , including a scheme with directory assistance. All these schemes performed a post-invalidation at the end of a parallel region. This work has influenced research with numerous citations across decades until today

Compilers for GPUs
Early CSRD work on program optimization for classical parallel computers, also spurred developments of languages and compilers for more specialized accelerators, such as Graphics Processing Units (GPU). For example, in the early 2000s, CSRD researcher Rudolf Eigenmann developed translation methods for compilers that enabled programs written in the standard OpenMP programming model to be executed efficiently on GPUs. Until then, GPUs had been programmed primarily in the specialized CUDA language. The new methods showed that high-level programming of GPUs was not only feasible for classical computational applications, but also for certain types of problems that exhibited irregular program patterns. This work incentivized further initiatives toward high-level programming models for GPUs and accelerators in general, such as OpenACC and OpenMP for accelerators. In turn, these initiatives contributed to the use of GPUs for a wide range of computational problems, including neural networks for deep-learning whose mathematical foundation was studied by Cybenko as discussed above.