User:Su-steve/SmartMemReduxPaper

Title: Does reconfigurability improve compute efficiency?
Previous versions: User:Su-steve/tmp-paper

 Alternative titles: 
 * Smart Memories Update
 * Revisiting reconfigurable computing
 * ISCA 2000 paper, updated

Plain english abstract
An earlier paper (ISCA2000) made certain claims about the abilities of a proposed machine called Smart Memories. Here it is eight years later. How well does the final machine perform with regard to the original claims?

Perhaps more importantly, what can the Smart Memories experience tell us about reconfigurable computing? What open questions might be answered? For the larger context, see Reconfigurable computing.

Pointer to the original ISCA 2000 paper
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz, Smart Memories: A Modular Reconfigurable Architecture, International Symposium on Computer Architecture, June 2000.

Smart memories combines the memory flexibility of (Tri-media [2,3], Equator [4], Mpact [5], IRAM [6]) with the high-ILP / multi-CPU capabilities of RAW.

Claims from the original ISCA 2000 paper
Assumption from the original 2000 paper: Given three different compute systems Sometimes you want CCCM; sometimes you want STRM; sometimes you want TLSM; and sometimes you might discover that you want a whole different sort of machine. (Are these assumptions valid? Why exactly, do you want each of these machines?)
 * CCCM, a dedicated conventional cache-coherent machine;
 * STRM, a dedicated stream machine; and
 * TLSM, a dedicated TLS machine.

Thesis of the original 2000 paper: We can build a single reconfigurable machine SM (Smart Memories) that reasonably approaches the performance of dedicated machines CCCM, STRM and TLSM and, by extension, other as-yet-unnammed machines.

Is STRM really better than CCCM, or can CCCM reasonably suffice in most cases? See Leverich paper?

Is TLSM really better than CCCM? Another paper makes this case (whose?)

Given that TLSM is better than CCCM, might some other idea arise that's yet better than TLSM? (Yes, see TCCM. Whose paper will make this claim?)

Can redo the original experiments with the new hardware; should get similar results; will this be interesting?

Can show that machine supports TCC, the TLS follow-on. A good claim for flexibility to adapt to unforeseen circumstance!

General idea for the new paper
Assumption, from original 2000 paper:

- some programs run better as cache-coherent programs; others run better as stream programs; still others run better as TLS.

- SM will run all three (and more!) types of program with little performance loss vs. a dedicated machine.

Put together a mix of cache-coherent, streaming and TCC benchmarks.

Run the entire mix in at least three modes, get performance numbers for each mode:

1. Tailored for cache-coherent 2. tailored for streaming 3. tailored for TCC

Then, run the entire mix with each benchmark running in its best mode.

Quantify the difference between 1, 2 and 3.

Problem: can the same code run in each mode, or does it have to be modified?

Meta-question: how do we compare the same program written in two different styles for two different machines?

Assertion: Application FOO runs better as a streaming application on a streaming machine than it will ever run as a CC app on a CC machine.

Difficult question: CC vs. stream vs. TCC/TLS, coding of app changes with mode

Simpler question: just change the cache, don't change the code.

Meta-question: are users willing to recode for higher performance?

Meta-question: is programming language A easier or more efficient etc. than programming language B?

Ongoing open questions: how do we find the optimal configuration for a given app? And/or is this an interactive process?

Open questions
1. Dusty deck: are users willing to recode their app for better performance?

Notes from the ISCA 2000 paper
The ISCA 2000 paper indicated that RC would be able to provide multiple programming models in a single RC chip, at only a slight cost to performance versus a rival chip tuned for a given model.

In other words, TLS apps running on RC would run approximately as well as TLS apps running on Hydra. Streaming apps running on RC would run approximately as well as streaming apps running on Imagine. The implied (?) upside for RC was that a single server with an RC chip would be able to run a mix of TLS and streaming apps better than a server with either a streaming or TLS chip alone. Unproven(?) assumptions include the following: 1) apps coded for and running on a TLS machine run better than apps coded for and running on a GP CMP; 2) apps coded for and running on a streaming machine run better than apps coded for and running on a GP CMP; 3) people are willing to rewrite their apps for a special-purpose processor so as to gain some advantage.

From ISCA 2000:

The 8-cluster Imagine is mapped to a 4-tile Smart Memories quad.

...

The kernels simulated - a 1024-point FFT, a 13-tap FIR filter, a   7x7 convolution, and an 8x8 DCT - were optimized for Imagine and were not re-optimized for the Smart Memories architecture.

...

The Hydra speculative multiprocessor enables code from a   sequential machine to be run on a parallel machine without requiring the code to be re-written [34][35]. A pre-processing script finds and marks loops in the original code. At run-time, different loop iterations from the marked loops are then speculatively distributed across all processors.

[34] L. Hammond, et al. Data Speculation Support for a Chip Multiprocessor. In Proceedings of Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), pages 58-69, Oct. 1998.

[35] K. Olukuton, et al. Improving the Performance of Speculatively Parallel Applications on the Hydra CMP. In Proceedings of the 1999 ACM International Conference on Supercomputing, June 1999.

Point of the original ISCA 2000 paper: SM can run Imagine-specific and Hydra-specific benchmarks at speeds similar to that of Imagine and Hydra (i.e. within like 50% performance). Implication was that Imagine would do poorly on the Hydra benchmarks and/or Hydra would do poorly on the Imagine benchmarks, but that SM would do reasonably well in either mode. Missing was the overall performance comparison for a combined Hydra/Imagine suite as measured on 1) Imagine, 2) Hydra and 3) Smart Memories. Results would presumably look something like:

Imagine    Hydra    Smart Memories ---    -    -- Imagine apps:  Good       Bad         Okay Hydra apps: Bad        Good        Okay Combined apps: Bad        Bad         Good

Assumptions:

Certain applications run better when re-coded for, and run on, a different style of architecture. For instance, applications suited for streaming will run better on an Imagine processor than on a general-purpose processor. Applications suited to multithreading will run better on a Hydra-like processor.

Imagine is a reasonable target for applications that stream data.

Hydra is a reasonable target for applications that can take good advantage of multithreading.

Imagine, Hydra and Smart Memories simulators are sufficiently accurate, individually and in tandem, such that results are valid. I.e., not only must the characteristics and flaws of an Imagine simulator map to those of an actual Imagine processor, it must also reasonably match the characteristics and flaws of the Smart Memories simulator to which it is being compared, and so on.

Open questions:

How hard is it to recode my applications for streaming, or for multithreading?

How does actual performance of Smart Memories, Imagine, Hydra compare to a known latest-and-greatest real processor, such as PowerPC, SPARC or x86?

New paper 1:
Concentrate on SPECthroughput. This requires no recoding of applications. Uses unaltered industry-standard applications of known characteristics.

Compare Smart Memories only to itself. This removes all the open variables associated with Imagine, Hydra, TCC or other theorized processor or system.

Leaves only two assumptions/open questions:

How well does Smart Memories simulator match the actual chip?

How does Smart Memories performance compare to that of a latest-and-greatest real processor?

Proving the web-page claim
The web page, http://www-vlsi.stanford.edu/smart_memories/, says: ``It [Smart Memories] is a single chip multi processor system with coarse grain reconfiguration capabilities, for supporting diverse computing models, like speculative multi-threading and streaming architectures. These features allow the system to run a broad range of applications efficiently.'' Do we still believe this? Have we shown it to be true?

The broader implication is that SM will run this wide range of applications more efficiently than an eqiuvalent general-purpose CMP. Presumably, because of its reconfigurable memory system.

To prove this, we will need * multi-thread application(s) of interest; * streaming applications of interest; * TCC applications of interest.

Experiments we can do now:
- Configure SM as a general-purpose multi-thread machine, run all applications, note the performance A. Tailor SM to each individual app, note performance B. Compare individual and aggregate performance A to individual and aggregate performance B. --

Experiments we have done
Experiments we have already done, based on the list of papers on the web site (http://www-vlsi.stanford.edu/smart_memories/papers.html):

- Compare stream performance on SVM to actual machine hardware, show that SVM is a good predictor of actual performance. Hardware: ATI, Nvidia, Imagine. Applications: Matrix Vector-Multiply, 2D FFT, Image Segmentation. Paper: F.Labonte, P. Mattson, I. Buck, C. Kozyrakis and M. Horowitz, "The Stream Virtual Machine," PACT, September 2004.

- Paper: K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz, Smart Memories: A Modular Reconfigurable Architecture, ISCA, June 2000. "To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation."

Experiments: 1) Imagine vs. Imagine-on-SM; 2) Hydra vs. Hydra-on-SM.

Applications, Imagine: fft, fir, convolve, dct. Applications, Hydra: compress, grep, m88ksim, wc, ijpeg, mpeg, alvin, simplex.

Conclusions: "The overheads of the coarse-grain configuration that Smart Memories uses, although modest, are not negligible; and as the mapping studies show, building a machine optimized for a specific application will always be faster than configuring a general machine for that task. Yet the results are promising, since the overheads and resulting difference in performance are not large. So if an application or set of applications needs more than one computing or memory model, our reconfigurable architecture can exceed the efficiency and performance of existing separate solutions."

Or, more concisely: SM's performance is comparable (?) to that of non-reconfigurable hardware for two very different (?) architectures.

Missing: *wc* did well on imagine, poorly on hydra. How would suite-wide performance compare for hydra vs. imagine vs. tuned-per-app SM?

- Paper: R. Ho, K. Mai, and M. Horowitz, The Future of Wires. Proceedings of the IEEE, April 2001, pp. 490-504. "...increased delays for global communication will drive architectures toward modular designs with explicit global latency mechanisms."

- Paper: J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, C. Kozyrakis, "Comparing Memory Systems for Chip Multi- processors," International Symposium on Computer Architecture, June 2007. "...our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory struc- tures are explicitly managed.  On the other hand, we show that stream- ing at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates oppor- tunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the pro- grammer cannot fully regularize an application's code."

MIT RAW paper, ISCA 2004
http://cag.csail.mit.edu/raw/documents/raw_isca_2004.pdf

"Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180 nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x to 9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw."

"an operation of the form c = a + b in a load-store RISC architecture will require a minimum of 4 operations – two loads, one add, and one store. Stream architectures such as Raw can accomplish the operation in a single operation (for a speedup of 4x) because the processor can issue bulk data stream requests and then process data directly from the network without going through the cache."

"The evaluation for this paper makes use of a validated cycle-accurate simulator of the Raw chip. Using the validated simulator as opposed to actual hardware allows us to better normalize differences with a reference system, e.g., DRAM memory latency, and instruction cache configuration."

"For fairness, this comparison system must be implemented in a process that uses the same lithography generation, 180 nm."

"Much like a VLIW architecture, Raw is designed to rely on the compiler to find and exploit ILP. We have developed Rawcc [5, 24, 25] to explore these compilation issues. Rawcc takes sequential C or Fortran programs and orchestrates them across the Raw tiles in two steps. First, Rawcc distributes the data and code across the tiles to attempt to balance the tradeoff between locality and parallelism. Then, it schedules the computation and communication to maximize parallelism and minimize communication stalls."

"unmodified Spec applications stretch [the rawcc compiler's] robustness. We are working on improving the robustness of Rawcc."

"The speedups attained in Table 8 shows the potential of automatic parallelization and ILP exploitation on Raw. Of the benchmarks compiled by Rawcc, Raw is able to outperform the P3 for all the scientific benchmarks and several irregular applications."

"We present performance of stream computations for Raw... We present two sets of results. First we show the performance of programs written in StreamIt, a high level stream language, and automatically compiled to Raw. Then, we show the performance of some hand written applications."