User:Amirkavyan/sandbox

Multi2Sim is a simulation framework for CPU-GPU heterogeneous computing written in C. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. The simulator is open-source and available to everyone on domain (http://www.multi2sim.org/)

Overview
The development of the model for a new microprocessor architecture on Multi2Sim consists of four phases. The development phases involve the design and implementation of four independent software modules: a disassembler, a functional simulator, a detailed simulator, and a visual tool. These four software modules communicate with each other with clearly defined interfaces, but can also work independently. Each software component though requires previous (left) design modules to work as a stand-alone tool.

Dissasembler
Given a bit stream representing machine instructions for a specific instruction set architecture (ISA), the goal of a disassembler is to decode these instructions into an alternative representation that allows for a straightforward interpretation of the instruction fields, such as operation code, input/output operands, or immediate constants. Multi2Sim’s disassembler for a given microprocessor architecture can operate autonomously or serve later simulation stages. In the first case, the disassembler reads directly from a program binary generated by a compiler, such as an x86 application binary, and dumps a text-based output of all fragments of ISA code found in the file. In the second case, the disassembler reads from a binary buffer in memory, and outputs machine instructions one by one in the form of organized data structures that split each instruction into its comprising fields.

Functional Simulator
The purpose of the functional simulator, also called the emulator, is to reproduce the original behavior of a guest program, providing the illusion that it is running natively on a given micro-architecture. For example, an ARM program binary can run on top of Multi2Sim’s ARM emulator. Even though Multi2Sim runs on an x86 architecture, the ARM guest program provides the same output as if it ran on a real ARM processor.

Detailed Simulator
The detailed simulator, interchangeably referred to as timing or architectural simulator, is the software component that models hardware structures and keeps track of their access time. The modeled hardware includes pipeline stages, pipe registers, instruction queues, functional units,cache memories, and others, all provided in Multi2Sim.

Visual Tool
The last software component involved in a microarchitecture model is the graphic visualization tool. As opposed to the runtime interaction scheme observed previously, the visual tool does not communicate with the detailed simulator during its execution. Instead, the detailed simulator generates a compressed text-based trace in an output file, which is consumed by the visual tool in a second execution of Multi2Sim.

The visual tool provides the user with a cycle-based interactive navigation. For each simulation cycle, one can observe the state of the processor pipelines, instructions in flight, memory accesses traversing the cache hierarchy, etc. This level of detail complements the global statistics provided by the timing simulator: Not only can one observe final performance results, but also the cause for access contention on a specific hardware resource, as well as other performance bottlenecks.

Superscalar Pipelining
Multi2Sim supports a cycle-based simulation of superscalar pipelines, modeling instruction fetch, decode, issue, write-back, and commit stages. The model features hardware structures such as the reorder buffer, load/store queues, register file, or trace cache. The pipeline front-end supports different types of branch prediction and micro-instruction decoding, while the back-end implements out-of-order and speculative execution.

Multi-Threading
A multithreaded pipeline model supports execution of multiple programs or one parallel application spawning child threads. The multithreaded processor shares a common pool of functional units among hardware threads. The rest of the pipeline resources can be configured as private or shared among hardware threads. Multi2Sim supports models for coarse-grain, fine-grain, and simultaneous multithreading.

Multicore
Superscalar and multithreaded pipelines are replicated a configurable number of times forming models of multicore processors. Cores communicate through the memory hierarchy with transaction triggered by the memory coherence protocol.

Graphics Processing Units
A cycle-based simulation model is provided for state-of-the-art AMD and NVIDIA Graphics Processing Units (GPUs). Multi2Sim can run unmodified OpenCL programs, intercepting OpenCL function calls, transferring control to a custom runtime, and launching simulation of OpenCL device kernels. Original host and kernel binaries can run on Multi2Sim without an actual GPU being installed on the system.

Memory Hierarchy
The memory hierarchy is modeled with an event-driven simulation of cache memories, organized with a configurable number of cache levels, geometries, and latencies. An implementation of the MOESI protocol handles coherence between caches from different cores. The model also features directories for caches and main memory.

Interconnection Networks
Components in the memory hierarchy communicate through interconnection networks, with configurable topologies, link bandwidths, routing algorithms, and virtual channels. An automatic cycle detection mechanism warns about possible deadlock condition in networks.

Heterogeneous Computing
Multi2Sim integrates models for different CPU and GPU architectures, all of them simulated at the ISA level for high accuracy purposes. This integration allows researchers to evaluate configurations of state-of-the-art commercial processors, where heterogeneous processing devices are encapsulated in the same die.

Benchmark Support
Multi2Sim can run standard program executables generated by the user's compiler. But additionally, the simulator is shipped with a complete set of pre-compiled benchmark suites. These benchmark suites are commonly used by the research community for architectural explorations and performance evaluation.

Sequential CPU Benchmarks
A wide set of sequential benchmarks are supported, including implementations of algorithms such as compression, compilation, or numerical. Custom code generated with gcc and other common compilers is also supported. Multi2Sim reads both static and dynamic ELF binaries, and launches program simulation transparently to the programmer or user.

Parallel CPU Benchmarks
Parallel applications using OpenMP or pthreads are fully supported on Multi2Sim. The simulator intercepts system calls spawning child threads and manages thread scheduling. The application binaries can be run unmodified, as compiled with the Linux implementations of these parallel libraries.

Massively Parallel GPU Benchmarks
Execution of OpenCL applications works transparently to the programmer and user. Multi2Sim reads two standard ELF binaries —one CPU host program and one GPU device kernel— and intercepts OpenCL API calls that setup the GPU execution. Custom OpenCL code can be compiled with the public AMD software development kit, and then run on Multi2Sim.

Additional Tools
Additional tools are available to enhance the usability of Multi2Sim. With these tools, the user can easily launch simulations, carry out architectural explorations, debug simulation stages, or observe benchmark executions graphically. Use the navigation bar above to explore the available tools. M2S-Cluster provides the possibility to Launch simulations in distributed environments. This tool significantly eases the architectural exploration process. Using M2S-Visual, one can observe graphically the state of the CPU and GPU pipelines, as well as the memory hierarchy, on a cycle-by-cycle basis. GPU-Calc measures the occupancy of a GPU compute unit, depending on the resource requirements of a kernel execution.

M2S-Cluster
M2S-Cluster is a system to launch simulations automatically using a set of benchmarks on top of Multi2Sim. The tool works on an infrastructure composed of a client Linux-based machine and a server formed of several compute nodes, using the condor framework as a task scheduling mechanism. M2S-Cluster simplifies the routine task of launching sets of simulations with a single command-line tool that communicates the client and server.

M2S-Visual


M2S-visual is a visualization tool based on GTK that provides visual representations for analysis of architectural simulations. The primary function of M2S-Visual is to observe the state of the CPU and GPU pipeline and the state of the memory, providing features such as simulation pausing, stepping through cycles, and viewing properties of in-flight instructions and memory accesses.

GPU-Calc: The GPU Occupancy Calculator


In a compute unit of an AMD GPU, there are several limits on the number of OpenCL software elements that can be run on top of it at a time, referred here as compute unit occupancy. Multi2Sim optionally dumps occupancy plots based on static and run-time characteristics of the executed OpenCL kernels. In the examples below, the red dots indicate the values for the simulation used (64x64 matrix multiplication):

Website
Multi2Sim website