User:A5b/Comparison of MPI, OpenMP, and Stream Processing

MPI
MPI is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." So, MPI is a specification, not an implementation.

MPI is not sanctioned by any major standards body; nevertheless, it has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run these programs. The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept. Nonetheless, MPI programs are regularly run on shared memory computers.

Designing programs around the MPI model (as opposed to explicit shared memory models) has advantages on NUMA architectures as programming for MPI encourages memory locality.

Most MPI implementations consist of a specific set of routines (API) callable from Fortran, C, or C++ and from any language capable of interfacing with such routine libraries. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs).

MPI is often compared with PVM, which is a popular distributed environment and message passing system developed in 1989, and which was one of the systems that motivated the need for standard parallel message passing systems.

Threaded shared memory programming models (such as Pthreads and OpenMP) and message passing programming (MPI/PVM) can be considered as complementary programming approaches.

OpenMP
OpenMP is an implementation of multithreading, a method of parallelization whereby the master "thread" (a series of instructions executed consecutively) "forks" a specified number of slave "threads" and a task is divided among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.

The section of code that is meant to run in parallel is marked accordingly, with a preprocessor directive that will cause the threads to form before the section is executed. Each thread has an "id" attached to it which can be obtained using a function (called   in C/C++ and   in FORTRAN). The thread id is an integer, and the master thread has an id of "0". After the execution of the parallelized code, the threads "join" back into the master thread, which continues onward to the end of the program. The number of threads for execution can be determined either statically (by environment variables) or dynamically (by a function call).

By default, each thread executes the parallelized section of code independently. "Work-sharing constructs" can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both Task parallelism and Data parallelism can be achieved using OpenMP in this way.

Stream Processing
Stream processing is a computer programming paradigm, related to SIMD, that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the floating point units on a GPU, without explicitly managing allocation, synchronization, or communication among those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) are applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding, for example, to launch DMAs at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of die area not dedicated to computational units such as ALUs.

Pros and Cons of MPI

 * Pros of MPI
 * does not require shared memory architectures which are more expensive than distributed memory architectures
 * can be used on a wider range of problems since it exploits both task parallelism and data parallelism
 * highly portable with specific optimization for the implementation on most hardware


 * Cons of MPI
 * requires more programming changes to go from serial to parallel version
 * can be harder to debug
 * performance is limited by the communication network between the nodes

Pros and Cons of OpenMP

 * Pros
 * easier to program and debug (compared to MPI)
 * data layout and decomposition is handled automatically by directives.
 * gradual parallelism: directives can be added incrementally so the program can be parallelized one portion after another and thus no dramatic change to code is needed.
 * unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.
 * original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs and helps maintenance as well.
 * both coarse-grained and fine-grained parallelism are possible


 * Cons
 * currently only runs efficiently in shared-memory multiprocessor platforms
 * requires a compiler that supports OpenMP.
 * scalability is limited by memory architecture.
 * reliable error handling is missing.
 * lacks fine-grained mechanisms to control thread-processor mapping.
 * synchronization between subsets of threads is not allowed.
 * mostly used for loop parallelization