User:Ewhorton/sandbox

Data parallelism is a form of parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It contrasts to task parallelism as another form of parallelism.

Description
In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

For instance, consider a 2-processor system (CPUs A and B) in a parallel environment. We wish to do a task on some data. It is possible to tell CPU A to do that task on one part of  and CPU B on another part simultaneously, thereby reducing the duration of the execution. The data can be assigned using conditional statements as described below. As a specific example, consider adding two matrices. In a data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same operation in serial using one CPU alone.

Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.

Example
The program expressed in pseudocode below—which applies some arbitrary operation,, on every element in the array  —illustrates data parallelism:

If the above example program is executed on a 2-processor system, the runtime environment may execute it as follows:


 * In an SPMD system, both CPUs will execute the code.
 * In a parallel environment, both will have access to.
 * A mechanism is presumed to be in place whereby each CPU will create its own copy of  and   that is independent of the other.
 * The  clause differentiates between the CPUs. CPU   will read true on the  ; and CPU   will read true on the , thus having their own values of   and.
 * Now, both CPUs execute, but since each CPU has different values of the limits, they operate on different parts of   simultaneously, thereby distributing the task among themselves. Obviously, this will be faster than doing it on a single CPU.

This concept can be generalized to any number of processors. However, when the number of processors increases, it may be helpful to restructure the program in a similar way (where  is an integer between 1 and the number of CPUs, and acts as a unique identifier for every CPU):

For example, on a 2-processor system CPU A ( 1) will operate on odd entries and CPU B (  2) will operate on even entries.

Dependencies in Parallel Code
Transforming a sequential program into a parallel version requires that dependencies within the code are preserved. There are three types of dependencies that can be found:

Anti-Dependence and Output Dependence can be dealt with by giving each process or thread its own copy of variables (known as privatization). However, true dependence must be preserved, requiring process synchronization.

Example of True Dependence , meaning that S2 has a true dependence on S3 because it writes to the variable, which S3 reads from.

Example of Anti-Dependence , meaning that S2 has an anti-dependence on S3 because it reads from the variable  before S3 writes to it.

Example of Output-Dependence , meaning that S2 has an output dependence on S3 because both write to the variable.

Loop-Carried vs Loop-Independent Dependence
Loops can have two types of dependence:


 * Loop-Carried Dependence
 * Loop-Independent Dependence

In loop-independent dependence, loops have inter-iteration dependence but do not have dependence between iterations. Each iteration may be treated as a block and performed in parallel without other synchronization efforts.

In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of. In loop-carried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. Loop-Carried Dependence uses a modified version of the dependence notation seen earlier. The following denotes a statement carries a true dependence to a later iteration, meaning that one iteration of the loop writes to a location read by a subsequent iteration of the loop.

Example of loop-carried dependence where.

Loop Carried Dependence Graph
A Loop-Carried Dependence Graph graphically shows the loop-carried dependencies between iterations. Each iteration is listed as a node on the graph, and directed edges show the true, anti, and output dependencies between each iteration.

Loop-Level Parallelism
Loops can exhibit four types of parallelism:


 * DOALL Parallelism
 * DOACROSS Parallelism
 * DOPIPE Parallelism
 * DISTRIBUTED Loop

DOALL Parallelism
DOALL Parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence). For example, the following code does not read from the array, and does not update the arrays. No iterations have a dependence on any other iteration.

Let's say the time of one execution of S1 be TS1 then the execution time for sequential form of above code is n x TS1, Now because DOALL Parallelism exists when all iterations are independent, speedup may be achieved by executing all iterations in parallel which gives us an execution time of TS1(the time taken for one iteration in sequential execution).

The following example, using a simplified pseudocode, shows how a loop might be parallelized to execute each iteration independently.

DOACROSS Parallelism
DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously. Synchronization exists to enforce loop-carried dependence.

Consider the following, synchronous loop with dependence.

Each loop iteration performs two actions


 * Calculate
 * Assign the value to

Calculating the value, and then performing the assignment can be decomposed into two lines:

The first line,, has no loop-carried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to.

DOPIPIE Parallelism
DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops.

Consider the following, synchronous code with dependence.

S1 must be executed sequentially, but S2 has no loop-carried dependence. S2 could executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished.

Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index.

DISTRIBUTED Loop
When a loop has a loop-carried dependence another way to parallelize it is to distribute the loop into several different loops, meaning by seperating statements that are not dependent with each other so that these distributed loops can be executed in parallel. For example consider the following code. The loop has a loop carried dependence   but S2 and S1 do not have a loop-carried dependence so we can rewrite the code as follows. Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in a data level parallelism here different parallel function perform different tasks on different data. We can call this type of parallelism as to a function or task parallelism.

Steps to Parallelization
The process of parallelizing a sequential program can be broken down into four discrete steps.

Programming Model
The type of programming model chosen affects how a program will be parallelized. There are two common models that are generally used, shared addressing and message passing.

Aspects of Shared Address Space vs Message Passing, as described by Yan Solihin.

Shared Address Space
In a shared address space model, each processor has access to all memory. This simplifies inter-process communication, because communication occurs implicitly by reading or writing to shared variables. However, such processes need explicit synchronization to prevent race conditions from occurring in regions of shared memory. Further, it typically requires a cache coherency protocol that causes additional overhead.

Message Passing
With a message passing model, processes live in their own address space that is distinct from that of the other processes. Communication occurs when each process explicitly sends or receives a message from another process. Passing messages and copying data to different address provides additional overhead, but also acts to synchronize processes and prevent memory race conditions by default.