User:Mlokhan/sandbox

Loop Parallelization
Loop parallelization is the process of distribution of loop iterations across multiple processing elements (or SMT cores) to achieve faster execution time. This can be done manually (by specifying which iteration can execute in parallel by compiler directives) or automatically by algorithms which detect parallelism within loop bodies. Programs in scientific applications generally consist of computational intensive loops which perform a repeated set of operations on the data. Detecting and extracting parallelism from these loops can help achieve significant speedup. Apart from parallelization, the maximum achievable speedup is also bounded by other factors , mainly Amdahl's Law , granularity, data distribution and latency of the memory system.

Dependencies
Parallelization transformations are used to change execution of programs from one thread to multiple threads without affecting the determinism. . This will occur only if all the dependencies in the single threaded variant are respected.

Consider 2 statements P and Q within a loop where the variable i stores the iteration value.

where P(i) = ith instance of the statement.

Then instance Q(j) depends on P(i) if there exists a memory location M such that -


 * 1) both P(i) and Q(j) utilize memory location M


 * 1) P(i) executes before Q(j) in sequential execution.


 * 1) No other instance overwrites the value in between P(i) and Q(j)

Now the dependence is called loop carried dependence(inter-iteration) if i!=j and loop independent dependence(intra-iteration) if i=j.

Bernsteins conditions can be used to ascertain if program segments are independent and can be executed in parallel.

Types of dependencies
Depending on the sequence in which the memory is accessed by different instances, the dependence can be classified as -


 * 1) Flow Dependence (also known as True Dependence)


 * 1) Anti Dependence


 * 1) Output Dependence


 * 1) Input Dependence.

Types of parallelization techniques
Depending upon methods used to preserve original dependencies, different communication patterns exist between threads. Based on this the techniques can be broadly classified into -


 * 1) Independent Multi-threading (IMT) -No communication between the threads. (E.g. - DOALL. )


 * 1) Cyclic Multi-threading (CMT) - communication exists between threads, within the loop body. Graph produced by threads as nodes and communication as direction is cyclic(E.g. - DOACROSS and TLS)


 * 1) Pipelined Multi-threading (PMT) - communication exists between threads, within the loop body . Graph produced by threads as nodes and communication as direction is acyclic . (E.g. - DOPIPE)

Loop Parallelization can be achieved by implementing the following techniques :


 * 1) Parallelization between iterations


 * 1) Loop distribution


 * 1) DOALL parallelization


 * 1) DOACROSS parallelization


 * 1) DOPIPE parallelization

Parallelization between iterations
Parallelism can be identified by analyzing which loop iterations can be run in parallel. We have to analyze the loop level dependencies for that. The most critical are the true dependencies, as anti and output dependencies can be eliminated using privatization. Consider the following code:

Here, there is a loop carried dependence S1->T S1[i+2]. But, the odd and the even iterations are independent of each other. Thus it is possible for us to execute them in separate loops. The execution can be showed as follows:

Here the even and odd loops can run simultaneously. However, it should be noted that the even and odd loops themselves are executed sequentially. If the the execution time for instruction S1 is Ts1, then the execution time for N instructions will be N*Ts1. But with parallelizing as shown above, the execution time reduces to N/2*Ts1.

Loop distribution
There may be some statements in the loop which may be independent of other statements. Such statements can be executed in parallel. Consider the following code:

Here statements 1 and 2 are independent of each other (no loop independent and dependent dependence). Hence we can execute them separately.There is a loop carried dependence in S1(S1-> T S1[i+1]). Therefore S1 has to be executed sequentially. Also, all the iterations of statement 2 can be run in parallel as there is no loop carried dependence in S2(DOALL parallelism). Hence the code can be rewritten as follows: If the execution times of S1 and S2 are Ts1 and Ts2 respectively, then the execution time without parallelization for N statements would be N*(Ts1 + Ts2). After parallelization, the execution time is max(N*Ts1,Ts2).

DOALL parallelization
In the best case scenario, there might be no loop dependent and independent dependencies. If such is the case, then all the iterations can be executed in parallel. This is called DOALL parallelization. Consider the following code: There are no loop dependent and independent dependencies. Hence all the iterations can run in parallel. If the execution time of S1 is Ts1, then without parallelization the execution time for N statements would be, N*Ts1. After parallelization, the execution time is Ts1.

For DOALL prallelization, the parallelism increases linearly with the number of processors, provided that the number of iterations is lesser than number of processors. Generally, DOALL parallelization doesn't exist to any significant level in programs. Although, a fair amount of statistical DOALL parallelism does exist in general purpose applications. Statistical DOALL loops do not show any cross iteration dependencies during profiling, even though the compiler cannot prove the independence. These loops can be parallelized using some lightweight hardware support to detect mis-speculation and rollback the execution if needed.

DOACROSS parallelization
DOACROSS parallelism allows us to extract parallelism for the tasks which have loop carried dependencies. Consider the following code: Here, the statement S has loop carried dependence S[i] -> T S[i+1]. However, the part f[i]/d[i] can be executed independently. This can be executed as follows: However, this method creates additional storage overhead because of the array temp[i]. Another method of implementing the above partial loop carried dependence is using DOACROSS parallelism. In DOACROSS parallelism, each iteration is still a parallel task, but synchronisation is used in order to ensure that the consumer iteration only reads the data has been produced by the producer iteration. This can be achieved using wait and post synchronisation primitives. The code then becomes as follows: S1 has no loop carried dependencies and S2 has a loop carried dependence. The temporary multiplication variable is now a private scalar rather than a shared array, thus reducing the storage overhead. Post(i) indicates that the value of e[i] has been produced and is ready for consumption. Wait(i-1) indicates that the consumer must wait until a[i-1] has been produced by the producer.

If Ts1 is the execution time of S1 and Ts2 is the execution time of S2, then without DOACROSS the execution time for N iteration would be N*(Ts1 + Ts2). After parallelization, S1 runs parallel with respect to all other statements in other iterations, whereas S2 is serialized. Hence, the time execution time now is Ts1 + N*Ts2. DOACROSS exploits iteration level parallelization.

In DOACROSS parallelism it can be seen that data is synchronized, bundled and sent to different cores when it is needed. Hence there is a communication overhead associated with DOACROSS. The communication latency between processes is one of the major factor that determines the effectiveness of implementing DOACROSS parallelism.

DOPIPE parallelization
Consider the following code: This code can be parallelized by distributing the loop and introducing pipelined parallelism. Right after e[i] is generated by the statement S1 in first loop, the second loop executes statement S2 that reads the just produced value of e[i]. This type of parallelism is called DOPIPE parallelism and it can be implemented as shown: If we assume that the execution times of S1(Ts1) and S2(Ts2) are the same (T), then the execution time using DOPIPE becomes N*T for N iterations. Whereas for sequential execution it was N*2T.

DOPIPE can be better than DOACROSS in applications where regular dependence is exhibited.