DOACROSS parallelism

DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times.

Description
DOACROSS parallelism is particularly useful when one statement depends on the values generated by another statement. In such a loop, DOALL parallelism can not be implemented in a straightforward manner. If the first statement blocks the execution of the second statement until the required value has been produced, then the two statements would be able to execute independent of each other (i.e.), each of the aforementioned statements would be parallelized for simultaneous execution using DOALL parallelism.

The following pseudocode illustrates the operation of DOACROSS parallelism in such a situation.

Example
In this example, each iteration of the loop requires the value written into a by the previous iteration. However, the entire statement is not dependent on the previous iteration, but only a portion of it. The statement is split into two blocks to illustrate this. The first statement has no loop carried dependence now, and the result of this statement is stored in the variable temp. The post command is used to signal that the required result has been produced for utilization by other threads. The wait (i-2) command waits for the value a[i-2] before unblocking. The execution time of DOACROSS parallelism largely depends on what fraction of the program suffers from loop-carried dependence. Larger gains are observed when a sizable portion of the loop is affected by loop-carried dependence.

Drawbacks
DOACROSS parallelism suffers from significant space and granularity overheads due to the synchronization primitives used. Modern day compilers often overlook this method because of this major disadvantage. The overheads may be reduced by reducing the frequency of synchronization across the loop, by applying the primitives for groups of statements at a time.