User:AbuSH1993/sandbox

Algorithm
The nativ Quicksort algorithm is also called sequential quicksort. It uses the divide and conquer method to do the sorting. Sequential quicksort consists of three main phases. Given an array A[0, ..., N-1] with N elements, array A can be sorted by using sequential quicksort with the following three steps.


 * 1) Select an value c, we call it pivot.
 * 2) Then partition the array A into two subarrays. The left subarray $$A_L$$ is the union of all A[i] < c and the right subarray $$A_R$$ consists of all A[i] $$\ge$$ c.
 * 3) Apply step 1 and 2 recursively to subarray $$A_L$$ and right subarray $$A_R$$ untill the length of the array is less than 2.

Pseudocode
algorithm Sequential_Quicksort(A, length) if length=1 then return select a pivot c               Sequential_Quicksort(A[0, ..., k-1], k)         Sequential_Quicksort(A[k+1, ..., length-1], length-1-k) For more details, you could visit Quicksort.

Algorithm
Peter Sanders has in his work compared the performance of parallel quicksort with grid sort, shear sort, bitonic sort, radix sort and sample sort. We've seen that quicksort is generally the fastest sorting algorithm for most circumstances. So it's worthy to parallize the quicksort algorithm with efficient method. The sequential quicksort algorithm is simple to implement and efficient as a sequential sorting algorithm. However, quicksort algorithm could be parallized due to its recursive partition and the independence of the two partitioned arrays $$A_L$$ and $$A_R$$. In this section, all computations are assumed to be executed on a distributed memory MIMD-machine with $$P$$ processors. Given $$n$$ elements as the input array with $$n=NP$$($$N$$ is the maximal number of elements in a processor). These $$P$$ processors are numbered as $$PE_0, PE_1, ..., PE_{P-1}$$. After the parallel quicksort, all elements on $$PE_i$$ will be locally sorted, which means: $$PE_i[j]<PE_i[j+1]$$ for all $$0\le j<N-1$$. All the $$n$$ elements will be globally sorted as all elements on $$PE_i$$ will be smaller than or equal to any element on $$PE_k$$ when $$i \le j$$. The following pseudocode describes how parallel quicksort works with distributed memory. algorithm ParQSort_Dist(s: input array of n elements, i: start index, j: end index) length:=j-i+1 if length=1 then // sort locally and return QuickSort(s) return s    c:=pickPivot(s, i, j) // select a pivot $$S_l$$:=$$\{e \in s: e \le c\}$$ // all elements less or equal than c is assigned to an array $$S_r$$:=$$\{e \in s: e > c\}$$ // all elements larger than c is assigned to another array $$N_l:=|S_l|$$ // the number of the elements in the left array $$N_r:=|S_r|$$ // the number of the elements in the right array // partition the processors $$K':=\frac{N_l}{N_l+N_r}$$ $$K:=\lfloor K' \rfloor or \lceil K' \rceil$$ so that $$min\{max\{ \lceil {\frac{N_l}{K}} \rceil, \lceil {\frac{N_r}{length-K}} \rceil\}\}$$ // redistribute the elements send the elements of $$S_l$$ to $$PE_{i, i+1, ..., i+K-1}$$ (each processor is assigned with less than $$\lceil {\frac{N_l}{K}} \rceil$$ elements) send the elements of $$S_r$$ to $$PE_{i+K, i+K+1, ..., j}$$ (each processor is assigned with less than $$\lceil {\frac{N_r}{length-K}} \rceil$$ elements) if $$i_{PE} < i+K$$ then ParQSort_Dist(s, i, i+K-1) else ParQSort_Dist(s, i+K, j)   A roundoff error may occur since some processors could get more data than others, which causes imbalance. Stopping the recursion wisely and then use another sorting algorithm, i.e. merge splitting sort, instead of quicksort for the recursion for processor of size between 2 and 4 will make the algorithm more efficient and less imbalance.

Local sorting
For large $$n$$, imbalance grows with the depth of the recursion and increases the computation time for local sorting. Initial local sorting before any communication(broadcast, reduction, prefix sum) can solve this issue. This measure also causes extra computation for merging data during data redistribution in order to keep the local sequence sorted.

If $$P$$ is power of 2, explicit indexing technique could help improve the efficiency, i.e. row- and column-major order, snake-like order, Hilbert curve, shuffle.

Pivot selection
Pivot selection makes a great impact on quicksort algorithm. Choosing appropriate pivot could save the times of comparing the elements with the pivot.

The simplest pivot choosing method is the local median of $$PE_i[0]$$, $$PE_i[N/2]$$ and $$PE_i[N-1]$$($$PE_i$$ is the ith processor). Another pivot selection method using median is the medianof the local median of all processors: $$PE_0$$, $$PE_1$$, ..., $$PE_{P-1}$$. For small $$n$$, the local median is appropriate. But for large $$n$$, the global median is much more efficient.

In the basic one-pivot choosing case, the input array is partitioned into two subarrays. However, with $$k$$ pivots, the input array is splitted into $$k+1$$ parts. if $$k=P$$, then the partition method works the same way as Samplesort. Generally, $$k$$ is much smaller than $$P$$. When $$k$$ increases, the corresponding recursion depth reduces. This trick seems to save cost on recursion, but it also causes more computation on data transfer. With larger $$k$$, the data have to travel in longer path in smaller packets. According the work of Peter Sanders and Thomas Hansch, when $$k=4$$, the best performance can be achieved with the shuffle indexing which was refered before.

Parallel quicksort with shared memory
All parallel quicksort operations in this part are based on the assumption of distributed memory machine with cache coherence technique.

3+1 phase algorithm
Philippas Tsigas and Yi Zhang describes the 3+1 phase algorithm, which is a kind of parallel quicksort algorithm in the paper "Fast Parallel Implementation of Quicksort And Its Performance Evaluation on SUN Enterprise 10000". This algorithm consists of four phases: parallel partition of the data, sequential partition of the data, process partition and sequential sorting with helping.

Phase 1: Parallel partition of the data
This is the first phase of 3+1 phase algorithm. The algorithm deals with data in consecutive blocks of size $$B$$. $$B$$ is decided by the size of cache, since two blocks should be fit into one cache. Given $$n$$ elements as the input of the algorithm and $$P$$ processors indexed from $$P_0$$ to $$P_{P-1}$$. Then the number of blocks should be $$\frac{n}{B}$$. After this phase, all $$neutrilized$$ leftblocks placed between $$[0, LN-1)$$ and all $$neutrilized$$ rightblocks placed between $$[RN, N-1)$$ are correctly placed. $$LN$$ and $$RN$$ are the right end side of leftblocks and left start side of rightblocks respectively. Howevever, $$unneutrilized$$ could be at anywhere in the array and the $$neutrilized$$ blocks between $$[LN-1, RN-1]$$ need further operations. This phase can be summarized as the following steps.
 * 1) Pick a $$pivot$$ at processor $$P_0$$.
 * 2) For each processor, assign two blocks $$leftblock$$ and $$rightblock$$.
 * 3) Apply the function $$neutrilize$$ with $$leftblock$$, $$rightblock$$ and $$pivot$$. Return which side or both sides are $$neutrilized$$(If the elements in $$leftblock$$ are all samller than or equal to the $$pivot$$, then the $$leftblock$$ is $$neutrilized$$. If the elements in $$rightblock$$ are all larger than or equal to the $$pivot$$, we call the $$rightblock$$ $$neutrilized$$.).
 * 4) For each processor, if the $$leftblock$$ is $$neutrilized$$, it will get a new block from the left side of the array; if the $$rightblock$$ is $$neutrilized$$, it will get a new block from the right side of the array.
 * 5) Keep executing step 2~4 until the $$leftblock$$ or $$rightblock$$ is empty.
 * 6) Update the remaining $$unneutrilized$$ blocks in the shared memory and the number of $$neutrilized$$ leftblocks and rightblocks.

SIDE neutrilize(Data *leftblock, Data *rightblock, Data pivot) int i, j;    do { for (i=0; i pivot) break; for (j=0; j<BlockSize; ++j) if (rightblock[j] < pivot) break; if (i == BlockSize || j == BlockSize) break; SWAP(leftblock[i], rightblock[j]) i++; j++; }while (i < BlockSize && j < BlockSize) if (i == BlockSize && j == BlockSize) return BOTH if (i == BlockSize) return LEFT else return RIGHT

Phase 2: Sequential partition of the data
In phase two, the goal is to place the remaining data from phase one correctly in the array of blocks by comparing with the $$pivot$$ selected in phase one. The steps of this phase are: After phase 2, the input array is partitioned into two subarrays. The left subarray contains all elements smaller than or equal to the $$pivot$$ and all elements larger than the $$pivot$$ are in the right subarray.
 * 1) Sort the remaining blocks.
 * 2) Pick two blocks from the left and right end on processor $$P_0$$. Apply neutrilize function with these two blocks and the $$pivot$$.
 * 3) Swap the remaining blocks so that all $$unneutrilized$$ blocks lie between $$[LN, RN]$$
 * 4) Use sequential quicksort to partition the blocks between $$[LN, RN]$$.

Phase 3: Process partition
Give each processor a subarray from phase two and apply the same paralle partition method. Recursively execute phase 1~3 until only one processor is left.

Phase 4: Sequential sorting with helping
Each processor does sequential quicksort lokally. Lock-free stacks are used in this phase as a helping scheme. That means, after doing its own work, a processor can help other processors' work since they all have an entry to the shared stacks.

Time complexity analysis
We've known that sequential quicksort has the worst case of time complexity $$O(n^2)$$ and both the best case and average case of time complexity $$O(nlog(n))$$. Since 3+1 phase algorithm also has a divide-conquer structure, it takes same number of comparison and swapping operations as the sequential one, but the time for the sorting phase is greatly optimized. Phase 1 takes $$O(\frac{n}{P})$$ time. Phase 2 and 3 takes $$O(B*(P+1))$$ time. So the whole time complexity of the partition is $$O(\frac{n}{P}\ +\ B*(P+1))$$.

Parallel partition using fetch-and-add
The work by Clyde P. Kruskal has shown that the parallel partition operation can be done using fetch-and-add. We introduce a parallel partition technique in this part using fetch-and-add from the work of P. Heidelberger. First, to understand parallel partition, we need to first introduce a partition function G. Partition function G allocates the elements from the source array A into the corresponding target array. For example, $$G(a[i]) = j$$ means the element $$a[i]$$ will be assigned to the target array j. In the case of quicksort, there are only two target arrays: left subarray $$A_L$$ and right subarray $$A_R$$. Then a partition function executed by fetch-and-add could be considered as:


 * $$G(a[i]) = 1, \ \ if \ \ a[i] < c$$


 * $$G(a[i]) = 2, \ \ if \ \ a[i] \ge c$$

$$G(a[i]) = 1$$ means that $$a[i]$$ is set to left subarray $$A_L$$. By $$G(a[i]) = 2$$, $$a[i]$$ is partitioned to right subarray $$A_R$$, which corresponds to the basic idea of quicksort.

Dual copy partition algorithm
This is the simplest parallel partition algorithm using fetch-and-add. It's called "dual copy" because it needs two copies for pivot c and a local variable t respectively. The algorithm is described in the following pseudocode. do_parallel means that the operations are done simultaneously. The array b is a copy of the input elements. algorithm Dual_copy(s, n)    do_parallel i = 1 to n t := s[i]  // copy s[i] into local variable if (t < c) then b[fetch_and_add(L, 1)] := t  // put t into the left subarray else b[fetch_and_add(R, -1)] := t   // put t into the right subarray end do_parallel

In-place partition algorithm
In-place algorithm partition is an extension of dual copy partition algorithm. It uses swapping technique to do the parallel partition. Two indices $$L$$ and $$R$$ point to the left and right end of the input array. So if the input array s has $$n$$ elements, $$L$$ is initialized as 0; $$R$$ is initialized as n-1. By comparing the elements s[$$L$$] and s[$$R$$] with pivot c, s[$$L$$] and s[$$R$$] will be swapped if s[$$L$$] $$\le$$ c and meanwhile s[$$R$$] > c. At the end of this parallel swapping, $$L$$ should be equal to $$R-1$$. However, a clean-up process needs to be executed, since s[$$L$$] may not be able to find a s[$$R$$] to swap. The number of fetch-and-adds in clean-up process is up to 7$$P$$($$P$$ is the number of the processors).

Parallel partition algorithm with a reduced number of fetch-and-adds
In the dual copy partition algorithm and in-place partition algorithm, we use fetch_and_add(L, 1) and fetch_and_add(R, 1) to do partitioning and swapping. However, this would take too many steps for fetch_and_add operation. To renduce the number of fetch_and_adds, fetch_and_add(L, K) and fetch_and_add(L, -K) are introduced in this algorithm instead of fetch_and_add(L, 1) and fetch_and_add(R, 1). Similar to the in-place partition algorithm, this algorithm also needs clean-up operations, but with much fewer fetch_and_adds.