Parallel external memory

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Definition
The PEM model is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of $$P$$ processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size $$N$$ and $$P$$ small  internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size $$M$$ which is partitioned in blocks of size $$B$$. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size $$B$$.

I/O complexity
The complexity measure of the PEM model is the I/O complexity, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if $$P$$ processors load parallelly a data block of size $$B$$ form the main memory into their caches, it is considered as an I/O complexity of $$O(1)$$ not $$O(P)$$. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts
In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts occur. Like in the PRAM model, three different variations of this problem are considered: The following two algorithms solve the CREW and EREW problem if $$P \leq B$$ processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of $$P$$ parallel block transfers. A second approach needs $$O(\log(P))$$ parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round $$P$$ processors combine their blocks into $$P/2$$ blocks. Then $$P/2$$ processors combine the $$P/2$$ blocks into $$P/4$$. This procedure is continued until all the data is combined in one block.
 * Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
 * Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
 * Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

Multiway partitioning
Let $$M=\{m_1,...,m_{d-1}\}$$ be a vector of d-1 pivots sorted in increasing order. Let $A$ be an unordered set of N elements. A d-way partition of $A$ is a set $$\Pi=\{A_1,...,A_d\}$$, where $$\cup_{i=1}^d A_i = A$$ and $$A_i\cap A_j=\emptyset$$ for $$1\leq i<j\leq d$$. $$A_i$$ is called the i-th bucket. The number of elements in $$A_i$$ is greater than $$m_{i-1}$$ and smaller than $$m_{i}^2$$. In the following algorithm the input is partitioned into N/P-sized contiguous segments $$S_1,...,S_P$$ in main memory. The processor i primarily works on the segment $$S_i$$. The multiway partitioning algorithm uses a PEM prefix sum algorithm to calculate the prefix sum with the optimal $$O\left(\frac{N}{PB} + \log P\right)$$ I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments $$S_i$$ for each processor i in parallel do Read the vector of pivots $M$ into the cache. Partition $$S_i$$ into d buckets and let vector $$M_i=\{j_1^i,...,j_d^i\}$$ be the number of items in each bucket. end for Run PEM prefix sum on the set of vectors $$\{M_1,...,M_P\}$$ simultaneously. // Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements $$S_i$$ into memory locations offset appropriately by $$M_{i-1}$$ and $$M_{i}$$. end for Using the prefix sums stored in $$M_P$$ the last processor P calculates the vector $B$ of bucket sizes and returns it.

If the vector of $$d=O\left(\frac{M}{B}\right)$$ pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with $$O\left(\frac{N}{PB} + \left\lceil \frac{d}{B} \right\rceil>\log(P)+d\log(B)\right)$$ I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection
The selection problem is about finding the k-th smallest item in an unordered list $A$ of size $N$. The following code makes use of  which is a PRAM optimal sorting algorithm which runs in $$O(\log N)$$, and , which is a cache optimal single-processor selection algorithm. if $$N \leq P$$ then $$\texttt{PRAMSORT}(A,P)$$ return $$A[k]$$ end if //Find median of each $$S_i$$ for each processor $i$ in parallel do $$m_i = \texttt{SELECT}(S_i, \frac{N}{2P}) $$ end for // Sort medians $$\texttt{PRAMSORT}(\lbrace m_1, \dots, m_2 \rbrace, P)$$ // Partition around median of medians $$t = \texttt{PEMPARTITION}(A, m_{P/2},P)$$ if $$k \leq t$$ then return $$\texttt{PEMSELECT}(A[1:t], P, k)$$ else return $$\texttt{PEMSELECT}(A[t+1:N], P, k-t)$$ end if

Under the assumption that the input is stored in contiguous memory,  has an I/O complexity of:


 * $$O\left(\frac{N}{PB} + \log (PB) \cdot \log(\frac{N}{P})\right)$$

Distribution sort
Distribution sort partitions an input list $A$ of size $N$ into $d$ disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If $$P = 1$$ the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm is used: // Sample $$\tfrac{4N}{\sqrt{d}}$$ elements from $A$ for each processor $i$ in parallel do if $$M < |S_i|$$ then $$d = M/B$$ Load $$S_i$$ in $M$-sized pages and sort pages individually else $$d = |S_i|$$ Load and sort $$S_i$$ as single page end if Pick every $$\sqrt{d}/4$$'th element from each sorted memory page into contiguous vector $$R^i$$ of samples end for in parallel do Combine vectors $$R^1 \dots R^P$$ into a single contiguous vector $$\mathcal{R}$$ Make $$\sqrt{d}$$ copies of $$\mathcal{R}$$: $$\mathcal{R}_1 \dots \mathcal{R}_{\sqrt{d}}$$ end do // Find $$\sqrt{d}$$ pivots $$\mathcal{M}[j]$$ for $$j = 1$$ to $$\sqrt{d}$$ in parallel do $$\mathcal{M}[j] = \texttt{PEMSELECT}(\mathcal{R}_i, \tfrac{P}{\sqrt{d}}, \tfrac{j \cdot 4N}{d})$$ end for Pack pivots in contiguous array $$\mathcal{M}$$ // Partition $A$around pivots into buckets $$\mathcal{B}$$ $$\mathcal{B} = \texttt{PEMMULTIPARTITION}(A[1:N],\mathcal{M},\sqrt{d},P)$$ // Recursively sort buckets for $$j = 1$$ to $$\sqrt{d} + 1$$ in parallel do recursively call $$\texttt{PEMDISTSORT}$$ on bucket $j$of size $$\mathcal{B}[j]$$ using $$O \left( \left \lceil \tfrac{\mathcal{B}[j]}{N / P} \right \rceil \right)$$ processors responsible for elements in bucket $j$ end for

The I/O complexity of  is:


 * $$O \left( \left \lceil \frac{N}{PB} \right \rceil \left ( \log_d P + \log_{M/B} \frac{N}{PB} \right ) + f(N,P,d) \cdot \log_d P \right)$$

where


 * $$f(N,P,d) = O \left ( \log \frac{PB}{\sqrt{d}} \log \frac{N}{P} + \left \lceil \frac{\sqrt{d}}{B} \log P + \sqrt{d} \log B \right \rceil \right )$$

If the number of processors is chosen that $$f(N,P,d) = O\left ( \left \lceil \tfrac{N}{PB} \right \rceil \right )$$and $$M < B^{O(1)}$$ the I/O complexity is then:

$$O \left ( \frac{N}{PB} \log_{M/B} \frac{N}{B} \right )$$

Other PEM algorithms
Where $$\textrm{sort}_P(N)$$ is the time it takes to sort $N$ items with $P$ processors in the PEM model.