User:Morris Kurz/sandbox

Parallel Graph Representations
The parallelization of graph problems faces significant challenges: Data-driven computations, unstructured problems, poor locality and high data access to computation ratio. The graph representation used for parallel architectures plays a significant role in facing those challenges. Poorly chosen representations may unnecessarily drive up the communication cost of the algorithm, which will decrease its scalability. In the following, shared and distributed memory architectures are considered.

Shared memory
In the case of a shared memory model, the graph representations used for parallel processing are the same as in the sequential case, since parallel read-only access to the graph representation (e.g. an adjacency list) is efficient in shared memory.

Distributed Memory
In the distributed memory model, the usual approach is to partition the vertex set $$V$$ of the graph into $$p$$ sets $$V_0, \dots, V_{p-1}$$. Here, $$p$$ is the amount of available processing elements (PE). The vertex set partitions are then distributed to the PEs with matching index, additionally to the corresponding edges. Every PE has its own subgraph representation, where edges with an endpoint in another partition require special attention. For standard communication interfaces like MPI, the ID of the PE owning the other endpoint has to be identifiable. During computation in a distributed graph algorithms, passing information along these edges implies communication.

Partitioning the graph needs to be done carefully - there is a trade-off between low communication and even size partitioning But partitioning a graph is a NP-hard problem, so it is not feasible to calculate them. Instead, the following heuristics are used.

1D partitioning: Every processor gets $$n/p$$ vertices and the corresponding outgoing edges. This can be understood as a row-wise or column-wise decomposition of the adjacency matrix. For algorithms operating on this representation, this requires an All-to-All communication step as well as $$\mathcal{O}(m)$$ message buffer sizes, as each PE potentially has outgoing edges to every other PE.

2D partitioning: Every processor gets a submatrix of the adjacency matrix. Assume the processors are aligned in a rectangle $$p = p_r \times p_c$$, where $$p_r $$ and $$p_c $$ are the amount of processing elements in each row and column, respectively. Then each processor gets a submatrix of the adjacency matrix of dimension $$(n/p_r)\times(n/p_c)$$. This can be visualized as a checkerboard pattern in a matrix. Therefore, each processing unit can only have outgoing edges to PEs in the same row and column. This bounds the amount of communication partners for each PE to $$p_r + p_c - 1$$ out of $$p = p_r \times p_c$$ possible ones.

Parallel Topological Sorting
An algorithm for parallel topological sorting on distributed memory machines parallelizes the algorithm of Khan for a DAG $$G = (V, E)$$. On a high level, the algorithm of Khan repeatedly removes the vertices of indegree 0 and adds them to the topological sorting in the order in which they were removed. Since the outgoing edges of the removed vertices are also removed, there will be a new set of vertices of indegree 0, where the procedure is repeated until no vertices are left. This algorithm performs $$D+1$$ iterations, where $$D$$ is the longest path in $$G$$. Each iteration can be parallelized, which is the idea of the following algorithm.

In the following it is assumed that the graph partition is stored on $$p$$ processing elements (PE) which are labeled $$0, \dots, p-1$$. Each PE $$i$$ initializes a set of local vertices $$Q_i^1$$ with indegree 0, where the upper index represents the current iteration. Since all vertices in the local sets $$Q_0^1, \dots, Q_{p-1}^1$$ have indegree 0, i.e. they are not adjacent, they can be given in an arbitrary order for a valid topological sorting. To assign a global index to each vertex, a prefix sum is calculated over the sizes of $$Q_0^1, \dots, Q_{p-1}^1$$. So each step, there are $$\sum_{i=0}^{p-1} |Q_i|$$ vertices added to the topological sorting. In the first step, PE $$j$$ assigns the indices $$\sum_{i=0}^{j-1} |Q_i^1|, \dots, (\sum_{i=0}^{j} |Q_i^1|) - 1$$ to the local vertices in $$Q_j^1$$. These vertices in $$Q_j^1$$ are removed, together with their corresponding outgoing edges. For each outgoing edge $$(u, v)$$ with endpoint $$v$$ in another PE $$l, j \neq l$$, the message $$(u, v)$$ is posted to PE $$l$$. After all vertices in $$Q_j^1$$ are removed, the posted messages are sent to their corresponding PE. Each message $$(u, v)$$ received updates the indegree of the local vertex $$v$$. If the indegree drops to zero, $$v$$ is added to $$Q_j^2$$. Then the next iteration starts.

In step k, PE j assigns the indices $$a_{k-1} + \sum_{i=0}^{j-1} |Q_i^k|, \dots, a_{k-1} + (\sum_{i=0}^{j} |Q_i^k|) - 1$$, where $$a_{k-1}$$is the total amount of processed vertices after step k-1. This procedure repeats until there are no vertices left to process, hence $$\sum_{i=0}^{p-1} |Q_i^{D+1}| = 0$$. Below is a high level, single program, multiple data pseudo code overview of this algorithm.

Note that the prefix sum for the local offsets $$a_{k-1} + \sum_{i=0}^{j-1} |Q_i^k|, \dots, a_{k-1} + (\sum_{i=0}^{j} |Q_i^k|) - 1$$ can be efficiently calculated in parallel. p processing elements with IDs from 0 to p-1 Input: G = (V, E) DAG, distributed to PEs, PE index $$j \in \{0, \dots, p-1\}$$ Output: topological sorting of G

function traverseDAGDistributed δ incoming degree of local vertices V    Q = {v ∈ V | δ[v] = 0}                     // All vertices with indegree 0 nrOfVerticesProcessed = 0 do global build prefix sum over size of Q    // get offsets and total amount of vertices in this step offset = nrOfVerticesProcessed + $$\sum_{i=0}^{j - 1} |Q_i|$$         // j is the processor index foreach u ∈ Q                                                     localOrder[u] = index++; foreach (u, v) ∈ E do post message (u, v) to PE owning vertex v         nrOfVerticesProcessed += $$\sum_{i=0}^{p - 1} |Q_i|$$ deliver all messages to neighbors of vertices in Q           receive messages for local vertices V          remove all vertices in Q          foreach message (u, v) received: if --δ[v] = 0 add v to Q    while global size of Q > 0 return localOrder

The communication cost depends heavily on the given graph partition. As for runtime, on a CRCW-PRAM model that allows fetch-and-decrement in constant time, this algorithm runs in $$\mathcal{O} \bigl(\frac{m + n}{p} + D (\Delta + \log n)\bigr)$$, where D is again the longest path in G and $$\Delta$$ the maximum degree.