Cache-oblivious distribution sort

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of $$N$$ items on a machine with cache of size $$Z$$ and cache lines of length $$L$$ is $$ O(\frac{N}{L} \log_Z N)$$, under the tall cache assumption that $$Z = \Omega(L^2)$$. This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of $$\Theta(N \log N)$$.

Overview
Distribution sort operates on a contiguous array of $$N$$ elements. To sort the elements, it performs the following:


 * 1) Partition the array into $$\sqrt{N}$$ contiguous subarrays of size $$\sqrt{N}$$, and recursively sort each subarray.
 * 2) Distribute the elements of the sorted subarrays into $$q \le \sqrt{N}$$ buckets $$B_1, B_2, \ldots, B_q$$ each of size at most $$2 \sqrt{N}$$ such that for every i from 1 to q-1,  every element of bucket $$B_i$$ is not larger than any element in $$B_{i+1}.$$ This distribution step is the main step of this algorithm, and is covered in more detail below.
 * 3) Recursively sort each bucket.
 * 4) Output the concatenation of the buckets.

Distribution step
As mentioned in step 2 above, the goal of the distribution step is to distribute the sorted subarrays into q buckets $$B_1, B_2, \ldots, B_q.$$ The distribution step algorithm maintains two invariants. The first is that each bucket has size at most $$ 2 \sqrt{N}$$ at any time, and any element in bucket $$B_i$$ is no larger than any element in bucket $$B_{i+1}.$$ The second is that every bucket has an associated pivot, a value which is greater than all elements in the bucket.

Initially, the algorithm starts with one empty bucket with pivot $$\infty$$. As it fills buckets, it creates new buckets by splitting a bucket into two when it would be made overfull (by having at least $$(2\sqrt{N}+1)$$ elements placed into it). The split is done by performing the linear time median finding algorithm, and partitioning based on this median. The pivot of the lower bucket will be set to the median found, and the pivot of the higher bucket will be set to the same as the bucket before the split. At the end of the distribution step, all elements are in the buckets, and the two invariants will still hold.

To accomplish this, each subarray and bucket will have a state associated with it. The state of a subarray consists of an index next of the next element to be read from the subarray, and a bucket number bnum indicating which bucket index the element should be copied to. By convention, $$bnum = \infty$$ if all elements in the subarray have been distributed. (Note that when we split a bucket, we have to increment all bnum values of all subarrays whose bnum value is greater than the index of the bucket that is split.) The state of a bucket consists of the value of the bucket's pivot, and the number of elements currently in the bucket.

Consider the follow basic strategy: iterate through each subarray, attempting to copy over its element at position next. If the element is smaller than the pivot of bucket bnum, then place it in that bucket, possibly incurring a bucket split. Otherwise, increment bnum until a bucket whose pivot is large enough is found. Though this correctly distributes all elements, it does not exhibit a good cache performance.

Instead, the distribution step is performed in a recursive divide-and-conquer. The step will be performed as a call to the function distribute, which takes three parameters i, j, and m. distribute(i, j, m) will distribute elements from the i-th through (i+m-1)-th subarrays into buckets, starting from $$B_j$$. It requires as a precondition that each subarray r in the range $$i, \ldots, i+m-1$$ has its $$bnum[r] \ge j$$. The execution of distribute(i, j, m) will guarantee that each $$bnum[r] \ge j+m$$. The whole distribution step is distribute$$(1,1,\sqrt{N})$$. Pseudocode for the implementation of distribute is shown below:

The base case, where m=1, has a call to the subroutine copy_elems. In this base case, all elements from subarray i that belong to bucket j are added at once. If this leads to bucket j having too many elements, it splits the bucket with the procedure described beforehand.