Draft:String sorting algorithms

In computer science, string sorting algorithms are a special case of sorting algorithms, where the input is an array $$S = \{ s_0,\dots,s_{n-1}\}$$ of $$n$$ strings with characters chosen from an alphabet $Σ$.

Unlike traditional sorting algorithms that deal with atomic keys, string sorting encounters unique challenges. Sorting strings using conventional atomic sorting algorithms, which treat keys as indivisible objects, is inefficient because comparing entire strings can be costly and must be performed numerous times. Efficient string sorting algorithms, in contrast, inspect most characters of the input only once during the entire sorting process and they examine only those characters that are necessary to establish the correct ordering. Another challenge is that strings are represented as arrays of pointers. This representation results in indirect access to string characters, leading to cache faults during the access, even when scanning an array of strings. This is in contrast to sorting atomic keys, where scanning is notably cache efficient. The efficiency of string sorting algorithms depens upon multiple factors, including the size of the dataset ($$n$$), the distinguishing prefix size of $$S$$ ($$D$$), which is the minimal number of characters that need to be examined to sort the strings, the number of subproblems ($σ$), into which the algorithm breaks down the problem, and the underlying hardware. This indicates that no singular algorithm is universally optimal.

Multikey quicksort
Developed by Bentley and Sedgewick in 1997, this algorithm is an adaptation of traditional quicksort, tailored for string sorting. It uses the character $$x = s[h]$$ with a common prefix of length h as a splitter, organizing the strings into three distinct arrays based on their $$(h+1)$$th character's relation to $$x: <,>,=$$. The algorithm recurses until the termination condition is met: if $$x=0$$ termination with $$S_=$$. With Insertion Sort as a base case sorter for constant input sizes, multikey quicksort has a complexity of $$O(D+n\log n)$$.

Most significant digit (MSD) radix sort
Most significant digit (MSD) radix sort is especially efficient for sorting large datasets, particularly when the alphabet size is small. The algorithm initiates sorting by examining the $(h+1)-th$ character of each string with $$h$$ as the common prefix, subsequently dividing the dataset into $σ$ distinct subproblems. Each subproblem is then recursively sorted with the common prefix length $$h + 1$$. This strategy, which is a natural approach to string sorting, has been subject to numerous refinements and improvements across various studies in the literature. The time complexity is $O(D)$ plus the time required for sorting the base cases. For example, with multikey quicksort as the base case sorter MSD radix sort has a complexity of $O(D + n log σ)$.

Burstsort
Burstsort uses a trie-based structure with containers at the leaves for sorting the strings. Upon reaching a predefined threshold, these containers "burst", redistributing the strings into new containers based on their next character. These new containers are then attached to the appropriate child nodes of the trie. The sorting process involves traversing the trie and individually sorting the small containers. Key factors influencing the runtime efficiency of Burstsort include the trie implementation, the design of the containers, the burst threshold, and the chosen base algorithm for sorting the containers. Sinha and Zoble used an array for each trie node and unordered dynamic arrays of string pointers for the leaf containers, with a bursting threshold set at $8192$. With this configuration and multikey quicksort for sorting the leaves, burstsort has a complexity of $O(D + n log σ)$.

LCP-mergesort
LCP-mergesort is an adaptation of the traditional merge sort algorithm, which stores and reuses the longest common prefixes (LCPs) of consecutive strings in the sorted subproblems. This strategy enhances the efficiency of string comparisons. In the conventional method the strings $$s_a$$ and $$s_b$$ must be compared character-by-character. However, with the LCP information for $$s_a$$ and $$s_b$$ relative to another string $$p$$ of similar or smaller size allows the preliminary use of the LCP. If the LCP between $$p$$ and $$s_a$$ is shorter than that between $$p$$ and $$s_b$$, it follows that $$s_a$$ precedes $$s_b$$ in lexicographical order due to $$s_a$$ and $$p$$ sharing a shorter common prefix than $$s_b$$ and $$p$$. This also applies symmetrically. LCP-Mergesort has a worst-case time complexity of $$O(D + n log n)$$.

Insertion sort
Insertion sort is frequently used as the base case sorter for small sets of strings. The algorithm stores an ordered array and inserts the unsorted items into their appropriate positions through linear scanning. This method treats strings as atomic units, necessitating full string comparisons during the linear scan to ensure the correct order. It has a worst-case time complexity of $$O(nD)$$. So it is particularly good for small $$n$$ and $$D$$, due to the cache-efficient manner in which strings are scanned.

Parallel methods
The exploration of parallel string sorting algorithms remains limited, yet it is the only way to get performance out of Moore's Law. The scalability of an algorithm in a parallel computing environment depends on various factors, similar to those affecting sequential methods. Many of the algorithms discussed in the sequential context can be adapted for parallel execution.