User:Algotime/Sandbox

The Compressed Suffix Array is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T. For an input pattern P of m characters, the search time is equal to n times the higher-order entropy of the text T, plus some extra bits to store the empirical statistical model plus o(n).

The original instantiation of the compressed suffix array solved a long-standing open problem by showing that fast pattern matching was possible using only a linear-space data structure, namely, one proportional to the size of the text T, which takes $$O(n \, {\log_{|\Sigma|} n})$$ bits. The conventional suffix array and suffix tree use $$\Omega(n \, {\log n})$$ bits, which is substantially larger. The basis for the data structure is a recursive decomposition using the "neighbor function," which allows a suffix array to be represented by one of half its length. The construction is repeated multiple times until the resulting suffix array uses a linear number of bits. Following work showed that the actual storage space was related to the zeroth-order entropy and that the index supports self-indexing. The space bound was further improved using adaptive coding with longer contexts to achieve the ultimate goal of higher-order entropy. The space usage is extremely competitive in practice with other state-of-the-art compressors, and it also supports fast pattern matching.

The memory accesses made by compressed suffix arrays and other compressed data structures for pattern matching are typically not localized, and thus these data structures have been notoriously hard to design efficiently for use in external memory. Recent progress using geometric duality takes advantage of the block access provided by disks to speed up the I/O time significantly