User:C-processor/Polar Tree

In the field of information theory, Polar tree coding is an entropy encoding scheme that utilizes variable-length prefix codes to achieve data compression. Other binary coding methods include Shannon-Fano coding and Huffman coding. All three methods of constructing codes are called prefix coding because none of the individual codes start with the same bit pattern as another, which allows for the unambiguous restoration of the original sequence of symbols from an encoded binary stream. Historically, Shannon-Fano coding was introduced earlier than Huffman coding. It was later proven that Huffman's codes produced the shortest possible message for a given set of symbol frequencies, and so Shannon-Fano coding was eventually abandoned.

Overview
Polar trees are typically used in adaptive encoders; the tree is not output into the encoded stream but updated, lockstep fashion, in both the encoder and decoder. The advantage over traditional adaptive Huffman coding lies in the efficiency with which the table is updated; the fact that Polar trees can be created or updated in a very small number of operations makes them attractive for use with various adaptive algorithms, as these employ trees that require frequent updates. While a small percentage of compression loss may result from using Polar codes, in most cases the amount is negligible; since the algorithm works with numbers rounded to powers of two, the worst case is still within the probabilistic optimum for that particular frequency of symbols (within the normal limits of Huffman coding, that is). As with Shannon-Fano coding, a given set of Polar codes may or may not map directly to a set of Huffman codes. The table below compares the codes generated by the various methods.

The algorithm of constructing of Polar codes goes over code lengths generation as it is shown in the next table.

Initially, the tree is created in a default state, the symbols are assigned a certain weight, which can be either biased or equalized. After a number of symbols (also known by both to the encoder and decoder) have been processed, the tree is updated to reflect the revised statistics of the input. The process of building the tree is shown in above table. All collected frequencies (shown in the second column) are rounded down to powers of two but the sum of all frequencies is rounded up to a next power 2 number, which is 64 in this example. On the next step we move top to bottom and double every frequency, which can be doubled without exceeding the total of 64, which we call target total. The result 16,8,8,8,8 is shown in the fourth column. Now we keep repeating previous step until sum of all frequencies becomes equal to target total. As we can see we need to repeat this step only once and obtain final list 32,8,8,8,8 = 64. The difference in bit lengths between modified frequencies (32,8,8,8,8) and target frequency (64) is the code length for each symbol in the tree. Having code lengths we can build actual codes by adding one and then shifting to the left of each coding fragment to the correspondent length. This last operation of building actual codes from the code lengths is the same used to build Huffman trees. If everything is done correctly the last code value in the tree has all positive bits.

History
The algorithm was developed by Andrew Polar and announced in July 2010.

Independent implementations
The method has recently been incorporated into the LZHAM data compression library.

Polar codes are used by more than one compressor cataloged in the Large Text Compression Benchmark, maintained by Matt Mahoney.

Lite weight data compression utility in 'c', near 600 lines for all coder, decoder and command line parser.

Additional optimization by adaptive sorting
The advantage of the algorithm is quick generation of code lengths. The trees, however, are replaced periodically after each relatively large fragment. To improve compression in original DEMO program the author suggests concept of floating leafs on the tree. This floating leafs concept is adaptive sorting on every step according to accumulated frequency. Presume in the middle of adaptation interval symbol B became more frequent than A. In this case they simply switch while tree stays the same and most frequent symbol encoded by shortest code already in the middle of adaptation interval. The same scenario is reproduced in decoding thus making decoding unmistakeably reversible. The additional advantage of this adaptive sorting or floating leafs is, that at the point of recreation of the tree, symbols are already sorted and time is not spent on sorting alphabet according to frequency.

Limitations of the method
Polar tree is subject to the same limitations in data compression as other two binary prefix coding. It can not possibly compress anything to less than 1 bit per symbol. And it might be far from statistical limit established by entropy for the data that could be compressed to low entropy. In the same way as Huffman coding and Shannon-Fano coding application of Polar coding is justified to data with large alphabets, which entropy is at least over 4 bits/symbol. According to description of LZHAM Polar coding was applied to literals left after application of LZ algorithm. As it is known literals are not very well compressed and may represent good ground for application of either of binary coding technique.

Implementations
A sample application published at the author's site.