Talk:Entropy coding

Isn't "entropy coding" the same thing as "lossless compression"?
Remark: This article contains some problems that appear not worth correcting because the article seems approximately fully-redundant with the article on lossless compression. Also "entropy coding" would be better a better subject title than "entropy encoding". -- comments by 4.65.146.242 moved from article


 * Entropy encoding is only a subset of lossless compression. LZ77 compression, for instance, is an important compression technique that isn't any form of entropy encoding. -- Antaeus Feldspar 20:43, 4 Jun 2005 (UTC)


 * Actually, I don't know where Feldspar gets his definitions, but I mostly agree with the prior comment by 4.65.146.242. In the usage I have experienced, the terms "entropy coding" and "lossless compression" are synonymous, and both terms apply to such things as Lempel-Ziv coding.  I have never previously heard anyone assert that LZ coding or other such things are not entropy coding.  The content of this page also seems rather overly simplistic.  For example, I don't think what it describes is a very accurate description of what happens in arithmetic coding.  I also agree that "entropy coding" is better than "entropy encoding".  Pawnbroker 05:28, 5 Jun 2005 (UTC)


 * Then I don't know where you get your definitions, Pawn. People might use "entropy coding" as if it was a synonym of "lossless compression" (just as there are some major RFCs that incorrectly use "Huffman code" as if it was a synonym of "prefix code", instead of denoting only to those prefix codes created by a Huffman algorithm) but that's definitely not correct usage.  Entropy encoding is encoding where each symbol is assigned a pattern whose length/cost corresponds to its entropy (hence the name).  While entropy encoding is quite often used with LZ77 compression, as the two techniques complement each other, LZ77 is not an example of entropy encoding. -- Antaeus Feldspar 17:03, 5 Jun 2005 (UTC)


 * Can someone point to any textbook on information theory or any similar such authoritative source that draws a distinction between the two terms in this manner? Pawnbroker 05:46, 12 Jun 2005 (UTC)


 * Well, the JPEG standard defines "entropy encoding" as "A lossless procedure which converts a sequence of input symbols into a sequence of bits such that the average number of bits per symbol approaches the entropy of the input symbols", a description which does not apply to dictionary coding. However, if you weren't prepared before to believe that entropy coding might have something to do with entropy, I have no idea what sort of source you're prepared to accept as "authoritative". -- Antaeus Feldspar 16:08, 12 Jun 2005 (UTC)


 * Two responses: 1) Actually, Lempel-Ziv coding does fit into that definition pretty well, 2) the JPEG standard is a document that defines terms for its own purposes - it is not meant to be a textbook on information theory (it has a status perhaps roughly similar to those major RFCs that confuse Huffman coding with prefix coding - note for example that JPEG defines lossless coding in a way that includes only the kinds of lossless coding described in the JPEG document - clearly there are others). If you could point to a single textbook on information theory that draws such a distinction, that would suffice to satisfy me.  Even a single paper in the IEEE Transactions on Information Theory would be something (or a digital communication textbook, or one or more papers in some other IEEE Transactions, like the Transactions on Signal Processing or the Transactions on Communications).  Right now the assertion that there's a difference between the two terms is completely unsupported by published sources.  Pawnbroker 04:48, 14 Jun 2005 (UTC)


 * Further elaborating on point 1 above - the quoted JPEG definition of "entropy encoding" is actually not bad. (It is certainly better than the one found here at Wikipedia today.)  Note, in particular, that it does not refer to a symbol-by-symbol mapping of each input symbol to a particular codeword length.  It only says that the average number of bits per symbol approaches the entropy of the input symbols.  There are many ways to make the average approach the entropy (or at least be substantially less than the average number of bits used to represent the input to the entropy encoding process).  Those ways include techniques like LZW coding and other dictionary-based methods.  Many of those ways do not involve a mapping of individual input symbols to specific coded lengths.  In fact, when the input symbols are not probabilistically independent of each other (e.g., when there is some statistical dependency between input symbol values, as there is with text, for example) any simple one-to-one mapping like a Huffman code will be inferior to methods that take into account the inter-symbol dependencies.  The JPEG document is therefore evidence against the Feldsparian interpretation, not for it.  Pawnbroker 23:29, 14 Jun 2005 (UTC)


 * Note also that entropy of a symbol is not the same thing as the negative log of the probability of the symbol value. Entropy includes the concept of averaging over all possible values of the input symbol.  Entropy is the expected value of the negative log of the probability, not the actual value of the negative log of the probability associated with a particular symbol's value.  The negative log of the probability of the sample value (assuming independent source symbol statistics) is the "information" conveyed by that sample value.  The average amount of information per symbol is the entropy.  Thus the above Feldspar quote that "each symbol is assigned a pattern whose length/cost corresponds to its entropy" is not correct in that sense.  In a Huffman-style scheme, each symbol is assigned a pattern whose length corresponds (approximately) to the amount of information conveyed by its value (not the entropy conveyed by its value).  Entropy coding therefore is conceptually focused on good average behavior on typical data, not at what happens from the micro perspective of individual symbols and individual short message lengths.  Pawnbroker 23:52, 14 Jun 2005 (UTC)


 * A little clear thinking will make it clear that Lempel-Ziv does not fit into the quoted definition, since it is entirely the existence of repeated multi-symbol patterns that enables compression under this technique, which has absolutely no connection with the entropy of the input symbols save as the existence of repeated multi-symbol patterns requires the repetition of the component symbols. A simple example:  Create two texts, each of which uses the characters A-Z twice.  The first text is the sequence A-Z, followed by A-Z again.  The second text is the sequence A-Z followed by the reverse of that sequence, Z-A.  Assuming a fairly normal LZ77 compression scheme, the first text will probably be compressed to just a little over 50% of its size, and the second text will not be compressed at all -- results very different, even though the entropy of the input symbols is the same for the two texts.


 * Right now the assertion that there's no difference between the two terms is completely unsupported by published sources. -- Antaeus Feldspar 00:22, 15 Jun 2005 (UTC)


 * Those are artifically-constructed example sequences - they are not generated by a probabilistic source model, so it is impossible to talk about their entropy. 52 letters is also way too short a message to try to invoke probabilistic behaviour arguments.  If you assume a probabilistic source model that, with equal probability, generates either the sequence A to Z or the sequence Z to A in each stage of its operation, then it can easily be shown that either LZ coding or any other good entropy coder will approach the underlying entropy after a long run of input data (much more than two 26-letter sequences), which is a rate of one bit per 26 letters.  Actually, the existence of repeated patterns is very much coupled with the concept of entropy.  A probabilistic source will tend to create repeated patterns of output.  Certain patterns will be "typical" and others will be highly atypical.  Some proofs of Shannon's lossless coding theorem involve segmenting the source in this fashion (something known as the "asymptotic equi-partition property").  Much of Lempel and Ziv's original paper is devoted to proofs of how their encoding method approaches the source entropy (under specific assumptions).  Pawnbroker 01:28, 15 Jun 2005 (UTC)

Those are artifically-constructed example sequences - they are not generated by a probabilistic source model, so it is impossible to talk about their entropy. Yes, but it *is* possible to run those 52-letter files through a data compression algorithm, such as a "entropy coder" and/or LZ77. Also, it is theoretically possible (although highly unlikely) that exactly these sequences might be generated by a base-26 memoryless source.

In the papers I've read, "entropy coding" usually refers to assuming a simple 0-order entropy model. This leades to "entropy coders" such as Huffman or arithmetic coding. They generate compressed files roughly equal in length to the "0-order entropy" of the input file, even when consecutive letters in the input file are highly correlated (not at all memoryless). On such input files, coders such as LZ77 and bzip can get much smaller output files by exploiting the correlation. Mathematically, I suppose this has something to do with "first-order entropy" or "nth-order entropy" ...

2 examples of published sources:

"Length-limited variable-to-variable length codes for High–Performance Entropy Coding." Joshua Senecal. Mark Duchaineau. IEEE Transactions on Information Theory, 47(4):1533–1537, May 2001. http://graphics.cs.ucdavis.edu/~jgseneca/Papers/DCC_2004.pdf "Given a binary memoryless source, each bit b_i in the source has a probability of being either a 0 or a 1, denoted respectively by the pair (p_i, q_i). ... For a binary memoryless source with probabilities (p, q), the entropy H of the source is defined as $$H(p,q) = -( p\log_2 (p) + q\log_2 (q) )$$ ..."

IEEE Signal Processing Magazine 2001-09 http://www.rle.mit.edu/stir/documents/Goyal_SigProcMag2001.pdf "Entropy codes are used for lossless coding of discrete random variables. Consider the discrete random variable z with alphabet I. An entropy code y assigns a unique binary string, called a codeword, to each $$i \in I$$. ... The entropy code y is called optimal if it is a prefix code that minimizes L(y). Huffman codes ... are examples of optimal codes. The performance of an optimal code is bounded by H(z) <= L(y) < H(z)+1 where $$H(z) = -\sum_{i \in I} p_z (i) \log_2 p_z (i)$$ is the entropy of z."

Both of these give definitions of "entropy" that give memoryless (uncorrelated) symbol entropy, also known as "0-order entropy". The second one (identical to one of the equations in the entropy article) can be applied to any file, including the 52 byte files mentioned above.

--DavidCary 05:05, 15 Jun 2005 (UTC)


 * Those seem like good citations. It's also nice to see someone else enter the conversation. To me, the Goyal reference speaks the most directly on the subject of our discussion. (The other one doesn't really supply a definition of "entropy coding" -- it just discusses a design problem for a source that is assumed to have memoryless and stationary statistics during the development of the method.) So it appears from Goyal that the phrase "entropy coding" (or at least "entropy code") is sometimes used in appropriate literature in a manner that restricts the scope of the concept to the handling of "zero-order" statistics. I think this justifies a different approach to this page than what I previously advocated, although I think there should be some acknowledgement that not everyone who uses the term "entropy coding" (e.g., as defined in JPEG) would include that assumption.


 * I would express some caution about the arithmetic coding example for two reasons: 1) in real applications, arithmetic coding is (almost) always used in a way that is both adaptive and context-based - these things don't quite fit into the zero-order model; 2) in arithmetic coding, the output bits are not just dependent on the current input symbol -- they are also affected by the entire sequence of preceding input symbols and by some subsequent input symbols (due to rounding effects).


 * Note that the Goyal definition of "entropy code" does not include arithmetic coding, as it is restricted to methods that assign "a unique binary string" to each input symbol value.


 * -Pawnbroker 23:07, 20 Jun 2005 (UTC)

As I understand it, entropy coding per se is just encoding symbols according to their frequency (really, some model of their probability, depending on whatever you think affects their probability), trying to give short codes to likely things and longer codes to less likely things. That is all.

If you're doing something like LZ77 (or PPM or any normal compression algorithm), you're mostly doing *modeling*, which is conceptually distinct from entropy coding---in principle, a model notices patterns and assigns probabilities to symbols, based on some assumptions about the source, and the entropy coder just assigns more probable symbols shorter codes.

This is pretty clear in the case of a simple PPM model. PPM predicts the probability of the very next symbol based on how often it's seen that symbol occur in the same contexts (preceding 1 character, or 2 character sequence, or 3..) Then it hands that probability to the arithmetic coder.

The same thing is going on in bizarre twisted ways with LZ77, where the distinction between modeling and encoding is much less clear---LZ77 has a weird model that effectively predicts that sequences it's seen before are more likely to be seen again than sequences it hasn't, and ends up assigning shorter codes (match distances & lengths) to them.

LZ77 is definitely not just doing entropy coding, because it is grouping sequences of symbols together and assigning higher probabilities to those particular sequences than it would to the same symbols in some other order.

Another way modeling gets mixed with entropy coding is that many "entropy coders" operate in a blockwise fashion, counting the frequencies of symbols within a reasonable-sized block of output symbols from the main model, assigning probabilities that correspond to the frequencies in the obvious way, and generating codes accordingly. This isn't just entropy coding, really---it's also a kind of model that implicitly assumes that the block size is a reasonable one, and that there will be relatively stable statistics within the block, but maybe not between blocks.

It is an adaptive model because if your source is "nonstationary" (frequencies of different symbols vary), but within a block they're reasonably stable, it adapts in a certain crude way---it will assign different frequencies to the same symbols within different blocks. Varying the block size changes the model---if the blocks are smaller, it adapts faster (looking only at recent input frequencies to assign probabilities), and if they're bigger, it adapts more slowly (looking at input going further back).

A blockwise "entropy coder" is thus *not* really just an entropy coder---it's a crude "recency model" plus an entropy coder.

Some other adaptive (so-called) "entropy coders" try to do something better than chunking things into blocks, and adapt more quickly and consistently. (E.g., not losing their memory across block boundaries, weighting more recent stuff more than less recent stuff.) But if they do, they are not really just entropy coders, either---they have a special model that they're using to assign probabilities before doing the actual entropy coding, based on certain assumptions about the source. (E.g., that what you've seen lately is a better predictor of what you'll see soon than stuff you've seen less lately.)

An entropy coder that's *just* an entropy coder doesn't do any of that. It doesn't do any pattern recognition. You hand it a probability distribution, and a particular symbol, and it hands you back a code for that symbol, given the probability distribution. That's all.

Consider a typical bytewise Huffman coder, which goes through a file counting the occurrences of each symbol, computing their (observed) probabilities, and assigning short codes to the more frequent symbols, and replacing each symbol occurrence with the appropriate code. Even that is not *just* an entropy coder---it's building a probability model first, then using that to construct a code, then using that to actually entropy code the symbol occurrences in the file. If it were really a pure entropy coder, it would not count symbol occurrences to get a probability distribution---you'd have to hand it the probability distribution and the sequence of symbols to compress according to that distribution.

An important idea there is that entropy is always relative to a model of the source. If you just Huffman code a text file by counting the frequencies of the individual bytes and doing the usual Huffman thing, the entropy will be high. That's because you're assuming an "order 0 model" where there are no interesting patterns. But actual text is full of patterns that such a scheme doesn't notice, e.g., that if you just saw "t", you're likely to see "h", and if you do, you're likely to see "e". So you want a model that gives you a different probability distribution depending on whether you just saw "t", or "th", so that the entropy coder will give you a shorter code for "e" after "th" than it would in other "contexts".

When people say "the entropy of", without explicit qualification, that's usually a shorthand for the "order 0 entropy of", where there's assumed to be a fixed probability distribution, rather than a constantly shifting one. — Preceding unsigned comment added by 107.77.218.83 (talk) 06:49, 2 May 2016 (UTC)

Repetition of content
Much of this content seems to be saying that the length of the codes are assigned proportionally to the inverse logarithm of the probability. Couldn't this just be said once? --Berrinam 1 Jun 2005


 * Quite possibly this repeated material shouldn't even be there once, per the above conversation. Pawnbroker 00:42, 15 Jun 2005 (UTC)

Range coding and arithmetic coding
I've deleted range encoding from the list of most common techniques, as it's actually the same as arithmetic coding. See those articles (and their talk pages) for more on the issue (if you're interested).

--Simon G Best 19:44, 21 July 2006 (UTC)

Entropy can not be a measure of similarity
The messages AAAAABBBBB and ABABABABAB and BBABABAAAB have same entropy. Obviously they are not similar. —Preceding unsigned comment added by 72.16.213.30 (talk • contribs)

I remember hearing about data compression algorithms being used on text files to study language. It's probably part of what is apparently called Computational linguistics. Maybe adding information about this would make the paragraph about measuring similarity more accessible to the lay reader? 84.198.246.199 (talk) 21:19, 23 May 2009 (UTC)


 * You are correct that a symbol-based entropy coder would not be able to distinguish strings with identical symbol frequencies, and hence these would be necessarily classified identically - however, it remains the case that it is a classifier that is useful in some situations (like identifying the language of a text). No classifier is a perfect measure of perceived similarity.
 * However, I do think this section would be more appropriately incorporated into a higher-level article, since it applies to any compression system where a model is constructed and not just entropy coding. Dcoetzee 06:56, 24 May 2009 (UTC)

This section is just talking about KL-divergence... it doesn't have to do with encoding particularly. — Preceding unsigned comment added by 171.66.208.145 (talk) 20:41, 17 April 2017 (UTC)

Entropy coding is radically different from other lossless compression
The entropy coding does not consider the order of symbols, statistics only. The messages AAAAABBBBB and BBAAABBAAB will have same bit length after correct entropy encoding, because they both have 5 symbols A and 5 symbols B. Other algorithms such as LZ77, RLE use the order of symbols in compression. When we say that it is impossible to compress data below entropy we mean that least wanted and least convenient symbol combination may be compressed to a particular size computed as entropy multiplied by number of symbols. If message can be compressed by RLE or LZ77 or LZW it should not be passed to entropy coder. Entropy coding is last resort used after all advantages of positions of symbols are exhausted.

My definition (not very smooth, but correct): The entropy encoder is some software or hardware device that compresses multiple data sets with same distribution and different order in symbols to the same size determined exclusively by statistics. —Preceding unsigned comment added by 72.16.213.30 (talk • contribs)
 * Actually, entropy coders are typically used as backends to most of the position-dependent algorithms you describe (dictionary coding usually serves to prepare data for better entropy coding, rather than to supplant entropy coding). I can scarcely think of a single compression method that doesn't use entropy coding at some point. Moreover, depending on what you use as your symbol set, entropy encoding can use finer-granularity position information. For example, a very simple and pretty effective compression technique for English text is to treat each word as a symbol and use an entropy code; this uses position information within words but not between words. In the most extreme theoretical case, we could treat each file of a given length as a single symbol, which uses all the position information, but is impractical, since it requires dealing with a symbol set whose size is exponential in the file size. Dcoetzee 06:18, 24 May 2009 (UTC)


 * There's a confusion here, I think. The anon is thinking about static codes, where the probability for each symbol is considered by estimating the total frequency that that symbol occurs.   is quite right, that even then the entropy encoder may encode large symbols each representing a run of letters at a time, rather than single letters.  But entropy encoders in the wild are more sophisticated than this, I think, and rather than using a static probability, they code according to the conditional probability of the next symbol (as predicted by some model), given what has already been seen.  In this way the length of their coded message does approach the entropy of the whole message, rather than the (rather less efficient) sum of the entropies for each letter.

Yes. Part of the confusion is that estimating probabilities from observed frequencies isn't part of entropy coding per se. It's actually a simple kind of modeling in itself. An actually static model would be one that you hand the entropy coder, not one it computes by counting symbol occurrences.


 * And it is this, re the previous question, that then leads to a "poor man's Kullback-Leibler distance", if one deliberately uses a model made for one string to see how well (or rather, how much worse than predicted) a second string is encoded using it. Jheald (talk) 08:35, 24 May 2009 (UTC)


 * I'd like to add that you can turn any scheme for estimating symbol probabilities into an entropy coding, as it depends only on a probabilistic model, which may or may not take positional information into account. If we are working on utf-8 encoded text at the byte level, for example, we know that certain combinations of symbols never occur(bytes that start with 10 never follow bytes that start with 0). The coder can take advantage of this in addition to the frequency data in order to lower the information content of the symbols. Another (trivial) example would be a perfect predictor. Let's say we can always predict the next symbol with 100% accuracy (obviously not a reasonable assumption). Then none of the symbols carry any information, and an entropy coder will be able to store it in 0 bytes. Dictionary codes do use statistical information, but in a fundamentally different way. The size of a dictionary code isn't decided by its likelyhood. One can obviously combine the two, as Dcoetzee points out.--2003:69:CD0B:C001:2E81:58FF:FEFF:8F4B (talk) 12:19, 25 November 2013 (UTC)

I think you're abusing terminology here and missing a key distinction when you say that "you can turn any scheme for estimating symbol probabilities into an entropy coding". What you mean is that you can turn any scheme for estimating probabilities, PLUS an entropy coder, into a _compressor_. "Compression" and "entropy coding" are not synonyms.

statistical modeling + entropy coding = compression (to quote the PPM people)

Have a look at PPM to see this. LZ77 is doing the same basic thing, but it's mixing up the modeling and the entropy coding, doing both crudely with a single simplistic algorithm. — Preceding unsigned comment added by 107.77.218.83 (talk) 07:09, 2 May 2016 (UTC)