Talk:Entropy (information theory)

Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 27 August 2021 and 19 December 2021. Further details are available on the course page. Student editor(s): Joycecs. Peer reviewers: J47211, Happyboi2489, Pinkfrog22.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 20:40, 16 January 2022 (UTC)

Independent and identically distributed?
I would like to remove the statement in the introduction that states that entropy is only defined for messages with an independent and identically distributed alphabet. This is not true, it is defined for any message, each letter of which may have different probabilities, and for which there may be correlations among the letters. For example, it is stated that in the English language, the pair "qu" occurs very frequently compared to other pairs beginning with "q". In other words, for the letter "q", the letter following it is highly correlated to it, not independent. Also, in English, the characters are not identically distributed, "e" is more probable than "x". For a message which is N letters long, with an alphabet of m letters, there are m^N possible messages, each with their own probability pi, with no restrictions on these probabilities other than that they are non-negative and their sum is equal to unity. The entropy of this set of messages is the sum of -pi log(pi over all m^N possible messages. All sorts of different letter frequencies and correlations may result from this set of m^N probabilities. PAR (talk) 11:50, 30 April 2011 (UTC)
 * It is a long time since I thought about this topic, but I think that text in the lead is talking about the best possible system (the one which conveys most information with each "character" that is transmitted). It looks out of place, or perhaps poorly expressed. Johnuniq (talk) 03:56, 1 May 2011 (UTC)

To say Shannon's entropy is only defined for independent and identically distributed spaces is perhaps a bit misleading. Perhaps the better way of saying this is that Shannon's entropy is a quantification of information in an i.i.d. You can apply it to English, but then it no longer represents the amount of information transmitted (see Kolmogorov Complexity). Note that you have also completely misunderstood what i.i.d. means in terms of English. It does NOT mean that the characters all have the same probability. It means that the probability of each character any a given position in unaffected by knowledge of characters at other positions (independent), and that the probability of each character is the same in every position (identically distributed). Naturally English doesn't abide by this, which means that in theory Shannon's entropy is probably not a lower bound for compressing English (again, see Kolmogorov Complexity).Elemental Magician (talk) 08:55, 11 April 2013 (UTC)

Entropy as an apparent effect of conservation of information.
If entropy is considered an equilibrium property as in energy physics, then it conflicts with the conservation of information. But the second law of thermodynamics may simply be a apparent effect of the conservation of information, that is, entropy is really the amount of information it takes to describe a system, each reaction creates new information but information cannot be destroyed. That means the second law of thermodynamics is not an independent law of physics at all, but just an apparent effect of the fact that information can be created but not destroyed. The arrow of time is thus not about destruction, but about the continuous creation of information. This explains how the same laws of physics can cause self-organization. Organized systems are not anyhow less chaotic than non-organized systems at all, and the spill heat life produces can, in an information-physical sense, be considered beings eliminated by evolution rather than a step towards equilibrium. It is possible that overload of information will cause the arrow of time to extract vacuum energy into usefulness rather than heat death. 217.28.207.226 (talk) 10:51, 23 August 2011 (UTC)Martin J Sallberg
 * This sounds interesting, but the details are not explained clearly. For example, information can be destroyed--we do it whenever we delete a file. And there is no reason given that "The arrow of time is...about the continuous creation of information." There is no basis to edit the article until these claims are explained more clearly (and with references). David Spector (talk) 11:31, 17 February 2012 (UTC)

In response to the above response, you are wrong about information being destroyed when you delete a file. The information is dispersed and thereafter unretrievable, but it is not destroyed. Information is never destroyed. — Preceding unsigned comment added by 69.143.174.166 (talk • contribs) 06:21, 17 October 2013

Ambiguous intro
The 2nd paragraph ends: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text.[7]"

The first range cited means even (50/50) or worse, the second means better than even or worse than even. If neither of these are typos wouldn't it be better to just say "0.6 to 1.5"?

robotwisdom (talk) 21:09, 27 March 2013 (UTC)


 * As far as I understand it, these were experimental studies with humans and therefore there's a high variance in the results. The two studies found slightly different ranges and you can't combine them like this. What really should be mentioned here is a study that analyzes English text electronically and computes the entropy rate in some way.
 * Btw, 50/50 means that, when given an initial sequence of text such as "The fox w", people were able to exclude, on average, half of all possible 26 letters as the next letter. ylloh (talk) 18:15, 28 March 2013 (UTC)
 * I think rather it means they can guess the next letter half the time-- much much harder. (I'm trying to gauge the entropy of Joyce's invented language in Finnegans Wake.) robotwisdom (talk) 00:54, 29 March 2013 (UTC)
 * Then it would not be 1 bit per letter but $$\log_2(26) / 2 \approx 2.35$$ bits per letter, I think. ylloh (talk) 23:56, 29 March 2013 (UTC)
 * (Is there somewhere else on the Web we should move this discussion?) If you arrange the 26 letters from commonest down (etaoinshrdlu, we used to say) couldn't you just blindly guess the top 13 and be right far more than half the time? There's a Web app somewhere that gave me values between 4 and 5 for both FW and plain English. I'm looking for a metric that conveys how much harder it is to proofread FW. robotwisdom (talk) 08:40, 30 March 2013 (UTC)
 * They weren't giving 13 guesses for the next letter, they were giving one -- and were right (more than) half the time.
 * It's an illustration of the high redundancy of English. If the letters were completely random, you would need to transmit log2 26 = 4.70 bits of information to identify each one.  But actually English is not so random, and is quite compressible: you can often have a quite good guess as to what the next letter is.  So if you build a good predictive model for the possibilities for the next letter, on average you only need to send one bit per symbol to identify it.  See eg videos of Dasher in action. Jheald (talk) 10:00, 30 March 2013 (UTC)

Now I'm imagining a gigantic decision-tree including every imaginable message that might be sent and every variation in how it might be expressed. This would allow some sort of optimal compression... and FW strains this via puns and riddles and allusions. robotwisdom (talk) 18:04, 30 March 2013 (UTC)

It is worth bearing in mind that the state of the art for compression of English text from ASCII is typically somewhere around 9:1 which is well under 1 bit per ASCII byte. See https://www.maximumcompression.com/data/text.php Elroch (talk) 22:20, 17 May 2018 (UTC)

Absurdly based and highly misleading estimate of entropy of English
From the article: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text."

Since information content is only equal to Shannon's entropy given an independent and identically distributed language, which English is not, asking people to predict the next letter in a sample of English text would underestimate the Shannon entropy of English if people could see the parts of the text that they were not predicting. If people were not given any information, but were merely asked to give a probability of each character (or something like that) this should be included.

Judging by the fact that Shannon derived a rather lower value of entropy with this experiment, my guess is that this wasn't actually measuring the Shannon entropy (but something like the Algorithmic probability).Elemental Magician (talk) 09:03, 11 April 2013 (UTC)


 * This seems to be an extension of the discussion further up-page of whether Shannon entropy is only defined for processes that are i.i.d., which came to the view that it was not so restricted.


 * In the entropy of English case, it seems not unreasonable to consider the average entropy per character to be the average of $$\scriptstyle{\sum p_i \log p_i}$$, where the probabilities pi give the conditional probability distribution for a character at position n given all that have been seen previously. This is in fact the very example that  gave above for application of Shannon entropy to a non-IID process, where for example the probabilities pi for a letter following a 'q' will not be the same as those for a letter following a 't'. Jheald (talk) 16:57, 11 April 2013 (UTC)

We can be sure that the entropy of typical English text is less than 1 bit per letter, since state of the art compression algorithms achieve significantly under 1 bit per ASCII character. See https://www.maximumcompression.com/data/text.php Elroch (talk) 22:24, 17 May 2018 (UTC)

Characterization
The characterization section can probably be rewritten by a professional. If I am correct "A Characterization of Entropy in Terms of Information Loss" by Baez et al. only require functoriality, convex linearity, and continuity for the definition of Shannon's entropy. The current statements are mutually overlapping and probably not entirely true. Andy (talk) 14:10, 2 July 2013 (UTC)

It seems characterization is central to understanding the motivation for the definition of entropy. Perhaps it could be mentioned or referred to somewhere in the introduction? 130.243.214.198 (talk) 13:17, 7 March 2014 (UTC)

Definition
Two vaguenesses in the Definition section.

1. "When taken from a finite sample, the entropy can explicitly be written as...": here it sounds like the author means that the entropy is taken from a finite sample, which is certainly not the case.

2. The term n_i in the expanded form for H(X) is never defined. Scorwin (talk) 17:03, 3 October 2013 (UTC)

Units of entropy
The introduction states bits, nats and bans as common units of Shannons entropy, while the definition section uses dits, rather than bans. — Preceding unsigned comment added by 129.177.143.236 (talk) 10:56, 18 November 2013 (UTC)


 * Shannon entropy units are bits/symbol. For example, each of the following 6 "messages" below have a Shannon entropy of log2(2) = 1 because each symbol is used with the same probability and there are 2 symbols.  An equal probability distribution is the max entropy for that number of symbols:
 * ab, 12, ababababab, 1212121212, aaaaabbbbbccccc, 221122112211221122112211.


 * But physical entropy units are nats because it increases as the "length of the message" (number of particles or volume) increases. Intensive entropy is more like Shannon entropy, being nats per mole, kg, volume, or particle.  The Joules/Kelivins units of kb are unitless because Kelvins are a measure of kinetic energy (joules) so the units cancel.  It's merely the slope needed to make sure there is (classically) zero kinetic energy at zero temperature, which is a truism.  If Joules and Kelvins were fundamentally different units there would not be this strict requirement that they are both zero at the same time when kb is used. Quantum mechanically kb is not even needed to get S and it becomes more obvious the units are "nats". Ywaz (talk) 23:02, 30 November 2015 (UTC)

Why do you say that information entropy is in bits per symbol? If each of 210 possibilities is equally probable, for a string of zeros and ones that is ten symbols long, then any one of those strings represents 10 bits, not 1 bit, yes? 𝕃eegrc (talk) 14:56, 1 December 2015 (UTC)


 * Shannon entropy H measures a disorder rate aka an information rate in a set of symbols. A string with all 1's or all 0's has H = 0 bits/symbol. At the other extreme, half 1's and half 0's is H=1 bit/symbol.  Search "entropy calculator" to play with it.  If H = 0.5 for a particular set of 10 bits, then it is carrying only H*10 = 5 bits of data, which shows compressibility only in a very restricted sense.  H*N like this is total entropy for a set of N symbols which is analogous to physical entropy's N microstates. H is analogous to specific entropy So (usually in units of nats per mole or per kg).  If each bit of 10 is randomly selected, then all 1's is not equally likely as half 0's and half 1's, but it is as likely as any other particular sequence.  Shannon entropy and physical entropy do not look at the sequence order. But a random selection of 10 bits is likely to have H close to 1, and H*N=10 for the total string. Ywaz (talk) 18:16, 1 December 2015 (UTC)

You may want to try for consensus on that before editing the article. My experience is that information entropy is more general than the per symbol definition I think you mean. 𝕃eegrc (talk) 20:10, 2 December 2015 (UTC)


 * The above discussion may make some sort of sense, but I cannot see it. The most glaring problem is that a message *instance* is talked about as if it had entropy. The entropy of "ab" is not defined, the entropy of " ababababab" is not defined. Entropy is not something which a particular message possesses. "A string with all 1's or all 0's has H = 0 bits/symbol" is a nonsense statement. "each of the following 6 "messages" below have a Shannon entropy of log2(2) = 1" is a nonsense statement. Please rewrite the argument in a form that makes sense. PAR (talk) 15:31, 3 December 2015 (UTC)

Shannon's entropy equation H is the definition of information entropy. You can look up "entropy calculator" and find many implementations of it online and you will get the same answers I provided. Both physical and informational entropy are an instantaneous statistical measure ("picture") of the data. The information most people intuitive think about is "extensive" entropy called shannons in bits as explained in the article. The "shannons" of ABABAB is H*N =6. H is "intensive" entropy and is 1 bit/symbol in this case. The first image in the article with coins describes "extensive" (absolute) informational entropy and the second image and all the equations describe "intensive" entropy (bits per symbol, aka Shannon's H function.

Concerning the correct units, see Shannon's book, chapter 1, section 7 which says "H or H' measures the amount of information generated by the source per symbol [H] or per second [H']." A "source" can be any data file being read from memory, an FM radio station, or the position and energies of particles in a gas. See Landauer's principle for this direct connection between physical entropy and information entropy, with experimental observations. Normal physical entropy is an absolute value (extensive).


 * $$S = k_\mathrm{B} \ln(2) N H$$

Where S is physical extensive entropy, kb is Boltzmann's constant, N is number of particles, ln(2) is the conversion from bits to nats, and H is Shannon's (intensive) entropy in bits per particle. You assign a symbol to each possible microstate a particle can occupy (100 possible internal energies in 100 possible locations would use 10,000 symbols as the alphabet for the 10,000 possible microstates), then calculate H from the probabilities of the distinct symbols by looking at a large number particles, or by already knowing the physics of the particles. You could take a picture of the particles, write down the symbol for every one of their energy-volume (phase space, microstate) locations, and consider this a message the system of particles has sent you. The "particles" could be in mole, kg, or m^3 and the symbols would represent the phase space of that block. N would then be in mole, kg, or m^3. Ywaz (talk) 17:55, 3 December 2015 (UTC)


 * I have the feeling we are arguing over language, rather than concepts. Just to be sure, what would you say is the entropy (extensive and intensive) of the message "A5B"? PAR (talk) 01:32, 4 December 2015 (UTC)

The Shannon entropy of "A5B" and "AAA555BBB" and "AB5AB5AB5" are all the same: H = log2(3) = 1.58 bits/symbol. This is an intensive type of entropy. The shannons (extensive entropy) of these messages are H*N: 1.58*3 for the first one and 1.58*9 for the other two. H is the sum of pi*log2(1/pi) where pi is the observed frequency of each symbol in the message.

How can you revert the article claiming Shannon entropy is extensive when Shannon himself said it is bits/symbol (intensive)? I have a reference from the man himself and you have nothing but opinion. A reference from the original source of information theory overrides 2 votes based on opinion. Admit your error and let the article be corrected to reflect the facts. Ywaz (talk) 19:34, 4 December 2015 (UTC)


 * Is my understanding of your statements correct: You are saying that "A5B" is independent draws of "A", "5", and "B" from a distribution and thus our best estimate of that distribution is that each of "A", "5", and "B" has a 1/3 probability. I agree that the entropy of such a (1/3, 1/3, 1/3) distribution is indeed log(3).  I hesitate to call this bits per symbol, though, because it is the counts of occurrences that is being divided by the number of symbols, not some measure of bits that is being divided by the number of symbols.  𝕃eegrc (talk) 21:13, 4 December 2015 (UTC)


 * Ok, "A5B" does not possess an entropy. Entropy is defined for the alphabet and their associated probabilities that produced the instance "A5B". What you have done is used the instance to assume an alphabet ("A","5","B") and you have assumed equal probabilities for each letter of the alphabet. These are not assumptions that you can generally make. Because of this, it is simply wrong to say that "A5B" has a particular value of entropy. The symbols imply nothing about the alphabet, nor the probabilities of the letters of that alphabet. I think you know this, but you talk as if "A5B" has entropy, and that is what is bothering me. PAR (talk) 21:51, 4 December 2015 (UTC)

Leegrc, "bits" are the invented name for when the log base is 2 is used. There is, like you say, no "thing" in the DATA itself you can point to. Pointing to the equation itself to declare a unit is, like you are thinking, suspicious. But physical entropy itself is in "nats" for natural units for the same reason (they use base "e"). The only way to take out this "arbitrary unit" is to make the base of the logarithm equal to the number of symbols. The base would be just another variable to plug a number in. Then the range of the H function would stay between 0 and 1. Then it is a true measure of randomness of the message per symbol. But by sticking with base two, I can look at any set of symbols and know how many bits (in my computing system that can only talk in bits) would be required to convey the same amount of information. If I see a long file of 26 letters having equal probability, then I need H = log2(26) = 4.7 bits to re-code each letter in 1's and 0's. There are H=4.7 bits per letter.

PAR, as far as I know, H should be used blind without knowledge of prior symbol probabilities, especially if looking for a definition of entropy. You are talking about watching a transmitter for a long time to determine probabilities, then looking at a short message and using the H function with the prior probabilities. Let's say experience shows a "1" occurs 90% of the time and 0 occurs 10%. A sequence then comes in: 0011. "H" = 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 6.917. [edit: this 6.917 result is wrong as PAR points out below, so the rest of this paragraph should be ignored] This would be the amount of information in bits conveyed when only 4 bits were sent. Two out of 4 symbols being 0 was a surprise, so it carried more information than usual. I don't know to incorporate this into a general idea of entropy. It's not Shannon's H because Shannon's H requires the sum of probabilities to be one. To force it in this case would be H = "H" / 4 = 1.72 bits per symbol (bits/bit). So 0.72 bits/bit is the excess information the short message carried than normally expected. Backing up, I can say the H of this source with 90% 1's is 0.496 bits/bit which is less than the ideal 1 bit/bit that random bits carry from a random source. All that's interesting but I don't see a definition of entropy coming out of it. It seems to be just applying H to discover a certain quantity desired.

Let me give an example of why a blind and simple H can be extremely useful. Let's say there is a file that has 8 bytes in it. One moment it say AAAABBBB and the next moment it says ABCDABCD. I apply H blindly not knowing what the symbols represent. H=1 in the first case and H=2 in the second. H*N went from 8 to 16. Now someone reveals the bytes were representing microstates of 8 gas particles. I know nothing else. Not the size of the box they were in, not if the temperature had been raised, not if a partition had been lifted, and not even if these were the only possible microstates (symbols). But there was a physical entropy change everyone agrees upon from  S1=kb*ln(2)*8 to S2=kb*ln(2)*16. So I believe entropy H*N as I've described it is as fundamental in information theory as it is in physics. Prior probabilities and such are useful but need to be defined how they are used. H on a per message basis will be the fundamental input to those other ideas, not to be brushed aside or detracted from.

edit: PAR I found an example of what you're probably thinking about: http://xkcd.com/936/ The little blocks in this comic are accurate, representing the number of bits needed to represent all the possibilities, which uses a prior knowledge. This is just N*H shannons as I've described for each groupings of his blocks, and to get a total number of shannons (entropy as he calls it, extensive entropy as I call it) you just add them up like he has done. Actually, he has made a mistake if the words we choose are not evenly distributed. In that case, we calculate an H which will come out lower than the 11 bits per word-symbol has indicated which means if our hacker starts with the most common words, he is more likely to finish sooner, which is like saying there are fewer than 2^44 things we have to search. Ywaz (talk) 01:48, 5 December 2015 (UTC)


 * Ywaz- you say:

"as far as I know, H should be used blind without knowledge of prior symbol probabilities, especially if looking for a definition of entropy. You are talking about watching a transmitter for a long time to determine probabilities, then looking at a short message and using the H function with the prior probabilities."


 * NO. Watching a transmitter will allow us to estimate probabilities, the longer we watch, the better the estimation. Or, we may have a model which gives us the probabilities directly, as in statistical mechanics where each microstate (message) is assumed to have equal probability. Once we have these probabilities, we can calculate the entropy, and only then, not before. In "message" terms, the entropy is the average information carried by a message. This requires knowing the set of all messages and their probabilities, which sum to unity. In micro/macrostate terms, the macrostate is a set of microstates, each with their own probability, the sum of which is unity. The entropy is only defined for the macro state. A microstate has no entropy. It does carry information, however, and I think you are confusing the two. The entropy is the average information carried by a message or microstate. It is averaged over all possible microstates which constitute the macrostate, or alternatively, it is averaged over the set of all possible messages.

"Let's say experience shows a '1' occurs 90% of the time and 0 occurs 10%. A sequence then comes in: 0011. 'H' = 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 6.917. This would be the amount of information in bits conveyed when only 4 bits were sent. Two out of 4 symbols being 0 was a surprise, so it carried more information than usual. I don't know to incorporate this into a general idea of entropy. It's not Shannon's H because Shannon's H requires the sum of probabilities to be one. To force it in this case would be H = 'H' / 4 = 1.72 bits per symbol (bits/bit). So 0.72 bits/bit is the excess information the short message carried than normally expected. Backing up, I can say the H of this source with 90% 1's is 0.496 bits/bit which is less than the ideal 1 bit/bit that random bits carry from a random source. All that's interesting but I don't see a definition of entropy coming out of it. It seems to be just applying H to discover a certain quantity desired."


 * First of all, 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 0.937991 bits. This is NOT the amount of information contained in 4 bits, it is the amount of information contained in the specific 4 bit message 0011. The entropy of a 4-bit message is 4(p log2(p)+q log2(q)) where p=0.1 and q=1-p=0.9. That comes out to 1.87598 bits, which is the entropy of the set of four bit messages, given the above probabilities. It is the average amount of information carried by 4 bits, and clearly the information in 0011 (0.937991 bits) is much less than average (1.87598 bits). In macro/microstate terms, the macrostate is represented by any of the 2^4=16 possible microstates, 0011 being one of those microstates. The entropy of the macrostate is 1.87598 bits. To ask for the entropy of a microstate is an improper question.


 * "Let me give an example of why a blind and simple H can be extremely useful. Let's say there is a file that has 8 bytes in it. One moment it say AAAABBBB and the next moment it says ABCDABCD. I apply H blindly not knowing what the symbols represent. H=1 in the first case and H=2 in the second. H*N went from 8 to 16. Now someone reveals the bytes were representing microstates of 8 gas particles. I know nothing else. Not the size of the box they were in, not if the temperature had been raised, not if a partition had been lifted, and not even if these were the only possible microstates (symbols)."


 * You say that a microstate is given by AAAABBBB, but you have no knowledge of the macrostate. Only a macrostate has entropy, you have no macrostate, you cannot calculate entropy. You then presume to know the macrostate by looking at AAAABBBB and saying the macrostate is the set of all possible arrangements of 8 particles in two equally probable energy levels. A totally unwarranted assumption. Given this unwarranted assumption, you then correctly calculate the information in AAAABBBB to be 8 bits. THIS IS NOT THE ENTROPY, it is the amount of information in AAAABBBB after making an unwarranted assumption. Only if your unwarranted assumption happens to be correct will it constitute the entropy.


 * In the second case, ABCDABCD, you make another unwarranted assumption; that there are 8 particles, each with four equally probable states. Given this unwarranted assumption, you then correctly calculate the information in ABCDABCD to be 16 bits. THIS IS NOT THE ENTROPY, it is the amount of information in ABCDABCD after making yet another unwarranted assumption. Only if your unwarranted assumption happens to be correct will it constitute the entropy. If that unwarranted assumption is correct then the amount of information in AAAABBBB will also be 16 bits.


 * But there was a physical entropy change everyone agrees upon from S1=kb*ln(2)*8 to S2=kb*ln(2)*16. So I believe entropy H*N as I've described it is as fundamental in information theory as it is in physics. Prior probabilities and such are useful but need to be defined how they are used. H on a per message basis will be the fundamental input to those other ideas, not to be brushed aside or detracted from.<\blockquote>


 * Everyone does NOT agree on your statement of physical entropy change. You are confusing the amount of information in a message (or microstate) with the entropy of the set of all possible messages (the macrostate).


 * Again, The bottom line is that a particular message (microstate) may carry varying amounts of information. Entropy can only be defined for a macrostate which is a set of microstates whose individual probabilities are known or estimated, and add to unity. Entropy on a per message basis is nonsense, the information carried by a message is not. Prior probabilities are not just useful, they are mandatory for the calculation of entropy, whether you estimate them from a large number of individual messages (microstates), or you assume them, as is done with statistical mechanics entropy (each microstate being equally probable). PAR (talk) 05:33, 5 December 2015 (UTC)

I agree you can shorten up the H equation by entering the p's directly by theory or by experience. But you're doing the same thing as me when I calculate H for large N, but I do not make any assumption about the symbol probabilities. You and I will get the same entropy H and "extensive" entropy N*H for a SOURCE. Your N*H extensive entropy is N*sum(p*log(p)). The online entropy calculators and I use N*H = N*sum[ count/N*log(count/N) ] ( they usually give H without the N). These are equal for large N if the source and channel do not change. "My" H can immediately detect if a source has deviated from its historical average. "My" H will fluctuate around the historical or theoretical average H for small N. You should see this method is more objective and more general than your declaration it can't be applied to a file or message without knowing prior p's. For example, let a partition be removed to allow particles in a box to get to the other side. You would immediately calculate the N*H entropy for this box from theory. "My" N*H will increase until it reaches your N*H as the particles reach maximum entropy. This is how thermodynamic entropy is calculated and measured. A message or file can have a H entropy that deviates from the expected H value of the source.

The distinct symbols A, B, C, D are distinct microstates at the lowest level. The "byte" POSITION determines WHICH particle (or microvolume if you want) has that microstate: that is the level to which this applies. The entropy of any one of them, is "0" by the H function, or "meaningless" as you stated. A sequence of these "bytes" tells the EXACT state of each particle and system, not a particular microstate (because microstate does not care about the order unless it is relevant to it's probability). A single MACROstate would be combinations of these distinct states. One example macrostate of this is when the gas might be in any one of these 6 distinct states: AABB, ABAB, BBAA, BABA, ABBA, or BAAB. You can "go to a higher level" than using A and B as microstates, and claim AA, BB, AB, and BA are individual microstates with a certain probabilities. But the H*N entropy will come out the same. There was not an error in my AAAABBBBB example and I did not make an assumption. It was observed data that "just happened" to be equally likely probabilities (so that my math was simple). I just blindly calculated the standard N*H entropy, and showed how it give the same result physics gets when a partition is removed and the macrostate H*N entropy went from 8 to 16 as the volume of the box doubled. The normal S increases S2-S1=kb*N*ln(2) as it always should when mid-box partition is removed.

I can derive your entropy from the way the online calculators and I use Shannon's entropy, but you can't go the opposite way.

Now is the time to think carefully, check the math, and realize I am correct. There are a lot of problems in the article because it does not distinguish between intensive Shannon entropy H in bits/symbol and extensive entropy N*H in bits (or "shannons to be more precise to distinguish it from the "bit" count of a file which may not have 1's and 0's of equal probability).

BTW, the entropy of an ideal gas is S~N*log2(u*v) where u and v are internal energy and volume per particle. u*v gives the number of microstates per particle. Quantum mechanics determines that u can take on a very large number of values and v is the requirement that the particles are not occupying the same spot, roughly 1000 different places per particle at standard conditions. The energy levels will have different probabilities. By merely observing large N and counting, H will automatically include the probabilities.

In summary, there are only 3 simple equations I am saying. They precisely lay the foundation of all further information entropy considerations. These equations should replace 70% of the existing article. These are not new equations, but defining them and how to use them is hard to come across since there is so much clutter and confusion regarding entropy as a result of people not understanding these statements.

1) Shannon's entropy is "intensive" bits/symbol = H = - sum[ count/N*log2(count/N) ] =   sum [count/N*log2(N/count) ] where N is the length of a message and count is for each distinct symbol.

2) Absolute ("extensive") information entropy is in units of bits or shannons = N*H.

3) S = kb*ln(2)*N*H where each N has a distinct microstate which is represented by a symbol. H is calculated directly from these symbols for all N. This works from the macro down to the quantum level.

In homogenous solids, 3) is not formally correct on a "per atom" or "per molecule" basis because phonons are interacting across the "lattice". The "symbols" would be interacting if they are no the scale of atoms, and are therefor not a random variable with a probability distribution that H can use. Each N needs to be on a larger scale like per kg, per mole, or per cm^3. However, once that is obtained, a simple division by molecules/kg or whatever will allow a false "per molecule" probability distribution and H that would be OK to use as long it is not taken used below the bulk level. N can be each molecule in a gas. They bump, but that is not an interaction that messes up the probability distribution.

Ywaz (talk) 17:12, 5 December 2015 (UTC)


 * Ywaz - You say:


 * "But you're doing the same thing as me when I calculate H for large N, but I do not make any assumption about the symbol probabilities."


 * You DO make an assumption - that the frequencies of a letter in a particular message represent the probabilities of all messages. Totally unwarranted.


 * A message or file can have a H entropy that deviates from the expected H value of the source.<\blockquote>


 * Again, a message does not have entropy, it carries a certain amount of information. The amount of information may deviate from the entropy (average amount of information carried by a message), but that amount of information is not called entropy.


 * "There are a lot of problems in the article because it does not distinguish between intensive Shannon entropy H in bits/symbol and extensive entropy N*H in bits (or 'shannons to be more precise to distinguish it from the 'bit' count of a file which may not have 1's and 0's of equal probability)."


 * Shannon's "bit" is not a 0 or a 1. It is a measure of information. It is a generalization of the old concept of a bit. A single message carries information such that if the probabilities of each of the characters in the message are equal, the amount of information it carries is the length of the message. Shannon generalized the concept of a bit to include cases where the probabilites were not equal. A previous discussion here has determined that the word "bit" as a measure of information, rather than the "Shannon" is to be preferred, because that is the most common convention in the literature. As long as we specify units as bits (extensive) or "bits/symbol" there will be no confusion. Can you point out an example in the article where there is confusion by failing to make this distinction?


 * "1) Shannon's entropy is 'intensive' bits/symbol = H = sum[ count/N*log2(count/N) ] where N is the length of a message and count is for each distinct symbol."


 * NO - "count" is not the count found in a single message, it is the count averaged over an infinite number of messages. In other words it is the expected value of the count.


 * 2) Absolute ("extensive") information entropy is in units of bits or shannons = N*H.


 * Fine, no problem.


 * "3) S = kb*ln(2)*N*H where each N has a distinct microstate which is represented by a symbol. H is calculated directly from these symbols for all N. This works from the macro down to the quantum level."


 * It's not clear what you are saying. If there are N microstates (or messages) in the macrostate, how is H "calculated directly from these symbols"? You speak of an "entropy" for each microstate, how do you arrive at the total intensive H ? PAR (talk) 06:30, 6 December 2015 (UTC)

As N gets larger, the message more closely exhibits the properties of all future messages from the source, unless the source changes its identity. Changing its identity means it stops acting like it did in the past. You're saying "NO" to this "count" method, but it is the method used by many others and is compatible with Shannon's text. Shannon even DEFINES entropy H for large N. He says "....a long message of N symbols. It will contain with high probability piN occurrences [of each symbol]". For you to say getting pi's from large N is "unwarranted" is so, so wrong. It is the opposite of the truth. There is no other way to get pi's except by theory without observation which is called metaphysics or philosophy instead of science. Entropy is an observation and a measurement. No one can observe an infinite number of symbols from any source. People were measuring entropy in thermodynamics many decades before they had any theory to explain where it came from and had no idea about the p's or the quantum theory from which they came. They were even using it before the molecular theory of gases.

"Information" is the N*H (where I've defined H for all messages of all lengths) if no other definition is provided. This is pretty much standard. You even used this definition of information in the subsequent paragraph. Or are you saying no one can measure the information content of a file without first knowing the p's based on a detail knowledge of the source that generated the file? "Bit" in the bit/symbol unit is correct, but it is not precise in terms of the measure of information because "bit" is normally simply a count of the "bits". Calling information content unit "shannons" may not be formally needed, but it is precise. It describes how the bit count was adjusted and is "minimal bits needed encode this message".

Using "intensive" and "extensive" information entropy may be new. So if it is used, it must be in quotes, with a reference to the thermo article on them to indicate why they are in quotes. But something must be said explicitly to show clearly to a wider audience that H and N*H are both called "entropy" but they have an important difference. Most people can't see the connection to physical entropy because they are looking at H instead of N*H. The other problem is that they do not know kb is merely a units conversion factor from kinetic energy per molecule as measured by Kelvins to joules per molecule (Joules/joules, unitless). In this way physical entropy really is just a statistical measure exactly like Shannon entropy, and they are not even separate concepts in systems operating at the Landauer limit.

The problem throughout the article is that information entropy is not defined, so every time the word "entropy" it is not clear if it is N*H entropy or H entropy. And half the time the wording is false. Example: "Shannon's entropy [H, by definition] measures the information contained in a message". No. N*H is the information content, not H. Finally, half way down, it warns "Often it is only clear from context which one is meant" and even complains about Shannon "confusing the matter" in a way that shows the writer does not know H itself is REQUIRED to be in bits/symbol:  "Shannon himself used the term in this way". Shannon's entropy is not a term. It's an equation. So the article should not say "Shannon entropy" so many times when it is really referring to N*H.

Where did I speak of entropy of a microstate? I think I said Macrostate. 1 microstate = 1 symbol. H for 1 symbol = 1*log2(1) = 0. This is the "ideal" lowest entropy state, zero information.

Ywaz (talk) 13:18, 6 December 2015 (UTC)

Ywaz: you say:


 * "As N gets larger, the message more closely exhibits the properties of all future messages from the source, unless the source changes its identity. Changing its identity means it stops acting like it did in the past. You're saying 'NO' to this 'count' method, but it is the method used by many others and is compatible with Shannon's text. Shannon even DEFINES entropy H for large N. He says '....a long message of N symbols. It will contain with high probability piN occurrences [of each symbol]'. For you to say getting pi's from large N is 'unwarranted' is so, so wrong. It is the opposite of the truth. There is no other way to get pi's except by theory without observation which is called metaphysics or philosophy instead of science. Entropy is an observation and a measurement. No one can observe an infinite number of symbols from any source. People were measuring entropy in thermodynamics many decades before they had any theory to explain where it came from and had no idea about the p's or the quantum theory from which they came. They were even using it before the molecular theory of gases."


 * There is a set of probabilities for each symbol in the alphabet (pi for the i-th symbol). You call these pi's "metaphysical" or whatever, but they are not. Yes, their values are something we can estimate by looking at a long message, the longer the message, the more precise the estimation. We can express this by saying that "the error in the estimate of the pi's tends to zero as N approaches infinity". I do not say getting pi's from large N messages is unwarranted. Its a good way, but not the only way. I am saying that getting them from SMALL N messages is unwarranted. Saying that the entropy of AB is 2 bits is not correct. Only when you know the probability of message "AB" can you calculate the information content of "AB" and you cannot get that probability from the message "AB". You cannot say that a message "AB" shows that there are two symbols, each with 50% probability. Therefore you cannot calculate entropy of anything, because you don't have the pi's. Furthermore, In the case of statistical mechanics, we do not look at microstates and calculate their probabilities by counting. Rather we make the assumption that their probabilities are all equal and then we see that this assumption "gives the right answer" when calculating the entropy. This is not a counting method, but it is not "metaphysical" or "philisophical", it is science.


 * "'Information' is the N*H (where I've defined H for all messages of all lengths) if no other definition is provided. This is pretty much standard. You even used this definition of information in the subsequent paragraph."


 * This is not standard, this is not the definition of information. The information content of message "AB" is I=-log2(pab) where pab is the probability of the message "AB". Looking at that message "AB" and saying pab=(1/2)(1/2)=1/4 and therefore the entropy is 2 bits is totally unwarranted. You have to get pab from somewhere else, either by estimating them from a long message or many short messages, or by knowing how the messages were generated. The entropy of a 2-symbol message is then the average information in a 2-symbol message.


 * "Or are you saying no one can measure the information content of a file without first knowing the p's based on a detail knowledge of the source that generated the file?"


 * I am saying you need the pi's in order to calculate the information content of a file. If the file is large, then yes, you may estimate the pi's from the symbol frequencies (assuming they are independent). If the file is small, you may not, and then you have to determine the pi's in some other way. If you have some outside information, fine, use it. In statistical mechanics, we have huge "files" (microstates) and a huge number of microstates, but we cannot count the frequencies. However, we make the assumption that the probabilities of the microstates are equal and we get results that match reality. So we say we know the pi's by yet another method.


 * Don't rely on online entropy calculators for your definition of entropy. For example, the first part of [] correctly calculates the entropy per symbol of a message GIVEN THE PROBABILITES OF A SYMBOL BEFOREHAND. Good. The second part is crap - it presumes to know the probabilities from the frequencies in a short message, the same mistake you make, and it is generally wrong.  [] is likewise crap.


 * "Using 'intensive' and 'extensive' information entropy may be new. So if it is used, it must be in quotes, with a reference to the thermo article on them to indicate why they are in quotes. But something must be said explicitly to show clearly to a wider audience that H and N*H are both called 'entropy' but they have an important difference. Most people can't see the connection to physical entropy because they are looking at H instead of N*H. The other problem is that they do not know kb is merely a units conversion factor from kinetic energy per molecule as measured by Kelvins to joules per molecule (Joules/joules, unitless). In this way physical entropy really is just a statistical measure exactly like Shannon entropy, and they are not even separate concepts in systems operating at the Landauer limit."


 * If there are confusions between bits and bits/symbol in the article, then I agree, that needs to be made clear.

"Where did I speak of entropy of a microstate? I think I said Macrostate. 1 microstate = 1 symbol. H for 1 symbol = 1*log2(1) = 0. This is the 'ideal' lowest entropy state, zero information."


 * No - a microstate corresponds to a particular message, e.g. "A5C". A macrostate is a set of microstates, for example, the set of all 3-symbol messages. Entropy applies to a macrostate, not a microstate. "A5C" carries some information equal to -log2(pA5C) where pA5C is the probability of message "A5C". The entropy of the set of 3-symbol messages is the weighted average of the amount of information in a 3-symbol message, averaged over all possible 3-symbol messages. Their probabilities will sum to unity. You cannot count on the short message "A5C" to give you the weights for the average, nor can you count on it to give you the full alphabet of 3-symbol messages. PAR (talk) 17:17, 6 December 2015 (UTC)

PAR writes:
 * You cannot say that a message "AB" shows that there are two symbols, each with 50% probability.

When I come across a file who's contents are "AB" and I know nothing about what generated the file or anything about its past or future, I have no choice but to view the data as the complete life of the source. So myself and the online entropy calculators apply the statistical measure H to it.


 * In the case of statistical mechanics, we do not look at microstates and calculate their probabilities by counting.

Look up Einstein solid and Debye model. They count oscillators and phonons.


 * This is not standard, this is not the definition of information. The information content of message "AB" is I=-log2(pab) where pab is the probability of the message "AB".

If a source sends me ABBABBAABAAB....random stuff for a while, I will apply H to any large portion of it and see H=1. It then sends me "AB" or "BB" and you ask me the information content of it. I say H*N=2, which is the same answer you get from the equation above because you had also observed p(AB)=1/4 in the long sequence. So again you see, my "crappy" 1) and 2) can derive what you consider correct.


 * No - a microstate corresponds to a particular message, e.g. "A5C". A macrostate is a set of microstates, for example, the set of all 3-symbol messages. Entropy applies to a macrostate, not a microstate. "A5C" carries some information equal to -log2(pA5C) where pA5C is the probability of message "A5C". The entropy of the set of 3-symbol messages is the weighted average of the amount of information in a 3-symbol message, averaged over all possible 3-symbol messages. Their probabilities will sum to unity. You cannot count on the short message "A5C" to give you the weights for the average, nor can you count on it to give you the full alphabet of 3-symbol messages.

If your implied 36 alphanumerics are equally probable, causing your 3-symbol microstates to be equally probable, then by treating each alphanumeric as a microstate, my entropy calculation for your Macrostate is N*H = 3*log2(36). This is the same entropy you will calculate by -log2(36^3). There is a tradeoff between the two methods. My microstate lookup table for the probabilities has 36 entries but yours has 36^3. So you have to carry a larger memory bank. But you will recognize any higher level patterns. This is the basis of most compression schemes. For example, if A5CA5CA5C.... was all the source ever sent, I would happily calculate H=log(3)= 1.585 and 3*H as the entropy for the macrostate. You would calculate H=0 and entropy 1*H=0 (your N = 3 * my N). The higher level microstates can show more "intelligence" (which requires the ability to compress data).

So I do not disagree with your example, but I wanted to show you that by remembering 36^3 probabilities in your lookup table for each microstate, you are actually assigning a SYMBOL (a row number in the lookup table) to each microstate. Again, I can derive your methods from mine.

My 1) above should be simplified to H= - sum( log2(count/N) ) because the other count/N sums to 1. Ywaz (talk) 00:53, 7 December 2015 (UTC)


 * Ywaz: You write:


 * "When I come across a file who's contents are 'AB' and I know nothing about what generated the file or anything about its past or future, I have no choice but to view the data as the complete life of the source. So myself and the online entropy calculators apply the statistical measure H to it."


 * Ok, I see why you call a single symbol a microstate. To my mind you are calling each symbol a microstate, and the entire file (say the file consists of "A5C"), a collection of symbols, the macrostate. You then calculate the frequencies (i.e. probabilities) of the microstates, make the assumption that they are independent, and with these pi's, you calculate entropy. So what do you do with this entropy. Of what use is it?


 * "Look up Einstein solid and Debye model. They count oscillators and phonons."


 * Ok, a semantic problem. They count them conceptually, not experimentally, I thought you were insisting they be counted experimentally. No problem.


 * "If a source sends me ABBABBAABAAB....random stuff for a while, I will apply H to any large portion of it and see H=1. It then sends me 'AB' or 'BB' and you ask me the information content of it. I say H*N=2, which is the same answer you get from the equation above because you had also observed p(AB)=1/4 in the long sequence. So again you see, my 'crappy' 1) and 2) can derive what you consider correct."


 * Ok, you have used ABBABBAABAAB... to estimate prior pi's before applying them to "AB". But if the source sent you "AAAAAAAAAAB", your H would not give the correct answer. That's my point. Without that prior knowledge of the source, the previously estimated pi's, you cannot calculate the information content of "AB".


 * "If your implied 36 alphanumerics are equally probable, causing your 3-symbol microstates to be equally probable, then by treating each alphanumeric as a microstate, my entropy calculation for your Macrostate is N*H = 3*log2(36). This is the same entropy you will calculate by -log2(36^3)."


 * Yes, yes! If my 36 alphanumerics are equally probable (and independent!), you are right. But if they are not, you are wrong. And that's my point. My point is that you have no reason, no justification, to assume that they are equally probable if all you have is "A5C". Noting that A, 5, and C are equally likely in A5C does not justify this assumption.


 * "Again, I can derive your methods from mine."


 * No, you cannot. Not without making some assumptions, basically pulling them out of thin air. You have to make the assumption that the symbol frequencies (probabilities) in the short message correspond to the frequencies of a long message, and that they are independent (i.e. the probability of finding "AB" in a message is pA pB). You can say that, without the long message, you have nothing else to go on, but what, then, is the value of your calculation? PAR (talk) 03:46, 7 December 2015 (UTC)

Continuing in new heading below. Ywaz (talk) 06:56, 7 December 2015 (UTC)

Applying H to short messages
This is a continuation of the above to determine if the online entropy calculators are correct in how they use Shannon's H to calculate entropy for short messages.

PAR wrote:
 * ..... make the assumption that they are independent, and with these pi's, you calculate entropy. So what do you do with this entropy. Of what use is it?

That's how molar and specific entropy's are calculated. They are useful when they are independent because S=N*So or S=N*H in information theory. If I calculate H on a short file and it is < 1, then I know I can compress it. If I have 13 balls with 1 either over or under weight, how do I use a balance in 3 weighings to determine which ball is over or underweight? I look for solutions that give me the highest N*H which represents 3 weighings*log(3 symbols) where symbols represent which position the balance is in when weighing (Left tilt, Right tilt, no tilt). These are the microstates. Highest number for macrostate gives highest entropy. So I want weigh methods that have an equal probability of triggering 1 of the 3 states of the balance, independently. I do not weigh 2 balls on 1st try because it is unlikely to tilt. 6 balls on each side may or may not tilt, and even keeps the chance of no tilt open, but that last symbol is not likely to occur, so 6 on each side may not be the right starting point. In any event, I am looking to get my highest N*H value on a very short message.


 * They count them conceptually, not experimentally, I thought you were insisting they be counted experimentally.

The phonons in the Debye model are real, so indirectly (by solid theory based on experiment) they are by experiment. But Carnot and Clausius did not need these theories and got the same results in bulk because in bulk there are microstates that are independent even though you have no clue about their source.


 * Ok, you have used ABBABBAABAAB... to estimate prior pi's before applying them to "AB". But if the source sent you "AAAAAAAAAAB", your H would not give the correct answer.

OK, yeah, my example was not right and just accidentally came out right. You want the information content of AB given that the source normally says AAAAAAAAAAB. You said you would calculate this 2 symbol information content as -log2(p(AB)). I get p(AB) = 1/12 by looking at the sequence. Is this information measure related to entropy?


 * Yes, yes! If my 36 alphanumerics are equally probable (and independent!), you are right. But if they are not, you are wrong.

I have been talking about mutually-independent symbols, even if they are NOT mutually independent, because you can never know to what EXTENT they are not, unless it is physical entropy with a solid theory or experimental history describing the source. Physics is already the best set of compression algorithms for the data we observe (occam's razor). But in information theory for stuff running around on our computers, no matter what we think we know about the source, there can always be a more efficient way to assign the symbols for the microstates as more data comes in (a better compression algorithm). What algorithm are you going to use to determine the most efficient number of symbols to call your microstates? In practice, something can be tried then N*H is effectively checked to see if it got smaller. If it did, then you found a better definition of the microstates. In physics, it would constitute an new and improved law.


 * You have to make the assumption that the symbol frequencies (probabilities) in the short message correspond to the frequencies of a long message,

No, no, I clarified earlier that the short messages were to be taken "as is" without any assumption about there being a source that could possibliy generate anything different, especially since there usually is no other observable data to go on than the short message itself. Let's say I send you a GIF image. How are you going to calculate its "information" or entropy content?


 * what, then, is the value of your calculation?

What are you calling "my" calculation? Applying H to a short message? Who decides which messages are short? Who decides if a file on my computer is or is not the life-sum total of some "source" that is never to be seen again? Ywaz (talk) 06:56, 7 December 2015 (UTC)


 * Ywas - you wrote:


 * If I calculate H on a short file and it is < 1, then I know I can compress it.<\blockquote>


 * I know what you are saying, but it sounds like you think this compression method has some special significance, and I don't want to lose track of the fact that it doesn't. If I have 8 GIF files of 1Mb each, I can number them 1 thru 8 and compress them as follows: 1,2,3,4,5,6,7,8. That's fine if all I ever deal with is duplicates of one of those 8 files. That's an assumption I have made. If I make another assumption, I get another compression method. If I make your assumption, that all the pixels are equally likely and independent, I get your compression method. My point is that an assumption has to be made, which effectively declares the pi's beforehand, and you cannot completely justify any assumption by just looking at a finite set of data. You can come close with large amounts of data, but for smaller amounts, you lose more and more justification. By saying "I can compress "AAB" because H=(-2/3)log2(2/3)+(-1/3)log2(1/3)=0.918 is less than unity, you are implicitly assuming that "AAB" is not the only file you expect to deal with. If it were the only one, you could just compress it to "1", H=0, and be done with it. You are implicitly assuming that the files you expect to deal with have a 2-symbol alphabet, each of which is equally likely and independent. That assumption, if true, is what makes your compression method useful. If its not true, then why bother?


 * "If I have 13 balls with 1 either over or under weight, how do I use a balance in 3 weighings to determine which ball is over or underweight? I look for solutions that give me the highest N*H which represents 3 weighings*log(3 symbols) where symbols represent which position the balance is in when weighing (Left tilt, Right tilt, no tilt). These are the microstates. Highest number for macrostate gives highest entropy. So I want weigh methods that have an equal probability of triggering 1 of the 3 states of the balance, independently. I do not weigh 2 balls on 1st try because it is unlikely to tilt. 6 balls on each side may or may not tilt, and even keeps the chance of no tilt open, but that last symbol is not likely to occur, so 6 on each side may not be the right starting point. In any event, I am looking to get my highest N*H value on a very short message."


 * Ha. That's an interesting thought experiment. Of course we ASSUME independence: the weight of a ball is not affected by the weight or proximity of any other ball. This is "outside information" - prior constraints on our pi's, which is not justified by any data we may gather by weighing them. That's part of my point. We assume that the measurements are not independent and not order dependent. If our first measurement tilts right, and our second measurement is of the same balls in the same cups, it will certainly tilt right. More outside information constraining our pi's, not justified by any measurement we make. Another part of my point. The probability that the odd ball is heavy is 1/2, equal to the probability that the odd ball is light. More outside information, not justified by any measurement. The probability that any given ball is odd is 1/13, same as any other ball. Again, more outside information.


 * The balls are labelled (A,B,C...M). Our microstates every possible set of three measurements, where a measurement is done by putting n distinct balls in each cup, n=1 thru 6. You can't put a ball in both cups, or two of the same ball in one cup. More outside information. For example, { {ABCD}{DEFG}, {AGE,MBE}, {F,G} } is a microstate. Macrostates are triplets of the three symbols R, L, and E. Each microstate yields a macrostate, but not vice versa. Now, and only now, can we calculate our pi's, none of which are justified by the data. We search for the microstate(s) that give maximum information, or, actually, any microstate with an information content less than or equal to three bits. If one or more exist, they are our solutions. If none have an information content less than or equal to two bits, then three measurements is the best we can do.


 * Note that the above "outside information" has been implicitly incorporated into your method. My point is that your method, by making these implicit assumptions, is not in any way fundamental. If we come across a situation in which your assumptions are invalid, your method is wrong. Furthermore, the data (the measured microstates) do not fully support your assumptions. The assumptions are "a priori" knowledge that you bring to the problem.


 * "OK, yeah, my example was not right and just accidentally came out right. You want the information content of AB given that the source normally says AAAAAAAAAAB. You said you would calculate this 2 symbol information content as -log2(p(AB)). I get p(AB) = 1/12 by looking at the sequence. Is this information measure related to entropy?"


 * If you ASSUME that AAAAAAAAAAB represents exactly the full alphabet (A,B) and their respective frequencies in any message (not necessarily true), then pA=11/12 and pB=1/12. If the symbol ocurrences are ASSUMED to be independent, then the probability of AB is pAB = pA pB = 11/144, and the information content of AB is -log2(pAB)=3.701... bits. The entropy of the set of two-bit messages is the weighted average of the information in all two-bit messages: pAA=121/144, pAB=pBA=11/144, pBB=1/144, (extensive) Entropy = H = pAA log2(pAA)+etc. = 0.8276... bits.


 * I have been talking about mutually-independent symbols, even if they are NOT mutually independent, because you can never know to what EXTENT they are not, unless it is physical entropy with a solid theory or experimental history describing the source.


 * Or outside knowledge you bring to the problem, like the 13-ball problem, in which any two measurements are assumed (or realized to be) not independent. (make a measurement using the same balls but switching cups and if your first measurement is R, then your second will certainly be L).


 * "Physics is already the best set of compression algorithms for the data we observe (occam's razor). But in information theory for stuff running around on our computers, no matter what we think we know about the source, there can always be a more efficient way to assign the symbols for the microstates as more data comes in (a better compression algorithm). What algorithm are you going to use to determine the most efficient number of symbols to call your microstates? In practice, something can be tried then N*H is effectively checked to see if it got smaller. If it did, then you found a better definition of the microstates. In physics, it would constitute an new and improved law."


 * Yes.


 * "Let's say I send you a GIF image. How are you going to calculate its 'information' or entropy content?"


 * The information content of a GIF image will be -log2(pGIF) where pGIF is the probability of that GIF image occurring. My first impulse would be to ASSUME equal probability for each pixel, so if there are M possible pixels and N pixels in the image, p=1/M, pGIF=(1/M)^N and extensive H = - N log2(1/M). Then I might think that most GIF images are not random noise, but areas of constant color, so given that one pixel is green, the pixel just after is more than randomly likely to be green. That changes my estimate of pGIF.


 * For example, if I look at a large number of GIF files, and note that GG occurs with probability p(G,G) which is significantly larger than p^2=(1/M)^2 that I would calculate if the pixel frequencies were independent, and this holds true for any color of pixel, then if I have a 3-pixel image BRG, I would say that the probability of such an image is pGIF = the probability of B given no previous pixel times probability of R given previous was B times probability of G given previous was R. I expect information content that I calculate will be smaller and my compression will be more efficient, as long as the GIFs that I used to estimate these probabilities are rather representative of all GIFs. If I consider my "macrostate" to be all GIF images of the same size, I expect my entropy will be smaller than that which I calculate assuming all pixels are independent.


 * As always, I have to pre-estimate pGIF, either by assuming p's are independent and equal to 1/M or looking at some other data set and using conditional probabilities.


 * "What are you calling 'my' calculation? Applying H to a short message? Who decides which messages are short? Who decides if a file on my computer is or is not the life-sum total of some 'source' that is never to be seen again?"


 * If you are making the compression algorithm, you do. If you don't know, then fall back on the p=1/M independent pixel idea, but if you have some outside information, use it to improve the algorithm. PAR (talk) 21:41, 7 December 2015 (UTC)

PAR writes:
 * If you ASSUME that AAAAAAAAAAB represents exactly the full alphabet (A,B) and their respective frequencies in any message (not necessarily true), then pA=11/12 and pB=1/12. If the symbol ocurrences are ASSUMED to be independent, then the probability of AB is pAB = pA pB = 11/144, and the information content of AB is -log2(pAB)=3.701... bits. The entropy of the set of two-bit messages is the weighted average of the information in all two-bit messages: pAA=121/144, pAB=pBA=11/144, pBB=1/144, (extensive) Entropy = H = pAA log2(pAA)+etc. = 0.8276... bits.

I count 10 A's where I think you've counted 11, so I'll use 11. You created a new symbol set by using pAA etc... in order to calculate an H and you claim this is in "bits". But it is really 0.8276 bits/symbol where your new symbol set has 1 symbol like "AA" where the old set had 2 symbols (the same "AA"). H uses the p's of single symbols, not a sequence of symbols like you've done, unless you allow that it is a new symbol set. If you had stuck with the old set and calculated H bits/symbol and then multiplied by 2 like I've been saying, you would have gotten the same result, 0.8276. We're still getting the same results and I'm still using shorter math equations. I agree 3.7 is the entropy content of pAB in bits, so you've given an exact measure of how much more information it carries than the expected.

I do not see that you have a different measure of entropy than the 1) and 2) I described above. I maintain that it can be (and very often is) applied to short messages without any problem, with the understanding it has been divorced from any possible source and is pretty ignorant about compressibility and completely ignorant about interdependencies. These are NP hard tasks, and since there is no ultimate or clear universally accepted benchmark for measuring compressibility or symbol interdependences, especially across the board on all data, a blind statistic is a great starting point. Indeed, it has always been the de facto starting point and reference. I maintain 1) and 2) are statistical measures like an average and can be used just as blindly without worries. Ywaz (talk) 00:10, 8 December 2015 (UTC)


 * Yes, sorry about that, I meant 11.


 * I think we understand each other, and we both come up with the same answers. Our disagreement is about who is coming up with new symbols. We disagree on what is fundamental. I am saying that it is fundamental that each microstate (or message) carries information and has a certain probability of occurrence and the entropy is the expected value of the information per microstate averaged over all microstates (i.e. averaged over the macrostate). This makes no assumptions about the nature of the microstates, whether they are a string of independent symbols, a string of symbols conditionally dependent on each other, or the energy levels of atoms or molecules, or whatever. You say that if a microstate can be broken up into independent pieces (e.g. individual symbols with independent probabilities pA and pB), those pieces are fundamental, if not, then a microstate is represented by one two-letter symbol with with probabilities (e.g. pAA, pAB, pBA, pBB). I'm ok for us to agree to disagree, since I don't think we will disagree on our solutions to problems. Or maybe I did not summarize your position correctly?


 * More explicitly, If we are dealing with 2-symbol messages with an alphabet of 2 symbols, then there are 4 micro states, AA, AB, BA, BB. The (extensive) entropy, lets call it He, is defined only in terms of the probability of those micro states (pAA, pAB, pBA, pBB). Then He=pAA log2(pAA) + pAB log2(pAB)+etc. It is only when you ASSUME that they are independent (i.e. there exists pA and pB such that pAA=pA pA, pAB=pA pB, pBA=pB pA and pBB=pB pB) can you then say


 * He = pA pA log2(pA pA)+pA pB log2(pA pB) +pB pA log2(pB pA)+pB pB log2(pB pB) = 2 ( pA log2(pA) + pB log2(pB) )


 * which is, as you say, twice the intensive entropy you calculate. I say I have not introduced new symbols, but rather it is you who have introduced new symbols pA and pB by assuming that they exist such that pAA=pA pA, pAB=pA pB, etc. But what if pAA=0.1, pAB=0.1, pBA=0.1 and pBB=0.7? Then the micro states will not consist of two independent symbols, there will be no pA and pB that fit the criterion of independence, yet there will still be an amount of information carried by each symbol pair, and an entropy of the set of symbol pairs equal to the expected value of that information.


 * I think I can modify my position on short messages. If I am faced with a short message "A5C" and no prior information and I don't expect any more messages, then my first impulse is to say that there is no point in worrying about the information content, since I cannot use whatever number I come up with in any meaningful way. I have no past or future messages to ponder. If I expect more messages to come, then I will say that assuming the message has an alphabet of 3 symbols, A, 5, and C, each equally likely and independent is one of the simpler initial assumptions I can make. As you say, "a blind statistic is a great starting point". I will wait for more messages and modify my assumptions and maybe eventually I will be able to more accurately estimate the information content of new messages, and the extensive entropy per message, or, if the symbols exhibit independence, the intensive entropy per symbol, of messages yet to be received. PAR (talk) 03:37, 8 December 2015 (UTC)

Note: If hasn't already been apparent, I something throw out a -1 from H and I'm letting H=p*log(1/p) instead of H=-p*log(p).

Example: 3 interacting particles with sum total energy 2 and possible individual energies 0,1,2 may have possible energy distributions 011, 110, 101, 200, 020, or 002. I believe the order is not relevant to what is called a microstate, so you have only 2 symbols for 2 microstates, and get the probability for each is 50-50. Maybe there is usually something that skews this towards low energies. I would simply call each one of the 6 "sub-micro states" a microstate and let the count be included in H. Assuming equal p's again, the first case gives log(2)=1 and the 2nd log(6)=2.58. I believe the first one is the physically correct entropy (the approach, that is, not the exact number I gave). If I had let 0,1,2 be the symbols, then it would have 3*1.46 = 4.38 which is wrong.

Physically, because of the above, when saying S=k*ln(2)*NH, it requires that you look at specific entropy So and make it = k*ln(2)*H, so you'll have the correct H. This back-calculates the correct H. This assumes you are like me and can't derive Boltzmann's thermodynamic H from first (quantum or not) principles. I may be able to do it for an ideal gas. I tried to apply H to Einstein's oscillators (he was not aware of Shannon's entropy at the time) for solids, and I was 25% lower than his multiplicity, which is 25% lower than the more accurate Debye model. So a VERY simplistic approach to entropy with information theory was only 40% lower than experiment and good theory, for the one set of conditions I tried. I assumed the oscillators had only 4 energy states and got S=1.1*kT where Debye via Einstein said S=1.7*kT

My point is this: looking at a source of data and choosing how we group the data into symbols can result in different values for H and NH, [edit: if not independent]. Using no grouping on the original data is no compression and is the only one that does not use an algorithm plus lookup table. Higher grouping on independent data means more memory is required with no benefit to understanding (better understanding=lower NH). People with bad memories are forced to develop better compression methods (lower NH), which is why smart people can sometimes be so clueless about the big picture, reading too much with high NH in their brains and thinking too little, never needing to reduce the NH because they are so smart. Looking for a lower NH by grouping the symbols is the simplest compression algorithm. The next step up is run-length encoding, a variable symbol length. All compression and pattern recognition create some sort of "lookup table" (symbols = weighting factors) to run through an algorithm that may combine symbols to create on-the-fly higher-order symbols in order to find the lowest NH to explain higher original NH. The natural, default non-compressed starting point should be to take the data as it is and apply the H and NH statistics, letting each symbol be a microstate. Perfect compression for generalized data is not a solvable problem, so we can't start from the other direction with an obvious standard.

This lowering of NH is important because compression is 1 of 3 requirements for intelligence. Intelligence is the ability to acquire highest profit divided by noise*log(memory*computation) in the largest number of environments. Memory on a computing device has a potential energy cost and computation has a kinetic energy cost. The combination is internal energy U. Specifically, for devices with a fixed volume, in both production machines and computational machines, profit = Work output/[k*Temp*N*ln(N/U)] = Work/(kTNH). This is Carnot efficiency W/Q, except the work output includes acquisition of energy from the environment so that the ratio can be larger than 1. The thinking machine must power itself from its own work production, so I should write (W-Q)/Q instead. W-Q feeds back to change Q to improve the ratio. The denominator represents a thinking machine plus its body (environment manipulator) that moves particles, ions (in brains), or electrons (in computers) to model much larger objects in the external world to try different scenarios before deciding where to invest W-Q. "Efficient body" means trying to lower k for a given NH. NH is the thinking machine's algorithmic efficiency for a giving k. NH has physical structure with U losses, but that should be a conversion factor moved out to be part of the kT so that NH could be a theoretical information construct. The ultimate body is bringing kT down to kb at 0 C. The goal of life and a more complete definition of intelligence is to feed Work back to supply the internal energy U and to build physical structures that hold more and more N operating at lower and lower k*T. A Buddhist might say we only need to stop being greedy and stop trying to raise N (copies of physical self, kT, aka the number of computations) and we could leave k, T, and U alone. This assumes constant volume, otherwise replace N/U with NN/UV. Including volume would mean searching for higher V per N which means more space exploration per "thought". The universe itself increases V/N (Hubble expansion) buth it cancels in determining Q because it causes U/N to decrease at the same rate. This keeps entropy and energy CONSTANT on a universal COMOVING basis (ref: Weinberg's 1977 famous book "First 3 Minutes"), which causes entropy to be emitted (not universally "increased" as the laymen's books still say) from gravitational systems like Earth and Galaxies. The least action principle (the most general form of Newton's law, better than Hamiltonian & Lagrangian for developing new theories, see Feynman's red books) appears to me to have an inherent bias against entropy, preferring PE over KE over all time scales, and thereby tries to lower temp and raise the P.E. part of U for each N on Earth. This appears to be the source of evolution and why machines are replacing biology, killing off species 50,000 times faster than the historical rate. The legal requirement of all public companies is to dis-employ workers because they are expensive and to extract as much wealth from society as possible so that the machine can grow. Technology is even replacing the need for shareholders and skill (2 guys started MS, Apple, google, youtube, facebook, and snapchat and you can see trend in decreasing intelligence and age and increasing random luck needed to get your first $billion). Silicon, carbon-carbon, and matals are higher energy bonds (which least action prefers over kinetic energy) enabling lower N/U and k, and even capturing 20 times more Work energy per m^2 than photosynthesis. Ions that brains have to model objects with still weigh 100,000 times more than the electrons computers use.

In the case of the balance and 13 balls, we applied the balance like asking a question and organize thigs to get the most data out of the test. We may seek more NH answers from people or nature than we give in order to profit, but in producing W, we want to spend as little NH as possible.

[edit: I originally backtracked on dependency but corrected it, and I made a lot errors with my ratios from not letting k be positive for the ln.]Ywaz (talk) 23:09, 8 December 2015 (UTC)


 * Ywaz - you wrote:


 * "Example: 3 interacting particles with sum total energy 2 and possible individual energies 0,1,2 may have possible energy distributions 011, 110, 101, 200, 020, or 002. I believe the order is not relevant to what is called a microstate, so you have only 2 symbols for 2 microstates, and get the probability for each is 50-50. Maybe there is usually something that skews this towards low energies. I would simply call each one of the 6 'sub-micro states' a microstate and let the count be included in H. Assuming equal p's again, the first case gives log(2)=1 and the 2nd log(6)=2.58. I believe the first one is the physically correct entropy (the approach, that is, not the exact number I gave). If I had let 0,1,2 be the symbols, then it would have 3*1.46 = 4.38 which is wrong."


 * If the particles are indistinquishable, 2 microstates, entropy 1. If the particles are distinquishable, 6 microstates, entropy 2.58. (See Identical particles.)


 * "My point is this: looking at a source of data and choosing how we group the data into symbols can result in different values for H and NH"


 * Yes. The fact remains that if we define the macrostate, and the microstates with their probabilities, we have specified entropy. Different groupings of symbols is essentially defining different macrostates, so yes, different entropies. If we have a box with a partition and oxygen at the same temperature and pressure on either side and remove the partition - no change in entropy. If we have one isotope of oxygen on one side, another on the other, but cannot experimentally distinguish the two, if we remove the partition, again, no change in entropy. If we can experimentally distinguish the two, we have a different definition of macrostate: we remove the partition, entropy increases. Altering the definition of macrostate for the same situation, we get different entropies.


 * If we have a GIF file and no other information, we assume independent pixels, we calculate entropy. If we have two types of GIF files, one a photo, the other noise, but we cannot know beforehand which is which, then we assume independent pixels, get an entropy for each. If we can distinguish a photo from noise beforehand, then we can assume some dependence between pixels for the photo, get one entropy, no dependence for the noise and get another entropy. Again, altering the definition of macrostate for the same situation, we get different entropies.


 * I read the rest of your post, it's interesting, but off-topic, I think. PAR (talk) 04:53, 10 December 2015 (UTC)

Thanks for the link to indistinguishable particles. The clearest explanation seems to be here, the mixing paradox. The idea is this: if we need to know the kTNH energy required (think NH for a given kT) to return to the initial state at the level 010, 100, 001 with correct sequence from a certain final sequence, then we need do the microstates at that low level. Going the other way, "my" method should be mathematically the same as "yours" if it is required to NOT specify the exact initial and final sequences, since those were implicitly not measured. Measuring the initial state sequences without the final state sequences would be changing the base of the logarithm mid-stream. H is in units of true entropy per symbol when the base of logarithm is equal to the number of distinct symbols. In this way H always varies from 0 to 1 for all objects and data packets, giving a true disorder (entropy) per symbol (particle). You multiply by ln(2)/ln(n) to change base from 2 to n symbols. Therefore the ultimate objective entropy (disorder or information) in all systems, physical or information, when applied to data that accurately represents the source should be
 * $$ Entropy = N*(-H) = \sum_i count_i \log_n (N/count_i) $$

where i=1 to n distinct symbols in data N symbols long. Shannon did not specify which base H uses, so it is a valid H. To convert it to nats of normal physical entropy or entropy in bits, multiply by ln(n) or log2(n). The count/N is inverted to make H positive. In this equation, with the ln(2) conversion factor, this entropy of "data" is physically same as the entropy of "physics" if the symbols are indistinguishable, and we use energy to change the state of our system E=kT*NH where our computer system has a k larger than kb due to inefficiency. Notice that changes in entropy will be the same without regard to k, which seems to explain why ultimately distinguishable states get away with using higher-level microstates definitions that are different with different absolute entropy. For thermo, kb is what appears to have fixed not caring about the deeper states that were ultimately distinguishable.

The best wiki articles are going to be like this: you derive what is the simplest but perfectly accurate view, then find the sources using that conclusion to justifyits inclusion.

So if particles (symbols) are distinguishable and we use that level of distinguishability, the count at the 010 level has to be used. Knowing the sequence means knowing EACH particle's energy. The "byte-position" in a sequence of bits represents WHICH particle. This is not mere symbolism because the byte positions on a computer have a physical location in a volume, so that memory core and CPU entropy changes are exactly the physical entropy changes if they are at 100% efficiency (Landauer's principle). (BTW the isotope method won't work better than different molecules because it has more mass. This does not affect temperature, but it affects pressure, which means the count has to be different so that pressure is the same. So if you do not do anything that changes P/n in PV=nRT, using different gases will have no effect to your measured CHANGE in entropy, and you will not know if they mixed or not. )

By using indistinguishable states, physics seems to be using a non-fundamental set of symbols, which allows it to define states that work in terms of energy and volume as long as kb is used. The ultimate, as far as physicists might know, might be phase space (momentum and position) as well as spin, charge, potential energy and whatever else. Momentum and position per particle are 9 more variables because unlike energy momentum is a 6D vector (including angular), and a precise description of the "state" of a system would mean which particle has the quantities matters, not just the total. Thermo gets away with just assigning states based on internal energy and volume, each per particle. I do not see kb in the ultimate quantum description of entropy unless they are trying to bring it back out in terms of thermo. If charge, spin, and particles are made up of even smaller distinguishable things, it might be turtles all the way down, in which case, defining physical entropy as well as information entropy in the base of the number of symbols used (our available knowledge) might be best. Ywaz (talk) 11:48, 10 December 2015 (UTC)


 * Ywas - you wrote:


 * The clearest explanation seems to be here, the mixing paradox. The idea is this: if we need to know the kTNH energy required (think NH for a given kT) to return to the initial state at the level 010, 100, 001 with correct sequence from a certain final sequence, then we need do the microstates at that low level. Going the other way, "my" method should be mathematically the same as "yours" if it is required to NOT specify the exact initial and final sequences, since those were implicitly not measured. Measuring the initial state sequences without the final state sequences would be changing the base of the logarithm mid-stream. H is in units of true entropy per symbol when the base of logarithm is equal to the number of distinct symbols. In this way H always varies from 0 to 1 for all objects and data packets, giving a true disorder (entropy) per symbol (particle). You multiply by ln(2)/ln(n) to change base from 2 to n symbols. Therefore the ultimate objective entropy (disorder or information) in all systems, physical or information, when applied to data that accurately represents the source should be


 * ":$ Entropy = N*(-H) = \sum_i count_i \log_n (N/count_i) $"


 * "where i=1 to n distinct symbols in data N symbols long."


 * I don't understand what you are saying. What is the meaning of "sequence"? If "sequence" 001 is distinct from 100, then "sequence" means microstate and statements like "Measuring the initial state sequences" are improper, microstates are not measured in thermo, only macrostates are measured. I assume the macrostate is total energy, so that microstates 010, 100, 001 form a macrostate: energy measured to be 1.


 * Again, what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), assuming independence, and assuming that the frequencies of the symbols in the message are a perfect indication of the frequencies of the population (macrostate) from which it is drawn. Its ok to say its a "best estimate" of the entropy, just as the best estimate of the mean of a normal distribution given one value is that value. But please don't call it entropy. Likewise, what you call entropy/symbol is an estimate of the information per symbol.


 * "The best wiki articles are going to be like this: you derive what is the simplest but perfectly accurate view, then find the sources using that conclusion to justifyits inclusion."


 * Fine, but please don't invent new quantities and call them by names which are universally accepted to be something else. It just creates massive confusion and interferes with communication.


 * "By using indistinguishable states, physics seems to be using a non-fundamental set of symbols, which allows it to define states that work in terms of energy and volume as long as kb is used. The ultimate, as far as physicists might know, might be phase space (momentum and position) as well as spin, charge, potential energy and whatever else. Momentum and position per particle are 9 more variables because unlike energy momentum is a 6D vector (including angular), and a precise description of the 'state' of a system would mean which particle has the quantities matters, not just the total."


 * Physics is not using a non-fundamental set of symbols. With indistinguishable particles, the "ultimate" you describe does not exist. For distinguishable particles 011, 101, 110 are distinct microstates (alphabet of two, message of three), for indistinguishable particles, they are not. (alphabet of one, message of one: 011, 101, 110, are the same symbol.) You cannot use the phrase "which indistinguishable particle", its nonsense, there is no "which" when it comes to indistinguishable particles.


 * Thermo gets away with just assigning states based on internal energy and volume, each per particle. I do not see kb in the ultimate quantum description of entropy unless they are trying to bring it back out in terms of thermo.<\blockquote>


 * Thermo doesn't "get away" with anything. It only works with observable, measureable quantities and assigns macrostates. Different observable quantities, different macrostates. Thermo knows nothing about microstates, and doesn't need to in order to calculate thermodynamic entropy (to within a constant). In thermal physics, microstates are unmeasureable. If they were measureable, they would be macrostates, and their entropy would be zero. Please note that there are, for our purposes, two kinds of entropy, thermodynamic entropy and information entropy. The two have nothing to do with each other, until you introduce the statistical mechanics model of entropy. Then the two are basically related by the statistical mechanics constant kB. (Boltzmann's S=kB ln(W)). The physical information entropy is proportional to the amount of information you are missing about the microstate by simply knowing the macrostate (internal energy, volume, or whatever). PAR (talk) 06:42, 11 December 2015 (UTC)

PAR: " what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), "

Don't you mean entropy = information in a Macrostate? That is what you should have said.

I didn't make it up. It's normally called normalized entropy, although they normally refer to this H with logn "as normalized entropy" when according to Shannon they should say "per symbol" and use NH to call it an entropy. I'm saying there's a serious objectivity to it that I did not realize until reading about indistinguishable states.

I hope you agree "entropy/symbol" is a number that should describe a certain variation in a probability distribution, and that if a set of n symbols were made of continuous p's, then a set of m symbols should have the same continuous distribution. But you can't do that (get the same entropy number) for the exact same "extrapolated" probability distributions if they use a differing number of symbols. You have to let the log base equal the number of symbols. I'll get back to the issue of more symbols having a "higher resolution". The point is that any set of symbols can have the same H and have the same continuous distribution if extrapolated.

If you pick a base like 2, you are throwing in an arbitrary element, and then have to call it (by Shannon's own words) "bits/symbol" instead of "entropy/symbol". Normalized entropy makes sense because of the following

entropy in bits/symbol = log2(2^("avg" entropy variation/symbol)) entropy per symbol = logn(n^("avg" entropy variation/symbol))

The equation I gave above is the normalized entropy that gives this 2nd result.

Previously we showed for a message of N bits, NH=MH' if the bits are converted to bytes and you calculate H' based on the byte symbols using the same log base as the bits, and if the bits were independent. M = number of byte symbols = N/8. This is fine for digital systems that have to use a certain amount of energy per bit. But what if energy is per symbol? We would want NH = M/8*H' because the byte system used 8 fewer symbols. By using log base n, H=H' for any set of symbols with the same probability distribution, and N*H=M/8*H.

Bytes can take on an infinite number of different p distributions for the same H value, whereas bits are restricted to a certain pair of values for p0 and p1 (or p1 and p0) for a certain H, since p0=1-p0. So bytes have more specificity, that could allow for higher compression or describing things like 6-vector momentum instead of just a single scalar for energy, using the same number of SYMBOLS. The normalized entropy allows them to have the same H to get the same kTNH energy without going through contortions. So for N particles let's say bits are being used to describe each one's energy with entropy/particle H, and bytes are used to described their momentums with entropy/particle H'. Momentums uniquely describe the energy (but not vice versa). NH=NH'. And our independent property does not appear to be needed: H' can take on a specific values of p's that satisfy H=H', not some sort of average of those sets. Our previous method of NH=MH' is not as nice, violating Occam's razor. Ywaz (talk) 14:57, 11 December 2015 (UTC)


 * Ywaz - you said:

PAR: " what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), " Don't you mean entropy = information in a Macrostate? That is what you should have said.


 * What I said was "what you call "Objective Entropy" (given in your equation above) is not entropy, it is the amount of information in a particular message (microstate), assuming independence, and assuming that the frequencies of the symbols in the message are a perfect indication of the frequencies of the population (macrostate) from which it is drawn."


 * Entropy is the AVERAGE amount of MISSING information in a macrostate. It is the AVERAGE amount of information SUPPLIED by knowing a microstate.


 * Assuming the probability of the i-th microstate is pi, and assuming they are independent, the information supplied by knowing the i-th microstate is -log2(pi) bits. The entropy is the weighted average: It is the sum over all i of -pi log2(pi) (bits). Note that the sum over all i of pi is unity. PAR (talk) 02:56, 14 December 2015 (UTC)

Kullback–Leibler divergence
The definition given here looks different from that on http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Please reconcile 108.6.15.34 (talk) 21:04, 15 June 2014 (UTC)


 * Note that f, in the article here, equals p divided by m. Substituting p / m instead of f into the definition used in the article gives
 * $$D_{\mathrm{KL}}(p\|m) = \int \ln\left(\frac{p(x)}{m(x)}\right) p(x) \, {\rm d}x, \!$$
 * the form used in the Kullback-Liebler article. Jheald (talk) 07:09, 16 June 2014 (UTC)

Estimating entropy from a sample
If you know about estimating entropy from a finite sample that is randomly drawn with replacement according to a probability distribution on a (possibly infinite) population, would you please add it to the article? For example, drawing a sample of size 1 gives p=1 for the drawn value and p=0 for all other values, yielding an estimated entropy of 0 regardless of the actual probability distribution on the population. More generally, the expected value of the entropy computed from a finite sample will be less than the actual entropy of the population. And so on. Leegrc (talk) 15:43, 9 March 2015 (UTC)

Grammar, please?
Would someone familiar with the terms and practices AND of the English language go through and fix, please?

Bits vs. shannons
While it may be true that there is a unit called the "shannon" that measures Shannon entropy using logarithms base 2, it is my experience that it is never used. Always the identical unit called the "bit" is used instead. Shouldn't the article reflect this common usage? Leegrc (talk) 17:20, 18 May 2015 (UTC)


 * The lead does mention the common usage, which should prevent problems even for the lay reader. It would be non-encyclopaedic to use a term that perpetuates an ambiguity and hence a misconception, though, with the only motivation that common usage dominates.  The concept of bits used to represent information and bits as units of information are so close as to engender confusion (especially since they become numerically equal under the illustrative assumption of uniform probability); many people cannot distinguish the two.  The alternative is to go to lengths to belabour the distinction between the two so that it is clear that they are different quantities and different units, even though they are both called "information" measured in "bits".  My feeling is that this encyclopaedic onus is discharged more naturally by using the unambiguous, albeit uncommon, use of the "correct" units.  However, more opinions and debate on this issue would be useful. —Quondum 18:08, 18 May 2015 (UTC)


 * My strong recommendation would be to use "bits" throughout. In my opinion, the use of "shannons" is as wrongheaded as having as a unit of capacity the litre, and using it to measure the capacity of empty vessels, but then introducing a different unit -- the "pemberton" perhaps -- for talking about the amount of fluid that one is going to put into those vessels.


 * As User:Leegrc says, the shannon is simply not used in the real world; and IMO its use here is positively confusing, by suggesting to people there is a difference between shannons and bits, when actually there isn't. If you want to send the information to identify one stream of data out of a probability distribution of possibilities, you need to transmit so many bits per second.  It's as simple as that.  Jheald (talk) 19:10, 18 May 2015 (UTC)


 * Did you even read my sentence "The concept of bits used to represent information and bits as units of information are so close as to engender confusion (especially since they become numerically equal under the illustrative assumption of uniform probability); many people cannot distinguish the two."? Let's stick to an encyclopaedic approach, should we? Not much of what you say here is even correct: it is akin to saying "683 candela = 1 watt/sr, simple as that." Are you saying that we should replace the article Shannon (unit) with a redirect to Bit?  Was Alan Turing daft to propose a definition of a unit of information (the ban) distinct from the decimal digit? Is IEC 80000-13 to be ignored? —Quondum 19:56, 18 May 2015 (UTC)

The edit using another WP article as a citation violates WP guidelines. Besides, that article is largely written from the perspective of a computer scientist who has little knowledge of or understanding about entropy and information theory. —Quondum 20:22, 18 May 2015 (UTC)


 * I appreciate the distinction between the bit as a unit of storage (for RAM, disks, etc.) and the bit as a unit of information. It is perhaps unfortunate that the same word is used in both contexts, but it is nonetheless fact.  Using "shannon" instead of the latter use of "bit" would make the distinction, but unfortunately the use of the "shannon" unit is pretty darn rare, and thus is problematic.  Leegrc (talk) 20:32, 18 May 2015 (UTC)
 * Okay, cool. Perhaps we can consider using the bit instead of the shannon, with suitable explanation of the distinction between "bit of information or entropy" and "bit of data".  But we still cannot equate the two.  In the context of entropy, I would prefer using the nat as the dominant unit of reference for the article though; what are the feelings on that?  —Quondum 21:59, 18 May 2015 (UTC)


 * . The overwhelming majority of textbooks on information theory, coding theory, quantum computing etc use bits, not shannons.  In fact, there's not a single textbook on my shelves that uses shannon.  So yes, in this case I would ignore IEC 80000-13, and follow the overwhelming usage by reliable sources instead.  There is previous form for this -- for example at Gibbs free energy we follow the traditional usage, rather than IUPAC's "Gibbs energy".  (And no, I don't want to know about Gibibits either).


 * The situation is different from Candela, because unlike the raw power intensity, the Candela incorporates a biological factor specific to human physiology. In contrast if a source has an entropy rate of n bits per second, that corresponds an information rate of n bits per second, which can be encoded (asymptotically) in n storage bits per second.  This is the content of Shannon's source coding theorem, and introducing anything other than bits is simply to bring in unnecessary confusion -- there is no advantage to be had in trying to make the distinction you're trying to make.


 * The "ban" (more specifically the "deciban") was introduced as a bit of a joke -- a play on "Banbury sheet" and "decibel", being a logarithmic scale for Bayes factors that would otherwise multiply. But it has the advantage of being shorter and punchier than "hartley" or "decimal digit", which is why I personally prefer it. In this regard it fits in well with the rest of the family: "bit", "byte" and "nat".  It may also be worth remembering that Good and Turing were working in a specific area, before the publication of Shannon's communication theory, which is when the equivalence measures of uncertainty and measures of information capacity really got put on the front page.


 * Finally, should we merge Shannon (unit) into Bit ? I'd prefer not to, because as far as possible I'd prefer not to introduce the confusion of the Shannon at all.  Better IMO to leave it like Gibibit in its own ghetto of an article, so with luck most people will be able to get along without ever needing to be troubled by it. (To which a good start would be not troubling them with it here).  Jheald (talk) 21:47, 18 May 2015 (UTC)


 * You seem to be firmly of the midset that data and information are the same quantity. Your argument above is merely one of using "bit" to mean the same as "shannon", but not to merge the concepts.  Whether we adopt the unit "bit" or not (I'd prefer the nat), it must be made clear that data and information are distinct quantities, and bits as units of information are distinct as units from bits of data.  And, no, the candela example applies: just like there is a function modifying the quantity, the function of probability is present in information, but not in data. —Quondum 21:54, 18 May 2015 (UTC)


 * It's more like saying that wood is weighed in kilograms and metal is weighed in kilograms. That doesn't mean wood = metal.  It means they're weighed on the same scale, so it makes sense to use the same unit.  Jheald (talk) 22:14, 18 May 2015 (UTC)


 * Then how would you express that 8 bits of data might have 2.37 bits of entropy in your analogy (how can the same object have a mass of 3 kilograms and 7 kilograms simultaneously)? The candela example seems apt: the wattage per unit area of radiation from a lamp provides an upper bound on its luminous intensity, just as the data capacity of a register gives an upper bound on its entropy.  In the luminous intensity example, the spectrum connects the two, and in the information example it is the probability density function that connects the two.  —Quondum 23:20, 18 May 2015 (UTC)


 * It's quite straightforward. It means you can compress that 8 bits of data and store it in (on average) 2.37 bits. Jheald (talk) 01:11, 19 May 2015 (UTC)


 * I know what it means. I don't particularly see the point of this discussion.  —Quondum 01:18, 19 May 2015 (UTC)


 * The point is that you're compressing bits into bits -- just like you crush down a ball of silver foil from a larger volume to a smaller volume. You don't need a separate unit for the volume of crushed-down silver foil, and it's a bad idea to introduce one because it hides what's going on.


 * It's a really important idea that Shannon entropy measures the number of bits that you can compress something into -- that those are the same bits that you measure your storage in, and that there is a formula (Shannon's entropy formula) that tells you how many of them you will need. It's all about bits.  Introducing a spurious new unit makes that harder to see, and makes things harder to understand than they should be -- things like Kullback-Leibler divergence, or Minimum message length/Minimum description length approaches to inference with arguments like "bits back".  Better to just to get used to the idea from the very start that Shannon entropy is measured in bits. Jheald (talk) 01:58, 19 May 2015 (UTC)


 * And what of the remainder of the family of Rényi entropy? —Quondum 02:10, 19 May 2015 (UTC)


 * To be honest, I've never really thought about using any unit for something like the collision entropy H2. Can you give any examples of such use?
 * Measurement of Shannon entropy in bits has an operational meaning anchored by Shannon's source coding theorem, ie how many storage bits one would need (on average) to express a maximally compressed message. But the data compression theorem is specific to Shannon entropy.
 * Rényi entropy can be regarded as the logarithm of an "effective number of types" (see diversity index), where different orders of the entropy correspond to whether more probable species should be given a more dominant or a less dominant weighting than they are in the Shannon entropy. But I don't think I've ever seen a Rényi entropy quoted in bits, or indeed in shannons -- I think these specifically relate to Shannon entropy.  Jheald (talk) 09:11, 19 May 2015 (UTC)


 * On further investigation, it seems that there are people who do give units to Rényi entropies. The words "in units of shannons" under the graph in the Rényi entropy article were added by you diff, so I don't think can be taken as much of a straw either way.  But here's a Masters thesis which measures the Rényi entropy in bits  (with a somewhat better graph); and I guess the use of bits isn't so inappropriate if they stand for the number of binary digits, this time in the "effective number of types".  Jheald (talk) 09:45, 19 May 2015 (UTC)


 * Your dismissiveness is not helping. You really do not seem open to considering the merits of other ideas before you write them off. Besides, this really does not belong on this talk page. —Quondum 15:55, 19 May 2015 (UTC)


 * Fine, let's just agree to rip out all instances of the unit "shannon" from the page, as Leegrc originally suggested, and then I will let it drop.
 * I have given it consideration, and the result of that consideration is that the more I have considered it the more certain I am that the use of "shannon", rather than "bit" in the context Shannon originally introduced the word, with its equivalence to storage bits, is a stumbling-block that we help nobody by putting in their path. Jheald (talk) 16:51, 19 May 2015 (UTC)


 * I agree with JHeald:


 * "Then how would you express that 8 bits of data might have 2.37 bits of entropy in your analogy (how can the same object have a mass of 3 kilograms and 7 kilograms simultaneously)?"


 * First of all, an instance of data does not have entropy. Data is measured in bits, entropy is measured in bits. They are *operationally* connected by the idea that any data instance may be compressed to at least the entropy of the process. Simplistically, entropy is the number of yes/no questions you have to ask to determine an instance of the data, and knowing a data instance answers that many questions. Same units ("number of questions"), entropy asks, data answers. The number of bits in the data minus the number of questions the data instance answers (answer=yes/no, a bit) is the redundancy of the data (in bits). You cannot have an equation like that with mixed units, so they are the same, and finagling the equation with unit conversion constants is counterproductive, in my mind.


 * In thermodynamics, the (thermodynamic) entropy is measured in Joules/Kelvin, which is the same units as the heat capacity. This does not mean that the two are equal for a given system. They can be related to each other, however.


 * I agree that there is a distinction, but I have never had any conceptual difficulty dealing with bits of data and bits of entropy. The connection between the two is more important than their difference. I have also never seen the "Shannon" used in the literature. Two reasons not to use it.


 * I prefer the bit when it comes to information theoretic discussions. It is intuitively more obvious, particularly when probabilities are equal. Entropy can then be viewed as "how many questions do I have to ask to determine a particular data instance?" Even in thermodynamics, the thermodynamic entropy in bits can be expressed as the number of questions I have to ask to determine the microstate, given the macrostate. PAR (talk) 06:04, 21 May 2015 (UTC)

Page Clarity
This is one of the worst-written pages I've ever encountered on Wikipedia; it sounds as though it was copied directly from a poorly-written upper-level textbook on the subject. The major problem is clarity - it uses longwinded, unnecessary phrases where simple ones would be both synonymous and much clearer to the reader, and has an inflated vocabulary. I don't understand the subject matter myself well enough to do a complete revision of it, though I am going to go through and do a cleanup of anything I am sure I know the meaning of.

If someone with more knowledge of the subject could do more in-depth work on it, that'd be very helpful. This is supposed to be an *encyclopedia* entry - that is, it's supposed to be a reasonably easily-understood explanation of a complex topic. Currently the "complex topic" part is more than covered, but there's not nearly enough of the "easily understood" part. In particular, the intro paragraph needs *heavy* revision, as it's the main thing non-technical readers will look at if they encounter this topic.

To forestall complaints of "It's a complex topic so it NEEDS complex language!": Yes, that's true, but there's a difference between technical terminology used to explain something and incomprehensible masses of unnecessarily-elevated vocabulary and tortured phrasing.

Lead text
I removed a paragraph in the lead which explained entropy as an amount of randomness. Indeed, entropy is greater for distributions which are more "random." For example, a uniform distribution has the maximum entropy over all possible discrete distributions. Gaussian distributions take that role for continuous distributions. However, when it comes to messages having the alphabet of say, the English alphabet, then it is not clear whether "aaaaaaaa" has more entropy than the message "alphabet." If it is assumed that all letters of the English language are equally likely then, in fact, these two messages are equally likely and thus have the same entropy. However, in reality, "a"'s are probably more likely to occur than other letters. In which case "aaaaaaaa" would have less entropy. However, this can be terribly misleading to relay to people who may not know the subject well to give them this general rule-of-thumb based off of perceived randomness. Bo ur ke M  Converse! 05:50, 17 October 2015 (UTC)

Rationale
The last formula gives 1 bit as desired. The value of $log_{2}(2) = 1$ is 1 bit for a fair coin where $n$. The summation adds half of one bit to half of one bit to get one bit. 𝕃eegrc (talk) 14:49, 11 November 2015 (UTC)


 * Why summation is used for single toss?95.132.143.157 (talk) 04:18, 15 November 2015 (UTC)

The summation is over the possible results, not just the observed results. 𝕃eegrc (talk) 13:30, 16 November 2015 (UTC)

The Rationale section reads, in part:
 * "I(p) is monotonic – increases and decreases in the probability of an event produces increases and decreases in information, respectively."

shouldn't that be "increases and decreases in the probability of an event produces decreases and increases in information, respectively." since less probable events convey more information? In other words, I(p) is monotonic, but isn't it monotonic decreasing, rather than monotonic increasing, as the article implies?207.165.235.61 (talk) 16:46, 14 September 2016 (UTC) Gabriel Burns

Posturing and Silliness
The glorification of the field of information entropy as conceptually distinct and more fundamental, compared to statistical mechanics and thermostatistics, is quite oblivious of the pre-existing vastness and sophistication of this basic field (statistical mechanics) from whence the Shannon paper and subsequent efforts are merely offshoots. The statements under this "comparison" are rife with patently false statements about what "entropy" was limited to previously in statistical mechanics (which, contrary to this silly paragraph, was defined by probability distributions, non-observably small fluctuations, and other such concepts that fundamentally define entropy!!). "Information entropy" is a small conceptual contribution to a vast, existing body of development on entropy theory that is falsely distinguished from by this paragraph. The author of this piece should be entirely better educated on the subject, rather than making this display of ignorance and presuming it corresponds to the pre-existing field.2602:306:CF87:A200:6D19:A65A:57A0:F49D (talk) 18:44, 7 January 2016 (UTC)


 * If your general objections are valid, it's difficult to make corrections without a specific example or examples of what you object to. Can you state specifically the text that you object to? PAR (talk) 01:01, 8 January 2016 (UTC)

Assessment comment
Substituted at 14:34, 29 April 2016 (UTC)

Entropy definition using frequency distributions
I propose an addition to the Definition section of the article:

Using a frequency distribution
A frequency distribution $$\mathrm{F}(X)$$ is related to its probability mass function $$\mathrm{P}(X)$$ by the equation:


 * $$\frac{\mathrm{F}(x_i)}{\sum_{i=1}^n {\mathrm{F}(x_i)}} = \mathrm{P}(x_i)$$.

A definition of entropy, in base $$b$$, of random variable $$X$$, with frequencies $$F(x_i)$$, that uses frequency distributions is:


 * $$\Eta(X) = \log_b{\sum_{i=1}^n {\mathrm{F}(x_i)}} - \frac{1}{\sum_{i=1}^n {\mathrm{F}(x_i)}} \sum_{i=1}^n {\mathrm{F}(x_i) \log_b \mathrm{F}(x_i)}.$$

People that need to calculate entropy for a large number of outcomes $$x_i$$ will appreciate that $$\mathrm{F}(x_i) \log_b \mathrm{F}(x_i)$$ can be pre-calculated and that this definition does not require you to know the total number of observations before starting calculation. The definition using probability mass function requires that you know all observations (to calculate $$\mathrm{P}(x_i)$$) before you start calculation.

Disclaimer: I am the author of this research and the supporting link is to a website I control. Additionally, I would move the cross-entropy definition under a subheading. Would any interested parties please share if you believe this is a useful contribution to the article? --Full Decent (talk) 05:02, 12 March 2017 (UTC)


 * I don't see a problem with it as long as you include something like "According to a 2015/new/ongoing study (reference), entropy may be defined in terms of frequency distributions... "
 * 177.68.225.247 (talk) 21:44, 24 March 2017 (UTC)

Clarifying jargon in lead section
A couple options: Use wikilinking (to offer readers a ready path to clarification and expanded vocabulary) or instead use a more common/less specialized term (one more readily understandable to a general readership, though perhaps not the first choice of those with a particular topical interest).

Regarding stochastic as used in the first sentence, it seems  might suffice for the former, and   might work for the latter.

Gonna' go with the wikilink for now. Hopefully this will serve as a broadly accommodating 'middle path' (between bare tech jargon and plain speech). I'd offer no objection to someone just substituting random instead though.

A fellow editor, --75.188.199.98 (talk) 14:05, 11 November 2017 (UTC)
 * As written now (possibly before your revision), I would think that "probabilistic stochastic" is redundant. "Stochastic" (or similar) should be sufficient. Attic Salt (talk) 14:09, 11 November 2017 (UTC)


 * Thank for offering input, I took your suggestion and reduced "probabilistic stochastic" to wikilinked "stochastic". Also started to consider adding some more wikilinks ... but then the more I looked at the lead (and indeed the article as a whole) and gave it consideration I started to feel there were general concerns beyond a few terms and added a tag to the article. Much of it seems pretty opaque to me. Overall seems quite long as well, but hard to judge the value of that without better being able to follow the material in the first place.


 * Perhaps someone might eventually be found who both follows the topic and has skill at translating such into plain speech. --75.188.199.98 (talk) 14:32, 11 November 2017 (UTC)
 * The lede is a mess. Attic Salt (talk) 14:34, 11 November 2017 (UTC)

(intro sentence) "average amount" vs "average rate"
Shouldn't the intro sentence read "the average **rate** of information produced", or something like that, rather than "the average **amount** of information produced"? I'm new to this, but I think information entropy is basically the expected self-information of each event (or outcome? measurement?). So it seems that it should be described as either the expected *amount per event/outcome/measurement*, or as the "rate" of information produced by the stochastic source. 203.63.183.112 (talk) 05:39, 25 January 2018 (UTC)
 * Done. Loraof (talk) 16:08, 19 June 2018 (UTC)

Important Mistake
The statement: "I(p) is monotonically decreasing in p – an increase in the probability of an event decreases the information from an observed event, and vice versa" is wrong. This is obviously wrong, otherwise you could not have a maximum. The correct statement is that I(p) is a continuous function.

If there is agreement I will fix this to: "I(p) is a continuous function of p"

--Skater00 (talk) 14:08, 10 April 2018 (UTC)

It might be better to say it was a unimodal function of p. https://en.wikipedia.org/wiki/Unimodality#Unimodal_function

David Malone (talk) —Preceding undated comment added 07:15, 11 April 2018 (UTC)


 * But reference 7, p. 18, says “I(p) is monotonic and continuous in p”. Loraof (talk) 16:07, 19 June 2018 (UTC)
 * It’s the entropy H(p), not the information I(p), that is non-monotonic in p, as shown in the graph in the Examples section. Loraof (talk) 18:28, 19 June 2018 (UTC)

Removed paragraph
I was surprised to find a paragraph encapsulating a crucial misunderstanding in the section "limitations of entropy as information measure" (itself not a topic that makes a lot of sense) "Although entropy is often used as a characterization of the information content of a data source, this information content is not absolute: it depends crucially on the probabilistic model. A source that always generates the same symbol has an entropy rate of 0, but the definition of what a symbol is depends on the alphabet. Consider a source that produces the string ABABABABAB… in which A is always followed by B and vice versa. If the probabilistic model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the sequence is considered as "AB AB AB AB AB …" with symbols as two-character blocks, then the entropy rate is 0 bits per character."

The only way in which it makes sense to define the entropy of a single sample from a probability distribution is as a measure of the surprise. When A and B are generated, it makes no sense to pretend that A has not been seen when B appears: A provided information and the probability of B depends on A having occurred. Given this the probability and A and B is the same as the product of the probability of A multiplied by the conditional probability of B given that the first letter was A. There is no difference in the information looking at it one way or the the other. It is not true that the information in a sequence AB AB AB ... is zero bits per block of two bits unless there is no surprise in this (i.e. no other sequences were possible).

The surprise per bit of some sample from an assumed probability distribution of such samples does depend on the assumed distribution but it cannot depend on how someone chooses to group the letters. Elroch (talk) 15:06, 17 May 2018 (UTC)


 * I agree - such a discussion is important. The entropy of a string of letters or bits or whatever is not defined until a probability distribution is defined. Different probability distribution different entropy. Very important point.


 * A source that produces the string ABABABABAB… in which A is always followed by B and vice versa, has entropy zero, no surprises. Same for AB AB AB... so yes, the paragraph is wrong.


 * Consider 6-letter strings like ABABAB - The general probability distribution for 6 letters is P(L1,L2,L3,L4,L5,L6) where L1 is the first letter, etc. The entropy is then the sum of P log P over every possible combination of letters in each of the six arguments (i.e. over the alphabet of each argument, different arguments may generally have different alphabets).


 * The probability P(L1,L2,L3,L4,L5,L6) is equal to P2(L1L2,L3L4,L5L6), the probability of three two-letter words, and you can calculate the entropy of three two-letter words as the sum of P2 log P2 over the alphabet of L1L2, L3L4, etc. and it will be the same.


 * So I agree, it doesn't depend on how someone chooses to group the letters, but one CAN group the letters anyway one chooses, as long as the probabilities (e.g. P and P2) remain consistent. PAR (talk) 04:55, 18 May 2018 (UTC)

Introduction formula inconsistant
The introduction reads

"The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value:"

But then displays a formula showing the *average* information content, i.e. the weighted sum of each possible data value.

Either the formula should be updated to

h(x=a_i) = - log(p(x=a_i))

OR the statement should be updated to

"The measure of information entropy associated with a random variable X is a weighted sum of the negative logarithm of the probability mass function p(x):"

Maitchison (talk) 06:15, 11 September 2019 (UTC)

Relationship to Shannon information?
OK, so the page on Shannon information redirects to this article. But someone who searches for "Shannon information" can have no idea why they have been served instead this article on entropy. And recourse to searching for "Shannon information" within this article can only render such readers even more confused. After all, the first of the string's two occurrences informs readers that
 * "the thermodynamic entropy [emphasis added] is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system,"

but that makes it sound as though the thing that this article is describing—entropy—is being explained in terms of the very thing that our poor baffled readers had been wanting to learn about in the first place. This experience comes awfully close to circularity.

Anyway, to paraphrase Thomas Jefferson in the Declaration of Independence, when a redirect is created, a decent respect to the opinions of mankind requires that the redirector should declare the relationship which impels the redirection (See WP:R).

And that's not even the worst of it, because the term at the heart of this as-yet-unexplained redirection may not even be well defined. To quote one source, "the concept of Shannon information is still a focus of much debate."

I would happily clean up all this mess in this article if only I had the necessary expertise. Would someone who's qualified please take the task on?—PaulTanenbaum (talk) 20:02, 7 November 2019 (UTC)

Entropy game in external links is defunct?
It seems like the website directed by the "A java applet representing Shannon's Experiment to Calculate the Entropy of English" no longer has a working applet. Am I missing something? If not, how can this be fixed?

Eulslick (talk) 22:02, 26 April 2020 (UTC)

Using entropy vs using variance for measuring uncertainty in the results of a random sample
There is a nice question here: What does entropy capture that variance does not? The link doesn't provide a good answer, but I thought that the comparison of entropy with variance could be a good section to add to this article.

For example (from here): If you have a bimodal distribution with two peaks and allow the spacing between them to vary, the standard deviation would increase as the distance between the peaks increases. However, the entropy doesn't care about where the peaks are, so the entropy would be the same.

Another insight is that variance is easiest to interpret when we have a near normal distribution (in the sense of how much of data is X SD jumps from the mean). Whereas with entropy the interpretation is that if it goes up in one data set vs another it means that the one with the higher entropy is "closer" to a uniform distribution.

I'm sure there are references for these statements.

What do others think - would adding such a section to this article make sense?

Tal Galili (talk) 19:55, 21 May 2020 (UTC)

Efficiency
I'm not sure what to make of this, but the efficiency section seems off: "A source alphabet with non-uniform distribution will have less entropy than if those symbols had uniform distribution (i.e. the "optimized alphabet")." Isn't that the point of entropy? Also, dividing by log(n) is a normalization by alphabet SIZE (since 0 <= H(x) <= log(n)), not anything about its distribution. Could somebody more knowledgeable than me have a look at this? — Preceding unsigned comment added by 2001:16B8:2E50:3200:DDEA:47B9:21D0:B732 (talk) 08:51, 5 August 2020 (UTC)

Checksums is not part of coding algorithms
@Bilorv In version https://en.wikipedia.org/w/index.php?title=Entropy_(information_theory)&oldid=968833507 you added "In practice, compression algorithms deliberately include some judicious redundancy in the form of checksums to protect against errors." I don't think that is correct, and none of the mentioned algorithms says anything about checksums (by searching for "checksum" in the equivalent wiki article). If that is really correct I think it deserves some references or just mention some examples of that. For compression algorithms there is no point in adding checksums. I suggest to delete that sentence

--Jarl (talk) 06:06, 3 September 2020 (UTC)
 * Hi . I didn't add the sentence—I just moved it from another part of the article when doing some reorganisation. I'm only familiar with the maths of information theory and not any of the details of compression algorithms commonly used today so I cannot verify whether the statement is true or false (either is plausible to me). As the statement is unsourced, you are free to remove it if you believe it should be removed. — Bilorv ( talk ) 07:07, 3 September 2020 (UTC)

Error in first sentence of "Further properties" section
"The Shannon entropy satisfies the following properties, for some of which it is useful to interpret entropy as the amount of information learned (or uncertainty eliminated) by revealing the value of a random variable X" is incorrect, but can be fixed by adding the word "expected": The Shannon entropy satisfies the following properties, for some of which it is useful to interpret entropy as the expected amount of information learned (or uncertainty eliminated) by revealing the value of a random variable X.

For example, consider the joint distribution of X and Y, where X has two values A and B with probabilities .99 and .01 respectively. Suppose that when X = A there are 2 equally likely values for Y, and when X = B there are 32 equally likely values for Y. Then H(X, Y) = 1.12 bits. If X is revealed to be A, the uncertainty decreases to 1 bit, but if X is revealed to be B, the uncertainty increases to 5 bits. Thus the expected value of the remaining uncertainty, aka H(Y|X), is 1.04 bits, so the expected amount of information learned is 1.12 - 1.04 = 0.08, which equals H(X).

It might be counterintuitive that the uncertainty can increase when X is revealed, but I think this is quite common in real life. p(X = A) = 0.99 implies that we've lived through case A many times before, and we have a good understanding of what can happen. When X = B happens, we're not familiar with this situation, and we don't know what to expect. 2600:1700:6EC0:35CF:E8F8:B2A6:C652:5C13 (talk) 00:24, 3 April 2022 (UTC)
 * An interesting point—if I have understood correctly, it is that uncertainty can increase conditional on particular r.v. values (in some cases, $$\Eta (Y|X=x)>\Eta (X,Y)$$), but the average uncertainty will never increase based on revealing a value of an r.v. (in all cases, $$\Eta(X,Y)=\Eta(X)+\Eta(Y|X)$$ and each term is non-negative).I have added the word "expected" where you suggest. In future, you can boldly make such changes yourself, and use the talk page if a justification will not fit in the edit summary. — Bilorv ( talk ) 11:14, 4 April 2022 (UTC)

Second law of information dynamics
It is the opposite of the second law of thermodynamics: while enthropy can only stay the same or increase, the information enthropy of a system tends to decrease. (source: scitechdaily). — Preceding unsigned comment added by 151.18.144.86 (talk) 16:37, 7 September 2022 (UTC)

"S" preferable to "X"?
The usual notation for measures of Entropy is a capital "S", would be more appropriate than the "X"es, wouldn't it? 2601:140:9182:4E50:6436:D263:C72E:89DA (talk) 01:28, 19 January 2023 (UTC)


 * H is the usual symbol for Information Entropy. The X used in this article is for the random variable that the entropy is being calculated for. S seems to be sued in the section on Thermodynamic Entropy, as you suggest. David Malone (talk) 08:37, 19 January 2023 (UTC)
 * X is typical for a random variable, and it's what I saw in a mathematical treatment of information theory; I'm not aware of where S might be preferred but I think X is fine. (Slight pedantry—it's actually a capital eta, $$\Eta$$, not the letter 'H'.) — Bilorv ( talk ) 11:41, 22 January 2023 (UTC)

Mistake in “Further properties”?
I think the proof in the second point of the section "Further properties" is incorrect. -log(x) (negative log) is a convex function so Jensen's inequality should have the opposite sign. This can be checked with a very simple example, e.g. p_0=3/4 and p_1=1/4. Moreover, note that the right-hand side of the first inequality is the Renyi-2 entropy and the left-hand side is the Shannon entropy and indeed the latter must be greater or equal than the former, not lesser or equal as stated in the article. Finally, I believe H(p) < log(n) is proved in a different way in the reference given (Cover and Thomas). If what I am saying is correct, I suggest replacing this line with a proper proof, e.g., from the Cover and Thomas. Qlodo (talk) 09:11, 20 September 2023 (UTC)
 * checking your example I agree the text looks wrong. I've rephrased to omit a proof (and be slightly clearer about what n is). Are you able to add a correct proof, from Cover and Thomas or another text you can access? When you find a mistake on Wikipedia the etiquette is to fix it yourself. — Bilorv ( talk ) 21:14, 24 September 2023 (UTC)