Talk:Information theory/Archive 2

Pending tasks (what do we want?)
Here are how I interpret the pending tasks:


 * Make sure intro and overview are understandable with daily life examples. Making sure the reader  can subjectively tell what a "bit" is, for example, is a good start.  Discussion of elephant trumpets, whale songs, and black holes, do not contribute in this manner, although their mere mention (with links, say, to the no hair theorem) may be useful to the curious.


 * Move applications up before mathematics (just after Overview) and make it independent. I'm not  sure this is for the best, as Overview states some applications and, for the rest, it's useful to know some theory.  Also, we need to be careful not to conflate security and information theory, since information knowledge is necessary for, yet not sufficient for, breaking ciphers.  (For example, a Diffie-Hellman key exchange assumes that an adversary will have full knowledge, but the form of that knowledge assures that the adversary will not be able to use it within a reasonable amount of time.  Thus, although secure in practice, Diffie-Hellman key exchange is information-theoretically insecure).


 * A separate section on the theory already exists (although it could use a mention of a few more concepts, e.g., the asymptotic equipartition property).


 * Move history down to end (just before References). Again, I'm not sure why this is best, and some "good" articles, like Thermodynamics, list history first.  If moved, we need to check that this is just as readable, but the reader doesn't need to know scientific history to understand scientific phenomena.  Clearly more post-1948 history is needed.  Most applications of information theory, even elementary ones, are post-1948 (Huffman codes, arithmetic codes, LDPC codes, Turbo codes), and much of the theory is too, e.g., information divergence, information inequalities, Fisher sufficient statistics, Kolmogorov complexity, network information theory, etc.  However, it might be best to move the history to its own article.  The current section is far too in-depth.

There's still a lot to do and a lot to add, and we might reasonably ask if certain topics should be shortened or omitted, e.g., Kullback–Leibler divergence (which, although important, can generally be omitted from elementary explanation), differential entropy (which is most important for transinformation, which can be interpreted as a limit of discrete transinformation), gambling (which most information theoretists I know love but is a somewhat fringe topic). Again, the thermodynamics article is shorter and far less mathematical. Do we want this article to be long with full mathematical explanations or medium-sized with general explanations and links to the math? I don't know the right answer, but it might be nice to get some ideas of what people would like out of this.

A few final things to keep in mind: Due to these factors, it's good to discuss significant changes on the talk page. I find my "peer edited" sections are a lot better than my first attempts. And too much unilateral change can result in an unstable, unreadable, and flat-out wrong article. Calbaer 01:44, 15 June 2006 (UTC)
 * Diversions or examples absent of explanation confuse, not clarify.
 * Sentences should be concise as possible, but no more so, for easy reading.
 * Proofread before (or at least immediately after) changes.
 * Others might not read the same way as you.
 * I say keep the K–L divergence. Not only is it important, but it is really useful in understanding the mutual information.  The treatment of other stuff like gambling, intelligence, and measure theory could be cut.  The measure theory section shows a useful mnemonic device, but the analogy is incomplete due to the failure of the transinformation to remain non-negative in the case of three or more random variables.  It could probably be cut out entirely.  History of information theory could indeed have its own article, but keep a short summary of history in this article here.  Also some mention of Gibbs' inequality would be nice as it shows the K–L divergence (and thus the bivariate mutual information) to be non-negative.  The AEP could be mentioned in the source theory section, since it is a property of certain sources.  Don't hesitate to add what you think should be added. --130.94.162.64 17:17, 20 June 2006 (UTC)
 * K-L divergence is certainly fundamental, so it should stay. Measure theory is also fundamental, but it is not necessary for a good understanding of information theory, so it could go.  History could be summarized and moved.  I'll get to this when I can, though if someone else wants to do so first, go ahead. Calbaer 18:36, 20 June 2006 (UTC)


 * the section that mentions measure theory should be kept, or moved somewhere else, rather than deleted. also, the following sentence in that section is misleading/incorrect: "it justifies, in a certain formal sense, the practice of calling Shannon's entropy a "measure" of information." entropy is not a measure, given an random variable f, one integrates f lnf against a suitable measure (the Lebesgue measure, the counting measure, etc) to get the entropy. Mct mht 18:52, 20 June 2006 (UTC)
 * I think that the section confuses more than it helps; most readers will not have a working knowledge of measure theory, and, as you say, the section needs at least a little reworking. I'll delete it and add a link to Information theory and measure theory. Calbaer 19:47, 20 June 2006 (UTC)

Seems to me a big priority should be to break out all the details into other articles. Information Theory is very much a science. If you look at the "Mathematics" Wikipedia article, you'll see that there are few actual mathematics concepts dicussed. Its more a history and categorization of the sub-sciences of mathematics, with links. I picture "Information Theory" as a fairly small article (only because the science is young) giving only a bit more than a dictionary definition of "information theory", and then a ton of links with brief descriptions to categories, related theories, practical applications, algorithms, coding methods, etc.. I think this article should be more of a starting point. Details should be elsewhere. Also, until this can be done, at least a set of links might be nice. I guess I'm saying here I'm surprised that a search of the "Information Theory" article in Wikipedia doesn't find the phrase "forward error correction"! qz27, 22 June 2006


 * Good point, and one that argues for the elimination of the following as all but links: Self-information, Kullback–Leibler divergence, differential entropy. We shouldn't be scared to have a little math in the article, but regular, joint, and conditional entropy can be defined together and mutual entropy in terms of them.  That's enough for the idealizations of source and channel coding.  If people want to know why the math is as it is, there could be a definitions in information theory article or something like that.  Two things though: 1) "error correction" is mentioned three times in the article (and I don't think it's that important to specifically state FEC in a "starting point" article) and 2) are you volunteering to do this? Calbaer 06:13, 23 July 2006 (UTC)


 * I went through much of the article and tried to make it more approachable with links to unnecessary math. Even now it may have too much math, and someone should probably reorganize it accordingly. Calbaer 22:45, 24 July 2006 (UTC)

The "Intelligence and Secrecy" section looks like someone's weird pet theory disguised as encyclopedic text. What does the Shannon-Hartley theorem have to do with keeping secrets? Even if this connection exists, it is far from clear here. kraemer 15:01, 11 June 2007 (UTC)


 * Information theory is everything in crypto -- entropy is its fundamental concept. The real questions are whether cryptography merits the mention (probably) and whether the subsection is written well enough to keep (probably not). CRGreathouse (t | c) 15:09, 11 June 2007 (UTC)


 * The text has been there since December 2005 and was likely written by User:130.94.162.64, who wrote the bulk of the first draft of this article.  I've criticized him (or her) for a mistake or two, but the truth is that, without that user, this article would be far less rich.  Unfortunately, there's been no sign of that IP since June 21, 2006, but still, I'd say that the text should be modified to something a bit better rather than simply removed, especially since it has, in some sense, "stood the test of time."  But I agree it should be improved.  It's important to express the fundamental idea of information theory in cryptography: That methods with keys shorter than their plaintext are theoretically breakable, and that modern crypto pretty much counts on the hypothesis &mdash; neither proven or disproven &mdash; that such methods require enough steps to break as to make them secure for any reasonable amount of time (e.g., the timespan of the universe). Calbaer 16:40, 11 June 2007 (UTC)


 * The text in question is the material surrounding the assertion, "It is extremely hard to contain the flow of information that has high redundancy." This has nothing to do with the Shannon-Hartley Theorem, which is about the relationship between channel noise and channel capacity.  The difficulty in keeping a secret that is known to multiple people is not a function of the signal-to-noise ratio of any informational vector in any obvious way.  If this is clear to someone else then the connection should be made explicit.  If these are, as I suspect, unrelated ideas, then this section should be rewritten.  The relationship between cryptography and information theory is indeed important -- far too important to be represented by some vanished author's unsupported ranting.   kraemer 18:48, 18 June 2007 (UTC)


 * To be fair, the text has degraded, so we shouldn't call it a "vanished author's unsupported ranting." Anyway, here's a replacement that I'll do if people like it:
 * Information theoretic concepts apply to making and breaking cryptographic systems. Such concepts were used in breaking the German Enigma machine and hastening the end of WWII in Europe.  Shannon himself defined an important concept now called the unicity distance. Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.


 * Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A brute force attack can break systems based on public-key cryptography or on most commonly used methods of private-key cryptography, such as block ciphers.  The security of such methods comes from the assumption that no known attack can break them in a practical amount of time, e.g., before the universe meets its ultimate fate.  Information theoretic security refers to methods such as the one-time pad which are not vulnerable to such brute force attacks.  However, as in any other cryptographic system, one must be careful to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse.


 * Calbaer 21:00, 18 June 2007 (UTC)


 * This is all true, though the connection between cryptographic attack and information theory could be drawn more explicitly. Big improvement though!  kraemer 21:22, 18 June 2007 (UTC)


 * Well, if you have any modifications, you can make them and add them. Also, the PRNG subsection should be modified; for such generators, e.g., extractors, min-entropy, not the more common and fundamental information theoretic quantity of entropy, is the correct measurement. Calbaer 16:14, 19 June 2007 (UTC)

Game theory as a template?
Game theory was recently a featured article. Perhaps information theory should look more like that? Calbaer 03:52, 2 August 2006 (UTC)

Removed tag, added poached images, reworked header
The article has been largely dormant and those who have tagged it &mdash; not to mention have made many contributions &mdash; are no longer in the Wikipedia community with the same aliases. (One was kicked off in September; the other stopped contributing and/or changed IP address in June.) It's not a perfect article, but it's better, so I removed the tag. Any advice for how to improve it &mdash; by novice or by expert &mdash; would be appreciated, especially on this talk page. I feel some of the article is a bit too redundant, but it's better to be redundant than to omit or underplay basic facts. I added some images from other articles to visually illustrate the concepts, which will hopefully get people more interested in the subject and more quickly clear what it's all about. Finally, I made the disambiguation header shorter, updated (to reflect the library/information science split), and added informatics. By the way, my apologies for misstating that I thought "K-L divergence should stay." I meant that mutual information should stay. K-L divergence may be a bit too confusing for those who don't want to explore it as a separate topic. Calbaer 21:59, 25 October 2006 (UTC)


 * The illustration for Information theory is missing labels for X and Y, which are referred to in the text. 198.145.196.71 23:28, 13 September 2007 (UTC)


 * I changed to one what has x and y marked (these are elements of the spaces X and Y). Dicklyon 00:33, 14 September 2007 (UTC)


 * Perhaps the "Channel capacity" and "Source theory" sections could be swapped, since source theory talks about rate, which is a prerequisite for channel capacity. 198.145.196.71 23:31, 14 September 2007 (UTC)


 * That's fine, though "coding theory" shouldn't be after the two and hierarchically siblings to them (a change made here). It should come first and/or be under something different, preferably both.  The idea for this was to explain source and channel coding, then mention their applications under "coding theory."  (In fact, calling these "applications" is rather loose; maybe I'll change it around.) Calbaer 00:22, 15 September 2007 (UTC)


 * Coding theory is really important. It is the main application of information theory, its raison d'être, so to speak.  It really shouldn't look like just another "application" among several others.  Just go ahead and make your changes, and let's look at it from there.  Let's make sure all the section and subsection titles are accurate, too. 198.145.196.71 15:28, 15 September 2007 (UTC)


 * You could make similar arguments for channel capacity and source theory. Maybe that section needs to be split into two.  Think up a good heading and go for it. Dicklyon 16:11, 15 September 2007 (UTC)


 * Coding theory, source theory, and channel capacity have been rearranged now. A little blurb on source coding could go into the source theory section, and something about channel coding could be said in the channel capacity section, possibly with another diagram somewhat along this line:
 * Source --> Encoder --> Noisy Channel --> Decoder --> Receiver.
 * 198.145.196.71 01:32, 21 September 2007 (UTC)

Modern information definitions
In 2006, Deng Yu et al use standard logic 'Genus add the inter-species difference ' definition mode (or calls connotation definition). This definition mode manifests by following formula:

Is defined a item = neighbor genus + the inter-species difference.

they changed the original negative Shannon and Wiener information into a positive definition.

Advance of Wiener information definition: contrary
The information is the information, the information is the material, the energy, the information and the attribute indicationWiener information definition opposite.

Shannon information definition affirmation turn over 2006
Reverse Shannon information definition: The information is certainty increase. (information is a measure of one's freedom of choice when one selects a message)

Corresponding formula

Ir=-logPi+1

or

Ir‘=log((N-ni)/N) ＝log(nq/N) ＝logPq

Shannon information from the form of negative entropy -- Uncertainty, transforms the positive entropy formally to make up --the determine degrees. See the original negative of the Shannon definition of information formula

I=-logPi =-log((ni)/N) =-(logni-logN)=logN-log ni =-log((N-nq)/N) =1-1- logPi =1-(1+ logPi) =(1- logPi) –1

Deng's information definition
in 2002, 2004, Deng Yu et al: The information is the thing phenomenon and the attribute marking (indication) set.


 * Deng Yu et al, Journal of Mathematical Medicine, 2004 (5). Chinese.
 * Journal of Mathematical Medicine, 2004 (6). Chinese.
 * Standardization of Information Definition (J), Medical Information, 2006 (7)

—Preceding unsigned comment added by Jingluolaodao (talk • contribs) 06:47, 21 October 2008 (UTC)

Law of conservation of information
Deng's "law of information conservation", Deng Yu et al, Standardization of Information Definition (J), Medical Information, 2006, (7), Chinese, and Deng Yu et al, JOURNAL OF MATHEMATICAL MEDICINE, 2000, (1)[5]. Information conservation and transformation law (basic information equation) definition 1 Total flows in system's information to be equal to that total the information which flows out from the system, in addition system insider information change; The information can transform, transforms from one condition other one condition; The information may create, may lose saves. With the formula expression is NQ= NW +ΔNU

Definition (law definition) 2: The information conservation relations are refer to “in the system the store information to increase were equal to that enters system's information to subtract leaves system's information” ΔNU= NQ-NW

in the system the store information change = to enter system's information - to leave system's information the = new creation information - to lose (the `vanishing ' leaves) information

In the system the information change was equal to that enters system's new creation information to subtract the system loses (leaves, loses saves, vanishing, elimination) the information. In the system store information's change = new creation's information - loses information

ΔNU =Ncre-Nlos which saves

The overall flow of information into the system must be equal to the total outflow from the system, coupled with changes in the internal information systems; be able to convert information from one state into another state; information can be created, you can keep missing. That the formula used for NQ = NW + ΔNU


 * Nonsense. — Arthur Rubin  (talk) 15:25, 2 November 2008 (UTC)

Entropy
In the entropy section the article says:

If $$\mathbb{M}\,$$ is the set of all messages $$m$$ that $$M$$ could be, and $$p(m)=Pr(M=m)$$, then $$M$$ has


 * $$ H(M) = \mathbb{E}_{M} [-\log p(m)] = -\sum_{m \in \mathbb{M}} p(m) \log p(m)$$

''bits of entropy. An important property of entropy is that it is maximized when all the messages in the message space are equiprobable &mdash; i.e., most unpredictable &mdash; in which case $$H(M) = \log |M|.$$''

but if all states are equiprobable then $$p(m)=\frac{1}{|\mathbb{M}|}$$ so


 * $$H(M)=-\sum_{m \in \mathbb{M}} p(m) \log p(m)=-\sum_{m \in \mathbb{M}} \frac{1}{|\mathbb{M}|} \log \frac{1}{|\mathbb{M}|}=\log|\mathbb{M}|$$

not $$\log |M|$$. Thesm 08:52, 6 December 2006 (UTC)


 * Changed; it was a slight abuse of notation. Though for something like this, you can Be bold.  Worst thing that can happen is that you'd be reverted. Calbaer 17:26, 6 December 2006 (UTC)

Boltzmann's entropy and von Neumann anecdote
No history of information theory would be complete without this famous anecdote:  Claude Shannon asked  John von Neumann which name he should give to this cool new concept he discovered: $$- \sum p_i \log_2 p_i \!$$. Von Neumann replied: "Call it H." Shannon: "H? Why H?" Von Neumann: "Because that's what  Boltzmann called it."

Ludwig Boltzmann introduced the concept in 1870. Compare: Boltzmann, Ludwig (1896, 1898). Vorlesungen über Gastheorie : 2 Volumes - Leipzig 1895/98 UB: O 5262-6. English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press; (1995) New York: Dover ISBN 0-486-68455-5

Algorithms 19:51, 7 June 2007 (UTC)
 * If no history would be complete without it, add it to the History of information theory article, not a section of an article focusing on the idea, not the history. You might also want to find a source for it, since I've been in information theory for ten years and have never heard the "famous anecdote."  Googling, I only encounter one webpage with the story, told secondhand without attribution.  (The reference you give is for "H", not your "H story," I'm assuming.)  For both these reasons, I don't think it should be in this article. Calbaer 00:13, 8 June 2007 (UTC)


 * Some of these books support the idea that von Neumann suggested Shannon use entropy after Boltzmann. This one has it as a quote from Shannon, not quite in that form. It's already quoted in History of information theory. Dicklyon 00:29, 8 June 2007 (UTC)


 * Calbaer reversed and wrote: "Anecdote doesn't belong; order is by relevance, not strictly chronological." But what could possibly be more relevant than the original entropy formulation of Boltzmann, providing the very foundations of information theory? Surprisingly, Boltzmann's work is even omitted from the references. This history seems a bit one-sided. Algorithms 19:47, 9 June 2007 (UTC)


 * That's because Boltzmann wasn't an information theorist. You would no more expect to find him on this page than Newton on the aerospace engineering page or Euler on the cryptography page.  Yes, they did the math and statistics that made those fields possible &mdash; and Boltzmann merits a mention on the history page, but to give him a good portion of the short history section overstates his role in information theory.  And random anecdotes also don't belong.  I wouldn't want the history to get too dry, but overstating the relations between information theory and thermodynamics &mdash; which, although many, are largely unimportant &mdash; would not be beneficial to the article, especially in trying to make it accessable to those who may already be daunted by information theory being a mix of engineering, CS, stat, and math.  Does anyone else feel the Boltzmann connection merits this much attention? Calbaer 00:39, 10 June 2007 (UTC)


 * I'm not so sure, either, how much of the history section should be devoted to Boltzmann, but I disagree on the importance of the relations between information theory and thermodynamics. The connection between the two is fundamental and deep, so much so, in fact, that I can say that one bit of (information-theoretic) entropy is equal to Boltzmann's constant times the natural logarithm of 2.  I'm not sure, though, how much of that belongs in this article, which is already quite long.  (My old IP address was 130.94.162.64, by the way.) -- 198.145.196.71 18:27, 25 August 2007 (UTC)


 * Welcome back. I guess I meant "unimportant" in the context of the article or your average textbook or paper on information theory.  I didn't meant to trivialize the connection, which is indeed deep, but is usually not important for someone looking for either a reference or tutorial on information theory. Calbaer 20:09, 25 August 2007 (UTC)

Source theory and rate
I still don't like the expression


 * $$r = \lim_{n \to \infty} \mathbb E H(X_n|X_{n-1},X_{n-2},X_{n-3}, \ldots);$$

because it isn't clear what probability distribution the expectation is taken over. There is already a limit in this expression on n, and an expectation is implicit in the conditional entropy. 198.145.196.71 22:08, 15 September 2007 (UTC)


 * I don't care for it much either, though I wrote it. The formula previously didn't have the limit, which made sense only if the conditional entropy was independent of n (previously called t here); but it did have the expectation, which seems to me is needed if the conditional entropy varies with the identity of the previous symbols.  I think the distribution is over those previous values.  Maybe we can find a source about this.  Dicklyon 22:44, 15 September 2007 (UTC)


 * OK, you're right, according to this book, eq. 2.34. Dicklyon 22:47, 15 September 2007 (UTC)


 * Actually the expression (without the expectation) is correct for a strictly stationary process, as you noted in the article. For a non-stationary process, the more general expression I added to the article (which I stole from Entropy rate) previously is needed (as you also noted apparently). 198.145.196.71 23:28, 15 September 2007 (UTC)

Help request
How about some help over on the article Information theory and measure theory? I wrote most of that article over a year ago, and it's still complete mess. 198.145.196.71 20:32, 25 August 2007 (UTC)


 * See these books for some good ideas. The online ref mentioned in that article doesn't even mention measure theory, so no wonder it's a mess. Dicklyon 05:51, 28 August 2007 (UTC)


 * I meant to be bold and help edit the article, and maybe take the expert tag off if that scares people from editing. I don't mean to claim ownership of it.  More at Talk:Information theory and measure theory. 198.145.196.71 17:06, 7 September 2007 (UTC)

Variety
I had never heard of Ashby's concept of variety (cybernetics) until it was added to, and removed from, the see-also list in this article. So I looked in Google Books, and found all kinds of stuff about it. Some books on information theory, such as Kullback's Information Theory and Statistics, refer to Ashby's 1956 book Introduction to Cybernetics as part of the literature on information theory, and a big chunk of this book is about "variety". And his "requisite variety" paper is referenced in Uncertainty-Based Information: Elements of Generalized Information Theory. I don't see any reason to exclude this concept from the see-also list here on the basis of it being not information theory. Someone is being too picky, methinks. Dicklyon 03:41, 17 September 2007 (UTC)


 * I think it's helpful and just fine to mention something like this in the see-also list. 198.145.196.71 21:18, 22 September 2007 (UTC)

My main problem is that the article, as it stands, doesn't seem to have much to do with information theory. While variety itself may be relevant, the article reads more like one on social sciences. ⇌Elektron 12:27, 24 September 2007 (UTC)


 * These guys are doing a lot of "cybernetics" articles, which are topics that sort of bridge information theory's very mathematical style to the style of the social sciences. It's not what you're used to, but shouldn't be excluded from a see-also list on that basis.  It may be more like what some readers are looking for. Dicklyon 14:34, 24 September 2007 (UTC)

Page is too long
I suggest removing all technical and specialized material and instead placing it on the appropriate separate pages. I do not see why coding theory needs more than a paragraph or two on this page. Coding theory is a rich field in its own right...it is connected to information theory but I don't think we need to have a full exposition of it duplicated on this page. The same goes for other sections. I think this page should remain general-audience. As such, I think more discussion of information theory's applications and usefulness would be a good addition to the page. Cazort (talk) 20:37, 26 November 2007 (UTC)


 * I am the one originally responsible for much of this article's length that you consider excessive. It is admittedly a long article, but on the other hand I like having a collection of the basic formulas on this page.  This page has by no means a full exposition of coding theory, only its very basic rudiments.  I do agree that more discussion of applications and usefulness would be helpful, although that might make the article even longer.  I would rather have an article that is a little bit too long than one that is too short and scanty on details. 198.145.196.71 (talk) 02:38, 20 December 2007 (UTC)
 * I think Czort's objection is apposite. This is a field with multitudes of connections elsewhere, and any article which addresses these even in passing will not be very small. In addition, the suggestion that all technical content be removed to other articles is ill advised. This is a tricky and technical field. The proper remedy for unclarity caused by technical material is better writing or a better introduction to technical sections.
 * Our Gentle Readers should not be written down to by hiding complexity of complex subjects. An article on a technical subject can never be made sufficiently innocuous that every Reader will not be put off. We should, instead, write very clearly, distinguishing between necessary technical material and supplemental technical material which could indeed be more fully explained in some other article(s). And, banishing technical content has the very considerable disadvantage of requireing tour Gentle Readers to chase links and assemble an adequate picture of a subject from them. This is a skill held by few and by very few of our presumed Gentle Readers.
 * There are numerous reasons why Czort's suggestion should not be followed. Concur, mostly, with 198.145, ww (talk) 21:41, 2 June 2008 (UTC)

Minor suggestion
Introducing the mathematical definitions in "Ways of measuring information" are useful, brief, intuitive definitions. Perhaps these should be italicized to draw attention to them for those who might be scared off by the mathematics. Alternatively, they could be indented in their own paragraphs. On a related note, I don't like the title of this section; it implies that these are different ways to measure the same thing, rather than different methods for measuring different things. It would be as though a description on wattage, amperage, voltage with respect to ground, and potential difference were titled "ways of measuring electricity." Calbaer (talk) 00:33, 19 July 2008 (UTC)
 * I've changed my mind on this issue. As far as I am concerned, please feel free to cut almost all the math out of "Ways of measuring information" and leave nothing but practical, intuitive, "English-language" definitions.   Maybe leave a formula or two, but nothing much more complicated than $$I=-p \log p.$$  Most of the mathematical details from this article have already been copied over to the article Quantities of information.  (It's nice to have those formulas listed in one place, but it doesn't have to be this article.) The math is/can be explained in even more detail in self-information, information entropy, conditional entropy, mutual information, and so forth.  According to stats.grok.se, this article gets several hundred to a thousand hits every day.  Having all the math here on the front page is just going to scare most of them away, and the article has been tagged as too technical for several years now. Deepmath (talk) 04:11, 19 July 2008 (UTC)
 * I've thought along these lines for some time. And even amde a feew edits to the intro, and some comments here, a while ago. The underlying ideas of information theory are important in many fields, and intrinsically. There is no reason, save the tradition in mathematical circles, to entangle those ideas with the machinery of their manipulation and proofs of their exact form. Shannon and Weaver managed to do so in their book 6 decades ago. We here should not allow ourselves to settle for something less now.
 * I'm with Deepmath here in favor of including plain English accounts in this article, which is one of first resort for those inquiring into the field. Certainly in the introduction. I'm less enthusiastic about removing all the technical material to another article (here Quantities of information).
 * There is a small, but respected tradition of such high quality writing in mathematical topics. Consider that Bertrand Russel, world class in several highly technical, and state of the art fields, won the Nobel Prize for --- Literature. Not research into the foundations of mathematical logic, nor philosophy... Even we technical sorts should high thee and do likewise. The tradition of obscurity and passive voice fog in science should be avoided here in the Wikipedia. ww (talk) 21:48, 19 July 2008 (UTC)

log probability
Why does log probability redirect here? —Preceding unsigned comment added by 199.46.198.230 (talk) 14:12, 18 March 2009 (UTC)


 * because -ln(p) = I (I representing amount of information) Kevin Baastalk 16:10, 13 January 2011 (UTC)

thermodynamic entropy
I'd like to add a section on the relationship of Shannon's information to thermodynamic entropy in the section on applications to other fields. It would draw heavily from Ben-Naim's book A Farewell to Entropy: Statistical Thermodynamics Based on Information, but would also mention the impact of information theory on the resolution of Maxwell's Demon. It would probably just be 1-2 paragraphs, but I might develop it into a separate article. Thoughts? Maniacmagee (talk) 22:01, 21 August 2009 (UTC)

technical note about log
I am following through the math, and there is a seeming discrepancy that should be corrected (if it is an error) or explained. The entropy of the random variable X involves the log function. For the general definition, just "log" is written, which usually means natural log in most fields. Later, log2 is written. I presume log2 is meant in both cases and that two notational conventions are represented here. I think the article should be fixed to use "log2" consistently. I don't want to change it before getting a response because everything I know about this entropy function I know from this article. Thoughts? Tbonepower07 (talk) 04:21, 16 February 2010 (UTC)


 * It is said in "Quantities of information" that "The choice of logarithmic base in the following formulae determines the unit of information entropy that is used." In computer science, log2 is the most natural logarithm. Similarly, in the case of the binary entropy function, you are most commonly interested in log2 since the function is the entropy of a value that can take only two values. However, the article could and probably should use the natural log all over. Nageh (talk) 14:40, 23 March 2010 (UTC)

Overview
Having practically no experience as a contributor I hesitate to rewrite a section. On the other hand, having had a course in "Information Theory" at MIT in the 1950's, I believe that the Overview section has some serious errors. Should I undertake to rewrite large portions of that section?

"The main concepts of information theory can be grasped by considering the most widespread means of human communication: language." True enough, but the rest of the paragraph describes the main concepts of coding theory, not of information theory. The main concept of information theory is that a platitude such as "Thank you; come again" conveys less information the urgent plea, "Call an ambulance!" not because it is less important, but because it is less unexpected. In context, however, either of these messages might convey very little information, the former because it is not unexpected at the end of a transaction, the latter because if someone is obviously injured, "it goes without saying" that one should call an ambulance. The next main concept of information theory is that the speech channel has a capacity limit. If you talk faster, you convey information at a higher rate, unless you talk so fast that your hearer can't understand you. Symbols are being transmitted at a faster rate, but they are being received with errors; the channel is noisy.

Again "The central paradigm of classical information theory" is not "the engineering problem of the transmission of information over a noisy channel." That is the central paradigm of coding theory. The central paradigm of classical information theory is the quantification of information and of the capacity of an information carrying channel (which may be noisy or noiseless).

It might also be appropriate in the Overview section to introduce some simple quantitative concepts, for example, to define bits in terms of binary digits, to point out that capacity is defined for noiseless as well as noisy channels, or even to mention, by way of illustration, that when a fair and balanced coin is flipped, a message revealing which side came up conveys exactly one bit of information. Marty39 (talk) 21:06, 30 September 2011 (UTC)
 * Welcome! It is quite some time since I looked at this article, but I believe my thoughts were much the same as what you have noted. Please edit as you see fit—don't worry about mistakes and formatting issues as they can be fixed. I see that you know citations are needed; if you like, just put the info after the text in brackets and someone will tweak it. Johnuniq (talk) 00:10, 1 October 2011 (UTC)
 * As already mentioned, you're encouraged to contribute. Please also feel free to make contributions to the coding theory article, which presently needs to be rewritten completely. Isheden (talk) 11:17, 1 October 2011 (UTC)

The latest Nobel prize in economics went to economists who used information theory to build a model of how people respond to government policies. Perhaps this means economics should be listed in the opening section as a field in which information theory has been applied. — Preceding unsigned comment added by 68.16.142.186 (talk) 23:00, 2 January 2012 (UTC)

Black holes vs conservation of information
In this astrophysics series on the Discovery Channel they said something about black holes "...breaking the fundamental law of conservation of information" and some quarrelling between theoretical physicists over black holes. I had never before heard of such a fundamental law and I thought it sounded very strange. So I went looking for it and what I found on Wikipedia had little to do with physics as far as I could see. Can someone point me to an article better describing what the TV series means with conservation of information, or is the series just misinformed? Thanks! Eddi (Talk) 00:27, 9 May 2012 (UTC)
 * I think I've found the answer here and here. The TV show either overstates a "common assumption" as a "fundamental law" to make it more sensational, or it uses "information" as a synonym for "entropy" to make it easier to understand.  This may not be completely unreasonable, but I found it a bit confusing... Eddi (Talk) 09:47, 14 May 2012 (UTC)
 * See also quite a lot on the talk page of the first of those articles, for discussion on this. Jheald (talk) 21:29, 14 May 2012 (UTC)

Conformity Measurements
Information Theory (at least in the modern research arena) also includes the concept of Conformity, a measure that describes whether any unit bit of information conforms to the wider sea of information within which the bit of information rests.

What constitutes conformity is rather like the old PBS television shows and other educational efforts designed for children which show a set of objects which all conform to some set or sets of grouping, one object of which is mildly or wildly different than the rest or, applied to Information Theory, one of which is wildly out of Conformity with the rest of the information within the arena of bits of information.

Information Theory shows vectors of Conformity that can be graphed. When there are 2 information units being considered, Conformity is zero, and when there are an infinite number of information bits, conformity is also zero. For information bits > 3 yet less than infinity, the measure of Conformity behaves much the way that photons behave when they interact with each other.

Any way, the article here does not touch upon Conformity as a measurement within the wider realm of Information theory -- which is a shame since it's currently being used as a way of detecting corrupted data within streams of data. If I was better informed about the concepts involved I would add a section. Maybe someone who has formal training could be sought to add some. Damotclese (talk) 20:19, 15 May 2012 (UTC)

Comment about assessment of the article
This article is also assessed within the mathematics field Probability and statistics. — Preceding unsigned comment added by Geometry guy (talk • contribs) 11:03, 28 May 2007‎ (UTC)

Problem in Section on Entropy
Where it says that entropy is maximized when an r.v. is uniformly distributed, there appears to be a minus sign missing from the H(X)=log(n). Also from the definition it appears that in this case it should read H(X)=-log(1/n), but perhaps I'm mistaken. — Preceding unsigned comment added by 129.67.121.230 (talk) 12:45, 18 March 2013 (UTC)


 * . I think we all agree that using the definition of entropy in the article, when there are n uniformly distributed (equiprobable) messages, the entropy (in bits) of the next message X is
 * H(X) = -log2(1/n). That can be simplified using a logarithmic identity to
 * H(X) = log2(n). For example, if we have 16 possible messages, n=16, then the entropy (in bits) of the next message is
 * H(X) = -log2(1/n) = -log2(1/16) = +4 bits = log2(n) = log2(16) = +4 bits.
 * --DavidCary (talk) 16:05, 30 January 2015 (UTC)

clarification of text
the following text which is quoted from the article "specifying the outcome from a roll of a die (six equally likely outcomes)" assumes using a six sided die. Dice are made in sides other then six. Please edit this to make such an assumption explicit. Thanks in advance. — Preceding unsigned comment added by 76.196.10.224 (talk) 18:23, 23 March 2013 (UTC)

Older foundations than Shannon
I have read "The Book of the Sign (Sefer HaOTh)", by the Spanish Abraham Abulafia, written around 1279. One of the founding texts of medieval Kabbalah, and I realised it is a work on Mathematical Information Theory. First every hebrew character gets an analog number (see hebrew numerals) and then he operates basic operations such as permutations or addings and then the numbers resulting get translated, get back a meaning. His work on permutations of the tetragrammaton, which gives 144 posible words, is clearly a work of Information Theory. Purposes of this theory were decrypting some message on sacred Jewish texts and getting a more direct spiritual communication through the finding of the lost sign. Abulafia wrote a letter had been lost in the process of creation of writing, so that explained the imperfections of World and Words. I am finding the exact references on the book. — Preceding unsigned comment added by Francisco.j.gonzalez (talk • contribs) 11:14, 11 January 2014‎ (UTC)

Shannon's H = entropy/symbol
From A Mathematical Theory of Communication, By C. E. SHANNON

Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.

The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. ...

Suppose we have a set of possible events whose probabilities of occurrence are p1􏰘 p2􏰘...pn. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome? If there is such a measure, say H􏰀 p1􏰘 p2􏰘 􏰗 􏰗 􏰗 􏰘 pn􏰁, it is reasonable to require of it the following properties:
 * H should be continuous in the pi.
 * If all the pi are equal, pi 􏰃 1, then H should be a monotonic increasing function of n. With equally n likely events there is more choice, or uncertainty, when there are more possible events. (emphasis mine)
 * If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H. ...

Theorem 2: The only H satisfying the three above assumptions is of the form: $$H=-K \sum_{i=1}^n p_i log p_i$$ where K is a positive constant.

Quite simply, if we have an m-symbol alphabet and n equally likely independent symbols in a message, the probability of a given n-symbol message is $$p_i=(1/m)^n$$. There are $$m^n$$ such messages $$\left(\sum_{all} p_i=1\right)$$. Shannon's H defined above (K=1) is then H=n log(m). For a 2-symbol alphabet, log base 2, H=n bits, the number of symbols in a message, clearly NOT an entropy per symbol. PAR (talk) 05:50, 10 December 2015 (UTC)


 * I have removed statements to the contrary - The removed material provides a double definition of H, the definition by Shannon is the only one we should be using. The removed material confuses the information content of a message composed of independently distributed symbols, with probabilities estimated from a single message, with the entropy - the expected value of the information content of a message averaged over all messages, whether or not the symbols are independent. Furthermore, when a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is zero. The probability of the message is 1 and 1 log(1)=0. Strictly speaking, entropy has no meaning for a single message drawn from a population of multiple possible messages. It is a number associated with the population, not with a message drawn from that population.  It cannot be defined without an a priori estimate or knowledge of those message probabilities. PAR (talk) 07:11, 11 December 2015 (UTC)

No, "per symbol" is per symbol of data received by a source, just as the article stated before you followed me here. It is not "per unique symbol" as you're interpreting it the units. You did not quote Shannon where he stated the units of H. To actually quote Shannon in section 1.7: "This (H) is the entropy per symbol of text" and "H or H' measures the amount of information generated by the source per symbol or per second." Where you reference his saying the p's should sum to 1 in section 1.6 is where he says "H is the entropy of a set of probabilities" and since they sum to 1 in this context, this is on the basis of 1 complete symbol from a source, that is, entropy per 1 symbol. This was clearly explained with quotes from Shannon in what you reverted. Also, by your reasoning a 10 MB of data from a source would have and entropy of no more than 1. My contributions survived 5 days without complaint, so please let someone else who's got a correct point of view edit it. Ywaz (talk) 18:46, 11 December 2015 (UTC)


 * Agreed.


 * I think we are using two different sources. The source I used is listed above and the quotes are correct. Could you please give the "Shannon" source you are using? PAR (talk) 20:14, 11 December 2015 (UTC)

Your quotes are fine. I'm quoting your source, top of page 11, literally 2 sentences before your quote, and page 13. Ywaz (talk) 00:19, 12 December 2015 (UTC)


 * Well this Shannon paper is not as straightforward as I thought. He does use two symbols for H, one for the general definition which is tied only to a probability mass function associated with an "event", no mention of symbols, etc, and another which is an entropy per symbol of a source of symbols behaving as a Markoff process. For the Markoff process, an entropy Hi is defined for a state according to the general definition, and then a new H, the H of the Markoff process is defined in terms of those Hi. That's an unfortunate duplication, but the first, general, definition is preferred as a definition of entropy. The second is an application of the concept of entropy to a specific situation.


 * I did quote Shannon where he stated the units - the quote involving Turkey is on page 1. Also, Shannon, in various places makes it clear that the frequencies of symbol occurrence in a finite message are NOT the frequencies associated with the source, only an approximation that gets better and better with longer messages, being exact only in the limit of infinite length. If you think I am wrong on this, please point out where Shannon says otherwise.


 * Also, yes, by my reasoning a 10 MB .gif file from a source would have and entropy of zero, IF IT WERE THE ONLY MESSAGE THE SOURCE WOULD AND COULD PROVIDE. Entropy is a measure of MISSING information, and once you have that 10Mb, there is no missing information. If a library contains 1024 books, and every day I send someone in to pick one out at random, the entropy of the process is log2(1024)=10 bits. Its the amount of information given to me by knowing which book is brought out, or the amount of missing information if I don't. It has nothing to do with a completely different process, the information IN the book, or in your case, the .gif file. If there were only one book in the library (or one .gif file), the entropy would be log2(1)=0, i.e. complete certainty.


 * I have to admit, that if it is the only book in the library, then the letter frequencies in the book are, by definition, the exact frequencies of the process which created the book. But the process created no other books. You can use these frequencies to compare the entropy or entropy/symbol of the first half of the book to the last half, etc. etc., but the minute you apply them to another book, you have changed the process, and they become only approximations. You cannot talk about the exact entropy of a lone message. If its truly alone, the entropy of the message is zero. If you divide it in half, then those halves are not alone, and you can compare their entropies.


 * In statistical mechanics, this corresponds to knowing the microstate at a particular instant. If we just think classically, then the future microstate history of the system can be predicted from that microstate, from the positions and velocities of each particle. There is no need for thermodynamics, its a pure mechanics problem. Entropy a flat zero, complete certainty.


 * I will continue to read the article, since, as I said, its not as straightforward as I thought. PAR (talk) 06:55, 13 December 2015 (UTC)

If there is a distribution on finitely many objects then there is an entropy. A non-standard way (by current standards) to specify a sample from a distribution of symbols is to juxtapose the symbols from the sample. The entropy of this sample can then be computed and it is an approximation to the entropy of the true distribution. If you want to write up a short example of this, that is fine, but please use modern notation for a set (or tuple) and please do not call it "entropy per symbol" which is a complete bastardization of the current meaning of that term. Please give the answer it "bits", "nats", or similar not in the bastardized "bits/symbol", "nats/symbol", etc.

On the other hand, if there is a stream of data or a collection of infinite strings, such as described in the coding theory section then the entropy as just described for the case of a finite sample or distribution can be problematic. In this case, there could be an unbounded number of possibilities with negligible probability each and thus the computed entropy could be unboundedly large. In this case it can make sense to talk about how the entropy grows with the length of the string. This is explained the the Rate subsection of the Coding Theory section. In this case one gets as the answer an amount of entropy per (additional) symbol. This is an entropy rate (or information rate) and has units of "bits/symbol", "nats/symbol", etc. In this case it is common to juxtapose the symbols into what is commonly known as a "string", because each string has its own probability that is not necessarily directly computable from the characters that comprise it. If you think the Coding Theory or Rate sections need improvement, please feel free to make improvements there.

The language in the last few days misuses (by current standards) the meaning of entropy per symbol. Unfortunately it is therefore a significant negative for the reader. I am removing it. 𝕃eegrc (talk) 19:17, 14 December 2015 (UTC)


 * Leegrc, do you have a reference to show that Shannon's definition of entropy is not the accepted definition? Honestly, your position is not tenable and PAR does not agree with you. Please stop undoing my edits on the entropy pages in your proclamation that Shannon's classic book should be ignored. Let someone else undo my edits.  Ywaz (talk) 21:57, 17 January 2016 (UTC)

It appears that this conversation is being continued in the following section "Please provide a reference or remove this material". That is where you will find my response. 𝕃eegrc (talk) 17:31, 19 January 2016 (UTC)

Please provide a reference or remove this material
Ywaz - This is the same material that was previously removed. Please provide a clear reference for the statement "When a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is...". Either that or remove the statement, and the conclusions drawn from it. PAR (talk) 03:06, 18 January 2016 (UTC)


 * Leergc removed it before you and I had reached agreement that H is in "bits/symbol". I believe I reposted about half as much as the previous text that I had. How about "When a set of symbols is big enough to accurately characterize the probabilities of each symbol, the Shannon entropy H in bits per symbol is..."?  The primary clarification I want to get across to readers is that when they see "Shannon Entropy" they have to determine if the writer is talking about specific entropy H (per symbol) or total entropy N*H in bits (or shannons).Ywaz (talk) 16:28, 18 January 2016 (UTC)

I don't have Shannon's original paper. Surely, if his terminology as you see it is still in use you can find a modern textbook that uses "bits/sybmol" in the way you believe to be correct. Would you cite it? I would cite a textbook that says that "bits/sybmol" is an incorrect description for entropy (other than in the case of strings of unbounded length), if publishing negative results were commonplace; but such is not commonplace. :-( 𝕃eegrc (talk) 17:31, 19 January 2016 (UTC)


 * You can see in the discussion directly above where I showed PAR the quotes from Shannon himself, so I just removed the "disputed" tag you just added. PAR provided a link to Shannon's text at the beginning of the previous topic above. In addition to the quotes I showed PAR on pages 13 and 14, Shannon says on pages 16, 17, 18, 19, and 20 that H is "bits per symbol". Do a "CTRL-F" on that link and search for "per symbol". Ywaz (talk) 18:37, 19 January 2016 (UTC)

I have moved the paragraphs in question to the historical section of the article. I trust that you are faithfully reporting what you see in Shannon's paper and thus I do not dispute that it is a part of history. Until you can cite a modern textbook that uses the same terminology, please leave these paragraphs in the historical section. 𝕃eegrc (talk) 12:52, 20 January 2016 (UTC)


 * There is not a historical difference. I do not know where you think I am disagreeing with anyone. Shannon called H "entropy" like everyone else does today. I would like to make sure people understand Shannon's entropy H, the whole world over in all times, is in units of entropy per symbol like shannons and modern researchers say, which means it is an intensive entropy S0 and not the regular physical entropy S that you can get from S=N*H. As far as a modern reference, I believe this well-known NIH researcher's home page is clear enough:  https://schneider.ncifcrf.gov/


 * PAR, I think I've got the connection between physical and information entropy worked out for an ideal gas on the [Sakur-Tetrode talk page]. In short, they are same if you send messages by taking a finite number of identical marbles out of a bag to use as "symbols" to be placed in a much larger number of slots in space (or time) to represent phase space.  In other words, there is a bigger difference than I hoped for in a simplest case. It may turn out that Gibbs and QM entropy is closer to Shannon's H, but it may require an equally strained view of information to get it there. At least Landauer's principle implies a direct and deep connection. Ywaz (talk) 23:37, 20 January 2016 (UTC)

Thank you for the link to Tom Schneider's work. That helps me to understand why you have been advocating for "bits/symbol" where I thought "bits" was more appropriate. I now see it as a tomAYto vs. tomAHto thing; both right, depending upon how exactly things are framed. In particular, with a few copy edits I think I have managed to preserve your "bits/symbol" with enough context that makes it meaningful to us tomAYto folks too. 𝕃eegrc (talk) 18:51, 21 January 2016 (UTC)


 * No. "Bits" for H is wrong in every way: logically, mathematically, historically, currently, and factually. Can you cite any authority in information theory that says H is in bits? The core problem is that Shannon called H "entropy" instead of "specific entropy".  You can see his error blatantly in section 1.7 of his booklet where he says H is in units of "entropy/symbol".  That's exactly like calling some function X "meters" and then saying it has units of "meters/second".  Anyone saying it is in "bits" is doing a disservice to others and they are unable to do correct calculations. For example, by Landauer's principle, the minimal physical entropy generated by erasing 1 bit is S=kT*ln(2). So if you want to build a super-efficient supercomputer in the future and need to know the minimal heat it releases and you calculate H and think it is in "bits", then you will get the wrong answer in Joules of the minimal energy needed to operate it and how much cooling it will need. Ywaz (talk) 10:46, 22 January 2016 (UTC)

I suspect that we are 99% in agreement; let me explain my thoughts further. It does not have to be a distribution over sybmols in order to have an entropy, right? For my first meal tomorrow I can have eggs, cereal, stir fry, etc. each with some probability, and that distribution has an entropy. I can report that entropy in bits (or nats, etc.). If it is the case that the distributions for subsequent breakfasts are independent and identically distributed with tomorrow's breakfast then I can report the same numerical quantity and give it units of bits/day. Furthermore, if it turns out that your breakfasts are chosen from a distribution that is independent and identically distributed to mine then I can report the same numerical quantity but give it units of bits/day/person. How many of "per day", "per person", etc. are appropriate depends upon the precise wording of the problem. The edits I made to the text are meant to make the wording yield "bits/symbol" as the natural units regardless of whether the reader is you or me. 𝕃eegrc (talk) 14:10, 22 January 2016 (UTC)

As a side note, I was not the one who removed your examples and formulae that used "counts", etc., though I was tempted to at least make revisions there. I suspect that we still have some disagreements to resolve for that part. 𝕃eegrc (talk) 14:10, 22 January 2016 (UTC)


 * No, it H is always a statistical measure applied to symbols. "Eggs" is a symbol as far as entropy is concerned, not something you eat. But this seems beside the point. A distribution of symbols has an "entropy" only because Shannon made the horrendous mistake of not clarifying more clearly that it is a specific entropy, not a total entropy. No, you can't report H "entropy" in bits according to Shannon or any other good authority. The N*H is the entropy in bits.  H is bits/symbols. Getting units right is not subjective or optional if you want to be correct. You just made a mistake by not keeping the units straight: you can't say bits/day/person because then you would calculate our combined entropy H in bits/day as  2 people * H bits/day/person = 2*H bits/day. But the H calculated for us individually or together gives the same H.


 * How are you and Barrelproof going to calculate the probabilities of symbols in data if you do not count them? It seems to me that he should not be reverting edits simply because it "seems" to him (his wording) that the edit was not correct. But I suppose that means I have to prove with references that is how it is done if someone interested in the article is skeptical. I've seen programs that calculate the entropy of strings and usually they use the variable "count" just like I did. Ywaz (talk) 23:12, 22 January 2016 (UTC)
 * Probabilities are often estimated by counting, but that is a matter of practical application and approximation – not theory. The article is about information theory. The relationship between the results of counting and true probabilities can be complex to analyze properly (e.g., issues relating to sampling and the questions of ergodicity and stationarity must be considered), and those details are unnecessary to consider in this article and are generally not part of an introduction to information theory. For example, the results from counting will only be reasonably valid if the number of samples measured by the counts is very large. Shannon's paper and other well-regarded sources on information theory do not equate counting with true probability. Probability is not the same thing as Statistics, which is why there are separate articles on those two subjects. Much of the study of information theory is conducted using idealized source models, such as by studying what happens with Bernoulli trials or with a Markov process or Gaussian random variables. No counting is used in such analysis. I am personally very skeptical of anything written in an introduction to information theory that is based on the idea that Shannon made some "horrendous mistake". I have seen no evidence that people who work on information theory have found any significant mistakes in Shannon's work. His work is universally held in the very highest regard among mainstream theoreticians. —BarrelProof (talk) 21:33, 23 January 2016 (UTC)

BarrelProof, I had stated the count method was if the data was all the source ever generated, which negates your ergodicity, sampling, and stationarity complaints. Shannon's section 1.7 relies on a count method for large N. Shannon may not have mentioned small N but I do not know why "respected" texts, however you define that, would not use it on small N in the same way the Fourier transform is applied to non-repeating signals: you simply treat it as if it was repeated. Also, physical entropy does not depend on an infinite N, or even a large N, and Landauer's limit shows how deep the connection is. I was able to derive the entropy equation for a monoatomic ideal gas from using the sum of the surpisals which is a form of H. The surprisal form was needed to maintain the random variable requirement of H. You are using an appeal to authority to justify an obviously terrible mistake in Shannon's nomenclature that makes it harder for newcomers to "catch on". It is a horrendous mistake for Shannon to call "entropy/symbol" an "entropy" instead of "specific entropy". Look how hard it was to convince Leegrc and PAR that H is entropy/symbol even after quoting Shannon himself. If you disagree, then please explain to me why it would be OK to call a speed in meters/second a distance. Ywaz (talk) 14:39, 24 January 2016 (UTC)


 * I am not convinced that Shannon made a mistake, I am not convinced he didn't. Its dumb to argue about what he wrote, when it is what it is. Can we agree that this source is what we are talking about?]
 * After all of the above, my request has still not been answered: Please provide a clear reference for the statement "When a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is...". PAR (talk) 00:35, 25 January 2016 (UTC)


 * That phrase needs to be specified more clearly and I do not have a book reference (I have not looked other than Shannon) to justify programmers who are doing it this way (and probably unknowingly making the required assumptions like it being a random variable), so I have not complained at it being deleted. We already agreed on that as the source, and he did not mention short strings. I am not currently arguing for going back to the things I had written simply because you guys seem to want references to at least show "it's done that way" even if the math and logic are correct. My last few comments are to show the other excuses being made for deleting what I wrote are factually wrong, as evidenced by Barrelproof and Leegrc not (so far) negating my rebuttals. I am not saying Shannon had a factual error, but made a horrendous nomenclature error in the sense that it makes entropy harder to understand for all the less-brilliant people who do not understand he meant his H is specific entropy. If he had not formulated his definitions to require the p's in H to sum to 1, then his Boltzmann-type H could have been a real Gibbs and QM entropy S.Ywaz (talk) 02:11, 25 January 2016 (UTC)

I've been reading Shannon & Weaver and what I have read makes sense to me. I have no problem with the present "entropy per symbol" as long as it is clear that this is a special case of a streaming information source as contrasted with the more general case of an entropy being associated with a set of probabilities pi.

The "error" that Shannon & Weaver make is the assignment of the letter H to a set of probabilities pi ( page 50) as:

H = -\sum_i p_i log(p_i) $$ and also to the entropy of an information source delivering a stream of symbols, each symbol having an INDEPENDENT probability of pi, in which case the above sum yields the entropy per symbol ( page 53). Both definitions are purely mathematical, devoid of any interpretation per se, it only acquires meaning in the real world through the association of the probabilities pi with some process or situation in the real world. The second is more restrictive in the assumption that each symbol in the message is INDEPENDENT.

As a physicist, I am naturally interested in information entropy as it relates to Boltzmann's equation H=k log(W). This uses the first definition of entropy. Every macrostate (given by e.g. temperature and pressure and volume for an ideal gas) consists of a set of microstates - in classical stat. mech., the specification of the position and velocity of each particle in the gas. It is ASSUMED that each microstate has a probability pi=1/W where W is the number of microstates which yield the given macrostate. Using these probabilities, the Shannon entropy is just log(W), (in bits if the log is base 2, in nats if base e) and Boltzmann's entropy is the Boltzmann constant (k) times the Shannon entropy in nats.

Each microstate can be considered a single "symbol" out of W possible symbols. If we have a collection of, say 5 consecutive microstates of the gas taken at 1-second intervals, all having the same macrostate (the gas is in equilibrium), then we have a "message" consisting of 5 "symbols". According to the second definition of H, the entropy log(W) for each microstate and the entropy of the 5-microstate "message" is 5 log(W), and we might say that Boltzmann's H is the entropy per symbol", or "entropy per microstate". I believe this was the source of confusion. Its a semantic problem, clearly illustrated by this example. Saying that H is "entropy per microstate" DOES NOT MEAN that each microstate has an entropy. Furthermore, we cannot specify H without the pi which were ASSUMED known.

The idea that a single symbol (microstate) has an entropy is dealt with in Shannon & Weaver, and it is clearly stated that a single message has entropy zero, in the absence of noise. To quote Shannon from page 62:

"If a source can produce only one particular message its entropy is zero, and no channel is required."

Also Weaver page 20 (parenthesis mine):

"The discussion of the last few paragraphs centers around the quantity 'the average uncertainty in the message source when the received signal (symbol) is known.' It can equally well be phrased in terms of the similar quantity 'the average uncertainty concerning the received signal when the message sent is known.' This latter uncertainty would, of course, also be zero if there were no noise."

See page 40 where Shannon makes it very clear that the entropy of a message (e.g. AABBC) is a function of the symbol probabilites AND THEIR CORRELATIONS, all of which must be known beforehand before entropy can be calculated. Shannon draws a clear distinction ( page 60-61) between an estimation of entropy for a given message and "true" entropy (parentheses mine):

The average number HI of binary digits used per symbol of original message is easily estimated.... (by counting frequencies of symbols in "original message") We see from this that the inefficiency in coding, when only a finite delay of N symbols is used, need not be greater than 1/N plus the difference between the true entropy H and the entropy GN calculated for sequences of length N.

In other words, if you are going to estimate entropy from messages, "true" entropy is only found in the limit of infinitely long messages. PAR (talk) 06:48, 25 January 2016 (UTC)


 * H in Shannon's words (on page 10 and then again on page 11) is simply the entropy of the set of p's ONLY if the p's sum to 1, which means it is "per symbol" as it is in the rest of the book. I mentioned this before. On both pages he says "H is the entropy of the set of probabilities" and "If the p's are equal p=1/n". This is completely different from Gibbs and QM entropy even though the Shannon's H look the same.[edit: PAR corrected this and the next statement] They are not the same because of the above requirement Shannon placed on H, that the p's sum to 1.   Boltzmann's equation is not H=k*ln(states). It's S, not H. Boltzmann's H comes from S = k*N*H where NH=ln(states).  This is what makes Shannon's H the same as Boltzmann's H and why he mentioned Boltzmann's H-theorem as the "entropy" not Boltzmann's S=k*ln(states) and not Gibbs or QM entropy.  H works fine as a specific entropy where the bulk material is known to be independent from other bulks. The only problem in converting H to physical entropy on a microscopic level is that even in the case of an ideal gas the atoms are not independent. When there are N atoms in a gas, all but 1 of them have the possibility of being motionless. Only 1 carrying all the energy is a possibility. That's one of the states physical entropy has that is missed if you err in treating the N atoms as independent.  I came across this problem when trying to derive the entropy of an Einstein solid from information theory: physical entropy always comes out higher than information theory if you do not carefully calculate the information content in a way that removes the N's mutual dependency on energy.  So H is hard to use directly.  The definition of "symbol" would get ugly if you try to apply H directly. You would end up just using Boltzmann's methods, i.e. the partition function.  Ywaz (talk) 15:42, 25 January 2016 (UTC)

First of all, yes, I should have said S=k log(W), not H=k log(W). Then S=k H. W is the number of microstates in a macrostate, (the number of "W"ays a macrostate can be realized). Each microstate is assumed to have equal probability pi = 1/W and then the sum of -(1/W)log(1/W) over all W microstates is just H=log(W). This is not completely different from Gibbs entropy, it is the same, identical. Multiplying by N gives a false result, S != k N H.

This is the core of the misunderstanding. We are dealing with two different processes, two different "H"s. Case 1 is when you have a probability distribution pi (all of which are positive and which sum to 1, by definition). There is an entropy associated with those pi given by the usual formula. There are no "symbols" per se. Case 2 is when you have a stream of symbols. If those symbols are independent, then you can assign a probability pi to the occurence of a symbol. If they are not, you cannot, you have to have a more complicated set of probabilities. Case 2 is dealt with using the more general concepts of case 1. If the pi are independent, they represent the simple probability that a symbol will occur in a message, and the entropy associated with THAT set of probabilities is the "entropy per symbol". In other words, it is the entropy of THE FULL SET OF N-symbol messages divided by the number of symbols in a message (N). THIS DOES NOT MEAN THAT A SYMBOL HAS AN ENTROPY, IT DOES NOT. THIS DOES NOT MEAN THAT A MESSAGE HAS AN ENTROPY, IT DOES NOT. Entropy is a number associated with the full set of N-symbol messages, NOT with any particular symbol or message. As Shannon said on pages 60-61 of the 1964 Shannon-Weaver book, you can estimate probabilities from a message to get a "tentative entropy" and this tentative entropy will approach the "true" entropy, as the number of symbols increases, goes to infinity. But strictly speaking, as Shannon said on page 62 of the book, the entropy of a lone message, in which the "full set" consists of a single message, is zero. You cannot speak about entropy without speaking about the "full set" of messages.

PLEASE - I am repeating myself over and over, yet your objections are simply to state your argument without analyzing mine. I have tried to analyze yours, and point out falsehoods, backed up by statements from Shannon and Weaver. If there is something that you don't understand about the above, say so, don't just ignore it. If there is something you disagree with, explain why. This will be a fruitless conversation if you just repeat your understanding of the situation without trying to analyze mine.

You say "They are not the same because of the above requirement Shannon placed on H, that the p's sum to 1." and also " If he had not formulated his definitions to require the p's in H to sum to 1, then his Boltzmann-type H could have been a real Gibbs and QM entropy S". This is WRONG. The p's ALWAYS sum to one, otherwise they are not probabilities. You are not considering the case where the pi's are not independent. In other words, the statistical situation cannot be represented by a probability pi associated with each symbol. In this case you don't have pi, you have something more complicated, pi j for example where the probability of getting symbol j is a function of the previous symbol i. Those pi j always sum to 1 by definition, and the entropy is the sum of -pi j log pi j over all i and j.

You say "the only problem in converting H to physical entropy on a microscopic level is that even in the case of an ideal gas the atoms are not independent." No, that's only a problem only if you insist on your narrow definition of entropy, in which the "symbols" are all independent. Shannon dealt with this on page 12 of the Shannon-Weaver book. Suppose you have a bag of 100 coins, 50 are trick coins with two heads, 50 are normal coins with a head and a tail. You pick a coin out of the bag, and without looking at it, flip it twice. You will have a two-symbol message. You will have HH occur with probability pHH=1/2+1/8=5/8, HT, TH and TT will occur with probability pHT=pTH=pTT=1/8. There is no pi here. There is no way to assign a probability to H or T, such that the probability of XY=pXpY where X and Y are H and/or T. There is no pi associated with each symbol that lets you calculate the sum over H and T of -pi log(pi). You are stuck with the four probabilites pHH, pHT, pTH, pTT. They sum to 1. The entropy of the process is the sum of -pi j log(pi j>) over all four possible i,j pairs.

Things only get "ugly" because you are not doing it right. When you do it right, things are beautiful. PAR (talk) 19:56, 25 January 2016 (UTC)


 * I didn't realized the p's summed to 1 in Gibbs and QM. Why does the Wikipedia H-theorem page use S=k*N*H?  As I said, the entropy of a message is only valid if it is considered all the data the source will ever generate, so it is the full set of messages, not just a single message. Specific entropy is useful. Of course I do not expect a single symbol by itself to carry H, but H is the entropy the source sends per symbol on average in case 1 and 2. Although case 1 is an "event", I do not see why that can't be assigned a symbol. If an event is a thing that can be defined, then a symbol can be assigned to it.  H is discreet, so I do not know how symbols can't be assigned to whatever is determining the probability.  I do not see any difference between your case 1 and case 2 if it isn't merely a matter of dependencies and independence. As far as a message carrying information, it seems clear a message of length M of independent symbols from a source known to have H will have H*M entropy (information) in it, if M displays the anticipated distribution of p's, and the degree to which it does not is something I believe you described to me with before. Concerning your coin example, you gave me the symbols: HH, HT, TH, and TT are the symbols for that example. The symbols are the things to which you assign the p's.  Ywaz (talk) 20:52, 25 January 2016 (UTC)


 * Regarding the H-Theorem page, I think it's wrong. It has a strange definition of H as  rather than - , but then there is a negative sign in the entropy equation which is written S = -NkH. Each microstate has an equal probability, so P=1/W, and H *should be* H=-&Sigma; P log(P) = -&Sigma; (1/W)log(1/W)=log(W). Boltzmann's entropy equation is S=k log(W), and I don't know where the N comes from. There is no reference for the statement.


 * Ok, I see what you are saying, I agree, you can say that each pi in Case 1 refers to a symbol, and Case 1 is a one-symbol message. In the case of thermodynamics and stat. mech., the full set is all W microstates, so there are W symbols, each with the same probability 1/W. If you consider, say, a gas at a particular pressure, temperature, volume, it exists in a particular macrostate, and one of the W microstates, its a one-symbol message. That microstate (symbol) is not known, however. We cannot measure the microstate of a large system. We can quantify our lack of knowledge by calculating the information entropy (H) of the macrostate, and it is log(W). log(W) in bits is the average number of yes/no questions we would have to ask about the microstate of the gas in order to determine what that microstate is. Its a huge number. The thermodynamic entropy S is then k log(W) and k is a small number, and we get a reasonable number for S.


 * Regarding the HH, HT, TH, TT example, we are on the same page here. H and T are not the symbols, HH, HT, TH, and TT are.PAR (talk) 07:30, 26 January 2016 (UTC)

S=k N H is only one of the equations they gave. They say what you said a month or two ago: it's only valid for independent particles, so I do not think it is in error. Someone on the talk page also complains about them not keeping the -1 with H. In the 2nd paragraph you sum it up well. Ywaz (talk) 10:53, 26 January 2016 (UTC)


 * Well, I guess I don't know what I meant when I said that. I don't know of a case where H is not log(W). That's the same as saying I don't know of a case where the probabilities of a microstate are not p_i = 1/W for every microstate. The devil is in the details of what constitutes a microstate, what with quantum versus classical,  identical particles, Bose statistics, Fermi statistics, etc. etc. PAR (talk) 22:37, 26 January 2016 (UTC)

I do not follow your 3rd sentence, and I'll look into the possibility that it's not a valid formula under that condition. To me it is the same specific entropy: take two boxes of gas that are exactly the same. The entropy of both is 2 times the entropy of 1. So S=kH is the entropy of 1, and S=kHN is the entropy of 2.

From Shannon's H to ideal gas S
This is a way I can go from information theory to the Sackur-Tetrode equation by simply using the sum of the surprisals or in a more complicated way by using Shannon's H. It gives the same result and has 0.16% difference from ST for neon at standard conditions.

I came up with the following while trying to view it as a sum of surprisals. When fewer atoms are carrying the same total kinetic energy E, E/i each, they will each have a larger momentum which increases the number of possible states they can have inside the volume in accordance with the uncertainty principle. A complication is that momentum increases as the square root energy and can go in 3 different directions (it's a vector, not just a magnitude), so there is a 3/2 power involved.

$$ S = k_B \sum_{i=1}^{N} \ln\left(\frac{\Omega_i}{i} \right)$$ where $$  \Omega_i = \Omega_N*\left(\frac{N}{i} \right)^\frac{3}{2}$$

You can make the following substitions to get the Sackur-Tetrode equation:

$$\Omega_N = \left(\frac{xp}{\frac{\hbar}{2\sigma^2}}\right)^3, x=V^\frac{1}{3}, p=\left(2mU/N\right)^\frac{1}{2}, \sigma = 0.341, U/N=\frac{3}{2} k_B T$$

The probability of encountering an atom with a certain momentum depends (through the total energy constraint) on the momentum of the other atoms. So the probability of the states of the individual atoms is not a random variable with regard to the other atoms, so I can't easily write H as a function of the independent atoms (I can't use S=N*H) directly. But by looking at the math, it seems valid to consider it an S=&Omega;H. The H below inside the parenthesis is entropy of each possible message where the energy is divided evenly among the atoms. The 1/i makes it entropy per moving atom, then the sum over N gives total entropy. Is it summing N messages or N atoms? Both? The sum for i over N was for messages, but then the 1/i re-interpreted it to be per atom. Notice my sum for j does not actually use j as it is just adding the same pi up for all states.

$$S = k_B \sum_{i=1}^{N}\left[ \frac{1}{i} *\left( -1*\sum_{j=1}^{\Omega_i} \frac{i}{\Omega_i}\ln \frac{i}{\Omega_i}\right)\right] = \sum_{i=1}^{N} \ln\left(\frac{\Omega_i}{i} \right)$$

This is the same as the Sackur-Tetrode equation, although I used the standard deviation to count states through uncertainty principle instead of the 4 &pi; / 3 constants that were in the ST equation which appears to have a small error (0.20%).

Here's how I used Shannon's H in a more direct but complicated way but got the same result:

There will be N symbols to count in order to measure the Shannon entropy: empty states (there are &Omega;-N of them), states with a moving atom (i of them), and states with a still atom (N-i). Total energy determines what messages the physical system can send, not the length of the message. It can send many different messages. This is why it's hard to connect information entropy to physical entropy: physical entropy has more freedom than a normal message source. So in this approximation, N messages will have their Shannon H calculated and averaged. Total energy is evenly split among "i" moving atoms, where "i" will vary from 1 to N, giving N messages. The number of phase states (the length of each message) increases as "i" (moving atoms) decreases because each atom has to have more momentum to carry the total energy. The &Omega;i states (message length) for a given "i" of moving atoms is a 3/2 power of energy because the uncertainty principle determining number of states is 1D and volume is 3D, and momentum is a square root of energy. $$S_2 = \frac{1}{N} \sum_{i=1}^{N} \left[ H_i \Omega_i \right]$$

Use $$S=k_B \ln(2) S_2$$ to convert to physical entropy. Shannon's entropy Hi for the 3 symbols is the sum of the probability of encountering an empty state, a moving-atom state, and a "still" atom state. I am employing a cheat beyond the above reasoning by counting only 1/2 the entropy of the empty states. Maybe that's a QM effect.

$$H_i= -0.5*\frac{\Omega_i - N}{\Omega_i} \log_2\left(\frac{\Omega_i-N}{\Omega_i}\right) - \frac{i}{\Omega_i} \log_2\left(\frac{i}{\Omega_i}\right) -  \frac{N-i}{\Omega_i} \log_2\left(\frac{N-i}{\Omega_i}\right)$$

Notice H*&Omega; simplifies to the count equation programmers use. E=empty state, M=state with moving atom, S=state with still atom.

$$S_i=H_i \Omega_i= 0.5 E \log_2\left(\frac{\Omega_i}{E}\right) + M\log_2\left(\frac{\Omega_i}{M}\right) + S \log_2\left(\frac{\Omega_i}{S}\right)$$

By some miracle the above simplifies to the previous equation. I couldn't do it, so I wrote a Perl program to calculate it directly to compare it to the ST equation for neon gas at ambient conditions for a 0.2 micron cube (so small to reduce number of loops to N<1 million). &Omega;/N=3.8 million. I confirmed the ST equation is correct with the official standard molar entropy S0 for neon: 146.22 entropy/mole / 6.022E23 * N. It was within 0.20%. I changed P, T, or N by 1/100 and 100x and the difference was from 0.24% and 0.12%. #!/usr/bin/perl # neon gas entropy by Sackur-Tetrode (ST), sum of surprisals (SS), and Shannon's H (SH) $T=298; $V=8E-21; $kB=1.381E-23; $m=20.8*1.66E-27; $h=6.6262E-34; $P=101325; # neon, 1 atm, 0.2 micron sides cube $N = int($P*$V/$kB/$T+0.5); $U=$N*3/2*$kB*$T; $ST = $kB*$N*(log($V/$N*(4*3.142/3*$m*$U/$N/$h**2)**1.5)+5/2)/log(2.718); $x = $V**0.33333; $p = (2*$m*$U/$N)**0.5; $O = ($x*$p/($h/(4*3.142*0.341**2)))**3; for ($i=1;$i<$N;$i++) { $Oi=$O*($N/$i)**1.5; $SH += 0.5*($Oi-$N)*log($Oi/($Oi-$N)) + $i*log($Oi/$i) + ($N-$i)*log($Oi/($N-$i)); $SS += log($Oi/($i));    } $SH += 0.5*($O-$N)*log($O/($O-$N)) + $N*log($O/$N); # for $i=$N $SH = $kB*$SH/log(2.718)/$N; $SS = $kB*$SS/log(2.718); print "SH=$SH, SS=$SS, ST=$ST, SH/ST=".$SH/$ST.", N=".int($N).", Omega=".int($O).", O/N=".int($O/$N); exit; Ywaz (talk) 19:54, 27 January 2016 (UTC)


 * Hm - I'm still trying to understand the above. What Ben Naim did was to say the info entropy S/kB is the sum of four entropies: ''N(hpos+hmom+hquant+hex)".


 * hpos is the classical entropic uncertainty in position of of a particle. Basically the probability is evenly distributed in the box, so the probability its in a {dx,dy,dz} box at {x,y,z} is some constant times dx dy dz, and that constant is 1/V where V is the volume, so when you integrate over the box you get 1. So hpos=log(V).


 * hmom is the classical entropic uncertainty in momentum of the particle. The probability is given by the Maxwell-Boltzmann distribution, and when you calculate it out, its hpos=(3/2)(1+log(2 &pi; m k T))


 * hquant is the uncertainty due to Heisenberg uncertainty. I don't have Ben-Naim's book with me, so I don't remember exactly how its derived, but its hquant=-3 log(h).


 * hex is the uncertainty (reduction) due to the exchange degeneracy, it amounts to the correction for "correct Boltzman counting", because a state with particle 1 in state 1 and particle 2 in state 2 is the same microstate as the state with particle 2 in state 1 and particle 1 in state 2. Even deeper, you cannot label particles so there is no such thing as "particle 1" and "particle 2", just two particles. Its the usual division by N! so hex=-log(N!).


 * Add them all up, multiply by N and you get Sackur-Tetrode. PAR (talk) 22:59, 28 January 2016 (UTC)


 * Looking at the above, I don't think its good. If you set your equation equal to Sacker-Tetrode, the &sigma; correction factor is (8(6 &pi;)3/2)1/6 = 0.339... which is coincidentally close to your (1/2) Erf(Sqrt(1/2))=0.341..., but the 0.341 is not good, its a hack from Heisenberg's uncertainty principle assuming a normal distribution. We shouldn't be mixing variances and entropies as measures of uncertainty. Second, in your program, I can make no connection between the long drawn-out expression for the sum in the program to the simple sum you listed at the top. I also worry about the introduction of particles standing still. This never happens (i.e. happens with measure zero) in the Maxwell-Boltzmann distribution, which is what is assumed in this derivation of the STE. PAR (talk) 01:30, 29 January 2016 (UTC)


 * Concerning the empty states and particles standing still, I was following information theory without regard to physics. Any excess should cancel as it is not throwing extra information in. It should cancel, and apparently did.  QM distributions are actually normal distributions which I heard Feynman point out, so I do not think it is a hack.  I do not think it was an accident that my view of the physics came out so close. If I plug in the ST constants instead, the error increases 0.39% which is due to my sum being more accurate by not using the Stirling approximation. My hack is more accurate than the error in the Stirling approximation, giving my method more accuracy than the Stirling-adjusted ST.  The program is long because it calculates all 3.  "SS" is the variable that does the first calculation.  I think you're looking for this that does the first equation. I should put it in javascript.

$O = ($x*$p/($h/(4*3.142*0.341**2)))**3; for ($i=1;$i<$N;$i++) { $Oi=$O*($N/$i)**1.5; $SS += log($Oi/($i)); }

An interesting simplification occurs if I divide out the N^3/2 in the 1st equation and thereby let &Omega; be based on the momentum as if only 1 atom is carry all the energy:


 * $$S= k_b \left[ N \ln(\Omega_1) - \frac{5}{2} \sum_{i=1}^{N} \ln(i) \right] $$

I changed some of my explanations above, and mostly took out the first explanation. It's really hard to see out it came out that simple in terms of Shannon entropy without the longer equation.

Let energy = society's resources = total happiness, let the p vector be different ways people (atoms) can direct the energy (happiness) (buying choice vector like maybe services, products, and leisure), let volume determine how far it is between people changing each other's p vector (for an increase or decrease in happiness), i.e., market transactions. High entropy results from longer distance between transactions, fewer people, and more energy. High entropy would be median happiness. Low entropy is wealth concentration. Just a thought.

Ywaz (talk) 12:43, 29 January 2016 (UTC)


 * Have you checked out http://www.eoht.info/page/Human+thermodynamics ? PAR (talk) 20:55, 26 February 2016 (UTC)

First sentence second paragraph needs editing.
Currently the sentence reads "A key measure in information theory is "entropy" but this sentence is inconsistent with the definition of "measure" in measure theory. For consistency with measure theory, I recommend that the "measure" of information theory be called "Shannon's measure." The "entropy" is then "Shannon's measure of the set difference between two state spaces" while the "mutual information" is "Shannon's measure of the intersection of the same two state spaces. 199.73.112.65 (talk) 01:56, 7 November 2016 (UTC)


 * Yes, the word "measure" should be used carefully. There is colloquial measure, Lebesgue measure, and perhaps "Shannon measure" (not sure if that is an accepted term). Colloquially, "this" is a measure of "that" if knowing a particular "this" allows you to calculate or recover a unique "that". I don't think the last sentence of the above comment belongs in an introduction, too technical. Shannon entropy or information entropy is, I think, a (not "the") characterization of the uncertainty associated with a probability distribution, or a (not "the") measure of uncertainty, with measure being interpreted in the colloquial sense. PAR (talk) 02:39, 7 November 2016 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Information theory. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20110723045720/http://aicanderson2.home.comcast.net/~aicanderson2/home.pdf to http://aicanderson2.home.comcast.net/~aicanderson2/home.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 21:57, 13 November 2017 (UTC)

It's all about information
According to the theory of information, the Universe has constant information per volume. Not necessarily constant mass and energy. For example the Universe can pack a lot more of energy and mass if they are arranged in repeatable or simple patterns. Also as the Universe ages, it's components become more convoluted and contain more information to be fully described. Thus less but convoluted particles and energy occupy the same information space as simple arrays of more matter and energy. It's information that generates Big Bang, dark energy and dark matter.

"Die" should be "Dice"
There is a probable typo,needs correction IMO 2A00:1110:134:A326:0:3F:CFE6:5301 (talk) 14:17, 21 June 2021 (UTC)
 * No, the term die is correct in singular form. It is not a typo. Mind  matrix  16:46, 21 June 2021 (UTC)