Talk:Maximum entropy probability distribution

History
Who was the first to show that the normal has maximum entropy among all distributions on the real line with specified mean μ and standard deviation σ? A reference for this would be a nice addition to the article. Btyner 21:21, 13 May 2006 (UTC)

It was Shannon who showed that if all second-moments are known, the multivariate n-dimensional gaussian will maximize the entropy over all distributions. Theory of Communication See pp. 88-89 —Preceding unsigned comment added by Lovewarcoffee (talk • contribs) 05:04, 17 September 2008 (UTC)

Merging
I am against merging these articles. The princible of maximal entropy and the maximal extropy probability distributions are completely different; there will not be enough room on the maximal entropy page to treat this topic well. It certainly needs its own article. I am removing the tag. Agalmic (talk) 01:19, 4 July 2008 (UTC)

Maximum entropy in other coordinate systems
I am wondering what the maximum entropy distribution would be for a variable in agular space. Note that for such a variable with support [-180°,180°], the points -180° and 180° are the same. So saying that the mean is mean μ=0°, μ=90° or any other value would be equivalent. I suspect that under no specification of standard deviation σ, the maxent distribution would be continuous uniform in angular space, and not a Gaussian. Do you have any references on this? It would be good to write something about the coordinate system dependence on this article. Jorgenumata (talk) 10:04, 29 March 2009 (UTC)


 * OK, so I found out the answer. For a random variable with a circular distribution, given the mean and the (analog of) the variance, the von Mises distribution maximizes entropy. Jorgenumata (talk) 10:27, 28 April 2009 (UTC)

Wrong information measure
The formula given for the Entropy of continuous probability distributions,
 * $$H(X) = - \int_{-\infty}^\infty p(x)\log p(x) dx$$

does not follow from Shannon's Theory. It lacks invariance under a change of variables $$ x \to y(x) $$. I.e., if I look at the distribution of radiuses of circles, and someone else at the distribution of areas, then we will get different entropies.

In my opinion, this article should either be rewritten to describe only discrete-valued probability distributions, or a proper continuous information measure should be introduced, e.g., something like the Kullback-Leibler divergence,
 * $$H^c(p(x)\|m(x)) = -\int p(x)\log\frac{p(x)}{m(x)}\,dx.$$

See e.g. E.T.Jaynes, "Probability Theory", Chapter 12.3, Cambridge University Press, 2003; or Entropy_(information_theory)

Please tell me your opinion of what should be done! If I hear nothing, I'll just do something at some stage. Hanspi (talk) 12:23, 24 August 2009 (UTC)


 * I think most applications would not need the general underlying measure m. So it would probably be best to just note the assumption being made, with a link to the wiki article/section you have noted above. If there are any important applications that need something more complicated the situation might be different, but even then it may still be best to allow the simpler form to stand as an initial version as this is how it is likely to be expressed in literature most people will come across. Are there any known results for mixed discrete/continuous distributions? Melcombe (talk) 10:49, 25 August 2009 (UTC)


 * The "assumption being made" is precisely the problem; m(x) is the result of starting from a discrete probability distribution with n points and letting n go to infinity through a well-defined limit process. m=constant can only happen if x is limited to a finite range. m(x) is, apart from a constant factor, the prior probability distribution expressing complete ignorance of x; e.g., for scaling parameters, it would be Jeffrey's prior, proportional to 1/x. So the underlying assumption contradicts all of the examples on the page, which are probability distributions with infinite range of x.


 * I see that the general reader would not be interested in this and would never even try to evaluate the entropy of a continuous probability distribution. You say that "this is how it is likely to be expressed in literature most people will come across;" I think this is so, but is it then OK to repeat the error here when we know about it?  I feel this is arguing according the lines "I'll tell you  a lie, but it is a lie you can understand."Hanspi (talk) 05:59, 27 August 2009 (UTC)


 * This isn't the place to decide what is "right". Wikipedia is meant to reflect how terms and concepts are actually being used ... hence the need for citations (but this is too often flouted in the maths/stats articles). It would be OK to put in something cautionary either if you can find a citation that makes this point, or if the wikipedia article/section you mentioned already makes the point clearly enough. However, if you know of a source that does treat all the cases covered in a unified framework, and that you find more satisfactory, then perhaps you should revise the whole article to reflect that approach. After all, there are presently no in-line citations to support any of the stuff presently in the article. Melcombe (talk) 09:26, 27 August 2009 (UTC)


 * Oh, it is not necessary to decide here what is right, I can easily cite (and have done so above, Jaynes 2003) a place where this is done. The continuous entropy measure on that page is very clearly wrong.  Imagine the following: you have a selection of squares of side lengths k that are integer multiples of 1 mm.  You know the distribution $$p_k$$.  Now your colleague looks at the same set of squares, but he looks at the areas $$a=k^2$$ and gets a $$p_a$$.  Both of you calculate the entropy; both of you get the same result, as it should be, because the two of you are looking at the same thing.  Now you simply let go of the restriction that the k are multiples of 1 mm, so you permit k to be any real number.  Same thing; one of you will get $$p(k)$$, the other $$p(a)$$, still you are looking at the same collection of objects, but now you will get different entropies.  Does this not worry you?  It worries me!


 * Anyway, I saw that the page Principle_of_maximum_entropy that we cite already explains all this and refers to our page as an example page. We may just want to remove the theory from this page!Hanspi (talk) 06:17, 28 August 2009 (UTC)

(Unindenting for convenience) I have found two recent sources that both explicitly define "maximum entropy distributions" to be those with maximum entropy defined for m=1. Both sources explicitly recognise that this is something that is not invariant, but go on with it anyway. See (1) Williams, D. (2001) Weighing the Odds (page 197-199) Cambridge UP ISBN 0-521-00618-x ; (2) Bernardo, J.M., Smith, A.F.M. (2000) Bayesian Theory (pages 209, 366) Wiley. ISBN 0-471-49464-x. (And Bernardo&Smith reference Jayne's work (earlier, but on the same topic), so did know of it.) The article must at least recognise that this is how some/many sources define "maximum entropy distributions". However the article could go on to give the fuller definition but, if it does, each of the examples should be modified to state m explicitly (possibly with a justification). I note that the present exponential distribution example use m=1, rather than something like 1/x. If there are some examples with a different m, then they should be included as otherwise even discussing the more general xcase will look odd.

I see that the Principle_of_maximum_entropy article suggests that maximum entropy solves the problem of specifying a prior distribution. Your comments suggest a background in Bayesian stuff, so you might like to consider this in relation to Bernardo&Smith's assertion that it does not (page 366, point (iv)). However, I don't think these articles should become too Bayes-specific.

Melcombe (talk) 11:35, 28 August 2009 (UTC)


 * What do you mean by "too Bayes-specific"? Yes, of course I know about Bayes's rule, I have an information theory background, and I have read several articles by Jaynes from the time when the orthodox vs. Bayes war seems to have been going on ferociously, but I though this was over!  A colleague who does information theory research assured me just a few weeks ago that at the research front, nobody doubts the Bayesian way anymore.  Was this not correct?


 * Now to the problem at hand: Principle_of_maximum_entropy and this article together are inconsistent. One must be changed.  Which one?  I would definitely change this one.  I don't have your literature, but together we could do something beautiful: we could omit the theory and refer the reader to Principle_of_maximum_entropy. If we then listed all disttributions for both  m=1 and m=1/x, we could make two sections: m=1, "use these if the unknown quantity is a location parameter", and m=1/x, "use these if the unknown parameter is a scale parameter".  I could explain this intuitively, and then we would have an article that is useful in practice, giving the answer "what distribution shall I use in which situation"? What do you think? Hanspi (talk) 14:51, 31 August 2009 (UTC)


 * P.S. Melcombe, I have ordered the books you cited from the Swiss network of libraries and information centres. I'll get them next week and can then read what you cited.   Hanspi (talk) 18:20, 1 September 2009 (UTC)


 * By "not too Bayes-specific" I meant two things, neither related to Bayes/frequentist controversy. First, there are some long-established results ... for example that the normal distribution is the maximum entropy distribution under given conditions ... that exist without a Bayesian background. Secondly, there is a need to cover the physics-based context already in the article for which giving a meaning to the reference measure may be problematic enough on its own, without having to find a Bayesian connection.
 * To move forward, I have added the more general defintion to the article, keeping also the original with some citations (one of which is new). I have not added the Jaynes reference as I don't know whether that was in the context of "maximum entropy" or just "entropy". Do edit it as you think fit, but remember that this is meant to reflect what is in published sources, not to decide that Jaynes is either right or wrong and so to exclude anything different. Melcombe (talk) 10:45, 7 September 2009 (UTC)

Caveats
I added the 'citation needed' because I'd really like some clarification on that point. Is it that in general there is no maximum entropy distribution given the first three moments, or is it that there's something particular about this combination (seems unlikely)?

This is also unclear (to me): "bounded above but there is no distribution which attains the maximal entropy"

Is it asymptotic? Then can't we just take the limit as you approach the bound to get the answer?

If not then can't we just look at all distributions that do exist and meet these conditions, and choose the one with the highest entropy (or the limit as you approach the highest entropy..)? because, there's probably something I misunderstand, if it's not asymptotic to the maximum entropy then to me it sounds like the article says:

which is obviously wrong; its okay if the maximum turns out not to be right on the theoretical limit.

(of course my example is iterating over the real numbers... so who am I to complain about things seeming wrong)

Sukisuki (talk) 13:37, 10 April 2010 (UTC)


 * I think I fixed this. PAR (talk) 04:57, 7 January 2022 (UTC)

Examples of maximum entropy distributions
Please include the Gamma distribution. — Preceding unsigned comment added by 187.64.42.156 (talk) 00:30, 19 April 2012 (UTC)
 * Initially empirical Zipf's distribution and its demonstrated form Mandelbrot's distribution (of which Zipf's is a special case) happen to be maximal entropy distributions too, if the constraint imposed is the order of magnitude (mean value of the log of rank for Zipf) and mean value of a more complicated expression involving usage cost and storage cost for Mandelbrot). I was shown that in 1974. I do not remember the demonstration very well, but they seemed pretty obvious at the time. Perhaps somebody can find a link to them ? 212.198.148.24 (talk) 19:06, 10 May 2013 (UTC)

Alternative way of writing entropy
It is significant that $$H(X) = E[-\log p(X)]$$ (or $$E[-\log(p(X)/m(X))]$$ for the version with a measure)? Seems like an easy way of remembering this formula to me.

conditions of example don't apply
I removed this:

In physics, this occurs when gravity acts on a gas that is kept at constant pressure and temperature: if X describes the distance of a molecule from the bottom, then the variable X'' is exponentially distributed (which also means that the density of the gas depends on height proportional to the exponential distribution). The reason: X is clearly positive and its mean, which corresponds to the average potential energy, is fixed. Over time, the system will attain its maximum entropy configuration, according to the second law of thermodynamics.''

You can't just conclude this from the purely mathematical example (given in the article above where I removed this) of a maximum-entropy distribution constrained to have a given mean; you have to do the physics ! It can be viewed as a coincidence that the two examples both produce an exponential distribution. 178.38.142.81 (talk) 00:11, 2 February 2015 (UTC)

Examples motivating ME + constraint = classic distributions would be appreciated
I find ME interesting, but I find it puzzling as to when, in the real world, it should be invoked. My concern is that the utility of ME is exaggerated, though I would be extremely happy to have this concern allayed. Let me elaborate on my thinking: Suppose I have a set of data. From these I might measure their variance, and with that number in hand, I could maximize entropy subject to the constraint that the result match the variance I have measured. I would then obtain the normal distribution. But, on this basis, am I really supposed to believe that the normal distribution is an appropriate description of my data? What if I chose to measure some other quality of the data, I don't know, maybe kurtosis or mean absolute deviation, or something, then invoking ME would yield a different distribution. Same data, same effect generating the data, but now, all of a sudden, based on what I have chosen to do, a different distribution seems to be implied. To me, this is not how my analyses usually work. If I have data, I can measure the variance, sure, but I would probably want to invoke the central limit theorem to justify use of the normal, not ME. So I'm left wondering, in what circumstances would one actually invoke the very attractive mathematics of Maximum entropy probability distribution? If a bit of discussion on this were added to the article by someone with more insight than I apparently have, then that would be very much appreciated. Isambard Kingdom (talk) 17:37, 30 May 2015 (UTC)

Main comment is there is a maximum entropy solution for boltzmann distribution which uses an undefined parameter (gamma) — Preceding unsigned comment added by 139.216.149.229 (talk) 02:18, 11 January 2019 (UTC)

The central limit theorem is just a special case of the maximum entropy principle. Averaging two independent variables (that are not drawn from a stable distribution) increases their entropy, but preserves the variance and mean (up to an irrelevant scaling factor). Thus, repeatedly averaging variables results in a distribution which tends towards the normal distribution. The CLT can be seen as a good example of why maximum-entropy distributions are common. The advantage of the maximum entropy principle over just using the CLT is it generalizes to cases where processes other than repeated sums are allowed. Note also that most people are happy to assume the likelihood is approximately normal, even though they know the variable of interest does not satisfy the conditions of the CLT. (I, personally, have never dealt with a variable that is a sum of infinitely many uncorrelated variables.) Maxent explains why this is valid: Choosing a normal distribution will minimize your expected error even if the likelihood is not normal. Closed Limelike Curves (talk) 01:43, 20 December 2020 (UTC)

"Choosing a normal distribution will minimize your expected error even if the likelihood is not normal." This sounds dubious. I thought the point of ME was to minimize assumptions, not to minimize expected error against every possible distribution that can give you a (large) sample whose mean and variance match what you have found. I take it you may mean it is the best guess when only mu and sigma are known, but in reality, one has a whole sample and can easily find statistics like skew that may rule out a normal distribution. One should not encourage people to turn a blind eye. Elias (talk) 12:42, 27 April 2022 (UTC)

Truncated Gaussian maximum entropy for positive support and fixed mean and variance
The truncated Gaussian fails to be the maximum entropy probability distribution for at least some cases: for example if mean = standard deviation, the maximum entropy distribution is the exponential (it is when you fix only the mean, if you than set the variance to match the one of the exponential, the solution will stay the same), and the truncated Gaussian never has the shape of an exponential regardless of the choice of the parameters.

Numerical counterexample: with mean 2 and standard deviation 2, the exponential has differential entropy around 1.7 while the truncated Gaussian has differential entropy roughly 1.6. — Preceding unsigned comment added by Bdn96 (talk • contribs) 10:36, 3 December 2020 (UTC)

Nonfull support and nonuniqueness of maximum entropy distribution
It seems to me there is a mistake in the section Maximum entropy probability distribution. One can read there « It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support », but there are examples of pdf with support on an interval, of maximal entropy among several natural sets of pdf, in particular among $$\{q\text{ pdf with finite entropy}:-\int q(x)f(x)dx=H(p)\}$$ -using the same notation as in the text. For instance the Beta distribution or the uniform distribution on $$[0,1]$$. I believe there is also a mistake in the constraint equality « $$\int p(x)f(x)dx=-H$$ » which should read $$\int q(x)f(x)dx=-H=-H(p)$$ as we are introducing a new pdf, here denoted by the variable $$q$$, in general different from $$p$$. Note that $$-\int q(x)f(x)dx=H(q,p)$$ is the cross-entropy of $$q$$ relative to $$p$$, and the Gibbs inequality implies that $$H(p)=H(q,p)\geq H(q)$$, from which it follows that $$p$$ is a maximal entropy distribution subject to the cross-entropy equality constraint -though equality of $$p$$ and $$q$$, thus uniqueness of the maximal entropy function, in the case of equality of entropies, seems not to be a part of the usual statement of the Gibbs inequality.

Also, i believe uniqueness of maximum entropy distribution does not hold either for equal supports: for instance the normal distribution with standard deviation $$\sigma$$ and the Cauchy distribution with $$\gamma=\frac{\sigma\sqrt{2\pi e}}{4\pi}=\frac{\sigma}{2\sqrt{2\pi /e }}$$ have the same entropy $$\ln\left(\sigma\sqrt{2\pi e}\right)$$ and are supported on the whole real line but are different. I think the text would be correct if instead of "full support" it asserted "supported Lebesgue-almost surely on supp(p)".

There is also a uniqueness criterion of med without assuming a priori equality of support but only an inequality of what we may call an "entropy integral constraint", see Lemma 4.2 and Theorem 4.3 in https://kconrad.math.uconn.edu/blurbs/analysis/entropypost.pdf. Actually, in this reference, Theorem 7.7 gives uniqueness but only within a convex set of pdf with finite entropy.

The statement « Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. » is also false i believe, even clarifying "its own entropy" to "the same entropy": the constraint is an equality of a cross-entropy with an entropy. As the example above show, there are pdf $$p\neq q$$ with $$H(p)=H(q)$$ and both full support; for those we can deduce that they do not belong to a common convex set of pdf with finite entropy at most $$H(p)$$.

Please anyone correct me if i am wrong. Plm203 (talk) 06:47, 25 October 2023 (UTC)