Talk:Kullback–Leibler divergence

possible clarification in opening paragraph
It says: A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P.

That's a great sentence, but I sort of think it might be worth saying what we mean by "excess". We could add "By 'excess' we mean above the expected surprise when using P, since using the true distribution always minimizes expected surprise."

Or maybe that's obvious to most people. I actually missed the word "excess" on my first reading and thought the sentence was wrong. Maybe if we just bolded or italicized "excess" it would let people know it's an important word. Maybe "excess expected surprise" would read better than "expected excess surprise", since "expected surprise" is a familiar phrase, being equal to entropy. — Preceding unsigned comment added by 38.105.200.57 (talk) 21:23, 25 April 2024 (UTC)

real world EXAMPLE needed immediately following motivation
This article needs a clear real world example: http://en.wikipedia.org/wiki/Hypergeometric_distribution#Application_and_example

Suggestion: The average person can understand word relationships and KLD has been applied to countless NLP problems. KL distance for images other high-dimension feature spaces could quickly turn off the reader.

To minus or not to minus
I think it is not equivalent to turn around the P and Q distributions inside the logarithm, and then get rid of the minus. If the measure-based definition is taken as the definition, then that should consistently induce all of the definitions for the special cases, such as the discrete and continuous cases. In particular, because absolute continuity is only required to one direction, the switch is not legal. Therefore, I am being bold and changing the definition of the discrete KL-divergence to be consistent with the measure-based definition. --Kaba3 (talk) 22:19, 8 October 2012 (UTC)

Ok, let's not be bold:) I was too quick to look: the current version is consistent. Instead, the formula on the Radom-Nikodym theorem page suffers from not using the minus-form instead. I'll go fix that instead. --Kaba3 (talk) 22:46, 8 October 2012 (UTC)

Hmm.. Something is still wrong. According to Sergio Verdu, to whose video lecture there is a link on the article, the definition of Kullback-Leibler divergence is given by


 * $$ D_{\mathrm{KL}}(P\|Q) = \int_X \ln \frac{{\rm d}P}{{\rm d}Q} \,{\rm d}P,$$

if P is absolutely continuous with respect to Q (P << Q), and infinite otherwise. This is not equivalent to appending a minus, and switching to dQ/dP, since that would require Q << P. Fixing this makes everything consistent, which I'll now go do. --Kaba3 (talk) 23:13, 8 October 2012 (UTC)

2004-2006 discussions
To whom it may concern. Recently the statement "KL(p,q) = 0 iff p=q" was added. I suspect that's not quite the case; maybe we want "KL(p,q) = 0 iff p=q (except for a set of measure 0 wrt p)" ?? Happy editing, Wile E. Heresiarch 01:41, 21 Oct 2004 (UTC)


 * I added "KL(p,q) = 0 iff p=q", which is a stronger claim than "KL(p,p)=0" in an earlier revision. My own understanding of measure theory is pretty limited; moreover, the article does not explicitly mention measures in connection with the integrals.  However, Kullback and Leibler in their 1951 paper (lemma 3.1) did consider this and say that divergence is equal to zero if and only if the measurable functions p and q are equivalent with respect to the measure (i.e. p and q are equal except on a null set).  That would include the case you mentioned, wouldn't it? --MarkSweep 03:29, 21 Oct 2004 (UTC)


 * Yes, that's what I'm getting at. What's not clear to me is that when we say p and q are equal except for a set of $FOO-measure 0, which measure $FOO are we talking about? I guessed the measure induced by p; but K & L must have specified which one in their paper. Wile E. Heresiarch 00:59, 22 Oct 2004 (UTC)


 * I checked the K&amp;L paper again, but they simply define a compound measure &lambda; as the pair of the measures associated with the functions p and q. The lemma is stated in terms of the measure &lambda;. --MarkSweep 09:28, 24 Oct 2004 (UTC)


 * You could say "KL(p,q) = 0 iff p=q almost everywhere" which is concise and says what I think you're trying to say. - grubber 01:57, 17 October 2005 (UTC)


 * The K-L divergence is between two probability measures on the same space. The support of one measure must be contained in the support of the other measure for the K-L divergence to be defined:  For $$ D_{KL}(\mathbb P \| \mathbb Q) $$ to be defined, it is necessary that $$ \mathrm{supp}\, \mathbb P \subseteq \mathrm{supp}\, \mathbb Q .$$  Otherwise there is an unavoidable zero in the denominator in the integral.  Here the "support" is the intersection of all closed sets of measure 1, and has itself measure 1 because the space of real numbers is second-countable.  The integral should actually be taken over $$ \mathrm{supp}\, \mathbb P $$ rather than over all the reals.  The "$$ \log N$$" that appears in some of the equations in the article should be the logarithm of the cardinality of the support of some probability measure on a discrete space. -- 130.94.162.61 20:35, 25 February 2006 (UTC)


 * Isn't the necessary condition $$ \mathbb P << \mathbb Q $$ and isn't $$D_{KL}(\mathbb P \| \mathbb Q) = \int_{\mathrm{supp}\,\mathbb P} \log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb P = \int_{\mathrm{supp}\,\mathbb P} \frac{d\mathbb P}{d\mathbb Q}\log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb Q$$? -- 130.94.162.61 19:15, 27 February 2006 (UTC)


 * Re reversion: Why have a whole article on "cross entropy", then, if it's not significant? -- 130.94.162.61 04:18, 2 March 2006 (UTC)


 * Excellent article, by the way. Has the context, clear explanation of technical details, related concepts, everything that is needed in a technical article. -- 130.94.162.61 16:37, 11 March 2006 (UTC)


 * I agree that the article is excellent. I have several comments and a question.
 * Regarding the comment of 25 February 2006 and the follow-on, this has special importance when analyzing natural data (as opposed to data composed of man-made alphabets). If the P distribution is based on a corpus of pre-existing data, then, as soon as you discover a new "letter" in an observed sample (upon which the Q distribution is based), you can no longer use Dkl, because then there will be a zero in the denominator.
 * To give a concrete example, suppose you are looking at amino-acid distributions in proteins and are using Dkl as a measure of how different the composition of a certain class of proteins (Q) is from that of a broad sample (P) comprising many classes. If the Q set lacks some amino acids that the P set contains, you can still compute a Dkl.  But suppose that all of a sudden you discover a new amino acid in the Q set;  this isn't as far-fetched as it sounds, if you admit somatically modified species.  Then, no matter how infrequent that new species is, Dkl goes to infinity (or, if you prefer, hell in a handbasket).
 * This may one of the reasons why K&L's symmetric measure has not met with favor: as the above example shows, you could easily come across a message that lacks some of the characters in the alphabet that P was computed over.  In this case, the reverse -- and therefore the symmetric -- measure cannot be computed, even though the forward one can.
 * For example, imagine trying to create a Dkl for the composition of a single protein, compared to the composition of a broad set of proteins. It would be quite possible that some amino acid present in the large sample might not be represented in the small sample. But the one-way Dkl can still be computed and is useful.
 * Question: Is there a literature on expected error bounds on a Dkl estimate due to finite sample size (as there is for the Shannon entropy?) --Shenkin 03:05, 4 July 2006 (UTC)

regarding recent revert
Regarding, jheald's recent revert: I spent a lot of time reorganizing the article and wanted to discuss the changes. Maybe some of them can be reimplemented? Here's a list:
 * you cannot have two probability distributions on the same random variable -- it's nonsense! What you have are two random variables -- but why even talk about that.  Just say given two discrete probability distributions.
 * Absolutely you can, if they are conditioned on different information, or reflect different individuals' different knowledge, or different degrees of belief; or if one distribution is based on deliberate approximation. P(X|I) is different from P(X|D,I), but they are both distributions on the random variable X.
 * I understand what you mean. How does this mesh with the definition at random variable though?  It says that every random variable follows a (single?) distribution.  Probability theory and measurable function say that random variables are functions that map outcomes to real numbers (in the discrete case at least.
 * Strictly speaking, the random variable is the mapping X: &Omega; -> R. That is a function that can be applied to many probability spaces, distinguished by different measures: (Ώ,Σ,P), (Ώ,Σ,Q), (Ώ,Σ,R), (Ώ,Σ,S) etc.  More loosely, we tend to talk of the random variable X as a quantity to which we can assign probability distribution(s), where these are the distributions induced from the measures P,Q,R,S by applying the mapping X to Ώ.  Either way, it is entirely conventional to talk about "the probability distribution of the random variable X". Jheald 21:59, 3 March 2007 (UTC)


 * use distinguish template instead of clouding the text
 * Don't use ugly hat notes for mere asides (and that template is particularly ugly). Nobody is going to type in Kullback-Leibler if they want an article on vector calculus.
 * You're probably right :) I was just trying to get rid of the note "(not to be confused with..." from the text.


 * list of alternative names all together instead of mixed into the text at the author's whim
 * Lists which are too long are hard to read, and break up the flow. Better to stress only information divergence, information gain, and relative entropy first.  Information gain and relative entropy are particularly important because they are different ways to think about the K-L divergence.  They should stand out.  On the other hand K-L distance is just an obvious loose abbreviation.
 * Okay... :)


 * Generalizing the two examples is a new paragraph
 * Unnecessary, and visually less appealing. Better rhythm without the break.
 * Yeah... but poor grammar :(


 * Gibbs inequality is the most basic property and comes first
 * No, the most important property is that this functional means something. The anchor for that meaning is the Kraft-McMillan theorem.  And that meaning informs what the other properties mean.
 * Hmmm. that's a really good point. I didn't see it that way before.


 * the motivation, properties and terminology section is split up into three sections
 * Two-line long sections should suggest over-division. Besides, the whole point of calling it a "directed" divergence was about the non-symmetry of D(P||Q) and D(Q||P).


 * the note about the KL divergence being well-defined for continuous distributions is superfluous given that it is defined for continuous distributions in the introduction.
 * Not superfluous. It is hugely important in this context, in that the Shannon entropy does not have a reliably interpretable meaning for continuous distributions.  The K-L divergence (a.k.a. relative entropy) does.
 * Oh! :) We could make that clear: "Unlike Shannon entry, the KL divergence remains..."

--MisterSheik 13:51, 3 March 2007 (UTC)
 * Sheik, backing revisions out wholesale is not something I do lightly. I can see from the log that you spent time thinking about them.  But in this case, as with Beta distribution, I thought that not one of your edits was positive for the article. That contrasts with Probability distribution, Quantities of information, Probability theory and Information theory, where although I have issues with some of the changes you made, I thought some of the steps were in the right direction. Jheald 15:37, 3 March 2007 (UTC)


 * Thanks for getting back to me Jheald. I'm glad that backing out revisions isn't something you take lightly :)  I'm going to take a break from this article so that I can think it over some more.  I'll add any ideas to the talk page so that we can discuss them.


 * Also, I'm glad you were okay with (most) of my changes to those other articles. I didn't actually add any information to probability theory; I just brought together information that was spread over many pages and mostly reduplicated.  I think it's unfortunate that some pages (like the information entropy pages) seem disorganized (individual pages are organized, but the group as a whole is hard to read--you end up readin g the same thing on many pages.)  Do you think that these pages could be organized?  Do you have any idea how we could start?

MisterSheik 18:28, 3 March 2007 (UTC)

motivation
I read the motivation, but I am not really sure what it means. The first two sentences have nothing to do with the rest of the article. Could someone make this clearer? —Preceding unsigned comment added by Forwardmeasure (talk • contribs) 03:38, 25 March 2007


 * In fact, there is no "motivation" in there at all. That section really should be renamed.  MisterSheik 03:54, 25 March 2007 (UTC)

f-divergence
The link to f-divergence I placed in the opening section was removed and not placed anywhere else in the article. Is there a reason for this? This family of divergences is a fairly important generalisation and may lead readers who find the KL-divergence unsuitable to something more appropriate. Should I put it back in somewhere else? If not, a reason would be helpful as I can't understand the motivation to remove it completely. MDReid (talk) 11:26, 15 January 2008 (UTC)

Is there any relation between kullback-leibler divergence and information value? (Nav75 (talk) 14:54, 12 November 2008 (UTC)) -> I would say that the bayesian approach with entropy gives a pretty good idea of this measure in term of information

KL divergence and Bayesian updating
In this section the demonstration in not sufficient. Some further approximations are needed to get the results and are not addressed. The implicite assumption here is that the p(x|i)\approxp(y|i) wich is true if knowing y is close in term of likelyhood and entropy to x wich is not always true... —Preceding unsigned comment added by 81.194.28.5 (talk) 16:16, 25 February 2009 (UTC)


 * The formula given is exact. It comes straight from the definition of DKL.  But it assumes you have observed and now know the exact value of y.


 * If you haven't observed y, you can calculate the expected information gain about x from y by summing over the possible values of y weighted by the various probabilities P(y|I). That gives the result that the expected (i.e. average) information gain is the mutual information.  Jheald (talk) 17:30, 25 February 2009 (UTC)

Reorganization
The "Motivations, properties, etc" is a hodgepodge of ideas. I actually added to it, to complement the standard "it's not a metric" boilerplate, but I don't think I made it any more confusing than it already was. The way ideas are ordered makes it almost trivial to add subtitles with minor reorganization, I just ran out of time today. —Preceding unsigned comment added by Dnavarro (talk • contribs) 15:04, 29 April 2009 (UTC)

Renyi reference
It would be nice to have a full citation for the Renyi (1961) reference - to which paper does it refer? 94.192.230.5 (talk) 09:59, 14 June 2009 (UTC)


 * Full citation now added, with link to online copy.
 * In the paper (p. 554) he characterises his generalised alpha divergence as "the information of order &alpha; obtained if the distribution P is replaced with the distribution Q" (note he is using a P and Q defined the other way round to how we're defining them in the article).
 * This isn't quite the snappier term "information gain", but it is very close, and the thought behind it is exactly the same: the information gained by doing an experiment which allows you to improve your probability distribution. The actual term "information gain" may come from his book Probability -- I'll have to check. Jheald (talk) 11:28, 14 June 2009 (UTC)

Simplifying needed
The sentance: "KL measures the expected number of extra bits required to code samples from P when using a code based on Q,", doesn't meen much to a guy like me. It should be simlified. 217.132.229.177 (talk) 07:39, 28 August 2009 (UTC)
 * Umm, that's fairly clear actually.
 * Maybe it wouldn't seem so clear to you if you were using a different code. Not that I think it is a bad explanation: maybe change "based on" for "designed for encoding" though. —Preceding unsigned comment added by 139.184.30.134 (talk) 21:24, 20 October 2010 (UTC)

Book recommendations
Any good book recommendations on this topic? Thanks —Preceding unsigned comment added by 129.133.94.143 (talk) 22:05, 10 January 2010 (UTC)

P to Q or Q to P?
Is it standard terminology to refer to $$D_{\mathrm{KL}}(P\|Q)$$ as the Kullback-Leibler divergence from P to Q, as this article does? Cover and Thomas, for instance, refer to it only as the divergence between P and Q.

I ask mainly because it seems the wrong way round. In the Bayesian interpretation of the KL-divergence, Q is the prior and P is the posterior, and it seems very strange to be talking about something being "from" the posterior "to" the prior. If it's completely accepted terminology then fair enough I guess, but it does seem confusing, and a quick google revealed only this page that uses the from/to terminology at all. Can someone supply a reference for it?

Cover and Thomas' terminology isn't perfect either, since it seems to imply symmetry - but to my mind this is less confusing than putting the "from" and the "to" the way round that this article does.

Nathaniel Virgo (talk) 14:38, 6 April 2010 (UTC)


 * My view is that if you identify P with "the truth" (or, at least, our current best estimate of it), then P is distinguished: there is one P, but there may be many Qs. That is the reason I think it makes sense to speak of the KL divergence of Q, from P.


 * Also, I think it is useful to emphasise the asymmetry.


 * Bayes' theorem finds the new P which minimises the divergence of the previous Q from the new P. I don't see a problem with that.


 * But I accept it's not cut and dried. Looking at some of the other synonyms, information gain does strongly suggest information gained by moving from Q to P; and it is customary to speak of the relative entropy of P relative to Q (though I'm not sure, when you think about it, that that conveys at all the right idea).


 * So I stand on the idea that, at least mentally, you are fixing P and then thinking of the divergence of Q from it, hence "from" and "to" as per the article. Jheald (talk) 17:26, 6 April 2010 (UTC)


 * Using the word "and" as well as the word "between" in "as the divergence between P and Q" implies that KL(P||Q) = KL(Q||P). this is incorrect.  in a sensed you are applying the projective mapping q onto the "real" probability manifold p.  and kl-div is giving you the average divergence of that q from the "true" distribution, p. (in bits or nats)  which is not equal to the div of p from q.  so the words "onto" and "from" would be appropriate, but "between" is not, nor is "and", as neither imply the directionality of the relationship, which is not reversible (/symmetric). and i hope i didn't get my p's and q's flipped.  (mind you p's and q's! (what the hell are my p's and q's?))Kevin Baastalk 00:07, 7 April 2010 (UTC)


 * I guess I can sort-of see your argument, Jheald, although in variational Bayes it's the prior Q that's fixed and the posterior P that's varied. Personally, I see $$D_{\mathrm{KL}}(P\|Q)$$ as representing the information gained in going from the prior Q to the posterior P, and hence calling it the divergence from P to Q seems the wrong way around.   To me the Bayesian-style interpretation is far more fundamental than the interpretation to do with correct and incorrect codings, since coding theory is just one application of information theory and the KL divergence has applications in variational Bayes and hypothesis testing that have nothing to do with it.  I find the terminology in this article makes discussions about the subject awkward because I'm always having to explain that I'm updating from a prior to a posterior but calculating the divergence from the posterior to the prior.


 * However, what matters is not what any of us thinks makes sense but what is standard in the literature. Otherwise it would be "original research", and hence not suitable for a Wikipedia article.  I agree that the use of "between...and" implies symmetry but I repeat that it's the only terminology I've seen used in published literature.


 * If someone can point to some use or justification of this terminology in the established literature on the subject then fair enough I guess, but otherwise I think the article should be changed to use the more standard (though sadly not ideal) "between...and" terminology. Nathaniel Virgo (talk) 01:15, 8 April 2010 (UTC)

Ok, so I've just found a definite published example of the "from P to Q" usage (i.e. the opposite from the article): http://www.kent.ac.uk/secl/philosophy/jw/2009/deFinetti.pdf (a book review in Philosophia Mathematica). I will put a note in the article to the effect that the terminology is not standard and, if nobody can come up with a published example of the "from Q to P" usage I will purge it from the article in a couple of weeks' time (when I'm less busy). Nathaniel Virgo (talk) 21:38, 17 May 2010 (UTC)

Is it possible that we standardize this, put the evaluation probability measure at the behind??? I mean to change all the notation to be consistent with the article F-divergence!! Anyone disagree?? Jackzhp (talk) 14:23, 23 February 2011 (UTC)


 * using "between" or something like that is just wrong, as it implies symmetry and this is not symmetric. KL(P||Q) is not equal to  KL(Q||P).  Though it is a distance measure, particularly from the distribution p to the distribution q. you can think of it roughly as what level of magnification you need to read language Q' if you see digitally and the pixels in your eyes are arranged according to P'.  in this since it's quite literally from your eye, P', to the script, Q'.  Whatever semantics you must use, the wording must maintain the asymmetry. Ideally it would also be visually intuitive, otherwise, it's not really communicating and thus isn't really language.  IF you find a textbook that uses anything like "between p and q", then that textbook is wrong.  and i suggest you not use it as it might contain other such errors.  Kevin Baastalk 15:24, 23 February 2011 (UTC)
 * Agree, so let's remove the word "between", and explicitly state that we should not say the word "between". Jackzhp (talk) 18:09, 23 February 2011 (UTC)

This discussion is intensely frustrating. Jheald has it in his head backwards, and simply reverts changes that anyone else makes (as he does with other edits he doesn't like), and no one finds it worth fighting with him. I first tried changing this in 2009 with a reference to Wolfram MathWorld (http://mathworld.wolfram.com/RelativeEntropy.html) which was insufficient of a source for Jheald—search the edit logs for "Mathworld is wrong" to see what I mean. That's when I gave up trying to fix it—I didn't have the energy for a fight. I left it alone, hoping Jheald would go away and let someone who knew what they were doing fix it. I come back now, and the article is still wrong, and Jheald is still arguing it backwards.

Rather than try to persuade each other as to what sounds right or what sounds wrong, why not use the asymmetric nature of the metric to compute the answer? Consider two distributions:


 * A = {X:25%, Y:25%, Z:25%, Q:25%}, and


 * B = {X:33%, Y:33%, Z:33%, Q:0%},

where "33%" is shorthand for "one third". Now one of these has a higher KL divergence with respect to the other than the other does with respect to the one. Which is which? If you assume A is truth, then B has some divergence from that truth. If you assume B is truth, then A has infinite divergence because there should be no Q under any circumstances. So, does this not mean that the relative entropy/divergence of A wrt B is infinity, while the relative entropy/divergence of B wrt A is finite? I think it does. Imagine a large bucket of balls, evenly distributed across four colors: that's A. B is a distribution you could get from selecting 3 balls from that bucket (or 6 balls, or 9 balls, though it obviously becomes less likely as the number goes up.) B (the sample) thus diverges a bit from A (the truth). Now consider a different bucket of balls of three colors, evenly distributed: that's B. If you pull out 4 balls and get 4 different colors, you couldn't have been reaching into bucket B: A (the sample) has diverged infinitely from the distribution of B (the truth).

If you do not think this describes "the divergence of A wrt B" and "the divergence of B wrt A", then argue why it's the other way around, please.

Computing D(A||B) and D(B||A) is straightforward and unambiguous. Using ".33" for "1/3" to avoid too many fraction slashes:


 * D(B||A) = .33*log(.33/.25) + .33*log(.33/.25) + .33*log(.33/.25) + 0*log(0/.25); By convention 0*log0 = 0; so we have D(B||A) = log(.33/.25) = .1249... with a base-10 log.


 * D(A||B) = .25*log(.25/.33) + .25*log(.25/.33) + .25*log(.25/.33) + .25*log(.25/0); log(.25/0) goes to infinity, so the total sum goes to infinity.

If you really don't like division by 0, set the probability of Q in distribution B to be ε = 1/1E+50. That doesn't change the value of D(B||A) in the first 4 significant digits, but D(A||B) goes to 12.2590... Now set ε = 1/1E+500, or ε = 1/1E+50_000_000.

My conclusion: D(B||A) is the divergence of B wrt A. D(A||B) is the divergence of A wrt B. MathWorld got it right—big shock! Wikipedia has it wrong. This is why I quit editing anything other than typos in Wikipedia. 108.28.163.61 (talk) 20:24, 15 June 2011 (UTC)

Is it possible that my earlier argument convinced everyone? Can we switch it around now? 108.28.163.61 (talk) 00:46, 9 July 2011 (UTC)


 * I think you're right, and I was wrong. I have made the change you suggest, and will now go through the articles that link here, making sure they're consistent with the (now revised) usage here. If there are any that I've missed, please do fix them. Jheald (talk) 19:37, 7 November 2012 (UTC)


 * I have made the changes now; but I am having some qualms of second (third?) thoughts, having now looked through the articles that link here, wondering if perhaps I was right the first time after all.


 * I suspect that where books do specify a direction, they talk about the KL divergence of the posterior from the prior. And ultimately, the balance of the usage in the real world has to be what ought to most strongly guide us here.  But it might be worth surveying.  The first book I've just happened to look at (Gelman et al 1995 Bayesian Data Analysis 1st ed, p.485) writes of "the Kullback information of the model relative to the true distribution" -- ie DKL of q relative to p; although admittedly this is in a context where they are holding p fixed and varying q, which may make language this way round more natural.  But if that's a first data point, it could well be worth surveying further.


 * On an aesthetic basis, I suppose it comes down to weighing whether it's from p (and/or with respect to p), because it's the distribution p we're taking as fundamental and weighting all our expectations in proportion to -- or whether it's from q, because it is zero values in the distribution of q which have such a determinative impact.


 * One article I noticed was cross-entropy, H(P,Q) = H(P) + D(P||Q) -- the average number of bits needed to code a datum if coded according to q rather than p. It does seem to make sense to consider this as the entropy of P plus an 'extra' number of bits for going from P to Q.


 * I was also struck by a number of articles that I've now changed to describe the "divergence of a distribution p from a reference prior m" using the new language -- but I can't help having the feeling that if m is a reference baseline that things are being calculated from, then doesn't that sort of suggest that it's m that the expectations should be being calculated with respect to? -- a sense that goes away if we write "divergence from the distribution to a reference prior m".


 * Looking at some of the synonyms for the divergence, "information gain" does strongly suggest moving from Q to P. But perhaps Akaike Information Criterion is nearer the mark when it suggests that what the KL divergence is really about is the amount of information loss in giving up P and moving back to Q (while still using P as the measuring stick to assess expectations).


 * Admittedly "discrimination information" goes the other way. If we consider P having a narrower support than Q, it may be had to discriminate P samples from a priorly assumed Q (small D(P||Q) away from Q), but easier to discriminate Q samples from a priorly assumed P (large D(Q||P).  So here it does make sense to talk of discriminating a true distribution P from a prior distribution Q.


 * And then what about "relative information"/"relative entropy". Having got used to talking about H(p) as the Shannon entropy of p, then when we want to bring in a reference measure m, it then feels sort of appropriate to be talking about D(p||m) as the entropy of p "relative to" the measure m.  But is this really what we mean?


 * ... to be given more thought. Jheald (talk) 23:42, 7 November 2012 (UTC)


 * I have now done what I should have done to start with, namely look at a straw poll from Google books, to see what form the world really is using. Searching for "Kullback-Leibler divergence of", "Kullback-Leibler distance of", and "Kullback-Leibler divergence from"  I'm getting 26 - 8 for "q from p", i.e. "approximation from truth".  That may not be particularly scientific, since (i) Google fights are considered dodgy anyway  (ii) it's only quite a small sample, (iii) many are of very little authority (iv) I had to ignore entries where Google only gave me snippets and I couldn't see the context, and (v) I got tired with the last lot and didn't get to the end.  But I hope it's a good enough straw poll to decide that it should be the "from p to q" form that we go with.  (And so now I have to go back and undo all the changes I made yesterday!)


 * I liked the explicit gloss given by Burnham and Anderson, as adapted in the paper in Kurdila: "The Kullback-Leibler distance of q from p is a measure of the information lost when q is used to approximate p" . I think that would be well worth incorporating into the lead. Jheald (talk) 11:41, 8 November 2012 (UTC)

Hits from Google books
Details of the Google hits below, put into a show/hide box for convenience. Jheald (talk) 11:41, 8 November 2012 (UTC)

 Details of Google Books results

Ambivolent
Kenneth P. Burnham, David R. Anderson (2002),  ''Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach''
 * p. 51: Kullback-Leibler information between models f and g. The notation I(f, g) denotes the “information lost when g is used to approximate f.” As a heuristic interpretation, I(f,g) is the distance from g to f
 * but p. 296: For a set of R models the Kullback-Leibler "distance" of model g_r from truth is denoted I(f, g_r)...

From p to q (from "truth" to "approximation")
CS Wallace (2005), Statistical and Inductive Inference by Minimum Message Length 
 * ...the Kullback-Leibler distance of b from a is defined as...

Almeida (2006), Nonlinear Source Separation 
 * The Kullback-Leibler divergence of a density q relative to another density p is defined as... KLD(p,q) is often interpreted as measuring the deviation of an approximate density q from a true density p.

Oded Maimon, Lior Rokach (2005), Data Mining and Knowledge Discovery Handbook 
 * The Kullback-Leibler divergence of a parametric model p&theta; with respect to an unknown density f

Khosrowpour (2007), Dictionary of Information Science and Technology 
 * The Kullback-Leibler divergence of a parametric model p with respect to an unknown density f...

Claude Sammut, Geoffrey I. Webb (2011), Encyclopedia of Machine Learning 
 * The Kullback-Leibler divergence from N(mu1, sigma1) to N(mu2, sigma2) is KL( N(mu1, sigma1) || N(mu2, sigma2) )

paper in Mark A. Fulk, John Case (1990), Proceedings of the Third Annual Workshop on Computational Learning 
 * The 'Kullback-Leibler divergence' of Q with respect to P, d_kl(P;Q) is defined as follows

Paper in Touretzky et al (1995), NIPS 1995 
 * D(w0 || w) is the Kullback-Leibler divergence from probability distribution p(x,y;wo) to p(x,y;w)

paper in Wolpert (1995), The mathematics of generalization: the proceedings of the SFI/CNLS Workshop (page 42)
 * The Kullback-Leibler divergence from P to P_hat, denoted I(P || P_hat)

paper in Bernhard Schölkopf, Manfred K. Warmuth (2003), Learning Theory and Kernel Machines: 16th Annual Conference 
 * let KL(p||q) denote the Kullback-Leibler divergence from a Bernoulli variable with bias p to a Bernoulli variable with bias q.

paper in Leondes (1998), Neural Network Systems, Techniques, and Applications 
 * KL divergence of p for q ((from the integral, it is q that they are finding the expectation w.r.t.))

Abdel H. El-Shaarawi, Walter W. Piegorsch (2001), Encyclopedia of Environmetrics - Volume 2 - Page 644 
 * ... the Kullback-Leibler divergence of k_theta1 from k_theta0

Chong Gu (2002), Smoothing Spline ANOVA Models 
 * The Kullback-Leibler distance of exp(&eta;lambda;) from exp(&eta;lambda;) should be modified as KL(&eta;,&eta;lambda) -- ((compare p. 152))

Tony Jebara (2004), Machine Learning: Discriminative and Generative 
 * To generalize, we cast negative entropy as a Kullback-Leibler divergence from P(0) to a target uniform distribution as follows

paper in Rainer Fischer, Roland Preuss, Udo von Toussaint (2004), Bayesian inference and maximum entropy methods in science and engineering (page 525)
 * The Kullback-Leibler divergence from the true distribution to a predictive distribution is adopted as a loss function

Stan Urayesev and A. Alexandre Trindade, paper in Kurdila et al (2005), Robust Optimization-Directed Design 
 * The Kullback-Leibler distance of distribution g from distribution f is a measure of the information lost when g is used to approximate f 

paper in Henderson et al (2006), Handbooks in Operations Research And Management Science: Simulation 
 * The Kullback-Leibler distance of distribution P1 from distribution P2 equals...

paper in Frank Emmert-Streib, Matthias Dehmer (2008), Information Theory and Statistical Learning 
 * Mutual information is the Kullback–Leibler divergence of the product P(X)P(Y) of two marginal probability distributions from the joint probability distribution P(X,Y),
 * ((possibly reflecting our own article on mutual information))

paper in Frank S. de Boer et al (eds) (2008), Formal Methods for Components and Objects: 5th International Symposium 
 * the Kullback- Leibler divergence from p to q is defined by DKL(p || q)

paper in Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone (2009), 
 * the average Kullback-Leibler divergence from the true distribution to the estimated distribution

paper in Zhou et al (2009), Advances in Machine Learning: First Asian Conference on Machine Learning 
 * ((I think -- since they appear to be taking the expectation w.r.t. their q))

Lingxin Hao, Daniel Q. Naiman (2010), Assessing Inequality 
 * ((going by the integrals: but they write D(F0;F) instead of D(F;F0) ))

Weijs, Steven V et al (2010), Monthly Weather Review Sep 2010, Vol. 138 Issue 9, p3387
 * the Kullback–Leibler divergence of the forecast distribution

from the observation distribution

paper in Nils Lid Hjort, Chris Holmes, Peter Müller (2010), Bayesian Nonparametrics 
 * the density within the [incomplete] model minimizing the Kullback– Leibler divergence from the true density.

paper in Lorenz Biegler, George Biros, Omar Ghattas (2011), Large-Scale Inverse Problems and Quantification of Uncertainty 
 * Kullback-Leibler divergence from the exact posterior to the approximate posterior

a paper in Thomas Hamelryck et al (eds) (2012), Bayesian Methods in Structural Bioinformatics 
 * the minimal Kullback_Leibler divergence from g ... KL[g || h]

From q to p (from "approximation" to "truth")
Alan J. Izenman (2008), ''Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning'' 
 * The relative entropy or Kullback-Leibler divergence of a multivariate probability density p with respect to another multivariate probability density q

Thomas M. Chen, Erchin Serpedin, Dinesh Rajan (2011), Mathematical Foundations for Signal Processing, Communications, and Networking 
 * The relative entropy or Kullback-Leibler distance of a probability distribution P relative to another probability distribution Q is...

Ganesh (1995), Stationary tail probabilities in exponential server tandems with renewal arrivals (page 25)
 * we define the Kullback-Leibler distance of q from p as...
 * ((uses q and p the other way round)

paper in Avellaneda (1999), ''Quantitative Analysis in Financial Markets: Collected Papers of the New York University Mathematical Finance Seminar 
 * ... the relative entropy, or Kullback-Leibler distance, of Q with respect to P is...
 * ((P and Q being used the other way round & discussed in terms of Radon-Nikodym derivative))

Meir & Zhang in NIPS 15 (2003), 
 * ... we set g(q) to be the Kullback-Leibler divergence of q from nu...

Ayman El Baz et al (2008), in Dit-Yan Yeung et al, Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR 
 * The similarity between two images is measured by the Kullback-Leibler divergence of a joint empirical distribution of the corresponding signals from the joint distribution of the independent signals.

paper in Qing Li et al (2009), Advances in Data and Web Management: Joint International Conferences, APWeb/WAIM 2009 
 * ... the Kullback-Leibler divergence of P from Q is defined as...
 * ((lifts several phrases verbatim from this article, but definition flipped!))

paper in Qiyuan Peng (2009) Proceedings of the second international conference on Transportation Engineering 
 * D( p||q ) denotes the relative entropy (or Kullback-Leibler divergence) from q to p

Inconsistency within the Article
At the very beginning of the section "Definition", D_KL(P||Q) is referred to as "the KL divergence of Q from P".

On the other hand, in the last paragraph of that same section, the wording "the divergence of P from Q" is given for D_KL(P||Q).

This is inconsistent and should be fixed. I guess, one of the places was missed when changing the wording as a result of the above discussion. As I myself am not sure which is right and which is wrong I won't change it myself. Please somebody of the contributors involved more deeply here do it!

Ernstkl (talk) 12:31, 7 January 2020 (UTC)

To make matters worse, in the section "Interpretations" it is first stated that in ML the KL-divergence D_KL(P||Q) gives the information gained by going from P to Q and 2 lines below it is stated that in Bayesian inference the KL-divergence gives the information gained by going from Q to P. I'm pretty sure the second one is the more generally accepted interpretation. Nmdwolf (talk) 08:23, 12 June 2020 (UTC)

Question
These two statements seem to contradict each other:

(1) The K-L divergence is only defined when $$P>0$$ and $$Q>0$$ for all values of i, and when P and Q both sum to 1.

(2) The self-information ... is the KL divergence of the probability distribution P(i) from a Kronecker delta...

They appear contradictory because a Kronecker delta puts probability zero on all events except one. So if (1) is correct, a KL divergence from a Kronecker delta should be undefined. Could someone knowledgeable please correct the page if necessary? (Or help me understand why no correction is necessary...) Thanks! Rinconsoleao (talk) 16:27, 15 August 2010 (UTC)


 * The first requirement is unnecessarily strict. It should be that $$Q(i)>0$$ for all i for which $$P(i)>0$$.  One then makes the usual information theory limit, that we can take $$0 \; \log \; 0$$ to be zero. Jheald (talk) 17:41, 15 August 2010 (UTC)


 * Thanks! I just edited the page for consistency with your comment. I will look for an appropriate page-specific reference, but if you can provide one that would be great. Rinconsoleao (talk) 12:41, 16 August 2010 (UTC)

Another question, in the propierties section, it is written that the K-L divergence is non-negative, and the plots shown in the beginning have negative values. What is wrong? —Preceding unsigned comment added by 148.247.183.97 (talk) 22:25, 7 September 2010 (UTC)


 * Nothing is wrong. I assume you're referring to the image captioned "Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions" KL divergence is the sum of area under the curve shown.  And though parts of the curve may be negative, the SUM of the area will always be positive.  As mentioned in the article, this is gauranteed by Gibb's inequality. Kevin Baastalk 15:44, 8 September 2010 (UTC)

What About P > 0 and Q = 0? What About P = 0 Everywhere Q > 0
It seems to me that KL-divergence has the nice property that when P = Q, we end up with a "distance" of 0. But When P = 0 everywhere that Q > 0, we end up with a "distance" of 0 alsoeven though the distributions are obviously, in some sense, about as "far apart" as possible.

Furthermore, it seems that the KL-divergence would be of limited use to the practitioner who, in computer implementations for practical problems, will often end up having P > 0 and Q = 0 at some points. I'm sure this is a standard "problem", but how does one address this? If KL-divergence can't be "patched" for these cases, it seems useless for the LARGE number of applications where at least one sample will have P > 0 and Q = 0. — Preceding unsigned comment added by 99.69.50.190 (talk) 04:14, 1 November 2011 (UTC)


 * If P = 0 somewhere (or everywhere) that Q > 0, then somewhere else you must have P > Q, so that the total probability for each probability distribution can still both be 1. So the KL divergence will not be zero.


 * As to your second question: the moral is, only set Q(x)=0 if you are really really sure the case is impossible. Compare Cromwell's Rule for discussion. Jheald (talk) 10:10, 1 November 2011 (UTC)

Symmetrised divergence
Any objections if I change "Kullback and Leibler themselves actually defined the divergence as" to "Kullback and Leibler cite Jeffery's divergence as" (source: Kullback & Leibler p81 (1951)) Wrecktaste (talk) 17:59, 1 November 2011 (UTC)

Symmetrised divergence: lambda-divergence
The lambda-divergence in the wikipedia article is given as an obvious generalisation of the KL. However, I struggled to find it anywhere else than in this particular document: https://camo.ici.ro/journal/vol11/v11d15.pdf which does not seem 100% worthy of full credibility (additionally that document doesn't cite anything, or prove anything when discussing this generalisation). The sentence below the definition in the wikipedia article is copied almost verbatim from that PDF too: "this signifies the gaining expectation of info about that X is obtained from P or Q, with respective probabilities p and q" (cf document, p 575). I don't find this either clear or convincing. Even though there may not in fact be an issue with using that as a divergence, I would feel more comfortable if there were more sources (ideally, more credible ones) discussing this generalisation, where it comes from and how it can be interpreted. In the mean time I'll flag it as citation needed. Thibaut Lienart (talk) 06:21, 15 May 2018 (UTC)

Definitions
In my opinion the mentioning of a random variable in the definition is superfluous and in some way misleading. The definition is just about two discrete or two absolute continuous probability measures. Nijdam (talk) 17:51, 29 April 2012 (UTC)

Correct Accent Marks?
I am just wondering if there are any accent marks on Kullback or Leiber. I would suspect, but do not know if there is supposed to be an umlaut over the 'u' in Kullback. Maybe there are no accent marks at all? I know this is trivial, but I would assume there are others like me who respect accent marks when possible. — Preceding unsigned comment added by 68.228.41.185 (talk) 21:18, 5 April 2013 (UTC)
 * There should be no accent marks. Solomon Kullback was American. Jheald (talk) 08:21, 6 April 2013 (UTC)

Entropy formula
Just a very basic question, I was confused by the equation after "This distribution has a new entropy", isn't it missing a minus? — Preceding unsigned comment added by 130.88.91.196 (talk) 09:46, 14 May 2014 (UTC)

cdot Notation in Bayesian section
In the Bayesian updating section, a notation involving \cdot is used without any introduction, explanation, or link. Can someone that knows what's going on there fix that? Thanks. — Preceding unsigned comment added by 174.101.176.129 (talk) 10:42, 11 March 2015 (UTC)

Data differencing
A clearer description of relative entropy wrt data differencing could be written. For absolute entropy the p(i) Log [p(i)] formula works pretty well in approximating a theoretical limit on file compression. However I cannot see how the relative entropy formula places a theoretical limit on a patch size between two files.

I've taken two web cam images of the same scene under identical lighting conditions. 1.jpg is 64139 bytes long. 2.jpg is 64555 bytes, differing in what I assume to be image noise only. The relative entropy formula then gives 330 bits as a limiting patch /delta size. It seems low. I've also created a patch using the xdelta open source utility. It comes out as 60939 bytes. This intuitively seems realistic. This is several decades of magnitude different from the 330 bits.

Saying that relative entropy forms a background for determining minimum patch size causes confusion if the numbers cannot be made to support it. Perhaps alternative language / a numerical example needs to be used?

(I'm another user: Carlos Pinzón)

I think that the whole section is misleading because the diff rate corresponds to the conditional entropy H(Y|X) instead of the KL-divergence.

Without entering into formulas, you can see that KL-divergence does not correspond to the average length of the patch: the diff size depends strongly on the relationship (joint distribution) between the source (X) and the target (Y), but the KL-divergence is independent of the joint distribution as it only depends on the marginals P and Q.

Concretely, let the source code X be a random bit and let the target code Y be also a random bit. Then the diff length is H(Y) if they were independent, and is zero if X=Y or X=1-Y. This is inconsistent with KL-divergence because in both cases P and Q are the same, but the diff lengths are different. — Preceding unsigned comment added by Caph1993 (talk • contribs) 00:20, 24 December 2021 (UTC)

Incomprehensible section needs reworking
The section Kullback–Leibler_divergence is completely incomprehensible and needs major reworking (or should be deleted). The jump from relatively simple mathematical definitions and statement to a physical application (using different symbols, terminology, and never explained termini like "available work") is too big to be comprehensible. — Preceding unsigned comment added by 134.96.90.27 (talk) 08:25, 7 July 2016 (UTC)

Asymmetry in the graphic illustration
The figure caption says "Note the typical asymmetry for the Kullback–Leibler divergence is clearly visible.' I'm not sure that I see that in the current figure. While the area to be integrated there is visually asymmetric, still for that example, D(P || Q) = D(Q || P).  Two Gaussians that differ only in their mean do not demonstrate the asymmetry.

I'm not sure how to generate graphic examples that will show well on wiki pages, but the following is an example with two Gaussians that differ also in standard deviation, so that the D(P || Q) is interestingly different from D(Q || P):

--Dancing figures (talk) 17:10, 6 August 2016 (UTC)

New Introduction
Please have a look at my non-expert attempt at revising the introduction. Ostracon (talk) 18:37, 8 May 2017 (UTC)

Incorrect upper limit in introduction?
The upper limit of the KL divergence is mentioned as 1 in the introduction. The following example gives a KL div > 1: p(1) = 0.99, p(2) = 0.01, p(3) = 0, q(1) = 0.001, q(2) = 0.01, q(3) = 0.989. Should that be rectified? — Preceding unsigned comment added by 169.234.233.68 (talk) 03:14, 20 February 2018 (UTC)

Expectation unclear.
The Sentence in the Section Definition:

In other words, it is the expectation of the logarithmic difference between the probabilities {\displaystyle P}P and {\displaystyle Q}Q, where the expectation is taken using the probabilities {\displaystyle P}P.

is not clear to an unadvanced reader. Can anyone give an explanation or a citation, why this is the expectation? The regular definition of the expected value looks different.

2001:16B8:2833:B900:C954:23B5:7E2C:F310 (talk) 04:10, 12 January 2020 (UTC)

Requested move 28 November 2020
Kullback–Leibler divergence → Relative entropy – Move to non-eponym (page was recently copy/pasted to new location by User:Bionhoward1337, and I agree with the move but there is a need to keep revision history) — Twassman &#91;Talk·Contribs&#93; 19:09, 28 November 2020 (UTC)


 * Noted. I made a new account and can't do the move for a bit. Put the discussion at the top of the talk page and will revisit this later. Thank you! - Bion
 * This is a contested technical request (permalink). Bionhoward1337 (talk) 19:14, 28 November 2020 (UTC)


 * Removed requested move; I will CSD Relative entropy to make way for move. — Twassman &#91;Talk·Contribs&#93; 19:33, 28 November 2020 (UTC)


 * Oppose KL divergence is by far the more common name. Jheald (talk) 11:38, 29 November 2020 (UTC)
 * Oppose changing the name! I agree with Jheald that KL divergence is the more common name. In google checking ""Kullback–Leibler divergence"|"KL divergence"" returns 650k results while "Relative entropy" returns 420k. Tal Galili (talk) 19:41, 13 December 2020 (UTC)

Move / Rename to "Relative Entropy"
Let's move and rename this article to the name of "Relative Entropy" because descriptive titles are more clear than eponyms, which exist for historical reasons.

Mathematical concepts ought be named to provide information about their purpose and usage because this helps learners and practitioners associate nomenclature with related knowledge. We can definitely keep the 'KL' term for the maths as an homage to the inventors.

Entropy is a confusing topic for beginners, and such an important topic as relative entropy ought to be named in as clear a manner as possible.

— Preceding unsigned comment added by Bionhoward1337 (talk • contribs)

Discussion
Greetings!

When we name math concepts after people instead of meanings, we make math less inclusive and harder for new generations to learn.

[Why Mathematicians Should Stop Naming Things After Each Other](http://m.nautil.us/issue/89/the-dark-side/why-mathematicians-should-stop-naming-things-after-each-other)

An eponym like “KL” doesn’t tell us anything about the underlying concept, whereas relative entropy tells us the concept regards a relationship between two quantities (relative) of hidden information (entropy).

The Wikipedia style guide reads, “Some topics are intrinsically technical, but editors should try to make them understandable to as many readers as possible. Minimize jargon...” — it also says to prefer more common usage forms. [A quick look at Google Trends shows there are actually more recent searches for the descriptive form, Relative Entropy, than for the eponymous form, Kullback-Leibler Divergence.](https://trends.google.com/trends/explore?geo=US&q=Relative%20Entropy,Kullback-Leibler%20Divergence)

The preferred term ought to be “relative entropy” because this presents a golden opportunity to make a Wikipedia article on a key, and confusing, technical topic much more clear for learners and make the mathematics of information more inclusive to newcomers.

Thank you for taking the time to read this,

Bionhoward1337 (talk) 21:31, 13 December 2020 (UTC)


 * Hey Bionhoward1337,
 * I'd like us to agree on some things upfront:
 * 1) I think your reasoning are generally good. I agree that relative entropy is an informative title than kl-divergence.
 * 2) that said, I believe that the decision on what is "the main name" of a concept should not come from the Wikipedia community, but the communities that use that name. So while I might think it's a better name, if most researchers don't use it - then I don't think I should be the one deciding on the name.
 * 3) Furthermore, even if there a "culture war" around the name, I don't think Wikipedia should play a part in it, unless there is an official preference in Wikipedia as a whole to lean toward more progressive naming. So, for example, I'm looking at Article_titles, and I see that there is "Recognizability" and "Naturalness", which pushes us towards kl-divergence over "relative entropy", since this is still the most common name. For some reason the trends plot you show are not loading. But to double check, I looked now at google scholar for the number of results for 2020, and I see "relative entropy" gets 4120, while "Kullback–Leibler divergence"|"KL divergence" gets 8540.
 * In summary, as long as there are double the papers with one name over the other in 2020, this tells me that the name "relative entropy" (even that I agree it is better), is half as common as the other name. Hence, I'd prefer us to keep kl-divergence. If this would change in the future (in the next couple of years), I'll be fine changing it as you suggest.
 * On which of the above 3 points can we agree?
 * WDYT? Tal Galili (talk) 12:35, 14 December 2020 (UTC)
 * On which of the above 3 points can we agree?
 * WDYT? Tal Galili (talk) 12:35, 14 December 2020 (UTC)


 * If the crowd wills it to be called Kullback-Leibler Divergence, then that is what it is and I'll not contest a change back. However, IMHO, as long as Kullback-Leibler Divergence redirects to Relative Entropy, and it is clear in the article both apply to the same concept, and clearly both names are widely used, then I propose the title stay Relative Entropy, because it's more precise, concise, more consistent with names for related topics in Information Theory, and more inclusive to newcomers to the field (for whom Wikipedia is a starting point) because it's easier to tie-in to related knowledge in one's 'semantic tree.' Didn't mean to start a culture war or anything like that. Thank you for taking the time to work on this! 2603:6080:A303:DD00:B9B5:A22C:AD5A:A0C9 (talk) 18:14, 14 December 2020 (UTC) Bionhoward1337 (talk) 18:20, 14 December 2020 (UTC)


 * I entirely agree with your comment. — Twassman &#91;Talk·Contribs&#93; 19:22, 14 December 2020 (UTC)


 * Followup, I looked at the trends data of kl divergence, it seems to both have many more searches than relative entropy and have a trend increase (!). See here. Given these results, I think we should revert the name change back to kl divergence... (even though I agree that relative entropy is a lovely name). Tal Galili (talk) 10:06, 16 December 2020 (UTC)


 * Update: made a move-back request here: https://en.wikipedia.org/wiki/Wikipedia:Requested_moves/Technical_requests#Uncontroversial_technical_requests


 * A couple of further points:
 * We (and pretty much everybody else) use the symbol $$D_{KL}$$ for the quantity. If we're going to do that, it's best to prefer the name Kullback-Leibler divergence.
 * Secondly, above Bionhoward1337 said that we should use a name that's more meaningful, making the concept easier to acquire. But is that true of "relative entropy" ?  "Entropy" is straightforward enough: the amount of information you don't have.  The modifier "Relative" suggests we're applying that idea to P relative to some background measure Q.  But is this what's going on?  No, not really.  Usually a better way to think about $$D_{KL}$$ is "how far is Q from P, looking back from P".  And the name "divergence" actually clues in that picture much better.  I suppose you can think about $$D_{KL}$$ as somehow being related to some continuous version of the Shannon entropy  you would get for P after a transformation that made Q flat, but it doesn't really work, because Shannon entropy doesn't really work for continuous variables.
 * As for the claim one shouldn't use mathematicians' names in the names of concepts, I don't think that's sound either. Attaching a person's name to a concept acts like a hash code, identifying the concept as something specific and distinct from others of its kind, and actually makes for a more memorable name for a specific thing, compared to a more generic adjective. Unless the generic phrase encapsulates the concept really transparently and directly -- and as I've said above, I don't think that is true of "relative entropy" -- then I think it can be better to establish the concept as a distinct personally-named thing.  "KL divergence" establishes that we're talking about a mathematical divergence; and that it is a divergence of a very specific form.  If you want something more directly cognitively accessible, to communicate more of what the concept relates to then IMO "information gain", the term introduced by Renyi, may be the best of the bunch. But that does frame the quantity in one rather particular way.  Both for its generality and its wide usage, "Kullback-Leibler divergence" remains what I think this article is best called.  Jheald (talk) 20:47, 16 December 2020 (UTC)


 * Fair points. I agree. Bionhoward1337 (talk) 02:24, 17 December 2020 (UTC)


 * Great points, thanks. Tal Galili (talk) 08:08, 17 December 2020 (UTC)

Requested move 28 November 2020 : Discussion part 2
User:Anthony Appleyard - I think this move is agreed upon by all the people in the discussion. Could you please help with making the move? Thanks. Tal Galili (talk) 08:17, 17 December 2020 (UTC)
 * This discussion on page Talk:Relative entropy is about moving page Kullback–Leibler divergence to Relative entropy, but the page is at Relative entropy already. Anthony Appleyard (talk) 09:13, 17 December 2020 (UTC)
 * User:Anthony Appleyard - there was a short discussion on moving Kullback–Leibler divergence to Relative entropy. This move was done quickly, and after the fact, a bunch of other editors suggested this move should not have taken place, and asked to move Relative entropy back to Kullback–Leibler divergence. You can see that in the discussion above we've reached a consensus for making this switch back. Tal Galili (talk) 13:44, 17 December 2020 (UTC)


 * ✅ Anthony Appleyard (talk) 14:14, 17 December 2020 (UTC)

Weird sentence in definition section
The definition section has the following text, concerning the measure theoretic definition:


 * More generally, if $$P$$ and $$Q$$ are probability measures on a measurable space $${\mathcal{X}}$$, and $$P$$ is absolutely continuous with respect to $$Q$$, then the relative entropy from $$Q$$ to $$P$$ is defined as


 * $$D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{P(dx)}{Q(dx)}\right)\, P(dx),$$
 * where $$\frac{P(dx)}{Q(dx)}$$ is the Radon–Nikodym derivative of $$P$$ with respect to $$Q$$,ie. the unique $$Q$$-almost everywhere defined function $$r$$ on $$\mathcal{X}$$ such that $$P(dx) = r(x)Q(dx)$$  which exists because $$P$$ is absolutely continuous with respect to $$Q$$. Also we assume the expression on the right-hand side exists.

I can't work out the meaning of that last sentence and I suspect it's superfluous. All the text before it seems to be about showing that the expression on the right-hand side of the equation does in fact exist, so stating it as an additional assumption seems a bit weird.

Or maybe I'm missing something and an additional assumption is needed beyond the one stated about absolute continuity? If that's the case, could it be reworded to make the extra assumption more explicit? If not, I suggest the extra sentence be removed.

Nathaniel Virgo (talk) 01:35, 13 January 2023 (UTC)

Metric and distance are synonyms in mathematics, contrary to what the page states
The article states « While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions (in contrast to variation of information), and does not satisfy the triangle inequality. » But this does not correspond to common mathematics. And wiki indeed contradicts the above claims: Distance and Metric space provide exactly the same definition for "metric" and "distance", which requires symmetry and triangle inequality. -I am of course, like all the articles above, not refering to a "riemannian metric", which is a positive definite symmetric (0,2)-tensors, or equivalently a real inner product on a tangent bundle, with some differentiability requirement. Thus i recommend deleting the above comment -which is false according to common mathematics- and rather commenting that KL divergence satisfies some properties of a distance but not all. Plm203 (talk) 02:34, 30 October 2023 (UTC)


 * I made edits to the article to address your point. Please feel free to edit further or leave additional comments here.  — Q uantling (talk &#124; contribs) 13:34, 30 October 2023 (UTC)