Talk:Divergence (statistics)

positive definiteness in the definition
The third property of divergence is given in the text as,
 * The matrix g(D) (see definition in the “geometrical properties” section) is strictly positive-definite everywhere on S.

I haven't read the reference yet, but it doesn't seem necessary to put it in the definition. At least, a good explanation is required here. --Memming (talk) 18:13, 29 January 2010 (UTC)


 * The positive-definiteness is actually the crucial property that connects divergences to information geometry! See discussion at Special:Permalink/1056280310. However, I agree that the earlier article didn't sufficiently explain this. I've had a shot at explaining it both more intuitively and more formally, in . This is hopefully clearer!
 * —Nils von Barth (nbarth) (talk) 04:41, 21 November 2021 (UTC)

Notation
The notation used, while appreciably terse, is also cryptic and unapproachable. This should be revised, either to further enhance its readability or it should link to a page that helps to disambiguate the meaning of the notation used. — Preceding unsigned comment added by 132.3.57.68 (talk) 22:52, 9 March 2012 (UTC)

Merger proposal
I propose that Statistical distance be merged into Divergence (statistics). It seems to me that a statistical distance, when it refers to comparison of two distributions, merely satisfies a somewhat tighter definition than what is required from a divergence. Currently the two articles give conflicting definitions of a distance, and this could be clarified by discussing them in a common framework. I think it would be okay to also explain comparison of a point to a distribution within the article Divergence (statistics), leaving no need for a Statistical distance article, except for as a redirect. Olli Niemitalo (talk) 10:33, 4 December 2014 (UTC)
 * Closing, as no support over 2.5 years. Klbrain (talk) 20:05, 15 August 2017 (UTC)
 * To follow up: divergences are a special kind of statistical difference (notably inducing a positive definite metric), with important geometric interpretations and a central role in information geometry. By contrast, statistical distances are a grab-bag with various properties. It is thus valuable to distinguish so one article is the grab-bag of functions used as "distances", and the other article discusses the ones with geometric properties. This is especially important because these two concepts are frequently conflated (as the history of this article and discussions shows), so having two articles helps distinguish the concepts.
 * —Nils von Barth (nbarth) (talk) 21:29, 6 November 2021 (UTC)

|| Notation
What is the origin of the || notation used in this article? Divergences and distance measures in statistics are traditionally written D(P,Q) or similar rather than D(P||Q). While the || notation is not completely unknown in the statistics research literature, it is very rare. The || notation is not used in any of the cited references, nor in the Wikipedia page on statistical distance, nor does it agree with the definition of || in the Wikipedia page List of mathematical symbols. I suggest reverting to historical notation, which is simpler and more widely understood. GKSmyth (talk) 10:18, 28 May 2015 (UTC)


 * The double-bar notation seems to be common for Kullback–Leibler divergence, particularly in information theory, to emphasize the asymmetry. It's not used Kullback & Leibler (1951), but is now common, and is a notation used on Wikipedia for Kullback–Leibler divergence. There's a discussion at: Origin of the notation for statistical divergence, but it's inconclusive.
 * Notation is inconsistent in the literature; more formal math often just uses $$D(x, y)$$, while information theory literature tends to use $$D_\text{KL}(P \parallel Q)$$, and one also sees e.g. $$D[x : y]$$ (Amari 2016). I'll add a section discussing notation.
 * —Nils von Barth (nbarth) (talk) 01:34, 31 October 2021 (UTC)


 * I've added a discussion of notation in Special:Permalink/1052789591; hopefully this helps.
 * I don't have strong feelings about the choice of notation (so long as different notations are mentioned and explained!).
 * I agree that just $$D(P, Q)$$ is simpler, and probably more appropriate for this level of article, so I wouldn't object if someone changes it (and may do so myself).
 * —Nils von Barth (nbarth) (talk) 02:49, 31 October 2021 (UTC)


 * I've changed the notation to consistently use commas in ; thanks for raising this!
 * —Nils von Barth (nbarth) (talk) 04:50, 21 November 2021 (UTC)

Definitions Incompatible with Source Material
I have reviewed both the translated monograph Methods of Information Geometry (0-8218-0531-2) and the paper (10.1007/978-3-642-10677-4_21), which are cited to substantiate the definition of divergence provided in this article. (Both of these are by a single, Japanese-speaking author, Amari) Only the much older translated monograph uses the word "divergence" for this kind of function, and it notes in a footnote on the same page (p. 54) that this is not used in the original Japanese text. The additional requirement in both the monograph and the paper in NIPS is the positive definiteness condition described by Memming.

Neither of the sources relevant to the definition which is the core of this article back-up the content of the article. I submit that this article either needs new sources, or else needs to be heavily modified or removed.

Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185–193. doi:10.1007/978-3-642-10677-4_21.

Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2.


 * Thank you for the careful reading of references and clarification!
 * This error was introduced in, which removed the positive definiteness condition that was included in the initial burst revisions (Special:Permalink/339269004); skepticism of the necessity of positive definiteness was noted in Special:Permalink/340755871, as you note.
 * The term "divergence" has been used loosely historically (as I've outlined in Special:Permalink/889324520), often just as a term for statistical distance, and continues to be used loosely.
 * However, in the context of information geometry, which is the topic of this article, the positive definiteness is essential to the geometry (essentially it means that infinitesimally, the divergence looks like squared Euclidean distance, and thus generalizes its properties; more formally, they generalize Hessian manifolds, which are affine and have local potential functions, not just infinitesimal positive definiteness).
 * I'll have a shot at rewriting to define correctly, and hopefully avoid further confusions by clarifying both the loose use (which should be discussed at statistical distance) and the information geometry sense in this article.
 * —Nils von Barth (nbarth) (talk) 21:23, 6 November 2021 (UTC)


 * I've restored the positive definiteness condition, and explained it both more intuitively and more formally, in . This should correct the article, and also address the original confusion.
 * —Nils von Barth (nbarth) (talk) 04:36, 21 November 2021 (UTC)

proposal to rename the article to "Divergence (information geometry)"
As it currently stands, the article is fully committed to divergence as used in information geometry. For example, it even excludes the Total variation distance.

I also have serious misgivings about the article's quality. I point out several problems:

> today there is a commonly used definition

commonly used by whom? It seems mostly the galaxy of information geometers swirling around Amari. A third-party review article would help. My personal impression is that "divergence" is used for any positive-definite function.

Again, where are you going to put total variation distance?

A general f-divergence does not allow a quadratic expansion for D(p, p + dp). The most general theorem I can find is Theorem 7.11 in, which requires $$f\in C^2(0, \infty)$$ and $$\limsup_{x\to\infty} f''(x)<\infty$$.

The examples given for f-divergence are quite unprincipled. Some are even downright wrong. I have no idea where the name "Chernoff alpha-divergence" came from. And the "exponential divergence" is simply wrong: The generating function $$f(x) = (\ln x)^2$$ is not even convex!

pony in a strange land (talk) 23:43, 25 May 2022 (UTC)


 * Please see WP:RM. –– FormalDude  talk  20:26, 30 May 2022 (UTC)

Kullback–Leibler divergence is not an example of a divergence
Kullback–Leibler divergence sometimes takes the value $$+\infty$$ and so cannot be a divergence (as defined in the article). ??? Thatsme314 (talk) 10:29, 5 June 2023 (UTC)


 * What do you mean by infinity here? Is it that you have some division by zero or as the limit of a sequence? The definition of a divergence only requires positive (semi)-definiteness, which has no "upper limit" on the real line. AntonyRichardLee (talk) 13:57, 25 July 2023 (UTC)