Talk:Kernel density estimation

(Particularly the section about the risk function.)

Misusage of the word variance
At the section Example, the text was claiming using variance 2.25, when in fact it is referring to standard deviation (I also suspect it should have been 2.5 instead of 2.25, since 2.25 creates a barely noticeable deviation at the graph, but only the figure author may confirm). I've fixed the term (see Image for justification), yet, in accordance with formula at section Definition, it should also be an average as stated below.

— Preceding unsigned comment added by Rafael Siqueira Telles Vieira (talk • contribs) 16:37, 25 October 2019 (UTC)

Incorrect caption
Note that the figure shows $$\sum_{i=1}^N W(x-x_i)$$ rather than $$\frac{1}{N}\sum_{i=1}^N W(x-x_i)$$ as the caption says. --anon


 * How do you know, as there is no y-axis in the picture? Oleg Alexandrov (talk) 03:31, 1 March 2006 (UTC)


 * $$\frac{1}{N}\sum_{i=1}^N$$ is an average. An average is never greater than the largest component.  If you look at the graph, the blue curve is clearly the sum of the component curves. Zik 03:40, 5 March 2006 (UTC)
 * You are right, I fixed the caption. I have no idea how I had missed that. :) Oleg Alexandrov (talk) 01:09, 6 March 2006 (UTC)

name
In my experience calling the technique Parzen windowing is limited specifically to time-series analysis, and mainly in engineering fields. In general statistics (and in statistical machine learning), the term kernel density estimation is much more common. Therefore I'd propose it be moved there. As an aside, the attribution to Parzen is also historically problematic, since Rosenblatt introduced the technique into the statistics literature in 1956, and it had been used in several more obscure papers as early as the 1870s, and again in the early 1950s. --Delirium 22:59, 26 August 2006 (UTC)

x
What is x in the equation? --11:06, 5 October 2006 (UTC)
 * It is a real number, I guess. Oleg Alexandrov (talk) 02:55, 6 October 2006 (UTC)

Changing the name of this page
The technique called here Parzen window is called kernel density estimation in non parametric statistics. It seems to me to be a much more general term and much clearer for people searching for it. The comment above state the same problem. I also agree that the article should refer to the Parzen-Rosenblatt notion of a kernel, and not just of Parzen. The definition of a Parzen-Rosenblatt kernel should be latter added on the kernel (statistics) page. —The preceding unsigned comment was added by Gpeilon (talk • contribs).
 * That's fine with me. If you move the page, you should also fix the double redirects. That is, after the move, while viewing the article at the new name, click on "what links here" on the left, and any redirects which point to redirects need to be made to point to the new name. Cheers, Oleg Alexandrov (talk) 03:18, 9 January 2007 (UTC)

Formula for optimal bandwidth
Hi, I just noticed that the optimal global bandwidth in Rosenblatt, M. The Annals of Mathematical Statistics, Vol. 42, No. 6. (Dec., 1971), pp. 1815-1842. has an additional factor of $$2^{\frac{2}{5}}$$. Just an oversight, or is there a reason for the difference that I'm missing? Best, Yeteez 18:34, 24 May 2007 (UTC)

In addition, what is the lower case n in the optimal bandwidth, it is undefined. CnlPepper (talk) 17:18, 13 December 2007 (UTC)
 * n is the same as N, and is the amount of data samples. I changed it in the Definition section from N to n to avoid confusion. 78.21.160.201 (talk) 12:50, 27 August 2009 (UTC)

Scaling factor
Shouldn't the $$\sigma$$ in the formula for K(x) be dropped, on the grounds that it is already there in the form of h in the formula for $$\hat{f}_h(x)$$?

--Santaclaus 15:45, 7 June 2007 (UTC)

Stata
Though not sure whether it violates the guidelines of what wikipedia is, I like the example section. But I would like to see the commands in some non-proprietory language, e.g. R. --Ben T/C 14:41, 2 July 2007 (UTC)

Practical Use
Can somebody please add a paragraph on what the practical use of Kernel density estimation is? Provide an example from statistics or econometrics? Thanks!

Kernel?
Isn't a Gaussian with variance of 1 totally arbitrary? On the other hand, using the PDF of your measurement tool as a kernel seems quite meaningful. For example, if you are measuring people's heights and know you can measure to a std. dev of 1/4", then convolving the set of measured heights by a Gaussian with std. dev of 1/4" seems like it captures everything you know about the data set. For example, in the limit of one sample, the estimation would reflect our best guess of the distribution for that one person. 155.212.242.34 22:07, 6 November 2007 (UTC)


 * Anybody? —Ben FrantzDale (talk) 16:00, 26 August 2008 (UTC)

--> I agree with the above poster that a standard gaussian is arbitrary. True, gaussians are often used as the kernel, but the variance of the gaussian is usually selected based on the "coarseness" of the desired result, and therefore not necessarily 1. —Preceding unsigned comment added by Zarellam (talk • contribs) 07:08, 17 April 2009 (UTC)

The variance then is the parameter h and can still be chosen as desired. I fixed this on the page.170.223.0.55 (talk) 14:57, 27 April 2009 (UTC)

Properties section - more on $$c_1, c_2, c_3$$
It appears that the section Properties tells us how to select $$h$$. However, I found several things confusing here, and would like to see these described more clearly.

First, if I'm interpreting correctly, $$c_1$$ and $$c_2$$ would be constants for the standard normal kernel that was earlier stated to be the common choice, e.g. $$c_1=1$$ and $$c_2\approx0.28$$. The fact that these constants for the standard normal were not given confused me and left me thinking that maybe there was a notational inconsistency or something, or that I wasn't interpreting something right. So please, mention what these constants are for the standard kernel choice.

Next and more serious, I'm still confused about $$c_3$$. It appears that we're going to find $$c_3$$ in order to find $$h^*$$ But $$c_3$$ apparently must be estimated as a function of $$h$$. I mean if $$f$$ is the underlying true distribution, which we don't know, then we don't know $$f(x)$$, so the implication is that we'd need to use $$\hat f$$, which is defined in terms of $$h$$. So it seems like $$h^*$$ has a circular definition. —Preceding unsigned comment added by 98.207.54.162 (talk) 19:12, 7 February 2009 (UTC)


 * You are correct. I added an internal link about $$c_1$$ and $$c_2$$ to the relevant page. For $$c_3$$ somebody with more knowledge of the estimation algorithms (cross-validation, plug-in etcetera) should have a look, as none of those algorithms are presently discussed on wikipedia. In any case the parameter $$c_3$$ must be estimated from the input data set, and is usually derived from its variance $$\sigma^2$$. Probably something should be said about the derivation of the $$R(f,\hat f(x))$$ function as well, which is (I think) the AMISE form. 78.21.160.201 (talk) 12:44, 27 August 2009 (UTC)

Comparison to histogram
The description of a histogram as KDE with a boxcar kernel is not entirely accurate. In a histogram the bin centers are fixed, whereas in KDE the kernel is centered on each data point. See this page for more explanation. --Nubicles (talk) 04:19, 20 February 2009 (UTC)

Hence I would suggest deleting the sentence about the histogram---I would say it is significantly misleading. -- Spireguy (talk) 02:25, 19 May 2010 (UTC)

Merge from Multivariate kernel density estimation
I suggest that the article Multivariate kernel density estimation was merged into this one. They cover essentially the same topic, with all formulas being essentially the same. The multivariate case is arguably more complicated, but it better fits into the subsection of this article, than a standalone topic. Note that currently this page already mentions the multivariate estimation, at least in the examples section. //  st pasha  » 17:30, 24 September 2010 (UTC)


 * There is a certain logic to this suggestion. However, the multivariate article is much better written than the more general one - B-class in my opinion.  It would be a shame to take a B-class and a start-class article and turn them into one start-class article.  Perhaps we should just create a section in this article, on the multivariate case, where we summarise that article and link to it.
 * N.B. I didn't write either article.
 * Yaris678 (talk) 07:43, 25 September 2010 (UTC)


 * I agree that the multivariate article is written better than the generic one. But is this a reason for keeping those articles separate? Think about the future: as both articles progress, they will have to cover more and more of the same topics, and the necessity to merge them will become more and more obvious. However the more two articles, the more difficult it will be to merge them, especially if they start developing different notation.
 * As to the rating, I'm not sure if B class is appropriate. Sure, the author put considerable effort into the article, not to mention all those beautiful pictures. However the breadth of coverage is lacking. Not having my book on density estimation at hand, I can name at least several major topics which should certainly be included: (1) the intuitive explanation why this estimator “works”; (2) asymptotic properties of the estimator: bias and variance, asymptotic behavior of the bandwidth parameter; (3) construction of approximate confidence bands; (4) construction of uniform confidence band; (5) other methods (besides MISE) for choosing the bandwidth; (6) connection between kernel density estimation and characteristic function estimation. //  st pasha  » 11:38, 25 September 2010 (UTC)


 * In starting the article on multivariate kernel density estimation, I wanted to write it as a tutorial which started with some data with the end result being a kernel density estimate, by highlighting the principal steps required, rather than a llist of mathematical results. For example, discussing alternative optimality criteria at the same time the MISE is currently introduced would add detail to the principle of optimal bandwidth selection without adding to conceptual understanding. I have the source material for all these added sections, but would require careful placement.
 * In my experience, univariate and multivariate data analyis are usually kept separate because the latter is conceptually different, e.g. separate pages for the univariate and multivariate normal distributions (matrix analysis versus real analysis). For multivariate kernel density estimation, the bandwidth parameter is a matrix which controls both the magnitude and orientation of the kernel, and where orientation has no analogue in the univariate case. I could add a section on this important idea of kernel orientation which I think would cement its conceptual difference to univariate kernel estimation (it was left out at the moment as it can be viewed as an overly technical question).
 * Drleft (talk) 22:37, 26 September 2010 (UTC)


 * Well, this is the problem however. According to one of the Wikipedia policies, WP:NOTHOWTO, tutorial-style articles are frowned upon. Thus, discussing the alternatives to MISE selection is in fact a must (even if it does not add to conceptual understanding), because some people might be interested in learning about those alternatives, other people may be using a software package that gives them a choice for which bandwidth selector to use, and they’ll come here to understand what different alternatives actually mean, etc.
 * Whether or not to keep the articles separate, is a judgmental question. Some topics discuss univariate/multivariate cases separately (such as probability density function, characteristic function, ordinary least squares), others keep them separate (e.g. normal distribution / multivariate normal distribution). One of the advantages of keeping things together is that it is easier to explain the concept to a reader if you start from the univariate case, and then move on to the multivariate. Also consider this: if you merge these articles, then it will automatically improve the quality of kernel density estimation as well. It is like killing two birds with one stone. //  st pasha  » 00:57, 27 September 2010 (UTC)

One consideration in the merge/not-merge question is that of article-length, and there may be some guidelines on this. I think that articles with worthwile content as separate articles would be too long to merge. In addition to the topics mentioned in the discussion above, there some othere that are yet to be mentioned: density esimation with known bounds on the range of values, density estimation for circular data. In any case, it seems that the next move should be to remove the "multivariate" material from the present article and merge it into the multivariate article so that that has a logical structure. Things could be left there, but this preliminary step would presumably make it easier to do a full merge if that were decided on. Melcombe (talk) 08:55, 27 September 2010 (UTC)


 * @Stpasha. I've added sections which address points (1), (3) and (5). I'm not sure about (4) and (5) since confidence bands don't have a widely accepted unambiguous definition for surfaces (partially because visualisation would be difficult), and for (6), estimating a characteristic function per se is not particularly useful for multivariate exploratory data analysis. Its most common application would be density estimation for contaminated data i.e. the deconvolution density estimation problem, but this would deserve a separate wikipage.
 * @Melcombe. This appears to be a good compromise to separate out the multivariate material. I didn't do this initially since I wasn't sure how much of the univariate KDE page I could edit. Drleft (talk) 17:28, 27 September 2010 (UTC)


 * Most multivariate references transferred to multivariate kernel density estimation, multivariate example code and figures replaced by univariate ones, notation (esp. bandwidth selectors) made more consistent. Drleft (talk) 21:58, 29 September 2010 (UTC)

Technical details in definition
Several changes have been made recently about technical mathematical details in the definition which are not entirely correct.
 * A kernel function does not have to be symmetric or non-negative, but symmetric, non-negative kernels are the most commonly used. Exceptions include asymmetric boundary kernels for estimatin near a sharp boundary or non-positive kernels for density derivative estimation. However for pedagogical reasons too, kernel density estimation is almost always introduced with symmetric, non-negative kernels since they are the simplest to analyse mathematically. Furthermore, the subsequent AMISE formula is true only for symmetric, non-negative kernels.
 * Certainly, one can contemplate the use of asymmetric kernels. However the subsequent discussion in the asymptotics section makes implicit use of the fact that the kernel is symmetric, to ensure that the first-order bias vanishes. It is easier to define kernels as being symmetric, and then mention the asymmetric ones as an exception, than the other way round. As for positivity, this property of a kernel is not needed anywhere, so there is no need to impose it. There are nonnegative kernels (such as sinc function) which are sometimes used. Lastly, we want the definition of a kernel match that in the Kernel article.
 * According to the definition in the cited wikipage a kernel is defined to be non-negative function. Drleft (talk) 21:18, 6 October 2010 (UTC)


 * The choice between upper and lower case letters to represent random samples is subject to various conventions. One convention adopted by many statistics research journals is that random variables as a function are capitalised (X) whereas a particular value from its sample space is lower case (x). Following this convention, e.g. Var X is the population variance of the r.v. X  whereas Var x = 0 since x is a single point and thus has no variance. So the subsequent MISE and AMISE (as they are integrated variances) will be affected by the choice of notation.
 * It is in fact customary to denote the sample using lowercase letters, and change them to uppercase only when there is a need to explicate that they should be treated as random variables. So we write $$\scriptstyle \hat{f}(x) = \frac1n\sum K_h(x-x_j)$$, however when computing the expected value of this quantity, it changes to $$\scriptstyle \operatorname{E}[\hat{f}(x)] = \operatorname{E}[ K_h(x-X_j) ]$$.
 * I don't agree that is customary. I have read hundreds of articles in kernel density estimation and the KDE is defined as $$\scriptstyle \hat{f}(x) = \frac1n\sum K_h(x-X_j)$$ (i.e. with capital Xj) and with no switching between upper and lower case, in the overwhelming majority of cases e.g. 12/15 of references in the wikipage, except Epanechnikov (1969), Wahba (1975) and Bowman (1984). Drleft (talk) 21:18, 6 October 2010 (UTC)


 * Stating that the characteristic function estimator coincides with the kernel density estimator is somewhat vague. It could be interpreted as $$\hat{f} (x) = \hat{\varphi}(t)$$ which is not true since the left hand side belongs the spatial domain and the right hand side to the frequency domain. It is clearer and more rigorous to say that there is one-to-one correspondence (or bijection) between the two estimators. Drleft (talk) 22:47, 5 October 2010 (UTC)
 * I added the word “density” there, so it’s “characteristic function density estimator”. The idea is that it is an estimator of a density, but via the c.f. //  st pasha  » 04:26, 6 October 2010 (UTC)


 * Is it always true, that the $$\int_{\mathbf{R}^d} K(x) dx=1$$? Is it not true only for certain K? Mullins CZ (talk) 18:23, 30 August 2016 (UTC)

Old Faithful example with data is flawed
The example using the old faithful data is flawed and has to be removed. The faithful data is not continuous: out of the 272 observations there are only 51 unique data values. This implies that the data could not have come from a continuous density. The function kde assumes that the data is continuous, not discrete. The theory presented in the article does not apply to such discrete data. Someone please add an appropriate example. —Preceding unsigned comment added by 64.235.198.242 (talk) 01:24, 12 October 2010 (UTC)


 * All measurements are discretisations due to the limited accuracy of measuring devices, but this does not prevent kernel density estimation being applied to these discretised measurements since the aim is the estimate the underlying continuous density function. For the Old Faithful data, the waiting time has been rounded to the nearest minute, but this doesn't change its underlying continuous nature, so a continuous density estimate is appropriate. Drleft (talk) 12:10, 15 October 2010 (UTC)

I understand all measurement are discretizations of continuous data, this is precisely the problem!!! If I have $U$ being uniform on $[0,1]$ and discretize it to the binary $B=I_{U<0.5}$, then is $B$ still a continuous random variable. It is not, regardless of $U$. Thus your comment does not change the fact that the data fed to the estimator is discrete, but the estimator smooths it as if it were continuous. The data fed to the kernel estimator should have been continuous (up to floating point accuracy) in the first place. This example shows that this particular kernel estimator cannot distinguish between discrete and continuous data. A valid estimator will smooth continuous data and NOT smooth discrete data. In other words, a properly working kernel estimator will give  bandwidth =0 when it is given discrete data. The true distribution of a discrete set of data, say, (2,2,2,3,3,3,3,4,5,6) should not be smoothed. In conclusion, the example is logically flawed: Two mistakes cancel each other to give the appearance of a correct example. The mistakes are: discrete data is fed to the estimator instead of continuous (1 mistake), the effect of which is then canceled out by an improperly working kernel estimator (2 mistake). One has made two different errors that have by chance canceled each other out so that you obtained the correct answer by happenstance. —Preceding unsigned comment added by 132.204.251.179 (talk) 15:45, 15 October 2010 (UTC)
 * (a) I don't agree that discrete data should lead only to discrete estimates. Continuous estimators from discrete data have their advantages e.g. normal approximations to the binomial and Poisson distributions allow for simplified calculation of quantiles.


 * You are very wrong! Of course discrete data should lead to discrete estimates, saying the opposite is a logical contradiction!!! Discrete data can be smoothed (and I have published a paper on the topic), but the estimator uses a DISCRETE kernel, not a continuous one. If we cannot agree on something so elementary as saying that discrete data should remain discrete, there is not point in arguing.


 * Side note: using CAPITALIZED words any numerous exclamation signs (!!!) is a written equivalent of shouting at your opponent. Use those a lot (especially in a paper), and you will prove your point much faster.
 * This is not true. Many people like me use capitalization or (!) to emphasize a word or sentence which is important. You are free to have your own interpretation, but you are not free to impose  it on me. Also using condescending sarcasm like yours will not actually help you prove your point faster, but using capitalization to emphasize things and make your point clearer might help you.
 * As for the actual topic of the discussion, you do realize that any data is actually discrete?
 * Yes. Discrete up to floating point accuracy, which on modern computers means that any two points X and Y are never going to be equal, unless you have so much data that you have exhausted the 16 decimal place accuracy on a typical computer. However, none of this proves your point. Why? Because nothing in the world is continuous per se, continuity is just a mathematical abstraction!

It is our judgement whether we want to assume that the underlying distribution is actually continuous or discrete (or maybe a mixture?).
 * Yes. So once you assume that the data is going to be discrete, you cannot then proceed to do the estimation as if it were continuous. You have to understand that you are free to choose to model the data as either discrete or continuous, but once you assume either one of these, you have to be consistent for the rest of the analysis.

In case of the Old Faithful eruptions it is natural to suppose that the eruption intervals is a continuous r.v., since more or less every physical variable is continuous (except for those that by construction take values in a discrete set -- such as the number of particles in a certain process, or certain quantum numbers, etc). In the dataset that we have the eruption intervals were rounded up to a nearest integer. Which really cannot stop us from using the kernel estimator to recover the density. //  st pasha  » 02:47, 19 October 2010 (UTC)
 * No, you are wrong. If your data is discrete (after the rounding discretization), it is WRONG to use a continuous

kde. You can use a discrete one if you want to put probability mass on other integer values, but you have to be clear from the beginning weather the data is assumed to be discrete or continuous. I am afraid professional statisticians get confused on this issue. Perhaps it is because they know little about smoothing of discrete data. The fact that many people like you think the example is OK does not make it OK. Popular opinion does not justify  incorrect mathematics. Unfortunately, I do not have time to argue with the prevailing view, no matter how wrong it is, so I will let this one go and hope people will see reason at some future point in time.


 * (b) For this data set, following your reasoning, since 44 minutes is not observed, then we should not assign any probability that a future waiting time would be 44 minutes. This seems to be artificially restrictive, because there is no a priori reason to believe that 44 minutes is an impossible outcome. So in this case, it is sensible to smooth the discretised data to estimate densities at unobserved values, including non-integral values. Drleft (talk) 23:54, 15 October 2010 (UTC)


 * You are not following my reasoning. You mix the concepts of discrete and continuous data. There is an important difference between the assumption that the data is discrete or continuous. One uses different kernel estimators for these VERY different settings. The fact is that if the data you provide is (3,3,3,3,4,4,4,5,6), then BY DEFINITION this data is not continuous, but discrete. See, for example, Cinlair's probability textbook. You cannot arbitrarily change the definition of continuous and discrete data.

This is all a moot point if the bandwidth is >> than the discretization error. In the other limit, where the bandwidth is << than the discretization error, the flaw in applying KDE will become very apparent. ---(192.161.76.82 (talk)) —Preceding undated comment added 23:48, 18 October 2017 (UTC)

Things that are missing
a) Discussion of boundary effects. These effects are important in many practical applications. Statements in the article about optimal AMISE rates are incorrect in the presence of boundary bias.

b) High-order kernels. If you are willing to assume that the kernel is not necessarily positive, these should be mentioned. Again, statements in the article about optimal AMISE rates are incorrect if high-order kernels are allowed (these kernels lead to better asymptotic convergence rates). — Preceding unsigned comment added by 95.32.237.124 (talk) 22:23, 19 August 2011 (UTC)

Error in R-code fragment
Maybe if someone could check : h   <- hpi(x=waiting) function hpi is not accesable (maybe removed) plot(fhat, drawpoints=TRUE) drawpoints is not longer a parameter  — Preceding unsigned comment added by 143.169.52.102 (talk) 09:45, 14 August 2012 (UTC)

Error in reference about origin
This article mentions: “after [Emanuel Parzen] and [Murray Rosenblatt], who are usually credited with independently creating it in its current form”. This is not correct. Parzen refers to Rosenblatt (so clearly NOT independent, and btw both papers are in the same journal), and the paper of Parzen does not provide the simple equations that we find in this paper.

If there is no objection, I'd like to correct this. Phdb (talk) 22:08, 25 September 2011 (UTC)


 * Parzen wrote: "Despite the obvious importance of these problems, we are aware of only two previous papers on the subject of estimation of the probability density function (Rosenblatt [5] and Whittle [6]). In this paper we show how one may construct a family of estimates of f(x), and of the mode, which are consistent and asymptotically normal." It's still not clear to me exactly who did what, or who is credited with what, but I'd rather go be a secondary source on that.  Dicklyon (talk) 05:49, 20 June 2014 (UTC)

Parameteric or non-parametric?
The first sentence says "In statistics, kernel density estimation (KDE) is a **non-parametric** way to estimate the probability density function of a random variable." Then further down is: "The bandwidth of the kernel is a **free parameter** which exhibits a strong influence on the resulting estimate." So which is it? — Preceding unsigned comment added by 115.187.246.104 (talk) 10:05, 3 June 2014 (UTC)


 * I think the point is that the method is not estimating a parameter that describes a distribution. There is a free parameter in the choice of window, but given a window, the density estimate is nonparametric, determined directly by the data without parameters limiting the shape.  Dicklyon (talk) 05:53, 20 June 2014 (UTC)

It is sort of parametric though in that it relies on an assumption about the smootheness of the distribution... which is equivalent to parameterizing the distribution by the amplitudes of the kernels. Maybe a clarification would be helpful? 2601:4A:857F:4880:F8EC:DAAF:2844:296F (talk) 23:57, 12 December 2020 (UTC)

Definition - symmetric but not necessarily positive
"K(•) is the kernel — a symmetric but not necessarily positive function that integrates to one"

Shouldn't it be the exact opposite: "K(•) is the kernel — a positive but not necessarily symmetric function that integrates to one"?

--Lucas Gallindo (talk) 21:40, 18 October 2014 (UTC)

Yes, I agree. I'll change it. DWeissman (talk) 21:20, 25 November 2014 (UTC)

Dr. Guillen's comment on this article
Dr. Guillen has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"I would insert a note at the end of the Badnwidth selection section.

Bandwidth selection for kernel density estimation in heavy-tailed distributions is difficult.

I would insert this reference Bolance, C., Buch-Larsen, T., Guillén, M. and Nielsen, J.P. (2005) “Kernel density estimation for heavy-tailed distributions using the Champernowne transformation” Statistics, 39, 6, 503-518."

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

Dr. Guillen has published scholarly research which seems to be relevant to this Wikipedia article:


 * Reference : Ramon Alemany & Catalina Bolance & Montserrat Guillen, 2012. "Nonparametric estimation of Value-at-Risk," Working Papers XREAP2012-19, Xarxa de Referencia en Economia Aplicada (XREAP), revised Oct 2012.

ExpertIdeasBot (talk) 06:47, 8 July 2015 (UTC)

Done! - User:Harish victory 09:17, 15 November 2015 (UTC)

Alternative methods
This section starts off with an incorrect claim:

> Kernel density estimation may have certain limitations because it is based on Gaussian distribution statistics

There are KDE methods that use the Gaussian distribution as a starting point (pilot estimator). There are KDE methods that use Gaussian kernels. These are not necessary or exhaustive choices. There are also KDE methods for heavy-tailed distributions, such as variable bandwidth KDE and data transformations.

Then there is the unsupported claim:

> In this connection, head/tail breaks[27] and its deduced TIN-based density estimation[28] can better characterize the heavy-tailed data.

Looking at the two references on the ArXiv, how can you say that the TIN method better characterizes a certain kind of data when (1) you never compared method TIN to KDE (nether paper mentions KDE) and (2) have not worked out the statistical properties of method TIN (not published in a stats journal btw)?

---(192.161.76.82 (talk)) —Preceding undated comment added 23:06, 18 October 2017 (UTC)

Minor correction of wording
I guess it must be "Intuitively one wants to choose h as large as the data will allow" in the Definition section, because you can let go h to zero, achieving a perfect fit between data and probability density (sum of delta functions). Am I right or wrong?Syspedia (talk) 10:12, 16 November 2017 (UTC)

Histogram example errors
Histogram bin edges need to be defined: -4, -2, 0, ..., 8.

Also, if there are 6 data points, why are the bars 1/12 high? — Preceding unsigned comment added by 2600:1700:EB40:86D0:707F:873F:613E:BAB8 (talk) 00:01, 18 December 2021 (UTC)

Tree-structured Parzen estimators
Tree-structured Parzen estimators link here, but shouldn't they have their own article? There is more to say, see here: https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f Biggerj1 (talk) 21:04, 12 October 2022 (UTC)