User:Peteymills/variable kernel estimation

Adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional.

Rational
Given a set of samples, $$\lbrace \vec x_i \rbrace$$, we wish to find the density, $$P(\vec x)$$, at a test point, $$\vec x$$:

$$ P(\vec x) \approx \frac{W}{n} $$

$$ W = \sum_{i=1}^n w_i $$

$$ w_i = K \left ( \frac{\vec x - \vec x_i}{h} \right ) $$

where $$n$$ is the number of samples, $$K$$ is the "kernel" and $$h$$ is its width. The kernel can be thought of as a simple, linear filter.

Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample.

For multivariate estimators, the parameter, $$h$$, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach approach will not be covered here.

Pointwise estimators
A common method of varying the kernel width is to make it proportional to the density at the test point:

$$ h = \frac{k}{\left [ n P(\vec x) \right ]^{1/D}} $$

where $$k$$ is a constant. If we back-substitute the estimated PDF, it is easy to show that $$W$$ is a constant:

$$ W = k^D (2 \pi)^{D/2} $$

This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique.

There are two components to the error: a variance term and a bias term. The variance term is given as :

$$ e_1 = \frac{P}{n h^D} $$

where $$D$$ are the number of dimensions.

The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out:

$$ e_2 = \frac{\sigma^2}{n} \triangle P $$

Using these error estimates, it is possible to derive an optimal kernel width for each estimate.

Use for statistical classification
The method is particularly useful for statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately and then compare them as in. Alternatively, we can divide up the sum based on the class of each sample:

$$ P(j, \vec{x}) = \frac{1}{n}\sum_{i=1, c_i=j}^n w_i $$

where $$c_i$$ is the class of the $$i$$th sample. The class of the test point may be estimated through maximum likelihood.

Many kernels, for instance a Gaussian, are smooth. Therefore, estimates of joint or conditional probabilities are both continuous and differentiable. This makes it easy to search for a border between two classes by zeroing the difference between the conditional probabilities:

$$ R(\vec x) = P(2 | \vec x) - P(1 | \vec x) = \frac{P(2, \vec x) - P(1, \vec x)}{P(1, \vec x) + P(2, \vec x)} $$

For instance, we could find a line that straddles the border and then use a numerical algorithm to find the root of $$R$$. This could be done as many times as necessary to sample the class border. We can use the border samples along with estimates of the gradients of $$R$$ to find the class of a test point:

$$ j = \arg \underset{i}{\min} | \vec{b_i} - \vec x | $$

$$ p = (\vec x - \vec{b_j}) \cdot \nabla_{\vec x} R (\vec{b_j}) $$

$$ c = ( 3 + p/|p| ) / 2 $$

where $$\lbrace \vec{b_i} \rbrace$$ sample the class border and $$c$$ is the estimated class. The conditional probability may be extrapolated to the test point:

$$ R(\vec x) \approx \tanh p $$

Two-class classifications are easy to generalize to multiple classes.