User:Igny/empirical measure

In probability theory, an empirical measure is a measure arising from a particular realization of a (usually finite) sequence of random variables. The precise definition is found below. Empirical measures are relevant to mathematical statistics.

The motivation for studying empirical measures is that it is often impossible to know the true underlying probability measure $$P$$. We collect observations $$X_1, X_2, \dots, X_n$$  and compute relative frequencies. We can estimate $$P$$, or a related distribution function $$F$$ by means of the empirical measure or empirical distribution function, respectively. These are uniformly good estimates under certain conditions. Theorems in the area of empirical processes provide rates of this convergence.

Definition
Let $$X_1, X_2, \dots$$ be a sequence of independent identically distributed random variables with values in the state space S with probability measure P.

Definition
 * The empirical measure $$P_n$$ is defined for measurable subsets of S and given by
 * $$P_n(A) = {1 \over n} \sum_{i=1}^n I_A(X_i)=\frac{1}{n}\sum_{i=1}^n \delta_{X_i}(A)$$
 * where $$I_A$$ is the indicator function and $$\delta_X$$ is the Dirac measure.

Definition
 * $$\bigl(P_n(c)\bigr)_{c\in\mathcal{C}}$$ is the empirical measure indexed by $$\mathcal{C}$$, a collection of measurable subsets of S.

For a fixed measurable set A, $$nP_n(A)$$ is a binomial random variable with mean nP(A) and variance nP(A)(1-P(A)).

By the strong law of large numbers, $$P_n(A)$$ converges to P(A) almost surely for fixed A. The problem of uniform convergence of $$P_n$$ to P was open until Vapnik and Chervonenkis solved it in 1968. If the class $$\mathcal{C}$$ is Glivenko-Cantelli with respect to P then $$P_n$$ converges to P uniformly over $$c\in\mathcal{C}.$$ In other words, with probability 1 we have


 * $$\|P_n-P\|_\mathcal{C}=\sup_{c\in\mathcal{C}}|P_n(c)-P(c)|\to 0$$

Empirical mean
To generalize this notion further, observe that the empirical measure $$P_n$$ maps measurable functions $$f:S\to \mathbb{R}$$ to their empirical mean,


 * $$f\mapsto P_n f=\int_S fdP_n=\frac{1}{n}\sum_{i=1}^n f(X_i)$$

In particular, the empirical measure of A is simply the empirical mean of the indicator function, $$P_n(A)=P_n I_A$$.

For a fixed measurable function f, $$P_nf$$ is a random variable with mean $$\mathbb{E}f$$ and variance $$\frac{1}{n}\mathbb{E}(f -\mathbb{E} f)^2$$. Similarly $$P_nf$$ converges to $$\mathbb{E} f$$ almost surely for a fixed measurable function f. If the class $$\mathcal{F}$$ is Glivenko-Cantelli then with probability 1 we have
 * $$\|P_n-P\|_\mathcal{F}=\sup_{f\in\mathcal{F}}|P_nf-\mathbb{E}f|\to 0.$$

Empirical distribution function
The empirical distribution function provides an example of empirical measures. For real-valued iid random variables $$X_1,\dots,X_n$$ it is given by


 * $$F_n(x)=P_n((-\infty,x])=P_nI_{(-\infty,x]}.$$

In this case, empirical measures are indexed by a class $$\mathcal{C}=\{(-\infty,x]:x\in\mathbb{R}\}.$$ It has been shown that $$\mathcal{C}$$ is a uniform Glivenko-Cantelli class, in particular,


 * $$\sup_F\|F_n(x)-F(x)\|_\infty\to 0$$

with probability 1.

Kernel estimation
For a metric space S, the Dirac measure $$\delta_{X_i}$$ can be replaced by an arbitrary measure centered at $$X_i$$
 * $$P_{n,\mu}(A)=\frac{1}{n}\sum_{i=1}^n \mu_{X_i}(A)$$

This corresponds to the following map of measurable functions
 * $$f\to P_{n,\mu}f=\frac{1}{n}\sum_{i=1}^n\int_S f d\mu_{X_i}$$

For $$S=\mathbb{R}^n$$ the measures are usually chosen to have a density with respect to the Lebesgue measure dx, that is
 * $$f\to P_{n,K_h}f=\frac{1}{n}\sum_{i=1}^n\int_{\mathbb{R}^n} K_h(x-X_i) f(x)dx$$

where $$K_h$$ is a kernel with a bandwidth h. See kernel density estimation for more details.