Sanov's theorem

In mathematics and information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. In the language of large deviations theory, Sanov's theorem identifies the rate function for large deviations of the empirical measure of a sequence of i.i.d. random variables.

Let A be a set of probability distributions over an alphabet X, and let q be an arbitrary distribution over X (where q may or may not be in A). Suppose we draw n i.i.d. samples from q, represented by the vector $$x^n = x_1, x_2, \ldots, x_n$$. Then, we have the following bound on the probability that the empirical measure $$\hat{p}_{x^n}$$ of the samples falls within the set A:


 * $$q^n(\hat{p}_{x^n}\in A) \le (n+1)^{|X|} 2^{-nD_{\mathrm{KL}}(p^*||q)}$$,

where
 * $$q^n$$ is the joint probability distribution on $$X^n$$, and
 * $$p^*$$ is the information projection of q onto A.

In words, the probability of drawing an atypical distribution is bounded by a function of the KL divergence from the true distribution to the atypical one; in the case that we consider a set of possible atypical distributions, there is a dominant atypical distribution, given by the information projection.

Furthermore, if A is a closed set, then


 * $$\lim_{n\to\infty}\frac{1}{n}\log q^n(\hat{p}_{x^n}\in A) = -D_{\mathrm{KL}}(p^*||q).$$

Technical statement
Define:

$$\{(a_1/n, \dots, a_{|\Sigma|}/n): \sum_i a_i = n, a_i \in \N\}$$Then, Sanov's theorem states:
 * $\Sigma$ is a finite set with size $\geq 2$ . Understood as “alphabet”.
 * $\Delta(\Sigma)$ is the simplex spanned by the alphabet. It is a subset of $\R^\Sigma$.
 * $L_n$ is a random variable taking values in $\Delta(\Sigma)$ . Take $n$  samples from the distribution $\mu$, then $L_n$  is the frequency probability vector for the sample.
 * $\mathcal L_n$ is the space of values that $L_n$  can take. In other words, it is

Here, $$int(S)$$ means the interior, and $$cl(S)$$ means the closure.
 * For every measurable subset $S \in \Delta(\Sigma)$ ,$$-\inf_{\nu \in int(S)} D(\nu\|\mu) \leq \liminf_n \frac 1n \ln P_\mu(L_n \in S) \leq \limsup_n \frac 1n \ln P_\mu(L_n \in S) \leq -\inf_{\nu \in cl(S)} D(\nu \| \mu)$$
 * For every open subset $U \in \Delta(\Sigma)$ ,$$-\lim_n \lim_{\nu \in U \cap \mathcal L_n} D(\nu\|\mu) = \lim_n \frac 1n \ln P_\mu(L_n \in S) = -\inf_{\nu \in U} D(\nu\|\mu)  $$