Kneser–Ney smoothing

Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney.

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method
Let $$c(w,w')$$ be the number of occurrences of the word $$w$$ followed by the word $$w'$$ in the corpus.

The equation for bigram probabilities is as follows:

$$ p_{KN}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-1},w')} + \lambda_{w_{i-1}} p_{KN}(w_i) $$

Where the unigram probability $$p_{KN}(w_i)$$ depends on how likely it is to see the word $$w_i$$ in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

$$ p_{KN}(w_i) = \frac { |\{ w' : 0 < c(w',w_i)    \} | } { |\{ (w',w) : 0 < c(w',w) \} | } $$

Note that $$p_{KN}$$ is a proper distribution, as the values defined in the above way are non-negative and sum to one.

The parameter $$\delta$$ is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The value of the normalizing constant $$\lambda_{w_{i-1}}$$ is calculated to make the sum of conditional probabilities $$p_{KN}(w_i|w_{i-1})$$ over all $$w_i$$ equal to one. Observe that (provided $$\delta < 1 $$) for each $$w_i$$ which occurs at least once in the context of $$w_{i-1}$$ in the corpus we discount the probability by exactly the same constant amount $${\delta} / \left(\sum_{w'} c(w_{i-1},w')\right)$$, so the total discount depends linearly on the number of unique words $$w_i$$ that can occur after $$w_{i-1}$$. This total discount is a budget we can spread over all $$p_{KN}(w_i|w_{i-1})$$ proportionally to $$p_{KN}(w_i)$$. As the values of $$p_{KN}(w_i)$$ sum to one, we can simply define $$\lambda_{w_{i-1}}$$ to be equal to this total discount:

$$ \lambda_{w_{i-1}} = \frac {\delta} {\sum_{w'} c(w_{i-1},w')} |\{ w' : 0 < c(w_{i-1},w') \} | $$

This equation can be extended to n-grams. Let $$w_{i-n+1}^{i-1}$$ be the $$n-1$$ words before $$w_i$$:

$$ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{\max(c(w_{i-n+1}^{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-n+1}^{i-1},w')} + \delta \frac{| \{ w' : 0 < c(w_{i-n+1}^{i-1},w') \} |}{\sum_{w'} c(w_{i-n+1}^{i-1},w')} p_{KN}(w_i|w_{i-n+2}^{i-1}) $$

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero. Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

Modified Kneser–Ney smoothing
Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification. This approach is once used for Google Translate under a MapReduce implementation. KenLM is a performant open-source implementation.