Bayes classifier

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.

Definition
Suppose a pair $$(X,Y)$$ takes values in $$\mathbb{R}^d \times \{1,2,\dots,K\}$$, where $$Y$$ is the class label of an element whose features are given by $$X$$. Assume that the conditional distribution of X, given that the label Y takes the value r is given by $$(X\mid Y=r) \sim P_r \quad \text{for} \quad r=1,2,\dots,K$$ where "$$\sim$$" means "is distributed as", and where $$P_r$$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $$C: \mathbb{R}^d \to \{1,2,\dots,K\}$$, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as $$\mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}.$$

The Bayes classifier is $$C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).$$

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $$\operatorname{P}(Y=r \mid X=x)$$. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $$C$$ (possibly depending on some training data) is defined as $$\mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}).$$ Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.

Considering the components $$x_i$$ of $$x$$ to be mutually independent, we get the naive Bayes classifier, where $$C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).$$

Properties
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $$R(h)$$, Bayes risk $$R^*$$, all possible classes to which the points can be classified $$Y = \{0,1\}$$. Let the posterior probability of a point belonging to class 1 be $$\eta(x)=Pr(Y=1|X=x)$$. Define the classifier $$\mathcal{h}^*$$as $$\mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases}$$

Then we have the following results: 1. \eta(x)-0.5

2. \cdot \mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}\right]$
 * 3 = $R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]$
 * 4 = $R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]$
 * undefined

Proof of (a): For any classifier $$h$$, we have $$\begin{align} R(h) &= \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] \\ &=\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] \\ &= \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ] \end{align} $$ where the second line was derived through Fubini's theorem

Notice that $$R(h)$$ is minimised by taking $$\forall x\in X$$, $$h(x) = \begin{cases} 1&\text{if }\eta(x)\geqslant 1-\eta(x),\\ 0&\text{otherwise.} \end{cases}$$

Therefore the minimum possible risk is the Bayes risk, $$R^*= R(h^*)$$.

Proof of (b): $$\begin{aligned} R(h)-R^* &= R(h)-R(h^*)\\ &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\ &=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\ &= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}] \end{aligned}$$

Proof of (c): $$\begin{aligned} R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\ &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \end{aligned}$$

Proof of (d): $$ \begin{aligned} R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\ &= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\ &=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|] \end{aligned}$$

General case
The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows. $$\begin{align} \mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\ &= \mathbb{E} \left[\Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}} + \Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}} + \dots + \Pr(Y=n|X=x) \mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right] \end{align}$$

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $$h(x) = k,\quad \arg\max_{k}Pr(Y=k|X=x)$$ for each observation x.