User:Sfang27/sandbox

= Uniform Stability and Generalization in learning theory =

Statistical learning theory considers the design of algorithms which produce [estimators] given example data. Although the error of resulting estimators can only be evaluated on sampled data, it is desirable to have estimators which perform well in expectation over the true distribution. This may be achieved if the algorithm generalizes, such that the empirical error of resulting estimator on any sample set is similar to the expected error.

The idea of stability can be used to show whether an algorithm generalizes. The study of stability in learning theory is concerned with how the output estimator changes given differences in input data for a given algorithm. Intuitively, if the algorithm is stable and thus produces estimators which do not differ much given different sampled sets as input, then the empirical error of different sampled sets is similar. In such cases, the empirical error as observed on any sampled set and the expected error should be similar.

Of particular interest is the notion of uniform stability, as strong bounds can be provided on the difference between the expected error and the empirical error for uniformly stable algorithm. In other words, if the algorithm is resistant to small perturbations in the sampled set as defined by uniform stability, then the performance of the resulting estimator on sampled sets is a good approximation of the performance in expectation.

= Preliminary Definitions = This article uses notation as defined in the following wikipedia article.

Empirical and Expected Error
Of particular importance is the modified $$S^i$$ obtained from training set $$S$$ by replacing the i-th element: $$S^i = \{z_1 ,...,\ z_{i-1},\ z_i^',\ z_{i+1},...,\ z_m\}$$ When the i-th element is replaced by $$z$$, we write $$S^{(i,z)}$$.

We further require notions of empirical error and expected error, given some loss function $$V$$. The empirical error of an estimator $$f$$ given a sampled set of data $$S$$ is:

$$I_S[f] = \frac{1}{n}\sum V(f,z_i)$$

The expected error of $$f$$ is:

$$I[f] = \mathbb{E}_z V(f,z)$$

Uniform Stability
In addition, an algorithm $$L$$ has uniform stability $$\beta$$ with respect to the loss function $$V$$ if:

$$\forall S\in Z^m, \forall i\in\{1,...,m\}, \sup_{z\in Z}|V(f_S,z)-V(f_{S^i},z)|\leq\beta$$

An algorithm is thus uniformly stable if given two training sets which differ by only one element, the difference in error between the two resulting estimators is upper bounded by some constant.

As the estimators are constructed from sampled sets of data, we also state the probabilistic version of uniform stability $$\beta$$:

$$\forall S\in Z^m, \forall i\in\{1,...,m\}, \mathbb{P}_S\{\sup_{z\in Z}|V(f_S,z)-V(f_{S^i},z)|\leq\beta\}\geq1-\delta$$

The probabilistic version of uniform stability requires that given two training sets which differ by only one element, the difference in error between the two resulting estimators is upper bounded by some constant with probability $$1-\delta$$.

= Generalization Bound Given Uniform Stability = In this section we will state the generalization bound of an uniformly stable and remark on the implications.

Statement
If a learning algorithm $$L $$ is uniformly stable and also has a bounded loss function, then with probability $$ 1 - \delta $$ the difference between the empirical error and the expected error will be upper bounded as follows:

$$I[f_S]\leq I_S[f_S]+ \beta + (2n\beta+M)\sqrt{\frac{ln(\frac{2}{\delta})}{2n}} $$

Remarks
The bound is instructive, and allows us to analyze whether the empirical error of an algorithm converges to the expected error as more samples are taken. Specifically, if one has a uniformly stable learning algorithm with $$ \beta_n = O \left( \frac{1}{ n } \right) $$, then the upper bound becomes:

$$ I[f_S] - I_S[f_S] \leq O\left( \frac{ 1 }{ \sqrt{n} } \right)$$

which converges to zero as we increase the number of samples $$n$$. The empirical risk is a good proxy for the generalization error with large sample sets.

In such cases, a predictor that minimizes the empirical risk will also have low expected error, given large training sets. Therefore, for a uniformly stable learning algorithm, minimizing the empirical risk is a good procedure for minimizing expected error (since the two are approximately equal for sufficiently large n).

One weakness of this result is the requirement for a bounded loss function. For example, the commonly used squared loss is not bounded for all values on the real line. However, applications will have bounded domains. For example, classification algorithms typically have domains bounded by $$[0,1)$$. In such cases, uniformly stable algorithms can be found by restricting the set of possible output estimators to have appropriately bounded domains.

= Examples of Uniformly Stability=

Tikhonov Regularization
Recall tikhonov regularization to be:

$$ f^{\lambda}_{S} = arg \min_{f \in \mathcal{H}} \left(      \frac{1}{n} \sum^n_{i=1} V(f(x_i), y_i) + \lambda ||f||^2_{\mathbb{R}^k}     \right) $$

it can be shown that tikhonov regularization is uniformly stable. If that is true then tikhonov regularization is proved to generalize.

To prove that Tikhonov regularization is stable we only need to show these three statements to be true:

1) we assume that the loss is Lipschitz continuous: $$ |V(f_i(x), y') - V(f_2(x), y')| \leq L||f_1 - f_2 ||_{\infty} $$

2) we need that the hypothesis class to be over reproducing kernel Hilbert spaces (RKHS): $$ ||f - f'||_{\infty} \leq \kappa ||f - f'||_{\mathbb{R}^K} $$ for any $$f,f' \in \mathcal{H} $$

3) finally we need the following lemma to hold: $$ ||f^{\lambda}_{S} - f^{\lambda}_{S^{i,z}} ||^2_{\mathbb{R}^K} \leq \frac{L||f^{\lambda}_S - f^{\lambda}_{S^{i,z}} ||_{\infty} }{\lambda n} $$

If the above holds and the loss function is upper bounded by M, then the generalization bound has the following form:

$$ |I[f^{\lambda}_{S}] - I_S[f^{\lambda}_S]| \leq \frac{L^2\kappa^2}{\lambda n} + (\frac{2L^2\kappa^2}{\lambda n} +M) \sqrt{\frac{2ln(\frac{2}{\delta} ) }{n} } $$

Therefore, with confidence $$ 1 - \delta $$, tikhonov regularization generalizes as n goes to infinity.

Remarks on Bound
Notice that keeping $$\lambda$$ fixed as n increases, the generalization tightens as $$ O \left( \frac{1}{ \sqrt{n} } \right) $$. However, fixing $$\lambda$$ keeps our hypothesis spaced fixed. However, as we get more data, we want $$\lambda$$ to get smaller. However, if $$\lambda$$ gets smaller too quickly, then the bounds have the potential to become vacuous.

= Proof of Generalization Bound =

We provide a proof of the generalization bound in this section. We begin by bounding the expectation and deviation of the difference between empirical error and expected error for estimators trained on sampled data. We then apply McDiarmid's inequality to arrive at the result. We define $$D[f_S] = I_S[f_S] - I[f_S]$$ for convenience in the proofs.

Bounding the expected difference
By definition, we have

$$ \begin{align} \mathbb{E}_S\left[I_S[f_S] - I[f_S]\right] & =  \mathbb{E}_{(S,z)}\left[\frac{1}{n}\sum_{i=1}^n V(f_S,z_i) - V(f_S,z)\right] \\ &= \mathbb{E}_{(S,z)}\left[\frac{1}{n}\sum_{i=1}^n V(f_{S^{(i,z)}},z_i) - V(f_S,z)\right] \\ &\leq \beta \end{align} $$

The second equality above follows because each of the $$z_i$$ were sampled from the same distribution of as $$z$$, and hence the expected values are equal. The inequality follows from the uniform stability of the algorithm.

Bounding the deviation of the difference
Consider an upper bound for $$|D[f_S] - D[f_{S^{i,z}}]|$$ as follows: $$ \begin{align} \left|D[f_S] - D[f_{S^{i,z}}]\right| & = \left|I_S[f_S] - I[f_S] - I_{S^{i,z}}[f_{S^{i,z}}] + I[f_{S^{i,z}}]\right|\\ & \leq |I[f_S] - I[f_{S^{i,z}}]| + |I_S[f_S]  - I_{S^{i,z}}[f_{S^{i,z}}]| &\text{ (By triangular inequality)}\\ & \leq \beta + \frac{1}{n}|V(f_S,z_i) - V(f_{S^{i,z}},z)| + \frac{1}{n}\sum_{j\neq i}|V(f_S,z_j) - V(f_{S^{i,z}},z_j)| \end{align} $$

The last inequality makes use of $$|I[f_S] - I[f_{S^{i,z}}]| \leq \beta $$ by uniform stability, and that $$I_S[f_S] $$ and $$I_{S^{i,z}}[f_{S^{i,z}}$$ differ only at index i. Applying the bound on $$V$$ to middle term, and applying uniform stability to the last term above, we have:

$$ \begin{align} \left|D[f_S] - D[f_{S^{i,z}}]\right| & \leq \beta + \frac{M}{n} + \beta\\ & = 2\beta + \frac{M}{n} \end{align} $$

Application of McDiarmid's
We now apply McDiarmid's inequality to find the stated inequality. Applying the bounds for the deviation of $$D[f_S]$$, for any $$\epsilon$$: $$ \begin{align} \mathbb{P}(|D[f_S] - \mathbb{E}D[f_{S^{i,z}}]|>\epsilon) & \leq 2exp\left(-\frac{2\epsilon^2}{\sum_{i=1}^{n}(2\beta+\frac{M}{n})^2}\right)\\ & = 2exp\left(-\frac{2\epsilon^2}{n(2\beta+\frac{M}{n})^2}\right)\\ & = 2exp\left(-\frac{2n\epsilon^2}{(2n\beta+M)^2}\right) \end{align} $$

Recall that we want a $$1-\delta$$ confidence bound. Let:

$$\delta = 2exp\left(-\frac{2n\epsilon^2}{(2n\beta+M)^2}\right)$$

Solving for $$\epsilon$$, we find that:

$$ \epsilon = (2n\beta+M)\sqrt{\frac{ln(\frac{2}{\delta})}{2n}}$$

Thus, we find that, with $$1-\delta$$ confidence,

$$D[f_S]\leq \mathbb{E}D[f_{S^{i,z}}] + (2n\beta+M)\sqrt{\frac{ln(\frac{2}{\delta})}{2n}} $$

Recalling that $$\mathbb{E}D[f_{S^{i,z}}]<\beta$$, we have

$$I[f_S]\leq I_S[f_S]+ \beta + (2n\beta+M)\sqrt{\frac{ln(\frac{2}{\delta})}{2n}} $$

with probability at least $$1-\delta$$ as required.

= References =