User:GutinE/sandbox

Uniform Stability
Uniform stability is a property of a learning algorithm which states that the change in loss, resulting from a single training point being changed, is bounded by a quantity $$\beta$$ that does not depend on that particular training point, in other words, the bound is uniform. Moreover, this bound could be a decreasing function of the number of training points, so that the bigger the sample size is, the more robust or insensitive we expect the learning algorithm to be when even one input parameter is changed. For instance we could have $$\beta = \mathcal{O}(1/m)$$, with $$m$$ the sample size, and this is what will be referred to for the remainder of this section, as uniform stability. Uniform stability is a strong feature of a learning algorithm because it implies several other properties, for example, 1) that change in empirical error from substituting a training point can be controlled and 2) that the empirical error of the learning algorithm converges to the 'true error' in some sense. This second point is precisely the main subject of this page.

There is a general class of learning algorithms that satisfy the uniform stability property. They are derived from the Tikhonov regularization problem, which is defined as


 * $$ f^{\lambda}_{S} = \text{arg} \min_{f \in \mathcal{H}} \left(      \frac{1}{n} \sum^n_{i=1} V(f(x_i), y_i) + \lambda ||f||^2_{\mathcal H}     \right) $$

that is, the algorithm in this class looks for a function within a hypothesis space $$\mathcal H$$ that minimizes the empirical error, as determined by a suitable choice of loss function $$V$$, plus a measure of the function's complexity, namely its norm. This general framework encompasses many algorithms, because the choice of hypothesis space, $$\lambda$$ and the loss function all determine a particular algorithm. This class includes well-established methods, such as regularized least squares, SVM, and logistic regression (for both Kernel and non-Kernel versions). It can be shown that an algorithm, in the framework of Tikhonov Regularization, is uniformly stable provided it also satisfies a few additional assumptions. Therefore any learning algorithm, which satisfies the assumptions discussed here, is proven to generalize.

Formal definition of uniform stability
An algorithm $$L$$ has uniform stability $$\beta$$ with respect to the loss function V if the following holds:

$$\forall S\in Z^m, \forall i\in\{1,...,m\}, \sup_{z\in Z}|V(f_S,z)-V(f_{S^i},z)|\leq\beta$$

A probabilistic version of uniform stability $$\beta$$ is:

$$\forall S\in Z^m, \forall i\in\{1,...,m\}, \mathbb{P}_S\{\sup_{z\in Z}|V(f_S,z)-V(f_{S^i},z)|\leq\beta\}\geq1-\delta$$

Required assumptions and their consequence

 * 1) The loss function is upper bounded by a constant $$M$$
 * 2) The loss function is $$L$$-Lipschitz, discussed below.
 * 3) The Hilbert space $$\mathcal H $$ is an RKHS, a reproducing kernel Hilbert space, whose kernel is bounded along the diagonal (more discussion to follow)

If the above assumptions are true, then the generalization bound has the following form. With probability $$ 1 - \delta $$

$$ |I[f^{\lambda}_{S}] - I_S[f^{\lambda}_S]| \leq \frac{L^2\kappa^2}{\lambda n} + (\frac{2L^2\kappa^2}{\lambda n} +M) \sqrt{\frac{2ln(\frac{2}{\delta} ) }{n} } $$

Therefore, the learning algorithm in question generalizes as $$n$$ goes to infinity. As discussed, popular algorithms that satisfy these properties are kernel and non-kernel SVM, logistic regression and regularized least squares where the function domain is bounded.

Remarks on Bound
Notice that keeping $$\lambda$$ fixed as n increases, the generalization tightens as $$ O \left( \frac{1}{ \sqrt{n} } \right) $$. However, fixing $$\lambda$$ keeps our hypothesis spaced fixed. However, as we get more data, we want $$\lambda$$ to get smaller. However, if $$\lambda$$ gets smaller too quickly, then the bounds have the potential to become vacuous.

Proof of stability of Tikhonov regularization
See also. We want to show that for any training set $$S = (x_1,y_1),\ldots,(x_m,y_m) = z_1,\ldots,z_m$$ that if we peturb one training point, say the $$ i$$th one, then the change in error in bounded by some $$\beta$$, where $$\beta$$ hopefully decreases with $$m$$, the size of the training set, at some favorable rate. Recall that $$S^i$$ is the training set obtained by replacing the $$i$$th element of $$S$$ with the a new (random) training point.

To prove that Tikhonov regularization is stable recall that we need to make some assumptions. One of these is

1) We assume that the loss is L-Lipschitz. That is, it satisfies the following property: For any functions $$f_1$$ and $$f_2$$ and any training point $$(x,y) \in \mathcal{Z}$$

$$ |V(f_1(x), y') - V(f_2(x), y')| \leq L||f_1 - f_2 ||_{\infty} $$

where L is a given constant and $$ ||f_1 - f_2 ||_{\infty} = \sup_{x \in X} | f_1(x) - f_2(x) | $$ is the maximum possible difference in the functions' values. Intuitively, what this says is that the change in the loss incurred by changing the predictor from $$f_1$$ to $$f_2$$ is bounded by some function of the predictors, which is proportional to the infinity norm of their difference.

This property may or may not be true of some common choices of loss function in Machine Learning. For example, the square loss function, $$V(f(x),y) = (f(x) - y)^2$$ in general, is not L-Lipschitz. To see this observe that if $$f(x) = x$$ and $$ f_\epsilon(x) = x+\epsilon$$, then for $$y = 0$$, it holds that $$\frac{|V(f(x),y) - V(f_\epsilon(x),y)|}{\epsilon} = \frac{|x^2 - (x+\epsilon)^2|}{\epsilon} \approx 2x$$ for small $$\epsilon$$ and the latter expression cannot bound $$\epsilon$$ (where $$\epsilon$$ would be $$\sup_{x \in X} | f(x) - f_\epsilon(x) | $$). However if the domain was bounded (or we could bound $$f(x)$$ and $$y$$) then the square-loss would be L-Lipschitz. We see from this example, that the L-Lipschitz property is almost the same as requiring the derivative to be bounded. It can also be verified that the hinge loss is L-Lipschitz, while the zero-one loss is not.

2) Secondly, we assume that the hypothesis space of predictors $$\mathcal{H}$$ is a reproducing kernel Hilbert space (RKHS) and, moreover, the associated kernel $$K$$ is bounded along the diagonal, namely,

$$ \sup_{x \in X} K(x,x) \le \kappa^2$$

for some finite constant $$\kappa$$. For example, the Gaussian RBF kernel is bounded along its diagonal. This assumption readily implies the following:

$$ ||f - f'||_{\infty} \leq \kappa ||f - f'||_{\mathcal{H}} $$ for any $$f,f' \in \mathcal{H} $$.

The above is a consequence of the reproducing kernel property and the Cauchy-Schwartz inequality. All that remains is to prove the following lemma (*).

$$ ||f^{\lambda}_{S} - f^{\lambda}_{S^{i}} ||^2_{\mathcal{H}} \leq \frac{L||f^{\lambda}_S - f^{\lambda}_{S^{i}} ||_{\infty} }{\lambda m} $$

Once it is proved, we can conclude that



\begin{align} & \le L \kappa || f^\lambda_D - f^\lambda_{S^i}||_\mathcal{H} \\ & = L \kappa \frac{|| f^\lambda_S - f^\lambda_{S^i}||^2_\mathcal{H} }{||f^\lambda_S - f^\lambda_{S^i}||_\mathcal{H} }\\ & \le \frac{L^2 \kappa  ||f^{\lambda}_S - f^{\lambda}_{S^{i}} ||_{\infty} }{\lambda m||f^\lambda_S - f^\lambda_{S^i}||_\mathcal{H} }\\ & \le \frac{L^2 \kappa^2 }{\lambda m}\\ \end{align} $$ The first inequality follows from the Lipshitz property of the loss function, the second from the kernel property discussed earlier, the third from the lemma that is proved in the next section.
 * V(f^\lambda_S,z)-V(f^\lambda_{S^i},z)| & \le L || f^\lambda_D - f^\lambda_{S^i}||_\infty \\

Proof of the lemma (*)
To prove the lemma, the concept of Bregman Divergence is needed. Let us define the following functionals

\begin{align} & I_S(f) =  \frac{1}{m} \sum^m_{i=1} V(f(x_i), y_i) \\ & N(f) =  || f ||^2_\mathcal{H} \\ & T_S(f) = I_S(f) + \lambda N(f) \\ \end{align} $$

where the last one is the objective function of Tikhonov regularization, and the first and second are the empirical risk and regularizer, respectively. The above three functionals are all convex and differentiable in $$f^\lambda_S$$, provided that the loss satisfies these properties (which it does if it is, for example, the bounded square loss). Now because $$N$$ is a norm sqaured, it follows that

\begin{align} d_N(f^\lambda_S, f^\lambda_{S^i}) & = N(f^\lambda_S) - N( f^\lambda_{S^i}) -\langle 2 f^\lambda_{S^i}, f^\lambda_S -  f^\lambda_{S^i} \rangle \\ & = ||f^\lambda_S||_\mathcal{H}^2 - ||f^\lambda_{S^i}||^2_{\mathcal{H}} -\langle 2 f^\lambda_{S^i}, f^\lambda_S -  f^\lambda_{S^i} \rangle \\ & = || f^\lambda_S - f^\lambda_{S_i} ||^2_\mathcal{H} \end{align} $$ and, by symmetry, we conclude that

d_N(f^\lambda_{S^i}, f^\lambda_{S}) + d_N(f^\lambda_S, f^\lambda_{S^i}) = 2|| f^\lambda_S - f^\lambda_{S_i} ||^2_\mathcal{H} $$ Now using the linearity property of Bregman divergence, and the fact that the functions minimize the Tikhonov optimization problem.



\begin{align} \lambda\left( d_N(f^\lambda_{S^i}, f^\lambda_{S}) + d_N(f^\lambda_S, f^\lambda_{S^i}) \right)& \le d_{T_S}(f^\lambda_{S^i}, f^\lambda_{S}) + d_{T_{S^i}}(f^\lambda_S, f^\lambda_{S^i}) \\ & = T_{S}(f^\lambda_{S^i}) - T_{S}(f^\lambda_{S}) + T_{S^i}(f^\lambda_{S})- T_{S^i}(f^\lambda_{S^i})\\ & = I_{S}(f^\lambda_{S^i}) + \lambda N(f^\lambda_{S^i}) - I_{S}(f^\lambda_{S}) - \lambda N(f^\lambda_{S}) \\ & \qquad + I_{S^i}(f^\lambda_{S}) + \lambda N(f^\lambda_{S}) - I_{S^i}(f^\lambda_{S^i}) - + \lambda N(f^\lambda_{S^i}) \\ & = I_{S}(f^\lambda_{S^i}) - I_{S}(f^\lambda_{S}) + I_{S^i}(f^\lambda_{S})- I_{S^i}(f^\lambda_{S^i}) \\ & = \frac{1}{m} \big( V(f^\lambda_{S^i}(x_i),y) - V(f_{S}^\lambda(x_i),y) \\ & \qquad + V(f^\lambda_{S^i}(x'_i),y') - V(f_{S}^\lambda(x_i'),y')\big) \\ & \le \frac{2 L ||f^\lambda_{S^i} - f^\lambda_{S}||_\infty }{m} \end{align} $$

The first inequality follows from the linearity of Bregman divergence, the first equality holds since the functions $$f^\lambda_S$$ and $$f^\lambda_{S^i}$$ are minimizers of the corresponding functions $$T_S$$ and $$T_{S^i}$$, respectively. Finally the last inequality is from the Lipschitz property.

We are ready to now state that

2|| f^\lambda_{S^i} - f^\lambda_{S} || \le \frac{2 L ||f^\lambda_{S^i} - f^\lambda_{S}||_\infty }{m} $$ The lemma (*) is proved and thus the uniform stability property is shown with $$\beta = \mathcal{O}(1/m)$$

Examples of uniformly stable learning algorithms
There are a number of different learning algorithms which satisfy the sufficient conditions for uniform stability.

SVM classification with hinge-loss and Gaussian Kernel
Here the output space is $$\mathcal Y = \{0,1\}$$ and the loss function is

V(f(x),y) = (1 - yf(x))^+ = \begin{cases} 1 - y(fx) & \text{ if } 1 \ge y f(x) \\ 0 & \text{otherwise} \end{cases} $$

The loss is $$L$$-Lipschitz with $$L = 1$$ and also since the Gaussian RBF kernel is bounded $$\kappa = 1$$ also. Therefore in this case $$ \beta = \frac{1}{m\lambda}$$

Regularized least squares with boundedness properties
Suppose that $$\forall y \in \mathcal Y, \; Y \le B$$ then the loss function has a Lipschitz constant of $$2B$$. If a Gaussian kernel is used then, according to these results, $$\beta = \frac{4B^2}{\lambda m}$$