User:Elmackev/sandbox

Regularization Perspectives on SVM
Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss.

Background
In the supervised learning framework, an algorithm is a strategy for choosing a function $$ f:\mathbf X \to \mathbf Y $$ given a training set $$ S = \{(x_1,y_1),\ldots, (x_n,y_n)\}$$ of inputs and their labels (the labels are usually $$\pm1$$). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

$$f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i))+\lambda||f||^2_\mathcal{H}\right\} $$,

where $$\mathcal{H}$$ is a hypothesis space of functions, $$V:\mathbf Y \times \mathbf Y \to \mathbb R$$ is the loss function, $$||\cdot||_\mathcal H$$ is a norm on the hypothesis space of functions, and $$\lambda\in\mathbb R$$ is the regularization parameter.

When $$\mathcal{H}$$ is a reproducing kernel Hilbert space, there exists a kernel function $$K: \mathbf X \times \mathbf X \to \mathbb R$$ that can be written as an $$n\times n$$ symmetric positive definite matrix $$\mathbf K$$. By the representer theorem, $$f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}$$, and $$ ||f||^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c $$

Hinge loss


The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if $$f(x_i)=y_i$$ and 1 if $$f(x_i) \neq y_i$$, i.e the heaviside step function on $$-y_if(x_i)$$. However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, $$ V(y_i,f(x_i)) = (1-yf(x))_+$$ where $$(s)_+ = max(s,0)$$, provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function, and with infinite data returns the Bayes optimal solution:

$$f_b(x) = \left\{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}\right.$$

$$$$

===Derivation === With the hinge loss, $$ V(y_i,f(x_i)) = (1-yf(x))_+$$ where $$(s)_+ = max(s,0)$$, the regularization problem becomes:

$$f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n (1-yf(x))_+ +\lambda||f||^2_\mathcal{H}\right\} $$,

In most of the SVM literature, this is written equivalently $$\left(\text{take }C = \frac{1}{2\lambda n}\right)$$ as:

$$f = \text{arg}\min_{f\in\mathcal{H}}\left\{C\sum_{i=1}^n (1-yf(x))_+ +\frac{1}{2}||f||^2_\mathcal{H}\right\} $$.

This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables $$\xi_i$$:

$$f = \text{arg}\min_{f\in\mathcal{H}}\left\{C\sum_{i=1}^n \xi_i +\frac{1}{2}||f||^2_\mathcal{H}\right\} $$ subject to: $$\begin{align}\xi_i\geq 1-y_if(x_i):\ \ \ &i = 1, \ldots, n \\ \xi_i\geq 0:\ \ \ & i = 1,\ldots,n \end{align}$$

Next we apply the representer theorem to get:

$$f = \text{arg}\min_{f\in\mathcal{H}}\left\{C\sum_{i=1}^n \xi_i +\frac{1}{2}c^T\mathbf K c\right\} $$ subject to: $$\begin{align}\xi_i\geq 1-y_i\sum_{j=1}^n c_j K(x_i,x_j):\ \ \ &i = 1, \ldots, n \\ \xi_i\geq 0:\ \ \ & i = 1,\ldots,n \end{align}$$

This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:

$$L(c,\xi,\alpha, \zeta) = C\sum_{i=1}^n \xi_i +\frac{1}{2} c^T \mathbf K c - \sum_{i=1}^n\alpha_i\left(y_i\left\{\sum_{j=1}^n c_j K(x_i,x_j)\right\}-1-\xi_i\right)-\sum_{i=1}^n\zeta_i\xi_i$$

The dual problem is:

$$\text{arg}\min_{\alpha,\zeta>0} \inf_{c,\xi} L(c,\xi, \alpha, \zeta)$$

Minimizing $$L$$ with respect to $$c_i$$: $$\frac{\partial L}{\partial c_i} = 0\Rightarrow c_i = \alpha_i y_i$$ Minimizing $$L$$ with respect to $$\xi_i$$: $$\frac{\partial L}{\partial \xi_i} = 0\Rightarrow C -\alpha_i-\zeta_i = 0\Rightarrow 0\leq \alpha_i\leq C$$

Then, plugging $$\zeta_i = C-\alpha_i$$ into the Lagrangian, we can write the dual problem as: $$\text{arg}\max_{\alpha\geq0}\inf L(c,\alpha) - \frac{1}{2} c^T\mathbf K c + \sum_{i=1}^n \alpha_i\left(1-y_i\sum_{j=1}^n K(x_i,x_j)c_j\right)$$

Then, plugging in $$c_i = \alpha_i y_i$$, we get: $$\text{arg}\max_{\alpha\in \mathbb R^n} L(\alpha) = \text{arg}\max_{\alpha\in \mathbb R^n}\sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n\alpha_iy_iK(x_i,x_j)\alpha_jy_j =\text{arg}\max_{\alpha\in \mathbb R^n} \sum_{i=1}^n \alpha_i - \frac{1}{2} \alpha^T(\text{diag}\mathbf Y) \mathbf K(\text{diag}\mathbf Y)\alpha$$

Subject to $$0\leq \alpha_i\leq C\ \ \ i = 1,\ldots,n$$

Note that this dual problem is easier to solve than the original problem because it is box constrained (the $$\alpha_i$$ are bounded). Also notice that the slack variables have disappeared in the dual problem.

Consequences and interpretations
The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for $$i = 1,\ldots,n$$:

$$\sum_{j=1}^n c_j K(x_i,x_j) - \sum_{j=1}^n y_i\alpha_jK(x_i,x_j) = 0$$

$$C-\alpha_i-\zeta_i = 0$$

$$y_i\left(\sum_{j=1}^ny_j\alpha_jK(x_i,x_j)\right)-1+\xi_i\geq0$$

$$\alpha_i\left[y_i\left(\sum_{j=1}^ny_j\alpha_jK(x_i,x_j)\right)-1+\xi_i\right]=0$$

$$\zeta_i\xi_i = 0$$

$$\xi_i,\alpha_i,\zeta_i\geq0$$

From these above constraints, and recalling that $$f(x) = \sum_{i=1}^ny_i\alpha_iK(x,x_i)$$, we can derive conditions relating the $$\alpha_i$$ to $$y_if(x_i)$$ :

$$\begin{align}y_if(x_i)>1&\Rightarrow(1-y_if(x_i))<0\\&\Rightarrow \xi_i\neq (1-y_if(x_i))\\&\Rightarrow\alpha_i=0\end{align}$$

$$\begin{align}y_if(x_i)<1&\Rightarrow (1-y_if(x_i))>0\\&\Rightarrow \xi_i>0\\&\Rightarrow\zeta_i=0\\&\Rightarrow\alpha_i = C\end{align}$$

$$\begin{align}\alpha_i = C&\Rightarrow \xi_i = 1-y_if(x_i)\\&\Rightarrow y_if(x_i)\leq1\end{align}$$

$$\begin{align}\alpha_i = 0&\Rightarrow C = \zeta_i\\&\Rightarrow \xi_i = 0\\&\Rightarrow\\&\Rightarrow y_if(x_i)\geq1\end{align}$$

$$\begin{align}0<\alpha_i1,\ \alpha_i=0$$. In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where $$y_if(x_i)\leq1$$. $$\begin{align}\end{align}$$ $$$$ $$$$ $$$$ $$$$