User:Manasi Deshmukh/sandbox

DeepBoost is a new machine learning algorithm, for Boosting, formulated by Corrina Cortes, Mehryar Mohri and Umar Syed in 2014. It is an ensemble learning algorithm which can achieve high accuracy without being overfitting. Deep decision trees or other complex families can be used as base classifiers. For the selection of the hypotheses DeepBoost uses a capacity-conscious criterion. This is the key to achieving higher accuracy as compared to that achieved from other ensemble-based learning techniques. The main purpose of DeepBoost is to minimize the learning bound. It has been observed that DeepBoost performs better than AdaBoost and Logistic Regression and their $$L_1$$-regularised variants in most cases. It can be used to promise similar theoretical guarantees in multiclass problems and use these guarantees to obtain a family of new multi-class deep boosting algorithms.

The analysis, from a theoretical perspective, and the design of DeepBoost algorithm can be extended to various loss functions and also to ranking. This should also generalize existing algorithms and their deployment of complex set of hypothesis, which is essentially a of families with different complexity, expressed using the Rademacher complexity. The algorithm can also consider losses that are non-differentiable, convex and surrogate, e.g. the hinge loss, for its extension. The algorithmic analysis of DeepBoost can help answer questions about AdaBoost's underlying theory. AdaBoost is primarily based on a margin guarantee. However, it does not exactly maximize the minimum margin, while algorithms such as arc-gv that are built to achive this do not perform better than AdaBoost. There are two main reasons are could be possible for such an observation: (1) to get a better margin, algorithms such as arc-gv may have a propensity to select decision trees that are deeper, or more complex hypotheses, which affect their generalization; (2) though these algorithms give a better margin, the margin distribution may not have improved. DeepBoost theory may help understanding and evaluating the consequence of factor (1) as the learning bounds depend on the mixture weights and each hypothesis set's, i.e. $$H_k$$'s contribution toward the definition of the ensemble-based function. However, the guarantees also put forward a better technique, DeepBoost.

Overview
Ensemble-based methods make use of a combination of several weak predictors or learners to create a strong, more accurate classifier or learner. The most popular techniques are bagging, boosting, stacking, techniques for correction of errors, Bayesian averaging, or other averaging schemes. Ensemble methods have been observed to promise better testing accuracy and deliver much better learning guarantees. Performance guarantees of algorithms like AdaBoost and its variants is expressed in terms of margins of training data and the algorithms themselves are based on a complex theoretical analysis.

Popular ensemble-based algorithms like AdaBoost make use of a combine of functions selected from a hypothesis set $$H$$ containing weak classifiers, wherein together they form the base classifiers. In AdaBoost applications, the hypothesis set $$H$$ is reduced to boosting stumps. These boosting stumps are decision trees of depth equal to one. For complicates problems like speech recognition or image processing, simple stumps do not help in achieving a very high accuracy. One may then want to resort to a more complex hypothesis set, e.g. decision trees of much larger depth. However, the current learning guarantees are based not only on the number of training samples and the margin, but also on $$H$$'s complexity which is measured in terms of either VC-dimension or Rademacher complexity. If a very complex $$H$$ is used, the learning bounds ted to get looser. These bounds hint at a risk of overfitting which has been encountered in experiments implementing AdaBoost.

The main goal behind Deep Boosting was the design of alternative ensemble-based algorithms using $$H$$ that may contain members of rich families like deep decision trees, achieving higher accuracy in the process. Let the set of base classifiers $$H$$ be decomposed as the union of $$p$$ disjoint families $${H_1,...,H_p}$$ ordered by increasing complexity, where $$H_k$$, $$k \in [1, p]$$, can be e.g. the set of decision trees of depth $$k$$, or a set of functions based on monomials of degree $$k$$.

The main goal is behind the design of Deep Boost algorithm is to achieve higher accuracy by drawing $$H$$'s from $${H_1,...,H_p}$$ in such a manner that weights allocated to the hypotheses are more for $$H_k$$s with smaller $$k$$. This goal is similar to that of model selection, Structural Risk Minimization, but it is different from that in the sense that the limiting of the base classifier set to an optimal $$\hbar_q = \cup_{k=1}^q H_k$$. Deep decision trees with large $$k$$ can be used not so frequently, or the weight assigned to these trees can be made small. Thus, there is room for flexibility of learning using deep hypothesis.

Learning Guarantess
Boosting or bagging or other such linear non-negative ensemble-based methods assume that all the weak learners are chosen from the same $$H$$. Ensembles of base functions, whose range was {-1,+1}, were based on margin-based generalization bounds, expressed in terms of the VC-dimension of $$H$$. Later, tigher margin bounds having easier proofs were given, expressed in terms of the Rademacher complexity of $$H$$, especially for a family of $$H$$ whose range was in $$\mathbb{R}$$. It is observed that the complexity of the bounds directly depends on the mixture coefficients, which define the ensembles. Let the family of ensembles be $$\mathbb{F} = conv(\cup_{k=1}^p H_k)$$. This is a family of the form $$f = \sum_{t=1}^T \alpha_{t}h_t$$ where $$\alpha = (\alpha_1,...,\alpha_T)$$ and for each $$t \in [1,T]$$, $$h_t$$ is in $$H_{k_t}$$ for some $$k_t \in [1,p]$$.

Let $$\boldsymbol{\chi}$$ denote the input space. $$H_1,..,H_p$$ are the families of functions mapping from $$\boldsymbol{\chi}$$ to $$\mathbb{R}$$. Points, both training and testing, are drawn independent and identically distributed, as per distribution $$D$$ over $$\boldsymbol{\chi} * \{-1,+1\}$$. Let $$S = ((x_1,y_1),...,(x_m,y_m))$$ where the training sample $$m$$ is drawn according to $$D^m$$. Let $$\rho > 0$$, $$R(f)$$ be the binary classification error, $$R_{\rho}(f)$$ be the $$\rho$$-margin error, and $$\widehat{R}_{S,\rho}(f)$$ be the emprical margin error, where the range of $$f$$ is $$\mathbb{R}$$.
 * $$R(f) = \underset{{(x,y)}_D}{E}[1_{yf(x)\leq0}]$$
 * $$R_{\rho}(f) = \underset{{(x,y)}_D}{E}[1_{yf(x)\leq\rho}]$$
 * $$\widehat{R}_{S,\rho}(f) = \underset{{(x,y)}_S}{E}[1_{yf(x)\leq\rho}]$$

where $$(x,y)_S$$ indicates that $$(x,y)$$ is drawn according to S which is the empirical distribution. Margin-nased Rademancher complexity learning bound is given by -
 * $$R(f) \leq {\widehat{R}_{S,\rho}} + {\frac{4}{\rho}}{\sum_{t=1}^T\alpha_t\Re_m(H_{k_t})} + C(m,p)$$

with $$C(m,p) = O\left(\sqrt{\frac{\log p}{\rho^2m}\log[\frac{\rho^2m}{\log p}]}\right)$$. Thus, the bounds depend on the mixture coefficients $$\alpha_t$$. This implies that even if the Rademancher complexity of $$H_k$$ may be large, this may not hamper generalization if the total mixture weight, i.e. sum of $$\alpha_t$$s for that $$H$$, is relatively small.

Algorithm
The capacity-conscious algorithm is derived via the application of a coordinate descent technique seeking to minimize the learning bounds. The objective function undergoes coordinate descent to derive the DeepBoost algorithm.

Optimization Problem
Let $$H_1,...,H_p$$ be p disjoint families of functions taking values in $$[-1,+1]$$ wityh increasing Rademacher complexity $$\Re_m(H_k) \in [1,p]$$. Assumption is that $$H_k$$are symmetric, which implies that for any $$h \in H_k$$, there exists a $$(-h) \in H_k$$. For any hypothesis $$h \in \sum_{k=1}^pH_k$$, $$d/(h)$$ denotes the index of the set it comes from, i.e. $$h \in H_{d(h)}$$. The learning bound holds uniformly for all $$\rho > 0$$ and the functions $$f \in conv(\sum_{k=1}^pH_k)$$. The last term of the bound does not depent on $$\alpha$$, therefore, $$\alpha$$ must be chosen such that it minimizes
 * $$G(\alpha) = \frac{1}{m}\overset{m}{\underset{i=1}{\sum}} 1_{yi\sum_{t=1}^T \alpha_th_t(x_i)\leq\rho} + \frac{4}{\rho}\overset{T}{\underset{t=1}{\sum}} \alpha_t\gamma_t$$

where $$\gamma_t = \Re_m(H_{d(h_t)})$$. Since for any $$\rho > 0$$, $$f$$ and $$f/p$$ take the same error of generalization, instead search can be performned, for $$\alpha \geq 0$$ with $$\sum_{t=1}^T \alpha_t \leq 1/p$$ which gives
 * $$\underset{\alpha\geq0}{min}\frac{1}{m}\overset{m}{\underset{i=1}{\sum}} 1_{yi\sum_{t=1}^T \alpha_th_t(x_i)\leq1} + 4\overset{T}{\underset{t=1}{\sum}} \alpha_t\gamma_t$$ such that $$\overset{T}{\underset{t=1}{\sum}} \alpha_t \leq 1/\rho$$. To simplify the minimization of this objective function, which is not convex, consider a convex upper bound. Let $$u \mapsto \Phi(-u)$$ be a convex non-increasing function that upper bounds $$u \mapsto 1_{u\leq0}$$ where $$\Phi$$ is differentiable over $$\mathbb{R}$$ and for all $$u$$, $$\Phi' \neq 0$$. $$\Phi$$ can be the exponential function as in AdaBoost or the logistic function. This upper bound gives the following convex optimization problem:
 * $$\underset{\alpha\geq0}{min}\frac{1}{m}\overset{m}{\underset{i=1}{\sum}} \Phi\left(1 - y_i\overset{T}{\underset{t=1}{\sum}} \alpha_th_t(x_i)\right) + \lambda\overset{T}{\underset{t=1}{\sum}} \alpha_t\gamma_t$$

such that $$\overset{T}{\underset{t=1}{\sum}} \alpha_t \leq 1/\rho$$, where the parameter $$\lambda \geq 0$$ controls the balance between the second term and the magnitude of the values taken by $$\Phi$$. Let $$\beta \geq 0$$ be a Lagrange variable associated to the constraint mentioned here, then the equivalent problem would be
 * $$\underset{\alpha\geq0}{min}\frac{1}{m}\overset{m}{\underset{i=1}{\sum}} \Phi\left(1 - y_i\overset{T}{\underset{t=1}{\sum}} \alpha_th_t(x_i)\right) + \overset{T}{\underset{t=1}{\sum}}(\lambda\gamma_t + \beta)\alpha_t$$,

where $$\beta$$ can be freely chosen by the algorithm, as any choice is equivalent to that of $$\rho$$. Let $$\{h_1,...,h_N\})$$ be a set of distinct base functions. Let the objective funciton be denoted by $$G$$.
 * $$G(\alpha) = \frac{1}{m}\overset{m}{\underset{i=1}{\sum}} \Phi\left(1 - y_i\overset{T}{\underset{t=1}{\sum}} \alpha_th_t(x_i)\right) + \overset{T}{\underset{t=1}{\sum}}(\lambda\gamma_t + \beta)\alpha_t$$

with $$\alpha = \{\alpha_1,..,\alpha_N\} \in {\mathbb{R}}^N$$. The condition $$\alpha \geq 0$$ cab be dropped due to symmetry of hypothesis sets, $$\alpha_th_t = (-\alpha_t)(-h_t)$$ and then coordinate gradient descent can be applied on the objective function.

Let $$\alpha_t = (\alpha_{t,1},...,\alpha_{t,N})^\top$$ denote the vector resulted after $$t \geq 1$$ iterations and let $$\alpha_0 = 0$$. Let $$e_k$$ denote the $$k$$th unit vecotr in $$\mathbb{R}^N$$, $$k \in [1,N]$$. The direction $$e_k$$ and the step $$\eta$$ selected at the $$t$$th round are those minimizing $$F(\alpha_{t-1} + {\eta}e_k)$$, that is
 * $$F(\alpha_{t-1} + {\eta}e_k) = \frac{1}{m}\overset{m}{\underset{i=1}{\sum}} \Phi(1 - y_if_{t-1}(x_i) - {\eta}y_ih_k(x_i)) + \underset{j{\neq}k}{\sum}(\lambda\gamma_j + \beta)_j|\alpha_{t-1,j}| + (\lambda\gamma_j + \beta)_k|\alpha_{t-1,k} + \eta|$$,

where $$f_{t-1} = \sum_{j=1}^N \alpha_{t-1,j}h_j$$. For any $$t \in [1,T]$$, $$D_t$$ is the distribution defined by
 * $$D_t(i) = \frac{\Phi'(1 - y_if_{t-1}(x_i))}{S_t}$$,

where $$S_t$$ is a normalization factor, $$S_t = \sum_{i=1}^m \Phi'(1 - y_if_{t-1}(x_i))$$. For any $$s \in [1,T]$$ and $$j \in [1,N]$$, the weighted error of hypothesis $$h_j$$ for the distribution $$D_s$$ for $$s \in [1,T]$$ be given by
 * $$\epsilon_{s,j} = \frac{1}{2}\left[1 - \underset{i_{D_s}}{E}[y_ih_j(x_i)]]\right]$$.

DeepBoost
DEEPBOOST($$S = ((x_1,y_1),...,(x_m,y_m))$$) 1  for $$i \leftarrow 1$$ to $$m$$ do  2        $$D_1(i) \leftarrow \frac{1}{m}$$ 3  for $$t \leftarrow 1$$ to $$T$$ do  4        for $$j \leftarrow 1$$ to $$N$$ do  5             if$$(\alpha_{t-1,j} \neq 0)$$ then 6                 $$d_j \leftarrow \left(\epsilon_{t,j} - \frac{1}{2}\right)  + sgn(\alpha_{t-1,j})\frac{(\lambda\gamma_j + \beta)m}{2S_t}$$ 7            elseif$$(\left|\epsilon_{t,j} - \frac{1}{2}\right| \leq \frac{(\lambda\gamma_j + \beta)m}{2S_t}$$ then  8                  $$d_j \leftarrow 0$$  9             else $$d_j \leftarrow \left(\epsilon_{t,j} - \frac{1}{2}\right) - sgn\left(\epsilon_{t,j} - \frac{1}{2}\right)\frac{(\lambda\gamma_j + \beta)m}{2S_t}$$ 10        $$k \leftarrow \underset{j\in[1,N]}{argmax|d_j|}$$ 11        $$\epsilon_t \leftarrow \epsilon_{t,k}$$ 12        if $$\left(|(1-\epsilon_t)e^{\alpha_{t-1,k}} - \epsilon_te^{-\alpha_{t-1,k}}| \leq \frac{(\lambda\gamma_k + \beta)m}{2S_t}\right)$$ then 13             $$\eta_t \leftarrow -\alpha_{t-1,k}$$ 14        elseif$$\left((1-\epsilon_t)e^{\alpha_{t-1,k}} - \epsilon_te^{-\alpha_{t-1,k}} \leq \frac{(\lambda\gamma_k + \beta)m}{2S_t}\right)$$ then 15             $$\eta_t \leftarrow \log\left[-\frac{(\lambda\gamma_j + \beta)m}{2\epsilon_tS_t} + \sqrt{[\frac{(\lambda\gamma_j + \beta)m}{2\epsilon_tS_t}]^2 + \frac{1-\epsilon_t}{epsilon_t}}\right]$$ 16       else $$\eta_t \leftarrow \log\left[+\frac{(\lambda\gamma_j + \beta)m}{2\epsilon_tS_t} + \sqrt{[\frac{(\lambda\gamma_j + \beta)m}{2\epsilon_tS_t}]^2 + \frac{1-\epsilon_t}{epsilon_t}}\right]$$ 17       $$\alpha_t \leftarrow \alpha_{t-1} + \eta_te_k$$ 18       $$S_{t+1} \leftarrow \sum_{i=1}^m \Phi'(1 - y_i\sum_{j=1}^N \alpha_{t,j}h_j(x_i))$$ 19       for $$i \leftarrow 1$$ to $$m$$ 20            $$D_{t+1}(i) \leftarrow \frac{\Phi'(1-y_i\sum_{j=1}^N \alpha_{t,j}h_j(x_i))}{S_{t+1}}$$ 21  $$f \leftarrow \sum_{j=1}^N \alpha_{T,j}h_j$$ 22  return $$f$$