User:A.n.campero/sandbox

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable $$ Y $$ (i.e., response, or dependent variable) to be learned can be described by a reduced number of variables in the input space $$ X $$ (i.e., the domain, space of features or explanatory variables). Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in $$ X $$.

Common motivation for the use of structured sparsity methods are model interpretability, high-dimensional learning (where dimensionality of $$ X $$ may be higher than the number of observations $$ n $$), and reduction of computational complexity. Moreover, structured sparsity methods allow to incorporate prior assumptions on the structure of the input variables, such as overlapping groups, non-overlapping groups, and acyclic graphs. Examples of uses of structured sparsity methods include face recognition, magnetic resonance image (MRI) processing , socio-linguistic analysis in natural language processing , and analysis of genetic expression in breast cancer.

Sparsity regularization
Consider the linear kernel regularized empirical risk minimization problem with a loss function $$ V(y_i, f(x)) $$ and the $$\ell_0$$ "norm" as the regularization penalty:


 * $$\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n V(y_i, \langle w,x_i\rangle) + \lambda \|w\|_0, $$

where $$ x, w \in \mathbb{R^d} $$, and $$\|w\|_0$$ denotes the $$\ell_0$$ "norm", defined as the number of nonzero entries of the vector $$w$$. We say that $$f(x) = \langle w,x_i\rangle$$ is sparse if  $$\|w\|_0 = s < d$$. Which means that the output $$Y$$ can be described by a small subset of input variables.

More generally, assume a dictionary $$ \phi_j : X \rightarrow \mathbb{R} $$ whith $$j = 1,...,p $$ is given, such that the target function $$f(x)$$ of a learning problem can be written as:
 * $$f(x) = \sum_{j=1}^p \phi_j(x) w_j$$, $$ \forall x \in X $$

And define the $$\ell_0$$ norm $$\|f\|_0 = \|w\|_0$$ as the number of non-zero components of $$w$$
 * $$\|w\|_0 = | { j | w_j \neq 0, j = 1,...,p } |$$, where $$|A|$$ is the cardinality of set $$A$$.

We say that $$f$$ is sparse if $$\|f\|_0 = \|w\|_0 = s < d$$.

Structured sparsity regularization
Structured sparsity regularization extends and generalizes the variable selection problem that characterizes sparsity regularization. Consider the above regularized empirical risk minimization problem with a general kernel and associated feature map $$ \phi_j : X \rightarrow \mathbb{R} $$ with $$j = 1,...,p $$.


 * $$\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n V(y_i, \langle w,\Phi(x_i)\rangle) + \lambda \|w\|_0, $$

The regularization term $$\lambda \|w\|_0 $$ penalizes each $$w_j$$ component independently, which means that the algorithm will suppress input variables independently from each other.

In several situations we may want to impose more structure in the regularization process, so that, for example, input variables are suppressed according to predefined groups. Structured sparsity regularization methods allow to impose such structure by adding structure to the norms defining the regularization term.

Non-overlapping groups: group Lasso
The non-overlapping group case is the most basic instance of structured sparsity. In it, we assume an a priori partition of the coefficient vector $$w$$ in $$G$$ non-overlapping groups. Let $$w_g$$ be the vector of coefficients in group $$g$$, we can define a regularization term and its group norm as


 * $$\lambda R(w)=\lambda\sum _^{G}\|w_{g}\|_{g} $$,

where $$ \|w_{g}\|_{g}$$ is the group $$\ell_2$$ norm $$ \|w_{g}\|_{g}= \sqrt{ \sum _^(w_{g}^{j})^{2}} $$,   $$G_g$$ is group $$g$$, and $$w_{g}^{j}$$ is the j-th component of group $$G_g$$.

The above norm is also referred to as group Lasso. This regularizer will force entire coefficient groups towards zero, rather than individual coefficients. As the groups are non-overlapping, the set of non-zero coefficients can be obtained as the union of the groups that were not set to zero, and conversely for the set of zero coefficients.

Overlapping groups
Overlapping groups is the structure sparsity case where a variable can belong to more than one group $$g$$. This case is often of interest as it can represent a more general class of relationships among variables than non-overlapping groups can, such as tree structures or other type of graphs.

There are two types of overlapping group sparsity regularization approaches, which are used to model different types of input variable relationships:

Intersection of compliments: group Lasso
The intersection of compliments approach is used in cases when we want to select only those input variables that have positive coefficients in all groups they belong to. Consider again the group Lasso for a regularized empirical risk minimization problem:


 * $$\lambda R(w)=\lambda\sum _^{G}\|w_{g}\|_{g} $$,

where $$ \|w_{g}\|_{g}$$ is the group $$\ell_2$$ norm,   $$G_g$$ is group $$g$$, and $$w_{g}^{j}$$ is the j-th component of group $$G_g$$.

As in the non-overlapping groups case, the group Lasso regularizer will potentially set entire groups of coefficients to zero. Selected variables are those with coefficients $$w_j > 0$$. However, as in this case groups may overlap, we take the intersection of the compliments of those groups that are not set to zero.

This intersection of complements selection criteria implies the modeling choice that we allow some coefficients within a particular group $$g$$ to be set to zero, while others within the same group $$g$$ may remain positive. In other words, coefficients within a group may differ depending on the several group memberships that each variable within the group may have.

Union of groups: latent group Lasso
A different approach is to consider union of groups for variable selection. This approach captures the modeling situation where variables can be selected as long as they belong at least to one group with positive coefficients. This modeling perspective implies that we want to preserve group structure.

The formulation of the union of groups approach is also referred to as latent group Lasso, and requires to modify the group $$\ell_2$$ norm considered above and introduce the following regularizer <ref name =


 * $$R(w)=inf\left\{\sum _{g}\|w_\|_:w=\sum _^{G}{\bar {w}}_{g}\right\}$$

where $$w\in {\mathbb {R^{d}}}$$,  $$w_\in G_{g}$$ is the vector of coefficients of group g, and $${\bar  {w}}_{g}\in {\mathbb  {R^{d}}}$$ is a vector with coefficients $$w_{g}^{j}$$ for all variables  $$j$$  in group  $$g$$, and  $$0$$  in all others, i.e, $${\bar  w}_{g}^{j}=w_{g}^{j}$$ if  $$j$$  in group  $$g$$  and $${\bar  w}_{g}^{j}=0$$ otherwise.

This regularizer can be interpreted as effectively replicating variables that belong to more than one group, therefore conserving group structure. As intended by the union of groups approach, requiring $$w=\sum _^{G}{\bar {w}}_{g}$$ produces a vector of weights w that effectively sums up the weights of all variables across all groups they belong to.

Best subset selection problem
The problem of choosing the best subset of input variables can be naturally formulated under a penalization framework as:


 * $$\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n V(y_i, w, x_i) + \lambda \|w\|_0, $$

Where $$\|w\|_0$$ denotes the $$\ell_0$$ "norm", defined as the number of nonzero entries of the vector $$w$$.

Although this formulation makes sense from a modeling perspective, it is computationally unfeasible, as it is equivalent to an exhaustive search evaluating all possible subsets of variables.

Two main approaches for solving the optimization problem are: 1) greedy methods, such as step-wise regression in statistics, or matching pursuit in signal processing; and 2) convex relaxation formulation approaches and proximal gradient optimization methods.

Convex relaxation
A natural approximation for the best subset selection problem is the $$\ell_1$$ norm regularization :


 * $$ \min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n V(y_i, w, x_i) + \lambda \|w\|_1$$

Such as scheme is called basis pursuit or the Lasso, which substitutes the $$\ell_0$$ "norm" for the convex, non-differentiable $$\ell_1$$ norm.

Proximal gradient methods
Proximal gradient methods, also called forward-backward splitting, are optimization methods useful for minimizing functions with a convex and differentiable component, and a convex potentially non-differentiable component.

As such, proximal gradient methods are useful for solving sparsity and structured sparsity regularization problems of the following form:


 * $$ \min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n V(y_i, w, x_i) + R(w)$$

Where $$V(y_i, w, x_i) $$ is a convex and differetiable loss function like the quadratic loss, and $$R(w) $$ is a convex potentially non-differentiable regularizer such as the $$\ell_1$$ norm.

Uses and applications
Structured sparsity regularization methods have been used in a number of settings where it is desired to impose an a priori input variable structure to the regularization process. Some such applications are:


 * Compressive sensing in magnetic resonance imaging (MRI), reconstructing MR images from a small number of measurements, potentially yielding significant reductions in MR scanning time.
 * Robust face recognition in the presence of misalignment, occlusion and illumination variation.
 * Uncovering socio-linguistic associations between lexical frequencies used by tweeter authors, and the socio-demographic variables of their geographic communities.
 * Gene selection analysis of breast cancer data using priors of overlapping groups, e.g., biologically meaningful gene sets.