User:Hilaryfinucane/sandbox

Matrix learning is a subfield of machine learning consisting of problems for which the parameter being learned is a matrix.

Many common matrix learning problems can be put into the following general format. The training set is $$ S = (X_i^t,y_i^t) $$ for $$ t = 1, \ldots, T $$ and $$ i = 1, \dots, n_t $$, where $$ X_i^t \in \mathbb{R}^{d\times T}, y_i^t\in \mathbb{R} $$. We assume a regression model


 * $$ y_i^t = \langle W, X_i^t \rangle_F + \epsilon_i^t $$

where $$ W $$ is the $$ d \times T $$ matrix we would like to learn, $$ e_i^t $$ is a noise term, and $$ \langle A, B \rangle_F $$ is the Frobenious inner product


 * $$ \langle A, B \rangle_F = \sum_{i,j} A_{i,j}B_{i,j}. $$

Our prior knowledge about $$ W $$ often reflects the matrix structure of $$ W $$; for example, we might believe $$ W $$ to be low-dimensional. So one approach to this problem is to use penalized regression, where the penalization term reflects the matrix structure of $$ W $$.

Examples of matrix learning problems
In the examples below, we will use tensor notation: for vectors $$ a $$ and $$ b $$, we will let $$ a \otimes b$$ denote a matrix whose $$ (i,j) $$-th entry is $$ a_j \cdot b_i $$. Then for a vector $$ c $$,
 * $$ (a \otimes b) c = \langle a, c \rangle b $$.

We will also let $$ e_t $$ denote the $$ t $$-th standard basis vector.

Linear multi-task learning
Suppose that $$ X_i^t = e_t \otimes x_i^t $$ for vectors $$ x_i^t \in \mathcal{R}^d $$. Then we have $$ \langle W, X_i^t \rangle_F = \langle W_t, x_i^t \rangle, $$ where $$ W_t $$ is the $$ t $$-th column of $$ W $$. So our model is equivalent to


 * $$ y_i^t = \langle W_t, x_i^t \rangle+ \epsilon_i^t $$.

For linear multi-task learning, we might have prior knowledge that $$ W_t $$ and $$ W_{t'} $$ are similar.

Multivariate regression
Multivariate regression is linear multi-task learning specialized to the case that $$ x_i^t $$ does not depend on $$ t $$.

Matrix completion
Suppose that $$ X_i^t = e_t \otimes e_i $$. Then we have $$ \langle W, X_i^t \rangle_F = W_{i,t}. $$ In this case, our model is equivalent to


 * $$ y_i^t =W_{i,t} + \epsilon_i^t $$.

In other words, we observe a noisy version of $$ W_{i,t} $$ and want to recover $$ W_{i,t} $$. Here, for example, we might believe that $$ W $$ is low-dimensional.

Penalized regression for matrix learning
As with other machine learning problems, matrix learning problems of the form described above can be approached using penalized regression. For this approach, we choose a loss function $$ V: \mathbb{R} \times \mathbb{R} \rightarrow [0,\infty) $$ and define the empirical error to be


 * $$ \hat{\mathcal{E}}(W) := \sum_{i,t} V(y_i^t, \langle W, X_i^t \rangle_F) $$.

For example, a standard choice is to let $$ V(a,b) = (a-b)^2 $$. We also choose a regularization function $$ R(W) $$. We then define the optimal $$ W $$ to be
 * $$ W^* := \mbox{argmin}_W \{ \hat{\mathcal{E}}(W) + R(W) \} . $$

The regularization function is usually chosen to be $$ R(W) = \lambda N(W), $$ where $$ N(W) $$ is a norm of $$ W $$. There are several choices of norm, including:


 * The nuclear norm:
 * $$ ||W||_* := \sum_j |\sigma_j|, $$

where $$\sigma_j $$ are the singular values of $$ W $$;


 * more generally, $ p$-Schatten norms :
 * $$ ||W||_p := \left( \sum_j |\sigma_j|^p \right)^{1/p}; $$


 * and entrywise norms:
 * $$ |||W|||_p := \left( \sum_{j,t} |W_{j,t}|^p \right)^{1/p} $$.

Note that the Frobenius norm is equal to both $$ ||W||_2 $$ and $$ |||W|||_2 $$.

Promixal methods for solving penalized regression
If the loss function $$ V $$ is convex,  proper, differentiable, lower semi-continuous, and has a Lipschitz continuous gradient, and if the regularization function $$ R $$ is convex, continuous, and proper, then we can minimize $$ \hat{\mathcal{E}}(W) + R(W) $$ using  Proximal gradient descent. To use proximal gradient descent, though, we need to be able to compute the proximal operator


 * $$ prox_R(W) = \mbox{argmin}_{Z \in \mathbb{R}^{d \times T}} \{ R(Z) + \frac{1}{2} ||Z - W ||_2^2 \}. $$

The proximal operator for the nuclear norm
When $$ R(W) = \lambda ||W||_* $$, we can compute $$ prox_R(W) $$ by first doing a singular value decomposition, $$ W = U\Sigma V^T $$. It can then be shown that


 * $$ prox_R(W) = U S_{\lambda || \cdot || _1}(\sigma(W)) V^T $$

where $$ \sigma(W) $$ is a vector of singular values of $$ W$$, and $$ S_{\lambda || \cdot ||_1} $$ is the prox operator of $$\lambda$$ times the $$\ell_1$$ norm.