User:To stats or not to stats/sandbox

Least Squares Sparse PCA (LS SPCA)
Sparse PCA (SPCA) is an extension of Principal Component Analysis (PCA), in which the components are computed as combinations of fewer variables than are available.

Unlike conventional SPCA methods, LS SPCA aims to create sparse principal components (SPCs) with the same optimality as the principal components (PCs) originally defined by Karl Pearson. Hence, the LS SPCs are orthogonal and sequentially best approximate the data matrix, in a least square sense.

LS SPCA Variants
In optimal LS SPCA (USPCA, uncorrelated SPCA) the orthogonality constraints require that the cardinality of the solutions is not smaller than the order of the component, These constraints may also create numerical problems when computing components of order larger than two.

Correlated SPCA (CSPCA) is a variant of LS SPCA in which the orthogonality constraints are relaxed and the solutions are obtained iteratively by minimizing the norm of the approximation error from residuals orthogonal to the previously computed SPCs. Even though the resulting components are correlated (usually very mildly), they have lower cardinality and in many cases explain more variance than the corresponding USPCA solutions.

The computation of the USPCA and CSPCA solutions is demanding when the data matrix is large. With Projection SPCA (PSPCA) approximate CSPCA solutions can computed much more efficiently by simply projecting the current first PC onto subsets of the variables. This means that the solutions can be computed with  efficient linear regression routines. PSPCA is fast and and can be shown to explain a proportion of the variance of the dataset comparable with that explained by CSPCA.

Determination of the SPCs' cardinality
Like all sparse methods, LS SPCA requires that the cardinality (number of nonzero values) of the solutions is determined by some optimization criterion. In LS SPCA, a reasonable approach is to choose the lowest cardinality for which an LS SPC explains a given percentage of the cumulative variance explained by the corresponding standard PC.

Merola proposed a forward and a backward elimination algorithms to search the optimal subsets of variables, which can be slow with large datasets. With PSPCA these optimal subsets can be efficiently computed by using standard algorithms for regression Feature_selection. One possible strategy for computing the components is to use PSPCA to identify the candidate variables and then compute the SPCs with CSPCA.

Mathematical formulation
Assume that the $$n$$ independent observations on the $$p$$ variables of interest, centered to mean zero, $$x_j$$, are the columns of the data matrix, $$X$$. A set of $$d\leq p$$ components is defined as $$t_j = Xa_j,\, j = 1, \ldots, d$$, where $$a_j$$ denotes the jth $$p$$-vectors of loadings.

Then, by denoting $$A$$ the matrix with columns $$a_j$$ and $$T = XA$$, the USPCs are derived as the solutions to the minimization problem:
 * $$\begin{align}

A = \text{arg }\min\quad & \vert\vert X - \Pi_{T}X \vert\vert^2 = \text{arg }\max \vert\vert \Pi_{T}X\vert\vert ^2\\ \text{subject to}\quad & card\{a_j\}\leq p\\ &T'T = diag{\{s_j\}}, \end{align}$$$$ where $$\Pi_{T}$$ is the orthogonal projector onto the column space of $$X$$ and the maximization problem follows from Pythagoras's theorem.

Because of the orthogonality constraints, the problem in equation $$ can be solved sequentially for each $$a_j,\, j=1,\ldots, d$$ as:
 * $$\begin{align}

a_j = \text{arg }\min\quad & \vert\vert X - \Pi_{t_j}X \vert\vert^2 = \text{arg }\max \vert\vert \Pi_{t_j}X\vert\vert ^2\\ \text{subject to}\quad & card\{a_j\}\leq p\\ &T_{[j-1]}'t_j = 0, \end{align}$$$$ where $$T_{[j-1]}$$ denotes the first $$j-1$$ columns of $$T$$. Closed form solutions to equation $$ for given $$p$$ are given in Merola.

In CSPC the loadings are obtained iteratively by minimizing the norm of the approximation error from the residuals orthogonal to the previously computed SPCs. That is, if we let  $$X^\perp_{j-1} = X - \Pi_{T_{[j-1]}}X$$, the $$j^\text{th}$$ vector of loadings is obtained as the solution to:
 * $$\begin{align}

a_j = \text{arg }\min\quad & \vert\vert X^\perp_{j-1} - \Pi_{t_j}X \vert\vert^2\\ \text{subject to}\quad & card\{a_j\} \leq p, \end{align}$$$$ Without the sparsity constraints, the problems in equations $$ are equivalent to Pearson's definition of PCA.

The PSPCA sparse loadings are obtained iteratively by projecting the first PCs of the residual matrix $$X^\perp_{j-1}$$ on subsets of the variables. that is by solving the problems
 * $$\begin{align}

a_j = \text{arg }\min\quad & \vert\vert r_{j} - \Pi_{t_j}r_{j} \vert\vert^2\\ \text{subject to}\quad & card\{a_j\} \leq p, \end{align}$$$$ where $$r_{j}$$ is the first principal component of the residual matrix $$X^\perp_{j-1}$$

Examples of LS SPCA results
As an example, Table 1 shows the results of USPCA, CSPCA and PSPCA applied to four row-rank deficient matrices. For each of the first three SPCs are shown the cardinality and the cumulative variance that they explain as a percentage of the cumulative variance explained by the corresponding PCs. The reduction in cardinality is striking when considering that over 96% of the variance is explained and in some cases 100% of it. The results for the three the variants of LS SPCA are quite similar.

Comparison of LS SPCA with conventional SPCA
Conventional SPCA methods are derived from Harold Hotelling's definition of PCA, by which the PCs are the linear combinations of the variables with unit $L_2$norm and orthogonal loadings which, sequentially, have the largest variance (the $$L_2$$ norm of the SPCs). While this and Pearson's definitions of PCA lead to the same solution, when sparsity constraints are added, this is no longer true.

Because of the way that they are derived, conventional SPCs are the PCs of subsets of  highly correlated variables. Instead, LS SPCs are combinations of variables unlikely to be highly correlated. Furthermore, conventional methods fail to identify the lowest possible representation of column-rank deficient matrices PCs, a task for which they were created.

Figure 1 shows the percent cumulative Comparison of the norm, relative cumulative variance explained, and correlation with the first PC for the first SPCs obtained with conventional SPCA and PSCA on four datasets. Details on the datasets can be found in Merola. It is easy to see how the PSPCs explain more variance and reach close to one correlation with the PC with much smaller cardinality, irrespective of the much larger norm of the conventional SPCA components.

As shown in Table 1, the datasets Khanh and Rama have much fewer rows than columns. For this reason, the PSPCs are identical to the PCs when their cardinality is equal to the rank of the matrices. Instead, the conventional SPCs are computed with a much larger cardinality (hence including perfectly correlated variables) than the matrices rank without reaching the PCs performance.