User:Fjoelskaldr/sandbox

PCA and Information Theory
The claim that the PCA used for dimensionality reduction preserves most of the information of the data is misleading. Indeed, without any assumption on the signal model, PCA cannot help to reduce the amount of information lost during dimensionality reduction, where information was measured using Shannon entropy.

Under the assumption that


 * $$\mathbf{x}=\mathbf{s}+\mathbf{n}$$

i.e., that the data vector $$\mathbf{x}$$ is the sum of the desired information-bearing signal $$\mathbf{s}$$ and a noise signal $$\mathbf{n}$$ one can show that PCA can be optimal for dimensionality reduction also from an information-theoretic point-of-view.

In particular, Linsker showed that if $$\mathbf{s}$$ is Gaussian and $$\mathbf{n}$$ is Gaussian noise with a covariance matrix proportional to the identity matrix, the PCA maximizes the mutual information $$I(\mathbf{y};\mathbf{s})$$ between the desired information $$\mathbf{s}$$ and the dimensionality-reduced output $$\mathbf{y}=\mathbf{W}_L^T\mathbf{x}$$.

If the noise is still Gaussian and has a covariance matrix proportional to the identity matrix (i.e., the components of the vector $$\mathbf{n}$$ are iid), but the information-bearing signal $$\mathbf{s}$$ is non-Gaussian (which is a common scenario), PCA at least minimizes an upper bound on the information loss, which is defined as


 * $$I(\mathbf{x};\mathbf{s})-I(\mathbf{y};\mathbf{s}).$$

The optimality of PCA is also preserved if the noise $$\mathbf{n}$$ is iid and at least more Gaussian (in terms of the Kullback-Leibler divergence) than the information-bearing signal $$\mathbf{s}$$. In general, even if the above signal model holds, PCA looses its information-theoretic optimality as soon as the noise $$\mathbf{n}$$ becomes dependent.