User:Pakoch/sandbox

In applied mathematics, K-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data. K-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.

Sparse representation with fixed dictionaries
Many applications, such as the JPEG 2000 image standard, leverage the fact that many signals (in the case of JPEG 2000, natural images) can be represented sparsely as linear combinations of elements from known dictionaries, such as wavelets, curvelets, or contourlets. Given such a dictionary of K signal atoms, usually represented as the columns of a matrix $$D \in \mathbb{R}^{n \times K}$$, and a target vector $$y \in \mathbb{R}^n$$, the goal of sparse approximation is to find a sparse coefficient vector $$x \in \mathbb{R}^K$$ that reconstructs y well. This goal is often formulated in two similar ways:



\min \limits _x \|y - Dx\|_2 \quad \text{subject to } \|x\|_0 < T_0 $$ or

\min \limits _x \|x\|_0 \quad \text{subject to } \|y - Dx\|_2 < \epsilon $$

where $$\|.\|_2$$ indicates the L2 norm and $$\|.\|_0$$ indicates the L0 norm, which counts the number of nonzero elements in a vector. Here, T0 and ε are fixed tolerances on the sparsity of the coefficient vector and the reconstruction error, respectively. Finding the truly optimal x for either of these equations, though, is NP-hard, and many algorithms exist for approximating the optimal solution. The sparse coding step of the K-SVD algorithm requires any algorithm that solves the first equation for a fixed T0.

Dictionary learning
Though using predefined dictionaries can simplify the computation of sparse representations, tailoring the set of signal atoms to a specific application can outperform general-purpose dictionaries in some contexts. Given a set of N training vectors y1, y2, ... yN, the goal of dictionary learning is to find a dictionary D that allows for the best sparse representation. As before, this can be achieved by constraining sparsity and minimizing reconstruction error, or vice-versa:

or

\min \limits _{D,X} \sum \limits _i \|x_i\|_0 \quad \text{subject to } \|Y - DX\|_F^2 \le \epsilon $$

where xi is the ith row of matrix X and $$\|.\|_F$$ indicates the Frobenius norm.

Relation to vector quantization
Vector quantization can be considered as extreme form of this dictionary learning problem, wherein each yi must be represented by exactly one signal atom; equivalently, in the first equation above, each xi must have a single nonzero entry, and that entry must be 1. K-means clustering, the inspiration for the K-SVD algorithm, is a method for solving this problem, and as shown below, K-SVD is a direct generalization of K-means; enforcing this additional constraint on X makes K-SVD identical to the K-means algorithm.

K-SVD algorithm
The K-SVD algorithm alternates between two steps: finding the best representation X for the input data Y (using any sparse approximation algorithm that can find such a representation for a fixed T0, such as orthogonal matching pursuit (OMP)), and updating the dictionary D to better fit the data. To accelerate convergence, the coefficient matrix is also updated in a limited fashion during the second step.

Sparse coding step
Any algorithm that approximately solves equation ($$) can be used in this step. Orthogonal matching pursuit (OMP) is readily made for this task, though other methods, such as basis pursuit and FOCUSS can be modified trivially to satisfy the sparsity condition.

Dictionary update step
The K-SVD algorithm then turns to the more difficult task of updating the dictionary elements to further reduce the reconstruction error. This is done by considering only one column dk of D at a time and using the singular value decomposition (SVD) to find a replacement for it that removes the most overall error, while holding the other columns constant. In contrast to the k-means algorithm, wherein the coefficients remain fixed while the dictionary is updated, the kth row of X (denoted x(k)) is also modified. This provides the algorithm a more current X for use in updating the subsequent columns, resulting in further error reduction and faster convergence at the cost of prohibiting parallelism in computing the column updates.

To update the kth column of D, write the product of D and X as a sum of outer products and define Ek, the error contributed by all atoms except the kth one, as



\begin{align} \|Y - DX\|_F^2 & = \|Y - \sum \limits _{j=1}^K d_jx_{(j)} \|_F^2 \\ & = \|(Y - \sum \limits _{j \ne k} d_jx_{(j)}) - d_kx_{(k)} \|_F^2 \\ & = \|E_k - d_kx_{(k)} \|_F^2 \end{align} $$

Because all terms in Ek are being held constant for the update of this column, the search for the dk and x(k) that minimize the above expression is equivalent to finding the rank-1 matrix that is closest to Ek in the Frobenius norm sense. This is easily found using the SVD, by letting dk be the first left singular vector of Ek and x(k) be the first right singular vector scaled by the first singular value. However, this will not preserve the sparsity of X, and so, measures must be taken to continue meeting the sparsity constraint.

To this end, the K-SVD algorithm updates only the nonzero entries of x(k). This is done by ignoring columns of X that do not "use" the kth signal atom; define ωk as the set of indices pointing to training vectors that use x(k):



\omega_k = \{ i \mid 1 \le i \le N, x_{(k)}(i) \ne 0 \} $$

Furthermore, define Ωk as an $$N \times \left\vert \omega_k \right\vert$$ matrix, with ones on the (i, ωk(i))th entries and zeros elsewhere. The product $$x_{(k)}^R = x_{(k)}\Omega_k$$ is then a reduced version of x(k), containing only the nonzero entries. Similarly, $$Y_k^R = Y\Omega_k$$ is the set of training vectors that currently use the kth atom, and $$E_k^R = E_k\Omega_k$$ selects the error columns associated with those training vectors.

This provides a way to selectively update only the nonzero entries of x(k) by performing the update on $$x_{(k)}^R$$ and mapping the new values back onto x(k). Now, the column update minimization problem is



\|E_k\Omega_k - d_kx_{(k)}\Omega_k\|_F^2 = \|E_k^R - d_kx_{(k)}^R\|_F^2 $$

which can be solved straightforwardly using the SVD. Decomposing $$E_k^R = U\Delta V$$, the updated dk is chosen to be the first column of U and $$x_{(k)}^R$$ to be the first column of V multiplied by Δ(1,1). The new entries in $$x_{(k)}^R$$ are then mapped back to the nonzero entries of x(k) from which they came. This update preserves both the normalization of the columns of D and the sparsity of the coefficient matrix X.

Initialization and stopping rule
In order to begin the algorithm, D must be initialized. As in the k-means algorithm, since the number of training vectors is much larger than the number of dictionary atoms, D can be set to a random selection of K unique columns of Y, then scaled to each have unit L2 norm. The algorithm can be terminated after a fixed number of iterations, or once the change in overall error between successive iterations becomes smaller than some tolerance ε.

Convergence
The overall error, $$\|Y - DX\|_F^2$$, is guaranteed to decrease with every application of the dictionary update procedure as a byproduct of the properties of the SVD. However, nothing guarantees that the new coefficient matrix produced in each sparse coding step will reduce the error over the previous one. This can be remedied with external interference: if the new X computed does not reduce the overall error, retain the previous X and continue with the dictionary update step. With this modification, the error is guaranteed to converge (though in practice, this is seldom needed, as many modern sparse coding algorithms perform very well). Note, though, that the dictionary update step is not suited to perform the dictionary learning process on its own, as it cannot change the support of X, and allowing training vectors to switch which signal atoms to use is important to both the K-SVD and k-means algorithms.

Example MATLAB implementation
Here,  can be an implementation of any function that approximately solves equation ($$), as noted above.

Limitations
Choosing an appropriate "dictionary" for a dataset is a non-convex problem, and K-SVD operates by an iterative update which does not guarantee to find the global optimum. However, this is common to other algorithms for this purpose, and K-SVD works fairly well in practice.