Convolutional sparse coding

The convolutional sparse coding paradigm is an extension of the global sparse coding model, in which a redundant dictionary is modeled as a concatenation of circulant matrices. While the global sparsity constraint describes signal $\mathbf{x}\in \mathbb{R}^{N}$ as a linear combination of a few atoms in the redundant dictionary $\mathbf{D}\in\mathbb{R}^{N\times M}, M\gg N$, usually expressed as $\mathbf{x}=\mathbf{D}\mathbf{\Gamma}$  for a sparse vector $\mathbf{\Gamma}\in \mathbb{R}^{M}$ , the alternative dictionary structure adopted by the convolutional sparse coding model allows the sparsity prior to be applied locally instead of globally: independent patches of $\mathbf{x}$  are generated by "local" dictionaries operating over stripes of $\mathbf{\Gamma}$.

The local sparsity constraint allows stronger uniqueness and stability conditions than the global sparsity prior, and has shown to be a versatile tool for inverse problems in fields such as image understanding and computer vision. Also, a recently proposed multi-layer extension of the model has shown conceptual benefits for more complex signal decompositions, as well as a tight connection the convolutional neural networks model, allowing a deeper understanding of how the latter operates.

Overview
Given a signal of interest $\mathbf{x}\in \mathbb{R}^{N}$ and a redundant dictionary $\mathbf{D}\in\mathbb{R}^{N\times M}, M\gg N$, the sparse coding problem consist of retrieving a sparse vector $\mathbf{\Gamma}\in \mathbb{R}^{M}$ , denominated the sparse representation of $\mathbf{x}$ , such that $\mathbf{x}= \mathbf{D}\mathbf{\Gamma}$. Intuitively, this implies $\mathbf{x}$ is expressed as a linear combination of a small number of elements in $\mathbf{D}$. The global sparsity constraint prior has been shown to be useful in many ill-posed inverse problems such as image inpainting, super-resolution, and coding. It has been of particular interest for image understanding and computer vision tasks involving natural images, allowing redundant dictionaries to be efficiently inferred

As an extension to the global sparsity constraint, recent pieces in the literature have revisited the model to reach a more profound understanding of its uniqueness and stability conditions. Interestingly, by imposing a local sparsity prior in $\mathbf{\Gamma}$, meaning that its independent patches can be interpreted as sparse vectors themselves, the structure in $\mathbf{D}$ can be understood as a “local" dictionary operating over each independent patch. This model extension is denominated convolutional sparse coding (CSC) and drastically reduces the burden of estimating signal representations while being characterized by stronger uniqueness and stability conditions. Furthermore, it allows for $\mathbf{\Gamma}$  to be efficiently estimated via projected gradient descent algorithms such as orthonormal matching pursuit (OMP) and basis pursuit (BP), while performing in a local fashion

Besides its versatility in inverse problems, recent efforts have focused on the multi-layer version of the model and provided evidence of its reliability for recovering multiple underlying representations. Moreover, a tight connection between such a model and the well-established convolutional neural network model (CNN) was revealed, providing a new tool for a more rigurous understanding of its theoretical conditions.

The convolutional sparse coding model provides a very efficient set of tools to solve a wide range of inverse problems, including image denoising, image inpainting, and image superresolution. By imposing local sparsity constraints, it allows to efficiently tackle the global coding problem by iteratively estimating disjoint patches and assembling them into a global signal. Furthermore, by adopting a multi-layer sparse model, which results from imposing the sparsity constraint to the signal inherent representations themselves, the resulting "layered" pursuit algorithm keeps the strong uniqueness and stability conditions from the single-layer model. This extension also provides some interesting notions about the relation between its sparsity prior and the forward pass of the convolutional neural network, which allows to understand how the theoretical benefits of the CSC model can provide a strong mathematical meaning of the CNN structure.

Sparse coding paradigm
Basic concepts and models are presented to explain into detail the convolutional sparse representation framework. On the grounds that the sparsity constraint has been proposed under different models, a short description of them is presented to show its evolution up to the model of interest. Also included are the concepts of mutual coherence and restricted isometry property to establish uniqueness stability guarantees.

Global sparse coding model
Allow signal $\mathbf{x}\in \mathbb{R}^N$ to be expressed as a linear combination of a small number of atoms from a given dictionary $\mathbf{D}\in \mathbb{R}^{N \times M}, M>N$. Alternatively, the signal can be expressed as $\mathbf{x}=\mathbf{D}\mathbf{\Gamma}$, where $\mathbf{\Gamma}\in \mathbb{R}^M$ corresponds to the sparse representation of $\mathbf{x}$ , which selects the atoms to combine and their weights. Subsequently, given $\mathbf{D}$, the task of recovering $\mathbf{\Gamma}$ from either the noise-free signal itself or an observation is denominated sparse coding. Considering the noise-free scenario, the coding problem is formulated as follows: $$\begin{aligned} \hat{\mathbf{\Gamma}}_{\text{ideal}}&= \underset{\mathbf{\Gamma}}{\text{argmin}}\; \| \mathbf{\Gamma}\|_{0}\; \text{s.t.}\; \mathbf{D}\mathbf{\Gamma}=\mathbf{x}.\end{aligned}$$ The effect of the $\ell_{0}$ norm is to favor solutions with as much zero elements as possible. Furthermore, given an observation affected by bounded energy noise: $\mathbf{Y}= \mathbf{D}\mathbf{\Gamma}+ \mathbf{E},\|\mathbf{E}\|_{2}<\varepsilon$, the pursuit problem is reformulated as: $$\begin{aligned} \hat{\mathbf{\Gamma}}_{\text{noise}}&= \underset{\mathbf{\Gamma}}{\text{argmin}}\; \| \mathbf{\Gamma}\|_{0}\; \text{ s.t. } \|\mathbf{D}\mathbf{\Gamma}-\mathbf{Y}\|_{2}<\varepsilon.\end{aligned}$$

Stability and uniqueness guarantees for the global sparse model
Let the spark of $\mathbf{\mathbf{D}}$ be defined as the minimum number of linearly independent columns: $$\begin{aligned} \sigma(\mathbf{D})=\underset{\mathbf{\Gamma}}{\text{min}} \quad \|\mathbf{\Gamma}\|_{0} \quad \text{s.t.}\quad \mathbf{D \Gamma}=0, \quad \mathbf{\Gamma}\neq 0.\end{aligned}$$

Then, from the triangular inequality, the sparsest vector $\mathbf{\Gamma}$ satisfies: $\|\mathbf{\Gamma}\|_{0}<\frac{\sigma(\mathbf{D})}{2}$. Although the spark provides an upper bound, it is unfeasible to compute in practical scenarios. Instead, let the mutual coherence be a measure of similarity between atoms in $\mathbf{D}$. Assuming $\ell_{2}$ -norm unit atoms, the mutual coherence of $\mathbf{D}$ is defined as: $\mu(\mathbf{D})= \max_{i\neq j} \|\mathbf{d_i^T}\mathbf{d_j}\|_2$, where $\mathbf{d}_{i}$  are atoms. Based on this metric, it can be proven that the true sparse representation $\mathbf{\Gamma}^{*}$ can be recovered if and only if $\|\mathbf{\Gamma}^{*}\|_0 < \frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D})} \big)$.

Similarly, under the presence of noise, an upper bound for the distance between the true sparse representation $\mathbf{\Gamma^{*}}$ and its estimation $\hat{\mathbf{\Gamma}}$  can be established via the restricted isometry property (RIP). A k-RIP matrix $\mathbf{D}$ with constant $\delta_{k}$  corresponds to: $(1-\delta_k)\|\mathbf{\Gamma}\|_2^2 \leq \|\mathbf{D\Gamma}\|_2^2 \leq (1+\delta_k)\|\mathbf{\Gamma}\|_2^2$, where $\delta_k$  is the smallest number that satisfies the inequality for every $\|\mathbf{\Gamma}\|_{0}=k$. Then, assuming $\|\mathbf{\Gamma}\|_0<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D})} \big)$, it is guaranteed that $\|\mathbf{\hat{\Gamma}-\Gamma^{*}}\|_{2}^{2}\leq \frac{4\varepsilon^2}{1-\mu(\mathbf{D})(2\|\mathbf{\Gamma}\|_0-1)}$.

Solving such a general pursuit problem is a hard task if no structure is imposed on dictionary $\mathbf{D}$. This implies learning large, highly overcomplete representations, which is extremely expensive. Assuming such a burden has been met and a representative dictionary has been obtained for a given signal $\mathbf{x}$, typically based on prior information, $\mathbf{\Gamma}^{*}$ can be estimated via several pursuit algorithms.

Pursuit algorithms for the global sparse model
Two basic methods for solving the global sparse coding problem are orthogonal matching pursuit (OMP) and basis pursuit (BP). OMP is a greedy algorithm that iteratively selects the atom best correlated with the residual between $\mathbf{x}$ and a current estimation, followed by a projection onto a subset of pre-selected atoms. On the other hand, basis pursuit is a more sophisticated approach that replaces the original coding problem by a linear programming problem. Based on this algorithms, the global sparse coding provides considerably loose bounds for the uniqueness and stability of $\hat{\mathbf{\Gamma}}$. To overcome this, additional priors are imposed over $\mathbf{D}$ to guarantee tighter bounds and uniqueness conditions. The reader is referred to (, section 2) for details regarding this properties.

Convolutional sparse coding model
A local prior is adopted such that each overlapping section of $\mathbf{\Gamma}$ is sparse. Let $\mathbf{D}\in \mathbb{R}^{N \times Nm}$ be constructed from shifted versions of a local dictionary $\mathbf{D_{L}}\in\mathbb{R}^{n \times m}, m\ll M$. Then, $\mathbf{x}$ is formed by products between $\mathbf{D_{L}}$  and local patches of $\mathbf{\Gamma}\in\mathbb{R}^{mN}$.



From the latter, $\mathbf{\Gamma}$ can be re-expressed by $N$  disjoint sparse vectors $\alpha_{i}\in \mathbb{R}^{m}$ : $\mathbf{\Gamma}\in \{\alpha_{1},\alpha_{2},\dots, \alpha_{N}\}^{T}$. Similarly, let $\gamma$ be a set of $(2n-1)$  consecutive vectors $\alpha_{i}$. Then, each disjoint segment in $\mathbf{x}$ is expressed as: $\mathbf{x}_{i}=\mathbf{R}_{i}\mathbf{D}\mathbf{\Gamma}$, where operator $\mathbf{R}_{i}\in \mathbb{R}^{n\times N}$  extracts overlapping patches of size $n$  starting at index $i$. Thus, $\mathbf{R}_{i}\mathbf{D}$ contains only $(2n-1)m$  nonzero columns. Hence, by introducing operator $\mathbf{S}_{i}\in \mathbf{R}^{(2n-1)m \times Nm}$ which exclusively preserves them: $$\begin{aligned} \mathbf{x}_{i}&= \underset{\Omega}{\underbrace{\mathbf{R}_{i}\mathbf{D}\mathbf{S}_{i}^{T}}}\underset{\gamma_{i}}{\underbrace{(S_{i}\mathbf{\Gamma})}},\end{aligned}$$ where $\Omega$ is known as the stripe dictionary, which is independent of $i$, and $\gamma_{i}$  is denominated the i-th stripe. So, $\mathbf{x}$ corresponds to a patch aggregation or convolutional interpretation: $$\begin{aligned} \mathbf{x}&= \sum_{i=1}^{N}\mathbf{R}_{i}^{T}\mathbf{D}_{L}\alpha_{i}= \sum_{i=1}^{m}\mathbf{d}_{i}\ast \mathbf{z_{i}}.\end{aligned}$$ Where $\mathbf{d}_{i}$ corresponds to the i-th atom from the local dictionary $\mathbf{D}_{L}$  and $\mathbf{z_{i}}$  is constructed by elements of patches $\alpha$ : $\mathbf{z_{i}}\triangleq (\alpha_{1,i}, \alpha_{2,i},\dots, \alpha_{N,i})^{T}$. Given the new dictionary structure, let the $\ell_{0,\infty}$ pseudo-norm be defined as: $\|\mathbf{\Gamma}\|_{0,\infty}\triangleq \underset{i}{\text{ max}}\; \|\gamma_{i}\|_{0}$. Then, for the noise-free and noise-corrupted scenarios, the problem can be respectively reformulated as: $$\begin{aligned} \hat{\mathbf{\Gamma}}_{\text{ideal}}&= \underset{\mathbf{\Gamma}}{\text{argmin}}\; \| \mathbf{\Gamma}\|_{0,\infty}\; \text{s.t.}\; \mathbf{D}\mathbf{\Gamma}=\mathbf{x},\\ \hat{\mathbf{\Gamma}}_{\text{noise}}&= \underset{\mathbf{\Gamma}}{\text{argmin}}\; \| \mathbf{\Gamma}\|_{0,\infty}\; \text{s.t.}\; \|\mathbf{Y}-\mathbf{D}\mathbf{\Gamma}\|_{2}<\varepsilon.\end{aligned}$$

Stability and uniqueness guarantees for the convolutional sparse model
For the local approach, $\mathbf{D}$ mutual coherence satisfies: $\mu(\mathbf{D})\geq \big(\frac{m-1}{m(2n-1)-1}\big)^{1/2}.$  So, if a solution obeys $\|\mathbf{\Gamma}\|_{0,\infty}< \frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D})}\big)$, then it is the sparsest solution to the $\ell_{0,\infty}$  problem. Thus, under the local formulation, the same number of non-zeros is permitted for each stripe instead of the full vector!

Similar to the global model, the CSC is solved via OMP and BP methods, the latter contemplating the use of the iterative shrinkage thresholding algorithm (ISTA) for splitting the pursuit into smaller problems. Based on the $\ell_{0,\infty}$ pseudonorm, if a solution $\mathbf{\Gamma}$  exists satisfying $\|\mathbf{\Gamma}\|_{0,\infty}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D})} \big)$, then both methods are guaranteed to recover it. Moreover, the local model guarantees recovery independently of the signal dimension, as opposed to the $\ell_{0}$ prior. Stability conditions for OMP and BP are also guaranteed if its exact recovery condition (ERC) is met for a support $\mathcal{T}$ with a constant $\theta$. The ERC is defined as: $\theta= 1-\underset{i\notin \mathcal{T}}{\text{max}} \|\mathbf{D}_{\mathcal{T}}^{\dagger}\mathbf{d}_{i}\|_{1}>0$, where $\dagger$ denotes the Pseudo-inverse. Algorithm 1 shows the Global Pursuit method based on ISTA.

Algorithm 1: 1D CSC via local iterative soft-thresholding.

Input:

$\mathbf{D}_{L}$ : Local Dictionary,

$\mathbf{y}$ : observation,

$\lambda$ : Regularization parameter,

$c$ : step size for ,






 * $\{\boldsymbol{\alpha}_{i}\}^{(0)}\gets \{\mathbf{0}_{N\times 1}\}$ (Initialize disjoint patches.)


 * $\{\mathbf{r}_{i}\}^{(0)}\gets \{\mathbf{R}_{i}\mathbf{y}\}$ (Initialize residual patches.)


 * $k\gets 0$

Repeat


 * $\{\boldsymbol{\alpha}_i\}^{(k)}\gets \mathcal{S}_{\frac{\lambda}{c}}\big( \{\boldsymbol{\alpha}_i\}^{(k-1)}+\frac{1}{c}\{\mathbf{D}_{L}^{T}\mathbf{r}_i\}^{(k-1)} \big)$ (Coding along disjoint patches)


 * $\boldsymbol{\alpha}_i$ $\hat{\mathbf{x}}^{(k)}\gets \sum_{i}\mathbf{R}_{i}^{T}\mathbf{D}_{L}\boldsymbol{\alpha}_{i}^{(k)}$  (Patch Aggregation)


 * $\{\mathbf{r}_{i}\}^{(k)}\gets \mathbf{R}_{i}\big( \mathbf{y}-\hat{\mathbf{x}}^{(k)} \big)$ (Update residuals)


 * $k \gets k+ 1$

Until $\|\hat{\mathbf{x}}^{(k)}- \hat{\mathbf{x}}^{(k-1)}\|_{2}<$   or $k>$.

Multi-layered convolutional sparse coding model
By imposing the sparsity prior in the inherent structure of $\mathbf{x}$, strong conditions for a unique representation and feasible methods for estimating it are granted. Similarly, such a constraint can be applied to its representation itself, generating a cascade of sparse representations: Each code is defined by a few atoms of a given set of convolutional dictionaries.

Based on these criteria, yet another extension denominated multi-layer convolutional sparse coding (ML-CSC) is proposed. A set of analytical dictionaries $\{\mathbf{D}\}_{k=1}^{K}$ can be efficiently designed, where sparse representations at each layer $\{\mathbf{\Gamma}\}_{k=1}^{K}$  are guaranteed by imposing the sparsity prior over the dictionaries themselves. In other words, by considering dictionaries to be stride convolutional matrices i.e. atoms of the local dictionaries shift $m$ elements instead of a single one, where $m$  corresponds to the number of channels in the previous layer, it is guaranteed that the $\|\mathbf{\Gamma}\|_{0,\infty}$  norm of the representations along layers is bounded.

For example, given the dictionaries $\mathbf{D}_{1} \in \mathbb{R}^{N\times Nm_{1}}, \mathbf{D}_{2} \in \mathbb{R}^{Nm_{1}\times Nm_{2}}$, the signal is modeled as $\mathbf{D}_{1}\mathbf{\Gamma}_{1}= \mathbf{D}_{1}(\mathbf{D}_{2}\mathbf{\Gamma}_{2})$ , where $\mathbf{\Gamma}_{1}$ is its sparse code, and $\mathbf{\Gamma}_{2}$  is the sparse code of $\mathbf{\Gamma}_{1}$. Then, the estimation of each representation is formulated as an optimization problem for both noise-free and noise-corrupted scenarios, respectively. Assuming $\mathbf{\Gamma}_{0}=\mathbf{x}$ : $$\begin{aligned} \text{Find}\; \{\mathbf{\Gamma}_{i}\}_{i=1}^{K}\;\text{s.t.}&\; \mathbf{\Gamma}_{i-1}=\mathbf{D}_{i}\mathbf{\Gamma}_{i},\; \|\mathbf{\Gamma}_{i}\|_{0,\infty}\leq \lambda_{i}\\ \text{Find}\; \{\mathbf{\Gamma}_{i}\}_{i=1}^{K}\; \text{s.t.} &\;\|\mathbf{\Gamma}_{i-1}-\mathbf{D}_{i}\mathbf{\Gamma}_{i}\|_{2}\leq \varepsilon_{i},\; \|\mathbf{\Gamma}_{i}\|_{0,\infty}\leq \lambda_{i}\end{aligned}$$

In what follows, theoretical guarantees for the uniqueness and stability of this extended model are described.

Theorem 1: (Uniqueness of sparse representations) Consider signal $\mathbf{x}$ satisfies the (ML-CSC) model for a set of convolutional dictionaries $\{\mathbf{D}_{i}\}_{i=1}^{K}$  with mutual coherence $\{\mu(\mathbf{D}_{i})\}_{i=1}^{K}$. If the true sparse representations satisfy $\{\mathbf{\Gamma}\}_{i=1}^{K}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D}_{i})}\big)$, then a solution to the problem $\{\hat{\mathbf{\Gamma}_{i}}\}_{i=1}^{K}$ will be its unique solution if the thresholds are chosen to satisfy: $\lambda_{i}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D}_{i})} \big)$.

Theorem 2: (Global stability of the noise-corrupted scenario) Consider signal $\mathbf{x}$ satisfies the (ML-CSC) model for a set of convolutional dictionaries $\{\mathbf{D}_{i}\}_{i=1}^{K}$  is contaminated with noise $\mathbf{E}$, where $\|\mathbf{E}\|_{2}\leq \varepsilon_{0}$. resulting in $\mathbf{Y=X+E}$. If $\lambda_{i}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D}_{i})}\big)$ and $\varepsilon_{i}^{2}=\frac{4\varepsilon_{i-1}^{2}}{1-(2\|\mathbf{\Gamma}_{i}\|_{0,\infty}-1)\mu(\mathbf{D}_{i})}$, then the estimated representations $\{\mathbf{\Gamma}_{i}\}_{i=1}^{K}$  satisfy the following: $\|\mathbf{\Gamma}_{i}-\hat{\mathbf{\Gamma}}_{i}\|_{2}^{2}\leq \varepsilon_{i}^{2}$.

Projection-based algorithms
As a simple approach for solving the ML-CSC problem, either via the $\ell_{0}$ or $\ell_{1}$  norms, is by computing inner products between $\mathbf{x}$  and the dictionary atoms to identify the most representatives ones. Such a projection is described as: $$\begin{aligned} \hat{\mathbf{\Gamma}}_{\ell_p}&= \underset{\mathbf{\Gamma}}{\operatorname{argmin}} \frac{1}{2}\|\mathbf{\Gamma}-\mathbf{D}^{T}\mathbf{x}\|_2^2 +\beta\|\mathbf{\Gamma}\|_p & p\in\{0,1\},\end{aligned}$$

which have closed-form solutions via the hard-thresholding $\mathcal{H}_{\beta}(\mathbf{D}^{T}\mathbf{x})$ and soft-thresholding algorithms $\mathcal{S}_{\beta}(\mathbf{D}^{T}\mathbf{x})$, respectively. If a nonnegative constraint is also contemplated, the problem can be expressed via the $\ell_{1}$ norm as: $$\begin{aligned} \hat{\mathbf{\Gamma}}&= \underset{\mathbf{\Gamma}}{\text{argmin}}\; \frac{1}{2}\|\mathbf{\Gamma}-\mathbf{D}^T\mathbf{x}\|_2^2+\beta\|\mathbf{\Gamma}\|_1,\; \text{ s.t. } \mathbf{\Gamma}\geq 0,\end{aligned}$$ which closed-form solution corresponds to the soft nonnegative thresholding operator $\mathcal{S}_{\beta}^{+}(\mathbf{D}^{T}\mathbf{x})$, where $\mathcal{S}_{\beta}^{+}(z)\triangleq \max(z-\beta,0)$. Guarantees for the Layered soft-thresholding approach are included in the Appendix (Section 6.2).

Theorem 3: (Stable recovery of the multi-layered soft-thresholding algorithm) Consider signal $\mathbf{x}$ that satisfies the (ML-CSC) model for a set of convolutional dictionaries $\{\mathbf{D}_i\}_{i=1}^K$  with mutual coherence $\{\mu(\mathbf{D}_i)\}_{i=1}^K$  is contaminated with noise $\mathbf{E}$, where $\|\mathbf{E}\|_2\leq \varepsilon_0$. resulting in $\mathbf{Y=X+E}$. Denote by $|\mathbf{\Gamma}_i^{\min}|$ and $|\mathbf{\Gamma}_i^{\max}|$  the lowest and highest entries in $\mathbf{\Gamma}_i$. Let $\{\hat{\mathbf{\Gamma}}_i\}_{i=1}^K$ be the estimated sparse representations obtained for $\{\beta_i\}_{i=1}^K$. If $\|\mathbf{\Gamma}_i\|_{0,\infty}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D}_{i})}\frac{|\mathbf{\Gamma}_i^{\min}|}{|\mathbf{\Gamma}_i^{\min}|}\big)-\frac{1}{\mu(\mathbf{D}_{i})} \frac{\varepsilon_{i-1}}{|\mathbf{\Gamma}_i^{\max}|}$ and $\beta_i$  is chosen according to: $$\begin{aligned} \|\mathbf{\Gamma}_i\|_{0,\infty}^s<\frac{1}{2}\big( 1+\frac{1}{\mu(\mathbf{D}_i)} frac{|\mathbf{\Gamma}_i^{\min}|}{|\mathbf{\Gamma}_i^{\max}|} \big)-\frac{1}{\mu(\mathbf{D}_i)}\frac{\varepsilon_{i-1}}{|\mathbf{\Gamma}_i^{\max}|} \end{aligned}$$ Then, $\hat{\mathbf{\Gamma}}_{i}$ has the same support as $\mathbf{\Gamma}_{i}$, and $\|\mathbf{\Gamma}_{i}-\hat{\mathbf{\Gamma}_i}\|_{2,\infty}\leq \varepsilon_i$ , for $\varepsilon_i=\sqrt{\|\mathbf{\Gamma}_i\|_{0,\infty}}(\varepsilon_{i-1}+\mu(\mathbf{D}_i)(\|\mathbf{\Gamma}_i\|_{0,\infty}-1)|\mathbf{\Gamma}_i^{\max}|+\beta_{i})$

Connections to convolutional neural networks
Recall the forward pass of the convolutional neural network model, used in both training and inference steps. Let $\mathbf{x}\in \mathbb{R}^{Mm_{1}}$ be its input and $\mathbf{W}_{k}\in\mathbb{R}^{N\times m_{1}}$  the filters at layer $k$, which are followed by the rectified linear unit (RLU) $\text{ReLU}(\mathbf{x})= \max(0, x)$ , for bias $\mathbf{b}\in \mathbb{R}^{Mm_{1}}$. Based on this elementary block, taking $K=2$ as example, the CNN output can be expressed as: $$\begin{aligned} \mathbf{Z}_{2}&= \text{ReLU}\big(\mathbf{W}_{2}^{T}\; \text{ReLU}(\mathbf{W}_{1}^{T}\mathbf{x})+\mathbf{b}_{1})+\mathbf{b}_{2}\;\big).\end{aligned}$$ Finally, comparing the CNN algorithm and the Layered thresholding approach for the nonnegative constraint, it is straightforward to show that both are equivalent: $$\begin{aligned}     \hat{\mathbf{\Gamma}}&= \mathcal{S}^{+}_{\beta_{2}}\big(\mathbf{D}_{2}^{T}\mathcal{S}^{+}_{\beta_{1}}(\mathbf{D}_{1}^{T}\mathbf{x}) \big)\\      &= \text{ReLU}\big(\mathbf{W}_{2}^{T} \text{ReLU}(\mathbf{W}_{1}^{T}\mathbf{x}+\beta_{1})+\beta_{2}\big).\end{aligned}$$





As explained in what follows, this naive approach of solving the coding problem is a particular case of a more stable projected gradient descent algorithm for the ML-CSC model. Equipped with the stability conditions of both approaches, a more clear understanding about the class of signals a CNN can recover, under what noise conditions can an estimation be accurately attained, and how can its structure be modified to improve its theoretical conditions. The reader is referred to (, section 5) for details regarding their connection.

Pursuit algorithms for the multi-layer CSC model
A crucial limitation of the forward pass is it being unable to recover the unique solution of the DCP problem, which existence has been demonstrated. So, instead of using a thresholding approach at each layer, a full pursuit method is adopted, denominated layered basis pursuit (LBP). Considering the projection onto the $\ell_{1}$ ball, the following problem is proposed: $$\begin{aligned} \hat{\mathbf{\Gamma}}_i & =\underset{\mathbf{\Gamma}_{i}}{\text{argmin}}\; \frac{1}{2}\|\mathbf{D}_{i}\mathbf{\Gamma}_{i}-\hat{\mathbf{\Gamma}}_{i}\|_{2}^{2}+\; \xi_{i}\|\mathbf{\Gamma}_{i}\|_{1},\end{aligned}$$ where each layer is solved as an independent CSC problem, and $\xi_{i}$ is proportional to the noise level at each layer. Among the methods for solving the layered coding problem, ISTA is an efficient decoupling alternative. In what follows, a short summary of the guarantees for the LBP are established.

Theorem 4: (Recovery guarantee) Consider a signal $\mathbf{x}$ characterized by a set of sparse vectors $\{\mathbf{\Gamma}_{i}\}_{i=1}^{K}$, convolutional dictionaries $\{\mathbf{D}_{i}\}_{i=1}^{K}$  and their corresponding mutual coherences $\{\mu\big(\mathbf{D}_{i}\big)\}_{i=1}^{K}$. If $\|\mathbf{\Gamma}_{i}\|_{0,\infty}<\frac{1}{2}\big(1+\frac{1}{\mu(\mathbf{D}_{i})}\big)$, then the LBP algorithm is guaranteed to recover the sparse representations.

Theorem 5: (Stability in the presence of noise) Consider the contaminated signal $\mathbf{Y}=\mathbf{X+E}$, where $\|\mathbf{E}\|_{0,\infty}\leq \varepsilon_{0}$ and $\mathbf{x}$  is characterized by a set of sparse vectors $\{\mathbf{\Gamma}_{i}\}_{i=1}^{K}$  and convolutional dictionaries $\{\mathbf{D}_{i}\}_{i=1}^{K}$. Let $\{\hat{\mathbf{\Gamma}}_{i}\}_{i=1}^{K}$ be solutions obtained via the LBP algorithm with parameters $\{\xi\}_{i=1}^{K}$. If $\|\mathbf{\Gamma}_{i}\|_{0,\infty}<\frac{1}{3}\big(1+\frac{1}{\mu(\mathbf{D}_{i})}\big)$ and $\xi_{i}=4\varepsilon_{i-1}$, then: (i) The support of the solution $\hat{\mathbf{\Gamma}}_i$  is contained in that of $\mathbf{\Gamma}_{i}$ , (ii) $\|\mathbf{\Gamma}_{i}-\hat\mathbf{\Gamma}_i\|_{2,\infty}\leq \varepsilon_{i}$ , and (iii) Any entry greater in absolute value than $\frac{\varepsilon_{i}}{\sqrt{\|\mathbf{\Gamma}_{i}\|_{0\infty}}}$  is guaranteed to be recovered.

Applications of the convolutional sparse coding model: image inpainting
As a practical example, an efficient image inpainting method for color images via the CSC model is shown. Consider the three-channel dictionary $\mathbf{D} \in \mathbb{R}^{N \times M \times 3}$, where $\mathbf{d}_{c,m}$ denotes the $m$ -th atom at channel $c$ , represents signal $\mathbf{x}$  by a single cross-channel sparse representation $\mathbf{\Gamma}$ , with stripes denoted as $\mathbf{z}_{i}$. Given an observation $\mathbf{y}=\{\mathbf{y}_{r}, \mathbf{y}_{g}, \mathbf{y}_{b}\}$, where randomly chosen channels at unknown pixel locations are fixed to zero, in a similar way to impulse noise, the problem is formulated as: $$\begin{aligned} \{\mathbf{\hat{z}}_{i}\}&=\underset{\{\mathbf{z}_{i}\}}{\text{argmin}}\frac{1}{2}\sum_{c}\bigg\|\sum_{i}\mathbf{d}_{c,i}\ast \mathbf{z}_{i} -\mathbf{y}_{c}\bigg\|_{2}^{2}+\lambda \sum_{i}\|\mathbf{z}_{i}\|_{1}.\end{aligned}$$ By means of ADMM, the cost function is decoupled into simpler sub-problems, allowing an efficient $\mathbf{\Gamma}$ estimation. Algorithm 2 describes the procedure, where $\hat{D}_{c,m}$ is the DFT representation of $D_{c,m}$, the convolutional matrix for the term $\mathbf{d}_{c,i}\ast \mathbf{z}_{i}$. Likewise, $\hat{\mathbf{x}}_{m}$ and $\hat{\mathbf{z}}_{m}$  correspond to the DFT representations of $\mathbf{x}_{m}$  and $\mathbf{z}_{m}$, respectively, $\mathcal{S}_{\beta}(.)$  corresponds to the Soft-thresholding function with argument $\beta$ , and the $\ell_{1,2}$  norm is defined as the $\ell_{2}$  norm along the channel dimension $c$  followed by the $\ell_{1}$  norm along the spatial dimension $m$. The reader is referred to (, Section II) for details on the ADMM implementation and the dictionary learning procedure.

Algorithm 2: Color image inpainting via the convolutional sparse coding model.

Input:

$\hat{\mathbf{D}}_{c,m}$ : DFT of convolutional matrices $\mathbf{D}_{c,m}$ ,

$\mathbf{y}=\{\mathbf{y}_{r},\mathbf{y}_{g},\mathbf{y}_{b}\}$ : Color observation,

$\lambda$ : Regularization parameter,

$\{\mu, \rho\}$ : step sizes for ,






 * $k\gets k+1$

Repeat


 * $\{\hat{\mathbf{z}}_{m}\}^{(k+1)}\gets\underset{\{\hat{\mathbf{x}}_{m}\}}{\text{argmin}}\;\frac{1}{2}\sum_{c}\big\|\sum_{m}\hat{\mathbf{D}}_{c,m} \hat{\mathbf{z}}_{m}-\hat{\mathbf{y}}_{c} \big\|+\frac{\rho}{2}\sum_{m}\|\hat{\mathbf{z}}_{m}- (\hat{\mathbf{y}}_{m}+\hat{\mathbf{u}}_{m}^{(k)})\|_{2}^{2}.$


 * $\{\mathbf{y}_{c,m}\}^{(k+1)}\gets \underset{\{\mathbf{y}_{c,m}\}}{\text{argmin}}\;\lambda \sum_{c}\sum_{m}\|\mathbf{y}_{c,m}\|_{1}+\mu\|\{\mathbf{x}_{c,m}^{(k+1)}\}\|_{2,1}+\frac{\rho}{2}\sum_{m}\|\mathbf{z}_{m}^{(k+1)}- (\mathbf{y}+\mathbf{u}_{m}^{(k)})\|_{2}^{2}.$


 * $\mathbf{y}_{m}^{(k+1)}=\mathcal{S}_{\lambda/\rho}\big( \mathbf{x}_{m}^{(k+1)}+\mathbf{u}_{m}^{(k)} \big).$


 * $k \gets k+1$

Until $\|\{\mathbf{z}_{m}\}^{(k+1)}-\{\mathbf{z}_{m}\}^{(k)}\|_{2}< $  or $i>$.