# On Multi-Layer Basis Pursuit, Efficient Algorithms and Convolutional Neural Networks

Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent Multi-Layer Convolutional Sparse Coding (ML-CSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multi-layer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multi-layer Iterative Soft Thresholding (ML-ISTA) and multi-layer Fast ISTA (ML-FISTA) converge to the global optimum of our multi-layer formulation at a rate of O(1/k) and O(1/k^2), respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feed-forward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multi-layer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feed-forward ones. We demonstrate the emerging constructions by training them in an end-to-end manner, consistently improving the performance of classical networks without introducing extra filters or parameters.

## Authors

• 17 publications
• 4 publications
• 52 publications
08/29/2017

### Multi-Layer Convolutional Sparse Modeling: Pursuit and Dictionary Learning

The recently proposed Multi-Layer Convolutional Sparse Coding (ML-CSC) m...
04/25/2018

### Multi Layer Sparse Coding: the Holistic Way

The recently proposed multi-layer sparse model has raised insightful con...
11/29/2020

### Architectural Adversarial Robustness: The Case for Deep Pursuit

Despite their unmatched performance, deep neural networks remain suscept...
12/05/2019

### Towards Understanding Residual and Dilated Dense Neural Networks via Convolutional Sparse Coding

Convolutional neural network (CNN) and its variants have led to many sta...
04/24/2018

### Genesis of Basic and Multi-Layer Echo State Network Recurrent Autoencoders for Efficient Data Representations

It is a widely accepted fact that data representations intervene noticea...
02/09/2021

### Consensus Based Multi-Layer Perceptrons for Edge Computing

In recent years, storing large volumes of data on distributed devices ha...
10/27/2020

### Deep Networks from the Principle of Rate Reduction

This work attempts to interpret modern deep (convolutional) networks fro...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sparsity has been shown to be a driving force in a myriad of applications in computer vision

[42, 43, 30], statistics [38, 39][26, 23, 24]. Most often, sparsity is often enforced not on a particular signal but rather on its representation in a transform domain. Formally, a signal admits a sparse representation in terms of a dictionary if , and is sparse. In its simplest form, the problem of seeking for a sparse representation for a signal, possibly contaminated with noise as , can be posed in terms of the following pursuit problem:

 minγ ∥γ∥0  s.t. ∥y−Dγ∥22≤ε, (1)

where the pseudo-norm counts the number of non-zero elements in . The choice of the (typically overcomplete) dictionary is far from trivial, and has motivated the design of different analytic transforms [9, 16, 27] and the development of dictionary learning methods [2, 36, 30]. The above problem, which is NP-hard in general, is often relaxed by employing the penalty as a surrogate for the non-convex measure, resulting in the celebrated Basis Pursuit De-Noising (BPDN) problem111This problem is also known in the statistical learning community as Least Absolute Shrinkage and Selection Operator (LASSO) [38], where the matrix is given by a set of measurements or descriptors, in the context of a sparse regression problem.:

 minγ λ∥γ∥1+12∥y−Dγ∥22. (2)

The transition from the to the relaxed case is by now well understood, and the solutions to both problems do coincide under sparse assumptions on the underlying representation (in the noiseless case), or have shown to be close enough in more general settings [17, 41].

This traditional model was recently extended to a multi-layer setting [34, 1], where a signal is expressed as , for a sparse and (possibly convolutional) matrix , while also assuming that this representation satisfies , for yet another dictionary and sparse . Such a construction can be cascaded for a number of layers222In the convolutional setting [35, 37], the notion of sparsity is better characterized by the pseudo-norm, which quantifies the density of non-zeros in the convolutional representations in a local sense. Importantly, however, the BPDN formulation (i.e., employing an penalty), still serves as a proxy for this norm. We refer the reader to [35] for a thorough analysis of convolutional sparse representations.. Under this framework, given the measurement , this multi-layer pursuit problem (or Deep Coding Problem, as first coined in [34]), can be expressed as

 min{γi} ∥y−D1γ1∥22%s.t. {γi−1=Diγi, ∥γi∥0≤si}Li=1, (3)

with . In this manner, one searches for the closest signal to while satisfying the model assumptions. This can be understood and analyzed as a projection problem [37]

, providing an estimate such that

, while forcing all intermediate representations to be sparse. Note the notation for brevity. Remarkably, the forward pass of neural networks (whose weights at each layer, , are set as the transpose of each dictionary ) yields stable estimations for the intermediate features or representations provided these are sparse enough [34]. Other generative models have also been recently proposed (as the closely related probabilistic framework in [patel2016probabilistic, nguyen2016semi]

), but the multi-layer sparse model provides a convenient way to study deep learning architectures in terms of pursuit algorithms

[Papyan18].

As an alternative to the forward pass, one can address the problem in Equation (3) by adopting a projection interpretation and develop an algorithm based on a global pursuit, as in [37]. More recently, the work in [1] showed that this problem can be cast as imposing an analysis prior on the signal’s deepest sparse representation. Indeed, the problem in (3) can be written concisely as:

 min{γi}∥y−D(1,L)γL∥22 (4) s.t.∥γL∥0≤sL, {∥D(i,L)γL∥0≤si−1}Li=1. (5)

This formulation explicitly shows that the intermediate dictionaries play the role of analysis operators, resulting in a representation which should be orthogonal to as many rows from as possible – so as to produce zeros in . Interestingly, this analysis also allows for less sparse representations in shallower layers while still being consistent with the multi-layer sparse model. While a pursuit algorithm addressing (4) was presented in [1], it is greedy in nature and does not scale to high dimensional signals. In other words, there are currently no efficient pursuit algorithms for signals in this multi-layer model that leverage this symbiotic analysis-synthesis priors. More importantly, it is still unclear how the dictionaries could be trained from real data under this scheme. These questions are fundamental if one is to bridge the theoretical benefits of the multi-layer sparse model with practical deep learning algorithms.

In this work we propose a relaxation of the problem in Equation (4), turning this seemingly complex pursuit into a convex multi-layer generalization of the Basis Pursuit (BP) problem333In an abuse of terminology, and for the sake of simplicity, we will refer to the BPDN problem in Equation (2) as BP.. Such a formulation, to the best of our knowledge, has never before been proposed nor studied, though we will comment on a few particular and related cases that have been of interest to the image processing and compressed sensing communities. We explore different algorithms to solve this multi-layer problem, such as variable splitting and the Alternating Directions Method of Multipliers (ADMM) [6, 8] and the Smooth-FISTA from [5]

, and we will present and analyze two new generalizations of Iterative Soft Thresholding Algorithms (ISTA). We will further show that these algorithms generalize feed-forward neural networks (NNs), both fully-connected and convolutional (CNNs), in a natural way. More precisely: the first iteration of such algorithms implements a traditional CNN, while a new recurrent architecture emerges with subsequent iterations. In this manner, the proposed algorithms provide a principled framework for the design of recurrent architectures. While other works have indeed explored the unrolling of iterative algorithms in terms of CNNs (e.g.

[44, 33]), we are not aware of any work that has attempted nor studied the unrolling of a global pursuit with convergence guarantees. Lastly, we demonstrate the performance of these networks in practice by training our models for image classification, consistently improving on the classical feed-forward architectures without introducing filters nor any other extra parameters in the model.

## 2 Multi-Layer Basis Pursuit

In this work we propose a convex relaxation of the problem in Equation (4), resulting in a multi-layer BP problem. For the sake of clarity, we will limit our formulations to two layers, but these can be naturally extended to multiple layers – as we will effectively do in the experimental section. This work is centered around the following problem:

 (P):minγ12∥y−D1D2γ∥22+λ1∥D2γ∥1+λ2∥γ∥1. (6)

This formulation imposes a particular mixture of synthesis and analysis priors. Indeed, if >0 and , one recovers a traditional Basis Pursuit formulation with a factorized global dictionary. If , however, an analysis prior is enforced on the representation by means of , resulting in a more regularized solution. Note that if , and is not empty, the problem above becomes ill-posed without a unique solution444It is true that also in Basis Pursuit one can potentially obtain infinite solutions, as the problem is not strongly convex. since . In addition, unlike previous interpretations of the multi-layer sparse model ([34, 37, 1]), our formulation stresses the fact that there is one unknown variable, , with different priors enforced on it. Clearly, one may also define and introduce , but this should be interpreted merely as the introduction of auxiliary variables to aid the derivation and interpretation of the respective algorithms. We will expand on this point in later sections.

Other optimization problems similar to have indeed been proposed, such as the Analysis-LASSO [10, 28], though their observation matrix and the analysis operator ( and , in our case) must be independent, and the latter is further required to be a tight frame [10, 28]. The Generalized Lasso problem [40] is also related to our multi-layer BP formulation, as we will see in the following section. On the other hand, and in the context of image restoration, the work in [7] imposes a Total Variation and a sparse prior on the unknown image, as does the work in [21], thus being closely related to the general expression in (6).

### 2.1 Algorithms

From an optimization perspective, our multi-layer BP problem can be expressed more generally as

 minγ F(γ)=f(D2γ)+g1(D2γ)+g2(γ), (7)

where is convex and smooth, and and are convex but non-smooth. For the specific problem in (6), , and

. Since this problem is convex, the choice of available algorithms is extensive. We are interested in high-dimensional settings, however, where interior-point methods and other solvers depending on second-order information might have a prohibitive computational complexity. In this context, the Iterative Soft Thresholding Algorithm (ISTA), and its Fast version (FISTA), are appealing as they only require matrix-vector multiplications and entry-wise operations. The former, originally introduced in

[15], provides convergence (in function value) of order , while the latter provides an improved convergence rate with order of [4].

Iterative shrinkage algorithms decompose the total loss into two terms: , convex and smooth (with Lipschitz constant ), and , convex and possibly non smooth. The central idea of ISTA, as a proximal gradient method for finding a minimizer of , is to iterate the updates given by the proximal operator of at the forward-step point:

 γk+1=prox1Lg(γk−1L∇f(γk)). (8)

Clearly, the appeal of ISTA depends on how effectively the proximal operator can be computed. When (as in the original BP formulation), such a proximal mapping becomes separable, resulting in the element-wise shrinkage or soft-thresholding operator. However, this family of methods cannot be readily applied to where . Indeed, computing when is a sum of composite terms is no longer directly separable, and one must resort to iterative approaches, making ISTA lose its appeal.

The problem is also related to the Generalized Lasso formulation [40] in the compressed sensing community, which reads

 minγ12∥y−Xγ∥22+ν∥Aγ∥1. (9)

Certainly, the multi-layer BP problem we study can be seen as a particular case of this formulation555One can rewrite problem (6) as in (9) by making and .. With this insight, one might consider solving through the solution of the generalized Lasso [40]. However, such an approach also becomes computationally demanding as it boils down to an iterative algorithm that includes the inversion of linear operators. Other possible solvers might rely on re-weighted approaches [11], but these also require iterative matrix inversions.

A simple way of tackling problem is the popular Alternating Directions Method of Multipliers (ADMM), which provides a natural way to address these kind of problems through variable splitting and auxiliary variables. For a two layer model, one can rewrite the multi-layer BP as a constrained minimization problem:

 minγ1,γ2 12∥y−D1D2γ2∥22+λ1∥γ1∥1+λ2∥γ2∥1 (10) s.t. γ1=D2γ2. (11)

ADMM minimizes this constrained loss by constructing an augmented Lagrangian (in normalized form) as

 minγ1,γ2,u 12∥y−D1D2γ2∥22+λ1∥γ1∥1+λ2∥γ2∥1+ρ2∥γ1−D2γ2+u∥22, (12)

which can be minimized iteratively by repeating the updates in Algorithm 1. This way, and after merging both terms, the pursuit of the inner-most representation ( in this case) is carried out in terms of a regular BP formulation that can be tackled with a variety of convex methods, including ISTA or FISTA. The algorithm then updates the intermediate representations () by a simple shrinkage operation, followed by the update of the dual variable, . Note that this algorithm is guaranteed to converge (at least in the sequence sense) to a global optimum of due to the convexity of the function being minimized [6, 8].

A third alternative, which does not incur in an additional inner iteration nor inversions, is the Smooth-FISTA approach from [5]. S-FISTA addresses cost functions of the same form as problem by replacing one of the non-smooth functions, in Eq. (7), by a smoothed version in terms of its Moreau envelope. In this way, S-FISTA converges with order to an estimate that is -away from the solution of the original problem in terms of function value. We will revisit this method further in the following sections.

Before moving on, we make a short intermission to note here that other Basis Pursuit schemes have been proposed in the context of multi-layer sparse models and deep learning. Already in [34] the authors proposed the Layered Basis Pursuit, which addresses the sequence of pursuits given by

 ^γi←argminγi ∥^γi−1−Diγi∥22+λi∥γi∥1, (13)

from to , where . Clearly, each of these can be solved with any BP solver just as well. A related idea was also recently proposed in [sun2018supervised], showing that cascading basis pursuit problems can lead to competitive deep learning constructions. However, the Layered Basis Pursuit formulation, or other similar variations that attempt to unfold neural network architectures [33, 44], do not minimize

and thus their solutions only represent sub-optimal and heuristic approximations to the minimizer of the multi-layer BP. More clearly, such a series of steps never provide estimates

that can generate a signal according to the multi-layer sparse model. As a result, one cannot reconstruct , because each representation is required to explain the next layer only approximately, so that .

### 2.2 Towards Multi-Layer ISTA

We now move to derive the proposed approach to efficiently tackle while relying on the concept of the gradient mapping (see for example [3, Chapter 10]), which we briefly review next. Given a function , where is convex and smooth with Lipschitz constant and is convex, the gradient mapping is the operator given by

 Gf,gL(γ)=L[γ−prox1Lg(γ−1L∇f(γ))]. (14)

Naturally, the ISTA update step in Equation (8) can be seen as a “gradient-mapping descent” step, since it can be rewritten as . Moreover, provides a sort of generalization of the gradient of , since

1. if ,

2. if and only if is a minimizer of .

We refer the reader to [3, Chapter 10] for further details on gradient mapping operators.

Returning to the problem in (7), our first attempt to minimize is a proximal gradient-mapping method, and it takes an update of the following form:

 γk+12=proxtg2(γk2−t Gf(⋅),g1(D2⋅)1/μ(γk2)), (15)

for constants and that will be specified shortly. This expression, however, requires the computation of , which is problematic as it involves a composite term666

The proximal of a composition with an affine map is only available for unitary linear transformations. See

[3, Chapter 10] and [13] for further details.. To circumvent this difficulty, we propose the following approximation in terms of

. In the spirit of the chain rule

777The step taken to arrive at Equation (16) is not actually the chain rule, as the gradient mapping is not necessarily a gradient of a smooth function., we modify the previous update to

 γk+12=proxtg2(γk2−t DT2 Gf,g11/μ(γk1)). (16)

Importantly, the above update step now involves the prox of as opposed to that of . This way, in the case of , the proximal mapping of becomes the soft-thresholding operator with parameter , i.e. . An analogous operator is obtained for just as well. Therefore, the proposed Multi-Layer ISTA update can be concisely written as

 \resizebox469.755pt$γk+12=Ttλ2(γk2−tμDT2(γk1−Tμλ1(γk1−μDT1(D1γk1−y))))$. (17)

A few comments are in place. First, this algorithm results in a nested series of shrinkage operators, involving only matrix-vector multiplications and entry-wise non linear operations. Note that if , i.e. in the case of a traditional Basis Pursuit problem, the update above reduces to the update of ISTA. Second, though seemingly complicated at first sight, the resulting operator in (17) can be decomposed into simple recursive layer-wise operations, as presented in Algorithm 2. Lastly, because the above update provides a multi-layer extension to ISTA, one can naturally suggest a “fast version” of it by including a momentum term, just as done by FISTA. In other words, ML-FISTA will be given by the iterations

 γk+12 =proxtg2(zk−tDT2Gf,g11/μ(D2zk)), (18) zk+1 =γk+12+ρk(γk+12−γk2), (19)

where , and the parameter is updated according to . Clearly, one can also write this algorithm in terms of layer-wise operations, as described in Algorithm 3.

### 2.3 Convergence Analysis of ML-ISTA

One can then inquire – does the update in Equation (17

) provide convergent algorithm? Does the successive iterations minimize the original loss function? Though these questions have been extensively studied for proximal gradient methods

[4, 13], algorithms based on a proximal gradient mapping have never been proposed – let alone analyzed. Herein we intend to provide a first theoretical analysis of the resulting multi-layer thresholding approaches.

Let us formalize the problem assumptions, recalling that we are interested in

 (P):minγ F(γ)=f(D2γ)+g1(D2γ)+g2(γ), (20)

where is a quadratic convex function, is a convex and Lipschitz continuous function with constant and is a proper closed and convex function that is -Lipschitz continuous over its domain. Naturally, we will assume that both and are proximable, in the sense that and can be efficiently computed for any and . We will further require that has a bounded domain888This can be easily accommodated by adding to a norm bound constraint in the form of an indicator function , for some large enough .. More precisely, . We denote , so that satisfies .

Note also that the convex and smooth function can be expressed as , for a positive semi-definite matrix . The gradient of can then easily be bounded by

 ∥∇f(γ1)∥2≤M≡∥Q∥2R1+∥b∥2, (21)

for any with .

It is easy to see that the algorithm in Equation (16) does not converge to the minimizer of the problem by studying its fixed point. The point is a fixed point of ML-ISTA if

 ∃ w2∈∂g2(γ⋆2),w1∈∂g1(^γ1) (22) such that DT2∇f(D2γ⋆2)+DT2w1+w2=0, (23)

where . We extend on the derivation of this condition in Section 5.1. This is clearly different from the optimality conditions for problem , which is

 ∃ w2∈∂g2(γ⋆2),w1∈∂g1(D2γ⋆2) (24) such that DT2∇f(D2γ⋆2)+DT2w1+w2=0. (25)

Nonetheless, we will show that the parameter controls the proximity of “near”-fixed points to the optimal solution in the following sense: as gets smaller, the objective function of a fixed-point of the ML-ISTA method gets closer to , the minimal value of . In addition, for points that satisfy the fixed point equation up to some tolerance , the distance to optimality in terms of the objective function is controlled by both and .

We first present the following lemma, stating that the norm of the gradient mapping operator is bounded. For clarity, we defer the proof of this lemma, as well as that of the following theorem, to Section 5.

###### Lemma 2.1.

For any and ,

 ∥∥Gf,g11/μ(γ1)∥∥2≤M+ℓg1. (26)

We now move to our main convergence result. In a nutshell, it states that the distance from optimality in terms of the objective function value of -fixed points is bounded by constants multiplying and .

###### Theorem 2.2.

Let999For a matrix , denotes the spectral norm of : , where

stands for the maximal eigenvalue of its argument.

, , and assume and . If

 1t∥∥~γ2−proxtg2(~γ2−t DT2Gf,g11/μ(~γ1))∥∥2≤ε, (27)

then

 F(α)−Fopt≤ηε+(β+κt)μ, (28)

where ,

 α=proxtg2(~γ2−tDT2Gf,g11/μ(~γ1)), (29)

and

 η =2R, (30) β =2R∥D2∥2∥Q∥2(M+ℓg1)+∥Q∥22R21 (31) +2∥b∥2∥Q∥2R1+ℓ2g1+2ℓg1M, (32) κ =∥D2∥2(∥D2∥2(M+ℓg1)+ℓg2)∥Q∥2(M+ℓg1). (33)

A consequence of this result is the following.

###### Corollary 2.2.1.

Suppose is the sequence generated by ML-ISTA with and . If , then

 F(γk+12)−Fopt≤ηε+(β+κt)μ, (34)

where , and are those given in (30).

An additional consequence of Theorem 2.2 is an analogous result for ML-FISTA. Recall that ML-FISTA introduces a momentum term to the update provided by ML-ISTA, and can be written as

 γk+12 =proxtg2(zk−tDT2Gf,g11/μ(D2zk)), (35) zk+1 =γk+12+ρk(γk+12−γk2). (36)

We have the following result.

###### Corollary 2.2.2.

Let and , and assume that and are the sequences generated by ML-FISTA according to Equation (35). If

 ∥zk−γk+12∥2≤tε, (37)

then

 F(γk+12)−Fopt≤ηε+(β+κt)μ, (38)

where the constants , and are defined in (30).

Before moving on, let us comment on the significance of these results. On the one hand, we are unaware of any results for proximal gradient-mapping algorithms, and in this sense, the analysis above presents a first result of its kind. On the other hand, the analysis does not provide a convergence rate, and so they do not reflect any benefits of ML-FISTA over ML-ISTA. As we will see shortly, the empirical convergence of these methods significantly differ in practice.

### 2.4 Synthetic Experiments

We now carry a series of synthetic experiments to demonstrate the effectiveness of the multi-layer Basis Pursuit problem, as well as the proposed iterative shrinkage methods.

First, we would like to illustrate the benefit of the proposed multi-layer BP formulation when compared to the traditional sparse regression problem. In other words, exploring the benefit of having . To this end, we construct a two layer model with Gaussian matrices and , (, ). We construct our signals by obtaining representation with and , following the procedure described in [1], so as to provide representations consistent with the multi-layer sparse model. Lastly, we contaminate the signals with Gaussian i.i.d. noise creating the measurements with SNR=10. We compare minimizing with (which accounts to solving a classical BP problem) with the case when , as a function of when solved with ISTA and ML-ISTA. As can be seen from the results in Figure 1, enforcing the additional analysis penalty on the intermediate representation can indeed prove advantageous and provide a lower recovery error in both and . Clearly, this is not true for any value of , as the larger this parameter becomes, the larger the bias will be in the resulting estimate. For the sake of this demonstration we have set as the optimal value for each (with grid search). The theoretical study of the conditions (in terms of the model parameters) under which provides a better recovery, and how to determine this parameter in practice, are interesting questions that we defer to future work.

We also employ this synthetic setup to illustrate the convergence properties of the main algorithms presented above: ADMM (employing either ISTA or FISTA for the inner BP problem), the Smooth-FISTA [5] and the proposed Multi-Layer ISTA and Multi-Layer FISTA. Once again, we illustrate these algorithms for the optimal choice of and from the previous experiment, and present the results in Figure 2. We measure the convergence in terms of function value for all methods, as well as the convergence to the solution found with ADMM run until convergence.

ADMM converges in relatively few iterations, though these take a relatively long time due to the inner BP solver. This time is reduced when using FISTA rather than ISTA (as the inner solver converges faster), but it is still significantly slower than any of the other alternatives - even while using warm-start at every iteration, which we employ in these experiments.

Both S-FISTA and our multi layer solvers depend on a parameter that controls the accuracy of their solution and affect their convergence speed – the for the former, and the for the latter approaches. We set these parameters so as to obtain roughly the same accuracy in terms of recovery error, and compare their convergence behavior. We can see that ML-FISTA is clearly faster than ML-ISTA, and slightly slightly faster than S-FISTA. Lastly, in order to demonstrate the effect of in ML-ISTA and ML-FISTA, we run the same algorithms for different values of this parameter (in decreasing order and equispaced in logarithmic scale between and 1) for the same setting, and present the results in Figure 3 for ML-ISTA (top) and ML-FISTA (bottom). These numerical results illustrate the theoretical analysis provided by Theorem 2.2 in that the smaller , the more accurate the solution becomes, albeit requiring more iterations to converge. These results also reflect the limitation of our current theoretical analysis, which is incapable of providing insights into the convergence rate.

## 3 Principled Recurrent Neural Networks

As seen above, the ML-ISTA and ML-FISTA schemes provide efficient solvers for problem . Interestingly, if one considers the first iteration of either of the algorithms (with ), the update of the inner most representation results in

 γ2←tμTtλ2(DT2Tμλ1(μDT1(y))), (39)

for a two-layer model, for instance. If one further imposes a non-negativity assumption on the representation coefficients, the thresholding operators become non-negative projections shifted by a bias of . Therefore, the above soft-thresholding operation can be equivalently written as

 γ2←ReLU(DT2ReLU(DT1y+b1)+b2) (40)

where the biases vectors

and account for the corresponding thresholds101010Note that this expression is more general in that it allows for different thresholds per atom, as opposed the expression in (39). The latter can be recovered by setting every entry in the bias vector to be .. Just as pointed out in [34], this is simply the forward pass in a neural network. Moreover, all the analysis presented above holds also in the case of convolutional dictionaries, where the dictionary atoms are nothing but convolutional filters (transposed) in a convolutional neural network. Could we benefit from this observation to improve on the performance of CNNs?

In this section, we intend to demonstrate how, by interpreting neural networks as approximation algorithms of the solution to a Multi-Layer BP problem, one can boost the performance of typical CNNs without introducing any parameters in the model. To this end, we will first impose a generative model on the features in terms of multi-layer sparse convolutional representations; i.e., we assume that , for convolutional dictionaries . Furthermore, we will adopt a supervised learning setting in which we attempt to minimize an empirical risk over training samples of signals with labels

. A classifier

, with parameters , will be trained on estimates of said features obtained as the solution of the ML-BP problem; i.e.

 minθ,{Di,λi}1NN∑1=1L(hi,ζθ(γ∗))  % s.t. γ∗=argminγ∥y−D(1,L)γ∥22+L−1∑i=1λi∥D(i+1,L)γ∥1+λL∥γ∥1. (41)

The function is a loss or cost function to be minimized during training, such as the cross entropy which we employ for the classification case. Our approach to address this bi-level optimization problem is to approximate the solution of the lower-level problem by iterations of the ML-ISTA approaches – effectively implemented as layers of unfolded recurrent neural networks. This way, becomes a straight-forward function of and the model parameters ( and ), which can be plugged into the loss function . A similar approach is employed by Task Driven Dictionary Learning [29] in which the constraint is a single layer BP (i.e. ) that is solved with LARS [18] until convergence, resulting in a more involved algorithm.

Importantly, if only one iteration is employed for the ML-ISTA, and a linear classifier111111Or, in fact, any other neural-network-based classifier acting on the obtained features. is chosen for , the problem in (41) boils down exactly to training a CNN to minimize the classification loss . Naturally, when considering further iterations of the multi-layer pursuit, one is effectively implementing a recurrent neural network with “skip connections”, as depicted in Figure 4 for a two-layer model. These extended networks, which can become very deep, have exactly as many parameters as their traditional forward-pass counterparts – namely, the dictionaries , biases and classifier parameters . Notably, and unlike other popular constructions in the deep learning community (e.g., Residual Neural Networks [22], DenseNet [25], and other similar constructions), these recurrent components and connections follow a precise optimization justification.

The concept of unfolding an iterative sparse coding algorithm is clearly not new. The first instance of such an idea was formalized by the Learned ISTA (LISTA) approach [20]. LISTA decomposes the linear operator of ISTA in terms of 2 matrices, replacing the computation of by

 Tλ(Wγ+By), (42)

following the equivalences and . Then, it adaptively learns these new operators instead of the initial dictionary in order to provide estimates that approximate the solution of ISTA. Interestingly, such a decomposition allows for the acceleration of ISTA [32], providing an accurate estimate in very few iterations. A natural question is, then, could we propose an analogous multi-layer Learned ISTA?

There are two main issues that need to be resolved if one is to propose a LISTA-like decomposition in the framework of our multi-layer pursuits. The first one is that the decomposition in (42) has been proposed and analyzed for general matrices (i.e., fully-connected layers in a CNN context), but not for convolutional dictionaries. If one was to naively propose to learn such an (unconstrained) operator , this would result in an enormous amount of added parameters. To resolve this point, in the case where is a convolutional dictionary (as in CNNs) we propose a decomposition of the form

 Tλ((I−WTW)γ+By), (43)

where is also constrained to be convolutional, thus controlling the number of parameters121212For completeness, we have also tested the traditional decomposition proposed in Equation (42), resulting in worse performance than that of ML-ISTA – likely due to the significant increase in the number of parameters discussed above.. In fact, the number of parameters in a layer of this ML-ISTA is simply twice as many parameters as the conventional case, since the number of convolutional filters in and (and their dimensions) are equal to those in .

The second issue is concerned with the fact that LISTA was proposed as a relaxation of ISTA – a pursuit tackling a single layer pursuit problem. To accommodate a similar decomposition in our multi-layer setting, we naturally extend the update to:

 ^γ1 ←Tλ1((I−WT1W1)γk1+B1y), (44) γk+12 ←Tλ2((I−WT2W2)γk2+B2^γ1), (45)

for a two-layer model for simplicity. In the context of the supervised classification setting, the learning of the dictionaries is replaced by learning the operators and . Note that this decomposition prevents us from obtaining the dictionaries , and so we use131313An alternative is to employ , but this choice was shown to perform slightly worse in practice. in Equation (44).

## 4 Experiments

In this final section, we show how the presented algorithms can be used for image classification on three common datasets: MNIST, SVHN and CIFAR10, while improving the performance of CNNs without introducing any extra parameters in the model. Recalling the learning formulation in Equation (41), we will compare different architectures resulting from different solvers for the features

. As employing only one iteration of the proposed algorithms recovers a traditional feed-forward network, we will employ such a basic architecture as our baseline and compare it with the Multi Layer ISTA and FISTA, for different number of iterations or unfoldings. Also for this reason, we deliberately avoid using training “tricks” popular in the deep learning community, such as batch normalization, drop-out, etc., so as to provide clear experimental setups that facilitate the understanding and demonstration of the presented ideas.

For the MNIST case, we construct a standard (LeNet-style) CNN with 3 convolutional layers (i.e., dictionaries) with 32, 64 and 512 filters, respectively141414Kernel sizes of , and

, respectively, with stride of 2 in the first two layers.

, and a final fully-connected layer as the classifier . We also enforce non-negativity constraints on the representations, resulting in the application of ReLUs and biases as shrinkage operators. For SVHN we use an analogous model, though with three input channels and slightly larger filters to accommodate the larger input size. For CIFAR, we define a ML-CSC model with 3 convolutional layers, and the classifier function

as a three-layer CNN. This effectively results in a 6 layers architecture, out of which the first three are unfolded in the context of the multi-layer pursuits. All models are trained with SGD with momentum, decreasing the learning rate every so many iterations. In particular, we make use of a PyTorch implementation, and training code is made available

151515Available through the first author’s website. online.

In order to demonstrate the effect of the ML-ISTA iterations (or unfoldings), we first depict the test error as a function of the training epochs for different number of such iterations in Figure

5. Recall that the case of 0 unfoldings corresponds to the typical feed-forward CNN, while the case with 6 unfoldings effectively implements a 18-layers-deep architecture, alas having the same number of parameters. As can be seen, further unfoldings improve on the resulting performance.

Moving to a more complete comparison, we demonstrate the ML-ISTA and ML-FISTA architectures when compared to some of the models mentioned above; namely:

• ML-LISTA: replacing the learning of the convolutional dictionaries (or filters) by the learning of the (convolutional) factors and , as indicated in Equation (44).

• Layered Basis Pursuit: the approach proposed in [34], which unrolls the iteration of ISTA for a single-layer BP problem at each layer. In contrast, the proposed ML-ISTA/FISTA unrolls the iterations of the entire Multi-Layer BP problem.

• An “All-Free” model: What if one ignores the generative model (and the corresponding pursuit interpretation) and simply frees all the filters to be adaptively learned? In order to study this question, we train a model with the same depth and an analogous recurrent architecture as the unfolded ML-ISTA/FISTA networks, but where all the filters of the different layers are free to be learned and to provide the best possible performance.

It is worth stressing that the ML-ISTA, ML-FISTA and Layered BP have all the same number of parameters as the feed-forward CNN. The ML-LISTA version has twice as many parameters, while the All-Free version has order more parameters, where is the number of layers and is the number of unfoldings.

The accuracy as a function of the iterations for all models are presented in Figure 6, and the final results are detailed in Table I. A first observation is that most “unrolled” networks provide an improvement over the baseline feed-forward architecture. Second, while the Layered BP performs very well on MNIST, it falls behind on the other two more challenging datasets. Recall that while this approach unfolds the iterations of a pursuit, it does so one layer at a time, and does not address a global pursuit problem as the one we explore in this work.

Third, the performances of the ML-ISTA, ML-FISTA and ML-LISTA are comparable. This is interesting, as the LISTA-type decomposition does not seem to provide an importance advantage over the unrolled multi-layer pursuits. Forth, and most important of all, freeing all the parameters in the architecture does not provide important improvements over the ML-ISTA networks. Limited training data is not likely to be the cause, as ML-ISTA/FISTA outperforms the larger model even for CIFAR, which enjoys a rich variability in the data and while using data-augmentation. This is noteworthy, and this result seems to indicate that the consideration of the multi-layer sparse model, and the resulting pursuit, does indeed provide an (approximate) solution to the problem behind CNNs.

## 5 Proofs of Main Theorems

### 5.1 Fixed Point Analysis

A vector is a fixed point of the ML-ISTA update from Equation (16) iff

 γ⋆2=proxtg2(γ⋆2−t DT2 Gf,g11/μ(D2γ⋆2)). (46)

By the second prox theorem [3, Theorem 6.39], we have that

 −t DT2 Gf,g11/μ(D2γ⋆2)∈t∂g2(γ⋆2), (47)

or, equivalently, there exists so that

 DT2 Gf,g11/μ(D2γ⋆2)+w2=0. (48)

Employing the definition of ,

 DT2 1μ(D2γ⋆2−proxμg1(D2γ⋆2−μ∇f(D2γ⋆2)))+w2=0. (49)

Next, denote

 ^γ1=proxμg1(D2γ⋆2−μ∇f(D2γ⋆2)). (50)

Employing the second prox theorem on (50), we have that the above is equivalent to the existence of for which . Thus, (49) amounts to

 DT2 1μ(D2γ⋆2−D2γ⋆2+μ∇(D2γ⋆2)+μw1)+w2=0, (51)

for some and