1 Introduction
Deep learning has been one of the most popular areas of research over the past few years, due in large part to the ability of deep neural networks to outperform humans at a number of cognition tasks, such as object and speech recognition. Despite the mystique that has surrounded their success, recent work has started to provide answers to questions pertaining, on the one hand, to basic assumptions behind deep networks–when do they work?–and, on the other hand, to interpretability–why do they work? In [2], Patel explains deep learning from the perspective of inference in a hierarchical probabilistic graphical model. This leads to new inference algorithms based on belief propagation and its variants. Papyan et al [1] consider deep convolutional networks through the lens of a multilayer convolutional sparse coding model. The authors show a correspondence between the sparse approximation step in this multilayer model and the encoding step (forward pass) in a related deep convolutional network. More recently, building on the work of Papyan et al, Ye et al [3] have shown that some of the key operations that arise in deep learning (e.g. pooling, ReLu) can be understood from the classical theory of filter banks in signal processing. In a separate line of work, Tishby [4] uses the information bottle neck principle from information theory to characterize the limits of a deep network from an informationtheoretic perspective.
Here, we take a more expansive approach than in [1, 2, 3] that connects deep networks to the theory of dictionary learning, to answer questions pertaining, not to basic assumptions and interpretability, but to the sample complexity of learning a deep network–how much data do you need to learn a deep network?.
Classical dictionary learning theory [5] tackles the problem of estimating a single unknown transformation from data obtained through a sparse coding model. The theory gives bounds for the sample complexity of learning a dictionary as a function of the parameters of the sparse coding model. Two key features unite the works from [1], [2] and [3]. The first is (a) sparsity, and the second (b) the use of a hierarchy of transformations/representations as a proxy for the different layers in a deep neural networks. Classical dictionary learning theory does not, however, provide a framework for assessing the complexity of learning a hierarchy, or sequence, of transformations from data.
We formulate a deep version of the classical sparsecoding generative model from dictionary learning [5]
: starting with a sparse code, a composition of linear transformations are applied to generate an observation. We constraint all the transformations in the composition, except for the last, to have sparse columns, so that their composition yields sparse representations at every step. We solve the deep dictionary learning–learning all of the transformation in the composition– problems by sequential alternating minimization, starting from the last transformation in the composition up to the first. Each alternatingminimization step involves a sparse approximation step, i.e. a search for a sparse input to each of the transformations in the composition. That’s why, we constraint the intermediate matrices in the composition to be sparse. As we detail in the main text our notion of depth refers to the number of transformations in the composition.
We begin the rest of our treatment by briefly introducing our notation (Section 2). Our main contributions are threefold. First, in Section 3 we develop the connection between classical dictionary learning and deep recurrent sparse autoencoders [6, 7, 8]. Second, we use this connection in Section 4 to introduce the deep generative model for dictionary learning, and prove that, under regularity assumptions, the sequence of dictionaries can be learnt and give a bound on the computational complexity of this learning as a function of the model parameters. Let the transformations in the composition be labeled through . Further letting be the dimension of the input of the transformation and the sparsity of this input, the computational complexity is . As in [5], the term seems to be an overestimate from the proof techniques used. This bound can be interpreted as a statement regarding the complexity of learning deep versions of the recurrent autoencoders from Section 3. Indeed, in neural networks terminology, is the size of the embedding at layer and the number active neurons at that layer. Third, our proof relies on a certain type of sparse random matrix satisfying the RIP property. We prove this in Section 5 using results from nonasymptotic random matrix theory [9]. We demonstrate the deep dictionary learning algorithm via simulation in Section 6. The simulations suggest that the term above is an artifact of the proof techniques we rely on to prove our main result. That is, the learning complexity depends on the maximum, across layers, of the product of the number of active neurons and the embedding dimension. We give concluding remarks in Section 7.
2 Notation
We use bold font for matrices and vectors, capital letters for matrices, and lowercase letters for vectors. For a matrix
, denotes its column vector, and its element at the row and the column. For a vector , denotes its element. and refer, respectively, to the transpose of the matrix and that of the vector . We use to denote the norm of the vector . We use andto refer, respectively, to the minimum and maximum singular values of the matrix
. We will also useto denote the spectral norm (maximum singular value of a matrix). We will make it clear from context whether a quantity is a random variable/vector. We use
to refer the identity matrix. Its dimension we will be clear from the context. Let
. For a vector , refers to set of indices corresponding to its nonzero entries.3 Shallow Neural Networks and Sparse Estimation
The rectifierlinear unit–ReLu–is a popular nonlinearity in the neuralnetworks literature. Let
, the ReLu nonlinearity is the scalarvalued function defined as ReLu(. In this section, we build a parallel between sparse approximation/regularized regression and the ReLU() nonlinearity. This leads to a parallel between dictionary learning [5] and autoencoder networks [7, 10]. In turn, this motivates an interpretation of learning a deep neural networks as a hierarchical, i.e. sequential, dictionary learning problem in cascadeofsparsecoding models [1].3.1 Unconstrained regularization in one dimension
We begin by a derivation of the softthresholding operator from the perspective of sparsity approximation. Let , , and consider the inverse problem
(1) 
It is wellknown that the solution to Equation 1 is given by the softthresholding operator, i.e.
(2) 
For completeness, we give the derivation of this result in the appendix.
We show next that, subject to a nonnegativity constraint, the solution to Equation 1 is a translated version of ReLu [11] applied to , a form that is more familiar to researchers from the neuralnetworks community. That is, the ReLu() nonlinearity arises in the solution to a simple constrained regularized regression problem in one dimension.
3.2 Constrained regularization in one dimension
For , the solution to Equation 3 is equivalent to that of Equation 1. For , the solution must be . Suppose, for a contradiction, that , then the value of the objective function is , which is strictly greater that , i.e. the objective function evaluated at .
The above result generalizes easily to the case when the observations and the optimization variable both live in higher dimensions and are related through a unitary transform.
3.3 Unconstrained regularization in more that one dimension
Let and , , and
a unitary matrix. Consider the problem
(4) 
Since is unitary, i.e. an isometry, Equation 4 is equivalent to
(5) 
where , . Equation 5 is separable in . For each , the optimization is equivalent to Equation 3 with as the input, . Therefore,
(6) 
Equation 6 states that, for unitary, the solution to the regularized leastsquares problem with nonnegativity constraints (Equation 4) is obtained componentwise, by projecting the vector onto the vector and passing it through the ReLu(
). nonlinearity. Stated otherwise, a simple feedforward neural network solves the inverse problem of Equation
4. Equation 6 also suggests that plays the role of the bias in neural networks. Allowing for different biases is akin to using a different regularization parameter for each of the components of . Applying the transformation to he vector yields an approximate reconstruction . We depict this twostage process as a twolayer feedforward neural network in Figure 1. The architecture depicted in the figure is called an autoencoder [7, 10]. Given training examples, the weights of the network, which depend on, can be tuned by backpropagation. This suggests a connection between dictionary learning and autoencoder architectures, which we elaborate upon below.
Remark 1: The literature suggests that the parallel between the ReLu and sparse approximation dates to the work of [12]. Prior to this, while they do not explicitly make this connection,the authors from [11] discuss in detail the sparsitypromoting properties of the ReLU compared to other nonlinearities in neural networks.
3.4 Sparse coding, dictionary learning, and autoencoders
We use the relationship between the ReLu() and regularized regression to draw a parallel between dictionary learning [5] and a specific autoencoder [7, 10] neuralnetwork architecture.
Shallow sparse generative model.
Let be an realvalued matrix generated as follows
(7) 
Each column of is a sparse vector, i.e. only a few of its elements are nonzero, and these represent the coordinates or code for the corresponding column of in the dictionary [5].
Remark 2: We call this model “shallow” because there is only one transformation to learn. In Section 4, we will contrast this with a “deep” generative model where we will learn each of the transformations that comprise the composition of multiple linear transformations applied to a sparse code.
Sparse coding and dictionary learning.
Given , the goal is to estimate and . Alternating minimization [5] is a popular algorithm to find and as the solution to
(8) 
The algorithms solves Equation 8 by alternating between a sparse coding step, which updates the sparse codes given an estimate of the dictionary, and a dictionary update step, which updates the dictionary given estimates of the sparse codes.
Suppose that instead of requiring equality in Equation 7, our goal where instead to solve the following problem
(9) 
If were a unitary matrix, the sparsecoding step could be solved exactly using Equation 6. The goal of the dictionarylearning step is to minimize the reconstruction error between applied to the sparse codes, and the observations. In the neuralnetwork literature, this twostage process describes socalled autoencoder architectures [7, 10].
Shallow, constrained, recurrent, spare autoencoders.
We introduce an autoencoder architecture for learning the model from Equation 7. This autoencoder has an implicit connection with the alternatingminimization algorithm applied to the same model. Given , the encoder produces a sparse code using a finite (large) number of iterations of the ISTA algorithm [13]. The decoder applies to the output of the decoder to reconstruct . We call this architecture a constrained recurrent sparse autoencoder (CRsAE) [8]. The constraint comes from the fact that the operations used by the encoder and the decoder are tied to each other through . Hence, the encoder and decoder are not independent, unlike in [7]. The auto encoder is called recurrent because of the ISTA algorithm, which is an iterative procedure. Figure 1 depicts this architecture.
There are two basic takeaways from the previous discussion

Constrained autoencoders with ReLu nonlinearities capture the essence of the alternatingminimization algorithm for dictionary learning.

Therefore, the sample complexity of dictionary learning can give us insights on the hardness of learning neural networks.
How to use dictionary learning to assess the sample complexity of learning deep networks?
The “depth” of a neural network refers to the number of its hidden layers, excluding the output layer. A “shallow” network is one with two or three hidden layers [14]. A network with more than three hidden layers is typically called “deep”. Using this definition, the architecture from Figure 2 would be called deep. This is because of iterations of ISTA which, when unrolled [6, 7, 8] would constitute separate layers. This definition, however, does not reflect the fact that the only unknown in the network is . Therefore, the number of parameters of the network is the same as that in a onelayer, fullyconnected, feedforward network.
A popular interpretation of deep neural networks is that they learn a hierarchy, or sequence, of transformations of data. Motivated by this interpretation, we define the “depth” of a network, not in relationship to its number of layers, but as the number of underlying distinct transformations/mappings to be learnt.
Classic dictionary learning tackles the problem of estimating a single transformation from data [5]. Dictionarylearning theory characterizes the sample complexity of learning the model of Equation 7 under various assumptions. We can use these results to get insights on the complexity of learning the parameters of the autoencoder from Figure 2. Classical dictionary learning theory does not, however, provide a framework for assessing the complexity of learning a hierarchy, or sequence, of transformations from data.
4 Deep Sparse Signal Representations
Our goal is to build a “deep” (in the sense defined previously) version of the model from Equation 7, i.e. a generative model in which, starting with a sparse code, a composition of linear transformations are applied to generate an observation. What properties should such a model obey? In the previous section, we used the sparse generative model of Equation 7 to motivate the autoencoder architecture of Figure 2. The goal of the encoder is to produce sparse codes [11]. We will construct a “deep” version of the autoencoder and use it to infer desirable properties of the “deep” generative model.
4.1 Deep, constrained, recurrent, sparse autoencoders
For simplicity, let us consider the case of two linear transformations and . is applied to a sparse code, and to its output to generate an observation. Applied to an observation, the goal of the ISTA encoder is to produce sparse codes. This is only reasonable if applied to the sparse code produces sparse/approximately sparse observations, i.e. the image of must be sparse/approximately sparse.
For the composition of more than two transformations, the requirement that the encoders applied in cascade produce sparse codes suggests that, starting with a sparse code, the output of each of the transformations, expect for the very last which gives the observations, must be approximately sparsely.
We specify our deep sparse generative model below, along with the assumptions that accompany the model.
4.2 Deep sparse generative model and coding
Let be the realvalued matrix obtained by applying the composition of linear transformations to a matrix of sparse codes
(10)  
uniformly bounded. 
If we further assume that each column of is sparse, i.e. at most of the entries of each column are nonzero, the image of each of the successive transformations will also be sparse. Finally, we apply the transformation to obtain the observations .
Given , we would like solve the following problem
(11) 
Remark 4: If , Equation 10 reduces to the “shallow” sparse generative model from Equation 7, a problem that is wellstudied in dictionarylearning literature [5], and for which the authors propose an alternatingminimization procedure whose theoretical properties they study in detail.
In what follows, it will be useful to define the matrix , namely the output of the operator in Equation 10, . At depth , the columns of , are sparse representations of the signal , i.e. they are deeply sparse.
Reduction to a sequence of “shallow” problems: the case .
To solve Equation 11, we will reduce to the problem to a sequence of “shallow” problems of the form studied in [5]. To gain some intuition, let us consider the case when . We will proceed as follows
Step 1. Find and : We first solve the following problem
(12) 
We solve Equation 12 using the alternatingminimization algorithm from [5]–Algorithm 1 below–which under regularity assumptions, guarantees that, with high probability, .
Step 2. Find and . We can now solve
(13) 
Appealing once again to Algorithm 1 from [5], we can conclude that, with high probability, we have solved for , and .
Remark 5: At this point, the reader would be justified in asking the following question: is a matrix with sparse columns that should satisfy RIP, do such matrices exist? In Section 5, we will answer this question in the affirmative for a certain class of matrices for which the nonzero entries of each column are chosen at random. We will appeal to standard results from random matrix theory [9].
We now state explicitly our assumptions on the “deep” generative model of Equation 10. These assumptions will let us give guarantees and samplecomplexity estimates for the success, for arbitrary , of the sequential alternatingminimization algorithm described above for . The reader can compare these assumptions to assumptions A1–A7 from [5]. As in [5], we assume, without any loss in generality that the columns of all have unit norm, i.e. , , .
Assumptions:
Let , , and , , .

Dictionary Matrices satisfying RIP: For each , the dictionary matrix has RIP constant of .

Spectral Condition of Dictionary Elements: For each , the dictionary matrix has bounded spectral norm, for some constant , .

Nonzero Entries in Coefficient Matrix: The nonzero entries of are drawn i.i.d. from a distribution such that , and satisfy the following a.s.: .

Sparse Coefficient Matrix: The columns of the coefficient matrix have nonzero entries which are selected uniformly at random from the set of all sized subsets of . It is required that , for some universal constant . We further require that, for , .

Sample Complexity: For some universal constant , and given failure parameters , the number of samples needs to satisfy,
(14) Here , , .

Initial dictionary with guaranteed error bound: It is assumed that, , we have access to an initial dictionary estimate such that
(15) 
Choice of Parameters for Alternating Minimization: For all , AltMinDict() uses a sequence of accuracy parameters and
(16)
We are now in a position to state our main result regarding the ability to learn the “deep” generative model of Equation 10, i.e. recover under assumptions A1–A7.
4.3 Learning the “deep” sparse coding model by sequential alternating minimization
Algorithm 2 describes the “deep” dictionary learning algorithm. The Algorithm requires the the specification of a variable . Given , Algorithm 2 solves for .
Theorem 1 (Exact recover of the “deep” generative model)
Let us denote by the event , . Let , then
(17) 
The Theorem states that, with the given probability, we can learn all of the transformations in the deep sparse generative model. Assumption A5 is a statement about the complexity of this learning: the computational complexity is . This can be interpreted as a statement regarding the complexity of learning deep versions of the recurrent autoencoders from Section 3. Indeed, in neural networks terminology, is the size of the embedding at layer and the number active neurons at that layer. The simulations (Section 6) suggest that the term above is an artifact of the proof techniques we rely on to arrive at our main result. That is, the learning complexity depends on the maximum, across layers, of the product of the number of active neurons and the embedding dimension.
We will prove the result by induction on . Before proceeding with the proof, let us discuss in detail the case when in Equation 10 and . We focus on exact recovery of and and defer computation of the probability in Equation 17 to the proof that will follow.
Intuition behind the proof: the case .
Algorithm 2 begins by solving for . If we can show that the algorithm succeeds for this pair, in particular that , then it follows that in the following iteration of the algorithm. This is because, if the first iteration were to succeed, then , which is the very model of Equation 7, which was treated in detail [5]. If we can show that the sparse matrix follows RIP–topic of the the next section Section 5–then we can apply Theorem 1 from [5] to guarantee recovery of .
Focusing on , the key point we would like to explain is that, in Equation 7, the properties of that allow the authors from [5] to prove their main result also apply to . This is not directly obvious because is the product of a sparse matrix and the matrix of codes. We address these points one at a time in the following remark
Remark 6:

We first note that, since the columns of are i.i.d., so are those of . Moreover, since both the entries of and are bounded by assumptions, so are those of .

By construction, is a sparse matrix: each of its columns is at most sparse. It is not trivial, however, to compute . Luckily, we do not need this probability explicitly, as long as we can either bound it, or bound the singular values of the matrix and the matrix of indicators values of its nonzero entries. It is not hard to show that

It is not hard to show that Lemma 3.2 from [5] applies to . Lemma 3.2 relies on Lemmas A1 and A2, which give bounds for the matrix of indicator values for the nonzero entries of . For , we can replace, in the proof of Lemma 3.2, the matrix of indicators of its nonzero entries with the product of the matrix of indicators of the nonzero entries for and respectively. This yields a bound that now depends on the , and the sparsity level .

Applying Lemma A1 and A2 from [5] to and , respectively, yields bounds for their lowest and largest singular values. Using standard singular value relationships, this gives a version of Lemma A2 for —, i.e. bounds for its lowest and largest singular values. A version of Lemmas A3 and A4 for directly follows.

Since is sparse, , we can prove the Bernstein inequalities to obtain the upper bounds from Lemma A5. These upper bounds are all that are necessary to prove Lemma A6 for .

The version of Lemma 3.3–the center piece of [5]–for them follows.
The interested reader can verify all of the above for herself. A detailed technical exposition of these points would lead to a tedious and unnecessary digression, without adding much intuition. Using induction, it can be shown that this remark applied to for all .
We can now to apply Theorem 1 from [5] to , guaranteeing recovery of .
Proof 1
We proceed by induction on .
Base case: . In this case, . Following the remark above, obeys the properties of from Theorem 1 in [5]. Under A1–A7, this theorem guarantees that , the limit as of converges to with probability at least . Therefore, , proving the base case.
Induction: Suppose the Theorem is true for , we will show that is true for .
Conditioned on the event , . Therefore, under A1–A7, the limit as of converges to with probability at least . Therefore
(18) 
This completes the proof.
4.4 Alternate algorithm for learning the “deep” generative model
Algorithm 2 learns the model of Equation 10 sequentially, starting with and ending with . In this section, we sketch out a learning procedure that proceeds in the opposite way. We begin by giving the intuition for this procedure for the case .
Alternate learning algorithm: the case .
As in the case of Algorithm 2, the procedure relies on the sequential application of Algorithm 1. We first learn the product . Having learnt this product, we then use it to learn the product , which automatically yields . Finally, we use to learn and .
The sequential procedure described above poses, however, one technical difficulty. To learn the product , a sufficient condition [5] is that it must satisfy RIP of order . Assumptions A1 only requires that the matrices , and satisfy RIP separately. We now show that assumption A1 has implications on the RIP constant of a certain order of the product matrix.
Before stating the result, we introduce some notation and present the alternate algorithm. We let and
(19)  
(20)  
(21) 
Theorem 2 (RIPlike property of )
Suppose is sparse, then
(22) 
Proof 2
We proceed by induction on .
Base case: . The theorem is true for this case by assumption A1.
Induction: Suppose the theorem is true for , we will show that it holds true for . Let be a sparse vector
(23) 
is a sparse vector, allowing us to apply our inductive hypothesis
(24) 
The result follows by assumption A1 since satisfies the RIP of order .
A direct consequence of the theorem is that , the RIP constant of must be smaller than or equal to . As long as this quantity is less than , we can expect Algorithm 3 with to succeed in recovering all dictionaries.
5 Concentration of eigenvalues of columnsparse random matrices with i.i.d. subGaussian entries
The proof of our main result, Theorem 1, relies on random sparse matrices satisfying RIP. Here we show that a class of random sparse matrices indeed satisfies RIP.
5.1 Sparse random subGaussian matrix model
Let be a matrix with columns . Let be a binary random matrix with columns that are i.i.d. sparse binary random vectors each obtained by selecting entries from without replacement, and letting
be the indicator random variable of whether a given entry
was selected. Let be a random matrix with i.i.d. entries distributed according to a zeromean subGaussian random variablewith variance
, almost surely, and subGaussian norm –we adopt the notation from [9] to denote the subGaussian norm of a random variable. We consider the following generative model for the entries of :(25) 
It is not hard to verify that the random matrix thus obtained is such that . To see this we note the following properties of the generative model for

.

Let , .

.

Let , .

a.s, .
Ultimately, we would like to understand the concentration behavior of the singular values of 1) , and 2) submatrices of that consist of a sparse subset of columns (RIPlike results). We fist recall the following result from nonasymptotic random matrix theory [9], and apply it obtain a concentration result on the singular values of the matrix .
Theorem 3 (Restatement of Theorem 5.39 from [9] (SubGaussian rows))
Let matrix whose rows ( is the column of ) are independent subGaussian isotropic random vectors in . Then for every , with probability at least one has
(26) 
Here, , depend only on the subGaussian norm of the rows.
Before we can apply the above result to , we need to demonstrate that the columns of are subGaussian random vectors, defined as follows
Definition 4 (Definition 5.22 from [9] (SubGaussian random vectors))
We say that a random vector in is subGaussian if the onedimensional marginals are subGaussian random variables for all in . The subGaussian norm of is defined as
(27) 
Theorem 5 (Columns of are subGaussian random vectors)
For every , is a subGaussian random vector. Moreover,
(28) 
where is a universal constant.
Proof 3
We show this by bounding :
(29)  
(30)  
(31)  
Comments
There are no comments yet.