Deep learning has been one of the most popular areas of research over the past few years, due in large part to the ability of deep neural networks to outperform humans at a number of cognition tasks, such as object and speech recognition. Despite the mystique that has surrounded their success, recent work has started to provide answers to questions pertaining, on the one hand, to basic assumptions behind deep networks–when do they work?–and, on the other hand, to interpretability–why do they work? In , Patel explains deep learning from the perspective of inference in a hierarchical probabilistic graphical model. This leads to new inference algorithms based on belief propagation and its variants. Papyan et al  consider deep convolutional networks through the lens of a multi-layer convolutional sparse coding model. The authors show a correspondence between the sparse approximation step in this multi-layer model and the encoding step (forward pass) in a related deep convolutional network. More recently, building on the work of Papyan et al, Ye et al  have shown that some of the key operations that arise in deep learning (e.g. pooling, ReLu) can be understood from the classical theory of filter banks in signal processing. In a separate line of work, Tishby  uses the information bottle neck principle from information theory to characterize the limits of a deep network from an information-theoretic perspective.
Here, we take a more expansive approach than in [1, 2, 3] that connects deep networks to the theory of dictionary learning, to answer questions pertaining, not to basic assumptions and interpretability, but to the sample complexity of learning a deep network–how much data do you need to learn a deep network?.
Classical dictionary learning theory  tackles the problem of estimating a single unknown transformation from data obtained through a sparse coding model. The theory gives bounds for the sample complexity of learning a dictionary as a function of the parameters of the sparse coding model. Two key features unite the works from ,  and . The first is (a) sparsity, and the second (b) the use of a hierarchy of transformations/representations as a proxy for the different layers in a deep neural networks. Classical dictionary learning theory does not, however, provide a framework for assessing the complexity of learning a hierarchy, or sequence, of transformations from data.
We formulate a deep version of the classical sparse-coding generative model from dictionary learning 
: starting with a sparse code, a composition of linear transformations are applied to generate an observation. We constraint all the transformations in the composition, except for the last, to have sparse columns, so that their composition yields sparse representations at every step. We solve the deep dictionary learning–learning all of the transformation in the composition– problems by sequential alternating minimization, starting from the last transformation in the composition up to the first. Each alternating-minimization step involves a sparse approximation step, i.e. a search for a sparse input to each of the transformations in the composition. That’s why, we constraint the intermediate matrices in the composition to be sparse. As we detail in the main text our notion of depth refers to the number of transformations in the composition.
We begin the rest of our treatment by briefly introducing our notation (Section 2). Our main contributions are three-fold. First, in Section 3 we develop the connection between classical dictionary learning and deep recurrent sparse auto-encoders [6, 7, 8]. Second, we use this connection in Section 4 to introduce the deep generative model for dictionary learning, and prove that, under regularity assumptions, the sequence of dictionaries can be learnt and give a bound on the computational complexity of this learning as a function of the model parameters. Let the transformations in the composition be labeled through . Further letting be the dimension of the input of the transformation and the sparsity of this input, the computational complexity is . As in , the term seems to be an over-estimate from the proof techniques used. This bound can be interpreted as a statement regarding the complexity of learning deep versions of the recurrent auto-encoders from Section 3. Indeed, in neural networks terminology, is the size of the embedding at layer and the number active neurons at that layer. Third, our proof relies on a certain type of sparse random matrix satisfying the RIP property. We prove this in Section 5 using results from non-asymptotic random matrix theory . We demonstrate the deep dictionary learning algorithm via simulation in Section 6. The simulations suggest that the term above is an artifact of the proof techniques we rely on to prove our main result. That is, the learning complexity depends on the maximum, across layers, of the product of the number of active neurons and the embedding dimension. We give concluding remarks in Section 7.
We use bold font for matrices and vectors, capital letters for matrices, and lower-case letters for vectors. For a matrix, denotes its column vector, and its element at the row and the column. For a vector , denotes its element. and refer, respectively, to the transpose of the matrix and that of the vector . We use to denote the norm of the vector . We use and
to refer, respectively, to the minimum and maximum singular values of the matrix. We will also use
to denote the spectral norm (maximum singular value of a matrix). We will make it clear from context whether a quantity is a random variable/vector. We use
to refer the identity matrix. Its dimension we will be clear from the context. Let. For a vector , refers to set of indices corresponding to its nonzero entries.
3 Shallow Neural Networks and Sparse Estimation
The rectifier-linear unit–ReLu–is a popular nonlinearity in the neural-networks literature. Let, the ReLu nonlinearity is the scalar-valued function defined as ReLu(. In this section, we build a parallel between sparse approximation/-regularized regression and the ReLU() nonlinearity. This leads to a parallel between dictionary learning  and auto-encoder networks [7, 10]. In turn, this motivates an interpretation of learning a deep neural networks as a hierarchical, i.e. sequential, dictionary learning problem in cascade-of-sparse-coding models .
3.1 Unconstrained -regularization in one dimension
We begin by a derivation of the soft-thresholding operator from the perspective of sparsity approximation. Let , , and consider the inverse problem
It is well-known that the solution to Equation 1 is given by the soft-thresholding operator, i.e.
For completeness, we give the derivation of this result in the appendix.
We show next that, subject to a non-negativity constraint, the solution to Equation 1 is a translated version of ReLu  applied to , a form that is more familiar to researchers from the neural-networks community. That is, the ReLu() nonlinearity arises in the solution to a simple constrained -regularized regression problem in one dimension.
3.2 Constrained -regularization in one dimension
Consider the inverse problem
The solution to Equation 3 is .
For , the solution to Equation 3 is equivalent to that of Equation 1. For , the solution must be . Suppose, for a contradiction, that , then the value of the objective function is , which is strictly greater that , i.e. the objective function evaluated at .
The above result generalizes easily to the case when the observations and the optimization variable both live in higher dimensions and are related through a unitary transform.
3.3 Unconstrained -regularization in more that one dimension
Let and , , and
a unitary matrix. Consider the problem
Since is unitary, i.e. an isometry, Equation 4 is equivalent to
Equation 6 states that, for unitary, the solution to the -regularized least-squares problem with non-negativity constraints (Equation 4) is obtained component-wise, by projecting the vector onto the vector and passing it through the ReLu(
). nonlinearity. Stated otherwise, a simple feed-forward neural network solves the inverse problem of Equation4. Equation 6 also suggests that plays the role of the bias in neural networks. Allowing for different biases is akin to using a different regularization parameter for each of the components of . Applying the transformation to he vector yields an approximate reconstruction . We depict this two-stage process as a two-layer feed-forward neural network in Figure 1. The architecture depicted in the figure is called an auto-encoder [7, 10]. Given training examples, the weights of the network, which depend on
, can be tuned by backpropagation. This suggests a connection between dictionary learning and auto-encoder architectures, which we elaborate upon below.
Remark 1: The literature suggests that the parallel between the ReLu and sparse approximation dates to the work of . Prior to this, while they do not explicitly make this connection,the authors from  discuss in detail the sparsity-promoting properties of the ReLU compared to other nonlinearities in neural networks.
3.4 Sparse coding, dictionary learning, and auto-encoders
Shallow sparse generative model.
Let be an real-valued matrix generated as follows
Each column of is a sparse vector, i.e. only a few of its elements are non-zero, and these represent the coordinates or code for the corresponding column of in the dictionary .
Remark 2: We call this model “shallow” because there is only one transformation to learn. In Section 4, we will contrast this with a “deep” generative model where we will learn each of the transformations that comprise the composition of multiple linear transformations applied to a sparse code.
Sparse coding and dictionary learning.
Given , the goal is to estimate and . Alternating minimization  is a popular algorithm to find and as the solution to
The algorithms solves Equation 8 by alternating between a sparse coding step, which updates the sparse codes given an estimate of the dictionary, and a dictionary update step, which updates the dictionary given estimates of the sparse codes.
Suppose that instead of requiring equality in Equation 7, our goal where instead to solve the following problem
If were a unitary matrix, the sparse-coding step could be solved exactly using Equation 6. The goal of the dictionary-learning step is to minimize the reconstruction error between applied to the sparse codes, and the observations. In the neural-network literature, this two-stage process describes so-called auto-encoder architectures [7, 10].
Shallow, constrained, recurrent, spare auto-encoders.
We introduce an auto-encoder architecture for learning the model from Equation 7. This auto-encoder has an implicit connection with the alternating-minimization algorithm applied to the same model. Given , the encoder produces a sparse code using a finite (large) number of iterations of the ISTA algorithm . The decoder applies to the output of the decoder to reconstruct . We call this architecture a constrained recurrent sparse auto-encoder (CRsAE) . The constraint comes from the fact that the operations used by the encoder and the decoder are tied to each other through . Hence, the encoder and decoder are not independent, unlike in . The auto- encoder is called recurrent because of the ISTA algorithm, which is an iterative procedure. Figure 1 depicts this architecture.
There are two basic take-aways from the previous discussion
Constrained auto-encoders with ReLu nonlinearities capture the essence of the alternating-minimization algorithm for dictionary learning.
Therefore, the sample complexity of dictionary learning can give us insights on the hardness of learning neural networks.
How to use dictionary learning to assess the sample complexity of learning deep networks?
The “depth” of a neural network refers to the number of its hidden layers, excluding the output layer. A “shallow” network is one with two or three hidden layers . A network with more than three hidden layers is typically called “deep”. Using this definition, the architecture from Figure 2 would be called deep. This is because of iterations of ISTA which, when unrolled [6, 7, 8] would constitute separate layers. This definition, however, does not reflect the fact that the only unknown in the network is . Therefore, the number of parameters of the network is the same as that in a one-layer, fully-connected, feed-forward network.
A popular interpretation of deep neural networks is that they learn a hierarchy, or sequence, of transformations of data. Motivated by this interpretation, we define the “depth” of a network, not in relationship to its number of layers, but as the number of underlying distinct transformations/mappings to be learnt.
Classic dictionary learning tackles the problem of estimating a single transformation from data . Dictionary-learning theory characterizes the sample complexity of learning the model of Equation 7 under various assumptions. We can use these results to get insights on the complexity of learning the parameters of the auto-encoder from Figure 2. Classical dictionary learning theory does not, however, provide a framework for assessing the complexity of learning a hierarchy, or sequence, of transformations from data.
4 Deep Sparse Signal Representations
Our goal is to build a “deep” (in the sense defined previously) version of the model from Equation 7, i.e. a generative model in which, starting with a sparse code, a composition of linear transformations are applied to generate an observation. What properties should such a model obey? In the previous section, we used the sparse generative model of Equation 7 to motivate the auto-encoder architecture of Figure 2. The goal of the encoder is to produce sparse codes . We will construct a “deep” version of the auto-encoder and use it to infer desirable properties of the “deep” generative model.
4.1 Deep, constrained, recurrent, sparse auto-encoders
For simplicity, let us consider the case of two linear transformations and . is applied to a sparse code, and to its output to generate an observation. Applied to an observation, the goal of the ISTA encoder is to produce sparse codes. This is only reasonable if applied to the sparse code produces sparse/approximately sparse observations, i.e. the image of must be sparse/approximately sparse.
For the composition of more than two transformations, the requirement that the encoders applied in cascade produce sparse codes suggests that, starting with a sparse code, the output of each of the transformations, expect for the very last which gives the observations, must be approximately sparsely.
We specify our deep sparse generative model below, along with the assumptions that accompany the model.
4.2 Deep sparse generative model and coding
Let be the real-valued matrix obtained by applying the composition of linear transformations to a matrix of sparse codes
If we further assume that each column of is -sparse, i.e. at most of the entries of each column are nonzero, the image of each of the successive transformations will also be sparse. Finally, we apply the transformation to obtain the observations .
Given , we would like solve the following problem
Remark 4: If , Equation 10 reduces to the “shallow” sparse generative model from Equation 7, a problem that is well-studied in dictionary-learning literature , and for which the authors propose an alternating-minimization procedure whose theoretical properties they study in detail.
In what follows, it will be useful to define the matrix , namely the output of the operator in Equation 10, . At depth , the columns of , are sparse representations of the signal , i.e. they are deeply sparse.
Reduction to a sequence of “shallow” problems: the case .
Step 1. Find and : We first solve the following problem
Step 2. Find and . We can now solve
Remark 5: At this point, the reader would be justified in asking the following question: is a matrix with sparse columns that should satisfy RIP, do such matrices exist? In Section 5, we will answer this question in the affirmative for a certain class of matrices for which the nonzero entries of each column are chosen at random. We will appeal to standard results from random matrix theory .
We now state explicitly our assumptions on the “deep” generative model of Equation 10. These assumptions will let us give guarantees and sample-complexity estimates for the success, for arbitrary , of the sequential alternating-minimization algorithm described above for . The reader can compare these assumptions to assumptions A1–A7 from . As in , we assume, without any loss in generality that the columns of all have unit norm, i.e. , , .
Let , , and , , .
Dictionary Matrices satisfying RIP: For each , the dictionary matrix has -RIP constant of .
Spectral Condition of Dictionary Elements: For each , the dictionary matrix has bounded spectral norm, for some constant , .
Non-zero Entries in Coefficient Matrix: The non-zero entries of are drawn i.i.d. from a distribution such that , and satisfy the following a.s.: .
Sparse Coefficient Matrix: The columns of the coefficient matrix have non-zero entries which are selected uniformly at random from the set of all -sized subsets of . It is required that , for some universal constant . We further require that, for , .
Sample Complexity: For some universal constant , and given failure parameters , the number of samples needs to satisfy,
Here , , .
Initial dictionary with guaranteed error bound: It is assumed that, , we have access to an initial dictionary estimate such that
Choice of Parameters for Alternating Minimization: For all , AltMinDict() uses a sequence of accuracy parameters and
We are now in a position to state our main result regarding the ability to learn the “deep” generative model of Equation 10, i.e. recover under assumptions A1–A7.
4.3 Learning the “deep” sparse coding model by sequential alternating minimization
Theorem 1 (Exact recover of the “deep” generative model)
Let us denote by the event , . Let , then
The Theorem states that, with the given probability, we can learn all of the transformations in the deep sparse generative model. Assumption A5 is a statement about the complexity of this learning: the computational complexity is . This can be interpreted as a statement regarding the complexity of learning deep versions of the recurrent auto-encoders from Section 3. Indeed, in neural networks terminology, is the size of the embedding at layer and the number active neurons at that layer. The simulations (Section 6) suggest that the term above is an artifact of the proof techniques we rely on to arrive at our main result. That is, the learning complexity depends on the maximum, across layers, of the product of the number of active neurons and the embedding dimension.
We will prove the result by induction on . Before proceeding with the proof, let us discuss in detail the case when in Equation 10 and . We focus on exact recovery of and and defer computation of the probability in Equation 17 to the proof that will follow.
Intuition behind the proof: the case .
Algorithm 2 begins by solving for . If we can show that the algorithm succeeds for this pair, in particular that , then it follows that in the following iteration of the algorithm. This is because, if the first iteration were to succeed, then , which is the very model of Equation 7, which was treated in detail . If we can show that the sparse matrix follows RIP–topic of the the next section Section 5–then we can apply Theorem 1 from  to guarantee recovery of .
Focusing on , the key point we would like to explain is that, in Equation 7, the properties of that allow the authors from  to prove their main result also apply to . This is not directly obvious because is the product of a sparse matrix and the matrix of codes. We address these points one at a time in the following remark
We first note that, since the columns of are i.i.d., so are those of . Moreover, since both the entries of and are bounded by assumptions, so are those of .
By construction, is a sparse matrix: each of its columns is at most sparse. It is not trivial, however, to compute . Luckily, we do not need this probability explicitly, as long as we can either bound it, or bound the singular values of the matrix and the matrix of indicators values of its nonzero entries. It is not hard to show that
It is not hard to show that Lemma 3.2 from  applies to . Lemma 3.2 relies on Lemmas A1 and A2, which give bounds for the matrix of indicator values for the nonzero entries of . For , we can replace, in the proof of Lemma 3.2, the matrix of indicators of its nonzero entries with the product of the matrix of indicators of the nonzero entries for and respectively. This yields a bound that now depends on the , and the sparsity level .
Applying Lemma A1 and A2 from  to and , respectively, yields bounds for their lowest and largest singular values. Using standard singular value relationships, this gives a version of Lemma A2 for —, i.e. bounds for its lowest and largest singular values. A version of Lemmas A3 and A4 for directly follows.
Since is -sparse, , we can prove the Bernstein inequalities to obtain the upper bounds from Lemma A5. These upper bounds are all that are necessary to prove Lemma A6 for .
The version of Lemma 3.3–the center piece of –for them follows.
The interested reader can verify all of the above for herself. A detailed technical exposition of these points would lead to a tedious and unnecessary digression, without adding much intuition. Using induction, it can be shown that this remark applied to for all .
We can now to apply Theorem 1 from  to , guaranteeing recovery of .
We proceed by induction on .
Base case: . In this case, . Following the remark above, obeys the properties of from Theorem 1 in . Under A1–A7, this theorem guarantees that , the limit as of converges to with probability at least . Therefore, , proving the base case.
Induction: Suppose the Theorem is true for , we will show that is true for .
Conditioned on the event , . Therefore, under A1–A7, the limit as of converges to with probability at least . Therefore
This completes the proof.
4.4 Alternate algorithm for learning the “deep” generative model
Algorithm 2 learns the model of Equation 10 sequentially, starting with and ending with . In this section, we sketch out a learning procedure that proceeds in the opposite way. We begin by giving the intuition for this procedure for the case .
Alternate learning algorithm: the case .
As in the case of Algorithm 2, the procedure relies on the sequential application of Algorithm 1. We first learn the product . Having learnt this product, we then use it to learn the product , which automatically yields . Finally, we use to learn and .
The sequential procedure described above poses, however, one technical difficulty. To learn the product , a sufficient condition  is that it must satisfy RIP of order . Assumptions A1 only requires that the matrices , and satisfy RIP separately. We now show that assumption A1 has implications on the RIP constant of a certain order of the product matrix.
Before stating the result, we introduce some notation and present the alternate algorithm. We let and
Theorem 2 (RIP-like property of )
Suppose is sparse, then
We proceed by induction on .
Base case: . The theorem is true for this case by assumption A1.
Induction: Suppose the theorem is true for , we will show that it holds true for . Let be a -sparse vector
is a -sparse vector, allowing us to apply our inductive hypothesis
The result follows by assumption A1 since satisfies the RIP of order .
A direct consequence of the theorem is that , the RIP constant of must be smaller than or equal to . As long as this quantity is less than , we can expect Algorithm 3 with to succeed in recovering all dictionaries.
5 Concentration of eigenvalues of column-sparse random matrices with i.i.d. sub-Gaussian entries
The proof of our main result, Theorem 1, relies on random sparse matrices satisfying RIP. Here we show that a class of random sparse matrices indeed satisfies RIP.
5.1 Sparse random sub-Gaussian matrix model
Let be a matrix with columns . Let be a binary random matrix with columns that are i.i.d. -sparse binary random vectors each obtained by selecting entries from without replacement, and letting
be the indicator random variable of whether a given entrywas selected. Let be a random matrix with i.i.d. entries distributed according to a zero-mean sub-Gaussian random variable
with variance, almost surely, and sub-Gaussian norm –we adopt the notation from  to denote the sub-Gaussian norm of a random variable. We consider the following generative model for the entries of :
It is not hard to verify that the random matrix thus obtained is such that . To see this we note the following properties of the generative model for
Let , .
Let , .
Ultimately, we would like to understand the concentration behavior of the singular values of 1) , and 2) sub-matrices of that consist of a sparse subset of columns (RIP-like results). We fist recall the following result from non-asymptotic random matrix theory , and apply it obtain a concentration result on the singular values of the matrix .
Theorem 3 (Restatement of Theorem 5.39 from  (Sub-Gaussian rows))
Let matrix whose rows ( is the column of ) are independent sub-Gaussian isotropic random vectors in . Then for every , with probability at least one has
Here, , depend only on the sub-Gaussian norm of the rows.
Before we can apply the above result to , we need to demonstrate that the columns of are sub-Gaussian random vectors, defined as follows
Definition 4 (Definition 5.22 from  (Sub-Gaussian random vectors))
We say that a random vector in is sub-Gaussian if the one-dimensional marginals are sub-Gaussian random variables for all in . The sub-Gaussian norm of is defined as
Theorem 5 (Columns of are sub-Gaussian random vectors)
For every , is a sub-Gaussian random vector. Moreover,
where is a universal constant.
We show this by bounding :