1 Introduction
Deep learning has achieved remarkable success in many technological areas, including automatic speech recognition [Hinton et al.2012, Hannun et al.2014]
[Collobert et al.2011, Mikolov et al.2013, Cho et al.2014], and computer vision, in particular with deep Convolutional Neural Networks (CNNs)
[LeCun et al.1989, Krizhevsky et al.2012, Simonyan and Zisserman2015, Szegedy et al.2015].Following the unprecedented success of deep networks, there have been some theoretical works [Arora et al.2014, Arora et al.2015, Paul and Venkatasubramanian2014] that suggest several mathematical models for different deep learning architectures. However, theoretical analysis and understanding lag behind the very rapid evolution and empirical success of deep architectures, and more theoretical analysis is needed to better understand the stateoftheart deep architectures, and possibly to improve them further.
In this paper, we address the gap between the empirical success and theoretical understanding of the CNNs, in particular its invertibility (i.e., reconstructing the input from the hidden activations), by analyzing a simplified mathematical model using random weights (See Section 2.1 and 4.1 for the practical relevance of the assumption).
This property is intriguing because CNNs are typically trained with discriminative objectives (i.e., unrelated to reconstruction) with a large amount of labels, such as the ImageNet dataset
[Deng et al.2009]. recoverlppooling [recoverlppooling] studied signal discovery from generalized pooling operators using image patches on nonconvolutional small scale networks and datasets. invertcnn [invertcnn] used upsamplingdeconvolutional architectures to invert the hidden activations of feedforward CNNs to the input domain. In another related work, whatwhere [whatwhere] proposed a stacked whatwhere autoencoder network and demonstrated its promise in unsupervised and semisupervised settings. deconvrecon [deconvrecon] showed that CNNs discriminately trained for image classification (e.g., VGGNet
[Simonyan and Zisserman2015]) are almost fully invertible using pooling switches. Despite these interesting results, there is no clear theoretical explanation as to why CNNs are invertible yet.We introduce three new concepts that, coupled with the accepted notion that images have sparse representations, guide our understanding of CNNs:

we provide a particular model of sparse linear combinations of the learned filters that are consistent with natural images; also, this model of sparsity is itself consistent with the feedforward network;

we show that the effective matrices that capture explicitly the convolution of multiple filters exhibit a modelRestricted Isometry Property (modelRIP) [Baraniuk et al.2010]; and

our model can explain each layer of the feedforward CNN algorithm as one iteration of Iterative Hard Thresholding (IHT) [Blumensath and Davies2009] for modelbased compressive sensing and, hence, we can reconstruct the input simply and accurately.
In other words, we give a theoretical connection to a particular version of modelbased compressive sensing (and its recovery algorithms) and CNNs. Using the connection, we give a reconstruction bound for a single layer in CNNs, which can be possibly extended to multiple layers. In the experimental sections, we show empirically that largescale CNNs are consistent with our mathematical analysis. This paper explores these properties and elucidates specific empirical aspects that further mathematical models might need to take into account.
2 Preliminaries
In this section, we begin with discussion on the effectiveness of random weights in CNNs, and then we provide the notations for CNNs, compressive sensing, and sparse signal recovery.
2.1 Effectiveness of Gaussian Random Filters
CNNs with Gaussian random filters have been shown to be surprisingly effective in unsupervised and supervised deep learning tasks. jarrett2009best [jarrett2009best] showed that random filters in 2layer CNNs work well for image classification. Also, saxe2011random [saxe2011random] observed that convolutional layer followed by pooling layer is frequency selective and translation invariant, even with random filters, and these properties lead to good performance for object recognition tasks. On the other hand, giryes2015deep [giryes2015deep] proved that CNNs with random Gaussian filters have metric preservation property, and they argued that the role of training is to select better hyperplanes discriminating classes by distorting boundary points among classes. According to their observation, random filters are in fact a good choice if training data are initially wellseparated. Also, he2016powerful [he2016powerful] empirically showed that random weight CNNs can do image reconstruction well.
To better demonstrate the effectiveness of Gaussian random CNNs, we evaluate their classification performance on CIFAR10 [Krizhevsky2009] in Section 4.1. Although the performance is not the stateoftheart, it is surprisingly good considering that the networks are almost untrained. Our theoretical results may provide a new perspective on explaining these phenomena.
2.2 Convolutional Neural Nets
For simplicity, we vectorize input signals to 1d signal; for any operations we would ordinarily carry out on images, we do on vectors with the appropriate modifications. We define a single layer of our CNN as follows. We assume that the input signal
consists of channels, each of length , and we write . For each of the input channels, , let , denote one of filters, each of length . Letbe the stride length, the number of indices by which we shift each filter. Note that
can be larger than 1. We assume that the number of shifts, , is an integer. Let be a vector of length that consists of the th filter shifted by , (i.e., has at most nonzero entries). We concatenate over the channels each of these vectors (as row vectors) to form a large matrix , which is the matrix made up of blocks of the shifts of each filter in each of channels. We assume that and the row vectors of span and that we have normalized the rows so that they have unit norm. The hidden units of the feedforward CNN are computed by multiplying an input signal by the matrix (i.e., convolving, in each channel, by a filter bank of size , and summing over the channels to obtain outputs).^{1}^{1}1Convolution can be computed more efficiently than matrix multiplication, but they are mathematically equivalent. We use for the hidden activation computed by a single layer CNN without pooling. Figure 1 illustrates the architecture. As a nonlinear activation, we apply the function to the outputs, and then selecting the value with maximum absolute value in each of theblocks; i.e., we perform max pooling over each of the convolved filters.
2.3 Compressive Sensing
In compressive sensing, we assume that there is a latent sparse code that generates the visible signal . We say that a matrix with satisfies the Restricted Isometry Property RIP if there is a distortion factor such that for all with exactly nonzero entries, . If satisfies RIP with sufficiently small and if is sparse, then given the vector , we can efficiently recover (see candes2008restricted [candes2008restricted] for more details)^{2}^{2}2We note that this is a sufficient condition and that there are other, less restrictive sufficient conditions, as well as more complicated necessary conditions. . There are many efficient algorithms, including sparse coding (e.g., minimization with regularization) and greedy and iterative algorithms, such as Iterative Hard Thresholding (IHT) [Blumensath and Davies2009].
Modelbased compressive sensing. While sparse signals are a natural model for some applications, they are less realistic for CNNs. We consider a vector as the true sparse code generating the CNN input with a particular model of sparsity. Rather than permitting nonzero entries anywhere in the vector , we divide the support of into contiguous blocks of size and we stipulate that from each block there is at most one nonzero entry in with a total of nonzero entries. We call a vector with this sparsity model modelsparse and denote the union of all sparse subspaces with this structure . It is clear that contains subspaces. In our analysis, we consider linear combinations of two modelsparse signals. To be precise, suppose that is the linear combination of two elements in . Then, we say that lies in the linear subspace that consists of all linear combinations of vectors from .^{3}^{3}3Intuitively, is a subspace where the error signal lies in and used for reconstruction bound derivation; see Appendix A. We say that a matrix satisfies the modelRIP if there is a distortion factor such that, for all ,
(1) 
See Baraniuk:2010hg [Baraniuk:2010hg] for the definitions of model sparse and modelRIP, as well as the necessary modifications to account for signal noise and compressible (as opposed to exactly sparse) signals, which we don’t consider in this paper to keep our analysis simple. Intuitively, a matrix satisfying the modelRIP is a nearly orthonormal matrix of a particular set of sparse vectors with a particular sparsity model or pattern.
For our analysis, we also need matrices that satisfy the modelRIP for vectors . We denote the distortion factor for such matrices; note that .
Many efficient algorithms have been proposed for sparse coding and compressive sensing [Olshausen and others1996, Mallat and Zhang1993, Beck and Teboulle2009]. As with traditional compressive sensing, there are efficient algorithms for recovering modelsparse signals from measurements [Baraniuk et al.2010], assuming the existence of an efficient structured sparse approximation algorithm , that given an input vector and the sparsity parameter, returns the vector closest to the input with the specified sparsity structure.
In CNNs, the max pooling operator finds the downsampled activations that are closest to the activations of the original size by retaining the most significant values. The max pooling can be viewed as two steps: 1) zeroing out the locally nonmaximum values; 2) downsampling the activations with the locally maximum values retained. To study the pooled activations with sparsity structures, we can recover dimension loss from the downsampling step by an unsampling operator. This procedure defines our structured sparse approximation algorithm , where is the original (unpooled) response, is the sparsity parameter for blocksparsification, and is the sparsified response after pooling but without shrinking the length (i.e., the locally nonmaximum values are zeroed out such that has the same dimension as ). Note that is a modelsparse signal by construction. On the other hand, without considering the blocksparsification, we actually apply the following max pooling and upsampling operations:
(2) 
where is the pooled response, is the filter response of CNN given input before max pooling (see Section 2.2), and denotes the upsampling switches that indicate whereto place the nonzero values in the upsampled activations. Since our theoretical analysis does not depend on but depends on , any type of valid upsampling switches will be consistent with the blocksparsification (modelsparse) assumption, thus we will use to denote the structured sparse approximation algorithm (2) without worrying about .
We use modelsparse version of IHT [Blumensath and Davies2009] as our recovery algorithm, as one iteration of IHT for our model of sparsity captures exactly a feedforward CNN.^{4}^{4}4Multiple iterations of IHT can improve the quality of signal recovery. However, it is rather equivalent to the recurrent version of CNNs and does not fit to the scope of this work. Algorithm 1 describes the modelbased IHT algorithm. In particular, the sequence of steps 4–6 in the middle IHT is exactly one layer of a feedforward CNN. As a result, the theoretical analysis of IHT for modelbased sparse signal recovery serves as a guide for how to analyze the approximation activations of a CNN.
3 Analysis
Following the idea of compressive sensing in Section 2.3, we assume that the input is generated from a latent modelsparse signal with basis vectors , which turns out to be by Theorem 3.1 (i.e., ). Therefore, our analysis views the output of CNN (with pooling) is a reconstruction of (i.e., ), and can be used to reconstruct from : that is, .
3.1 CNN Filters with Positive and Negative Pairs
Here we assume that all of the entries in the vectors are real numbers rather than only nonnegative like when using . This setup is equivalent to using Concatenated (CReLU) [Shang et al.2016]
as an activation function (i.e., keeping the positive and negative activations as separate hidden units) with tied decoding weights. The CReLU activation scheme is justified by the fact that trained CNN filters come in positive and negative pairs and that it achieves superior classification performance in several benchmarks. This setting makes a CNN much easier to analyze within the model compressed sensing framework.
To motivate the setting, we begin with a simple example. Suppose that the matrix is an orthonormal basis for and define .
Proposition 1.
A onelayer CNN using the matrix , with no pooling, gives perfect reconstruction (with the matrix ) for any input vector .
Proof.
Because we have both the positive and the negative dot products of the signal with the basis vectors in , we have positive and negative versions of the hidden units and where we decompose into the difference of two nonnegative vectors, the positive and the negative entries of . From this decomposition, we can easily reconstruct the original signal via
∎
In the example above, we have pairs of vectors in our matrix . Now suppose that we have a vector where its positive and negative components can be split into , and that we synthesize a signal from using the matrix . Then, we have
Next, we multiply by a concatenation of positive and negative , then we get and if we apply to this vector, we get , which is a vector that is split into its positive and negative components. The structure of the product is crucial to the reconstruction quality of the vector . In addition, this calculation shows that if we have both positive and negative pairs of filters or vectors, then the function applied to both the positive and negative dot products simply splits the vector into the positive and negative components. These components are then reassembled in the next computation. For this reason, in the analysis in the following sections, it is sufficient to assume that all of the entries in the vectors are real numbers, rather than only nonnegative.
3.2 ModelRIP and Random Filters
Our first main result shows that if we use Gaussian random filters in our CNN, then, with high probability,
, the transpose of a large matrix formed by the convolution filters satisfies the modelRIP. In other words, Gaussian random filters generate a matrix whose transpose is almost an orthonormal transform for sparse signals with a particular sparsity pattern (that is consistent with our pooling procedure). The bounds in the theorem tell us that we must balance the size of the filters and the number of channels against the sparsity of the hidden units , the number of the filter banks , the number of shifts , the distortion parameter , and the failure probability . The proof is in Appendix A.Theorem 3.1.
Assume that we have vectors of length
in which each entry is a scaled i.i.d. (sub)Gaussian random variable with zero mean and unit variance (the scaling factor is
. Let be the stride length (where ) andbe a structured random matrix, which is the weight matrix of a single layer CNN with
channels and input length . Iffor a positive constant , then with probability , the matrix satisfies the modelRIP for model with parameter .
We also note that the same analysis can be applied to the sum of two modelsparse signals, with changes in the constants (that we do not track here).
Corollary 3.2.
Random matrices with the CNN structure satisfy, with high probability, the modelRIP for .
Other examples of matrices that satisfy the modelRIP include wavelets and localized Fourier bases; both examples can be easily and efficiently implemented via convolutions.
3.3 Reconstruction Bounds
Suppose satisfies the modelRIP and is the reconstruction of true sparse code through a CNN layer followed by pooling, i.e., . Then, Theorem 3.3 shows that is an approximate reconstruction of the input signal, and the relative error is bounded on a function of the distortion parameters of the modelRIP.
Theorem 3.3.
We assume that satisfies the RIP with constant . If we use in a single layer CNN both to compute the hidden units and to reconstruct the input from these hidden units as so that , the error in our reconstruction is
See Appendix B for the detailed proofs. Part of our analysis also shows that the hidden units are approximately the putative coefficient vector in the sparse linear representation for the input signal. Recall that the structured sparsity approximation algorithm includes the downsampling caused by pooling and an unsampling operator. Theorem 3.3 is applicable to any type of upsampling switches, so our reconstruction bound is generic to the particular design choice on how to recover the activation size in a decoding neural network. We can extend the analysis for a single layer CNN to multiple layer CNN by using the output on one layer as the input to another, following the proof in Appendix B. We leave further investigation of this idea as future work.
4 Experimental Evidence and Analysis
In this section, we provide experimental validation of our theoretical model and analysis. We first validate the practical relevance of our assumption by examining the effectiveness of random filter CNNs, and then provide results on more realistic scenarios. In particular, we study popular deep CNNs trained for image classification on ILSVRC 2012 dataset [Deng et al.2009]. We calculate empirical modelRIP bounds for , showing that they are consistent with our theory. Our results are also consistent with a long line of research shows that it is reasonable to model real and natural images as sparse linear combinations overcomplete dictionaries [Boureau et al.2008, Le et al.2013, Lee et al.2008, Olshausen and others1996, Ranzato et al.2007, Yang et al.2010]. In addition, we verify our theoretical bounds for the reconstruction error
on real images. We investigate both randomly sampled filters and empirically learned filters in these experiments. Our implementation is based on Caffe
[Jia et al.2014] and MatConvNet [Vedaldi and Lenc2015].Recall that our theoretical analysis is generic to any upsampling switches in (2) for reconstruction. In the experiments, we specifically use the naive upsampling to reverse maxpool activations to its original size, where only the first element in a pooling region is assigned with the pooled activation, and the rest elements are all zero. Thus, no extra information other than the pooled activation values are taken into account.
4.1 Gaussian Random CNNs on CIFAR10
To show the practical relevance of our theoretical assumptions on using random filters for CNNs as stated in Section 2.1, we evaluate simple CNNs with Gaussian random filters with i.i.d. zero mean unit variance entries on the CIFAR10 [Krizhevsky2009]
. Note that the goal of this experiment is not to achieve stateoftheart results, but to examine practical relevance of our assumption on random filter CNNs. Once the CNNs weights are initialized (randomly), they are fixed during the training of the classifiers.
^{5}^{5}5Implementation detail: we add a batch normalization layer together with a learnable scale and bias before the activation so that we do not need to tune the scale of the filters. See Appendix
C.1 for more details.Specifically, we test random CNNs with 1, 2, and 3 convolutional layers followed by ReLU activation and
max pooling layer. We tested different filter sizes () and numbers of channels () and report the best classification accuracy by crossvalidation in Table 1. We also report the best performance using learnable filters for comparison. More details about the architectures can be found in Appendix C.1. We observe that CNNs with Gaussian random filters achieve good classification performance (implying that they serve as reasonable representation of input data), which is not too far off the learned filters. Our experimental results are also consistent with the observations made by jarrett2009best [jarrett2009best] and saxe2011random [saxe2011random]. In conclusion, those results suggest that CNNs with Gaussian random filters might be a reasonable setup which is amenable to mathematical analysis while not being too far off in terms of practical relevance.Method  1 layer  2 layers  3 layers 

Random filters  66.5%  74.6%  74.8% 
Learned filters  68.1%  83.3%  89.3% 
4.2 1d ModelRIP
We use 1d synthetic data to empirically show the basic validity of our theory in terms of the modelRIP in (1) and reconstruction bound in Theorem 3.3. We plot the histograms of the empirical modelRIP values of 1d Gaussian random filters ( scaled by ) with size on 1d sparse signal with size and sparsity
, whose nonzero elements are drawn from a uniform distribution on
. The histograms in Figure 2 (a)–(b) are tightly centered around , suggesting that satisfies the modelRIP in (1) and its corollary from Lemma B.1, respectively. We also empirically show the reconstruction bound in Theorem 3.3 on synthetic vectors (Figure 2 (c)). The reconstruction error is concentrated at around – and bound under . Results in Figure 2 suggest the practical validity of our theory when the model assumptions hold.4.3 Architectures for 2d ModelRIP
We conduct the rest of our experimental evaluations on the 16layer VGGNet (Model D in vggnet [vggnet]), where the computation is carried out on images; e.g., convolution with a 2d filter bank and pooling on square regions. In contrast to the theory, the realistic network does not pool activations over all the possible shifts for each filter, but rather on nonoverlapping patches. The networks are trained for the largescale image classification task, which is important for extending to other supervised tasks in vision. The main findings on VGGNet are presented in the rest of this section; we also provide some analysis on AlexNet [Krizhevsky et al.2012] in Appendix C.2.
VGGNet contains five macro layers of convolution and pooling layers, and each macro layer has 2 or 3 convolutional layers followed by a pooling layer. We denote the th convolutional layer in the th macro layer “conv,” and the pooling layer “pool.” The activations/features from th macro layer are the output of pool. Our analysis is for single convolutional layers.
4.4 2d ModelRIP
The key to our reconstruction bound is Theorem 3.3. We empirically evaluate the modelRIP, i.e., , for real CNN filters of the pretrained VGGNet. We use twodimensional coefficients (each block of coefficients is of size ), filters of size , and pool the coefficients over smaller pooling regions (i.e., not over all possible shifts of each filter). The following experimental evidence suggests that the sparsity model and the modelRIP of the filters are consistent with our mathematical analysis on the simpler onedimensional case.
To check the significance of the modelRIP (i.e., how close is to ) in controlled settings, we first synthesize the hidden activations with sparse uniform random variables, which fully agree with our model assumptions.
layer  c(1,1)  c(1,2)  p(1)  c(2,1)  c(2,2)  p(2) 

% of nonzeros  49.1  69.7  80.8  67.4  49.7  70.7 
layer  c(3,1)  c(3,2)  c(3,3)  p(3)  c(4,1)  c(4,2) 
% of nonzeros  53.4  51.9  28.7  45.9  35.6  29.6 
layer  c(4,3)  p(4)  c(5,1)  c(5,2)  c(5,3)  p(5) 
% of nonzeros  12.6  23.1  23.9  20.6  7.3  13.1 
layer  (1,1)  (1,2)  (2,1)  (2,2)  (3,1)  (3,2)  (3,3) 

learned  0.943  0.734  0.644  0.747  0.584  0.484  0.519 
random  0.670  0.122  0.155  0.105  0.110  0.090  0.080 
layer  (4,1)  (4,2)  (4,3)  (5,1)  (5,2)  (5,3)  
learned  0.460  0.457  0.404  0.410  0.410  0.405  
random  0.092  0.062  0.062  0.070  0.067  0.067 
The sparsity of is constrained to the average level of the real CNN activations, which is reported in Table 2. Given the filters of a certain convolutional layer, we use the synthetic (in equal position to this layer’s output activations) to get statistics for the modelRIP. To be consistent with succeeding experiments, we choose conv, while other layers show similar results. Figure 3 (a) summarizes the distribution of empirical modelRIP values, which is clearly centered around and satisfies (1) with a short tail roughly bounded by . For more details of the algorithm, we normalize the filters from the conv layer, which are (). All filters with input channels are used.^{6}^{6}6We do not remove any filters including those in approximate positive/negative pairs (see Section 3.) We set , which is the same as the output activations of conv, and use pooling regions^{7}^{7}7No pooling layer follows conv in VGGNet. However, we use it in this way to analyze the convolutionpooling pair per theory., which is commonly used in recent CNNs. We generate 1000 randomly sampled sparse activation maps by first sampling their nonzero supports and then filling elements on the supports uniformly from . The sparsity is the same as that in conv activations.
More realistically, we observe that the actual conv activations from VGGNet are not necessarily drawn from a modelsparse uniform distribution. This motivates us to evaluate the empirical modelRIP on the hidden activations that reconstruct the actual input activations from conv by . Per theory, the is given by a max pooling layer, so we constrain the sparsity (i.e., the size of the support set is no more than in a pooling region for a single channel). We use a simple and efficient algorithm to recover from in Algorithm 2. The algorithm is inspired by “heuristic" method that are commonly used in practice (e.g., boyd2015 [boyd2015]). As shown in Algorithm 2, we first do regularized least squares without constraining the support set. Max pooling is then applied to figure out the support set for each pooling region. In particular, we use max pooling and unpooling with known switches (line 2) to zero out the locally nonmaximum values without messing up the support structures. We perform regularized least squares again on the fixed support set to recover the hidden activations satisfying the model sparsity. As shown in Figures 3 (b)–(c), the empirical modelRIP values for visual activations from conv with/without are both close to . The center offset to is less than and the range bound is roughly less than , which agrees with the theoretical bound in (1). To gain more insight, we summarize the learned filter coherence in Table 3 for all convolutional layers in VGGNet.^{8}^{8}8The coherence is defined as the maximum (in absolute value) dot product between distinct pairs of columns of the matrix , i.e. , where denote the th row of matrix . This measures the correlation or similarity between the columns of and is a proxy for the value of the modelRIP parameter
(which we can only estimate computationally). The smaller the coherence, the smaller
is, and the better the reconstruction. The coherence of the learned filters is not low, which is inconsistent with our theoretical assumptions. However, the modelRIP turns out to be robust to this mismatch. It demonstrates the strong practical invertibility of CNN.4.5 Reconstruction Bounds
With modelRIP as a sufficient condition, Theorem 3.3 provides a theoretical bound for layerwise reconstruction via , which consists of the projection and reconstruction in one IHT iteration. Without confusion, we refer to it as IHT for notational convenience. We investigate the practical reconstruction errors on pool to of VGGNet.
To encode and reconstruct intermediate activations of CNNs, we employ IHT with sparsity estimated from the real CNN activations on ILSVRC 2012 validation set (see Table 2). We also reconstruct input images, since CNN inversion is not limited to a single layer, and images are easier to visualize than hidden activations. To implement image reconstruction, we project the reconstructed activations into the image space via a pretrained decoding network as in deconvrecon [deconvrecon], which extends a similar autoencoder architecture as in invertcnn [invertcnn] to a stacked “whatwhere” autoencoder [Zhao et al.2016]. The reconstructed activations were scaled to have the same norm as the original activations so that we can feed them into the decoding network.
As an example, Figure 4 illustrates the image reconstruction results for the hidden activations of pool. Interestingly, the decoding network itself is quite powerful, since it can reconstruct the rough (although very noisy) glimpse of images with Gaussian random input, as shown in Figure 4 (e). Object shapes are recovered up to some extent by using the pooling switches only in the “whatwhere” autoencoder. This result suggests that it is important to determine which pooling units are active and then to estimate these values accurately. These steps are consistent with the steps in the inner loop of any iterative sparse signal reconstruction algorithm.
In Figure 4 (c), we take the pretrained conv filters for IHT. The images recovered from the IHT reconstructed pool activations are reasonable and the reconstruction quality is significantly better than the random input baseline. We also try Gaussian random filters (Figure 4 (d)), which agree more with the model assumptions (e.g., lower coherence, see Table 3). The learned filters from VGGNet perform equally well visually. IHT ties the encoder and decoder weights (no filter learning for the decoder), so it does not perform as well as the decoding network trained with a huge batch of data (Figure 4 (b)). Nevertheless, we show both theoretically and experimentally decent reconstruction bounds for these simple reconstruction methods on real CNNs. More visualization results for more layers are in Appendix C.3.
layer  image space  activation space  

relative error  relative error  
learned  random  random  learned  random  random  
filters  filters  activations  filters  filters  activations  
1  0.423  0.380  0.610  0.895  0.872  1.414 
2  0.692  0.438  0.864  0.961  0.926  1.414 
3  0.326  0.345  0.652  0.912  0.862  1.414 
4  0.379  0.357  0.436  1.051  0.992  1.414 

In Table 4, we summarize reconstruction performance for all 4 macro layers. With random filters, the model assumptions hold and the IHT reconstruction is the best quantitatively. IHT with real CNN filters performs comparable to the best case and much better than the baseline established by the randomly sampled activations.
5 Conclusion
We introduce three concepts that tie together a particular model of compressive sensing (and the associated recovery algorithms), the properties of learned filters, and the empirical observation that CNNs are (approximately) invertible. Our experiments show that filters in trained CNNs are consistent with the mathematical properties we present while the hidden units exhibit a much richer structure than mathematical analysis suggests. Perhaps simply moving towards a compressive, rather than exactly sparse, model for the hidden units will capture the sophisticated structure in these layers of a CNN or, perhaps, we need a more sophisticated model. Our experiments also demonstrate that there is considerable information captured in the switch units (or the identities of the nonzeros in the hidden units after pooling) that no mathematical model has yet expressed or explored thoroughly. We leave such explorations as future work.
Acknowledgments
This work was supported in part by ONR N000141612928, NSF CAREER IIS1453651, and Sloan Research Fellowship. We would like to thank Michael Wakin for helpful discussions about concentration of measure for structured random matrices.
1
References
 [Arora et al.2014] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable Bounds for Learning Some Deep Representations. In ICML, 2014.
 [Arora et al.2015] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. Why are deep nets reversible: A simple theory, with implications for training. arXiv:1511.05653, 2015.
 [Baraniuk et al.2010] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. ModelBased Compressive Sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
 [Beck and Teboulle2009] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal of Imaging Science, 2:183–202, 2009.
 [Blumensath and Davies2009] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.

[Boureau et al.2008]
Ylan Boureau, Yann L Cun, et al.
Sparse feature learning for deep belief networks.
In NIPS, 2008.  [Boyd2015] Stephen Boyd. norm methods for convexcardinality problems, ee364b: Convex optimization II lecture notes, 20142015 spring. 2015.
 [Bruna et al.2014] Joan Bruna, Arthur Szlam, and Yann LeCun. Signal recovery from pooling representations. In ICML, 2014.
 [Candés2008] Emmanuel J. Candés. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008.
 [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv:1406.1078, 2014.

[Collobert et al.2011]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research
, 12(Aug):2493–2537, 2011.  [Deng et al.2009] Jia Deng, Wei Dong, R. Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [Dosovitskiy and Brox2016] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.
 [Giryes et al.2016] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep neural networks with random gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
 [Hannun et al.2014] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up endtoend speech recognition. arXiv:1412.5567, 2014.
 [He et al.2016] Kun He, Yan Wang, and John Hopcroft. A powerful generative model using random weights for the deep image representation. arXiv:1606.04801, 2016.
 [Hinton et al.2012] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [Jarrett et al.2009] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multistage architecture for object recognition? In ICCV, 2009.
 [Jia et al.2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
 [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [Krizhevsky2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[Le et al.2013]
Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado,
J. Dean, and A. Y. Ng.
Building highlevel features using large scale unsupervised learning.
In ICML, 2013.  [LeCun et al.1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 [Lee et al.2008] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In NIPS, 2008.
 [Mallat and Zhang1993] Stephane Mallat and Zhifeng Zhang. Matching pursuits with timefrequency dictionaries. IEEE Transactions on Signal Processing, 41:3397 – 3415, 1993.
 [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
 [Olshausen and others1996] Bruno A Olshausen et al. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
 [Park et al.2011] Jae Young Park, Han Lun Yap, C.J. Rozell, and M. B. Wakin. Concentration of Measure for Block Diagonal Matrices With Applications to Compressive Signal Processing. IEEE Transactions on Signal Processing, 59(12):5859–5875, 2011.
 [Paul and Venkatasubramanian2014] Arnab Paul and Suresh Venkatasubramanian. Why does Deep Learning work?  A perspective from Group Theory. arXiv.org, December 2014.
 [Ranzato et al.2007] Marc Aurelio Ranzato, Fu Jie Huang, YLan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
 [Saxe et al.2011] Andrew Saxe, Pang W Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In ICML, 2011.

[Shang et al.2016]
Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee.
Understanding and improving convolutional neural networks via concatenated rectified linear units.
In ICML, 2016.  [Simonyan and Zisserman2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [Szegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [Vedaldi and Lenc2015] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia, 2015.
 [Vershynin2010] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv:1011.3027, November 2010.

[Yang et al.2010]
Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma.
Image superresolution via sparse representation.
IEEE Transactions on Image Processing, 19(11):2861–2873, 2010.  [Zhang et al.2016] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting neural networks with reconstructive decoding pathways for largescale image classification. In ICML, 2016.
 [Zhao et al.2016] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked whatwhere autoencoders. arXiv:1506.02351, 2016.
Appendix A Mathematical Analysis: ModelRIP and Random Filters
Theorem 3.1(Restated) Assume that we have vectors of length in which each entry is a scaled i.i.d. (sub)Gaussian random variable with zero mean and unit variance (the scaling factor is . Let be the stride length (where ) and be a structured random matrix, which is the weight matrix of a single layer CNN with channels and input length . If
for a positive constant , then with probability , the matrix satisfies the modelRIP for model with parameter .
Proof.
We note that the proof follows the same structure of those in other papers such as Park:2011iw [Park:2011iw] and Vershynin:2010vk [Vershynin:2010vk], though we make minor tweaks to account for the particular structure of .
Suppose that , i.e., consists of at most nonzero entries that each appears in a distinct block of size (there are a total of blocks). First, Lemma A.1 shows that the expectation of the norm of is preserved.
Lemma A.1.
Proof.
Note that each entry of is either zero or Gaussian random variable before scaling. Therefore, it is obvious that since each row of satisfies if or for any , and we normalized the random variables so that for all ’s. Finally, we have
∎
Let . We aim to show that the square norm of the random variable concentrates tightly about its mean; i.e., with exceedingly low probability
To do so, we need several properties of subGaussian and subexponential random variables. A meanzero subGaussian random variable
has a moment generating function that satisfies
for all and some constant . The subGaussian norm of , denoted is
If , then where is a positive constant (following Definition 5.7 in Vershynin:2010vk [Vershynin:2010vk]).
A subexponential random variable satisfies^{11}^{11}11There are two other equivalent properties. See Vershynin:2010vk [Vershynin:2010vk] for details.
for all .
Let denote the th entry of the vector . We can write
and observe that is a linear combination of i.i.d. subGaussian random variables (or it is identically equal to 0) and, as such, is itself a subGaussian random variable with zero mean and subGaussian norm (see Vershynin:2010vk [Vershynin:2010vk], Lemma 5.9).
The structure of the random matrix and how many nonzero entries are in row of do enter the more refined bound on the subGaussian norm of (again, see Vershynin:2010vk [Vershynin:2010vk], Lemma 5.9 for details) but we ignore such details for this estimate as they are not necessary for the next estimate.
To obtain a concentration bound for , we recall from Park:2011iw [Park:2011iw] and Vershynin:2010vk [Vershynin:2010vk] that the sum of squares of subGaussian random variables tightly concentrate.
Theorem A.2.
Let be independent subGaussian random variables with subGaussian norms for all . Let . For every and every and a positive constant ,
We note that although some entries may be identically zero, depending on the sparsity pattern of , not all entries are. Let us define so that .
From Lemma A.1 and the relation , we have
See Proposition 5.16 in Vershynin:2010vk [Vershynin:2010vk] for the proof of Theorem A.2. We apply Theorem A.2 to the subGaussian random variables with the weights . We have
If we set , , and use the above estimates for the norms of , we have
(3) 
Finally, we use the concentration of measure result in a crude union bound to bound the failure probability over all vectors . We take and for a desired constant failure probability. Using the smaller term in (3), (note that , , and ) we have
which implies
Therefore, if design our matrix as described and with the parameter relationship as above, the matrix satisfies the modelRIP for and parameter with probability . ∎
Let us discuss the relationship amongst the parameters in our result. First, if we have only one channel and the filter length ; namely,
If (i.e., the filters are much shorter than the length of the input signal as in a CNN), then we can compensate by adding more channels; i.e., the filter length needs to be larger than , or, if add more channels, .
Appendix B Mathematical Analysis: Reconstruction Bounds
The consequences of having the modelRIP are twofold. The first is that if we assume that an input image is the structured sparse linear combination of filters, where and satisfies the modelRIP, then we know an upper and lower bound on the norm of in terms of the norm of its sparse coefficients, . Additionally,
More importantly, when we calculate the hidden units of ,
then we can see that the computation of is nothing other than the first step of a reconstruction algorithm analogous to that of modelbased compressed sensing. As a result, we have a bound on the error between and and we see that we can analyze the approximation properties of a feedfoward CNN and its linear reconstruction algorithm. In particular, we can conclude that a feedforward CNN and a linear reconstruction algorithm provide a good approximation to the original input image.
Theorem 3.3(Restated) We assume that satisfies the RIP with constant . If we use in a single layer CNN both to compute the hidden units and to reconstruct the input from these hidden units as so that , the error in our reconstruction is
Proof.
To show this result, we recall the following lemmas from Baraniuk:2010hg [Baraniuk:2010hg] and rephrase them in the setting of a feedforward CNN. Note that Lemma B.1 and B.2 are the same as Lemma 1 and 2 in Baraniuk:2010hg [Baraniuk:2010hg], respectively.
Lemma B.1.
Suppose has RIP with constant . Let be a support corresponding to a subspace in . Then we have the following bounds:
(4)  
(5)  
(6) 
Lemma B.2.
Suppose that has RIP with constant . Let be a support corresponding to a subspace of and suppose that (not necessarily supported on ). Then
Let denote the support of the sparse vector . Set and set to be the result of max pooling applied to the vector , or the best fit (with respect to the norm) to in the model . Let denote the support set of . For simplicity, we assume .
Lemma B.3 (Identification).
The support set, , of the switch units captures a significant fraction of the total energy in the coefficient vector
Proof.
Let and be the vector restricted to the support sets and , respectively. Since both are support sets for and since is the best support set for ,
and, after several calculations, which is identical to those in the proof of Lemma 3 in Baraniuk:2010hg [Baraniuk:2010hg], we have
Using Lemma B.2 and the size , we have
Since the support of is the set , for , so we can conclude that
and with some rearrangement, we have
∎
To set the value of on its support set , we simply set and . Then
Lemma B.4 (Estimation).
Proof.
Finally, if we use the autoencoder formulation to reconstruct the original image by setting , we can estimate the reconstruction error. We note that is sparse by construction and remind the reader that satisfies modelRIP with constants . Then, using Lemma B.4 as well as the sparse properties of ,
Comments
There are no comments yet.