Towards Understanding the Invertibility of Convolutional Neural Networks

05/24/2017 ∙ by Anna C. Gilbert, et al. ∙ University of Michigan 0

Several recent works have empirically observed that Convolutional Neural Nets (CNNs) are (approximately) invertible. To understand this approximate invertibility phenomenon and how to leverage it more effectively, we focus on a theoretical explanation and develop a mathematical model of sparse signal recovery that is consistent with CNNs with random weights. We give an exact connection to a particular model of model-based compressive sensing (and its recovery algorithms) and random-weight CNNs. We show empirically that several learned networks are consistent with our mathematical analysis and then demonstrate that with such a simple theoretical framework, we can obtain reasonable re- construction results on real images. We also discuss gaps between our model assumptions and the CNN trained for classification in practical scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved remarkable success in many technological areas, including automatic speech recognition [Hinton et al.2012, Hannun et al.2014]

, natural language processing 

[Collobert et al.2011, Mikolov et al.2013, Cho et al.2014]

, and computer vision, in particular with deep Convolutional Neural Networks (CNNs) 

[LeCun et al.1989, Krizhevsky et al.2012, Simonyan and Zisserman2015, Szegedy et al.2015].

Following the unprecedented success of deep networks, there have been some theoretical works [Arora et al.2014, Arora et al.2015, Paul and Venkatasubramanian2014] that suggest several mathematical models for different deep learning architectures. However, theoretical analysis and understanding lag behind the very rapid evolution and empirical success of deep architectures, and more theoretical analysis is needed to better understand the state-of-the-art deep architectures, and possibly to improve them further.

In this paper, we address the gap between the empirical success and theoretical understanding of the CNNs, in particular its invertibility (i.e., reconstructing the input from the hidden activations), by analyzing a simplified mathematical model using random weights (See Section 2.1 and 4.1 for the practical relevance of the assumption).

This property is intriguing because CNNs are typically trained with discriminative objectives (i.e., unrelated to reconstruction) with a large amount of labels, such as the ImageNet dataset 

[Deng et al.2009]

. recover-lp-pooling [recover-lp-pooling] studied signal discovery from generalized pooling operators using image patches on non-convolutional small scale networks and datasets. invert-cnn [invert-cnn] used upsampling-deconvolutional architectures to invert the hidden activations of feedforward CNNs to the input domain. In another related work, what-where [what-where] proposed a stacked what-where autoencoder network and demonstrated its promise in unsupervised and semi-supervised settings. deconv-recon [deconv-recon] showed that CNNs discriminately trained for image classification (e.g., VGGNet 

[Simonyan and Zisserman2015]) are almost fully invertible using pooling switches. Despite these interesting results, there is no clear theoretical explanation as to why CNNs are invertible yet.

We introduce three new concepts that, coupled with the accepted notion that images have sparse representations, guide our understanding of CNNs:

  1. we provide a particular model of sparse linear combinations of the learned filters that are consistent with natural images; also, this model of sparsity is itself consistent with the feedforward network;

  2. we show that the effective matrices that capture explicitly the convolution of multiple filters exhibit a model-Restricted Isometry Property (model-RIP) [Baraniuk et al.2010]; and

  3. our model can explain each layer of the feedforward CNN algorithm as one iteration of Iterative Hard Thresholding (IHT) [Blumensath and Davies2009] for model-based compressive sensing and, hence, we can reconstruct the input simply and accurately.

In other words, we give a theoretical connection to a particular version of model-based compressive sensing (and its recovery algorithms) and CNNs. Using the connection, we give a reconstruction bound for a single layer in CNNs, which can be possibly extended to multiple layers. In the experimental sections, we show empirically that large-scale CNNs are consistent with our mathematical analysis. This paper explores these properties and elucidates specific empirical aspects that further mathematical models might need to take into account.

2 Preliminaries

In this section, we begin with discussion on the effectiveness of random weights in CNNs, and then we provide the notations for CNNs, compressive sensing, and sparse signal recovery.

2.1 Effectiveness of Gaussian Random Filters

CNNs with Gaussian random filters have been shown to be surprisingly effective in unsupervised and supervised deep learning tasks. jarrett2009best [jarrett2009best] showed that random filters in 2-layer CNNs work well for image classification. Also,  saxe2011random [saxe2011random] observed that convolutional layer followed by pooling layer is frequency selective and translation invariant, even with random filters, and these properties lead to good performance for object recognition tasks. On the other hand,  giryes2015deep [giryes2015deep] proved that CNNs with random Gaussian filters have metric preservation property, and they argued that the role of training is to select better hyperplanes discriminating classes by distorting boundary points among classes. According to their observation, random filters are in fact a good choice if training data are initially well-separated. Also,  he2016powerful [he2016powerful] empirically showed that random weight CNNs can do image reconstruction well.

To better demonstrate the effectiveness of Gaussian random CNNs, we evaluate their classification performance on CIFAR-10 [Krizhevsky2009] in Section 4.1. Although the performance is not the state-of-the-art, it is surprisingly good considering that the networks are almost untrained. Our theoretical results may provide a new perspective on explaining these phenomena.

2.2 Convolutional Neural Nets

Figure 1: One-dimensional CNN architecture where is the matrix instantiation of convolution over channels with a filter bank consisting of different filters. Note that a filter bank has K filters of size , such that there are parameters in this architecture.

For simplicity, we vectorize input signals to 1-d signal; for any operations we would ordinarily carry out on images, we do on vectors with the appropriate modifications. We define a single layer of our CNN as follows. We assume that the input signal

consists of channels, each of length , and we write . For each of the input channels, , let , denote one of filters, each of length . Let

be the stride length, the number of indices by which we shift each filter. Note that

can be larger than 1. We assume that the number of shifts, , is an integer. Let be a vector of length that consists of the -th filter shifted by , (i.e., has at most non-zero entries). We concatenate over the channels each of these vectors (as row vectors) to form a large matrix , which is the matrix made up of blocks of the shifts of each filter in each of channels. We assume that and the row vectors of span and that we have normalized the rows so that they have unit norm. The hidden units of the feed-forward CNN are computed by multiplying an input signal by the matrix (i.e., convolving, in each channel, by a filter bank of size , and summing over the channels to obtain outputs).111Convolution can be computed more efficiently than matrix multiplication, but they are mathematically equivalent. We use for the hidden activation computed by a single layer CNN without pooling. Figure 1 illustrates the architecture. As a nonlinear activation, we apply the function to the outputs, and then selecting the value with maximum absolute value in each of the

blocks; i.e., we perform max pooling over each of the convolved filters.

2.3 Compressive Sensing

In compressive sensing, we assume that there is a latent sparse code that generates the visible signal . We say that a matrix with satisfies the Restricted Isometry Property RIP if there is a distortion factor such that for all with exactly non-zero entries, . If satisfies RIP with sufficiently small and if is -sparse, then given the vector , we can efficiently recover (see candes2008restricted [candes2008restricted] for more details)222We note that this is a sufficient condition and that there are other, less restrictive sufficient conditions, as well as more complicated necessary conditions. . There are many efficient algorithms, including sparse coding (e.g., minimization with regularization) and greedy and iterative algorithms, such as Iterative Hard Thresholding (IHT) [Blumensath and Davies2009].

Model-based compressive sensing. While sparse signals are a natural model for some applications, they are less realistic for CNNs. We consider a vector as the true sparse code generating the CNN input with a particular model of sparsity. Rather than permitting non-zero entries anywhere in the vector , we divide the support of into contiguous blocks of size and we stipulate that from each block there is at most one non-zero entry in with a total of non-zero entries. We call a vector with this sparsity model model--sparse and denote the union of all -sparse subspaces with this structure . It is clear that contains subspaces. In our analysis, we consider linear combinations of two model--sparse signals. To be precise, suppose that is the linear combination of two elements in . Then, we say that lies in the linear subspace that consists of all linear combinations of vectors from .333Intuitively, is a subspace where the error signal lies in and used for reconstruction bound derivation; see Appendix A. We say that a matrix satisfies the model-RIP if there is a distortion factor such that, for all ,

(1)

See Baraniuk:2010hg [Baraniuk:2010hg] for the definitions of model sparse and model-RIP, as well as the necessary modifications to account for signal noise and compressible (as opposed to exactly sparse) signals, which we don’t consider in this paper to keep our analysis simple. Intuitively, a matrix satisfying the model-RIP is a nearly orthonormal matrix of a particular set of sparse vectors with a particular sparsity model or pattern.

For our analysis, we also need matrices that satisfy the model-RIP for vectors . We denote the distortion factor for such matrices; note that .

0:  model-RIP matrix , measurement , structured sparse approximation algorithm
0:  -sparse approximation
1:  Initialize , ,
2:  while stopping criteria not met do
3:     
4:     
5:     
6:     
7:  end while
8:  return
Algorithm 1 Model-based IHT

Many efficient algorithms have been proposed for sparse coding and compressive sensing [Olshausen and others1996, Mallat and Zhang1993, Beck and Teboulle2009]. As with traditional compressive sensing, there are efficient algorithms for recovering model--sparse signals from measurements [Baraniuk et al.2010], assuming the existence of an efficient structured sparse approximation algorithm , that given an input vector and the sparsity parameter, returns the vector closest to the input with the specified sparsity structure.

In CNNs, the max pooling operator finds the downsampled activations that are closest to the activations of the original size by retaining the most significant values. The max pooling can be viewed as two steps: 1) zeroing out the locally non-maximum values; 2) downsampling the activations with the locally maximum values retained. To study the pooled activations with sparsity structures, we can recover dimension loss from the downsampling step by an unsampling operator. This procedure defines our structured sparse approximation algorithm , where is the original (unpooled) response, is the sparsity parameter for block-sparsification, and is the sparsified response after pooling but without shrinking the length (i.e., the locally non-maximum values are zeroed out such that has the same dimension as ). Note that is a model--sparse signal by construction. On the other hand, without considering the block-sparsification, we actually apply the following max pooling and upsampling operations:

(2)

where is the pooled response, is the filter response of CNN given input before max pooling (see Section 2.2), and denotes the upsampling switches that indicate whereto place the non-zero values in the upsampled activations. Since our theoretical analysis does not depend on but depends on , any type of valid upsampling switches will be consistent with the block-sparsification (model--sparse) assumption, thus we will use to denote the structured sparse approximation algorithm (2) without worrying about .

We use model-sparse version of IHT [Blumensath and Davies2009] as our recovery algorithm, as one iteration of IHT for our model of sparsity captures exactly a feedforward CNN.444Multiple iterations of IHT can improve the quality of signal recovery. However, it is rather equivalent to the recurrent version of CNNs and does not fit to the scope of this work. Algorithm 1 describes the model-based IHT algorithm. In particular, the sequence of steps 4–6 in the middle IHT is exactly one layer of a feedforward CNN. As a result, the theoretical analysis of IHT for model-based sparse signal recovery serves as a guide for how to analyze the approximation activations of a CNN.

3 Analysis

Following the idea of compressive sensing in Section 2.3, we assume that the input is generated from a latent model--sparse signal with basis vectors , which turns out to be by Theorem 3.1 (i.e., ). Therefore, our analysis views the output of CNN (with pooling) is a reconstruction of (i.e., ), and can be used to reconstruct from : that is, .

3.1 CNN Filters with Positive and Negative Pairs

Here we assume that all of the entries in the vectors are real numbers rather than only non-negative like when using . This setup is equivalent to using Concatenated (CReLU) [Shang et al.2016]

as an activation function (i.e., keeping the positive and negative activations as separate hidden units) with tied decoding weights. The CReLU activation scheme is justified by the fact that trained CNN filters come in positive and negative pairs and that it achieves superior classification performance in several benchmarks. This setting makes a CNN much easier to analyze within the model compressed sensing framework.

To motivate the setting, we begin with a simple example. Suppose that the matrix is an orthonormal basis for and define .

Proposition 1.

A one-layer CNN using the matrix , with no pooling, gives perfect reconstruction (with the matrix ) for any input vector .

Proof.

Because we have both the positive and the negative dot products of the signal with the basis vectors in , we have positive and negative versions of the hidden units and where we decompose into the difference of two non-negative vectors, the positive and the negative entries of . From this decomposition, we can easily reconstruct the original signal via

In the example above, we have pairs of vectors in our matrix . Now suppose that we have a vector where its positive and negative components can be split into , and that we synthesize a signal from using the matrix . Then, we have

Next, we multiply by a concatenation of positive and negative , then we get and if we apply to this vector, we get , which is a vector that is split into its positive and negative components. The structure of the product is crucial to the reconstruction quality of the vector . In addition, this calculation shows that if we have both positive and negative pairs of filters or vectors, then the function applied to both the positive and negative dot products simply splits the vector into the positive and negative components. These components are then reassembled in the next computation. For this reason, in the analysis in the following sections, it is sufficient to assume that all of the entries in the vectors are real numbers, rather than only non-negative.

3.2 Model-RIP and Random Filters

Our first main result shows that if we use Gaussian random filters in our CNN, then, with high probability,

, the transpose of a large matrix formed by the convolution filters satisfies the model-RIP. In other words, Gaussian random filters generate a matrix whose transpose is almost an orthonormal transform for sparse signals with a particular sparsity pattern (that is consistent with our pooling procedure). The bounds in the theorem tell us that we must balance the size of the filters and the number of channels against the sparsity of the hidden units , the number of the filter banks , the number of shifts , the distortion parameter , and the failure probability . The proof is in Appendix A.

Theorem 3.1.

Assume that we have vectors of length

in which each entry is a scaled i.i.d. (sub-)Gaussian random variable with zero mean and unit variance (the scaling factor is

. Let be the stride length (where ) and

be a structured random matrix, which is the weight matrix of a single layer CNN with

channels and input length . If

for a positive constant , then with probability , the matrix satisfies the model-RIP for model with parameter .

We also note that the same analysis can be applied to the sum of two model--sparse signals, with changes in the constants (that we do not track here).

Corollary 3.2.

Random matrices with the CNN structure satisfy, with high probability, the model-RIP for .

Other examples of matrices that satisfy the model-RIP include wavelets and localized Fourier bases; both examples can be easily and efficiently implemented via convolutions.

3.3 Reconstruction Bounds

Suppose satisfies the model-RIP and is the reconstruction of true sparse code through a CNN layer followed by pooling, i.e., . Then, Theorem 3.3 shows that is an approximate reconstruction of the input signal, and the relative error is bounded on a function of the distortion parameters of the model-RIP.

Theorem 3.3.

We assume that satisfies the -RIP with constant . If we use in a single layer CNN both to compute the hidden units and to reconstruct the input from these hidden units as so that , the error in our reconstruction is

See Appendix B for the detailed proofs. Part of our analysis also shows that the hidden units are approximately the putative coefficient vector in the sparse linear representation for the input signal. Recall that the structured sparsity approximation algorithm includes the downsampling caused by pooling and an unsampling operator. Theorem 3.3 is applicable to any type of upsampling switches, so our reconstruction bound is generic to the particular design choice on how to recover the activation size in a decoding neural network. We can extend the analysis for a single layer CNN to multiple layer CNN by using the output on one layer as the input to another, following the proof in Appendix B. We leave further investigation of this idea as future work.

4 Experimental Evidence and Analysis

In this section, we provide experimental validation of our theoretical model and analysis. We first validate the practical relevance of our assumption by examining the effectiveness of random filter CNNs, and then provide results on more realistic scenarios. In particular, we study popular deep CNNs trained for image classification on ILSVRC 2012 dataset [Deng et al.2009]. We calculate empirical model-RIP bounds for , showing that they are consistent with our theory. Our results are also consistent with a long line of research shows that it is reasonable to model real and natural images as sparse linear combinations overcomplete dictionaries [Boureau et al.2008, Le et al.2013, Lee et al.2008, Olshausen and others1996, Ranzato et al.2007, Yang et al.2010]. In addition, we verify our theoretical bounds for the reconstruction error

on real images. We investigate both randomly sampled filters and empirically learned filters in these experiments. Our implementation is based on Caffe 

[Jia et al.2014] and MatConvNet [Vedaldi and Lenc2015].

Recall that our theoretical analysis is generic to any upsampling switches in (2) for reconstruction. In the experiments, we specifically use the naive upsampling to reverse max-pool activations to its original size, where only the first element in a pooling region is assigned with the pooled activation, and the rest elements are all zero. Thus, no extra information other than the pooled activation values are taken into account.

4.1 Gaussian Random CNNs on CIFAR-10

To show the practical relevance of our theoretical assumptions on using random filters for CNNs as stated in Section 2.1, we evaluate simple CNNs with Gaussian random filters with i.i.d. zero mean unit variance entries on the CIFAR-10 [Krizhevsky2009]

. Note that the goal of this experiment is not to achieve state-of-the-art results, but to examine practical relevance of our assumption on random filter CNNs. Once the CNNs weights are initialized (randomly), they are fixed during the training of the classifiers.

555

Implementation detail: we add a batch normalization layer together with a learnable scale and bias before the activation so that we do not need to tune the scale of the filters. See Appendix 

C.1 for more details.

Specifically, we test random CNNs with 1, 2, and 3 convolutional layers followed by ReLU activation and

max pooling layer. We tested different filter sizes () and numbers of channels () and report the best classification accuracy by cross-validation in Table 1. We also report the best performance using learnable filters for comparison. More details about the architectures can be found in Appendix C.1. We observe that CNNs with Gaussian random filters achieve good classification performance (implying that they serve as reasonable representation of input data), which is not too far off the learned filters. Our experimental results are also consistent with the observations made by jarrett2009best [jarrett2009best] and saxe2011random [saxe2011random]. In conclusion, those results suggest that CNNs with Gaussian random filters might be a reasonable setup which is amenable to mathematical analysis while not being too far off in terms of practical relevance.

Method 1 layer 2 layers 3 layers
Random filters 66.5% 74.6% 74.8%
Learned filters 68.1% 83.3% 89.3%
Table 1: Classification accuracy of CNNs with random and learnable filters on CIFAR-10. A typical layer consists of four operators: convolution, ReLU, batch normalization and max pooling. Networks with optimal filter size and numbers of output channels are used. (See Appendix C.1 for more details about the architectures). The random filters, assumed in our theoretical analysis, perform reasonably well, not far off the learned filters.
(a)
(b)
(c)
Figure 2: For 1-d scaled Gaussian random filters , we plot the histogram of ratios (a) (model-RIP in (1); supposed to be concentrated at ), (b) (ratio between the norm of the reconstructed code and that of the original code ; supposed to be concentrated at ), and (c) (reconstruction bound in Theorem 3.3, supposed to be small), where is a sparse signal that generates the vector and is the reconstruction of , where we use the naive unsampling to recover the reduced dimension due to pooling: we place recovered values in the top-left corner in each unsampled block. (See Section 2.3).

4.2 1-d Model-RIP

We use 1-d synthetic data to empirically show the basic validity of our theory in terms of the model-RIP in (1) and reconstruction bound in Theorem 3.3. We plot the histograms of the empirical model-RIP values of 1-d Gaussian random filters ( scaled by ) with size on 1-d sparse signal with size and sparsity

, whose non-zero elements are drawn from a uniform distribution on

. The histograms in Figure 2 (a)–(b) are tightly centered around , suggesting that satisfies the model-RIP in (1) and its corollary from Lemma B.1, respectively. We also empirically show the reconstruction bound in Theorem 3.3 on synthetic vectors (Figure 2 (c)). The reconstruction error is concentrated at around and bound under . Results in Figure 2 suggest the practical validity of our theory when the model assumptions hold.

4.3 Architectures for 2-d Model-RIP

We conduct the rest of our experimental evaluations on the 16-layer VGGNet (Model D in vggnet [vggnet]), where the computation is carried out on images; e.g., convolution with a 2-d filter bank and pooling on square regions. In contrast to the theory, the realistic network does not pool activations over all the possible shifts for each filter, but rather on non-overlapping patches. The networks are trained for the large-scale image classification task, which is important for extending to other supervised tasks in vision. The main findings on VGGNet are presented in the rest of this section; we also provide some analysis on AlexNet [Krizhevsky et al.2012] in Appendix C.2.

VGGNet contains five macro layers of convolution and pooling layers, and each macro layer has 2 or 3 convolutional layers followed by a pooling layer. We denote the -th convolutional layer in the -th macro layer “conv,” and the pooling layer “pool.” The activations/features from -th macro layer are the output of pool. Our analysis is for single convolutional layers.

4.4 2-d Model-RIP

The key to our reconstruction bound is Theorem 3.3. We empirically evaluate the model-RIP, i.e., , for real CNN filters of the pretrained VGGNet. We use two-dimensional coefficients (each block of coefficients is of size ), filters of size , and pool the coefficients over smaller pooling regions (i.e., not over all possible shifts of each filter). The following experimental evidence suggests that the sparsity model and the model-RIP of the filters are consistent with our mathematical analysis on the simpler one-dimensional case.

To check the significance of the model-RIP (i.e., how close is to ) in controlled settings, we first synthesize the hidden activations with sparse uniform random variables, which fully agree with our model assumptions.

layer c(1,1) c(1,2) p(1) c(2,1) c(2,2) p(2)
% of non-zeros 49.1 69.7 80.8 67.4 49.7 70.7
layer c(3,1) c(3,2) c(3,3) p(3) c(4,1) c(4,2)
% of non-zeros 53.4 51.9 28.7 45.9 35.6 29.6
layer c(4,3) p(4) c(5,1) c(5,2) c(5,3) p(5)
% of non-zeros 12.6 23.1 23.9 20.6 7.3 13.1
Table 2: Layer-wise sparsity of VGGNet on ILSVRC 2012 validation set. “c” stands for convolutional layers and “p” represents pooling layers. CNN with random filters in Section 4.4 can be simulated with the same sparsity.
layer (1,1) (1,2) (2,1) (2,2) (3,1) (3,2) (3,3)
learned 0.943 0.734 0.644 0.747 0.584 0.484 0.519
random 0.670 0.122 0.155 0.105 0.110 0.090 0.080
layer (4,1) (4,2) (4,3) (5,1) (5,2) (5,3)
learned 0.460 0.457 0.404 0.410 0.410 0.405
random 0.092 0.062 0.062 0.070 0.067 0.067
Table 3: Comparison of coherence between learned filters in each convolutional layer of VGGNet and Gaussian random filters with corresponding sizes.

The sparsity of is constrained to the average level of the real CNN activations, which is reported in Table 2. Given the filters of a certain convolutional layer, we use the synthetic (in equal position to this layer’s output activations) to get statistics for the model-RIP. To be consistent with succeeding experiments, we choose conv, while other layers show similar results. Figure 3 (a) summarizes the distribution of empirical model-RIP values, which is clearly centered around and satisfies (1) with a short tail roughly bounded by . For more details of the algorithm, we normalize the filters from the conv layer, which are (). All filters with input channels are used.666We do not remove any filters including those in approximate positive/negative pairs (see Section 3.) We set , which is the same as the output activations of conv, and use pooling regions777No pooling layer follows conv in VGGNet. However, we use it in this way to analyze the convolution-pooling pair per theory., which is commonly used in recent CNNs. We generate 1000 randomly sampled sparse activation maps by first sampling their non-zero supports and then filling elements on the supports uniformly from . The sparsity is the same as that in conv activations.

(a) Random
(b) After
(c) Before
Figure 3: For VGGNet’s conv filters , we plot the histogram of ratios , which is expected to be concentrated at according to (1), where is a sparse signal. In (a), is randomly generated with the same sparsity as the conv activations and from a uniform distribution for the non-zero magnitude. In (b) and (c), is recovered by Algorithm 2 from the conv(5,1) activations before and after applying , respectively. The learned filters admits similar model-RIP value distributions to the random filters except for a bit larger bandwidth, which means the model-RIP in (1) can empirically hold even when the filters do not necessarily subject to the i.i.d Gaussian random assumption.
0:  convolution matrix , input activation/image
0:  hidden code , satisfying our model-RIP assumption with and reconstructing with
1:  
2:  , where
3:  
Algorithm 2 Sparse hidden activation recovery

More realistically, we observe that the actual conv activations from VGGNet are not necessarily drawn from a model-sparse uniform distribution. This motivates us to evaluate the empirical model-RIP on the hidden activations that reconstruct the actual input activations from conv by . Per theory, the is given by a max pooling layer, so we constrain the sparsity (i.e., the size of the support set is no more than in a pooling region for a single channel). We use a simple and efficient algorithm to recover from in Algorithm 2. The algorithm is inspired by “heuristic" method that are commonly used in practice (e.g.,  boyd2015 [boyd2015]). As shown in Algorithm 2, we first do -regularized least squares without constraining the support set. Max pooling is then applied to figure out the support set for each pooling region. In particular, we use max pooling and unpooling with known switches (line 2) to zero out the locally non-maximum values without messing up the support structures. We perform -regularized least squares again on the fixed support set to recover the hidden activations satisfying the model sparsity. As shown in Figures 3 (b)–(c), the empirical model-RIP values for visual activations from conv with/without are both close to . The center offset to is less than and the range bound is roughly less than , which agrees with the theoretical bound in (1). To gain more insight, we summarize the learned filter coherence in Table 3 for all convolutional layers in VGGNet.888The coherence is defined as the maximum (in absolute value) dot product between distinct pairs of columns of the matrix , i.e. , where denote the -th row of matrix . This measures the correlation or similarity between the columns of and is a proxy for the value of the model-RIP parameter

(which we can only estimate computationally). The smaller the coherence, the smaller

is, and the better the reconstruction. The coherence of the learned filters is not low, which is inconsistent with our theoretical assumptions. However, the model-RIP turns out to be robust to this mismatch. It demonstrates the strong practical invertibility of CNN.

4.5 Reconstruction Bounds

With model-RIP as a sufficient condition, Theorem 3.3 provides a theoretical bound for layer-wise reconstruction via , which consists of the projection and reconstruction in one IHT iteration. Without confusion, we refer to it as IHT for notational convenience. We investigate the practical reconstruction errors on pool to of VGGNet.

Figure 4: Visualization of images reconstructed by a pretrained decoding network with VGGNet’s pool activation reconstructed using different methods: (a) original image, (b) output of the -layer decoding network with original activation, (c) output of the decoding net with reconstructed activation by IHT with learned filters, (d) output of the decoding net with reconstructed activation by IHT with Gaussian random filters, and (e) output of the decoding net with Gaussian random activation.

To encode and reconstruct intermediate activations of CNNs, we employ IHT with sparsity estimated from the real CNN activations on ILSVRC 2012 validation set (see Table 2). We also reconstruct input images, since CNN inversion is not limited to a single layer, and images are easier to visualize than hidden activations. To implement image reconstruction, we project the reconstructed activations into the image space via a pretrained decoding network as in deconv-recon [deconv-recon], which extends a similar autoencoder architecture as in invert-cnn [invert-cnn] to a stacked “what-where” autoencoder [Zhao et al.2016]. The reconstructed activations were scaled to have the same norm as the original activations so that we can feed them into the decoding network.

As an example, Figure 4 illustrates the image reconstruction results for the hidden activations of pool. Interestingly, the decoding network itself is quite powerful, since it can reconstruct the rough (although very noisy) glimpse of images with Gaussian random input, as shown in Figure 4 (e). Object shapes are recovered up to some extent by using the pooling switches only in the “what-where” autoencoder. This result suggests that it is important to determine which pooling units are active and then to estimate these values accurately. These steps are consistent with the steps in the inner loop of any iterative sparse signal reconstruction algorithm.

In Figure 4 (c), we take the pretrained conv filters for IHT. The images recovered from the IHT reconstructed pool activations are reasonable and the reconstruction quality is significantly better than the random input baseline. We also try Gaussian random filters (Figure 4 (d)), which agree more with the model assumptions (e.g., lower coherence, see Table 3). The learned filters from VGGNet perform equally well visually. IHT ties the encoder and decoder weights (no filter learning for the decoder), so it does not perform as well as the decoding network trained with a huge batch of data (Figure 4 (b)). Nevertheless, we show both theoretically and experimentally decent reconstruction bounds for these simple reconstruction methods on real CNNs. More visualization results for more layers are in Appendix C.3.

layer image space activation space
relative error relative error
learned random random learned random random
filters filters activations filters filters activations
1 0.423 0.380 0.610 0.895 0.872 1.414
2 0.692 0.438 0.864 0.961 0.926 1.414
3 0.326 0.345 0.652 0.912 0.862 1.414
4 0.379 0.357 0.436 1.051 0.992 1.414

Table 4: Layer-wise relative reconstruction errors by different methods in activation space and image space between reconstructed and original activations. For macro layer , we take its activation after pooling from that layer and reconstruct it with different methods (using learned filters from the layer above or scaled Gaussian random filters) and feed the reconstructed activation to a pretrained corresponding decoding network.101010The values in the last column are identical () for all layers because on average for Gaussian random provided .

In Table 4, we summarize reconstruction performance for all 4 macro layers. With random filters, the model assumptions hold and the IHT reconstruction is the best quantitatively. IHT with real CNN filters performs comparable to the best case and much better than the baseline established by the randomly sampled activations.

5 Conclusion

We introduce three concepts that tie together a particular model of compressive sensing (and the associated recovery algorithms), the properties of learned filters, and the empirical observation that CNNs are (approximately) invertible. Our experiments show that filters in trained CNNs are consistent with the mathematical properties we present while the hidden units exhibit a much richer structure than mathematical analysis suggests. Perhaps simply moving towards a compressive, rather than exactly sparse, model for the hidden units will capture the sophisticated structure in these layers of a CNN or, perhaps, we need a more sophisticated model. Our experiments also demonstrate that there is considerable information captured in the switch units (or the identities of the non-zeros in the hidden units after pooling) that no mathematical model has yet expressed or explored thoroughly. We leave such explorations as future work.

Acknowledgments

This work was supported in part by ONR N00014-16-1-2928, NSF CAREER IIS-1453651, and Sloan Research Fellowship. We would like to thank Michael Wakin for helpful discussions about concentration of measure for structured random matrices.

1

References

  • [Arora et al.2014] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable Bounds for Learning Some Deep Representations. In ICML, 2014.
  • [Arora et al.2015] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. Why are deep nets reversible: A simple theory, with implications for training. arXiv:1511.05653, 2015.
  • [Baraniuk et al.2010] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-Based Compressive Sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
  • [Beck and Teboulle2009] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Science, 2:183–202, 2009.
  • [Blumensath and Davies2009] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
  • [Boureau et al.2008] Y-lan Boureau, Yann L Cun, et al.

    Sparse feature learning for deep belief networks.

    In NIPS, 2008.
  • [Boyd2015] Stephen Boyd. -norm methods for convex-cardinality problems, ee364b: Convex optimization II lecture notes, 2014-2015 spring. 2015.
  • [Bruna et al.2014] Joan Bruna, Arthur Szlam, and Yann LeCun. Signal recovery from pooling representations. In ICML, 2014.
  • [Candés2008] Emmanuel J. Candés. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008.
  • [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014.
  • [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    , 12(Aug):2493–2537, 2011.
  • [Deng et al.2009] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [Dosovitskiy and Brox2016] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.
  • [Giryes et al.2016] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep neural networks with random gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
  • [Hannun et al.2014] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567, 2014.
  • [He et al.2016] Kun He, Yan Wang, and John Hopcroft. A powerful generative model using random weights for the deep image representation. arXiv:1606.04801, 2016.
  • [Hinton et al.2012] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • [Jarrett et al.2009] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
  • [Jia et al.2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
  • [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [Krizhevsky2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [Le et al.2013] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng.

    Building high-level features using large scale unsupervised learning.

    In ICML, 2013.
  • [LeCun et al.1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • [Lee et al.2008] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In NIPS, 2008.
  • [Mallat and Zhang1993] Stephane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:3397 – 3415, 1993.
  • [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
  • [Olshausen and others1996] Bruno A Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
  • [Park et al.2011] Jae Young Park, Han Lun Yap, C.J. Rozell, and M. B. Wakin. Concentration of Measure for Block Diagonal Matrices With Applications to Compressive Signal Processing. IEEE Transactions on Signal Processing, 59(12):5859–5875, 2011.
  • [Paul and Venkatasubramanian2014] Arnab Paul and Suresh Venkatasubramanian. Why does Deep Learning work? - A perspective from Group Theory. arXiv.org, December 2014.
  • [Ranzato et al.2007] Marc Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
  • [Saxe et al.2011] Andrew Saxe, Pang W Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In ICML, 2011.
  • [Shang et al.2016] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee.

    Understanding and improving convolutional neural networks via concatenated rectified linear units.

    In ICML, 2016.
  • [Simonyan and Zisserman2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [Szegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [Vedaldi and Lenc2015] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia, 2015.
  • [Vershynin2010] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027, November 2010.
  • [Yang et al.2010] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma.

    Image super-resolution via sparse representation.

    IEEE Transactions on Image Processing, 19(11):2861–2873, 2010.
  • [Zhang et al.2016] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting neural networks with reconstructive decoding pathways for large-scale image classification. In ICML, 2016.
  • [Zhao et al.2016] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders. arXiv:1506.02351, 2016.

Appendix A Mathematical Analysis: Model-RIP and Random Filters

Theorem 3.1(Restated) Assume that we have vectors of length in which each entry is a scaled i.i.d. (sub-)Gaussian random variable with zero mean and unit variance (the scaling factor is . Let be the stride length (where ) and be a structured random matrix, which is the weight matrix of a single layer CNN with channels and input length . If

for a positive constant , then with probability , the matrix satisfies the model-RIP for model with parameter .

Proof.

We note that the proof follows the same structure of those in other papers such as Park:2011iw [Park:2011iw] and Vershynin:2010vk [Vershynin:2010vk], though we make minor tweaks to account for the particular structure of .

Suppose that , i.e., consists of at most non-zero entries that each appears in a distinct block of size (there are a total of blocks). First, Lemma A.1 shows that the expectation of the norm of is preserved.

Lemma A.1.
Proof.

Note that each entry of is either zero or Gaussian random variable before scaling. Therefore, it is obvious that since each row of satisfies if or for any , and we normalized the random variables so that for all ’s. Finally, we have

Let . We aim to show that the square norm of the random variable concentrates tightly about its mean; i.e., with exceedingly low probability

To do so, we need several properties of sub-Gaussian and sub-exponential random variables. A mean-zero sub-Gaussian random variable

has a moment generating function that satisfies

for all and some constant . The sub-Gaussian norm of , denoted is

If , then where is a positive constant (following Definition 5.7 in Vershynin:2010vk [Vershynin:2010vk]).

A sub-exponential random variable satisfies111111There are two other equivalent properties. See Vershynin:2010vk [Vershynin:2010vk] for details.

for all .

Let denote the th entry of the vector . We can write

and observe that is a linear combination of i.i.d. sub-Gaussian random variables (or it is identically equal to 0) and, as such, is itself a sub-Gaussian random variable with zero mean and sub-Gaussian norm (see Vershynin:2010vk [Vershynin:2010vk], Lemma 5.9).

The structure of the random matrix and how many non-zero entries are in row of do enter the more refined bound on the sub-Gaussian norm of (again, see Vershynin:2010vk [Vershynin:2010vk], Lemma 5.9 for details) but we ignore such details for this estimate as they are not necessary for the next estimate.

To obtain a concentration bound for , we recall from Park:2011iw [Park:2011iw] and Vershynin:2010vk [Vershynin:2010vk] that the sum of squares of sub-Gaussian random variables tightly concentrate.

Theorem A.2.

Let be independent sub-Gaussian random variables with sub-Gaussian norms for all . Let . For every and every and a positive constant ,

We note that although some entries may be identically zero, depending on the sparsity pattern of , not all entries are. Let us define so that .

From Lemma A.1 and the relation , we have

See Proposition 5.16 in Vershynin:2010vk [Vershynin:2010vk] for the proof of Theorem A.2. We apply Theorem A.2 to the sub-Gaussian random variables with the weights . We have

If we set , , and use the above estimates for the norms of , we have

(3)

Finally, we use the concentration of measure result in a crude union bound to bound the failure probability over all vectors . We take and for a desired constant failure probability. Using the smaller term in (3), (note that , , and ) we have

which implies

Therefore, if design our matrix as described and with the parameter relationship as above, the matrix satisfies the model-RIP for and parameter with probability . ∎

Let us discuss the relationship amongst the parameters in our result. First, if we have only one channel and the filter length ; namely,

If (i.e., the filters are much shorter than the length of the input signal as in a CNN), then we can compensate by adding more channels; i.e., the filter length needs to be larger than , or, if add more channels, .

Appendix B Mathematical Analysis: Reconstruction Bounds

The consequences of having the model-RIP are two-fold. The first is that if we assume that an input image is the structured sparse linear combination of filters, where and satisfies the model-RIP, then we know an upper and lower bound on the norm of in terms of the norm of its sparse coefficients, . Additionally,

More importantly, when we calculate the hidden units of ,

then we can see that the computation of is nothing other than the first step of a reconstruction algorithm analogous to that of model-based compressed sensing. As a result, we have a bound on the error between and and we see that we can analyze the approximation properties of a feedfoward CNN and its linear reconstruction algorithm. In particular, we can conclude that a feedforward CNN and a linear reconstruction algorithm provide a good approximation to the original input image.

Theorem 3.3(Restated) We assume that satisfies the -RIP with constant . If we use in a single layer CNN both to compute the hidden units and to reconstruct the input from these hidden units as so that , the error in our reconstruction is

Proof.

To show this result, we recall the following lemmas from Baraniuk:2010hg [Baraniuk:2010hg] and rephrase them in the setting of a feedforward CNN. Note that Lemma B.1 and  B.2 are the same as Lemma 1 and 2 in Baraniuk:2010hg [Baraniuk:2010hg], respectively.

Lemma B.1.

Suppose has -RIP with constant . Let be a support corresponding to a subspace in . Then we have the following bounds:

(4)
(5)
(6)
Lemma B.2.

Suppose that has -RIP with constant . Let be a support corresponding to a subspace of and suppose that (not necessarily supported on ). Then

Let denote the support of the sparse vector . Set and set to be the result of max pooling applied to the vector , or the best fit (with respect to the norm) to in the model . Let denote the support set of . For simplicity, we assume .

Lemma B.3 (Identification).

The support set, , of the switch units captures a significant fraction of the total energy in the coefficient vector

Proof.

Let and be the vector restricted to the support sets and , respectively. Since both are support sets for and since is the best support set for ,

and, after several calculations, which is identical to those in the proof of Lemma 3 in Baraniuk:2010hg [Baraniuk:2010hg], we have

Using Lemma B.2 and the size , we have

Using (6) and Lemma B.2, we can bound the other side of the inequality as

Since the support of is the set , for , so we can conclude that

and with some rearrangement, we have

To set the value of on its support set , we simply set and . Then

Lemma B.4 (Estimation).
Proof.

First, note that since

where

is the maximum singular value. Therefore,

Finally, if we use the autoencoder formulation to reconstruct the original image by setting , we can estimate the reconstruction error. We note that is -sparse by construction and remind the reader that satisfies -model-RIP with constants . Then, using Lemma B.4 as well as the -sparse properties of ,