1 Introduction
Compressive or compressed sensing is the problem of reconstructing an unknown vector after observing linear measurements of its entries, possibly with added noise:
where is called the measurement matrix and is noise. Even without noise, this is an underdetermined system of linear equations, so recovery is impossible unless we make an assumption on the structure of the unknown vector . We need to assume that the unknown vector is “natural,” or “simple,” in some applicationdependent way.
The most common structural assumption is that the vector is sparse in some known basis (or approximately sparse). Finding the sparsest solution to an underdetermined system of linear equations is NPhard, but still convex optimization can provably recover the true sparse vector if the matrix
satisfies conditions such as the Restricted Isometry Property (RIP) or the related Restricted Eigenvalue Condition (REC)
[35, 7, 14, 6]. The problem is also called highdimensional sparse linear regression and there is vast literature on establishing conditions for different recovery algorithms, different assumptions on the design of
and generalizations of RIP and REC for other structures, see e.g. [6, 33, 1, 30, 3].This significant interest is justified since a large number of applications can be expressed as recovering an unknown vector from noisy linear measurements. For example, many tomography problems can be expressed in this framework: is the unknown true tomographic image and the linear measurements are obtained by xray or other physical sensing system that produces sums or more general linear projections of the unknown pixels. Compressed sensing has been studied extensively for medical applications including computed tomography (CT) [8], rapid MRI [31]
and neuronal spike train recovery
[21]. Another impressive application is the “single pixel camera” [15], where digital micromirrors provide linear combinations to a single pixel sensor that then uses compressed sensing reconstruction algorithms to reconstruct an image. These results have been extended by combining sparsity with additional structural assumptions [4, 22], and by generalizations such as translating sparse vectors into lowrank matrices [33, 3, 17]. These results can improve performance when the structural assumptions fit the sensed signals. Other works perform “dictionary learning,” seeking overcomplete bases where the data is more sparse (see [9] and references therein).In this paper instead of relying on sparsity, we use structure from a generative model
. Recently, several neural network based generative models such as variational autoencoders (VAEs)
[26] and generative adversarial networks (GANs) [19] have found success at modeling data distributions. In these models, the generative part learns a mapping from a low dimensional representation space to the high dimensional sample space. While training, this mapping is encouraged to produce vectors that resemble the vectors in the training dataset. We can therefore use any pretrained generator to approximately capture the notion of an vector being “natural” in our domain: the generator defines a probability distribution over vectors in sample space and tries to assign higher probability to more likely vectors, for the dataset it has been trained on. We expect that vectors “natural” to our domain will be close to some point in the support of this distribution,
i.e., in the range of .Our Contributions: We present an algorithm that uses generative models for compressed sensing. Our algorithm simply uses gradient descent to optimize the representation such that the corresponding image has small measurement error . While this is a nonconvex objective to optimize, we empirically find that gradient descent works well, and the results can significantly outperform Lasso with relatively few measurements.
We obtain theoretical results showing that, as long as gradient descent finds a good approximate solution to our objective, our output will be almost as close to the true as the closest possible point in the range of .
The proof is based on a generalization of the Restricted Eigenvalue Condition () that we call the SetRestricted Eigenvalue Condition (S/̄REC). Our main theorem is that if a measurement matrix satisfies the S/̄REC for the range of a given generator , then the measurement error minimization optimum is close to the true . Furthermore, we show that random Gaussian measurement matrices satisfy the S/̄REC condition with high probability for large classes of generators. Specifically, for layer neural networks such as VAEs and GANs, we show that
Gaussian measurements suffice to guarantee good reconstruction with high probability. One result, for ReLUbased networks, is the following:
Theorem 1.1.
Let be a generative model from a layer neural network using ReLU activations. Let be a random Gaussian matrix for , scaled so . For any and any observation , let minimize to within additive of the optimum. Then with probability,
Let us examine the terms in our error bound in more detail. The first two are the minimum possible error of any vector in the range of the generator and the norm of the noise; these are necessary for such a technique, and have direct analogs in standard compressed sensing guarantees. The third term comes from gradient descent not necessarily converging to the global optimum; empirically, does seem to converge to zero, and one can check postobservation that this is small by computing the upper bound .
While the above is restricted to ReLUbased neural networks, we also show similar results for arbitrary Lipschitz generative models, for . Typical neural networks have bounded weights in each layer, so
, giving for all activation functions the same
sample complexity as for ReLU networks.Theorem 1.2.
Let be an Lipschitz function. Let be a random Gaussian matrix for , scaled so . For any and any observation , let minimize to within additive of the optimum over vectors with . Then with probability,
The downside is two minor technical conditions: we only optimize over representations with bounded by , and our error gains an additive term. Since the dependence on these parameters is , and is something like , we may set and while only losing constant factors, making these conditions very mild. In fact, generative models normally have the coordinates of be independent uniform or Gaussian, so , and a constant signaltonoise ratio would have .
We remark that, while these theorems are stated in terms of Gaussian matrices, the proofs only involve the distributional JohnsonLindenstrauss property of such matrices. Hence the same results hold for matrices with subgaussian entries or fastJL matrices [2].
2 Our Algorithm
All norms are norms unless specified otherwise.
Let be the vector we wish to sense. Let be the measurement matrix and be the noise vector. We observe the measurements . Given and , our task is to find a reconstruction close to .
A generative model is given by a deterministic function , and a distribution over . To generate a sample from the generator, we can draw and the sample then is . Typically, we have , i.e. the generative model maps from a low dimensional representation space to a high dimensional sample space.
Our approach is to find a vector in representation space such that the corresponding vector in the sample space matches the observed measurements. We thus define the objective to be
(1) 
By using any optimization procedure, we can minimize with respect to . In particular, if the generative model is differentiable, we can evaluate the gradients of the loss with respect to
using backpropagation and use standard gradient based optimizers. If the optimization procedure terminates at
, our reconstruction for is . We define the measurement error to be and the reconstruction error to be .3 Related Work
Several recent lines of work explore generative models for reconstruction. The first line of work attempts to project an image on to the representation space of the generator. These works assume full knowledge of the image, and are special cases of the linear measurements framework where the measurement matrix is identity. Excellent reconstruction results with SGD in the representation space to find an image in the generator range have been reported by [28] with stochastic clipping and [11] with logistic measurement loss. A different approach is introduced in [16] and [12]. In their method, a recognition network that maps from the sample space vector to the representation space vector is learned jointly with the generator in an adversarial setting.
A second line of work explores reconstruction with structured partial observations. The inpainting problem consists of predicting the values of missing pixels given a part of the image. This is a special case of linear measurements where each measurement corresponds to an observed pixel. The use of Generative models for this task has been studied in [38], where the objective is taken to be a combination of
error in measurements and a perceptual loss term given by the discriminator. Superresolution is a related task that attempts to increase the resolution of an image. We can view this problem as observing local spatial averages of the unknown higher resolution image and hence cast this as another special case of linear measurements. For prior work on superresolution see
e.g. [37, 13, 23] and references therein.We also take note of the related work of [18]
that connects modelbased compressed sensing with the invertibility of Convolutional Neural Networks.
A related result appears in [5], which studies the measurement complexity of an RIP condition for smooth manifolds. This is analogous to our S/̄REC for the range of , but the range of is neither smooth (because of ReLUs) nor a manifold (because of selfintersection). Their recovery result was extended in [20] to unions of two manifolds.
4 Theoretical Results
We begin with a brief review of the Restricted Eigenvalue Condition (REC) in standard compressed sensing. The REC is a sufficient condition on for robust recovery to be possible. The REC essentially requires that all “approximately sparse” vectors are far from the nullspace of the matrix . More specifically, satisfies REC for a constant if for all approximately sparse vectors ,
(2) 
It can be shown that this condition is sufficient for recovery of sparse vectors using Lasso. If one examines the structure of Lasso recovery proofs, a key property that is used is that the difference of any two sparse vectors is also approximately sparse (for sparsity up to ). This is a coincidence that is particular to sparsity. By contrast, the difference of two vectors “natural” to our domain may not itself be natural. The condition we need is that the difference of any two natural vectors is far from the nullspace of .
We propose a generalized version of the REC for a set of vectors, the SetRestricted Eigenvalue Condition (S/̄REC):
Definition 1.
Let . For some parameters , , a matrix is said to satisfy the if ,
There are two main differences between the S/̄REC and the standard REC in compressed sensing. First, the condition applies to differences of vectors in an arbitrary set of “natural” vectors, rather than just the set of approximately sparse vectors in some basis. This will let us apply the definition to being the range of a generative model.
Second, we allow an additive slack term . This is necessary for us to achieve the S/̄REC when is the output of general Lipschitz functions. Without it, the S/̄REC depends on the behavior of at arbitrarily small scales. Since there are arbitrarily many such local regions, one cannot guarantee the existence of an that works for all these local regions. Fortunately, as we shall see, poor behavior at a small scale will only increase our error by .
The S/̄REC definition requires that for any two vectors in , if they are significantly different (so the right hand side is large), then the corresponding measurements should also be significantly different (left hand side). Hence we can hope to approximate the unknown vector from the measurements, if the measurement matrix satisfies the S/̄REC.
But how can we find such a matrix? To answer this, we present two lemmas showing that random Gaussian matrices of relatively few measurements satisfy the S/̄REC for the outputs of large and practically useful classes of generative models .
In the first lemma, we assume that the generative model is Lipschitz, i.e., , we have
Note that state of the art neural network architectures with linear layers, (transposed) convolutions, maxpooling, residual connections, and all popular nonlinearities satisfy this assumption. In Lemma
8.5 in the Appendix we give a simple bound on in terms of parameters of the network; for typical networks this is . We also require the input to the generator to have bounded norm. Since generative models such as VAEs and GANs typically assume their input is drawn with independent uniform or Gaussian inputs, this only prunes an exponentially unlikely fraction of the possible outputs.Lemma 4.1.
Let be Lipschitz. Let
be an norm ball in . For , if
then a random matrix
with IID entries such that satisfies the with probability.All proofs, including this one, are deferred to Appendix A.
Note that even though we proved the lemma for an ball, the same technique works for any compact set.
For our second lemma, we assume that the generative model is a neural network with such that each layer is a composition of a linear transformation followed by a pointwise nonlinearity. Many common generative models have such architectures. We also assume that all nonlinearities are piecewise linear with at most two pieces. The popular ReLU or LeakyReLU nonlinearities satisfy this assumption. We do not make any other assumption, and in particular, the magnitude of the weights in the network do not affect our guarantee.
Lemma 4.2.
Let be a layer neural network, where each layer is a linear transformation followed by a pointwise nonlinearity. Suppose there are at most nodes per layer, and the nonlinearities are piecewise linear with at most two pieces, and let
for some . Then a random matrix with IID entries satisfies the with probability.
To show Theorems 1.1 and 1.2, we just need to show that the S/̄REC implies good recovery. In order to make our error guarantee relative to error in the image space , rather than in the measurement space , we also need that preserves norms with high probability [10]. Fortunately, Gaussian matrices (or other distributional JL matrices) satisfy this property.
Lemma 4.3.
Let by drawn from a distribution that (1) satisfies the with probability and (2) has for every fixed , with probability .
For any and noise , let . Let approximately minimize over , i.e.,
Then,
with probability .
Combining Lemma 4.1, Lemma 4.2, and Lemma 4.3 gives Theorems 1.1 and 1.2. In our setting, is the range of the generator, and in the theorem above is the reconstruction returned by our algorithm.
We compare the performance of our algorithm with baselines. We show a plot of per pixel reconstruction error as we vary the number of measurements. The vertical bars indicate 95% confidence intervals.
5 Models
In this section we describe the generative models used in our experiments. We used two image datasets and two different generative model types (a VAE and a GAN). This provides some evidence that our approach can work with many types of models and datasets.
In our experiments, we found that it was helpful to add a regularization term to the objective to encourage the optimization to explore more in the regions that are preferred by the respective generative models (see comparison to unregularized versions in Fig. 1). Thus the objective function we use for minimization is
Both VAE and GAN typically imposes an isotropic Gaussian prior on . Thus is proportional to the negative loglikelihood under this prior. Accordingly, we use the following regularizer:
(3) 
where measures the relative importance of the prior as compared to the measurement error.
5.1 MNIST with VAE
The MNIST dataset consists of about images of handwritten digits, where each image is of size [27]. Each pixel value is either (background) or (foreground). No preprocessing was performed. We trained VAE on this dataset. The input to the VAE is a vectorized binary image of input dimension . We set the size of the representation space . The recognition network is a fully connected network. The generator is also fully connected with the architecture . We train the VAE using the Adam optimizer [25] with a minibatch size and a learning rate of .
We found that using in Eqn. (3) gave the best performance, and we use this value in our experiments.
The digit images are reasonably sparse in the pixel space. Thus, as a baseline, we use the pixel values directly for sparse recovery using Lasso. We set shrinkage parameter to be for all the experiments.
5.2 CelebA with DCGAN
CelebA is a dataset of more than face images of celebrities [29]. The input images were cropped to a RGB image, giving inputs per image. Each pixel value was scaled so that all values are between . We trained a DCGAN ^{1}^{1}1Code reused from https://github.com/carpedm20/DCGANtensorflow [34, 24] on this dataset. We set the input dimension
and use a standard normal distribution. The architecture follows that of
[34]. The model was trained by one update to the discriminator and two updates to the generator per cycle. Each update used the Adam optimizer [25] with minibatch size , learning rate and .We found that using in Eqn. (3) gave the best results and thus, we use this value in our experiments.
For baselines, we perform sparse recovery using Lasso on the images in two domains: (a) 2D Discrete Cosine Transform (2DDCT) and (b) 2D Daubechies1 Wavelet Transform (2DDB1). While the we provide Gaussian measurements of the original pixel values, the penalty is on either the DCT coefficients or the DB1 coefficients of each color channel of an image. For all experiments, we set the shrinkage parameter to be and respectively for 2DDCT, and 2DDB1.
6 Experiments and Results
6.1 Reconstruction from Gaussian measurements
We take
to be a random matrix with IID Gaussian entries with zero mean and standard deviation of
. Each entry of noise vectoris also an IID Gaussian random variable. We compare performance of different sensing algorithms qualitatively and quantitatively. For quantitative comparison, we use the reconstruction error =
, where is an estimate of returned by the algorithm. In all cases, we report the results on a held out test set, unseen by the generative model at training time.6.1.1 Mnist
The standard deviation of the noise vector is set such that . We use Adam optimizer [25], with a learning rate of . We do random restarts with steps per restart and pick the reconstruction with best measurement error.
In Fig. 0(a), we show the reconstruction error as we change the number of measurements both for Lasso and our algorithm. We observe that our algorithm is able to get low errors with far fewer measurements. For example, our algorithm’s performance with measurements matches Lasso’s performance with measurements. Fig. 1(a) shows sample reconstructions by Lasso and our algorithm.
However, our algorithm is limited since its output is constrained to be in the range of the generator. After measurements, our algorithm’s performance saturates, and additional measurements give no additional performance. Since Lasso has no such limitation, it eventually surpasses our algorithm, but this takes more than measurements of the 784dimensional vector. We expect that a more powerful generative model with representation dimension can make better use of additional measurements.
6.1.2 celebA
The standard deviation of entries in the noise vector is set such that . We optimize use Adam optimizer [25], with a learning rate of . We do random restarts with update steps per restart and pick the reconstruction with best measurement error.
In Fig. 0(b), we show the reconstruction error as we change the number of measurements both for Lasso and our algorithm. In Fig. 3 we show sample reconstructions by Lasso and our algorithm. We observe that our algorithm is able to produce reasonable reconstructions with as few as measurements, while the output of the baseline algorithms is quite blurry. Similar to the results on MNIST, if we continue to give more measurements, our algorithm saturates, and for more than measurements, Lasso gets a better reconstruction. We again expect that a more powerful generative model with would perform better in the highmeasurement regime.
6.2 Superresolution
Superresolution is the task of constructing a high resolution image from a low resolution version of the same image. This problem can be thought of as special case of our general framework of linear measurements, where the measurements correspond to local spatial averages of the pixel values. Thus, we try to use our recovery algorithm to perform this task with measurement matrix tailored to give only the relevant observations. We note that this measurement matrix may not satisfy the S/̄REC condition (with good constants and ), and consequently, our theorems may not be applicable.
6.2.1 Mnist
We construct a low resolution image by spatial
pooling with a stride of
to produce a image. These measurements are used to reconstruct the original image. Fig. 1(b) shows reconstructions produced by our algorithm on images from a held out test set. We observe sharp reconstructions which closely match the fine structure in the ground truth.6.2.2 celebA
We construct a low resolution image by spatial pooling with a stride of to produce a image. These measurements are used to reconstruct the original image. In Fig. 4 we show results on images from a held out test set. We see that our algorithm is able to fill in the details to match the original image.
6.3 Understanding sources of error
Although better than baselines, our reconstructions still admit some error. There are three sources of this error: (a) Representation error: the image being sensed is far from the range of the generator (b) Measurement error: The finite set of random measurements do not contain all the information about the unknown image (c) Optimization error: The optimization procedure did not find the best .
In this section we present some experiments that suggest that the representation error is the dominant term. In our first experiment, we ensure that the representation error is zero, and try to minimize the sum of other two errors. In the second experiment, we ensure that the measurement error is zero, and try to minimize the sum of other two.
6.3.1 Sensing images from the range of the generator
Our first approach is to sense an image that is in the range of the generator. More concretely, we sample a from . Then we pass it through the generator to get . Now, we pretend that this is a real image and try to sense that. This method eliminates the representation error and allows us to check if our gradient based optimization procedure is able to find by minimizing the objective.
In Fig. 5(a) and Fig. 5(b), we show the reconstruction error for images in the range of the generators trained on MNIST and celebA datasets respectively. We see that we get almost perfect reconstruction with very few measurements. This suggests that objective is being properly minimized and we indeed get close to . i.e. the sum of optimization error and the measurement error is not very large, in the absence of the representation error.
6.3.2 Quantifying representation error
We saw that in absence of the representation error, the overall error is small. However from Fig. 1, we know that the overall error is still nonzero. So, in this experiment, we seek to quantify the representation error, i.e., how far are the real images from the range of the generator?
From the previous experiment, we know that the recovered by our algorithm is close to , the best possible value, if the image being sensed is in the range of the generator. Based on this, we make an assumption that this property is also true for real images. With this assumption, we get an estimate to the representation error as follows: We sample real images from the test set. Then we use the full image in our algorithm, i.e., our measurement matrix is identity. This eliminates the measurement error. Using these measurements, we get the reconstructed image through our algorithm. The estimated representation error is then . We repeat this procedure several times over randomly sampled images from our dataset and report average representation error values. The task of finding the closest image in the range of the generator has been studied in prior work [11, 16, 12].
On the MNIST dataset, we get average per pixel representation error of . The recovered images are shown in Fig. 7. In contrast with only Gaussian measurements, we are able to get a per pixel reconstruction error of about .
On the celebA dataset, we get average per pixel representation error of . The recovered images are shown in Fig. 5. On the other hand, with only Gaussian measurements, we get a per pixel reconstruction error of about .
These experiments suggest that the representation error is the major component of the total error. Thus, a more flexible generative model can help to decrease the overall error on both datasets.
7 Conclusion
We demonstrate how to perform compressed sensing using generative models from neural nets. These models can represent data distributions more concisely than standard sparsity models, while their differentiability allows for fast signal reconstruction. This will allow compressed sensing applications to make significantly fewer measurements.
Our theorems and experiments both suggest that, after relatively few measurements, the signal reconstruction gets close to the optimal within the range of the generator. To reach the full potential of this technique, one should use larger generative models as the number of measurements increase. Whether this can be expressed more concisely than by training multiple independent generative models of different sizes is an open question.
Generative models are an active area of research with ongoing rapid improvements. Because our framework applies to general generative models, this improvement will immediately yield better reconstructions with fewer measurements. We also believe that one could also use the performance of generative models for our task as one benchmark for the quality of different models.
Acknowledgements
We would like to thank Philipp Krähenbühl for helpful discussions.
References

[1]
Alekh Agarwal, Sahand Negahban, and Martin J Wainwright.
Fast global convergence rates of gradient methods for highdimensional statistical recovery.
In Advances in Neural Information Processing Systems, pages 37–45, 2010.  [2] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.

[3]
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al.
Optimization with sparsityinducing penalties.
Foundations and Trends® in Machine Learning
, 4(1):1–106, 2012.  [4] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Modelbased compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
 [5] Richard G Baraniuk and Michael B Wakin. Random projections of smooth manifolds. Foundations of computational mathematics, 9(1):51–77, 2009.
 [6] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
 [7] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.
 [8] GuangHong Chen, Jie Tang, and Shuai Leng. Prior image constrained compressed sensing (piccs): a method to accurately reconstruct dynamic ct images from highly undersampled projection data sets. Medical physics, 35(2):660–663, 2008.
 [9] Guangliang Chen and Deanna Needell. Compressed sensing and dictionary learning. Proceedings of Symposia in Applied Mathematics, 73, 2016.
 [10] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best kterm approximation. J. Amer. Math. Soc, 22(1):211–231, 2009.
 [11] Antonia Creswell and Anil Anthony Bharath. Inverting the generator of a generative adversarial network. arXiv preprint arXiv:1611.05644, 2016.
 [12] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
 [14] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 [15] Marco F Duarte, Mark A Davenport, Dharmpal Takbar, Jason N Laska, Ting Sun, Kevin F Kelly, and Richard G Baraniuk. Singlepixel imaging via compressive sampling. IEEE signal processing magazine, 25(2):83–91, 2008.
 [16] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [17] Rina Foygel and Lester Mackey. Corrupted sensing: Novel guarantees for separating structured signals. IEEE Transactions on Information Theory, 60(2):1223–1247, 2014.
 [18] Anna C. Gilbert, Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. Towards understanding the invertibility of convolutional neural networks. 2017.
 [19] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [20] Chinmay Hegde and Richard G Baraniuk. Signal recovery on incoherent manifolds. IEEE Transactions on Information Theory, 58(12):7204–7214, 2012.
 [21] Chinmay Hegde, Marco F Duarte, and Volkan Cevher. Compressive sensing recovery of spike trains using a structured sparsity model. In SPARS’09Signal Processing with Adaptive Sparse Structured Representations, 2009.
 [22] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A nearlylinear time framework for graphstructured sparsity. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 928–937, 2015.

[23]
Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee.
Accurate image superresolution using very deep convolutional
networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1646–1654, 2016. 
[24]
Taehoon Kim.
A tensorflow implementation of “deep convolutional generative adversarial networks”.
https://github.com/carpedm20/DCGANtensorflow, 2017.  [25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [26] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [27] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [28] Zachary C Lipton and Subarna Tripathi. Precise recovery of latent vectors from generative adversarial networks. arXiv preprint arXiv:1702.04782, 2017.
 [29] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
 [30] PoLing Loh and Martin J Wainwright. Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
 [31] Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007.
 [32] Jiří Matoušek. Lectures on discrete geometry, volume 212. Springer Science & Business Media, 2002.
 [33] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for highdimensional analysis of estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
 [34] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [35] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [36] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 [37] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image superresolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
 [38] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark HasegawaJohnson, and Minh N Do. Semantic image inpainting with perceptual and contextual losses. arXiv preprint arXiv:1607.07539, 2016.
8 Appendix A
Lemma 8.1.
Given , , , and , if matrix satisfies the , then for any two , such that and , we have
Proof.
∎
8.1 Proof of Lemma 4.1
Definition 2.
A random variable is said to be if , we have
Lemma 8.2.
Let be an Lipschitz function. Let be the ball in with radius , , and be a net on such that . Let be a
random matrix with IID Gaussian entries with zero mean and variance
. Ifthen for any , if , we have with probability .
Note that for any given point in , if we try to find its nearest neighbor of that point in an net on , then the difference between the two is at most the . In words, this lemma says that even if we consider measurements made on these points, i.e. a linear projection using a random matrix , then as long as there are enough measurements, the difference between measurements is of the same order . If the point was in the net, then this can be easily achieved by JohnsonLindenstrauss Lemma. But to argue that this is true for all in , which can be an uncountably large set, we construct a chain of nets on . We now present the formal proof.
Proof.
Observe that is . Thus, for any ,
is sufficient to ensure that
Now, let be a chain of epsilon nets of such that is a net and , with . We know that there exist nets such that
Let . Then due to Lipschitzness of , ’s form a chain of epsilon nets such that is a net of , with .
For , let
Thus,
Now assume ,
and
By choice of and , we have ,
Thus by union bound, we have
Now,
Observe that for any , we can write
where and .
Since each , with probability at least , we have
Now, , and due to properties of epsilonnets. We know that with probability at least (Corollary 5.35 [36]). By setting , we get that, with probability .
Combining these two results, and noting that it is possible to choose , we get that with probability ,
∎
Lemma.
Let be Lipschitz. Let
be an norm ball in . For , if
then a random matrix with IID entries such that satisfies the with probability.
Proof.
We construct a net, , on . There exists a net such that
Since is a cover of , due to the Lipschitz property of , we get that is a cover of .
Let denote the pairwise differences between the elements in , i.e.,
Then,
For any , , such that are close to and respectively. Thus, by triangle inequality,
Again by triangle inequality,
Now, by Lemma 8.2, with probability , , and . Thus,
By the JohnsonLindenstrauss Lemma, for a fixed , . Therefore, we can union bound over all vectors in to get
Since , and , , we have
Combining the three results above we get that with probability ,
Thus, satisfies with probability .
∎
8.2 Proof of Lemma 4.2
Lemma 8.3.
Consider different
dimensional hyperplanes in
. Consider the dimensional faces (hereafter called faces) generated by the hyperplanes, i.e. the elements in the partition of such that relative to each hyperplane, all points inside a partition are on the same side. Then, the number of faces is .Proof.
Proof is by induction, and follows [32].
Let denote the number of faces generated in by different dimensional hyperplanes. As a base case, let . Then dimensional hyperplanes are just points on a line. points partition into pieces. This gives .
Now, assuming that is true, we need to show . Assume we have different hyperplanes , and a new hyperplane is added. intersects at different faces given by . The faces in partition into different faces. Additionally, each face in divides an existing face into two. Hence the number of new faces introduced by the addition of is . This gives the recursion
∎
Lemma.
Let be a layer neural network, where each layer is a linear transformation followed by a pointwise nonlinearity. Suppose there are at most nodes per layer, and the nonlinearities are piecewise linear with at most two pieces, and let
for some . Then a random matrix with IID entries satisfies the with probability.
Proof.
Consider the first layer of . Each node in this layer can be represented as a hyperplane in , where the points on the hyperplane are those where the input to the node switches from one linear piece to the other. Since there are at most nodes in this layer, by Lemma 8.3, the input space is partitioned by at most different hyperplanes, into faces. Applying this over the layers of , we get that the input space is partitioned into at most sets.
Recall that the nonlinearities are piecewise linear, and the partition boundaries were made precisely at those points where the nonlinearities change from one piece to another. This means that within each set of the input partition, the output is a linear function of the inputs. Thus is a union of different faces in .
We now use an oblivious subspace embedding to bound the number of measurements required to embed the range of . For a single face , a random matrix with IID entries such that satisfies with probability if .
Since the range of is a union of different faces, we can union bound over all of them, such that satisfies the with probability . Thus, we get that satisfies the with probability if
∎
8.3 Proof of Lemma 4.3
Lemma.
Let by drawn from a distribution that (1) satisfies the with probability and (2) has for every fixed , with probability . For any and noise , let . Let approximately minimize over , i.e.,
Then
with probability .
Proof.
Let . Then we have by Lemma 8.1 and the hypothesis on that