1 Introduction
We consider the wellstudied compressed sensing problem of recovering an unknown signal by observing a set of noisy measurements of the form
(1) 
Here is a known measurement matrix, typically generated with random independent Gaussian entries. Since the number of measurements is smaller than the dimension
of the unknown vector
, this is an underdetermined system of noisy linear equations and hence illposed. There are many solutions, and some structure must be assumed on to have any hope of recovery. Pioneering research donoho2006compressed ; candes2006robust ; candes2005decoding established that if is assumed to be sparse in a known basis, a small number of measurements will be provably sufficient to recover the unknown vector in polynomial time using methods such as Lasso tibshirani1996regression .Sparsity approaches have proven successful, but more complex models with additional structure have been recently proposed such as modelbased compressive sensing baraniuk2010model and manifold models hegde2008random ; hegde2012signal ; eftekhari2015new . Bora et al. bora2017compressed
showed that deep generative models can be used as excellent priors for images. They also showed that backpropagation can be used to solve the signal recovery problem by performing gradient descent in the generative latent space. This method enabled image generation with significantly fewer measurements compared to Lasso for a given reconstruction error. Compressed sensing using deep generative models was further improved in very recent work
tripathi2018correction ; grover2018amortized ; kabkab2018task ; shah2018solving ; fletcher2017inference ; DBLP:journals/corr/abs180204073 . Additionally a theoretical analysis of the nonconvex gradient descent algorithm bora2017compressed was proposed by Hand et al. hand2017global under some assumptions on the generative model.Inspired by these impressive benefits of deep generative models, we chose to investigate the application of such methods for medical imaging, a canonical application of compressive sensing. A significant problem, however, is that all these previous methods require the existence of pretrained models. While this has been achieved for various types of images, e.g. human faces of CelebA liu2015faceattributes via DCGAN radford2015unsupervised , it remains significantly more challenging for medical images wolterink2017generative ; schlegl2017unsupervised ; nie2017medical ; schlemper2017deep . Instead of addressing this problem in generative models, we found an easier way to circumvent it.
Surprising recent work by Ulyanov et al. ulyanov2017deep proposed Deep Image Prior (DIP), which uses untrainedconvolutional neural networks. In DIPbased schemes, a convolutional neural network generator (e.g. DCGAN) is initialized with random weights; these weights are subsequently optimized to make the network produce an output as close to the target image as possible. This procedure is unlearned, using no prior information from other images. The prior is enforced only by the fixed convolutional structure of the generator network.
Generators used for DIP are typically overparameterized, i.e. the number of network weights is much larger compared to the output dimension. For this reason DIP has empirically been found to overfit to noise if run for too many iterations ulyanov2017deep . In this paper we theoretically prove that this phenomenon occurs with gradient descent and justify the use of early stopping and other regularization methods.
Our Contributions:

In Section 3 we propose DIP for compressed sensing (CSDIP). Our basic method is as follows. Initialize a DCGAN generator with random weights; use gradient descent to optimize these weights such that the network produces an output which agrees with the observed measurements as much as possible. This unlearned method can be improved with a novel learned regularization technique, which regularizes the DCGAN weights throughout the optimization process.

In Section 4 we theoretically prove that DIP will fit any signal to zero error with gradient descent. Our result is established for a network with a single hidden layer and sufficient constant fraction overparametrization. While it is expected that overparametrized neural networks can fit any signal, the fact that gradient descent can provably solve this nonconvex problem is interesting and provides theoretical justification for early stopping.

In Section 5 we empirically show that CSDIP outperforms previous unlearned methods in many cases. While pretrained or “learned” methods will likely perform better bora2017compressed , we have the advantage of not requiring a generative model trained over large datasets. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and generative models are difficult to train.
2 Background
2.1 Compressed Sensing: Classical and Unlearned Approaches
A classical assumption made in compressed sensing is that the vector is sparse in some basis such as wavelet or discrete cosine transform (DCT). Finding the sparsest solution to an underdetermined linear system of equations is NPhard in general; however, if the matrix
satisfies conditions such as the Restricted Eigenvalue Condition (REC) or Restricted Isometry Property (RIP)
candes2006stable ; bickel2009simultaneous ; donoho2006compressed ; tibshirani1996regression , then can be recovered in polynomial time via convex relaxations tropp2006just or iterative methods. There is extensive compressed sensing literature regarding assumptions on , numerous recovery algorithms, and variations of RIP and REC bickel2009simultaneous ; negahban2009unified ; agarwal2010fast ; bach2012optimization ; loh2011high .Compressed sensing methods have found many applications in imaging, for example the singlepixel camera (SPC) duarte2008single . Medical tomographic applications include xray radiography, microwave imaging, magnetic resonance imaging (MRI) winters2010sparsity ; chen2008prior ; lustig2007sparse . Obtaining measurements for medical imaging can be costly, timeconsuming, and in some cases dangerous to the patient qaisar2013compressive . As such, an important goal is to reduce the number of measurements while maintaining good reconstruction quality.
Aside from the classical use of sparsity, recent work has used other priors to solve linear inverse problems. Plugandplay priors venkatakrishnan2013plug ; chan2017plug and Regularization by Denoising romano2017little have shown how image denoisers can be used to solve general linear inverse problems. A key example of this is BM3DAMP, which applies a BlockMatching and 3D filtering (BM3D) denoiser to an Approximate Message Passing (DAMP) algorithm metzler2016denoising ; metzler2015bm3d . AMP has also been applied to linear models in other contexts, e.g. schniter2016vector . Another related algorithm is TVAL3 zhang2013improved ; li2009user which leverages augmented Lagrangian multipliers to achieve impressive performance on compressed sensing problems. In many different settings, we compare our algorithm to these prior methods: BM3DAMP, TVAL3, and Lasso.
2.2 Compressed Sensing: Learned Approaches
While sparsity in some chosen basis is wellestablished, recent work has shown better empirical performance when neural networks are used bora2017compressed . This success is attributed to the fact that neural networks are capable of learning image priors from very large datasets goodfellow2014generative ; kingma2013auto . There is significant recent work on solving linear inverse problems using various learned techniques, e.g. recurrent generative models mardani2017recurrent and autoregressive models dave2018solving . Additionally approximate message passing (AMP) has been extended to a learned setting by Metzler et al. metzler2017learned .
Bora et al. bora2017compressed is the closest to our setup. In this work the authors assume that the unknown signal is in the range of a pretrained generative model such as a generative adversarial network (GAN) goodfellow2014generative
or variational autoencoder (VAE)
kingma2013auto . The recovery of the unknown signal is obtained via gradient descent in the latent space by searching for a signal that satisfies the measurements. This can be directly applied for linear inverse problems and more generally to any differentiable measurement process. Recent work has built upon these methods using new optimization techniques Chang17 , uncertainty autoencoders grover2018uncertainty , and other approaches dhar2018modeling ; kabkab2018task ; mixon2018sunlayer ; pandit2019asymptotics ; rusu2018meta . The key point is that all this prior work requires pretrained generative models, in contrast to CSDIP. Finally, there is significant ongoing work to understand DIP and develop related approaches, see e.g. heckel2018deep ; dittmer2018regularization .3 Proposed Algorithm
Let be the signal that we are trying to reconstruct, be the measurement matrix, and be independent noise. Given the measurement matrix and the observations , we wish to reconstruct an that is close to .
A generative model is a deterministic function which takes as input a seed and is parameterized by “weights” , producing an output . These models have shown excellent performance generating reallife signals such as images goodfellow2014generative ; kingma2013auto and audio wavenet . We investigate deep convolutional generative models, a special case in which the model architecture has multiple cascaded layers of convolutional filters krizhevsky2012imagenet . In this paper we apply a DCGAN radford2015unsupervised model and restrict the signals to be images.
3.1 Compressed Sensing with Deep Image Prior (CSDIP)
Our approach is to find a set of weights for the convolutional network such that the measurement matrix applied to the network output, i.e. , matches the measurements we are given. Hence we initialize an untrained network with some fixed and solve the following optimization problem:
(2) 
This is, of course, a nonconvex problem because
is a complex feedforward neural network. Still we can use gradientbased optimizers for any generative model and measurement process that is differentiable. Generator networks such as DCGAN are biased toward smooth, natural images due to their convolutional structure; thus the network structure alone provides a good prior for reconstructing images in problems such as inpainting and denoising
ulyanov2017deep . Our finding is that this applies to general linear measurement processes. We restrict our solution to lie in the span of a convolutional neural network. If a sufficient number of measurements is given, we obtain an output such that .Note that this method uses an untrained generative model and optimizes over the network weights . In contrast previous methods, such as that of Bora et al. bora2017compressed , use a trained model and optimize over the latent space, solving . We instead initialize a random with Gaussian i.i.d. entries and keep this fixed throughout the optimization process.
In our algorithm we leverage the wellestablished total variation regularization rudin1992nonlinear ; wang2008new ; liu2018image , denoted as . We also propose an additional learned regularization technique, ; note that without this technique, i.e. when , our method is completely unlearned. Lastly we use early stopping, a phenomena that will be analyzed theoretically in Section 4.
Thus the final optimization problem becomes
(3) 
The regularization term contains hyperparameters
and for total variation and learned regularization: . We now discuss this term.3.2 Learned Regularization
Without learned regularization CSDIP relies only on linear measurements taken from one unknown image. We now introduce a novel method which leverages a small amount of training data to optimize regularization. In this case training data refers to measurements from additional ground truth of a similar type, e.g. other xray images.
To leverage this additional information, we pose Eqn. 3
as a Maximum a Posteriori (MAP) estimation problem and propose a novel prior on the weights of the generative model. This prior then acts as a regularization term, penalizing the model toward an optimal set of weights
.For a set of weights , we model the likelihood of the measurements and the prior on the weights
as Gaussian distributions given by
where and .
In this setting we want to find a set of weights that maximizes the posterior on given , i.e.,
(4) 
This gives us the learned regularization term
(5) 
where the coefficient in Eqn. 4 controls the strength of the prior.
Notice that when and this regularization term is equivalent to regularization. Thus this method can be thought of as a more strategic version of standard weight decay.
3.2.1 Learning the Prior Parameters
In the previous section, we introduced the learned regularization term defined in Eqn. 5. However we have not yet learned values for parameters that incorporate prior knowledge of the network weights. We now propose a way to estimate these parameters.
Assume we have a set of measurements from different images , each obtained with a different measurement matrix . For each measurement we run CSDIP to solve the optimization problem in Eqn. 3 and obtain an optimal set of weights . Note that when optimizing for the weights we only have access to the measurements , not the ground truth .
The number of weights in deep networks tends to be very large. As such, learning a distribution over each weight, i.e. estimating and , becomes intractable. We instead use a layerwise approach: with network layers, we have and . Thus each weight within layer is modeled according to the same distribution. For simplicity we assume , i.e. that network weights are independent across layers. The process of estimating statistics from is described in Algorithm 1 of the appendix.
We use this learned in the regularization term from Eqn. 5 for reconstructing measurements of images. We refer to this technique as learned regularization
. While this may seem analogous to batch normalization
ioffe2015batch , note that we only use to penalize the norm of the weights and do not normalize the layer outputs themselves.3.2.2 Discussion of Learned Regularization
The proposed CSDIP does not require training if no learned regularization is used, i.e. if in Eqn. 3. This means that CSDIP can be applied only with measurements from a single image and no prior information of similar images in a dataset.
Our next idea, learned regularization, utilizes a small amount of prior information, requiring access to measurements from a small number of similar images (roughly ). In contrast, other pretrained models such as that of Bora et al. bora2017compressed require access to ground truth from a massive number of similar images (tens of thousands for CelebA). If such a large dataset is available, and if a good generative model can be trained on that dataset, we expect that pretrained models bora2017compressed ; grover2018amortized ; kabkab2018task ; mardani2017recurrent would outperform our method. Our approach is instead more suitable for reconstructing problems where large amounts of data or good generative models are not readily available.
4 Theoretical Results
In this section we provide theoretical evidence to highlight the importance of early stopping for DIPbased approaches. Here we focus on denoising a noisy signal via DIP. The optimization problem in this case takes the form
(6) 
This is a special instance of Eqn. 2 with the measurement matrix
corresponding to denoising. We focus on generators consisting of a single hiddenlayer ReLU network with
inputs, hidden units, and outputs. Using the generator model in this case is given by(7) 
where is the input, the inputtohidden weights, and the hiddentooutput weights. We assume is fixed at random and train over using gradient descent. With these formulations in place, we are now ready to state our theoretical result.
Theorem 4.1.
Consider fitting a generator of the form to a signal with , , , and . Furthermore, assume
is a random matrix with i.i.d.
entries with . Starting from an initial weight matrix selected at random with i.i.d. entries, we run gradient descent updates of the form on the losswith step size where . Assuming that
with a fixed numerical constant, then
holds for all
with probability at least
.Our theoretical result shows that after many iterative updates, gradient descent will solve this nonconvex optimization problem and fit any signal , if the generator network is sufficiently wide. This occurs as soon as the number of hidden units exceeds the signal size by a constant factor. While our proof is for the case of , a similar result can be shown for other measurement matrices, since the resulting is essentially a Gaussian i.i.d. measurement matrix of different output dimension. This result demonstrates that early stopping is necessary for DIPbased methods to be successful; otherwise the network can fit any signal, including one that is noisy or corrupted.
Our proof builds on theoretical ideas from Oymak et al. oymak2019towards which provide a general framework for establishing global convergence guarantees for overparameterized nonlinear learning problems based on various properties of the Jacobian mapping along the gradient descent trajectory. See also du2018gradient ; Oymak:2018aa and references therein for other related literature. We combine delicate tools from empirical process theory, random matrix theory, and matrix algebra to show that, starting from a random initialization, the Jacobian mapping across all iterates has favorable properties with high probability, hence facilitating convergence to a global optima.
5 Experiments
To replicate these experiments or run new experiments using this method, please see our GitHub repository at github.com/davevanveen/compsensing_dip.
Measurements,  

500  1000  2000  4000  8000  
0  9.9%  2.9%  0.2%  2.0%  0.6% 
10  11.6%  4.6%  4.5%  2.4%  1.0% 
100  14.9%  19.2%  5.0%  3.9%  2.8% 
1000  37.4%  30.6%  19.8%  3.0%  6.2% 
corresponds to variance of the noise vector
in Eqn. 1, i.e. each entry of is drawn independently . These results indicate that LR tends to provide greater benefit with noisy signals and with fewer measurements.5.1 Experimental Setup
Measurements: We evaluate our algorithm using two different measurements processes, i.e. matrices . First we set the entries of to be Gaussian i.i.d. such that . Recall is the number of measurements, and is the number of pixels in the ground truth image. This measurement process is standard practice in compressed sensing literature, and hence we use it on each dataset. Additionally we use a Fourier measurement process common in MRI applications mardani2018neural ; mardani2017deep ; hammernik2018learning ; lehtinen2018noise2noise ; lustig2008compressed and evaluate it on the xray dataset.
Datasets: We use our algorithm to reconstruct both grayscale and RGB images. For grayscale we use the first 100 images in the test set of MNIST lecun1998gradient and also 60 random images from the Shenzhen Chest XRay Dataset jaeger2014two , downsampling a 512x512 crop to 256x256 pixels. For RGB we use retinopathy images from the STARE dataset hoover2000locating with 512x512 crops downsized to 128x128 pixels.
Baselines: We compare our algorithm to stateoftheart unlearned methods such as BM3DAMP metzler2016denoising ; metzler2015bm3d , TVAL3 li2011compressive ; li2009user ; zhang2013improved , and Lasso in a DCT basis ahmed1974discrete . We also evaluated the performance of Lasso in a Daubechies wavelet basis daubechies1988orthonormal ; wasilewski2010pywavelets but found this performed worse than Lasso  DCT on all datasets. Thus for simplicity we refer to Lasso  DCT as “Lasso” and do not include results of Lasso  Wavelet. To reconstruct RGB retinopathy images, we must use the colored version CBM3DAMP. Unfortunately an RGB version of TVAL3 does not currently exist, although related TV algorithms such as FTVd perform similar tasks such as denoising RGB images wang2008new .
Metrics: To quantitatively evaluate the performance of our algorithm, we use perpixel meansquared error (MSE) between the reconstruction and true image , i.e. . Note that because these pixels are over the range , it’s possible for MSE to be greater than .
Implementation: To find a set of weights that minimize Eqn. 3
, we use PyTorch
paszke2017automatic with a DCGAN architecture. For baselines BM3DAMP and TVAL3, we use the repositories provided by the authors Metzler et al. metzler2018 and Li et al. li2013 , respectively. For baseline reconstructions Lasso, we use scikitlearn scikitlearn . Section A in the appendix provides further details on our experimental procedures, e.g. choosing hyperparameters.5.2 Experimental Results
Results: Learned Regularization
Perpixel reconstruction error (MSE) vs. number of measurements. Vertical bars indicate 95% confidence intervals. BM3DAMP frequently fails to converge for fewer than
measurements on xray images, as denoted by error values far above the vertical axis.We first evaluate the benefits of learned regularization by comparing our algorithm with and without learned regularization, i.e. and , respectively. The latter setting is an unlearned method, as we are not leveraging () from a specific dataset. In the former setting we first learn () from a particular set of xray images; we then evaluate on a different set of xray images. We compare these two settings with varying noise and across different number of measurements.
Our results in Table 1 show that learned regularization does indeed provide benefit. This benefit tends to increase with more noise or fewer measurements. Thus we can infer that assuming a learned Gaussian distribution over weights is useful, especially when the original signal is noisy or significantly compressed.
Results: Unlearned CSDIP
For the remainder of this section, we evaluate our algorithm in the noiseless case without learned regularization, i.e. when in Eqn. 1 and in Eqn. 3. Hence CSDIP is completely unlearned; as such, we compare it to other stateoftheart unlearned algorithms on various datasets and with different measurement matrices.
MNIST: In Figure 0(b) we plot reconstruction error with varying number of measurements of = 784. This demonstrates that our algorithm outperforms baselines in almost all cases. Figure 1(b) shows reconstructions for 75 measurements, while remaining reconstructions are in the appendix.
Chest XRays: In Figure 0(a) we plot reconstruction error with varying number of measurements of = 65536. Figure 1(a) shows reconstructions for 2000 measurements, while the remaining reconstructions are in the appendix. On this dataset we outperform all baselines except BM3DAMP for higher . However for lower , e.g. when the ratio , BM3DAMP often doesn’t converge. This finding seems to support the work of Metzler et al. metzler2015bm3d : BM3DAMP performs impressively on higher , e.g. , but recovery at lower sampling rates is not demonstrated.
Retinopathy: We plot reconstruction error with varying number of measurements of = 49152 in Figure 2(a) of the appendix. On this RGB dataset we quantitatively outperform all baselines except BM3DAMP on higher ; however, even at these higher , patches of green and purple pixels corrupt the image reconstructions as seen in Figure 9. Similar to xray for lower , BM3DAMP often fails to produce anything sensible. All retinopathy reconstructions are located in the appendix.
Fourier Measurement Process: All previous experiments used a measurement matrix containing Gaussian i.i.d. entries. We now consider the case where the measurement matrix is a subsampled Fourier matrix. That is, for a 2D image and a set of indices , the measurements we receive are given by , where
is the 2D Fourier transform. We choose
to be indices along radial lines, as shown in Figure 12 of the appendix; this choice of is common in literature candes2006robust and MRI applications mardani2017deep ; lustig2008compressed ; eksioglu2018denoising . We compare our algorithm to baselines on the xray dataset for radial lines in the Fourier domain, which corresponds to Fourier coefficients, respectively. We plot reconstruction error with varying number of Fourier coefficients in Figure 2(b) of the appendix, outperforming baselinse BM3DAMP and TVAL3. Reconstructions can also be found in the appendix.Runtime: In Table 2 we show the runtimes of CSDIP on the xray dataset. While runtime is not the focus of our work, because our algorithm can utilize GPU, it is competitive with or faster than baseline algorithms. The baselines are implemented in MATLAB or scikitlearn scikitlearn and only leverage CPU, while we run our experiments on a NVIDIA GTX 1080Ti.
Algorithm  1000  2000  4000  8000 

CSDIP  15.6  17.1  20.4  29.9 
BM3DAMP  51.1  54.0  67.8  71.2 
TVAL3  13.8  22.1  31.9  56.7 
Lasso DCT  27.1  33.0  52.2  96.4 
6 Conclusion
We demonstrate compressed sensing recovery using untrained, randomly initialized convolutional neural networks. Our method outperforms previous stateoftheart unlearned methods in most cases, especially when the number of obtained measurements is small. Additionally we propose a learned regularization method, which enforces a learned Gaussian prior on the network weights. This prior reduces reconstruction error, particularly for noisy or compressed measurements. Finally we show that a sufficiently wide singlelayer network can fit any signal, thus motivating regularization by early stopping.
References

(1)
Alekh Agarwal, Sahand Negahban, and Martin J Wainwright.
Fast global convergence rates of gradient methods for highdimensional statistical recovery.
In Advances in Neural Information Processing Systems, pages 37–45, 2010.  (2) Nasir Ahmed, T_ Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
 (3) Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Solving bilinear inverse problems using deep generative priors. CoRR, abs/1802.04073, 2018.

(4)
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al.
Optimization with sparsityinducing penalties.
Foundations and Trends® in Machine Learning
, 4(1):1–106, 2012.  (5) Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Modelbased compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
 (6) Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
 (7) Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
 (8) Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489–509, 2006.
 (9) Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.

(10)
Emmanuel J Candes and Terence Tao.
Decoding by linear programming.
IEEE transactions on information theory, 51(12):4203–4215, 2005.  (11) Stanley H Chan, Xiran Wang, and Omar A Elgendy. Plugandplay admm for image restoration: Fixedpoint convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98, 2017.
 (12) JenHao Rick Chang, ChunLiang Li, Barnabás Póczos, B. V. K. Vijaya Kumar, and Aswin C. Sankaranarayanan. One network to solve them all  solving linear inverse problems using deep projection models. CoRR, abs/1703.09912, 2017.
 (13) GuangHong Chen, Jie Tang, and Shuai Leng. Prior image constrained compressed sensing (piccs): a method to accurately reconstruct dynamic ct images from highly undersampled projection data sets. Medical physics, 35(2):660–663, 2008.
 (14) Ingrid Daubechies. Orthonormal bases of compactly supported wavelets. Communications on pure and applied mathematics, 41(7):909–996, 1988.
 (15) Akshat Dave, Anil Kumar Vadathya, Ramana Subramanyam, Rahul Baburajan, and Kaushik Mitra. Solving inverse computational imaging problems using deep pixellevel prior. arXiv preprint arXiv:1802.09850, 2018.
 (16) Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed sensing using generative models. arXiv preprint arXiv:1807.01442, 2018.
 (17) Soren Dittmer, Tobias Kluth, Peter Maass, and Daniel Otero Baguer. Regularization by architecture: A deep prior approach for inverse problems. arXiv preprint arXiv:1812.03889, 2018.
 (18) David L Donoho. Compressed sensing. IEEE Transactions on info theory, 52(4):1289–1306, 2006.
 (19) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
 (20) Marco F Duarte, Mark A Davenport, Dharmpal Takhar, Jason N Laska, Ting Sun, Kevin F Kelly, and Richard G Baraniuk. Singlepixel imaging via compressive sampling. IEEE signal processing magazine, 25(2):83–91, 2008.
 (21) Armin Eftekhari and Michael B Wakin. New analysis of manifold embeddings and signal recovery from compressive measurements. Applied and Computational Harmonic Analysis, 39(1):67–109, 2015.
 (22) Ender M Eksioglu and A Korhan Tanc. Denoising amp for mri reconstruction: Bm3dampmri. SIAM Journal on Imaging Sciences, 11(3):2090–2109, 2018.
 (23) Alyson K Fletcher and Sundeep Rangan. Inference in deep networks in high dimensions. arXiv preprint arXiv:1706.06549, 2017.
 (24) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
 (25) Aditya Grover and Stefano Ermon. Amortized variational compressive sensing. ICLR Workshop, 2018.
 (26) Aditya Grover and Stefano Ermon. Uncertainty autoencoders: Learning compressed representations via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.
 (27) Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P Recht, Daniel K Sodickson, Thomas Pock, and Florian Knoll. Learning a variational network for reconstruction of accelerated mri data. Magnetic resonance in medicine, 79(6):3055–3071, 2018.
 (28) Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
 (29) Reinhard Heckel, Wen Huang, Paul Hand, and Vladislav Voroninski. Deep denoising: Rateoptimal recovery of structured signals with a deep prior. arXiv preprint arXiv:1805.08855, 2018.
 (30) Chinmay Hegde and Richard G Baraniuk. Signal recovery on incoherent manifolds. IEEE Transactions on Information Theory, 58(12):7204–7214, 2012.
 (31) Chinmay Hegde, Michael Wakin, and Richard Baraniuk. Random projections for manifold learning. In Advances in neural information processing systems, pages 641–648, 2008.
 (32) AD Hoover, Valentina Kouznetsova, and Michael Goldbaum. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical imaging, 19(3):203–210, 2000.
 (33) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 (34) Stefan Jaeger, Sema Candemir, Sameer Antani, YìXiáng J Wáng, PuXuan Lu, and George Thoma. Two public chest xray datasets for computeraided screening of pulmonary diseases. Quantitative imaging in medicine and surgery, 4(6):475, 2014.
 (35) Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Taskaware compressed sensing with generative adversarial networks. arXiv preprint arXiv:1802.01284, 2018.
 (36) Diederik P Kingma and Max Welling. Autoencoding variational bayes. preprint arXiv:1312.6114, 2013.
 (37) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 (38) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (39) Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. preprint arXiv:1803.04189, 2018.
 (40) Chengbo Li. Compressive sensing for 3d data processing tasks: applications, models and algorithms. Technical report, Rice University, 2011.
 (41) Chengbo Li, Wotao Yin, and Yin Zhang. Tval3: Tv minimization by augmented lagrangian and alternating direction algorithms. https://www.caam.rice.edu/~optimization/L1/TVAL3/.
 (42) Chengbo Li, Wotao Yin, and Yin Zhang. User’s guide for tval3: Tv minimization by augmented lagrangian and alternating direction algorithms. CAAM report, 20(4647):4, 2009.
 (43) Jiaming Liu, Yu Sun, Xiaojian Xu, and Ulugbek S Kamilov. Image restoration using total variation regularized deep image prior. arXiv preprint arXiv:1810.12864, 2018.

(44)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  (45) PoLing Loh and Martin J Wainwright. Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity. In NeurIPS, pages 2726–2734, 2011.
 (46) Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007.
 (47) Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly. Compressed sensing mri. IEEE signal processing magazine, 25(2):72–82, 2008.
 (48) Morteza Mardani, Enhao Gong, Joseph Y Cheng, Shreyas Vasanawala, Greg Zaharchuk, Marcus Alley, Neil Thakur, Song Han, William Dally, John M Pauly, et al. Deep generative adversarial networks for compressed sensing automates mri. arXiv preprint arXiv:1706.00051, 2017.
 (49) Morteza Mardani, Hatef Monajemi, Vardan Papyan, Shreyas Vasanawala, David Donoho, and John Pauly. Recurrent generative adversarial networks for proximal learning and automated compressive image recovery. arXiv preprint arXiv:1711.10046, 2017.
 (50) Morteza Mardani, Qingyun Sun, Shreyas Vasawanala, Vardan Papyan, Hatef Monajemi, John Pauly, and David Donoho. Neural proximal gradient descent for compressive imaging. arXiv preprint arXiv:1806.03963, 2018.
 (51) Chris Metzler et al. Damp toolbox. https://github.com/ricedsp/DAMP_Toolbox, 2018.
 (52) Chris Metzler, Ali Mousavi, and Richard Baraniuk. Learned damp: Principled neural network based compressive image recovery. In NeurIPS, pages 1772–1783, 2017.
 (53) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. Bm3damp: A new image recovery algorithm based on bm3d denoising. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 3116–3120. IEEE, 2015.
 (54) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. From denoising to compressed sensing. IEEE Transactions on Information Theory, 62(9):5117–5144, 2016.
 (55) Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks. arXiv preprint arXiv:1803.09319, 2018.
 (56) Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for highdimensional analysis of estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
 (57) Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian Wang, and Dinggang Shen. Medical image synthesis with contextaware generative adversarial networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 417–425. Springer, 2017.
 (58) Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv preprint arXiv:1312.2903, 2013.
 (59) Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? ., 12 2018.
 (60) Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
 (61) Parthe Pandit, Mojtaba Sahraee, Sundeep Rangan, and Alyson K Fletcher. Asymptotics of map inference in deep networks. arXiv preprint arXiv:1903.01293, 2019.
 (62) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. Open Review, 2017.
 (63) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 (64) Saad Qaisar, Rana Muhammad Bilal, Wafa Iqbal, Muqaddas Naureen, and Sungyoung Lee. Compressive sensing: From theory to applications. Journal of Communications and networks, 15(5):443–456, 2013.
 (65) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 (66) Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
 (67) Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(14):259–268, 1992.
 (68) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.

(69)
Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula
SchmidtErfurth, and Georg Langs.
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery.
In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.  (70) Jo Schlemper, Jose Caballero, Joseph V Hajnal, Anthony N Price, and Daniel Rueckert. A deep cascade of convolutional neural networks for dynamic mr image reconstruction. IEEE transactions on Medical Imaging, 37(2):491–503, 2017.
 (71) Philip Schniter, Sundeep Rangan, and Alyson K Fletcher. Vector approximate message passing for the generalized linear model. In ACSSC, pages 1525–1529. IEEE, 2016.
 (72) Viraj Shah and Chinmay Hegde. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. arXiv preprint arXiv:1802.08406, 2018.
 (73) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

(74)
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.  (75) Subarna Tripathi, Zachary C Lipton, and Truong Q Nguyen. Correction by projection: Denoising images with generative adversarial networks. arXiv preprint arXiv:1803.04477, 2018.
 (76) Joel A Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE transactions on information theory, 52(3):1030–1051, 2006.
 (77) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint arXiv:1711.10925, 2017.
 (78) Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 (79) Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plugandplay priors for model based reconstruction. In GlobalSIP, 2013 IEEE, pages 945–948. IEEE, 2013.
 (80) Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
 (81) F Wasilewski. Pywavelets: Discrete wavelet transform in python, 2010.
 (82) David W Winters, Barry D Van Veen, and Susan C Hagness. A sparsity regularization approach to the electromagnetic inverse scattering problem. IEEE transactions on antennas and propagation, 58(1):145–154, 2010.
 (83) Jelmer M Wolterink, Tim Leiner, Max A Viergever, and Ivana Išgum. Generative adversarial networks for noise reduction in lowdose ct. IEEE transactions on medical imaging, 36(12):2536–2545, 2017.
 (84) Jian Zhang, Shaohui Liu, Ruiqin Xiong, Siwei Ma, and Debin Zhao. Improved total variation based image compressive sensing recovery by nonlocal regularization. In Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, pages 2836–2839. IEEE, 2013.
Appendix A Experimentation Details and Insights
Our algorithm CSDIP is implemented in PyTorch using the RMSProp optimizer [74] with learning rate , momentum , and update steps for every set of measurements. These parameters are the same across all datasets.
We also made some datasetspecific design choices. On larger images such as xray () and retinopathy (), we found no difference using random restarts of the initial seed . However for smaller vectors such as MNIST (), restarts did provide some benefit. As such our experiments utilize 5 random restarts for MNIST and one initial seed (no restarts) for xray and retinopathy images. For choosing hyperparameters and in Eqn. 3, we used a standard grid search and selected the best one. We used a similar grid search procedure for choosing datasetspecific hyperparameters in baseline algorithms BM3DAMP, TVAL3, and Lasso.
The network’s initial seed in Eqn. 3 is initialized with random Gaussian i.i.d. entries and then held fixed as we optimize over network weights . We found negligible difference when varying the dimension of (within reason), as this only affects the number of channels in the network’s first layer. As such we set the dimension of to be , a standard choice for DCGAN architectures.
We further note that the “Error vs. Iterations” curve of CSDIP with RMSProp did not monotonically decrease for some learning rates, even though error gradually decreased in all cases. As such we implemented a stopping condition which chooses the reconstruction with least error over the last 20 iterations. Note we choose this reconstruction based off measurement loss and do not look at the ground truth image.
Appendix B Proof of Section 4: Theoretical Justification for Early Stopping
In this section we prove our theoretical result in Theorem 4.1. We begin with a summary of some notations we use throughout in Section B.1. Next, we state some preliminary calculations in Section B.2. Then, we state a few key lemmas in Section B.3 with the proofs deferred to Appendix C. Finally, we complete the proof of Theorem 4.1 in Section B.4.
b.1 Notation
In this section we gather some notation used throughout the proofs. We use ReLU with . For two matrices/vectors and of the same size we use to denote the entrywise Hadamard product of these two matrices/vectors. We also use to denote their Kronecker product. For two matrices and , we use the KhatrioRao product as the matrix with rows given by . For a matrix we use vect to denote a vector obtained by aggregating the rows of the matrix into a vector, i.e. vect. For a matrix we use and
denotes the minimum singular value and spectral norm of
. Similarly, for a symetric matrix we use to denote its smallest eigenvalue.b.2 Preliminaries
In this section we carryout some simple calculations yielding simple formulas for the gradient and Jacobian mappings. We begin by noting we can rewrite the gradient descent iterations in the form
Here,
where
is the Jacobian mapping associated to the network and
is the misfit or residual vector. Note that
Thus
This in turn yields
(8) 
b.3 Lemmas for controlling the spectrum of the Jacobian and initial misfit
In this section we state a few lemmas concerning the spectral properties of the Jacobian mapping, its perturbation and initial misfit of the model with the proofs deferred to Appendix C.
Lemma B.1 (Minimum singular value of the Jacobian at initialization).
Let and be random matrices with i.i.d. and entries and define the Jacobian mapping . Then as long as ,
holds with probability at least .
Lemma B.2 (Perturbation lemma).
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Also let be a matrix with i.i.d. entries. Then,
holds for all obeying with probability at least .
Lemma B.3 (Spectral norm of the Jacobian).
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Then,
holds for all with probability at least .
Lemma B.4 (Initial misfit).
Let be a matrix with i.i.d. entries with . Also let be a matrix with i.i.d. entries. Then
holds with probability at least .
b.4 Proof of Theorem 4.1
Consider a nonlinear leastsquares optimization problem of the form
with and . Suppose the Jacobian mapping associated with obeys the following three assumptions.
Assumption 1.
Fix a point . We have that .
Assumption 2.
Let denote a norm that is dominated by the Euclidean norm i.e. holds for all . Fix a point and a number . For any satisfying , we have that .
Assumption 3.
For all , we have that .
Under these assumptions we can state the following theorem from [60].
Theorem B.5 (Nonsmooth Overparameterized Optimization).
We shall apply this theorem to the case where the parameter is and the nonlinear mapping is given by and . All that is needed to be able to apply this theorem is check that the assumptions hold. Per the assumptions of the theorem we use
To this aim note that using Lemma B.1 Assumption 1 holds with
with probability at least . Furthermore, by Lemma B.3 Assumption 3 holds with
with probability at least . All that remains for applying the theorem above is to verify Assumption 2 holds with high probability
In the above we have used Lemma B.4 to conclude that holds with probability at least . Thus, using Lemma B.2 all that remains is to show that
holds with and with probability at least . The latter is equivalent to
which can be rewritten in the form
which holds as long as . Thus with then Assumptions 1, 2, and 3 holds with probability at least . Thus, Theorem B.5 holds with high probability. Applying Theorem B.5 completes the proof.
Appendix C Proof of Lemmas for the Spectral Properties of the Jacobian
c.1 Proof of Lemma b.1
We prove the result for , the general result follows from a simple rescaling. Define the vectors
with the th column of . Using (B.2) we have
(11) 
To bound the minimum eigenvalue we state a result from [58].
Theorem C.1.
Assume
are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define
(this is an entrywise expectation) andLet be such that for all . Then for any we have
We shall apply this theorem with . To do this we need to calculate the various parameters in the theorem. We begin with and note that for ReLU we have
To calculate we have
Thus we can take . Therefore, using Theorem C.1 with we can conclude that
holds with probability at least as long as
Plugging this into (C.1) we conclude that with probability at least
c.2 Proof of Lemma b.2
We prove the result for , the general result follows from a simple rescaling. Based on (B.2) we have
Thus
(12)  
(13) 
where is the set of indices where and have different signs i.e. and is a submatrix obtained by picking the columns corresponding to .
To continue further note that by Gordon’s lemma we have
with probability at least . In particular using we conclude that
(14) 
with probability at least . To continue further we state a lemma controlling the size of based on the size of the radius .
Lemma C.2 (sign changes in local neighborhood).
Let be a matrix with i.i.d. entries. Also for a matrix define . Then for any obeying
holds with probability at least .
c.3 Proof of Lemma c.2
To prove this result we utilize two lemmas from [60]. In these lemmas we use to denote the th smallest entry of after sorting its entries in terms of absolute value.
Lemma C.3.
Lemma C.4.
[60, Lemma C.3] Let . Also let be a matrix with i.i.d. entries. Then, with probability at least ,
Combining the latter two lemmas with we conclude that when
then with probability at least we have
Comments
There are no comments yet.