We consider the well-studied compressed sensing problem of recovering an unknown signal by observing a set of noisy measurements of the form
Here is a known measurement matrix, typically generated with random independent Gaussian entries. Since the number of measurements is smaller than the dimension
of the unknown vector, this is an under-determined system of noisy linear equations and hence ill-posed. There are many solutions, and some structure must be assumed on to have any hope of recovery. Pioneering research donoho2006compressed ; candes2006robust ; candes2005decoding established that if is assumed to be sparse in a known basis, a small number of measurements will be provably sufficient to recover the unknown vector in polynomial time using methods such as Lasso tibshirani1996regression .
Sparsity approaches have proven successful, but more complex models with additional structure have been recently proposed such as model-based compressive sensing baraniuk2010model and manifold models hegde2008random ; hegde2012signal ; eftekhari2015new . Bora et al. bora2017compressed
showed that deep generative models can be used as excellent priors for images. They also showed that backpropagation can be used to solve the signal recovery problem by performing gradient descent in the generative latent space. This method enabled image generation with significantly fewer measurements compared to Lasso for a given reconstruction error. Compressed sensing using deep generative models was further improved in very recent worktripathi2018correction ; grover2018amortized ; kabkab2018task ; shah2018solving ; fletcher2017inference ; DBLP:journals/corr/abs-1802-04073 . Additionally a theoretical analysis of the nonconvex gradient descent algorithm bora2017compressed was proposed by Hand et al. hand2017global under some assumptions on the generative model.
Inspired by these impressive benefits of deep generative models, we chose to investigate the application of such methods for medical imaging, a canonical application of compressive sensing. A significant problem, however, is that all these previous methods require the existence of pre-trained models. While this has been achieved for various types of images, e.g. human faces of CelebA liu2015faceattributes via DCGAN radford2015unsupervised , it remains significantly more challenging for medical images wolterink2017generative ; schlegl2017unsupervised ; nie2017medical ; schlemper2017deep . Instead of addressing this problem in generative models, we found an easier way to circumvent it.
Surprising recent work by Ulyanov et al. ulyanov2017deep proposed Deep Image Prior (DIP), which uses untrainedconvolutional neural networks. In DIP-based schemes, a convolutional neural network generator (e.g. DCGAN) is initialized with random weights; these weights are subsequently optimized to make the network produce an output as close to the target image as possible. This procedure is unlearned, using no prior information from other images. The prior is enforced only by the fixed convolutional structure of the generator network.
Generators used for DIP are typically over-parameterized, i.e. the number of network weights is much larger compared to the output dimension. For this reason DIP has empirically been found to overfit to noise if run for too many iterations ulyanov2017deep . In this paper we theoretically prove that this phenomenon occurs with gradient descent and justify the use of early stopping and other regularization methods.
In Section 3 we propose DIP for compressed sensing (CS-DIP). Our basic method is as follows. Initialize a DCGAN generator with random weights; use gradient descent to optimize these weights such that the network produces an output which agrees with the observed measurements as much as possible. This unlearned method can be improved with a novel learned regularization technique, which regularizes the DCGAN weights throughout the optimization process.
In Section 4 we theoretically prove that DIP will fit any signal to zero error with gradient descent. Our result is established for a network with a single hidden layer and sufficient constant fraction over-parametrization. While it is expected that over-parametrized neural networks can fit any signal, the fact that gradient descent can provably solve this non-convex problem is interesting and provides theoretical justification for early stopping.
In Section 5 we empirically show that CS-DIP outperforms previous unlearned methods in many cases. While pre-trained or “learned” methods will likely perform better bora2017compressed , we have the advantage of not requiring a generative model trained over large datasets. As such, we can apply our method to various medical imaging datasets for which data acquisition is expensive and generative models are difficult to train.
2.1 Compressed Sensing: Classical and Unlearned Approaches
A classical assumption made in compressed sensing is that the vector is -sparse in some basis such as wavelet or discrete cosine transform (DCT). Finding the sparsest solution to an underdetermined linear system of equations is NP-hard in general; however, if the matrix
satisfies conditions such as the Restricted Eigenvalue Condition (REC) or Restricted Isometry Property (RIP)candes2006stable ; bickel2009simultaneous ; donoho2006compressed ; tibshirani1996regression , then can be recovered in polynomial time via convex relaxations tropp2006just or iterative methods. There is extensive compressed sensing literature regarding assumptions on , numerous recovery algorithms, and variations of RIP and REC bickel2009simultaneous ; negahban2009unified ; agarwal2010fast ; bach2012optimization ; loh2011high .
Compressed sensing methods have found many applications in imaging, for example the single-pixel camera (SPC) duarte2008single . Medical tomographic applications include x-ray radiography, microwave imaging, magnetic resonance imaging (MRI) winters2010sparsity ; chen2008prior ; lustig2007sparse . Obtaining measurements for medical imaging can be costly, time-consuming, and in some cases dangerous to the patient qaisar2013compressive . As such, an important goal is to reduce the number of measurements while maintaining good reconstruction quality.
Aside from the classical use of sparsity, recent work has used other priors to solve linear inverse problems. Plug-and-play priors venkatakrishnan2013plug ; chan2017plug and Regularization by Denoising romano2017little have shown how image denoisers can be used to solve general linear inverse problems. A key example of this is BM3D-AMP, which applies a Block-Matching and 3D filtering (BM3D) denoiser to an Approximate Message Passing (D-AMP) algorithm metzler2016denoising ; metzler2015bm3d . AMP has also been applied to linear models in other contexts, e.g. schniter2016vector . Another related algorithm is TVAL3 zhang2013improved ; li2009user which leverages augmented Lagrangian multipliers to achieve impressive performance on compressed sensing problems. In many different settings, we compare our algorithm to these prior methods: BM3D-AMP, TVAL3, and Lasso.
2.2 Compressed Sensing: Learned Approaches
While sparsity in some chosen basis is well-established, recent work has shown better empirical performance when neural networks are used bora2017compressed . This success is attributed to the fact that neural networks are capable of learning image priors from very large datasets goodfellow2014generative ; kingma2013auto . There is significant recent work on solving linear inverse problems using various learned techniques, e.g. recurrent generative models mardani2017recurrent and auto-regressive models dave2018solving . Additionally approximate message passing (AMP) has been extended to a learned setting by Metzler et al. metzler2017learned .
Bora et al. bora2017compressed is the closest to our set-up. In this work the authors assume that the unknown signal is in the range of a pre-trained generative model such as a generative adversarial network (GAN) goodfellow2014generative
or variational autoencoder (VAE)kingma2013auto . The recovery of the unknown signal is obtained via gradient descent in the latent space by searching for a signal that satisfies the measurements. This can be directly applied for linear inverse problems and more generally to any differentiable measurement process. Recent work has built upon these methods using new optimization techniques Chang17 , uncertainty autoencoders grover2018uncertainty , and other approaches dhar2018modeling ; kabkab2018task ; mixon2018sunlayer ; pandit2019asymptotics ; rusu2018meta . The key point is that all this prior work requires pre-trained generative models, in contrast to CS-DIP. Finally, there is significant ongoing work to understand DIP and develop related approaches, see e.g. heckel2018deep ; dittmer2018regularization .
3 Proposed Algorithm
Let be the signal that we are trying to reconstruct, be the measurement matrix, and be independent noise. Given the measurement matrix and the observations , we wish to reconstruct an that is close to .
A generative model is a deterministic function which takes as input a seed and is parameterized by “weights” , producing an output . These models have shown excellent performance generating real-life signals such as images goodfellow2014generative ; kingma2013auto and audio wavenet . We investigate deep convolutional generative models, a special case in which the model architecture has multiple cascaded layers of convolutional filters krizhevsky2012imagenet . In this paper we apply a DCGAN radford2015unsupervised model and restrict the signals to be images.
3.1 Compressed Sensing with Deep Image Prior (CS-DIP)
Our approach is to find a set of weights for the convolutional network such that the measurement matrix applied to the network output, i.e. , matches the measurements we are given. Hence we initialize an untrained network with some fixed and solve the following optimization problem:
This is, of course, a non-convex problem because
is a complex feed-forward neural network. Still we can use gradient-based optimizers for any generative model and measurement process that is differentiable. Generator networks such as DCGAN are biased toward smooth, natural images due to their convolutional structure; thus the network structure alone provides a good prior for reconstructing images in problems such as inpainting and denoisingulyanov2017deep . Our finding is that this applies to general linear measurement processes. We restrict our solution to lie in the span of a convolutional neural network. If a sufficient number of measurements is given, we obtain an output such that .
Note that this method uses an untrained generative model and optimizes over the network weights . In contrast previous methods, such as that of Bora et al. bora2017compressed , use a trained model and optimize over the latent -space, solving . We instead initialize a random with Gaussian i.i.d. entries and keep this fixed throughout the optimization process.
In our algorithm we leverage the well-established total variation regularization rudin1992nonlinear ; wang2008new ; liu2018image , denoted as . We also propose an additional learned regularization technique, ; note that without this technique, i.e. when , our method is completely unlearned. Lastly we use early stopping, a phenomena that will be analyzed theoretically in Section 4.
Thus the final optimization problem becomes
The regularization term contains hyperparametersand for total variation and learned regularization: . We now discuss this term.
3.2 Learned Regularization
Without learned regularization CS-DIP relies only on linear measurements taken from one unknown image. We now introduce a novel method which leverages a small amount of training data to optimize regularization. In this case training data refers to measurements from additional ground truth of a similar type, e.g. other x-ray images.
To leverage this additional information, we pose Eqn. 3
as a Maximum a Posteriori (MAP) estimation problem and propose a novel prior on the weights of the generative model. This prior then acts as a regularization term, penalizing the model toward an optimal set of weights.
For a set of weights , we model the likelihood of the measurements and the prior on the weights
as Gaussian distributions given by
where and .
In this setting we want to find a set of weights that maximizes the posterior on given , i.e.,
This gives us the learned regularization term
where the coefficient in Eqn. 4 controls the strength of the prior.
Notice that when and this regularization term is equivalent to -regularization. Thus this method can be thought of as a more strategic version of standard weight decay.
3.2.1 Learning the Prior Parameters
In the previous section, we introduced the learned regularization term defined in Eqn. 5. However we have not yet learned values for parameters that incorporate prior knowledge of the network weights. We now propose a way to estimate these parameters.
Assume we have a set of measurements from different images , each obtained with a different measurement matrix . For each measurement we run CS-DIP to solve the optimization problem in Eqn. 3 and obtain an optimal set of weights . Note that when optimizing for the weights we only have access to the measurements , not the ground truth .
The number of weights in deep networks tends to be very large. As such, learning a distribution over each weight, i.e. estimating and , becomes intractable. We instead use a layer-wise approach: with network layers, we have and . Thus each weight within layer is modeled according to the same distribution. For simplicity we assume , i.e. that network weights are independent across layers. The process of estimating statistics from is described in Algorithm 1 of the appendix.
We use this learned in the regularization term from Eqn. 5 for reconstructing measurements of images. We refer to this technique as learned regularization
. While this may seem analogous to batch normalizationioffe2015batch , note that we only use to penalize the -norm of the weights and do not normalize the layer outputs themselves.
3.2.2 Discussion of Learned Regularization
The proposed CS-DIP does not require training if no learned regularization is used, i.e. if in Eqn. 3. This means that CS-DIP can be applied only with measurements from a single image and no prior information of similar images in a dataset.
Our next idea, learned regularization, utilizes a small amount of prior information, requiring access to measurements from a small number of similar images (roughly ). In contrast, other pre-trained models such as that of Bora et al. bora2017compressed require access to ground truth from a massive number of similar images (tens of thousands for CelebA). If such a large dataset is available, and if a good generative model can be trained on that dataset, we expect that pre-trained models bora2017compressed ; grover2018amortized ; kabkab2018task ; mardani2017recurrent would outperform our method. Our approach is instead more suitable for reconstructing problems where large amounts of data or good generative models are not readily available.
4 Theoretical Results
In this section we provide theoretical evidence to highlight the importance of early stopping for DIP-based approaches. Here we focus on denoising a noisy signal via DIP. The optimization problem in this case takes the form
This is a special instance of Eqn. 2 with the measurement matrix
corresponding to denoising. We focus on generators consisting of a single hidden-layer ReLU network withinputs, hidden units, and outputs. Using the generator model in this case is given by
where is the input, the input-to-hidden weights, and the hidden-to-output weights. We assume is fixed at random and train over using gradient descent. With these formulations in place, we are now ready to state our theoretical result.
Consider fitting a generator of the form to a signal with , , , and . Furthermore, assume is a random matrix with i.i.d.
is a random matrix with i.i.d.entries with . Starting from an initial weight matrix selected at random with i.i.d. entries, we run gradient descent updates of the form on the loss
with step size where . Assuming that
with a fixed numerical constant, then
holds for all with probability at least
with probability at least.
Our theoretical result shows that after many iterative updates, gradient descent will solve this non-convex optimization problem and fit any signal , if the generator network is sufficiently wide. This occurs as soon as the number of hidden units exceeds the signal size by a constant factor. While our proof is for the case of , a similar result can be shown for other measurement matrices, since the resulting is essentially a Gaussian i.i.d. measurement matrix of different output dimension. This result demonstrates that early stopping is necessary for DIP-based methods to be successful; otherwise the network can fit any signal, including one that is noisy or corrupted.
Our proof builds on theoretical ideas from Oymak et al. oymak2019towards which provide a general framework for establishing global convergence guarantees for overparameterized nonlinear learning problems based on various properties of the Jacobian mapping along the gradient descent trajectory. See also du2018gradient ; Oymak:2018aa and references therein for other related literature. We combine delicate tools from empirical process theory, random matrix theory, and matrix algebra to show that, starting from a random initialization, the Jacobian mapping across all iterates has favorable properties with high probability, hence facilitating convergence to a global optima.
To replicate these experiments or run new experiments using this method, please see our GitHub repository at github.com/davevanveen/compsensing_dip.
corresponds to variance of the noise vectorin Eqn. 1, i.e. each entry of is drawn independently . These results indicate that LR tends to provide greater benefit with noisy signals and with fewer measurements.
5.1 Experimental Setup
Measurements: We evaluate our algorithm using two different measurements processes, i.e. matrices . First we set the entries of to be Gaussian i.i.d. such that . Recall is the number of measurements, and is the number of pixels in the ground truth image. This measurement process is standard practice in compressed sensing literature, and hence we use it on each dataset. Additionally we use a Fourier measurement process common in MRI applications mardani2018neural ; mardani2017deep ; hammernik2018learning ; lehtinen2018noise2noise ; lustig2008compressed and evaluate it on the x-ray dataset.
Datasets: We use our algorithm to reconstruct both grayscale and RGB images. For grayscale we use the first 100 images in the test set of MNIST lecun1998gradient and also 60 random images from the Shenzhen Chest X-Ray Dataset jaeger2014two , downsampling a 512x512 crop to 256x256 pixels. For RGB we use retinopathy images from the STARE dataset hoover2000locating with 512x512 crops downsized to 128x128 pixels.
Baselines: We compare our algorithm to state-of-the-art unlearned methods such as BM3D-AMP metzler2016denoising ; metzler2015bm3d , TVAL3 li2011compressive ; li2009user ; zhang2013improved , and Lasso in a DCT basis ahmed1974discrete . We also evaluated the performance of Lasso in a Daubechies wavelet basis daubechies1988orthonormal ; wasilewski2010pywavelets but found this performed worse than Lasso - DCT on all datasets. Thus for simplicity we refer to Lasso - DCT as “Lasso” and do not include results of Lasso - Wavelet. To reconstruct RGB retinopathy images, we must use the colored version CBM3D-AMP. Unfortunately an RGB version of TVAL3 does not currently exist, although related TV algorithms such as FTVd perform similar tasks such as denoising RGB images wang2008new .
Metrics: To quantitatively evaluate the performance of our algorithm, we use per-pixel mean-squared error (MSE) between the reconstruction and true image , i.e. . Note that because these pixels are over the range , it’s possible for MSE to be greater than .
Implementation: To find a set of weights that minimize Eqn. 3
, we use PyTorchpaszke2017automatic with a DCGAN architecture. For baselines BM3D-AMP and TVAL3, we use the repositories provided by the authors Metzler et al. metzler2018 and Li et al. li2013 , respectively. For baseline reconstructions Lasso, we use scikit-learn scikit-learn . Section A in the appendix provides further details on our experimental procedures, e.g. choosing hyperparameters.
5.2 Experimental Results
Results: Learned Regularization
Per-pixel reconstruction error (MSE) vs. number of measurements. Vertical bars indicate 95% confidence intervals. BM3D-AMP frequently fails to converge for fewer thanmeasurements on x-ray images, as denoted by error values far above the vertical axis.
We first evaluate the benefits of learned regularization by comparing our algorithm with and without learned regularization, i.e. and , respectively. The latter setting is an unlearned method, as we are not leveraging () from a specific dataset. In the former setting we first learn () from a particular set of x-ray images; we then evaluate on a different set of x-ray images. We compare these two settings with varying noise and across different number of measurements.
Our results in Table 1 show that learned regularization does indeed provide benefit. This benefit tends to increase with more noise or fewer measurements. Thus we can infer that assuming a learned Gaussian distribution over weights is useful, especially when the original signal is noisy or significantly compressed.
Results: Unlearned CS-DIP
For the remainder of this section, we evaluate our algorithm in the noiseless case without learned regularization, i.e. when in Eqn. 1 and in Eqn. 3. Hence CS-DIP is completely unlearned; as such, we compare it to other state-of-the-art unlearned algorithms on various datasets and with different measurement matrices.
MNIST: In Figure 0(b) we plot reconstruction error with varying number of measurements of = 784. This demonstrates that our algorithm outperforms baselines in almost all cases. Figure 1(b) shows reconstructions for 75 measurements, while remaining reconstructions are in the appendix.
Chest X-Rays: In Figure 0(a) we plot reconstruction error with varying number of measurements of = 65536. Figure 1(a) shows reconstructions for 2000 measurements, while the remaining reconstructions are in the appendix. On this dataset we outperform all baselines except BM3D-AMP for higher . However for lower , e.g. when the ratio , BM3D-AMP often doesn’t converge. This finding seems to support the work of Metzler et al. metzler2015bm3d : BM3D-AMP performs impressively on higher , e.g. , but recovery at lower sampling rates is not demonstrated.
Retinopathy: We plot reconstruction error with varying number of measurements of = 49152 in Figure 2(a) of the appendix. On this RGB dataset we quantitatively outperform all baselines except BM3D-AMP on higher ; however, even at these higher , patches of green and purple pixels corrupt the image reconstructions as seen in Figure 9. Similar to x-ray for lower , BM3D-AMP often fails to produce anything sensible. All retinopathy reconstructions are located in the appendix.
Fourier Measurement Process: All previous experiments used a measurement matrix containing Gaussian i.i.d. entries. We now consider the case where the measurement matrix is a subsampled Fourier matrix. That is, for a 2D image and a set of indices , the measurements we receive are given by , where
is the 2D Fourier transform. We chooseto be indices along radial lines, as shown in Figure 12 of the appendix; this choice of is common in literature candes2006robust and MRI applications mardani2017deep ; lustig2008compressed ; eksioglu2018denoising . We compare our algorithm to baselines on the x-ray dataset for radial lines in the Fourier domain, which corresponds to Fourier coefficients, respectively. We plot reconstruction error with varying number of Fourier coefficients in Figure 2(b) of the appendix, outperforming baselinse BM3D-AMP and TVAL3. Reconstructions can also be found in the appendix.
Runtime: In Table 2 we show the runtimes of CS-DIP on the x-ray dataset. While runtime is not the focus of our work, because our algorithm can utilize GPU, it is competitive with or faster than baseline algorithms. The baselines are implemented in MATLAB or scikit-learn scikit-learn and only leverage CPU, while we run our experiments on a NVIDIA GTX 1080-Ti.
We demonstrate compressed sensing recovery using untrained, randomly initialized convolutional neural networks. Our method outperforms previous state-of-the-art unlearned methods in most cases, especially when the number of obtained measurements is small. Additionally we propose a learned regularization method, which enforces a learned Gaussian prior on the network weights. This prior reduces reconstruction error, particularly for noisy or compressed measurements. Finally we show that a sufficiently wide single-layer network can fit any signal, thus motivating regularization by early stopping.
Alekh Agarwal, Sahand Negahban, and Martin J Wainwright.
Fast global convergence rates of gradient methods for high-dimensional statistical recovery.In Advances in Neural Information Processing Systems, pages 37–45, 2010.
- (2) Nasir Ahmed, T_ Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
- (3) Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Solving bilinear inverse problems using deep generative priors. CoRR, abs/1802.04073, 2018.
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al.
Optimization with sparsity-inducing penalties.
Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
- (5) Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
- (6) Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
- (7) Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
- (8) Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489–509, 2006.
- (9) Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.
Emmanuel J Candes and Terence Tao.
Decoding by linear programming.IEEE transactions on information theory, 51(12):4203–4215, 2005.
- (11) Stanley H Chan, Xiran Wang, and Omar A Elgendy. Plug-and-play admm for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98, 2017.
- (12) Jen-Hao Rick Chang, Chun-Liang Li, Barnabás Póczos, B. V. K. Vijaya Kumar, and Aswin C. Sankaranarayanan. One network to solve them all - solving linear inverse problems using deep projection models. CoRR, abs/1703.09912, 2017.
- (13) Guang-Hong Chen, Jie Tang, and Shuai Leng. Prior image constrained compressed sensing (piccs): a method to accurately reconstruct dynamic ct images from highly undersampled projection data sets. Medical physics, 35(2):660–663, 2008.
- (14) Ingrid Daubechies. Orthonormal bases of compactly supported wavelets. Communications on pure and applied mathematics, 41(7):909–996, 1988.
- (15) Akshat Dave, Anil Kumar Vadathya, Ramana Subramanyam, Rahul Baburajan, and Kaushik Mitra. Solving inverse computational imaging problems using deep pixel-level prior. arXiv preprint arXiv:1802.09850, 2018.
- (16) Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed sensing using generative models. arXiv preprint arXiv:1807.01442, 2018.
- (17) Soren Dittmer, Tobias Kluth, Peter Maass, and Daniel Otero Baguer. Regularization by architecture: A deep prior approach for inverse problems. arXiv preprint arXiv:1812.03889, 2018.
- (18) David L Donoho. Compressed sensing. IEEE Transactions on info theory, 52(4):1289–1306, 2006.
- (19) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- (20) Marco F Duarte, Mark A Davenport, Dharmpal Takhar, Jason N Laska, Ting Sun, Kevin F Kelly, and Richard G Baraniuk. Single-pixel imaging via compressive sampling. IEEE signal processing magazine, 25(2):83–91, 2008.
- (21) Armin Eftekhari and Michael B Wakin. New analysis of manifold embeddings and signal recovery from compressive measurements. Applied and Computational Harmonic Analysis, 39(1):67–109, 2015.
- (22) Ender M Eksioglu and A Korhan Tanc. Denoising amp for mri reconstruction: Bm3d-amp-mri. SIAM Journal on Imaging Sciences, 11(3):2090–2109, 2018.
- (23) Alyson K Fletcher and Sundeep Rangan. Inference in deep networks in high dimensions. arXiv preprint arXiv:1706.06549, 2017.
- (24) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
- (25) Aditya Grover and Stefano Ermon. Amortized variational compressive sensing. ICLR Workshop, 2018.
- (26) Aditya Grover and Stefano Ermon. Uncertainty autoencoders: Learning compressed representations via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.
- (27) Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P Recht, Daniel K Sodickson, Thomas Pock, and Florian Knoll. Learning a variational network for reconstruction of accelerated mri data. Magnetic resonance in medicine, 79(6):3055–3071, 2018.
- (28) Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
- (29) Reinhard Heckel, Wen Huang, Paul Hand, and Vladislav Voroninski. Deep denoising: Rate-optimal recovery of structured signals with a deep prior. arXiv preprint arXiv:1805.08855, 2018.
- (30) Chinmay Hegde and Richard G Baraniuk. Signal recovery on incoherent manifolds. IEEE Transactions on Information Theory, 58(12):7204–7214, 2012.
- (31) Chinmay Hegde, Michael Wakin, and Richard Baraniuk. Random projections for manifold learning. In Advances in neural information processing systems, pages 641–648, 2008.
- (32) AD Hoover, Valentina Kouznetsova, and Michael Goldbaum. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical imaging, 19(3):203–210, 2000.
- (33) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- (34) Stefan Jaeger, Sema Candemir, Sameer Antani, Yì-Xiáng J Wáng, Pu-Xuan Lu, and George Thoma. Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery, 4(6):475, 2014.
- (35) Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative adversarial networks. arXiv preprint arXiv:1802.01284, 2018.
- (36) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. preprint arXiv:1312.6114, 2013.
- (37) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- (38) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- (39) Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. preprint arXiv:1803.04189, 2018.
- (40) Chengbo Li. Compressive sensing for 3d data processing tasks: applications, models and algorithms. Technical report, Rice University, 2011.
- (41) Chengbo Li, Wotao Yin, and Yin Zhang. Tval3: Tv minimization by augmented lagrangian and alternating direction algorithms. https://www.caam.rice.edu/~optimization/L1/TVAL3/.
- (42) Chengbo Li, Wotao Yin, and Yin Zhang. User’s guide for tval3: Tv minimization by augmented lagrangian and alternating direction algorithms. CAAM report, 20(46-47):4, 2009.
- (43) Jiaming Liu, Yu Sun, Xiaojian Xu, and Ulugbek S Kamilov. Image restoration using total variation regularized deep image prior. arXiv preprint arXiv:1810.12864, 2018.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), 2015.
- (45) Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In NeurIPS, pages 2726–2734, 2011.
- (46) Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007.
- (47) Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly. Compressed sensing mri. IEEE signal processing magazine, 25(2):72–82, 2008.
- (48) Morteza Mardani, Enhao Gong, Joseph Y Cheng, Shreyas Vasanawala, Greg Zaharchuk, Marcus Alley, Neil Thakur, Song Han, William Dally, John M Pauly, et al. Deep generative adversarial networks for compressed sensing automates mri. arXiv preprint arXiv:1706.00051, 2017.
- (49) Morteza Mardani, Hatef Monajemi, Vardan Papyan, Shreyas Vasanawala, David Donoho, and John Pauly. Recurrent generative adversarial networks for proximal learning and automated compressive image recovery. arXiv preprint arXiv:1711.10046, 2017.
- (50) Morteza Mardani, Qingyun Sun, Shreyas Vasawanala, Vardan Papyan, Hatef Monajemi, John Pauly, and David Donoho. Neural proximal gradient descent for compressive imaging. arXiv preprint arXiv:1806.03963, 2018.
- (51) Chris Metzler et al. D-amp toolbox. https://github.com/ricedsp/D-AMP_Toolbox, 2018.
- (52) Chris Metzler, Ali Mousavi, and Richard Baraniuk. Learned d-amp: Principled neural network based compressive image recovery. In NeurIPS, pages 1772–1783, 2017.
- (53) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. Bm3d-amp: A new image recovery algorithm based on bm3d denoising. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 3116–3120. IEEE, 2015.
- (54) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. From denoising to compressed sensing. IEEE Transactions on Information Theory, 62(9):5117–5144, 2016.
- (55) Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks. arXiv preprint arXiv:1803.09319, 2018.
- (56) Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
- (57) Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian Wang, and Dinggang Shen. Medical image synthesis with context-aware generative adversarial networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 417–425. Springer, 2017.
- (58) Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv preprint arXiv:1312.2903, 2013.
- (59) Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? ., 12 2018.
- (60) Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
- (61) Parthe Pandit, Mojtaba Sahraee, Sundeep Rangan, and Alyson K Fletcher. Asymptotics of map inference in deep networks. arXiv preprint arXiv:1903.01293, 2019.
- (62) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. Open Review, 2017.
- (63) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- (64) Saad Qaisar, Rana Muhammad Bilal, Wafa Iqbal, Muqaddas Naureen, and Sungyoung Lee. Compressive sensing: From theory to applications. Journal of Communications and networks, 15(5):443–456, 2013.
- (65) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- (66) Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
- (67) Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- (68) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula
Schmidt-Erfurth, and Georg Langs.
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery.In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.
- (70) Jo Schlemper, Jose Caballero, Joseph V Hajnal, Anthony N Price, and Daniel Rueckert. A deep cascade of convolutional neural networks for dynamic mr image reconstruction. IEEE transactions on Medical Imaging, 37(2):491–503, 2017.
- (71) Philip Schniter, Sundeep Rangan, and Alyson K Fletcher. Vector approximate message passing for the generalized linear model. In ACSSC, pages 1525–1529. IEEE, 2016.
- (72) Viraj Shah and Chinmay Hegde. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. arXiv preprint arXiv:1802.08406, 2018.
- (73) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- (75) Subarna Tripathi, Zachary C Lipton, and Truong Q Nguyen. Correction by projection: Denoising images with generative adversarial networks. arXiv preprint arXiv:1803.04477, 2018.
- (76) Joel A Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE transactions on information theory, 52(3):1030–1051, 2006.
- (77) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint arXiv:1711.10925, 2017.
- (78) Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- (79) Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In GlobalSIP, 2013 IEEE, pages 945–948. IEEE, 2013.
- (80) Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
- (81) F Wasilewski. Pywavelets: Discrete wavelet transform in python, 2010.
- (82) David W Winters, Barry D Van Veen, and Susan C Hagness. A sparsity regularization approach to the electromagnetic inverse scattering problem. IEEE transactions on antennas and propagation, 58(1):145–154, 2010.
- (83) Jelmer M Wolterink, Tim Leiner, Max A Viergever, and Ivana Išgum. Generative adversarial networks for noise reduction in low-dose ct. IEEE transactions on medical imaging, 36(12):2536–2545, 2017.
- (84) Jian Zhang, Shaohui Liu, Ruiqin Xiong, Siwei Ma, and Debin Zhao. Improved total variation based image compressive sensing recovery by nonlocal regularization. In Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, pages 2836–2839. IEEE, 2013.
Appendix A Experimentation Details and Insights
Our algorithm CS-DIP is implemented in PyTorch using the RMSProp optimizer  with learning rate , momentum , and update steps for every set of measurements. These parameters are the same across all datasets.
We also made some dataset-specific design choices. On larger images such as xray () and retinopathy (), we found no difference using random restarts of the initial seed . However for smaller vectors such as MNIST (), restarts did provide some benefit. As such our experiments utilize 5 random restarts for MNIST and one initial seed (no restarts) for x-ray and retinopathy images. For choosing hyperparameters and in Eqn. 3, we used a standard grid search and selected the best one. We used a similar grid search procedure for choosing dataset-specific hyperparameters in baseline algorithms BM3D-AMP, TVAL3, and Lasso.
The network’s initial seed in Eqn. 3 is initialized with random Gaussian i.i.d. entries and then held fixed as we optimize over network weights . We found negligible difference when varying the dimension of (within reason), as this only affects the number of channels in the network’s first layer. As such we set the dimension of to be , a standard choice for DCGAN architectures.
We further note that the “Error vs. Iterations” curve of CS-DIP with RMSProp did not monotonically decrease for some learning rates, even though error gradually decreased in all cases. As such we implemented a stopping condition which chooses the reconstruction with least error over the last 20 iterations. Note we choose this reconstruction based off measurement loss and do not look at the ground truth image.
Appendix B Proof of Section 4: Theoretical Justification for Early Stopping
In this section we prove our theoretical result in Theorem 4.1. We begin with a summary of some notations we use throughout in Section B.1. Next, we state some preliminary calculations in Section B.2. Then, we state a few key lemmas in Section B.3 with the proofs deferred to Appendix C. Finally, we complete the proof of Theorem 4.1 in Section B.4.
In this section we gather some notation used throughout the proofs. We use ReLU with . For two matrices/vectors and of the same size we use to denote the entrywise Hadamard product of these two matrices/vectors. We also use to denote their Kronecker product. For two matrices and , we use the Khatrio-Rao product as the matrix with rows given by . For a matrix we use vect to denote a vector obtained by aggregating the rows of the matrix into a vector, i.e. vect. For a matrix we use and
denotes the minimum singular value and spectral norm of. Similarly, for a symetric matrix we use to denote its smallest eigenvalue.
In this section we carryout some simple calculations yielding simple formulas for the gradient and Jacobian mappings. We begin by noting we can rewrite the gradient descent iterations in the form
is the Jacobian mapping associated to the network and
is the misfit or residual vector. Note that
This in turn yields
b.3 Lemmas for controlling the spectrum of the Jacobian and initial misfit
In this section we state a few lemmas concerning the spectral properties of the Jacobian mapping, its perturbation and initial misfit of the model with the proofs deferred to Appendix C.
Lemma B.1 (Minimum singular value of the Jacobian at initialization).
Let and be random matrices with i.i.d. and entries and define the Jacobian mapping . Then as long as ,
holds with probability at least .
Lemma B.2 (Perturbation lemma).
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Also let be a matrix with i.i.d. entries. Then,
holds for all obeying with probability at least .
Lemma B.3 (Spectral norm of the Jacobian).
Let be a matrix with i.i.d. entries, , and define the Jacobian mapping . Then,
holds for all with probability at least .
Lemma B.4 (Initial misfit).
Let be a matrix with i.i.d. entries with . Also let be a matrix with i.i.d. entries. Then
holds with probability at least .
b.4 Proof of Theorem 4.1
Consider a nonlinear least-squares optimization problem of the form
with and . Suppose the Jacobian mapping associated with obeys the following three assumptions.
Fix a point . We have that .
Let denote a norm that is dominated by the Euclidean norm i.e. holds for all . Fix a point and a number . For any satisfying , we have that .
For all , we have that .
Under these assumptions we can state the following theorem from .
Theorem B.5 (Non-smooth Overparameterized Optimization).
We shall apply this theorem to the case where the parameter is and the nonlinear mapping is given by and . All that is needed to be able to apply this theorem is check that the assumptions hold. Per the assumptions of the theorem we use
with probability at least . All that remains for applying the theorem above is to verify Assumption 2 holds with high probability
holds with and with probability at least . The latter is equivalent to
which can be rewritten in the form
Appendix C Proof of Lemmas for the Spectral Properties of the Jacobian
c.1 Proof of Lemma b.1
We prove the result for , the general result follows from a simple re-scaling. Define the vectors
with the th column of . Using (B.2) we have
To bound the minimum eigenvalue we state a result from .
Assume are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define
are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define(this is an entry-wise expectation) and
Let be such that for all . Then for any we have
We shall apply this theorem with . To do this we need to calculate the various parameters in the theorem. We begin with and note that for ReLU we have
To calculate we have
Thus we can take . Therefore, using Theorem C.1 with we can conclude that
holds with probability at least as long as
Plugging this into (C.1) we conclude that with probability at least
c.2 Proof of Lemma b.2
We prove the result for , the general result follows from a simple rescaling. Based on (B.2) we have
where is the set of indices where and have different signs i.e. and is a submatrix obtained by picking the columns corresponding to .
To continue further note that by Gordon’s lemma we have
with probability at least . In particular using we conclude that
with probability at least . To continue further we state a lemma controlling the size of based on the size of the radius .
Lemma C.2 (sign changes in local neighborhood).
Let be a matrix with i.i.d. entries. Also for a matrix define . Then for any obeying
holds with probability at least .
c.3 Proof of Lemma c.2
To prove this result we utilize two lemmas from . In these lemmas we use to denote the th smallest entry of after sorting its entries in terms of absolute value.
[60, Lemma C.2] Given an integer , suppose
[60, Lemma C.3] Let . Also let be a matrix with i.i.d. entries. Then, with probability at least ,
Combining the latter two lemmas with we conclude that when
then with probability at least we have