1 Introduction
Encoding and decoding are central problems in communication (MacKay, 2003). Compressed sensing (CS) provides a framework that separates encoding and decoding into independent measurement and reconstruction processes (Candes et al., 2006; Donoho, 2006). Unlike commonly used autoencoding models (Bourlard & Kamp, 1988; Kingma & Welling, 2013; Rezende et al., 2014), which feature endtoend trained encoder and decoder pairs, CS reconstructs signals from lowdimensional measurements via online optimisation. This model architecture is highly flexible and sample efficient: high dimensional signals can be reconstructed from a few random measurements with little or no training at all. CS has been successfully applied in scenarios where measurements are noisy and expensive to take, such as in MRI (Lustig et al., 2007). Its sample efficiency enables the development of, for example, the “single pixel camera”, which reconstructs a full resolution image from a single light sensor (Duarte et al., 2008).
However, the wide application of CS, especially in processing large scale data where modern deep learning approaches thrive, is hindered by its assumption of sparse signals and the slow optimisation process for reconstruction. Recently,
Bora et al. (2017) combined CS with separately trained neural network generators. Although these pretrained neural networks were not optimized for CS, they demonstrated reconstruction performance superior to existing methods such as the Lasso (Tibshirani, 1996). Here we propose the deep compressed sensing (DCS) framework in which neural networks can be trained fromscratch for both measuring and online reconstruction. We show that this framework leads naturally to a family of models, including GANs (Goodfellow et al., 2014), which can be derived by training the measurement functions with different objectives. In summary, this work contributes the following:
We demonstrate how to train deep neural networks within the CS framework.

We show that a metalearned reconstruction process leads to a more accurate and orders of magnitudes faster method compared with previous models.

We develop a new GAN training algorithm based on latent optimisation, which improves GAN performance. The nonsaturated generator loss emerges as a measurement error.

We extend our framework to training semisupervised GANs, and show that latent optimisation results in semantically meaningful latent spaces.
Notations
We use bold letters for vectors and matrices and normal letters for scalars.
indicates taking the expectation of over the distribution . We use subscriptions of Greek letters to indicate function parameters. For example, is a function parametrised by .2 Background
2.1 Compressed Sensing
Compressed sensing aims to recover signal from a linear measurement :
(1) 
where is the measurement matrix, and
is the measurement noise which is usually assumed to be Gaussian distributed.
is typically a “wide” matrix, such that . As a result, the measurement has much lower dimensionality compared with the original signal; solving is generally impossible for such underdetermined problems. The elegant CS theory shows that one can nearly perfectly recoverwith high probability given a random matrix
and sparse (Donoho, 2006; Candes et al., 2006). In practice, the requirement that be sparse can be replaced by sparsity in a set of basis , such as the Fourier basis or wavelet, so that can be nonsparse signals such as natural images. Here we omit the basisfor brevity; the linear transform from
does not affect our following discussion.At the centre of CS theory is the Restricted Isometry Property (RIP) ^{1}^{1}1
The theory can also be proved from the closely related and more general Restricted Eigenvalue condition
(Bora et al., 2017). We focus on RIP in this form for its more straightforward connection with the training loss (see section 3.1)., which is defined for and the difference between signals as(2) 
where is a small constant. The RIP states that the projection from preserves the distance between two signals bounded by factors of and . This property holds with high probability for various random matrices and sparse signals . It guarantees minimising the measurement error
(3) 
under the constraint that is sparse, leads to accurate reconstruction with high probability (Donoho, 2006; Candes et al., 2006). This constrained optimisation problem is computationally intensive — a price for the measuring process that only requires sparse random projections of the signals (Baraniuk, 2007).
2.2 Compressed Sensing using Generative Models
The requirement of sparsity poses a strong restriction on CS. Sparse bases, such as the Fourier basis or wavelet, only partially relieve this constraints, since they are restricted to domains known to be sparse in these bases and cannot adapt to data distributions. Recently, Bora et al. (2017) proposed compressed sensing using generative models (CSGM) to relax this requirement. This model uses a pretrained deep neural network (from a VAE or GAN) as the structural constraint in the place of sparsity. This generator maps a latent representation to the signal space:
(4) 
Instead of requiring sparse signals, implicitly constrains output in a lowdimensional manifold via its architecture and the weights adapted from data. This constraint is sufficient to provide a generalised SetRestricted Eigenvalue Condition (SREC) with random matrices, under which low reconstruction error can be achieved with high probability. A minimisation process similar to that in CS is used for reconstruction:
(5)  
(6) 
such that is the reconstructed signal. In contrast to directly optimising the signal in CS (eq.3), here optimisation is in the space of latent representation .
The operator in eq. 5 is intractable since is highly nonconvex. It is therefore approximated using gradient descent starting from a randomly sampled point :
(7) 
where is a learning rate. One can take a specified steps of gradient descent. Typically, hundreds or thousands of gradient descent steps and several restarts from the initial step are needed to obtain a sufficiently good (Bora et al., 2017; Bojanowski et al., 2018). This process is illustrated in Figure 1.
This work established the connection between compressed sensing and deep neural networks, and demonstrated performance superior to the Lasso (Tibshirani, 1996), especially when the number of measurements is small. The theoretical properties of CSGM have been more closely examined by Hand & Voroninski (2017), who also proved stronger convergence guarantees. More recently, Dhar et al. (2018) proposed additional constraints to allow sparse deviation from the generative model’s support set, thus improving generalisation. However, CSGM still suffers from two restrictions:

The optimisation for reconstruction is still slow, as it requires thousands of gradient descent steps.

It relies on random measurement matrices, which are known to be suboptimal for highly structured signals such natural images. Learned measurements can perform significantly better (Weiss et al., 2007).
2.3 ModelAgnostic Meta Learning
Metalearning, or learning to learn, allows a model adapting to new tasks by selfimproving (Schmidhuber, 1987). ModelAgnostic Meta learning (MAML) provides a general method to adapt parameters for a number of tasks (Finn et al., 2017)
. Given a differentiable loss function
for task sampled from the task distribution , the taskspecific parameters are adapted by gradient descent from the initial parameters :(8) 
The initial parameters are trained to minimise the loss across all tasks
(9) 
Multiple steps and more sophisticated optimisation algorithms can be used in the place of eq. 8. Despite usually being a highly nonconvex function, by backpropagating through the gradientdescent process, only a few gradient steps are sufficient to adapt to new tasks.
2.4 Generative Adversarial Networks
A Generative Adversarial Network (GAN) trains a parametrised generator to fool a discriminator that tries to distinguish real data from fake data sampled from the generator (Goodfellow et al., 2014). The generator is a deterministic function that transforms samples from a source to the same space as the data , which has the distribution . This adversarial game can be summarised by the following minmax problem with the value function :
(10) 
GANs are usually difficult to train due to this adversarial game (Balduzzi et al., 2018). Training may either diverge or converge to bad equilibrium with, for example, collapsed modes, unless extra care is taken in designing and training the model (Radford et al., 2015; Salimans et al., 2016).
A widely adapted trick is using as the objective for the generator (Goodfellow et al., 2014). Compared with eq. 10, this alternative objective avoids saturating the discriminator in the early stage of training when the generator is too weak. However, this objective voids most theoretical analyses (Hu et al., 2018), since the new adversarial objective is no longer a zerosum game (eq. 10).
In most GAN models, discriminators become useless after training. Recently, Tao et al. (2018) and Azadi et al. (2019) proposed methods using the discriminator for importance sampling. Our work provides an alternative: our model moves latent representations to areas more likely to generate realistic images as deemed by the discriminator.
3 Deep Compressed Sensing
We start by showing the benefit of combining metalearning with the model in Bora et al. (2017). We then generalise measurement matrices to parametrised measurement functions, including deep neural networks. While previous work relies on random projections as measurement functions, our approach learns measurement functions by imposing the RIP as a training objective. We then derive two novel models by imposing properties other than the RIP on the measurements, including a GAN model with discriminatorguided latent optimisation, which leads to more stable training dynamics and better results.
3.1 Compressed Sensing with MetaLearning
We hypothesise that the runtime efficiency and performance in CSGM (Bora et al. 2017, section 2.2), can be improved by training the latent optimisation procedure using metalearning, by backpropagating through the gradient descent steps (Finn et al., 2017). The latent optimisation procedure for CS models can take hundreds or thousands of steps. By employing metalearning to optimise this optimisation procedure we aim to achieve similar results with far fewer updates.
To this end, the model parameters, as well as the latent optimisation procedure, are trained to minimise the expected measurement error:
(11) 
where is obtained from gradient descent (eq. 7). The gradient descent in eq. 7 and the loss function in eq. 11 mirror their counterparts in MAML (eq. 8 and 9), except that:

Instead of the stochastic gradient computed in the outside loop, here each measurement error only depends on a single sample , so eq. 7 computes the exact gradient of .

The online optimisation is over latent variables rather than parameters. There are usually much fewer latent variables than parameters, so the update is quicker.
Like in MAML, we implicitly perform second order optimisation, by backpropagating through the latent optimisation steps which compute when optimising eq. 11. We empirically observed that this dramatically improves the efficiency of latent optimisation, with only 35 gradient descent steps being sufficient to improve upon baseline methods.
Unlike Bora et al. (2017), we also train the generator . Merely minimising eq. 9 would fail — the generator can exploit by mapping all into the null space of . This trivial solution always gives zero measurement error, but may contain no useful information. Our solution is to enforce the RIP (eq. 2) via training, by minimising the measurement loss:
(12) 
and can be sampled in various ways. While the choice is not unique, it is important to sample from both the data distribution and generated samples , so that the trained RIP holds for both real and generated data. In our experiments, we randomly sampled one image from the data and two generated images at the beginning and end of latent optimisation, then computed the average between the 3 pairs of losses between these 3 points as a form of “triplet loss”.
3.2 Deep Compressed Sensing with Learned Measurement Function
In Algorithm 1, we use the RIP property to train the generator. We can use the same approach and enforce the RIP property to learn the measurement function itself, rather than using a random projection.
3.2.1 Learning Measurement Function
We start by generalising the measurement matrix (eq. 1), and define a parametrised measurement function . The model introduced in the previous section corresponds to a linear function ; now both and can be deep neural networks. Similar to CS, the central problem in this generalised setting is inverting the measurement function to recover the signal via minimising the measurement error similar to eq. 6:
(13) 
The distance preserving property as a counterpart of the RIP can be enforced by minimising a loss similar to eq. 12:
(14) 
Minimising provides a relaxation of the constraint specified by the RIP (eq. 2). When is small, the projection from better preserves the distance between and . This relaxation enables us to transform the RIP into a training objective for the measurements, which can then be integrated into training other model components. Empirically, we found this relaxation leads to high quality reconstruction.
The rest of the algorithm is identical to Algorithm 1, except that we also update the measurement function’s parameters . Consequently, different schemes can be employed to coordinate updating and , which will be discussed more in section 3.3. This extended algorithm is summarised in Algorithm 2. We call it Deep Compressed Sensing (DCS) to emphasise that both the measurement and reconstruction can be deep neural networks. Next, we turn to generalising the measurements to properties other than the RIP.
3.2.2 Generalised CS 1: CSGAN
Here we consider an extreme case: a onedimensional measurement that only encodes how likely an input is a real data point or fake one sampled from the generator. One way to formulate this is to train the measurement function using the following loss instead of eq. 14:
(15) 
Algorithm 2 then becomes the Least Squares Generative Adversarial Nets (LSGAN, Mao et al., 2017) with latent optimisation — they are exactly equivalent when the latent optimisation is disabled (, zero step). LSGAN is an alternative to the original GAN (Goodfellow et al., 2014) that can be motivated from Pearson Divergence. To demonstrate a closer connection with original GANs (Goodfellow et al., 2014)
, we instead focus on another formulation whose measurement function is a binary classifier (the discriminator).
This is realised by using a binary classifier as the measurement function, where we can interpret as the probability that comes from the dataset. In this case, the measurement function is equivalent to the discriminator in GANs. Consequently, we change the the squaredloss in eq. 13 to the crossentropy loss as the matching measurement loss function (Bishop, 2006) (ignoring the expectation over for brevity):
(16) 
where the binary scalar is an indicator function identifies whether is a real data point.
(17) 
Similarly, a crossentropy measurement error is employed to quantify the discrepancy between and the scalar measurement :
(18) 
At the minimum of (eq. 16), the optimal measurement function is achieved by the perfect classifier:
(19) 
We can therefore simplify eq. 18 by replacing with its target value as in teacherforcing (Williams & Zipser, 1989):
(20) 
This objective recovers the vanilla GAN formulation with the commonly used alternative loss (Goodfellow et al., 2014), which we derived as a measurement error. When latent optimisation is disabled (), Algorithm 2 is identical to a vanilla GAN.
In our experiments (section 4.2), we observed that the additional latent optimisation steps introduced from the CS perspective significantly improved GAN training. We reckon this is because latent optimisation moves the representation to areas more likely to generate realistic images as deemed by the discriminator. Since the gradient descent process remains local, the latent representations are still spread broadly in latent space, which avoids mode collapse. Although a sufficiently powerful generator can transform the source into arbitrarily complex distribution, a more informative source, as implicitly manifested from the optimised , may significantly reduce the complexity required for , thus striking a better tradeoff in terms of the overall computation.
3.2.3 Generalised CS 2: Semisupervised GANs
So far, we have shown two extreme cases of Deep Compressed Sensing: in one case, the distance preserving measurements (section 3.2.1) essentially encode all information for recovering the original signals; on the other hand, the CSGAN (section 3.2.2) has onedimensional measurements that only indicates whether signals are real or fake. We now seek a middle ground, by using measurements that preserve class information for labelled data.
We generalise CSGAN by replacing the binary classifier (discriminator) with a multiclass classifier . For data with classes, this classifier outputs classes with the ’th class reserved for “fake” data that comes from the generator. This specification is the same as the classifier used in semisupervised GANs (SGANs, Salimans et al. (2016)). Consequently, we extend the binary indicator function in eq. 17 to multiclass indicator, so that its ’the element when in class . The ’th output of the classifier indicates the predicted probability that is in the ’th class, and multiclass crossentropy loss is used for the measurement loss and measurement error:
(21)  
(22) 
When latent optimisation is disabled (), the model is similar to other semisupervised GANs (Salimans et al., 2016; Odena et al., 2017). However, when the online optimisation moves latent representations towards regions representing particular classes. This provides a novel way of training conditional GANs.
Compared with conditional GANs which concatenate labels to latent variables (Mirza & Osindero, 2014), optimising latent variables is more adaptive and uses information from the entire model. Compared with BatchNorm based methods (Miyato & Koyama, 2018), the information for conditioning is presented in the target measurements, and does not need to be trained as BatchNorm statistics (Ioffe & Szegedy, 2015). Since both of these methods use separate sources (label inputs or batch statistics) to provide the condition, their latent variables tend to retain no information about the condition. Our model, on the other hand, distils the condition information into the latent representation, which results in semantically meaningful latent space (Figure 5).
Model  Property  Loss 

CS  RIP  N/A 
DCS  trained RIP  eq. 14 
CSGAN  validity preserving  eq. 16 
CSSGAN  class preserving  eq. 21 
3.3 Optimising Models
The three models we derived as examples in the DCS framework are summarised in Table 1 along side CS. The main difference between them lies is the training objective used for the measurement functions . Once is specified, the generator objective , in the form of measurement error, can be derived follow suit. When and are adversarial, such as in the CSGAN, and need to be optimised separately as in GANs. This is implemented as the alternating update option in Algorithm 2. In optimising the latent variables (eq. 7), we normalise after each gradient descent step, as in (Bojanowski et al., 2018). We treat the step size in latent optimisation as a parameter and backpropagate through it in optimising the model loss function. An additional technique we found useful in stabilising CSGAN training is to penalise the distance moves as an optimisation cost and add it to :
(23) 
where is a scalar controlling the strength of this regulariser. This regulariser encourages small moves of in optimisation, and can be interpreted as approximating an optimal transport cost (Villani, 2008). We found a range of from to made little difference in training, and used in our experiments with CSGAN.
4 Experiments
4.1 Deep Compressed Sensing for Reconstruction
We first evaluate the DCS model using the MNIST (Yann et al., 1998) and CelebA (Liu et al., 2015) datasets. To compare with the approach in Bora et al. (2017)
, we used the same generators as in their model. For the measurements functions, we considered both linear projection and neural networks. We considered both random projections and trained measurement functions, while the generator was always trained jointly with the latent optimisation process. Unless otherwise specified, we use 3 gradient descent steps for latent optimisation. More details, including hyperparameter values, are reported in the Appendix. Our code will be available at
https://github.com/deepmind/deepcompressedsensing.Tables 2 and 3 summarise the results from our models as well as the baseline model from Bora et al. (2017)
. The reconstruction loss for the baseline model is estimated from Figure 1 in
Bora et al. (2017). DCS performs significantly better than the baseline. In addition, while the baseline model used hundreds or thousands of gradientdescent steps with several restarts, we only used 3 steps without any restarting, achieving orders of magnitudes higher efficiency. Interestingly, for fixed , random linear projections outperformed neural networks as the measurement functions in both datasets across different neural network structures (row 2 and 3 of Table 2 and 3). This empirical result is consistent with the optimality of random projections described in the compressed sensing literature and the more general JohnsonLindenstrauss lemma (Donoho, 2006; Candes et al., 2006; Johnson & Lindenstrauss, 1984).The advantage of neural networks manifested when was optimised; this variant reached the best performance in all scenarios. As argued in (Weiss et al., 2007), we observed that random projections are suboptimal for highly structured signals such as images, as seen in the improved performance when optimising the measurement matrices (row 4 of Table 2 and 3). The reconstruction performance was further improve when the linear measurement projections were replaced by neural networks (row 5 of Table 2 and 3). Examples of reconstructed MNIST images from different models are shown in Figure 2.
Unlike autoencoderbased methods, our models were not trained with any pixel reconstruction loss, which we only use for testing. Despite this, our results are comparable with the recently proposed “Uncertainty Autoencoders”
(Grover & Ermon, 2018). We have worse MNIST reconstructions: with 10 and 25 measurements, ours best model achieved 5.3 and 3.4 perimage reconstruction errors compared with theirs 3.8 and 2.5 (estimated from figure 4). However, we achieved better CelebA results: with 20 and 50 measurements, we have errors of 23.4 and 18.5 compared with their 27 and 22 (estimated from Figure 6).Model  10  25 measurements  steps 

Baseline  
Linear  3  
NN  3  
Linear(L)  3  
NN(L)  3 
” shows the standard deviation across test samples. (L) indicates learned measurement functions. Lower is better.
Model  20  50 measurements  steps 

Baseline  
Linear  3  
NN  3  
Linear(L)  3  
NN(L)  3 
4.2 CSGANs
To evaluate our proposed CSGANs, we first trained a small model on MNIST to demonstrate intuitively the advantage of latent optimisation. For quantitative evaluation, we trained larger and more standard models on CIFAR10 (Krizhevsky & Hinton, 2009), and evaluate them using the Inception Score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017). To our knowledge, latent optimisation has not been previously used to improving GANs, so our approach is orthogonal to existing methods such as Arjovsky et al. (2017); Miyato et al. (2018). We first compare our model with vanilla GANs, which is a special case of the CSGAN (see section 3.2.2).
We use the same MNIST architectures as in section 4.1, but changed the the measurement function to a GAN discriminator (section 3.2.2). We use the alternating update option in Algorithm 2 in this setting. All other hyperparameters are the same as in previous experiments. We use this relatively weak model to reveal failure modes as well as advantages of the CSGAN. Figure 3 shows samples from models with the same setting but different latent optimisation iterations. The three panels show samples from models using 0, 3 and 5 gradient descent steps respectively. The model using 0 iteration was equivalent to a vanilla GAN. Optimising latent variables exhibits no mode collapse, one of the common failure modes of GAN training.
To confirm this advantage, we more systematically evaluate our method across a range of 144 hyperparameters (similar to Kurach et al. (2018)). We use the CIFAR dataset which contains various categories of natural images, whose features from an Inception Network (Ioffe & Szegedy, 2015) are meaningful for evaluating the IS and FID. Other than the number of gradient descent steps (0 vs. 3) the model architectures and training procedures were identical. The change of IS and FID during training are plot in figure 4
. CSGANs achieved better performance in both IS and FID, and had less variance across the range of hyperparameters. The blue horizontal lines at the bottom of Fig.
4 (left) and the top of Fig. 4 (right) shows failed vanilla GANs, but none of the CSGANs diverged in training.SNGAN  SNGAN (ours)  CS+SNGAN  

IS  
FID 
We also applied our latent optimisation method on SpectralNormalised GANs (SNGANs) (Miyato et al., 2018), which use Batch Normalisation (Ioffe & Szegedy, 2015) for the generator and Spectral Normalisation for the discriminator. We compared our model with SNGAN in Table 4: the SNGAN column reproduces the numbers from (Miyato et al., 2018), and the next column are numbers from our replication of the same baseline. Our results demonstrate that deeper architectures, Batch Normalisation and Spectral Normalisation can further improve CSGAN and that CSGAN can improve upon a competitive baseline, SNGAN.
4.3 CSSGANs
We now experimentally assess our approach to use latent optimisation in semisupervised GANs, CSSGAN. We illustrate this extension with the MNIST dataset, and leave it to future work to study other applications. We keep all the hyperparameters the same as in section 4.2, except changing the number of measurements to 11 for the 10 MNIST classes and 1 class reserved for generated samples. Samples from CSSGAN can be seen in Figure 5 (left). Figure 5 (right) illustrates this with TSNE (Maaten & Hinton, 2008) computed from 2000 random samples, where class labels are colourcoded. The latent space formed separated regions representing different digits. It is impossible to obtain such clustered latent space in typical conditional GANs (Mirza & Osindero, 2014; Miyato & Koyama, 2018), where labels are supplied as separate inputs while the random source only provides labelindependent variations. In contrast, in our model the labels are distilled into latent representation via optimisation, leading to a more interpretable latent space.
5 Discussion
We present a novel framework for combining compressed sensing and deep neural networks. In this framework we trained both the measurement and generation functions, as well as the latent optimisation (i.e., reconstruction) procedure itself via metalearning. Inspired by Bora et al. (2017), our approach significantly improves upon the performance and speed of reconstruction obtain in this work. In addition, we derived a family of models, including a novel GAN model, by expanding the set of properties we consider for the measurement function (Table 1).
Our method differs from existing algorithms that aim to combine compressed sensing with deep networks in that our approach preserves the online minimisation of measurement errors in generic neural networks. Previous attempts that combine CS and deep learning generally fall into two categories. One category of methods interprets taking compressed measurements and reconstructing from these measurements as an encodingdecoding problem and formulate the model as an autoencoder (Mousavi et al., 2015; Kulkarni et al., 2016; Mousavi et al., 2017; Grover & Ermon, 2018; Mousavi et al., 2018; Lu et al., 2018). Another category of methods are designed to mimic principled iterative CS algorithms using specialised network architectures (Metzler et al., 2017; Sun et al., 2016). In contrast, our framework maintains the separation of measurements from generation but still uses generic neural networks. Therefore, both the measurements and latent representation of the generator can be flexibly optimised for different, even adversarial, objectives, while taking advantage of powerful neural network architectures.
Moreover, we can train measurement functions with properties that are difficult or impossible to obtain from random or handcrafted projections, thus broadening the range of problems that can be solved by minimising error measurements online. In other words, learning the measurement can be used as a useful stepping stone for learning complex tasks where the cost function is difficult to design directly. Our approach can also be interpreted as training implicit generative models, where explicit minimisation of divergences is replaced by statistical tests (Mohamed & Lakshminarayanan, 2016). We have illustrated this idea in the context of relatively simple tasks, but anticipate that complex tasks such as style transfer (Zhu et al., 2017), in areas already seen the applications of CS, as well as applications including MRI (Lustig et al., 2007)
and unsupervised anomaly detection
(Schlegl et al., 2017), may further benefit from our approach.Acknowledgments
We thank Shakir Mohamed and Jonathan Hunt for insightful discussions. We also appreciate the feedback from the anonymous reviewers.
Appendix
Appendix A Experiments Detail
Unless otherwise specified, we used the following default configuration for all experiments. We used the Adam optimiser (Kingma & Ba, 2014) with the learning rate and the parameters , . We trained all the models for steps with the batch size of . 100 dimensional latent representations were used for generators. We use 3 gradientdescent steps for latent optimisation, and the initial step size of .
a.1 Reconstruction Experiments
Following Bora et al. (2017)
, we used a 2layer multilayer perceptron (MLP), with 500 units in each hidden layer and leaky ReLU nonlinearity, as the generator for MNIST images; for CelebA, we used the DCGAN generator
(Radford et al., 2015). In addition to random linear projections, we tested the following neural networks as the measurement functions: a 2layer MLP with 500 units in each layer and leaky ReLU nonlinearity for MNIST, and the DCGAN discriminator for CelebA.a.2 GAN experiments
We used the same MLP generator and discriminator (i.e., measurement function) as described in the previous section for MNIST experiments. We also use the same architecture for the semisupervised GAN.
For CIFAR dataset, we used the DCGAN architecture with its recommended Adam parameters , (Radford et al., 2015). We tested a number of hyperparameters as the cross product of the following: generator learning rates , discriminator learning rates , latent variable sizes , minibatch sizes . Additionally, 2 replicas for each combination were trained to account for the effect of random seeds.
To reproduce the Spectral Normalised GANs. We used the same discriminator as in Miyato et al. (2018), which is deeper than the DCGAN discriminator. A grid search over optimisation parameters found the learning rate of and Adam’s of most stably achieved the best results.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Azadi et al. (2019) Azadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena, A. Discriminator rejection sampling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1GkToR5tm.
 Balduzzi et al. (2018) Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. The mechanics of nplayer differentiable games. arXiv preprint arXiv:1802.05642, 2018.
 Baraniuk (2007) Baraniuk, R. G. Compressive sensing [lecture notes]. IEEE signal processing magazine, 24(4):118–121, 2007.
 Bishop (2006) Bishop, C. M. Pattern recognition and machine learning (information science and statistics) springerverlag new york. Inc. Secaucus, NJ, USA, 2006.
 Bojanowski et al. (2018) Bojanowski, P., Joulin, A., LopezPas, D., and Szlam, A. Optimizing the latent space of generative networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 600–609, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/bojanowski18a.html.
 Bora et al. (2017) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.

Bourlard & Kamp (1988)
Bourlard, H. and Kamp, Y.
Autoassociation by multilayer perceptrons and singular value decomposition.
Biological cybernetics, 59(45):291–294, 1988.  Candes et al. (2006) Candes, E. J., Romberg, J. K., and Tao, T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.
 Dhar et al. (2018) Dhar, M., Grover, A., and Ermon, S. Modeling sparse deviations for compressed sensing using generative models. In International Conference on Machine Learning, pp. 1222–1231, 2018.
 Donoho (2006) Donoho, D. L. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 Duarte et al. (2008) Duarte, M. F., Davenport, M. A., Takhar, D., Laska, J. N., Sun, T., Kelly, K. F., and Baraniuk, R. G. Singlepixel imaging via compressive sampling. IEEE signal processing magazine, 25(2):83–91, 2008.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Grover & Ermon (2018) Grover, A. and Ermon, S. Uncertainty autoencoders: Learning compressed representations via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.
 Hand & Voroninski (2017) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
 Hu et al. (2018) Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P. On unifying deep generative models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rylSzlR.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Johnson & Lindenstrauss (1984) Johnson, W. B. and Lindenstrauss, J. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189206):1, 1984.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Kulkarni et al. (2016)
Kulkarni, K., Lohit, S., Turaga, P., Kerviche, R., and Ashok, A.
Reconnet: Noniterative reconstruction of images from compressively
sensed measurements.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 449–458, 2016.  Kurach et al. (2018) Kurach, K., Lucic, M., Zhai, X., Michalski, M., and Gelly, S. The gan landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
 Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Lu et al. (2018) Lu, X., Dong, W., Wang, P., Shi, G., and Xie, X. Convcsnet: A convolutional compressive sensing framework based on deep learning. arXiv preprint arXiv:1801.10342, 2018.
 Lustig et al. (2007) Lustig, M., Donoho, D., and Pauly, J. M. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58(6):1182–1195, 2007.
 Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 MacKay (2003) MacKay, D. J. Information theory, inference and learning algorithms. Cambridge university press, 2003.
 Mao et al. (2017) Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802, 2017.
 Metzler et al. (2017) Metzler, C., Mousavi, A., and Baraniuk, R. Learned damp: Principled neural network based compressive image recovery. In Advances in Neural Information Processing Systems, pp. 1772–1783, 2017.
 Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Miyato & Koyama (2018) Miyato, T. and Koyama, M. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT.
 Mohamed & Lakshminarayanan (2016) Mohamed, S. and Lakshminarayanan, B. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Mousavi et al. (2015) Mousavi, A., Patel, A. B., and Baraniuk, R. G. A deep learning approach to structured signal recovery. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1336–1343. IEEE, 2015.
 Mousavi et al. (2017) Mousavi, A., Dasarathy, G., and Baraniuk, R. G. Deepcodec: Adaptive sensing and recovery via deep convolutional neural networks. arXiv preprint arXiv:1707.03386, 2017.
 Mousavi et al. (2018) Mousavi, A., Dasarathy, G., and Baraniuk, R. G. A datadriven and distributed approach to sparse signal representation and recovery. In International Conference on Learning Representations, 2018.
 Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2642–2651. JMLR. org, 2017.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 Schlegl et al. (2017) Schlegl, T., Seeböck, P., Waldstein, S. M., SchmidtErfurth, U., and Langs, G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pp. 146–157. Springer, 2017.
 Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 Sun et al. (2016) Sun, J., Li, H., Xu, Z., et al. Deep admmnet for compressive sensing mri. In Advances in neural information processing systems, pp. 10–18, 2016.
 Tao et al. (2018) Tao, C., Chen, L., Henao, R., Feng, J., and Duke, L. C. Chisquare generative adversarial network. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4887–4896, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/tao18b.html.
 Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 Weiss et al. (2007) Weiss, Y., Chang, H. S., and Freeman, W. T. Learning compressed sensing. In Snowbird Learning Workshop, Allerton, CA. Citeseer, 2007.

Williams & Zipser (1989)
Williams, R. J. and Zipser, D.
A learning algorithm for continually running fully recurrent neural networks.
Neural computation, 1(2):270–280, 1989. 
Yann et al. (1998)
Yann, L., Corinna, C., and Burges, C.
The mnist database of handwritten digits.
URL http://yhann. lecun. com/exdb/mnist, 1998. 
Zhu et al. (2017)
Zhu, J.Y., Park, T., Isola, P., and Efros, A. A.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
arXiv preprint, 2017.