1 Introduction
Efficient acquisition and recovery of highdimensional signals is important in many applications [39, 28]. In compressive sensing, we formalize this problem as mapping an dimensional signal to linear measurements . Mathematically, our goal is to solve for the following system of linear equations:
(1) 
where the linear map is referred to as the measurement matrix and is the measurement noise. If , then the system is underdetermined and additional assumptions are required to guarantee a unique recovery. The celebrated results in compressive sensing posit that a sparse signal
can be recovered with high probability using only
measurements acquired via a certain class of random measurement matrices using LASSO based recovery methods [13, 22, 12, 47, 8].The assumptions of sparsity are fairly general and can be applied “outofthebox” for many types of signals. For instance, image and audio signals are typically sparse in the wavelet and Fourier basis respectively. Moreover, the measurement matrix can simply be a random Gaussian matrix. However, these approaches fail to account for the statistical nature of the problem. In particular, if we are given a dataset of training signals from the underlying domain (or a related domain), then we can learn the acquisition and recovery procedures that are more suited for outofsample generalization than sparsity based priors which decode random encodings of the data using a LASSObased recovery procedure.
We propose Uncertainty Autoencoders (UAE), a framework that learns both the acquisition (i.e., encoding) and recovery (i.e., decoding) procedures for statistical compressive sensing, given a dataset of training signals using an information maximization objective. A UAE directly optimizes for variational lower bounds on the mutual information between measurements and signal [5]. In doing so, we sidestep strong modeling assumptions such as sparsity [4]. Instead, we show that a UAE learns a novel implicit generative model [40] over highdimensional signals via variational information maximization, i.e., a UAE permits sampling from the learned distribution but does not specify an explicit likelihood function. While the measurements can be seen as analogous to latent variables of a variational autoencoder (VAE) [31], our approach does not require specifying a prior over the lower dimensional measurements in learning the generative model.
By amortizing the recovery procedure using expressive neural networks, UAE decoding can share statistical strength across signals and scale to massive datasets, unlike prior frameworks for statistical compressive sensing that solve an optimization problem for each new data signal at test time
[9, 20]. As a final theoretical contribution, we show that a UAE, under suitable assumptions, is a generalization of principal component analysis. While earlier results showing connections of standard autoencoders with PCA assume linear encodings and decodings
[10, 3, 30], our result surprisingly holds even for nonlinear decodings and is derived from variational principles.Empirically, we evaluate UAEs for compressive sensing and dimensionality reduction for downstream classification. On the MNIST, Omniglot, and CelebA datasets, we observe average improvements of , , and in recovery over the closest benchmark across all measurements considered. For dimensionality reduction and subsequent classification on the MNIST dataset, we observed an average improvement of
over principal component analysis across all dimensions considered and eight standard classifiers.
2 Preliminaries
We use uppercase symbols to denote probability distributions and assume they admit absolutely continuous densities on a suitable reference measure, denoted by the corresponding lowercase notation. Further, we use uppercase symbols for random variables and lower case for their realizations.
Let the data signal and measurements be denoted with multivariate random variables and respectively. For the purpose of compressive sensing, we assume and relate these variables through a measurement matrix and a parameterized acquisition function (for any integer ) such that:
(2) 
where is the measurement noise. If we let be the identity function (i.e., for all ), then we recover the standard system of underdetermined linear equations in Eq. (1) where measurements are linear combinations of the signal corrupted by noise. In all other cases, the acquisition function transforms the signal such that is potentially more amenable for compressive sensing. For instance, could specify a change of basis that encourages sparsity, e.g., a Fourier basis for audio signals. Note that our formulation allows the codomain of the mapping to be defined on a higher or lower dimensional space (i.e., in general).
Sparse Compressive Sensing.
In compressive sensing, we wish to recover an dimensional signal given the measurements by solving an underdetermined system of linear equations. To obtain nontrivial solutions, the true (but unknown) underlying signal is assumed to be sparse in some basis . We are not given any additional information about the signal. The measurement matrix is a random Gaussian matrix and the recovery is done via LASSO. LASSO solves for a convex minimization problem such that the reconstruction for any datapoint is given as: where
is a tunable hyperparameter.
Statistical Compressive Sensing.
The motivating use case of statistical compressive sensing [53] is to learn a sensing pipeline for the acquisition of signals from an underlying domain and their accurate recovery, given additional training signals from the domain being sensed. During training, we are given access to a set of signals, , where each is assumed to be sampled i.i.d. from a data distribution . Using this dataset, we learn the the measurement matrix and the acquisition function in Eq. (2).
Particularly relevant to this work, we can optionally learn a recovery function to reconstruct a data signal given its measurements. This amortized approach is in contrast to standard decoding mechanisms such as LASSO which solve an optimization problem for every new data point at test time. At test time, we directly observe the measurements that are assumed to satisfy Eq. (2) for a target signal and the task is to provide an accurate reconstruction . Evaluation is based on the reconstruction error between and . If we learned the recovery function during training, then and the error is given by .
Autoencoders.
An autoencoder is a pair of functions designed to encode and decode data signals respectively. For a standard autoencoder, let and denote the encoding and decoding functions for an dimensional signal and an dimensional space. The learning objective minimizes the reconstruction error over a dataset :
(3) 
where the encoding and decoding functions are typically parameterized using neural networks.
3 Uncertainty Autoencoders
Consider a joint distribution between the signals
and the measurements , which factorizes as . Here, is a fixed data distribution and is a parameterized observation model that depends on the measurement noise , as given by Eq. (2). In particular, corresponds to collectively the set of measurement matrix parameters and the acquisition function parameters . For instance, for isotropic Gaussian noisewith a fixed variance
, we have .In an uncertainty autoencoder, we wish to learn the parameters that permit efficient and accurate recovery of a signal using the measurements . In order to do so, we propose to maximize the mutual information between and :
(4) 
where
denotes differential entropy. The intuition is simple: if the measurements preserve maximum information about the signal, we can hope that recovery will have low reconstruction error. We formalize this intuition by noting that this objective is equivalent to maximizing the average logposterior probability of
given . In fact, in Eq. (3), we can omit the term corresponding to the data entropy (since it is independent of ) to get the following equivalent objective:(5) 
Even though the mutual information is maximized and equals the data entropy when , the dimensionality constraints on , the parametric assumptions on , and the noise model given by Eq. (2) prohibit learning an identity mapping.
Estimating mutual information between arbitrary high dimensional random variables can be challenging. However, we can lower bound the mutual information by introducing a variational approximation to the model posterior [5]. Denoting this approximation as , we get the following lower bound:
(6) 
Comparing Eqs. (3, 5, 6), we can see that the second term in Eq. (6) approximates the intractable negative conditional entropy, with a variational lower bound. Optimizing this bound leads to a decoding distribution given by with variational parameters . The bound is tight when there is no distortion during recovery, or equivalently when the decoding distribution matches the true posterior (i.e., the Bayes optimal decoder).
Stochastic Optimization.
Formally, the uncertainty autoencoder (UAE) objective is given by:
(7) 
In practice, the data distribution is unknown and accessible only via a finite dataset . Hence, expectations with respect to and its gradients can be estimated using Monte Carlo methods. This allows us to express the UAE objective as:
(8) 
Tractable evaluation of the above objective is closely tied to the distributional assumptions on the noise model. For the typical case of an isotropic Gaussian noise model, we know that , which is easytosample.
While Monte Carlo gradient estimates with respect to can be efficiently obtained via linearity of expectation, gradient estimation with respect to is challenging since these parameters specify the sampling distribution . One solution is to evaluate score function gradient estimates along with control variates [23, 26, 52]. Alternatively, many continuous distributions (e.g., the isotropic Gaussian and Laplace distributions) can be reparameterized such that it is possible to obtain samples by applying a deterministic transformation to samples from a fixed distribution [31, 44]. Reparameterized gradient estimates typically have lower variance than score function estimates [25, 46].
4 Theoretical Analysis
In this section, we derive connections of uncertainty autoencoders with generative modeling and Principal Component Analysis (PCA). The proofs of all theoretical results in this section are in Appendix A.
4.1 Implicit Generative Modeling
Starting from an arbitrary point
, define a Markov chain over
with the following transitions:(9)  
(10) 
Theorem 1.
The above theorem suggests an interesting insight into the behavior of UAEs. Under idealized conditions, the learned model specifies an implicit generative model for . Further, ergodicity can be shown to hold for the isotropic Gaussian noise model.
Corollary 1.
For any fixed value of , let us assume that the UAE objective in Eq. (7) is globally maximized for some choice of . For a Gaussian noise model, the stationary distribution of the chain for the parameters and is given by .
The marginal of the joint distribution with respect to corresponds to the data distribution. A UAE hence seeks to learn an implicit generative model of the data distribution [21, 40], i.e., even though we do not have a tractable estimate for the likelihood of the model, we can generate samples using the Markov chain transitions defined in Eqs. (9, 10).
4.2 Optimal Encodings
A UAE can additionally be viewed as a dimensionality reduction technique for the dataset
. While in general the encoding performing this reduction can be nonlinear, the case of a linear encoding (as typically done in standard compressive sensing) is one where the projection vectors are given as the rows of the measurement matrix
. The result below characterizes the optimal encoding of the dataset with respect to the UAE objective for an isotropic Gaussian noise model.Theorem 2.
Assume a uniform data distribution over a finite dataset . Further, we assume that expectations in the UAE objective exist, and the signals and measurement matrices are bounded in /Frobenius norms, i.e., for all , for some positive constants . For a linear encoder and isotropic Gaussian noise , the optimal measurement matrix that maximizes the mutual information in the limit is given as:
where denotes the topeigenvectors of the matrix
with the largest eigenvalues (specified up to a positive scaling constant).
Under the stated assumptions, the above result suggests an interesting connection between uncertainty autoencoding and PCA. PCA is one the most widely used techniques for dimensionality reduction and seeks to find the directions that explain the most variance in the data. Theorem 2 suggests that when the noise in the projected signal is very high, the optimal projection directions (i.e., the rows of ) correspond to the principal components of the data signals. We note that this observation comes with a caveat; when the noise variance is high, it will dominate the contribution to the measurements in Eq. (2) as one would expect. Hence, the measurements and the signal will have low mutual information even under the optimal measurement matrix .
Our assumptions are notably different from prior results drawing connections with autoencoders and PCA. Prior results show that linear encoding and decoding in a standard autoencoder recovers the principal components of the data (Eq. (3) in [10], Eq. (1) in [3]). In contrast, the result in Theorem 2 is derived from variational principles and does not assume linear decoding.
In general, the behaviors of UAE and PCA can be vastly different. As noted in prior work [5, 51]
, the principal components may not be the the most informative lowdimensional projections for recovering the original highdimensional data back from its projections. A UAE, on the other hand, is explicitly designed to preserve as much information as possible (see Eq. (
5)). We illustrate the differences in a synthetic experiment in Figure 1. The true data distribution is an equiweighted mixture of two Gaussians stretched along orthogonal directions. We sample points (black) from this mixture and consider two dimensionality reductions. In the first case, we project the data on the first principal component (blue points on magenta line). This axis captures a large fraction of the variance in the data but collapses data sampled from the bottom right Gaussian in a narrow region. The projections of the data on the UAE axis (red points on green line) are more spread out. This suggest that recovery is easier, even if doing so increases the total variance in the projected space compared to PCA.5 Experiments
We performed three kinds of experiments: the first two relate to the application of UAEs for compressive sensing and the third experiment contrasts UAEs and PCA with regards to dimensionality reduction for downstream classification. Additional experimental details beyond those stated below are provided in Appendix B.
5.1 Statistical Compressive Sensing
We perform compressive sensing on three high dimensional continuous datasets: MNIST [34], Omniglot [33], and CelebA dataset [37], with extremely low number of measurements . We discuss the MNIST and Omniglot datasets here since they have a similar setup. Due to lack of space, results on the celebA dataset are deferred to Appendix B.3. Every image in MNIST and Omniglot has a dimensionality of , and we considered measurements. In all our experiments, we assume a Gaussian noise model with . We evaluated UAE against:

[leftmargin=*]

LASSO decoding with random Gaussian matrices. The MNIST and Omniglot datasets are reasonably sparse in the canonical pixel basis, and hence, we did not observe any gains after applying Discrete Cosine Transform and Daubechies1 Wavelet Transform.

VAE decoding with random Gaussian matrices [9]. This approach learns a latent variable generative model over the observed variables and the latent variables . Such a model defines a mapping from to , which is given by either the mean function of the observation model for a VAE or the forward deterministic mapping to generate samples for a GAN. We use VAEs in our experiments. Thereafter, using a classic acquisition matrix satisfying a generalized Restricted Eigenvalue Condition (say ), the reconstruction for any datapoint is given as: . Intuitively, this procedure seeks the latent vector such that the corresponding point on the range of can best approximate the measurements under the mapping . We used the default parameter settings and architectures proposed in [9].

RP+UAE decoding. To independently evaluate the effect of variational decoding, this ablation baseline encodes the data using Gaussian random projections (RP) and trains the decoder based on the UAE objective. Since LASSO and VAE both use an RP encoding, the differences in performance would arise only due to the decoding procedures.
The UAE decoder and the VAE encoder/decoder are multilayer perceptrons consisting of two hidden layers with
units each. For a fair comparison with random Gaussian matrices, the UAE encoder is linear. Further, we perform regularization on the norm of . This has two benefits. First, it helps in generalization to test signals outside the train set. Second, it is equivalent to solving the Lagrangian of a constrained UAE objective:The Lagrangian parameter is chosen by line search on the above objective. The constraint ensures that UAE does not learn encodings that trivially scale the measurement matrix to overcome noise. For each , we choose to be the expected norm of a random Gaussian matrix of dimensions for fair comparisons with other baselines. In practice, the norm of the learned for a UAE is much smaller than those of random Gaussian matrices suggesting that the observed performance improvements are nontrivial.
Results. The reconstruction errors on the standard test sets are shown in Figure 2. For both datasets, we observe that UAE drastically outperforms both LASSO and VAE for all values of considered. LASSO (blue curves) is unable to reconstruct with such few measurements. The VAE (red) error decays much more slowly compared to UAE as grows. Even the RP+UAE baseline (yellow), which trains the decoder keeping the encoding fixed to a random projection, outperforms VAE. Jointly training the encoder and the decoder using the UAE objective (green) exhibits the best performance. These results are also reflected qualitatively for the reconstructed test signals shown in Figure 3 for measurements.
5.2 Transfer Compressive Sensing
A key motivation for compressive sensing is to directly acquire the signals using a few measurements. On the contrary, learningbased methods such as VAE and UAE requires access to large amounts of training data. For many critical applications, even acquiring the training data might not be feasible. Hence, we test the generative modelbased recovery on the transfer compressed sensing task introduced in [20].
Experimental setup. We train the models on a source domain, which is assumed to be datarich and related to a datahungry target domain. Since the dimensions of MNIST and Omniglot images match, transferring from one domain to another requires no additional processing. For UAE, we retrain the decoder on the target domain with the encodings learned for the source domain. This retraining is not possible for a VAE, which would require the target domain signals directly (and not the compressed measurements). Retraining is computationally efficient due to amortized decoding in UAE, unlike LASSO or VAE which solve an optimization problem for every test signal.
Results. The reconstruction errors are shown in Figure 4. LASSO does not involve any learning, and hence its performance is same as Figure 2. The VAE performance degrades significantly in comparison, even performing worse than LASSO in some cases. Again, the UAE based methods outperform competing approaches. Qualitative differences are highlighted in Figure 5 for measurements.
Dimensions  Method  kNN  DT  RF  MLP  AdaB  NB  QDA  SVM 

2  PCA  0.4078  0.4283  0.4484  0.4695  0.4002  0.4455  0.4576  0.4503 
UAE  0.4644  0.5085  0.5341  0.5437  0.4248  0.5226  0.5316  0.5256  
5  PCA  0.7291  0.5640  0.6257  0.7475  0.5570  0.6587  0.7321  0.7102 
UAE  0.8115  0.6331  0.7094  0.8262  0.6164  0.7286  0.7961  0.7873  
10  PCA  0.9257  0.6354  0.6956  0.9006  0.7025  0.7789  0.8918  0.8440 
UAE  0.9323  0.5583  0.7362  0.9258  0.7165  0.7895  0.9098  0.8753  
25  PCA  0.9734  0.6382  0.6889  0.9521  0.7234  0.8635  0.9572  0.9194 
UAE  0.9730  0.5407  0.7022  0.9614  0.7398  0.8306  0.9580  0.9218  
50  PCA  0.9751  0.6381  0.6059  0.9580  0.7390  0.8786  0.9632  0.9376 
UAE  0.9754  0.5424  0.6765  0.9597  0.7330  0.8579  0.9638  0.9384  
100  PCA  0.9734  0.6380  0.4040  0.9584  0.7136  0.8763  0.9570  0.9428 
UAE  0.9731  0.6446  0.6241  0.9597  0.7170  0.8809  0.9595  0.9431 
5.3 Dimensionality Reduction
Dimensionality reduction is a common preprocessing technique for specifying features for classification. We compare PCA and UAE on this task. While Theorem 2 posits that the two techniques are equivalent in the regime of high noise given optimal decodings for the UAE framework, we consider the case where the noise is set as a hyperparameter based on a validation set enabling outofsample generalization.
Setup. We learn the principal components and UAE projections on the MNIST training set for varying number of dimensions. We then learn classifiers based on the these projections. Again, we use a linear encoder for the UAE for a fair evaluation. Since the inductive biases vary across different classifiers, we considered
commonly used classifiers: kNearest Neighbors (kNN), Decision Trees (DT), Random Forests (RF), Multilayer Perceptron (MLP), AdaBoost (AdaB), Gaussian Naive Bayes (NB), Quadratic Discriminant Analysis (QDA), and Support Vector Machines (SVM) with a linear kernel.
Results. The performance of the PCA and UAE feature representations for different number of dimensions is shown in Table 1. We find that UAE outperforms PCA in a majority of the cases. Further, this trend is largely consistent across classifiers. The improvements are especially high when the number of dimensions is low, suggesting the benefits of UAE as a dimensionality reduction technique for classification.
6 Discussion & Related Work
Our work provides a holistic treatment to several lines of research in mutual information maximization, autoencoding, compressive sensing, and dimensionality reduction using principal component analysis.
Mutual Information Maximization.
The principle of mutual information maximization, often referred to as InfoMax in prior work, was first proposed for learning encodings for communication over a noisy channel [35]. The InfoMax objective has also been applied for statistical compressive sensing for learning both linear and nonlinear encodings [51, 14, 50]. Our work differs from these existing frameworks in two fundamental ways. First, we optimize for a tractable variational lower bound to the MI that which allows our method to scale to highdimensional signals and measurements. Second, we learn an amortized decoder in addition to the encoder that sidesteps expensive, perexample optimization for the test signals being sensed.
Further, we improve upon the IM algorithm proposed originally for variational information maximization [5]
. While the IM algorithm proposes to optimize the lower bound on the mutual information in alternating “wakesleep” phases for optimizing the encoder (“wake”) and decoder (“sleep”) analogous to the expectationmaximization procedure used in
[51], we optimize the encoder and decoder jointly using a single consistent objective leveraging recent advancements in gradient based variational stochastic optimization.Autoencoders. To contrast uncertainty autoencoders with other commonly used autoencoding schemes, consider a UAE with a Gaussian observation model with fixed isotropic covariance for the decoder of all the autoencoding objectives we discuss subsequently. The UAE objective can be simplified as:
Standard Autoencoder. If we assume no measurement noise (i.e., ) and assume the observation model to be a Gaussian with mean and a fixed isotropic , then the UAE objective reduces to minimizing the mean squared error between the true and recovered signal:
This special case of a UAE corresponds to a standard autoencoder [6] where the measurements
signify a hidden representation for the observed data
. However, this case lacks the interpretation of an implicit generative model since the assumptions of Theorem 1 do not hold.Denoising Autoencoders. A DAE [49] adds noise at the level of the input signal to learn robust representations. For a UAE, the noise model is defined at the level of the compressed measurements. Again, with the assumptions of a Gaussian decoder, the DAE objective can be expressed as:
where is some predefined noise corruption model. Similar to Theorem 1, a DAE also learns an implicit model of the data distribution [7, 1].
Variational Autoencoders. A VAE [31, 44] explicitly learns a latent variable model for the dataset. The learning objective is a variational lower bound to the marginal loglikelihood assigned by the model to the data , which notationally corresponds to . The variational objective that maximizes this quantity can be simplified as:
The learning objective includes a reconstruction error term, akin to the UAE objective. Crucially, it also includes a regularization term to minimize the divergence of the variational posterior over with a prior distribution over . A key difference is that a UAE does not explicitly need to model the prior distribution over . On the downside, a VAE can perform efficient ancestral sampling while a UAE requires running relatively expensive Markov Chains to obtain samples.
Recent works have attempted to unify the variants of variational autoencoders through the lens of mutual information [2, 54, 18]. These works also highlight scenarios where the VAE can learn to ignore the latent code in the presence of a strong decoder thereby affecting the reconstructions to attain a lower KL loss. One particular variant, the VAE, weighs the additional regularization term with a positive factor and can effectively learn disentangled representations [29]. Although [29] does not consider this case, the UAE can be seen as a VAE with .
Generative Modeling and Compressive Sensing. The closely motivated works of [9, 20] also use generative models for compressive sensing. As highlighted in Section 5, their approach is radically different from UAE. Similar to [9], a UAE learns a data distribution. However, in doing so, it additionally learns an acquisition/encoding function and a recovery/decoding function, unlike [9, 20] which rely on generic random matrices and decoding. The cost of implicit learning in a UAE is that some of its inference capabilities, such as likelihood evaluation and sampling, are intractable or require running Markov chains. However, these inference queries are orthogonal to compressive sensing. Finally, our decoding is amortized and scales to large datasets, unlike [9, 20] which solve an independent optimization problem for each test signal.
To summarize, our uncertainty autoencoding formulation provides a combination of unique desirable properties for representation learning that are absent in prior autoencoders. As discussed, a UAE defines an implicit generative model without specifying a prior (Theorem 1) even under realistic conditions (Corollary 1; unlike DAEs) and has rich connections with PCA even for nonlinear decoders (Theorem 2; unlike any kind of existing autoencoder).
7 Conclusion & Future Work
In this work, we presented uncertainty autoencoders (UAE), a framework for representation learning via variational maximization of mutual information between an input signal and hidden representation. We showed that UAEs are a natural candidate for statistical compressive sensing, wherein we can learn the acquisition and recovery functions jointly as well as sidestep making any strong assumptions based on sparsity. We presented connections of our framework with many related threads of research, especially with respect to implicit generative modeling and principal component analysis.
Our framework suffers from limitations that also serve as directions for future work. We assumed a setting of fixed number of measurements. Applications involving a stream of measurements would naively require retraining UAEs for every additional measurement. Extending sequential encoding schemes, e.g., [11], to the UAE framework is a promising direction. Further, it would be interesting to incorporate advancements in compressive sensing based on complex neural network architectures [42, 32, 15, 38, 48] within the UAE framework for real world applications, e.g., medical imaging.
Unlike the rich theory surrounding the compressive sensing of sparse signals, a similar theory surrounding generative modelbased priors on the signal distribution is lacking. Existing results due to [19] provide upper bounds on the number of measurements required for recovery of sparse signals only using linear encodings that satisfy restricted isometry property. These bounds naturally hold even for UAEs when only the decoder is learned (as in the RP+UAE baseline considered in our experiments). However, these results only guarantee the existence of an optimal decoder and do not prescribe an algorithm for decoding (such as LASSO). Recent works have made promising progress in developing a theory of SGD based recovery methods for nonconvex inverse problems, which continues to be an exciting direction for future work [9, 27, 20, 36].
Acknowledgements
We are thankful to Kristy Choi, Neal Jean, Daniel Levy, Ben Poole, and Yang Song for helpful discussions and comments on early drafts.
References

[1]
G. Alain and Y. Bengio.
What regularized autoencoders learn from the datagenerating
distribution.
Journal of Machine Learning Research
, 15(1):3563–3593, 2014.  [2] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, 2018.
 [3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
 [4] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Modelbased compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
 [5] D. Barber and F. Agakov. The IM algorithm: A variational approach to information maximization. In Advances in Neural Information Processing Systems, 2003.
 [6] Y. Bengio et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
 [7] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising autoencoders as generative models. In Advances in Neural Information Processing Systems, 2013.
 [8] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
 [9] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning, 2017.

[10]
H. Bourlard and Y. Kamp.
Autoassociation by multilayer perceptrons and singular value decomposition.
Biological cybernetics, 59(45):291–294, 1988.  [11] G. Braun, S. Pokutta, and Y. Xie. Infogreedy sequential adaptive compressed sensing. In Allerton Conference on Communication, Control, and Computing. IEEE, 2014.
 [12] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.

[13]
E. J. Candès and T. Tao.
Decoding by linear programming.
IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.  [14] W. R. Carson, M. Chen, M. R. Rodrigues, R. Calderbank, and L. Carin. Communicationsinspired projection design with application to compressive sensing. SIAM Journal on Imaging Sciences, 5(4):1185–1212, 2012.
 [15] J. R. Chang, C.L. Li, B. Poczos, B. V. Kumar, and A. C. Sankaranarayanan. One network to solve them all—solving linear inverse problems using deep projection models. arXiv preprint, 2017.
 [16] G. Chen and D. Needell. Compressed sensing and dictionary learning. Preprint, 106, 2015.
 [17] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2016.
 [18] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In International Conference on Learning Representations, 2017.
 [19] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best kterm approximation. Journal of the American mathematical society, 22(1):211–231, 2009.
 [20] M. Dhar, A. Grover, and S. Ermon. Modeling sparse deviations for compressed sensing using generative models. In International Conference on Machine Learning, 2018.
 [21] P. J. Diggle and R. J. Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 193–227, 1984.
 [22] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
 [23] M. C. Fu. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.

[24]
S. Gao, G. Ver Steeg, and A. Galstyan.
Variational information maximization for feature selection.
In Advances in Neural Information Processing Systems, 2016.  [25] P. Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
 [26] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 [27] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
 [28] M. A. Herman and T. Strohmer. Highresolution radar via compressed sensing. IEEE Transactions on Signal Processing, 57(6):2275–2284, 2009.
 [29] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
 [30] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [31] D. Kingma and M. Welling. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.

[32]
K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok.
Reconnet: Noniterative reconstruction of images from compressively
sensed measurements.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2016.  [33] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 [34] Y. LeCun, C. Cortes, and C. J. Burges. MNIST handwritten digit database. http://yann. lecun. com/exdb/mnist, 2010.
 [35] R. Linsker. How to generate ordered maps by maximizing the mutual information between input and output signals. Neural computation, 1(3):402–411, 1989.
 [36] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo. On the convergence of learningbased iterative methods for nonconvex inverse problems. arXiv preprint arXiv:1808.05331, 2018.
 [37] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.
 [38] X. Lu, W. Dong, P. Wang, G. Shi, and X. Xie. Convcsnet: A convolutional compressive sensing framework based on deep learning. arXiv preprint arXiv:1801.10342, 2018.
 [39] M. Lustig, D. Donoho, and J. M. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007.
 [40] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.

[41]
S. Mohamed and D. J. Rezende.
Variational information maximisation for intrinsically motivated reinforcement learning.
In Advances in Neural Information Processing Systems, 2015.  [42] A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery. In Annual Allerton Conference on Communication, Control, and Computing, 2015.
 [43] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[44]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, 2014.  [45] G. O. Roberts and J. S. Rosenthal. Harris recurrence of metropoliswithingibbs and transdimensional markov chains. The Annals of Applied Probability, pages 2123–2139, 2006.
 [46] J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, 2015.
 [47] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [48] D. Van Veen, A. Jalal, E. Price, S. Vishwanath, and A. G. Dimakis. Compressed sensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.
 [49] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, 2008.
 [50] L. Wang, A. Razi, M. Rodrigues, R. Calderbank, and L. Carin. Nonlinear informationtheoretic compressive measurement design. In International Conference on Machine Learning, 2014.
 [51] Y. Weiss, H. S. Chang, and W. T. Freeman. Learning compressed sensing. In Snowbird Learning Workshop, Allerton, CA. Citeseer, 2007.
 [52] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.

[53]
G. Yu and G. Sapiro.
Statistical compressed sensing of Gaussian mixture models.
IEEE Transactions on Signal Processing, 59(12):5842–5858, 2011. 
[54]
S. Zhao, J. Song, and S. Ermon.
The information autoencoding family: A lagrangian perspective on
latent variable generative models.
In
Conference on Uncertainty in Artificial Intelligence
, 2018.
Appendix A Proofs of Theoretical Results
a.1 Proof of Theorem 1
Proof.
We can rewrite the objective in Eq. (6) as:
(11)  
(12) 
The divergence is nonnegative and minimized when its argument distributions are identical. Hence, for any value of , the optimal value of satisfies the following for all such that :
(13) 
For any value of , we know the following Gibbs chain converges to if the chain is ergodic:
(14)  
(15) 
Substituting the result from Eq. (13) in the above Markov chain transitions finishes the proof. ∎
a.2 Proof of Corollary 1
Proof.
By using earlier results (Proposition 2 in [45]), we need to show that the Markov chain defined in Eqs. (9)(10) is irreducible with a Gaussian noise model.^{†}^{†}Note that the symbol here is different from the parameters denoted by little used in the rest of the paper. That is, there exists a measure such that there is a nonzero probability of transitioning from every set of nonzero measure to every other such set defined on the same measure using this Markov chain.
Consider the Lebesgue measure. Formally, given any and such that the density and for the Lebesgue measure, we need to show that the probability density of transitioning .
(1) Since for all , (by Gaussian noise model assumption), we can use Eq. (9) to transition from to with nonzero probability.
(2) Next, we claim that the transition probability is nonnegative for all . By Bayes rule, we have:
Since and , the marginals and are positive. Again, for all , by the Gaussian noise model assumption. Hence, is positive. Finally, using the optimality assumption for the posteriors matching for all , we can use Eq. (10) to transition from to with nonzero probability.
From (1) and (2), we see that there is a nonzero probability of transitioning from to . Hence, under the assumptions of the corollary the Markov chain in Eqs. (9, 10) is ergodic.
∎
a.3 Proof of Theorem 2
Proof.
The UAE objective reduces to information maximization^{†}^{†}We note that a similar claim has been made earlier in [51]. However, the stated proof seems to be incorrect in simultaneously assuming both a noisefree and Gaussian (noisy) observation model in its arguments. when matches allowing us to precisely characterize the optimal encodings that maximize the mutual information between and . Consider the following simplification of the UAE objective under optimal decodings.
(16) 
The first term is independent of . For the second term, note that
is a normal distributed random variable and hence its entropy is given by a constant
. Only the third term depends on . Hence, the optimal encodings maximizing the mutual information can be specified as:(17) 
where we have retained the term due to the entropy of since we are interested in characterizing the solutions^{†}^{†}For a linear encoder, note that (and = ), and hence, these can be used interchangeably. to Eq. (17) in the limit .
(18) 
where we have denoted the logmarginal by plus a shift term due to .
Next, we lowerbound the logmarginal using Jensen’s inequality:
(19) 
where we have used the fact that the data distribution is uniform over the entire dataset (by assumption).
Finally, we denote the nonnegative slack term for the above inequality as such that:
(20) 
We now consider simplifications of the lower bound and the slack term.
Lower bound:
(21) 
Slack:
(22) 
We can simplify the posteriors as:
(23) 
where we have used the fact that the data distribution is uniform and the decoder is isotropic Gaussian.
Substituting the above expression for the slack term:
(24) 
We will now show that dominated convergence in holds for the slack term (seen as a sequence of functions indexed by for any ). By Cauchy Schwarz, the first subterm in Eq. (A.3) is bounded given the assumptions in the theorem ( for all , for some positive constants ). The next two subterms are constants. For the last subterm, we can derive upper and lower bounds on the integrand that are independent of .
For the upper bound, we note that:
This gives an upper bound on the integrand (equals 0).
For the lower bound, we note that:
.
Hence, we have the following lower bound:
where the terms after the inequality use the following inequalities for any . Since both the upper and lower bounds are independent of , dominated convergence holds for the slack term in .
Consequently, we can evaluate limits to obtain a limiting ratio between the slack term and the lower bound:
(25) 
using the expressions derived in Eq. (A.3) and Eq. (A.3) along with L’Hôpital’s rule.
We can now rewrite Eq. (20) as:
(26) 
By the definition of limit, we know that , there exists a such that for all and for all such that and for all , we have:
(27) 
Substituting the above expression in Eq. (18), we can conclude that , there exists a , such that for all and for all such that and for all , we have:
(28) 
which finishes the proof. ∎
Appendix B Experimental Details
For MNIST, we use the train/valid/test split of images. For Omniglot, we use train/valid/test split of images. For CelebA, we used the splits as provided by [37] on the dataset website. All images were scaled such that pixel values are between and . We used the Adam optimizer with a learning rate of for all the learned models. For MNIST and Omniglot, we used a batch size of . For CelebA, we used a batch size of . Further, we implemented early stopping based on the best validation bounds after epochs for MNIST, epochs for Omniglot, and epochs for CelebA.
b.1 Hyperparameters for Compressive Sensing on MNIST and Omniglot
For both datasets, the UAE decoder used hidden layers of
units each with ReLU activations. The encoder was a single linear layer with only weight parameters and no bias parameters. The encoder and decoder architectures for the
VAE baseline are symmetrical with hidden layers of units each and latent units. We used the LASSO baseline implementation from sklearn and tuned the Lagrange parameter on the validation sets. For the baselines, we do random restarts with steps per restart and pick the reconstruction with best measurement error as prescribed in [9]. Refer to [9] for further details of the baseline implementations.m  Random Gaussian Matrices  MNISTUAE  OmniglotUAE 

2  39.57  6.42  2.17 
5  63.15  5.98  2.66 
10  88.98  7.24  3.50 
25  139.56  8.53  4.71 
50  198.28  9.44  5.45 
100  280.25  10.62  6.02 
Table 2 shows the average norms for the random Gaussian matrices used in the baselines and the learned UAE encodings. The lower norms for the UAE encodings suggest that the UAE baseline is not trivially overcoming noise by increasing the norm of .
b.2 Hyperparameters for Dimensionality Reduction
For PCA and each of the classifiers, we used the standard implementations in sklearn with default parameters and the following exceptions:

KNN: n_neighbors = 3

DT: max_depth = 5

RF: max_depth = 5, n_estimators = 10, max_features = 1

MLP: alpha=1

SVC: kernel=linear, C=0.025
b.3 Statistical Compressive Sensing on CelebA dataset
For the CelebA dataset, the dimensions of the images are and . The naive pixel basis does not augur well for compressive sensing on such highdimensional RGB datasets. Following [9], we experimented with the Discrete Cosine Transform (DCT) and Wavelet basis for the LASSO baseline. Further, we used the DCGAN architecture [43] as in [9] as our main baseline. For the UAE approach, we used additional convolutional layers in the encoder to learn a dimensional feature space for the image before projecting it down to dimensions.
Encoder architecture:
Signal  
Decoder architecture:
Measurements  
Comments
There are no comments yet.