Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization

12/26/2018 ∙ by Aditya Grover, et al. ∙ 10

The goal of statistical compressive sensing is to efficiently acquire and reconstruct high-dimensional signals with much fewer measurements than the data dimensionality, given access to a finite set of training signals. Current approaches do not learn the acquisition and recovery procedures end-to-end and are typically hand-crafted for sparsity based priors. We propose Uncertainty Autoencoders, a framework that jointly learns the acquisition (i.e., encoding) and recovery (i.e., decoding) procedures while implicitly modeling domain structure. Our learning objective optimizes for a variational lower bound to the mutual information between the signal and the measurements. We show how our framework provides a unified treatment to several lines of research in dimensionality reduction, compressive sensing, and generative modeling. Empirically, we demonstrate improvements of 32 approaches for statistical compressive sensing of high-dimensional datasets.



There are no comments yet.


page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efficient acquisition and recovery of high-dimensional signals is important in many applications  [39, 28]. In compressive sensing, we formalize this problem as mapping an -dimensional signal to linear measurements . Mathematically, our goal is to solve for the following system of linear equations:


where the linear map is referred to as the measurement matrix and is the measurement noise. If , then the system is underdetermined and additional assumptions are required to guarantee a unique recovery. The celebrated results in compressive sensing posit that a -sparse signal

can be recovered with high probability using only

measurements acquired via a certain class of random measurement matrices using LASSO based recovery methods [13, 22, 12, 47, 8].

The assumptions of sparsity are fairly general and can be applied “out-of-the-box” for many types of signals. For instance, image and audio signals are typically sparse in the wavelet and Fourier basis respectively. Moreover, the measurement matrix can simply be a random Gaussian matrix. However, these approaches fail to account for the statistical nature of the problem. In particular, if we are given a dataset of training signals from the underlying domain (or a related domain), then we can learn the acquisition and recovery procedures that are more suited for out-of-sample generalization than sparsity based priors which decode random encodings of the data using a LASSO-based recovery procedure.

We propose Uncertainty Autoencoders (UAE), a framework that learns both the acquisition (i.e., encoding) and recovery (i.e., decoding) procedures for statistical compressive sensing, given a dataset of training signals using an information maximization objective. A UAE directly optimizes for variational lower bounds on the mutual information between measurements and signal [5]. In doing so, we sidestep strong modeling assumptions such as sparsity [4]. Instead, we show that a UAE learns a novel implicit generative model [40] over high-dimensional signals via variational information maximization, i.e., a UAE permits sampling from the learned distribution but does not specify an explicit likelihood function. While the measurements can be seen as analogous to latent variables of a variational autoencoder (VAE) [31], our approach does not require specifying a prior over the lower dimensional measurements in learning the generative model.

By amortizing the recovery procedure using expressive neural networks, UAE decoding can share statistical strength across signals and scale to massive datasets, unlike prior frameworks for statistical compressive sensing that solve an optimization problem for each new data signal at test time 

[9, 20]

. As a final theoretical contribution, we show that a UAE, under suitable assumptions, is a generalization of principal component analysis. While earlier results showing connections of standard autoencoders with PCA assume linear encodings and decodings 

[10, 3, 30], our result surprisingly holds even for non-linear decodings and is derived from variational principles.

Empirically, we evaluate UAEs for compressive sensing and dimensionality reduction for downstream classification. On the MNIST, Omniglot, and CelebA datasets, we observe average improvements of , , and in recovery over the closest benchmark across all measurements considered. For dimensionality reduction and subsequent classification on the MNIST dataset, we observed an average improvement of

over principal component analysis across all dimensions considered and eight standard classifiers.

2 Preliminaries

We use uppercase symbols to denote probability distributions and assume they admit absolutely continuous densities on a suitable reference measure, denoted by the corresponding lowercase notation. Further, we use uppercase symbols for random variables and lower case for their realizations.

Let the data signal and measurements be denoted with multivariate random variables and respectively. For the purpose of compressive sensing, we assume and relate these variables through a measurement matrix and a parameterized acquisition function (for any integer ) such that:


where is the measurement noise. If we let be the identity function (i.e., for all ), then we recover the standard system of underdetermined linear equations in Eq. (1) where measurements are linear combinations of the signal corrupted by noise. In all other cases, the acquisition function transforms the signal such that is potentially more amenable for compressive sensing. For instance, could specify a change of basis that encourages sparsity, e.g., a Fourier basis for audio signals. Note that our formulation allows the codomain of the mapping to be defined on a higher or lower dimensional space (i.e., in general).

Sparse Compressive Sensing.

In compressive sensing, we wish to recover an -dimensional signal given the measurements by solving an underdetermined system of linear equations. To obtain nontrivial solutions, the true (but unknown) underlying signal is assumed to be sparse in some basis . We are not given any additional information about the signal. The measurement matrix is a random Gaussian matrix and the recovery is done via LASSO. LASSO solves for a convex -minimization problem such that the reconstruction for any datapoint is given as: where

is a tunable hyperparameter.

Statistical Compressive Sensing.

The motivating use case of statistical compressive sensing [53] is to learn a sensing pipeline for the acquisition of signals from an underlying domain and their accurate recovery, given additional training signals from the domain being sensed. During training, we are given access to a set of signals, , where each is assumed to be sampled i.i.d. from a data distribution . Using this dataset, we learn the the measurement matrix and the acquisition function in Eq. (2).

Particularly relevant to this work, we can optionally learn a recovery function to reconstruct a data signal given its measurements. This amortized approach is in contrast to standard decoding mechanisms such as LASSO which solve an optimization problem for every new data point at test time. At test time, we directly observe the measurements that are assumed to satisfy Eq. (2) for a target signal and the task is to provide an accurate reconstruction . Evaluation is based on the reconstruction error between and . If we learned the recovery function during training, then and the error is given by .


An autoencoder is a pair of functions designed to encode and decode data signals respectively. For a standard autoencoder, let and denote the encoding and decoding functions for an -dimensional signal and an -dimensional space. The learning objective minimizes the reconstruction error over a dataset :


where the encoding and decoding functions are typically parameterized using neural networks.

3 Uncertainty Autoencoders

Consider a joint distribution between the signals

and the measurements , which factorizes as . Here, is a fixed data distribution and is a parameterized observation model that depends on the measurement noise , as given by Eq. (2). In particular, corresponds to collectively the set of measurement matrix parameters and the acquisition function parameters . For instance, for isotropic Gaussian noise

with a fixed variance

, we have .

In an uncertainty autoencoder, we wish to learn the parameters that permit efficient and accurate recovery of a signal using the measurements . In order to do so, we propose to maximize the mutual information between and :



denotes differential entropy. The intuition is simple: if the measurements preserve maximum information about the signal, we can hope that recovery will have low reconstruction error. We formalize this intuition by noting that this objective is equivalent to maximizing the average log-posterior probability of

given . In fact, in Eq. (3), we can omit the term corresponding to the data entropy (since it is independent of ) to get the following equivalent objective:


Even though the mutual information is maximized and equals the data entropy when , the dimensionality constraints on , the parametric assumptions on , and the noise model given by Eq. (2) prohibit learning an identity mapping.

Estimating mutual information between arbitrary high dimensional random variables can be challenging. However, we can lower bound the mutual information by introducing a variational approximation to the model posterior  [5]. Denoting this approximation as , we get the following lower bound:


Comparing Eqs. (3, 5, 6), we can see that the second term in Eq. (6) approximates the intractable negative conditional entropy, with a variational lower bound. Optimizing this bound leads to a decoding distribution given by with variational parameters . The bound is tight when there is no distortion during recovery, or equivalently when the decoding distribution matches the true posterior (i.e., the Bayes optimal decoder).

Stochastic Optimization.

Formally, the uncertainty autoencoder (UAE) objective is given by:


In practice, the data distribution is unknown and accessible only via a finite dataset . Hence, expectations with respect to and its gradients can be estimated using Monte Carlo methods. This allows us to express the UAE objective as:


Tractable evaluation of the above objective is closely tied to the distributional assumptions on the noise model. For the typical case of an isotropic Gaussian noise model, we know that , which is easy-to-sample.

While Monte Carlo gradient estimates with respect to can be efficiently obtained via linearity of expectation, gradient estimation with respect to is challenging since these parameters specify the sampling distribution . One solution is to evaluate score function gradient estimates along with control variates [23, 26, 52]. Alternatively, many continuous distributions (e.g., the isotropic Gaussian and Laplace distributions) can be reparameterized such that it is possible to obtain samples by applying a deterministic transformation to samples from a fixed distribution [31, 44]. Reparameterized gradient estimates typically have lower variance than score function estimates [25, 46].

4 Theoretical Analysis

In this section, we derive connections of uncertainty autoencoders with generative modeling and Principal Component Analysis (PCA). The proofs of all theoretical results in this section are in Appendix A.

4.1 Implicit Generative Modeling

Starting from an arbitrary point

, define a Markov chain over

with the following transitions:

Theorem 1.

For any fixed value of , let us assume that the UAE objective in Eq. (7) can be globally maximized for some choice of . If the Markov chain defined in Eqs. (9, 10) is ergodic, then the stationary distribution of the chain for the parameters and is given by .

The above theorem suggests an interesting insight into the behavior of UAEs. Under idealized conditions, the learned model specifies an implicit generative model for . Further, ergodicity can be shown to hold for the isotropic Gaussian noise model.

Corollary 1.

For any fixed value of , let us assume that the UAE objective in Eq. (7) is globally maximized for some choice of . For a Gaussian noise model, the stationary distribution of the chain for the parameters and is given by .

The marginal of the joint distribution with respect to corresponds to the data distribution. A UAE hence seeks to learn an implicit generative model of the data distribution [21, 40], i.e., even though we do not have a tractable estimate for the likelihood of the model, we can generate samples using the Markov chain transitions defined in Eqs. (9, 10).

4.2 Optimal Encodings

A UAE can additionally be viewed as a dimensionality reduction technique for the dataset

. While in general the encoding performing this reduction can be nonlinear, the case of a linear encoding (as typically done in standard compressive sensing) is one where the projection vectors are given as the rows of the measurement matrix

. The result below characterizes the optimal encoding of the dataset with respect to the UAE objective for an isotropic Gaussian noise model.

Theorem 2.

Assume a uniform data distribution over a finite dataset . Further, we assume that expectations in the UAE objective exist, and the signals and measurement matrices are bounded in /Frobenius norms, i.e., for all , for some positive constants . For a linear encoder and isotropic Gaussian noise , the optimal measurement matrix that maximizes the mutual information in the limit is given as:

where denotes the top-eigenvectors of the matrix

with the largest eigenvalues (specified up to a positive scaling constant).

Figure 1: Dimensionality reduction using PCA vs. UAE. Projections of the data (black points) on the UAE direction (green line) maximize the likelihood of decoding unlike the PCA projection axis (magenta line) which collapses many points in a narrow region.

Under the stated assumptions, the above result suggests an interesting connection between uncertainty autoencoding and PCA. PCA is one the most widely used techniques for dimensionality reduction and seeks to find the directions that explain the most variance in the data. Theorem 2 suggests that when the noise in the projected signal is very high, the optimal projection directions (i.e., the rows of ) correspond to the principal components of the data signals. We note that this observation comes with a caveat; when the noise variance is high, it will dominate the contribution to the measurements in Eq. (2) as one would expect. Hence, the measurements and the signal will have low mutual information even under the optimal measurement matrix .

Our assumptions are notably different from prior results drawing connections with autoencoders and PCA. Prior results show that linear encoding and decoding in a standard autoencoder recovers the principal components of the data (Eq. (3) in [10], Eq. (1) in [3]). In contrast, the result in Theorem 2 is derived from variational principles and does not assume linear decoding.

In general, the behaviors of UAE and PCA can be vastly different. As noted in prior work [5, 51]

, the principal components may not be the the most informative low-dimensional projections for recovering the original high-dimensional data back from its projections. A UAE, on the other hand, is explicitly designed to preserve as much information as possible (see Eq. (

5)). We illustrate the differences in a synthetic experiment in Figure 1. The true data distribution is an equiweighted mixture of two Gaussians stretched along orthogonal directions. We sample points (black) from this mixture and consider two dimensionality reductions. In the first case, we project the data on the first principal component (blue points on magenta line). This axis captures a large fraction of the variance in the data but collapses data sampled from the bottom right Gaussian in a narrow region. The projections of the data on the UAE axis (red points on green line) are more spread out. This suggest that recovery is easier, even if doing so increases the total variance in the projected space compared to PCA.

5 Experiments

We performed three kinds of experiments: the first two relate to the application of UAEs for compressive sensing and the third experiment contrasts UAEs and PCA with regards to dimensionality reduction for downstream classification. Additional experimental details beyond those stated below are provided in Appendix B.

5.1 Statistical Compressive Sensing

We perform compressive sensing on three high dimensional continuous datasets: MNIST [34], Omniglot [33], and CelebA dataset [37], with extremely low number of measurements . We discuss the MNIST and Omniglot datasets here since they have a similar setup. Due to lack of space, results on the celebA dataset are deferred to Appendix B.3. Every image in MNIST and Omniglot has a dimensionality of , and we considered measurements. In all our experiments, we assume a Gaussian noise model with . We evaluated UAE against:

(b) Omniglot
Figure 2: Test reconstruction error (per image) for compressive sensing.
(b) Omniglot
Figure 3: Reconstructions for . Top: Original. Second: LASSO. Third: VAE. Last: UAE. projections of the data are sufficient for UAE to reconstruct the original image with high accuracy.
(a) Source: MNIST, Target: Omniglot
(b) Source: Omniglot, Target: MNIST
Figure 4: Test reconstruction error (per image) for transfer compressive sensing.
(a) Source: MNIST, Target: Omniglot
(b) Source: Omniglot, Target: MNIST
Figure 5: Reconstructions for . Top: Target. Second: LASSO. Third: VAE. Last: UAE.
  • [leftmargin=*]

  • LASSO decoding with random Gaussian matrices. The MNIST and Omniglot datasets are reasonably sparse in the canonical pixel basis, and hence, we did not observe any gains after applying Discrete Cosine Transform and Daubechies-1 Wavelet Transform.

  • VAE decoding with random Gaussian matrices [9]. This approach learns a latent variable generative model over the observed variables and the latent variables . Such a model defines a mapping from to , which is given by either the mean function of the observation model for a VAE or the forward deterministic mapping to generate samples for a GAN. We use VAEs in our experiments. Thereafter, using a classic acquisition matrix satisfying a generalized Restricted Eigenvalue Condition (say ), the reconstruction for any datapoint is given as: . Intuitively, this procedure seeks the latent vector such that the corresponding point on the range of can best approximate the measurements under the mapping . We used the default parameter settings and architectures proposed in [9].

  • RP+UAE decoding. To independently evaluate the effect of variational decoding, this ablation baseline encodes the data using Gaussian random projections (RP) and trains the decoder based on the UAE objective. Since LASSO and VAE both use an RP encoding, the differences in performance would arise only due to the decoding procedures.

The UAE decoder and the VAE encoder/decoder are multi-layer perceptrons consisting of two hidden layers with

units each. For a fair comparison with random Gaussian matrices, the UAE encoder is linear. Further, we perform regularization on the norm of . This has two benefits. First, it helps in generalization to test signals outside the train set. Second, it is equivalent to solving the Lagrangian of a constrained UAE objective:

The Lagrangian parameter is chosen by line search on the above objective. The constraint ensures that UAE does not learn encodings that trivially scale the measurement matrix to overcome noise. For each , we choose to be the expected norm of a random Gaussian matrix of dimensions for fair comparisons with other baselines. In practice, the norm of the learned for a UAE is much smaller than those of random Gaussian matrices suggesting that the observed performance improvements are non-trivial.

Results. The reconstruction errors on the standard test sets are shown in Figure 2. For both datasets, we observe that UAE drastically outperforms both LASSO and VAE for all values of considered. LASSO (blue curves) is unable to reconstruct with such few measurements. The VAE (red) error decays much more slowly compared to UAE as grows. Even the RP+UAE baseline (yellow), which trains the decoder keeping the encoding fixed to a random projection, outperforms VAE. Jointly training the encoder and the decoder using the UAE objective (green) exhibits the best performance. These results are also reflected qualitatively for the reconstructed test signals shown in Figure 3 for measurements.

5.2 Transfer Compressive Sensing

A key motivation for compressive sensing is to directly acquire the signals using a few measurements. On the contrary, learning-based methods such as VAE and UAE requires access to large amounts of training data. For many critical applications, even acquiring the training data might not be feasible. Hence, we test the generative model-based recovery on the transfer compressed sensing task introduced in [20].

Experimental setup. We train the models on a source domain, which is assumed to be data-rich and related to a data-hungry target domain. Since the dimensions of MNIST and Omniglot images match, transferring from one domain to another requires no additional processing. For UAE, we retrain the decoder on the target domain with the encodings learned for the source domain. This retraining is not possible for a VAE, which would require the target domain signals directly (and not the compressed measurements). Retraining is computationally efficient due to amortized decoding in UAE, unlike LASSO or VAE which solve an optimization problem for every test signal.

Results. The reconstruction errors are shown in Figure 4. LASSO does not involve any learning, and hence its performance is same as Figure 2. The VAE performance degrades significantly in comparison, even performing worse than LASSO in some cases. Again, the UAE based methods outperform competing approaches. Qualitative differences are highlighted in Figure 5 for measurements.

Dimensions Method kNN DT RF MLP AdaB NB QDA SVM
2 PCA 0.4078 0.4283 0.4484 0.4695 0.4002 0.4455 0.4576 0.4503
UAE 0.4644 0.5085 0.5341 0.5437 0.4248 0.5226 0.5316 0.5256
5 PCA 0.7291 0.5640 0.6257 0.7475 0.5570 0.6587 0.7321 0.7102
UAE 0.8115 0.6331 0.7094 0.8262 0.6164 0.7286 0.7961 0.7873
10 PCA 0.9257 0.6354 0.6956 0.9006 0.7025 0.7789 0.8918 0.8440
UAE 0.9323 0.5583 0.7362 0.9258 0.7165 0.7895 0.9098 0.8753
25 PCA 0.9734 0.6382 0.6889 0.9521 0.7234 0.8635 0.9572 0.9194
UAE 0.9730 0.5407 0.7022 0.9614 0.7398 0.8306 0.9580 0.9218
50 PCA 0.9751 0.6381 0.6059 0.9580 0.7390 0.8786 0.9632 0.9376
UAE 0.9754 0.5424 0.6765 0.9597 0.7330 0.8579 0.9638 0.9384
100 PCA 0.9734 0.6380 0.4040 0.9584 0.7136 0.8763 0.9570 0.9428
UAE 0.9731 0.6446 0.6241 0.9597 0.7170 0.8809 0.9595 0.9431
Table 1: PCA vs. UAE. Average test classification accuracy for the MNIST dataset.

5.3 Dimensionality Reduction

Dimensionality reduction is a common preprocessing technique for specifying features for classification. We compare PCA and UAE on this task. While Theorem 2 posits that the two techniques are equivalent in the regime of high noise given optimal decodings for the UAE framework, we consider the case where the noise is set as a hyperparameter based on a validation set enabling out-of-sample generalization.

Setup. We learn the principal components and UAE projections on the MNIST training set for varying number of dimensions. We then learn classifiers based on the these projections. Again, we use a linear encoder for the UAE for a fair evaluation. Since the inductive biases vary across different classifiers, we considered

commonly used classifiers: k-Nearest Neighbors (kNN), Decision Trees (DT), Random Forests (RF), Multilayer Perceptron (MLP), AdaBoost (AdaB), Gaussian Naive Bayes (NB), Quadratic Discriminant Analysis (QDA), and Support Vector Machines (SVM) with a linear kernel.

Results. The performance of the PCA and UAE feature representations for different number of dimensions is shown in Table 1. We find that UAE outperforms PCA in a majority of the cases. Further, this trend is largely consistent across classifiers. The improvements are especially high when the number of dimensions is low, suggesting the benefits of UAE as a dimensionality reduction technique for classification.

6 Discussion & Related Work

Our work provides a holistic treatment to several lines of research in mutual information maximization, autoencoding, compressive sensing, and dimensionality reduction using principal component analysis.

Mutual Information Maximization.

The principle of mutual information maximization, often referred to as InfoMax in prior work, was first proposed for learning encodings for communication over a noisy channel [35]. The InfoMax objective has also been applied for statistical compressive sensing for learning both linear and non-linear encodings [51, 14, 50]. Our work differs from these existing frameworks in two fundamental ways. First, we optimize for a tractable variational lower bound to the MI that which allows our method to scale to high-dimensional signals and measurements. Second, we learn an amortized decoder in addition to the encoder that sidesteps expensive, per-example optimization for the test signals being sensed.

Further, we improve upon the IM algorithm proposed originally for variational information maximization [5]

. While the IM algorithm proposes to optimize the lower bound on the mutual information in alternating “wake-sleep” phases for optimizing the encoder (“wake”) and decoder (“sleep”) analogous to the expectation-maximization procedure used in

[51], we optimize the encoder and decoder jointly using a single consistent objective leveraging recent advancements in gradient based variational stochastic optimization.

Autoencoders. To contrast uncertainty autoencoders with other commonly used autoencoding schemes, consider a UAE with a Gaussian observation model with fixed isotropic covariance for the decoder of all the autoencoding objectives we discuss subsequently. The UAE objective can be simplified as:

Standard Autoencoder. If we assume no measurement noise (i.e., ) and assume the observation model to be a Gaussian with mean and a fixed isotropic , then the UAE objective reduces to minimizing the mean squared error between the true and recovered signal:

This special case of a UAE corresponds to a standard autoencoder [6] where the measurements

signify a hidden representation for the observed data

. However, this case lacks the interpretation of an implicit generative model since the assumptions of Theorem 1 do not hold.

Denoising Autoencoders. A DAE [49] adds noise at the level of the input signal to learn robust representations. For a UAE, the noise model is defined at the level of the compressed measurements. Again, with the assumptions of a Gaussian decoder, the DAE objective can be expressed as:

where is some predefined noise corruption model. Similar to Theorem 1, a DAE also learns an implicit model of the data distribution [7, 1].

Variational Autoencoders. A VAE [31, 44] explicitly learns a latent variable model for the dataset. The learning objective is a variational lower bound to the marginal log-likelihood assigned by the model to the data , which notationally corresponds to . The variational objective that maximizes this quantity can be simplified as:

The learning objective includes a reconstruction error term, akin to the UAE objective. Crucially, it also includes a regularization term to minimize the divergence of the variational posterior over with a prior distribution over . A key difference is that a UAE does not explicitly need to model the prior distribution over . On the downside, a VAE can perform efficient ancestral sampling while a UAE requires running relatively expensive Markov Chains to obtain samples.

Recent works have attempted to unify the variants of variational autoencoders through the lens of mutual information [2, 54, 18]. These works also highlight scenarios where the VAE can learn to ignore the latent code in the presence of a strong decoder thereby affecting the reconstructions to attain a lower KL loss. One particular variant, the -VAE, weighs the additional regularization term with a positive factor and can effectively learn disentangled representations [29]. Although [29] does not consider this case, the UAE can be seen as a -VAE with .

Generative Modeling and Compressive Sensing. The closely motivated works of [9, 20] also use generative models for compressive sensing. As highlighted in Section 5, their approach is radically different from UAE. Similar to [9], a UAE learns a data distribution. However, in doing so, it additionally learns an acquisition/encoding function and a recovery/decoding function, unlike [9, 20] which rely on generic random matrices and decoding. The cost of implicit learning in a UAE is that some of its inference capabilities, such as likelihood evaluation and sampling, are intractable or require running Markov chains. However, these inference queries are orthogonal to compressive sensing. Finally, our decoding is amortized and scales to large datasets, unlike [9, 20] which solve an independent optimization problem for each test signal.

To summarize, our uncertainty autoencoding formulation provides a combination of unique desirable properties for representation learning that are absent in prior autoencoders. As discussed, a UAE defines an implicit generative model without specifying a prior (Theorem 1) even under realistic conditions (Corollary 1; unlike DAEs) and has rich connections with PCA even for non-linear decoders (Theorem 2; unlike any kind of existing autoencoder).

7 Conclusion & Future Work

In this work, we presented uncertainty autoencoders (UAE), a framework for representation learning via variational maximization of mutual information between an input signal and hidden representation. We showed that UAEs are a natural candidate for statistical compressive sensing, wherein we can learn the acquisition and recovery functions jointly as well as sidestep making any strong assumptions based on sparsity. We presented connections of our framework with many related threads of research, especially with respect to implicit generative modeling and principal component analysis.

Our framework suffers from limitations that also serve as directions for future work. We assumed a setting of fixed number of measurements. Applications involving a stream of measurements would naively require retraining UAEs for every additional measurement. Extending sequential encoding schemes, e.g., [11], to the UAE framework is a promising direction. Further, it would be interesting to incorporate advancements in compressive sensing based on complex neural network architectures [42, 32, 15, 38, 48] within the UAE framework for real world applications, e.g., medical imaging.

Unlike the rich theory surrounding the compressive sensing of sparse signals, a similar theory surrounding generative model-based priors on the signal distribution is lacking. Existing results due to [19] provide upper bounds on the number of measurements required for recovery of sparse signals only using linear encodings that satisfy restricted isometry property. These bounds naturally hold even for UAEs when only the decoder is learned (as in the RP+UAE baseline considered in our experiments). However, these results only guarantee the existence of an optimal decoder and do not prescribe an algorithm for decoding (such as LASSO). Recent works have made promising progress in developing a theory of SGD based recovery methods for nonconvex inverse problems, which continues to be an exciting direction for future work [9, 27, 20, 36].


We are thankful to Kristy Choi, Neal Jean, Daniel Levy, Ben Poole, and Yang Song for helpful discussions and comments on early drafts.


  • [1] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-generating distribution.

    Journal of Machine Learning Research

    , 15(1):3563–3593, 2014.
  • [2] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, 2018.
  • [3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
  • [4] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
  • [5] D. Barber and F. Agakov. The IM algorithm: A variational approach to information maximization. In Advances in Neural Information Processing Systems, 2003.
  • [6] Y. Bengio et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
  • [7] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, 2013.
  • [8] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
  • [9] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning, 2017.
  • [10] H. Bourlard and Y. Kamp.

    Auto-association by multilayer perceptrons and singular value decomposition.

    Biological cybernetics, 59(4-5):291–294, 1988.
  • [11] G. Braun, S. Pokutta, and Y. Xie. Info-greedy sequential adaptive compressed sensing. In Allerton Conference on Communication, Control, and Computing. IEEE, 2014.
  • [12] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
  • [13] E. J. Candès and T. Tao.

    Decoding by linear programming.

    IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
  • [14] W. R. Carson, M. Chen, M. R. Rodrigues, R. Calderbank, and L. Carin. Communications-inspired projection design with application to compressive sensing. SIAM Journal on Imaging Sciences, 5(4):1185–1212, 2012.
  • [15] J. R. Chang, C.-L. Li, B. Poczos, B. V. Kumar, and A. C. Sankaranarayanan. One network to solve them all—solving linear inverse problems using deep projection models. arXiv preprint, 2017.
  • [16] G. Chen and D. Needell. Compressed sensing and dictionary learning. Preprint, 106, 2015.
  • [17] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2016.
  • [18] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In International Conference on Learning Representations, 2017.
  • [19] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation. Journal of the American mathematical society, 22(1):211–231, 2009.
  • [20] M. Dhar, A. Grover, and S. Ermon. Modeling sparse deviations for compressed sensing using generative models. In International Conference on Machine Learning, 2018.
  • [21] P. J. Diggle and R. J. Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 193–227, 1984.
  • [22] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • [23] M. C. Fu. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.
  • [24] S. Gao, G. Ver Steeg, and A. Galstyan.

    Variational information maximization for feature selection.

    In Advances in Neural Information Processing Systems, 2016.
  • [25] P. Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
  • [26] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  • [27] P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. arXiv preprint arXiv:1705.07576, 2017.
  • [28] M. A. Herman and T. Strohmer. High-resolution radar via compressed sensing. IEEE Transactions on Signal Processing, 57(6):2275–2284, 2009.
  • [29] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
  • [30] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [31] D. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • [32] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok. Reconnet: Non-iterative reconstruction of images from compressively sensed measurements. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [33] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [34] Y. LeCun, C. Cortes, and C. J. Burges. MNIST handwritten digit database. http://yann. lecun. com/exdb/mnist, 2010.
  • [35] R. Linsker. How to generate ordered maps by maximizing the mutual information between input and output signals. Neural computation, 1(3):402–411, 1989.
  • [36] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo. On the convergence of learning-based iterative methods for nonconvex inverse problems. arXiv preprint arXiv:1808.05331, 2018.
  • [37] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.
  • [38] X. Lu, W. Dong, P. Wang, G. Shi, and X. Xie. Convcsnet: A convolutional compressive sensing framework based on deep learning. arXiv preprint arXiv:1801.10342, 2018.
  • [39] M. Lustig, D. Donoho, and J. M. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007.
  • [40] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • [41] S. Mohamed and D. J. Rezende.

    Variational information maximisation for intrinsically motivated reinforcement learning.

    In Advances in Neural Information Processing Systems, 2015.
  • [42] A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery. In Annual Allerton Conference on Communication, Control, and Computing, 2015.
  • [43] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [44] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In International Conference on Machine Learning, 2014.
  • [45] G. O. Roberts and J. S. Rosenthal. Harris recurrence of metropolis-within-gibbs and trans-dimensional markov chains. The Annals of Applied Probability, pages 2123–2139, 2006.
  • [46] J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, 2015.
  • [47] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [48] D. Van Veen, A. Jalal, E. Price, S. Vishwanath, and A. G. Dimakis. Compressed sensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.
  • [49] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, 2008.
  • [50] L. Wang, A. Razi, M. Rodrigues, R. Calderbank, and L. Carin. Nonlinear information-theoretic compressive measurement design. In International Conference on Machine Learning, 2014.
  • [51] Y. Weiss, H. S. Chang, and W. T. Freeman. Learning compressed sensing. In Snowbird Learning Workshop, Allerton, CA. Citeseer, 2007.
  • [52] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [53] G. Yu and G. Sapiro.

    Statistical compressed sensing of Gaussian mixture models.

    IEEE Transactions on Signal Processing, 59(12):5842–5858, 2011.
  • [54] S. Zhao, J. Song, and S. Ermon. The information autoencoding family: A lagrangian perspective on latent variable generative models. In

    Conference on Uncertainty in Artificial Intelligence

    , 2018.

Appendix A Proofs of Theoretical Results

a.1 Proof of Theorem 1


We can rewrite the objective in Eq. (6) as:


The -divergence is non-negative and minimized when its argument distributions are identical. Hence, for any value of , the optimal value of satisfies the following for all such that :


For any value of , we know the following Gibbs chain converges to if the chain is ergodic:


Substituting the result from Eq. (13) in the above Markov chain transitions finishes the proof. ∎

a.2 Proof of Corollary 1


By using earlier results (Proposition 2 in [45]), we need to show that the Markov chain defined in Eqs. (9)-(10) is -irreducible with a Gaussian noise model.Note that the symbol here is different from the parameters denoted by little used in the rest of the paper. That is, there exists a measure such that there is a non-zero probability of transitioning from every set of non-zero measure to every other such set defined on the same measure using this Markov chain.

Consider the Lebesgue measure. Formally, given any and such that the density and for the Lebesgue measure, we need to show that the probability density of transitioning .

(1) Since for all , (by Gaussian noise model assumption), we can use Eq. (9) to transition from to with non-zero probability.

(2) Next, we claim that the transition probability is non-negative for all . By Bayes rule, we have:

Since and , the marginals and are positive. Again, for all , by the Gaussian noise model assumption. Hence, is positive. Finally, using the optimality assumption for the posteriors matching for all , we can use Eq. (10) to transition from to with non-zero probability.

From (1) and (2), we see that there is a non-zero probability of transitioning from to . Hence, under the assumptions of the corollary the Markov chain in Eqs. (9, 10) is ergodic.

a.3 Proof of Theorem 2


The UAE objective reduces to information maximizationWe note that a similar claim has been made earlier in [51]. However, the stated proof seems to be incorrect in simultaneously assuming both a noise-free and Gaussian (noisy) observation model in its arguments. when matches allowing us to precisely characterize the optimal encodings that maximize the mutual information between and . Consider the following simplification of the UAE objective under optimal decodings.


The first term is independent of . For the second term, note that

is a normal distributed random variable and hence its entropy is given by a constant

. Only the third term depends on . Hence, the optimal encodings maximizing the mutual information can be specified as:


where we have retained the term due to the entropy of since we are interested in characterizing the solutionsFor a linear encoder, note that (and = ), and hence, these can be used interchangeably. to Eq. (17) in the limit .


where we have denoted the log-marginal by plus a shift term due to .

Next, we lower-bound the log-marginal using Jensen’s inequality:


where we have used the fact that the data distribution is uniform over the entire dataset (by assumption).

Finally, we denote the non-negative slack term for the above inequality as such that:


We now consider simplifications of the lower bound and the slack term.

Lower bound:




We can simplify the posteriors as:


where we have used the fact that the data distribution is uniform and the decoder is isotropic Gaussian.

Substituting the above expression for the slack term:


We will now show that dominated convergence in holds for the slack term (seen as a sequence of functions indexed by for any ). By Cauchy Schwarz, the first subterm in Eq. (A.3) is bounded given the assumptions in the theorem ( for all , for some positive constants ). The next two subterms are constants. For the last subterm, we can derive upper and lower bounds on the integrand that are independent of .

For the upper bound, we note that:

This gives an upper bound on the integrand (equals 0).

For the lower bound, we note that:


Hence, we have the following lower bound:

where the terms after the inequality use the following inequalities for any . Since both the upper and lower bounds are independent of , dominated convergence holds for the slack term in .

Consequently, we can evaluate limits to obtain a limiting ratio between the slack term and the lower bound:


using the expressions derived in Eq. (A.3) and Eq. (A.3) along with L’Hôpital’s rule.

We can now rewrite Eq. (20) as:


By the definition of limit, we know that , there exists a such that for all and for all such that and for all , we have:


Substituting the above expression in Eq. (18), we can conclude that , there exists a , such that for all and for all such that and for all , we have:


which finishes the proof. ∎

Appendix B Experimental Details

For MNIST, we use the train/valid/test split of images. For Omniglot, we use train/valid/test split of images. For CelebA, we used the splits as provided by  [37] on the dataset website. All images were scaled such that pixel values are between and . We used the Adam optimizer with a learning rate of for all the learned models. For MNIST and Omniglot, we used a batch size of . For CelebA, we used a batch size of . Further, we implemented early stopping based on the best validation bounds after epochs for MNIST, epochs for Omniglot, and epochs for CelebA.

b.1 Hyperparameters for Compressive Sensing on MNIST and Omniglot

For both datasets, the UAE decoder used hidden layers of

units each with ReLU activations. The encoder was a single linear layer with only weight parameters and no bias parameters. The encoder and decoder architectures for the

VAE baseline are symmetrical with hidden layers of units each and latent units. We used the LASSO baseline implementation from sklearn and tuned the Lagrange parameter on the validation sets. For the baselines, we do random restarts with steps per restart and pick the reconstruction with best measurement error as prescribed in [9]. Refer to [9] for further details of the baseline implementations.

m Random Gaussian Matrices MNIST-UAE Omniglot-UAE
2 39.57 6.42 2.17
5 63.15 5.98 2.66
10 88.98 7.24 3.50
25 139.56 8.53 4.71
50 198.28 9.44 5.45
100 280.25 10.62 6.02
Table 2: Frobenius norms of the UAE encodings and random Gaussian projections for MNIST and Omniglot datasets.

Table 2 shows the average norms for the random Gaussian matrices used in the baselines and the learned UAE encodings. The lower norms for the UAE encodings suggest that the UAE baseline is not trivially overcoming noise by increasing the norm of .

b.2 Hyperparameters for Dimensionality Reduction

For PCA and each of the classifiers, we used the standard implementations in sklearn with default parameters and the following exceptions:

  • KNN: n_neighbors = 3

  • DT: max_depth = 5

  • RF: max_depth = 5, n_estimators = 10, max_features = 1

  • MLP: alpha=1

  • SVC: kernel=linear, C=0.025

b.3 Statistical Compressive Sensing on CelebA dataset

Figure 6: Test reconstruction error (per image) for compressive sensing on CelebA.
Figure 7: Reconstructions for on the CelebA dataset. Top: Target. Second: LASSO-DCT. Third: LASSO-Wavelet. Fourth: DCGAN. Last: UAE.

For the CelebA dataset, the dimensions of the images are and . The naive pixel basis does not augur well for compressive sensing on such high-dimensional RGB datasets. Following [9], we experimented with the Discrete Cosine Transform (DCT) and Wavelet basis for the LASSO baseline. Further, we used the DCGAN architecture [43] as in [9] as our main baseline. For the UAE approach, we used additional convolutional layers in the encoder to learn a dimensional feature space for the image before projecting it down to dimensions.

Encoder architecture:


Decoder architecture: