From Variational to Deterministic Autoencoders

03/29/2019 ∙ by Partha Ghosh, et al. ∙ Max Planck Society 28

Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce "blurry" images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.



There are no comments yet.


page 7

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative modeling lies at the core of machine learning and computer vision. By capturing the mechanisms behind the data generation process, one can reason about data probabilistically, access and traverse the low-dimensional manifold the data is assumed to live on, and ultimately

generate new data. It is therefore not surprising that learning generative models has gained momentum in applications like chemistry [25, 16], NLP [46, 8] and computer vision [9, 48].

Variational Autoencoders (VAEs) [27, 38] allow for a principled probabilistic way to model high-dimensional distributions. They do so by casting learning representations as a variational inference problem. Learning a VAE amounts to the optimization of an objective balancing the quality of autoencoded samples through a stochastic encoder–decoder pair while encouraging the latent space to follow a fixed prior distribution.

Since their introduction, VAEs have become one of the frameworks of choice for generative modeling, promising theoretically well-founded and more stable training than Generative Adversarial Networks (GANs) [17]

and more efficient sampling mechanisms than autoregressive models 

[30]. Much of the recent literature has focused on applying VAEs on image generation tasks [33, 19, 23] and devising new encoder–decoder architectures [56, 25].

Despite this attention from the community, the VAE framework is still far from delivering the promised generative mechanism in many real-world scenarios. In fact, VAEs tend to generate blurry samples, a condition which has been attributed to using overly simplistic distributions for the prior [54]; restrictiveness of the Gaussian assumption for the stochastic architecture [13]; or over-regularization induced by the KL divergence term in the VAE objective [53] (see Fig. 2). Moreover, the VAE objective itself poses several challenges as it admits trivial solutions that decouple the latent space from the input [11, 60], leading to the posterior collapse phenomenon in conjunction with powerful decoders [56]

. Training a VAE requires approximating expectations by sampling at the cost of increased variance in gradients 

[55, 10]

, making initialization, validation and annealing of hyperparameters fundamental in practice 

[8, 5, 21]. Lastly, even after a satisfactory convergence of the objective, the learned aggregated posterior distribution rarely matches the assumed latent prior in practice [26, 5, 13], ultimately hurting the quality of generated samples.

In this work, we tackle these shortcomings by reformulating the VAE into a generative modeling scheme that scales better, is simpler to optimize, and most importantly, produces higher-quality samples. We do so based on the ovservation that under common distributional assumptions made for VAEs, training a stochastic encoder–decoder pair does not differ in practice from training a deterministic architecture where noise is added to the decoder’s input to enforce a smooth latent space. We investigate how to substitute this noise injection mechanism with other regularization schemes in our deterministic Regularized Autoencoders (RAEs), and we analyze how we can learn a meaningful latent space without forcing it to conform to a given prior distribution. We equip RAEs with a generative mechanism through a simple and efficient density estimation step on the learned latent space which leads to improved image quality that surpasses VAEs and stronger alternatives such as Wasserstein Autoencoders (WAEs) [53].

In summary, our contributions are as follows:

  1. we introduce the framework for generative modeling,

  2. we propose an density estimation scheme that greatly improves sample quality for the VAE, WAE and without the need for additional training,

  3. we conduct a rigorous empirical evaluation on several common image datasets (MNIST, CIFAR-10, CelebA), assaying reconstruction, random samples and interpolation quality for VAE, WAE and RAE,

  4. we achieve state-of-the-art FID scores for the above datasets in a non-adversarial setting.

The paper is organized as follows. In Section 2 we introduce the VAE framework and discuss assumptions, practical implementations and limitations, leading to the introduction of our simplified deterministic and regularized framework (Section 3), interpreting explicit regularization as constrained optimization under certain parametric assumptions (Section 4). After discussing density estimation and related works in Sections 5 and 6, we present experiments in Section 7 before we close with our final conclusions.

2 Variational Autoencoders

For a general discussion, we consider a collection of high-dimensional samples drawn from the true data distribution

over a random variable

taking values in the input space. The aim of generative modeling is to learn from a mechanism to draw new samples .

Variational Autoencoders provide a powerful latent variable framework to infer such a mechanism. The generative process of the VAE is defined as


where is a fixed prior distribution over a low-dimensional latent variable . A stochastic decoder


links the latent space to the input space through the likelihood, where is an expressive non-linear function parameterized by .111With slight abuse of notation, we use lowercase letters for both random variables and their realizations, instead of , when it is clear to discriminate between the two. As a result, a VAE estimates as the infinite mixture model . At the same time, the input space is mapped to the latent space via a stochastic encoder


where is the posterior distribution given by a second function parameterized by .

Computing the marginal log-likelihood is generally intractable. We therefore follow a variational approach, maximizing the evidence lower bound (ELBO) for a sample :


Maximizing Eq. 2 over data model parameters , corresponds to minimizing the loss


where and are defined for a sample as follows:


Intuitively, the reconstruction loss takes into account the quality of autoencoded samples through , while the KL-divergence term encourages to match the prior for each which acts as a regularizer during training [22]. For the purpose of generating high-quality samples, a balance between these two loss terms must be found during training, see Fig. 2.

2.1 Practice and shortcomings of VAEs

To fit a VAE to data through Eq. 5 one has specify the parametric forms for , , , and hence the deterministic mappings and . In practice, the choice for the above distributions is guided by trading off computational complexity with model expressiveness.

Figure 1: Test reconstruction quality for a VAE trained on MNIST with different numbers of samples in the latent space as in Eq. 8 measured by FID (lower is better). Larger numbers of samples clearly improve training, however, the increased accuracy comes with larger requirements for memory and computation. In practice, the most common choice is therefore .

In the most common formulation of the VAE, and are assumed to be Gaussian


with means and covariance parameters given by and

. In practical implementations, the covariance of the decoder is set to the identity matrix for all

,  [13]. The expectation of in Eq. 6 is then approximated via Monte Carlo point estimates. We find clear evidence that larger values lead to improvements in training as shown in Fig. 1. Nevertheless, only a 1-sample approximation is carried out in practice [27] since requirements to memory and computation scale linearly with . With this approximation, the computation of is given by the mean squared error between input samples and their mean reconstructions through the deterministic decoder:


Gradients the encoder parameters are computed through the expectation of in Eq. 6 via the reparametrization trick [27] where the stochasticity of is relegated to an auxiliary random variable which does not depend on :


where denotes the Hadamard product. An additional simplifying assumption involves fixing the prior to be a -dimensional isotropic Gaussian . For this choice, the KL-divergence for a sample is given in closed form:


While the chosen assumptions make VAEs easy to implement, the stochasticity in the encoder and decoder has been deemed to be responsible for the “blurriness” in VAE samples [34, 53, 13]. Furthermore, the optimization problem as shown in Eq. 5 presents some further challenges. Imposing a strong weight on the term during optimization can dominate , having the effect of over-regularization which leads to blurred samples, see Fig. 2

. Heuristics to avoid this include gradually annealing the importance of

during training [8, 5] and manually fine-tuning the balance between the losses.

Figure 2: Reconstruction and random sample quality (FID, y-axis, lower is better) of a VAE on MNIST for different tradeoffs between and (x-axis, see Eq. 5). Higher weights for improve random samples but hurt reconstruction. Enforcing structure in the VAE latent space leads to a penalty in quality.

Even after employing the full array of approximations and “tricks” to reach convergence of Eq. 5 for a satisfactory set of parameters, there is no guarantee that the learned latent space is distributed according to the assumed prior distribution. In other words, the aggregated posterior distribution has been shown not to conform well to after training [53, 5, 13]. This critical issue severely hinders the generative mechanism of a VAE (Eq. 1) since latent codes sampled from (instead of ) might lead to regions of the latent space that are previously unseen to during training. The result is blurry out-of-distribution samples. We analyze solutions to this problem in Section 5.

2.2 Constant-Variance Encoders

Analogously to what is generally done for decoders, we also investigate fixing the variance of to be constant for all . This simplifies the computation of from Eq. 11 to


where is a fixed scalar. At the same time, the KL loss term in Eq. 12 simplifies (up to constants) to


Constant-Variance VAEs (CV-VAEs) have been previously applied in applications of adversarial robustness [15] and variational image compression [4] but to the best of our knowledge, there is no systematic study of CV-VAEs in the literature. As noted in [15], treating as a constant impairs the assumption of to be an isotropic Gaussian which demands a more complex prior structure over . We address this mismatch in Sec. 5.

3 Deterministic Regularized Autoencoders

As described in Section 2, autoencoding in VAEs is defined in a probabilistic fashion: and map data points not to a single point, but rather to parameterized distributions as shown in Equations 8 and 9. However, the practical implementation of the VAE admits a deterministic view for this probabilistic mechanism.

A glance at the autoencoding mechanism of the VAE is revealing. The encoder maps a data point to a mean and variance in the latent space via the reparametrization trick given in Eq. 11. The input to the decoder is then simply the mean augmented with random Gaussian noise scaled by . In the CV-VAE, this relationship is even more obvious, as the magnitude of the noise is fixed for all data points (Eq. 13). In this light, a VAE can be seen as a deterministic autoencoder where Gaussian noise is added to the decoder’s input.

Using random noise injection to regularize neural networks during training is a well-known technique that dates back several decades 

[47, 2]. The addition of noise implicitly smooths the function learned by the network. Since this procedure also adds noise to the gradients, we propose to substitute noise injection with an explicit regularization scheme for the decoder network. Note that from a generative perspective, this is motivated by the goal to learn a smooth latent space where similar data points are mapped to similar latent codes , and small variations in lead to reconstructions by that vary only slightly.

By removing the noise injection mechanism from the CV-VAE, we are effectively left with a deterministic Regularized Autoencoder (RAE) that can be coupled with any type of explicit regularization for the decoder to enforce a smooth latent space. Training a RAE thus involves minimizing the simplified loss


where represents the explicit regularizer for (see Sec. 3.1) and from Eq. 14 is equivalent to constraining the size of the learned space. Note that for RAEs, no sampling approximation of is required, thus relieving the need for more samples from to achieve better image quality (see Fig. 1).

3.1 Regularization Schemes for RAEs

Among possible choices for a mechanism to use for , a first obvious candidate is Tikhonov regularization [52] since is known to be related to the addition of low-magnitude input noise [7]. Training a RAE within this framework thus amounts to adopting


where effectively applies weight decay on the decoder parameters .

Another avenue comes from the recent GAN literature where regularization is a hot topic [29] and where injecting noise to the input of the adversarial discriminator has led to improved performance in a technique called instance noise [49]. To enforce Lipschitz continuity on adversarial discriminators, weight clipping has been proposed [3], which is however known to significantly slow down training. More successfully, a gradient penalty on the discriminator can be used similar to [18, 35], yielding the objective


which encourages small norm of the gradient of the decoder its input.

Additionally, spectral normalization (SN) has been proposed as an alternative way to bound the Lipschitz norm of an adversarial discriminator, showing very promising results for GAN training [36]. SN normalizes the weight matrix

for each layer in the decoder by dividing it by an estimate of its largest singular value:


where is the current estimate obtained through the power method.

Lastly, in light of recent success stories of deep neural networks without explicit regularization achieving state-of-the-art results [59, 58], it is intriguing to question the need to explicitly regularize the decoder in order to obtain a meaningful latent space. The assumption here is that techniques such as dropout [50]

, batch normalization 

[24], adding noise during training [2], or early stopping in conjunction with novel architectural developments implicitly regularize the networks enough to smoothen the latent space. Therefore, as a natural baseline to the objectives introduced above, we also consider the RAE framework without and , a standard deterministic autoencoder optimizing only.

4 A Probabilistic Derivation of Smoothing

In this section, we propose an alternative view on enforcing smoothness on the output of by augmenting the ELBO optimization problem for VAEs with an explicit constraint. While we keep the Gaussianity assumptions over a stochastic and for convenience, we are not fixing a parametric form for yet. We will then discuss how some parametric restrictions over indeed lead to a variation of the RAE framework in Eq. 15, specifically as the introduction of as a regularizer of a deterministic version of the CV-VAE.

To start, we can augment the minimization in Eq. 5 as:


where and the constraint on the decoder encodes that the output has to vary, in the sense of an norm, only by a small amount for any two possible draws from the encoder. Using the mean value theorem, there exists a such that the left term in the constraint can be bounded as:


where we take the supremum of possible gradients of as well as the supremum of a measure of the support of . From this form of the smoothness constraint, it is apparent why the choice of a parametric form for can be impactful during training. For a compactly supported isotropic PDF , the extension of the support would be dependent on , the entropy of , through some functional . For instance, a uniform posterior over a hypersphere in would ascertain where is the dimensionality of the latent space.

Intuitively, one would look for parametric distributions that do not favor overfitting, degenerating in Dirac-deltas (minimal entropy and support) along any dimensions. To this end, an isotropic nature of would favor such a robustness against decoder over-fitting. We can now rewrite the constraint as


The term can be expressed in terms of , by decomposing it as , where and represents a cross-entropy term. Therefore, the constrained problem in Eq. 19 can be written in a Lagrangian formulation by including Eq. 21:


where . We argue that a reasonable simplifying assumption for is to fix to a single constant for all samples . Intuitively, this can be understood as fixing the variance in as we did for the CV-VAE in Sec. 2.2. With this simplification, Eq. 22 further reduces to


We can see that results to be the gradient penalty and corresponds to , thus recovering our RAE framework as presented in Eq. 15.

5 Ex-Post Density Estimation

By removing stochasticity and ultimately, the KL divergence term from RAEs, we have simplified the original VAE objective at the cost of detaching the encoder from the prior over the latent space. This implies i) we cannot ensure that the latent space is distributed according to a simple distribution anymore (isotropic Gaussian) and consequently, ii) we lose the simple mechanism provided by to sample from as in Eq. 1.

As discussed in Section 2.1, issue i) is compromising the VAE framework in any case in practice as reported in several works [14, 42, 22]. To fix this, some works extend the VAE objective by encouraging the aggregated posterior to match  [53] or by utilizing more complex priors [26, 5, 54].

To overcome both i) and ii), we instead propose to employ density estimation over . We fit a density estimator denoted as to . This simple approach not only fits our RAE framework well, but it can also be readily adopted for any VAE or variants thereof such as the WAE as a practical remedy to the aggregated posterior mismatch without adding any computational overhead to the costly training phase.

The choice of needs to trade-off expressiveness – to provide a good fit of an arbitrary space for – with simplicity – to improve generalization. Indeed, placing a Dirac distribution on each latent point

would allow the decoder to output only training sample reconstructions. Striving for simplicity and in order to show the effectiveness of the proposed ex-post density estimation scheme, we compare a full covariance multivariate Gaussian with a 10-component Gaussian mixture model (GMM) in our experiments.

6 Related works

Many works have focused on diagnosing the VAE framework, the terms in its objective [1, 22, 60], and ultimately augmenting it to solve optimization issues [39, 13]. With RAE, we argue that a simpler deterministic framework can be competitive for generative modeling.

Deterministic denoising [57]

and contractive autoencoders (CAEs) 

[41] have received attention in the past for their ability to capture a smooth data manifold. Heuristic attempts to equip them with a generative mechanism include MCMC schemes [40, 6]. However, they are hard to diagnose for convergence, require a considerable effort in tuning [12], and have not scaled beyond MNIST, leading to them being superseded by VAEs. While in spirit the proposed RAE is similar, requires much less computational effort than computing the Jacobian for CAEs [41].

Approaches to cope with the aggregated posterior mismatch involve fixing a more expressive form for  [26, 5] therefore altering the VAE objective and requiring considerable additional computational efforts. Estimating the latent space of a VAE with a second VAE [13] reintroduces many of the optimization shortcomings discussed for VAEs and is much more expensive in practice compared to fitting a simple after training.

GANs [17] have received widespread attention for their ability to produce sharp samples. Despite theoretical and practical advances [3], the training procedure of GANs is still unstable, sensitive to hyperparameters, and prone to the mode collapse problem [33, 43, 44].

Adversarial Autoencoders (AAE) [34] add a discriminator to a deterministic encoder–decoder pair, leading to sharper samples at the expense of higher computational overhead and the introduction of instabilities caused by the adversarial nature of the training process. Wasserstein Autoencoders (WAE) [53] have been introduced as a generalization of AAEs by casting autoencoding as an optimal transport (OT) problem. Both stochastic and deterministic models can be trained by minimizing a relaxed OT cost function employing either an adversarial loss term or the maximum mean discrepancy score between and as a regularizer in place of .

Within the RAE framework, we look at this problem from a different perspective: instead of explicitly imposing a simple structure on

that might impair the ability to fit high-dimensional data during training, we propose to model the latent space by an ex-post density estimation step.

Reconstructions Random Samples Interpolations
Figure 3: Qualitative evaluation of sample quality for VAEs, WAEs and RAEs on CelebA. Left: reconstructed samples (top row is ground truth). Middle: randomly generated samples. Right: interpolations in the latent space between a pair of test images (first and last column). RAE models provide overall sharper samples and reconstructions while interpolating smoothly in the latent space. Corresponding qualitative overviews for MNIST and CIFAR-10 are provided in Appendix D.

7 Experiments

In this Section, we investigate the performance of several VAE and RAE variants on MNIST [31], CIFAR10 [28] and CelebA [32]. We measure three qualities for each model: held-out sample reconstruction quality, random sample quality, and interpolation quality. While reconstructions give us a lower bound on the best quality achievable by the generative model, random sample quality indicates how well the model generalizes. Finally, interpolation quality sheds light of the structure of the learned latent space.

Rec. Samples Rec. Samples Rec. Samples
VAE 18.26 19.21 17.66 18.21 57.94 106.37 103.78 88.62 39.12 48.12 45.52 44.49
CV-VAE 15.15 33.79 17.87 25.12 37.74 94.75 86.64 69.71 40.41 48.87 49.30 44.96
WAE 10.03 20.42 9.39 14.34 35.97 117.44 93.53 76.89 34.81 53.67 42.73 40.93
-GP 14.04 22.21 11.54 15.32 32.17 83.05 76.33 64.08 39.71 116.30 45.63 47.00
-L2 10.53 22.22 8.69 14.54 32.24 80.80 74.16 62.54 43.52 51.13 47.97 45.98
-SN 15.65 19.67 11.74 15.15 27.61 84.25 75.30 63.62 36.01 44.74 40.95 39.53
11.67 23.92 9.81 14.67 29.05 83.87 76.28 63.27 40.18 48.20 44.68 43.67
AE 12.95 58.73 10.66 17.12 30.52 84.74 76.47 61.57 40.79 127.85 45.10 50.94
Table 1: Evaluation of all models by FID (lower numbers are better, best model in bold). We evaluate each model by Rec.: test sample reconstruction; : random samples generated according to the prior distribution (for VAE / WAE) or by fitting a Gaussian to (for the remaining models); : random samples generated by fitting a mixture of 10 Gaussians in the latent space; : mid-point interpolation between random pairs of test reconstructions. Note that our (less constrained) RAE models are competitive with or outperform the VAE and WAE throughout the evaluation. Surprisingly, interpolations do not suffer from the lack of explicit prior on the latent space in our models. Furthermore, the unregularized AE achieves very good FID scores when combined with the proposed ex-post density estimation technique.

7.1 Models

We compare the the proposed RAE model with the gradient penalty (RAE-GP), with weight decay (RAE-L2), and with spectral normalization (RAE-SN). Additionally, we consider two models for which we either add only the latent code regularizer to (RAE), or no explicit regularization at all (AE). As baselines, we further employ a regular VAE, the constant-variance VAE (CV-VAE) for comparison, and finally, a Wasserstein Autoencoder (WAE) with the MMD loss as a state-of-the-art alternative.

Aiming for a fair comparison, we employ the same network architecture for all models. We largely follow the models adopted in [53] with the difference that we consistently apply batch normalization [24] for all models as we found it to improve performance across the range. The latent space dimension is 16 for MNIST, 128 for CIFAR-10 and 64 for CelebA. Further details about the network architecture and training procedure are given in Appendix A.

7.2 Evaluation

The evaluation of generative models is a nontrivial research question [51, 45, 33]. Since we are interested in the quality of samples, the ubiquitous Fréchet Inception Distance (FID) [20]

is a reasonable choice for comparing different models. More recently, a notion of precision and recall for distributions (PRD) has been proposed 

[43], separating sample quality from diversity. We report the popular FID scores in this section and we provide PRD scores in Appendix C.

We compute the FID of the reconstructions of random validation samples against the test set to evaluate reconstruction quality. For evaluating generative modeling capabilities, we compute the FID between the test data and randomly drawn samples from a single Gaussian that is either the isotropic available for VAEs and WAEs, or a single Gaussian fit to for CV-VAEs and RAEs. For all models, we also evaluate random samples from a 10-component Gaussian Mixture model (GMM) fit to . Using only 10 components prevents us from overfitting (which would indeed give good FIDs when compared with the test set). We note that fitting GMMs with up to 100 components, only improved results marginally. Additionally, we provide nearest-neighbours from the training set in Appendix E to show that the models are not overfitting. For interpolations, we report the FID for the furthest interpolation points resulted by applying spherical interpolation to randomly selected validation reconstruction pairs.

7.3 Results

Table 1 summarizes our main results. All of the proposed RAE variants are competitive with the VAE and WAE generated image quality in all settings. Sampling RAEs achieve the best FIDs across all datasets when a modest 10-component GMM is employed for ex-post density estimation. Furthermore, even when is considered as , RAEs rank first with the exception of MNIST, though the best FID achieved there by the VAE is very close to the FID of RAE-SN.

Moreover, our best RAE FIDs are lower than the best results reported for VAEs in the large scale comparison of [33], challenging even the best scores reported for GANs. While we are employing a slightly different architecture than theirs, our models underwent only modest finetuning instead of an extensive hyperparameter search. By looking at the differently regularized RAEs, there is no clear winner across all settings as all perform equally well. For practical reasons of implementation simplicity, one may prefer RAE-L2 over the GP and SN variants.

Surprisingly, the implicitly regularized RAE and AE models are shown to be able to score impressive FIDs when is fit through GMMs. FIDs for AEs decrease from 58.73 to 10.66 on MNIST and from 127.85 to 45.10 on CelebA – a value close to the state of the art. This is a remarkable result that follows a long series of recent confirmations that neural networks are surprisingly smooth by design [37]. It is also surprising that the lack of an explicitly fixed structure on the latent space of the RAE does not impede interpolation quality. This is further confirmed by the qualitative evaluation on CelebA as reported in Fig. 3 and for the other datasets in Appendix D, where RAE interpolated samples seem sharper than competitors and transitions smoother.

We would like to note that our extensive study confirms and quantifies the effect of the aggregated posterior mismatch as well as the effectivity of our proposed solution to it. Indeed, if we consider the effect of applying ex-post density estimation to each model in Table 1, we see that it consistently improves sample quality across all settings considerably. A 10-component GMM trained to fit seems to be enough to half the FID scores from to for WAE and RAE models on MNIST and from 116 to 46 on CelebA. This is striking since this very cheap additional step to any VAE-like generative model can be employed to boost the quality of generated samples.

All in all, the results strongly support our conjecture that the simple deterministic RAE framework can challenge the VAE and WAE.

8 Conclusion

While the theoretical derivation of the VAE has helped popularize the framework for generative modeling, recent works have started to expose some discrepancies between theory and practice. In this work, viewing sampling in VAEs as noise injection to enforce smoothness has enabled us to distill a deterministic autoencoding framework that is compatible with several regularization techniques to learn a meaningful latent space. We have demonstrated that such a deterministic autoencoding framework can generate comparable or better samples than VAEs, while getting around the practical drawbacks tied to a stochastic framework. Furthermore, we have shown that our solution to fit a simple density estimator such as a Gaussian Mixture Model on the learned latent space is able to consistently improve sample quality both for the proposed RAE framework as well as for the VAE and WAE, acting as a solution for the known mismatch between the prior and the aggregated posterior. The RAE framework opens interesting future research venues such as learning the density estimator in an end-to-end fashion with the autoencoding network and devising more sophisticated autoencoders that can access the full range of recent structural neural network advancements to scale generative modeling without being bound to the VAE’s restrictions.


We would like to thank Anant Raj, Matthias Bauer, Paul Rubenstein and Soubhik Sanyal for fruitful discussions.


  • [1] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy. Fixing a Broken ELBO. In ICML, 2018.
  • [2] G. An.

    The effects of adding noise during backpropagation training on a generalization performance.

    In Neural computation, 1996.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. In ICML, 2017.
  • [4] J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compression. In ICLR, 2017.
  • [5] M. Bauer and A. Mnih. Resampled Priors for Variational Autoencoders. In AISTATS, 2019.
  • [6] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. In NeurIPS, 2013.
  • [7] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
  • [8] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. In CoNLL, 2016.
  • [9] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
  • [10] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • [11] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In ICLR, 2017.
  • [12] M. K. Cowles and B. P. Carlin. Markov chain Monte Carlo convergence diagnostics: a comparative review. In Journal of the American Statistical Association, 1996.
  • [13] B. Dai and D. Wipf. Diagnosing and Enhancing VAE Models. In ICLR, 2019.
  • [14] B. Dai and D. Wipf. Diagnosing and Enhancing VAE Models. In ICLR, 2019.
  • [15] P. Ghosh, A. Losalka, and M. J. Black. Resisting Adversarial Attacks using Gaussian Mixture Variational Autoencoders. In AAAI, 2019.
  • [16] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. In ACS central science, 2018.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NeurIPS, 2017.
  • [19] C. Ham, A. Raj, V. Cartillier, and I. Essa.

    Variational Image Inpainting.


    Bayesian Deep Learning Workshop, NeurIPS

    , 2018.
  • [20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. In NeurIPS, 2017.
  • [21] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • [22] M. D. Hoffman and M. J. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In

    Workshop in Advances in Approximate Bayesian Inference, NeurIPS

    , 2016.
  • [23] H. Huang, R. He, Z. Sun, T. Tan, et al. Introvae: Introspective Variational Autoencoders for Photographic Image Synthesis. In NeurIPS, 2018.
  • [24] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 2015.
  • [25] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
  • [26] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving variational inference with inverse autoregressive flow. In NeurIPS, 2016.
  • [27] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In ICLR, 2014.
  • [28] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.
  • [29] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly. The GAN landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
  • [30] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.
  • [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In IEEE, 1998.
  • [32] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. In ICCV, 2015.
  • [33] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs Created Equal? A Large-Scale Study. In NeurIPS, 2018.
  • [34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In ICLR, 2016.
  • [35] L. Mescheder, A. Geiger, and S. Nowozin. Which Training Methods for GANs do actually Converge? In ICML, 2018.
  • [36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
  • [37] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
  • [38] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
  • [39] D. J. Rezende and F. Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.
  • [40] S. Rifai, Y. Bengio, Y. Dauphin, and P. Vincent. A generative process for sampling contractive auto-encoders. In ICML, 2012.
  • [41] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio.

    Contractive auto-encoders: Explicit invariance during feature extraction.

    In ICML, 2011.
  • [42] M. Rosca, B. Lakshminarayanan, and S. Mohamed. Distribution Matching in Variational Inference. arXiv preprint arXiv:1802.06847, 2018.
  • [43] M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing Generative Models via Precision and Recall. In NeurIPS, 2018.
  • [44] M. S. M. Sajjadi, G. Parascandolo, A. Mehrjou, and B. Schölkopf. Tempered Adversarial Networks. In ICML, 2018.
  • [45] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch.

    EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis.

    In ICCV, 2017.
  • [46] A. Severyn, E. Barth, and S. Semeniuta.

    A Hybrid Convolutional Variational Autoencoder for Text Generation.

  • [47] J. Sietsma and R. J. Dow. Creating artificial neural networks that generalize. In Neural networks. Elsevier, 1991.
  • [48] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPS, 2015.
  • [49] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP Inference for Image Super-resolution. In ICLR, 2017.
  • [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014.
  • [51] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016.
  • [52] A. N. Tikhonov and V. I. Arsenin. Solutions of Ill Posed Problems. Vh Winston, 1977.
  • [53] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein Auto-Encoders. In ICLR, 2017.
  • [54] J. Tomczak and M. Welling. VAE with a VampPrior. In AISTATS, 2018.
  • [55] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017.
  • [56] A. van den Oord, O. Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
  • [57] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    In ICML, 2008.
  • [58] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
  • [59] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
  • [60] S. Zhao, J. Song, and S. Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.


Appendix A Network architecture and Training Details

For all experiments, we use the Adam optimizer with a starting learning rate of which is cut in half every time the validation loss plateaus. All models are trained for a maximum of epochs on MNIST and CIFAR and epochs on CelebA. We use mini-batch size of

and pad MNIST digits with zeros to make the size


We use official train, validation and test splits of CelebA. For MNIST and CIFAR, we set aside 10k train samples for validation. For random sample evaluation, we draw samples from for VAE and WAE-MMD and for all remaining models, samples are drawn from a multivariate Gaussian whose mean and covariance are estimated using training set embeddings. For the GMM density estimation, we also utilize the training set embeddings for fitting and validation set embeddings to verify that GMM models are not over fitting to training embeddings. However, due to the very low number of mixture components, we did not encounter overfitting at this step. The GMM parameters are estimated by running EM for at most iterations.

For Figures 1 and 2, we used smaller networks due to computational limitations and since only relative performance is of interest in these experiments. It should be noted that as a result of this, the absolute values of the reported FID scores in these figures cannot be directly compared with the numbers reported in Section 7.



represents a convolutional layer with filters. All convolutions and transposed convolutions have a filter size of for MNIST and CIFAR-10 and

for CELEBA. They all have a stride of size 2 except for the last convolutional layer in the decoder. Finally,

for all models except for the VAE which has as the encoder has to produce both mean and variance for each input.

Appendix B Evaluation Setup

We use 10k samples for all FID and PRD evaluations. Scores for random samples are evaluated against the test set. Reconstruction scores are computed from validation set reconstructions against the respective test set. Interpolation scores are computed by interpolating latent codes of a pair of randomly chosen validation embeddings vs test set samples. The visualized interpolation samples are interpolations between two randomly chosen test set images.

Appendix C Evaluation by Precision and Recall

VAE 0.96 / 0.92 0.95 / 0.96 0.25 / 0.55 0.37 / 0.56 0.54 / 0.66 0.50 / 0.66
CV-VAE 0.84 / 0.73 0.96 / 0.89 0.31 / 0.64 0.42 / 0.68 0.25 / 0.43 0.32 / 0.55
WAE 0.93 / 0.88 0.98 / 0.95 0.38 / 0.68 0.51 / 0.81 0.59 / 0.68 0.69 / 0.77
-GP 0.93 / 0.87 0.97 / 0.98 0.36 / 0.70 0.46 / 0.77 0.38 / 0.55 0.44 / 0.67
-L2 0.92 / 0.87 0.98 / 0.98 0.41 / 0.77 0.57 / 0.81 0.36 / 0.64 0.44 / 0.65
-SN 0.89 / 0.95 0.98 / 0.97 0.36 / 0.73 0.52 / 0.81 0.54 / 0.68 0.55 / 0.74
0.92 / 0.85 0.98 / 0.98 0.45 / 0.73 0.53 / 0.80 0.46 / 0.59 0.52 / 0.69
AE 0.90 / 0.90 0.98 / 0.97 0.37 / 0.73 0.50 / 0.80 0.45 / 0.66 0.47 / 0.71
Table 2: Evaluation of random sample quality by precision / recall [43] (higher numbers are better, best value for each dataset in bold). It is notable that the proposed ex-post density estimation improves not only precision, but also recall throughout the experiment. For example, WAE seems to have a comparably low recall of only on MNIST which is raised considerably to by fitting a GMM. In all cases, GMM gives the best results. Another interesting point is the low precision but high recall of all models on CIFAR-10 – this is also visible upon inspection of the samples in Fig. 5.

Appendix D More Qualitative Results

Reconstructions Random Samples Interpolations
Figure 4: Qualitative evaluation for sample quality for VAEs, WAEs and RAEs on MNIST. Left: reconstructed samples (top row is ground truth). Middle: randomly generated samples. Right: spherical interpolations between two images (first and last column).
Reconstructions Random Samples Interpolations
Figure 5: Qualitative evaluation for sample quality for VAEs, WAEs and RAEs on CIFAR-10. Left: reconstructed samples (top row is ground truth). Middle: randomly generated samples. Right: spherical interpolations between two images (first and last column).

Appendix E Investigating Overfitting









Figure 6: Nearest neighbors to generated samples (leftmost image, red box) from training set. It seems that the models have generalized well and fitting only 10 Gaussians to the latent space prevents overfitting.