1 Introduction
Generative modeling lies at the core of machine learning and computer vision. By capturing the mechanisms behind the data generation process, one can reason about data probabilistically, access and traverse the lowdimensional manifold the data is assumed to live on, and ultimately
generate new data. It is therefore not surprising that learning generative models has gained momentum in applications like chemistry [25, 16], NLP [46, 8] and computer vision [9, 48].Variational Autoencoders (VAEs) [27, 38] allow for a principled probabilistic way to model highdimensional distributions. They do so by casting learning representations as a variational inference problem. Learning a VAE amounts to the optimization of an objective balancing the quality of autoencoded samples through a stochastic encoder–decoder pair while encouraging the latent space to follow a fixed prior distribution.
Since their introduction, VAEs have become one of the frameworks of choice for generative modeling, promising theoretically wellfounded and more stable training than Generative Adversarial Networks (GANs) [17]
and more efficient sampling mechanisms than autoregressive models
[30]. Much of the recent literature has focused on applying VAEs on image generation tasks [33, 19, 23] and devising new encoder–decoder architectures [56, 25].Despite this attention from the community, the VAE framework is still far from delivering the promised generative mechanism in many realworld scenarios. In fact, VAEs tend to generate blurry samples, a condition which has been attributed to using overly simplistic distributions for the prior [54]; restrictiveness of the Gaussian assumption for the stochastic architecture [13]; or overregularization induced by the KL divergence term in the VAE objective [53] (see Fig. 2). Moreover, the VAE objective itself poses several challenges as it admits trivial solutions that decouple the latent space from the input [11, 60], leading to the posterior collapse phenomenon in conjunction with powerful decoders [56]
. Training a VAE requires approximating expectations by sampling at the cost of increased variance in gradients
[55, 10], making initialization, validation and annealing of hyperparameters fundamental in practice
[8, 5, 21]. Lastly, even after a satisfactory convergence of the objective, the learned aggregated posterior distribution rarely matches the assumed latent prior in practice [26, 5, 13], ultimately hurting the quality of generated samples.In this work, we tackle these shortcomings by reformulating the VAE into a generative modeling scheme that scales better, is simpler to optimize, and most importantly, produces higherquality samples. We do so based on the ovservation that under common distributional assumptions made for VAEs, training a stochastic encoder–decoder pair does not differ in practice from training a deterministic architecture where noise is added to the decoder’s input to enforce a smooth latent space. We investigate how to substitute this noise injection mechanism with other regularization schemes in our deterministic Regularized Autoencoders (RAEs), and we analyze how we can learn a meaningful latent space without forcing it to conform to a given prior distribution. We equip RAEs with a generative mechanism through a simple and efficient density estimation step on the learned latent space which leads to improved image quality that surpasses VAEs and stronger alternatives such as Wasserstein Autoencoders (WAEs) [53].
In summary, our contributions are as follows:

we introduce the framework for generative modeling,

we propose an density estimation scheme that greatly improves sample quality for the VAE, WAE and without the need for additional training,

we conduct a rigorous empirical evaluation on several common image datasets (MNIST, CIFAR10, CelebA), assaying reconstruction, random samples and interpolation quality for VAE, WAE and RAE,

we achieve stateoftheart FID scores for the above datasets in a nonadversarial setting.
The paper is organized as follows. In Section 2 we introduce the VAE framework and discuss assumptions, practical implementations and limitations, leading to the introduction of our simplified deterministic and regularized framework (Section 3), interpreting explicit regularization as constrained optimization under certain parametric assumptions (Section 4). After discussing density estimation and related works in Sections 5 and 6, we present experiments in Section 7 before we close with our final conclusions.
2 Variational Autoencoders
For a general discussion, we consider a collection of highdimensional samples drawn from the true data distribution
over a random variable
taking values in the input space. The aim of generative modeling is to learn from a mechanism to draw new samples .Variational Autoencoders provide a powerful latent variable framework to infer such a mechanism. The generative process of the VAE is defined as
(1) 
where is a fixed prior distribution over a lowdimensional latent variable . A stochastic decoder
(2) 
links the latent space to the input space through the likelihood, where is an expressive nonlinear function parameterized by .^{1}^{1}1With slight abuse of notation, we use lowercase letters for both random variables and their realizations, instead of , when it is clear to discriminate between the two. As a result, a VAE estimates as the infinite mixture model . At the same time, the input space is mapped to the latent space via a stochastic encoder
(3) 
where is the posterior distribution given by a second function parameterized by .
Computing the marginal loglikelihood is generally intractable. We therefore follow a variational approach, maximizing the evidence lower bound (ELBO) for a sample :
(4) 
Maximizing Eq. 2 over data model parameters , corresponds to minimizing the loss
(5) 
where and are defined for a sample as follows:
(6)  
(7) 
Intuitively, the reconstruction loss takes into account the quality of autoencoded samples through , while the KLdivergence term encourages to match the prior for each which acts as a regularizer during training [22]. For the purpose of generating highquality samples, a balance between these two loss terms must be found during training, see Fig. 2.
2.1 Practice and shortcomings of VAEs
To fit a VAE to data through Eq. 5 one has specify the parametric forms for , , , and hence the deterministic mappings and . In practice, the choice for the above distributions is guided by trading off computational complexity with model expressiveness.
In the most common formulation of the VAE, and are assumed to be Gaussian
(8)  
(9) 
with means and covariance parameters given by and
. In practical implementations, the covariance of the decoder is set to the identity matrix for all
, [13]. The expectation of in Eq. 6 is then approximated via Monte Carlo point estimates. We find clear evidence that larger values lead to improvements in training as shown in Fig. 1. Nevertheless, only a 1sample approximation is carried out in practice [27] since requirements to memory and computation scale linearly with . With this approximation, the computation of is given by the mean squared error between input samples and their mean reconstructions through the deterministic decoder:(10) 
Gradients the encoder parameters are computed through the expectation of in Eq. 6 via the reparametrization trick [27] where the stochasticity of is relegated to an auxiliary random variable which does not depend on :
(11) 
where denotes the Hadamard product. An additional simplifying assumption involves fixing the prior to be a dimensional isotropic Gaussian . For this choice, the KLdivergence for a sample is given in closed form:
(12) 
While the chosen assumptions make VAEs easy to implement, the stochasticity in the encoder and decoder has been deemed to be responsible for the “blurriness” in VAE samples [34, 53, 13]. Furthermore, the optimization problem as shown in Eq. 5 presents some further challenges. Imposing a strong weight on the term during optimization can dominate , having the effect of overregularization which leads to blurred samples, see Fig. 2
. Heuristics to avoid this include gradually annealing the importance of
during training [8, 5] and manually finetuning the balance between the losses.Even after employing the full array of approximations and “tricks” to reach convergence of Eq. 5 for a satisfactory set of parameters, there is no guarantee that the learned latent space is distributed according to the assumed prior distribution. In other words, the aggregated posterior distribution has been shown not to conform well to after training [53, 5, 13]. This critical issue severely hinders the generative mechanism of a VAE (Eq. 1) since latent codes sampled from (instead of ) might lead to regions of the latent space that are previously unseen to during training. The result is blurry outofdistribution samples. We analyze solutions to this problem in Section 5.
2.2 ConstantVariance Encoders
Analogously to what is generally done for decoders, we also investigate fixing the variance of to be constant for all . This simplifies the computation of from Eq. 11 to
(13) 
where is a fixed scalar. At the same time, the KL loss term in Eq. 12 simplifies (up to constants) to
(14) 
ConstantVariance VAEs (CVVAEs) have been previously applied in applications of adversarial robustness [15] and variational image compression [4] but to the best of our knowledge, there is no systematic study of CVVAEs in the literature. As noted in [15], treating as a constant impairs the assumption of to be an isotropic Gaussian which demands a more complex prior structure over . We address this mismatch in Sec. 5.
3 Deterministic Regularized Autoencoders
As described in Section 2, autoencoding in VAEs is defined in a probabilistic fashion: and map data points not to a single point, but rather to parameterized distributions as shown in Equations 8 and 9. However, the practical implementation of the VAE admits a deterministic view for this probabilistic mechanism.
A glance at the autoencoding mechanism of the VAE is revealing. The encoder maps a data point to a mean and variance in the latent space via the reparametrization trick given in Eq. 11. The input to the decoder is then simply the mean augmented with random Gaussian noise scaled by . In the CVVAE, this relationship is even more obvious, as the magnitude of the noise is fixed for all data points (Eq. 13). In this light, a VAE can be seen as a deterministic autoencoder where Gaussian noise is added to the decoder’s input.
Using random noise injection to regularize neural networks during training is a wellknown technique that dates back several decades
[47, 2]. The addition of noise implicitly smooths the function learned by the network. Since this procedure also adds noise to the gradients, we propose to substitute noise injection with an explicit regularization scheme for the decoder network. Note that from a generative perspective, this is motivated by the goal to learn a smooth latent space where similar data points are mapped to similar latent codes , and small variations in lead to reconstructions by that vary only slightly.By removing the noise injection mechanism from the CVVAE, we are effectively left with a deterministic Regularized Autoencoder (RAE) that can be coupled with any type of explicit regularization for the decoder to enforce a smooth latent space. Training a RAE thus involves minimizing the simplified loss
(15) 
where represents the explicit regularizer for (see Sec. 3.1) and from Eq. 14 is equivalent to constraining the size of the learned space. Note that for RAEs, no sampling approximation of is required, thus relieving the need for more samples from to achieve better image quality (see Fig. 1).
3.1 Regularization Schemes for RAEs
Among possible choices for a mechanism to use for , a first obvious candidate is Tikhonov regularization [52] since is known to be related to the addition of lowmagnitude input noise [7]. Training a RAE within this framework thus amounts to adopting
(16) 
where effectively applies weight decay on the decoder parameters .
Another avenue comes from the recent GAN literature where regularization is a hot topic [29] and where injecting noise to the input of the adversarial discriminator has led to improved performance in a technique called instance noise [49]. To enforce Lipschitz continuity on adversarial discriminators, weight clipping has been proposed [3], which is however known to significantly slow down training. More successfully, a gradient penalty on the discriminator can be used similar to [18, 35], yielding the objective
(17) 
which encourages small norm of the gradient of the decoder its input.
Additionally, spectral normalization (SN) has been proposed as an alternative way to bound the Lipschitz norm of an adversarial discriminator, showing very promising results for GAN training [36]. SN normalizes the weight matrix
for each layer in the decoder by dividing it by an estimate of its largest singular value:
(18) 
where is the current estimate obtained through the power method.
Lastly, in light of recent success stories of deep neural networks without explicit regularization achieving stateoftheart results [59, 58], it is intriguing to question the need to explicitly regularize the decoder in order to obtain a meaningful latent space. The assumption here is that techniques such as dropout [50]
[24], adding noise during training [2], or early stopping in conjunction with novel architectural developments implicitly regularize the networks enough to smoothen the latent space. Therefore, as a natural baseline to the objectives introduced above, we also consider the RAE framework without and , a standard deterministic autoencoder optimizing only.4 A Probabilistic Derivation of Smoothing
In this section, we propose an alternative view on enforcing smoothness on the output of by augmenting the ELBO optimization problem for VAEs with an explicit constraint. While we keep the Gaussianity assumptions over a stochastic and for convenience, we are not fixing a parametric form for yet. We will then discuss how some parametric restrictions over indeed lead to a variation of the RAE framework in Eq. 15, specifically as the introduction of as a regularizer of a deterministic version of the CVVAE.
To start, we can augment the minimization in Eq. 5 as:
(19) 
where and the constraint on the decoder encodes that the output has to vary, in the sense of an norm, only by a small amount for any two possible draws from the encoder. Using the mean value theorem, there exists a such that the left term in the constraint can be bounded as:
(20) 
where we take the supremum of possible gradients of as well as the supremum of a measure of the support of . From this form of the smoothness constraint, it is apparent why the choice of a parametric form for can be impactful during training. For a compactly supported isotropic PDF , the extension of the support would be dependent on , the entropy of , through some functional . For instance, a uniform posterior over a hypersphere in would ascertain where is the dimensionality of the latent space.
Intuitively, one would look for parametric distributions that do not favor overfitting, degenerating in Diracdeltas (minimal entropy and support) along any dimensions. To this end, an isotropic nature of would favor such a robustness against decoder overfitting. We can now rewrite the constraint as
(21) 
The term can be expressed in terms of , by decomposing it as , where and represents a crossentropy term. Therefore, the constrained problem in Eq. 19 can be written in a Lagrangian formulation by including Eq. 21:
(22) 
where . We argue that a reasonable simplifying assumption for is to fix to a single constant for all samples . Intuitively, this can be understood as fixing the variance in as we did for the CVVAE in Sec. 2.2. With this simplification, Eq. 22 further reduces to
(23) 
We can see that results to be the gradient penalty and corresponds to , thus recovering our RAE framework as presented in Eq. 15.
5 ExPost Density Estimation
By removing stochasticity and ultimately, the KL divergence term from RAEs, we have simplified the original VAE objective at the cost of detaching the encoder from the prior over the latent space. This implies i) we cannot ensure that the latent space is distributed according to a simple distribution anymore (isotropic Gaussian) and consequently, ii) we lose the simple mechanism provided by to sample from as in Eq. 1.
As discussed in Section 2.1, issue i) is compromising the VAE framework in any case in practice as reported in several works [14, 42, 22]. To fix this, some works extend the VAE objective by encouraging the aggregated posterior to match [53] or by utilizing more complex priors [26, 5, 54].
To overcome both i) and ii), we instead propose to employ density estimation over . We fit a density estimator denoted as to . This simple approach not only fits our RAE framework well, but it can also be readily adopted for any VAE or variants thereof such as the WAE as a practical remedy to the aggregated posterior mismatch without adding any computational overhead to the costly training phase.
The choice of needs to tradeoff expressiveness – to provide a good fit of an arbitrary space for – with simplicity – to improve generalization. Indeed, placing a Dirac distribution on each latent point
would allow the decoder to output only training sample reconstructions. Striving for simplicity and in order to show the effectiveness of the proposed expost density estimation scheme, we compare a full covariance multivariate Gaussian with a 10component Gaussian mixture model (GMM) in our experiments.
6 Related works
Many works have focused on diagnosing the VAE framework, the terms in its objective [1, 22, 60], and ultimately augmenting it to solve optimization issues [39, 13]. With RAE, we argue that a simpler deterministic framework can be competitive for generative modeling.
Deterministic denoising [57]
and contractive autoencoders (CAEs)
[41] have received attention in the past for their ability to capture a smooth data manifold. Heuristic attempts to equip them with a generative mechanism include MCMC schemes [40, 6]. However, they are hard to diagnose for convergence, require a considerable effort in tuning [12], and have not scaled beyond MNIST, leading to them being superseded by VAEs. While in spirit the proposed RAE is similar, requires much less computational effort than computing the Jacobian for CAEs [41].Approaches to cope with the aggregated posterior mismatch involve fixing a more expressive form for [26, 5] therefore altering the VAE objective and requiring considerable additional computational efforts. Estimating the latent space of a VAE with a second VAE [13] reintroduces many of the optimization shortcomings discussed for VAEs and is much more expensive in practice compared to fitting a simple after training.
GANs [17] have received widespread attention for their ability to produce sharp samples. Despite theoretical and practical advances [3], the training procedure of GANs is still unstable, sensitive to hyperparameters, and prone to the mode collapse problem [33, 43, 44].
Adversarial Autoencoders (AAE) [34] add a discriminator to a deterministic encoder–decoder pair, leading to sharper samples at the expense of higher computational overhead and the introduction of instabilities caused by the adversarial nature of the training process. Wasserstein Autoencoders (WAE) [53] have been introduced as a generalization of AAEs by casting autoencoding as an optimal transport (OT) problem. Both stochastic and deterministic models can be trained by minimizing a relaxed OT cost function employing either an adversarial loss term or the maximum mean discrepancy score between and as a regularizer in place of .
Within the RAE framework, we look at this problem from a different perspective: instead of explicitly imposing a simple structure on
that might impair the ability to fit highdimensional data during training, we propose to model the latent space by an expost density estimation step.
Reconstructions  Random Samples  Interpolations  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE 
7 Experiments
In this Section, we investigate the performance of several VAE and RAE variants on MNIST [31], CIFAR10 [28] and CelebA [32]. We measure three qualities for each model: heldout sample reconstruction quality, random sample quality, and interpolation quality. While reconstructions give us a lower bound on the best quality achievable by the generative model, random sample quality indicates how well the model generalizes. Finally, interpolation quality sheds light of the structure of the learned latent space.
MNIST  CIFAR  CelebA  

Rec.  Samples  Rec.  Samples  Rec.  Samples  
VAE  18.26  19.21  17.66  18.21  57.94  106.37  103.78  88.62  39.12  48.12  45.52  44.49 
CVVAE  15.15  33.79  17.87  25.12  37.74  94.75  86.64  69.71  40.41  48.87  49.30  44.96 
WAE  10.03  20.42  9.39  14.34  35.97  117.44  93.53  76.89  34.81  53.67  42.73  40.93 
GP  14.04  22.21  11.54  15.32  32.17  83.05  76.33  64.08  39.71  116.30  45.63  47.00 
L2  10.53  22.22  8.69  14.54  32.24  80.80  74.16  62.54  43.52  51.13  47.97  45.98 
SN  15.65  19.67  11.74  15.15  27.61  84.25  75.30  63.62  36.01  44.74  40.95  39.53 
11.67  23.92  9.81  14.67  29.05  83.87  76.28  63.27  40.18  48.20  44.68  43.67  
AE  12.95  58.73  10.66  17.12  30.52  84.74  76.47  61.57  40.79  127.85  45.10  50.94 
7.1 Models
We compare the the proposed RAE model with the gradient penalty (RAEGP), with weight decay (RAEL2), and with spectral normalization (RAESN). Additionally, we consider two models for which we either add only the latent code regularizer to (RAE), or no explicit regularization at all (AE). As baselines, we further employ a regular VAE, the constantvariance VAE (CVVAE) for comparison, and finally, a Wasserstein Autoencoder (WAE) with the MMD loss as a stateoftheart alternative.
Aiming for a fair comparison, we employ the same network architecture for all models. We largely follow the models adopted in [53] with the difference that we consistently apply batch normalization [24] for all models as we found it to improve performance across the range. The latent space dimension is 16 for MNIST, 128 for CIFAR10 and 64 for CelebA. Further details about the network architecture and training procedure are given in Appendix A.
7.2 Evaluation
The evaluation of generative models is a nontrivial research question [51, 45, 33]. Since we are interested in the quality of samples, the ubiquitous Fréchet Inception Distance (FID) [20]
is a reasonable choice for comparing different models. More recently, a notion of precision and recall for distributions (PRD) has been proposed
[43], separating sample quality from diversity. We report the popular FID scores in this section and we provide PRD scores in Appendix C.We compute the FID of the reconstructions of random validation samples against the test set to evaluate reconstruction quality. For evaluating generative modeling capabilities, we compute the FID between the test data and randomly drawn samples from a single Gaussian that is either the isotropic available for VAEs and WAEs, or a single Gaussian fit to for CVVAEs and RAEs. For all models, we also evaluate random samples from a 10component Gaussian Mixture model (GMM) fit to . Using only 10 components prevents us from overfitting (which would indeed give good FIDs when compared with the test set). We note that fitting GMMs with up to 100 components, only improved results marginally. Additionally, we provide nearestneighbours from the training set in Appendix E to show that the models are not overfitting. For interpolations, we report the FID for the furthest interpolation points resulted by applying spherical interpolation to randomly selected validation reconstruction pairs.
7.3 Results
Table 1 summarizes our main results. All of the proposed RAE variants are competitive with the VAE and WAE generated image quality in all settings. Sampling RAEs achieve the best FIDs across all datasets when a modest 10component GMM is employed for expost density estimation. Furthermore, even when is considered as , RAEs rank first with the exception of MNIST, though the best FID achieved there by the VAE is very close to the FID of RAESN.
Moreover, our best RAE FIDs are lower than the best results reported for VAEs in the large scale comparison of [33], challenging even the best scores reported for GANs. While we are employing a slightly different architecture than theirs, our models underwent only modest finetuning instead of an extensive hyperparameter search. By looking at the differently regularized RAEs, there is no clear winner across all settings as all perform equally well. For practical reasons of implementation simplicity, one may prefer RAEL2 over the GP and SN variants.
Surprisingly, the implicitly regularized RAE and AE models are shown to be able to score impressive FIDs when is fit through GMMs. FIDs for AEs decrease from 58.73 to 10.66 on MNIST and from 127.85 to 45.10 on CelebA – a value close to the state of the art. This is a remarkable result that follows a long series of recent confirmations that neural networks are surprisingly smooth by design [37]. It is also surprising that the lack of an explicitly fixed structure on the latent space of the RAE does not impede interpolation quality. This is further confirmed by the qualitative evaluation on CelebA as reported in Fig. 3 and for the other datasets in Appendix D, where RAE interpolated samples seem sharper than competitors and transitions smoother.
We would like to note that our extensive study confirms and quantifies the effect of the aggregated posterior mismatch as well as the effectivity of our proposed solution to it. Indeed, if we consider the effect of applying expost density estimation to each model in Table 1, we see that it consistently improves sample quality across all settings considerably. A 10component GMM trained to fit seems to be enough to half the FID scores from to for WAE and RAE models on MNIST and from 116 to 46 on CelebA. This is striking since this very cheap additional step to any VAElike generative model can be employed to boost the quality of generated samples.
All in all, the results strongly support our conjecture that the simple deterministic RAE framework can challenge the VAE and WAE.
8 Conclusion
While the theoretical derivation of the VAE has helped popularize the framework for generative modeling, recent works have started to expose some discrepancies between theory and practice. In this work, viewing sampling in VAEs as noise injection to enforce smoothness has enabled us to distill a deterministic autoencoding framework that is compatible with several regularization techniques to learn a meaningful latent space. We have demonstrated that such a deterministic autoencoding framework can generate comparable or better samples than VAEs, while getting around the practical drawbacks tied to a stochastic framework. Furthermore, we have shown that our solution to fit a simple density estimator such as a Gaussian Mixture Model on the learned latent space is able to consistently improve sample quality both for the proposed RAE framework as well as for the VAE and WAE, acting as a solution for the known mismatch between the prior and the aggregated posterior. The RAE framework opens interesting future research venues such as learning the density estimator in an endtoend fashion with the autoencoding network and devising more sophisticated autoencoders that can access the full range of recent structural neural network advancements to scale generative modeling without being bound to the VAE’s restrictions.
Acknowledgements
We would like to thank Anant Raj, Matthias Bauer, Paul Rubenstein and Soubhik Sanyal for fruitful discussions.
References
 [1] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy. Fixing a Broken ELBO. In ICML, 2018.

[2]
G. An.
The effects of adding noise during backpropagation training on a generalization performance.
In Neural computation, 1996.  [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. In ICML, 2017.
 [4] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimized image compression. In ICLR, 2017.
 [5] M. Bauer and A. Mnih. Resampled Priors for Variational Autoencoders. In AISTATS, 2019.
 [6] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising autoencoders as generative models. In NeurIPS, 2013.
 [7] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
 [8] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. In CoNLL, 2016.
 [9] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
 [10] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 [11] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In ICLR, 2017.
 [12] M. K. Cowles and B. P. Carlin. Markov chain Monte Carlo convergence diagnostics: a comparative review. In Journal of the American Statistical Association, 1996.
 [13] B. Dai and D. Wipf. Diagnosing and Enhancing VAE Models. In ICLR, 2019.
 [14] B. Dai and D. Wipf. Diagnosing and Enhancing VAE Models. In ICLR, 2019.
 [15] P. Ghosh, A. Losalka, and M. J. Black. Resisting Adversarial Attacks using Gaussian Mixture Variational Autoencoders. In AAAI, 2019.
 [16] R. GómezBombarelli, J. N. Wei, D. Duvenaud, J. M. HernándezLobato, B. SánchezLengeling, D. Sheberla, J. AguileraIparraguirre, T. D. Hirzel, R. P. Adams, and A. AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. In ACS central science, 2018.
 [17] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
 [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NeurIPS, 2017.

[19]
C. Ham, A. Raj, V. Cartillier, and I. Essa.
Variational Image Inpainting.
InBayesian Deep Learning Workshop, NeurIPS
, 2018.  [20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs Trained by a Two TimeScale Update Rule Converge to a Nash Equilibrium. In NeurIPS, 2017.
 [21] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. BetaVAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

[22]
M. D. Hoffman and M. J. Johnson.
Elbo surgery: yet another way to carve up the variational evidence
lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NeurIPS
, 2016.  [23] H. Huang, R. He, Z. Sun, T. Tan, et al. Introvae: Introspective Variational Autoencoders for Photographic Image Synthesis. In NeurIPS, 2018.
 [24] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 2015.
 [25] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
 [26] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving variational inference with inverse autoregressive flow. In NeurIPS, 2016.
 [27] D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. In ICLR, 2014.
 [28] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.
 [29] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly. The GAN landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
 [30] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.
 [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In IEEE, 1998.
 [32] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. In ICCV, 2015.
 [33] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs Created Equal? A LargeScale Study. In NeurIPS, 2018.
 [34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In ICLR, 2016.
 [35] L. Mescheder, A. Geiger, and S. Nowozin. Which Training Methods for GANs do actually Converge? In ICML, 2018.
 [36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 [37] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
 [38] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 [39] D. J. Rezende and F. Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.
 [40] S. Rifai, Y. Bengio, Y. Dauphin, and P. Vincent. A generative process for sampling contractive autoencoders. In ICML, 2012.

[41]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio.
Contractive autoencoders: Explicit invariance during feature extraction.
In ICML, 2011.  [42] M. Rosca, B. Lakshminarayanan, and S. Mohamed. Distribution Matching in Variational Inference. arXiv preprint arXiv:1802.06847, 2018.
 [43] M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing Generative Models via Precision and Recall. In NeurIPS, 2018.
 [44] M. S. M. Sajjadi, G. Parascandolo, A. Mehrjou, and B. Schölkopf. Tempered Adversarial Networks. In ICML, 2018.

[45]
M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch.
EnhanceNet: Single Image SuperResolution Through Automated Texture Synthesis.
In ICCV, 2017. 
[46]
A. Severyn, E. Barth, and S. Semeniuta.
A Hybrid Convolutional Variational Autoencoder for Text Generation.
2017.  [47] J. Sietsma and R. J. Dow. Creating artificial neural networks that generalize. In Neural networks. Elsevier, 1991.
 [48] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPS, 2015.
 [49] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP Inference for Image Superresolution. In ICLR, 2017.
 [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014.
 [51] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016.
 [52] A. N. Tikhonov and V. I. Arsenin. Solutions of Ill Posed Problems. Vh Winston, 1977.
 [53] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein AutoEncoders. In ICLR, 2017.
 [54] J. Tomczak and M. Welling. VAE with a VampPrior. In AISTATS, 2018.
 [55] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. SohlDickstein. REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017.
 [56] A. van den Oord, O. Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.

[57]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol.
Extracting and composing robust features with denoising autoencoders.
In ICML, 2008.  [58] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
 [59] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 [60] S. Zhao, J. Song, and S. Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.
Appendix
Appendix A Network architecture and Training Details
For all experiments, we use the Adam optimizer with a starting learning rate of which is cut in half every time the validation loss plateaus. All models are trained for a maximum of epochs on MNIST and CIFAR and epochs on CelebA. We use minibatch size of
and pad MNIST digits with zeros to make the size
.We use official train, validation and test splits of CelebA. For MNIST and CIFAR, we set aside 10k train samples for validation. For random sample evaluation, we draw samples from for VAE and WAEMMD and for all remaining models, samples are drawn from a multivariate Gaussian whose mean and covariance are estimated using training set embeddings. For the GMM density estimation, we also utilize the training set embeddings for fitting and validation set embeddings to verify that GMM models are not over fitting to training embeddings. However, due to the very low number of mixture components, we did not encounter overfitting at this step. The GMM parameters are estimated by running EM for at most iterations.
For Figures 1 and 2, we used smaller networks due to computational limitations and since only relative performance is of interest in these experiments. It should be noted that as a result of this, the absolute values of the reported FID scores in these figures cannot be directly compared with the numbers reported in Section 7.
MNIST  CIFAR_10  CELEBA  

Encoder: 



Decoder: 



represents a convolutional layer with filters. All convolutions and transposed convolutions have a filter size of for MNIST and CIFAR10 and
for CELEBA. They all have a stride of size 2 except for the last convolutional layer in the decoder. Finally,
for all models except for the VAE which has as the encoder has to produce both mean and variance for each input.Appendix B Evaluation Setup
We use 10k samples for all FID and PRD evaluations. Scores for random samples are evaluated against the test set. Reconstruction scores are computed from validation set reconstructions against the respective test set. Interpolation scores are computed by interpolating latent codes of a pair of randomly chosen validation embeddings vs test set samples. The visualized interpolation samples are interpolations between two randomly chosen test set images.
Appendix C Evaluation by Precision and Recall
MNIST  CIFAR10  CelebA  

VAE  0.96 / 0.92  0.95 / 0.96  0.25 / 0.55  0.37 / 0.56  0.54 / 0.66  0.50 / 0.66 
CVVAE  0.84 / 0.73  0.96 / 0.89  0.31 / 0.64  0.42 / 0.68  0.25 / 0.43  0.32 / 0.55 
WAE  0.93 / 0.88  0.98 / 0.95  0.38 / 0.68  0.51 / 0.81  0.59 / 0.68  0.69 / 0.77 
GP  0.93 / 0.87  0.97 / 0.98  0.36 / 0.70  0.46 / 0.77  0.38 / 0.55  0.44 / 0.67 
L2  0.92 / 0.87  0.98 / 0.98  0.41 / 0.77  0.57 / 0.81  0.36 / 0.64  0.44 / 0.65 
SN  0.89 / 0.95  0.98 / 0.97  0.36 / 0.73  0.52 / 0.81  0.54 / 0.68  0.55 / 0.74 
0.92 / 0.85  0.98 / 0.98  0.45 / 0.73  0.53 / 0.80  0.46 / 0.59  0.52 / 0.69  
AE  0.90 / 0.90  0.98 / 0.97  0.37 / 0.73  0.50 / 0.80  0.45 / 0.66  0.47 / 0.71 
Appendix D More Qualitative Results
Reconstructions  Random Samples  Interpolations  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE 
Reconstructions  Random Samples  Interpolations  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE  
GT  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE 
Appendix E Investigating Overfitting
MNIST  CIFAR10  CELEBA  
VAE  
CVVAE  
WAE  
RAEGP  
RAEL2  
RAESN  
RAE  
AE 