1 Introduction
Invertible flowbased generative models (Dinh et al., 2016; Kingma & Dhariwal, 2018) have recently gained traction due to several desirable properties: (i) exact loglikelihood calculation (unlike VAEs that maximize a lower bound (Kingma & Welling, 2013; Rezende et al., 2014)), (ii) exact inference of latent variables, and (iii) good sample quality relative to VAEs.
However, a limitation of invertible flow models is that they require invertibility on the full ambient space, resulting in a latent space with the same dimensionality as the input data. This requirement leads to larger models with higher memory and computational costs that are more difficult to scale than VAE and GAN counterparts that have lowerdimensional latent spaces (Goodfellow et al., 2014; Kingma & Dhariwal, 2018). This lack of dimensionality reduction also makes it difficult to capture highlevel generative factors directly in individual latent dimensions, a property that is often argued to be desirable for generative models (Higgins et al., 2017; Narayanaswamy et al., 2017; Kumar et al., 2018; Kim & Mnih, 2018; Chen et al., 2018).
In this work we propose a generative model of data based on probability flows that relaxes the bijectivity requirement. The model maps low dimensional latents to samples in the image of , residing in the much higherdimensional ambient space
. A probability distribution in the latent space (
e.g., standard normal) is pushed forward by the mapping to induce a distribution on the image, . By taking the mapping to be onetoone or injective and differentiable, we can use a change of variables theorem to obtain a closed form for the distribution over . While the resulting loglikelihood in ambient space is intractable, we can form tractable lowerbounds using stochastic approximations to obtain objectives amenable to stochastic firstorder optimization. Relaxing the bijectivity requirement loses the ability to provide exact likelihoods for data points that lie off the image . In this work, we limit ourselves to using the derived flowbased objective for learning a sampling mechanism that always generates samples from the image . This is in contrast to VAEs where generated samples lie off the image due to the presence of an additional distribution at the output of the decoder.^{1}^{1}1One can augment the decoder with an ambient noise distribution (e.g., Gaussian) either as part of the training objective or posthoc after training (Wu et al., 2017)for estimating loglikelihoods, but we do not consider that here.
Our final objective, although derived from the probability flow perspective, resembles a regularized autoencoder with an additional prior log probability term and an annealing of the weight on the reconstruction loss that increases over time. This flow perspective motivates several commonly used autoencoder regularization strategies, e.g., those in Ghosh et al. (2019). We evaluate the relaxed injective flow models on MNIST, CIFAR10, and CelebA, where we observe better FID scores compared to VAEs, and Ghosh et al. (2019). Our results demonstrate that these models provide an efficient and tractable mechanism for training neural samplers with compressed latent spaces.
2 Formulation
2.1 Invertible Flows
Let be a generator mapping from latents to data, assumed to be differentiable everywhere. If is a bijection with , then and must have the same dimensionality and we can write the distribution induced over in terms of the distribution over using change of variables formula:
(1) 
where and are the Jacobians of and at and , respectively. Invertible flow models (Dinh et al., 2016; Papamakarios et al., 2017; Kingma & Dhariwal, 2018) optimize (1) to learn a generative model of the data. They provide tractable objectives by structuring the generator so that the inverse and the logdetJacobian terms are tractable. Recent work on invertible residual nets (Behrmann et al., 2018) makes use of certain approximations to get a tractable objective for invertible flows with ResNets having Lipschitz constrained residual blocks. More recently, Behrmann et al. (2020) studied the numerical stability of invertible flow models, finding that, in practice, numerical issues may prevent the models from being invertible in certain regions even slightly off the data manifold.
2.2 Injective Flows
We are interested in developing probability flowbased models for the setting when the dimensionality of the latent space is much lower than the data dimensionality, i.e., and , where . We can obtain a change of variables formula for this setting by looking at how an infinitesimal volume element at is transformed by the mapping . The theory of integration on manifolds tells us that the mapping transforms an infinitesimal volume element at to a corresponding volume on (Boothby, 1986), where is the image of under . If we assume is an injective function and thus invertible when seen as a mapping , we can write the probability flow from to as:
(2) 
Note that as is a positive definite matrix. Figure 1 presents a schematic of invertible and injective functions transforming an infinitesimal volume element. The more familiar change of variables formula in (1) can be derived as the special case when is square and thus . In order to avoid solving an inverse problem for every in our data (i.e., finding for every s.t. ), we assume the existence of an encoder such that for every . This lets us write
(3) 
Optimizing the term exactly may be computationally challenging for large models as it requires computing a Jacobian matrix for every data point (however, it could still be tractable for smaller models where the latent dimensionality is small). We propose two ways to lower bound the log likelihood in (3
) in order to obtain a tractable objective we can maximize. Let the singular values of
be given by . Using the inequality , for all and (based on concavity of the log), we have(4)  
This lower bound is maximized for . Substituting it into (4), we get
(5)  
The bounds in (4) and (5) are also computationally expensive but we will show how to form efficient stochastic approximations in the next section.
Tightness of the bounds. The inequality in (4) is tight when all singular values are equal to . Note that it is also possible to use separate corresponding to each
and tune these as hyperparameters to improve upon the tightness of the bound, however we do not explore this for the sake of simplicity. We will still tune the hyperparameter
in (4) to see how it performs against the objective in (5).The objectives in (5) and (4) are constrained optimization problems that can be solved with a variety of approaches. Recent work on VAEs has used the augmented Lagrangian method to enforce reconstruction constraints (Rezende & Viola, 2018), but here we use the penalty method for its simplicity (Bertsekas, 2016). Applying penalty method to (5), we get:
(6) 
where is a positive real that is increased as the optimization progresses (Bertsekas, 2016). Optimizing (6) can still be computationally demanding as it involves computing the full Jacobian of the generator. We can use Hutchinson’s trace estimator (Hutchinson, 1990) to avoid explicitly materializing the full Jacobian. Hutchinson’s trace estimator is based on the fact that
for any random vector
s.t. . We write the Frobenius norms of the Jacobian as for . We further employ the unbiased MonteCarlo estimation , and use one MonteCarlo sample per example () in a minibatch. This leads to an unbiased estimator of the bound when used with objective (
4), and the expectation of the MonteCarlo approximation remains a lower bound on the log likelihood.When used with the objective in (5), Hutchinson’s estimator leads to a biased estimator of , as the expectation of this estimator is smaller than . This results in an estimate whose expectation may no longer be a bound on the log likelihood. Similar issue arises in earlier works that try to do Monte Carlo estimation for (Li & Turner, 2016; Rhodes & Gutmann, 2018). In spite of no longer bounding loglikelihood, we find that this approximation is still effective in practice to train neural samplers. Using this MonteCarlo approximation yields
(7) 
with and ignoring the constant terms (the factor of can also be absorbed in ). We use automatic differentiation^{2}^{2}2If the automatic differentiation framework only allows for reverse mode AD, one can use with for , instead of with . to optimize the term containing the Jacobianvector product. However, we observe numerical instabilities while training models for some configurations (CIFAR10 with ). In these cases, we use the finite difference approximation:
(8) 
with small , and .
For to be injective, a necessary condition is to constrain all singular values of to be bounded away from zero. Instead of directly enforcing this which can be computationally challenging, we simply enforce that is greater than a threshold for all with . A similar approach was used by Odena et al. (2018). There are scenarios where positivity of singular values does not ensure global injectivity, i.e., there may exist s.t. (see selfintersections in Lagrange et al. (2007)) . Suppose in this case, then the lower bounds in (5) and (4) are still valid since .
While training, we take the latent space distribution
to be an isotropic Gaussian distribution
, which reduces the first term in bounds (4) and (5) to be . Our final minimization objective corresponding to the lower bound of (5) is given by(9) 
where , , and is a positive penalty on the constraint enforcing the local injectivity of the generator. Both and are increased over the course of optimization (Bertsekas, 2016).
Following similar steps of forming an unconstrained objective using the penalty method and using MonteCarlo estimation for , we obtain the following minimization objective corresponding to the lower bound of (4):
(10) 
where is a fixed hyperparameter (which is not optimized over but can be tuned as discussed earlier). We optimize the objectives (9) and (10) with respect to parameters of both the generator and the encoder .
2.2.1 Sampling from the model
Although the injective flow model transforms an isotropic Gaussian prior to the data distribution, in practice we observe that the distribution of encoded data points (“aggregate posterior”) deviates from the prior distribution, which is also reflected in poor quality of generated samples. Note that this is not linked to invertibility and can happen even when the network is perfectly invertible. This issue is not specific to our model and is present even in VAEs and bijective flow based models. For VAEs, recent work highlighted this issue in the case of modeling a data distribution that lies along a lowdimensional manifold (Sec. 4 in Dai & Wipf (2019)) and proposed fitting another distribution on the encoded latents after training the VAE. For invertibleflow models the Euclidean norm of the latent codes is often different from the typical set of the prior, indicating a systematic aggregate posteriorprior mismatch (see Choi et al. (2018), and Fig. 8 in Kingma & Dhariwal (2018)).
Hence, for sampling from the model, we fit a distribution over the encoded training data in the latent space after the model has been trained, an approach taken by several recent works (van den Oord et al., 2017; Dai & Wipf, 2019; Ghosh et al., 2019). Dai & Wipf (2019) train another VAE on the encoded training data to get a complex postfit prior. However, in this paper we experiment with fitting a Gaussian prior and a mixture of 10 Gaussians similar to Ghosh et al. (2019).
3 Related Work
Our work is similar in spirit to the recent work of Ghosh et al. (2019); van den Oord et al. (2017); Dai & Wipf (2019), that find regularized autoencoders paired with a learned prior produces highquality samples. Our work provides another perspective on the regularized autoencoder (RAE) objective in (Ghosh et al., 2019), wherein the regularization terms arise naturally from approximating the loglikelihood objective of the injective probability flow. Ghosh et al. (2019)
motivate the RAE objective by considering constant posteriorvariance VAEs and connecting stochasticity at the decoder’s input (arising by sampling from
) to smoothness of the decoder. Recently, Kumar & Poole (2020) analyzed the implicit regularization in VAEs, deriving a regularizer that also depends on the Jacobian of the deocder but has a different form.Regularized autoencoders have been widely studied in earlier works as well (Rifai et al., 2011b; Alain & Bengio, 2014; Poole et al., 2014)
(Rifai et al., 2011b) also penalize the Frobenius norm of the Jacobian, however the penalty is on the encoder Jacobian, which is different from our penalty on the decoderJacobian. Most of these prior works on RAEs has been on improving the quality of the encoder for downstream tasks, wheres we are primarily interested in the quality of the generator for producing samples. Recent work has turned to regularizing autoencoders for sample quality as well, for example improving interpolation quality using an adversarial training objective
(Berthelot et al., 2018).Krusinga et al. (2019) recently used Eq. (2) to get density estimates for trained GANs. However as we noted earlier, these density estimates are by nature undefined for unseen real examples which may lie off the manifold.
Several earlier works have also used spectral regularizers in training generative models. Miyato et al. (2018) encourage Lipschitz smoothness of the GAN discriminator by normalizing the spectral norm of each layer. Odena et al. (2018) study the spectral properties of the Jacobian of the generator and its correlation with the quality of generated samples. They empirically observe that regularizing the condition number of the Jacobian leads to more stable training and improved generative model.
4 Experiments
Datasets.
Our experimental framework is based on Ghosh et al. (2019).
We evaluate our proposed model and baselines on three publicly available datasets: CelebA (Liu et al., 2015), CIFAR10 (Krizhevsky & Hinton, 2009), and MNIST (Lecun et al., 1998).
We use cropped images for CelebA faces as used in several prior works. Image size for CIFAR10 and MNIST is and , respectively.
MNIST  CIFAR10  CelebA  

Rec.  Samples  Rec.  Samples  Rec.  Samples  
GMM  GMM  GMM  
VAE  65.10  57.04  62.08  176.5  169.1  184.3  62.36  72.48  67.82 
VAE  7.91  24.31  8.12  43.86  83.59  71.56  30.06  50.66  42.77 
AE  8.69  43.40  12.14  41.45  81.13  70.97  30.16  51.48  43.49 
CAE  10.51  45.18  12.90  41.13  81.53  70.11  31.12  48.13  40.67 
AE+L2  7.76  34.27  9.69  43.02  81.28  70.13  29.97  50.02  42.09 
AE+SN  8.07  37.19  11.84  41.34  81.35  70.94  31.21  51.13  43.33 
InjFlow  7.40  35.96  9.93  40.11  78.78  68.26  27.93  47.70  40.23 
InjFlow  6.0  42.65  11.43  40.86  79.67  68.37  28.51  49.01  40.57 
Baseline models. Our final objective, although obtained by developing an injective flow and lower bounding its log likelihood, has resemblance with recently proposed regularized autoencoders (Ghosh et al., 2019) which arise as natural models for comparison.
We consider several smoothness regularizers in our evaluations, some of which have also been used by Ghosh et al. (2019):
(i) AE: Vanilla autoencoder trained with reconstruction loss. (ii) AE+L2: Autoencoder with an additional norm penalty on the decoder parameters (weight decay). (iii) AE+SN: Autoencoder with an additional spectral normalization on each individual layer of the decoder (i.e.normalizing the top singular value to be 1), motivated by (Miyato et al., 2018). (iv) CAE: We also use contractive autoencoder (Rifai et al., 2011a) as a baseline which penalizes the Frobenius norm of the encoder’s Jacobians. We use a similar Hutchinson trace stochastic approximation (as used for our objectives) to optimize the Frobenius norm term in the CAE objective. Ghosh et al. (2019) also consider a gradient penalty regularized AE which penalizes the Frobenius norm of the decoder’s Jacobian, a term which is also present in our objective (4). We also compare with (v) VAE (Kingma & Welling, 2013) and (vi) VAE (Higgins et al., 2017), both with a Gaussian observation model at the decoder’s output . For VAE, is taken to be 1, while for VAE, varying directly controls , with . We do not report a comparison with Wasserstein Autoencoders (WAE) as Ghosh et al. (2019) have shown that the tractable WAEMMD version is outperformed by regularized autoencoders.
Architectures.
We use convolutional neural net based architecture for both encoder and decoder, each having five layers of convolutions or transposed convolutions, respectively. Strides and kernelsize in the convolutional filters differ across datasets, but stay same for all the models for a given dataset. We use a slightly larger network than
(Ghosh et al., 2019) (5 vs. 4 layers), and thus rerun all baseline methods so that the results are comparable. This also results in improved scores for baselines over those reported in (Ghosh et al., 2019). We use eluactivation in both encoder and decoder, and also use batch normalization. Latent dimensionality is taken to be
for CIFAR10 and CelebA, and for MNIST. More details on the architectures used are provided in the supplementary material.Hyperparameters and training. Our Frobenius norm objective (9) (referred as InjFlow) has four hyperparameters: (i) variance of the isotropic Gaussian distribution on latent space, which determines the weight on the term penalizing the norm of the encodings , (ii) penalty coefficient on the reconstruction loss, (iii) penalty coefficient on the injectivity loss term, and (iv) singular value threshold used in the injectivity term . We use in all our experiments. Both penalty coefficients and are initialized to be at the beginning of optimization and are increased with each minibatch iteration as , where is searched over . The weight on the prior term is searched over . Our objective (10) (referred as InjFlow) has an additional hyperparameter that determines the weight on Frobenius norm regularization term, which we fix to in all our experiments. As discussed earlier, this will result in a tight bound only for the case when all Jacobian singular values are one.
For AE+L2, the hyperparameter for the regularization term is searched over the set . For
VAE, we search over the standard deviation
of the decoder’s distribution (which is related to as for the Gaussian observation model) over the set (for VAE, ). For CAE (Rifai et al., 2011b), the hyperparameter penalizing the encoder’s Jacobian norm is searched over . We train all models using Adam optimizer (Kingma & Ba, 2014) with batch size of and a fixed learning rate of . All models are trained for k minibatch iterations.Evaluation. Evaluation of sample quality is a challenging task (Theis et al., 2015), and several metrics have been proposed in literature for this (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018). We use the FID score (Heusel et al., 2017) as the quantitative metric for our evaluations, which is one of the most popular metrics and has been used in several recent works (Tolstikhin et al., 2017; Dai & Wipf, 2019; Ghosh et al., 2019) for evaluating generative models. As discussed earlier, we fit a Gaussian and a mixture of 10 Gaussians on the encoded training data, and use these as prior latent distributions to sample from the model. The covariance matrices for the Gaussian as well as for all mixture components in the GMM are taken to be full matrices. We also report FID scores on the test reconstructions apart from qualitative visualization of reconstructions and samples.
For all models, we report the best FID score obtained using decoder sampling from a postfit GMM in the latent space. We then report all the other scores (i.e., scores for samples from postfit Gaussian and test reconstructions) for the same model that gives the best GMM samples FID score. This enables us to assess models in term of their best possible sample generation ability. The proposed injective flow models yield better FID scores than all the baseline models for CelebA and CIFAR10, and are competitive on MNIST where they are outperformed by VAE. For most cases, FID scores for samples with postfit GMM are better than samples with postfit Gaussian, except for VAE which we suspect could be due to a convergence issue with GMM fitting. In most cases, InjFlow yields better FID scores than InjFlow, which is expected as InjFlow uses optimal value of (inequality (5)) as opposed to fixed and likely suboptimal value of used for InjFlow in our experiments.
Randomly generated CelebA samples for autoencoder, VAE, and the proposed model (InjFlow) are visualized in Fig. 3. While VAE samples are blurry and tend to lose fine details, they retain global coherence. On the other hand, samples from InjFlow and autoencoder are sharper with more fine details but also have undesired visual artifacts in some cases. Fig. 2 shows reconstructions of randomly sampled test examples using InjFlow, autoencoder and VAE. InjFlow reconstructions preserve more fine details than both autoencoder and VAE (e.g., hair strand for image in the third column), as also reflected by improved FID scores. More generated samples are shown in the supplementary material. It should be noted that better sample quality can be achieved by fitting a more expressive prior such as a GMM with more components, a VAE or a flow prior (Dinh et al., 2016; Papamakarios et al., 2017), however care must be taken to not overfit the latent encodings of the training points. In principle, a model that can produce good quality test reconstructions has the ability to generate good quality novel samples and the challenge lies in fitting a prior distribution that generalizes well.
5 Discussion
We proposed a probability flow based generative model that leverages an injective generator mapping, relaxing the bijectivity requirement. We use a change of variables formula to derive an optimization objective for learning the generator and encoder, where a smoothness regularizer on the generator naturally arises from the probability flow, along with some additional penalty terms. This nicely motivates several autoencoder regularizers that have been used in the past, such as in Ghosh et al. (2019). The proposed model also improves over several regularizers studied in Ghosh et al. (2019) in terms of FID scores.
Relaxing the bijectivity constraint loses many nice properties of invertible flow based generative models, such as tractable likelihood and inference for unseen data. A possible approach to recover these aspects could be to do define a background probability model over the full ambient space and work with a mixture of foreground distribution over coming from probability flow and the background distribution. Investigation of this will be an interesting future direction.
To enable tractable and efficient training of Injective Flow models, we relied on lower bounds and stochastic approximation for the Jacobian term, and an amortized encoder trained with penalty method. Future work should investigate the degree to which these approximations are accurate, and whether there are better and more efficient approaches for ensuring invertibility () on training points (e.g., augmented Lagrangian methods (Bertsekas, 2016) which have been used in Rezende & Viola (2018)). A benefit of Injective Flow models is the ability to scale to larger input dimensions. In future work, we plan to improve sample quality and cater to higherresolution images by scaling the models and fitting more expressive priors.
References

Alain & Bengio (2014)
Alain, G. and Bengio, Y.
What regularized autoencoders learn from the datagenerating
distribution.
Journal of Machine Learning Research
, 15(1):3563–3593, 2014.  Behrmann et al. (2018) Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J.H. Invertible residual networks, 2018.

Behrmann et al. (2020)
Behrmann, J., Vicol, P., Wang, K.C., Grosse, R. B., and Jacobsen, J.H.
On the invertibility of invertible neural networks, 2020.
URL https://openreview.net/forum?id=BJlVeyHFwH.  Berthelot et al. (2018) Berthelot, D., Raffel, C., Roy, A., and Goodfellow, I. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543, 2018.
 Bertsekas (2016) Bertsekas, D. P. Nonlinear programming: 3rd edition. 2016.
 Boothby (1986) Boothby, W. M. An introduction to differentiable manifolds and Riemannian geometry, volume 120. Academic press, 1986.
 Chen et al. (2018) Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018.
 Choi et al. (2018) Choi et al. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection. arXiv eprints, art. arXiv:1810.01392, 2018.
 Dai & Wipf (2019) Dai, B. and Wipf, D. Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789, 2019.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Ghosh et al. (2019) Ghosh, P., Sajjadi, M. S. M., Vergari, A., Black, M., and Schölkopf, B. From variational to deterministic autoencoders, 2019.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
 Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 Hutchinson (1990) Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450, 1990.
 Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krusinga et al. (2019) Krusinga, R., Shah, S., Zwicker, M., Goldstein, T., and Jacobs, D. Understanding the (un) interpretability of natural image distributions using generative models. arXiv preprint arXiv:1901.01499, 2019.
 Kumar & Poole (2020) Kumar, A. and Poole, B. On implicit regularization in vaes. arXiv preprint arXiv:2002.00041, 2020.
 Kumar et al. (2018) Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.
 Lagrange et al. (2007) Lagrange, S., Delanoue, N., and Jaulin, L. On sufficient conditions of the injectivity: development of a numerical test algorithm via interval analysis. Reliable computing, 13(5):409–421, 2007.
 Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
 Li & Turner (2016) Li, Y. and Turner, R. E. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pp. 1073–1081, 2016.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 3730–3738, 2015.  Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Narayanaswamy et al. (2017) Narayanaswamy, S., Paige, T. B., Van de Meent, J.W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925–5935, 2017.
 Odena et al. (2018) Odena, A., Buckman, J., Olsson, C., Brown, T. B., Olah, C., Raffel, C., and Goodfellow, I. Is generator conditioning causally related to gan performance? arXiv preprint arXiv:1802.08768, 2018.
 Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
 Poole et al. (2014) Poole, B., SohlDickstein, J., and Ganguli, S. Analyzing noise in autoencoders and deep networks, 2014.
 Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Rhodes & Gutmann (2018) Rhodes, B. and Gutmann, M. Variational noisecontrastive estimation. arXiv preprint arXiv:1810.08010, 2018.

Rifai et al. (2011a)
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X.
The manifold tangent classifier.
In NIPS, 2011a. 
Rifai et al. (2011b)
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y.
Contractive autoencoders: Explicit invariance during feature extraction.
In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 833–840. Omnipress, 2011b. 
Sajjadi et al. (2018)
Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S.
Assessing generative models via precision and recall.
In Advances in Neural Information Processing Systems, pp. 5228–5237, 2018.  Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
 Theis et al. (2015) Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 van den Oord et al. (2017) van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
 Wu et al. (2017) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the quantitative analysis of decoderbased generative models. In ICLR, 2017.
References

Alain & Bengio (2014)
Alain, G. and Bengio, Y.
What regularized autoencoders learn from the datagenerating
distribution.
Journal of Machine Learning Research
, 15(1):3563–3593, 2014.  Behrmann et al. (2018) Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J.H. Invertible residual networks, 2018.

Behrmann et al. (2020)
Behrmann, J., Vicol, P., Wang, K.C., Grosse, R. B., and Jacobsen, J.H.
On the invertibility of invertible neural networks, 2020.
URL https://openreview.net/forum?id=BJlVeyHFwH.  Berthelot et al. (2018) Berthelot, D., Raffel, C., Roy, A., and Goodfellow, I. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543, 2018.
 Bertsekas (2016) Bertsekas, D. P. Nonlinear programming: 3rd edition. 2016.
 Boothby (1986) Boothby, W. M. An introduction to differentiable manifolds and Riemannian geometry, volume 120. Academic press, 1986.
 Chen et al. (2018) Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018.
 Choi et al. (2018) Choi et al. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection. arXiv eprints, art. arXiv:1810.01392, 2018.
 Dai & Wipf (2019) Dai, B. and Wipf, D. Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789, 2019.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Ghosh et al. (2019) Ghosh, P., Sajjadi, M. S. M., Vergari, A., Black, M., and Schölkopf, B. From variational to deterministic autoencoders, 2019.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
 Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 Hutchinson (1990) Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450, 1990.
 Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krusinga et al. (2019) Krusinga, R., Shah, S., Zwicker, M., Goldstein, T., and Jacobs, D. Understanding the (un) interpretability of natural image distributions using generative models. arXiv preprint arXiv:1901.01499, 2019.
 Kumar & Poole (2020) Kumar, A. and Poole, B. On implicit regularization in vaes. arXiv preprint arXiv:2002.00041, 2020.
 Kumar et al. (2018) Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.
 Lagrange et al. (2007) Lagrange, S., Delanoue, N., and Jaulin, L. On sufficient conditions of the injectivity: development of a numerical test algorithm via interval analysis. Reliable computing, 13(5):409–421, 2007.
 Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
 Li & Turner (2016) Li, Y. and Turner, R. E. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pp. 1073–1081, 2016.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 3730–3738, 2015.  Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Narayanaswamy et al. (2017) Narayanaswamy, S., Paige, T. B., Van de Meent, J.W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925–5935, 2017.
 Odena et al. (2018) Odena, A., Buckman, J., Olsson, C., Brown, T. B., Olah, C., Raffel, C., and Goodfellow, I. Is generator conditioning causally related to gan performance? arXiv preprint arXiv:1802.08768, 2018.
 Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
 Poole et al. (2014) Poole, B., SohlDickstein, J., and Ganguli, S. Analyzing noise in autoencoders and deep networks, 2014.
 Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Rhodes & Gutmann (2018) Rhodes, B. and Gutmann, M. Variational noisecontrastive estimation. arXiv preprint arXiv:1810.08010, 2018.

Rifai et al. (2011a)
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X.
The manifold tangent classifier.
In NIPS, 2011a. 
Rifai et al. (2011b)
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y.
Contractive autoencoders: Explicit invariance during feature extraction.
In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 833–840. Omnipress, 2011b. 
Sajjadi et al. (2018)
Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S.
Assessing generative models via precision and recall.
In Advances in Neural Information Processing Systems, pp. 5228–5237, 2018.  Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
 Theis et al. (2015) Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 van den Oord et al. (2017) van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
 Wu et al. (2017) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the quantitative analysis of decoderbased generative models. In ICLR, 2017.
Appendix A Architectures
We used a similar architecture for all datasets, with 5 convolution layers followed by a dense layer projecting to a mean embedding.
Our architecture resembles that of Ghosh et al. (2019)
but with an additional layer, ELU instead of ReLU nonlinearities, and larger latent dimensions. We list Conv (convoutional) and ConvT (transposed convolution) layers with their number of filters, kernel size, and stride.
MNIST  CIFAR10  CelebA 

Encoder  



Decoder  



Appendix B Additional Samples
We visualize additional reconstructed test examples and samples from a postfit GMM model with 10 mixture components on the latents.