1 Introduction
Deep Gaussian Latent Models (Rezende et al., 2014), also known as Variational Autoencoders (VAEs) (Kingma & Welling, 2014)
, fall within the paradigm of Maximum Likelihood Estimate (MLE) and are often applied in computer vision problems. However, training with MLE usually leads to overestimation of the entropy of the data distribution
(Minka, 2005). This is an undesirable property, as natural images are usually assumed to lie within a lower dimensional manifold, and the additional entropy (and other simplifying modeling assumptions for the purpose of explicit density estimation) often leads to a marginal likelihood with probability mass spread out in the data space where there is no support in the training data, which causes the blurriness of samples. These observations motivate the design of more flexible, complex families of model densities.
Since a continuous latent variable is introduced to the model, VAEs can be interpreted as an infinite mixture model where the parameters of the class conditional distribution are functions of the latent variable
(which is thought of as class here), and there are infinitely many classes. Such models should theoretically have enough flexibility to capture highly complex distributions such as image manifolds, but in practice it is found to be overshadowed by tractable density models such as autoregressive models
(Van Den Oord et al., 2016), or Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) in terms of sample generation quality.It is believed that the relative poor performance in sample quality lies in the fact that the introduction of a latent representation requires approximate inference, as the model distribution is biased by simplifying posterior densities (Buntine & Jakulin, 2004); i.e. training is achieved by maximizing the variational lower bound on the marginal log likelihood:
(1) 
where subscripts denote the parameters of the associated distributions.
We discuss two aspects of training with the bound. First, maximizing (1) with respect to amounts to minimizing ; the variational distribution, , can thus be viewed as an approximate to the true posterior, . Simplifying (e.g. by using a factorial Gaussian as a common practice) is problematic, as the marginal log likelihood of interest can only be optimized to the extent we are able to approximate the true posterior using the variational distribution. This motivates a direct improvement of variational inference (Rezende & Mohamed, 2015; Ranganath et al., 2015; Kingma et al., 2016).
Second, during training of the VAE, only a part of the latent space is explored. When marginalizing out the input vector
, we recover the marginal , where indicates the true data distribution. When the marginal approximate posterior fails to fill up the prior as the priorcontractive term requires, one would risk sampling from untrained regions in the latent space. A direct and nonparametric treatment of sampling from such regions of the prior would be to take as the prior, but the integral is intractable and the data distribution is only partially specified by a limited training data. Even if we take the empirical distribution of , we would have a mixture model of up to components, where is the number of training data points, which would be impractical given the scale of modern machine learning tasks. A workaround of this problem is to take a random subset of , or introduce a learnable set of pseudodata of size , and set the prior to be , which is shown to be promising in the recent work done by Tomczak & Welling (2017). Another approach is to directly regularize the autoencoder by matching the aggregated posterior with the prior, as in Makhzani et al. (2015).In this paper, we make two main contributions. First, we analyze the effect of making the prior learnable. We show that training with the variational lower bound under some limit conditions matches the marginal approximate posterior with the prior, which is desirable from the generative model point of view. We then decompose the lower bound, and show that updating the prior alone brings the prior closer to the marginal approximate posterior, suggesting that having the prior trainable is beneficial to both sample generation and inference. Our second contribution is to prove that by using the family of inverse AF (Kingma et al., 2016), one can universally approximate any posterior. This theoretically justifies the use of inverse AF to improve variational inference. We unified the two aspects and propose to use invertible functionals Dinh et al. (2016) and Kingma et al. (2016) to parameterize explicit densities for both the prior and approximate posterior.
2 Marginal Matching Prior
We claim that maximizing the variational lower bound explicitly matches the marginal with the prior . By decomposing the lower bound, we then suggest using a learnable prior to improve sampling, i.e. to have a prior that matches the marginal instead.
Let us define encoding and decoding distributions as and respectively, a prior as and a data distribution as . Our goal is to train an autoencoder as a generative model by keeping close to the prior. This can be achieved at the limits of the following two conditions (Hoffman & Johnson, 2016):
In words, given a perfect approximate posterior of and a perfect marginal likelihood of , we have the marginal converge to the prior, i.e.
(2) 
That is, to have , we need to ensure the two conditions are satisfied. We can cast it as an optimization problem by minimizing the KLdivergences:
(3) 
The equality is a direct result of rearrangement of terms. What (3) implies is that maximizing the variational lower bound brings us to the limit conditions under which marginal approximate posterior should match the prior given enough flexibility in the assumed form of densities.
Now if we maximize (3) with respect to while holding and fixed like doing coordinate ascent, the samples drawn from the doubly stochastic process , can be thought of as a projected data distribution that we want to model using the prior distribution:
(4) 
As a result, having a learnable prior allows us to sample from the marginal approximate posterior if the above divergence metric goes to zero.
Another advantage of a learnable prior can be visualized by the cartoon plot in Figure 1. When we fix the approximate posterior and update the prior such that it becomes closer to the marginal approximate posterior, it concentrates the probability mass in such a way that the true posterior becomes closer to the approximate posterior, as . In other words, the region of high posterior density not covered by the approximate posterior will be reduced, which effectively means our proposal as variational distribution could be improved by having a better prior which simplifies the true posterior.
3 Inverse Autoregressive Flows as Universal Posterior Approximator
In Kingma et al. (2016), a powerful family of invertible functions called the Inverse Autoregressive Flows (inverse AF or IAF) were introduced, to improve variational inference. It is thus of practical and fundamental importance to understand the benefits of using inverse AF and how to improve them.
In this section, we show that normalizing flows from a base distribution (such as uniform distribution) under autoregressive assumptions are universal approximators of any density (as suggested in
Goodfellow (2017)), given enough capacity when a neural network is used to parameterize nonlinear dependencies.
Lemma 1.
Existence of solution to a nonlinear independent component analysis problem.
Given a random vector , there always exists a mapping from to such that the components of the random vector are statistically independent.Proof.
See Hyvarjnen & Pajunen (1998) for the full proof. Here we point out that the transformation used in the proof falls within the family of autoregressive functions: where , for . is the conditional CDF and
. Then any distribution of a random variable
can be warped into an independent distribution via the CDFs, specifically by a kind of GramSchmidt processlike construction. ∎
Proposition 1.
Inverse autoregressive transformation as universal approximator of any density. Let be a random vector in an open set . We assume that has a positive and continuous probability density distribution. There exists a sequence of mappings from to parametrized by autoregressive neural networks such that the sequence where converges in distribution to X.
Proof.
We consider the mapping defined in the proof of Lemma 1. As is autoregressive, the Jacobian of is an upper triangular matrix whose diagonal entries are equal to the conditional densities which are positive by assumption. The determinant of the Jacobian, which is equal to the product of diagonal entries, is positive. By the inverse function theorem, f is locally invertible. As f is also injective (as follows from the bijectivity of CDF), f is globally invertible and let g denotes its inverse. g is an autoregressive function and by the universal approximation theorem (Cybenkot, 1989), we know that there exists a sequence of mappings from to parametrized by autoregressive neural networks that converge uniformly to . Let where . Let be a realvalued bounded continuous function on . The latter uniform convergence implies that since converge pointwise to , then by continuity of , converges pointwise to . As is bounded, the dominated convergence theorem gives that converges to . As the latter statement is valid for all bounded continuous function , converge to in distribution. ∎
Note that is usually parameterized as an invertible function, at the expense of flexibility, to have a tractable Jacobian. Special designs of such a function, other than affine transformation (Kingma et al., 2016), could be made to improve the flow; otherwise one would need to compose multiple layers of transformations to have a richer distribution family. Our proof shows that, with careful designs of approximate posteriors, VAEs could have asymptotic consistency.
4 Proposed Method
As suggested in sections 2 and 3, we propose to use onetoone correspondence to define a learnable explicit density (LED) model for both inference and sample generation. First, inspired by (4), we found that updating the prior alone is reminiscent of MLE. One can think of data points projected onto the latent space via Monte Carlo sampling as a data distribution in space . A unimodal prior tends to overestimate the entropy of . A powerful family of real nonvolume preserving (Real NVP) transformations (Dinh et al., 2016) can be applied to real variables. It is thus natural to incorporate Real NVP into VAEs to jointly train an explicit density model as prior. We define the prior (and also the approximate posterior) with change of variable formula: where . To compute the density of the projected data distribution, we inversely () transform the samples into the base variable with tractable density (Dinh et al., 2014). We define the posterior likewise, as in Rezende & Mohamed (2015), with . Objective (1) can thus be modified as
(5) 
For permutation invariant latent variables, is implemented with random masks. For latent variables that preserve the spatial correlation when a convolutional network is used, we choose to use a checkerboard style mask (Dinh et al., 2016; Agrawal & Dukkipati, 2016)
. Interestingly, sampling of such models is similar to block Gibbs sampling for energy based models (e.g. Ising models) that define the correlation between adjacent pixels.
Second, for the posterior distribution, we construct by inverse AF, which is parallelizable when combined with MADE (Germain et al., 2015) or PixelCNN (Van Den Oord et al., 2016). In fact, inverse AF can be thought of as a generalization of Real NVP, as the Jacobian of the masked operation used in Real NVP is upper triangular.
MLP  MLP  ResConv  

L_{post}  NLL  L_{prior}  NLL  L_{prior}  NLL 
0  90.78  0  90.78  0  83.11 
4  88.89  4  88.07  4  81.87 
8  88.71  8  87.47  8  81.70 
12  88.70  12  86.59  12  81.44 
: number of NVP layers used for prior. One hidden layer of 100 nodes was used for each layer of transformation. For multilayer perceptron, two hidden layers with 200 nodes were used and the dimension of the latent variable is 50. Rectifier is used as nonlinear activation. For Residual ConvNet, we have 3 layers of residual strided convolution
(He et al., 2015) with [16,32,32] feature maps, using filter of size 33. Before the stochastic layer a hidden layer of 450 nodes is used. The dimension of the latent variable is 32. We use exponential linear units (Clevert et al., 2015) as nonlinearity.ResConv  

L_{prior}  L_{post}  NLL 
4 NVP  4 NVP  81.81 
8 NVP  8 NVP  81.55 
8 NVP  8 MADE  80.81 
16 NVP  16 MADE  80.60 
5 Experiments
Mixture of Bivariate Gaussians. We experiment on a Gaussian mixture toy example, and visualize the effect of having a learnable prior in Figure 2. During training, we observe that models with flexible prior are easier to train than models with flexible posterior. Our first conjecture is that to refine the posterior density, we only draw one sample of for each data point , whereas refining the prior density can be viewed as modeling the projected data distribution and thus depends on as many samples as there are in the training set. Second, it might be due to the kind of transformation and the distance metrics that are used. To learn the posterior, we implicitly minimize , which is zero forcing since samples in region that has low target density are heavily penalized. If begins with a sharper shape, it pays a high penalty by expansion to move to another mode. It is thus easy for the distribution to be stuck in local minima if the true posterior is multimodal, while learning the prior does not have this mode seeking problem since the forward KL in (4) is zero avoiding.
MNIST.
We also tested our proposed method on binarized MNIST
(Larochelle & Murray, 2011), and report the estimated negative log likelihood as an evaluation metric.
We compare the effects of adding more invertible transformation layers on either the prior or posterior (see Table 1), or both (Table 2). From Table 1, we see that models having a flexible prior easily outperform models with a flexible posterior. Likelihood of a model with flexible prior can be further improved by using expressive posterior (Table 2) such as real NVP (81.70 81.55), or with MADE to introduce more autoregressive dependencies (81.55 80.81).
6 Discussion and Future Work
In this paper, we first reinterpret training with the variational lower bound as layerwise density estimation. Treating the Monte Carlo samples from the approximate posterior distributions as projected data distribution suggests using a flexible prior to avoid overestimate of entropy. We leave experiments on larger datasets and sample generation as future work. Second, we showed that parameterizing inverse AF using neural networks allows us to universally approximate any posterior, which theoretically justifies the use of inverse AF. Our proof also implies using affine coupling law to autoregressively warp the distribution is limited. It is thus possible to consider designs of more flexible invertible functions to improve approximate posterior.
Acknowledgements
We thank NVIDIA for donating a DGX1 computer used in this work.
References
 Agrawal & Dukkipati (2016) Agrawal, Siddharth and Dukkipati, Ambedkar. Deep Variational Inference Without PixelWise Reconstruction. ArXiv eprints, November 2016.

Buntine & Jakulin (2004)
Buntine, Wray and Jakulin, Aleks.
Applying discrete pca in data analysis.
In
Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
, UAI ’04, pp. 59–66, Arlington, Virginia, United States, 2004. AUAI Press. ISBN 0974903906.  Clevert et al. (2015) Clevert, DjorkArné, Unterthiner, Thomas, and Hochreiter, Sepp. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289, 2015.

Cybenkot (1989)
Cybenkot, G.
Mathematics of control, signals, and systems approximation by superpositions of a sigmoidal function.
Math. Control Signals Systems, 2:303–314, 1989.  Dinh et al. (2014) Dinh, Laurent, Krueger, David, and Bengio, Yoshua. NICE: Nonlinear Independent Components Estimation. International Conference on Learning Representation, 1(2):1–12, 2014.
 Dinh et al. (2016) Dinh, Laurent, SohlDickstein, Jascha, and Bengio, Samy. Density estimation using real NVP. International Conference on Learning Representation, abs/1605.08803, 2016.
 Germain et al. (2015) Germain, Mathieu, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. Made: Masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37, ICML’15, pp. 881–889, 2015.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014.
 Goodfellow (2017) Goodfellow, Ian J. NIPS 2016 tutorial: Generative adversarial networks. CoRR, abs/1701.00160, 2017.
 He et al. (2015) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 Hoffman & Johnson (2016) Hoffman, Matthew D. and Johnson, Matthew J. Elbo surgery: yet another way to carve up the variational evidence lower bound. 2016.
 Hyvarjnen & Pajunen (1998) Hyvarjnen, Aapo and Pajunen, Petteri. Nonlinear Independent Component Analysis: Existence and Uniqueness Results. Neural Networks, 1998.
 Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. AutoEncoding Variational Bayes. International Conference on Learning Representation, (Ml):1–14, 2014.
 Kingma et al. (2016) Kingma, Diederik P, Salimans, Tim, and Welling, Max. Improving Variational Inference with Inverse Autoregressive Flow. Advances in Neural Information Processing Systems 27, (2011):1–8, 2016.
 Larochelle & Murray (2011) Larochelle, Hugo and Murray, Iain. The neural autoregressive distribution estimator. In The Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR: W&CP, pp. 29–37, 2011.
 Makhzani et al. (2015) Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and Goodfellow, Ian J. Adversarial autoencoders. CoRR, abs/1511.05644, 2015.
 Minka (2005) Minka, Tom. Divergence measures and message passing. Technical report, January 2005.
 Ranganath et al. (2015) Ranganath, Rajesh, Tran, Dustin, and Blei, David M. Hierarchical variational models. arXiv preprint arXiv:1511.02386, 2015.
 Rezende & Mohamed (2015) Rezende, Danilo Jimenez and Mohamed, Shakir. Variational Inference with Normalizing Flows. Proceedings of the 32nd International Conference on Machine Learning, 37:1530–1538, 2015.

Rezende et al. (2014)
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, ICML’14. JMLR.org, 2014.  Tomczak & Welling (2017) Tomczak, Jakub M. and Welling, Max. VAE with a vampprior. CoRR, abs/1705.07120, 2017.

Van Den Oord et al. (2016)
Van Den Oord, Aäron, Kalchbrenner, Nal, and Kavukcuoglu, Koray.
Pixel Recurrent Neural Networks.
2016.