1 Introduction
Deep generative latentvariable models are important building blocks in current approaches to a host of challenging highdimensional problems including density estimation
[1, 2, 3][4, 5] and representation learning for downstream tasks [6, 7, 8, 9]. To train these models, the principle of maximum likelihood is often employed. However, maximum likelihood is often intractable due to the difficulty of marginalizing the latent variables. Variational Bayes addresses this by instead providing a tractable lower bound of the loglikelihood, which serves as a surrogate target for maximization. Variational Bayes, however, introduces a per sample optimization subroutine to find the variational proposal distribution that best matches the true posterior distribution (of the latent variable given an input observation). To amortize the cost of this optimization subroutine, the variational autoencoder introduces an amortized inference model that learns to predict the best proposal distribution given an input observation [1, 10, 11, 12].Although the computational efficiency of amortized inference has enabled latent variable models to be trained at scale on large datasets [13, 14], amortization introduces an additional source of error in the approximation of the posterior distributions if the amortized inference model fails to predict the optimal proposal distribution. This additional source of error, referred to as the amortization gap [15], causes variational autoencoder training to further deviate from maximum likelihood training [15, 16].
To improve training, numerous methods have been developed to reduce the amortization gap. In this paper, we focus on a class of methods [17, 18, 19] that takes an initial proposal distribution predicted by the amortized inference model and refines this initial distribution with the application of Stochastic Variational Inference (SVI) [20]. Since SVI applies gradient ascent to iteratively update the proposal distribution, a byproduct of this procedure is a trajectory of proposal distributions and their corresponding importance weights . The intermediate distributions are discarded, and only the last distribution is retained for updating the generative model. Our key insight is that the intermediate importance weights can be repurposed to further improve training. Our contributions are as follows

We propose a new method, Buffered Stochastic Variational Inference (BSVI), that takes advantage of the intermediate importance weights and constructs a new lower bound (the BSVI bound).

We show that the BSVI bound is a special instance of a family of generalized importanceweighted lower bounds.

We show that training variational autoencoders with BSVI consistently outperforms SVI, demonstrating the effectiveness of leveraging the intermediate weights.
Our paper shows that BSVI is an attractive replacement of SVI with minimal development and computational overhead.
2 Background and Notation
We consider a latentvariable generative model where is observed, is latent, and are the model’s parameters. The marginal likelihood is intractable but can be lower bounded by the evidence lower bound (ELBO)
(1) 
which holds for any distribution
. Since the gap of this bound is exactly the KullbackLeibler divergence
, is thus the variational approximation of the posterior. Furthermore, by viewing as a proposal distribution in an importance sampler, we refer to as an unnormalized importance weight. Sinceis a random variable, the variance can be reduced by averaging the importance weights derived from i.i.d samples from
. This yields the ImportanceWeighted Autonenocder (IWAE) bound [21],(2) 
2.1 Stochastic Variational Inference
The generative model can be trained by jointly optimizing and to maximize the lower bound over the data distribution . Supposing the variational family is parametric and indexed by the parameter space (e.g. a Gaussian variational family indexed by mean and covariance parameters), the optimization problem becomes
(3) 
where importance weight is now
(4) 
For notational simplicity, we omit the dependency on . For a fixed choice of and , [17] proposed to optimize via gradient ascent, where one initializes with and takes successive steps of
(5) 
for which the ELBO gradient with respect to can be approximated via Monte Carlo sampling as
(6) 
where is reparameterized as a function of and a base distribution . We note that applications gradient ascent generates a trajectory of variational parameters , where we use the final parameter for the approximation. Following the convention in [20], we refer to this procedure as Stochastic Variational Inference (SVI).
2.2 Amortized Inference Suboptimality
The SVI procedure introduces an inference subroutine that optimizes the proposal distribution per sample, which is computationally costly. [1, 10] observed that the computational cost of inference can be amortized by introducing an inference model , parameterized by , that directly seeks to learn the mapping from each sample to an optimal that solves the maximization problem
(7) 
This yields the amortized ELBO optimization problem
(8) 
where can be concisely rewritten (with a slight abuse of notation) as to yield the standard variational autoencoder objective [1].
While computationally efficient, the influence of the amortized inference model on the training dynamics of the generative model has recently come under scrutiny [15, 17, 18, 16]. A notable consequence of amortization is the amortization gap
(9) 
which measures the additional error incurred when the amortized inference model is used instead of the optimal for approximating the posterior [15]. A large amortization gap can present a potential source of concern since it introduces further deviation from the maximum likelihood objective [16].
2.3 AmortizationSVI Hybrids
To close the amortization gap, [17] proposed to blend amortized inference with SVI. Since SVI requires one to initialize , a natural solution is to set . Thus, SVI is allowed to finetune the initial proposal distribution found by the amortized inference model and reduce the amortization gap. Rather than optimizing jointly with the amortized ELBO objective Eq. 8, the training of the inference and generative models is now decoupled; is trained to optimize the amortized ELBO objective, but is trained to approximately optimize Eq. 3, where is approximated via SVI. To enable endtoend training of the inference and generative models, [18]
proposed to backpropagate through the SVI steps via a finitedifference estimation of the necessary Hessianvector products. Alternatively,
[19] adopts a learningtolearn framework where an inference model iteratively outputs as a function of and the ELBO gradient.3 Buffered Stochastic Variational Inference
In this paper, we focus on the simpler, decoupled training procedure described by [17] and identify a new way of improving the SVI training procedure (orthogonal to the endtoend approaches in [18, 19]). Our key observation is that, as part of the gradient ascent estimation in Eq. 6, the SVI procedure necessarily generates a sequence of importance weights , where . Since likely achieves the highest ELBO, the intermediate weights are subsequently discarded in the SVI training procedure, and only is retained for updating the generative model parameters. However, if the preceding proposal distributions are also reasonable approximations of the posterior, then it is potentially wasteful to discard their corresponding importance weights. A natural question to ask then is whether the full trajectory of weights can be leveraged to further improve the training of the generative model.
Taking inspiration from IWAE’s weightaveraging mechanism, we propose a modification to the SVI procedure where we simply keep a buffer of the entire importance weight trajectory and use an average of the importance weights as the objective in training the generative model.^{1}^{1}1For simplicity, we use the uniformweighting in our base implementation of BSVI. In Section 4.1, we discuss how to optimize during training. The generative model is then updated with the gradient . We call this procedure Buffered Stochastic Variational Inference (BSVI) and denote as the BSVI objective. We describe the BSVI training procedure in Algorithm 1 and contrast it with SVI training. For notational simplicity, we shall always imply initialization with an amortized inference model when referring to SVI and BSVI.
is marginalized, the result is a joint distribution of dependent samples. For notational simplicity, the dependency on
is omitted.4 Theoretical Analysis
An important consideration is whether the BSVI objective serves as a valid lower bound to the loglikelihood . A critical challenge in the analysis of the BSVI objective is that the trajectory of variational parameters is actually a sequence of statisticallydependent random variables. This statistical dependency is a consequence of SVI’s stochastic gradient approximation in Eq. 6. We capture this dependency structure in Figure 1(a), which shows that each is only deterministically generated after is sampled. When the proposal distribution parameters are marginalized, the resulting graphical model is a joint distribution over . To reason about such a joint distribution, we introduce the following generalization of the IWAE bound.
[] Let be a distribution where . Consider a joint proposal distribution over . Let for all , and be a categorical distribution over . The following construction, which we denote the Generalized IWAE Bound, is a valid lower bound of the logmarginallikelihood
(10) 
The proof follows directly from the linearity of expectation when using
for importancesampling to construct an unbiased estimate of
, followed by application of Jensen’s inequality. A detailed proof is provided in Appendix A.Notably, if , then Section 4 reduces to the IWAE bound. Section 4 thus provides a generalization of IWAE, where the samples drawn are potentially nonindependently and nonidentically distributed. Section 4 thus provides a way to construct new lower bounds on the loglikelihood whenever one has access to a set of nonindependent samples.
In this paper, we focus on a special instance where a chain of samples is constructed from the SVI trajectory. We note that the BSVI objective can be expressed as
(11) 
Note that since can be deterministically computed given , it is therefore admissible to interchange the distributions . The BSVI objective is thus a special case of the Generalized IWAE bound, where with auxiliary conditioning on . Hence, the BSVI objective is a valid lower bound of ; we now refer to it as the BSVI bound where appropriate.
In the following two subsections, we address two additional aspects of the BSVI bound. First, we propose a method for ensuring that the BSVI bound is tighter than the Evidence Lower Bound achievable via SVI. Second, we provide an initial characterization of BSVI’s implicit samplingimportanceresampling distribution.
4.1 Buffer Weight Optimization
Stochastic variational inference uses a series of gradient ascent steps to generate a final proposal distribution . As evident from Figure 1(a), the parameter is in fact a random variable. The ELBO achieved via SVI, in expectation, is thus
(12) 
where the RHS reexpresses it in notation consistent with Eq. 11. We denote Eq. 12 as the SVI bound. In general, the BSVI bound with uniformweighting is not necessarily tighter than the SVI bound. For example, if SVI’s last proposal distribution exactly matches posterior , then assigning equal weighting to across would make the BSVI bound looser.
In practice, we observe the BSVI bound with uniformweighting to consistently achieve a tighter lower bound than SVI’s last proposal distribution. We attribute this phenomenon to the effectiveness of variancereduction from averaging multiple importance weights—even when these importance weights are generated from dependent and nonidentical proposal distributions.
To guarantee that the BSVI is tighter than the SVI bound, we propose to optimize the buffer weight . This guarantees a tighter bound,
(13) 
since the SVI bound is itself a special case of the BSVI bound when . It is worth noting that Eq. 13 is concave with respect to , allowing for easy optimization of .
Although is a local variational parameter, we shall, for simplicity, optimize only a single global that we update with gradient ascent throughout the course of training. As such, is jointly optimized with and .
4.2 DependenceBreaking via DoubleSampling
As observed in [20], taking the gradient of the loglikelihood with respect to results in the expression
(14) 
We note that gradient of the ELBO with respect to results in a similar expression
(15) 
As such, the ELBO gradient differs from loglikelihood gradient only in terms of the distribution applied by the expectation operator. To approximate the loglikelihood gradient, we wish to set close to under some divergence.
We now show what results from computing the gradient of the BSVI objective. [] The BSVI gradient with is
(16) 
where is a samplingimportanceresampling procedure defined by the generative process
(17)  
(18)  
(19) 
and
is a probability mass function over
. A detailed proof is provided in Appendix A.A natural question to ask is whether BSVI’s is closer to the posterior than in expectation. To assist in this analysis, we first characterize a particular instance of the Generalized IWAE bound when are independent but nonidentically distributed.
[] When , the implicit distribution admits the inequality
(20)  
(21) 
Figure 3 extends the analysis by [23] from the i.i.d. case (i.e. the standard IWAE bound) to the nonidentical case (proof in Appendix A). It remains an open question whether the inequality holds for the nonindependent case.
Since the BSVI objective employs dependent samples, it does not fulfill the conditions for Fig. 3. To address this issue, we propose a variant, BSVI with doublesampling (BSVIDS), that breaks dependency by drawing two samples at each SVI step: for computing the SVI gradient update and for computing the BSVI importance weight . The BSVIDS bound is thus
(22) 
where is a product of independent but nonidentical distributions when conditioned on . Doublesampling now allows us to make the following comparison.
[] Let denote the proposal distribution found by SVI. For any choice of , the distribution implied by BSVIDS (with optimal weighting ) is at least as close to as ,
(23) 
as measured by the KullbackLeibler divergence.
5 Computational Considerations
Another important consideration is the speed of training the generative model with BSVI versus SVI. Since BSVI reuses the trajectory of weights generated by SVI, the forward pass incurs the same cost. The backwards pass for BSVI, however, is for SVI steps—in contrast to SVI’s cost. To make the cost of BSVI’s backwards pass , we highlight a similar observation from the original IWAE study [21] that the gradient can be approximated via Monte Carlo sampling
(24) 
where is sampled from BSVI’s implicit distribution . We denote this as training BSVI with sampleimportanceresampling (BSVISIR). Setting allows variational autoencoder training with BSVISIR to have the same wallclock speed as training with SVI.
6 Experiments
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  89.83 0.03  89.88 0.02  0.97 0.13  88.91 0.15 
IWAE  89.02 0.05  89.89 0.06  4.02 0.18  85.87 0.15 
SVI  89.65 0.06  89.73 0.05  1.37 0.15  88.36 0.20 
BSVISIR  88.80 0.03  90.24 0.06  7.52 0.21  82.72 0.22 
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  2202.90 14.95  2203.01 14.96  0.40 0.07  2202.62 14.96 
IWAE  2148.67 10.11  2153.69 10.94  2.03 0.08  2151.66 10.86 
SVI  2074.43 10.46  2079.26 9.99  45.28 5.01  2033.98 13.38 
BSVISIR  2059.62 3.54  2066.12 3.63  51.24 5.03  2014.88 5.30 
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  1733.86 0.84  1736.49 0.73  11.62 1.01  1724.87 1.70 
IWAE  1705.28 0.66  1710.11 0.72  33.04 0.36  1677.08 0.70 
SVI  1710.15 2.51  1718.39 2.13  26.05 1.90  1692.34 4.03 
BSVISIR  1699.44 0.45  1707.00 0.49  41.48 0.12  1665.52 0.41 
6.1 Setup
We evaluated the performance of our method by training variational autoencoders with BSVISIR with buffer weight optimization (BSVISIR
)) on the dynamicallybinarized Omniglot, grayscale SVHN datasets, and FashionMNIST (a complete evaluation of all BSVI variants is available in
Appendix B). Our main comparison is against the SVI training procedure (as described in Algorithm 1). We also show the performance of the standard VAE and IWAE training procedures. Importantly, we note that we have chosen to compare SVI and IWAE trained with against BSVISIR trained with SVI steps. This is because that BSVISIR generates importance weights.For all our experiments, we use the same architecture as [18] (where the decoder is a PixelCNN) and train with the AMSGrad optimizer [24]. For grayscale SVHN, we follow [25] and replaced [18]’s bernoulli observation model with a discretized logistic distribution model with a global scale parameter. Each model was trained for up to 200k steps with earlystopping based on validation set performance. For the Omniglot experiment, we followed the training procedure in [18] and annealed the KL term multiplier [2, 26] during the first
iterations. We replicated all experiments four times and report the mean and standard deviation of all relevant metrics. For additional details, refer to
Appendix D6.2 LogLikelihood Performance
For all models, we report the loglikelihood (as measured by BSVI). We additionally report the SVI (ELBO*) bound along with its decomposition into rate (KL*) and distortion (Reconstruction*) components [27]. We highlight that KL* provides a fair comparison of the rate achieved by each model without concern of misrepresentation caused by the amortized inference suboptimality.
Omniglot. Table 1 shows that BSVISIR outperforms SVI on the test set loglikelihood. BSVISIR also makes greater usage of the latent space (as measured by the lower Reconstruction*). Interestingly, BSVISIR’s loglikelihoods are noticeably higher than its corresponding ELBO*, suggesting that BSVISIR has learned posterior distributions not easily approximated by the Gaussian variational family when trained on Omniglot.
SVHN. Table 2 shows that BSVISIR outperforms SVI on test set loglikelihood. We observe that both BSVISIR and SVI significantly outperform both VAE and IWAE on loglikelihood, ELBO*, and Reconstruction*, demonstrating the efficacy of iteratively refining the proposal distributions found by amortized inference model during training.
FashionMNIST. Table 3 similarly show that BSVISIR outperforms SVI on test set loglikelihood. Here, BSVI achieves significantly better Reconstruction* as well as achieving higher ELBO* compared to VAE, IWAE, and SVI.
In Tables 6, 5 and 4 (Appendix B), we also observe that the use of double sampling and buffer weight optimization does not make an appreciable difference than their appropriate counterparts, demonstrating the efficacy of BSVI even when the samples are statistically dependent and the buffer weight is simply uniform.
6.3 Stochastic Gradient as Regularizer
Interestingly, Table 4 shows that BSVISIR can outperform BSVI on the test set despite having a higher variance gradient. We show in Figure 4 that this is the result of BSVI overfitting the training set. The results demonstrate the regularizing effect of having noisier gradients and thus provide informative empirical evidence to the ongoing discussion about the relationship between generalization and the gradient signaltonoise ratio in variational autoencoders [28, 16].
6.4 Latent Space Visualization
Table 1 shows that the model learned by BSVISIR training has better Reconstruction* than SVI, indicating greater usage of the latent variable for encoding information about the input image. We provide a visualization of the difference in latent space usage in Figure 5. Here, we sample multiple images conditioned on a fixed . Since BSVI encoded more information into than SVI on the Omniglot dataset, we see that the conditional distribution of the model learned by BSVI has lower entropy (i.e. less diverse) than SVI.
6.5 Analysis of Training Metrics
Recall that the BSVI training procedure runs SVI as a subroutine, and therefore generates the trajectory of importance weights . Note that and are unbiased estimates of the ELBO achieved by the proposal distribution (SVI bound) and (SVI bound) respectively. It is thus possible to monitor the health of the BSVI training procedure by checking whether the bounds adhere to the ordering
(25) 
in expectation. Figures 5(b) and 5(a) show that this is indeed the case. Since Omniglot was trained with KLannealing [18], we see in Figure 5(a) that SVI plays a negligible role once the warmup phase (first iterations) is over. In contrast, SVI plays an increasingly large role when training on the more complex SVHN and FashionMNIST datasets, demonstrating that the amortization gap is a significantly bigger issue in the generative modeling of SVHN and FashionMNIST. Figure 5(b) further shows that BSVI consistently achieves a better bound than SVI. When the buffer weight is also optimized, we see in Figure 5(c) that learns to upweight the later proposal distributions in , as measured by the buffer weight average . For SVHN, the significant improvement of SVI over SVI results in being biased significantly toward the later proposal distributions. Interestingly, although Figure 5(c) suggests that the optimal buffer weight can differ significantly from naive uniformweighting, we see from Tables 2 and 1 that buffer weight optimization has a negligible effect on the overall model performance.
7 Conclusion
In this paper, we proposed Buffered Stochastic Variational Inference (BSVI), a novel way to leverage the intermediate importance weights generated by stochastic variational inference. We showed that BSVI is effective at alleviating inference suboptimality and that training variational autoencoders with BSVI consistently outperforms its SVI counterpart, making BSVI an attractive and simple dropin replacement for models that employ SVI. One promising line of future work is to extend the BSVI training procedure with endtoend learning approaches in [18, 19]. Additionally, we showed that BSVI procedure is a valid lower bound and belongs to general class of importanceweighted (Generalized IWAE) bounds where the importance weights are statistically dependent. Thus, it would be of interest to study the implications of this bound for certain MCMC procedures such as Annealed Importance Sampling [29] and others.
Acknowledgements
We would like to thank Matthew D. Hoffman for his insightful comments and discussions during this project. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA95501910024), and FLI.
References
 [1] Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 [2] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder Variational Autoencoders. In Advances In Neural Information Processing Systems, pages 3738–3746, 2016.

[3]
Rui Shu, Hung H Bui, and Mohammad Ghavamzadeh.
Bottleneck conditional density estimation.
International Conference on Machine Learning
, 2017.  [4] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. SemiSupervised Learning With Deep Generative Models. In Advances In Neural Information Processing Systems, pages 3581–3589, 2014.

[5]
Volodymyr Kuleshov and Stefano Ermon.
Deep hybrid models: Bridging discriminative and generative
approaches.
Conference on Uncertainty in Artificial Intelligence
, 2017.  [6] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating Sources Of Disentanglement In Variational Autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 [7] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 [8] Ershad Banijamali, Rui Shu, Mohammad Ghavamzadeh, Hung Bui, and Ali Ghodsi. Robust locallylinear controllable embedding. Artificial Intelligence And Statistics, 2018.

[9]
Yunzhu Li, Jiaming Song, and Stefano Ermon.
Infogail: Interpretable imitation learning from visual demonstrations.
In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.  [10] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation And Approximate Inference In Deep Generative Models. arXiv preprint arXiv:1401.4082, 2014.
 [11] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. Proceedings of the Annual Meeting of the Cognitive Science Society, 2014.
 [12] Shengjia Zhao, Jiaming Song, and Stefano Ermon. A lagrangian perspective on latent variable generative models. In Proc. 34th Conference on Uncertainty in Artificial Intelligence, 2018.

[13]
Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and
Lawrence Carin.
Variational autoencoder for deep learning of images, labels and captions.
In Advances in neural information processing systems, pages 2352–2360, 2016.  [14] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 [15] Chris Cremer, Xuechen Li, and David Duvenaud. Inference Suboptimality In Variational Autoencoders. arXiv preprint arXiv:1801.03558, 2018.
 [16] Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. Amortized inference regularization. Advances in Neural Information Processing Systems, 2018.
 [17] Rahul G Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with inference networks on sparse, highdimensional data. arXiv preprint arXiv:1710.06085, 2017.
 [18] Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. SemiAmortized Variational Autoencoders. arXiv preprint arXiv:1802.02550, 2018.
 [19] Joseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. arXiv preprint arXiv:1807.09356, 2018.
 [20] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic Variational Inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 [21] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 [22] Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pages 4475–4484, 2018.
 [23] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting ImportanceWeighted Autoencoders. arXiv preprint arXiv:1704.02916, 2017.
 [24] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
 [25] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference With Inverse Autoregressive Flow. In Advances In Neural Information Processing Systems, pages 4743–4751, 2016.
 [26] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 [27] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. In International Conference on Machine Learning, pages 159–168, 2018.
 [28] Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter Variational Bounds Are Not Necessarily Better. arXiv preprint arXiv:1802.04537, 2018.
 [29] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 [30] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017.
 [31] Jakub M Tomczak and Max Welling. VAE With A Vampprior. arXiv preprint arXiv:1705.07120, 2017.
References
 [1] Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 [2] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder Variational Autoencoders. In Advances In Neural Information Processing Systems, pages 3738–3746, 2016.

[3]
Rui Shu, Hung H Bui, and Mohammad Ghavamzadeh.
Bottleneck conditional density estimation.
International Conference on Machine Learning
, 2017.  [4] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. SemiSupervised Learning With Deep Generative Models. In Advances In Neural Information Processing Systems, pages 3581–3589, 2014.

[5]
Volodymyr Kuleshov and Stefano Ermon.
Deep hybrid models: Bridging discriminative and generative
approaches.
Conference on Uncertainty in Artificial Intelligence
, 2017.  [6] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating Sources Of Disentanglement In Variational Autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 [7] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 [8] Ershad Banijamali, Rui Shu, Mohammad Ghavamzadeh, Hung Bui, and Ali Ghodsi. Robust locallylinear controllable embedding. Artificial Intelligence And Statistics, 2018.

[9]
Yunzhu Li, Jiaming Song, and Stefano Ermon.
Infogail: Interpretable imitation learning from visual demonstrations.
In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.  [10] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation And Approximate Inference In Deep Generative Models. arXiv preprint arXiv:1401.4082, 2014.
 [11] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. Proceedings of the Annual Meeting of the Cognitive Science Society, 2014.
 [12] Shengjia Zhao, Jiaming Song, and Stefano Ermon. A lagrangian perspective on latent variable generative models. In Proc. 34th Conference on Uncertainty in Artificial Intelligence, 2018.

[13]
Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and
Lawrence Carin.
Variational autoencoder for deep learning of images, labels and captions.
In Advances in neural information processing systems, pages 2352–2360, 2016.  [14] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 [15] Chris Cremer, Xuechen Li, and David Duvenaud. Inference Suboptimality In Variational Autoencoders. arXiv preprint arXiv:1801.03558, 2018.
 [16] Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. Amortized inference regularization. Advances in Neural Information Processing Systems, 2018.
 [17] Rahul G Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with inference networks on sparse, highdimensional data. arXiv preprint arXiv:1710.06085, 2017.
 [18] Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. SemiAmortized Variational Autoencoders. arXiv preprint arXiv:1802.02550, 2018.
 [19] Joseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. arXiv preprint arXiv:1807.09356, 2018.
 [20] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic Variational Inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 [21] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 [22] Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pages 4475–4484, 2018.
 [23] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting ImportanceWeighted Autoencoders. arXiv preprint arXiv:1704.02916, 2017.
 [24] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
 [25] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference With Inverse Autoregressive Flow. In Advances In Neural Information Processing Systems, pages 4743–4751, 2016.
 [26] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 [27] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. In International Conference on Machine Learning, pages 159–168, 2018.
 [28] Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter Variational Bounds Are Not Necessarily Better. arXiv preprint arXiv:1802.04537, 2018.
 [29] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 [30] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017.
 [31] Jakub M Tomczak and Max Welling. VAE With A Vampprior. arXiv preprint arXiv:1705.07120, 2017.
Appendix A Proofs
See 4
Proof.
To show the validity of this lower bound, note that
(26)  
(27)  
(28)  
(29) 
Applying Jensen’s inequality shows that the lower bound in the theorem is valid. ∎
See 3
Proof.
(30)  
(31) 
The doubleexpectation can now be reinterpreted as the samplingimportanceresampling distribution . ∎
See 3
Proof.
Recall that the is defined by the following sampling procedure
(32)  
(33)  
(34) 
where
(35) 
We first note that, for any distribution
(36) 
This provides an intuitive way of constructing the probability density function by reframing it as a sampling process (the expectation w.r.t.
) paired with a filtering procedure (the diracdelta ). Thus, the density under is thus(37) 
Additionally, we shall introduce the following terms
(38)  
(39)  
(40)  
(41) 
for notational simplicity. Note that the density function can be reexpressed as
(42)  
(43)  
(44)  
(45) 
We now begin with the ELBO under and proceed from there via
(46)  
(47)  
(48) 
where we use Jensen’s Inequality to exploit the convexity of the unnormalized KullbackLeibler divergence . We now do a small change of notation when rewriting the unnormalized KL as an integral to keep the notation simple
(49)  
(50)  
(51)  
(52)  
(53) 
If are independent, then it follows that . Thus,
(55)  
(56)  
(57) 
∎
Appendix B Model Performance on Test and Training Data
Here we report various performance metrics for each type of model trained on the training set for both Omniglot and SVHN. As stated earlier, loglikelihood is estimated using BSVI500, and ELBO* refers to the lower bound achieved by SVI500 (i.e. ). KL* and Reconstruction* are the rate and distortion terms for ELBO*, respectively.
Loglikelihood  (58)  
ELBO*  (59) 
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  89.83 0.03  89.88 0.02  0.97 0.13  88.91 0.15 
IWAE  89.02 0.05  89.89 0.06  4.02 0.18  85.87 0.15 
SVI  89.65 0.06  89.73 0.05  1.37 0.15  88.36 0.20 
BSVIDS  88.93 0.02  90.13 0.04  8.13 0.17  81.99 0.14 
BSVI  88.98 0.03  90.19 0.06  8.29 0.25  81.89 0.20 
BSVI  88.95 0.02  90.18 0.05  8.48 0.22  81.70 0.18 
BSVISIR  88.80 0.03  90.24 0.06  7.52 0.21  82.72 0.22 
BSVISIR  88.84 0.05  90.22 0.02  7.44 0.04  82.78 0.05 

Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  2202.90 14.95  2203.01 14.96  0.40 0.07  2202.62 14.96 
IWAE  2148.67 10.11  2153.69 10.94  2.03 0.08  2151.66 10.86 
SVI  2074.43 10.46  2079.26 9.99  45.28 5.01  2033.98 13.38 
BSVIDS  2054.48 7.78  2060.21 7.89  48.82 4.66  2011.39 9.35 
BSVI  2054.75 8.22  2061.11 8.33  51.12 3.80  2009.99 8.52 
BSVI  2060.01 5.00  2065.45 5.88  47.24 4.62  2018.21 1.64 
BSVISIR  2059.62 3.54  2066.12 3.63  51.24 5.03  2014.88 5.30 
BSVISIR  2057.53 4.91  2063.45 4.34  49.14 5.62  2014.31 8.25 
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  1733.86 0.84  1736.49 0.73  11.62 1.01  1724.87 1.70 
IWAE  1705.28 0.66  1710.11 0.72  33.04 0.36  1677.08 0.70 
SVI  1710.15 2.51  1718.39 2.13  26.05 1.90  1692.34 4.03 
BSVIDS  1699.14 0.18  1706.92 0.11  41.73 0.18  1665.19 0.26 
BSVI  1699.01 0.33  1706.62 0.35  41.48 0.16  1665.14 0.39 
BSVI  1699.24 0.36  1706.92 0.37  41.60 0.49  1665.32 0.31 
BSVISIR  1699.44 0.45  1707.00 0.49  41.48 0.12  1665.52 0.41 
BSVISIR  1699.09 0.28  1706.68 0.26  41.18 0.19  1665.50 0.31 
Model  Loglikelihood  ELBO*  KL*  Reconstruction* 

VAE  88.60 0.18  88.66 0.18  1.00 0.13  87.66 0.19 
IWAE 