1 Introduction
Variational Inference (VI) has played an important role in Bayesian model uncertainty calibration and in unsupervised representation learning. It is different from Markov Chain Monte Carlo
(MCMC) methods, which rely on the Markov chain to reach an equilibrium; in VI one can easily draw i.i.d. samples from the variational distribution, and enjoy lower variance in inference. On the other hand VI is subject to bias on account of the introduction of the approximating variational distribution.
As pointed out by Turner and Sahani (2011), variational approximations tend not to propagate uncertainty well. This inaccuracy and overconfidence in inference can result in bias in statistics of certain features of the unobserved variable, such as marginal likelihood of data or the predictive posterior in the context of a Bayesian model. We argue that this is especially true in the amortized VI setups (Kingma and Welling, 2014; Rezende et al., 2014), where one seeks to learn the representations of the data in an efficient manner. We note that this sacrifices the chance of exploring different configurations of representation in inference, and can bias and hurt the training of the model.
This bias is believed to be caused by the variational family that is used, such as factorial Gaussian for computational tractability, which limits the expressiveness of the approximate posterior. In principle, this can be alleviated by having a more rich family of distributions, but in practice, it remains a challenging issue for optimization. To see this, we write the training objective of VI as the KL divergence, also known as the variational free energy , between the proposal and the target distribution :
Due to the KL divergence,
gets penalized for allocating probability mass in regions where
has low density. This behaviour of the objective can result in a cascade of consequences. The variational distribution becomes biased towards being excessively confident, which, in turn, can inhibit the variational approximation from escaping poor local minima, even when it has sufficient representational capacity to accurately represent the posterior. This usually happens when the target (posterior) distribution has a complicated structure, such as multimodality, where the variational distribution might get stuck fitting only a subset of modes.In what follows, we discuss two annealing techniques that are designed with two diverging goals in mind. On the one hand, alphaannealing is used to encourage exploration of significant modes and is designed for learning a good inference model. On the other, betaannealing facilitates the learning of a good generative model by reducing noise and regularization during training.
Alphaannealing:
The optimization problem of inference is due to the nonconvex nature of the objective, and can be mitigated via energy tempering (Katahira et al., 2008; Abrol et al., 2014; Mandt et al., 2016):
(1) 
where and is analogous to the temperature in statistical mechanics. The temperature is usually initially set to be high, and gradually annealed to , i.e. goes from a small value to . The intuition is that when is small the energy landscape is smoothed out, as goes to zero everywhere when if is continuous and has bounded gradient.
However, alphaannealing might not be ideal in practice. Tempering schemes are typically suitable for one time inference, e.g. in the case of inferring the posterior distribution of a Bayesian model, but it can be time consuming for latent variable models or hierarchical Bayesian models where multiple inferences are required for maximizing the marginal likelihood. Examples include deep latent variable models (Kingma and Welling, 2014; Rezende et al., 2014), filtering and smoothing in deep state space models (Chung et al., 2015; Fraccaro et al., 2016; Karl et al., 2016; Shabanian et al., 2017), and hierarchical Bayes with inference network (Edwards and Storkey, 2017)
. Doing energy tempering in these setups would do harm to the training due to the excess noise injected to the gradient estimate of the model.
Betaannealing:
Deterministic warmup (Raiko et al., 2007) is applied to improve training of a generative model, which is of even greater importance especially in the case of hierarchical models (Sønderby et al., 2016) and latent variable models with flexible likelihood (Gulrajani et al., 2017; Bowman et al., 2016). Let the joint likelihood of observed data and latent variable be , which is equal to the true posterior up to a normalizing constant (marginal ). The annealed variational objective (negative Evidence Lower Bound, or ELBO) is
(2) 
where is annealed from to . The rationale behind this annealing is that the first term in the parenthesis overregularizes the model by forcing the approximate posterior to be like the prior , and thus by reducing this prior contrastive
coefficient early on during the training we allow the decoder to make better use of the latent code to represent the underlying structure of the data. However, the entropy term has less importance when the coefficient is smaller, leading it to be more a deterministic autoencoder and biasing the encoding distribution to be more sharp and unimodal. This approach is clearly in conflict with the principle of energy tempering, where one allows the approximate posterior to “explore” in the energy landscape for significant modes.
This work aims to clarify the implications of doing these two kinds of annealing, in terms of the inductive bias the learning objective and algorithm has. We first review a few techniques in the VI and MCMC literature to tackle the expressiveness problem (i.e. limited expressiveness of the approximate posterior) and the optimization problem (i.e. the mode seeking problem) in inference. We focus on the latter. These are summarized in Section 3. We introduce Annealed Variational Objectives (AVO) in Section 4, which aims to satisfy the criteria the alpha and beta annealing schemes seek to achieve separately. Finally, in Section 5 we demonstrate the biasing effect of VI with beta annealing, and demonstrate the relative robustness and effectiveness of the proposed method.
2 Related work
Naturally, recent works in VI have been focused on reducing representational bias, especially in the setting of amortized VI, which is also known as the Variational AutoEncoders (VAE) (Kingma and Welling, 2014; Rezende et al., 2014), by (1) finding more expressive families of variational distributions without losing the computational tractability. Explicit density methods include Rezende and Mohamed (2015); Ranganath et al. (2016); Kingma et al. (2016); Tomczak and Welling (2016, 2017); Berg et al. (2018); Huang et al. (2018). Other methods include implicit density methods (Huszár, 2017; Mescheder et al., 2017; Shi et al., 2018). A second line of research focuses on (2) reducing the amortization error introduced by the use of a conditional network (Cremer et al., 2018; Krishnan et al., 2017; Marino et al., 2018; Kim et al., 2018).
In terms of nonparametric methods, the Importance Weighted AutoEncoder (IWAE) developped by Burda et al. (2016) uses several samples for evaluating the loss to reduce the variational gap, which can be expensive in scenarios where the decoder is much more complex. Nowozin (2018) further reduces the bias in inference via Jackknife Variational Inference. However, Rainforth et al. (2018) notices that the signaltonoise ratio of the encoder’s gradient vanishes with increasing number of importance samples (reduced bias), rendering the encoder an ineffective representation model in the limit case. Salimans et al. (2015) combines MCMC with VI but the inference process requires multiple passes through the decoder as well.
Our method is orthogonal to all these, assuming we have a rich family of approximate posterior at hand, and smooths out the objective landscape using annealed objectives specifically for hierarichical VI methods. We also note that it is possible to consider alternative losses to induce different optimization behavior, such as in Li and Turner (2016); Dieng et al. (2017).
3 Background
In what follows, we consider a latent variable model with a joint probability , where and are observed and realvalued unobserved variables, and
denotes the set of parameters to be learned. Due to the nonconjugate priorlikelihood pair or the nonlinearity of the conditioning, exact inference is often intractable. Direct maximization of the marginal likelihood is impossible because the marginalization involves integration:
. Thus, training is usually conducted by maximizing the expected complete data log likelihood (ECLL) over an auxiliary distribution ^{1}^{1}1 When using the conditional we emphasize the use of a recognition network in amortized VI (Kingma and Welling, 2014; Rezende et al., 2014).:When the exact posterior is tractable, one could choose to use , which leads to the well known ExpectationMaximization (EM) algorithm. When exact inference is not possible, we need to approximate the true posterior, usually by sampling from a Markov chain in MCMC or from a variational distribution in VI. The approximation induces bias in learning the model , as
That is to say, maximizing ECLL increases the marginal likelihood of the data while biasing the true posterior to be more like the auxiliary distribution. The second effect vanishes when approximates better.
In this paper we focus on the case of VI, where learning of the auxiliary distribution, i.e. variational distribution, is through maximizing the ELBO, which is equivalent to minimizing . Due to the zeroforcing property of the KL, tends to be unimodal and more concentrated. We emphasize that a highly flexible parametric form of
can potentially alleviate this problem but is orthogonal to the loss function that is used. Before going into technical details of choice of
and loss function, we now discuss a few implications and effects doing approximate inference has in practice:
At initialization, the true posterior is likely to be multimodal. Having a unimodal helps to regularize the latent space, biasing the true posterior towards being unimodal as well. Depending on the type of data being modeled, this may be a good property to have. A unimodal approximate posterior also means less noise when sampling from . This facilitates learning by having lower variance gradient estimates.

In cases where the true posterior has to be multimodal, biasing the posterior to be unimodal inhibits the model from learning the true generative process of the data. Allowing sufficient uncertainty in the approximate posterior encourages the learning process to explore the spurious modes in the true posterior.

Betaannealing facilitates point 1 by lowering the penalty of the prior contrastive term, allowing the estimate to be sharp (per point 1). Alphaannealing encourages exploration (point 2) by lowering the penalty of the crossentropy term, increasing the importance of the entropy term, and therefore the tendency to explore the latent space (per point 2).
3.1 Assumption of the variational family
Recent works have shown promising results in having more expressive parametric form of the variational distribution. This allows the variational family to cover a wider range of distributions, ideally including the true posterior that we seek to approximate. For instance, the Hierarchical Variational Inference (HVI) methods (Ranganath et al., 2016) are a generic family of methods that subsume discrete mixture proposals (e.g. mixture of Gaussian), auxiliary variable methods (Agakov and Barber, 2004; Maaløe et al., 2016), and normalizing flows (Rezende and Mohamed, 2015) as subfamilies.
In HVI, we use a latent variable model as approximate posterior, with the index indicating the latent variable ^{2}^{2}2 Here we consider sequence of latent variables of the same dimension as , but in principle the dimensionality can vary.. Thus the entropy term is intractable, and needs to be approximated. One can lower bound this term by introducing a reverse network :
The variational lower bound then becomes:
(3) 
Since can be seen as an infinite mixture model, in principle it has the capability to represent any posterior distribution given enough capacity in the conditional. In the special case where is deterministic and invertible, we can choose to be its inverse function. The KL term would vanish, and the entropy can be computed recursively via the changeofvariable formula: . Universal approximation results of certain parameterizations of invertible functions have also been recently shown in Huang et al. (2018).
3.2 Loss function tempering: annealed importance sampling
As explained in the introduction, the purpose of doing alphaannealing is to let the variational distribution be more exploratory early on during training. Annealed importance sampling (AIS) is an MCMC method that encapsulates this concept (Neal, 2001). In AIS, we consider an extended state space with , where is sampled from a simple distribution (such as the prior distribution ), and the subsequent particles are sampled from the transition operators . We design a set of intermediate target densities as , for , where , is the true posterior we want to approximate, and is the initial distribution is sampled from. These (Markov chain) transitions are constructed to leave the intermediate targets invariant, and the importance weight is calculated as
A downside of AIS is that it requires constructing a long sequence of transitions in order for the intermediate targets to be close enough and thus for the estimate to be accurate.
4 Annealed variational objectives
Inspired by AIS and alphaannealing, we propose to integrate energy tempering into the optimization objective of the variational distribution. As in AIS, we consider an extended state space with random variables
, and we take the marginal as approximate posterior. We construct forward transition operators and backward operators, and assign an intermediate targetto each transition pair, which is defined as an interpolation between the true (unnormalized) posterior and the initial distribution:
, where ^{3}^{3}3 In the experiments, we use a linear annealing schedule for all models trained with AVO.. Different from AIS, we propose to learn the parametric transition operators. More formally, we define:Set ^{4}^{4}4 We learn the initial distribution in the case of amortized VI., which is easy to sample from. We define the backward transitions in the reverse ordering, i.e. . We consider maximizing the following objective, which we refer to as the Annealed Variational Objectives (AVO):
(4) 
for , by taking the partial derivative with respect to the parameters of the ’th transition (fixing the intermediate marginal ) ^{5}^{5}5 In practice, this is achieved by disconnecting the gradient..
Intuitively, the goal of each forward transition is to stochastically transform the samples drawn from the previous step into the intermediate target distribution assigned to it. Since each intermediate target only differs from the previous one slightly, each transition only needs to correct the marginal samples from the previous step incrementally. Also, in the case of amortized VI, we set to be the prior distribution of the VAE, so each intermediate target has a more widely distributed density function, which facilitates exploration of the transition operators.
4.1 Analysis on the optimality of transition operators
The following proposition shows that if each forward transition and backward transition are locally optimal, the result is globally consistent, which means the marginals and constructed by locally optimal transitions match the intermediate target . The corollary below shows that the optimal transitions in Equation 4 bridge the variational gap.
Proposition 1.
Let and be the functional solutions to Equation 4. Then the resulting marginal density functions and are equal to .
corollary 1.
Take as the true posterior distribution , or simply choose . The variational lower bound in Equation 3 is exact with respect to the marginal likelihood if optimal transitions and are used.
4.2 Specification of the transitions
When is close enough to , assuming the marginal approximates accurately, the transition only needs to modestly refine the samples from . In light of this, we design the following stochastic refinement operator for both and :
where the mean
of the conditional normal distribution is defined as follows
are chosen to be softplus, sigmoid and ReLU
(Nair and Hinton, 2010) (or ELU (Clevert et al., 2016) in the case of amortized VI). In the case of amortized VI, we replace the dot product with the conditional weight normalization (Salimans and Kingma, 2016) operator proposed in Krueger et al. (2017); Huang et al. (2018).4.3 Loss calibrated AVO
Since the correctness of the marginal trained with AVO depends on the optimality of each transition operator, when used for amortized VI each update will not necessarily improve the marginal to be a better approximate posterior. Hence, in amortized VI, we consider the following loss calibrated version of AVO ^{6}^{6}6 Note that in amortized VI , and all depend on the input , but we omit the notation for simplicity.:
for all , where
is a hyperparameter that determines the weight of AVO used in training. In practice naive implementation of this objective can be computationally expensive, so we select the loss function stochastically (with probability
maximize AVO, otherwise maximize ELBO) in the amortized VI experiments, which we found helps to make progress in improving the model .5 Experiments
5.1 Biased noise model
To demonstrate the biasing effect of the approximate posterior in amortized VI, we utilize a toy example in the form of the following data distribution:
(5)  
(6) 
where . The result is the data distribution depicted in Figure 0(a). The data distribution is constructed so that has a bimodal true posterior with a mode at the tail ends of the Gaussian. This is analogous to some real data distribution in which two data points close in the input space can come from two well separated modes in latent space (e.g. someone wearing glasses vs. not wearing glasses).
We train three different models which differ only in the way their approximate posterior is parameterized: (1) , trained using the IWAE loss and 500 samples, (2) A VAE model with a Gaussian approximate posterior , and finally (3) A VAE model with an AVO loss. We use a decoder as in Equation 6 for the mean of the conditional Gaussian likelihood, with standard deviation of and parameterized as an MLP conditioned on . The densities learned by each model is depicted in Figure 2. We estimate the posterior of the learned generative model in each case, and compare them in Figure 2. At the point , the true posterior is a bimodal distribution as discussed before. Shifting slightly to the left or right in space towards the higher density regions gives a sharper prediction in the latent space at their respective ends. In the unambiguous regions of the data, the posterior is unimodal.
We observe that the VAE encoder predicts the posterior to be centered around . From Figure 2 we can clearly see the zeroforcing effect mentioned before, where encoder matches one of the modes of the true posterior. As a consequence of that behavior, the variance of the decoder’s prediction at that point is high, resulting in the plot seen in Figure 0(c). The IWAE model on the other hand, matches the distribution better. With 500 samples per data point in this case, we are effectively training the generative model with a nonparametric posterior that can perfectly fit the true posterior. The price of doing so, however, is that we require many samples in order to train the model.
For a simple task such as this one, this approach works, but is meant to demonstrate the benefits of having sufficient uncertainty in the proposal distribution. The proposed AVO method () also approximates the distribution well, but without incurring the same computational cost as IWAE.
5.2 Toy energy fitting
In this experiment, we take a fourmode mixture of Gaussian as a true target energy, and compare HVI trained with ELBO and HVI trained with AVO (with ). In Figure 3 we see that HVIELBO overfits two modes out of four, whereas HVIAVO captures all four modes. Also, first few layers of HVIELBO do not contribute much to the resulting marginal , where each layer of HVIAVO has to fit the assigned target.
5.3 Quantitative analysis on robustness to beta annealing
We perform the experiments in Section 5.2 again, on six different energy functions, (summarized in Appendix B) with different linear schedules of betaannealing. We run 10 trials for each energy function, and do 2000 stochastic updates with a batch size of 64 and a learning rate of 0.001. We evaluate the resulting marginal according to its negative KL divergence from the true energy function. Since the entropy of is intractable, we estimate it using importance sampling; we elaborate this in Appendix C. We see that HVIAVO constantly outperforms HVIELBO. Even when most of the time performance of HVIELBO degrades along with prolonged betaannealing schedule, HVIAVO’s performance is relatively invariant. Visualization of the learned marginals can be found in Appendix D.
5.4 Amortized inference
We train VAEs using a standard VAE (with a Gaussian approximate posterior), HVI, and HVI with AVO on the Binarized MNIST dataset from
Larochelle and Murray (2011), and the Omniglot dataset as used in Burda et al. (2016). In these experiments, we used an MLP for both the encoder and decoder, with gating activation as in Tomczak and Welling (2016). Both the encoder and decoder have 2 hidden layers, each with 300 hidden units. For MNIST, we used a dimension of 40 for the latent space, and 200 dimensions in Omniglot. We used hyperparameter search to determine batch size, learning rate, and the betaannealing schedule.Table 1 lists the results from our experiments on MNIST, alongside results from Salimans et al. (2015) combining MCMC in HVI for comparison (see Appendix E). Table 1 lists the results for Omniglot. We find that HVI’s approximate posterior trained with AVO tends to have smaller variational gap (estimated by difference between test likelihood and ELBO), and also better test likelihood. It is also worth noting that in the MNIST case, smaller variational gap translates into smaller generalization gap and also better test likelihood, while the same is not true in the OMNIGLOT example, which corroborates our finding in Section 5.1 that true posterior can be biased towards approximate posterior, resulting in a smaller variatinal gap (8.58) but a worse density model (6.70).
6 Conclusion
We find that despite the representational capacity of the chosen family of approximate distributions in VI, the density that can be represented is still limited by the optimization process. We resolve this by incorporating annealed objectives into the training of hierarchical variational methods. Experimentally, we demonstrate (1) our method’s robustness to deterministic warmup, (2) the benefits of encouraging exploration and (3) the downside of biasing the true posterior to be unimodal. Our method is orthogonal to finding a more rich family of variational distributions, and sheds light on an important optimization issue that has thus far been neglected in the amortized VI literature.
7 Acknowledgements
We would like to thank Massimo Cassia and Aristide Baratin for their useful feedback and discussion.
References
 Abrol et al. (2014) Abrol, F., Mandt, S., Ranganath, R., and Blei, D. (2014). Deterministic annealing for stochastic variational inference. stat, 1050:7.
 Agakov and Barber (2004) Agakov, F. V. and Barber, D. (2004). An auxiliary variational method. In Neural Information Processing.
 Berg et al. (2018) Berg, R. v. d., Hasenclever, L., Tomczak, J. M., and Welling, M. (2018). Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649.
 Bowman et al. (2016) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2016). Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning (CONLL).
 Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R. (2016). Importance weighted autoencoders. In International Conference on Learning Representations.
 Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. (2015). A recurrent latent variable model for sequential data. In Advances in neural information processing systems.
 Clevert et al. (2016) Clevert, D.A., Unterthiner, T., and Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations.

Cremer et al. (2018)
Cremer, C., Li, X., and Duvenaud, D. (2018).
Inference suboptimality in variational autoencoders.
In
International Conference on Machine Learning
.  Dieng et al. (2017) Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems.
 Edwards and Storkey (2017) Edwards, H. and Storkey, A. (2017). Towards a neural statistician. In International Conference on Learning Representations.
 Fraccaro et al. (2016) Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207.
 Gulrajani et al. (2017) Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2017). Pixelvae: A latent variable model for natural images. In International Conference on Learning Representations.
 Huang et al. (2018) Huang, C.W., Krueger, D., Lacoste, A., and Courville, A. (2018). Neural autoregressive flows. In International Conference on Machine Learning.
 Huszár (2017) Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235.

Karl et al. (2016)
Karl, M., Soelch, M., Bayer, J., and van der Smagt, P. (2016).
Deep variational bayes filters: Unsupervised learning of state space models from raw data.
In International Conference on Learning Representations.  Katahira et al. (2008) Katahira, K., Watanabe, K., and Okada, M. (2008). Deterministic annealing variant of variational bayes method. In Journal of Physics: Conference Series, volume 95, page 012015. IOP Publishing.
 Kim et al. (2018) Kim, Y., Wiseman, S., Miller, A. C., Sontag, D., and Rush, A. M. (2018). Semiamortized variational autoencoders. In International Conference on Machine Learning.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational bayes. In International Conference on Learning Representations.
 Krishnan et al. (2017) Krishnan, R. G., Liang, D., and Hoffman, M. (2017). On the challenges of learning with inference networks on sparse, highdimensional data. arXiv preprint arXiv:1710.06085.
 Krueger et al. (2017) Krueger, D., Huang, C.W., Islam, R., Turner, R., Lacoste, A., and Courville, A. (2017). Bayesian hypernetworks. arXiv preprint arXiv:1710.04759.

Larochelle and Murray (2011)
Larochelle, H. and Murray, I. (2011).
The neural autoregressive distribution estimator.
In
International Conference on Artificial Intelligence and Statistics
.  Li and Turner (2016) Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. In Advances in Neural Information Processing Systems.
 Maaløe et al. (2016) Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. (2016). Auxiliary deep generative models. In International Conference on Machine Learning.
 Mandt et al. (2016) Mandt, S., McInerney, J., Abrol, F., Ranganath, R., and Blei, D. (2016). Variational tempering. In International Conference on Artificial Intelligence and Statistics.
 Marino et al. (2018) Marino, J., Yue, Y., and Mandt, S. (2018). Iterative amortized inference. In International Conference on Machine Learning.
 Mescheder et al. (2017) Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning.
 Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2).
 Nowozin (2018) Nowozin, S. (2018). Debiasing evidence approximations: On importanceweighted autoencoders and jackknife variational inference. In International Conference on Learning Representations.
 Raiko et al. (2007) Raiko, T., Valpola, H., Harva, M., and Karhunen, J. (2007). Building blocks for variational bayesian learning of latent variable models. Journal of Machine Learning Research.
 Rainforth et al. (2018) Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. (2018). Tighter variational bounds are not necessarily better. In International Conference on Machine Learning.
 Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International Conference on Machine Learning.
 Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning.  Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning.

Salimans and Kingma (2016)
Salimans, T. and Kingma, D. P. (2016).
Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
In Advances in Neural Information Processing Systems.  Shabanian et al. (2017) Shabanian, S., Arpit, D., Trischler, A., and Bengio, Y. (2017). Variational bilstms. arXiv preprint arXiv:1711.05717.
 Shi et al. (2018) Shi, J., Sun, S., and Zhu, J. (2018). Kernel implicit variational inference. In International Conference on Learning Representations.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. (2016). Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746.
 Tomczak and Welling (2016) Tomczak, J. M. and Welling, M. (2016). Improving variational autoencoders using householder flow. arXiv preprint arXiv:1611.09630.
 Tomczak and Welling (2017) Tomczak, J. M. and Welling, M. (2017). Improving variational autoencoders using convex combination linear inverse autoregressive flow. In Benelearn.
 Turner and Sahani (2011) Turner, R. E. and Sahani, M. (2011). Two problems with variational expectation maximisation for timeseries models. Bayesian Time series models, 1(3.1):3–1.
Appendix A Proof of optimal transitions
This is the proof of Proposition 1.
Proof.
The base case when is trivially true by choice of initial distribution. We now look at the case when . Taking the functional derivative of the objective defined by Equation 4, along with a Lagrange multiplier to satisfy the second axiom of probability (both and sum to one), with respect to the conditionals and yields:
(7) 
where we take , assuming it’s optimal, and ’s are normalizing constant given the argument, and are proper distributions since the numerators above are both proper joint probabilities. Taking the product of the two equations yields
Thus and by marginalization. By definition and the induction assumption, the resulting mariginal of the optimal forward transition is:
Similarly, the resulting marginal of the optimal forward transition is:
Since both the base case and inductive step are true, by mathematical induction, for all . ∎
This is the proof of Corollary 1.
Appendix B Specification of the toy energy functions
Visualization of the toy energy functions, , are presented in Figure 5. Below are the corresponding formulas.
(a)  

(b)  
(c)  
(d)  
(e)  
(f)  
Appendix C Estimating
Similar to Burda et al. (2016) we upper bound the negative KL:
Moreover, if is bounded, then by the
strong law of large number
, converges almost surely to . That is, the gap is closed up by using large number of samples of generated by . In our case, or , so both and further factorize as product of conditional transitions. The bound is tight with around 100 samples in our examples. We use 2000 samples simply to reduce variance to estimate the KL to evaluate the learned .Appendix D Visualization of robustness to betaannealing
We further simulate the scenario of VAE training in which betaannealing of the objective is applied to encourage the approximate posterior to be more deterministic. Since there is no concept of prior distribution in the case of energy function, we only decrease the weight of all entropy and crossentropy terms of the transition operators. The annealing coefficient is initially set to be and annealed back to linearly throughout of the training time (2500 updates). We see in Figure 7 that HVIELBO collapses to only one mode, and HVIAVO remains robust even in the face of an annealing scheme that discourages exploration.
The same can be found with deterministic transitions, i.e. normalizing flows. We experimented with the Deep Sigmoidal Flows, a universal change of variable model proposed by Huang et al. (2018), and found that problem of lack of exploration was worsened (hypothetically due to lack of noise injection in the transition operators) and can be remedied by AVO as well.
Appendix E Discussion on Table 1
It is important to note that the results in Salimans et al. (2015) are not directly comparable to ours due to different architectures. The architecture we used was from Tomczak and Welling (2016), who provide a baseline for using Householder flow, with an estimate of 87.68 on negative log likelihood (NLL) using 5000 samples for importance weighted lower bound (whereas we use 1000). Tomczak and Welling (2017) further experimented with IAF with an estimate of 86.70 on NLL, and convex combination of IAFs yield an NLL of 86.10. We showed that our AVO can improve the performance of HVI from 87.62, similar to 21 Householder flow, to 86.06, which is better than all the abovementioned flow based methods applied to the same encoder and decoder architecture.