1 Introduction
Variational AutoEncoders (VAEs), introduced by Kingma and Welling [2014] and Rezende et al. [2014], are popular techniques to carry out inference and learning in complex latent variable models. However, the standard meanfield parametrization of the approximate posterior distribution can limit its flexibility. Recent work has sought to augment the VAE approach by sampling from the VAE posterior approximation and transforming these samples through mappings with additional trainable parameters to achieve richer posterior approximations. The most popular application of this idea is the Normalizing Flows (NFs) approach Rezende and Mohamed [2015] in which the samples are deterministically evolved through a series of parameterized invertible transformations called a flow. NFs have demonstrated success in various domains Berg et al. [2018], Kingma et al. [2016], but the flows do not explicitly use information about the target posterior. Therefore, it is unclear whether the improved performance is caused by an accurate posterior approximation or simply a result of overparametrization. The related Hamiltonian Variational Inference (HVI) Salimans et al. [2015] instead stochastically evolves the base samples according to Hamiltonian Monte Carlo (HMC) Neal et al. [2011] and thus uses target information, but relies on defining reverse dynamics in the flow, which, as we will see, turns out to be unnecessary and suboptimal.
One of the key components in the formulation of VAEs is the maximization of the evidence lower bound (ELBO). The main idea put forward in Salimans et al. [2015] is that it is possible to use
MCMC iterations to obtain an unbiased estimator of the ELBO and its gradients. This estimator is obtained using an importance sampling argument on an augmented space, with the importance distribution being the joint distribution of the
states of the ‘forward’ Markov chain, while the augmented target distribution is constructed using a sequence of ‘reverse’ Markov kernels such that it admits the original posterior distribution as a marginal. The performance of this estimator is strongly dependent on the selection of these forward and reverse kernels, but no clear guideline for selection has been provided therein. By linking this approach to earlier work by Del Moral et al. [2006], we show how to select these components. We focus, in particular, on the use of timeinhomogeneous Hamiltonian dynamics, proposed originally in Neal [2005]. This method uses reverse kernels which are optimal for reducing variance of the likelihood estimators and allows for simple calculation of the approximate posteriors of the latent variables. Additionally, we can easily use the reparameterization trick to calculate unbiased gradients of the ELBO with respect to the parameters of interest. The resulting method, which we refer to as the Hamiltonian Variational AutoEncoder (HVAE), can be thought of as a Normalizing Flow scheme in which the flow depends explicitly on the target distribution. This combines the best properties of HVI and NFs, resulting in targetinformed and inhomogeneous deterministic Hamiltonian dynamics, while being scalable to large datasets and high dimensions.2 Evidence Lower Bounds, MCMC and Hamiltonian Importance Sampling
2.1 Unbiased likelihood and evidence lower bound estimators
For data and parameter , consider the likelihood function
where are some latent variables. If we assume we have access to a strictly positive unbiased estimate of , denoted , then
(1) 
with ,
denoting all the random variables used to compute
. Here, denotes additional parameters of the sampling distribution. We emphasize that depends on both and potentially as this is not done notationally. By applying Jensen’s inequality to (1), we thus obtain, for all and ,(2) 
It can be shown that decreases as the variance of decreases; see, e.g., Burda et al. [2016], Maddison et al. [2017]. The standard variational framework corresponds to and , while the Importance Weighted AutoEncoder (IWAE) Burda et al. [2016] with importance samples corresponds to , and
In the general case, we do not have an analytical expression for . When performing stochastic gradient ascent for variational inference, however, we only require an unbiased estimator of . This is given by if the reparameterization trick Glasserman [1991], Kingma and Welling [2014] is used, i.e. , and is a ‘smooth enough’ function of . As a guiding principle, one should attempt to obtain a lowvariance estimator of , which typically translates into a lowvariance estimator of . We can analogously optimize with respect to through stochastic gradient ascent to obtain tighter bounds.
2.2 Unbiased likelihood estimator using timeinhomogeneous MCMC
Salimans et al. [2015] propose to build an unbiased estimator of by sampling a (potentially timeinhomogeneous) ‘forward’ Markov chain of length using and for . Using artificial ‘reverse’ Markov transition kernels for , it follows easily from an importance sampling argument that
(3) 
is an unbiased estimator of as long as the ratio in (3) is welldefined. In the framework of the previous section, we have and is given by the denominator of (3). Although we did not use measuretheoretic notation, the kernels are typically MCMC kernels which do not admit a density with respect to the Lebesgue measure (e.g. the Metropolis–Hastings kernel). This makes it difficult to define reverse kernels for which (3) is welldefined as evidenced in Salimans et al. [2015, Section 4] or Wolf et al. [2016]. The estimator (3) was originally introduced in Del Moral et al. [2006] where generic recommendations are provided for this estimator to admit a low relative variance: select as MCMC kernels which are invariant, or approximately invariant as in Heng et al. [2017], with respect to , where is a sequence of artificial densities bridging to smoothly using . It is also established in Del Moral et al. [2006] that, given any sequence of kernels , the sequence of reverse kernels minimizing the variance of is given by where denotes the marginal density of under the forward dynamics, yielding
(4) 
For stochastic forward transitions, it is typically not possible to compute and the corresponding estimator (4) as the marginal densities do not admit closedform expressions. However this suggests that should be approximating and various schemes are presented in Del Moral et al. [2006]. As noticed by Del Moral et al. [2006] and Salimans et al. [2015], Annealed Importance Sampling (AIS) Neal [2001] – also known as the JarzynskiCrooks identity (Crooks [1998], Jarzynski [1997]) in physics – is a special case of (3) using, for , a invariant MCMC kernel and the reversal of this kernel as the reverse transition kernel ^{1}^{1}1The reversal of a invariant kernel is given by . If is reversible then .. This choice of reverse kernels is suboptimal but leads to a simple expression for estimator (3
). AIS provides stateoftheart estimators of the marginal likelihood and has been widely used in machine learning. Unfortunately, it typically cannot be used in conjunction with the reparameterization trick. Indeed, although it is very often possible to reparameterize the forward simulation of
in terms of the deterministic transformation of some random variables independent of and , this mapping is not continuous because the MCMC kernels it uses typically include singular components. In this context, although (1) holds, is not an unbiased estimator of ; see, e.g., Glasserman [1991] for a careful discussion of these issues.2.3 Using Hamiltonian dynamics
Given the empirical success of Hamiltonian Monte Carlo (HMC) Hoffman and Gelman [2014], Neal et al. [2011], various contributions have proposed to develop algorithms exploiting Hamiltonian dynamics to obtain unbiased estimates of the ELBO and its gradients when . This was proposed in Salimans et al. [2015]. However, the algorithm suggested therein relies on a timehomogeneous leapfrog where momentum resampling is performed at each step and no Metropolis correction is used. It also relies on learned reverse kernels. To address the limitations of Salimans et al. [2015], Wolf et al. [2016] have proposed to include some Metropolis acceptance steps, but they still limit themselves to homogeneous dynamics and their estimator is not amenable to the reparameterization trick. Finally, in Hoffman [2017], an alternative approach is used where the gradient of the true likelihood, , is directly approximated by using Fisher’s identity and HMC to obtain approximate samples from . However, the MCMC bias can be very significant when one has multimodal latent posteriors and is strongly dependent on both the initial distribution and .
Here, we follow an alternative approach where we use Hamiltonian dynamics that are timeinhomogeneous as in Del Moral et al. [2006] and Neal [2001], and use optimal reverse Markov kernels to compute . This estimator can be used in conjunction with the reparameterization trick to obtain an unbiased estimator of . This method is based on the Hamiltonian Importance Sampling (HIS) scheme proposed in Neal [2005]; one can also find several instances of related ideas in physics Jarzynski [2000], SchöllPaschinger and Dellago [2006]. We work in an extended space , introducing momentum variables to pair with the position variables , with new target . Essentially, the idea is to sample using deterministic transitions so that , where and , define diffeomorphisms corresponding to a timediscretized and inhomogeneous Hamiltonian dynamics. In this case, it is easy to show that
(5) 
It can also be shown that this is nothing but a special case of (3) (on the extended positionmomentum space) using the optimal reverse kernels^{2}^{2}2Since this is a deterministic flow, the density can be evaluated directly. However, direct evaluation corresponds to optimal reverse kernels in the deterministic case. . This setup is similar to the one of Normalizing Flows Rezende and Mohamed [2015], except here we use a flow informed by the target distribution. Salimans et al. [2015] is in fact mentioned in Rezende and Mohamed [2015], but the flow therein is homogeneous and yields a highvariance estimator of the normalizing constants even if is used, as demonstrated in our simulations in section 4.
Under these dynamics, the estimator defined in (5) can be rewritten as
(6) 
Hence, if we can simulate using , where and is a smooth mapping, then we can use the reparameterization trick since are also smooth mappings.
In our case, the deterministic transformation has two components: a leapfrog step, which discretizes the Hamiltonian dynamics, and a tempering step, which adds inhomogeneity to the dynamics and allows us to explore isolated modes of the target Neal [2005]. To describe the leapfrog step, we first define the potential energy of the system as for a single datapoint . Leapfrog then takes the system from into via the following transformations:
(7)  
(8)  
(9) 
where are the individual leapfrog step sizes per dimension, denotes elementwise multiplication, and the gradient of is taken with respect to . The composition of equations (7)  (9) has unit Jacobian since each equation describes a shear transformation. For the tempering portion, we multiply the momentum output of each leapfrog step by for where . We consider two methods for setting the values . First, fixed tempering involves allowing an inverse temperature to vary, and then setting , where each is a deterministic function of and . In the second method, known as free tempering, we allow each of the values to be learned, and then set the initial inverse temperature to . For both methods, the tempering operation has Jacobian . We obtain by composing the leapfrog integrator with the cooling operation, which implies that the Jacobian is given by , which in turns implies
The only remaining component to specify is the initial distribution. We will set , where will be referred to as the variational prior over the latent variables and is the canonical momentum distribution at inverse temperature . The full procedure to generate an unbiased estimate of the ELBO from (2) on the extended space for a single point and fixed tempering is given in Algorithm 1. The set of variational parameters to optimize contains the flow parameters and , along with additional parameters of the variational prior.^{3}^{3}3We avoid reference to a mass matrix throughout this formulation because we can capture the same effect by optimizing individual leapfrog step sizes per dimension as pointed out in [Neal et al., 2011, Section 4.2]. We can see from (6) that we will obtain unbiased gradients with respect to and from our estimate of the ELBO if we write , for and , provided we are not also optimizing with respect to parameters of the variational prior. We will require additional reparameterization when we elect to optimize with respect to the parameters of the variational prior, but this is generally quite easy to implement on a problemspecific basis and is wellknown in the literature; see, e.g. Kingma and Welling [2014], Rezende and Mohamed [2015], Rezende et al. [2014] and section 4.
3 Stochastic Variational Inference
We will now describe how to use Algorithm 1 within a stochastic variational inference procedure, moving to the setting where we have a dataset and for all . In this case, we are interested in finding
(10) 
where is the empirical measure of the data. We must resort to variational methods since cannot generally be calculated exactly and instead maximize the surrogate ELBO objective function
(11) 
for defined as in (2). We can now turn to stochastic gradient ascent (or a variant thereof) to jointly maximize (11) with respect to and by approximating the expectation over using minibatches of observed data.
For our specific problem, we can reduce the variance of the ELBO calculation by analytically evaluating some terms in the expectation (i.e. RaoBlackwellization) as follows:
(12) 
where we write under reparameterization. We can now consider the output of Algorithm 1 as taking a sample from the inner expectation for a given sample from the outer expectation. Algorithm 2 provides a full procedure to stochastically optimize (12). In practice, we take the gradients of (12
) using automatic differentation packages. This is achieved by using TensorFlow
Abadi et al. [2015] in our implementation.4 Experiments
In this section, we discuss the experiments used to validate our method. We first test HVAE on an example with a tractable full log likelihood (where no neural networks are needed), and then perform largerscale tests on the MNIST dataset. Code is available online.^{4}^{4}4https://github.com/anthonycaterini/hvaenips All models were trained using TensorFlow Abadi et al. [2015].
4.1 Gaussian Model
The generative model that we will consider first is a Gaussian likelihood with an offset and a Gaussian prior on the mean, given by
where is constrained to be diagonal. We will again write to denote an observed dataset under this model, where each . In this example, we have . The goal of the problem is to learn the model parameters , where and .
Here, we have only one latent variable generating the entire set of data. Thus, our variational lower bound is now given by
for the variational posteroir approximation . We note that this is not exactly the same as the autoencoder setting, in which an individual latent variable is associated with each observation, however it provides a tractable framework to analyze effectiveness of various variational inference methods. We also note that we can calculate the loglikelihood exactly in this case, but we use variational methods for the sake of comparison.
From the model, we see that the logarithm of the unnormalized target is given by
For this example, we will use a HVAE with variational prior equal to the true prior, i.e. , and fixed tempering. The potential, given by , has gradient
The set of variational parameters here is , where contains the perdimension leapfrog stepsizes and is the initial inverse temperature. We constrain each of the leapfrog step sizes such that for some , for all – this is to prevent the leapfrog discretization from entering unstable regimes. Note that in this example; in particular, we do not optimize any parameters of the variational prior and thus require no further reparameterization.
We will compare HVAE with a basic Variational Bayes (VB) scheme with meanfield approximate posterior , where is diagonal and denotes the set of learned variational parameters. We will also include a planar normalizing flow of the form of equation (10) in Rezende and Mohamed [2015], but with the same flow parameters across iterations to keep the number of variational parameters of the same order as the other methods. The variational prior here is also set to the true prior as in HVAE above. The log variational posterior is given by equation (13) of Rezende and Mohamed [2015], where .
We set our true offset vector to be
, and our scale parameters to range quadratically from , reaching a minimum at , and increasing back to .^{6}^{6}6When is even, does not exist, although we could still considerto be the location of the minimum of the parabola defining the true standard deviations.
All experiments have and all training was done using RMSProp Tieleman and Hinton [2012] with a learning rate of .To compare the results across methods, we train each method ten times on different datasets. For each training run, we calculate , where is the estimated value of given by the variational method on a particular run, and plot the average of this across the 10 runs for various dimensions in (a). We note that, as the dimension increases, HVAE performs best in parameter estimation. The VB method suffers most on prediction of as the dimension increases, whereas the NF method does poorly on predicting .
We also compare HVAE with tempering to HVAE without tempering, i.e. where is fixed to in training. This has the effect of making our Hamiltonian dynamics homogeneous in time. We perform the same comparison as above and present the results in (b). We can see that the tempered methods perform better than their nontempered counterparts; this shows that timeinhomogeneous dynamics are a key ingredient in the effectiveness of the method.
4.2 Generative Model for MNIST
The next experiment that we consider is using HVAE to improve upon a convolutional variational autoencoder (VAE) for the binarized MNIST handwritten digit dataset. Again, our training data is
, where each for . The generative model is as follows:for , where is the component of , is the latent variable associated with , and
is a convolutional neural network (i.e. the
generative network, or encoder) parametrized by the model parameters . This is the standard generative model used in VAEs in which each pixel in the image is conditionally independent given the latent variable. The VAE approximate posterior – and the HVAE variational prior across the latent variables in this case – is given by , where and are separate outputs of the same neural network (the inference network, or encoder) parametrized by , and is constrained to be diagonal.We attempted to match the network structure of Salimans et al. [2015]. The inference network consists of three convolutional layers, each with filters of size
and a stride of 2. The convolutional layers output 16, 32, and 32 feature maps, respectively. The output of the third layer is fed into a fullyconnected layer with hidden dimension
, whose output is then fully connected to the output means and standard deviations each of size. Softplus activation functions are used throughout the network except immediately before the outputted mean. The generative network mirrors this structure in reverse, replacing the stride with upsampling as in
Dosovitskiy et al. [2015] and replicated in Salimans et al. [2015].We apply HVAE on top of the base convolutional VAE. We evolve samples from the variational prior according to Algorithm 1 and optimize the new objective given in (12). We reparameterize as , for and , to generate unbiased gradients of the ELBO with respect to . We select various values for and set . In contrast with normalizing flows, we do not need our flow parameters and to be outputs of the inference network because our flow is guided by the target. This allows our method to have fewer overall parameters than normalizing flow schemes. We use the standard stochastic binarization of MNIST Salakhutdinov and Murray [2008] as training data, and train using Adamax Kingma and Ba [2014] with learning rate
. We also employ early stopping by halting the training procedure if there is no improvement in the loss on validation data over 100 epochs.
To evaluate HVAE after training is complete, we estimate outofsample negative log likelihoods (NLLs) using 1000 importance samples from the HVAE approximate posterior. For each trained model, we estimate NLL three times, noting that the standard deviation over these three estimates is no larger than 0.12 nats. We report the average NLL values over either two or three different initializations (in addition to the three NLL estimates for each trained model) for several choices of tempering and leapfrog steps in Table 1. A full accounting of the tests is given in the supplementary material. We also consider an HVAE scheme in which we allow to vary across layers of the flow and report the results.
From Table 1, we notice that generally increasing the inhomogeneity in the dynamics improves the test NLL values. For example, free tempering is the most successful tempering scheme, and varying the leapfrog step size across layers also improves results. We also notice that increasing the number of leapfrog steps does not always improve the performance, as provides the best results in free tempering schemes. We believe that the improvement in HVAE over the base VAE scheme can be attributed to a more expressive approximate posterior, as we can see that samples from the HVAE approximate posterior exhibit nonnegligible covariance across dimensions. As in Salimans et al. [2015], we are also able to improve upon the base model by adding our timeinhomogeneous Hamiltonian dynamics on top, but in a simplified regime without referring to learned reverse kernels. Rezende and Mohamed [2015] report only lower bounds on the loglikelihood for NFs, which are indeed lower than our loglikelihood estimates, although they use a much larger number of variational parameters.
fixed across layers  varied across layers  

Free  Fixed  None  Free  Fixed  None  
N/A  83.32  83.17  N/A  N/A  N/A  
83.09  83.26  83.68  83.01  82.94  83.35  
82.97  83.26  83.40  82.62  82.87  83.25  
82.78  83.56  83.82  82.62  83.09  82.94  
82.93  83.18  83.33  82.83  82.85  82.93 
5 Conclusion and Discussion
We have proposed a principled way to exploit Hamiltonian dynamics within stochastic variational inference. Contrary to previous methods Salimans et al. [2015], Wolf et al. [2016], our algorithm does not rely on learned reverse Markov kernels and benefits from the use of tempering ideas. Additionally, we can use the reparameterization trick to obtain unbiased estimators of gradients of the ELBO. The resulting HVAE can be interpreted as a targetdriven normalizing flow which requires the evaluation of a few gradients of the loglikelihood associated to a single data point at each stochastic gradient step. However, the Jacobian computations required for the ELBO are trivial. In our experiments, the robustness brought about by the use of targetinformed dynamics can reduce the number of parameters that must be trained and improve generalizability.
We note that, although we have fewer parameters to optimize, the memory cost of using HVAE and targetinformed dynamics could become prohibitively large if the memory required to store evaluations of
is already extremely large. Evaluating these gradients is not a requirement of VAEs or standard normalizing flows. However, we have shown that in the case of a fairly large generative network we are still able to evaluate gradients and backpropagate through the layers of the flow. Further tests explicitly comparing HVAE with VAEs and normalizing flows in various memory regimes are required to determine in what cases one method should be used over the other.
There are numerous possible extensions of this work. Hamiltonian dynamics preserves the Hamiltonian and hence also the corresponding target distribution, but there exist other deterministic dynamics which leave the target distribution invariant but not the Hamiltonian. This includes the NoséHoover thermostat. It is possible to directly use these dynamics instead of the Hamiltonian dynamics within the framework developed in subsection 2.3. In continuoustime, related ideas have appeared in physics Cuendet [2006], Procacci et al. [2006], SchöllPaschinger and Dellago [2006]. This comes at the cost of more complicated Jacobian calculations. The ideas presented here could also be coupled with the methodology proposed in Heng et al. [2017] – we conjecture that this could reduce the variance of the estimator (3) by an order of magnitude.
Acknowledgments
Anthony L. Caterini is a Commonwealth Scholar, funded by the UK government.
References
 Abadi et al. [2015] Martín Abadi et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Berg et al. [2018] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.

Burda et al. [2016]
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.
Importance weighted autoencoders.
In The 4th International Conference on Learning Representations (ICLR), 2016.  Crooks [1998] Gavin E Crooks. Nonequilibrium measurements of free energy differences for microscopically reversible Markovian systems. Journal of Statistical Physics, 90(56):1481–1487, 1998.
 Cuendet [2006] Michel A Cuendet. Statistical mechanical derivation of Jarzynski’s identity for thermostated nonhamiltonian dynamics. Physical Review Letters, 96(12):120602, 2006.
 Del Moral et al. [2006] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
 Dosovitskiy et al. [2015] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1538–1546. IEEE, 2015.
 Glasserman [1991] Paul Glasserman. Gradient estimation via perturbation analysis, volume 116. Springer Science & Business Media, 1991.
 Heng et al. [2017] Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential Monte Carlo. arXiv preprint arXiv:1708.08396, 2017.
 Hoffman [2017] Matthew D Hoffman. Learning deep latent Gaussian models with Markov chain Monte Carlo. In International Conference on Machine Learning, pages 1510–1519, 2017.
 Hoffman and Gelman [2014] Matthew D Hoffman and Andrew Gelman. The nouturn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
 Jarzynski [1997] Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14):2690, 1997.
 Jarzynski [2000] Christopher Jarzynski. Hamiltonian derivation of a detailed fluctuation theorem. Journal of Statistical Physics, 98(12):77–102, 2000.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational Bayes. In The 2nd International Conference on Learning Representations (ICLR), 2014.
 Kingma et al. [2016] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Maddison et al. [2017] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6576–6586, 2017.
 Neal [2001] Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
 Neal [2005] Radford M Neal. Hamiltonian importance sampling. www.cs.toronto.edu/pub/radford/histalk.ps, 2005. Talk presented at the Banff International Research Station (BIRS) workshop on Mathematical Issues in Molecular Dynamics.
 Neal et al. [2011] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
 Procacci et al. [2006] Piero Procacci, Simone Marsili, Alessandro Barducci, Giorgio F Signorini, and Riccardo Chelli. Crooks equation for steered molecular dynamics using a NoséHoover thermostat. The Journal of Chemical Physics, 125(16):164101, 2006.
 Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
 Rezende et al. [2014] Danilo Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.

Salakhutdinov and Murray [2008]
Ruslan Salakhutdinov and Iain Murray.
On the quantitative analysis of deep belief networks.
In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.  Salimans et al. [2015] Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
 SchöllPaschinger and Dellago [2006] E SchöllPaschinger and Christoph Dellago. A proof of Jarzynski’s nonequilibrium work theorem for dynamical systems that conserve the canonical distribution. The Journal of Chemical Physics, 125(5):054105, 2006.
 Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 Wolf et al. [2016] Christopher Wolf, Maximilian Karl, and Patrick van der Smagt. Variational inference with Hamiltonian Monte Carlo. arXiv preprint arXiv:1609.08203, 2016.
Appendix A Full List of Tests on MNIST
Table 2 and Table 3 display the list of test runs of HVAE on MNIST. The number of flow steps is denoted by . The total number of epochs varies in training because of early stopping. The ELBO and NLL estimates are generated using 1000 importance samples from the HVAE approximate posterior; this procedure is run 3 times for each seed. The average and standard deviation of these estimates over the three runs is displayed. Table 2 refers to HVAE tests in which was fixed across flow layers, whereas Table 3 refers to HVAE tests in which was allowed to vary across flow layers.
K  Tempering  Seed  Total Epochs  ELBO Estimate  NLL Estimate [head to column names]mnist_fixed_eps.csv 

()  () 
K  Tempering  Seed  Total Epochs  ELBO Estimate  NLL Estimate [head to column names]mnist_varied_eps.csv 

()  () 
Table 4 displays the list of test runs of the base VAE on MNIST. ELBO and NLL estimates are again generated by importance sampling, but this time from the learned VAE approximate posterior.
Seed  Total Epochs  ELBO Estimate  NLL Estimate [head to column names]mnist_cnn.csv 

()  () 