DeepAI

# MCMC Variational Inference via Uncorrected Hamiltonian Annealing

Given an unnormalized target distribution we want to obtain approximate samples from it and a tight lower bound on its (log) normalization constant log Z. Annealed Importance Sampling (AIS) with Hamiltonian MCMC is a powerful method that can be used to do this. Its main drawback is that it uses non-differentiable transition kernels, which makes tuning its many parameters hard. We propose a framework to use an AIS-like procedure with Uncorrected Hamiltonian MCMC, called Uncorrected Hamiltonian Annealing. Our method leads to tight and differentiable lower bounds on log Z. We show empirically that our method yields better performances than other competing approaches, and that the ability to tune its parameters using reparameterization gradients may lead to large performance improvements.

• 11 publications
• 28 publications
05/29/2018

### Hamiltonian Variational Auto-Encoder

Variational Auto-Encoders (VAEs) have become very popular techniques to ...
09/26/2016

### Variational Inference with Hamiltonian Monte Carlo

Variational inference lies at the core of many state-of-the-art algorith...
10/31/2019

### Energy-Inspired Models: Learning with Sampler-Induced Distributions

Energy-based models (EBMs) are powerful probabilistic models, but suffer...
09/02/2020

### Quasi-symplectic Langevin Variational Autoencoder

Variational autoencoder (VAE) as one of the well investigated generative...
09/07/2017

### A Tight Lower Bound for Counting Hamiltonian Cycles via Matrix Rank

For even k, the matchings connectivity matrix M_k encodes which pairs of...
12/22/2021

### Surrogate Likelihoods for Variational Annealed Importance Sampling

Variational inference is a powerful paradigm for approximate Bayesian in...
02/01/2019

### Understanding MCMC Dynamics as Flows on the Wasserstein Space

It is known that the Langevin dynamics used in MCMC is the gradient flow...

## 1 Introduction

Variational Inference (VI) blei2017variational; wainwright2008graphical; zhang2017advances is a method to do approximate inference on a target distribution that is only known up to the normalization constant . The basic insights are, first, that the evidence lower bound (ELBO) lower-bounds and, second, that maximizing the ELBO is equivalent to minimizing the KL-divergence from to . The simplest VI method chooses a parameterized family for and optimizes its parameters to maximize the ELBO.

A recent direction involves combining VI with Markov chain Monte Carlo (MCMC)

salimans2015markov; wolf2016variational. These methods can be seen as an instance of the auxiliary VI framework agakov2004auxiliary

– they create an augmented variational distribution that represents all intermediate random variables generated during the MCMC procedure. An augmented target distribution that attempts to capture the inverse MCMC dynamics is optimized jointly with this variational distribution. However, it has been observed that capturing inverse dynamics is challenging

(wolf2016variational, §5.4) (further discussion in Section 4).

Annealed Importance Sampling (AIS) jarzynski1997equilibrium; neal2001ais

is a powerful technique used to build augmented distributions without the need of learning inverse dynamics. While it was originally proposed to estimate expectations using importance sampling, it can be easily used to build lower bounds on normalization constants of intractable densities

grosse2015sandwiching; wu2016quantitative. AIS creates a sequence of densities that bridge from a tractable initial approximation to the target . Then, the augmented variational distribution is given by a sequence of MCMC kernels targeting each bridging density, while the augmented target uses the reversals of those kernels. It turns out that the ratio of these augmented distributions can be computed using only evaluations of the bridging densities. Combining Hamiltonian MCMC kernels with AIS has been observed to produce strong lower bounds sohl2012hamiltonian; wu2016quantitative.

However, these bounds are sensitive to numerous parameters, such as the initial distribution, bridging schedule, and parameters of the MCMC kernels. It would be desirable to optimize these parameters to tighten the bound. Unfortunately, the presence of Metropolis-Hastings acceptance steps means that the the final estimator is non-differentiable, and thus reparameterization gradients cannot be used.

In this work, we propose Uncorrected Hamiltonian Annealing (UHA), a differentiable alternative to Hamiltonian AIS. We define an augmented variational distribution using Hamiltonian MCMC kernels, but dropping the accept-reject steps. This is motivated by the fact that Hamiltonian dynamics sometimes have high acceptance rates. Since these uncorrected MCMC kernels do not exactly hold the bridging densities invariant, an augmented target distribution cannot be defined in terms of reversals. Instead, we define our augmented target by deriving an algorithm for the exact reversal of the original (corrected) MCMC kernel and dropping the accept-reject step. Surprisingly, this yields a very simple expression for the resulting lower bound.

We use reparameterization gradients to tune various parameters involved in the lower bound produced by UHA, including the initial approximation , parameters of the uncorrected MCMC kernel, and the bridging densities. Experimentally, tuning all these leads to large gains. For example, in several inference tasks we observe that tuning UHA with bridging densities gives better results than traditional Hamiltonian AIS with .

Finally, we use UHA to train VAEs vaes_welling; rezende2014stochastic. In this case we observe that using UHA leads to higher ELBOs. In addition, we observe that increasing the number of bridging densities with UHA consistently leads to better results, and that for a large enough number of bridging densities the variational gap (difference between ELBO and true log-likelihood) becomes small, and models with higher log-likelihood are obtained.

## 2 Preliminaries

Variational inference and augmentation. Suppose that is some target density, where is unnormalized and is the corresponding normalizer, and let

 ELBO(q(z),¯p(z))=Eq(z)log¯p(z)q(z) (1)

be the "ELBO operator". Variational inference (VI) is based on the fact that for any we have blei2017variational

 logZ=ELBO(q(z),¯p(z))+KL(q(z)∥p(z)). (2)

In VI, the parameters of are tuned to maximize the "evidence lower bound" (ELBO). Since the KL-divergence is non-negative, this is always a lower bound on . Also, maximizing the ELBO is equivalent to minimizing the KL-divergence from to .

To get tighter bounds and better approximations recent work has made use of augmented distributions agakov2004auxiliary; huang2018improving. Let and suppose that augments the original target density while preserving its normalization constant. Then, for any we have

 logZ=ELBO(q(z1:M),¯p(z1:M))+KL(q(z1:M)∥p(z1:M)). (3)

The first term is called the "augmented" ELBO and again lower bounds

. By the chain rule of KL-divergence

cover1999elements, the KL-divergence from to over upper-bounds the KL-divergence over . This justifies using the marginal of over to approximate the original target distribution.

Annealed Importance Sampling. A successful approach for creating augmented distributions is Annealed Importance Sampling (AIS) neal2001ais. It creates an augmented proposal distribution by applying a sequence of transition densities , and an augmented target by defining transition densities . This gives the augmented densities

 q(z1:M)=q(z1)M−1∏m=1Tm(zm+1|zm) and ¯p(z1:M)=¯p(zM)M−1∏m=1Um(zm|zm+1). (4)

Naively, the ratio of these densities is

 ¯p(z1:M)q(z1:M)=¯p(zM)q(z1)M−1∏m=1Um(zm|zm+1)Tm(zm+1|zm). (5)

To define the transitions and , AIS creates a sequence of unnormalized densities that “bridge” from a starting distribution to the target , meaning that is close to and is close to . Then, for each intermediate distribution, is chosen to be a Markov kernel that holds invariant, and to be the reversal of with respect to , defined as

 Um(zm|zm+1)=T(zm+1|zm)πm(zm)πm(zm+1). (6)

This choice produces a simplification so that eq. 5 becomes

 ¯p(z1:M)q(z1:M)=¯p(zM)q(z1)M−1∏m=1¯πm(zm)¯πm(zm+1). (7)

This can be easily evaluated without needing to evaluate the transition densities. The ratio from eq. 7 can be used to get an expression for the lower bound . Research has shown that the AIS augmentation may lead to extremely tight lower bounds grosse2015sandwiching; sohl2012hamiltonian; wu2016quantitative.

Hamiltonian Dynamics. Many MCMC methods used to sample from are based on Hamiltonian dynamics betancourt2017geometric; chen2014stochastic; neal2011mcmc; welling2011bayesian. The idea is to create an augmented distribution , where is a distribution over a momentum variable (e.g. a Multivariate Gaussian). Then, one can define numerical integration schemes where and evolve while nearly holding constant. When corrected by a Metropolis-Hastings acceptance step, this can be made to exactly hold invariant. This is alternated with a scheme that resamples the momentum while holding invariant. When Hamiltonian dynamics work well, can quickly move around, suppressing random-walk behavior.

There are a variety of different Hamiltonian MCMC methods, corresponding to different integration schemes, momentum distributions, and ways of resampling the momentum. For instance, HMC and Langevin dynamics use the leapfrog integrator, a Gaussian for the momentum variables and a full resampling of the momentum variables at each step neal2011mcmc; welling2011bayesian. On the other hand, if the momentum variables are only partially resampled, the under-damped variants of HMC and Langevin dynamics are recovered neal2011mcmc. It was observed that partial resampling may lead to improved perfomance cheng2018underdamped.

It is easy to integrate Hamiltonian dynamics into AIS. First, define an augmented target and an augmented starting distribution . Then, create a series of augmented densities bridging the two as . Finally, define the forward transition to be an iteration of a Hamiltonian MCMC method that leaves invariant. We will describe a single transition as a sequence of three steps: (1) resample the momentum; (2) simulate Hamiltonian dynamics and apply an accept-reject step; and (3) negate the momentum. The precise process that defines the transition is shown in Alg. 1. Note that this algorithm is quite general, and compatible with HMC, Langevin dynamics and their underdamped variants (by selecting an appropriate integrator and resampling method).

Representing this way makes it easy to show it holds the density invariant. The overall strategy is to show that each of the steps 1-3 holds invariant, and so does the composition of them (neal2011mcmc, §3.2). For steps 1 and 3 this is trivial, provided that . For step 2, we require that the simulation has unit Jacobian and satisfies . Then, can be interpreted as a symmetric Metropolis-Hastings proposal, meaning the Metroplis-Hastings acceptance probability is as given. A typical choice for that satisfies these requirements is the leapfrog integrator with a momentum reversal at the end. (This reversal then gets "un-reversed" in step 3 for accepted moves.)

Since holds invariant, we can define as the reversal of wrt . Then, eq. 7 becomes

 ¯p(z1:M,ρ1:M)q(z1:M,ρ1:M)=¯p(zM,ρM)q(z1,ρ1)M−1∏m=1¯πm(zm,ρm)¯πm(zm+1,ρm+1). (8)

Using this ratio we get an expression for the lower bound obtained with Hamiltonian AIS. While this method has been observed to yield strong lower bounds on sohl2012hamiltonian; wu2016quantitative (see also Section 5.1), its performance depends on many parameters: initial distribution , momentum distribution , momentum resampling scheme, simulator , and bridging densities. We would like to tune these parameters by maximizing the ELBO using reparameterization-based estimators. However, due to the accept-reject step required by the Hamiltonian MCMC transition, the resulting bound is not differentiable, and thus reparameterization gradients are not available.

## 3 Uncorrected Hamiltonian Annealing

The contribution of this paper is the development of uncorrected Hamiltonian Annealing (UHA). This method is similar to Hamiltonian AIS (eq. 8), but yields a differentiable lower bound. The main idea is simple. For any transitions and , by the same logic as in eq. 5, we can define the ratio

 ¯p(z1:M,ρ1:M)q(z1:M,ρ1:M)=¯p(zM,ρM)q(z1,ρ1)M−1∏m=1Um(zm,ρm|zm+1,ρm+1)Tm(zm+1,ρm+1|zm,ρm). (9)

Hamiltonian AIS defines as a Hamiltonian MCMC kernel that holds invariant, and as the reversal of with respect to . While this leads to a nice simplification, there is no requirement that these choices be made. We can use any transitions as long as the ratio is tractable.

We propose to use the "uncorrected" versions of the transitions and used by Hamiltonian AIS, obtained by dropping the accept-reject steps. To get an expression for the uncorrected we first derive the reversal used by Hamiltonian AIS (Alg. 2). These uncorrected transitions are no longer reversible with respect to the bridging densities , and thus we cannot use the simplification used by AIS to get eq. 8. Despite this, we show that the ratio for the uncorrected transitions can still be easily computed (Thm. 2). This produces a differentiable estimator, meaning the parameters can be tuned by stochastic gradient methods designed to maximize the ELBO.

We start by deriving the process that defines the transition used by Hamiltonian AIS. This is shown in Alg. 2. It can be observed that follows the same three steps of (resample momentum, Hamiltonian simulation with accept-reject, momentum negation), but in reverse order.

###### Lemma 1.

The corrected (Alg. 2) is the reversal of the corrected (Alg. 1) with respect to .

###### (Proof Sketch).

First, we claim the general result that if , and have reversals , and , respectively, then the composition has reversal (all reversals with respect to same density ). Then, we apply this to the corrected and : is the composition of three steps that hold invariant. Thus, its reversal is given by the composition of the reversals of those steps, applied in reversed order. A full proof is in Appendix E. ∎

We now define the "uncorrected" transitions used by UHA, shown in Algs. 3 and 4. These are just the transitions used by Hamiltonian AIS but without the accept-reject steps. (If Hamiltonian dynamics are simulated exactly, the acceptance rate is one and the uncorrected and corrected transitions are equivalent.) We emphasize that, for the "uncorrected" transitions, does not exactly hold invariant and is not the reversal of . Thus, their ratio does not give a simple expression in terms of as in eq. 8. Nevertheless, the following result shows that their ratio has a simple form.

###### Theorem 2.

Let and be the uncorrected transitions defined in Algs. 3 and 4, and let the dynamics simulator be volume preserving and self inverting. Then,

 Um(zm,ρm|zm+1,ρm+1)Tm(zm+1,ρm+1|zm,ρm)=S(ρm)S(ρ′m), (10)

where is the second component of . (That is, from Algs. 3 and 4.)

###### (Proof Sketch.).

We derive the densities for and using the rule for transformation of densities under invertible mappings, using that is self-inverting and volume preserving. Taking the ratio and simplifying gives the result. A full proof is in Appendix F. ∎

As an immediately corollary of eq. 9 and Theorem 2 we get that for UHA

 ¯p(z1:M,ρ1:M)q(z1:M,ρ1:M)=¯p(zM)q(z1)M−1∏m=1S(ρm+1)S(ρ′m). (11)

This ratio can be used to get an expression for the lower bound obtained with UHA. As mentioned in Section 2, the parameters of the augmented distributions are tuned to maximize the ELBO, equivalent to minimizing the KL-divergence from to

. While computing this ELBO exactly is typically intractable, an unbiased estimate can be obtained using a sample from

as shown in Alg. 5. If sampling is done using reparameterization, then unbiased reparameterization gradients may be used together with stochastic optimization algorithms to optimize the lower bound. In contrast, the variational lower bound obtained with Hamiltonian AIS (see Alg. 6 in Appendix A) does not allow the computation of unbiased reparameterization gradients.

### 3.1 Algorithm Details

Simulation of dynamics. We use the leapfrog operator with step-size to simulate Hamiltonian dynamics. This has unit Jacobian and satisfies (if the momentum is negated after the simulation), which are the properties required for eq. 11 to be correct (see Theorem 2).

Momentum distribution and resampling. We set the momentum distribution to be a Gaussian with mean zero and covariance . The resampling distribution must hold this distribution invariant. As is common we use , where is the damping coefficient. If , the momentum is completely replaced with a new sample from in each iteration (used by HMC and Langevin dynamics neal2011mcmc; welling2011bayesian). For larger , the momentum becomes correlated between iterations, which may help suppress random walk behavior and encourage faster mixing cheng2018underdamped (used by the underdamped variants of HMC and Langevin dynamics neal2011mcmc).

Bridging densities. We set , where and .

Computing gradients. We set the initial distribution to be a Gaussian, and perform all sampling operations in Alg. 5 using reparameterization vaes_welling; rezende2014stochastic; doublystochastic_titsias. Thus, the whole procedure is differentiable and reparameterization-based gradients may be used to tune parameters by maximizing the ELBO. These parameters include the initial distribution , the covariance of the momentum distribution, the step-size of the integrator, the damping coefficient of the momentum resampling distribution, and the parameters of the bridging densities (including ), among others. As observed in Section 5.1.1 tuning all of these parameters may lead to considerable performance improvements.

## 4 Related Work

There are three lines of work that produce differentiable variational bounds integrating Monte Carlo methods. One is Hamiltonian VI (HVI) salimans2015markov; wolf2016variational. It uses eq. 9 to build a lower bound on , with set to an uncorrected Hamiltonian transition (like UHA but without bridging densities) and set to conditional Gaussians parameterized by learnable functions. Typically, a single transition is used, and the parameters of the transitions are learned by maximizing the resulting ELBO.111The formulation of HVI allows the use of more than one transition. However, this leads to an increased number of reverse models that must be learned, and thus not typically used in practice. Indeed, experiments by Salimans et al. salimans2015markov use only one HMC step while varying the number of leapfrog integration steps, and results from Wolf et al. wolf2016variational show that increasing the number of transitions may actually yield worse bounds (they conjecture that this is due to the difficulty of learning inverse dynamics.).

A second method is given by Hamiltonian VAE (HVAE) caterini2018hamiltonian, based on Hamiltonian Importance sampling neal2005hamiltonianis. They augment the variational distribution with momentum variables, and use the leapfrog integrator to simulate Hamiltonian dynamics (a deterministic invertible transformation with unit Jacobian) with a tempering scheme as a target-informed flow rezende2015variational; tabak2013family.

The third method is Importance Weighting (IW) IWVAE; cremer2017reinterpreting; domke2018importance. Here, the idea is that , and that the latter bound can be optimized, rather than the traditional ELBO. More generally, other Monte-Carlo estimators can be used domke2019divide.

Some work defines novel contrastive-divergence-like objectives in terms of the final iteration of an MCMC chain

ruiz2019contrastive; li2017approximate. These do not provide an ELBO-like variational bound. While in some cases the initial distribution can be optimized to minimize the objective ruiz2019contrastive, gradients do not flow through the MCMC chains, meaning MCMC parameters cannot be optimized by gradient methods.

For latent variable models, Hoffman hoffman2017MCMCvae suggested to run a few MCMC steps after sampling from the variational distribution before computing gradients with respect to the model parameters, which is expected to "debias" the gradient estimator to be closer to the true likelihood gradient. The variational distribution is simultaneously learned to optimize a standard ELBO. (AIS can also be used ding2019vaeAIS.)

## 5 Experiments and Results

This section presents results using UHA for Bayesian inference problems on several models of varying dimensionality and for VAE training. In all cases, for UHA and Hamiltonian AIS we use under-damped Langevin dynamics, that is, we perform just one leapfrog step per transition and partially resample momentum. We implement all algorithms using Jax

jax2018github.

This section shows results using UHA for Bayesian inference tasks. For this set of experiments, for UHA we tune the initial distribution , the integrator’s step-size and the damping coefficient . We include detailed results tuning more parameters in Section 5.1.1.

Models. We consider four models: Brownian motion (), which models a Brownian Motion process with a Gaussian observation model; Convection Lorenz bridge (), which models a nonlinear dynamical system for atmospheric convection; and Logistic regression with the a1a () and madelon () datasets. The first two obtained from the “Inference gym” inferencegym2020.

Baselines. We compare UHA against IW, HVAE, a simple variant of HVI, and Hamiltonian AIS (HAIS). For all methods which rely on HMC (i.e. all except IW) we use a singe integration step-size common to all dimensions and fix the momentum distribution to a standard Gaussian. For HVI we learn the initial distribution , integration step-size and the reverse dynamics

(set to a factorized Gaussian with mean and variance given by affine functions), and for HVAE we learn

, and the tempering scheme (we use the quadratic scheme parameterized by a single parameter).

Training details. We set to be a mean-field Gaussian initialized to a maximizer of the ELBO, and tune the parameters of each method by running Adam for steps. We repeated all simulations for different step-sizes in , and selected the best one for each method. Since Hamiltonian AIS’ parameters cannot be tuned by gradient descent, we find a good pair by grid search. We consider and three values of that correspond to three different rejection rates: and . We tested all 9 possible combinations and selected the best one.

Results are shown in Fig. 1. To simplify comparisons against IW, results are shown as a function of , the number of likelihood evaluations required by each method to build the lower bound.222For IW is the number of samples used to build the lower bound, for UHA and Hamiltonian AIS it is the number of bridging densities plus one, for HVI and HVAE it is the number of leapfrog steps used plus one. It can be observed that our method yields better lower bounds than all other competing approaches for all models considered, and that increasing the number of bridging densities consistently leads to better results. The next best performing method is Hamiltonian AIS. IW also shows a good performance for the lower dimensional model Brownian motion. However, for models of higher dimensionality IW leads to bounds that are several nats worse than the ones achieved by UHA. Finally, HVI and HVAE yield bounds that are much worse than those achieved by the other three methods, and do not appear to improve consistently for larger . For HVAE, these results are consistent with the ones in the original paper (caterini2018hamiltonian, §4), in that higher may sometimes hurt performance. For HVI, we believe this is related to the use of just one HMC step and suboptimal inverse dynamics.

#### 5.1.1 Tuning More Parameters with UHA

A basic version of UHA involves fitting a variational distribution using plain VI, and then tuning the integration step-size and the damping coefficient . However, more parameters could be tuned:

• [leftmargin=0.5cm]

• Moment distribution cov : We propose to learn a diagonal matrix instead of using the identity.

• Bridging densities’ coefficients : Typically . We propose to learn the sequence , with the restrictions , , and .

• Initial distribution : Instead of fixing to be a maximizer of the typical ELBO, we propose to learn it to maximize the augmented ELBO obtained using UHA.

• Integrator’s step-size : Instead of learning a unique step-size , we propose to learn a step-size that is a function of , i.e. . In our experiments we use an affine function.

• Bridging densities parameters : Instead of setting the -th bridging density to be , we propose to set it to , where is a mean-field Gaussian with a mean and diagonal covariance specified as affine functions of .

We consider the four models described previously and compare three methods: UHA tuning all parameters described above, UHA tuning only the pair , and Hamiltonian AIS with parameters obtained by grid-search. We perform the comparison for ranging from to . (For we tune the UHA’s parameters using and extrapolate them as explained in Appendix D.)

Results are shown in Fig. 2. It can be observed that tuning all parameters with UHA leads to significantly better lower bounds than those obtained by Hamiltonian AIS (or UHA tuning only and ). Indeed, for the Logistic regression models, UHA tuning all parameters for leads to results comparable to the ones obtained by Hamiltonian AIS with .

To verify what parameters lead to larger performance improvements, we tested UHA with tuning different subsets of . Fig. 3 shows the results. It can be observed that tuning the bridging parameters and the initial approximation leads to the largest gains in performance, and that tuning all parameters always outperforms tuning smaller subsets of parameters. We show a more thorough analysis, including more subsets and values of in Appendix B.

### 5.2 VAE training

Our method can be used to train latent variable models, such as Variational Auto-encoders (VAE) vaes_welling; rezende2014stochastic. In this case the initial approximation and the model

are parameterized by two neural networks (encoder and decoder), whose parameters are trained by maximizing the ELBO. UHA can be used to train VAEs by augmenting these two distributions as described in Section

3.

Datasets.

We use three datasets: mnist

lecun1998gradient (numbers 1-9), emnist-letters cohen2017emnist (letters A-Z), and kmnist clanuwat2018deep (cursive Kuzushiji). All consist on greyscale images of

pixels. In all cases we use stochastic binarization

salakhutdinov2008quantitative and a training set of samples, a validation set of samples, and a test set of

samples. All datasets are available in tensorflow-datasets

TFDS.

Baselines. We compare against Importance Weighted Auto-encoders IWVAE and plain VAE training vaes_welling.

Architecture details. We set to a diagonal Gaussian, to a standard Normal, and to a Bernoulli. We consider two architectures for the encoder and decoder: (1) Feed forward networks with one hidden layer of size

and Relu non-linearities, with a latent space dimensionality of

; (2) Architecture used by Burda et al. IWVAE, feed forward networks with two hidden layers of size with tanh non-linearities, with a latent space dimensionality of .

Training details. In all cases the encoder and decoder are initialized to parameters that maximize the ELBO. For IW we tune the encoder and decoder parameters (using the doubly-reparameterized estimator doublyrep), and for UHA we tune the integration step-size , damping coefficient , bridging parameters , momentum covariance (diagonal), and the decoder parameters. Following Catherini et al. caterini2018hamiltonian we constrain to avoid unstable behavior of the leapfrog discretization. We use Adam with a step-size of to train for epochs and use the validation set for early stopping. We repeated all simulations for three different random seeds. In all cases the standard deviation of the results was less than nats (not shown in tables).

All methods achieved better results using the architecture with one hidden layer. These results are shown in Tables 1 and 2. The first one shows the ELBO on the test set achieved for different values of , and the second one the log-likelihood on the test set estimated with AIS wu2016quantitative. It can be observed that UHA leads to higher ELBOs, higher log-likelihoods, and smaller variational gaps (difference between ELBO and log-likelihood) than IW for all datasets, with the difference between both methods’ performance increasing for increasing . Notably, for , the variational gap for UHA becomes quite small, ranging from to nats depending on the dataset.

Results for the architecture from Burda et al. IWVAE (two hidden layers) are shown in Tables 3 and 4 (Appendix C). Again, we observe that UHA consistently leads to higher ELBOs and the best test log-likelihood was consistently achieved by UHA with . However, for smaller , IW sometimes had better log-likelihoods than UHA (despite worse ELBOs).

## 6 Discussion

Since UHA yields a differentiable lower bound, one could tune other parameters not considered in this work. For instance, a different momentum distribution per bridging density could be used, that is, . We believe additions such as this may yield further gains. Also, our method can be used to get tight and differentiable upper bounds on using the reversed AIS procedure described by Grosse et al. grosse2015sandwiching.

Finally, removing accept-reject steps might sometimes lead to instabilities during optimization if the step-size becomes large. We observed this effect when training VAEs on some datasets for the larger values of . We solved this by constraining the range of (previously done by Catherini et al. caterini2018hamiltonian). While this simple solution works well, we believe that other approaches (e.g. regularization, automatic adaptation) could work even better. We leave the study of such alternatives for future work.

## Appendix B More results tuning more subsets of parameters for UHA

We tested UHA tuning different subsets of . Fig. 4 shows the results. The first row shows the results obtained by tuning the pair and each other parameter individually for different values of , and the second row shows the results obtained by tuning increasingly more parameters. It can be observed that tuning and lead to the largest gains in performance.

## Appendix C Results using architecture from Burda et al. [Iwvae]

In this section we show the results achieved for VAE training using the architecture from Burda et al. [IWVAE] (with

stochastic layer). In this case the encoder and decoder consist on feed forward neural networks with two hidden layers of size

with Tanh non-linearity, and latent space dimensionality of . All training details are the same, but with the constraint . Tables 3 and 4 show the results.

## Appendix D Extrapolating optimal parameters for UHA

Some results in Section 5.1.1 (and Appendix B) use a number of bridging densities up to 512. As mentioned previously, for those simulations, if bridging densities were used, we optimized the parameters for and extrapolate the parameters to work with . We now explain this procedure.

From the parameters considered, , the only ones that need to be "extrapolated" are the step-size and the bridging parameters . All other parameters are tuned for bridging densities and the values obtained are directly used with bridging densities.

For

we use a simple interpolation. Define

to be the piecewise linear function (with "pieces") that satisfies , for and (this is a bijection from to ). Then, when using , we simply define , where and .

For , we use the transformation . While other transformations could be used (e.g. without the ), we observed this to work best in practice. (In fact, we obtained this rule by analyzing the dependence of the optimal on for several tasks and values of .)

## Appendix E Proof of Lemma 1

We begin with the following result.

###### Lemma 3.

Let , and be three transitions that leave some distribution invariant and satisfy (i.e. is the reversal of with respect to ). Then the reversal of with respect to is given by .

###### Proof.
 T(z′|z)π(z) =∫T3(z′|z2)T2(z2|z1)T1(z1|z)π(z)dz1dz2 (12) =∫T3(z′|z2)T2(z2|z1)π(z1)U1(z|z1)dz1dz2 (13) =∫T3(z′|z2)π(z2)U2(z1|z2)U1(z|z1)dz1dz2 (14) =π(z′)∫U3(z2|z′)U2(z1|z2)U1(z|z1)dz1dz2 (15) =π(z′)∫U1(z|z1)U2(z1|z2)U3(z2|z′)dz1dz2 (16) =π(z′)U(z|z′). (17)

The rest of the proof is straightforward. Let the three steps from the corrected version of (Alg. 1) be denoted , and . The latter two (Hamiltonian simulation with accept-reject step and momentum negation) satisfy detailed balance with respect to [neal2011mcmc, §3.2]. Thus, for these two, is defined by the same process as . For (momentum resampling), its reversal is given by the reversal of with respect to . We call this , and it satisfies

 srev(ρ|ρ′)=s(ρ′|ρ)S(ρ)S(ρ′). (18)

## Appendix F Proof of Theorem 2

To deal with delta functions, whenever the transition states [Set ], we use , and take the limit . We use to denote the density of a Gaussian with mean zero and variance evaluated at , and (operator that negates momentum).

We first compute . We have that and . Thus,

 T′m(z′m,ρ′m|zm,ρm)=s(ρ′m|ρm)ga(z′m−zm). (19)

Also, we have . Since is an invertible transformation with unit Jacobian and inverse , we get that

 Tm(zm+1,ρm+1|zm,ρm) =T′m((Tm∘γ)(zm+1,ρm+1)|zm,ρm) (20) =s(Tρm(zm+1,−ρm+1)|ρm)ga(Tzm(zm+1,−ρm+1)−zm), (21)

where is the operator that applies and returns the second component of the result (and similarly for ).

Now, we compute . We have that . Thus,

 Um(zm,ρm|zm+1,ρm+1) =Um(zm,ρm|z′m,ρ′m) (22) =srev(ρm|ρ′m)ga(zm−z′m) (23) =srev(ρm|Tρm(zm+1,−ρm+1)ga(