Variational Inference (VI) blei2017variational; wainwright2008graphical; zhang2017advances is a method to do approximate inference on a target distribution that is only known up to the normalization constant . The basic insights are, first, that the evidence lower bound (ELBO) lower-bounds and, second, that maximizing the ELBO is equivalent to minimizing the KL-divergence from to . The simplest VI method chooses a parameterized family for and optimizes its parameters to maximize the ELBO.
A recent direction involves combining VI with Markov chain Monte Carlo (MCMC)salimans2015markov; wolf2016variational. These methods can be seen as an instance of the auxiliary VI framework agakov2004auxiliary
– they create an augmented variational distribution that represents all intermediate random variables generated during the MCMC procedure. An augmented target distribution that attempts to capture the inverse MCMC dynamics is optimized jointly with this variational distribution. However, it has been observed that capturing inverse dynamics is challenging(wolf2016variational, §5.4) (further discussion in Section 4).
Annealed Importance Sampling (AIS) jarzynski1997equilibrium; neal2001ais
is a powerful technique used to build augmented distributions without the need of learning inverse dynamics. While it was originally proposed to estimate expectations using importance sampling, it can be easily used to build lower bounds on normalization constants of intractable densitiesgrosse2015sandwiching; wu2016quantitative. AIS creates a sequence of densities that bridge from a tractable initial approximation to the target . Then, the augmented variational distribution is given by a sequence of MCMC kernels targeting each bridging density, while the augmented target uses the reversals of those kernels. It turns out that the ratio of these augmented distributions can be computed using only evaluations of the bridging densities. Combining Hamiltonian MCMC kernels with AIS has been observed to produce strong lower bounds sohl2012hamiltonian; wu2016quantitative.
However, these bounds are sensitive to numerous parameters, such as the initial distribution, bridging schedule, and parameters of the MCMC kernels. It would be desirable to optimize these parameters to tighten the bound. Unfortunately, the presence of Metropolis-Hastings acceptance steps means that the the final estimator is non-differentiable, and thus reparameterization gradients cannot be used.
In this work, we propose Uncorrected Hamiltonian Annealing (UHA), a differentiable alternative to Hamiltonian AIS. We define an augmented variational distribution using Hamiltonian MCMC kernels, but dropping the accept-reject steps. This is motivated by the fact that Hamiltonian dynamics sometimes have high acceptance rates. Since these uncorrected MCMC kernels do not exactly hold the bridging densities invariant, an augmented target distribution cannot be defined in terms of reversals. Instead, we define our augmented target by deriving an algorithm for the exact reversal of the original (corrected) MCMC kernel and dropping the accept-reject step. Surprisingly, this yields a very simple expression for the resulting lower bound.
We use reparameterization gradients to tune various parameters involved in the lower bound produced by UHA, including the initial approximation , parameters of the uncorrected MCMC kernel, and the bridging densities. Experimentally, tuning all these leads to large gains. For example, in several inference tasks we observe that tuning UHA with bridging densities gives better results than traditional Hamiltonian AIS with .
Finally, we use UHA to train VAEs vaes_welling; rezende2014stochastic. In this case we observe that using UHA leads to higher ELBOs. In addition, we observe that increasing the number of bridging densities with UHA consistently leads to better results, and that for a large enough number of bridging densities the variational gap (difference between ELBO and true log-likelihood) becomes small, and models with higher log-likelihood are obtained.
Variational inference and augmentation. Suppose that is some target density, where is unnormalized and is the corresponding normalizer, and let
be the "ELBO operator". Variational inference (VI) is based on the fact that for any we have blei2017variational
In VI, the parameters of are tuned to maximize the "evidence lower bound" (ELBO). Since the KL-divergence is non-negative, this is always a lower bound on . Also, maximizing the ELBO is equivalent to minimizing the KL-divergence from to .
To get tighter bounds and better approximations recent work has made use of augmented distributions agakov2004auxiliary; huang2018improving. Let and suppose that augments the original target density while preserving its normalization constant. Then, for any we have
The first term is called the "augmented" ELBO and again lower bounds
. By the chain rule of KL-divergencecover1999elements, the KL-divergence from to over upper-bounds the KL-divergence over . This justifies using the marginal of over to approximate the original target distribution.
Annealed Importance Sampling. A successful approach for creating augmented distributions is Annealed Importance Sampling (AIS) neal2001ais. It creates an augmented proposal distribution by applying a sequence of transition densities , and an augmented target by defining transition densities . This gives the augmented densities
Naively, the ratio of these densities is
To define the transitions and , AIS creates a sequence of unnormalized densities that “bridge” from a starting distribution to the target , meaning that is close to and is close to . Then, for each intermediate distribution, is chosen to be a Markov kernel that holds invariant, and to be the reversal of with respect to , defined as
This choice produces a simplification so that eq. 5 becomes
This can be easily evaluated without needing to evaluate the transition densities. The ratio from eq. 7 can be used to get an expression for the lower bound . Research has shown that the AIS augmentation may lead to extremely tight lower bounds grosse2015sandwiching; sohl2012hamiltonian; wu2016quantitative.
Hamiltonian Dynamics. Many MCMC methods used to sample from are based on Hamiltonian dynamics betancourt2017geometric; chen2014stochastic; neal2011mcmc; welling2011bayesian. The idea is to create an augmented distribution , where is a distribution over a momentum variable (e.g. a Multivariate Gaussian). Then, one can define numerical integration schemes where and evolve while nearly holding constant. When corrected by a Metropolis-Hastings acceptance step, this can be made to exactly hold invariant. This is alternated with a scheme that resamples the momentum while holding invariant. When Hamiltonian dynamics work well, can quickly move around, suppressing random-walk behavior.
There are a variety of different Hamiltonian MCMC methods, corresponding to different integration schemes, momentum distributions, and ways of resampling the momentum. For instance, HMC and Langevin dynamics use the leapfrog integrator, a Gaussian for the momentum variables and a full resampling of the momentum variables at each step neal2011mcmc; welling2011bayesian. On the other hand, if the momentum variables are only partially resampled, the under-damped variants of HMC and Langevin dynamics are recovered neal2011mcmc. It was observed that partial resampling may lead to improved perfomance cheng2018underdamped.
It is easy to integrate Hamiltonian dynamics into AIS. First, define an augmented target and an augmented starting distribution . Then, create a series of augmented densities bridging the two as . Finally, define the forward transition to be an iteration of a Hamiltonian MCMC method that leaves invariant. We will describe a single transition as a sequence of three steps: (1) resample the momentum; (2) simulate Hamiltonian dynamics and apply an accept-reject step; and (3) negate the momentum. The precise process that defines the transition is shown in Alg. 1. Note that this algorithm is quite general, and compatible with HMC, Langevin dynamics and their underdamped variants (by selecting an appropriate integrator and resampling method).
Representing this way makes it easy to show it holds the density invariant. The overall strategy is to show that each of the steps 1-3 holds invariant, and so does the composition of them (neal2011mcmc, §3.2). For steps 1 and 3 this is trivial, provided that . For step 2, we require that the simulation has unit Jacobian and satisfies . Then, can be interpreted as a symmetric Metropolis-Hastings proposal, meaning the Metroplis-Hastings acceptance probability is as given. A typical choice for that satisfies these requirements is the leapfrog integrator with a momentum reversal at the end. (This reversal then gets "un-reversed" in step 3 for accepted moves.)
Since holds invariant, we can define as the reversal of wrt . Then, eq. 7 becomes
Using this ratio we get an expression for the lower bound obtained with Hamiltonian AIS. While this method has been observed to yield strong lower bounds on sohl2012hamiltonian; wu2016quantitative (see also Section 5.1), its performance depends on many parameters: initial distribution , momentum distribution , momentum resampling scheme, simulator , and bridging densities. We would like to tune these parameters by maximizing the ELBO using reparameterization-based estimators. However, due to the accept-reject step required by the Hamiltonian MCMC transition, the resulting bound is not differentiable, and thus reparameterization gradients are not available.
3 Uncorrected Hamiltonian Annealing
The contribution of this paper is the development of uncorrected Hamiltonian Annealing (UHA). This method is similar to Hamiltonian AIS (eq. 8), but yields a differentiable lower bound. The main idea is simple. For any transitions and , by the same logic as in eq. 5, we can define the ratio
Hamiltonian AIS defines as a Hamiltonian MCMC kernel that holds invariant, and as the reversal of with respect to . While this leads to a nice simplification, there is no requirement that these choices be made. We can use any transitions as long as the ratio is tractable.
We propose to use the "uncorrected" versions of the transitions and used by Hamiltonian AIS, obtained by dropping the accept-reject steps. To get an expression for the uncorrected we first derive the reversal used by Hamiltonian AIS (Alg. 2). These uncorrected transitions are no longer reversible with respect to the bridging densities , and thus we cannot use the simplification used by AIS to get eq. 8. Despite this, we show that the ratio for the uncorrected transitions can still be easily computed (Thm. 2). This produces a differentiable estimator, meaning the parameters can be tuned by stochastic gradient methods designed to maximize the ELBO.
We start by deriving the process that defines the transition used by Hamiltonian AIS. This is shown in Alg. 2. It can be observed that follows the same three steps of (resample momentum, Hamiltonian simulation with accept-reject, momentum negation), but in reverse order.
First, we claim the general result that if , and have reversals , and , respectively, then the composition has reversal (all reversals with respect to same density ). Then, we apply this to the corrected and : is the composition of three steps that hold invariant. Thus, its reversal is given by the composition of the reversals of those steps, applied in reversed order. A full proof is in Appendix E. ∎
We now define the "uncorrected" transitions used by UHA, shown in Algs. 3 and 4. These are just the transitions used by Hamiltonian AIS but without the accept-reject steps. (If Hamiltonian dynamics are simulated exactly, the acceptance rate is one and the uncorrected and corrected transitions are equivalent.) We emphasize that, for the "uncorrected" transitions, does not exactly hold invariant and is not the reversal of . Thus, their ratio does not give a simple expression in terms of as in eq. 8. Nevertheless, the following result shows that their ratio has a simple form.
We derive the densities for and using the rule for transformation of densities under invertible mappings, using that is self-inverting and volume preserving. Taking the ratio and simplifying gives the result. A full proof is in Appendix F. ∎
This ratio can be used to get an expression for the lower bound obtained with UHA. As mentioned in Section 2, the parameters of the augmented distributions are tuned to maximize the ELBO, equivalent to minimizing the KL-divergence from to
. While computing this ELBO exactly is typically intractable, an unbiased estimate can be obtained using a sample fromas shown in Alg. 5. If sampling is done using reparameterization, then unbiased reparameterization gradients may be used together with stochastic optimization algorithms to optimize the lower bound. In contrast, the variational lower bound obtained with Hamiltonian AIS (see Alg. 6 in Appendix A) does not allow the computation of unbiased reparameterization gradients.
3.1 Algorithm Details
Simulation of dynamics. We use the leapfrog operator with step-size to simulate Hamiltonian dynamics. This has unit Jacobian and satisfies (if the momentum is negated after the simulation), which are the properties required for eq. 11 to be correct (see Theorem 2).
Momentum distribution and resampling. We set the momentum distribution to be a Gaussian with mean zero and covariance . The resampling distribution must hold this distribution invariant. As is common we use , where is the damping coefficient. If , the momentum is completely replaced with a new sample from in each iteration (used by HMC and Langevin dynamics neal2011mcmc; welling2011bayesian). For larger , the momentum becomes correlated between iterations, which may help suppress random walk behavior and encourage faster mixing cheng2018underdamped (used by the underdamped variants of HMC and Langevin dynamics neal2011mcmc).
Bridging densities. We set , where and .
Computing gradients. We set the initial distribution to be a Gaussian, and perform all sampling operations in Alg. 5 using reparameterization vaes_welling; rezende2014stochastic; doublystochastic_titsias. Thus, the whole procedure is differentiable and reparameterization-based gradients may be used to tune parameters by maximizing the ELBO. These parameters include the initial distribution , the covariance of the momentum distribution, the step-size of the integrator, the damping coefficient of the momentum resampling distribution, and the parameters of the bridging densities (including ), among others. As observed in Section 5.1.1 tuning all of these parameters may lead to considerable performance improvements.
4 Related Work
There are three lines of work that produce differentiable variational bounds integrating Monte Carlo methods. One is Hamiltonian VI (HVI) salimans2015markov; wolf2016variational. It uses eq. 9 to build a lower bound on , with set to an uncorrected Hamiltonian transition (like UHA but without bridging densities) and set to conditional Gaussians parameterized by learnable functions. Typically, a single transition is used, and the parameters of the transitions are learned by maximizing the resulting ELBO.111The formulation of HVI allows the use of more than one transition. However, this leads to an increased number of reverse models that must be learned, and thus not typically used in practice. Indeed, experiments by Salimans et al. salimans2015markov use only one HMC step while varying the number of leapfrog integration steps, and results from Wolf et al. wolf2016variational show that increasing the number of transitions may actually yield worse bounds (they conjecture that this is due to the difficulty of learning inverse dynamics.).
A second method is given by Hamiltonian VAE (HVAE) caterini2018hamiltonian, based on Hamiltonian Importance sampling neal2005hamiltonianis. They augment the variational distribution with momentum variables, and use the leapfrog integrator to simulate Hamiltonian dynamics (a deterministic invertible transformation with unit Jacobian) with a tempering scheme as a target-informed flow rezende2015variational; tabak2013family.
The third method is Importance Weighting (IW) IWVAE; cremer2017reinterpreting; domke2018importance. Here, the idea is that , and that the latter bound can be optimized, rather than the traditional ELBO. More generally, other Monte-Carlo estimators can be used domke2019divide.
Some work defines novel contrastive-divergence-like objectives in terms of the final iteration of an MCMC chainruiz2019contrastive; li2017approximate. These do not provide an ELBO-like variational bound. While in some cases the initial distribution can be optimized to minimize the objective ruiz2019contrastive, gradients do not flow through the MCMC chains, meaning MCMC parameters cannot be optimized by gradient methods.
For latent variable models, Hoffman hoffman2017MCMCvae suggested to run a few MCMC steps after sampling from the variational distribution before computing gradients with respect to the model parameters, which is expected to "debias" the gradient estimator to be closer to the true likelihood gradient. The variational distribution is simultaneously learned to optimize a standard ELBO. (AIS can also be used ding2019vaeAIS.)
5 Experiments and Results
This section presents results using UHA for Bayesian inference problems on several models of varying dimensionality and for VAE training. In all cases, for UHA and Hamiltonian AIS we use under-damped Langevin dynamics, that is, we perform just one leapfrog step per transition and partially resample momentum. We implement all algorithms using Jaxjax2018github.
5.1 Inference tasks
This section shows results using UHA for Bayesian inference tasks. For this set of experiments, for UHA we tune the initial distribution , the integrator’s step-size and the damping coefficient . We include detailed results tuning more parameters in Section 5.1.1.
Models. We consider four models: Brownian motion (), which models a Brownian Motion process with a Gaussian observation model; Convection Lorenz bridge (), which models a nonlinear dynamical system for atmospheric convection; and Logistic regression with the a1a () and madelon () datasets. The first two obtained from the “Inference gym” inferencegym2020.
Baselines. We compare UHA against IW, HVAE, a simple variant of HVI, and Hamiltonian AIS (HAIS). For all methods which rely on HMC (i.e. all except IW) we use a singe integration step-size common to all dimensions and fix the momentum distribution to a standard Gaussian. For HVI we learn the initial distribution , integration step-size and the reverse dynamics
(set to a factorized Gaussian with mean and variance given by affine functions), and for HVAE we learn, and the tempering scheme (we use the quadratic scheme parameterized by a single parameter).
Training details. We set to be a mean-field Gaussian initialized to a maximizer of the ELBO, and tune the parameters of each method by running Adam for steps. We repeated all simulations for different step-sizes in , and selected the best one for each method. Since Hamiltonian AIS’ parameters cannot be tuned by gradient descent, we find a good pair by grid search. We consider and three values of that correspond to three different rejection rates: and . We tested all 9 possible combinations and selected the best one.
Results are shown in Fig. 1. To simplify comparisons against IW, results are shown as a function of , the number of likelihood evaluations required by each method to build the lower bound.222For IW is the number of samples used to build the lower bound, for UHA and Hamiltonian AIS it is the number of bridging densities plus one, for HVI and HVAE it is the number of leapfrog steps used plus one. It can be observed that our method yields better lower bounds than all other competing approaches for all models considered, and that increasing the number of bridging densities consistently leads to better results. The next best performing method is Hamiltonian AIS. IW also shows a good performance for the lower dimensional model Brownian motion. However, for models of higher dimensionality IW leads to bounds that are several nats worse than the ones achieved by UHA. Finally, HVI and HVAE yield bounds that are much worse than those achieved by the other three methods, and do not appear to improve consistently for larger . For HVAE, these results are consistent with the ones in the original paper (caterini2018hamiltonian, §4), in that higher may sometimes hurt performance. For HVI, we believe this is related to the use of just one HMC step and suboptimal inverse dynamics.
, all methods reduce to plain VI. Vertical bars indicate one standard deviation obtained by running simulations with four different random seeds.
5.1.1 Tuning More Parameters with UHA
A basic version of UHA involves fitting a variational distribution using plain VI, and then tuning the integration step-size and the damping coefficient . However, more parameters could be tuned:
Moment distribution cov : We propose to learn a diagonal matrix instead of using the identity.
Bridging densities’ coefficients : Typically . We propose to learn the sequence , with the restrictions , , and .
Initial distribution : Instead of fixing to be a maximizer of the typical ELBO, we propose to learn it to maximize the augmented ELBO obtained using UHA.
Integrator’s step-size : Instead of learning a unique step-size , we propose to learn a step-size that is a function of , i.e. . In our experiments we use an affine function.
Bridging densities parameters : Instead of setting the -th bridging density to be , we propose to set it to , where is a mean-field Gaussian with a mean and diagonal covariance specified as affine functions of .
We consider the four models described previously and compare three methods: UHA tuning all parameters described above, UHA tuning only the pair , and Hamiltonian AIS with parameters obtained by grid-search. We perform the comparison for ranging from to . (For we tune the UHA’s parameters using and extrapolate them as explained in Appendix D.)
Results are shown in Fig. 2. It can be observed that tuning all parameters with UHA leads to significantly better lower bounds than those obtained by Hamiltonian AIS (or UHA tuning only and ). Indeed, for the Logistic regression models, UHA tuning all parameters for leads to results comparable to the ones obtained by Hamiltonian AIS with .
To verify what parameters lead to larger performance improvements, we tested UHA with tuning different subsets of . Fig. 3 shows the results. It can be observed that tuning the bridging parameters and the initial approximation leads to the largest gains in performance, and that tuning all parameters always outperforms tuning smaller subsets of parameters. We show a more thorough analysis, including more subsets and values of in Appendix B.
5.2 VAE training
Our method can be used to train latent variable models, such as Variational Auto-encoders (VAE) vaes_welling; rezende2014stochastic. In this case the initial approximation and the model
are parameterized by two neural networks (encoder and decoder), whose parameters are trained by maximizing the ELBO. UHA can be used to train VAEs by augmenting these two distributions as described in Section3.
We use three datasets: mnistlecun1998gradient (numbers 1-9), emnist-letters cohen2017emnist (letters A-Z), and kmnist clanuwat2018deep (cursive Kuzushiji). All consist on greyscale images of
pixels. In all cases we use stochastic binarizationsalakhutdinov2008quantitative and a training set of samples, a validation set of samples, and a test set of
samples. All datasets are available in tensorflow-datasetsTFDS.
Baselines. We compare against Importance Weighted Auto-encoders IWVAE and plain VAE training vaes_welling.
Architecture details. We set to a diagonal Gaussian, to a standard Normal, and to a Bernoulli. We consider two architectures for the encoder and decoder: (1) Feed forward networks with one hidden layer of size
and Relu non-linearities, with a latent space dimensionality of; (2) Architecture used by Burda et al. IWVAE, feed forward networks with two hidden layers of size with tanh non-linearities, with a latent space dimensionality of .
Training details. In all cases the encoder and decoder are initialized to parameters that maximize the ELBO. For IW we tune the encoder and decoder parameters (using the doubly-reparameterized estimator doublyrep), and for UHA we tune the integration step-size , damping coefficient , bridging parameters , momentum covariance (diagonal), and the decoder parameters. Following Catherini et al. caterini2018hamiltonian we constrain to avoid unstable behavior of the leapfrog discretization. We use Adam with a step-size of to train for epochs and use the validation set for early stopping. We repeated all simulations for three different random seeds. In all cases the standard deviation of the results was less than nats (not shown in tables).
All methods achieved better results using the architecture with one hidden layer. These results are shown in Tables 1 and 2. The first one shows the ELBO on the test set achieved for different values of , and the second one the log-likelihood on the test set estimated with AIS wu2016quantitative. It can be observed that UHA leads to higher ELBOs, higher log-likelihoods, and smaller variational gaps (difference between ELBO and log-likelihood) than IW for all datasets, with the difference between both methods’ performance increasing for increasing . Notably, for , the variational gap for UHA becomes quite small, ranging from to nats depending on the dataset.
Results for the architecture from Burda et al. IWVAE (two hidden layers) are shown in Tables 3 and 4 (Appendix C). Again, we observe that UHA consistently leads to higher ELBOs and the best test log-likelihood was consistently achieved by UHA with . However, for smaller , IW sometimes had better log-likelihoods than UHA (despite worse ELBOs).
Since UHA yields a differentiable lower bound, one could tune other parameters not considered in this work. For instance, a different momentum distribution per bridging density could be used, that is, . We believe additions such as this may yield further gains. Also, our method can be used to get tight and differentiable upper bounds on using the reversed AIS procedure described by Grosse et al. grosse2015sandwiching.
Finally, removing accept-reject steps might sometimes lead to instabilities during optimization if the step-size becomes large. We observed this effect when training VAEs on some datasets for the larger values of . We solved this by constraining the range of (previously done by Catherini et al. caterini2018hamiltonian). While this simple solution works well, we believe that other approaches (e.g. regularization, automatic adaptation) could work even better. We leave the study of such alternatives for future work.
Appendix A Generating the Hamiltonian AIS bound
Appendix B More results tuning more subsets of parameters for UHA
We tested UHA tuning different subsets of . Fig. 4 shows the results. The first row shows the results obtained by tuning the pair and each other parameter individually for different values of , and the second row shows the results obtained by tuning increasingly more parameters. It can be observed that tuning and lead to the largest gains in performance.
Appendix C Results using architecture from Burda et al. [Iwvae]
In this section we show the results achieved for VAE training using the architecture from Burda et al. [IWVAE] (with
stochastic layer). In this case the encoder and decoder consist on feed forward neural networks with two hidden layers of sizewith Tanh non-linearity, and latent space dimensionality of . All training details are the same, but with the constraint . Tables 3 and 4 show the results.
Appendix D Extrapolating optimal parameters for UHA
Some results in Section 5.1.1 (and Appendix B) use a number of bridging densities up to 512. As mentioned previously, for those simulations, if bridging densities were used, we optimized the parameters for and extrapolate the parameters to work with . We now explain this procedure.
From the parameters considered, , the only ones that need to be "extrapolated" are the step-size and the bridging parameters . All other parameters are tuned for bridging densities and the values obtained are directly used with bridging densities.
we use a simple interpolation. Defineto be the piecewise linear function (with "pieces") that satisfies , for and (this is a bijection from to ). Then, when using , we simply define , where and .
For , we use the transformation . While other transformations could be used (e.g. without the ), we observed this to work best in practice. (In fact, we obtained this rule by analyzing the dependence of the optimal on for several tasks and values of .)
Appendix E Proof of Lemma 1
We begin with the following result.
Let , and be three transitions that leave some distribution invariant and satisfy (i.e. is the reversal of with respect to ). Then the reversal of with respect to is given by .
The rest of the proof is straightforward. Let the three steps from the corrected version of (Alg. 1) be denoted , and . The latter two (Hamiltonian simulation with accept-reject step and momentum negation) satisfy detailed balance with respect to [neal2011mcmc, §3.2]. Thus, for these two, is defined by the same process as . For (momentum resampling), its reversal is given by the reversal of with respect to . We call this , and it satisfies
Appendix F Proof of Theorem 2
To deal with delta functions, whenever the transition states [Set ], we use , and take the limit . We use to denote the density of a Gaussian with mean zero and variance evaluated at , and (operator that negates momentum).
We first compute . We have that and . Thus,
Also, we have . Since is an invertible transformation with unit Jacobian and inverse , we get that
where is the operator that applies and returns the second component of the result (and similarly for ).
Now, we compute . We have that . Thus,