1 Introduction
Several recent work attempts to build powerful variational distributions using unadjusted Hamiltonian Monte Carlo (HMC) transition kernels salimans2015markov; wolf2016variational; caterini2018hamiltonian; wu2020stochastic; thin2021monte; dais; UHA; jankowiak2021surrogate; chen2022bayesian. In principle, one would like to use the last sample marginal of the HMC chain as variational distribution. Since this marginalization is typically intractable, these methods use auxiliary variables agakov2004auxiliary; they build an augmented variational distribution that includes all samples generated in the chain, an augmented target, and perform variational inference (VI) on these augmented distributions. Training proceeds by maximizing the ELBO using unbiased reparameterization gradients, made possible by using uncorrected transitions.
One such method is Unadjusted Langevin Annealing (ULA) wu2020stochastic; thin2021monte, which can be seen as an approximation of Annealed Importance Sampling neal2001ais; jarzynski1997nonequilibrium. The method builds a sequence of densities that gradually bridge an initial approximation to the target, and augments the variational distribution and target using uncorrected overdamped Langevin kernels targeting each of these bridging densities.
While ULA has shown good performance, it has two limitations: It is based on overdamped Langevin dynamics, which are known to suffer from random walk behavior (neal2011mcmc, §5.2); and it augments the target using an approximation of the Annealed Importance Sampling augmentation, which is known to be suboptimal doucet2006sequential. These two limitations were addressed independently. Uncorrected Hamiltonian Annealing (UHA) UHA; dais extends ULA to use underdamped Langevin transitions, known to improve convergence over the overdamped variant cheng2018underdamped. Meanwhile, Monte Carlo Diffusion (MCD) doucet2022annealed extends ULA to use better augmentations for the target. Both of these enhancements lead to significant performance improvements over ULA, albeit through orthogonal enhancements.
These methods were developed using different approaches: While UHA was developed as a differentiable approximation to Annealed Importance Sampling with underdamped Langevin transitions, MCD was developed by numerically simulating the overdamped Langevin diffusion process and its time reversal, with intractable terms approximated using a score network song2019generative. The fact that these methods have different derivations and are based on different techniques makes it difficult to reason about their benefits, drawbacks, and the connections between them. It also means it is not obvious how to combine both of their benefits.
This paper introduces a formulation for Langevinbased VI that encompasses most previously proposed methods. This formulation can be seen as a generalization of MCD doucet2022annealed that uses underdamped Langevin dynamics, instead of the overdamped variant. Like MCD doucet2022annealed, our approach is based on the analysis of continuous time processes. Its main components are the underdamped Langevin diffusion process and its time reversal, which are numerically simulated to derive the augmentations for the variational approximation and target. We introduce our approach for Langevinbased VI in section 3.
Our method is compatible with multiple numerical simulation schemes, with different choices leading to different algorithms. Section 4 introduces a simulation scheme based on splitting methods bou2010long; melchionna2007design. We show that this specific scheme can be used to recover all of ULA, MCD and UHA, thus providing a unified view for all of them, shedding light on the connections between them, their benefits and limitations.
Additionally, our formulation facilitates the development of new methods. We use it to propose Langevin Diffusion VI (LDVI), a novel method that combines the best of UHA and MCD: it uses powerful and improved augmentations for the target, like MCD, while enjoying the benefits of underdamped Langevin transitions, like UHA. We evaluate LDVI empirically in section 5, showing that it outperforms ULA, UHA and MCD in a range of inference tasks.
Finally, we explore the importance of the numerical simulation scheme. In section 5.2 we observe that one can also develop methods using a EulerMaruyama type discretization scheme. Our experimental results therein show that the simulation method used plays a crucial role in the algorithms’ performance, suggesting a possible direction to explore to further improve these methods.
2 Preliminaries
Variational Inference
VI approximates a target distribution (known up to the normalizing constant ) with a simpler distribution . It works by finding the parameters of that maximize the evidence lower bound
(1) 
Noting that , it can be seen that maximizing the ELBO is equivalent to minimizing the KLdivergence from the approximation to the target .
McmcVi
Many methods have been developed to use MCMC to build powerful variational approximations. Ideally, one would use the last sample marginal of an MCMC chain as the approximating distribution. However, since computing this marginal is typically intractable, most methods are based on augmentations agakov2004auxiliary and variants of Annealed Importance Sampling wu2020stochastic; thin2021monte; UHA; dais; doucet2022annealed. They define a sequence of unnormalized densities , for and , forward transitions that (approximately) leave invariant, backward transitions , and build the augmented target and variational distribution as
(2) 
Then, one attempts to tune the forward and backward transitions to maximize the ELBO between these augmented distributions, equivalent to minimizing the KL divergence between them. The chain rule for the KLdivergence
cover1999elements then guarantees , justifying the use of the marginal of over to approximate the original target distribution.While augmentations bypass intractable marginalizations, they introduce additional looseness in that . For a given set of forward transitions , this inequality can in principle be made tight by using the optimal backward transitions doucet2006sequential
(3) 
In practice, however, the marginal densities are not exactly known, so algorithms must use other choices for . There are two desiderata: the ratio must be tractable (required to get a tractable expression for the ELBO between the augmented distributions), and the transitions should be differentiable (not strictly needed, but desirable, as it allows the use of reparameterization gradients to tune all parameters). Most recent methods were developed with these two properties in mind salimans2015markov; wolf2016variational; wu2020stochastic; thin2021monte; UHA; dais; doucet2022annealed; jankowiak2021surrogate. For instance, ULA uses unadjusted overdamped Langevin kernels for both and , and UHA extends it to use underdamped Langevin kernels. For the latter, the distributions from eq. 2 are further augmented to include momentum variables , leading to
(4) 
and to the augmented ELBO
(5) 
3 Langevin Diffusion Variational Inference
This section introduces our approach for Langevinbased VI. It provides a way to build the augmented distributions from eq. 4. Its main components are (1) the underdamped Langevin diffusion process and its time reversal, (2) a numerical simulation scheme to approximately simulate these processes, and (3) a score network doucet2022annealed; song2019generative used to approximate intractable terms in the timereversed process. Together, these produce the forward and backward transitions and with a tractable ratio. Since our approach is compatible with many simulation schemes, we first introduce it in a general way, and present a specific simulation scheme in section 4.
3.1 Langevin diffusion
This subsection introduces the Langevin diffusion process and its time reversal, which will be used to derive the forward and backward transitions in the following sections. Let be a sequence of densities bridging from the starting distribution for to the target for . That is, and . The Langevin diffusion process is characterized by the following stochastic differential equation (SDE):
(6) 
where , is a standard Wiener process, is a friction coefficient, and . The forward transitions will be derived by simulating this process. The motivation behind the use of this process comes from its good convergence properties. Intuitively, evolving eq. 6 yields values for that tend to be close to . Thus, one may hope that the marginal density of the process at time is close to , meaning the distribution of the final value may be close to the target of interest.
The backward transitions , on the other hand, will be derived by simulating the timereversed SDE corresponding to eq. 6. Defining and , this timereversed process is characterized by (see appendix B)
(7) 
where is the marginal of the forward process at time . Formally, this process is initialized with . However, in what follows, where we use it to define the backward transitions to augment the target, it will be initialized as . The motivation for using the reverse time SDE from eq. 7 is that, under exact simulation, it yields the optimal backward transitions from eq. 3 (i.e. no additional looseness in the augmented ELBO).
3.2 Forward and backward transitions via SDE simulation
The forward and backward transitions will be obtained by simulating the forward and timereversed processes for a fixed period of time . If we could simulate the above SDEs exactly, then

[leftmargin=*]

The forward transition would be obtained by simulating the forward process from time up to time , starting from the initial values ,

The backward transition would be obtained by simulating the reversetime SDE from time up to time , starting from the initial values .
It can be shown that these backward transitions are optimal. That is, if one could simulate eqs. 7, LABEL: and 6 exactly to get the forward and backward transitions defined above, the resulting augmentations would be tight in the sense that the augmented ELBO from eq. 5 would have no additional looseness compared to an ELBO defined between the last sample marginals and .
Unfortunately, these transitions are intractable for two reasons. First, the forward marginal density that appears in the reverse SDE is unknown. Second, it is intractable to exactly simulate or evaluate either of the above SDEs.
Approximating
The first source of intractability of the optimal transitions is the score term , which is typically unavailable. Inspired by the fact that is expected to be close to , we propose to approximate this term as
(8) 
where is some learnable function approximator. Following recent work song2020score; song2019generative; sohl2015deep; ho2020denoising; doucet2022annealed
, we use a neural network, typically referred to as score network, which is trained with the other parameters to maximize the ELBO. The intuition behind our approximation in
eq. 8 comes from considering scenarios where the forward transitions mix fast. In such cases will be close to , and thus the approximation should work well. (In fact, as we show in section 4.2, several wellknown methods are recovered by removing the score network; that is, fixing .)Transitions via numerical simulation
The second source of intractability is that it is rarely possible to simulate the forward and reverse SDEs exactly. Thus, we use a numerical simulation scheme to approximately simulate them. The requirements for the simulation scheme are (1) it must yield transitions with a tractable ratio, and (2) it must be differentiable, in order to allow unbiased reparameterization gradients doublystochastic_titsias; vaes_welling; rezende2014stochastic. Section 4 presents a scheme that satisfies these.
3.3 Framework for Langevinbased VI
Our formulation for Langevinbased VI is based on the transitions described above. To get a specific instance, several choices are required:

[leftmargin=*]

A momentum augmented target that retains original target as marginal, often defined as ,

A momentum augmented initial approximation , often defined as ,

A score network to approximate intractable term involving ,
Given specific choices for each of these components, we can compute
(9) 
which is required to estimate and optimize the ELBO from
eq. 5.4 Numerical Simulation Scheme
This section introduces two numerical simulation schemes, one for the forward SDE and one for the timereversed SDE, which yield transitions with a tractable ratio. We begin by giving explicit algorithmic representations for these transitions and an expression for their ratio (section 4.1). We then explain how our formulation for Langevinbased VI with these transitions can be used to recover several existing methods, including ULA, MCD and UHA (section 4.2), and also how it can be used to derive new methods (section 4.3).
4.1 Forward and backward transitions
The forward transitions used to simulate to forward SDE are shown in algorithm 1. They consist on two steps: (partial) momentum resampling from some distribution (see section 4.1.1), followed by a single leapfrog integrator step typically used to simulate Hamiltonian dynamics neal2011mcmc; betancourt2017conceptual (denoted by in algorithm 1, which consists on sequential deterministic updates to the variables , , and ). As explained in section 4.1.1, these transitions are derived by simulating the forward SDE from eq. 6 using splitting methods bou2010long; melchionna2007design.
The backward transitions used to simulate the timereversed SDE are shown in algorithm 2. They also consist on two steps: the inverse of a single leapfrog integrator step used to simulate Hamiltonian dynamics, followed by a (partial) momentum resampling from some distribution . We include their derivation and details for the momentum resampling distribution in section 4.1.1.
In order to use these transitions for Langevinbased VI, we need an expression for their ratio. This is given in lemma 1, proved in appendix D.
Lemma 1.
Let and be the transitions defined in algorithms 2, LABEL: and 1, and the momentum resampling distributions used in these transitions, the discretization stepsize, and the damping coefficient. Then,
(10) 
where is as defined in algorithms 2, LABEL: and 1, given by .
Using the transitions from algorithms 2, LABEL: and 1 and their ratio given in lemma 1 we can get an exact expression for the augmented ELBO from eq. 5
. While computing this augmented ELBO exactly is typically intractable, an unbiased estimate can be obtained using a sample from
, as shown in algorithm 3.4.1.1 Derivation of forward and backward transitions
We now show the derivation for the forward and backward transitions using splitting methods bou2010long; melchionna2007design, which have been observed to work well for Langevin processes leimkuhler2013rational; monmarche2021high. Simply put, splitting methods split an SDE into multiple simpler components, simulate each component for a timestep of size , and then combine the solutions sequentially to build the sized step for the original SDE.
Forward transitions
These transitions are obtained by simulating the forward SDE using a splitting method. Specifically, we split the SDE in three components, , and ,
each one simpler than the original SDE, and then build the forward transition by sequentially composing the simulations for components . The final forward transition shown in algorithm 1 can be obtained by noting that each of the individual components can be simulated with the following strategies:
 Simulating :

This can be done exactly. Given initial values at time , simulating for a time results in .
 Simulating :

Given initial values at time , and using that for small , simulating for a time results in .
 Simulating :

This can be done exactly, as corresponds to an Ornstein–Uhlenbeck process. Given an initial value of at time , simulating for a time gives , where . However, as we will see next, exact simulation for the corresponding component of the reverse SDE is not possible. Thus, it may be useful to simulate approximately as well, using the EulerMaruyama scheme maruyama1955continuous; bayram2018numerical, which gives . We use to denote generically the momentum resampling distribution used, which could be any of the ones just described.
In summary, simulating yields the momentum resampling step, while composing the simulations for yields the leapfrog integration step (note that since is simulated twice, it is done with a stepsize of .)
Backward transitions
Like for the forward transitions, these are derived by splitting the reverse SDE in three components, , and (using our approximation for the score term),
Then, we construct the backward transition by sequentially composing the simulations for components , where the sequence yields the inverse of the leapfrog integrator step, and yields the momentum resampling step. The derivation follows the one for the forward transitions closely, with one main difference: simulating component has an additional difficulty, due to the presence of the term involving the score network . While in general cannot be simulated exactly (unless we fix ), it can be done approximately using the EulerMaruyama method, which results in the momentum resampling distribution . We give further details in appendix C.
4.2 Recovering known methods
As mentioned previously, ULA, MCD and UHA were originally derived using different techniques and approaches. Some of these methods use overdamped Langevin dynamics, while others use the underdamped variant; some were derived as approximations of Annealed Importance Sampling, while others emerged from an analysis of continuous time diffusion processes. This section’s main purpose is to show that all of these methods can be derived in a unified way using our general formulation for Langevinbased VI with the numerical simulation schemes introduced above. We begin by briefly giving details about ULA, MCD and UHA (including their different derivations), followed by an explanation of how these methods can be recovered with our approach.
ULA wu2020stochastic; thin2021monte works directly with unadjusted overdamped Langevin kernels (i.e. no momentum variables), defining the forward transitions as
(11) 
Then, using the fact that is approximately reversible with respect to when the stepsize is small, it defines the backward transitions as . The ratio between these transitions, and thus the augmented ELBO, are straightforward to compute (see appendix D). Broadly speaking, the method can be seen as a differentiable approximation to Annealed Importance Sampling with overdamped Langevin transitions.
Theorem 4.1.
ULA is recovered by our formulation from section 3.3 with , , , and the transitions from algorithms 2, LABEL: and 1 with exact momentum resampling (possible due to removing the score network) with (high friction limit).
MCD doucet2022annealed was developed by studying the overdamped Langevin diffusion process, given by
(12) 
It uses unadjusted overdamped Langevin kernels for the forward transitions (i.e. simulating eq. 12 with the EulerMaruyama scheme), and uses backward transitions derived by simulating the reversetime diffusion corresponding to eq. 12, also with the EulerMaruyama scheme, using a score network to approximate intractable terms.
Theorem 4.2.
MCD is recovered by our formulation from section 3.3 with , , , the forward transitions from algorithm 1 with exact momentum resampling for (high friction limit), and the backward transition from algorithm 2 using the momentum resampling distribution described in appendix D.
UHA UHA; dais was developed as an approximation to Annealed Importance Sampling using underdamped Langevin dynamics. It uses unadjusted underdamped Langevin kernels for the forward transitions, and the unadjusted reversal of a Metropolis adjusted underdamped Langevin kernel for the backward transitions (simply put, Geffner and Domke UHA and Zhang et al. dais derived an exact expression for the reversal of a Metropolis adjusted underdamped Langevin kernel, and proposed to remove the correction step to define the backward transition).
Theorem 4.3.
UHA is recovered by our formulation from section 3.3 with , , , and the transitions from algorithms 2, LABEL: and 1 with exact momentum resampling (possible due to removing the score network) with a learnable .
We include proofs in appendix D. All three proofs follow similar steps, we get the exact transitions and expression for the ELBO given the specific choices made in each case, and compare to that of the original method, verifying their equivalence.
4.3 A New Method
Apart from recovering many existing methods, new algorithms can be derived using our formulation for Langevinbased VI with the proposed simulation scheme. As an example, we propose Langevin Diffusion VI (LDVI), a novel method that combines the benefits of MCD (powerful backward transitions aided by a score network) with the benefits of UHA (underdamped dynamics). It is obtained by using the formulation from section 3.3 with , , a full score network , and the transitions from algorithms 2, LABEL: and 1 with the momentum resampling distributions given by
(13) 
Simply put, these distributions are obtained by simulating components and using the EulerMaruyama scheme maruyama1955continuous; bayram2018numerical.
5 Experiments
This section presents an empirical evaluation of different methods that follow our framework. We are interested in the effect that different choices have on the final performance. Specifically, we are interested in studying the benefits of using underdamped dynamics (UHA and LDVI) instead of the overdamped variant (ULA and MCD), the benefits of using powerful backward transitions aided by learnable score networks (MCD and LDVI), and the benefits of combining both improvements (LDVI) against each individual one (MCD and UHA). We explore this empirically in section 5.1.
Additionally, we are interested in studying how the numerical simulation scheme used affects the methods’ performance. We explore this in section 5.2, where we propose and evaluate empirically an alternative simulation scheme based on a simpler splitting than the one introduced in section 4.
In all cases, we use the different methods to perform inference on a wide range of tasks for values of
, and report mean ELBO achieved after training, with standard deviations computed over three different random seeds. For all methods we set
to a meanfield Gaussian, initialized to a maximizer of the ELBO, and train all parameters using Adam for 150000 steps. We repeat all simulations for the learning rates , and , and keep the best one for each method and model. For all methods we tune the initial distribution , discretization stepsize , and the bridging densities’ parameters . For LDVI and UHA we also tune the damping coefficient , and for LDVI and MCD we tune the score network, which has two hidden layers with residual connections
he2016deep. We implement all methods using Jax jax2018github.5.1 Benefits of underdamped dynamics and score networks
Logistic Regression
shows results achieved by ULA, MCD, UHA and LDVI on a logistic regression model with two datasets,
ionosphere sigillito1989classification () and sonar gorman1988analysis (). It can be observed that going from overdamped to underdamped dynamics yields significant performance improvements: For the sonar dataset, UHA with bridging densities performs better than ULA with , and LDVI with performs better than MCD with . Similarly, it can be observed that the use of score networks for the backward transitions also yields significant gains: For the sonar dataset, MCD and LDVI with outperform ULA and UHA, respectively, with . Finally, results show that combining both improvements is beneficial as well, as it can be seen that LDVI outperforms all other methods for all datasets and values of .Ionosphere  Sonar  

ULA  MCD  UHA  LDVI  ULA  MCD  UHA  LDVI  
Time series
We consider two time series models obtained by discretizing two different SDEs, one modeling a Brownian motion with a Gaussian observation model (), and other modeling a Lorenz system, a threedimensional nonlinear dynamical system used to model atmospheric convection (). We give details for these models in appendix A. Both were obtained from the “Inference Gym” inferencegym2020. Results are shown in table 2. The conclusions are similar to the ones for the logistic regression models: underdamped dynamics and score networks yield gains in performance, and LDVI, which combines both improvements, performs better than all other methods.
Brownian motion  Lorenz system  

ULA  MCD  UHA  LDVI  ULA  MCD  UHA  LDVI  
Random effect regression (seeds)  
ULA  MCD  UHA  LDVI  
Random effect regression
Table 3 shows results on a random effect regression model with the seeds dataset crowder1978beta () (model details in appendix A). The same conclusions hold, both UHA and MCD perform better than ULA, and LDVI performs better than all other methods.
5.2 Effect of numerical simulation scheme
All the methods studied so far are based on the simulation scheme introduced in section 4. We note that other simulations methods, which yield different transitions, could be used. We are interested in studying how the simulation scheme used affects methods’ performance.
We consider the forward and backward transitions shown in algorithms 5, LABEL: and 4, obtained by simulating the SDEs using the EulerMaruyama scheme as explained in appendix E (where we also give an expression for the ratio of the transitions). We propose two methods using these transitions: One that uses a full score network , and other one that uses no score network (i.e. fixes ). Intuitively, these can be seen as variants of LDVI and UHA that use this new simulation scheme, so we term them LDVI and UHA.^{1}^{1}1Using this simulation it is not directly clear how to take the highfriction limit. Thus, it is unclear whether they can be used to recover methods analogous to ULA and MCD.
Table 4 shows results for the Brownian motion model and the logistic regression model with the sonar dataset (results for all models in appendix E). It can be observed that each of LDVI and UHA performs worse than its counterpart using the simulation scheme from section 4, LDVI and UHA. Interestingly, for the Brownian motion model, UHA is also outperformed by ULA. This sheds light on the importance of the simulation scheme used: for some models, the benefits obtained by using underdamped dynamics (and possibly score networks) may be lost by using a poorly chosen simulation scheme.
Brownian motion  Logistic regression (sonar)  

UHA  UHA  LDVI  LDVI  UHA  UHA  LDVI  LDVI  
References
Appendix A Model details
a.1 Time series models
This models are presented following their descriptions in the Inference Gym documentation [inferencegym2020]. Both are obtained by discretizing an SDE and using a Gaussian observation model.
Brownian motion with Gaussian observation noise
The model is given by
The goal is to do inference over variables , and (), given the observations , for (i.e. the ten middle observations are missing).
Lorenz system
The model is given by
where (determined by the discretization stepsize used for the original SDE). The goal is to do inference over for , given observed values for .
a.2 Random effect regression
This model can be found in the MultiBUGS [goudie2020multibugs] documentation. It is essentially a random effects regression model, given by
The goal is to do inference over the variables and for , given observed values for , and . The data used was obtained from [crowder1978beta], which models the germination proportion of seeds arranged in a factorial layout by seed and type of root.
Appendix B Deriving the timereversed SDE for the underdamped Langevin process
Consider a diffusion process of the form
(14) 
where