# Hamiltonian Monte Carlo with Asymmetrical Momentum Distributions

Existing rigorous convergence guarantees for the Hamiltonian Monte Carlo (HMC) algorithm use Gaussian auxiliary momentum variables, which are crucially symmetrically distributed. We present a novel convergence analysis for HMC utilizing new analytic and probabilistic arguments. The convergence is rigorously established under significantly weaker conditions, which among others allow for general auxiliary distributions. In our framework, we show that plain HMC with asymmetrical momentum distributions breaks a key self-adjointness requirement. We propose a modified version that we call the Alternating Direction HMC (AD-HMC). Sufficient conditions are established under which AD-HMC exhibits geometric convergence in Wasserstein distance. Numerical experiments suggest that AD-HMC can show improved performance over HMC with Gaussian auxiliaries.

## Authors

• 8 publications
• 6 publications
• 3 publications
09/11/2019

### Exploring Discrete Analogs of Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) has emerged as a powerful way to sample fr...
11/17/2020

04/15/2021

### Hamiltonian zigzag sampler got more momentum than its Markovian counterpart: Equivalence of two zigzags under a momentum refreshment limit

Zigzag and other piecewise deterministic Markov process samplers have at...
02/04/2021

### HMC, an Algorithms in Data Mining, the Functional Analysis approach

The main purpose of this paper is to facilitate the communication betwee...
10/17/2017

### Convergence Rate of Riemannian Hamiltonian Monte Carlo and Faster Polytope Volume Computation

We give the first rigorous proof of the convergence of Riemannian Hamilt...
10/11/2018

### Stochastic Approximation Hamiltonian Monte Carlo

Recently, the Hamilton Monte Carlo (HMC) has become widespread as one of...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hamiltonian Monte Carlo (HMC) belongs to the wider class of Markov Chain Monte Carlo (MCMC) algorithms

(Hastings, 1970; Gelfand and Smith, 1990)

that approximate the difficult-to-compute density of a target probability measure by running a Markov chain whose invariant measure coincides with the target distribution. Let

denote the target density of interest. Define its support to be on a metric space , where we will assume is, typically, a Euclidean space for some positive integer and is a metric in .

The is of form , where is easily queried, but to completely characterize we need to know the normalizing constant

, for example to compute expectations. This is a core problem in modern statistics, and forms a fundamental operation in many applications in machine/deep learning. For instance, in frequentist statistics, it relates to estimating coverage of a statistic using likelihood over data space. In Bayesian statistics, it can be generically stated as inferring the posterior distribution

of (unobservable) latent variables given observations from a user-modeled joint density . The inference target is , where the exact marginal is usually hard to compute directly.

This motivates the wide use of MCMC in applications such as statistical inference  (Robert and Casella, 2004), inverse problems  (Stuart, 2010)

(Andrieu et al., 2003) and molecular dynamics  (Lelievre et al., 2010). The early success of the MCMC approach can be attributed to simple, elegant and provably convergent algorithms such as the Metropolis-Hastings method (Hastings, 1970), in which a user chosen auxiliary distribution and an acceptance/rejection mechanism helps pick and accept candidate samples as being from

. But MCMC algorithms suffer from slow convergence in high dimensional statistical computing in fields like molecular dynamics and artificial intelligence.

Sample generation in the standard HMC (Algorithm 1) is driven by a dynamical system. Write the energy of the target distribution as . The user chooses an auxiliary set of momentum variables in space , which is also . The momentum is sampled from a probability measure with density function , referred to as the momentum or auxiliary distribution with its kinetic energy. In each iteration, the HMC Algorithm 1 spreads (or lifts) the current iterate , where is sampled independently from . Hamiltonian dynamics with energy are used to define the (deterministic) transformation with a pre-selected time . It preserves , the total energy or Hamiltonian of the system. The variables are then dropped by projecting to obtain next iterate. The terminology of and being the position and momentum variables, and corresponding to the energy arose in physics applications, where HMC first saw wide-spread use (Duane et al., 1987).

The power of HMC stems from the energy-preserving property when obtaining the candidate solutions even for large Hamiltonian moves, since this also preserves the joint density of

and hence (if the motion is exactly implemented) no new samples are rejected. This leads to its relatively higher effectiveness compared to classical MCMC even in the high dimensional settings used in deep learning with neural networks

(Neal, 1993; Gelman et al., 2013; Jasche and Kitaura, 2010; Betancourt et al., 2017). With the rise of high-performance software implementations such as Stan (Carpenter et al., 2017; Team, 2017), the method has now become a pervasive tool across many scientific, medical, and industrial applications.

Role of Momentum Distribution. The kinetic energy is typically symmetric and quadratic: , and so the momentum

is distributed as a Gaussian distribution. A good choice for

is the Hessian of .  Betancourt et al. (2017) note that if is Gaussian, then the choice of as the inverse of the covariance of under

, leads to the optimal de-skewing and regularization of contours of the Hamiltonian motion. Consequently, this prevents the uncontrolled build up of error in practical discretized approximations of Hamiltonian dynamics implemented on a computer. When the

is non-Gaussian, existing convergence proofs of HMC do not allow the momentum distribution to be asymmetric. As a workaround, the Riemannian-Gaussian (Girolami and Calderhead, 2011) kinetic energy is used, where the the Hessian of . So, the kinetic energy is updated dynamically by continually estimating at each the “local” Hessian directly (or approximating covariance of recent samples) to locally regularize the Hamiltonian motion curvature.

Limitations of Prior Results. A rich literature has emerged on the important theoretical problem of when and how fast the HMC algorithm with (conditional) Gaussian converges. Of particular interest has been the conditions under which geometric convergence can be ensured, which has been achieved through different approaches such as comparison theorems for differential equations (Chen and Vempala, 2019), Harris recurrence techniques (Bou-Rabee and Sanz-Serna, 2017) and coupling (Bou-Rabee et al., 2020). Some of the various conditions identified for geometric ergodicity of HMC are not easy to verify (e.g.  (Bou-Rabee and Sanz-Serna, 2017)), some  (Durmus et al., 2017; Mangoubi and Smith, 2019; Livingstone et al., 2019) cannot lead to explicit expression of the convergence rate on the HMC parameters, while others  (Bou-Rabee et al., 2020) heavily depends on delicate tricks for Gaussian distributions, and require log-concave properties of the target density function.

The existing approaches crucially use the fact that under appropriate conditions on the Hamiltonian function, the Markov chains starting from any two fixed points in the state space would eventually result in iterated distributions that are (under some well-defined distributional metric) closer than the two origination points. Recent advances in the understanding of general Markov operators from the perspective of optimal transportation shed clearer light on this phenomenon, see e.g.  Ollivier (2009) and  Joulin and Ollivier (2010). In this approach, the geometric convergence in Wasserstein metric of the Markov chain is quantified by the coarse Ricci curvature, a geometric concept that provides a second order characterization of movements in the probability measure space. Importantly, the curvature approach lets us produce quantitative characterization of the Markov chain through detailed calculations of the trajectories in the Hamiltonian system.

Contributions. The main thrust of our paper is in the presentation of a new set of analysis techniques for HMC style algorithms. In contrast to most previous work that treats the convergence of iterates of Algorithm 1, we analyse the convergence of the corresponding induced densities on a properly defined function space. This functional convergence approach allows us, in Sec. 2, to identify key properties of Hamiltonian motions that are sufficient for convergence. Seen in this light, the standard HMC operator (defined in Sec. 3) with an asymmetric momentum distribution is not self-adjoint, a characteristic that is essential to the proof of convergence. The modified Alternating Direction HMC (AD-HMC) Algorithm 2 rectifies this by applying the HMC operator and its adjoint in alternating steps, amounting to taking Hamiltonian motion in forward and backward directions. Theorem 1 shows that AD-HMC converges strongly to in the functional space. This significantly expands the class of HMC algorithms for which convergence is rigorously established, among others by dropping the symmetry restriction on momentum distributions. The proofs provide a simple and intuitive understanding of the working of HMC and illustrate why its iterates converge. Moreover, our observations on the functional and probabilistic structures of the algorithms lead to a significantly shortened presentation compared to previous work in the literature.

In Section 4.1, we establish that geometric convergence holds. In Theorem 2 the AD-HMC method with asymmetric enjoys this rate under standard log-concavity conditions that are imposed on target and auxiliary . The analysis technique here switches to using probabilistic arguments over measure spaces endowed with the Wasserstein metric. This shift allows us in Theorem 3 of Section 4.2 to generalize our analysis by relaxing the global conditions to only requiring that log-concavity hold outside a “small” set (defined in (3)). While we do not necessarily provide the tightest convergence rate estimate in comparison to existing results, our results focus on convergence in Wasserstein metric, which is not only a weaker notion hence covering more applications in practice but can also provide more robust bound for high dimensional problems as discovered in recent literature; see e.g.  Hairer et al. (2011) and  Durmus and Moulines (2015). The Hamiltonian dynamics in Algorithm 2 are implemented by discrete numerical approximations such as the leapfrog integrator (Leimkuhler and Reich, 2004); Section 4.3 proves in Theorem 4 that such practical discretization schemes also converge geometrically if the error it produces can be uniformly bounded.

Section 5 describes our initial numerical experiments that suggest the potential benefit of considering asymmetrical momentum distributions. As discussed earlier, a popular technique to speed up standard HMC methods is to dynamically update the covariance matrix of the auxiliary distribution with either the directly computed Hessian or its approximation from the covariance of samples (see ‘adapt-single

heuristic in Sec.

5). We propose an alternate adaptive scheme, ‘adapt-many’, that utilises the AD-HMC algorithm to dynamically construct as a general mixture of Gaussians with components that provide adequate coverage of the target. In a three-dimensional simulated dataset example where the target

is chosen to be full dimensional and significantly asymmetric with wide variance in

, we demonstrate that the AD-HMC based heuristic significantly outperforms standard HMC with adaptively learnt Gaussian distributions. On a public domain Bayesian logistic regression dataset, the AD-HMC ‘

The ‘adapt-many’ heuristic and its examples provided in Section 5 motivate the use of general auxiliary distributions and display the great potential of AD-HMC. An exhaustive study of the practical, versatile uses of AD-HMC is beyond the scope of this paper. We can be certain that additional applications for asymmetric auxiliaries will become apparent as the literature delves into AD-HMC further, and we intend to further pursue deeper exhaustive investigations along these lines.

## 2 Preliminaries

As noted, the HMC iterates are formally defined over a state space , and we assume that the initial state is sampled from an arbitrary distribution that is absolutely continuous with respect to a reference measure on , which in most of the cases will be the Lebesgue measure on a Euclidean space.

Each step of HMC applies a Hamiltonian motion to a joint sample , which is the evaluation of the solution to the following system of differential equations at a fixed time ,

 ˙Q(t)=∂H∂p,˙P(t)=−∂H∂q, (Q(0),P(0))=(q,p). (1)

The solution represents the position and momentum in the Hamiltonian conservation system for time . Since the parameter will remain fixed in this paper, we will omit it when there is no confusion. Furthermore, denote as the map . The measurable motion is assumed to be invertible, which corresponds to being able to go back in time and have two invariance properties: (1) (conservation of the Hamiltonian energy) and (2) for any integrable function we have (conservation of Lebesgue measure by the Hamiltonian motion). Additionally we assume that the motion is irreducible (ergodic), which means there are no nontrivial measurable invariant sets. For simplicity we assume a stronger coverage property: (3) for (almost) every . This compactly represents (using a slight abuse of notation ) the property that every point in can be reached from (almost) any with a lift by an appropriate for the motion in one step.

The evolution of iterates

is naturally modelled as a Markov chain. Next step depends only on the previous one and an independent and randomly generated momentum vector

. Hence, we can equivalently study the evolution of the density functions of the Markov chain. Our HMC analysis combines a dynamical system on a functional space with methods in probability theory.

#### HMC as a Markov chain

Modeling the evolution of Algorithm 1 as a Markov chain defined on the space , its transition probability can be defined as follows. Given the fixed parameter and initial position , the map , with being the usual projection from to , is an onto map. Hence, for any given probability measure on (with density ), induces a push forward , a probability measure on defined as for any measurable set , with denoting its pre-image under . This push forward provides the transition probability for our Markov chain, .

#### HMC as a Dynamical System

From an analytical point of view, the density function of a probability measure, well-defined when the measure is absolutely continuous with respect to the reference (Lebesgue) measure, can also be viewed as a member in a proper functional space. First, let the target measure on the space (which we can assume to be its support) be expressed as a density with respect to the reference measure on . The abstract (or ideal) HMC uses an auxiliary measure with a density on the space with respect to the reference measure there. A step of HMC is a realization of an operator acting on the densities belonging to where the integration is with respect to the reference measure . To simplify notation we shall skip the subscript and write and . The operator

first constructs the joint distribution

on (in the lifting step), then performs the Hamiltonian motion in the product space producing a joint distribution and finally projects this transported density along the direction of on its marginal on space . More specifically,

 Th(q)=∫P(h⋅g)∘R(q,p)dp=∫Ph(Q)g(P)dp. (2)

The progression of HMC algorithm applied to an initial density can be expressed as the sequence of iterations of the operator , where .

### 2.1 Notions of Convergence

With the Markov chain and dynamical system defined above, various notions of convergence need to be discussed for understanding the main results of the paper. For a Markov chain, it is well known that, under proper assumptions, there exists a probability measure that is invariant under the push forward , hence named invariant measure. Furthermore, in typical probabilistic literature, , where for some initial probability , and denotes the total variational distance between two measures, with the supremum taken over all the sets in the -field where the probability measures are defined. In terms of the HMC notation, the measures and can be viewed to have densities and .

The total variational distance has been extensively studied in statistics and probability, and has close connections to other distance metrics of probability measures such as the Kullback-Leibler divergence. See Definition

1 of the family of metrics  (Joulin and Ollivier, 2010) of closeness of probability measures. It is however beneficial and sometimes more convenient to consider convergence under the Wasserstein distance.

###### Definition 1.

For any two measures and on metric space , the Wasserstein distance for is defined as,

 Wp(μ,ν)=infγ∈Γ(μ.ν)[∫Q×Qdp(x,y)γ(dx,dy)]1p, (3)

where denotes the set of measures on that project on and .

A closely related concept is that of the coarse Ricci curvature of a Markov operator, developed in Ollivier (2009) and  Joulin and Ollivier (2010). Ricci curvature of a Riemannian manifold is interpreted as a measure of deformation along parallel transformations.

###### Definition 2.

On the space , with the Markov operator , for each pair , the coarse Ricci curvature in the direction of , is defined as,

 κ(q1,q2):=1−W1(μ(q1),μ(q2))d(q1,q2),

where

denotes the probability distribution of the Markov chain governed by

with initial state .

Denote . It is demonstrated in Ollivier (2009) that implies that the corresponding Markov recursion converges geometrically.

From the analytical viewpoint, the convergence of the HMC is studied in terms of convergence of density functions in a proper functional space. In particular, the space defined previously with respect to the reciprocal of the density function has been found to be very useful in establishing the convergence for a variety of iterative operators, see e.g. Markowich and Villani (2000). The space has a natural scalar product . Hence, the (strong) convergence in for a sequence of functions to a function is defined by , and the weak convergence is defined as , for any . In Chen (2000), it is shown that geometric convergence is equivalent to geometric ergodicity, which is a well studied property in the Markov chain literature.

## 3 The Alternating Direction HMC

In the space , the adjoint operator to is characterized by for any . Note that by the invariance properties described after the definition of the Hamiltonian (1), . A self-adjoint operator satisfies , and for Hamiltonian motion a sufficient condition for self-adjointness is that the auxiliary distribution is symmetric . Self-adjointness is crucial to the convergence result in Theorem 1 below.

In case the operator is not self-adjoint (for example when is asymmetrical) we use a method with in place of . Algorithm 2 presents the proposed Alternating Direction Hamiltonian Monte Carlo (AD-HMC) method. It is a modification of the standard HMC Algorithm 1 where forward motion of size is followed by backward motion (with fresh momentum samples) of size , where signs of the motion equations (1) are reversed. The AD-HMC operator has both invariance properties and is self-adjoint.

###### Theorem 1.

Let the motion that defines the operator have invariance and coverage properties. Then for any the sequence of alternating iterations converges strongly to the fix point , where . If additionally is self-adjoint itself then converges strongly to .

For the proof of Theorem 1, detailed in Sec. A.1, we observe that is in fact an averaging map (Lemma 3), thus by the convexity of the norm decreases under : (Lemma 4), sharply unless (by coverage assumption) . In the space , bounded sequences have weak accumulation points. Using self-adjointness we prove (Proposition 2) that each accumulation point of the sequence of iterations has the same norm and thus (Corollary 3) must be of form , where the value is deduced from integral invariance (Lemma 5). Hence the whole sequence converges. Meanwhile the proof of the convergence of the norms to the norm of the limit provides (Proposition 3) strong convergence.

## 4 Geometric Convergence in W1

### 4.1 Global Condition on the Hessian

Recall that by the Ricci curvature arguments, the geometric convergence in the Wasserstein distance is established if the quantity can be shown to be contracting comparing to the initial positions of the Hamiltonian motion. The precise calculation of requires solving the well-known Monge-Ampére equation, a nonlinear elliptic equation whose solutions are hard to obtain, see e.g. Villani (2008). Here, we identify the conditions under which an upper bound can be derived to this important quantity, and this then leads to a lower bound to curvature , which is still a positive number, hence the geometric convergence of the Markov chain to its stationary distribution. While there exist several similar results for the geometric convergence of HMC, we believe that our proof provided here is among the simplest, and it provides important insights on the related dynamics that of independent interest.

###### Theorem 2.

If , for some , where denotes the operator norm for matrices, then the Markov recursion converges geometrically in the with a rate being at least .

Proof of the theorem can be found in Sec. A.2. The bound is on the sensitivity of the Hamiltonian motion output on the starting point , and applies for any momentum . We will show below that this bound can be obtained if additional conditions are imposed on the form of the densities and .

As before, let have density , and with Hamiltonian form , denote and . Define averages and where are the solutions of (1). When and are symmetric and positive definite, denote and . These roots exist because a unique symmetric positive square root exists for any symmetric positive operator, and a composition of two positive symmetric operators is also positive and has a well defined positive square root. Lemma 6 in the supplement Sec. A.3 studies the general evolution of the four inital-configuration dependence terms including the . Proposition 1 below provides a solution of the evolution equations.

###### Proposition 1.

Assume that both the target and the auxiliary distributions are strictly log-concave over their domains. Then the solution of the evolution equation (6) (see Lemma 6 in Sec. A.3) is:

 ⎛⎜⎝∂Q∂q∂Q∂p∂P∂q∂P∂p⎞⎟⎠(t)=(cos(tA)t¯Vsinc(tB)−t¯Usinc(tA)cos(tB)).

By logarithmic concavity and by continuity for each and both and are positive symmetric operators, and so are their time averages and . The functions and used above are well defined by their power series, which are absolutely convergent for all bounded operators, the latter even when is not well defined.

###### Proof.

The expression follows from standard calculations (see Lemma 7) from the general evolution Lemma 6 in Sec. A.3. Specifically, strict logarithmic concavity on the entire domain implies that the Hessian of can be bounded (for example in terms of their spectra) away from and uniformly. Then and are uniformly bounded, and so are and . ∎

The standard Gaussian satisfies this strict log-concavity assumption, but note that no symmetry requirement is imposed on . From the solution, observe that with a well chosen positive but small such that , we have .

###### Corollary 1.

Suppose both the target and auxiliary are strictly log-concave, and the AD-HMC algorithm implements exact Hamiltonian dynamics for such that with . Then the algorithm converges geometrically to the target .

### 4.2 Relaxed Condition on the Hessian

In this section, we will relax the uniform strictly logarithmic concave conditions over the entire space. Instead, we only require that it holds outside a compact set. To facilitate the analysis, we assume the compact set to be, , in the metric space.

###### Theorem 3.

If in , and is bounded within , for some real number , then there exists an such that the Markov recursion converges geometrically in Wasserstein metric with a rate at least .

Theorem 3 thus relaxes the strict log-concavity on , and in particular covers any auxiliary (symmetric or not) distribution that has a corresponding non-empty . The proof of Theorem 3, details in Sec. A.6, relies on the following definition and two lemmas.

###### Definition 3.

A set in the state space is called small (or -small) if there exists and , and a probability measure on the state space such that for any ,

is a small set.

###### Lemma 2.

Under the conditions of Theorem 3, for , and for all ,

 W1(μ(q1),μ(q2))≤1+β2d(q1,q2). (4)

Proof of Lemmas 1 and 2 can be found in Secs. A.4 and A.5 respectively. Similar to Corollary 1, we have the following geometric convergence result.

###### Corollary 2.

Suppose both the target and auxiliary are strictly log-concave outside for a positive constant , and the AD-HMC algorithm implements exact Hamiltonian dynamics for such that with . Then the algorithm converges geometrically to the target .

Lemmata 1 and 2 are closely related to the important concepts of (pseudo-)small set and Lyapunov function of a Markov chain. In fact, under the same conditions, (4) can be viewed as a bivariate drift condition for for some function , taking the form of the distance function, then, the results in probability theory, see, e.g.  Rosenthal (2002), implies geometric convergence in total variational distance via a coupling method that inspired some of our arguments here, see, e.g.  Rosenthal (2002). For more detailed information on (pseudo-)small set and Lyapunov function techniques, please see systematic treatments in  Roberts and Rosenthal (2001),  Roberts and Rosenthal (2004) and  Meyn et al. (2009).

### 4.3 HMC with Numerical Integrators

In practice, HMC is implemented with numerical solutions to the Hamiltonian system of equations (1), and coupled with the well-known Metropolis-Hasting acceptance/rejection step to ensure the convergence to the target distribution. Various numerical methods are available for this purpose, for example, the symplectic leapfrog integrator (Verlet, 1967) can be commonly seen in many literature and software implementations, and versions of the Runge-Kutta method can achieve high order of approximation. For a detailed discussion on these methods and their analysis, see, e.g. Hairer et al. (2013). More specifically, for an integer , an approximation is of -order if its error is as the step size . Therefore, for any real number , a proper step size can be chosen, such that the numerical method (e.g. Runge-Kutta method for high order approximation) produces satisfying uniformly over all initial conditions that:

 d(Q(t),~Q(t))+d(P(t),~P(t))<δ. (5)
###### Theorem 4.

If outside for some , and bounded within , and condition (5) is satisfied, then the recursion generated by the numerical-approximation AD-HMC procedure converge geometrically in Wasserstein metric.

The proof of the theorem (in Sec. A.7) adapts the same techniques as those in the above section.

## 5 Numerical Experiments

In this section, we study the efficacy of using an asymmetric momentum distribution along with AD-HMC over using Gaussian distributions in standard HMC. Let define a Gaussian with mean and covariance matrix . We provide results for two sets of experiments for moderate dimensional targets . The first is a simulated model where we set the target distribution of our experiments to reside in , while in the second the HMC methods are applied to a Bayesian logistic regression application. Additional details and comparisons (e.g. poor performance of standard HMC with asymmetric distributions) are available in the supplementary material.

HMC methods. We include the basic HMC Algorithm (Alg. 1) with symmetric Gaussians as auxiliaries. The case uses to represent the standard approach. Recall that an adaptive procedure to set the to the most skewed covariance, equivalent to the highest local skewness in the Hessian of the potential , is conjectured to better regularize the resulting Hamiltonian dynamics. We call this dynamically varying symmetric Gaussian auxiliary case the ‘adapt-single’ scheme. After every iterations, the particles in iterate are run through the OPTICS (Ankerst et al., 1999)

clustering algorithm to classify them into distinct classes, if possible. Then, the

‘adapt-single’ case adopts the mean and covariance estimates of the cluster with most skewed as the replacement for the parameters of its Gaussian auxiliary distribution.

We consider a similar adaptive heuristic where the auxiliary is allowed to be a Gaussian mixture. Formally, the ‘adapt-many’ heuristic starts with and at each iteration :

Apply labels to all the particle samples in the current iterate using the OPTICS clustering algorithm, which dynamically determines both and the assignment of labels; the samples in the class denote those that were not classified.

For each , let represent the number of samples in the class, and set . Estimate the sample mean and covariance matrix and set .

Set the auxiliary distribution w.p. .

Continue the AD-HMC algorithm with for another iterations.

The AD-HMC algorithm admits each auxiliary of the form constructed by ‘adapt-many’ and the method converges for any single such choice. While our analysis does not cover this dynamically adapting version of the auxiliary, we are actively pursuing a theory for the convergence of adaptive AD-HMC algorithms.

Simulation Model in . The target constitutes as a mixture of uncorrelated gaussian densities with the means

arranged so that they lie along three linearly independent lines, and standard deviations in the range

. For this model, we add the ‘’ HMC case, which sets the auxiliary’s variance to the lowest (known) value, . As a reference, we also study the convergence of AD-HMC with a mixture-of-Gaussians (call it the ‘simple-target’ case) that is constructed from foreknowledge of the target as its simplification that retains six of the twelve Gaussians. The supplement provides further details on this study.

The top panel in Figure 1 plots the sample paths and averages of the three HMC and two AD-HMC cases of the Wasserstein distance between the (collection of particle samples representing the) iterate distributions and the target distribution, as approximated using the Sinkhorn method Feydy et al. (2019). The top figure plots this over the iteration count, and is seen to establish the geometric convergence of the methods. A clear ordering is observed in both figures between the HMC and AD-HMC cases, with the AD-HMC cases getting closer to the target than HMC at termination. Within the HMC cases, knowledge of the lowest variance does speed convergence as anticipated. The adaptive scheme ‘adaptive-single’ performs even better than the perfect-knowledge case ‘’, which may be because of the very skewed covariance estimates encountered in the earlier iterations. The perfect-knowledge asymmetric case ‘simple-target’ narrowly beats the ‘adaptive-single’ scheme in terms of iteration count. Finally, the asymmetric adaptive Gaussian mixture case ‘adaptive-many’ attains the best performance overall in both figures, and indeed continues to improve while the other algorithms have slowed down.

Figure 1’s bottom panel provides the average CPU wallclock time per iteration for each method. The three HMC methods take the same order of effort, while the full-knowledge asymmetric distribution ’simple-target’ takes the longest time on average since it consists of a mixture of six of the twelve Gaussians that constitute the target. The ’adapt-many’ case takes increasing time over the iterations, reflecting the increasing complexity of the auxiliary; indeed, on average it includes five Gaussians towards the end.

Bayesian logistic regression (BLR). We study a public domain regression dataset (Huq and Cleland, 1990) that seeks to explain contraceptive use by women based on their age, number of children and urban/rural residence. The target Bayesian posterior density over is given by , where is the prior distribution and represents the likelihood of observed data given regression coefficient . The supplement provides full details on this model. Figure 2 plots the sample paths and averages of the two HMC methods ‘’ and ‘adaptive-single’, and the AD-HMC case ‘adaptive-many’. Since the exact target is unavailable, we measure their performance using the Kuhlback-Leibler divergence from the posterior, as is the standard in variational inference algorithms. The two adaptive schemes produce markedly faster convergence over standard HMC, indeed seeming to have converged, while the ‘’ slowly approaches the solution. The iterations tend to concentrate the posterior’s mass around a mean regression parameter value as expected for logistic regression, and this uni-modal nature of the target limits the advantage of ‘adaptive-many’ over ‘adaptive-single’.

Parameter Settings. All algorithms were implemented in python 3.7 and ran on a server with two AMD EPYC 7301 -core processors and 64Gb system memory and two GeForce RTX2070 8Gb GPUs. The iterate distribution is represented by empirical distributions with samples. AD-HMC sample paths are run for and iterations respectively for the simulation and BLR model, and for a fair comparison the HMC paths are allowed double the number of iterations. The adaptive methods set , and the averages are developed from sample paths. The Hamiltonian motion dynamics in both algorithms are approximated by the leapfrog symplectic numerical integration procedure (Verlet, 1967); see Supplement section B. Each iteration takes discrete first-order approximation steps of size to implement Hamiltonian motion of length , with of in simulated dataset and for the BLR model. The OPTICS clustering scheme is run with its default parameters in the Scipy library (Pedregosa et al., 2011). Note that the clustering step limits the use of the adaptive schemes to only moderate dimensional settings.

Summary. The numerical evidence shows that the AD-HMC based ‘adapt-many’, a straightforward heuristic to model the auxiliary distribution on key characteristics of the target distribution, can notably speed up estimation of the target distribution over plain HMC. This shows a promising avenue for further exploration both in practical adaptive auxiliary design algorithms and in rigorously understanding when they may converge faster.

## Societal Impact

This is a theoretical contribution that, nevertheless, has the potential of impacting a wide range of application domains in business, engineering and science. In particular, all of those in which Hamiltonian Monte Carlo methods have been used as a statistical inference and estimation tool (e.g. image analysis and computer vision, signal processing, operations research, and so on). Because our paper provides a step towards generalizing and relaxing the conditions needed on the method’s parameters in order to establish statistical rates of convergence, we believe that we have the potential of speeding up and thus enabling more applications.

## References

• Andrieu et al. (2003) C. Andrieu, N. D. Freitas, A. Doucet, and M. I. Jordan.

An introduction to mcmc for machine learning. machine learning.

Machine Learning, 50:5––43, 2003.
• Ankerst et al. (1999) M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49–60, June 1999. ISSN 0163-5808. doi: 10.1145/304181.304187. URL https://doi.org/10.1145/304181.304187.
• Betancourt et al. (2017) M. Betancourt, S. Byrne, S. Livingstone, and M. Girolami. The geometric foundations of hamiltonian monte carlo. Bernoulli, 23(4A):2257–2298, 11 2017. doi: 10.3150/16-BEJ810. URL https://doi.org/10.3150/16-BEJ810.
• Bogachev (2007) V. Bogachev. Measure Theory. Number v. 1 in Measure Theory. Springer Berlin Heidelberg, 2007. ISBN 9783540345145. URL https://books.google.com/books?id=CoSIe7h5mTsC.
• Bou-Rabee and Sanz-Serna (2017) N. Bou-Rabee and J. M. Sanz-Serna. Randomized hamiltonian monte carlo. Ann. Appl. Probab., 27(4):2159–2194, 08 2017. doi: 10.1214/16-AAP1255. URL https://doi.org/10.1214/16-AAP1255.
• Bou-Rabee et al. (2020) N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for hamiltonian monte carlo. Ann. Appl. Probab., 30(3):1209–1250, 06 2020. doi: 10.1214/19-AAP1528. URL https://doi.org/10.1214/19-AAP1528.
• Carpenter et al. (2017) B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. ISSN 1548-7660. doi: 10.18637/jss.v076.i01. URL https://www.jstatsoft.org/v076/i01.
• Chen (2000) M.-F. Chen. Equivalence of exponential ergodicity and l2-exponential convergence for markov chains. Stochastic Processes and their Applications, 87(2):281 – 297, 2000. ISSN 0304-4149. doi: https://doi.org/10.1016/S0304-4149(99)00114-3. URL http://www.sciencedirect.com/science/article/pii/S0304414999001143.
• Chen and Vempala (2019) Z. Chen and S. S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. RANDOM, 2019.
• Duane et al. (1987) S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216 – 222, 1987. ISSN 0370-2693. doi: https://doi.org/10.1016/0370-2693(87)91197-X. URL http://www.sciencedirect.com/science/article/pii/037026938791197X.
• Durmus and Moulines (2015) A. Durmus and É. Moulines. Quantitative bounds of convergence for geometrically ergodic markov chain in the wasserstein distance with application to the metropolis adjusted langevin algorithm. Statistics and Computing, 25(1):5–19, 2015. doi: 10.1007/s11222-014-9511-z. URL https://doi.org/10.1007/s11222-014-9511-z.
• Durmus et al. (2017) A. Durmus, É. Moulines, and E. Saksman. On the convergence of hamiltonian monte carlo. arXiv: Computation, 2017.
• Feydy et al. (2019) J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.
• Gelfand and Smith (1990) A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990. doi: 10.1080/01621459.1990.10476213.
• Gelman et al. (2013) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
• Girolami and Calderhead (2011) M. Girolami and B. Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. doi: 10.1111/j.1467-9868.2010.00765.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2010.00765.x.
• Hairer et al. (2013) E. Hairer, C. Lubich, and G. Wanner.

Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations

.
Springer Series in Computational Mathematics. Springer Berlin Heidelberg, 2013. ISBN 9783662050187. URL https://books.google.com/books?id=cPTxCAAAQBAJ.
• Hairer et al. (2011) M. Hairer, J. C. Mattingly, and M. Scheutzow. Asymptotic coupling and a general form of harris’theorem with applications to stochastic delay equations. Probability Theory and Related Fields, 149(1):223–259, 2011. doi: 10.1007/s00440-009-0250-6. URL https://doi.org/10.1007/s00440-009-0250-6.
• Hastings (1970) W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970. ISSN 00063444. URL http://www.jstor.org/stable/2334940.
• Huq and Cleland (1990) N. M. Huq and J. Cleland. Bangladesh fertility survey 1989 (main report). Technical report, Dhaka: National Institute of Population Research and Training, 1990.
• Jasche and Kitaura (2010) J. Jasche and F.-S. Kitaura. Fast hamiltonian sampling for large‐scale structure inference. Monthly Notices of the Royal Astronomical Society, 407:29 – 42, 09 2010. doi: 10.1111/j.1365-2966.2010.16897.x.
• Joulin and Ollivier (2010) A. Joulin and Y. Ollivier. Curvature, concentration and error estimates for markov chain monte carlo. Ann. Probab., 38(6):2418–2442, 11 2010. doi: 10.1214/10-AOP541. URL https://doi.org/10.1214/10-AOP541.
• Leimkuhler and Reich (2004) B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004. ISBN 9780521772907. URL https://books.google.com/books?id=tpb-tnsZi5YC.
• Lelievre et al. (2010) T. Lelievre, M. Rousset, and G. Stoltz. Langevin dynamics with constraints and computation of free energy differences, 2010.
• Livingstone et al. (2019) S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of hamiltonian monte carlo. Bernoulli, 25(4A):3109–3138, 11 2019. doi: 10.3150/18-BEJ1083. URL https://doi.org/10.3150/18-BEJ1083.
• Mangoubi and Smith (2019) O. Mangoubi and A. Smith. Rapid mixing of hamiltonian monte carlo on strongly log-concave distributions. Proceedings of Machine Learning Research, 89, 2019.
• Markowich and Villani (2000) P. A. Markowich and C. Villani. On the trend to equilibrium for the fokker-planck equation: An interplay between physics and functional analysis. Matematica Contemporanea (SBM), 19:1–31, 2000.
• Meyn et al. (2009) S. Meyn, R. Tweedie, and P. Glynn. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2009. ISBN 9780521731829. URL https://books.google.com/books?id=Md7RnYEPkJwC.
• Neal (1993) R. M. Neal. Bayesian learning via stochastic dynamics. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 475–482. Morgan-Kaufmann, 1993.
• Ollivier (2009) Y. Ollivier. Ricci curvature of markov chains on metric spaces. Journal of Functional Analysis, 256(3):810 – 864, 2009. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2008.11.001. URL http://www.sciencedirect.com/science/article/pii/S002212360800493X.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Robert and Casella (2004) C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.
• Roberts and Rosenthal (2001) G. O. Roberts and J. S. Rosenthal. Small and pseudo-small sets for markov chains. Stochastic Models, 17(2):121–145, 2001. doi: 10.1081/STM-100002060. URL https://doi.org/10.1081/STM-100002060.
• Roberts and Rosenthal (2004) G. O. Roberts and J. S. Rosenthal. General state space markov chains and mcmc algorithms. Probab. Surveys, 1:20–71, 2004. doi: 10.1214/154957804100000024. URL https://doi.org/10.1214/154957804100000024.
• Rosenthal (2002) J. Rosenthal. Quantitative convergence rates of markov chains: A simple account. Electron. Commun. Probab., 7:123–128, 2002. doi: 10.1214/ECP.v7-1054. URL https://doi.org/10.1214/ECP.v7-1054.
• Stuart (2010) A. M. Stuart. Inverse problems: A bayesian perspective. Acta Numerica, 19:451–559, 2010. doi: 10.1017/S0962492910000061.
• Talagrand (1996) M. Talagrand. Transportation cost for gaussian and other product measures. Geometric & Functional Analysis GAFA, 6(3):587–600, May 1996. ISSN 1420-8970. doi: 10.1007/BF02249265. URL https://doi.org/10.1007/BF02249265.
• Team (2017) S. D. Team. Stan modeling language users guide and reference manual, 2017. URL https://mc-stan.org/.
• Verlet (1967) L. Verlet. Computer "experiments" on classical fluids. i. thermodynamical properties of lennard-jones molecules. Phys. Rev., 159:98–103, Jul 1967. doi: 10.1103/PhysRev.159.98. URL https://link.aps.org/doi/10.1103/PhysRev.159.98.
• Villani (2008) C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google.com/books?id=hV8o5R7_5tkC.

## References

• Andrieu et al. (2003) C. Andrieu, N. D. Freitas, A. Doucet, and M. I. Jordan.

An introduction to mcmc for machine learning. machine learning.

Machine Learning, 50:5––43, 2003.
• Ankerst et al. (1999) M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49–60, June 1999. ISSN 0163-5808. doi: 10.1145/304181.304187. URL https://doi.org/10.1145/304181.304187.
• Betancourt et al. (2017) M. Betancourt, S. Byrne, S. Livingstone, and M. Girolami. The geometric foundations of hamiltonian monte carlo. Bernoulli, 23(4A):2257–2298, 11 2017. doi: 10.3150/16-BEJ810. URL https://doi.org/10.3150/16-BEJ810.
• Bogachev (2007) V. Bogachev. Measure Theory. Number v. 1 in Measure Theory. Springer Berlin Heidelberg, 2007. ISBN 9783540345145. URL https://books.google.com/books?id=CoSIe7h5mTsC.
• Bou-Rabee and Sanz-Serna (2017) N. Bou-Rabee and J. M. Sanz-Serna. Randomized hamiltonian monte carlo. Ann. Appl. Probab., 27(4):2159–2194, 08 2017. doi: 10.1214/16-AAP1255. URL https://doi.org/10.1214/16-AAP1255.
• Bou-Rabee et al. (2020) N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for hamiltonian monte carlo. Ann. Appl. Probab., 30(3):1209–1250, 06 2020. doi: 10.1214/19-AAP1528. URL https://doi.org/10.1214/19-AAP1528.
• Carpenter et al. (2017) B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. ISSN 1548-7660. doi: 10.18637/jss.v076.i01. URL https://www.jstatsoft.org/v076/i01.
• Chen (2000) M.-F. Chen. Equivalence of exponential ergodicity and l2-exponential convergence for markov chains. Stochastic Processes and their Applications, 87(2):281 – 297, 2000. ISSN 0304-4149. doi: https://doi.org/10.1016/S0304-4149(99)00114-3. URL http://www.sciencedirect.com/science/article/pii/S0304414999001143.
• Chen and Vempala (2019) Z. Chen and S. S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. RANDOM, 2019.
• Duane et al. (1987) S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216 – 222, 1987. ISSN 0370-2693. doi: https://doi.org/10.1016/0370-2693(87)91197-X. URL http://www.sciencedirect.com/science/article/pii/037026938791197X.
• Durmus and Moulines (2015) A. Durmus and É. Moulines. Quantitative bounds of convergence for geometrically ergodic markov chain in the wasserstein distance with application to the metropolis adjusted langevin algorithm. Statistics and Computing, 25(1):5–19, 2015. doi: 10.1007/s11222-014-9511-z. URL https://doi.org/10.1007/s11222-014-9511-z.
• Durmus et al. (2017) A. Durmus, É. Moulines, and E. Saksman. On the convergence of hamiltonian monte carlo. arXiv: Computation, 2017.
• Feydy et al. (2019) J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.
• Gelfand and Smith (1990) A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990. doi: 10.1080/01621459.1990.10476213.
• Gelman et al. (2013) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
• Girolami and Calderhead (2011) M. Girolami and B. Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. doi: 10.1111/j.1467-9868.2010.00765.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2010.00765.x.
• Hairer et al. (2013) E. Hairer, C. Lubich, and G. Wanner.

Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations

.
Springer Series in Computational Mathematics. Springer Berlin Heidelberg, 2013. ISBN 9783662050187. URL https://books.google.com/books?id=cPTxCAAAQBAJ.
• Hairer et al. (2011) M. Hairer, J. C. Mattingly, and M. Scheutzow. Asymptotic coupling and a general form of harris’theorem with applications to stochastic delay equations. Probability Theory and Related Fields, 149(1):223–259, 2011. doi: 10.1007/s00440-009-0250-6. URL https://doi.org/10.1007/s00440-009-0250-6.
• Hastings (1970) W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970. ISSN 00063444. URL http://www.jstor.org/stable/2334940.
• Huq and Cleland (1990) N. M. Huq and J. Cleland. Bangladesh fertility survey 1989 (main report). Technical report, Dhaka: National Institute of Population Research and Training, 1990.
• Jasche and Kitaura (2010) J. Jasche and F.-S. Kitaura. Fast hamiltonian sampling for large‐scale structure inference. Monthly Notices of the Royal Astronomical Society, 407:29 – 42, 09 2010. doi: 10.1111/j.1365-2966.2010.16897.x.
• Joulin and Ollivier (2010) A. Joulin and Y. Ollivier. Curvature, concentration and error estimates for markov chain monte carlo. Ann. Probab., 38(6):2418–2442, 11 2010. doi: 10.1214/10-AOP541. URL https://doi.org/10.1214/10-AOP541.
• Leimkuhler and Reich (2004) B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004. ISBN 9780521772907. URL https://books.google.com/books?id=tpb-tnsZi5YC.
• Lelievre et al. (2010) T. Lelievre, M. Rousset, and G. Stoltz. Langevin dynamics with constraints and computation of free energy differences, 2010.
• Livingstone et al. (2019) S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of hamiltonian monte carlo. Bernoulli, 25(4A):3109–3138, 11 2019. doi: 10.3150/18-BEJ1083. URL https://doi.org/10.3150/18-BEJ1083.
• Mangoubi and Smith (2019) O. Mangoubi and A. Smith. Rapid mixing of hamiltonian monte carlo on strongly log-concave distributions. Proceedings of Machine Learning Research, 89, 2019.
• Markowich and Villani (2000) P. A. Markowich and C. Villani. On the trend to equilibrium for the fokker-planck equation: An interplay between physics and functional analysis. Matematica Contemporanea (SBM), 19:1–31, 2000.
• Meyn et al. (2009) S. Meyn, R. Tweedie, and P. Glynn. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2009. ISBN 9780521731829. URL https://books.google.com/books?id=Md7RnYEPkJwC.
• Neal (1993) R. M. Neal. Bayesian learning via stochastic dynamics. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 475–482. Morgan-Kaufmann, 1993.
• Ollivier (2009) Y. Ollivier. Ricci curvature of markov chains on metric spaces. Journal of Functional Analysis, 256(3):810 – 864, 2009. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2008.11.001. URL http://www.sciencedirect.com/science/article/pii/S002212360800493X.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Robert and Casella (2004) C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.
• Roberts and Rosenthal (2001) G. O. Roberts and J. S. Rosenthal. Small and pseudo-small sets for markov chains. Stochastic Models, 17(2):121–145, 2001. doi: 10.1081/STM-100002060. URL https://doi.org/10.1081/STM-100002060.
• Roberts and Rosenthal (2004) G. O. Roberts and J. S. Rosenthal. General state space markov chains and mcmc algorithms. Probab. Surveys, 1:20–71, 2004. doi: 10.1214/154957804100000024. URL https://doi.org/10.1214/154957804100000024.
• Rosenthal (2002) J. Rosenthal. Quantitative convergence rates of markov chains: A simple account. Electron. Commun. Probab., 7:123–128, 2002. doi: 10.1214/ECP.v7-1054. URL https://doi.org/10.1214/ECP.v7-1054.
• Stuart (2010) A. M. Stuart. Inverse problems: A bayesian perspective. Acta Numerica, 19:451–559, 2010. doi: 10.1017/S0962492910000061.
• Talagrand (1996) M. Talagrand. Transportation cost for gaussian and other product measures. Geometric & Functional Analysis GAFA, 6(3):587–600, May 1996. ISSN 1420-8970. doi: 10.1007/BF02249265. URL https://doi.org/10.1007/BF02249265.
• Team (2017) S. D. Team. Stan modeling language users guide and reference manual, 2017. URL https://mc-stan.org/.
• Verlet (1967) L. Verlet. Computer "experiments" on classical fluids. i. thermodynamical properties of lennard-jones molecules. Phys. Rev., 159:98–103, Jul 1967. doi: 10.1103/PhysRev.159.98. URL https://link.aps.org/doi/10.1103/PhysRev.159.98.
• Villani (2008) C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google.com/books?id=hV8o5R7_5tkC.

## Appendix A Proofs

### a.1 Proof of Theorem 1

We assume the operator is self-adjoint, and has invariance and coverage properties. All the arguments applies also to the operator which is self-adjoint as and inherits the invariance properties as well.

###### Lemma 3.

is an averaging operator, and is a fixed point.

###### Proof.

By the invariance property, . That shows that any proportional to is a fixed point of . ∎

###### Lemma 4.

The operator is well defined on the non-empty . It strictly decreases the norm unless in which case it is a fixed point of the operator.

###### Proof.

We have .

 ||Th||2 =∫Q(f∫P(h/f)∘R⋅g)2f≤∬Q×P(hf∘R)2⋅(g⋅f)=∬Q×P(hf∘R)2⋅(g⋅f)∘R =∬Q×P(hf)2⋅(g⋅f)=(∫Qh2f)⋅(∫Pg)=||h||2⋅1

The equality happens only if for -a.e.  we have , but for that must be constant with respect to -a.e. . The coverage assumption forces the constant to be the same for (almost surely) all . It follows that either with equal norms or otherwise the norm is strictly decreasing, i.e. . ∎

###### Lemma 5.

The operator conserves the integral .

###### Proof.

By the invariance property, . The last statement follows from the definition of the scalar product in space. ∎

Lemma 4 allows us to define . Clearly for any we have . Next, a crucial proposition is established utilizing self-adjointness.

###### Proposition 2.

Any weak accumulation point of the sequence satisfies .

###### Proof.

By the definition of weak convergence for a subsequence , we have for any . In particular , where we use Lemma 5. We can assume that given an we have and that the subsequence contains infinitely many even iterates , otherwise take some and use in place of . We now use self-adjointness: . That proves . The opposite inequality is standard . ∎

###### Corollary 3.

The sequence converges weakly to , with .

###### Proof.

If then and both have the same norm . But it is possible only if , and . Hence all weakly convergent subsequences of have the same limit, and as in every bounded sequence has a weakly converging subsequence, the whole sequence converges weakly to .

###### Proposition 3.

The sequence converges strongly:

 ∥∥Tnh−∫Qh∫Qf⋅f∥∥→0.
###### Proof.

By Corollary 3, . The strong convergence follows from the weak convergence and the convergence of norms to the norm of the limit . ∎

### a.2 Proof of Theorem 2

###### Proof.

Recall that to bound away from zero, we will need to provide a uniform ( and ) upper bound to the quantity for any pair of and . This upper bound will be achieved by two relaxations. First, we identify one member in the set of joint distributions . By definition 1 , the integration, , naturally provides an upper bound to . The second relaxation is on the calculation of the function within the integration .

The basic idea of the first relaxation is the same as that of the common random number generator in simulation literature. M. Talagrand Talagrand (1996), who attributed the idea to M. Frechet, used it to prove some version of the logarithmic Sobolev inequalities, which is closely related to the geometric convergence of Markov chains.

Suppose that is the selected auxiliary distribution on . Define a map with , for any given . thus induced a measure on , ( can be viewed as ). More specifically, for any subset , the measure has the following representation,

 γμ(A×B)=∫~A×Bg(p)dp,

with . it can be easily verified that . Therefore,

 W1(μ(q1),μ(q2))≤∫∫d(x,y)γμ(dx,dy).

Naturally, it can be seen that we are coupling the two motions with the same momentum generated from the common distribution on .

For any subset , the measure is defined as

 Γμ(A×B)=∫~A×Bg(p)dp,

with . From the property of integrability for push-forward measure, see, e.g. Bogachev (2007), we know that,

 ∫∫d(x,y)Γμ(dx,dy)=∫d(Q(q1,p),Q(q2,p))g(p)dp

For the second relaxation, the quantity in the above integration can be further upper bounded by the length of one specific curve that connects and . More specifically, for any , let . Thus, . Hence,

 d(Q(q1,p),Q(q2,p))≤∫10√|˙η(t)|2dt≤∣∣∂Q∂q∣∣|q2−q1|.

### a.3 Evolution of the dependence on the initial configuration

###### Lemma 6 (Evolution of the dependence on the initial configuration).

When the Hamiltonian is given by , the derivative of the motion with respect to the starting configuration satisfy the following time evolution equation:

 ∂∂t⎛⎜⎝∂Q∂q∂Q∂p∂P∂q∂P∂p⎞⎟⎠=(0V′′−U′′0)⋅⎛⎜⎝∂Q∂q∂Q∂p∂P∂q∂