# HMC: avoiding rejections by not using leapfrog and some results on the acceptance rate

We give numerical evidence that the standard leapfrog algorithm may not be the best integrator to use within the Hamiltonian Monte Carlo (HMC) method and its variants. When the dimensionality of the target distribution is high, the number of accepted proposals obtained with a given computational effort may be multiplied by a factor of three or more by switching to alternative integrators very similar to leapfrog. Such integrators have appeared in the numerical analysis and molecular dynamics literature rather than in the statistics literature. In addition, we provide some theoretical results on the acceptance rate of HMC.

## Authors

• 3 publications
• 2 publications
• 8 publications
12/06/2019

### HMC: avoiding rejections by not using leapfrog and an analysis of the acceptance rate

We give numerical evidence that the standard leapfrog algorithm may not ...
11/14/2017

### Geometric integrators and the Hamiltonian Monte Carlo method

This paper surveys in detail the relations between numerical integration...
06/06/2021

### On Irreversible Metropolis Sampling Related to Langevin Dynamics

There has been considerable interest in designing Markov chain Monte Car...
11/09/2020

### Symmetrically processed splitting integrators for enhanced Hamiltonian Monte Carlo sampling

We construct integrators to be used in Hamiltonian (or Hybrid) Monte Car...
02/25/2016

### Towards Unifying Hamiltonian Monte Carlo and Slice Sampling

We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demo...
08/02/2017

### Hamiltonian Monte Carlo with Energy Conserving Subsampling

Hamiltonian Monte Carlo (HMC) has recently received considerable attenti...
02/21/2020

### A micro-macro Markov chain Monte Carlo method for molecular dynamics using reaction coordinate proposals I: direct reconstruction

We introduce a new micro-macro Markov chain Monte Carlo method (mM-MCMC)...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we provide numerical evidence that the standard leapfrog algorithm may not be the best integrator to use within the Hamiltonian Monte Carlo (HMC) method and its variants. If the dimensionality of the target distribution is high, the number of accepted proposals obtained with a given computational effort may be multiplied by a factor of three or more by switching to alternative, more accurate integrators very similar to leapfrog. Such integrators have appeared in the numerical analysis and molecular dynamics literature rather than in the statistics literature. In addition, we provide some theoretical results on the acceptance rate of HMC.

HMC, originally suggested by Duane et al. (1987)

, is a popular Markov chain Monte Carlo (MCMC) algorithm where proposals are obtained by numerically integrating a Hamiltonian system of differential equations

(Neal, 2011; Sanz-Serna, 2014)

. This integration gives HMC its main potential advantage: the possibility of proposals that are far away from the current state of the chain and, at the same time, may be accepted with high probability. At present the leapfrog/Störmer/Verlet scheme is almost universally used to integrate the Hamiltonian dynamics and obtain a proposal that may be accepted or rejected. While rejections increase the correlation in any MCMC algorithm, they are particularly unwelcome in the case of HMC, because of the computational effort wasted in the numerical integration legs used to generate the rejected proposals

(Campos and Sanz-Serna, 2015).

We emphasize that, when comparing different integrators, it is essential to take into account the computational effort. If the numerical integration of the Hamiltonian dynamics were exact, then all proposals would be accepted. By implication, for any reasonable (i.e. convergent) numerical integrator, it is possible to make the acceptance rate arbitrarily close to by reducing the step-length and increasing the number of time-steps used to simulate the dynamics over a given time-interval . However such a reduction in may not be advisable as it implies a corresponding increase in computational cost. In fact, once an integrator has been chosen, a very high acceptance rate signals that an inappropriately small value of is being used and the computational cost of generating each proposal is too high; one would sample better at the same cost exploring the state space further by increasing the number of steps in the Markov chain and solving the Hamilton equations less accurately in each proposal. On the other hand, when choosing an integrator one should prefer (all other things being equal) the method that maximizes the number of accepted proposals that may be obtained with a given computational cost.

Section 2 is introductory and contains a description of HMC and of the properties of the leapfrog algorithm. It also serves to recap a number of considerations that are essential to understand the remainder of the paper. Section 3 describes a two-parameter family of integrators that contains the leapfrog algorithm as a particular case. All members of the family may be implemented as easily as leapfrog and all use the same accept/reject strategy. It is not possible to identify analytically the algorithm of this family that is best for the HMC method; in fact the acceptance rate depends on the so-called global errors of the integrators and global errors (as distinct from local errors) are extremely difficult to pin down analytically (Bou-Rabee and Sanz-Serna (2018) provide a short introduction to the basic concepts in numerical integration). For this reason we conduct numerical experiments to compare different integrators (Section 4

). We employ three test problems: a simple multivariate Gaussian distribution, a Log-Gaussian Cox model and an example from molecular dynamics: the canonical distribution of an alkane molecule. The last distribution possesses many modes (corresponding to alternative equilibrium configurations of the molecule) and has been chosen in view of the complexity of the nonlinear forces. In all test problems the dimensionality

of the target distribution is high, ranging from 15 to 4096; in fact as increases so does the computational cost of generating proposals and therefore it is more important to identify efficient integrators. The experiments show that several integrators of the family improve on leapfrog. Specially an integrator specifically designed by Blanes et al. (2014) to be used in the HMC context may provide, for a given computational effort, three times more accepted proposals than leapfrog.

In all experiments we found that the mean acceptance rate is (approximately) given by the formula (Gupta et al., 1990) , where is the mean energy error and

the standard normal cumulative distribution function. The high-dimensional asymptotic validity of this formula has been rigorously established by

(Beskos et al., 2013) under the very restrictive assumption that the target is a product of independent identical copies of a given distribution. In Section 5 we prove (Theorem 2) that the formula also holds asymptotically in a general scenario for Gaussian distributions. In addition, we study (Theorem 1) the acceptance rate in the simplest model problem given by a univariate Gaussian target. The theorems have a clear implication for the construction of Hamiltonian integrators tailored to be used within HMC: reducing the expected energy error is the only way to improve the expected acceptance rate, which validates the approach used by Blanes et al. (2014).

Our contributions are summarized in the final Section 6.

We conclude this introduction with some caveats. A comparison of the merits of HMC with those of other sampling techniques or of the advantages of the various variants of HMC is completely outside our scope here; a recent contribution devoted to such comparisons is the paper by Radivojević and Akhmatskaya (2019), which also provides an extensive bibliography. Similarly, we are not concerned with discussing issues of HMC not directly related to inaccurate time-integrations. Some of those issues concern the ergodicity of the method (Cances et al., 2007; Livingstone et al., 2019; Bou-Rabee and Sanz-Serna, 2017), the choice of the length of the time-interval for simulating the dynamics (Hoffman and Gelman, 2014), the difficulties in leaving the neighbourhood of a probability mode in multimodal targets (Mangoubi et al., 2018) and the scalability of the method to large data-sets (Chen et al., 2014). Finally, while we limit the exposition to the simplest version of HMC, our study may be applied to other variants (Horowitz, 1991; Akhmatskaya et al., 2017; Radivojević et al., 2018) and even to molecular dynamics simulations with no accept/reject mechanism in place (Fernández-Pendás et al., 2016).

## 2 Hamiltonian Monte Carlo

We denote by

the random variable whose distribution we wish to sample and by

the corresponding density. HMC is based on the introduction of an independent auxiliary variable with density . The joint density and the negative log-likelihood are then, respectively, and

 H(θ,p)=−L(θ)+12pTM−1p+12log((2π)ddet(M)),

where .444The constant may be suppressed without altering the Hamiltonian dynamics or the acceptance probability. In a mechanical analogy, the function is the Hamiltonian (total energy) of a system with potential energy and kinetic energy ; are the positions, the momenta and is the mass matrix. Hamilton’s differential equations for the system are then given by

 ddτθ=+∂H∂p=M−1p,ddτp=−∂H∂θ=∇θL(θ), (1)

where represents time, and have the property that, for each , their solution flow

 (θ(0),p(0))↦(θ(τ),p(τ))

preserves the joint density, . HMC is a Markov chain Monte Carlo algorithm where at each step of the chain one integrates the system (1). Since, in practical applications, the exact solution flow cannot be obtained analytically, one has to resort to a numerical integration whose error is removed by an accept/reject mechanism. At present the leapfrog/Störmer/Verlet algorithm is the integrator of choice. Over a single time-step of length , the integrator reads

 p ← p+ϵ2∇θL(θ), (2) θ ← θ+ϵM−1p, (3) p ← p+ϵ2∇θL(θ). (4)

In molecular dynamics, the substeps (2) and (4) are referred to as kicks; the momentum undergoes an increment while the position stands still. The substep (3) is called a drift; the system changes position without changing momentum.

The bulk of the computational effort of the leapfrog integrator typically lies in the evaluation of the gradient (score) . While, in a single time-step, an evaluation is required by (2) and a second evaluation is required by (4), an integration leg consisting of time-steps only demands evaluations, because the gradient in (4) is the same as the gradient in (2) at the next time-step. Thus the computational cost of the leapfrog integrator is essentially one gradient evaluation per time-step.

At each step of the Markov chain the Hamiltonian dynamics over an interval are simulated by taking time-steps of length of the integrator starting from , where is the last sample and is drawn from . The numerical solution at the end of the time-steps is a (deterministic) proposal, which is accepted with probability

 a=min(1,exp(−[H(θ⋆,p⋆)−H(θ,p)])). (5)

It is hoped that, by choosing appropriately, will be far from so as to reduce correlation and enhance the exploration of the state space.

In (5), the difference

 ΔH(θ,p)=H(θ⋆,p⋆)−H(θ,p) (6)

(recall that and are deterministic functions of ) is the increment of the Hamiltonian function over the numerical integration leg. For the true solution, the value of remains constant as varies (conservation of energy) and therefore the initial energy coincides with the value of the Hamiltonian in the true dynamics at the end point . Thus, (6) is also the error in at introduced by the numerical integrator. Since leapfrog is a second order method, in the limit where and with fixed, it provides numerical approximations to the true values of the Hamiltonian solutions that have errors.555

This and similar estimates to be found later require some smoothness in the Hamiltonian but we will not concern ourselves with such issues.

It follows that, for fixed , (6) is also in that limit. As a consequence, once has been chosen to generate suitable proposals, the acceptance probability (5) may become arbitrarily close to by decreasing (but this will imply an increase of the computational cost of generating the proposal, as more time-steps have to be taken to span the interval ). The sentence “HMC may generate proposals away from the current state that may be accepted with high probability” has been repeated again and again to promote HMC.

For given , how should be chosen? If is too large, many rejections will take place and the chain will be highly correlated. On the other hand, a very high empirical acceptance rate signals that an inappropriately small value of is being used and the computational cost of generating each proposal is too high; one would sample better at the same cost by increasing the number of steps in the Markov chain and solving the Hamilton equations less accurately in each proposal. According to the analysis in Beskos et al. (2013), that only applies in a restrictive model scenario where the target is a product of independent copies of a fixed distribution, one should tune so as to observe an acceptance rate of approximately . The result is asymptotic as and that reference recommends that in practice one aims at acceptance rates larger than that. This discussion will be retaken below, when commenting on Figures 5 and 6.

We emphasize that in an HMC context numerical integrations do not need to be very accurate. Even if the energy error in an integration leg is as large as, say, (orders of magnitude larger than the accuracy usually sought when integrating numerically differential equations), then the proposal will be accepted with probability

. In addition, negative energy errors always result in acceptance. Nevertheless it is important to recall that the Hamiltonian is an extensive quantity whose value is added when mechanical systems are juxtaposed. If the vector random variable

has independent scalar components and is diagonal, then the Hamiltonian for is the sum of the Hamiltonians for the individual components and the Hamiltonian system of differential equations for

is the uncoupled juxtaposition of the one-degree-of-freedom Hamiltonian equations for the

. In that model scenario, an acceptance probability of would require that each one-degree-of-freedom Hamiltonian system be integrated with an error of size (on average over the components). Thus, as increases the numerical integration has to become more accurate.

The leapfrog integrator has two geometric properties (Sanz-Serna and Calvo, 2018; Blanes and Casas, 2016): (1) it preserves the volume element, and (2) it is time-reversible, i.e., if after time-steps the integrator maps an initial condition into , then steps from the initial condition lead to . These two properties are essential for the accept/reject mechanism based on the expression (5) to remove the bias introduced by numerical integration. For integrators that fail to be either volume-preserving or time-reversible, the expression of the acceptance probability is far more complicated (see e.g. Fang et al. (2014)); it is for this reason that explicit Runge-Kutta and other standard methods are not suitable to be used in HMC codes.

To conclude this section, we look at the model problem with univariate target . If, in addition, the mass matrix is taken to be , then . For this problem, it is proved by Blanes et al. (2014) that the expectation of the difference (6), for and an arbitrary number of time-steps, satisfies

 0≤E(ΔH)≤ϵ4/σ432(1−ϵ24σ2), (7)

if one assumes that the chain is at stationarity. With , the upper bound of the energy error takes the value and halving brings down the expected error to . This proves that, for this model problem, the leapfrog integrator works well with step-lengths that are not small (as measured in the “natural” unit ).

The bound (7) requires that . In fact for the leapfrog algorithm is unstable for the corresponding Hamiltonian equations given by the (scalar) harmonic oscillator ,

(the frequency of the oscillator is the inverse of the standard deviation

). Recall that this means that, if for fixed more and more time-steps are taken, then the numerical solution grows unboundedly. One says that is the stability interval of the leapfrog algorithm.

As pointed out before, as with , the energy increment (6) is at each fixed ; since, according to (7), its expectation at stationarity is , there is much cancellation between points where (6) is positive and points where it is negative. Such a cancellation is a consequence of conservation of volume and time-reversibility in leapfrog integration and holds for arbitrary targets in (Beskos et al., 2013; Bou-Rabee and Sanz-Serna, 2018). In fact, for general volume-preserving, time-reversible integrators of order of accuracy and arbitrary targets, the expectation of at stationarity is .

## 3 A family of integrators

In this paper we work with a family of splitting integrators (Blanes et al., 2008), depending on two real parameters and , that have potential for replacing the standard leapfrog method within HMC algorithms. The use of integrators of this family in different applications has been studied in detail by Campos and Sanz-Serna (2017). The formulas for performing a single time-step are:

 p ← p+(1/2−b)ϵ∇θL(θ), (8) θ ← θ+aϵM−1p, (9) p ← p+bϵ∇θL(θ), (10) θ ← θ+(1−2a)ϵM−1p, (11) p ← p+bϵ∇θL(θ), (12) θ ← θ+aϵM−1p, (13) p ← p+(1/2−b)ϵ∇θL(θ). (14)

We assume that and do not take the values or ; when they do some of the substeps are redundant. An integration based on this algorithm is just a sequence of kicks and drifts and therefore not very different from a standard leapfrog integration. Four evaluations of the gradient are required at a single time-step (8)–(14). However, since the last gradient at a time-step may be reused at the next time-step, an integration leg of time-steps needs gradient evaluations. Therefore, over a single time-step (8)–(14) is three times more expensive than the leapfrog (2)–(4). To have fair comparisons, an HMC simulation based on the standard leapfrog (2)–(4) should be allowed three times as many time-steps as a simulation based on (8)–(14). We will return to this issue later.

Since each individual kick and drift preserves volume, the integrator (8)–(14) is volume-preserving. In addition, its palindromic structure leads to its being time-reversible. As a consequence, the bias introduced by numerical integration errors may be suppressed by an accept/reject mechanism based on the same formula (5) that is used for leapfrog integrations. In summary, the implementation of HMC based on the new integrator is extremely similar to leapfrog-based implementations.

Campos and Sanz-Serna (2017) show that when the parameters do not satisfy the relation

 a+b−6ab=0, (15)

the stability interval of (8)–(14) is relatively short and as a consequence small values of are required to cover the interval . Therefore schemes for which (15) is violated cannot possibly compete with standard leapfrog. Accordingly, we hereafter assume that the relation holds and in this way we are left with a family that may be described in terms of the single parameter . It is perhaps of some interest to point out that by imposing (15) we exclude the unique choice of and for which (8)–(14) is fourth-order accurate. That fourth-order accurate integrator has a very short stability interval and performs extremely poorly in the HMC context, as shown by Campos and Sanz-Serna (2017). All integrators considered below are then second-order accurate, just as standard leapfrog.

It is virtually impossible to get analytically accurate information on the energy increment (6) for given Hamiltonian system, integrator and value of , in particular in our situation where we are interested in relatively large values of ; therefore numerical experimentation seems to be essential to identify good values of . The numerical experiments in the next section compare the following six values of :

• . In this case, it is easy to check that a single time-step of (8)–(14) yields the same result as three consecutive time-steps of standard leapfrog (2)–(4) with step-length . Thus, in the experiments below, one may think that, rather than performing time-steps with (8)–(14), each of length , one is taking time-steps with length with the standard leapfrog integrator. Hence, when comparing later an integration with a value with a second integration with , we are really comparing the results of the first integration with those standard leapfrog would deliver if it were run with a time-step , so as to equalize computational costs. With the length of the stability interval of (8)–(14) is (of course three times the corresponding length for (2)–(4)). In what follows we shall refer to the method with as LF.

• with stability interval of length .

• . The rationale for this choice, suggested by Blanes et al. (2014) in the numerical analysis literature, will be summarized below. This method is called BlCaSa in Campos and Sanz-Serna (2017). The length of the stability interval is .

• . This choice, due to Predescu et al. (2012) in the molecular dynamics literature without reference to HMC, results from imposing that, for the particular case where is linear (so that is a Gaussian), the error in after a single time-step is as (rather than as it is the case for other values of ). As a result, the energy increment (6) over an integration leg is for Gaussian targets, which, according to the results in Beskos et al. (2013), implies an bound for the expected energy error for such targets. This method is called PrEtAl in Campos and Sanz-Serna (2017) and has a stability interval of length .

• , with .

• , with .

Experiments not reported here show that values of smaller than or larger than lead to integrators that are not competitive in the HMC context with the six listed above.

As it is the case for PrEtAl, the method BlCaSa was derived assuming a multivariate Gaussian model. However, while PrEtAl was derived so as to reduce the energy error in the limit , the derivation of BlCaSa takes into account the behaviour of the method over a finite -interval in view of the fact that Hamiltonian simulations within HMC are not carried out with small values of the step-length. It was proved by Blanes et al. (2014) that for each integrator, assuming ,666The result may be adapted to cover other choices of mass matrix (Blanes et al., 2014; Bou-Rabee and Sanz-Serna, 2018). the expected energy increment at stationarity has, independently of the number of time-steps at each integration leg, a bound

 0≤E(ΔH)≤d∑i=1ρ(ϵ/σi), (16)

where is the dimension of , the

are the marginal standard deviations (i.e. the square roots of the eigenvalues of the covariance matrix) of

and is a function that changes with the integrator. For the standard leapfrog integrator and , we have seen this bound in (7). Blanes et al. (2014) determined by minimizing

 ρ∞=max0<ζ<3ρ(ζ);

this optimizes the performance assuming that is chosen , see Blanes et al. (2014) for further details.777For the standard leapfrog integrator the step-length is smaller than the maximum allowed by stability (compare with Neal (1993, 2012)); the interval in the minimization here is three times as long because the integrators should be operated with step-lengths three times longer than (2)–(4).

## 4 Numerical experiments

Appropriate choices of may allow for proposal mechanisms tailored to the given target and to its possibly anisotropic covariance structure. From the point of view of the Hamiltonian dynamics, may be chosen so as to reduce the spectrum of frequencies present in the system of differential equations, which facilitates the numerical integration, see e.g. Beskos et al. (2011). For simplicity we do not explore here special choices of and all numerical experiments use the unit mass matrix . In all of them we have randomized, following Neal (2011), the step-length at the beginning of each step of the Markov chain/integration leg. Once the “basic” step-length and the number of time-steps to complete a simulation of duration have been chosen, the integration to find the -th proposal of the chain is performed with a step-length

 ϵ(n)=(1+u(n))ϵ,u(n)∼U(−0.05,0.05).

As a result, the actual length of the integration interval will differ slightly from . This randomization suppresses the artifacts that may appear for special choices of the step-length (Bou-Rabee and Sanz-Serna, 2017).

### 4.1 A Gaussian example

We have first considered a simple test problem, used by Blanes et al. (2014), with target density

 π(θ)∝exp(−12d∑j=1j2θ2j), (17)

for or . Assuming a diagonal covariance matrix is not restrictive, as the application of the numerical integrators commutes with the rotation of axes that diagonalizes the covariance matrix (see Bou-Rabee and Sanz-Serna (2018) for a detailed discussion). We report experiments with ; we also ran other values of and the conclusions as to the merit of the various integrators do not differ from those drawn below.

As in this example the Hamiltonian system to be integrated is a set of uncoupled linear oscillators with angular frequencies , the condition for a stable integration is

 d×ϵ≤η,

where is the length of the stability interval of the method. For all six integrators tested, and therefore for the step-length all of them are stable. For the case , and , , we need time-steps in each integration leg. Our experiments had and , respectively; ( then varies between and ). For the methods with longer stability intervals, such as LF, we also used the coarser values , with (respectively) . Turning now to , the choices , , require time-steps in each integration leg. Our experiments used , ( ranges from to ), and in addition , for the integrators with longer stability intervals.

For each and , we have generated a Markov chain with samples, a number that is large enough for our purposes. The initial is drawn from the target , and we have monitored the acceptance rate, the effective sample size (ESS) of the first component of

(the component with largest variance/smallest precision), and the mean, across the

integration legs/Markov chain steps, of the energy increments (6).

For the different methods we have plotted in Figures 1 () and 2 (), as functions of the step-size , the mean of the changes in (diamonds), the acceptance percentage (triangles), and the ESS as a percentage of the number of samples in the Markov chain (squares). Note that we use a logarithmic scale for the energy increments (left vertical axis) and a linear scale for acceptance rate and ESS (right vertical axis).

We look first at the mean of as a function of for the different integrators. When comparing Figure 1 with Figure 2, note that both the vertical and horizontal scales are different; uses smaller values of and yet shows larger values of the mean energy error.

The differences between BlCaSa, PrEtAl and are remarkable in both figures if we take into account that these three integrators have values of very close to each other.888When running BlCaSa and PrEtAl it is important not to round the values of given here and to compute accurately the value of from (15). PrEtAl was designed by making small in the limit for Gaussian targets and we see that indeed the graph of as a function of has a larger slope for this method than for the other five integrators. Even though, for small, PrEtAl delivers very small energy errors, this does not seem to be particularly advantageous in HMC sampling, because those errors happen where the acceptance rate is close to , i.e. not in the range of values of that make sense on efficiency grounds. On the other hand, hand BlCaSa was derived so as to get, for Gaussian targets, small energy errors over a range of values of ; this is borne out in both figures where for this method the variation of with shows a kind of plateau, not present in any of the other five schemes.

In both figures, for each , the values of the mean of for each of the integrators , BlCaSa, PrEtAl and are all below the corresponding value for . Since, for a given , all integrators have the same computational effort, this implies that each of those four methods improves on LF as it gives less error for a given computational cost.

The mean acceptance as a function of has a behaviour that mirrors that of the mean of : smaller mean energy errors lead to larger mean acceptance rates as expected. For small, PrEtAl has the highest acceptance rate, but again this is not advantageous for our sampling purposes. In Figure 1, BlCaSa shows acceptance rates above for the coarsest ; to get that level of acceptance with LF one needs to reduce to . The advantage of BlCaSa over LF in that respect is even more marked in Figure 2.

To visualize the relation between the mean and the mean acceptance rate, we have plotted in the right panel of Figure 3 the values corresponding to all our runs for (i.e. to all six integrators and all values of ). All points are almost exactly on a curve, so that to a given value of the mean energy error corresponds a well-defined mean acceptance rate, independently of the value of and . The left panel collects similarly all the runs with

; the points are again very approximately on the same curve as the points in the right panel, but the fit is now worse. This behaviour is related to a central limit theorem and will be discussed in Section

5. Figure 4 displays results corresponding to the six integrators, , and different step-lengths for the target (17) for the small values where the central limit theorem cannot be invoked; as distinct from Figure 3 it does not appear to exist a single smooth curve that contains the results of the different integrators.

Returning to Figure 1 and Figure 2, we see that in both of them and for all six methods, the relative ESS is close to when is very small. This is no doubt due to the fact that, if we used exact integration of the Hamiltonian dynamics, we would see a relative ESS around . We therefore should prefer integrators that reach that limiting value of with the least computational work, i.e. with the largest . In this respect the advantage of BlCaSa over LF is very noticeable. In each panel of both figures, we have marked with a circle the run with the highest ESS per unit computational work, i.e. the run where ESS is highest. The location of the circles reinforces the idea that BlCaSa is the method that may operate with the largest , i.e. with the least computational effort.

In order to better compare the efficiency of the different integration algorithms, we provide two additional figures, Figures 5 and 6, for and respectively. For clarity, only the methods LF, BlCaSa and PrEtAl are considered. The horizontal axis in both panels of each figure represents values of . In the left panel we show values of the acceptance percentage multiplied by (so as to take into account that the computational effort is proportional to ). The right panel shows values of the product of ESS and .

For all step-lengths, the acceptance percentage per time-step, i.e. per unit computational work, is clearly smaller for LF than for the other two methods. Comparing PrEtAl and BlCaSa, we see that for the largest step-lengths (larger than when and larger than when ), BlCaSa shows larger acceptance rates than PrEtAl, but when the step-length decreases the acceptance rates using PrEtAl are slightly higher. BlCaSa is the most efficient of the three methods, as can be seen by comparing the maxima of the three lines in each panel of both figures.

In the right panels a blue circle highlights the most efficient run for each method, which when corresponds to 360 time-steps per leg for BlCaSa (ESS , acceptance rate (AR) ), 480 time-steps per leg for method PrEtAl (ESS , AR ), and 720 time-steps per leg for LF (ESS , AR ). When , the most efficient runs are obtained when for method BlCaSa (ESS , AR ), when for method PrEtAl (ESS =2454, AR =88.36%), and when for the LF integrator (ESS , AR %). Note that the optimal acceptance rate is typically higher than the resulting from the analysis of Beskos et al. (2013). There is no contradiction: while the analysis applies when and is a product of identically distributed independent distributions, our figures have a fixed value of and here is not of that simple form.

We conclude that in this model problem, for a given computational effort, BlCaSa generates roughly three times as many effective samples than LF. In addition the advantage of BlCaSa is more marked as the problem becomes more challenging.

The information obtained above on the performance of the different integrators when using ESS is essentially the same as when using the mean acceptance rate. Since ESS is a metric that depends on the specific observable chosen, the experiments below concentrate on the mean acceptance rate.

### 4.2 Log-Gaussian Cox model

The second test problem (Christensen et al., 2005; Girolami and Calderhead, 2011) is a well-known high dimensional example of inference in a log-Gaussian Cox process. A grid is considered in and the random variables , , represent the number of points in cell

. These random variables are conditionally independent and Poisson distributed with means

, where is the area of each cell and , , are unobserved intensities, which are assumed to be given by , with , , multivariate Gaussian with mean and covariance matrix given by

 Σ(i,j),(i′,j′)=σ2exp(−√(i−i′)2+(j−j′)264β).

Given the data , the target density of interest is

 π(y)∝64∏i,j=1exp(xi,jyi,j−mexp(yi,j))exp(−(y−μ1)TΣ−1(y−μ1)/2).

As in Christensen et al. (2005) we have fixed the parameters , and and used three alternative starting values. The conclusions obtained as to the merits of the different integrators are very similar for the three alternative cases and, for this reason, we only report here results corresponding to one of them (choice (b) in Christensen et al. (2005)). The initial state is randomly constructed for each run in the following way. Given a random vector with components the starting value is the solution of the nonlinear system , where is the Cholesky factor of the matrix which, therefore, depends on . This nonlinear system has been solved by fixed-point iteration starting with , and the iteration has been stopped when (which happens after 19 iterations).

The length of each integration leg has been set equal to 3, and the “basic” step-lengths used are . After generating burn-in samples, we have run Markov chains with elements, a number that is sufficient for our purposes.

In Figure 7 we have represented the acceptance percentage and the mean of the increments in energy against the step-size for the different values of the parameter . We have omitted data corresponding to either acceptance percentages below or values of the mean of larger than 1.

Comparing the performance of the six methods, we see that for this problem, BlCaSa, in contrast with the other five integrators, operates well for . PrEtAl gives very high acceptance rates for small values of , but this is not very relevant for the reason given in the preceding test example. The other four integrators, including LF, require very small step-lengths to generate a reasonable number of accepted proposals; their performance is rather poor.

The energy error in PrEtAl decays by more than five orders of magnitude when reducing from to ; this is remarkable because the additional order of energy accuracy of this integrator as , as compared with the other five, only holds analytically for Gaussian targets (Predescu et al., 2012). Similarly BlCaSa shows here the same kind of plateau in the mean energy error we observed in the Gaussian model.

In Figure 8 we have plotted the acceptance percentage per step as a function of the step-length for the two methods with the largest acceptance rates, PrEtAl (diamonds) and BlCaSa (squares) in addition to LF (triangles). The blue circles in the figure highlight the most efficient run for each method. When the computational cost is taken into account, BlCaSa is the most efficient of the methods being compared, by a factor of more than three with respect to LF.

### 4.3 Alkane molecule

The third test problem comes from molecular simulations and was chosen to consider multimodal targets very far away from Gaussian models. The target is the Boltzmann distribution , where the vector collects the cartesian coordinates of the different atoms in the molecule and is the potential energy and, accordingly, in (8)–(14), has to be replaced by . We have considered the simulation of an alkane molecule described in detail in Cances et al. (2007) and later used by Campos and Sanz-Serna (2015), Campos and Sanz-Serna (2017) and Blanes et al. (2014). The expression for the potential energy is extremely complicated, as it contains contributions related to bond lengths, bond angles, dihedral angles, electrostatic forces and Van der Waals forces. In our numerical experiments molecules with 5, 9 or 15 carbon atoms have been used, leading to Hamiltonian systems with 15, 27 and 45 degrees of freedom (the hydrogen atoms are not simulated).

As Campos and Sanz-Serna (2017), we set the length of each integration leg as , with “basic” step-length , corresponding to and steps per integration leg. In all runs, after generating samples for burn-in, we consider Markov chains with 1000 samples; the results provided are obtained by averaging over 20 instances of the experiments. In all cases, the initial condition for the chain is the equilibrium with minimum potential energy.

Figure 9 displays results for carbon atoms. We have omitted data corresponding to either acceptance percentages below or values of the mean of larger than 1. It is clear that LF is far from being the best choice for this problem. BlCaSa and PrEtAl have substantially larger acceptance rates than LF for each value of where LF works. The results for and carbon atoms are not very different. With atoms, LF, , BlCaSa, and PrEtAl can operate with the largest value , but again, for all values of , the acceptance rate of LF is well below that provided by BlCaSa.

As in the preceding test problem, for both BlCaSa and PrEtAl the variation of the mean with is as in the Gaussian model: the former method exhibits a plateau and the second a fast decay (a reduction of more than two orders of magnitude when halving from to ). This is in spite of the fact that both integrators were derived by considering only the Gaussian case.

## 5 Expected acceptance rate vs. expected energy error

In this section we investigate analytically some properties of the expected acceptance percentage and the expected energy increments.

### 5.1 A formula for the expected acceptance rate

The following result (Neal, 2011) is needed later.

###### Lemma 1.

Let the integration of the Hamiltonian dynamics (1) be carried out with a volume-preserving and time-reversible integrator, such that the transformation in the phase space that maps the initial condition into the numerical solution is continuously differentiable. Assume that (i.e. the chain is at stationarity) and that the energy increment (6) satisfies

 P(ΔH=0)=0. (18)

Then the expected acceptance rate is given by

 E(a)=2P(ΔH<0).

Proof. Since almost surely , we may write

 E(a) = ∫Rd×Rdmin(1,exp(−ΔH(θ,p)))exp(−H(θ,p))dθdp = ∫ΔH<0exp(−H(θ,p))dθdp +∫ΔH>0exp(−ΔH(θ,p))exp(−H(θ,p))dθdp,

and it is enough to show that the last two integrals share a common value.

We denote by the composition , where is the momentum flip (so that ). Then, from the definition of ,

 ΔH(θ,p)+H(θ,p)=H(Ψ(θ,p))

and therefore

 ∫ΔH>0exp(−ΔH(θ,p))exp(−H(θ,p))dθdp =∫ΔH>0exp(−H(Ψ(θ,p)))dθdp =∫ΔH>0exp(−H(ˆΨ(θ,p)))dθdp,

where in the last equality we have taken into account that . On the other hand,

 ∫ΔH<0exp(−H(θ,p))dθdp = ∫ΔH<0exp(−H(˜θ,˜p))d˜θ˜dp = ∫ΔH>0exp(−H(ˆΨ(θ,p)))dθdp

Here the first equality is just a change in notation. The second uses the change of variables which has unit Jacobian determinant because the integrator preserves volume. Pairs with correspond to pairs with as a consequence of the time-reversibility of the integrator:

 ΔH(˜θ,˜p)=H(Ψ(˜θ,˜p))−H(˜θ,˜p)=H(S(θ,p))−H(S(Ψ(θ,p))) =H(θ,p)−H(Ψ(θ,p))=−ΔH(θ,p).□

Later in the section we shall come across cases where, for simple targets and special choices of and , the assumption (18) does not hold. Those cases should be regarded as exceptional and without practical relevance. In addition, reviewing the proof, we see that the lemma could have been reformulated without (18): at stationarity and without counting the proposals with , one out of two accepted steps comes from proposals with .

### 5.2 The standard univariate Gaussian target

We now investigate in detail the model situation where

is the standard univariate normal distribution and the mass matrix is

, so that . The Hamiltonian system is the standard harmonic oscillator: , . For all time-reversible, volume-preserving, one-step integrators of practical interest (including implicit Runge-Kutta and splitting integrators), assuming that is in the stability interval, the transformation associated with an integration leg has an expression (Blanes et al., 2014; Bou-Rabee and Sanz-Serna, 2018):

 θ⋆ = −χ−1ϵcos(Lαϵ)θ+χϵsin(Lαϵ)p, (19) p⋆ = −χ−1ϵsin(Lαϵ)θ+χϵcos(Lαϵ)p, (20)

where and are quantities that depend smoothly on the step-length and change with the specific integrator, but are independent of the number of time-steps. As varies with fixed and , the end point moves on an ellipse whose eccentricity is governed by ; for a method of order of accuracy , as . The angle is related to the speed of the rotation of the numerical solution around the origin as increases; it behaves as .

From (19)–(20) a short calculation reveals that the energy increment has the following expression

 2ΔH(θ,p)=Aθ2+2Bθp+Cp2, (21)

with

 A = sin2(Lαϵ)(χ−2ϵ−1), B = cos(Lαϵ)sin(Lαϵ)(χϵ−χ−1ϵ), C = sin2(Lαϵ)(χ2ϵ−1).

For future reference we note that

 B2−AC=A+C. (22)

Taking expectations at stationarity in (21),

 2E(ΔH)=A+C, (23)

i.e.

 E(ΔH)=sin2(Lαϵ)ρ(ϵ)

with

 ρ(ϵ)=12(χ2ϵ+χ−2ϵ−2)=12(χϵ−1χϵ)2≥0,

so that, regardless of the choice of ,

 0≤E(ΔH)≤ρ(ϵ). (24)

As , , and therefore , where we remark that the exponent is doubled from the pointwise estimate , valid for . If happens to be an integer multiple of then and ; in this case the transformation is either or with no energy error (but then the Markov chain is not ergodic, this is one of the reasons for randomizing ). Also note that this gives an example where the hypothesis in Lemma 1 does not hold.

The preceding material has been taken from Blanes et al. (2014). For the acceptance rate we have the following result that shows that is of the form where is a monotonically decreasing function that does not depend on the integrator, or . Thus, even though the integrator BlCaSa was derived to minimize the expected energy error in the univariate standard normal target, this result shows that it also maximizes the expected acceptance rate.

###### Theorem 1.

When the target is the standard univariate Gaussian distribution, the mass matrix is set to and the chain is at stationarity, the expected acceptance rate is given by

 E(a)=1−2πarctan√E(ΔH)2,

regardless of the (volume-preserving, time-reversible) integrator being used, the step-length and the number of time-steps in each integration leg.

Proof. If in (21), either or , then , which leads to . In this case and and the result holds.

If and , then , , and vanishes on the straight lines

 p=−B−√B2−ACCθ,p=−B+√B2−ACCθ,

of the plane with slopes and respectively. From the expressions for and we see that and therefore . For the sake of clarity, let us assume that and , which implies and (the proof in the other cases is similar). Choose angles , with , .

The energy increment is then negative for points with polar angles . Lemma 1 and the symmetry yield

 E(a) = 2∫ϕ∈(ϕ1,ϕ2)∪(ϕ1+π,ϕ2+π)exp(−(1/2)(θ2+p2))dθdp = 4∫ϕ∈(ϕ1,ϕ2)exp(−(1/2)(θ2+p2))dθdp.

The last integral is easily computed by changing to polar coordinates and then

 E(a)=2π(ϕ2−ϕ1).

We now recall the formula

 arctan(x)−arctan(y)=arctan(x−y1+xy)

(if the equality sign is understood modulo , then the formula holds for all branches of the multivalued function ) and, after some algebra find

 E(a)=2πarctan(2√B2−ACA+C).

Taking into account (22) and (23), the last display may be rewritten in the form

 E(a)=2πarctan⎛⎝√2E(ΔH)⎞⎠; (25)

the theorem follows from well-known properties of the function .

In the proof the slopes , of the lines that separate energy growth from energy decay change with the integrator, and ; however the angle between the lines only depends on .

Note that, as ,

 E(a)=1−√2π√E(ΔH)+O((E(ΔH))3/2), (26)

and, as a consequence, as , with ,

 E(a)=1−O(ϵν)

for an integrator of order of accuracy . On the other hand, the formula (25) shows that, as , the expected acceptance rate decays slowly, according to the estimate,

 E(a)∼2√2√E(ΔH). (27)

For instance, if , then (25) yields ; this should perhaps be compared with the data in Figure 4, where the target is low dimensional.

We conclude our analysis of the standard normal by presenting a result that will be required later. It is remarkable that the formulas in this lemma express moments of

as polynomials in and that, furthermore, those polynomials do not change with the value of , or with the choice of integrator.

###### Lemma 2.

For a univariate, standard Gaussian target with unit mass matrix and a time-reversible, volume preserving integrator, setting , we have:

 E((ΔH)2) = 2μ+3μ2, E((ΔH)3) = 18μ2+15μ3, E((ΔH)4) = 36μ2+180μ3+105μ4.

Proof. Raise (21) to the second (third or fourth) power. Compute expectations to write ( or ) as a polynomial in . Express in terms of and by using (22) and a little patience. Finally use (23).

As , , in agreement with Proposition 3.4 in Beskos et al. (2013). In addition, also , a result that, while not contained in the paper by Beskos et al. (2013) may be proved with the tools used to prove that proposition.

### 5.3 Univariate Gaussian targets

For the univariate problem where and , the Hamiltonian is and the differential equations are , . This problem may be reduced to the case , by scaling the variables, because for all integrators of practical interest the operations of rescaling and numerical integration commute. Specifically, the new variables are , with remaining as it was.

By changing variables we find that the bound (24) should be replaced by

 0≤E(ΔH)≤ρ(ϵ/σ). (28)

(The bound (16) is a direct consequence of this.) On the other hand, for arbitrary , Theorem 1 and Lemma 2 hold as they stand.

### 5.4 Multivariate Gaussian targets

We now consider for each a centered Gaussian target , with covariance matrix . The are assumed to be nonsingular, but otherwise they are allowed to be completely arbitrary. As pointed out before we may assume that the have been diagonalized. We run HMC for each and, for simplicity, assume that the unit mass matrix is used (but the result may be adapted to general mass matrices). Then is a product of univariate distributions , , the Hamiltonian is a sum of one-degree-of-freedom Hamiltonians and the increment is a sum of increments . The integrator being used for is allowed to change with ; it is only assumed that when applied to the standard normal univariate target takes the form (19)–(20). Similarly the step-length and the length of the integration interval are allowed to change with .

Our next result shows that, even in this extremely general scenario, simple hypotheses on the expectations of the ensure that, asymptotically, the variables are normal with a variance that is twice as large as the mean (Neal, 2011).

###### Theorem 2.

In the general scenario described above, assume that

 Md:=max1≤ℓ≤dE(ΔHd,ℓ)→0 (29)

and, for some ,

 E(ΔHd)=d∑ℓ=1E(ΔHd,ℓ)→μ. (30)

Then, as :

• The distributions of the random variables at stationarity converge to the distribution .

• The acceptance rates for the targets , , satisfy

Proof. The first item is a direct application of the central limit theorem. From Lemma 2, the variance of is, with ,

 s2d=d∑ℓ=1(2μd,ℓ+2μ2d,ℓ),

which tends to because, for any integer ,

 d∑ℓ=1μkd,ℓ≤Mk−1dd∑ℓ=1μd,ℓ. (31)

We use the Lyapunov condition based on the centered fourth moment and consider the expression

 1s4dd∑ℓ=1E((ΔHd,ℓ−μd,ℓ)4).

We expand the fourth power and use Lemma