Implicit Hamiltonian Monte Carlo for Sampling Multiscale Distributions

11/02/2019 ∙ by Arya A Pourzanjani, et al. ∙ The Regents of the University of California 0

Hamiltonian Monte Carlo (HMC) has been widely adopted in the statistics community because of its ability to sample high-dimensional distributions much more efficiently than other Metropolis-based methods. Despite this, HMC often performs sub-optimally on distributions with high correlations or marginal variances on multiple scales because the resulting stiffness forces the leapfrog integrator in HMC to take an unreasonably small stepsize. We provide intuition as well as a formal analysis showing how these multiscale distributions limit the stepsize of leapfrog and we show how the implicit midpoint method can be used, together with Newton-Krylov iteration, to circumvent this limitation and achieve major efficiency gains. Furthermore, we offer practical guidelines for when to choose between implicit midpoint and leapfrog and what stepsize to use for each method, depending on the distribution being sampled. Unlike previous modifications to HMC, our method is generally applicable to highly non-Gaussian distributions exhibiting multiple scales. We illustrate how our method can provide a dramatic speedup over leapfrog in the context of the No-U-Turn sampler (NUTS) applied to several examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Hamiltonian Monte Carlo (HMC) algorithm (Duane et al., 1987) and its recent successor, the No-U-Turn (NUTS) sampler (Hoffman and Gelman, 2014), have seen widespread use recently in the statistics community because of their proficiency in sampling high-dimensional distributions. In fact, Beskos et al. (2013) showed that as the dimension, , of the distribution being sampled tends to infinity, HMC requires only samples to sufficiently explore the distribution, while the classic random-walk Metropolis algorithm (Metropolis et al., 1953) requires . Roughly speaking, HMC and NUTS achieve this efficiency gain because rather than exploring parameter space in a random fashion, they systematically explore level-sets of Hamiltonian energy by using the leapfrog integrator to numerically simulate Hamiltonian dynamics over the potential energy surface defined by (Neal et al., 2011)

. While the Hamiltonian energy remains roughly constant over these level sets, when simulating Hamiltonian dynamics using the leapfrog integrator, the log-probability density and values of the random variables

often vary quite widely in practice, yielding samples that are much closer to independent than samples from the random-walk Metropolis algorithm (Betancourt, 2017).

The key to HMC reaching the correct stationary distribution lies in the reversibility of the leapfrog integrator that allows the Markov transitions to satisfy detailed-balance. Furthermore, the leapfrog integrator is volume-preserving in phase-space, which makes the computation of the Metropolis probabilities in HMC trivial (Neal et al., 2011). In fact, the leapfrog integrator belongs to a class of numerical integrators known as symplectic integrators that exhibit both of these properties along with a more general property known as symplecticity. Symplectic integrators have been well-studied (Leimkuhler and Reich, 2004; Hairer et al., 2006)

. However, little work has been done to characterize how the properties of a probability distribution relate to the properties of the associated Hamiltonian system in HMC, and how these properties make certain integrators more advantageous than others, depending on the problem.

To this end, we provide an analysis of how and why the geometry of Bayesian posterior distributions with low variance components can lead to multiscale Hamiltonian systems, i.e. systems with rapidly oscillating components that force leapfrog to take a much smaller stepsize than what would otherwise be possible with an implicit integrator. We describe the implicit midpoint integrator as an alternative to leapfrog for these types of problems, and show how to practically implement it in an HMC context. Furthermore, we offer practical guidelines on the stepsize to choose when using either leapfrog or implicit midpoint in an HMC sampler, as well as heuristics for when to choose one algorithm over the other. Finally, we compare, using practical examples, the efficiency of leapfrog NUTS (lfNUTS) with an implicit NUTS implementation we introduce called iNUTS. Code for these experiments is available as an R package that can be obtained by emailing the authors.

In Section 2 we provide a brief introduction to HMC and then describe the concept of numerical stability of an integrator and how it connects to the geometries of multivariate distributions in HMC. In Section 3 we describe the symplectic implicit midpoint algorithm along with our practical custom implementation specific to HMC. We also derive the mathematical stability limit for the implicit midpoint algorithm, showing how it provides a clear advantage over leapfrog on multiscale systems. In Section 4 we show how in practice, using real examples, our iNUTS implementation leads to less computational work per effective sample than leapfrog-based NUTS on common multiscale systems. We conclude with a summary of our contributions and future directions in Section 5.

Related Work:

While various works have explored modifications of the leapfrog integrator in HMC, the connection between posterior geometry and integrator choice as well as the multiscale problem has been sparsely examined. On the statistics side, both Shahbaba et al. (2014) and Chao et al. (2015) adapted the leapfrog integrator in HMC by splitting the potential energy into a Gaussian term and a non-Gaussian term, which are integrated separately. While these methods can alleviate the multiscale problem for distributions that can be well-approximated by a Gaussian, they fail to offer an efficiency gain over leapfrog for more complicated distributions, as we describe in Section 2. Our implicit midpoint-based method is able to achieve efficiency gains over leapfrog on a more general class of problems.

Okudo and Suzuki (2015) modify the leapfrog integrator in HMC by adding an auxiliary variable that allows for online adaptation of the stepsize. The RMHMC method of Betancourt (2013) uses local Hessian information of the potential energy function to adaptively change the mass matrix of HMC, which is equivalent to adaptively changing leapfrog stepsizes. While both of these methods can effectively adapt the stepsize in leapfrog to the local geometries of non-Gaussian distributions, they are still subject to using a small integrator stepsize in a multiscale problem, just as standard leapfrog is. In contrast, our implicit midpoint-based approach has much less severe stepsize limitations (Ascher and Reich, 1999).

2 Hamiltonian Monte Carlo and Numerical Stability in a Multiscale Problem

We provide a brief overview of the basic HMC algorithm and how it leads to a Hamiltonian system of ODEs. We then provide a short explanation and illustration of numerical stability of the leapfrog integrator on a linear Hamiltonian system and then extend this analysis to show how stability can be a crucial bottleneck in multiscale systems. Finally, using the linearization of arbitrary Hamiltonian systems, we extend the multiscale concept to non-Gaussian distributions and draw the connection between posterior geometry and the mass matrix of HMC. For a more comprehensive review of HMC see Neal et al. (2011), or Betancourt (2017) for a more recent review that includes an exposition on NUTS. For a more thorough review of stability analysis for the numerical solution of Hamiltonian systems see (Leimkuhler and Reich, 2004, Ch. 2.6).

2.1 Hamiltonian Monte Carlo

HMC provides samples from an arbitrary distribution over with density by taking Markov transitions that satisfy detailed-balance with respect to . In HMC, a potential energy surface is defined, along with a kinetic energy, , and a Hamiltonian, . Given a starting point on this surface, an HMC transition begins by sampling a random momentum,

, from a multivariate normal distribution with mean zero and covariance

(Betancourt, 2017). Given these initial values and , the Hamiltonian system

(1)

is solved, resulting in a new set of points and in phase-space, as shown in Figure 1. The point is then accepted or rejected as a new sample of the distribution, in typical Metropolis fashion using the acceptance probability defined by

Figure 1: Given a starting point , an HMC proposal starts by drawing a momentum, , then simulating Hamiltonian dynamics for a fixed time to reach a new proposal, . A Metropolis accept-reject step is then used to accept the new proposal randomly, depending on the difference between the Hamiltonian at the original point in - space and the Hamiltonian at the proposed point.
(2)

In classical HMC, the dynamics defined in (1) are simulated by applying discrete steps of the leapfrog integrator for steps. These leapfrog steps are discretized by a stepsize , yielding

(3)

These update equations crucially provide a reversible and volume preserving transition due to the symplectic property of the leapfrog integrator (Neal et al., 2011). Moreover, for a satisfactory stepsize the Hamiltonian, , remains nearly constant over the numerical simulation of the Hamiltonian dynamics.

2.2 Numerical Stability

In practice, the stepsize of the leapfrog integrator is limited by its numerical stability, which is problem-dependent. This concept is easiest to illustrate using a simple system derived from a univariate Gaussian. Specifically, for a univariate Gaussian distribution with mean zero and variance , the potential energy function used in HMC is . For an identity mass matrix, this leads to the following leapfrog update rule:

(4)

which can be compactly written in matrix form as

(5)

The update matrix in (5) describes how leapfrog advances forward one step, for a univariate Gaussian system with mass one. Because the leapfrog method is symplectic, the determinant of this matrix is always one. However, for

the eigenvalues are complex and have modulus equal to one, while for

they are real with one of the eigenvalues having modulus greater than one (Leimkuhler and Reich, 2004, Ch. 2.6). The latter case results in numerical instability of the integrator Leimkuhler and Reich (2004). Intuitively, this means that any small numerical errors that will inevitably arise in the numerical solutions will be successively “magnified” by each application of the leapfrog update matrix, quickly resulting in a solution with unacceptably large error (Figure 2).

Figure 2: For stepsizes larger than the stability threshold of the problem, the leapfrog method goes unstable, leading to wild numerical solutions whose errors characteristically grow after each step (green trajectory). This is in contrast to numerical solutions below the stability threshold, for which the error stays bounded after a series of steps (red trajectory).

The univariate analysis can be readily extended to a multivariate Gaussian of dimension

with covariance matrix . With a mass matrix equal to the identity, this results in the following Hamiltonian system:

(6)

In general, may have off-diagonal correlation terms that make this derived system coupled. However, the transformation defined by and , where

is a matrix whose columns consist of the eigenvectors of

, essentially uncouples the differential equations, yielding

(7)

where is a diagonal matrix composed of the eigenvalues, , of (Leimkuhler and Reich, 2004, Ch. 2.6). This decoupling transformation can be thought of as reparameterizing the problem so that the potential energy surface exhibits no correlation (Figure 3).

Figure 3: A decoupling transformation is equivalent to transforming a correlated distribution (left) to an uncorrelated one (middle). Similarly, a transformation can be used to make an arbitrary Gaussian isotropic (right).

Applying the leapfrog update rule (3) to the uncoupled equations (7) elucidates how the issue of numerical stability arises in the case of a multivariate Gaussian. Specifically, if one or more of the eigenvalues of do not satisfy the inequality

, then the leapfrog update matrix will have a real eigenvalue with modulus greater than one, and instability in the numerical integration will arise. In practice this will lead to poor acceptance rates in the Markov chain as well as to divergences

(Betancourt, 2017). Formally, for leapfrog on a multivariate Gaussian with mass matrix the stepsize must satisfy the inequality

(8)

to ensure stability (here denotes the spectral radius of , defined as the maximum of the absolute value of its eigenvalues).

This condition can severely limit the timestep of leapfrog. Intuitively, the dimensions of the distribution that exhibit high variance, and thus low curvature in the potential energy surface, will have slowly oscillating Hamiltonian trajectories that are very smooth and can be integrated by leapfrog with a reasonable stepsize. Meanwhile, dimensions with low variance, and thus high curvature, will lead to rapdily oscillating solutions that require a very small stepsize. Thus a system that has even one state with a dramatically smaller variance than the others will force leapfrog to take excessively small steps on the whole system. An ODE system that exhibits this sort of behavior is known as a multiscale system (Ascher and Petzold, 1998). In practice, the telltale sign of a distribution that will lead to a multiscale Hamiltonian system in HMC is the inverse covariance matrix, having a large condition number. The condition number is defined as the ratio of the largest eigenvalue of to the smallest. Intuitively, it captures the ratio of the curvatures of the dimensions of the potential energy surface. Practically speaking, a large condition number can arise either when an uncorrelated Gaussian has a large disparity in the variance of its largest and smallest dimensions, or alternatively when a multivariate Gaussian contains high correlations between dimensions. In engineering, the multiscale problem is often resolved by using an appropriate implicit integrator, which is typically able to take much larger steps on multiscale problems while preserving the accuracy of desired quantities of the system Ascher and Reich (1999); Leimkuhler and Matthews (2016). We describe this approach and its application to HMC in the following section.

2.3 Nonlinear Posterior Geometry

While the stability analysis considered so far has been only for Gaussian distributions which result in linear Hamiltonian systems, most practical distributions being sampled by HMC are not Gaussian and thus result in nonlinear Hamiltonian systems. For nonlinear systems, one can locally identify a multiscale system by analyzing the condition number of the local Hessian of the potential energy surface, . This local Hessian is equivalent to the inverse covariance matrix of the local Laplace approximation to the posterior (Gelman et al., 2013, Chapter 4). Thus the key difference between the posterior geometry of a Gaussian distribution and that of a more complicated distribution is that the former has a potential energy surface with a constant Hessian while the latter may contain vastly different local Hessians throughout the surface, which can lead to different multiscale properties of the associated Hamiltonian system (Figure 4). In practice, this means that different leapfrog stepsizes may be required, depending on where on the potential energy surface the sampler is currently located. Roughly speaking, RMHMC uses local Hessian evaluations of the potential energy surface to adaptively change this stepsize based on the local curvature (Betancourt, 2013).

Figure 4: Unlike a Gaussian distribution, arbitrary non-Gaussian distributions have non-constant local Hessians (here characterized by the green eigenvectors of the local Hessian multiplied by their associated eigenvalues) that affect the oscillatory frequency of HMC trajectories and thus an acceptable leapfrog stepsize.

2.4 The Mass Matrix as a Reparameterization

Although a multivariate Gaussian system exhibiting multiple scales can significantly hinder the efficiency of the leapfrog method, this deficiency may be resolved by appropriately selecting the mass matrix in (1). In fact, utilizing a mass matrix in HMC that is equal to the covariance of the Gaussian is equivalent to reparameterizing the problem to an equivalent isotropic Gaussian (Neal et al., 2011) (Figure 3). From a numerical analysis perspective, can be viewed as a preconditioner that transforms the problem so as to give a better condition number , i.e. it transforms the problem so that there is less discrepency between the largest and smallest eigenvalues of .

While a conveniently selected mass matrix can effectively eliminate the multiscale problem of leapfrog for Gaussian distributions, for distributions with more complicated geometry whose local curvature varies, a constant mass matrix is inherently much less effective in accounting for multiscale geometry. Splitting methods such as that of Shahbaba et al. (2014) as well as Chao et al. (2015) which can handle high curvature and multiple scales by separating out a constant Gaussian approximation of the distribution unfortunately have the same limitation, as they cannot handle the varying curvature of non-Gaussian posteriors.

3 Implicit HMC

The stability bottleneck placed on leapfrog by a multiscale system is characteristic of explicit integrators like leapfrog. In the ODE community, implicit integrators have been used to essentially “skip” fast oscillations while accurately evolving in time quantities of interest in the system (Ascher and Reich, 1999; Leimkuhler and Matthews, 2016). We describe the implicit midpoint integrator: a symplectic alternative to leapfrog that is of the same order of accuracy as leapfrog, but is implicit, allowing it to take much bigger timesteps on multiscale problems than what would otherwise be possible with leapfrog. Unlike the approaches of Shahbaba et al. (2014) and Chao et al. (2015), the implicit midpoint integrator is applicable to arbitrary multiscale systems, not just Gaussian ones.

We explain how, unlike leapfrog, the implicit midpoint method has no stability limit on a linear system. We then describe a custom Newton-Krylov method for the solution of the nonlinear system that must be solved at each timestep by midpoint. Finally, we point out that while the implicit midpoint method is unconditionally stable on a linear system, in practice it can go unstable on nonlinear problems (Ascher and Reich, 1999), although at a much larger stepsize than leapfrog is able to take. We discuss the practical implications of this and aspects of choosing between leapfrog and implicit midpoint for a specific problem, and we give practical guidelines for selecting a stepsize for both methods.

3.1 Implicit Midpoint and Stability

For general Hamiltonian systems of the form in equation 1, the timesteps of implicit midpoint are defined by

(9)

Because the point is defined implicitly, as opposed to in leapfrog where an explicitly computable update equation is given, one must resort to either functional iteration or Newton’s method to compute its value. However, in situations where implicit midpoint would be desirable over leapfrog, Newton’s method or a more robust modification of it is typically preferred (Ascher and Petzold, 1998, ch. 3.4.2). We elaborate on this point in the subsequent subsection.

An analaysis that is similar to the one performed for leapfrog in section 2.2 shows that unlike leapfrog, the implicit midpoint method does not have a stability limit on the linear system that arise from HMC sampling of a Gaussian distribution. In fact, like the update matrix for leapfrog, the update matrix for midpoint on a Gaussian distribution always has determinant one, however it can be shown that its eigenvalues are complex for any stepsize, (see appendix A for a full analysis of the stability of implicit midpoint for linear Hamiltonian systems). In other words, the implicit midpoint method is stable on Gaussian systems for any stepsize, i.e. it has not stability limit (Figure 5).

Figure 5: Unlike the leapfrog integrator which goes unstable for large enough stepsizes (green trajectory), the implicit midpoint integrator can remain stable at the same stepsize (red trajectory) and even at stepsizes much larger than the stability threshold for leapfrog (purple trajectory).

3.2 Fast Solution of the Nonlinear System

As previously mentioned, (9) must typically be solved for numerically. In practice this can be done by applying Newton’s method to solve equation 9, however this can be costly for large systems. First, because the first equations in 9 are linear, one can substitute the first equations into the second to obtain the nonlinear system 10 with equations and the unknowns, . Once is obtained, can be obtained by simply plugging the result back into the first set of equations, yielding

(10)

Renaming to and noting that is known from the previous timestep, equation (10) can be written as

(11)

Newton’s method applied to this system consists of starting with an initial guess, , and iteratively obtaining approximate solutions using the following steps until a convergence criteria is reached:

  1. Solve the linear system

  2. Set ,

where for the standard Newton iteration . Here, refers to the Jacobian of evaluated at and is given by

(12)

For large systems, exploitation of the structure of the system is critical for both efficiency and robustness. We introduce the Newton-Krylov method, which is particularly well-suited to this class of problems.

3.2.1 Newton-Krylov Methods

In the classic Newton method, the linear solve step requires explicitly computing and carrying out a full linear solve of the equation for at each Newton iteration. This requires evaluating the Hessian of the potential energy, which scales as . A Newton-Krylov method (Knoll and Keyes, 2004) circumvents this expensive linear solve that occurs at every Newton iteration by replacing the linear solve with an approximate linear solve that is much cheaper to compute. Specifically, the Newton-Krylov method replaces the solution with an approximate solution such that is less than some tolerance, , otherwise known as a “forcing term”. This forcing term is typically selected using a scheduling criteria that satisfies certain theoretical properties. For our experiments we found the most success with method 2.1 of Eisenstat and Walker (1996), as it did not rely on manually selected “tuning parameters”.

In our iNUTS implementation, we use the GMRES solver (Saad and Schultz, 1986) to compute the approximate linear solves within the Newton-Krylov iterations. The GMRES algorithm works by iteratively computing approximate solutions to the linear system that reduce the norm , while needing only to evaluate the product of the matrix

with an arbitrary vector

.

Needing only to evaluate the product of the matrix with an arbitrary vector , rather than computing and then computing its product with , is particularly advantageous in HMC for two reasons. First, the Jacobian vector product can be evaluated using only a Hessian-vector product of the potential energy which, in an efficient automatic differentiation implementation such as the one available in Stan (Carpenter et al., 2017), scales as as opposed to computing a full Hessian, which scales as . Second, the iterated solutions in Krylov-based solver are known to converge faster when the matrix has “clusters” of eigenvalues that are close together (Meyer, 2000), as is the case when the system being solved comes from the posterior of a Bayesian model, particularly a multilevel Bayesian model. Specifically, in the case of implicit midpoint applied to HMC on the posterior of a Bayesian multilevel model, the matrix (12) will have clusters of eigenvalues that are all of similar order and correspond to the units in a multilevel model that are are all at the same level. For example, for the famous “eight schools” model in Gelman et al. (2013), there is a variance parameter that is at the scale of 1, and eight separate school-level parameters at the scale of 10. Although there are nine parameters total and thus a nine-by-nine system to solve, in practice the Newton-Krylov method can typically solve the linear system to numerical precision in only two iterations, because there are only two “clusters” of eigenvalues in the Hessian: the cluster at the scale of 1 and the cluster at the scale of 10.

3.2.2 Using a Line Search

The second modification we make to the classic Newton method is to change the steplength of the Newton iteration to ensure that consecutive iterations of the Newton-Krylov method effectively reduce the residual error in the nonlinear system. In particular, we use a geometric line-search to continually halve the size of until the new solution satisfies the Armijo-Goldstein condition (Quarteroni et al., 2010).

3.2.3 Choosing an Initial Guess

While not a modification of Newton’s method per se, in our iNUTS implementation we also use a unique method of setting the initial guess to the nonlinear solver. In practice, this greatly improves the number of Newton-Krylov iterations needed for convergence. In particular, we exploit the numerical properties of the implicit midpoint integrator on multiscale systems, to obtain a good initial guess. Specifically, when the stepsize of the numerical integrator is small relative to the local frequency of a particular oscillating momentum coordinate , the previous momentum, , will serve as a good initial guess to in the system 10, as the numerical solution will not be changing much between consecutive steps. Similarly, the second to last momentum will also serve as a good initial guess, although not quite as good as the last value. On the other hand, when is much larger than the frequency of a particular oscillating variable , which will be the case in a multiscale system, the numerical solution will approximately “alternate” about zero taking on the values . Thus in this case, the second to last momentum serves as a good initial guess. Thus we use as an initial guess overall for the nonlinear equation solver.

3.3 Nonlinear Stability Limit and Choosing an Integrator and a Stepsize

While the implicit midpoint method is unconditionally stable for any stepsize on a linear system, it can and will exhibit instability for certain nonlinear systems, although typically at a stepsize that is much larger than the stepsize at which leapfrog would go unstable on the same problem (Ascher and Reich, 1999). In practice, these instabilities can typically be identified by observing large growth in the Hamiltonian of the numerical numerical trajectory, or when Newton-Krylov iterations fail to converge. To find an appropriate stepsize for implicit midpoint in practice, we recommend running a sampler “warmup” period where the stepsize is reduced by some factor like until these pathological behaviors are eliminated. This is akin to the current warmup method of Stan, where the stepsize of the integrator is reduced until an acceptable accept rate is reached (Carpenter et al., 2017).

To select between implicit midpoint and leapfrog on a specific problem, we suggest starting with implicit midpoint and periodically calculating an approximation to the largest eigenvalue of the local Hessian of the potential energy function during a warmup period. This eigenvalue approximation can be efficiently computed with only a few Hessian-vector products using the power method of numerical linear algebra (Meyer, 2000). This can in turn be used to get an approximation of the largest stepsize that leapfrog could take while maintaining stability, via equation 8. When the smallest of these approximate stepsizes divided by two is close to the implicit midpoint stepsize needed to maintain stability divided by the average number of gradient evaluations plus Hessian-vector products required by implicit midpoint, then the leapfrog method will be more efficient for the problem. This is because the leapfrog method requires two gradient evaluations per step, and a Hessian-vector product, like a gradient evaluation, scales linearly in the number of inputs in an automatic differentiation system such as the one used in Stan. In practice, we found that computation of the Hessian-vector product was 1.2-1.3 times slower than a gradient evaluation in Stan.

4 Experiments

We illustrate on several examples how implicit midpoint can achieve superior efficiency over leapfrog on multiscale problems. For all of our examples we compare using a custom implementation of NUTS that either uses leapfrog (lfNUTS) or implicit midpoint (iNUTS). Our code is freely available as an R package, and at the time of writing can be obtained by emailing the authors.

4.1 2-D Gaussian

As discussed in Section 2.2, the leapfrog stability limit for a multivariate Gaussian is directly related to the largest eigenvalue of the inverse covariance matrix. One of the ways this eigenvalue can be large is when the multivariate Gaussian is highly correlated. To illustrate how this affects NUTS in practice, we compared the average tree depth per sample of lfNUTS and iNUTS for varying levels of on the two-dimensional Gaussian distribution with mean zero, and covariance matrix

(13)

The empirical scaling of average tree depth over 1,000 NUTS samples versus is shown in Figure 6. In the NUTS algorithm, tree depth is selected automatically, when a Hamiltonian trajectory has “U-turned”. Two to the power of the tree depth represents how many steps in the trajectory were computed. For more correlated distributions, leapfrog is forced to take smaller stepsizes, which results in more steps having to be taken and thus larger tree depths. Note that for a Gaussian system, implicit midpoint has no stepsize limit for stability regardless of the covariance of the Gaussian. Thus the stepsize can be chosen large enough so that a tree depth of only one is required.

Figure 6: For highly correlated Gaussian distributions, the leapfrog integrator in lfNUTS is forced to take tiny stepsizes that result in larger average NUTS tree depths. As the correlation goes to zero, the average tree depth approaches that of iNUTS.

4.2 Banana Distribution

To illustrate how the implicit midpoint integrator is advantageous for distributions with nonlinear correlations that cannot be addressed by a mass matrix, we compared lfNUTS and iNUTS on a highly-correlated banana distribution (Long et al., 2013) with parameters and

whose probability density function (PDF) takes the following form:

(14)

The highly-nonlinear correlation stucture of this distribution is typical for complicated non-Gaussian distributions. For HMC, the long direction across the length of the banana compared to its narrow width creates a prototypical multiscale problem, forcing leapfrog to take an excessively small stepsize that leads to insufficient exploration of the posterior in lfNUTS compared to iNUTS (Figure 7).

Figure 7: 1,000 samples of the banana distribution from lfNUTS and iNUTS. Because of the multiscale nature of the Hamiltonian system that arises when sampling the banana distribution, lfNUTS is forced to use an unreasonably small stepsize that nearly reduces its behavior to that of a random walk and renders it unable to explore the entire distribution (orange samples). Meanwhile, iNUTS can effectively take ten times as large of a stepsize which allows it to efficiently explore the entire posterior.

4.3 Neal’s Funnel

Neal’s -dimensional funnel is a highly non-Gaussian distribution over defined as follows (Neal, 2003):

(15)
(16)

This is a particularly important distribution as it captures the characteristics of typical Bayesian hierarchical models that are known to give any sampler difficulty. In particular, the potential energy surface of the distribution exhibits a multiscale nature that changes drastically as a function of , causing trouble for the lfNUTS compared to iNUTS. To quantify this difference, we took 1,000 samples of an 11-dimensional Neal’s funnel using both lfNUTS and iNUTS. The scatter plot in Figure 8, which depicts these samples for the first two dimensions, illustrates the widely-varying curvature of the distribution. For our experiments we used an identity mass matrix and picked the leapfrog stepsize according to the hand-tuned value of reported in Betancourt (2013). For the implicit midpoint stepsize we picked the largest stepsize that did not produce Newton convergence errors during sampling, which was .

Figure 8: 1,000 samples of the 11-dimensional Neal’s funnel distribution using lfNUTS and iNUTS. The curvature of the funnel varies drastically, at times being quite multiscale in nature (neck of the funnel) and at other times being more even in nature (mouth of the funnel). The multiscale regime places an unreasonably small stepsize on lfNUTS, causing the sampler to fail to efficiently explore the entire distribution in a reasonable amount of time. Meanwhile, iNUTS is able to maintain a reasonable stepsize and can sample efficiently.

A summary of results comparing the two methods is presented in Table 1. Because the implicit midpoint steps within iNUTS require a nonlinear solve, there is much more overhead in each step compared to leapfrog. However, because implicit midpoint has a much larger stability limit than leapfrog, iNUTS is able to take a much larger step, leading to more efficient samples. In practice, iNUTS ends up needing to do only a third of the work as lfNUTS to obtain an effective sample. In terms of computer time, iNUTS took about longer to obtain over ten times as many effective samples.

lfNUTS iNUTS
avg. grad evals/step 2.00 11.98
avg. hess-vec evals/step 0.00 9.88
avg. work/step 2.00 21.86
avg. tree depth 9.97 5.40
avg. effective samples 42.47 613.26
work/effective sample 47.09 16.11
total computer time (s) 181 231
Table 1: A comparison of the computation involved in the sampling experiment. Average gradient evaluations and Hessian-vector evaluations are per step of the integrator, and work is the sum of these. Average effective samples are over all 11 dimensions being sampled. While iNUTS requires more work per step, because it is able to take a much larger step than leapfrog, it can obtain many more effective samples in a similar time frame.

5 Discussion

We have shown that distributions exhibiting multiple scales limit the stepsize of the leapfrog integrator in HMC, and how this limitation can be effectively circumvented by utilizing the implicit midpoint integrator together with Newton-Krylov iteration. Furthermore, we offered a practical implementation of implicit midpoint that is applicable to Bayesian posterior sampling problems, and provided practical guidelines for choosing which integrator to use, as well as the stepsize to use in the integrator. As illustrated in our examples, using implicit midpoint together with Newton-Krylov instead of leapfrog can provide a practical and significant efficiency boost in the context of NUTS.

Appendix A Eigenvalues of the Update Matrix for Implicit Midpoint on a Linear System

For the simple univariate Gaussian system with mass matrix one defined in Section 2.2, the update equations (9) reduce to

(17)

In matrix notation, these update equations can be rewritten as

(18)

Collecting the terms on the right yields

(19)

which yields the update equation

(20)

Note that the eigenvalues of the scalar times the matrix in this equation are simply the eigenvalues of the matrix times the scalar in front. Defining these eigenvalues are the solutions of the quadratic equation

(21)

which by the quadratic formula are

(22)

Thus the moduli of the eigenvalues of the overall update matrix are one.

References

  • Ascher and Petzold [1998] Uri M Ascher and Linda R Petzold.

    Computer methods for ordinary differential equations and differential-algebraic equations

    , volume 61.
    SIAM, 1998.
  • Ascher and Reich [1999] Uri M Ascher and Sebastian Reich. On some difficulties in integrating highly oscillatory Hamiltonian systems. In Computational Molecular Dynamics: Challenges, Methods, Ideas, pages 281–296. Springer, 1999.
  • Beskos et al. [2013] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, Andrew Stuart, et al. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
  • Betancourt [2013] Michael Betancourt. A general metric for Riemannian manifold Hamiltonian Monte Carlo. In Geometric science of information, pages 327–334. Springer, 2013.
  • Betancourt [2017] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
  • Carpenter et al. [2017] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.
  • Chao et al. [2015] Wei-Lun Chao, Justin Solomon, Dominik Michels, and Fei Sha. Exponential integration for Hamiltonian Monte Carlo. In

    International Conference on Machine Learning

    , pages 1142–1151, 2015.
  • Duane et al. [1987] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
  • Eisenstat and Walker [1996] Stanley C Eisenstat and Homer F Walker. Choosing the forcing terms in an inexact newton method. SIAM Journal on Scientific Computing, 17(1):16–32, 1996.
  • Gelman et al. [2013] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
  • Hairer et al. [2006] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration: structure-preserving algorithms for ordinary differential equations, volume 31. Springer Science & Business Media, 2006.
  • Hoffman and Gelman [2014] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
  • Knoll and Keyes [2004] Dana A Knoll and David E Keyes. Jacobian-free Newton–Krylov methods: a survey of approaches and applications. Journal of Computational Physics, 193(2):357–397, 2004.
  • Leimkuhler and Matthews [2016] Ben Leimkuhler and Charles Matthews. Molecular Dyamics. Springer, 2016.
  • Leimkuhler and Reich [2004] Benedict Leimkuhler and Sebastian Reich. Simulating hamiltonian dynamics, volume 14. Cambridge University press, 2004.
  • Long et al. [2013] Andrew W Long, Kevin C Wolfe, Michael J Mashner, and Gregory S Chirikjian. The banana distribution is Gaussian: A localization study with exponential coordinates. Robotics: Science and Systems VIII; MIT Press: Cambridge, MA, USA, pages 265–272, 2013.
  • Metropolis et al. [1953] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.
  • Meyer [2000] Carl D Meyer. Matrix Analysis and Applied Linear Algebra, volume 71. SIAM, 2000.
  • Neal [2003] Radford M Neal. Slice sampling. Annals of statistics, pages 705–741, 2003.
  • Neal et al. [2011] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
  • Okudo and Suzuki [2015] Michiko Okudo and Hideyuki Suzuki. Hamiltonian Monte Carlo with explicit, reversible, and volume-preserving adaptive step size control. JSIAM Letters, 9:33–36, 2015.
  • Quarteroni et al. [2010] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. Numerical Mathematics, volume 37. Springer Science & Business Media, 2010.
  • Saad and Schultz [1986] Youcef Saad and Martin H Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856–869, 1986.
  • Shahbaba et al. [2014] Babak Shahbaba, Shiwei Lan, Wesley O Johnson, and Radford M Neal. Split Hamiltonian Monte Carlo. Statistics and Computing, 24(3):339–349, 2014.