Magnetic Hamiltonian Monte Carlo

07/10/2016 ∙ by Nilesh Tripuraneni, et al. ∙ 0

Hamiltonian Monte Carlo (HMC) exploits Hamiltonian dynamics to construct efficient proposals for Markov chain Monte Carlo (MCMC). In this paper, we present a generalization of HMC which exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the mechanics of a charged particle coupled to a magnetic field. We establish a theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC, and construct a symplectic, leapfrog-like integrator allowing for the implementation of magnetic HMC. Finally, we exhibit several examples where these non-canonical dynamics can lead to improved mixing of magnetic HMC relative to ordinary HMC.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic inference in complex models generally requires the evaluation of intractable, high-dimensional integrals. One powerful and generic approach to inference is to use Markov chain Monte Carlo (MCMC) methods to generate asymptotically exact (but correlated) samples from a posterior distribution for inference and learning. Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011)

is a state-of-the-art MCMC method which uses gradient information from an absolutely continuous target density to encourage efficient sampling and exploration. Crucially, HMC utilizes proposals inspired by Hamiltonian dynamics (corresponding to the classical mechanics of a point particle) which can traverse long distances in parameter space. HMC, and variants like NUTS (which eliminates the need to hand-tune the algorithm’s hyperparameters), have been successfully applied to a large class of probabilistic inference problems where they are often the gold standard for (asymptotically) exact inference

(Neal, 1996; Hoffman & Gelman, 2014; Carpenter et al., 2016).

Figure 1: Example sample paths for standard HMC (left) and MHMC (right) for an isotropic Gaussian target distribution.

In this paper, we first review important properties of Hamiltonian dynamics, namely energy-preservation, symplecticity, and time-reversibility, and derive a more general class of dynamics with these properties which we refer to as non-canonical Hamiltonian dynamics. We then discuss the relationship of non-canonical Hamiltonian dynamics to well-known variants of HMC and propose a novel extension of HMC. We refer to this method as magnetic HMC (see Algorithm 1) since it corresponds to a particular subset of the non-canonical dynamics that in 3 dimensions map onto to the mechanics of a charged particle coupled to a magnetic field – see Figure 1 for an example of these dynamics. Furthermore, we construct an explicit, symplectic, leapfrog-like integrator for magnetic HMC which allows for an efficient numerical integration scheme comparable to that of ordinary HMC. Finally, we evaluate the performance of magnetic HMC on several sampling problems where we show how its non-canonical dynamics can lead to improved mixing. The proofs of all results in this paper are presented in the corresponding sections of the Appendix.

2 Markov chain Monte Carlo

Given an unnormalized target density defined on , an MCMC algorithm constructs an ergodic Markov chain such that the distribution of converges to (e.g. in total variation) (Robert & Casella, 2004). Often, the transition kernel of such a Markov chain is specified by the Metropolis-Hastings (MH) algorithm which (i) given the current state , proposes a new state by sampling from a proposal distribution , and (ii) sets

with probability

and otherwise. The role of the acceptance step is to enforce reversibility (or detailed balance) of the Markov chain with respect to – which implies is a stationary distribution of the transition kernel.

Heuristically, a good MH algorithm should have low inter-sample correlation while maintaining a high acceptance ratio. Hamiltonian Monte Carlo provides an elegant mechanism to do this by simulating a particle moving along the contour lines of a dynamical system, constructed from the target density, to use as a MCMC proposal.

2.1 Hamiltonian Monte Carlo

In Hamiltonian Monte Carlo, the target distribution is augmented with “momentum” variables which are independent of the

variables but of equal dimension. For the remainder of the paper, we take the distribution over the momentum variables to be Gaussian, as is common in the literature (indeed, there is evidence that in many cases, the choice of a Gaussian distribution may be optimal

(Betancourt, 2017)). The joint target distribution is therefore:

(1)

Crucially, this augmentation allows Hamiltonian dynamics to be used as a proposal for an MCMC algorithm over the space , where we interpret (resp., ) as position (resp., momentum) coordinates of a physical particle with total energy , given by the sum of its potential energy and kinetic energy . We briefly review the Markov chain construction below; see (Neal, 2011) or (Duane et al., 1987) for a more detailed description. Given the Markov chain state at time , the new state for time is obtained by first resampling momentum , and then proposing a new state according to the following steps: (i) Simulate the deterministic Hamiltonian flow defined by the differential equation

(2)

for time , with initial condition , to obtain 111Throughout this paper, we use to denote the map that takes a given position-momentum pair as initial conditions for the Hamiltonian flow associated with for time . In addition, denotes the composition of with the momentum flip map .; (ii) Flip the resulting momentum component with the map to obtain ; (iii) Apply a MH-type accept/reject step to enforce detailed balance with respect to the target distribution; (iv) Flip the momentum again with so it points in the original direction.

Note that because the map is time-reversible (in the sense that if the path is a solution to (2) then the path with negated momentum traversed in reverse is also a solution), the map is self-inverse. From this, the acceptance ratio in step (iii) enforcing detailed balance can be shown (see e.g. (Green, 1995)) to have the form:

(3)

Note that the Hamiltonian flow & momentum flip operator is volume-preserving222In fact the Hamiltonian flow satisfies the stronger condition of symplecticity with respect to the matrix () which immediately implies it is volume-preserving by taking determinants of this relation., which immediately yields that the Jacobian term in the acceptance ratio (3) is simply 1. The acceptance probability therefore reduces to . Furthermore, since the Hamiltonian flow defined in (2) is energy-preserving (i.e. 

) – the acceptance ratio is identically 1. Moreover, the momentum resampling in (i) and momentum flip in (iv) both leave the joint distribution invariant.

While the momentum resampling ensures the Markov chain explores the joint space, the proposals inspired by Hamiltonian dynamics can traverse long distances in parameter space , reducing the random-walk behavior of MH that often results in highly correlated samples (Neal, 2011).

2.2 Symplectic Numerical Integration

Unfortunately, it is rarely possible to integrate the flow defined in (2) analytically; instead an efficient numerical integration scheme must be used to generate a proposal for the MH-type accept/reject test. Typically, the leapfrog (Störmer-Verlet) integrator is used since it is an explicit method that is both symplectic and time-reversible (Neal, 2011). One elegant way to motivate this integrator is by decomposing the Hamiltonian into a symmetric splitting:

(4)

and then defining and to be the exactly-integrated flows for the sub-Hamiltonians and , respectively. These updates (which are equivalent to Euler translations) can be written:

(5)

since the Hamilton equations (2) for the sub-Hamiltonians and are linear, and hence analytically integrable. One leapfrog step is then defined as:

(6)

with the overall proposal given by leapfrog steps, followed by the momentum flip operator as before:

As each of the flows , exactly integrates a sub-Hamiltonian, they inherit the symplecticity, volume-preservation, and time-reversibility of the exact dynamics. Moreover, since the composition of symplectic flows is also symplectic and the splitting scheme is symmetric (implying the composition of time-reversible flows is also time-reversible), the Jacobian term in the acceptance probability (3) is exactly 1 as in the case of perfect simulation.

The leapfrog scheme will not exactly preserve the Hamiltonian , so the remaining acceptance ratio must be calculated. However, the leapfrog integrator has error in one leapfrog step (Hairer et al., 2006). This error scaling will lead to good energy conservation properties (and thus high acceptance rates in the MH step), even when simulating over long trajectories.

3 Non-Canonical Hamiltonian Monte Carlo

In Section 2, we noted the role time-reversibility, volume-preservation, and energy conservation of canonical Hamiltonian dynamics play in making them useful candidates for MCMC. In this section, we develop the properties of a general class of flows we refer to as non-canonical Hamiltonian systems that parallel these properties, we use to construct our method magnetic HMC (see Algorithm 1):

Lemma 1.

The map defined by integrating the non-canonical Hamiltonian system

(7)

with initial conditions for time , where is any invertible, antisymmetric matrix induces a flow on the coordinates that is still energy-conserving and symplectic with respect to which also implies volume-preservation of the flow.

Within the formal construction of classical mechanics, it is known that any Hamiltonian flow defined on the cotangent bundle of a configuration manifold, which is equipped with an arbitrary symplectic 2-form, will preserve its symplectic structure and admit the corresponding Hamiltonian as a first integral invariant (Arnold, 1989). The statement of Lemma 1 is simply a restatement of this fact grounded in a coordinate system. Similar arbitrary, antisymmetric terms have also appeared in the study of MCMC algorithms based on diffusion processes; such samplers often do not enforce detailed balance with respect to the target density and are often implemented as discretizations of stochastic differential equations (Rey-Bellet & Spiliopoulos, 2015; Ma et al., 2015), in contrast to the approach taken here.

Our second observation is that the dynamics in (17) are not time-reversible in the traditional sense. Instead, if we consider the parametrization of as:

(8)

where , are antisymmetric and is taken to be general such that is invertible, then the non-canonical dynamics have a (pseudo) time-reversibility symmetry:

Lemma 2.

If is a solution to the non-canonical dynamics:

(9)

then is a solution to the modified non-canonical dynamics:

(10)

if . In particular if then , which reduces to the traditional time-reversal symmetry of canonical Hamiltonian dynamics.

Lemma 1 suggests a generalization of HMC that can utilize an arbitrary invertible antisymmetric matrix in its dynamics; however Lemma 2 indicates the non-canonical dynamics lack a traditional time-reversibility symmetry which poses a potential difficulty to satisfying detailed balance. In particular, we cannot compose with an exact/approximate simulation of to make self-inverse.

Our solution to obtaining a time-reversible proposal is simply to flip the elements of the and matrices just as ordinary HMC flips the auxiliary variable i) at the end of Hamiltonian flow in the proposal and ii) once again after the MH acceptance step to return to its original direction. In this vein, we view the parameters and as auxiliary variables in the state space, and simultaneously flip , , and after having simulated the dynamics, rendering the proposal time-reversible according to Lemma 2 – see Section 2 in the Appendix for full details of this construction. This ensures that detailed balance is satisfied for this entire proposal. To avoid “random walk” behaviour in the resulting Markov chain, we can apply a sign flip to and , in addition to , to return them to their original directions after the MH acceptance step.

The validity of this construction relies on equipping and with symmetric auxiliary distributions. For the remainder of this paper, we further restrict to binary symmetric auxiliary distributions supported on a given antisymmetric matrix and its sign flip – see Appendix 1.1 for full details. This restriction is not necessary, but gives rise to a simple and interpretable class of algorithms, which is in spirit closest to using fixed parameters and , whilst ensuring the proposal satisfies detailed balance. This construction is also reminiscent of lifting constructions prevalent in the discrete Markov chain literature (Chen et al., 1999); heuristically, the signed variables and favour proposals in opposing directions.

3.1 Symplectic Numerical Integration for Non-Canonical Dynamics

As with standard HMC, exactly simulating the flow is rarely tractable, and a numerical integrator is required to approximate the flow. It is not absolutely necessary to use an explicit, symplectic integration scheme; indeed implicit integrators are used in Riemannian HMC to maintain symplecticity of the proposal which comes at a greater complexity and computational cost (Girolami et al., 2009). However explicit, symplectic integrators are simple, have good energy-conservation properties, and are volume-preserving/time-reversible (Hairer et al., 2006), so for the present discussion we restrict our attention to investigating leapfrog-like schemes.

We begin, as in Section 2.2, by considering the symmetric splitting (34), yielding the sub-Hamiltonians , . The corresponding non-canonical dynamics for the sub-Hamiltonians and are:

and:

We denote the corresponding flows by and respectively. The flow is generally not explicitly tractable unless we take – in which case it is solved by an Euler translation as before. Crucially, the flow in is a linear differential equation and hence analytically integrable. If is invertible (and ) then:

(11)

See the Appendix for a detailed derivation which also handles the general case where is not invertible. Thus when , the flows and are analytically tractable and will inherit the generalized symplecticity and (pseudo) time-reversibility of the exact dynamics in (17). Therefore if we use the symmetric splitting (34) to construct a leapfrog-like step:

(12)

we can construct a total proposal that consists of several leapfrog steps, followed by a flip of the momentum and , , which will be a volume-preserving, self-inverse map:

(13)

Henceforth we will always take when we use to generate leapfrog proposals, which interestingly corresponds to a magnetic dynamics as discussed in Lemma 4. A full description of the magnetic HMC algorithm using this numerical integrator is described in Section 5.

4 Special Cases

Here, we describe several tractable subcases of the general formulation of non-canonical Hamiltonian dynamics since these they have interesting physical interpretations.

4.1 Mass Preconditioned Dynamics

One simple variant of HMC is preconditioned HMC where (Neal, 2011), and can be implemented nearly identically to ordinary HMC. We note that preconditioning can be recovered within our framework using a simple form for the non-canonical matrix:

Lemma 3.

i) Preconditioned HMC with momentum variable in the coordinates is exactly equivalent to simulating non-canonical HMC with and the non-canonical matrix , and then transforming back to coordinates using . Here is a Cholesky factor for .

ii) On the other hand if we apply a change of basis (via an invertible matrix

) to our coordinates , simulate HMC in the coordinates, and transform back to the original basis using , this is exactly equivalent to non-canonical HMC with

Lemma 3 illustrates a fact alluded to in (Neal, 2011); using a mass preconditioning matrix and a change of basis given by matrix acting on leaves the HMC algorithm invariant.

4.2 Magnetic Dynamics

The primary focus of this paper is to investigate the subcase of the dynamics where:

(14)

for two important reasons333Note that the effect of a non-identity matrix can be achieved by simply composing these magnetic dynamics with a coordinate-transformation as suggested in Lemma 3.. Firstly for this choice of we can construct an explicit, symplectic, leapfrog-like integration scheme which is important for developing an efficient HMC sampler as discussed in Section 3.1. Secondly, the dynamics have an interesting physical interpretation quite distinct from mass preconditioning and other HMC variants like Riemannian HMC (Girolami et al., 2009):

Lemma 4.

In 3 dimensions the non-canonical Hamiltonian dynamics corresponding to the Hamiltonian and matrix as in (14) are equivalent to the Newtonian mechanics of a charged particle (with unit mass and charge) coupled to a magnetic field (given by a particular function of - see Appendix):

This interpretation is perhaps surprising since Hamiltonian formulations of classical magnetism are uncommon, although the quantum mechanical treatment naturally incorporates a Hamiltonian framework. However, in light of Lemma 3 we might wonder if by a clever rewriting of the Hamiltonian we can reproduce this system of ODEs using the canonical matrix (i.e. , ). This is not the case:

Lemma 5.

The non-canonical Hamiltonian dynamics with magnetic and Hamiltonian cannot be obtained using canonical Hamiltonian dynamics for any choice of smooth Hamiltonian. (See Appendix).

5 The Magnetic HMC (MHMC) Algorithm

Using the results discussed in Section 3 and Section 3.1 we can now propose Magnetic HMC – see Algorithm 1.

0:  , , ,
  Initialize , and set
  for  do
     Resample
     Set with as in (11)
     Flip momentum and set
     if  then
         Set
     else
         Set
     end if
     Flip momentum and flip
  end for
  
Algorithm 1 Magnetic HMC (MHMC)

One further remark is that by construction the integrator for magnetic HMC is expected to have similarly good energy conservation properties to the integrator of standard HMC:

Lemma 6.

The symplectic leapfrog-like integrator for magnetic HMC will have the same local () and global () error scaling (over steps), as the canonical leapfrog integrator of standard HMC if the Hamiltonian is separable. (See Appendix).

It is worthwhile to contrast the algorithmic differences between magnetic HMC and ordinary HMC. Intuitively, the role of the flow – which reduces to the standard Euler translation update of ordinary HMC when – is to introduce a rotation into the momentum space of the flow. In particular, a non-zero element will allow momentum to periodically flow between and . If we regard as an element in the Lie algebra of antisymmetric matrices, which can be thought of as infinitesimal rotations, then the exponential map will project this transformation into the Lie group of real orthogonal linear maps.

With respect to computational cost, although magnetic HMC requires matrix exponentiation/diagonalization to simulate , this only needs to be computed once upfront for and cached; moreover, as is diagonalizable, the exact exponential can be calculated in time. Additionally, there is an

cost for the matrix-vector products needed to implement the flow

as with preconditioning. However, it is possible to design sparsified matrix representations of which will translate into sparsified rotations if we only wish to ”curl” in a specific subspace of dimension – which will translate into a computational cost of and respectively.

An important problem to address is the selection of the matrix, which affords a great deal of flexibility to MHMC relative to HMC; this point is further discussed in the Experiments section, where we argue that in certain cases intuitive heuristics can be used to select the matrix.

6 Experiments

Here we investigate the performance of magnetic HMC against standard HMC in several examples; in each case commenting on our choice of the magnetic field term . Step sizes () and number of leapfrog steps () were tuned to achieve an acceptance rate between , after which the norm of the non-zero elements in was set to which was found to work well.

In the Appendix we also display illustrations of different MHMC proposals across several targets in order to provide more intuition for MHMC’s dynamics. Further experimental details and an additional experiment on a Gaussian funnel target are also provided in the Appendix.

6.1 Multiscale Gaussians

We consider two highly ill-conditioned Gaussians similar to as in (Sohl-Dickstein et al., 2014) to illustrate a heuristic for

matrix selection and demonstrate properties of the magnetic dynamics. In particular we consider a centered, uncorrelated 2D Gaussian with covariance eigenvalues of

and as well as a centered, uncorrelated 10D Gaussian with two large covariance eigenvalues of and remaining eigenvalues of . We denote their coordinates as and respectively.

Figure 2: Averaged Autocorrelation of HMC vs MHMC on a 10D ill-conditioned Gaussian (left) and Averaged Autocorrelation of HMC vs MHMC on a 2D ill-conditioned Gaussian.

HMC will have difficulty exploring the directions of large marginal variance since its exploration will often be limited by the smaller variance directions. Accordingly, in order to induce a periodic momentum flow between the directions of small and large variance, we introduce nonzero components

into the subspaces spanned directly between the large and small eigenvalues. Indeed, we find that magnetic term is encouraging more efficient exploration as we can see from the averaged autocorrelation of samples generated from the HMC/MHMC chains – see Figure 2. Further, by running the 50 parallel chains for

timesteps, we computed both the bias and Monte Carlo standard errors (MCSE) of the estimators of the target moments as shown in Table

1 and Table 2.

algorithm (Bias MCSE) (Bias MCSE)
HMC .947 41.7 .00108 .0247
MHMC .657 22.5 .000280 .00438
Table 1: Absolute Bias MCSE for 2D ill-conditioned Gaussian moments for HMC vs. MHMC. Note that and .
algorithm (Bias MCSE) (Bias MCSE)
HMC 24.7 46.6 0.00376 0.0249
MHMC 9.68 32.7 0.00127 0.0132
Table 2: Absolute Bias MCSE for 10D ill-conditioned Gaussian moments for HMC vs. MHMC. Note that and .

6.2 Mixture of Gaussians

We compare MHMC vs. HMC on a simple, but interesting, 2D density over comprised of an evenly weighted mixture of isotropic Gaussians: for , and . This problem is challenging for HMC because the gradients in canonical Hamiltonian dynamics force it to one of the two modes.

Figure 3: Left: 500 samples from HMC; Middle: 500 samples from MHMC; Right: MMD between HMC/MHMC samples for various magnitudes of the non-zero component of the magnetic field – denoted g. Note corresponds to standard HMC.

We tuned HMC to achieve an acceptance rate of and used the same for MHMC, generating 15000 samples from both HMC and MHMC with these settings. The addition of the magnetic field term

– which has one degree of freedom in this case – introduces an asymmetric “curl” into the dynamics that pushes the sampler across the saddlepoint to the other mode allowing it to efficiently mix around both modes and between them – see Figure

3. The maximum mean discrepancy between exact samples generated from the target density and samples generated from both HMC and MHMC chains was also estimated for various magnitudes of , using a quadratic kernel and averaged over 100 runs of the Markov chains (Borgwardt et al., 2006). Here, we clearly see that for various values of the nonzero component of , denoted , the samples generated by MHMC more faithfully reflect the structure of the posterior. As before, we ran 50 parallel chains for timesteps to compute both the bias and Monte Carlo standard errors (MCSE) of the estimators of the target moments as shown in Table 3.

algorithm (Bias MCSE) (Bias MCSE)
HMC .0132 .0644 0.00264 0.0114
MHMC .00239 .012 0.000596 0.00365
Table 3: Bias MCSE for 2D Mixture of Gaussians for HMC vs. MHMC with . Note that and .

Additional experiments over a range of , (and corresponding acceptance rates) and details are included in the Appendix for this example, demonstrating similar behavior.

6.3 FitzHugh-Nagumo model

Finally, we consider the problem of Bayesian inference over the parameters of the FitzHugh-Nagumo model (a set of nonlinear ordinary differential equations, originally developed to model the behavior of axial spiking potentials in neurons) as in

(Ramsay et al., 2007; Girolami & Calderhead, 2011). The FitzHugh-Nagumo model is a dynamical system defined by the following coupled differential equations:

(15)

We consider the problem where the initial conditions of the system (15) are known, and a set of noise-corrupted observations at discrete time points , are available - note that we illustrate dependence of the trajectories on the model parameters explicitly via subscripts. It is not possible to recover the true parameter values of the model from these observations, but we can obtain a posterior distribution over them by specifying a model for the observation noise and a prior distribution over the model parameters.

Similar to (Ramsay et al., 2007; Girolami & Calderhead, 2011), we assume that the observation noise variables and are iid , and take an independent prior over each parameter , , and . This yields a posterior distribution of the form

(16)

Importantly, the highly non-linear dependence of the trajectory on the parameters , and yields a complex posterior distribution - see Figure 4. Full details of the model set-up can be found in (Ramsay et al., 2007; Girolami & Calderhead, 2011).

For our experiments, we used fixed parameter settings of to generate 200 evenly-spaced noise-corrupted observations over the time interval (as in (Ramsay et al., 2007; Girolami & Calderhead, 2011)). We performed inference over the posterior distribution of parameters with this set of observations using both the HMC and MHMC algorithms, which was perturbed with a magnetic field in each of the 3 axial planes of parameters – along the , , and axes with magnitude . The chains were run to generate 1000 samples over 100 repetitions with settings of , which resulted in an average acceptance rate of . The effective sample size of each of the chains normalized per unit time was then computed for each chain.

Figure 4: Marginal posterior density contour plot over , , with .

Since each query to the posterior log-likelihood or posterior gradient log-likelihood requires solving an augmented set of differential equations as in (15), the computation time () of all the methods was nearly identical.

algorithm ESS ESS(, , )/time (s)
HMC 1000 318 606 4.20, 1.33, 2.55
MHMC ab 1000 349 658 4.20, 1.47, 2.77
MHMC ac 1000 336 628 4.20, 1.42, 2.64
MHMC bc 1000 326 649 4.20, 1.37, 2.73
Table 4: HMC vs. MHMC performance targeting the Fitzhugh-Nagumo posterior parameters

Moreover, note that all methods achieved nearly perfect mixing over the first coordinate so their effective sample size were truncated at 1000 for the coordinate. In this example, we can see that all magnetic perturbations slightly increase the mixing rate of the sampler over each of the coordinates with the perturbation performing best.

7 Discussion and Conclusion

We have investigated a framework for MCMC algorithms based on non-canonical Hamiltonian dynamics and have given a construction for an explicit, symplectic integrator that is used to implement a generalization of HMC we refer to as magnetic HMC. We have also shown several examples where the non-canonical dynamics of MHMC can improve upon the sampling performance of standard HMC. Important directions for further research include finding more automated, adaptive mechanisms to set the matrix , as well as investigating positionally-dependent magnetic field components, similar to how Riemannian HMC corresponds to local preconditioning. We believe that exploiting more general deterministic flows (such as also maintaining a non-zero in the top left-block of a general matrix) could form a fruitful area for further research on MCMC methods.

Acknowledgements

The authors thank John Aston, Adrian Weller, Maria Lomeli, Yarin Gal and the anonymous reviewers for helpful comments. MR acknowledges support by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis. RET thanks EPSRC grants EP/M0269571 and EP/L000776/1 as well as Google for funding.

References

  • Arnold (1989) Arnold, Vladimir. Mathematical Methods of Classical Mechanics. 1989. ISBN 0387968903.
  • Betancourt (2017) Betancourt, Michael. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434, 2017.
  • Betancourt & Girolami (2015) Betancourt, Michael and Girolami, Mark. Hamiltonian Monte Carlo for Hierarchical Models. CRC Press, Boca Raton, FL, 2015.
  • Borgwardt et al. (2006) Borgwardt, Karsten M., Gretton, Arthur, Rasch, Malte J., Kriegel, Hans Peter, Scholkopf, Bernhard, and Smola, Alex J. Integrating structured biological data by Kernel Maximum Mean Discrepancy. In Bioinformatics, volume 22, 2006.
  • Carpenter et al. (2016) Carpenter, Bob, Gelman, Andrew, Hoffman, Matt, Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Marcus A, Li, Peter, and Riddell, Allen. Journal of Statistical Software Stan : A Probabilistic Programming Language. Journal of Statistical Software, VV(Ii), 2016.
  • Chen et al. (1999) Chen, Fang, Lovász, László, and Pak, Igor. Lifting markov chains to speed up mixing. In

    Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing

    , STOC ’99, pp. 275–281, New York, NY, USA, 1999. ACM.
    ISBN 1-58113-067-8. doi: 10.1145/301250.301315.
  • Duane et al. (1987) Duane, Simon, Kennedy, A. D., Pendleton, Brian J., and Roweth, Duncan. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987.
  • Girolami & Calderhead (2011) Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, March 2011.
  • Girolami et al. (2009) Girolami, Mark, Calderhead, Ben, and Chin, Siu A. Riemannian Manifold Hamiltonian Monte Carlo. Physics, 73(2):35, 2009.
  • Green (1995) Green, Peter J. Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika, 82(4):711–732, 1995.
  • Haario et al. (1999) Haario, Heikki, Saksman, Eero, and Tamminen, Johanna. Adaptive proposal distribution for random walk metropolis algorithm. Computational Statistics, 14(3), 1999.
  • Hairer et al. (2006) Hairer, Ernst, Hochbruck, Marlis, Iserles, Arieh, and Lubich, Christian. Geometric Numerical Integration. Oberwolfach Reports, pp. 805–882, 2006.
  • Hoffman & Gelman (2014) Hoffman, Matt and Gelman, Andrew. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(2008):1–31, 2014.
  • Ma et al. (2015) Ma, Yi-an, Chen, Tianqi, and Fox, Emily B. A Complete Recipe for Stochastic Gradient MCMC. NIPS, pp. 1–19, 2015.
  • Neal (1996) Neal, Radford.

    Bayesian Learning for Neural Networks

    , volume 118.
    1996.
  • Neal (2003) Neal, Radford M. Slice Sampling: Rejoinder, 2003.
  • Neal (2011) Neal, Radford M. MCMC using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, pp. 113–162, 2011.
  • Ramsay et al. (2007) Ramsay, J. O., Hooker, G., Campbell, D., and Cao, J. Parameter estimation for differential equations: A generalized smoothing approach. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 2007.
  • Rey-Bellet & Spiliopoulos (2015) Rey-Bellet, Luc and Spiliopoulos, Konstantinos. Irreversible Langevin Samplers and Variance Reduction: a Large Deviations Approach. Nonlinearity, 28(7):2081, 2015.
  • Robert & Casella (2004) Robert, Christian P and Casella, George. Monte Carlo Statistical Methods, volume 95. 2004.
  • Sohl-Dickstein et al. (2014) Sohl-Dickstein, J, Mudigonda, Mayur, and DeWeese, M. Hamiltonian Monte Carlo Without Detailed Balance. Proceedings of The 31st International Conference on Machine Learning, 32:9, 2014.
  • Zhang & Sutton (2014) Zhang, Yichuan and Sutton, Charles. Semi-separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models. In Advances in Neural Information Processing Systems, pp. 10–18, 2014.

Appendix A Section 3 and 4 Proofs

Here we provide proofs for results discussed in Section 3 of the main text regarding non-canonical dynamics.

Lemma 1.

The map defined by integrating the non-canonical Hamiltonian system

(17)

with initial conditions for time , where is any invertible, antisymmetric matrix induces a flow on the coordinates that is still energy-conserving () and symplectic with respect to () which also implies volume-preservation of the flow.

Proof.

The proofs of both results simply uses the antisymmetry of .

Energy-Conservation – Simply, we have that:

(18)
(19)

using the antisymmetry of and symmetry of .

Symplecticness – We must show that the Jacobian of the flow generated by the dynamics preserves the non-canonical structure matrix , which amounts to showing:

(20)

where we define as the time-evolving Jacobian of the flow. First, note that can be equivalently described as the solution to the differential equation:

(21)

with the initial condition (the Jacobian for the identity map at ). Trivially, we have:

(22)

Then note that:

as desired by simply using . ∎

Time-Reversibility – However, crucially it is not the case that the Hamilton equations are time-reversible in the traditional sense of canonical Hamiltonian dynamics.

Lemma 2.

If is a solution to the non-canonical dynamics:

(23)

then is a solution to the modified non-canonical dynamics:

(24)

if . In particular if then , which reduces to the traditional time-reversal symmetry of canonical Hamiltonian dynamics.

Proof.

A direct calculation yields

a.1 Non-Canonical Dynamics Variable Augmentation

As remarked in the main text it is necessary to flip the and matrices at the end of a deterministic simulation of the Hamiltonian dynamics in order to render the proposal time-reversible which is in turn necessary to satisfy detailed balance. This is crucial for the correctness of the algorithm especially when an approximate simulation of the dynamics is used (as is always often the case).

In particular, say that we wish to use as a transition kernel with fixed, non-zero values of and . We first augment the state-space by placing a symmetric, binary distribution independently over and such that and , independently of :

(25)

Importantly, this augmentation leaves the distribution over intact when and are marginalized out. Just as applying the momentum flip operator, , is a deterministic, energy-preserving, volume-preserving transformation, the , flip operators, and , are also deterministic, energy-preserving, volume-preserving transformations that leave (25) invariant for this particular augmentation with and . We can now build a self-inverse operator , composed of simulating the Hamiltonian flow as plus , a flip of , , , as:

(26)

Now we have constructed a deterministic, self-inverse map . can now be used as the proposal for a reversible MCMC algorithm.

An important point to note is that our choice of variable augmentation strategy, namely augmenting with binary distribution, is certainly not unique. However, it is perhaps the most natural and simplest choice which avoids the repetitive computation of matrix exponentials/diagonalizations since the approximate flow detailed in Section B.2 will only need to compute matrix exponentials once upfront for .

a.2 Mass Preconditioning Proofs

A common variation on standard HMC dynamics is to set the kinetic energy term in the Hamiltonian to for some symmetric positive-definite matrix , and sample the initial momentum variable from the corresponding distribution . However, we can contextualize preconditioning using a non-canonical matrix in the following manner:

Lemma 3.

i) Preconditioned HMC with momentum variable in the coordinates, is exactly equivalent to simulating non-canonical HMC with and the non-canonical matrix:

and then transforming back to coordinates using . Here is a Cholesky factor for .

ii) On the other hand if we apply a change of basis (via an invertible matrix ) to our coordinates , simulate HMC in the coordinates, and transform back to the original basis using , this is exactly equivalent to non-canonical HMC with

Proof.

We first prove the equivalence regarding the change of basis in momentum space. Under the mass matrix variant of HMC, is drawn from a distribution, and the dynamics of are then given by

Denoting the upper-triangular Cholesky factor of by , and introducing a new variable , we obtain the following dynamics for the joint variable :

Further, note that if the marginal distribution of is , then under this change of variables has the marginal distribution . Thus, simulating canonical HMC with a non-identity mass matrix is equivalent to making a change of basis in momentum space, simulating non-canonical HMC with a particular choice of non-canonical matrix, and finally reverting back to the original basis.

We now prove the equivalence regarding the change of basis in space, which follows similarly. Consider non-canonical HMC on the state-momentum pair , with the antisymmetric matrix taking the particular form

The states obtained from this algorithm are equal to those obtained by first changing basis to , then simulating standard HMC dynamics for the pair with respect to the Hamiltonian

and then reverting to the original basis as . To see this, first note that if we denote the distribution on the coordinates corresponding to the potential by , then the corresponding distribution on the coordinates is given by , where

The corresponding potential is therefore given by

Running canonical HMC dynamics targeting the Hamiltonian yields the dynamics:

But note then that the dynamics of the original coordinates are then given by:

which are exactly a special case of non-canonical HMC dynamics described above. ∎

Appendix B Magnetic HMC (MHMC)

Here we provide proofs related to the dynamics of magnetic HMC and it’s symplectic integration scheme.

b.1 Non-Canonical Dynamics and Magnetism

We first establish the connection between the particular subcase of non-canonical Hamiltonian dynamics where

and Newton’s law for a charged particle coupled to a magnetic field.

Lemma 4.

In 3-dimensions the non-canonical Hamiltonian dynamics, with Hamiltonian , correspond to simulating the differential equations: