In Bayesian statistics, we often find ourselves with the challenge of performing an integration over a set of model parameters. A common example is when we have a model, defined by the likelihood , with and constituting an input–output data pair. We define a reasonable prior over the model parameters, and our interest lies in inferring the posterior over the parameters . Knowledge of this posterior allows us to make predictions over new input points via the integration,
In Bayesian modelling, the path to this predictive distribution comes from employing Bayes’ theorem:
where the marginal likelihood, , is the final term that needs to be inferred in order to obtain the predictive distribution in Equation (1). This marginal likelihood also requires integration with respect to :
Therefore, if we can infer , we can evaluate the predictive distribution. Depending on the prior and the likelihood, there are sometimes analytic solutions to marginalising over , such as in Gaussian process regression (Williams and Rasmussen, 2006), where the prior is conjugate to the likelihood. However, when this is not the case, one must approximate the integral. One solution is to replace the true posterior with a variational approximation that is either cheaper to sample from (stochastic variational inference) (Hoffman et al., 2013) or conjugate to the likelihood (MacKay, 1992)
. Another option is to use a Markov chain Monte Carlo (MCMC) technique to sample directly from the posterior in Equation (2).
In this paper we focus on the latter option of using an MCMC technique, namely, Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) (Girolami and Calderhead, 2011). As we describe in Sections 3 and 4, RMHMC is a Monte Carlo technique that is well suited to performing inference in highly complex models.
However, RMHMC implementations remain computationally expensive and challenging due to both the higher-order derivatives and quadratic memory requirements. It is for these reasons that RMHMC has not become widely used in large-scale Bayesian modelling, such as in Bayesian neural networks(MacKay, 1992; Neal et al., 2011). To overcome these issues, we present a new RMHMC integration scheme, Explicit RMHMC, that is competitive with standard Hamiltonian Monte Carlo (HMC) in both wall-clock time and performance. Furthermore, we provide a Python package, hamiltorch
, that implements a general-purpose API to run these algorithms on a GPU with PyTorchPaszke et al. (2017a).111Link to the Python package: https://github.com/AdamCobb/hamiltorch. For a tutorial please refer to https://adamcobb.github.io/journal/hamiltorch.html.
2 Related work
Our motivation for introducing a more computationally efficient way of performing inference with RMHMC stems from quotes in the literature, such as “we need improved inference techniques in BNNs", Gal and Smith (2018); and “In small scale problems, Markov chain Monte Carlo algorithms remain the gold standard, but such algorithms face major problems in scaling up to big data.”, Li et al. (2017). Both these papers are referring to the need for improved inference in complex models. Furthermore they also highlight the drawbacks of variational approximations Blei et al. (2017), which can lead to poor uncertainty quantification in safety-critical applications (Cobb et al., 2018) and possibly be more susceptible to adversarial attacks (Gal and Smith, 2018).
2.1 Markov chain Monte Carlo methods
One of the most commonly used and most popularised MCMC algorithms originated from the statistical mechanics literature: the Metropolis–Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970)
includes an acceptance step, whereby proposed samples are accepted according to an acceptance probability that ensures the required MCMC properties are satisfied.222This acceptance step was originally related to potential energy of particles in a system, such that a sample configuration of particles was accepted according to this potential energy, rather than the previous method of simply weighting samples by the energy distribution. Despite its success, the MH algorithm can suffer in high dimensional models due to low acceptance rates and an inability to efficiently explore the space (Bishop (2006, Chapter 11)).
HMC has become a popular choice for performing inference in highly complex models, due to its higher acceptance rates and its use of gradient information in exploring the space. It was first introduced as “Hybrid Monte Carlo” by Duane et al. (1987) for computer simulations in lattice field theory.333We note that earlier work by Alder and Wainwright (1959) utilised Hamiltonian dynamics to simulate molecular dynamics. However the term “Hamiltonian Monte Carlo” was coined from the extensive work by Neal (1995)
, who introduced HMC to the machine learning community with work on Bayesian neural networks. Despite the success of HMC, there are various hyperparameters that need to be tuned, such as the mass matrix,, the step size, , and the number of steps in a given trajectory, (or leapfrog steps, see Section 3
). The reliance on heuristicsNeal (1995), which are heavily dependent on the Bayesian model or the data, often prevent HMC from being exploited to its full potential.
Progress in this area has been made with the introduction of a Hamiltonian-based Monte Carlo scheme that takes into account the geometry of the Bayesian model. This was framed in the work by Girolami and Calderhead (2011), where they introduced RMHMC as an “overarching geometric framework” for MCMC methods that included previous work by Roberts and Stramer (2002) and Duane et al. (1987).
2.2 Adaptations to Hamiltonian-based Monte Carlo samplers
The challenges associated with setting the hyperparameters of Hamiltonian samplers have seen various solutions offered in the literature. Hoffman and Gelman (2014) focussed on solving the problem of automatically setting the trajectory length in HMC by introducing the No-U-Turn Sampler, whilst ensuring detailed balance is preserved. This was extended to Riemannian manifolds in Betancourt (2013a) with the aim of reducing unnecessary computation time. Wang et al. (2013) take a different approach to adapting hyperparameters by employing Bayesian optimisation (Snoek et al., 2012) for adaptive tuning within Hamiltonian-based Monte Carlo samplers.
Rather than focussing on techniques for hyperparameter tuning, others have directly worked on the formulation of the Hamiltonian itself. Shahbaba et al. (2014) show that it is possible to split the Hamiltonian in a way that can reduce the computational cost. They use a MAP approximation that allows the Hamiltonian to be partially analytically integrated, which can reduce the computational burden when exploring much of the state space. Previous work by Lan et al. (2012) introduced some of the benefits of explicit integration techniques to RMHMC. However, unlike the integrator introduced in our work, they derive an integrator that is no longer symplectic and adjust their acceptance step in order to preserve detailed balance. In addition, they introduce two additional matrix inversions that limit efficiency in high-dimensional spaces.
3 Hamiltonian Monte Carlo
Rather than relying on a random walk to explore the parameter space of a model, HMC employs Hamiltonian dynamics to traverse the parameter space, . In order to introduce HMC, we begin with the integrand of interest and augment the space by introducing the momentum variable , such that we now have a joint density . The negative log joint probability is then given by
It is then noted that this negative joint likelihood actually consists of a kinetic energy term and a potential energy term . This insight points to using Hamiltonian dynamics to explore this system, where trajectories in this space move along constant energy levels.
Moving in this space requires derivative information:
The evolution of this system must preserve both volume and total energy, where any transformation that preserves area has the property of symplecticity (Hairer et al., 2006, Page 182). As this Hamiltonian is separable, meaning and are functions of only one of the variables (momentum and parameters, respectively), to traverse the space we can use the Stormer–Verlet or leapfrog integrator (as used in Duane et al. (1987) and Neal (1995)):
where corresponds to a the leapfrog step iteration and is the step size. Therefore given this is a numerical solution that uses discrete steps, we must check whether the Hamiltonian energy is conserved by employing a Metropolis–Hastings step with an acceptance probability after leapfrog steps. Finally, the overall sampling process follows the Gibbs sampling structure, where we draw and then draw a new set of parameters , where we have followed similar nomenclature to Girolami and Calderhead (2011).
Therefore in repeating this Gibbs sampling scheme, including the implementation of the leapfrog algorithm, we can sample our parameter of interest .
4 Moving to a Riemannian manifold
One of the big challenges, when implementing HMC, is how to set the mass matrix . In Neal (1995, Appendix 4), one practical solution is offered, which relates the step size for each parameter to the second-order derivative of the log posterior. Interestingly, this is not too far away from the topic of discussion in this section, where we start to explore ideas of higher-order derivatives.
If our interest lies in measuring distances between distributions, then naively treating the space as if it were Euclidean (while computationally convenient) may prevent rapid sampling of the posterior. A common metric for measuring a distance between distributions is the Kullback–Leibler (KL) divergence:
If we expand the term within the integrand by using a Taylor expansion around such that , we get (see Appendix A for full derivation)
where tells us about the curvature for the joint probability space over and . As used in Girolami and Calderhead (2011), we are using a Bayesian perspective and concern ourselves with the joint likelihood of the data and the parameters and therefore is the Fisher information plus the negative Hessian of the log-prior. Furthermore there is a plethora of literature relating to the appropriate metric for measuring distances on a Riemannian manifold, where we refer to either the RMHMC paper (Girolami and Calderhead, 2011) or the book (Amari and Nagaoka, 2000).
4.1 Riemannian manifold Hamiltonian Monte Carlo
Having introduced the need for a metric situated on a Riemannian manifold, we can now go back to the original Hamiltonian of Equation (4) but write it so that it lies on a Riemannian manifold by replacing with . However, because the mass matrix is now parameterised by , the Hamiltonian equations are no longer separable as they contain a coupling between and :
Since the Hamiltonian is non-separable, the leapfrog integrator defined in Equation 6 is no longer applicable due to the volume conservation requirements. Furthermore, the new dependence of the mass matrix on the parameters means that detailed balance would no longer be satisfied.444At this point, it feels important to highlight that in general, implementations of HMC are often not time-reversible, since that would require an equal chance of being positive or negative. However, as Hoffman and Gelman (2014) mention, this can be omitted in practice.
5 Explicit RMHMC
5.1 Implicit versus explicit integration
The current solution for solving the non-separable Hamiltonian equations in Section 4 is to rely on the implicit generalised leapfrog algorithm (Leimkuhler and Reich, 2004). The term implicit refers to the computationally expensive first-order implicit Euler integrators, which are fixed-point iterations that are run until convergence. In order to understand the computational cost associated with this generalised leapfrog algorithm, we summarise a single leapfrog step in Algorithm 1.
A single step in the implicit generalised leapfrog contains two ‘while’ loops, which both require expensive update steps for each iteration. In Betancourt (2013b), the author uses ‘while’ loops as in Algorithm 1 and sets a threshold for breaking out. As an alternative to a conditional break, Girolami and Calderhead (2011) use a ‘for’ loop and fix the number of iterations to be five or six. Therefore, one step, (at a minimum) requires Hessian calculations along with the additional derivative calculations.
Rather than relying on implicit integration, an explicit integration scheme relies on an explicit evaluation at an already known value (Hairer et al., 2006, Chapter 1, Page 3). This leads to a deterministic number of operations per leapfrog step, which we will show to be advantageous. We now introduce an explicit integration scheme for RMHMC.
5.2 Introducing the new explicit integrator
The challenge in speeding up the symplectic integration for non-separable Hamiltonians has not been applied within the context of Bayesian statistics. However if we delve into the physical sciences, there have been many examples where non-separable Hamiltonians appear Tao (2016a); Luo et al. (2017); Zhang et al. (2018). Therefore it makes sense to draw from the progress made in this area, given the strong relationship between the physical interpretations of HMC within probabilistic modelling applications.
A key advance that moved away from dealing with specific subclasses of non-separable Hamiltonians and moved towards arbitrary non-separable Hamiltonians came in Pihajoki (2015), where the phase space was extended to include an additional momentum term and parameter term . This extension of the phase space was built on by Tao (2016b) who added an additional binding term that resulted in improved long-term performance. This new augmented Hamiltonian can now be defined as:
where the binding term and controls the binding. The two Hamiltonian systems and each follow Equation (4), where and have their parameters and momenta mixed. Interestingly, if we set , we retrieve the augmented Hamiltonian from Pihajoki (2015).
The Hamiltonian in Equation (10) gives the system:
We now introduce the integrators from Tao (2016b), which when combined make up the symplectic flow for the augmented non-separable Hamiltonian:
Finally, the numerical symplectic second-order integrator is given by the function composition
where combining these symmetric mappings is known as Strang splitting (Strang, 1968) and higher-order integrators can be built in a similar manner (Yoshida, 1990). In Appendix B, we show that this second-order integrator is indeed symplectic.
There are two major advantages when comparing the integrator in Equation (13) to previous symplectic integrators for non-separable Hamiltonians. First of all, when comparing to the implicit integrator of Section 5.1, we no longer need to rely on two fixed-point methods that can often diverge. Furthermore we are guaranteed the same number of derivative calculations of eight per equivalent leapfrog step. This compares to the minimum number of for the implicit generalised leapfrog (although it can often be higher). Secondly, when comparing to the previous explicit integrator of Pihajoki (2015), this integration scheme has better long-term error performance. More specifically, Tao (2016b) showed that until the number of steps, , the numerical error is , although empirical results show that larger values of give favourable results in terms of the sampler’s performance.
6.1 Bayesian logistic regression
The purpose of this illustrative example is to demonstrate both the benefits of RMHMC, and that Explicit RMHMC is significantly faster than Implicit RMHMC. We start with the convex problem of Bayesian logistic regression as the metric tensor is positive semi-definitive and hence well behaved. We define our Hamiltonian as in Equation4, where the log joint probability is given by
where the quadratic term in the model parameters corresponds to the prior. We must be careful when defining the precision term , as the influence of the prior on the sampling can be significant as we show in Appendix D.1.1. Also, we allow the data pairs to be augmented such that , which allows to correspond to the bias term. We then generate the data by separately sampling and building a data set through using the logistic .
6.1.1 1D example
In a simple example, where consists of a weight and a bias term , we show how RMHMC is able to take advantage of the geometry of the parameter space, whereas HMC does not. We see this in Figure 1, where we purposely initialise all samplers with and in order to show how each implementation manages when heading towards the known means of and . There is a clear difference between the standard HMC sampler and both Riemannian samplers. The incorporation of geometry in both RMHMC schemes avoids the slower route that the HMC scheme follows. Furthermore HMC is renowned for being sensitive to the hyperparameters, such as the step size and the number of leapfrog steps. Therefore despite the fact that RMHMC still requires optimisation of these hyperparameters, it is less sensitive since the step size is automatically adapted according to the curvature of the space.
However, despite the obvious advantages of RMHMC, it comes at a severe computational cost. On the same hardware, the samples in Figure 1 take 0.5 s, 15.3 s and 9.3 s for HMC, Implict RMHMC and Explicit RMHMC respectively.555CPU: Intel i7-8700k; GPU: NVIDIA TITAN Xp; RAM: 32 GB This is when allowing the fixed-point routine to take a maximum number of six iterations. Therefore although RMHMC is expensive, even this small example shows that Explicit RMHMC can help in reducing the computational cost.
6.1.2 Influence of the prior in a 10D example
Although well known in Bayesian modelling, we use a 10D example to highlight the importance of the prior when designing the sampling scheme in HMC. We find that large values of focus the samples near the zero-mean prior, whereas for a smaller value of , we see the samples moving towards . We display these results in Figure 6 in Appendix D.1.1 in order to highlight how the prior influences results. In addition, when setting the hyperparameters, we initialise all schemes with a long trajectory length and use the autocorrelation to set . We aim to reduce the autocorrelation to result in a larger effective sample size. Appendix D.1 provides further details.
6.1.3 The significance of
We cannot introduce a new parameter without giving some intuition and empirical example of its influence in practice. We are able to show that as we vary the value of , small values below a certain threshold display poor long-term performance, whereas above a threshold, we see a long-term performance consistent with the implicit integration scheme. We also see that when is set too large, the long-term performance once again degrades. These results are consistent with Pihajoki (2015) and Tao (2016b), and we refer to Figure 7 in Appendix D.1.2.
6.2 Inference in a Bayesian hierarchical model
We now introduce the funnel distribution, which is defined as
This was first introduced by Neal et al. (2003)
and is a good example of a model that is challenging for Bayesian inference. Additionally, the marginal distribution ofis , thus allowing us to make comparisons to the known distribution. To demonstrate the advantages of RMHMC over HMC, we use the KL-divergence, , as our metric of performance, as well as comparing the wall-clock time. To define . The KL-divergence is appropriate because in a real application we would only have access to , and it is in our interest to know how much information is lost if we rely on instead of .
We compare four Hamiltonian-based Monte Carlo schemes, where our baseline is the HMC implementation of the No-U-Turn Sampler (Hoffman and Gelman, 2014), which has been set to adapt its acceptance rate to between .666The same parameter settings are followed as in Hoffman and Gelman (2014). The desired acceptance rate is set to , but this results in large step sizes that cause trajectories to traverse into numerically unstable areas of the log probability space. Therefore the adapted acceptance rate is higher than would normally be desired, although results in the same poor performance. We further provide an implementation of HMC that has been tuned with knowledge of the true marginal. Importantly, we compare implicit RMHMC with our new explicit RMHMC
sampler. We further highlight that for both Riemannian schemes the Fisher information matrix is no longer guaranteed to be positive semi-definite as it was for Bayesian logistic regression. Therefore we filter out negative eigenvalues by implementingSoftAbs, as in Betancourt (2013b), the details of which are available in Appendix C.
|Scheme||Wall Time||Acc. Rate|
|HMC - NUTS||min||-|
|Implicit RMHMC||hr min|
|Explicit RMHMC,||hr min|
|HMC (Hand-Tuned for KL)||min||-|
Table 1 shows that both forms of RMHMC outperform HMC in terms of , despite providing the optimiser for HMC with a posteriori information by hand-tuning with knowledge of the true distribution and by allowing it to take two orders of magnitude more samples. The No-U-Turn Sampler, which adapts for an appropriate acceptance rate leads to too large a step size and prevents it from exploring the narrow neck of the funnel (Figure 3). However, the key result is in the wall-clock time, where our new explicit RMHMC achieves comparable to implicit RMHMC in almost a third of the time.777When the max number of fixed point iterations for Implicit RMHMC was set to the acceptance rate for the same initial conditions dropped from to , therefore the max number was set to , with the convergence threshold for breaking out of the loop set to . At an acceptance rate of , the resulting sampler did achieve a similar performance in terms of wall time to Explicit RMHMC but this is clearly not a desirable rate. We further show both the convergence rate of each sampler in and the autocorrelations in Figure 2, where both RMHMC samplers display comparable performance in convergence and autocorrelation. The faster convergence of for both RMHMC methods demonstrates how the samplers are more efficient at building an empirical representation of
, when compared to the other HMC samplers. This is also highlighted by how the autocorrelation between samples plateaus around zero faster for Explicit and Implicit RMHMC. This suggests that the effective sample size of both these samplers is higher than for NUTS and hand-tuned HMC. Finally, the estimated marginal distributions ofare shown in Figure 4, where the better performance of the two Riemannian samplers can be seen from the lower KL divergence.
By introducing an explicit symplectic integration scheme, we have reduced the number of higher-order derivative calculations required per leapfrog step for RMHMC. This results in an almost two times speed up over the previously used implicit integration scheme for RMHMC, with comparable performance. We have demonstrated these results both for Bayesian logistic regression and a Bayesian hierarchical model, where we have also provided insights as to how one might use explicit RMHMC in practice.
Finally, by reducing the computational burden of RMHMC, we hope to expand the range of models that may currently be too computationally expensive for RMHMC and enable others to do the same with our open-source Python package hamiltorch. We will explore these models in future work, with the objective of providing the tools for improving inference and achieving better uncertainty estimates.
Hamiltorch: a PyTorch Python package for sampling.
All the Monte Carlo methods implemented in this paper were performed using hamiltorch
. This is a Python package we developed that uses Hamiltonian Monte Carlo (HMC) to sample from probability distributions.hamiltorch is built with a PyTorch (Paszke et al., 2017a) backend to take advantage of the innate automatic differentiation. PyTorch is a machine learning framework that makes modelling with neural networks in Python easier. Therefore since hamiltorch is based on PyTorch, we ensured that hamiltorch was built to sample directly from neural network models (i.e. objects inheriting from the torch.nn.Module).
To use hamiltorch to sample a Bayesian neural network one only needs to define a data set and a neural network model. PyTorch allows us to run much of the code on a GPU, which results in a large speed increase over running on a CPU. For example, if we use the Pytorch MNIST CNN example (Paszke et al., 2017b) based on the LeNet architecture (LeCun et al., 1998), for the GPU we are able to achieve samples per second, whereas on the CPU the sampling rate is samples per second. This is for a network that consists of two convolutional layers and two fully connected layers, which contains weights (including biases).
Finally, it is simple to switch between integration schemes by changing the argument passed to the sampler. Therefore easily switching between HMC and RMHMC is part of our framework. The structure of hamiltorch aims to make performing HMC more scalable and accessible to practitioners, as well as enabling researchers to add their own metrics for RMHMC. For the code base please refer to https://github.com/AdamCobb/hamiltorch and for a tutorial please refer to https://adamcobb.github.io/journal/hamiltorch.html.
- Williams and Rasmussen  Christopher KI Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning, volume 2. MIT Press Cambridge, MA, 2006.
- Hoffman et al.  Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
David JC MacKay.
A practical Bayesian framework for backpropagation networks.Neural computation, 4(3):448–472, 1992.
- Girolami and Calderhead  Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
- Neal et al.  Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
- Paszke et al. [2017a] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017a.
- Gal and Smith  Yarin Gal and Lewis Smith. Sufficient conditions for idealised models to have no adversarial examples: a theoretical and empirical study with Bayesian neural networks. arXiv preprint arXiv:1806.00667, 2018.
- Li et al.  Cheng Li, Sanvesh Srivastava, and David B Dunson. Simple, scalable and accurate posterior interval estimation. Biometrika, 104(3):665–680, 2017.
- Blei et al.  David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Cobb et al.  Adam D Cobb, Stephen J Roberts, and Yarin Gal. Loss-calibrated approximate inference in Bayesian neural networks. arXiv preprint arXiv:1805.03901, 2018.
- Metropolis et al.  Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
- Hastings  W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.
- Bishop  Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.
- Duane et al.  Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
- Alder and Wainwright  Berni J Alder and Thomas Everett Wainwright. Studies in molecular dynamics. I. General method. The Journal of Chemical Physics, 31(2):459–466, 1959.
- Neal  Radford M Neal. BAYESIAN LEARNING FOR NEURAL NETWORKS. PhD thesis, University of Toronto, 1995.
- Roberts and Stramer  Gareth O Roberts and Osnat Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4(4):337–357, 2002.
- Hoffman and Gelman  Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
- Betancourt [2013a] Michael J Betancourt. Generalizing the no-U-turn sampler to Riemannian manifolds. arXiv preprint arXiv:1304.1920, 2013a.
- Wang et al.  Ziyu Wang, Shakir Mohamed, and Nando Freitas. Adaptive Hamiltonian and Riemann Manifold Monte Carlo. In International Conference on Machine Learning, pages 1462–1470, 2013.
- Snoek et al.  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
- Shahbaba et al.  Babak Shahbaba, Shiwei Lan, Wesley O Johnson, and Radford M Neal. Split Hamiltonian Monte Carlo. Statistics and Computing, 24(3):339–349, 2014.
- Lan et al.  Shiwei Lan, Vassilios Stathopoulos, Babak Shahbaba, and Mark Girolami. Lagrangian Dynamical Monte Carlo. arXiv preprint arXiv:1211.3759, 2012.
Hairer et al. 
Ernst Hairer, Christian Lubich, and Gerhard Wanner.
Geometric numerical integration: structure-preserving algorithms for ordinary differential equations, volume 31. Springer Science & Business Media, 2006.
- Amari and Nagaoka  Shun-Ichi Amari and Hiroshi Nagaoka. Methods of information geometry. 2000.
- Leimkuhler and Reich  Benedict Leimkuhler and Sebastian Reich. Simulating Hamiltonian dynamics, volume 14. Cambridge university press, 2004.
- Betancourt [2013b] Michael Betancourt. A general metric for Riemannian manifold Hamiltonian Monte Carlo. In International Conference on Geometric Science of Information, pages 327–334. Springer, 2013b.
- Tao [2016a] Molei Tao. Explicit high-order symplectic integrators for charged particles in general electromagnetic fields. Journal of Computational Physics, 327:245–251, 2016a.
- Luo et al.  Junjie Luo, Xin Wu, Guoqing Huang, and Fuyao Liu. Explicit Symplectic-like Integrators with Midpoint Permutations for Spinning Compact Binaries. The Astrophysical Journal, 834(1):64, 2017.
- Zhang et al.  Ruili Zhang, Yulei Wang, Yang He, Jianyuan Xiao, Jian Liu, Hong Qin, and Yifa Tang. Explicit symplectic algorithms based on generating functions for relativistic charged particle dynamics in time-dependent electromagnetic field. Physics of Plasmas, 25(2):022117, 2018.
- Pihajoki  Pauli Pihajoki. Explicit methods in extended phase space for inseparable Hamiltonian problems. Celestial Mechanics and Dynamical Astronomy, 121(3):211–231, 2015.
- Tao [2016b] Molei Tao. Explicit symplectic approximation of nonseparable Hamiltonians: Algorithm and long time performance. Physical Review E, 94(4):043303, 2016b.
- Strang  Gilbert Strang. On the construction and comparison of difference schemes. SIAM Journal on Numerical Analysis, 5(3):506–517, 1968.
- Yoshida  Haruo Yoshida. Construction of higher order symplectic integrators. Physics letters A, 150(5-7):262–268, 1990.
- Neal et al.  Radford M Neal et al. Slice sampling. The annals of statistics, 31(3):705–767, 2003.
- Paszke et al. [2017b] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. PyTorch MNIST example. https://github.com/pytorch/examples/blob/master/mnist/main.py, 2017b. Accessed: 8 September 2019.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Channell and Scovel  Paul J Channell and Clint Scovel. Symplectic integration of Hamiltonian systems. Nonlinearity, 3(2):231, 1990.
- Poincaré  H Poincaré. Les Méthodes nouvelles de la Mécanique Céleste. Paris, Grauthier-Viltars, 3:46, 1899.
Introducing an Explicit Symplectic Integration Scheme for Riemannian Manifold Hamiltonian Monte Carlo
Appendix A Derivation of distance metric
A common metric for measuring a distance between distributions is the Kullback–Leibler (KL)divergence,
If we expand the term within the integrand using a Taylor expansion around such that , we obtain
We note that the first two terms in Equation (A) cancel and the first order gradient term reduces to zero as follows,
which results in the outcome of Equation (4).
Appendix B Checking the symplectomorphism of the augmented Hamiltonian binding term
In order for a transformation matrix to be symplectic, it must satisfy the condition Channell and Scovel :
is a non-singular, skew-symmetric matrix and
is a non-singular, skew-symmetric matrix andis a symplectic matrix. This definition [Hairer et al., 2006, Chapter 6, Page 183] is derived from the area preserving properties of the linear mapping , where the structure of originates from the form of the determinant in the transformed space. In nonlinear systems, can be the Jacobian of a differentiable function [Poincaré, 1899] as shown in [Hairer et al., 2006, Chapter 6, Page 184].
To show how the binding term in Equation (10) meets this condition, we can transform the Hamiltonian to , where , , and , giving
as shown in Tao [2016b], where is the perturbative higher order remainder.
Therefore, if we employ this transformation of variables to the binding term , the linear system in Equation (13) corresponding to becomes:
after the manipulation of the variables from the linear system in (12).
We will now write this as
where is the transformation matrix in Equation (22). In order to check this transformation satisfies the condition in Equation (20), we do . After applying these matrix multiplications and using the trigonometric identity we see that thus confirming is a symplectomorphism.
Appendix C Hessians with negative eigenvalues
In Bayesian logistic regression, we can calculate the Fisher information matrix with knowledge that the metric will be positive semi-definite Girolami and Calderhead . However, once models become more complex, the positive semi-definiteness of the Fisher information metric is no longer guaranteed.
In order to ensure is positive semi-definite, we follow the the same SoftAbs metric that Betancourt [2013b] introduced for RMHMC. This allows us to filter the eigenspectrum by passing all the eigenvalues, , through the function
This approximates the absolute values of the eigenvalues and therefore ensures that the new metric is positive semi-definite, by introducing a parameter which controls the smoothness. As the SoftAbs function becomes the absolute value. In our work we set as in Betancourt [2013b].
Appendix D Experiments
d.1 Bayesian logistic regression
To tune the trajectory length, we started with and ran for iterations. We then looked at the autocorrelation in Figure 4(a) and set to less than half the period of the oscillations for each respective inference scheme (HMC: . Implicit RMHMC: . Explicit RMHMC: .). This led us to Figure 4(b), where the autocorrelation is significantly reduced.
d.1.1 Dependence on the prior and the initialisation of RMHMC
In this logistic regression example the assumed prior over the combined parameter vector,
In this logistic regression example the assumed prior over the combined parameter vector,, is given by . We note that this is compared to the known generative model when we build the data set. The value of defines the confidence in the prior. We expect that for large values of , the structure of the prior will dominate the when inferring the parameters, whereas small values of are expected to result in broader, less-informative priors that reduce in their influence in the presence of more data. We display this behaviour in Figure 6, where we display the samples for five of the eleven parameters.
Furthermore, it often benefits the RMHMC schemes to be initialised by a cheap HMC burn-in to ensure that the linear algebra associated with the Hessian is numerically stable (we found that far away initialisations caused numerical issues due to the Hessian becoming singular).
d.1.2 Implications of the binding term on performance
For the single 1D case, such that and , we can test how the performance of the binding term of the explicit integrator () affects long term performance. If we initialise the samplers at the location in parameter space and set the trajectory length to , we can observe how different values of affect the trajectory. In Figure 7, we compare the implicit integration scheme in Equation (6) to various explicit schemes with different values for . We treat the dashed red line, Implicit RMHMC, as the ground truth and show how all other explicit RMHMC trajectories compare. It is clear from this simple example, that has an important implication on the long term performance of RMHMC. This was also pointed out in the original paper by Tao [2016b], whereby must be larger than some threshold but also small enough such that the error bound remains small. Figure 7 shows what this means in practice.
In fact the purpose of the additional binding term was to improve on the long term performance of the integration scheme introduced by Pihajoki , which is equivalent to . We can clearly see this is the case, as the trajectory corresponding to in Figure 7 diverges from the implicit scheme shortly after traversing away from the initial condition. Furthermore, the largest value we were able to set for the binding term before the trajectory was rejected, , also diverges from the implicit scheme.