Differential geometry plays a fundamental role in applied mathematics, statistics, and computer science, including numerical integration [mclachlan2002splitting, HairerBook, leimkuhler2004simulating, Celledoni:2014, marsden2001discrete], optimisation [betancourt2018symplectic, bravetti2019optimization, Franca:2020, Franca:2021, FrancaBarp:2021, Alimisis21], sampling [rousset2010free, Duane:1987, betancourt2017geometric, livingstone2019geometric, barp2019hamiltonian]
, statistics on spaces with deep learning[Celledoni:2021, bronstein2021geometric], medical imaging and shape methods [vaillant2005surface, durrleman2009statistical]barpRiemannSteinKernelMethod2020], and the study of random maps [harms2020geometry], to name a few. Of particular relevance to this chapter is information geometry, i.e., the differential geometric treatment of smooth statistical manifolds, whose origin stems from a seminal article by Rao [rao1992information]
who introduced the Fisher metric tensor on parametrised statistical models, and thus a natural Riemannian geometry that was later observed to correspond to an infinitesimal distance with respect to the Kullback–Leibler (KL) divergence[jeffreys1946invariant]. The geometric study of statistical models has had many successes [amari2016information, ay2017information, nielsen2020elementary], ranging from statistical inference, where it was used to prove the optimality of the maximum likelihood estimator [amari2012differential], to the construction of the category of mathematical statistics, generated by Markov morphisms [chentsov1965categories, jost2021probabilistic]. Our goal in this chapter is to discuss the emergence of natural geometries within a few important areas of statistics and applied mathematics, namely optimisation, sampling, inference, and adaptive agents. We provide a conceptual introduction to the underlying ideas rather than a technical discussion, highlighting connections with various fields of mathematics and physics.
The vast majority of statistics and machine learning applications involve solvingoptimisation problems. Accelerated gradient-based methods [Polyak:1964, Nesterov:1983], and several variations thereof, have became workhorses in these fields. Recently, there has been great interest in studying such methods from a continuous-time limiting perspective; see, e.g., [Candes:2016, Wibisono:2016, Wilson:2021, Franca:2018a, Franca:2018b, Franca:2021b, muehlebach2021optimization, muehlebach2021constrained] and references therein. Such methods can be seen as 1st order integrators to a classical Hamiltonian system with dissipation. This raises the question on how to discretise the system such that important properties are preserved, assuming the system has fast convergence to critical points and desirable stability properties. It has been known for a long time that the class of symplectic integrators is the preferred choice for simulating physical systems [Takahashi:1984, Suzuki:1990, Yoshida:1990, SanzSerna:1992, Benettin:1994, HairerBook, mclachlan2002splitting, McLachlan:2006, Forest:2006, Kennedy:2013]. These discretisation techniques, designed to preserve the underlying (symplectic) geometry of Hamiltonian systems, also form the basis of Hamiltonian Monte Carlo (HMC) (or hybrid Monte Carlo) methods [Duane:1987, Neal:2011]. Originally, such a theory of geometric integration was developed with conservative systems in mind while, in optimisation, the associated system is naturally a dissipative one. Nevertheless, symplectic integrators were exploited in this context [betancourt2018symplectic, Franca:2020, bravetti2019optimization]. More recently, it has been proved that a generalisation of symplectic integrators to dissipative Hamiltonian systems is indeed able to preserve rates of convergence and stability [Franca:2021], which are the main properties of interest for optimisation. Followup work [FrancaBarp:2021] extended this approach, enabling optimisation on manifolds and problems with constraints. There is also a tight connection between optimisation on the space of measures and sampling which dates back to [Otto:2001, Jordan:1998]; we will revisit these ideas in relation to dissipative Hamiltonian systems.
methods are critical to the efficient implementation of many methodologies. Most modern samplers are based on Markov Chain Monte Carlo methods, which include slice samplers[neal2003slice, murray2010elliptical], piecewise-deterministic Markov chains, such as bouncy particle and zig-zag samplers [davis1984piecewise, bouchard2018bouncy, vanetti2017piecewise, bierkens2019zig, bierkens2017piecewise, peters2012rejection], Langevin algorithms [roberts1996exponential, durmus2017nonasymptotic, durmus2018efficient], interacting particle systems [garbuno-inigoInteractingLangevinDiffusions2019] and the class of HMC methods [Duane:1987, betancourt2017geometric, rousset2010free, betancourt2017conceptual, Neal:2011, barp2018geometry]. The original HMC algorithm was introduced in physics to sample distributions on gauge groups for lattice quantum chromodynamics [Duane:1987]. It combined two approaches that emerged in previous decades, namely the Metropolis-Hastings algorithm and the Hamiltonian formulation of molecular dynamics [metropolis1953equation, hastings1970monte, alder1959studies]. Modern HMC relies heavily on symplectic integrators to simulate a deterministic dynamic, responsible for generating distant moves between samples and thus reduce their correlation, while at the same time preserving important geometric properties. This deterministic step is then usually combined with a corrective step (originally a Metropolis-Hastings acceptance step) to ensure preservation of the correct target, and with a stochastic process, employed to speed up convergence to the target distribution. We will first focus on the geometry of measure-preserving diffusions, which emerges from ideas formulated by Poincaré and Volterra, and form the building block of many samplers. In particular, we will discuss ways to “accelerate” sampling using irreversibility and hypoellipticity. We will then introduce HMC focusing on its underlying Poisson geometry, the important role played by symmetries, and its connection to geometric integration.
We then discuss the problem of statistical inference, whose practical implementation usually relies upon sampling and optimisation. Given observations from a target distribution, many estimators belong to the family of the so-called and estimators [van2000asymptotic], which are obtained by finding the parameters that maximises (or are zeros of) a parametrised set of functions. These include the maximum likelihood and minimum Hyvärinen score matching estimators [vapnik1999nature, hyvarinen2005estimation], which are also particular instances of the minimum score estimators induced by scoring rules that quantify the discrepancy between a sample and a distribution [parry2012proper]. The Monge–Kantorovich transportation problem [villani2009optimal] motivates another important class of estimators, namely the minimum Kantorovich and -Wasserstein estimators, whose implementation use the Sinkhorn discrepancy [bassetti2006minimum, peyre2019computational, cuturi2013sinkhorn]
. Our discussion of inference builds upon the theory of Hilbertian subspaces and, in particular, reproducing kernels. These inference schemes rely on the continuity of linear functionals, such as probability and Schwartz distributions, over a class of functions to geometrise the analysis of integral probability metrics which measure the worse case integration error. We shall explain how maximum mean, kernelised, and score matching discrepancies arise naturally from topological considerations.
Models of adaptive agents are the basis of algorithmic-decision-making under uncertainty. This is a difficult problem that spans multiple disciplines such as statistical decision theory [bergerStatisticalDecisionTheory1985]vonneumannTheoryGamesEconomic1944], control theory [bellmanAppliedDynamicProgramming2015]bartoReinforcementLearningIntroduction1992], and active inference [dacostaActiveInferenceDiscrete2020]. To illustrate a generic use case for the previous methodologies we consider active inference, a unifying formulation of behaviour—subsuming perception, planning and learning—as a process of inference [fristonActionBehaviorFreeenergy2010, fristonFreeenergyPrincipleUnified2010, fristonActiveInferenceEpistemic2015, dacostaActiveInferenceDiscrete2020]. We describe decision-making under active inference using information geometry, revealing several special cases that are established notions in statistics, cognitive science and engineering. We then show how preserving this information geometry in algorithms enables adaptive algorithmic decision-making, endowing robots and artificial agents with useful capabilities, including robustness, generalisation and context-sensitivity [lanillosActiveInferenceRobotics2021, dacostaHowActiveInference2022a]. Active inference is an interesting use case because it has yet to be scaled—to tackle high dimensional problems—to the same extent as established approaches, such as reinforcement learning [silverMasteringGameGo2016]; however, numerical analyses generally show that active inference performs at least as well in simple environments [millidgeDeepActiveInference2020, vanderhimstDeepActiveInference2020, cullenActiveInferenceOpenAI2018, sajidActiveInferenceDemystified2021, markovicEmpiricalEvaluationActive2021a, paulActiveInferenceStochastic2021, mazzagliaContrastiveActiveInference2021], and better in environments featuring volatility, ambiguity and context switches [sajidActiveInferenceDemystified2021, markovicEmpiricalEvaluationActive2021a].
2 Accelerated optimisation
We shall be concerned with the problem of optimisation of a function , i.e., finding a point that maximises , or minimises , over a smooth manifold
. We will assume this function is differentiable to construct algorithms that rely on the flows of smooth vector fields guided by the derivatives of.
Many algorithms in optimisation are given as a sequence of finite differences, represented by iterations of a mapping , where is a step size. The analysis of such finite difference iterations is usually challenging, relying on painstaking algebra to obtain theoretical guarantees; such as convergence to a critical point, stability, and rates of convergence to a critical point. Even when these algorithms are seen as discretisations of a continuum system, whose behaviour is presumably understood, it is well-known that most discretisations break important properties of the system.
2.1 Principle of geometric integration
Fortunately, here comes into play one of the most fundamental ideas of geometric integration: many numerical integrators are very close—exponentially in the step size—to a smooth dynamics generated by a shadow vector field (a perturbation of the original vector field). This allows us to analyse the discrete trajectory implemented by the algorithm using powerful tools from dynamical systems and differential geometry, which are a priori reserved to smooth systems. Crucially, while numerical integrators typically diverge significantly from the dynamics they aim to simulate, geometric integrators respect the main properties of the system. In the context of optimisation this means respecting stability and rates of convergence. This was first demonstrated in [Franca:2021] and further extended in [FrancaBarp:2021]; our following discussion will be based on these works.
The simplest way to construct numerical methods to simulate the flow of a vector field arises when it is given by a sum, , and the flows of the individual vector fields and are—analytically or numerically—tractable. In such a case, we can approximate the exact flow , for step size , by composing the individual flows and . The simplest composition is given by . The Baker–Campbell–Hausdorff (BCH) formula then yields
where is the commutator between and . Thus, the numerical method itself can be seen as a smooth dynamical system with flow map . The goal of geometric integration is to construct numerical methods for which shares with the critical properties of interest; this is usually done by requiring preservation of some geometric structure.
Recall that a numerical map is said to be of order if ; we abuse notation slightly and let denote a well-defined distance over manifolds (see [Hansen:2011] for details). Thus, the expansion (1) also shows that the error in the approximation is , i.e., we have an integrator of order . One can also consider more elaborate compositions, such as
which is more accurate since the first term in (1) cancels out, yielding an integrator of order .111Higher-order methods are constructed by looking for appropriate compositions that cancel first terms in the BCH formula [Yoshida:1990]. However, methods for tend to be expensive numerically, with not so many benefits (if any) over methods of order .
2.2 Conservative flows and symplectic integrators
As a stepping stone, we first discuss the construction of suitable conservative flows, namely flows along which some function is constant, where is the phase space manifold of the system, i.e., the space in which the dynamics evolves. Such flows, which are amongst the most well-studied due to their importance in physics, will enable us to obtain our desired “rate-matching” optimisation methods and will also be central in our construction of geometric samplers.
To construct vector fields along the derivative of we shall need brackets. Geometrically, these are morphisms , also known as contravariant tensors of rank in physics, where is the dual space of . Note that on Riemannian manifolds (e.g., ) both spaces are isomorphic. In Euclidean space, , we define such -vector fields in terms of a state-dependent matrix as222We denote by the th component of and . We also use Einstein’s summation convention, i.e., repeated upper and lower indices are summed over.
Any vector field that depends linearly and locally on may be written in this manner. Notice that a decomposition induces a decomposition that is amenable to the splitting integrators previously mentioned. Importantly, vector fields that preserve correspond to bracket vector fields in which is antisymmetric [mclachlan1999geometric]. Constructing conservative flows is thus straightforward. Unfortunately, it is a rather more challenging task to construct efficient discretisations that retain this property; most well-known procedures, namely discrete-gradient and projection methods, only give rise to integrators that require solving implicit equations at every step, and they may break other important properties of the system.
For a particular class of conservative flows, it is possible to construct splitting integrators that—exactly—preserve another function that remains close to . Indeed, going back to the BCH formula (1), we see that if we were to approximate a conservative flow of by composing the flows of and , and, crucially, if we had a bracket for which the commutators can be written as
for some function , and so on for all commutators in (1), then the right-hand side of the BCH formula would itself be an expansion in terms of a vector field for some shadow function . In particular, would inherit all the properties of , i.e., properties common to -vector fields. This is precisely the case for Poisson brackets, written , which are antisymmetric brackets for which the Jacobi identity holds:
where is the Poisson bracket between functions and . The BCH formula then implies
Such an integrator can thus be seen as a Poisson system itself, generated by the above asymptotic shadow , which is exactly preserved.
Poisson brackets and their dynamics are the most important class of conservative dynamical systems, describing many physical systems, including all of fluid and classical mechanics. The two main classes of Poisson brackets are constant antisymmetric matrices on Euclidean space, and symplectic brackets for which is invertible at every point . Its inverse is denoted by and called a symplectic form. In this case, the function is called a Hamiltonian, denoted . The invertibility of the Poisson tensor implies that such a bracket exists only on even-dimensional spaces. Darboux theorem then ensures the existence of local coordinates in which the symplectic form can be represented as . Dynamically, this corresponds to the fact that these are 2nd order differential equations, requiring not only a position but also a momentum .333More precisely, the dynamics evolve on the cotangent bundle , with coordinates ; momentum and velocity are equivalent on the Riemannian manifolds that are used in practice. is called the configuration manifold with coordinates . Note that if then , and conversely if then . Thus, a change in coordinate is generated by its conjugate momentum , and vice-versa. Thus, the only way to generate dynamics on
in this case is by introducing a Hamiltonian depending on both position and momentum. From a numerical viewpoint, the extended phase space introduces extra degrees of freedom that allow us to incorporate “symmetries” in the Hamiltonian, which facilitate integration. Indeed, in practice, the Hamiltonian usually decomposes into a potential energy, associated to position and independent of momentum, and a kinetic energy, associated to momentum and invariant under position changes, both generating tractable flows. Thanks to this decomposition, we are able to construct numerical methods through splitting the vector field. Note also that, for symplectic brackets, the existence of a shadow Hamiltonian can be guaranteed beyond the case of splitting methods, e.g., for variational integrators—which use a discrete version of Hamilton’s principle of least action—and more generally for most symplectic integrators in which the symplectic bracket is preserved up to topological considerations described by the 1st de Rham cohomology of phase space.
2.3 Rate-matching integrators for smooth optimisation
Having obtained a vast family of smooth dynamics and integrators that closely preserve , we can now apply these ideas to optimisation. Vector fields for which a Hamiltonian function dissipates can be written as a bracket vector field for some negative semi-definite matrix [mclachlan1999geometric]. Let us consider a concrete example in in the form of a (generalised) conformal Hamiltonian system [McLachlan:2001, Franca:2020, Franca:2021]. Consider thus the Hamiltonian
where is a constant symmetric positive definite matrix with inverse . The associated vector field is , with being a “damping coefficient.” This is associated to the negative definite matrix
The equations of motion are
so the system is dissipative. Suppose has a minimizer in some region of interest and, without loss of generality, has value . Then and outside such a critical point, implying that is also a (strict) Lyapunov function; the existence of such a Lyapunov function implies that trajectories starting in the neighborhood of will converge to . In other words, the above system provably solves the optimisation problem
Two common choices for the damping are the constant case, , and the asymptotic vanishing case, for some constant (other choices are also possible). When is a convex function (resp. strongly convex function with parameter ) it is possible to show the following convergence rates [Franca:2018b]:
is the largest eigenvalue of the metric. The convergence rates of this system are therefore known under such convexity assumptions. Ideally, we want to design optimisation methods that preserve these rates, i.e., are “rate-matching”, and are also numerically stable. As we will see, such geometric integrators can be constructed by leveraging the shadow Hamiltonian property of symplectic methods on higher-dimensional conservative Hamiltonian systems [Franca:2021] (see also [marthinsen2014geometric, asorey1983generalized]). This holds not only on but on general settings, namely on arbitrary smooth manifolds [Franca:2021, FrancaBarp:2021].
In the conformal Hamiltonian case, the dissipation appears explicitly in the equations of motion. It is however theoretically convenient to consider an equivalent explicit time-dependent Hamiltonian formulation. Consider the following coordinate transformation into system (8):
It is easy to see that (8) is equivalent to standard Hamilton’s equations,
with the explicit time-dependent Hamiltonian
The rate of change of along the flow now satisfies
so the system is nonconservative; this equation is equivalent to (9).
Going one step further, let us now promote to a new coordinate and introduce its (conjugate) momentum . Consider thus the higher-dimensional Hamiltonian
Note that and are two arbitrary canonical coordinates. Denoting the time parameter of this system by , Hamilton’s equations read
This system is conservative since . Now, if we fix coordinates as
the conservative system (16) reduces precisely the original dissipative system (13); the 2nd equation in (16) reproduces (14), and the remaining equations are equivalent to the equations of motion associated to (13), which in turn are equivalent to (8) as previously noted. Formally, what we have done is to embed the original dissipative system with phase space into a higher-dimensional conservative system with phase space . The dissipative dynamics thus lies on a hypersurface of constant energy, , in high dimensions; see [Franca:2021] for details. The reason for doing this procedure, called symplectification, is purely theoretical: since the theory of symplectic integrators only accounts for conservative systems, we can now extend this theory to dissipative settings by applying a symplectic integrator to (13) and then fixing the relevant coordinates (17) in the resulting method. Geometrically, this corresponds to integrating the time flow exactly [Franca:2021, marthinsen2014geometric]. In [Franca:2021] such a procedure was defined under the name of presymplectic integrators, and these connections hold not only for the specific example above but also for general non-conservative Hamiltonian systems.
We are now ready to explain why this approach is suitable to construct practical optimisation methods. Let be a symplectic integrator of order applied to system (15). Denote by the numerical state, obtained by iterations of . Time is simulated over the grid , with step size . Because a symplectic integrator has a shadow Hamiltonian we have
Enforcing (17), the coordinate becomes simply the time discretization , which is exact, and so is since it is a function of time alone; importantly, does not couple to any of the other degrees of freedom so it is irrelevant whether we have access to or not. Replacing (15) into the above equation we conclude:
where we now denote , for . Hence, the time-dependent Hamiltonian also has a shadow, thanks to the cancellation of the variable . In particular, if we replace the explicit form of the Hamiltonian (13) we obtain444The kinetic part only contributes to the small error since is positive definite and . There are several technical details we are omitting, such as Lipschitz conditions on the Hamiltonian and on the numerical method, which we refer to [Franca:2021] for details.
Therefore, the known rates (11) for the continuum system are nearly preserved—and so would be any rates of more general time-dependent (dissipative) Hamiltonian systems. Moreover, as a consequence of (18), the original time-independent Hamiltonian (6) of the conformal formulation is also closely preserved, i.e., within the same bounded error in —recall transformation (12). However, this is also a Lyapunov function, hence the numerical method respects the stability properties of the original system as well.555Naturally, all these results hold for suitable choices of step size, which can be determined by a linear stability analysis of the particular numerical method under consideration.
In short, as a consequence of having a shadow Hamiltonian, such geometric integrators are able to reproduce all the relevant properties of the continuum system. These arguments are completely general; namely, they ultimately rely on the BCH formula, the existence of bracket vector fields and the symplectification procedure. Under these basic principles, no discrete-time analyses were necessary to obtain guarantees for the numerical method; which may not be particularly enlightening from a dynamical systems viewpoint and are only applicable on a (painful) case-by-case basis.
Let us now present an explict algorithm to solve the optimisation problem (10). Consider a generic (conservative) Hamiltonian , evolving in time . The well-known leapfrog or Störmer-Verlet method, the most used symplectic integrator in the literature, is based on the composition (2) and reads [HairerBook]
According to our prescription,
replacing the higher-dimensional Hamiltonian (15), imposing
the gauge fixing conditions (17), and recalling that cancels out, we obtain the following method:666
In a practical implementation, it is convenient to make
the change of variables into (20); recall the transformations (12). In this case the method reads
where we recall that is the step size and , for iterations . This method, which is a dissipative generalisation of the leapfrog, was proposed in [Franca:2021] and has very good performance when solving unconstrained problems (10). In a similar fashion, one can extend any (known) symplectic integrator to a dissipative setting; the above method is just one such example.
2.4 Manifold and constrained optimisation
Following [FrancaBarp:2021], we briefly mention how the previous approach can be extended in great generality, i.e., to an optimisation problem
where is an arbitrary Riemannian manifold. There are essentially two ways to solve this problem through a (dissipative) Hamiltonian approach. One is to is to simulate a Hamiltonian dynamics on by incorporating the metric of in the kinetic part of the Hamiltonian. Another is to consider a Hamiltonian dynamics on and embed into by imposing several constraints,777Theoretically, there is no loss of generality since Nash or Whitney embedding theorems tells us that any smooth manifold can be embedded into for sufficiently large .
This constrained case turns out to be particularly useful since we typically are unable to compute the geodesic flow on , but are able to construct robust constrained symplectic integrators for it.
As an example of the first approach, consider being a Lie group, with Lie algebra and generators . The analogous of Hamiltonian (13) is given by , where is a constant, and (they can be seen as matrices). The method (20) can be adapted to this setting, resulting in the following algorithm [FrancaBarp:2021] (recall footnote 6):
where is a matrix.
As an example of the second approach, one can constrain the integrator on to define a symplectic integrator on via the discrete constrained variational approach [marsden2001discrete] by introducing Lagrange multipliers, i.e., by considering the Hamiltonian , where is the Hamiltonian (13). In particular, the method (20) can be constrained to yield [FrancaBarp:2021]
where we have the projector with , and is the Jacobian matrix of the constraints; is the vector of Lagrange multipliers and accounts for the damping. In practice, the Lagrange multipliers are determined by solving the (nonlinear) algebraic equations for the constraints, i.e., the 2nd to 4th updates above are solved simultaneously. The above method consists in a dissipative generalisation of the well-known RATTLE integrator from molecular dynamics [Andersen:1983, Leimkuhler:1994, McLachlan:2014, leimkuhler2016efficient].
It is possible to generalise any other (conservative) symplectic method to this (dissipative) optimisation setting on manifolds. In this general setting, there still exists a shadow Hamiltonian so that convergence rates and stability are closely preserved numerically [FrancaBarp:2021] (similarly to (18) and (19)). In particular, one can also consider different types of kinetic energy, beyond the quadratic case discussed above, which may perform better in specific problems [Franca:2020]. This approach therefore allows one to adapt existing symplectic integrators to solve optimisation problems on Lie groups and other manifolds commonly appearing in machine learning, such as Stiefel, Grassmanians, or to solve constrained optimisation problems on .
2.5 Gradient flow as a high friction limit
Let us provide some intuition why simulating 2nd order systems is expected to yield faster algorithms. It has been shown that several other accelerated optimisation methods888Besides accelerated gradient based methods, accelerated extensions of important proximal-based methods such as proximal point, proximal-gradient, alternating direction method of multipliers (ADMM), Douglas-Rachford, Tseng splitting, etc., are implicit discretizations of (8); see [Franca:2021b] for details. are also discretisations of system (8) [Franca:2021b]. Moreover, in the large friction limit, , this system reduces to the 1st order gradient flow, (assuming ), which is the continuum limit of standard, i.e., nonaccelerated methods [Franca:2021b]. The same happens in more general settings; when the damping is too strong, the second derivative becomes negligible and the dynamics is approximately 1st order.
As an illustration, consider Figure 1 (left) where a particle immersed in a fluid falls under the influence of a potential force , that plays the role of “gravity”, and is constrained to move on a surface. In the underdamped case, the particle is under water, which is not so viscous, so it has acceleration and moves fast (even oscillate). In the overdamped case, the particle is in a highly viscous fluid, such as honey, and the drag force is comparable or stronger to , thus the particle moves slowly since it cannot accelerate; during the same elapsed time , an accelerated particle would travel a longer distance. We can indeed verify this behaviour numerically. In Figure 1 (right) we run algorithm (23) in the underdamped and overdamped regimes when solving an optimisation problem on the -sphere, i.e., on the Lie group .999The details are not important here, but this problem minimises the Hamiltonian of a spherical spin glass (see [FrancaBarp:2021] for details). The same behaviour is seen with the constrained method (24) as well. We can see that, in the overdamped regime, this method has essentially the same dynamics as the Riemannian gradient descent [Sra:2016], which is nonaccelerated and corresponds to a 1st order dynamics; all methods use the same step size, only the damping coefficient is changed.
2.6 Optimisation on the space of probability measures
There is a tight connection between sampling and optimisation on the space of probability measures which goes back to [Otto:2001, Jordan:1998]. Let be the space of probability measures on
with finite second moments, endowed with a Wasserstein-2 metric. The gradient flow of a functional
on the space of probability measures is the solution to the partial differential equation, which, under sufficient regularity conditions, is equivalent to [Otto:2001, Jordan:1998, ambrosioGradientFlowsMetric2005]
where and are the derivative and the divergence operators on , respectively. The evolution of this system solves the optimisation problem
i.e., as in the sense of distributions. We can consider the analogous situation with a dissipative flow induced by the conformal Hamiltonian tensor (7) on the space of probability measures; we set and for simplicity. Thus, instead of (25), we have a conformal Hamiltonian system in Wasserstein space given by the continuity equation
where now and is a measure over . Let be the free energy defined as
where is the (internal) energy, is the Hamiltonian (6), is the Shannon entropy, and is the inverse temperature. The functional derivative of the free energy equals
In particular, the minimiser of is the stationary density
Note also that the free energy (28) is nothing but the divergence (up to a constant which is the partition function):
which is nothing but the Fokker-Planck equation associated to the underdamped Langevin diffusion
where is a standard Wiener process. Thus, the underdamped Langevin can be seen as performing accelerated optimisation on the space of probability measures. A quantitative study of its speed of convergence is given by the theory of hypocoercivity [ottobreMarkovChainMonte2016, pavliotisStochasticProcessesApplications2014, villaniHypocoercivity2009].
The above results provide a tight connection between sampling and optimisation. Interestingly, by the same argument as used in section 2.5 (see Figure 1), the high friction limit, , of the underdamped Langevin diffusion (31) yields the overdamped Langevin diffusion [Franca:2021b, pavliotisStochasticProcessesApplications2014]
which corresponds precisely to the gradient flow (25) on the free energy functional [Jordan:1998, ambrosioGradientFlowsMetric2005], where now . Thus, in the same manner that a 2nd order damped Hamiltonian system may achieve accelerated optimisation compared to a 1st order gradient flow, the underdamped Langevin diffusion (31) may achieve accelerated sampling compared to the overdamped Langevin diffusion (32). Such an acceleration has indeed been demonstrated [Ma:2019] in continuous-time and for a particular discretisation.
3 Hamiltonian-based accelerated sampling
The purpose of sampling methods is to efficiently draw samples from a given target distribution or, more commonly, to calculate expectations with respect to :
However, generating i.i.d. samples is usually practically infeasible, even for finite sample spaces, as in high dimensions probability mass tends to concentrate in small regions of the sample space, while regions of high probability mass tend to be separated by large regions of negligible probability. Moreover, is usually only known up to a normalisation constant [mackay2003information]. To circumvent this issue, MCMC methods rely on constructing ergodic Markov chains that preserve the target distribution . If we run the chain long enough (), Birkhoff’s ergodic theorem guarantees that the estimator on the right-hand side of (33) converges to our target integral on the left-hand side almost surely [hairerErgodicPropertiesMarkov2018]
. An efficient sampling scheme is one that minimises the variance of the MCMC estimator. In other words, fewer samples will be needed to obtain a good estimate. Intuitively, good samplers are Markov chains that converge as fast as possible to the target distribution.
3.1 Optimising diffusion processes for sampling
As many MCMC methods are based on discretising continuous-time stochastic processes, the analysis of continuous-time processes is informative of the properties of efficient samplers.
Diffusion processes possess a rich geometric theory, extending that of vector fields, and have been widely studied in the context of sampling. They are Markov processes featuring almost surely continuous sample paths (i.e., no jumps) and correspond to the solutions of stochastic differential equations (SDEs). While a deterministic flow is given by a first order differential operator—namely, a vector field as used in §2—diffusions require specifying a set of vector fields , where represents the (deterministic) drift and the directions of the (random) Wiener processes , and are characterised by a second order differential operator of the form , known as the generator of the process. Equivalently, diffusions can be written as Stratonovich SDEs: .
For a smooth positive target measure , the complete family of -preserving diffusions is given by (up to a topological obstruction contribution) [barpUnifyingCanonicalDescription2021]
for a choice of antisymmetric bracket . Here is a differential operator on multi-vector fields, generalising the divergence on vector fields of , and is induced via an isomorphism defined by which allows to transfer the calculus of twisted differential forms to a measure-informed calculus on multi-vector fields [barpUnifyingCanonicalDescription2021]. The ergodicity of (34) is essentially characterised by Hörmander’s hypoellipticity condition; i.e., whether the Lie algebra of vector fields generated by spans the tangent spaces at every point [hormanderHypoellipticSecondOrder1967, pavliotisStochasticProcessesApplications2014, bismut1981martingales]. On Euclidean space the above complete class of measure preserving diffusions can be given succinctly by Itô SDEs [maCompleteRecipeStochastic2015]:
where reduce to symmetric and antisymmetric matrix fields and is the negative Lebesgue log-density of .
There are two well-studied criteria describing sampling efficiency in Markov processes: 1) the worst-case asymptotic variance of the MCMC estimator (33) over functions in , and 2) the spectral gap. The spectral gap is the lowest non-zero eigenvalue of the (negative) generator on . When it exists, it is an exponential convergence rate of the density of the process to the target density [hwangAcceleratingDiffusions2005, pavliotisStochasticProcessesApplications2014, chafaiEntropiesConvexityFunctional2004]
. Together, these criteria yield confidence intervals on the non-asymptotic variance of the MCMC estimator, which determines sampling performance[joulinCurvatureConcentrationError2010].
A fundamental criterion for efficient sampling is non-reversibility [hwangAcceleratingDiffusions2005, rey-belletIrreversibleLangevinSamplers2015, duncanVarianceReductionUsing2016]. A process is non-reversible if it is statistically distinguishable from its time-reversal when initialised at the target distribution [pavliotisStochasticProcessesApplications2014]. Measure-preserving diffusions are non-reversible precisely when [haussmannTimeReversalDiffusions1986]. Intuitively, non-reversible processes backtrack less often and thus furnish more diverse samples [nealImprovingAsymptoticVariance2004]. Furthermore, non-reversibility leads to mixing, which accelerates convergence to the target measure. It is well known that removing non-reversibility worsens the spectral gap and the asymptotic variance of the MCMC estimator [rey-belletIrreversibleLangevinSamplers2015, duncanVarianceReductionUsing2016, hwangAcceleratingDiffusions2005]. In diffusions with linear coefficients, one can construct the optimal non-reversible matrix to optimise the spectral gap [lelievreOptimalNonreversibleLinear2013, wuAttainingOptimalGaussian2014] or the asymptotic variance [duncanVarianceReductionUsing2016]. However, there are no generic guidelines on how to optimise non-reversibility in arbitrary diffusions. This suggests a two-step strategy to construct efficient samplers: 1) optimise reversible diffusions, and 2) add a non-reversible perturbation [zhangGeometryinformedIrreversiblePerturbations2021].
Diffusions on manifolds are reversible when , and thus have the form , which on Euclidean space reads
The spectral gap and the asymptotic variance of the MCMC estimator are the same optimality criteria in reversible Markov processes [miraOrderingImprovingPerformance2001]. When is positive definite everywhere, it defines a Riemannian metric on the state space. The generator is then the elliptic differential operator
where is the Riemannian gradient and is the Laplace-Beltrami operator, i.e., the Riemannian counterpart of the Laplace operator. Thus, reversible (elliptic) diffusions (37) are the natural generalisation of the overdamped Langevin dynamics (32) to Riemannian manifolds [girolami2011riemann]. Optimising to improve sampling amounts to endowing the state space with a suitable Riemannian geometry that exploits the structure of the target density. For example, sampling is improved by directing noise along vector fields that preserve the target density [abdulleAcceleratedConvergenceEquilibrium2019]. When the potential is strongly convex, the optimal Riemannian geometry is given by [helfferRemarksDecayCorrelations1998, saumardLogconcavityStrongLogconcavity2014]. Sampling can also be improved in hypoelliptic diffusions with degenerate noise (i.e., when is not positive definite). Intuitively, the absence of noise in some directions of space leads the process to backtrack less often and thus yield more diverse samples. For instance, in the linear case, the optimal spectral gap is attained for an irreversible diffusion with degenerate noise [guillinOptimalLinearDrift2021]. However, degenerate diffusions can be very slow to start with, as the absence of noise in some directions of space make it more difficult for the process to explore the state space [guillinOptimalLinearDrift2021].
Underdamped Langevin dynamics (31) combines all the desirable properties of an efficient sampler: it is irreversible, has degenerate noise, and achieves accelerated convergence to the target density [maThereAnalogNesterov2019]. We can optimise the reversible part of the dynamics (i.e., the friction ) to improve the asymptotic variance of the MCMC estimator [chakOptimalFrictionMatrix2021]. Lastly, we can significantly improve underdamped Langevin dynamics by adding additional non-reversible perturbations to the drift [duncanUsingPerturbedUnderdamped2017].
One way to obtain MCMC algorithms is to numerically integrate diffusion processes. As virtually all non-linear diffusion processes cannot be simulated exactly, we ultimately need to study the performance of discrete algorithms instead of their continuous counterparts. Alarmingly, many properties of diffusions can be lost in numerical integration. For example, numerical integration can affect ergodicity [mattinglyErgodicitySDEsApproximations2002]. An irreversible diffusion may sample more poorly than its reversible counterpart after integration [katsoulakisMeasuringIrreversibilityNumerical2014]. This may be because numerical discretisation can introduce, or otherwise change, the amount of non-reversibility [katsoulakisMeasuringIrreversibilityNumerical2014]. The invariant measure of the diffusion and its numerical integration may differ, a feature known as bias. We may observe very large bias even in the simplest schemes, such as the Euler-Maruyama integration of overdamped Langevin [roberts1996exponential]. Luckily, there are schemes whose bias can be controlled by the integration step size [mattinglyConvergenceNumericalTimeAveraging2010]; yet, this precludes using large step sizes. Alternatively, one can remove bias by supplementing the integration step with a Metropolis-Hastings corrective step; however, this makes the resulting algorithm reversible. In conclusion, designing efficient sampling algorithms with strong theoretical guarantees is a non-trivial problem that needs to be addressed in its own right.
3.2 Hamiltonian Monte Carlo
Constructing measure-preserving processes, in particular diffusions, is relatively straightforward. A much more challenging task consists of constructing efficient sampling algorithms with strong theoretical guarantees. We now discuss an important family of well-studied methods, known as Hamiltonian Monte Carlo (HMC), which can be implemented on any manifold, for any smooth fully supported target measure that is known up to a normalising constant. Some of these methods can be seen as an appropriate geometric integration of the underdamped Langevin diffusion, but it is in general simpler to view them as combining a geometrically integrated deterministic dynamics with a simple stochastic process that ensures ergodicity.
The conservative Hamiltonian systems previously discussed provide a natural candidate for the deterministic dynamics. Indeed, given a target measure , with a Riemannian measure (such as the Lebesgue measure on ), if we interpret the negative log-density as a potential energy, i.e., a function depending on position , one can then plug in the potential within Newton’s equation to obtain a deterministic proposal that is well-defined on any manifold, as soon as the acceleration and derivative operators have been replaced by their curved analogues
with given initial conditions for the position and velocity . This is a 2nd order system which evolves in the tangent bundle, , which is when . The resulting flow is conservative since it corresponds to a Hamiltonian system as discussed in section §2.2, with Hamiltonian , where is the Riemannian squared-norm, which is when and is the Riemannian metric; this is the manifold version of the Hamiltonian (6). This system preserves the symplectic measure , and thus also the canonical distribution , which is the product of the target distribution over position with the Gaussian measures on velocity (with covariance ). For instance, on ,
Moreover, the pushforward under the projection is precisely the target measure: . Concretely, the samples generated by moving according to Newton’s law, after ignoring their velocity component, have as their law. The main critical features and advantages in using Hamiltonian systems arises from their numerical implementation. Indeed, the flow of (38) is only tractable for the simplest target measures, namely those possessing a high degree of symmetry. In order to proceed, we must devise suitable numerical approximations which, unfortunately, not only break such symmetries but may lose key properties of the dynamics such as stationarity (typically not retained by discretisations). However, as we saw in section §2.2, most symplectic integrators have a shadow Hamiltonian and thus generate discrete trajectories that are close to the associated bona fide (shadow) Hamiltonian dynamics, that in particular preserve the shadow canonical distribution.
Most integrators used in sampling, such as the leapfrog, are geodesic integrators. These are splitting methods (see section §2.1) obtained by splitting the Hamiltonian , where and are to be treated as independent Hamiltonians in their own right. Both of these Hamiltonians generate dynamics that might be tractable: the Riemannian component , associated to the Riemannian reference measure, induces the geodesic flow, while the target density component gives rise to a vertical gradient flow, wherein the velocity is shifted by the direction of maximal density change, i.e., . The Jacobi identity and the BCH formula imply these integrators do possess a shadow Hamiltonian , and reproduce its dynamics. Such a shadow can in fact be explicitly obtained from (5) by computing iterated Poisson brackets; e.g., on and for , the three-stage geodesic integrator , with parameters , yields [radivojevic2018multi]
for some constants and . As an immediate consequence, these symplectic integrators preserve the reference symplectic measure
and can be used as a (deterministic) Markov proposal, which when combined with the Metropolis-Hastings acceptance step that depends only on the target density, gives rise to a measure-preserving process. Moreover, the existence of the shadow Hamiltonian ensures that the acceptance rate will remain high for distant proposals, allowing small correlations. However, since Hamiltonian flows are conservative, they remain stuck within energy level sets, which prevents ergodicity. It is thus necessary to introduce another measure-preserving process, known as the heat bath or thermostat, that explores different energy levels; the simplest such process corresponds to sampling a velocity from a Gaussian distribution. Bringing these ingredients together, we thus have the following HMC algorithm: given, compute according to
Heat bath: sample a velocity according to a Gaussian, .
Shadow Hamiltonian dynamics: move along the Hamiltonian flow generated by the geodesic integrator, , where .
Metropolis correction: accept with probability , where . If accepted then set , otherwise set .
The above rudimentary HMC method (originally known as Hybrid Monte Carlo) was proposed for simulations in lattice quantum chromodynamics with being the special unitary group, , and used a Hamiltonian dynamics ingeniously constructed from the Maurer-Cartan frame to compute the partition function of discretised gauge theories [Duane:1987]. This method has later been applied in molecular dynamics and statistics [rousset2010free, betancourt2017conceptual, neal1992bayesian, cances2007theoretical, Neal:2011].
While the above discussion provides a justification for the use of Hamiltonian mechanics, a more constructive argument from first principles can also be given. From the family of measure-preserving dynamics, which as we have seen can be written as (recall (34)), we want to identify those suited to practical implementations (here could be any distribution on some space having the target has a marginal). Only for the simplest distributions we can hope to find brackets for which the flow of is tractable. Instead, the standard approach to geometrically integrate this flow relies as before on splitting methods, which effectively decompose into simpler components by decomposing the reference measure from the density and taking advantage of any product structure of the density, so that .
There are three critical properties underpinning the success of HMC in practice. The first two are the preservation of the reference measure and the existence of a conserved shadow Hamiltonian for the numerical method. These imply that we remain close to preserving along the flow, and in particular leads to Jacobian-free Metropolis corrections with good acceptance rates for distant proposals (see [fang2014compressible] for examples of schemes with Jacobian corrections). Achieving these properties yield strong constraints on the choice of [barp2020bracket]; the shadow property is essentially exclusive to Poisson systems, for which the conservation of a common reference measure is equivalent to the triviality of the modular class in the first Poisson cohomology group [weinstein1997modular]. In particular, Poisson brackets that admit such an invariant measure have been carefully analysed and are known as unimodular; the only unimodular Poisson bracket that can be constructed on general manifolds seems to be precisely the symplectic one.
The third critical property is the existence of splittings methods for which all the composing flows are either tractable or have adequate approximations, namely the geodesic integrators. Indeed, as we have seen, the flow —induced by the potential —is always tractable, independently of the complexity of the target density; this is possible mainly due to the extra “symmetries” resulting from implementing the flow on a higher-dimensional space rather than . On the other hand, one key consideration for the tractability of the the geodesic flow —induced by the kinetic energy —is the choice of Riemannian metric; for most cases, it is numerically hard to implement since several implicit equations need to be solved. In general, it is desirable to use a Riemannian metric that reflects the intrinsic symmetries of the sample space, mathematically described by a Lie group action. Indeed, by using an invariant Riemannian metric, one greatly simplifies the equations of motion of the geodesic flow, reducing the usual nd order Euler-Langrange equations to the 1st order Euler-Arnold equations [holm1998euler, modin2010geodesics, barp2019hamiltonianB], with tractable solutions in many cases of interest, e.g., for naturally reductive homogeneous spaces; including