Convergence Rate of Riemannian Hamiltonian Monte Carlo and Faster Polytope Volume Computation

10/17/2017 ∙ by Yin Tat Lee, et al. ∙ Georgia Institute of Technology 0

We give the first rigorous proof of the convergence of Riemannian Hamiltonian Monte Carlo, a general (and practical) method for sampling Gibbs distributions. Our analysis shows that the rate of convergence is bounded in terms of natural smoothness parameters of an associated Riemannian manifold. We then apply the method with the manifold defined by the log barrier function to the problems of (1) uniformly sampling a polytope and (2) computing its volume, the latter by extending Gaussian cooling to the manifold setting. In both cases, the total number of steps needed is O^*(mn^2/3), improving the state of the art. A key ingredient of our analysis is a proof of an analog of the KLS conjecture for Gibbs distributions over manifolds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hamiltonian dynamics provide an elegant alternative to Newtonian mechanics. The Hamiltonian , which captures jointly the potential and kinetic energy of a particle, is a function of its position and velocity. First-order differential equations describe the change in both.

As we review in Section 2, these equations preserve the Hamiltonian .

Riemannian Hamiltonian Monte Carlo (or RHMC) [26, 25][6, 7]

is a Markov Chain Monte Carlo method for sampling from a desired distribution. The target distribution is encoded in the definition of the Hamiltonian. Each step of the method consists of the following: At a current point

,

  1. Pick a random velocity according to a local distribution defined by

    (in the simplest setting, this is the standard Gaussian distribution for every

    ).

  2. Move along the Hamiltonian curve defined by Hamiltonian dynamics at for time (distance) .

For a suitable choice of , the marginal distribution of the current point approaches the desired target distribution. Conceptually, the main advantage of RHMC is that it does not require a Metropolis filter (as in the Metropolis-Hastings method) and its step sizes are therefore not severely limited even in high dimension.

Over the past two decades, RHMC has become very popular in Statistics and Machine Learning, being applied to Bayesian learning, to evaluate expectations and likelihood of large models by sampling from the appropriate Gibbs distribution, etc. It has been reported to significantly outperform other known methods

[2, 25] and much effort has been made to make each step efficient by the use of numerical methods for solving ODEs.

In spite of all these developments and the remarkable empirical popularity of RHMC, analyzing its rate of convergence and thus rigorously explaining its success has remained an open question.

1.1 Results

In this paper, we analyze the mixing rate of Hamiltonian Monte Carlo for a general function as a Gibbs sampler, i.e., to generate samples from the density proportional to . The corresponding Hamiltonian is for some metric . We show that for in a compact manifold, the conductance of the Markov chain is bounded in terms of a few parameters of the metric and the function . The parameters and resulting bounds are given in Corollary 28 and Theorem 30. Roughly speaking, the guarantee says that Hamiltonian Monte Carlo mixes in polynomial time for smooth Hamiltonians. We note that these guarantees use only the smoothness and Cheeger constant (expansion) of the function, without any convexity type assumptions. Thus, they might provide insight in nonconvex settings where (R)HMC is often applied.

We then focus on logconcave densities in , i.e., is a convex function. This class of functions appears naturally in many contexts and is known to be sampleable in polynomial-time given access to a function value oracle. For logconcave densities, the current fastest sampling algorithms use function calls, even for uniform sampling [20, 23], and

oracle calls given a warm start after appropriate rounding (linear transformation)

[14]. In the prototypical setting of uniform sampling from a polytope , with inequalities, the general complexity is no better, with each function evaluation taking arithmetic operations, for an overall complexity of in the worst case and after rounding from a warm start. The work of Kannan and Narayanan [11] gives an algorithm of complexity from an arbitrary start and from a warm start (here is the matrix multiplication exponent), which is better than the general case when the number of facets is not too large. This was recently improved to from a warm start [13]; the subquadratic complexity for the number of steps is significant since all known general oracle methods cannot below a quadratic number of steps. The leading algorithms and their guarantees are summarized in Table 1.

Year Algorithm Steps Cost per step
1997 [10] Ball walk#
2003 [21] Hit-and-run#
2009 [11] Dikin walk
2016 [13] Geodesic walk
2016 [15] Ball walk#
This paper RHMC
Table 1: The complexity of uniform polytope sampling from a warm start, where each step of every algorithm uses bit of randomness. The entries marked # are for general convex bodies presented by oracles, while the rest are for polytopes.

In this paper, using RHMC, we improve the complexity of sampling polytopes. In fact we do this for a general family of Gibbs distributions, of the form where is a convex function over a polytope. When is the standard logarithmic barrier function and is its Hessian, we get a sampling method that mixes in only steps from a warm start! When , the resulting distribution is very close to uniform over the polytope.

Theorem 1.

Let be the logarithmic barrier for a polytope with constraints and variables. Hamiltonian Monte Carlo applied to the function and the metric given by with appropriate step size mixes in

steps where each step is the solution of a Hamiltonian ODE.

In recent independent work, Mangoubi and Smith [24] analyze Euclidean HMC in the oracle setting, i.e., assuming an oracle for evaluating . Their analysis formally gives a dimension-independent convergence rate based on certain regularity assumptions such as strong convexity and smoothness of the Hamiltonian . Unfortunately, these assumptions do not hold for the polytope sampling problem.

An important application of sampling is integration. The complexity of integration for general logconcave functions is also oracle calls. For polytopes, the most natural questions is computing its volume. For this problem, the current best complexity is , where the factor of

is the complexity of checking membership in a polytope. Thus, even for explicitly specified polytopes, the complexity of estimating the volume from previous work is asymptotically the same as that for a general convex body given by a membership oracle. Here we obtain a volume algorithm with complexity

, improving substantially on previous algorithms. The volume algorithm is based using Hamiltonian Monte Carlo for sampling from a sequence of Gibbs distributions over polytopes. We remark that in the case when 111We suspect that the LS barrier [12] might be used to get a faster algorithm even in the regime even if is sub-exponential. However, our proof requires a delicate estimate of the fourth derivative of the barrier functions. Therefore, such a result either requires a new proof or a unpleasantly long version of the current proof., the final complexity is arithmetic operations, improving by more than a quadratic factor in the dimension over the previous best complexity of operations for arbitrary polytopes. These results and prior developments are given in Table 2.

Year Algorithm Steps Cost per step
1989 [5] DFK
1989-93 [17, 4, 1, 18, 19] many improvements
1997 [10] DFK, Speedy walk, isotropy
2003 [22] Annealing, hit-and-run
2015 [3] Gaussian Cooling*
This paper RHMC + Gaussian Cooling
Table 2: The complexity of volume estimation, each step uses bit of randomness, all except the last for general convex bodies (the result marked * is for well-rounded convex bodies). The current paper applies to general polytopes, and is the first improvement utilizing their structure.
Theorem 2.

For any polytope with constraints and variables, and any , the Hamiltonian volume algorithm estimates the volume of to within multiplicative factor using steps where each step consists of solving a first-order ODE and takes time and is the bit complexity222 where is the largest absolute value of the determinant of a square sub-matrix of . of the polytope.

A key ingredient in the analysis of RHMC is a new isoperimetric inequality for Gibbs distributions over manifolds. This inequality can be seen as an evidence of a manifold version of the KLS hyperplane conjecture. For the family of Gibbs distributions induced by convex functions with convex Hessians, the expansion is within a constant factor of that of a hyperplane cut. This result might be of independent interest.

1.2 Approach and contributions

Traditional methods to sample from distributions in are based on random walks that take straight line steps (grid walk, ball walk, hit-and-run). While this leads to polynomial-time convergence for logconcave distributions, the length of each step has to be small due to boundary effects, and a Metropolis filter (rejection sampling) has to be applied to ensure the limiting distribution is the desired one. These walks cannot afford a step of length greater than for a distribution in isotropic position, and take a quadratic number of steps even for the hypercube. The Dikin walk for polytopes [11], which explicitly takes into account the boundary of polytope at each step, has a varying step size, but still runs into similar issues and the bound on its convergence rate is for a polytope with facets.

In a recent paper [13], we introduced the geodesic walk. Rather than using straight lines in Euclidean space, each step of the walk is along a geodesic (locally shortest path) of a Riemannian metric. More precisely, each step first makes a deterministic move depending on the current point (drift), then moves along a geodesic in a random initial direction and finally uses a Metropolis filter. Each step can be computed by solving a first-order ODE. Due to the combination of drift and geodesic, the local

-step distributions are smoother than that of the Dikin walk and larger steps can be taken while keeping a bounded rejection probability for the filter. For sampling polytopes, the manifold/metric defined by the standard log barrier gives a convergence rate of

, going below the quadratic (or higher) bound of all previous sampling methods.

One major difficulty with geodesic walk is ensuring the stationary distribution is uniform. For high dimensional problems, this necessitates taking a sufficiently small step size and then rejecting some samples according to the desired transition probabilities according to Metropolis filter. Unfortunately, computing these transition probabilities can be very expensive. For the geodesic walk, it entails solving an size matrix ODE.

Hamiltonian Monte Carlo bears some similarity to the geodesic walk — each step is a random (non-linear) curve. But the Hamiltonian-preserving nature of the process obviates the most expensive ingredient, Metropolis filter. Due to this, the step size can be made longer, and as a result we obtain a faster sampling algorithm for polytopes that mixes in steps (the per-step complexity remains essentially the same, needing the solution of an ODE).

To get a faster algorithm for volume computation, we extends the analysis to a general family of Gibbs distributions, including where is the standard log-barrier and . We show that the smoothness we need for the sampling corresponding to a variant of self-concordance defined in Definition 46. Furthermore, we establish an isoperimetric inequality for this class of functions. This can be viewed as an extension of the KLS hyperplane conjecture from Euclidean to Riemannian metrics (the analogous case in Euclidean space to what we prove here is the isoperimetry of the Gaussian density function multiplied by any logconcave function, a case for which the KLS conjecture holds). The mixing rate for this family of functions is for .

Finally, we study the Gaussian Cooling schedule of [3]. We show that in the manifold setting, the Gaussian distribution can be replaced by . Moreover, the speed of Gaussian Cooling depends on the “thin-shell” constant of the manifold and classical self-concordance of .

Combining all of these ideas, we obtain a faster algorithm for polytope volume computation. The resulting complexity of polytope volume computation is the same as that of sampling uniformly from a warm start: steps. To illustrate the improvement, for polytopes with facets, the new bound is while the previous best bound was .

1.3 Practicality

From the experiments, the ball walk/hit-and-run seem to mix in steps, the geodesic walk seems to mix in sublinear number of steps (due to the Metropolis filter bottleneck) and RHMC seems to mix in only polylogarithmic number of steps. One advantage of RHMC compared to the geodesic walk is that it does not require the expensive Metropolis filter that involves solving matrix ODEs. In the future, we plan to do an empirical comparison study of different sampling algorithms. We are hopeful that using RHMC we might finally be able to sample from polytopes in millions of dimensions after more than three decades of research on this topic!

1.4 Notation

Throughout the paper, we use lowercase letter for vectors and vector fields and uppercase letter for matrices and tensors. We use

to denote coordinate vectors. We use for the usual derivative, e.g. is the derivative of some function along a curve parametrized by , we use for the usual partial derivative. We use for the directional derivative of at along . We use for the usual gradient and the connection (manifold derivative, defined in Section D which takes into account the local metric), for the directional derivative of a vector with respect to the vector (or vector field) (again, defined in Section D), and if the curve is clear from the context. We use for the local metric. Given a point , is a matrix with entries Its inverse has entries Also, is the dimension, the number of inequalities. We use for the total variation (or ) distance between two distributions.

1.5 Organization

In Section 2, we define the Riemannian Hamiltonian Monte Carlo and study its basic properties such as time-reversibility. In Section 3, we give the the first convergence rate analysis of RHMC. However, the convergence rate is weak for the sampling applications (it is polynomial, but not better than previous methods). In Section 4, we introduce more parameters and use them to get a tighter analysis of RHMC. In Section 5, we study the isoperimetric constant of under the metric . In Section 6, we study the generalized Gaussian Cooling schedule and its relation to the thin-shell constant. Finally, in Section 7, we compute the parameters we need for the log barrier function.

2 Basics of Hamiltonian Monte Carlo

In this section, we define the Hamiltonian Monte Carlo method for sampling from a general distribution . Hamiltonian Monte Carlo uses curves instead of straight lines and this makes the walk time-reversible even if the target distribution is not uniform, with no need for a rejection sampling step. In contrast, classical approaches such as the ball walk require an explicit rejection step to converge to a desired stationary distribution.

Definition 3.

Given a continuous, twice-differentiable function (called the Hamiltonian, which often corresponds to the total energy of a system) where is the domain of , we say follows a Hamiltonian curve if it satisfies the Hamiltonian equations

(2.1)

We define the map where the follows the Hamiltonian curve with the initial condition

Hamiltonian Monte Carlo is the result of a sequence of randomly generated Hamiltonian curves.

algocf[htbp]    

Lemma 4 (Energy Conservation).

For any Hamiltonian curve , we have that

Proof.

Note that

Lemma 5 (Measure Preservation).

For any , we have that

where is the Jacobian of the map at the point .

Proof.

Let be a family of Hamiltonian curves given by . We write

By differentiating the Hamiltonian equations (2.1) w.r.t. , we have that

This can be captured by the following matrix ODE

using the equation

Therefore, . Next, we observe that

Hence,

Using the previous two lemmas, we now show that Hamiltonian Monte Carlo indeed converges to the desired distribution.

Lemma 6 (Time reversibility).

Let denote the probability density of one step of the Hamiltonian Monte Carlo starting at . We have that

for almost everywhere in and where .

Proof.

Fix and . Let be the component of . Let and . Then,

We note that this formula assumed that is invertible. Sard’s theorem showed that is measure where . Therefore, the formula is correct except for a measure zero subset.

By reversing time for the Hamiltonian curve, we have that for the same ,

(2.2)

where denotes the component of and in the first and second sum respectively.

We compare the first terms in both equations. Let . Since and , the inverse function theorem shows that is the inverse map of . Hence, we have that

Therefore, we have that and . Hence, we have that

Using that (Lemma 5), we have that

Hence, we have that

where we used that (Lemma 4) at the end.

For the second term in (2.2), by the same calculation, we have that

Combining both terms we have the result. ∎

The main challenge in analyzing Hamiltonian Monte Carlo is to bound its mixing time.

2.1 Hamiltonian Monte Carlo on Riemannian manifolds

Suppose we want to sample from the distribution . We define the following energy function :

(2.3)

One can view as the location and as the velocity. The following lemma shows that the first variable in the Hamiltonian curve satisfies a second-order differential equation. When we view the domain as a manifold, this equation is simply , namely, acts like a particle under the force field . (For relevant background on manifolds, we refer the reader to Appendix D).

Lemma 7.

In Euclidean coordinates, The Hamiltonian equation for (2.3) can be rewritten as

where and is the Levi-Civita connection on the manifold with metric .

Proof.

From the definition of the Hamiltonian curve, we have that

Putting the two equations together, we have that

Hence,

(2.4)

Using the formula of Christoffel symbols

we have that

Putting this into (2.4) gives

Motivated by this, we define the Hamiltonian map as the first component of the Hamiltonian dynamics operator defined earlier. For the reader familiar with Riemannian geometry, this is similar to the exponential map (for background, see Appendix D).

Definition 8.

Let where be the solution of the Hamiltonian equation with initial conditions and . We also denote by .

We now give two examples of Hamiltonian Monte Carlo.

Example 9.

When

, the Hamiltonian curve acts like stochastic gradient descent for the function

with each random perturbation drawn from a standard Gaussian.

When , the Hamiltonian curve acts like a stochastic Newton curve for the function :

where the volumetric function .

Next we derive a formula for the transition probability in Euclidean coordinates.

Lemma 10.

For any and , the probability density of the 1-step distribution from is given by

(2.5)

where is the Jacobian of the Hamiltonian map .

Proof.

We prove the formula by separately considering each s.t. , then summing up. In the tangent space , the point follows a Gaussian step. Therefore, the probability density of in is as follows:

Let and be defined by . Here is the same set as but endowed with the Euclidean metric. Hence, we have

The result follows from and

3 Convergence of Riemannian Hamiltonian Monte Carlo

Hamiltonian Monte Carlo is a Markov chain on a manifold whose stationary stationary distribution has density proportional to . We will bound the conductance of this Markov chain and thereby its mixing time to converge to the stationary distribution. Bounding conductance involves showing (a) the induced metric on the state space satisfies a strong isoperimetric inequality and (b) two points that are close in metric distance are also close in probabilistic distance, i.e., the one-step distributions from two nearby points have large overlap. In this section and the next, we present general conductance bounds using parameters determined by the associated manifold. In Section 7, we bound these parameters for the manifold corresponding to the logarithmic barrier in a polytope.

3.1 Basics of geometric Markov chains

For completeness, we will discuss some standard techniques in geometric random walks in this subsection. For a Markov chain with state space , stationary distribution and next step distribution for any , the conductance of the Markov chain is

The conductance of an ergodic Markov chain allows us to bound its mixing time, i.e., the rate of convergence to its stationary distribution, e.g., via the following theorem of Lovász and Simonovits.

Theorem 11 ([19]).

Let be the distribution of the current point after steps of a Markov chain with stationary distribution and conductance at least starting from initial distribution For any ,

Definition 12.

The isoperimetry of a metric space with target distribution is

where is the shortest path distance in .

The proof of the following theorem follows the standard outline for geometric random walks (see e.g., [29]).

Lemma 13.

Given a metric space and a time-reversible Markov chain on with stationary distribution . Fix any . Suppose that for any with , we have that . Then, the conductance of the Markov chain is .

Proof.

Let be any measurable subset of . Then our goal is to bound the conductance of the Markov chain

Since the Markov chain is time-reversible (For any two subsets , ), we can write the numerator of the left hand side above as

Define

Without loss of generality, we can assume that and (if not, and hence the conductance is .)

Next, we note that for any two points and , . Therefore, by the assumption, we have that . Therefore, by the definition of , we have that

Going back to the conductance,

Therefore, the conductance of the Markov chain is . ∎

Combining Theorem 13 and Lemma 13 gives the following result for bounding mixing time of general geometric random walk.

Lemma 14.

Given a metric space and a time-reversible Markov chain on with stationary distribution . Suppose that there exist and such that

  1. For any with , we have that .

  2. For any , we have that

Let be the distribution of the current point after steps of a Markov chain with stationary distribution starting from initial distribution For any ,

3.2 Overlap of one-step distributions

The mixing of the walk depends on smoothness parameters of the manifold and the functions used to define the Hamiltonian. Since each step of our walk involves a Gaussian vector, many smoothness parameters depend on choices of the random vector. Formally, let be the Hamiltonian curve used in a step of Hamiltonian Monte Carlo. In the analysis, we need a large fraction of Hamiltonian curves from any point on the manifold to be well-behaved. A Hamiltonian curve can be problematic when its velocity or length is too large and this happens with non-zero probability. Rather than using supremum bounds for our smoothness parameters, it suffices to use large probability bounds, where the probability is over the random choice of Hamiltonian curve at any point . To capture the notion that “most Hamiltonian curves are well-behaved”, we use an auxiliary function which assigns a real number to each Hamiltonian curve and measures how “good” the curve is. The smoothness parameters assume that this function is bounded and Lipshitz. One possible choice of such is which measures the initial velocity, but this will give us a weaker bound. Instead, we use the following which jointly bounds the change in position (first term) and change in velocity (second term).

Definition 15.

An auxiliary function is a non-negative real-valued function on the set of Hamiltonian curves, i.e., maps , with bounded parameters , such that

  1. For any variation of a Hamiltonian curve (see Definition 18) with , we have

  2. For any , where indicates a random Hamiltonian curve starting at , chosen by picking a random Gaussian initial velocity according to the local metric at .

3.2.1 Proof Outline

To bound the conductance of HMC, we need to show that one-step distributions from nearby points have large overlap for reasonably large step size . To this end, recall that the probability density of going from to is given by the following formula

In Section 3.2.2, we introduce the concept of variations of Hamiltonian curves and use it to bound . We can show that is in fact close to