In this paper, we study the continuous time underdamped Langevin diffusion represented by the following stochastic differential equation (SDE):
where , is a twice continuously-differentiable function and represents standard Brownian motion in . Under fairly mild conditions, it can be shown that the invariant distribution of the continuous-time process (1) is proportional to . Thus the marginal distribution of is proportional to . There is a discretized version of (1) which can be implemented algorithmically, and provides a useful way to sample from when the normalization constant is not known.
Our main result establishes the convergence of (1) as well as its discretization, to the invariant distribution. This provides explicit rates for sampling from log-smooth and strongly log-concave distributions using the underdamped Langevin MCMC algorithm (Algorithm 1).
Underdamped Langevin diffusion is particularly interesting because it contains a Hamiltonian component, and its discretization can be viewed as a form of Hamiltonian MCMC. Hamiltonian MCMC (see review of HMC in [Neal, 2011, Betancourt et al., 2017]) has been empirically observed to converge faster to the invariant distribution compared to standard Langevin MCMC which is a discretization of overdamped Langevin diffusion,
the first order SDE corresponding to the high friction limit of (1). This paper provides a non-asymptotic quantitative explanation for this statement.
1.1 Related Work
The first explicit proof of non-asymptotic convergence of overdamped Langevin MCMC for log-smooth and strongly log-concave distributions was given by [Dalalyan, 2017], where it was shown that discrete, overdamped Langevin diffusion achieves error, in total variation distance, in steps. Following this, [Durmus and Moulines, 2016] proved that the same algorithm achieves error, in 2-Wasserstein distance, in steps. [Cheng and Bartlett, 2017] obtained results similar to those in [Dalalyan, 2017] when the error is measured by KL-divergence. Recently [Raginsky et al., 2017, Dalalyan and Karagulyan, 2017] also analyzed convergence of overdamped Langevin MCMC with stochastic gradient updates. Asymptotic guarantees for overdamped Langevin MCMC was established much earlier in [Gelfand and Mitter, 1991, Roberts and Tweedie, 1996].
Hamiltonian Monte Carlo (HMC) is a broad class of algorithms which involve Hamiltonian dynamics in some form. We refer to [Ma et al., 2015] for a survey of the results in this area. Among these, the variant studied in this paper (Algorithm 1), based on the discretization of (1), has a natural physical interpretation as the evolution of a particle’s dynamics under a force field and drag. This equation was first proposed by [Kramers, 1940] in the context of chemical reactions. The continuous-time process has been studied extensively [Hérau, 2002, Villani, 2009, Eberle et al., 2017, Gorham et al., 2016, Baudoin, 2016, Bolley et al., 2010, Calogero, 2012, Dolbeault et al., 2015, Mischler and Mouhot, 2014].
However, to the best of our knowledge there has been no prior polynomial-in-dimension convergence result for any version of HMC under a log-smooth or strongly log-concave assumption for the target distribution111Following the first version of this paper, two recent papers also independently analyzed and provided non-asymptotic guarantees for different versions of HMC [Mangoubi and Smith, 2017, Lee and Vempala, 2017] . Most closely related to our work is the recent paper Eberle et al.  that demonstrated a contraction property of the continuous-time process (1). That result deals, however, with a much larger class of functions, and because of this the distance to the invariant distribution scales exponentially with dimension .
Our main contribution in this paper is to prove that Algorithm 1, a variant of HMC algorithm, converges to error in 2-Wasserstein distance after iterations, under the assumption that the target distribution is of the form , where is smooth and strongly convex (see section 1.4.1), with denoting the condition number. Compared to the results of Durmus and Moulines  on the convergence of Langevin MCMC in in iterations, this is an improvement in both and
. We also analyze the convergence of chain when we have noisy gradients with bounded variance and establish non-asymptotic convergence guarantees in this setting.
1.3 Organization of the Paper
In the next subsection we establish the notation and assumptions that we use throughout the paper. In Section 2 we present the discretized version of (1) and state our main results for convergence to the invariant distribution. Section 3 then establishes exponential convergence for the continuous-time process and in Section 4 we show how to control the discretization error. Finally in Section 5 we prove the convergence of the discretization of (1). We defer technical lemmas to the appendix.
1.4 Notation and Definitions
In this section, we present basic definitions and notational conventions. Throughout, we let
denotes the Euclidean norm, for a vector.
1.4.1 Assumption on
We make the following assumptions regarding the function .
The function is twice continuously-differentiable on and has Lipschitz continuous gradients; that is, there exists a positive constant such that for all we have
is -strongly convex, that is, there exists a positive constant such that for all ,
It is fairly easy to show that under these two assumptions the Hessian of is positive definite throughout its domain, with . We define as the condition number. Throughout the paper we denote the minimum of by . Finally, we assume that we have a gradient oracle ; that is, we have access to for all .
1.4.2 Coupling and Wasserstein Distance
Denote by the Borel -field of
. Given probability measuresand on , we define a transference plan between and as a probability measure on such that for all sets , and . We denote
as the set of all transference plans. A pair of random variablesis called a coupling if there exists a such that are distributed according to . (With some abuse of notation, we will also refer to as the coupling.)
We define the Wasserstein distance of order two between a pair of probability measures as follows:
Finally we denote by the set of transference plans that achieve the infimum in the definition of the Wasserstein distance between and (for more properties of see [Villani, 2008]).
1.4.3 Underdamped Langevin Diffusion
Throughout the paper we use to denote standard Brownian motion [Mörters and Peres, 2010]. Next we set up the notation specific to the continuous and discrete processes that we study in this paper.
Consider the exact underdamped Langevin diffusion defined by the SDE (1), with an initial condition for some distribution on . Let denote the distribution of and let denote the operator that maps from to :
One step of the discrete underdamped Langevin diffusion is defined by the SDE
with an initial condition . Let and be defined analogously to and for .
Note 1: The discrete update differs from (1) by using instead of in the drift of .
1.4.4 Stationary Distributions
Let . We let be the distribution of when .
The underdamped Langevin MCMC algorithm that we analyze in this paper in shown in Algorithm 1.
The random vector , conditioned on
, has a Gaussian distribution with conditional mean and covariance obtained from the following computations:
2.2 Main Result
The dependence of the runtime on is thus , which is a significant improvement over the corresponding runtime of (overdamped) Langevin diffusion in Durmus and Moulines .
2.2.1 Result with Stochastic Gradients
Now we state convergence guarantees when we have access to noisy gradients, , where is a independent random variable that satisfies
The noise is unbiased – .
The noise has bounded variance –
Each step of the dynamics is now driven by the SDE,
with an initial condition . Let and be defined analogously to and for in Section 1.4.3.
Theorem 3 (Proved in Appendix D).
Note that when the variance in the gradients – is large we recover back the rate of overdamped Langevin diffusion and we need steps to achieve accuracy of in .
3 Convergence of the Continuous-Time Process
In this section we prove Theorem 5, which demonstrates a contraction for solutions of the SDE (1). We will use Theorem 5 along with a bound on the discretization error between (1) and (3) to establish guarantees for Algorithm 1.
Let and be two arbitrary points in . Let be the Dirac delta distribution at and let be the Dirac delta distribution at . We pick where is the smoothness parameter of the function and . Then for every , there exists a coupling such that
A similar objective function was used in Eberle et al.  to prove contraction.
Given this theorem it is fairly easy to establish the exponential convergence of the continuous-time process to the stationary distribution in .
Let be arbitrary distribution with . Let and be the distributions of and , respectively (i.e., the images of and under the map ). Then
Proof We let such that . For every we let be the coupling as prescribed by Theorem 5. Then we have,
where follows as the Wasserstein distance is defined by the optimal coupling and by the tower property of expectation, follows by applying Theorem 5 and finally follows by choice of to be the optimal coupling. One can verify that the random variables defines a valid coupling between and . Taking square roots completes the proof.
Lemma 8 (Sandwich Inequality).
The triangle inequality for the Euclidean norm implies that
Thus we also get convergence of to :
Proof [Proof of Lemma 8] Using Young’s inequality, we have
The other direction follows identical arguments, using instead the inequality
We now turn to the proof of Theorem 5.
Proof [Proof of Theorem 5] We will prove Theorem 5 in four steps. Our proof relies on a synchronous coupling argument, where and are coupled (trivially) through independent and , and through shared Brownian motion .
Step 1: Following the definition of (1), we get
The two processes are coupled synchronously which ensures that the Brownian motion terms cancel out. For simplicity, we define and . As is twice differentiable, by Taylor’s theorem we have
Using the definition of we obtain
Similarly we also have the following derivative for the position update:
Step 2: Using the result from Step 1, we get
Here denotes the concatenation of and .
Step 3: Note that for any vector the quadratic form is equal to
Let us define the symmetric matrix
. We now compute and lower bound the eigenvalues of the matrixby making use of an appropriate choice of the parameters and . The eigenvalues of are given by the characteristic equation
By invoking a standard result of linear algebra (stated in the Appendix as Lemma 18), this is equivalent to solving the equation
Next we diagonalize and get equations of the form
where with are the eigenvalues of . By the strong convexity and smoothness assumptions we have . We plug in our choice of parameters, and , to get the following solutions to the characteristic equation:
This ensures that the minimum eigenvalue of satisfies .
4 Discretization Analysis
In this section, we study the solutions of the discrete process (3) up to for some small . Here, represents a single step of the Langevin MCMC algorithm. In Theorem 9, we will bound the discretization error between the continuous-time process (1) and the discrete process (3) starting from the same initial distribution. In particular, we bound . This will be sufficient to get the convergence rate stated in Theorem 1. Recall the definition of and from (2).
Furthermore, we will assume for now that the kinetic energy (second moment of velocity) is bounded for the continuous-time process,
In this section, we will repeatedly use the following inequality:
which follows from Jensen’s inequality using the convexity of .
We now present our main discretization theorem:
Let and be as defined in (2) corresponding to the continuous-time and discrete-time processes respectively. Let be any initial distribution and assume wlog that the step size . As before we choose and . Then the distance between the continuous-time process and the discrete-time process is upper bounded by
Proof We will once again use a standard synchronous coupling argument, in which and are coupled through the same initial distribution and common Brownian motion .
First, we bound the error in velocity. By using the expression for and from Lemma 10, we have
where follows from the Lemma 10 and , follows from application of Jensen’s inequality, follows as , is by application of the -smoothness property of , follows from the definition of , follows from Jensen’s inequality and follows by the uniform upper bound on the kinetic energy assumed in (8), and proven in Lemma 12.
This completes the bound for the velocity variable. Next we bound the discretization error in the position variable:
where the first line is by coupling through the initial distribution , the second line is by Jensen’s inequality and the third inequality uses the preceding bound. Setting and by our choice of we have that the squared Wasserstein distance is bounded as
Given our assumption that is chosen to be smaller than 1, this gives the upper bound:
Taking square roots establishes the desired result.
5 Proof of Theorem 1
Having established the convergence rate for the continuous-time SDE (1) and having proved a discretization error bound in Section 4 we now put these together and establish our main result for underdamped Langevin MCMC.
By the triangle inequality for ,
Let us define . Then by applying (10) times we have:
This inequality follows as . We now bound both terms and at a level to bound the total error at a level . Note that choice of (by upper bound on in Lemma 12) ensures that,
To control it is enough to ensure that
In Lemma 13 we establish a bound on