 # Underdamped Langevin MCMC: A non-asymptotic analysis

We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves ε error (in 2-Wasserstein distance) in O(√(d)/ε) steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is O(d/ε^2) steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we study the continuous time underdamped Langevin diffusion represented by the following stochastic differential equation (SDE):

 dvt =−γvtdt−u∇f(xt)dt+(√2γu)dBt (1) dxt =vtdt,

where , is a twice continuously-differentiable function and represents standard Brownian motion in . Under fairly mild conditions, it can be shown that the invariant distribution of the continuous-time process (1) is proportional to . Thus the marginal distribution of is proportional to . There is a discretized version of (1) which can be implemented algorithmically, and provides a useful way to sample from when the normalization constant is not known.

Our main result establishes the convergence of (1) as well as its discretization, to the invariant distribution. This provides explicit rates for sampling from log-smooth and strongly log-concave distributions using the underdamped Langevin MCMC algorithm (Algorithm 1).

Underdamped Langevin diffusion is particularly interesting because it contains a Hamiltonian component, and its discretization can be viewed as a form of Hamiltonian MCMC. Hamiltonian MCMC (see review of HMC in [Neal, 2011, Betancourt et al., 2017]) has been empirically observed to converge faster to the invariant distribution compared to standard Langevin MCMC which is a discretization of overdamped Langevin diffusion,

 dxt=−∇f(xt)dt+√2dBt

the first order SDE corresponding to the high friction limit of (1). This paper provides a non-asymptotic quantitative explanation for this statement.

### 1.1 Related Work

The first explicit proof of non-asymptotic convergence of overdamped Langevin MCMC for log-smooth and strongly log-concave distributions was given by [Dalalyan, 2017], where it was shown that discrete, overdamped Langevin diffusion achieves error, in total variation distance, in steps. Following this, [Durmus and Moulines, 2016] proved that the same algorithm achieves error, in 2-Wasserstein distance, in steps. [Cheng and Bartlett, 2017] obtained results similar to those in [Dalalyan, 2017] when the error is measured by KL-divergence. Recently [Raginsky et al., 2017, Dalalyan and Karagulyan, 2017] also analyzed convergence of overdamped Langevin MCMC with stochastic gradient updates. Asymptotic guarantees for overdamped Langevin MCMC was established much earlier in [Gelfand and Mitter, 1991, Roberts and Tweedie, 1996].

Hamiltonian Monte Carlo (HMC) is a broad class of algorithms which involve Hamiltonian dynamics in some form. We refer to [Ma et al., 2015] for a survey of the results in this area. Among these, the variant studied in this paper (Algorithm 1), based on the discretization of (1), has a natural physical interpretation as the evolution of a particle’s dynamics under a force field and drag. This equation was first proposed by [Kramers, 1940] in the context of chemical reactions. The continuous-time process has been studied extensively [Hérau, 2002, Villani, 2009, Eberle et al., 2017, Gorham et al., 2016, Baudoin, 2016, Bolley et al., 2010, Calogero, 2012, Dolbeault et al., 2015, Mischler and Mouhot, 2014].

However, to the best of our knowledge there has been no prior polynomial-in-dimension convergence result for any version of HMC under a log-smooth or strongly log-concave assumption for the target distribution111Following the first version of this paper, two recent papers also independently analyzed and provided non-asymptotic guarantees for different versions of HMC [Mangoubi and Smith, 2017, Lee and Vempala, 2017] . Most closely related to our work is the recent paper Eberle et al.  that demonstrated a contraction property of the continuous-time process (1). That result deals, however, with a much larger class of functions, and because of this the distance to the invariant distribution scales exponentially with dimension .

Also related is the recent work on understanding acceleration of first-order optimization methods as discretizations of second-order differential equations [Su et al., 2014, Krichene et al., 2015, Wibisono et al., 2016].

### 1.2 Contributions

Our main contribution in this paper is to prove that Algorithm 1, a variant of HMC algorithm, converges to error in 2-Wasserstein distance after iterations, under the assumption that the target distribution is of the form , where is smooth and strongly convex (see section 1.4.1), with denoting the condition number. Compared to the results of Durmus and Moulines  on the convergence of Langevin MCMC in in iterations, this is an improvement in both and

. We also analyze the convergence of chain when we have noisy gradients with bounded variance and establish non-asymptotic convergence guarantees in this setting.

### 1.3 Organization of the Paper

In the next subsection we establish the notation and assumptions that we use throughout the paper. In Section 2 we present the discretized version of (1) and state our main results for convergence to the invariant distribution. Section 3 then establishes exponential convergence for the continuous-time process and in Section 4 we show how to control the discretization error. Finally in Section 5 we prove the convergence of the discretization of (1). We defer technical lemmas to the appendix.

### 1.4 Notation and Definitions

In this section, we present basic definitions and notational conventions. Throughout, we let

denotes the Euclidean norm, for a vector

.

#### 1.4.1 Assumption on f

We make the following assumptions regarding the function .

1. The function is twice continuously-differentiable on and has Lipschitz continuous gradients; that is, there exists a positive constant such that for all we have

 ∥∇f(x)−∇f(y)∥2≤L∥x−y∥2.
2. is -strongly convex, that is, there exists a positive constant such that for all ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+m2∥x−y∥22.

It is fairly easy to show that under these two assumptions the Hessian of is positive definite throughout its domain, with . We define as the condition number. Throughout the paper we denote the minimum of by . Finally, we assume that we have a gradient oracle ; that is, we have access to for all .

#### 1.4.2 Coupling and Wasserstein Distance

Denote by the Borel -field of

. Given probability measures

and on , we define a transference plan between and as a probability measure on such that for all sets , and . We denote

as the set of all transference plans. A pair of random variables

is called a coupling if there exists a such that are distributed according to . (With some abuse of notation, we will also refer to as the coupling.)

We define the Wasserstein distance of order two between a pair of probability measures as follows:

 W2(μ,ν)≜(infζ∈Γ(μ,ν)∫∥x−y∥22dζ(x,y))1/2.

Finally we denote by the set of transference plans that achieve the infimum in the definition of the Wasserstein distance between and (for more properties of see [Villani, 2008]).

#### 1.4.3 Underdamped Langevin Diffusion

Throughout the paper we use to denote standard Brownian motion [Mörters and Peres, 2010]. Next we set up the notation specific to the continuous and discrete processes that we study in this paper.

1. Consider the exact underdamped Langevin diffusion defined by the SDE (1), with an initial condition for some distribution on . Let denote the distribution of and let denote the operator that maps from to :

 Φtp0=pt. (2)
2. One step of the discrete underdamped Langevin diffusion is defined by the SDE

 d~vt =−γ~vtdt−u∇f(~x0)dt+(√2γu)dBt (3) d~xt =~vsdt,

with an initial condition . Let and be defined analogously to and for .

Note 1: The discrete update differs from (1) by using instead of in the drift of .

Note 2: We will only be analyzing the solutions to (3) for small . Think of an integral solution of (3) as a single step of the discrete Langevin MCMC.

#### 1.4.4 Stationary Distributions

Throughout the paper, we denote by the unique distribution which satisfies . It can be shown that is the unique invariant distribution of (1) (see, for example, Proposition 6.1 in [Pavliotis, 2016]).

Let . We let be the distribution of when .

## 2 Results

### 2.1 Algorithm

The underdamped Langevin MCMC algorithm that we analyze in this paper in shown in Algorithm 1.

The random vector , conditioned on

, has a Gaussian distribution with conditional mean and covariance obtained from the following computations:

 E[vi+1]=vie−2δ−12L(1−e−2δ)∇f(xi) E[xi+1]=xi+12(1−e−2δ)vi−12L(δ−12(1−e−2δ))∇f(xi) E[(xi+1−E[xi+1])(xi+1−E[xi+1])⊤]=1L[δ−14e−4δ−34+e−2δ]⋅Id×d E[(vi+1−E[vi+1])(vi+1−E[vi+1])⊤]=1L(1−e−4δ)⋅Id×d E[(xi+1−E[xi+1])(vi+1−E[vi+1])⊤]=12L[1+e−4δ−2e−2δ]⋅Id×d.

The distribution is obtained by integrating the discrete underdamped Langevin diffusion (3) up to time , with the specific choice of and . In other words, if is the distribution of , then . Refer to Lemma 11 in Appendix A for the derivation.

### 2.2 Main Result

###### Theorem 1.

Let be the distribution of the iterate of Algorithm 1 after steps starting with the initial distribution . Let the initial distance to optimum satisfy . If we set the step size to be

 δ=ε104κ√1d/m+D2,

and run Algorithm 1 for iterations with

 n≥(52κ2ε)⋅(√dm+D2)⋅log⎛⎜ ⎜⎝24(dm+D2)ε⎞⎟ ⎟⎠,

then we have the guarantee that

 W2(p(n),p∗)≤ε.
###### Remark 2.

The dependence of the runtime on is thus , which is a significant improvement over the corresponding runtime of (overdamped) Langevin diffusion in Durmus and Moulines .

We note that the factor can be shaved off by using a time-varying step size. We present this result as Theorem 14 in Appendix C. In neither theorem have we attempted to optimize the constants.

#### 2.2.1 Result with Stochastic Gradients

Now we state convergence guarantees when we have access to noisy gradients, , where is a independent random variable that satisfies

1. The noise is unbiased – .

2. The noise has bounded variance –

Each step of the dynamics is now driven by the SDE,

 d^vt =−γ^vtdt−u^∇f(^x0)dt+(√2γu)dBt (4) d^xt =^vsdt,

with an initial condition . Let and be defined analogously to and for in Section 1.4.3.

###### Theorem 3 (Proved in Appendix D).

Let be the distribution of the iterate of Algorithm 2 (stated in Appendix D) after steps starting with the initial distribution . Let the initial distance to optimum satisfy . If we set the step size to be

 δ=min⎧⎪ ⎪⎨⎪ ⎪⎩εκ ⎷5479232(d/m+D2),ε2L21440σ2dκ⎫⎪ ⎪⎬⎪ ⎪⎭,

and run Algorithm 1 for iterations with

 n≥κδ⋅log⎛⎜ ⎜⎝36(dm+D2)ε⎞⎟ ⎟⎠,

then we have the guarantee that

 W2(p(n),p∗)≤ε.
###### Remark 4.

Note that when the variance in the gradients – is large we recover back the rate of overdamped Langevin diffusion and we need steps to achieve accuracy of in .

## 3 Convergence of the Continuous-Time Process

In this section we prove Theorem 5, which demonstrates a contraction for solutions of the SDE (1). We will use Theorem 5 along with a bound on the discretization error between (1) and (3) to establish guarantees for Algorithm 1.

###### Theorem 5.

Let and be two arbitrary points in . Let be the Dirac delta distribution at and let be the Dirac delta distribution at . We pick where is the smoothness parameter of the function and . Then for every , there exists a coupling such that

 E(xt,vt,yt,wt)∼ζt((x0,v0,y0,w0))[∥xt−yt∥22+∥(xt+vt)−(yt+wt)∥22] (5) ≤e−t/κ{∥x0−y0∥22+∥(x0+v0)−(y0+w0)∥22}.
###### Remark 6.

A similar objective function was used in Eberle et al.  to prove contraction.

Given this theorem it is fairly easy to establish the exponential convergence of the continuous-time process to the stationary distribution in .

###### Corollary 7.

Let be arbitrary distribution with . Let and be the distributions of and , respectively (i.e., the images of and under the map ). Then

 W2(Φtq0,q∗)≤e−t/2κW2(q0,q∗).

Proof  We let such that . For every we let be the coupling as prescribed by Theorem 5. Then we have,

 W22(qt,q∗) (i)≤E(x0,v0,y0,w0)∼ζ0[E(xt,vt,yt,wt)∼ζt(x0,v0,y0,w0)[∥xt−yt∥22+∥xt−yt+vt−wt∥22∣∣x0,y0,v0,w0]] (ii)≤E(x0,v0,y0,w0)∼ζ0[e−t/κ(∥x0−y0∥22+∥x0−y0+v0−w0∥22)] (iii)=e−t/κW22(q0,q∗),

where follows as the Wasserstein distance is defined by the optimal coupling and by the tower property of expectation, follows by applying Theorem 5 and finally follows by choice of to be the optimal coupling. One can verify that the random variables defines a valid coupling between and . Taking square roots completes the proof.

###### Lemma 8 (Sandwich Inequality).

The triangle inequality for the Euclidean norm implies that

 12W2(pt,p∗)≤W2(qt,q∗)≤2W2(pt,p∗). (6)

Thus we also get convergence of to :

 W2(Φtp0,p∗)≤4e−t/2κW2(p0,p∗).

Proof [Proof of Lemma 8] Using Young’s inequality, we have

 ∥x+v−(x′+v′)∥22≤2∥x−x′∥22+2∥v−v′∥22.

Let Then

 W2(qt,q∗) ≤√E(x,v,x′,v′)∼γt[∥x−x′∥22+∥x+v−(x′+v′)∥22] ≤√E(x,v,x′,v′)∼γt[3∥x−x′∥22+2∥v−v′∥22] ≤2√E(x,v,x′,v′)∼γt[∥x−x′∥22+∥v−v′∥22] =2W2(pt,p∗).

The other direction follows identical arguments, using instead the inequality

 ∥v−v′∥22≤2∥x+v−(x′+v′)∥22+2∥x−x′∥22.

We now turn to the proof of Theorem 5.

Proof [Proof of Theorem 5] We will prove Theorem 5 in four steps. Our proof relies on a synchronous coupling argument, where and are coupled (trivially) through independent and , and through shared Brownian motion .

Step 1: Following the definition of (1), we get

 ddt[(xt+vt)−(yt+wt)]= −(γ−1)vt−u∇f(xt)−{−(γ−1)wt−u∇f(yt)}.

The two processes are coupled synchronously which ensures that the Brownian motion terms cancel out. For simplicity, we define and . As is twice differentiable, by Taylor’s theorem we have

 ∇f(xt)−∇f(yt)=[∫10∇2f(xt+h(yt−xt))dh]≜Htzt.

Using the definition of we obtain

 ddt[zt+ψt]= −((γ−1)ψt+uHtzt).

Similarly we also have the following derivative for the position update:

 ddt[xt−yt] =ddt[zt]=ψt.

Step 2: Using the result from Step 1, we get

 ddt[∥zt+ψt∥22+∥zt∥22] =−2⟨(zt+ψt,zt),((γ−1)ψt+uHtzt,−ψt)⟩ =−2[zt+ψtzt][(γ−1)Id×duHt−(γ−1)Id×d−Id×dId×d]St[zt+ψtzt] (7)

Here denotes the concatenation of and .

Step 3: Note that for any vector the quadratic form is equal to

 x⊤Stx =x⊤(St+S⊤t2)x.

Let us define the symmetric matrix

. We now compute and lower bound the eigenvalues of the matrix

by making use of an appropriate choice of the parameters and . The eigenvalues of are given by the characteristic equation

 =0.

By invoking a standard result of linear algebra (stated in the Appendix as Lemma 18), this is equivalent to solving the equation

 det((γ−1−λ)(1−λ)Id×d−14(uHt−γId×d)2)=0.

Next we diagonalize and get equations of the form

 (γ−1−λ)(1−λ)−14(uΛj−γ)2=0,

where with are the eigenvalues of . By the strong convexity and smoothness assumptions we have . We plug in our choice of parameters, and , to get the following solutions to the characteristic equation:

 λ∗j=1±(1−Λj2L).

This ensures that the minimum eigenvalue of satisfies .

Step 4: Putting this together with our results in Step 2 we have the lower bound

 [zt+ψt,zt]⊤St[zt+ψt,zt] =[zt+ψt,zt]⊤Qt[zt+ψt,zt] ≥12κ[∥zt+ψt∥22+∥zt∥22].

Combining this with (7) yields

 ddt[∥zt+ψt∥22+∥zt∥22] ≤−1κ[∥zt+ψt∥22+∥zt∥22].

The convergence rate of Theorem 5 follows immediately from this result by applying Grönwall’s inequality (Corollary 3 in [Dragomir, 2003]).

## 4 Discretization Analysis

In this section, we study the solutions of the discrete process (3) up to for some small . Here, represents a single step of the Langevin MCMC algorithm. In Theorem 9, we will bound the discretization error between the continuous-time process (1) and the discrete process (3) starting from the same initial distribution. In particular, we bound . This will be sufficient to get the convergence rate stated in Theorem 1. Recall the definition of and from (2).

Furthermore, we will assume for now that the kinetic energy (second moment of velocity) is bounded for the continuous-time process,

 ∀t∈[0,δ]Ept[∥v∥22]≤EK. (8)

We derive an explicit bound on (in terms of problem parameters etc.) in Lemma 12 in Appendix B.

In this section, we will repeatedly use the following inequality:

 ∥∥∥∫t0vsds∥∥∥22=∥∥∥1t∫t0t⋅vsds∥∥∥22≤t∫t0∥vs∥22ds,

which follows from Jensen’s inequality using the convexity of .

We now present our main discretization theorem:

###### Theorem 9.

Let and be as defined in (2) corresponding to the continuous-time and discrete-time processes respectively. Let be any initial distribution and assume wlog that the step size . As before we choose and . Then the distance between the continuous-time process and the discrete-time process is upper bounded by

 W2(Φδp0,~Φδp0)≤δ2√2EK5.

Proof  We will once again use a standard synchronous coupling argument, in which and are coupled through the same initial distribution and common Brownian motion .

First, we bound the error in velocity. By using the expression for and from Lemma 10, we have

 E[∥vs−~vs∥22] (i)=E[∥∥∥u∫s0e−2(s−r)(∇f(xr)−∇f(x0))dr∥∥∥22] =u2E[∥∥∥∫s0e−2(s−r)(∇f(xr)−∇f(x0)dr)∥∥∥22] (ii)≤su2∫s0E[∥∥e−2(s−r)(∇f(xr)−∇f(x0))∥∥22]dr (vi)≤su2L2∫s0r(∫r0E[∥vw∥22]dw)dr =s4u2L2EK3,

where follows from the Lemma 10 and , follows from application of Jensen’s inequality, follows as , is by application of the -smoothness property of , follows from the definition of , follows from Jensen’s inequality and follows by the uniform upper bound on the kinetic energy assumed in (8), and proven in Lemma 12.

This completes the bound for the velocity variable. Next we bound the discretization error in the position variable:

 E[∥xs−~xs∥22] ≤s∫s0E[∥vr−~vr∥22]dr ≤s∫s0r4u2L2EK3dr =s6u2L2EK15,

where the first line is by coupling through the initial distribution , the second line is by Jensen’s inequality and the third inequality uses the preceding bound. Setting and by our choice of we have that the squared Wasserstein distance is bounded as

 W22(Φδp0,~Φp0)≤EK(δ43+δ615).

Given our assumption that is chosen to be smaller than 1, this gives the upper bound:

 W22(Φδp0,~Φp0)≤2EKδ45.

Taking square roots establishes the desired result.

## 5 Proof of Theorem 1

Having established the convergence rate for the continuous-time SDE (1) and having proved a discretization error bound in Section 4 we now put these together and establish our main result for underdamped Langevin MCMC.

Proof [Proof of Theorem 1] From Corollary 7, we have that for any

 W2(Φδq(i),q∗)≤e−δ/2κW2(q(i),q∗).

By the discretization error bound in Theorem 9 and the sandwich inequality (6), we get

 W2(Φδq(i),~Φδq(i))≤2W2(Φδp(i),~Φδp(i))≤δ2√8EK5.

By the triangle inequality for ,

 W2(q(i+1),q∗)=W2(~Φδq(i),q∗) ≤W2(Φδq(i),~Φδq(i))+W2(Φδq(i),q∗) (9) ≤δ2√8EK5+e−δ/2κW2(q(i),q∗). (10)

Let us define . Then by applying (10) times we have:

 W2(q(n),q∗) ≤ηnW2(q(0),q∗)+(1+η+…+ηn−1)δ2√8EK5 ≤2ηnW2(p(0),p∗)+(11−η)δ2√8EK5,

where the second step follows by summing the geometric series and by applying the upper bound (6). By another application of (6) we get:

 W2(p(n),p∗) ≤4ηnW2(p(0),p∗)T1+(11−η)δ2√32EK5T2.

Observe that

 1−η=1−e−δ/2κ ≥δ4κ.

This inequality follows as . We now bound both terms and at a level to bound the total error at a level . Note that choice of (by upper bound on in Lemma 12) ensures that,

 T2=(11−η)δ2√32EK5≤4κδ(δ2√32EK5)≤ε2.

To control it is enough to ensure that

 n>1log(η)log(8W2(p(0),p∗)ε).

In Lemma 13 we establish a bound on