 # Convergence of Langevin MCMC in KL-divergence

Langevin diffusion is a commonly used tool for sampling from a given distribution. In this work, we establish that when the target density p^* is such that p^* is L smooth and m strongly convex, discrete Langevin diffusion produces a distribution p with KL(p||p^*)≤ϵ in Õ(d/ϵ) steps, where d is the dimension of the sample space. We also study the convergence rate when the strong-convexity assumption is absent. By considering the Langevin diffusion as a gradient flow in the space of probability distributions, we obtain an elegant analysis that applies to the stronger property of convergence in KL-divergence and gives a conceptually simpler proof of the best-known convergence results in weaker metrics.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Suppose that we would like to sample from a density

 p∗(x)=e−U(x)+C

where is the normalizing constant. We know , but we do not know the normalizing constant. This comes up, for example, in variational inference, when the normalization constant is computationally intractable.

One way to sample from is to consider the Langevin diffusion:

 ¯x0 ∼¯p0 d¯xt =−∇U(¯xt)dt+√2dBt (1)

Where is some initial distribution and is Brownian motion (see Section 4). The stationary distribution of the above SDE is .

The Langevin MCMC algorithm, given in two equivalent forms in (3) and (4), is an algorithm based on discretizing (1).

Previous works have shown the convergence of (4) in both total variation distance (, ) and 2-Wasserstein distance (). The approach in these papers relies on first showing the convergence of (1), and then bounding the discretization error between (4) and (2).

In this paper, our main goal is to establish the convergence of in (4) in

. KL-divergence is perhaps the most natural notion of distance between probability distributions in this context, because of its close relationship to maximum likelihood estimation, its interpretation as information gain in Bayesian statistics, and its central role in information theory. Convergence in KL-divergence implies convergence in total variation and 2-Wasserstein distance, thus we are able to obtain convergence rates in total variation and 2-Wasserstein that are comparable to the results shown in (

, , ).

## 2 Related Work

The first non-asymptotic analysis of the discrete Langevin diffusion (

4) was due to Dalalyan in . This was soon followed by the work by Durmus and Moulines in , which improved upon the results in . Subsequently, Durmus and Moulines also established convergence of (4) for the 2-Wasserstein distance in . We remark that the proofs of Lemma 7, 11 and 13 are essentially taken from .

In a slightly different direction from the goals of this paper, Bubeck et al  and Durmus et al  studied variants of (4) which work when

is not smooth. This is important, for example, when we want to sample from the uniform distribution over some convex set, so

is the indicator function.

Very recently, Dalalyan et al  proved the convergence of Langevin Monte Carlo when only stochastic gradients are available.

Our work also borrows heavily from the theory established in the book of Ambrosio, Gigli and Savare , which studies the underlying probability distribution induced by (1) as a gradient flow in probability space. This allows us to view (4) as a deterministic convex optimization procedure over the probability space, with KL-divergence as the objective. This beautiful line of work relating SDEs with gradient flows in probability space was begun by Jordan, Kinderlehrer and Otto . We refer any interested reader to an excellent survey by Santambrogio in .

Finally, we remark that the theory in  has some very interesting connections with the study of normalization flows in  and . For example, the tangent velocity of (2), given by , can be thought of as a deterministic transformation that induces a normalizing flow.

## 3 Our Contribution

In this section, we compare the results we obtain with those in ,  and .

Our main contribution is establishing the first nonasymptotic convergence Kullback-Leibler divergence for (

4) when is strongly convex and smooth. (see Theorem 3). As a consequence, we also unify the proof of convergence in total variation and as simple corollaries to the convergence in .

The following table compares the number of iterations of (3) required to achieve error in each of the three quantities according to the analysis of various papers.

In Section 7, we also state a convergence result for when is not strongly convex. The corollary for convergence in total variation has a better dependence on the dimension than the corresponding result in , but a worse dependence on .

## 4 Definitions

We denote by the space of all probability distributions over . In the rest of this paper, only distributions with densities wrt the Lebesgue measure will appear (see Lemma 16), both in the algorithm and in the analysis. With abuse of notation, we use the same symbol (e.g. ) to denote both the probability distribution and its density wrt the Lebesgue measure.

We let be the d-dimensional Brownian motion.

Let be the target distribution such that has Lipschitz continuous gradients and strong convexity, i.e. for all :

 mI⪯∇2U(x)⪯LI

For a given initial distribution , the Exact Langevin Diffusion is given by the following stochastic differential equation (recall ):

 ¯x0 ∼¯p0 d¯xt =−∇U(¯xt)dt+√2dBt (2)

(This is identical to (1), restated here for ease of reference.) For a given initial distribution , and for a given stepsize , the Langevin MCMC Algorithm is given by the following:

 u0 ∼p0 ui+1 =ui−h⋅∇U(ui)+√2hξi (3)

Where .

For a given initial distribution and stepsize , the Discretized Langevin Diffusion is given by the following SDE:

 x0 ∼p0 (4) dxt =−∇U(xτ(t))dt+√2dBt Let pt denote the distribution of xt

Where (note that is parametrized by ). It is easily verified that for any , from (4) is equivalent to in (3). Note that the difference between (2) and (4) is in the drift term: one is , the other is

For the rest of this paper, we will use to exclusively denote the distribution of in (4).

We assume without loss of generality that

 argminxU(x)=0

, and that

 U(0)=0

. (We can always shift the space to achieve this, and the minimizer of is easy to find using, say, gradient descent.)

For the rest of this paper, we will let

 F(μ)=⎧⎪ ⎪⎨⎪ ⎪⎩∫μ(x)log(μ(x)p∗(x))dx,if μ has density wrtLebesgue measure∞else

be the KL-divergence between and . It is well known that is minimized by , and .

Finally, given a vector field

and a distribution , we define the -norm of as

 ∥v∥L2(μ)≜√Eμ[∥v(x)∥22]

### 4.1 Background on Wasserstein distance and curves in P(Rd)

Given two distributions , let

be the set of all joint distributions over the product space

whose marginals equal and respectively. ( is the set of all couplings)

The Wasserstein distance is defined as

 W2(μ,ν)=√infγ∈Γ(μ,ν)∫(∥x−y∥22)dγ(x,y)

Let and be two measurable spaces, be a measure, and be a measurable map. The push-forward measure of through is defined as

 r#μ(B)=μ(r−1(B))∀B∈B(X2)

Intuitively, for any , .

It is a well known result that for any two distributions and which have density wrt the Lebesgue measure, the optimal coupling is induced by a map , i.e. for

 γ∗=(Id,Topt)#μ

Where is the identity map, and satisfies , so by definition, . We call the optimal transport map, and the optimal displacement map.

Given two points and in , a curve is a constant-speed-geodesic between and if , and for all . If is the optimal displacement map between and , then the constant-speed-geodesic is nicely characterized by

 μt=(Id+tvπν)#ν (5)

Given a curve , we define its metric derivative as

 |μ′t|≜limsups→tW2(μs,μt)|s−t| (6)

. Intuitively, this is the speed of the curve in 2-Wasserstein distance. We say that a curve is absolutely continuous if for all .

Given a curve and a sequence of velocity fields , we say that and satisfy the continuity equation at if

 ddtμt(x)+∇⋅(μt(x)⋅vt(x))=0 (7)

(We assume that has density wrt Lebesgue measure for all )

###### Remark 1

If is a constant-speed-geodesic between and , then satisfies (7) at , by the characterization in (5).

We say that is tangent to at if the continuity equation holds and for all such that . Intuitively, is tangent to if it minimizes among all velocity fields that satisfy the continuity equation.

## 5 Preliminary Lemmas

This section presents some basic results needed for our main theorem.

### 5.1 Calculus over P(Rd)

In this section, we present some crucial Lemmas which allow us to study the evolution of along a curve . These results are all immediate consequences of results proven in .

###### Lemma 1

For any , let be the first variation of at defined as . Let the subdifferential of at be given by

 wμ≜∇(δFδμ(μ)):Rd→Rd

. For any curve , and for any that satisfies the continuity equation for (see equation (7)), the following holds:

Based on Lemma 1, we define (for any ) the operator

 Dμ(v)≜Eμ[⟨wμ(x),v(x)⟩]:(Rd→Rd)→R (8)

is linear in .

###### Lemma 2

Let be an absolutely continuous curve in with tangent velocity field . Let be the metric derivative of .

Then

 ∥vt∥L2(μt)=|μ′t|
###### Lemma 3

For any , let , then

 ∥Dμ∥∗=√∫∥∥∥∇(δFδμ(μ))(x)∥∥∥22μ(x)dx

Furthermore, for any absolutely continuous curve with tangent velocity , we have

 ∣∣∣ddtF(μt)∣∣∣≤∥Dμt∥∗∥vt∥L2(μt)

As a Corollary of Lemma 2 and Lemma 3, we have the following result:

###### Corollary 4

Let be an absolutely continuous curve with tangent velocity field . Then

 ddtF(μt)≤∥Dμt∥∗⋅|μ′t|

### 5.2 Exact and Discrete Gradient Flow for F(p)

In this section, we will study the curve defined in (4). Unless otherwise specified, we will assume that is an arbitrary distribution.

Let be as defined in (4).

For any given and for all , we define a stochastic process as

 yts =xs for s≤t dyts =−∇U(yts)ds+√2dBs for s≥t (9) let qts denote the distribution for yts

From onwards, this is the exact Langevin diffusion with as the initial distribution (compare with expression (2)).

Finally, for each , we define a sequence by

 zts =xs for s≤t dzts =(−∇U(ztτ(t))+∇U(zts))ds, for s≥t (10) let gts denote the distribution for zts

represents the discretization error of through the divergence between and (formally stated in Lemma 5). Note that because .

###### Remark 2

The the in (4), (9) and (10)) are the same. Thus, (from (4)), (from (9)) and (from (10)) define a coupling between the the curves , and .

Our proof strategy is as follows:

1. In Lemma 5, we demonstrate that the divergence between (discretized Langevin) and (exact Langevin) can be represented as a curve .

2. In Lemma 6, we demonstrate that the "decrease in due to exact Langevin" given by is sufficiently negative.

3. In Lemma 7, we show that the "discretization error" given by is small.

4. Added together, they imply that is sufficiently negative.

###### Lemma 5

For all and

 ddsgts(x)∣∣∣s=t=(ddsps(x)−ddsqts(x))∣∣∣s=t
###### Lemma 6

For all

 ddsF(qts)=−∥Dqts∥2∗
###### Lemma 7

For all

 dds(F(ps)−F(qts))∣∣∣s=t≤ (2L2h√Epτ(t)[∥x∥22]+2L√hd)⋅∥Dpt∥∗

## 6 Strong Convexity Result

In this section, we study the consequence of assuming strong convexity and smoothness of .

### 6.1 Theorem statement and discussion

###### Theorem 3

Let and be as defined in (4) with .

If

 h=mϵ16dL2

and

 k=16L2m2dlogdLmϵϵ

Then

The above theorem immediately allows us to obtain the convergence rate of in both total variation and 2-Wasserstein distance.

###### Corollary 8

Using the choice of and in Theorem 3, we get

The first item follows from Pinsker’s inequality. The second item follows from (12), where we take to be and to be , and noting that . To achieve accuracy in Total Variation or , we apply Theorem 3 with and respectively.

###### Remark 4

The term in Theorem 3 is not crucial. One can run (3) a few times, each time aiming to only halve the objective (thus the stepsize starts out large and is also halved each subsequent run). The proof is quite simple and will be omitted.

### 6.2 Proof of Theorem 3

We now state the Lemmas needed to prove Theorem 3. We first establish a notion of strong convexity of with respect to metric.

###### Lemma 9

If is strongly convex, then

 F(μt)≤(1−t)F(μ0)+tF(μ1)−m2t(1−t)W22(μ0,μ1) (11)

for all and , let be the constant-speed geodesic between and . (recall from (5) that If is the optimal displacement map from to , then .)

Equivalently,

 F(μ1)≥F(μ0)+Dμ0(vμ1μ0)+m2W22(μ0,μ1) (12)

We call this the m-strong-geodesic-convexity of wrt the distance.

Next, we use the strong geodesic convexity of to upper bound by (for any ). This is analogous to how for standard -strongly-convex functions in .

###### Lemma 10

Under our assumption that is strongly convex, we have that for all ,

 F(μ)−F(p∗)≤12m∥Dμ∥2∗

Now, recall from (4). We use strong convexity to obtain a bound on for all . This will be important for bounding the discretization error in conjunction with Lemma 7

###### Lemma 11

Let be as defined in (4). If is such that , and in the definition of (4), then for all ,

 Ept∥x∥2≤4dm

Finally, we put everything together to prove Theorem 3.

• Proof of Theorem 3

We first note that .

By Lemma 11, for all , . Combined with Lemma 7, we get that for all

 ddsF(ps)−F(qts)∣∣∣s=t≤(4L2h√dm+2L√hd)⋅∥Dpt∥∗

Suppose that , and let

 h=mϵ16dL2≤116min{mL2√ϵd,mϵL2d}

then

 ddsF(ps)−F(qts)∣∣∣s=t≤ (4L2h√dm+2L√hd) ≤ 12√mϵ∥Dpt∥∗≤12∥Dpt∥2∗

Where the last inequality is because Lemma 10 and the assumption that together imply that .

So combining Lemma 6 and Lemma 5, we have

 ddtF(pt) =ddsF(qts)∣∣∣s=t+ddsF(ps)−F(qts)∣∣∣s=t ≤−∥Dpt∥2∗+12∥Dpt∥2∗ =−12∥Dpt∥2∗ ≤−m(F(pt)−F(p∗)) (13)

Where the last line once again follows from Lemma 10.

To handle the case when , we use the following argument:

1. We can conclude that implies .

2. By the results of Lemma 16 and Lemma 17, for all , is finite and is finite, so is finite and is continuous in .

3. Thus, if for some , then for all as implies and is continuous in . Thus .

Thus, we need only consider the case that for all . This means that (13) holds for all .

By Gronwall’s inequality, we get

 F(pkh)−F(p∗)≤(F(p0)−F(p∗))exp(−mkh)

We thus need to pick

 k=1mlogF(p0)−F(p∗)ϵh=16L2m2dlogF(p0)−F(p∗)ϵϵ

Using the fact that . Using -smoothness and -strong convexity, we can show that

 −logp∗(x)≤L2∥x∥22+d2log(2πm)

, and

 logp0(x)=−m2∥x∥22−d2log(2πm)

. We thus get that , so

 k=16L2m2dlogdLmϵϵ

## 7 Weak convexity result

In this section, we study the case when is not strongly convex (but still convex and smooth). Let be the stationary distribution of (4) with stepsize .

We will assume that we can choose an initial distribution which satisfies

 W2(p0,p∗)=C1 (14)

and

 √Ep∗∥x∥22=C2 (15)

. Let be the largest stepsize such that

 W2(πh,p∗)≤C1,∀h≤h′ (16)

### 7.1 Theorem statement and discussion

###### Theorem 5

Let , and be defined as in the beginning of this section.

Let and be as defined in (4) with satisfying (14). If

 h =148min{ϵC1(C1+C2)L2,ϵ2C21dL2,h′} =148min{ϵC1C2L2,ϵ2C21dL2,h′}

and

 k=2C21ϵh+2C21log(F(r0)−F(p∗))h

Then

Once again, applying Pinsker’s inequality, we get that the above choice of and yields . Without strong convexity, we cannot get a bound on from bounding like we did in corollary 8.

In , a proof in the non-strongly-convex case was obtained by running Langevin MCMC on

 ~p∗∝p∗⋅exp(−δd∥x∥22)

is thus strongly convex with , and . By the results of , or , or Theorem 3, we need

 k=~O(d3δ4) (17)

iterations to get .

On the other hand, if we assume and the results of Theorem 5 implies that

 h=ϵL2C1min{1C2,ϵdC1}

To get , we need

 k=L2C31δ4max{C2,dC1δ2}

Even if we ignore and , our result is not strictly better than (17) as we have a worse dependence on . However, we do have a better dependence on .

The proof of Theorem 5 is quite similar to that of Theorem 3, so we defer it to the appendix.

## 8 Supplementary Materials

• Proof of Lemma 1 The proof is directly from results in . See Theorem 10.4.9, with , with , , , , , and . The expression for comes from expression 10.1.16 (section E of chapter 10.1.2, page 233). See also expressions 10.4.67 and 10.4.68.

(One can also refer to Theorem 10.4.13 and Theorem 10.4.17 for proofs of for the KL-divergence functional in more general settings.) By Lemma 16, is well defined for all .

• Proof of Lemma 2 Theorem 8.3.1 of .

• Proof of Lemma 3 By definition of in (8) and Lemma 1 and Cauchy Schwarz.

• Proof of Lemma 5 In this proof, we treat as a fixed but arbitrary number, and prove the Lemma for all . We will use , , , , and as defined in (4), (9) and (10).

First, consider the case when . By definition, , and . By Fokker Planck,

 ddsps(x)∣∣∣s=t =−∇U(xt)+tr(∇2pt) =−∇U(ytt)+tr(∇2qtt) =ddsqts(x)∣∣∣s=t

On the other hand

 dzts∣∣s=t=−∇U(ztτ(t))+∇U(ztt)=−∇U(xt)+∇U(xt)=0

Thus So Lemma (5) holds.

In the remainder of this proof, we assume that .

For a given , we let denote the projection of onto its first coordinates, and denote the projection of onto its last coordinates. With abuse of notation, for , we let and denote the corresponding marginal densities.

We will consider three stochastic processes: over for .

First, we introduce the stochastic process for

 Θτ(t) =[xτ(t)−∇U(xτ(t))] dΘs =[Π2(Θs)0]dt+[√2dBt0]for s∈[τ(t),τ(t)+h)

We let denote the density for . Intuitively, is the joint density between and . One can verify that and . By Fokker-Planck, we have

 ddsPs(Θ)∣∣∣s=t= −∇⋅(Pt(Θ)⋅[Π2(Θ)0]) +d∑i=1∂2∂Θ2iPt(Θ) (18)

Next, for any given , we introduce the stochastic process