# Analysis of Langevin Monte Carlo via convex optimization

In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as a first order optimization algorithm of an objective functional defined on the Wasserstein space of order 2. Using this interpretation and techniques borrowed from convex optimization, we give a non-asymptotic analysis of this method to sample from logconcave smooth target distribution on R^d. Our proofs are then easily extended to the Stochastic Gradient Langevin Dynamics, which is a popular extension of the Unadjusted Langevin Algorithm. Finally, this interpretation leads to a new methodology to sample from a non-smooth target distribution, for which a similar study is done.

• 41 publications
• 6 publications
• 9 publications
03/25/2019

### Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Learning in the Big Data Regime

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) is a momentum versio...
09/03/2015

### Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (...
07/31/2015

### An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

We consider the closely related problems of bandit convex optimization w...
09/11/2018

### Smooth Structured Prediction Using Quantum and Classical Gibbs Samplers

We introduce a quantum algorithm for solving structured-prediction probl...
12/06/2018

### On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case

Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbi...
02/09/2022

### Reproducibility in Optimization: Theoretical Framework and Limits

We initiate a formal study of reproducibility in optimization. We define...
03/30/2015

### Fast Optimal Transport Averaging of Neuroimaging Data

Knowing how the Human brain is anatomically and functionally organized a...

## 1 Introduction

This paper deals with the problem of sampling from a probability measure

on which admits a density, still denoted by , with respect to the Lebesgue measure given for all by

 π(x)=e−U(x)/∫Rde−U(y)dy,

where

. This problem arises in various fields such that Bayesian statistical inference

[gelman:carlin:stern:rudin:2014][andrieu:defreitas:doucet:jordan:2003], ill-posed inverse problems [stuart:2010] or computational physics [krauth:2006]

. Common and current methods to tackle this issue are Markov Chain Monte Carlo methods

[brooks:gelman:jone:meng:2011], for example the Hastings-Metropolis algorithm [metropolis:rosenbluth:rosenbluth:teller:teller:1953, hastings:1970] or Gibbs sampling [geman:geman:1984]. All these methods boil down to building a Markov kernel on

whose invariant probability distribution is

. Yet, choosing an appropriate proposal distribution for the Hastings-Metropolis algorithm is a tricky subject. For this reason, it has been proposed to consider continuous dynamics which naturally leave the target distribution invariant. Perhaps, one of the most famous such examples are the over-damped Langevin diffusion [parisi:1981] associated with , assumed to be continuously differentiable:

 dYt=−∇U(Yt)dt+√2dBt, (1)

where is a -dimensional Brownian motion. On appropriate conditions on , this SDE admits a unique strong solution and defines a strong Markov semigroup which converges to in total variation [roberts:tweedie:1996, Theorem 2.1] or Wasserstein distance [bolley:gentil:guillin:2012]. However, simulating path solutions of such stochastic differential equations is not possible in most cases, and discretizations of these equations are used instead. In addition, numerical solutions associated with these schemes define Markov kernels for which is not invariant anymore. Therefore quantifying the error introduced by these approximations is crucial to justify their use to sample from the target . We consider in this paper the Euler-Maruyama discretization of (1) which defines the (possibly inhomogenous) Markov chain given for all by

 Xk+1=Xk−γk+1∇U(Xk)+√2γk+1Gk+1, (2)

where is a sequence of step sizes which can be held constant or converges to , and is a sequence of i.i.d. standard

-dimensional Gaussian random variables. The use of the Euler-Maruyama discretization (

2) to approximatively sample from

is referred to as the Unadjusted Langevin Algorithm (ULA) (or the Langevin Monte Carlo algorithm (LMC)), and has already been the matter of many works. For example, weak error estimates have been obtained in

[talay:tubaro:1991], [mattingly:stuart:higham:2002] for the constant step size setting and in [lamberton:pages:2003], [lemaire:2005] when is non-increasing and goes to . Explicit and non-asymptotic bounds on the total variation ([dalalyan:2014], [durmus:moulines:2017]) or the Wasserstein distance ([durmus2016high]) between the distribution of and have been obtained. Roughly, all these results are based on the comparison between the discretization and the diffusion process and quantify how the error introduced by the discretization accumulate throughout the algorithm. In this paper, we propose an other point of view on ULA, which shares nevertheless some relations with the Langevin diffusion (1). Indeed, it has been shown in [jordan1998variational] that the family of distributions , where is the semi-group associated with (1) and is a probability measure on

admitting a second moment, is the solution of a gradient flow equation in the Wasserstein space of order

associated with a particular functional , see Section 2. Therefore, if is invariant for , then it is a stationary solution of this equation, and is the unique minimizer of if is convex. Starting from this observation, we interpret ULA as a first order optimization algorithm on the Wasserstein space of order with objective functional

. Namely, we adapt some proofs of convergence for the gradient descent algorithm from the convex optimization literature to obtain non-asymptotic and explicit bounds between the Kullback-Leibler divergence from

to averaged distributions associated with ULA for the constant and non-increasing step-size setting. Then, these bounds easily imply computable bounds in total variation norm and Wasserstein distance. If the potential is strongly convex and gradient Lipschitz, we get back the results of [durmus:moulines:2017], [durmus2016high] and [cheng:bartlett:2017], when the step-size is held constant in (2) (see Table 1). In the case where is only convex and from a warm start, we get a bound on the complexity for ULA of order and to get one sample close from with an accuracy , in Kullback Leibler (KL) divergence and total variation distance respectively (Table 2. The bounds we get starting from a minimizer of are presented in Table 3.

In addition, we propose two new algorithms to sample from a class of non-smooth log-concave distributions for which we derive computable non-asymptotic bounds as well. The first one can be applied to Lipschitz convex potential for which unbiased estimates of subgradients are available. Remarkably, the bounds we obtain for this algorithm depend on the dimension only through the initial condition and the variance of the stochastic sub-gradient estimates. The second method we propose is a generalization of the Stochastic Gradient Langevin Dynamics algorithm

[welling:teh:2011]. This latter is a popular extension of ULA, in which the gradient is replaced by a sequence of i.i.d. unbiased estimators. For this new scheme, we assume that can be decomposed as the sum of two functions and , where is at least continuously differentiable and is only convex, and use stochastic gradient estimates for and the proximal operator associated with . This new method is close to the one proposed in [durmus:moulines:pereyra:2016] but is different. To get computable bounds from the target distribution , we interpret this algorithm as a first order optimization algorithm and provide explicit bounds between the Kullback-Leibler divergence from to distributions associated with SGLD. In the case where is strongly convex and gradient Lipschitz (i.e. ), we get back the same complexity as [dalalyan:karagulyan:2017] which is of order for the Wasserstein distance. We obtain the same complexity for the total variation distance and a complexity of order for the KL divergence (Table 4). In the case where is only convex and from a warm start, we get a complexity of order and to get one sample close from with an accuracy in KL divergence and total variation distance respectively, see Table 5. The bounds we get starting from a minimizer of are presented in Table 6.

Furthermore, SGLD has been also analyzed in a general setting, i.e. the potential is not necessarily convex. In [teh:vollmer:zygalakis:2015], a study of this scheme is done by weak error estimates. Finally, [raginsky2017non] and [xu2017global] gives some results regarding the potential use of SGLD as an optimization algorithm to minimize the potential by targeting a target density proportional to for some .

In summary, our contributions are the following:

• We give a new interpretation of ULA and use it to get bounds on the Kullback-Leibler divergence from to the iterates of ULA. We recover the dependence on the dimension of [cheng:bartlett:2017, Theorem 3] in the strongly convex case and get tighter bounds. Note that this result implies previously known bounds between and ULA in Wasserstein distance and the total variation distance but with a completely different technique. We also give computable bounds when is only convex which improves the results of [durmus:moulines:2017], [dalalyan:2014] and [cheng:bartlett:2017].

• We give two new methodologies to sample from a non-smooth potential and make a non-asymptotic analysis of them. These two new algorithms are generalizations of SGLD.

The paper is organized as follows. In Section 2, we give some intuition on the strategy we take to analyze ULA and its variants. These ideas come from gradient flow theory in Wasserstein space. In Section 3, we give the main results we obtain on ULA and their proof. In Section 4

, two variants of ULA are presented and analyzed. Finally, numerical experiments on logistic regression models are presented in

Section 5 to support our theoretical findings regarding our new methodologies.

### Notations and conventions

Denote by the Borel -field of , the Lebesgue measure on , the set of all Borel measurable functions on and for , . For a probability measure on and a -integrable function, denote by the integral of w.r.t. . Let and be two sigma-finite measures on . Denote by if is absolutely continuous w.r.t.  and the associated density. Let be two probability measures on . Define the Kullback-Leibler divergence of from by

 KL(μ|ν)={∫Rddμdν(x)log(dμdν(x))dν(x),if μ≪ν+∞ otherwise.

We say that is a transference plan of and if it is a probability measure on such that for all measurable set of , and . We denote by the set of transference plans of and . Furthermore, we say that a couple of -random variables is a coupling of and if there exists such that are distributed according to . For two probability measures and , we define the Wasserstein distance of order as

 W2(μ,ν)=(infζ∈Π(μ,ν)∫Rd×Rd∥x−y∥2dζ(x,y))1/2. (3)

By [VillaniTransport, Theorem 4.1], for all probability measures on , there exists a transference plan such that for any coupling distributed according to , . This kind of transference plan (respectively coupling) will be called an optimal transference plan (respectively optimal coupling) associated with . We denote by the set of probability measures with finite second moment: for all , . By [VillaniTransport, Theorem 6.16], equipped with the Wasserstein distance of order is a complete separable metric space. Denote by .

For two probability measures and on , the total variation distance distance between and is defined by .

Let and be an open set of . Denote by the set of -th continuously differentiable function from to . Denote by the set of -th continuously differentiable function from to with compact support. Let be an interval and . is absolutely continuous on if for all , there exists such that for all and , ,

 if n∑k=1{t2k−t2k−1}≤δ  then% n∑k=1|f(t2k)−f(t2k−1)|≤ε.

In the sequel, we take the convention that and for , .

## 2 Interpretation of ULA as an optimization algorithm

Throughout this paper, we assume that satisfies the following condition for .

###### A 1 (m).

is -convex, i.e. for all ,

 U(tx+(1−t)y)≤tU(x)+(1−t)U(y)−t(1−t)(m/2)∥x−y∥2

Note that A 1 includes the case where is only convex when . We consider in this Section the following additional condition on which will be relaxed in Section 4.

###### A 2.

is continuously differentiable and -gradient Lipschitz, i.e. there exists such that for all ,

Under A 1 and A 2, the Langevin diffusion (1) has a unique strong solution starting at . The Markovian semi-group , given for all , and by , is reversible with respect to and is its unique invariant probability measure, see [ambrosio:savare:zambotti:2009, Theorem 1.2, Theorem 1.6]. Using this probabilistic framework, [roberts:tweedie:1996, Theorem 1.2] shows that is irreducible with respect to the Lebesgue measure, strong Feller and for all . But to study the properties of the semi-group , an other complementary and significant approach can be used. This dual point of view is based on the adjoint of the infinitesimal generator associated with . The strong generator of (1) is defined for all and by

 Af(x)=limt→0t−1(Ptf(x)−f(x)),

where is the subset of such that for all , there exists such that . In particular for , we get by Itô’s formula

 Af=⟨∇f,∇U⟩+Δf.

In addition, by [ethier:kurtz:1986, Proposition 1.5], for all , and for , is continuously differentiable,

 dPtf(x)dt=APtf(x)=PtAf(x). (4)

For all and , by Girsanov’s Theorem [karatzas:shreve:1991, Theorem 5.1, Corollary 5.16, Chapter 3], admits a density with respect to the Lebesgue measure denoted by . This density is solution by (4) of the Fokker-Planck equation (in the weak sense):

 ∂ρxt∂t=div(∇ρxt+ρxt∇U(x)),

meaning that for all and ,

 ∂∂t∫Rdϕ(y)ρxt(dy)=∫RdAϕ(y)ρxt(dy). (5)

In the landmark paper [jordan1998variational], the authors shows that if is infinitely continuously differentiable, is the limit of the minimization scheme which defines a sequence of probability measures as follows. For and set and

 ~ρk,γ=d~μk,γdLeb ,~μk,γ=argminμ∈Pa2(Rd) W2(~μk,h,μ)+γF(μ),k∈N, (6)

where is the free energy functional,

 F=H+E, (7)

are the Boltzmann H-functional and the potential energy functional, given for all by

 H(μ) ={∫RddμdLeb(x)log(dμdLeb(x))dx if μ≪Leb+∞ otherwise, (8) E(μ) =∫RdU(x)dμ(x). (9)

More precisely, setting and for , [jordan1998variational, Theorem 5.1] shows that for all , converges to weakly in as goes to . This result has been extended and cast into the framework of gradient flows in the Wasserstein space , see [ambrosio2008gradient]. We provide a short introduction to this topic in Appendix A and present useful concepts and results for our proofs. Note that this scheme can be seen as a proximal type algorithm (see [martinet:1970] and [rockafeller:1976]) on the Wasserstein space used to minimize the functional . The following lemma shows that is the unique minimizer of . As a result, the distribution of the Langevin diffusion is the steepest descent flow of and we get back intuitively that this process converges to the target distribution .

###### Lemma .

Assume A 1. The following holds:

1. [label=)]

2. , and .

3. For all satisfying

 F(μ)−F(π)=KL(μ|π). (10)
###### Proof.

The proof is postponed to Section 7.1. ∎

Based on this interpretation, we could think about minimizing on the Wasserstein space to get close to using the minimization scheme (6). However, while this scheme is shown in [jordan1998variational] to be well-defined, finding explicit recursions is as difficult as minimizing and therefore can not be used in practice. In addition, to the authors knowledge, there is no efficient and practical schemes to optimize this functional. On the other hand, discretization schemes have been used to approximate the Langevin diffusion (1) and its long-time behaviour. One of the most popular method is the Euler-Maruyama discretization given in (2). While most work study the theoretical properties of this discretization to ensure to get samples close to the target distribution , by comparing the distributions of and through couplings or weak error expansions, we interpret this scheme as a first order optimization algorithm for the objective functional .

## 3 Main results for the Unadjusted Langevin algorithm

Let be a convex continuously differentiable objective function with . The inexact or stochastic gradient descent algorithm used to estimate defines the sequence starting from by the following recursion for :

 xn+1=xn−γn+1∇f(xn)+γn+1Ξ(xn),

where is a non-increasing sequence of step sizes and is a deterministic or/and stochastic perturbation of . To get explicit bound on the convergence (in expectation) of the sequence to , one possibility (see e.g. [beck:teboule:2009]) is to show that the following inequality holds: for all ,

 2γn+1(f(xn+1)−f(xf))≤∥xn−xf∥−∥xn+1−xf∥22+Cγ2n+1, (11)

for some constant . In a similar manner as for inexact gradient algorithms, in this section we will establish that ULA satisfies an inequality of the form (11) with the objective function defined by (7) on , but instead of the Euclidean norm, the Wasserstein distance of order will be used.

Consider the family of Markov kernels associated with the Euler-Maruyama discretization , (2), for a sequence of step sizes , given for all and by

 Rγ(x,A)=(4\uppiγ)−d/2∫Aexp(−∥y−x−γ∇U(x)∥2/(4γ))dy. (12)
###### Proposition .

Assume A 1 for and A 2. For all and , we have

 2γ{F(μRγ)−F(π)}≤(1−mγ)W22(μ,π)−W22(μRγ,π)+2γ2Ld, (13)

where is defined in (7).

For our analysis, we decompose for all in the product of two elementary kernels and given for all and by

 Sγ(x,A)=\updeltax−γ∇U(x)(A),Tγ(x,A)=(4\uppiγ)−d/2∫Aexp(−∥y−x∥2/(4γ))dy. (14)

We take the convention that is the identity kernel given for all by . is the deterministic part of the Euler-Maruyama discretization, which corresponds to gradient descent step relative to for the functional, whereas is the random part, that corresponds to going along the gradient flow of . Note then and consider the following decomposition

 F(μRγ)−F(π)=E(μRγ)−E(μSγ)+E(μSγ)−E(π)+H(μRγ)−H(π). (15)

The proof of Section 3 then consists in bounding each difference in the decomposition above. This is the matter of the following Lemma:

###### Lemma .

Assume A 2. For all and ,

 E(μTγ)−E(μ)≤Ldγ.
###### Proof.

First note that by [nesterov:2004, Lemma 1.2.3], for all , we have

 (16)

Therefore, for all and , we get

 E(μTγ)−E(μ) =(4\uppiγ)−d/2∫Rd∫Rd{U(x+y)−U(x)}e∥y∥2/(4γ)dydμ(x) ≤(4\uppiγ)−d/2∫Rd∫Rd{⟨∇U(x),y⟩+(L/2)∥y∥2}e∥y∥2/(4γ)dydμ(x),

which concludes the proof. ∎

###### Lemma .

Assume A 1 for and A 2. For all and ,

 2γ{E(μSγ)−E(ν)}≤(1−mγ)W22(μ,ν)−W22(μSγ,ν)−γ2(1−γL)∫Rd∥∇U(x)∥2dμ(x),

where and are defined in (9) and (14) respectively.

###### Proof.

Using (16) and A 1, for all , we get

 U(x−γ∇U(x))−U(y) =U(x−γ∇U(x))−U(x)+U(x)−U(y)

Multiplying both sides by we obtain:

 2γ{U(x−γ∇U(x))−U(y)}≤(1−mγ)∥x−y∥2−∥x−γ∇U(x)−y∥2−γ2(1−γL)∥∇U(x)∥2. (17)

Let now be an optimal coupling between and . Then by definition and (17), we get

Using that concludes the proof. ∎

###### Lemma .

Let , . Then for all ,

 2γ{H(μTγ)−H(ν)}≤W22(μ,ν)−W22(μTγ,ν),

where is given in (14).

###### Proof.

Denote for all by . Then, is the solution (in the sense of distribution) of the Fokker-Plank equation:

 ∂μt∂t=Δμt,

and goes to as goes to in . Let and . Then by Theorem 7, for all , there exists such that

 W22(μγ,ν)−W22(μϵ,ν)=∫γϵδsds (18) δs/2≤H(ν)−H(μs), % for almost all s∈(ϵ,γ). (19)

In addition by [VillaniTransport, Particular case 24.3], is non-increasing on and therefore (19) becomes

 δs/2≤H(ν)−H(μγ), for almost% all s∈(ϵ,γ).

Plugging this bound in (18) yields that for all ,

 W22(μt,ν)−W22(μϵ,ν)≤2(γ−ϵ){H(ν)−H(μγ)}.

Taking concludes the proof. ∎

We now have all the tools to prove Section 3.

###### Proof of Section 3.

Let and . By Section 3, we get

 E(μRγ)−E(μSγ)=E(μSγTγ)−E(μSγ)≤Ldγ.

By Section 3 since by Section 2-1,

 2γ{E(μSγ)−E(π)}≤(1−mγ)W22(μ,ν)−W22(μSγ,ν).

By Section 3 and Section 2-1,

 2γ{H(μRγ)−H(π)} =2γ{H((μSγ)Tγ)−H(π)} ≤W22(μSγ,π)−W22(μRγ,π).

Plugging these bounds in (15) concludes the proof. ∎

Based on inequalities of the form (11) and using the convexity of , for all , non-asymptotic bounds (in expectation) between and can be derived, where is the sequence of averages of given for all by . Besides, if is assumed to be strongly convex, a bound on can be established. We will adapt this methodology to get some bounds on the convergence of sequences of averaged measures defined as follows. Let and be two non-increasing sequences of reals numbers referred to as the sequence of step sizes and weights respectively. Define for all , ,

 ΓN,N+n=N+n∑k=N+1γk,ΛN,N+n=N+n∑k=N+1λk. (20)

Let be an initial distribution. The sequence of probability measures is defined for all , , by

 νNn=Λ−1N,N+nN+n∑k=N+1λkμ0Qkγ,Qkγ=Rγ1⋯Rγk, for k∈N∗, (21)

where is defined by (12) and is a burn-in time. We take in the following, the convention that is the identity operator.

###### Theorem 1.

Assume A 1() for and A 2. Let and be two non-increasing sequences of positive real numbers satisfying , and for all , . Let and . Then for all , it holds:

 KL(νNn∣∣π)+λN+nW22(μ0QN+nγ,π)/(2γN+nΛN,N+n)≤λN+1(1−mγN+1)W22(μ0QNγ,π)/(2γNΛN,N+n)+(Ld/ΛN,N+n)N+n∑k=N+1γkλk,

where and are defined in (28).

###### Proof.

Using the convexity of Kullback-Leibler divergence (see [cover:thomas:2006, Theorem 2.7.2] or [canerven:harremos:2014, Theorem 11]) and Section 3, we obtain

 KL(νNn∣∣π) ≤Λ−1N,N+nN+n∑k=N+1λkKL(μ0Qkγ∣∣π) ≤(2ΛN,N+n)−1[(1−mγN+1)λN+1γN+1W22(μ0QNγ,π)−λN+nγN+nW22(μ0QN+nγ,π) +N+n−1∑k=N+1{(1−mγk+1)λk+1γk+1−λkγk}W22(μ0Qkγ,π)+N+n∑k=N+1Ldλkγk].

We get the thesis using that for all . ∎

###### Corollary .

Assume A 1() and A 2. Let and . Let

Then it holds where .

###### Proof.

We apply Theorem 1 with and for all . We obtain

 KL(νnε|π)+W22(μ0Qnεγ,π)/(2γεnε)≤W22(μ0,π)/(2γεnε)+(Ld/nε)nε∑k=1γε,

and the proof is concluded by a straightforward calculation using the definition of and . ∎

###### Corollary .

Assume A 1() for and A 2. Let . Define and for all by , , . Then, there exists such that for all we have , if , and for , we have