1 Introduction
This paper deals with the problem of sampling from a probability measure
on which admits a density, still denoted by , with respect to the Lebesgue measure given for all bywhere
. This problem arises in various fields such that Bayesian statistical inference
[gelman:carlin:stern:rudin:2014][andrieu:defreitas:doucet:jordan:2003], illposed inverse problems [stuart:2010] or computational physics [krauth:2006]. Common and current methods to tackle this issue are Markov Chain Monte Carlo methods
[brooks:gelman:jone:meng:2011], for example the HastingsMetropolis algorithm [metropolis:rosenbluth:rosenbluth:teller:teller:1953, hastings:1970] or Gibbs sampling [geman:geman:1984]. All these methods boil down to building a Markov kernel onwhose invariant probability distribution is
. Yet, choosing an appropriate proposal distribution for the HastingsMetropolis algorithm is a tricky subject. For this reason, it has been proposed to consider continuous dynamics which naturally leave the target distribution invariant. Perhaps, one of the most famous such examples are the overdamped Langevin diffusion [parisi:1981] associated with , assumed to be continuously differentiable:(1) 
where is a dimensional Brownian motion. On appropriate conditions on , this SDE admits a unique strong solution and defines a strong Markov semigroup which converges to in total variation [roberts:tweedie:1996, Theorem 2.1] or Wasserstein distance [bolley:gentil:guillin:2012]. However, simulating path solutions of such stochastic differential equations is not possible in most cases, and discretizations of these equations are used instead. In addition, numerical solutions associated with these schemes define Markov kernels for which is not invariant anymore. Therefore quantifying the error introduced by these approximations is crucial to justify their use to sample from the target . We consider in this paper the EulerMaruyama discretization of (1) which defines the (possibly inhomogenous) Markov chain given for all by
(2) 
where is a sequence of step sizes which can be held constant or converges to , and is a sequence of i.i.d. standard
dimensional Gaussian random variables. The use of the EulerMaruyama discretization (
2) to approximatively sample fromis referred to as the Unadjusted Langevin Algorithm (ULA) (or the Langevin Monte Carlo algorithm (LMC)), and has already been the matter of many works. For example, weak error estimates have been obtained in
[talay:tubaro:1991], [mattingly:stuart:higham:2002] for the constant step size setting and in [lamberton:pages:2003], [lemaire:2005] when is nonincreasing and goes to . Explicit and nonasymptotic bounds on the total variation ([dalalyan:2014], [durmus:moulines:2017]) or the Wasserstein distance ([durmus2016high]) between the distribution of and have been obtained. Roughly, all these results are based on the comparison between the discretization and the diffusion process and quantify how the error introduced by the discretization accumulate throughout the algorithm. In this paper, we propose an other point of view on ULA, which shares nevertheless some relations with the Langevin diffusion (1). Indeed, it has been shown in [jordan1998variational] that the family of distributions , where is the semigroup associated with (1) and is a probability measure onadmitting a second moment, is the solution of a gradient flow equation in the Wasserstein space of order
associated with a particular functional , see Section 2. Therefore, if is invariant for , then it is a stationary solution of this equation, and is the unique minimizer of if is convex. Starting from this observation, we interpret ULA as a first order optimization algorithm on the Wasserstein space of order with objective functional. Namely, we adapt some proofs of convergence for the gradient descent algorithm from the convex optimization literature to obtain nonasymptotic and explicit bounds between the KullbackLeibler divergence from
to averaged distributions associated with ULA for the constant and nonincreasing stepsize setting. Then, these bounds easily imply computable bounds in total variation norm and Wasserstein distance. If the potential is strongly convex and gradient Lipschitz, we get back the results of [durmus:moulines:2017], [durmus2016high] and [cheng:bartlett:2017], when the stepsize is held constant in (2) (see Table 1). In the case where is only convex and from a warm start, we get a bound on the complexity for ULA of order and to get one sample close from with an accuracy , in Kullback Leibler (KL) divergence and total variation distance respectively (Table 2. The bounds we get starting from a minimizer of are presented in Table 3.Total variation  Wasserstein distance  KL divergence  

[durmus2016high]  
[cheng:bartlett:2017]  
This paper 
Total variation  Wasserstein distance  KL divergence  

[cheng:bartlett:2017]    
This paper   
Total variation  Wasserstein distance  KL divergence  
[durmus:moulines:2017]      
This paper   
In addition, we propose two new algorithms to sample from a class of nonsmooth logconcave distributions for which we derive computable nonasymptotic bounds as well. The first one can be applied to Lipschitz convex potential for which unbiased estimates of subgradients are available. Remarkably, the bounds we obtain for this algorithm depend on the dimension only through the initial condition and the variance of the stochastic subgradient estimates. The second method we propose is a generalization of the Stochastic Gradient Langevin Dynamics algorithm
[welling:teh:2011]. This latter is a popular extension of ULA, in which the gradient is replaced by a sequence of i.i.d. unbiased estimators. For this new scheme, we assume that can be decomposed as the sum of two functions and , where is at least continuously differentiable and is only convex, and use stochastic gradient estimates for and the proximal operator associated with . This new method is close to the one proposed in [durmus:moulines:pereyra:2016] but is different. To get computable bounds from the target distribution , we interpret this algorithm as a first order optimization algorithm and provide explicit bounds between the KullbackLeibler divergence from to distributions associated with SGLD. In the case where is strongly convex and gradient Lipschitz (i.e. ), we get back the same complexity as [dalalyan:karagulyan:2017] which is of order for the Wasserstein distance. We obtain the same complexity for the total variation distance and a complexity of order for the KL divergence (Table 4). In the case where is only convex and from a warm start, we get a complexity of order and to get one sample close from with an accuracy in KL divergence and total variation distance respectively, see Table 5. The bounds we get starting from a minimizer of are presented in Table 6.Furthermore, SGLD has been also analyzed in a general setting, i.e. the potential is not necessarily convex. In [teh:vollmer:zygalakis:2015], a study of this scheme is done by weak error estimates. Finally, [raginsky2017non] and [xu2017global] gives some results regarding the potential use of SGLD as an optimization algorithm to minimize the potential by targeting a target density proportional to for some .
Total variation  Wasserstein distance  KL divergence  

[dalalyan:karagulyan:2017]  
This paper 
Total variation  Wasserstein distance  KL divergence  

This paper 
Total variation  Wasserstein distance  KL divergence  

This paper 
In summary, our contributions are the following:

We give a new interpretation of ULA and use it to get bounds on the KullbackLeibler divergence from to the iterates of ULA. We recover the dependence on the dimension of [cheng:bartlett:2017, Theorem 3] in the strongly convex case and get tighter bounds. Note that this result implies previously known bounds between and ULA in Wasserstein distance and the total variation distance but with a completely different technique. We also give computable bounds when is only convex which improves the results of [durmus:moulines:2017], [dalalyan:2014] and [cheng:bartlett:2017].

We give two new methodologies to sample from a nonsmooth potential and make a nonasymptotic analysis of them. These two new algorithms are generalizations of SGLD.
The paper is organized as follows. In Section 2, we give some intuition on the strategy we take to analyze ULA and its variants. These ideas come from gradient flow theory in Wasserstein space. In Section 3, we give the main results we obtain on ULA and their proof. In Section 4
, two variants of ULA are presented and analyzed. Finally, numerical experiments on logistic regression models are presented in
Section 5 to support our theoretical findings regarding our new methodologies.Notations and conventions
Denote by the Borel field of , the Lebesgue measure on , the set of all Borel measurable functions on and for , . For a probability measure on and a integrable function, denote by the integral of w.r.t. . Let and be two sigmafinite measures on . Denote by if is absolutely continuous w.r.t. and the associated density. Let be two probability measures on . Define the KullbackLeibler divergence of from by
We say that is a transference plan of and if it is a probability measure on such that for all measurable set of , and . We denote by the set of transference plans of and . Furthermore, we say that a couple of random variables is a coupling of and if there exists such that are distributed according to . For two probability measures and , we define the Wasserstein distance of order as
(3) 
By [VillaniTransport, Theorem 4.1], for all probability measures on , there exists a transference plan such that for any coupling distributed according to , . This kind of transference plan (respectively coupling) will be called an optimal transference plan (respectively optimal coupling) associated with . We denote by the set of probability measures with finite second moment: for all , . By [VillaniTransport, Theorem 6.16], equipped with the Wasserstein distance of order is a complete separable metric space. Denote by .
For two probability measures and on , the total variation distance distance between and is defined by .
Let and be an open set of . Denote by the set of th continuously differentiable function from to . Denote by the set of th continuously differentiable function from to with compact support. Let be an interval and . is absolutely continuous on if for all , there exists such that for all and , ,
In the sequel, we take the convention that and for , .
2 Interpretation of ULA as an optimization algorithm
Throughout this paper, we assume that satisfies the following condition for .
A 1 ().
is convex, i.e. for all ,
Note that A 1 includes the case where is only convex when . We consider in this Section the following additional condition on which will be relaxed in Section 4.
A 2.
is continuously differentiable and gradient Lipschitz, i.e. there exists such that for all ,
Under A 1 and A 2, the Langevin diffusion (1) has a unique strong solution starting at . The Markovian semigroup , given for all , and by , is reversible with respect to and is its unique invariant probability measure, see [ambrosio:savare:zambotti:2009, Theorem 1.2, Theorem 1.6]. Using this probabilistic framework, [roberts:tweedie:1996, Theorem 1.2] shows that is irreducible with respect to the Lebesgue measure, strong Feller and for all . But to study the properties of the semigroup , an other complementary and significant approach can be used. This dual point of view is based on the adjoint of the infinitesimal generator associated with . The strong generator of (1) is defined for all and by
where is the subset of such that for all , there exists such that . In particular for , we get by Itô’s formula
In addition, by [ethier:kurtz:1986, Proposition 1.5], for all , and for , is continuously differentiable,
(4) 
For all and , by Girsanov’s Theorem [karatzas:shreve:1991, Theorem 5.1, Corollary 5.16, Chapter 3], admits a density with respect to the Lebesgue measure denoted by . This density is solution by (4) of the FokkerPlanck equation (in the weak sense):
meaning that for all and ,
(5) 
In the landmark paper [jordan1998variational], the authors shows that if is infinitely continuously differentiable, is the limit of the minimization scheme which defines a sequence of probability measures as follows. For and set and
(6) 
where is the free energy functional,
(7) 
are the Boltzmann Hfunctional and the potential energy functional, given for all by
(8)  
(9) 
More precisely, setting and for , [jordan1998variational, Theorem 5.1] shows that for all , converges to weakly in as goes to . This result has been extended and cast into the framework of gradient flows in the Wasserstein space , see [ambrosio2008gradient]. We provide a short introduction to this topic in Appendix A and present useful concepts and results for our proofs. Note that this scheme can be seen as a proximal type algorithm (see [martinet:1970] and [rockafeller:1976]) on the Wasserstein space used to minimize the functional . The following lemma shows that is the unique minimizer of . As a result, the distribution of the Langevin diffusion is the steepest descent flow of and we get back intuitively that this process converges to the target distribution .
Lemma .
Assume A 1. The following holds:

[label=)]

, and .

For all satisfying
(10)
Proof.
The proof is postponed to Section 7.1. ∎
Based on this interpretation, we could think about minimizing on the Wasserstein space to get close to using the minimization scheme (6). However, while this scheme is shown in [jordan1998variational] to be welldefined, finding explicit recursions is as difficult as minimizing and therefore can not be used in practice. In addition, to the authors knowledge, there is no efficient and practical schemes to optimize this functional. On the other hand, discretization schemes have been used to approximate the Langevin diffusion (1) and its longtime behaviour. One of the most popular method is the EulerMaruyama discretization given in (2). While most work study the theoretical properties of this discretization to ensure to get samples close to the target distribution , by comparing the distributions of and through couplings or weak error expansions, we interpret this scheme as a first order optimization algorithm for the objective functional .
3 Main results for the Unadjusted Langevin algorithm
Let be a convex continuously differentiable objective function with . The inexact or stochastic gradient descent algorithm used to estimate defines the sequence starting from by the following recursion for :
where is a nonincreasing sequence of step sizes and is a deterministic or/and stochastic perturbation of . To get explicit bound on the convergence (in expectation) of the sequence to , one possibility (see e.g. [beck:teboule:2009]) is to show that the following inequality holds: for all ,
(11) 
for some constant . In a similar manner as for inexact gradient algorithms, in this section we will establish that ULA satisfies an inequality of the form (11) with the objective function defined by (7) on , but instead of the Euclidean norm, the Wasserstein distance of order will be used.
Consider the family of Markov kernels associated with the EulerMaruyama discretization , (2), for a sequence of step sizes , given for all and by
(12) 
For our analysis, we decompose for all in the product of two elementary kernels and given for all and by
(14) 
We take the convention that is the identity kernel given for all by . is the deterministic part of the EulerMaruyama discretization, which corresponds to gradient descent step relative to for the functional, whereas is the random part, that corresponds to going along the gradient flow of . Note then and consider the following decomposition
(15) 
The proof of Section 3 then consists in bounding each difference in the decomposition above. This is the matter of the following Lemma:
Lemma .
Assume A 2. For all and ,
Proof.
First note that by [nesterov:2004, Lemma 1.2.3], for all , we have
(16) 
Therefore, for all and , we get
which concludes the proof. ∎
Proof.
Lemma .
Proof.
Denote for all by . Then, is the solution (in the sense of distribution) of the FokkerPlank equation:
and goes to as goes to in . Let and . Then by Theorem 7, for all , there exists such that
(18)  
(19) 
In addition by [VillaniTransport, Particular case 24.3], is nonincreasing on and therefore (19) becomes
Plugging this bound in (18) yields that for all ,
Taking concludes the proof. ∎
We now have all the tools to prove Section 3.
Proof of Section 3.
Based on inequalities of the form (11) and using the convexity of , for all , nonasymptotic bounds (in expectation) between and can be derived, where is the sequence of averages of given for all by . Besides, if is assumed to be strongly convex, a bound on can be established. We will adapt this methodology to get some bounds on the convergence of sequences of averaged measures defined as follows. Let and be two nonincreasing sequences of reals numbers referred to as the sequence of step sizes and weights respectively. Define for all , ,
(20) 
Let be an initial distribution. The sequence of probability measures is defined for all , , by
(21) 
where is defined by (12) and is a burnin time. We take in the following, the convention that is the identity operator.
Theorem 1.
Proof.
Using the convexity of KullbackLeibler divergence (see [cover:thomas:2006, Theorem 2.7.2] or [canerven:harremos:2014, Theorem 11]) and Section 3, we obtain
We get the thesis using that for all . ∎
Proof.
We apply Theorem 1 with and for all . We obtain
and the proof is concluded by a straightforward calculation using the definition of and . ∎
Comments
There are no comments yet.