 # User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient

In this paper, we revisit the recently established theoretical guarantees for the convergence of the Langevin Monte Carlo algorithm of sampling from a smooth and (strongly) log-concave density. We improve, in terms of constants, the existing results when the accuracy of sampling is measured in the Wasserstein distance and provide further insights on relations between, on the one hand, the Langevin Monte Carlo for sampling and, on the other hand, the gradient descent for optimization. More importantly, we establish non-asymptotic guarantees for the accuracy of a version of the Langevin Monte Carlo algorithm that is based on inaccurate evaluations of the gradient. Finally, we propose a variable-step version of the Langevin Monte Carlo algorithm that has two advantages. First, its step-sizes are independent of the target accuracy and, second, its rate provides a logarithmic improvement over the constant-step Langevin Monte Carlo algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of sampling a random vector distributed according to a given target distribution is central in many applications. In the present paper, we consider this problem in the case of a target distribution having a smooth and log-concave density

and when the sampling is performed by a version of the Langevin Monte Carlo algorithm (LMC). More precisely, for a positive integer , we consider a continuously differentiable function satisfying the following assumption: For some positive constants and , it holds

 {f(θ)−f(θ′)−∇f(θ′)⊤(θ−θ′)≥(\nicefracm2)∥θ−θ′∥22,\vphantom{I∫II}∥∇f(θ)−∇f(θ′)∥2≤M∥θ−θ′∥2,∀θ,θ′∈Rp, (1)

where stands for the gradient of and is the Euclidean norm. The target distributions considered in this paper are those having a density with respect to the Lebesgue measure on given by

 π(θ)=e−f(θ)∫Rpe−f(u)du. (2)

We say that the density is log-concave (resp. strongly log-concave) if the function satisfies the first inequality of (1) with (resp. ).

Most part of this work focused on the analysis of the LMC algorithm, which can be seen as the analogue in the problem of sampling of the gradient descent algorithm for optimization. For a sequence of positive parameters , referred to as the step-sizes and for an initial point that may be deterministic or random, the iterations of the LMC algorithm are defined by the update rule

 ϑk+1,h=ϑk,h−hk+1∇f(ϑk,h)+√2hk+1ξk+1;k=0,1,2,… (3)

where is a sequence of mutually independent, and independent of , centered Gaussian vectors with covariance matrices equal to identity.

When all the ’s are equal to some value , we will call the sequence in (3) the constant step LMC and will denote it by . When satisfies assumptions (1), if is small and is large (so that the product is large), the distribution of is known to be a good approximation to the distribution with density . An important question is to quantify the quality of this approximation. An appealing approach to address this question is by establishing non asymptotic upper bounds on the error of sampling; this kind of bounds are particularly useful for deriving a stopping rule for the LMC algorithm, as well as for understanding the computational complexity of sampling methods in high dimensional problems. In the present paper we establish such bounds by focusing on their user-friendliness. The latter means that our bounds are easy to interpret, hold under conditions that are not difficult to check and lead to simple theoretically grounded choice of the number of iterations and the step-size.

In the present work, we measure the error of sampling in the Wasserstein-Monge-Kantorovich distance . For two measures and defined on , and for a real number , is defined by

 Wq(μ,ν)=(infϱ∈ϱ(μ,ν)∫Rp×Rp∥θ−θ′∥q2dϱ(θ,θ′))1/q, (4)

where the

is with respect to all joint distributions

having and

as marginal distributions. For statistical and machine learning applications, we believe that this distance is more suitable for assessing the quality of approximate sampling schemes than other metrics such as the total variation or the Kullback-Leibler divergence. Indeed, bounds on the Wasserstein distance—unlike the bounds on the total-variation—provide direct guarantees on the accuracy of approximating the first and the second order moments.

Asymptotic properties of the LMC algorithm, also known as Unadjusted Langevin Algorithm (ULA), and its Metropolis adjusted version, MALA, have been studied in a number of papers (Roberts and Tweedie, 1996; Roberts and Rosenthal, 1998; Stramer and Tweedie, 1999a, b; Jarner and Hansen, 2000; Roberts and Stramer, 2002). These results do not emphasize the effect of the dimension on the computational complexity of the algorithm, which is roughly proportional to the number of iterations. Non asymptotic bounds on the total variation error of the LMC for log-concave and strongly log-concave distributions have been established by Dalalyan (2017b). If a warm start is available, the results in Dalalyan (2017b) imply that after iterations the LMC algorithm has an error bounded from above by . Furthermore, if we assume that in addition to (1) the function has a Lipschitz continuous Hessian, then a modified version of the LMC, the LMC with Ozaki discretization (LMCO), needs iterations to achieve a precision level . These results were improved and extended to the Wasserstein distance by (Durmus and Moulines, 2017, 2016). More precisely, they removed the condition of the warm start and proved that under the Lipschitz continuity assumption on the Hessian of , it is not necessary to modify the LMC for getting the rate . The last result is closely related to an error bound between a diffusion process and its Euler discretization established by Alfonsi et al. (2014).

On a related note, (Bubeck et al., 2018) studied the convergence of the LMC algorithm with reflection at the boundary of a compact set, which makes it possible to sample from a compactly supported density (see also (Brosse et al., 2017)). Extensions to non-smooth densities were presented in (Durmus and Pereyra, 2016; Luu et al., 2017). (Cheng and Bartlett, 2017) obtained guarantees similar to those in (Dalalyan, 2017b) when the error is measured by the Kullback-Leibler divergence. Very recently, (Cheng et al., 2017) derived non asymptotic guarantees for the underdamped LMC which turned out to improve on the previously known results. Langevin dynamics was used in (Andrieu et al., 2016; Brosse et al., 2017) in order to approximate normalizing constants of target distributions. Huggins and Zou (2017) established tight bounds in Wasserstein distance between the invariant distributions of two (Langevin) diffusions; the bounds involve mixing rates of the diffusions and the deviation in their drifts.

The goal of the present work is to push further the study of the LMC and its variants both by improving the existing guarantees and by extending them in some directions. Our main contributions can be summarized as follows:

• We state simplified guarantees in Wasserstein distance with improved constants both for the LMC and the LMCO when the step-size is constant, see Theorem 1 and Theorem 6.

• We propose a varying-step LMC which avoids a logarithmic factor in the number of iterations required to achieve a precision level , see Theorem 2.

• We extend the previous guarantees to the case where accurate evaluations of the gradient are unavailable. Thus, at each iteration of the algorithm, the gradient is computed within an error that has a deterministic and a stochastic component. Theorem 4 deals with functions satisfying (1), whereas Theorem 5 requires the additional assumption of the smoothness of the Hessian of .

• We propose a new second-order sampling algorithm termed LMCO’. It has a per-iteration computational cost comparable to that of the LMC and enjoys nearly the same guarantees as the LMCO, when the Hessian of is Lipschitz continuous, see Theorem 6.

• We provide a detailed discussion of the relations between, on the one hand, the sampling methods and guarantees of their convergence and, on the other hand, optimization methods and guarantees of their convergence (see Section 5).

We have to emphasize right away that Theorem 1 is a corrected version of (Dalalyan, 2017a, Theorem 1), whereas Theorem 4 extends (Dalalyan, 2017a, Theorem 3) to more general noise. In particular, Theorem 4 removes the unbiasedness and independence conditions. Furthermore, thanks to a shrewd use of a recursive inequality, the upper bound in Theorem 4 is tighter than the one in (Dalalyan, 2017a, Theorem 3).

As an illustration of the first two bullets mentioned in the above summary of our contributions, let us consider the following example. Assume that , and we have at our disposal an initial sampling distribution satisfying . The main inequalities in Theorem 1 and Theorem 2 imply that after iterations, the distribution obtained by the LMC algorithm satisfies

 W2(νK,π)≤(1−mh)KW2(ν0,π)+1.65(M/m)(hp)1/2 (5)

for the constant step LMC and

 W2(νK,π)≤3.5M√pm√M+m+(\nicefrac23)m(K−K1) (6)

for the varying-step LMC, where is an integer the precise value of which is provided in Theorem 2. One can compare these inequalities with the corresponding bound in (Durmus and Moulines, 2016): adapted to the constant-step, it takes the form

 W22(νK,π)≤ 2(1−mMhm+M)KW22(ν0,π) (7) +Mhpm(m+M)(h+m+M2mM)(2+M2hm+M2h26). (8)

For any , we can derive from these guarantees the smallest number of iterations, , for which there is a such that the corresponding upper bound is smaller than . The logarithms of these values for varying and are plotted in Figure 1. We observe that for all the considered values of and , the number of iterations derived from (6) (referred to as Theorem 2) is smaller than those derived from (5) (referred to as Theorem 1) and from (8) (referred to as DM bound). The difference between the varying-step LMC and the constant step LMC becomes more important when the target precision level gets smaller. In average over all values of , when , the number of iterations derived from (8) is 4.6 times larger than that derived from (6), and almost times larger than the number of iterations derived from (5). Figure 1: Plots showing the logarithm of the number of iterations as function of dimension p for several values of ϵ. The plotted values are derived from (5)-(8) using the data m=10, M=20, W2(ν0,π)=p+(p/m).

## 2 Guarantees in the Wasserstein distance with accurate gradient

The rationale behind the LMC (3) is simple: the Markov chain is the Euler discretization of a continuous-time diffusion process , known as Langevin diffusion. The latter is defined by the stochastic differential equation

 dLt=−∇f(Lt)dt+√2dWt,t≥0, (9)

where is a -dimensional Brownian motion. When satisfies condition (1), equation (9) has a unique strong solution, which is a Markov process. Furthermore, the process has as invariant density (Bhattacharya, 1978, Thm. 3.5). Let be the distribution of the -th iterate of the LMC algorithm, that is . In what follows, we present user-friendly guarantees on the closeness of and , when is strongly convex.

### 2.1 Reminder on guarantees for the constant-step LMC

When the function is -strongly convex and -gradient Lipschitz, upper bounds on the sampling error measured in Wasserstein distance of the LMC algorithm have been established in (Durmus and Moulines, 2016; Dalalyan, 2017a). We state below a slightly adapted version of their result, which will serve as a benchmark for the bounds obtained in this work.

###### Theorem 1.

Assume that and satisfies condition (1). The following claims hold:

1. If then .

2. If then .

We refer the readers interested in the proof of this theorem either to (Dalalyan, 2017a) or to Section 7, where the latter is obtained as a direct consequence of Theorem 4. The factor is obtained by upper bounding .

In practice, a relevant approach to getting an accuracy of at most is to minimize the upper bound provided by Theorem 1 with respect to , for a fixed . Then, one can choose the smallest for which the obtained upper bound is smaller than . One useful observation is that the upper bound of case (b) is an increasing function of . Its minimum is always attained at , which means that one can always look for a step-size in the interval by minimizing the upper bound in (a). This can be done using standard line-search methods such as the bisection algorithm.

Note that if the initial value is deterministic then, using the notation , in view of (Durmus and Moulines, 2016, Proposition 1), we have

 W2(ν0,π)2 =∫Rp∥θ0−θ∥22π(dθ)≤∥θ0−θ∗∥22+p/m. (10)

Finally, let us remark that if we choose and so that

 h≤\nicefrac2(m+M),e−mhKW2(ν0,π)≤ε/2,1.65(M/m)(hp)1/2≤ε/2, (11)

then we have . In other words, conditions (11) are sufficient for the density of the output of the LMC algorithm after iterations to be within the precision of the target density when the precision is measured using the Wasserstein distance. This readily yields

 h≤m2ε211M2p∧2m+MandhK≥1mlog(2(∥θ0−θ∗∥22+p/m)1/2ε) (12)

Assuming and to be constants, we can deduce from the last display that it suffices number of iterations in order to reach the precision level . This fact has been first established in (Dalalyan, 2017b) for the LMC algorithm with a warm start and the total-variation distance. It was later improved by Durmus and Moulines (2017, 2016), who showed that the same result holds for any starting point and established similar bounds for the Wasserstein distance. Theorem 1 above can be seen as a user-friendly version of the corresponding result established by Durmus and Moulines (2016).

###### Remark 2.1.

Although (10) is relevant for understanding the order of magnitude of , it has limited applicability since the distance might be hard to evaluate. An attractive alternative to that bound is the following111The second line follows from strong convexity whereas the third line is a consequence of the fact that is a stationary point of .:

 mW2(ν0,π)2 ≤m∥θ0−θ∗∥22+p (13) ≤2(f(θ0)−f(θ∗)−∇f(θ∗)⊤(θ0−θ))+p (14) =2(f(θ0)−f(θ∗))+p. (15)

If is lower bounded by some known constant, for instance if , the last inequality provides the computable upper bound .

### 2.2 Guarantees under strong convexity for the varying step LMC

The result of previous section provides a guarantee for the constant step LMC. One may wonder if using a variable step sizes can improve the convergence. Note that in (Durmus and Moulines, 2016, Theorem 5), guarantees for the variable step LMC are established. However, they do not lead to a clear message on the choice of the step-sizes. The next result fills this gap by showing that an appropriate selection of step-sizes improves on the constant step LMC with an improvement factor logarithmic in .

###### Theorem 2.

Let us consider the LMC algorithm with varying step-size defined by

 hk+1=2M+m+(\nicefrac23)m(k−K1)+,k=1,2,… (16)

where is the smallest non-negative integer satisfying222Combining the definition of and the upper bound in (10), one easily checks that if is bounded, then is upper bounded by a constant that does not depend on the dimension .

 K1≥ln(W2(ν0,π)/√p)+ln(m/M)+(\nicefrac12)ln(M+m)ln(1+\nicefrac2mM−m). (17)

If satisfies (1), then for every , we have

 W2(νk,π)≤3.5M√pm√M+m+(\nicefrac23)m(k−K1). (18)

The step size (16) has two important advantages as compared to the constant steps. The first advantage is that it is independent of the target precision level . The second advantage is that we get rid of the logarithmic terms in the number of iterations required to achieve the precision level . Indeed, it suffices iterations to get the right hand side of (18) smaller than , where depends neither on the dimension nor on the precision level .

Since the choice of in (16) might appear mysterious, we provide below a quick explanation of the main computations underpinning this choice. The main step of the proof of upper bounds on , is the following recursive inequality (see Proposition 2 in Section 7)

 W2(νk+1,π)≤(1−mhk+1)W2(νk,π)+1.65M√ph3/2k+1. (19)

Using the notation , this inequality can be rewritten as

 Bk+1≤(1−mhk+1)Bk+2(mhk+1/3)3/2.

Minimizing the right hand side with respect to , we find that the minimum is attained at the stationary point

 hk+1=3mB2k. (20)

With this , one checks that the sequence satisfies the recursive inequality

 B2k+1≤B2k(1−B2k)2≤B2k1+B2k.

The function being increasing in , we get

 B2k+1≤B2k1+B2k≤B2k−11+B2k−11+B2k−11+B2k−1=B2k−11+2B2k−1.

By repetitive application of the same argument, we get

 B2k+1≤B2K11+(k+1−K1)B2K1.

The integer was chosen so that , see (62). Inserting this upper bound in the right hand side of the last display, we get

 B2k+1≤2m3(M+m)+2m(k+1−K1).

Finally, replacing in (20) by its upper bound derived from the last display, we get the suggested value for .

### 2.3 Extension to mixtures of strongly log-concave densities

We describe here a simple setting in which a suitable version of the LMC algorithm yields efficient sampling algorithm for a target function which is not log-concave. Indeed, let us assume that

 π(θ)=∫Hπ1(θ|η)π0(dη), (21)

where is an arbitrary measurable space,

is a probability distribution on

and is a Markov kernel on . This means that defines a probability measure on of which is the first marginal.

###### Theorem 3.

Assume that so that for every , satisfies assumption (1). Define the mixture LMC (MLMC) algorithm as follows: sample and choose an initial value , then compute

 ϑMLMCk+1=ϑMLMCk−hk+1∇fη(ϑMLMCk)+√2hk+1ξk+1;k=0,1,2,… (22)

where is defined by (16) and is a sequence of mutually independent, and independent of , centered Gaussian vectors with covariance matrices equal to identity. It holds that, for every positive integer (see eq. 17 for the definition of ),

 W2(νk,π)≤3.5M√pm√M+m+(\nicefrac23)m(k−K1). (23)

This is result extends the applicability of Langevin based techniques to a wider framework than the one of strongly log-concave distributions. The proof is just a straightforward consequence of Theorem 2. Therefore, it is omitted.

## 3 Guarantees for the inaccurate gradient version

In some situations, the precise evaluation of the gradient is computationally expensive or practically impossible, but it is possible to obtain noisy evaluations of at any point. This is the setting considered in the present section. More precisely, we assume that at any point of the LMC algorithm, we can observe the value

 Yk,h=∇f(ϑk,h)+ζk, (24)

where is a sequence of random (noise) vectors. The noisy LMC (nLMC) algorithm is defined as

 ϑk+1,h=ϑk,h−hYk,h+√2hξk+1;k=0,1,2,… (25)

where and are as in (3). The noise is assumed to satisfy the following condition.

Condition N: for some and and for every ,

• (bounded bias) ,

• (bounded variance)

,

• (independence of updates) in (25) is independent of .

We emphasize right away that the random vectors are not assumed to be independent, as opposed to what is done in (Dalalyan, 2017a). The next theorem extends the guarantees of Theorem 1 to the inaccurate-gradient setting and to the nLMC algorithm.

###### Theorem 4.

Let be the -th iterate of the nLMC algorithm (25) and be its distribution. If the function satisfies condition (1) and then

 W2(νK,π)≤(1 −mh)KW2(ν0,π)+1.65(M/m)(hp)1/2 (26) +δ√pm+σ2(hp)1/21.65M+σ√m . (27)

To the best of our knowledge, the first result providing guarantees for sampling from a distribution in the scenario when precise evaluations of the log-density or its gradient are not available has been established in (Dalalyan, 2017a). Prior to that work, some asymptotic results has been established in (Alquier et al., 2016). The closely related problem of computing an average value with respect to a distribution, when the gradient of its log-density is known up to an additive noise, has been studied by Teh et al. (2016); Vollmer and Zygalakis (2015); Nagapetyan et al. (2017). Note that these settings are of the same flavor as those of stochastic approximation, an active area of research in optimization and machine learning.

As compared to the analogous result in (Dalalyan, 2017a), Theorem 4 above has several advantages. First, it extends the applicability of the result to the case of a biased noise. In other words, it allows for with nonzero means. Second, it considerably relaxes the independence assumption on the sequence , by replacing it by the independence of the updates. Third, and perhaps the most important advantage of Theorem 4 is the improved dependence of the upper bound on . Indeed, while the last term in the upper bound in Theorem 4 is , when , the corresponding term in (Dalalyan, 2017a, Th. 3) is only .

To understand the potential scope of applicability of Theorem 4, let us consider a generic example in which is the average of

functions defined through independent random variables

:

 f(θ)=1nn∑i=1ℓ(θ,Xi).

When the gradient of with respect to parameter is hard to compute, one can replace the evaluation of at each step by that of , where

is a random variable uniformly distributed in

and independent of . Under suitable assumptions, this random vector satisfies the conditions of Theorem 4 with and constant . Therefore, if we analyze the upper bound provided by (26), we see that the last term, due to the subsampling, is of the same order of magnitude as the second term. Thus, using the subsampled gradient in the LMC algorithm does not cause a significant deterioration of the precision while reducing considerably the computational burden.

Note that Theorem 4 allows to handle situations in which the approximations of the gradient are biased. This bias is controlled by the parameter . Such a bias can appear when using deterministic approximations of integrals or differentials. For instance, in statistical models with latent variables, the gradient of the log-likelihood has often an integral form. Such integrals can be approximated using quadrature rules, yielding a bias term, or Monte Carlo methods, yielding a variance term.

In the preliminary version (Dalalyan, 2017a) of this work, we made a mistake by claiming that the stochastic gradient version of the LMC, introduced in (Welling and Teh, 2011) and often referred to as Stochastic Gradient Langevin Dynamics (SGLD), has an error of the same order as the non-stochastic version of it. This claim is wrong, since when with a strongly convex function and iid variables , we have and proportional to . Therefore, choosing as a noisy version of the gradient (where is a uniformly over distributed random variable independent of ), we get but proportional to . Therefore, the last term in (26) is of order and dominates the other terms. Furthermore, replacing by with iid variables does not help, since then is of order and the last term in (26) is of order , which is still larger than the term . This discussion shows that Theorem 4 does not provide any interesting result when applied to SGLD. For a more in-depth analysis of the SGLD, we refer the reader to (Nagapetyan et al., 2017; Raginsky et al., 2017; Xu et al., 2017).

It is also worth mentioning here that another example of approximate gradient—based on a quadratic approximation of the log-likelihood of the generalized linear model—has been considered in (Huggins and Zou, 2017, Section 5). It corresponds, in terms of condition N, to a situation in which the variance vanishes but the bias is non-zero.

An important ingredient of the proof of Theorem 4 is the following simple result, which can be useful in other contexts as well (for a proof, see Lemma 7 in Section 7.6 below).

###### Lemma 1.

Let , and be given non-negative numbers such that . Assume that the sequence of non-negative numbers satisfies the recursive inequality

 x2k+1 ≤[(1−A)xk+C]2+B2 (28)

for every integer . Then, for all integers ,

 xk ≤(1−A)kx0+CA+B2C+√AB. (29)

Thanks to this lemma, the upper bound on the Wasserstein distance provided by (26) is sharper than the one proposed in (Dalalyan, 2017a).

## 4 Guarantees under additional smoothness

When the function has Lipschitz continuous Hessian, one can get improved rates of convergence. This has been noted by (Dalalyan, 2017b), who proposed to use a modified version of the LMC algorithm, the LMC with Ozaki discretization, in order to take advantage of the smoothness of the Hessian. On the other hand, it has been proved in (Alfonsi et al., 2014, 2015) that the boundedness of the third order derivative of (equivalent to the boundedness of the second-order derivative of the drift of the Langevin diffusion) implies that the Wasserstein distance between the marginals of the Langevin diffusion and its Euler discretization are of order . Note however, that in (Alfonsi et al., 2015) there is no evaluation of the impact of the dimension on the quality of the Euler approximation. This evaluation has been done by Durmus and Moulines (2016) who showed that the Wasserstein error of the Euler approximation is of order . This raises the following important question: is it possible to get advantage of the Lipschitz continuity of the Hessian of in order to improve the guarantees on the quality of sampling by the standard LMC algorithm. The answer of this question is affirmative and is stated in the next theorem.

In what follows, for any matrix , we denote by and , respectively, the spectral norm and the Frobenius norm of . We write or to indicate that the matrix is positive semi-definite.

Condition F: the function is twice differentiable and for some positive numbers , and ,

• (strong convexity) , for every ,

• (bounded second derivative) , for every ,

• (further smoothness) , for every .

###### Theorem 5.

Let be the -th iterate of the nLMC algorithm (25) and be its distribution. Assume that conditions F and N are satisfied. Then, for every , we have

 W2(νK,π)≤(1 −mh)KW2(ν0,π)+M2hp2m+11Mh√Mp5m (30) +δ√pm+2σ2√hpM2√hp+2σ√m. (31)

In the last inequality, is an upper bound for .

When applying the nLMC algorithm to sample from a target density, the user may usually specify four parameters: the step-size , the number of iterations , the tolerated precision of the deterministic approximation and the precision of the stochastic approximation. An attractive feature of Theorem 5 is that the contributions of these four parameters are well separated, especially if we upper bound the last term by . As a consequence, in order to have an error of order in Wasserstein distance, we might choose: at most of order , at most of order , of order and of order . Akin to Theorem 2, one can use variable step-sizes to avoid the logarithmic factor; we leave these computations to the reader.

Note that if we instantiate Theorem 5 to the case of accurate gradient evaluations, that is when , we recover the constant step-size version of (Durmus and Moulines, 2016, Theorem 8), with optimized constants.

Under the assumption of Lipschitz continuity of the Hessian of , one may wonder whether second-order methods that make use of the Hessian in addition to the gradient are able to outperform the standard LMC algorithm. The most relevant candidate algorithms for this are the LMC with Ozaki discretization (LMCO) and a variant of it, LMCO’, a slightly modified version of an algorithm introduced in (Dalalyan, 2017b). The LMCO is a recursive algorithm the update rule of which is defined as follows: For every , we set , which is an invertible matrix since is strongly convex, and define

 Mk=(Ip−e−hHk)H−1k,Σk=(Ip−e−2hHk)H−1k, (32) ϑLMCOk+1,h=ϑLMCOk,h−Mk∇f(ϑLMCOk,h)+Σ1/2kξk+1, (33)

where is a sequence of independent random vectors distributed according to the distribution. The LMCO’ algorithm is based on approximating the matrix exponentials by linear functions, more precisely, for ,

 ϑLMCO′k+1,h= ϑLMCO′k,h−h(Ip−12hH′k)∇f(ϑLMCO′k,h) (34) +√2h(Ip−hH′k+13h2(H′k)2)1/2ξk+1. (35)

Let us mention right away that the stochastic perturbation present in the last display can be computed in practice without taking the matrix square-root. Indeed, it suffices to generate two independent standard Gaussian vectors and ; then the random vector

 (Ip−(\nicefrac12)hH′k)ηk+1+(\nicefrac√36)hH′kη′k+1 (36)

has exactly the same distribution as .

In the rest of this section, we provide guarantees for methods LMCO and LMCO’. Note that we consider only the case where the gradient and the Hessian of are computed exactly, that is without any approximation.

###### Theorem 6.

Let and be, respectively, the distributions of the -th iterate of the LMCO algorithm (33) and the LMCO’ algorithm (35) with an initial distribution . Assume that conditions F and N are satisfied. Then, for every ,

 W2(νLMCOK,π)≤(1−0.25mh)kW2(ν0,π)+11.5M2h(p+1)m. (37)

 W2(νLMCO′K,π) ≤(1−0.25mh)kW2(ν0,π)+1.3M2h2√Mpm+7.3M2h(p+1)m. (38)

A very rough consequence of this theorem is that one has similar theoretical guarantees for the LMCO and the LMCO’ algorithms, since in most situations the middle term in the right hand side of (38) is smaller than the last term. On the other hand, the per-iteration cost of the modified algorithm LMCO’ is significantly smaller than the per-iteration cost of the original LMCO. Indeed, for the LMCO’ there is no need to compute matrix exponentials neither to invert matrices, one only needs to perform matrix-vector multiplication for

matrices. Note that for many matrices such a multiplication operation might be very cheap using the fast Fourier transform or other similar techniques. In addition, the computational complexity of the Hessian-vector product is provably of the same order as that of evaluating the gradient, see

(Griewank, 1993). Therefore, one iteration of the LMCO’ algorithm is not more costly than one iteration of the LMC. At the same time, the error bound (38) for the LMCO’ is smaller than the one for the LMC provided by Theorem 5. Indeed, the term present in the bound of Theorem 5 is generally of larger order than the term appearing in (38).

## 5 Relation with optimization

We have already mentioned that the LMC algorithm is very close to the gradient descent algorithm for computing the minimum of the function . However, when we compare the guarantees of Theorem 1 with those available for the optimization problem, we remark the following striking difference. The approximate computation of requires a number of steps of the order of to reach the precision , whereas, for reaching the same precision in sampling from , the LMC algorithm needs a number of iterations proportional to . The goal of this section is to explain that this, at first sight very disappointing behavior of the LMC algorithm is, in fact, continuously connected to the exponential convergence of the gradient descent.

The main ingredient for the explanation is that the function and the function have the same point of minimum , whatever the real number . In addition, if we define the density function , then the average value

 ¯θτ=∫Rpθπτ(θ)dθ

tends to the minimum point when goes to zero. Furthermore, the distribution tends to the Dirac measure at . Clearly, satisfies (1) with the constants and . Therefore, on the one hand, we can apply to claim (a) of Theorem 1, which tells us that if we choose , then

 W2(νK,πτ)≤(1−mM)KW2(δθ0,πτ)+1.65(Mm)(pτM)1/2. (39)

On the other hand, the LMC algorithm with the step-size applied to reads as

 ϑk+1,h=ϑk,h−1M∇f(ϑk,h)+√2τMξk+1;k=0,1,2,… (40)

When the parameter goes to zero, the LMC sequence (40) tends to the gradient descent sequence . Therefore, the limiting case of (39) corresponding to writes as

 ∥θ(K)−θ∗∥2≤(1−mM)K∥θ0−θ∗∥2, (41)

which is a well-known result in Optimization. This clearly shows that Theorem 1 is a natural extension of the results of convergence from optimization to sampling.

Such an analogy holds true for the Newton method as well. Its counterpart in sampling is the LMCO algorithm. Indeed, one easily checks that if is replaced by with going to zero, then, for any fixed step-size , the matrix in (33) tends to zero. This implies that the stochastic perturbation vanishes. On the other hand, the term tends to , as . Thus, the updates of the Newton algorithm can be seen as the limit case, when goes to zero, of the updates of the LMCO.

However, if we replace by in the upper bounds stated in Theorem 6 and we let go to zero, we do not retrieve the well-known guarantees for the Newtons method. The main reason is that Theorem 6 describes the behavior of the LMCO algorithm in the regime of small step-sizes , whereas Newton’s method corresponds to (a limit case of) the LMCO with a fixed . Using arguments similar to those employed in the proof of Theorem 6, one can establish the following result, the proof of which is postponed to Section 7.

###### Proposition 1.

Let be the distributions of the -th iterate of the LMCO algorithm (33) with an initial distribution . Assume that conditions F and N are satisfied. Then, for every and ,

 W2(νLMCOK,π) ≤2mM2(wKexp(vKw−2KK))2K (42)

with

 wK =M2W2K+1(ν0,π)2m+12e−mh, and vK=2M2M3/2√2p+2Km3+e−mh. (43)

If we replace in the right hand side of (42) the quantities , and , respectively, by , and , and we let go to zero, then it is clear that the term vanishes. On the other hand, if is the Dirac mass at some point , then converges to . Therefore, for Newton’s algorithm as a limiting case of (42) we get

 ∥θNewtonK−θ∗∥2≤2mM2(M2∥θ0−θ∗∥22m)2K. (44)

The latter provides the so called quadratic rate of convergence, which is a well-known result that can be found in many textbooks; see, for instance, (Chong and Zak, 2013, Theorem 9.1).

A particularly promising remark made in Section 2.3 is that all the results established for the problem of approximate sampling from a log-concave distribution can be carried over the distributions that can be written as a mixture of (strongly) log-concave distributions. The only required condition is to be able to sample from the mixing distribution. This provides a well identified class of (posterior) distributions for which the problem of finding the mode is difficult (because of nonconvexity) whereas the sampling problem can be solved efficiently.

There are certainly other interesting connections to uncover between sampling and optimization. One can think of lower bounds for sampling or finding a sampling counterpart of Nesterov acceleration. Some recent advances on the gradient flow (Wibisono et al., 2016) might be useful for achieving these goals.

## 6 Conclusion

We have presented easy-to-use finite-sample guarantees for sampling from a strongly log-concave density using the Langevin Monte-Carlo algorithm with a fixed step-size and extended it to the case where the gradient of the log-density can be evaluated up to some error term. Our results cover both deterministic and random error terms. We have also demonstrated that if the log-density has a Lipschitz continuous second-order derivative, then one can choose a larger step-size and obtain improved convergence rate.

We have also uncovered some analogies between sampling and optimization. The underlying principle is that an optimization algorithm may be seen as a limit case of a sampling algorithm. Therefore, the results characterizing the convergence of the optimization schemes should have their counterparts for sampling strategies. We have described these analogues for the steepest gradient descent and for the Newton algorithm. However, while in the optimization the relevant characteristics of the problem are the dimension , the desired accuracy and the condition number , the problem sampling involves an additional characteristic which is the scale given by the strong-convexity constant . Indeed, if we increase by keeping the condition number constant, the number of iterations for the LMC to reach the precision will decrease. In this respect, we have shown that the LMC with Ozaki discretization, termed LMCO, has a better dependence on the overall scale of than the original LMC algorithm. However, the weakness of the LMCO is the high computational cost of each iteration. Therefore, we have proposed a new algorithm, LMCO’, that improves the LMC in terms of its dependence on the scale and each iteration of LMCO’ is computationally much cheaper than each iteration of the LMCO.

Another interesting finding is that, in the case of accurate gradient evaluations (i.e., when there is no error in the gradient computation), a suitably chosen variable step-size leads to logarithmic improvement in the convergence rate of the LMC algorithm.

Interesting directions for future research are establishing lower bounds in the spirit of those existing in optimization, obtaining user-friendly guarantees for computing the posterior mean or for sampling from a non-smooth density. Some of these problems have already been tackled in several papers mentioned in previous sections, but we believe that the techniques developed in the present work might be helpful for revisiting and deepening the existing results.

## 7 Proofs

The basis of the proofs of all the theorems stated in previous sections is a recursive inequality that upper bounds the error at the step , , by an expression involving the error of the previous step, . We will also make repeated use of the Minkowski inequality and its integral version

 {E[(∫baXtdt)p]}1/p≤∫ba{E[|Xt|p]}1/pdt,∀p≥N∗, (45)

where is a random process almost all paths of which are integrable over the interval . Furthermore, for any random vector , we define the norm .

The next result is the central ingredient of the proofs of Theorems 4, 2 and 1. Readers interested only in the proof of Theorems 2 and 1, are invited—in the next proof—to consider the random vectors as equal to and as equal to . This implies, in particular, that .

###### Proposition 2.

Let us introduce (since , this value satisfies ). If satisfies (1) and , then

 W2(νk+1,π)2 ≤{ϱk+1W2(νk,π)+αM(h3k+1p)1/2+hk+1δ√p}2+σ2h2k+1p, (46)

with .

###### Proof.

To simplify notation, and since there is no risk of confusion, we will write instead of . Let be a random vector drawn from such that