In this paper, the problem of sampling from a target distribution
is investigated, where and the function satisfies Lipschitz continuity and a certain dissipativity condition. We establish non-asymptotic convergence rates for the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, based on the stochastic differential equation
where is the standard Brownian motion in and is the inverse temperature parameter.
Non-asymptotic convergence rates of Langevin dynamics based algorithms for approximate sampling of log-concave distributions have been intensively studied in recent years, starting with . This was followed by , , ,  amongst others.
Relaxing log-concavity is a more challenging problem. In , the log-concavity assumption is replaced by a logconcavity at infinity condition and and -Wasserstein distances convergence rates are obtained. In a similar setting,  analyzes sampling errors in the -Wasserstein distance for both overdamped and underdamped Langevin MCMC. In , only a dissipativity condition is assumed and convergence rates are obtained in the -Wasserstein distance. Moreover, a clear and strong link between sampling via SGLD algorithms and non-convex optimization is highlighted. One can further consult ,  and references therein.
In the present paper, we impose the dissipativity condition as in . Using a different Wasserstein-type metric, we obtain shaper estimates and allow for possibly dependent data sequences. The key new idea is that we compare the SGLD algorithm to a suitable auxiliary continuous time processes inspired by (1) and we rely on contraction results developed in  for (1).
2 Main results
Let be a probability space. We denote by the expectation of a random variable
be a probability space. We denote by
the expectation of a random variable. For , is used to denote the usual space of -integrable real-valued random variables. Fix an integer . For an -valued random variable , its law on (the Borel sigma-algebra of ) is denoted by . Scalar product is denoted by , with standing for the corresponding norm (where the dimension of the space may vary depending on the context). We fix a discrete-time filtration , where is an i.i.d. sequence with values in some Polish space. This represents the flow of past information. The notation is self-explanatory. We also define the decreasing sequence of sigma-algebras , , representing future information at the respective time instants.
Fix an -valued random variable , representing the initial value of the procedure we consider. For each , define the -valued random process , by recursion:
where is a measurable function, , is an -valued, -adapted process and , is an independent sequence of standard -dimensional Gaussian random variables.
We interpret , as a stream of data and , as an artificially generated noise sequence. We assume throughout the paper that , and are independent.
Let be continuously differentiable with derivative . Let us define the probability
We now present our assumptions. First, the moments of the initial condition need to
We now present our assumptions. First, the moments of the initial condition need to be controlled.
Next, we require joint Lipschitz-continuity of .
There is and such that for all and ,
The data sequence , need not be i.i.d., we require only a mixing property, defined in Section 3.1 below.
The process , is conditionally -mixing with respect to . It satisfies
Stationarity of the process , would also be natural to assume but we need only the weaker property (4).
Finally, we present a dissipativity condition on .
There exist , such that, for all and ,
When for all for some (i.e. when is replaced by in (2)) then we arrive at the well-known unadjusted Langevin algorithm whose convergence properties have been amply analyzed, see e.g. [7, 12, 6, 21]. The case of i.i.d. , has also been investigated in great detail, see e.g. [23, 27, 21].
In the present article, better estimates are obtained for the distance between and than those of  and . Such rates have already been obtained in  for strongly convex and in  for that is convex outside a compact set. Here we make no convexity assumptions at all. This comes at the price of using the metric defined in (6) below while [23, 27, 21, 2] use Wasserstein distances with respect to the standard Euclidean metric, see (10) below.
Another novelty of our paper is that, just like in , we allow the data sample , to be dependent. As observed data have no reason to be i.i.d., we believe that such a result is fundamental to assure the robustness of the sampling method based on the stochastic gradient Langevin dynamics (2).
For any integer , let denote the set of probability measures on . For , let denote the set of probability measures on such that its respective marginals are . Define
which is the Wasserstein- distance associated to the bounded metric , .
In this work, the constants appearing are often denoted by for some natural number . Without further mention, these constants depend on , , , , , , , and on the process , through the quantities (13) below and, unless otherwise stated, they do not depend on anything else. In case of further dependencies (e.g. dependence on , which is due to the drift condition, coming from Lemma 3.6 below), we signal these in parentheses, e.g. .
Our main contribution is summarized in the following result. Define
Example 3.4 of  suggests that the best rate we can hope to get in (8) is , even in the convex case. The above theorem achieves this rate. We remark that, although the statement of Theorem 2.7 concerns the discrete-time recursive scheme (2), its proof is carried out entirely in a continuous-time setting, in Section 3. It relies on techniques from  and . The principal new idea is the introduction of the auxiliary process , , see (25) below.
Consider now a strengthening of Assumption 2.5 by imposing convexity outside a compact set.
There exist such that, for each , satisfying ,
Then, we can recover analogous results to Theorem 2.7 by considering the -Wasserstein distance. At this point, let us recall the definition of the familiar, “usual” Wasserstein- (also know as -Wasserstein) distance, for :
Strengthening the monotonicity condition (9) even guarantees convergence in .
There exists such that, for each , ,
2.1 Related work and our contributions
In the remarkable paper , a non-convex optimization problem is considered in the context of empirical risk minimization, which plays a central role in ML algorithms. The excess risk is decomposed into a sampling error resulting from the application of SGLD, a generalization error and a suboptimality error. Our aim is to improve the sampling error in the non-convex setting and provide sharper convergence estimates under more relaxed conditions. To this end, we focus on the comparison of our results with Proposition 3.3 of .
Condition of  is (much) stronger than Assumption 2.1 above. Assumption 2.5 is identical to in . Condition in  corresponds to Lipschitz-continuity of in its first variable with a Lipschitz-constant independent from its second variable and there means that , are bounded where and . Hence Assumption 2.2 here is neither stronger nor weaker than of , they are incomparable conditions. In any case, Assumption 2.2 does not seem to be restrictive for practical purposes. Condition in  is implied by Assumptions 2.2 and 2.3.
We obtain stronger rates (which we believe to be optimal) than those of . More precisely, we obtain a rate in (8) for the distance while  only obtains (which depends on ) but in the distance. Furthermore, we allow a possibly dependent data sequence. In other words,  is applicable only if is i.i.d. while Assumption 2.3 suffices for the derivation of our results.
Now let us turn to . The comparison is made only in the presence of convexity (outside a compact set) for as it is a requirement for the results in . Their Assumption 1.1 is precisely Assumptions 2.2 and
2.8 combined, however this is stipulated for in 
while we need it for , for all , as we allow dependent data streams. Furthermore, Assumption 1.3 in  requires that the variance of
requires that the variance ofis controlled by a power of the step size while we do not need such an assumption. The second conclusion of their Theorem 1.4 (with , using their notation ) is the same as our Theorem 2.9.
In the particular case where , are i.i.d., one can replace Theorem 3.2 below by Doob’s inequality in the arguments for proving Theorem 2.7. The full power of Assumption 2.2 is used only in Lemma 3.14. When are i.i.d. then Lemma 3.14 is trivial and it is enough to assume only (A.2) of  instead of Assumption 2.2.
3.1 Conditional -mixing
-mixing processes and random fields were introduced in . In , the closely related concept of conditional -mixing was created. We define this concept below and recall some related results. This section is an almost exact replica of Section 2 in .
We assume that the probability space is equipped with a discrete-time filtration , as well as with a decreasing sequence of sigma-fields , such that is independent of , for all .
Fix an integer and let be a set of parameters. A measurable function is called a random field. We drop the dependence on in the notation henceforth and write . A random process corresponds to a random field where is a singleton. A random field is -bounded for some if
Now we define conditional -mixing. Recall that, for any family , of real-valued random variables, denotes a random variable that is an almost sure upper bound for each and it is, almost surely, smaller than or equal to any other such bound.
Let be -bounded for each . Define, for each , and for ,
When necessary, , and are used to emphasize dependence of these quantities on the domain which may vary.
Definition 3.1 (Conditional -mixing).
We call uniformly conditionally -mixing (UCLM) with respect to if is adapted to for all ; for all , it is -bounded; and the sequences , are also -bounded for all . In the case of stochastic processes (when is a singleton) the terminology “conditionally -mixing process” is used.
Conditionally -mixing encompasses a broad class of processes (linear processes, functionals of Markov processes, etc.), see Example 2.1 in . The following maximal inequality is pivotal for our arguments.
Assume that for some Polish space-valued independent random variables , , . Fix and . Let , be a conditionally -mixing process w.r.t. , satisfying a.s. for all . Let and let , be deterministic numbers. Then we have
almost surely, where is a deterministic constant depending only on but independent of .
See Theorem 2.6 of  (there, , , are assumed to be i.i.d.; the proof, though, trivially works for a merely independent sequence, too). ∎
Let , be conditionally -mixing. Let Assumption 2.2 hold true. Then, for each , the random field , , , the closed ball of radius centered at , is uniformly conditionally -mixing with
See Lemma 6.4 and Example 2.4 of . ∎
Let be bounded. Fix and let be a sequence of measurable random variables. Let be conditionally -mixing and Lipschitz in . Define the process . Then
The proof is identical to that of Lemma 6.3 of , noting the Lipschitz continuity. ∎
3.2 Further notation and introduction of auxiliary processses
Assumption 2.5 implies
Also, Assumption 2.2 implies
with the constant defined in (3). We will employ a family of Lyapunov-functions in the sequel. For this purpose, let us define, for each , , for any real , and similarly
Notice that these functions are twice continuously differentiable and
Let denote the set of satisfying
For and for a non-negative measurable , the notation
is used. The following functional is pivotal in our arguments as it is used to measure the distance between probability measures. We define, for any and ,
Though is not a metric, it satisfies trivially
In the sequel we will need the case , that is, .
Our estimations are carried out in a continuous-time setting, so we define and discuss a number of auxiliary continuous-time processes below. First, consider , defined by the stochastic differential equation (SDE)
where is standard Brownian motion on , independent of . Its natural filtration is denoted by , henceforth. The meaning of is clear. Equation (23) has a unique solution on adapted to since is Lipschitz-continuous by (17). We proceed by defining, for each ,
Notice that , is also a Brownian motion and
Define , , the natural filtration of , .
Let us also introduce, for each and for each , the process , satisfying
with initial condition . Due to Assumption 2.2, there is a unique solution to (25) which is adapted to . Moreover, for any given and , consider the following auxiliary process, which plays an important role in the derivation of our results,
with initial condition , and notice that .
3.3 Layout of the proof
In view of (28), the main objective is to bound , which is decomposed as follows
The last term is controlled using the drift condition (29) below, due to the dissipativity Assumption 2.5, and Lipschitzness of the mean field , see (17). The second term is controlled uniformly in by a quantity which is proportional to . For that purpose, we use novel results by , which give us a contraction in , see Proposition 3.12 and, in particular, (49). To obtain this result, the mixing condition also plays a crucial role, see Lemma 3.19. Finally, the first term is controlled uniformly in by a quantity which is also proportional to , see Corollary 3.25. This is based on Kullback-Leibler distance estimates which go back to .
3.4 Crucial estimates
The next lemma shows that the SDEs (25) and (23) satisfy standard drift conditions involving the functions . Note that, on the left-hand side of (29) below, the infinitesimal generator of the diffusion process appears which is applied to the function .
Let Assumption 2.5 hold. For each ,
and, for all ,
where , with
By direct calculation, the left-hand side of (29) equals
Then, for , one observes that
As for , one obtains
Take into consideration of the two cases, we have for all ,
The statement (30) follows in an identical way, noting that the constants which appear do not depend on . ∎
We note that for , hence . For any fixed sequence and , by Itô’s formula, one obtains almost surely,
where the expectation of the stochastic integral disappears since by . Differentiating both sides and using Lemma 3.6, one obtains