On stochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case

We consider the problem of sampling from a target distribution which is not necessarily logconcave. Non-asymptotic analysis results are established in a suitable Wasserstein-type distance of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, when the gradient is driven by even dependent data streams. Our estimates are sharper and uniform in the number of iterations, in contrast to those in previous studies.

Authors

• 1 publication
• 38 publications
• 5 publications
• 8 publications
• 59 publications
• On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case

Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbi...
12/06/2018 ∙ by M. Barkhagen, et al. ∙ 0

• Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization

Within the context of empirical risk minimization, see Raginsky, Rakhlin...
10/04/2019 ∙ by Ying Zhang, et al. ∙ 0

• On the stability of the stochastic gradient Langevin algorithm with dependent data stream

We prove, under mild conditions, that the stochastic gradient Langevin d...
05/04/2021 ∙ by Miklos Rasonyi, et al. ∙ 0

• A fully data-driven approach to minimizing CVaR for portfolio of assets via SGLD with discontinuous updating

A new approach in stochastic optimization via the use of stochastic grad...
07/02/2020 ∙ by Sotirios Sabanis, et al. ∙ 0

• Distribution-Dependent Analysis of Gibbs-ERM Principle

Gibbs-ERM learning is a natural idealized model of learning with stochas...
02/05/2019 ∙ by Ilja Kuzborskij, et al. ∙ 0

• On the rates of convergence of Parallelized Averaged Stochastic Gradient Algorithms

The growing interest for high dimensional and functional data analysis l...
10/22/2017 ∙ by Antoine Godichon-Baggioni, et al. ∙ 0

• Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Algorithm-dependent generalization error bounds are central to statistic...
07/19/2017 ∙ by Wenlong Mou, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, the problem of sampling from a target distribution

 πβ(θ)\wasyproptoexp(−βU(θ))dθ

is investigated, where and the function satisfies Lipschitz continuity and a certain dissipativity condition. We establish non-asymptotic convergence rates for the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, based on the stochastic differential equation

 dxt=−β∇U(xt)dt+√2β−1dBt (1)

where is the standard Brownian motion in and is the inverse temperature parameter.

Non-asymptotic convergence rates of Langevin dynamics based algorithms for approximate sampling of log-concave distributions have been intensively studied in recent years, starting with [7]. This was followed by [9], [13], [12], [5] amongst others.

Relaxing log-concavity is a more challenging problem. In [21], the log-concavity assumption is replaced by a logconcavity at infinity condition and and -Wasserstein distances convergence rates are obtained. In a similar setting, [6] analyzes sampling errors in the -Wasserstein distance for both overdamped and underdamped Langevin MCMC. In [23], only a dissipativity condition is assumed and convergence rates are obtained in the -Wasserstein distance. Moreover, a clear and strong link between sampling via SGLD algorithms and non-convex optimization is highlighted. One can further consult [27], [8] and references therein.

In the present paper, we impose the dissipativity condition as in [23]. Using a different Wasserstein-type metric, we obtain shaper estimates and allow for possibly dependent data sequences. The key new idea is that we compare the SGLD algorithm to a suitable auxiliary continuous time processes inspired by (1) and we rely on contraction results developed in [15] for (1).

2 Main results

Let

be a probability space. We denote by

the expectation of a random variable

. For , is used to denote the usual space of -integrable real-valued random variables. Fix an integer . For an -valued random variable , its law on (the Borel sigma-algebra of ) is denoted by . Scalar product is denoted by , with standing for the corresponding norm (where the dimension of the space may vary depending on the context). We fix a discrete-time filtration , where is an i.i.d. sequence with values in some Polish space. This represents the flow of past information. The notation is self-explanatory. We also define the decreasing sequence of sigma-algebras , , representing future information at the respective time instants.

Fix an -valued random variable , representing the initial value of the procedure we consider. For each , define the -valued random process , by recursion:

 θλ0:=θ0,θλn+1:=θλn−λH(θλn,Xn+1)+√2λβξn+1, n∈N, (2)

where is a measurable function, , is an -valued, -adapted process and , is an independent sequence of standard -dimensional Gaussian random variables.

We interpret , as a stream of data and , as an artificially generated noise sequence. We assume throughout the paper that , and are independent.

Let be continuously differentiable with derivative . Let us define the probability

 πβ(A):=∫Ae−βU(θ)dθ∫Rde−βU(θ)dθ, A∈B(Rd).

It is implicitly assumed that and this is indeed the case under Assumption 2.5 below, as easily seen. Our objective is to (approximately) sample from the distribution using the scheme (2).

We now present our assumptions. First, the moments of the initial condition need to be controlled.

Assumption 2.1.
 |θ0|∈∩p≥1Lp.

Next, we require joint Lipschitz-continuity of .

Assumption 2.2.

There is and such that for all and ,

 |H(θ,x)−H(θ′,x′)|≤K1|θ−θ′|+K2|x−x′|,

We set

 H∗:=|H(0,0)|. (3)

The data sequence , need not be i.i.d., we require only a mixing property, defined in Section 3.1 below.

Assumption 2.3.

The process , is conditionally -mixing with respect to . It satisfies

 E[H(θ,Xn)]=h(θ), θ∈Rd, n∈N. (4)
Remark 2.4.

Stationarity of the process , would also be natural to assume but we need only the weaker property (4).

Finally, we present a dissipativity condition on .

Assumption 2.5.

There exist , such that, for all and ,

 ⟨H(θ,x),θ⟩≥a|θ|2−b. (5)

When for all for some (i.e. when is replaced by in (2)) then we arrive at the well-known unadjusted Langevin algorithm whose convergence properties have been amply analyzed, see e.g. [7, 12, 6, 21]. The case of i.i.d. , has also been investigated in great detail, see e.g. [23, 27, 21].

In the present article, better estimates are obtained for the distance between and than those of [23] and [27]. Such rates have already been obtained in [2] for strongly convex and in [21] for that is convex outside a compact set. Here we make no convexity assumptions at all. This comes at the price of using the metric defined in (6) below while [23, 27, 21, 2] use Wasserstein distances with respect to the standard Euclidean metric, see (10) below.

Another novelty of our paper is that, just like in [2], we allow the data sample , to be dependent. As observed data have no reason to be i.i.d., we believe that such a result is fundamental to assure the robustness of the sampling method based on the stochastic gradient Langevin dynamics (2).

For any integer , let denote the set of probability measures on . For , let denote the set of probability measures on such that its respective marginals are . Define

 W1(μ,ν):=infζ∈C(μ,ν)∫Rd∫Rd[|x−y|∧1]ζ(dx,dy), (6)

which is the Wasserstein- distance associated to the bounded metric , .

Remark 2.6.

In this work, the constants appearing are often denoted by for some natural number . Without further mention, these constants depend on , , , , , , , and on the process , through the quantities (13) below and, unless otherwise stated, they do not depend on anything else. In case of further dependencies (e.g. dependence on , which is due to the drift condition, coming from Lemma 3.6 below), we signal these in parentheses, e.g. .

Our main contribution is summarized in the following result. Define

 λmax=min{a/2K21,1/a}. (7)
Theorem 2.7.

Let Assumptions 2.1, 2.2, 2.3 and 2.5 be valid. Then there are finite constants , , such that, for every .

 W1(L(θλn),π)≤C1e−C0λn+C2√λ, n∈N (8)

Example 3.4 of [2] suggests that the best rate we can hope to get in (8) is , even in the convex case. The above theorem achieves this rate. We remark that, although the statement of Theorem 2.7 concerns the discrete-time recursive scheme (2), its proof is carried out entirely in a continuous-time setting, in Section 3. It relies on techniques from [2] and [15]. The principal new idea is the introduction of the auxiliary process , , see (25) below.

Consider now a strengthening of Assumption 2.5 by imposing convexity outside a compact set.

Assumption 2.8.

There exist such that, for each , satisfying ,

 ⟨H(θ,x)−H(θ′,x),θ−θ′⟩≥a|θ−θ′|2, x∈Rm. (9)

Then, we can recover analogous results to Theorem 2.7 by considering the -Wasserstein distance. At this point, let us recall the definition of the familiar, “usual” Wasserstein- (also know as -Wasserstein) distance, for :

 (10)
Theorem 2.9.

Let Assumptions 2.1, 2.2, 2.3 and 2.8 be valid. Then there are constants such that, for every ,

 ~W1(L(θλn),π)≤C4e−C3λn+C5√λ, n∈N. (11)

Strengthening the monotonicity condition (9) even guarantees convergence in .

Assumption 2.10.

There exists such that, for each , ,

 ⟨H(θ,x)−H(θ′,x),θ−θ′⟩≥a[|θ−θ′|2+|H(θ,x)−H(θ′,x)|2], x∈Rm.
Theorem 2.11.

Let Assumptions 2.1, 2.2, 2.3 and 2.10 be valid. Then there are constants such that

 (12)

holds for every .

2.1 Related work and our contributions

In the remarkable paper [23], a non-convex optimization problem is considered in the context of empirical risk minimization, which plays a central role in ML algorithms. The excess risk is decomposed into a sampling error resulting from the application of SGLD, a generalization error and a suboptimality error. Our aim is to improve the sampling error in the non-convex setting and provide sharper convergence estimates under more relaxed conditions. To this end, we focus on the comparison of our results with Proposition 3.3 of [23].

Condition of [23] is (much) stronger than Assumption 2.1 above. Assumption 2.5 is identical to in [23]. Condition in [23] corresponds to Lipschitz-continuity of in its first variable with a Lipschitz-constant independent from its second variable and there means that , are bounded where and . Hence Assumption 2.2 here is neither stronger nor weaker than of [23], they are incomparable conditions. In any case, Assumption 2.2 does not seem to be restrictive for practical purposes. Condition in [23] is implied by Assumptions 2.2 and 2.3.

We obtain stronger rates (which we believe to be optimal) than those of [23]. More precisely, we obtain a rate in (8) for the distance while [23] only obtains (which depends on ) but in the distance. Furthermore, we allow a possibly dependent data sequence. In other words, [23] is applicable only if is i.i.d. while Assumption 2.3 suffices for the derivation of our results.

Now let us turn to [21]. The comparison is made only in the presence of convexity (outside a compact set) for as it is a requirement for the results in [21]. Their Assumption 1.1 is precisely Assumptions 2.2 and 2.8 combined, however this is stipulated for in [21] while we need it for , for all , as we allow dependent data streams. Furthermore, Assumption 1.3 in [21]

requires that the variance of

is controlled by a power of the step size while we do not need such an assumption. The second conclusion of their Theorem 1.4 (with , using their notation ) is the same as our Theorem 2.9.

Remark 2.12.

In the particular case where , are i.i.d., one can replace Theorem 3.2 below by Doob’s inequality in the arguments for proving Theorem 2.7. The full power of Assumption 2.2 is used only in Lemma 3.14. When are i.i.d. then Lemma 3.14 is trivial and it is enough to assume only (A.2) of [23] instead of Assumption 2.2.

3 Proofs

3.1 Conditional L-mixing

-mixing processes and random fields were introduced in [16]. In [4], the closely related concept of conditional -mixing was created. We define this concept below and recall some related results. This section is an almost exact replica of Section 2 in [2].

We assume that the probability space is equipped with a discrete-time filtration , as well as with a decreasing sequence of sigma-fields , such that is independent of , for all .

Fix an integer and let be a set of parameters. A measurable function is called a random field. We drop the dependence on in the notation henceforth and write . A random process corresponds to a random field where is a singleton. A random field is -bounded for some if

 supn∈Nsupθ∈DE1/r[|Un(θ)|r]<∞.

Now we define conditional -mixing. Recall that, for any family , of real-valued random variables, denotes a random variable that is an almost sure upper bound for each and it is, almost surely, smaller than or equal to any other such bound.

Let be -bounded for each . Define, for each , and for ,

 Mnr(U) :=esssupθ∈Dsupm∈NE1/r[|Un+m(θ)|r∣∣Rn] γnr(τ,U) :=esssupθ∈Dsupm≥τE1/r[|Un+m(θ)−E[Un+m(θ)|R+n+m−τ∨Rn]|r∣∣Rn], Γnr(U) :=∞∑τ=0γnr(τ,U). (13)

When necessary, , and are used to emphasize dependence of these quantities on the domain which may vary.

Definition 3.1 (Conditional L-mixing).

We call uniformly conditionally -mixing (UCLM) with respect to if is adapted to for all ; for all , it is -bounded; and the sequences , are also -bounded for all . In the case of stochastic processes (when is a singleton) the terminology “conditionally -mixing process” is used.

Conditionally -mixing encompasses a broad class of processes (linear processes, functionals of Markov processes, etc.), see Example 2.1 in [2]. The following maximal inequality is pivotal for our arguments.

Theorem 3.2.

Assume that for some Polish space-valued independent random variables , , . Fix and . Let , be a conditionally -mixing process w.r.t. , satisfying a.s. for all . Let and let , be deterministic numbers. Then we have

 (14)

almost surely, where is a deterministic constant depending only on but independent of .

Proof.

See Theorem 2.6 of [4] (there, , , are assumed to be i.i.d.; the proof, though, trivially works for a merely independent sequence, too). ∎

Remark 3.3.

We will apply Theorem 3.2 with the choice . In that case it is known that , see Theorem A.1 of [2].

Lemma 3.4.

Let , be conditionally -mixing. Let Assumption 2.2 hold true. Then, for each , the random field , , , the closed ball of radius centered at , is uniformly conditionally -mixing with

 Mnr(H(θ,X),B(i))≤K1i+K2Mnr(X)+H∗ (15)

and

 Γnr(H(θ,X),B(i))≤2K2Γnr(X). (16)
Proof.

See Lemma 6.4 and Example 2.4 of [2]. ∎

Lemma 3.5.

Let be bounded. Fix and let be a sequence of measurable random variables. Let be conditionally -mixing and Lipschitz in . Define the process . Then

 Mnp(Y)≤Mnp(X),Γnp(Y)≤Γnp(X).
Proof.

The proof is identical to that of Lemma 6.3 of [4], noting the Lipschitz continuity. ∎

3.2 Further notation and introduction of auxiliary processses

Throughout this section we assume that the hypotheses of Theorem 2.7 are valid. Note that Assumption 2.2 implies

 |h(θ)−h(θ′)|≤K1|θ−θ′|, θ,θ′∈Rd, (17)

Assumption 2.5 implies

 ⟨h(θ),θ⟩≥a|θ|2−b, θ∈Rd. (18)

Also, Assumption 2.2 implies

 |H(θ,x)|≤K1|θ|+K2|x|+H∗, (19)

with the constant defined in (3). We will employ a family of Lyapunov-functions in the sequel. For this purpose, let us define, for each , , for any real , and similarly

 Vp(θ):=(1+|θ|2)p/2, θ∈Rd.

Notice that these functions are twice continuously differentiable and

 lim|θ|→∞∇Vp(θ)Vp(θ)=0. (20)

Let denote the set of satisfying

 ∫RdVp(θ)μ(dθ)<∞.

For and for a non-negative measurable , the notation

 μ(f):=∫Rdf(θ)μ(dθ)

is used. The following functional is pivotal in our arguments as it is used to measure the distance between probability measures. We define, for any and ,

 w1,p(μ,ν):=infζ∈C(μ,ν)∫Rd∫Rd[1∧|θ−θ′|](1+Vp(θ)+Vp(θ′))ζ(dθdθ′), (21)

Though is not a metric, it satisfies trivially

 W1(μ,ν)≤w1,p(μ,ν). (22)

In the sequel we will need the case , that is, .

Our estimations are carried out in a continuous-time setting, so we define and discuss a number of auxiliary continuous-time processes below. First, consider , defined by the stochastic differential equation (SDE)

 dLt=−h(Lt)dt+√2βdBt,L0:=θ0, (23)

where is standard Brownian motion on , independent of . Its natural filtration is denoted by , henceforth. The meaning of is clear. Equation (23) has a unique solution on adapted to since is Lipschitz-continuous by (17). We proceed by defining, for each ,

 Lλt:=Lλt, t∈R+.

Notice that , is also a Brownian motion and

 dLλt=−λh(Lλt)dt+√2λβd~Bλt, Lλ0=θ0. (24)

Define , , the natural filtration of , .

Let us also introduce, for each and for each , the process , satisfying

 d~Yλt(x)=−λH(~Yλt(x),x⌊t⌋)dt+√2λβd~Bλt, (25)

with initial condition . Due to Assumption 2.2, there is a unique solution to (25) which is adapted to . Moreover, for any given and , consider the following auxiliary process, which plays an important role in the derivation of our results,

 d~ζλ(t,s;x,θ)=−λH(~ζλ(t,s;x,θ),x⌊t⌋)dt+√2λβd~Bλt,for t>s, (26)

with initial condition , and notice that .

Let us now define the continuously interpolated Euler-Maruyama approximation of

, via

 dYλt(x)=−λH(Yλ⌊t⌋(x),x⌊t⌋)dt+√2λβd~Bλt, (27)

with initial condition . Notice at this point that (27), can be solved by a simple recursion. In addition, if one considers , , where is a random element in defined by , , then for each integer ,

 L(Yλn(X))=L(θλn). (28)

3.3 Layout of the proof

In view of (28), the main objective is to bound , which is decomposed as follows

The last term is controlled using the drift condition (29) below, due to the dissipativity Assumption 2.5, and Lipschitzness of the mean field , see (17). The second term is controlled uniformly in by a quantity which is proportional to . For that purpose, we use novel results by [15], which give us a contraction in , see Proposition 3.12 and, in particular, (49). To obtain this result, the mixing condition also plays a crucial role, see Lemma 3.19. Finally, the first term is controlled uniformly in by a quantity which is also proportional to , see Corollary 3.25. This is based on Kullback-Leibler distance estimates which go back to [10].

3.4 Crucial estimates

The next lemma shows that the SDEs (25) and (23) satisfy standard drift conditions involving the functions . Note that, on the left-hand side of (29) below, the infinitesimal generator of the diffusion process appears which is applied to the function .

Lemma 3.6.

Let Assumption 2.5 hold. For each ,

 ΔVp(θ)β−⟨h(θ),∇Vp(θ)⟩≤−C6(p)Vp(θ)+C7(p), θ∈Rd, (29)

and, for all ,

 ΔVp(θ)β−⟨H(θ,x),∇Vp(θ)⟩≤−C6(p)Vp(θ)+C7(p), θ∈Rd, (30)

where , with

 ¯¯¯¯¯¯M(p)=√1/3+4b/(3a)+4d/(3aβ)+4(p−2)/(3aβ). (31)
Proof.

By direct calculation, the left-hand side of (29) equals

 dp(|θ|2+1)(p−2)/2β+p(p−2)(|θ|2+1)(p−4)/2|θ|2β−p⟨h(θ),(|θ|2+1)(p−2)/2θ⟩. (32)

By Assumption 2.5, see also (18), the third term of (32) is dominated by

 −pa|θ|2(|θ|2+1)(p−2)/2+pb(|θ|2+1)(p−2)/2. (33)

Then, for , one observes that

 ΔVp(θ)β−⟨h(θ),∇Vp(θ)⟩≤−ap4Vp(θ).

As for , one obtains

 ΔVp(θ)β−⟨h(θ),∇Vp(θ)⟩+ap4Vp(θ)≤34apvp(¯¯¯¯¯¯M(p)).

Take into consideration of the two cases, we have for all ,

 ΔVp(θ)β−⟨h(θ),∇Vp(θ)⟩≤−C6(p)Vp(θ)+C7(p).

The statement (30) follows in an identical way, noting that the constants which appear do not depend on . ∎

Now, we proceed with the required moment estimates which play a crucial role in the derivation of the main results as given in Theorems 2.7, 2.9 and 2.11.

Lemma 3.7.

Let Assumptions 2.1, 2.2 and 2.5 hold. Let . For , let be the solution of (26) with an initial condition . Then, for any ,

 supx∈(Rm)Nsupt≥sE[Vp(~ζλ(t,s;x,~θ))]≤E[Vp(~θ)]+3vp(¯¯¯¯¯¯M(p)) (34)

where is defined in (31).

Proof.

We note that for , hence . For any fixed sequence and , by Itô’s formula, one obtains almost surely,

 dVp(~ζλ(t,s;x,~θ))=⎡⎣λΔVp(~ζλ(t,s;x,~θ))β−λ⟨H(~ζλ(t,s;x,~θ),x⌊t⌋),∇Vp(~ζλ(t,s;x,~θ))⟩⎤⎦dt+√2λβ⟨∇Vp(~ζλ(t,s;x,~θ)),d~Bλt⟩,

which implies

 E[Vp(~ζλ(t,s;x,~θ))]=E[Vp(~θ)]+∫tsE⎡⎣λΔVp(~ζλ(u,s;x,~θ))β−λ⟨H(~ζλ(u,s;x,~θ),x⌊u⌋),∇Vp(~ζλ(u,s;x,~θ))⟩⎤⎦du,

where the expectation of the stochastic integral disappears since by . Differentiating both sides and using Lemma 3.6, one obtains

 ddtE[Vp(~ζλ(t,s;x,~θ))] =E⎡⎣λΔVp(~ζλ(t,s;x,~θ))β−λ⟨H(~ζλ(t,s;x,~θ),x⌊t⌋),∇Vp(~ζλ(t,s;x,~θ))⟩⎤⎦ ≤−λC6(p)E[Vp(~ζλ(t,s;x,~θ))]+λC7(p),

which yields

 E[Vp(~ζλ(t,s;x,~θ))]≤e−λC6(p)(t−s)E[Vp(~θ)]+C7(p)C6(p)(1−e−λC6(p)(t−s))≤E[V