 # Quantitative Central Limit Theorems for Discrete Stochastic Processes

In this paper, we establish a generalization of the classical Central Limit Theorem for a family of stochastic processes that includes stochastic gradient descent and related gradient-based algorithms. Under certain regularity assumptions, we show that the iterates of these stochastic processes converge to an invariant distribution at a rate of O1/√(k) where k is the number of steps; this rate is provably tight.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many randomized algorithms in machine learning can be analyzed as some kind of stochastic process. For example, MCMC algorithms intentionally inject carefully designed randomness in order to sample from a desired target distribution. There is a second category of randomized algorithms for which the for which the goal is optimization rather than sampling, and the randomness is viewed as a price to pay for computational tractability. For example, stochastic gradient methods for large scale optimization use noisy estimates of a gradient because they are cheap. While such algorithms are not designed with the goal of sampling from a target distribution, an algorithm of this kind has random outputs, and its behavior is determined by the distribution of its output. Results in this paper provide tools for analyzing the convergence of such algorithms as stochastic processes.

We establish a quantitative Central Limit Theorem for stochastic processes that have the following form:

 xk+1=xk−δ∇U(xk)+√δξk(xk), (1)

where is an iterate, is a stepsize, is a potential function, and is a zero-mean, position-dependent noise variable. Under certain assumptions, we show that (1) converges in -Wasserstein distance to the following SDE:

 dx(t)=−∇U(x(t))dt+σ(x(t))dBt, (2)

where . The notion of convergence is summarized in the following informal statement of our main theorem:

###### Theorem 1

(Informal) Let denote the distribution of in (1), and let denote the invariant distribution of (2). Then there exist constants , such that for all , if and ,

 W2(pk,p∗)≤ϵ.

In other words, under the right scaling of the step size, the long-term distribution of depends only on the expected drift and the covariance matrix of the noise . As long as we know these two quantities, we can draw conclusions about the approximate behavior of (1) through , and ignore the other characteristics of .

Our result can be viewed as a general, quantitative form of the classical Central Limit Theorem, which can be thought of as showing that in (1) converges in distribution to , for the specific case of and . Our result is more general: can be any strongly convex function satisfying certain regularity assumptions and can vary with position. We show that converges to the invariant distribution of (2

), which is not necessarily a normal distribution. The fact that the classical CLT is a special case implies that the

rate in our main theorem cannot be improved in general. We discuss this in more detail in Section 4.1.1.

## 2 Related Work

Most relevant to this work is the quantitative CLT result due to Zhai (2018). In that paper, he established that for random variables with mean zero and covariance , , where is the standard Gaussian random variable. Prior to this, a number of other authors have also proved a rate, but without establishing dimension dependence (see, e.g., Bonis, 2015; Rio et al., 2009).

Another relevant line of work is the recent work on quantitative rates for Langevin MCMC algorithms. Langevin MCMC algorithms can be thought of as discretizations of the Langevin diffusion SDE, which is essentially (2) for . Authors such as Dalalyan (2017) and Durmus and Moulines (2016) were able to prove quantitative convergence results for Langevin MCMC by bounding its discretization error from the Langevin SDE. The processes we study in this paper differ from Langevin MCMC in two crucial ways: first, the noise is not Gaussian, and second, the diffusion matrix in (5) varies with .

Finally, this work is also motivated by results such as those due to Ruppert (1988), Polyak and Juditsky (1992), Fan et al. (2018), which show that iterates of the stochastic gradient algorithm with diminishing step size converge asymptotically to a normal distribution. (The limiting distribution of the appropriately rescaled iterates is Gaussian in this case, because a smooth is locally quadratic.) These classical results are asymptotic and do not give explicit rates.

## 3 Definitions and Assumptions

We will study the discrete process given by

 xk+1=xk−δ∇U(xk)+√2δTηk(xk), (3)

where

1. is the potential function,

2. are iid random variables which take values in some set and have distribution ,

3. is the noise map, and

4. is a stepsize.

Let denote the invariant distribution of (3). Define

 σx:=(Eq(η)[Tη(x)Tη(x)T])1/2. (4)

We will also study the continuous SDE given by

 dx(t)=−∇U(x(t))dt+√2σx(t)dBt, (5)

where denotes the standard -dimensional Brownian motion, and is as defined in (4). Let denote the invariant distribution of (5).

For convenience of notation, we define the following:

1. Let be the distribution of in (3).

2. Let be the transition map:

 Fη(x):=x−δ∇U(x)+√2δTη(x), (6)

so that . Note that also depends on , but we do not write this explicitly; the choice of should be clear from context.

3. Define as

 Φδ(p):=(Fη)#p, (7)

where denotes the pushforward operator; i.e., is the distribution of when , so that

We make the following assumptions about .

###### Assumption 1

There exist constants and satisfying, for all ,

 1 ∇U(0)=0, 2 ∇2U(x)⪯LI, 3 ∇2U(x)⪰mI, 4 ∥∥∇3U(x)∥∥2≤L,

where denotes the operator norm; see (8) below.

We make the following assumptions about and :

###### Assumption 2

There exists a constant , such that for all ,

 1. Eq(η)[Tη(x)]=0, 2. Eq(η)[Tη(x)Tη(x)T]≺c2σI.

### 3.1 Basic Notation

For any two distributions and , let be the 2-Wasserstein distance between and . We overload the notation and sometimes use for random variables and to denote the distance between their distributions.

For a

-order tensor

and a vector

, we define the product such that . Sometimes, to avoid ambiguity, we will write instead.

We let denote the operator norm:

 ∥M∥2=supv∈Rd,∥v∥2=1∥Mv∥2. (8)

It can be verified that for all , is a norm over .

Finally, we use the notation to denote two kinds of inner products:

1. For vectors , (the dot product).

2. For matrices , (the trace inner product).

Although the notation is overloaded, the usage should be clear from context.

## 4 Main Results and Discussion

We will consider two settings: one in which the noise in (3) does not depend on , and one in which it does. We will treat these results separately in Theorem 2 and Theorem 3.

### 4.1 Homogeneous Noise

Our first theorem deals with the case when is a constant with respect to . In addition to Assumption 1 and Assumption 2, we make the following assumptions:

###### Assumption 3

For all ,

 1. Tη(x)=Tη, 2. ∥∥Tη∥∥2≤√L, 3. σx=I.

Under these assumptions, the invariant distribution of (5) has the form

 p∗(x)∝e−U(x). (9)
###### Theorem 2

Let be an arbitrary initial distribution, and let be defined as in (3) with step size . Recall the definition of as the invariant distributioin of (3) and as the invariant distribution of (5).

For ,

 W2(^p,p∗)≤ϵ. (10)

 W2(pk,p∗)≤ϵ. (11)

An equivalent statement is that for a sufficiently large , and for sufficiently small , we can bound

 W2(pk,p∗)≤~O(d3/2√k). (12)

#### 4.1.1 Relation to the Classical Central Limit Theorem

Our result can be viewed as a generalization of the classical central limit theorem, which deals with sequences of the form

 Sk+1=∑k+1i=0ηi√k+1=√k√k+1⋅Sk+ηk+1√k+1≈Sk−12(k+1)Sk+√2√2(k+1)ηk+1

for some with mean and covariance . Thus, the sequence essentially has the same dynamics as from (3), with , and variable stepsize . To the best of the our knowledge, the best rate for the classical CLT is proven in Theorem 1.1 of Zhai (2018), with a rate of . It is also essentially tight, as Proposition 1.2 of Zhai (2018) shows that the is lower bounded by .

Our bound in Theorem 2 (equivalently, (12)) also shrinks as . We note that the sequence studied in Theorem 2 differs from , as the stepsize for is constant (i.e., does not depend on ). We stated Theorem 2 for constant step sizes mainly to simplify the proof. Our proof technique can also be applied to the variable step size setting; in Corollary 48 in the appendix, we use the results of Theorem 4 to prove that , which is the same as the constant-stepsize case.

This shows that the dependence in Theorem 2 is tight. Our dependence is , compared to the optimal rate of . However, our bound is applicable to a much more general setting, not just for .

### 4.2 Inhomogeneous Noise

We now examine the convergence of (3) under a general setting, in which the noise depends on the position.

In addition to the assumptions in Section 3, we make some additional assumptions about how depends on . We begin by defining some notation. For all and , we will let denote the derivative of wrt , denote the derivative of wrt , and denote the derivative of wrt , i.e.:

 1. ∀x,i,j and for η a.s., [Gη(x)]i,j:=∂∂xj[Tη(x)]i 2. ∀x,i,j,k and for η a.s., [Mη(x)]i,j,k:=∂2∂xj∂xk[Tη(x)]i 3. ∀x,i,j,k,l and for η a.s., [Nη(x)]i,j,k,l:=∂3∂xj∂xk∂xl[Tη(x)]i

We will assume that , , satisfy the following regularity:

###### Assumption 4

There exists an that satisfies Assumption 1 and, for all and for a.s.:

 1. Gη(x) is symmetric, 2. ∥∥Tη(x)∥∥2≤√L(∥x∥2+1), 3. ∥∥Gη(x)∥∥2≤√L, 4. ∥∥Mη(x)∥∥2≤√L, 5. ∥∥Nη(x)∥∥2≤√L.
###### Assumption 5

For any distributions and , .

Finally, we assume that is regular in the following sense:

###### Assumption 6

There exists a constant , such that the log of the invariant distribution of (5), , satisfies, for all ,

 1. ∥∥∇3f(x)∥∥2≤θ, 2. ∥∥∇2f(x)∥∥2≤θ(∥x∥2+1), 3. ∥∇f(x)∥2≤θ(∥x∥22+1).
###### Remark 1

If and are bounded by , then 2. and 3. are implied by 1., but we state the assumption this way for convenience.

#### 4.2.1 A motivating example

Before we state our main theorem, it will help to motivate some of our assumptions by considering an application to the stochastic gradient algorithm.

Consider a classification problem where one tries to learn the parameters of a model. One is given datapoints , and a likelihood function , and one tries to minimize for

 U(x):=1SS∑i=1Ui(x),withUi(x):=ℓ(x,(zi,yi)).

The stochastic gradient algorithm proceeds as follows:

 xk+1= xk−δ∇Uηk(xk) = xk−δ∇U(xk)+√2δTηk(xk), (13)

where for each , is an integer sampled uniformly from , and we define . Notice that (13) is identical to (3).

The mean and variance of

are

 Eη[Tη(x)] =0 Eη[Tη(x)Tη(x)T] =√δ/2Ei∼Unif({1...S})[(∇U(x)−∇Ui(x))(∇U(x)−∇Ui(x))T]

Assuming , Assumption 2 is true if for some constant ,

 Ei∼Unif({1...S})[(∇U(x)−∇Ui(x))(∇U(x)−∇Ui(x))T]≺c2σ⋅I (14)

Furthermore, are respectively ,
, , , so Assumption 4

is guaranteed by the loss function

having Lipschitz derivatives (in ) up to fourth order.

If is -strongly convex and has -Lipschitz gradients for all , then Assumption 5 is true for for all , by a synchronous coupling argument (see Lemma 33 in Appendix B).

Finally, we remark that the upper bound for Assumption 2.2 implied by (14) is in fact quite loose when .

We will now state our main theorem for this section:

###### Theorem 3

Let be an arbitrary initial distribution, and let be defined as in (3) with step size . Recall the defintion of as the invariant distribution of (3) and as the invariant distribution of (5). For ,

 W2(^p,p∗)≤ϵ. (15)

 W2(pk,p∗)≤ϵ. (16)
###### Remark 2

Like Theorem 2, this also gives a rate, which is optimal. (see Section 4.1.1).

## 5 Proof of Main Theorems

In this section, we sketch the proofs of Theorems 2 and 3.

### 5.1 Proof of Results for Homogeneous Diffusion

• We first prove (11).

By Theorem 4 below, for ,

 W2(pk,p∗)≤ e−mδk/8W2(p0,p∗)+282δ1/2d3/2(L+1)9/2max{1mlog(1m),1}7. (17)

Thus if , then

Additionally, if , then , so together, we get To summarize, our assumptions are

 δ≤ min⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩min{m2,1}218d2(L+1)3,ϵ22166d3(L+1)9max{1mlog(1m),1}14⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭=ϵ2d3⋅poly(1m,L)−1

and

 k≥ 8mδlog2W2(p0,p∗)ϵ=d3ϵ2⋅logW2(p0,p∗)ϵpoly(1m,L).

This proves (11). To prove (10), use our above assumption on , and take the limit of (17) as .

###### Theorem 4

Let be an arbitrary initial distribution, and let be defined as in (3).
Let be some arbitrary constant. For any step size satisfying , the Wasserstein distance between and is upper bounded as

 W2(pk,p∗)≤ e−mδk/8W2(p0,p∗)+282δ1/2d3/2(L+1)9/2max{1mlog(1m),1}7.
• Recall our definition of in (7). Let denote repeated applications of , so . Our objective is thus to bound .

We first use triangle inequality to split the objective into two terms:

 W2(Φkδ(p0),p∗)≤ W2(Φkδ(p0),Φkδ(p∗))+W2(Φkδ(p∗),p∗) (18)

The first term is easy to bound. We can apply Lemma 14 (in Appendix A) to get

 W2(Φkδ(p∗),p∗)≤e−mδk/8W2(p0,p∗) (19)

To bound the second term of (18), we use an argument adapted from (Zhai 2016):

 W2(Φkδ(p∗),p∗)= W2(Φδ(Φk−1δ(p∗)),p∗) ≤ W2(Φδ(Φk−1δ(p∗)),Φδ(p∗))+W2(Φδ(p∗),p∗) ≤ e−mδ/8W2(Φk−1δ(p∗),p∗)+W2(Φδ(p∗),p∗) ⋮ ≤ k−1∑i=0e−mδi/8W2(Φδ(p∗),p∗) ≤ 8mδW2(Φδ(p∗),p∗).

Here the third inequality is by induction. This reduces our problem to bounding the expression , which can be thought of as the one-step divergence between (3) and (5) when . We apply Lemma 1 below to get

 W2(Φδ(p∗),p∗)≤ 278δ3/2d3/2(L+1)9/2max{1mlog(1m),1}6 ⇒8mδW2(Φδ(p∗),p∗)≤ 282δ1/2d3/2(L+1)9/2max{1mlog(1m),1}7. (20)

Thus, substituting (19) and (20) into (18), we get

 W2(Φkδ(p0),p∗)≤ e−mδk/8W2(p0,p∗)+282δ1/2d3/2(L+1)9/2max{1mlog(1m),1}7. (21)

###### Lemma 1

Let . Then for any ,

 W2(pδ,p∗)≤278δ3/2d3/2(L+1)9/2max{1mlog(1m),1}6.

(This lemma is similar in spirit to Lemma 1.6 in Zhai (2018).)

• Using Talagrand’s inequality and the fact that is strongly convex, we can upper bound by for any distribution which has density wrt , i.e.:

 W22(p∗,pδ)≤ 2m∫(pδ(x)p∗(x)−1)2p∗(x)dx. (22)

See Lemma 12 in Appendix B for a rigorous proof of (22).

Under our assumptions on , we can apply Lemma 2 below, giving

 ∫BR(pδ(x)p∗(x)−1)2p∗(x)dx ≤ 223δ3d2(L+1)9∫exp(m32∥x∥22)(∥x∥122+1)p∗(x)dx ≤ 224δ3d2(L+1)9(∫exp(m16∥x∥22)p∗(x)dx+∫(∥x∥242+1)p∗(x)dx) ≤ 224δ3d2(L+1)9(8d+max{(2111mlog(28/m))11,2111m}) ≤ 2156δ3d3(L+1)9max{1mlog(1m),1}11,

where the first inequality is by Lemma 2, the second inequality is by Young’s inequality, the third inequality is by Lemma 34 and Lemma 38, with . Plugging the above into (22),

 W22(p∗,pδ)≤ 2156δ3d3(L+1)9max{1mlog(1m),1}12.

The following lemma studies the “discretization error” between the SDE (5) and one step of (3).

###### Lemma 2

Let . For any , , and ,

 ∣∣∣pδ(x)p∗(x)−1∣∣∣≤512δ3/2d(L+1)9/2exp(m32∥x∥22)(∥x∥62+1).
• Recall that . Thus by the change of variable formula, we have

 pδ(x) =∫p∗(F−1η(x))det(∇Fη(F−1η(x)))−1q(η)dη =Eq(η)⎡⎢ ⎢ ⎢ ⎢⎣p∗(F−1η(x))to16.9pt\vboxto16.9pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{6.249756pt}{0.0pt}\pgfsys@curveto{6.249756pt}{3.451645pt}{3.451% 645pt}{6.249756pt}{0.0pt}{6.249756pt}\pgfsys@curveto{-3.451645pt}{6.249756pt}{% -6.249756pt}{3.451645pt}{-6.249756pt}{0.0pt}\pgfsys@curveto{-6.249756pt}{-3.45% 1645pt}{-3.451645pt}{-6.249756pt}{0.0pt}{-6.249756pt}\pgfsys@curveto{3.451645% pt}{-6.249756pt}{6.249756pt}{-3.451645pt}{6.249756pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-5.24992pt}% {-1.749973pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{1}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicturedet(∇Fη(F−1η(x)))−1to16.9pt\vboxto16.9pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{6.249756pt}{0.0pt}\pgfsys@curveto{6.249756pt}{3.451645pt}{3.451% 645pt}{6.249756pt}{0.0pt}{6.249756pt}\pgfsys@curveto{-3.451645pt}{6.249756pt}{% -6.249756pt}{3.451645pt}{-6.249756pt}{0.0pt}\pgfsys@curveto{-6.249756pt}{-3.45% 1645pt}{-3.451645pt}{-6.249756pt}{0.0pt}{-6.249756pt}\pgfsys@curveto{3.451645% pt}{-6.249756pt}{6.249756pt}{-3.451645pt}{6.249756pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-5.24992pt}% {-1.749973pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{2}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture⎤⎥ ⎥ ⎥ ⎥⎦, (23)

where denotes the Jacobian matrix of at . The invertibility of is shown in Lemma 46. We rewrite as its Taylor expansion about :

 p∗(F−1η(x))=

Substituting the above into (23) and applying Lemmas 3, 4, 5, and 6, we get

 pδ(x)= = (24)

for some satisfying

 |Δ|≤ p∗(x)⋅8δ3/2dL3/2(∥x∥2+1)+p∗(x)⋅16δ3/2dL5/2(∥x∥32+1) +p∗(x)⋅64δ3/2d(L+1)5/2(∥x∥52+1) +p∗(x)⋅256δ3/2exp(m32∥x∥22)(L+1)9/2(∥x∥62+1) ≤ p∗(x)⋅512δ3/2d(L+1)9/2exp(m32∥x∥22)(∥x∥62+1).

Furthermore, by using the expression and some algebra, we see that

 (25) = =0.

Substituting the above into (24) gives , which implies that

 ∣∣∣pδ(x)p∗(x)−1∣∣∣≤512δ3/2d(L+1)9/2exp(m32∥x∥22)(∥x∥62+1).

### 5.2 Proof of Results for Inhomogeneous Diffusion

The proof of Theorem 3 is quite similar to the proof of Theorem 2, and can be found in the Appendix (Section B). We will highlight some additional difficulties in the proof compared to Theorem 2.

The heart of the proof lies in Lemma 15, which bounds the discretization error between the SDE (5) and one step of the discrete process (3), in the form of . This is analogous to Lemma 1 in Section 5.1. Compared to the proof of Lemma 1, one additional difficulty is that we can no longer rely on Talagrand’s inequality (22). This is because is no longer guaranteed to be strongly log-concave. We instead use the fact that is subgaussian to upper bound by (see Corollary 40).

Lemma 1 in turn relies crucially on bounding the expression . This is proved in Lemma 16, which is the analog of Lemma 2 in Section 5.1. The additional difficulty is that we have to handle the effects of a diffusion matrix that depends on the position . Also, Lemma 16 relies on the closed-form expression for in order to cancel out terms of order less than