1 Introduction
Many randomized algorithms in machine learning can be analyzed as some kind of stochastic process. For example, MCMC algorithms intentionally inject carefully designed randomness in order to sample from a desired target distribution. There is a second category of randomized algorithms for which the for which the goal is optimization rather than sampling, and the randomness is viewed as a price to pay for computational tractability. For example, stochastic gradient methods for large scale optimization use noisy estimates of a gradient because they are cheap. While such algorithms are not designed with the goal of sampling from a target distribution, an algorithm of this kind has random outputs, and its behavior is determined by the distribution of its output. Results in this paper provide tools for analyzing the convergence of such algorithms as stochastic processes.
We establish a quantitative Central Limit Theorem for stochastic processes that have the following form:
(1) 
where is an iterate, is a stepsize, is a potential function, and is a zeromean, positiondependent noise variable. Under certain assumptions, we show that (1) converges in Wasserstein distance to the following SDE:
(2) 
where . The notion of convergence is summarized in the following informal statement of our main theorem:
Theorem 1
In other words, under the right scaling of the step size, the longterm distribution of depends only on the expected drift and the covariance matrix of the noise . As long as we know these two quantities, we can draw conclusions about the approximate behavior of (1) through , and ignore the other characteristics of .
Our result can be viewed as a general, quantitative form of the classical Central Limit Theorem, which can be thought of as showing that in (1) converges in distribution to , for the specific case of and . Our result is more general: can be any strongly convex function satisfying certain regularity assumptions and can vary with position. We show that converges to the invariant distribution of (2
), which is not necessarily a normal distribution. The fact that the classical CLT is a special case implies that the
rate in our main theorem cannot be improved in general. We discuss this in more detail in Section 4.1.1.2 Related Work
Most relevant to this work is the quantitative CLT result due to Zhai (2018). In that paper, he established that for random variables with mean zero and covariance , , where is the standard Gaussian random variable. Prior to this, a number of other authors have also proved a rate, but without establishing dimension dependence (see, e.g., Bonis, 2015; Rio et al., 2009).
Another relevant line of work is the recent work on quantitative rates for Langevin MCMC algorithms. Langevin MCMC algorithms can be thought of as discretizations of the Langevin diffusion SDE, which is essentially (2) for . Authors such as Dalalyan (2017) and Durmus and Moulines (2016) were able to prove quantitative convergence results for Langevin MCMC by bounding its discretization error from the Langevin SDE. The processes we study in this paper differ from Langevin MCMC in two crucial ways: first, the noise is not Gaussian, and second, the diffusion matrix in (5) varies with .
Finally, this work is also motivated by results such as those due to Ruppert (1988), Polyak and Juditsky (1992), Fan et al. (2018), which show that iterates of the stochastic gradient algorithm with diminishing step size converge asymptotically to a normal distribution. (The limiting distribution of the appropriately rescaled iterates is Gaussian in this case, because a smooth is locally quadratic.) These classical results are asymptotic and do not give explicit rates.
3 Definitions and Assumptions
We will study the discrete process given by
(3) 
where

is the potential function,

are iid random variables which take values in some set and have distribution ,

is the noise map, and

is a stepsize.
Let denote the invariant distribution of (3). Define
(4) 
We will also study the continuous SDE given by
(5) 
where denotes the standard dimensional Brownian motion, and is as defined in (4). Let denote the invariant distribution of (5).
For convenience of notation, we define the following:

Let be the distribution of in (3).

Let be the transition map:
(6) so that . Note that also depends on , but we do not write this explicitly; the choice of should be clear from context.

Define as
(7) where denotes the pushforward operator; i.e., is the distribution of when , so that
We make the following assumptions about .
Assumption 1
We make the following assumptions about and :
Assumption 2
There exists a constant , such that for all ,
3.1 Basic Notation
For any two distributions and , let be the 2Wasserstein distance between and . We overload the notation and sometimes use for random variables and to denote the distance between their distributions.
For a
order tensor
and a vector
, we define the product such that . Sometimes, to avoid ambiguity, we will write instead.We let denote the operator norm:
(8) 
It can be verified that for all , is a norm over .
Finally, we use the notation to denote two kinds of inner products:

For vectors , (the dot product).

For matrices , (the trace inner product).
Although the notation is overloaded, the usage should be clear from context.
4 Main Results and Discussion
We will consider two settings: one in which the noise in (3) does not depend on , and one in which it does. We will treat these results separately in Theorem 2 and Theorem 3.
4.1 Homogeneous Noise
Our first theorem deals with the case when is a constant with respect to . In addition to Assumption 1 and Assumption 2, we make the following assumptions:
Assumption 3
For all ,
Under these assumptions, the invariant distribution of (5) has the form
(9) 
Theorem 2
Let be an arbitrary initial distribution, and let be defined as in (3) with step size . Recall the definition of as the invariant distributioin of (3) and as the invariant distribution of (5).
For ,
(10) 
If, in addition, ,
(11) 
An equivalent statement is that for a sufficiently large , and for sufficiently small , we can bound
(12) 
4.1.1 Relation to the Classical Central Limit Theorem
Our result can be viewed as a generalization of the classical central limit theorem, which deals with sequences of the form
for some with mean and covariance . Thus, the sequence essentially has the same dynamics as from (3), with , and variable stepsize . To the best of the our knowledge, the best rate for the classical CLT is proven in Theorem 1.1 of Zhai (2018), with a rate of . It is also essentially tight, as Proposition 1.2 of Zhai (2018) shows that the is lower bounded by .
Our bound in Theorem 2 (equivalently, (12)) also shrinks as . We note that the sequence studied in Theorem 2 differs from , as the stepsize for is constant (i.e., does not depend on ). We stated Theorem 2 for constant step sizes mainly to simplify the proof. Our proof technique can also be applied to the variable step size setting; in Corollary 48 in the appendix, we use the results of Theorem 4 to prove that , which is the same as the constantstepsize case.
This shows that the dependence in Theorem 2 is tight. Our dependence is , compared to the optimal rate of . However, our bound is applicable to a much more general setting, not just for .
4.2 Inhomogeneous Noise
We now examine the convergence of (3) under a general setting, in which the noise depends on the position.
In addition to the assumptions in Section 3, we make some additional assumptions about how depends on . We begin by defining some notation. For all and , we will let denote the derivative of wrt , denote the derivative of wrt , and denote the derivative of wrt , i.e.:
We will assume that , , satisfy the following regularity:
Assumption 4
There exists an that satisfies Assumption 1 and, for all and for a.s.:
Assumption 5
For any distributions and , .
Finally, we assume that is regular in the following sense:
Assumption 6
There exists a constant , such that the log of the invariant distribution of (5), , satisfies, for all ,
Remark 1
If and are bounded by , then 2. and 3. are implied by 1., but we state the assumption this way for convenience.
4.2.1 A motivating example
Before we state our main theorem, it will help to motivate some of our assumptions by considering an application to the stochastic gradient algorithm.
Consider a classification problem where one tries to learn the parameters of a model. One is given datapoints , and a likelihood function , and one tries to minimize for
The stochastic gradient algorithm proceeds as follows:
(13) 
where for each , is an integer sampled uniformly from , and we define . Notice that (13) is identical to (3).
The mean and variance of
areAssuming , Assumption 2 is true if for some constant ,
(14) 
Furthermore, are respectively ,
, , , so Assumption 4
is guaranteed by the loss function
having Lipschitz derivatives (in ) up to fourth order.If is strongly convex and has Lipschitz gradients for all , then Assumption 5 is true for for all , by a synchronous coupling argument (see Lemma 33 in Appendix B).
Finally, we remark that the upper bound for Assumption 2.2 implied by (14) is in fact quite loose when .
We will now state our main theorem for this section:
Theorem 3
5 Proof of Main Theorems
5.1 Proof of Results for Homogeneous Diffusion
Theorem 4
Let be an arbitrary initial distribution, and let be defined as in (3).
Let be some arbitrary constant.
For any step size satisfying
,
the Wasserstein distance between and is upper bounded as

Recall our definition of in (7). Let denote repeated applications of , so . Our objective is thus to bound .
We first use triangle inequality to split the objective into two terms:
(18) To bound the second term of (18), we use an argument adapted from (Zhai 2016):
Here the third inequality is by induction. This reduces our problem to bounding the expression , which can be thought of as the onestep divergence between (3) and (5) when . We apply Lemma 1 below to get
(20) Thus, substituting (19) and (20) into (18), we get
(21)
Lemma 1
Let . Then for any ,
(This lemma is similar in spirit to Lemma 1.6 in Zhai (2018).)

Using Talagrand’s inequality and the fact that is strongly convex, we can upper bound by for any distribution which has density wrt , i.e.:(22)
The following lemma studies the “discretization error” between the SDE (5) and one step of (3).
Lemma 2
Let . For any , , and ,

Recall that . Thus by the change of variable formula, we have
(23) where denotes the Jacobian matrix of at . The invertibility of is shown in Lemma 46. We rewrite as its Taylor expansion about :
Substituting the above into (23) and applying Lemmas 3, 4, 5, and 6, we get
(24) for some satisfying
Furthermore, by using the expression and some algebra, we see that
(25) Substituting the above into (24) gives , which implies that
5.2 Proof of Results for Inhomogeneous Diffusion
The proof of Theorem 3 is quite similar to the proof of Theorem 2, and can be found in the Appendix (Section B). We will highlight some additional difficulties in the proof compared to Theorem 2.
The heart of the proof lies in Lemma 15, which bounds the discretization error between the SDE (5) and one step of the discrete process (3), in the form of . This is analogous to Lemma 1 in Section 5.1. Compared to the proof of Lemma 1, one additional difficulty is that we can no longer rely on Talagrand’s inequality (22). This is because is no longer guaranteed to be strongly logconcave. We instead use the fact that is subgaussian to upper bound by (see Corollary 40).
Lemma 1 in turn relies crucially on bounding the expression . This is proved in Lemma 16, which is the analog of Lemma 2 in Section 5.1. The additional difficulty is that we have to handle the effects of a diffusion matrix that depends on the position . Also, Lemma 16 relies on the closedform expression for in order to cancel out terms of order less than
Comments
There are no comments yet.