Many randomized algorithms in machine learning can be analyzed as some kind of stochastic process. For example, MCMC algorithms intentionally inject carefully designed randomness in order to sample from a desired target distribution. There is a second category of randomized algorithms for which the for which the goal is optimization rather than sampling, and the randomness is viewed as a price to pay for computational tractability. For example, stochastic gradient methods for large scale optimization use noisy estimates of a gradient because they are cheap. While such algorithms are not designed with the goal of sampling from a target distribution, an algorithm of this kind has random outputs, and its behavior is determined by the distribution of its output. Results in this paper provide tools for analyzing the convergence of such algorithms as stochastic processes.
We establish a quantitative Central Limit Theorem for stochastic processes that have the following form:
where is an iterate, is a stepsize, is a potential function, and is a zero-mean, position-dependent noise variable. Under certain assumptions, we show that (1) converges in -Wasserstein distance to the following SDE:
where . The notion of convergence is summarized in the following informal statement of our main theorem:
In other words, under the right scaling of the step size, the long-term distribution of depends only on the expected drift and the covariance matrix of the noise . As long as we know these two quantities, we can draw conclusions about the approximate behavior of (1) through , and ignore the other characteristics of .
Our result can be viewed as a general, quantitative form of the classical Central Limit Theorem, which can be thought of as showing that in (1) converges in distribution to , for the specific case of and . Our result is more general: can be any strongly convex function satisfying certain regularity assumptions and can vary with position. We show that converges to the invariant distribution of (2
), which is not necessarily a normal distribution. The fact that the classical CLT is a special case implies that therate in our main theorem cannot be improved in general. We discuss this in more detail in Section 4.1.1.
2 Related Work
Most relevant to this work is the quantitative CLT result due to Zhai (2018). In that paper, he established that for random variables with mean zero and covariance , , where is the standard Gaussian random variable. Prior to this, a number of other authors have also proved a rate, but without establishing dimension dependence (see, e.g., Bonis, 2015; Rio et al., 2009).
Another relevant line of work is the recent work on quantitative rates for Langevin MCMC algorithms. Langevin MCMC algorithms can be thought of as discretizations of the Langevin diffusion SDE, which is essentially (2) for . Authors such as Dalalyan (2017) and Durmus and Moulines (2016) were able to prove quantitative convergence results for Langevin MCMC by bounding its discretization error from the Langevin SDE. The processes we study in this paper differ from Langevin MCMC in two crucial ways: first, the noise is not Gaussian, and second, the diffusion matrix in (5) varies with .
Finally, this work is also motivated by results such as those due to Ruppert (1988), Polyak and Juditsky (1992), Fan et al. (2018), which show that iterates of the stochastic gradient algorithm with diminishing step size converge asymptotically to a normal distribution. (The limiting distribution of the appropriately rescaled iterates is Gaussian in this case, because a smooth is locally quadratic.) These classical results are asymptotic and do not give explicit rates.
3 Definitions and Assumptions
We will study the discrete process given by
is the potential function,
are iid random variables which take values in some set and have distribution ,
is the noise map, and
is a stepsize.
Let denote the invariant distribution of (3). Define
We will also study the continuous SDE given by
For convenience of notation, we define the following:
Let be the distribution of in (3).
Let be the transition map:
so that . Note that also depends on , but we do not write this explicitly; the choice of should be clear from context.
where denotes the pushforward operator; i.e., is the distribution of when , so that
We make the following assumptions about .
There exist constants and satisfying, for all ,
where denotes the operator norm; see (8) below.
We make the following assumptions about and :
There exists a constant , such that for all ,
3.1 Basic Notation
For any two distributions and , let be the 2-Wasserstein distance between and . We overload the notation and sometimes use for random variables and to denote the distance between their distributions.
and a vector, we define the product such that . Sometimes, to avoid ambiguity, we will write instead.
We let denote the operator norm:
It can be verified that for all , is a norm over .
Finally, we use the notation to denote two kinds of inner products:
For vectors , (the dot product).
For matrices , (the trace inner product).
Although the notation is overloaded, the usage should be clear from context.
4 Main Results and Discussion
4.1 Homogeneous Noise
For all ,
Under these assumptions, the invariant distribution of (5) has the form
If, in addition, ,
An equivalent statement is that for a sufficiently large , and for sufficiently small , we can bound
4.1.1 Relation to the Classical Central Limit Theorem
Our result can be viewed as a generalization of the classical central limit theorem, which deals with sequences of the form
for some with mean and covariance . Thus, the sequence essentially has the same dynamics as from (3), with , and variable stepsize . To the best of the our knowledge, the best rate for the classical CLT is proven in Theorem 1.1 of Zhai (2018), with a rate of . It is also essentially tight, as Proposition 1.2 of Zhai (2018) shows that the is lower bounded by .
Our bound in Theorem 2 (equivalently, (12)) also shrinks as . We note that the sequence studied in Theorem 2 differs from , as the stepsize for is constant (i.e., does not depend on ). We stated Theorem 2 for constant step sizes mainly to simplify the proof. Our proof technique can also be applied to the variable step size setting; in Corollary 48 in the appendix, we use the results of Theorem 4 to prove that , which is the same as the constant-stepsize case.
This shows that the dependence in Theorem 2 is tight. Our dependence is , compared to the optimal rate of . However, our bound is applicable to a much more general setting, not just for .
4.2 Inhomogeneous Noise
We now examine the convergence of (3) under a general setting, in which the noise depends on the position.
In addition to the assumptions in Section 3, we make some additional assumptions about how depends on . We begin by defining some notation. For all and , we will let denote the derivative of wrt , denote the derivative of wrt , and denote the derivative of wrt , i.e.:
We will assume that , , satisfy the following regularity:
There exists an that satisfies Assumption 1 and, for all and for a.s.:
For any distributions and , .
Finally, we assume that is regular in the following sense:
There exists a constant , such that the log of the invariant distribution of (5), , satisfies, for all ,
If and are bounded by , then 2. and 3. are implied by 1., but we state the assumption this way for convenience.
4.2.1 A motivating example
Before we state our main theorem, it will help to motivate some of our assumptions by considering an application to the stochastic gradient algorithm.
Consider a classification problem where one tries to learn the parameters of a model. One is given datapoints , and a likelihood function , and one tries to minimize for
The stochastic gradient algorithm proceeds as follows:
The mean and variance ofare
Assuming , Assumption 2 is true if for some constant ,
Furthermore, are respectively ,
, , , so Assumption 4
is guaranteed by the loss functionhaving Lipschitz derivatives (in ) up to fourth order.
We will now state our main theorem for this section:
5 Proof of Main Theorems
5.1 Proof of Results for Homogeneous Diffusion
Let be an arbitrary initial distribution, and let be defined as in (3).
Let be some arbitrary constant. For any step size satisfying , the Wasserstein distance between and is upper bounded as
Recall our definition of in (7). Let denote repeated applications of , so . Our objective is thus to bound .
We first use triangle inequality to split the objective into two terms:
To bound the second term of (18), we use an argument adapted from (Zhai 2016):
Here the third inequality is by induction. This reduces our problem to bounding the expression , which can be thought of as the one-step divergence between (3) and (5) when . We apply Lemma 1 below to get
Let . Then for any ,
(This lemma is similar in spirit to Lemma 1.6 in Zhai (2018).)
Using Talagrand’s inequality and the fact that is strongly convex, we can upper bound by for any distribution which has density wrt , i.e.:
Let . For any , , and ,
Recall that . Thus by the change of variable formula, we have
where denotes the Jacobian matrix of at . The invertibility of is shown in Lemma 46. We rewrite as its Taylor expansion about :
for some satisfying
Furthermore, by using the expression and some algebra, we see that
Substituting the above into (24) gives , which implies that
5.2 Proof of Results for Inhomogeneous Diffusion
The heart of the proof lies in Lemma 15, which bounds the discretization error between the SDE (5) and one step of the discrete process (3), in the form of . This is analogous to Lemma 1 in Section 5.1. Compared to the proof of Lemma 1, one additional difficulty is that we can no longer rely on Talagrand’s inequality (22). This is because is no longer guaranteed to be strongly log-concave. We instead use the fact that is subgaussian to upper bound by (see Corollary 40).
Lemma 1 in turn relies crucially on bounding the expression . This is proved in Lemma 16, which is the analog of Lemma 2 in Section 5.1. The additional difficulty is that we have to handle the effects of a diffusion matrix that depends on the position . Also, Lemma 16 relies on the closed-form expression for in order to cancel out terms of order less than