Quantitative W_1 Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise

07/07/2019 ∙ by Xiang Cheng, et al. ∙ berkeley college 3

We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. We study the setup where the additive noise can be non-Gaussian and state-dependent and the potential function can be non-convex. We show that the key properties of these processes depend on the potential function and the second moment of the additive noise. We apply our theoretical findings to studying the convergence of Stochastic Gradient Descent (SGD) for non-convex problems and corroborate them with experiments using SGD to train deep neural networks on the CIFAR-10 dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD)

[12]. This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.

There has been rapid recent progress in understanding the finite-time behavior of MCMC methods, by comparing them to stochastic differential equations (SDEs), such as the Langevin diffusion. It is natural in this context to think of SGD as a discrete time approximation of an SDE. But there are two significant barriers to extending previous analyses to the case of SGD, because those analyses are mostly restricted to isotropic Gaussian noise. First, the noise in SGD can be far from Gaussian. For instance, sampling from a minibatch leads to a discrete distribution. Second, the noise depends significantly on the current state (the optimization variable). For instance, if the objective is an average over a training sample of a non-negative loss, as the objective approaches zero, the noise variance of minibatch SGD goes to zero. Any attempt to cast SGD as an SDE must thus be able to handle this kind of noise.

This motivates the study of Langevin MCMC-like methods that have a state-dependent noise term:

(1)

where is the state variable at time , is the step-size, is a potential, is the noise function, and are sampled iid from some set (for example, in minibatch SGD, is the set of subsets of indices in the training sample. We discuss the SGD example in more detail in Section 6).

Throughout this paper, we assume that for all . We define a matrix-valued function to be the square root of the covariance matrix of , i.e. for all ,

Where for a symmetric positive semidefinite matrix , is the unique symmetric positive semidefinite matrix such that .

Given the above definition of , it is natural to consider the following stochastic process:

(2)

It can be verified that at discrete time intervals, (2) is equivalent to Where . Thus (2) is essentially (1) with replaced by the simpler .

Finally, we can consider the continuous-time limit of (2):

(3)

We will let denote the invariant distribution of (3). In Theorem 1, we establish quantitative rates at which that (2) converges to . In Theorem 2, we establish quantitative rates at which (1) also converges to . Notice in particular that the only property of the SDE (3) that corresponds to the noise function in (1) is the covariance function .

Our contributions are as follows:

  1. We prove a quantitative convergence rate for (2) to of (3) in Theorem 2. This is the first quantitative rate for overdamped Langevin MCMC when the diffusion matrix is state-dependent. Our rate is comparable to earlier work with similar assumptions of nonconvexity, but assuming constant diffusion matrix [2, 4, 15].

  2. We prove a quantitative convergence rate for (1) to of (3) in Theorem 2. Prior to this work, convergence of processes of the form has only been established in [3] under much more restrictive convexity assumptions.

  3. Based on our theory, we describe a "large-noise" version of SGD and empirically evaluate its generalization performance. See Section 6.

2 Related work

There has been a long line of work in the study of convergence properties of stochastic processes. We review the ones most relevant to our work here.

Recent work on non-asymptotic convergence rates of Langevin MCMC algorithms began with [5] and [6], which established quantitative rates for Unadjusted Langevin MCMC under log-concavity assumptions (i.e. (2) with convex and for some constant ). Another line of work [7, 8, 2, 4, 15] analyzed the convergence of MCMC algorithms under nonconvexity assumptions. In particular, [4] and [15] considered the Overdamped Langevin MCMC algorithm under similar assumptions as Assumption A, but still assuming for some constant . Finally, [3] analyzed the convergence of (1) to (3) under much more restrictive assumptions of convexity of . In addition, [3] requires that (1) be a contractive process. In this paper, we show that convergence is possible as long as is sufficiently regular and is lower-bounded globally by some constant. Following our presentation of Theorem 1 and 2, we compare our result with some of the above mentioned results.

On the other hand, authors of [16, 11] have drawn connections between SGD and a SDE of the form 3. Furthermore, [1]

proved quantitative rates at which iterates of (rescaled) SGD approaches a normal distribution, assuming strong convexity around a local minima. In the specific setting of deep learning, authors of

[12, 11, 13] studied the generalization properties of SGD in deep learning. [11] in particular noted that the generalization error of SGD in deep learning seems strongly correlated with the magnitude of the noise covariance, and suggested that this may be explained by considering the underlying SDE.

3 An illustration of the importance of considering inhomogeneity

To motivate further discussion, we present a simple example to illustrate how

can significantly skew the invariant distribution of (

3) away from .

(a) Potential Function
(b) Diffusion Function
(c)
(d) Samples
Figure 1: 1-dimensional example exhibiting the importance of inhomogeneity: A simple construction showing how can affect the shape of the invariant distribution. While has two local minima, only has the smaller minima at . 1(d) represents samples obtained from simulating using the process (2).

We will define the potential and the diffusion function as

A plot of can be found in Figure 1.a. We highlight the fact that is constructed to have two local minima: one shallow minimum at , and one deeper minimum at . A plot of can be found in Figure 1.b. is constructed to have increasing magnitude at larger values of . This has the effect of biasing the invariant distribution towards smaller values of .

Given and , we will consider the SDE in (3). Let be the invariant distribution under this SDE. Using the Fokker Planck equation, we obtain an explicit expression for the invariant distribution: . We can then integrate to get , which we plot in Figure 1.c.

Remarkably, is has only one local minima at . The larger minima of at was completely smoothed over by the effect of the large diffusion . This is very different from when the noise is homogeneous (e.g. ), in which case . We also simulate (3) (using (2)) for the given and for 1000 samples (each simulated for 1000 steps), and plot the histogram in Figure 1.d.

4 Assumptions and definitions

In this section, we state the assumptions and definitions that would be required for our main theoretical results in Theorem 1 and Theorem 2.

4.1 Assumptions

Assumption A

We assume that satisfies

  1. The function is continuously-differentiable on and has Lipschitz continuous gradients; that is, there exists a positive constant such that for all ,

  2. The function has a stationary point at zero:

  3. There exist constants such that for all with , we have:

    in addition, for all with , we have:

Assumption B

We make the following assumptions on and :

  1. For all , has zero mean, i.e.

  2. For almost sure , and for all ,

  3. For almost sure , and for all ,

  4. There is a positive constant such that for all ,

Remark 1

We discuss these assumptions in a specific setting at the end of Section 6.1.

In the rest of this paper, we will define a matrix valued function :

(4)

From Assumption A, we can prove that and are bounded and Lipschitz. See Lemma 3 and 4 in Appendix C for details. These properties will be crucial in ensuring convergence.

4.2 Definitions and notation

For a

-order tensor

and a vector

, we define the product such that .

For a tensor, we use to denote the operator norm:

It can be verified that for all , is a norm over . Furthermore, when , this is the Euclidean norm, and when

, this is the largest singular value.

We use the notation to denote both vector and matrix inner products.

  1. For vectors , (the dot product).

  2. For matrices , (the trace inner product).

We will use to denote outer product. For two vectors , means that . We extend this notation to matrix vector outer products: if , and similarly if . We will use to denote and to denote .

Given any distribution and

, a joint distribution distribution

is a coupling between and if its marginals are equal to and respectively. Let denote the space of all couplings between and . Then the 1-Wasserstein distance is defined as

(5)

Our proof centers around a function , defined in Appendix E. Intuitively, plays the role of the metric . We will define a -Wasserstein distance as follows:

(6)

We overload the notation and sometimes use

for random variables

and to denote the distance between their distributions. We will prove our convergence results in , which then implies a convergence in by using Lemma 15.4.

Given a random variable , we use to denote its distribution.

5 Main results

In this section, we present our main convergence results beginning with convergence under Gaussian noise in Section 5.1 and proceeding to the non-Gaussian case in Section 5.2.

5.1 Convergence under Gaussian noise

In this section, we study the convergence rate of (2) to (invariant distribution of (3)). We will assume that , and satisfy Assumptions A and B.

Theorem 1

Let be some target accuracy satisfying . Let us define

Let , for any . Let be as defined in (2), initialized at , and let . Then, we have,

where is the invariant distribution to (3).

Remark 2

For ease of reference: are from Assumption A, is from Lemma 4, are from Assumption B)

Remark 3

Finding a suitable can be done very quickly using gradient descent wrt . The convergence rate of to the ball of radius is very fast, due to Assumption A.

Remark 4

In a nontrivial instance,

Theorem 1 is saying that one can get error in between from (2) and after steps, i.e. the convergence is given by .

Substituting in the parameters, and after some algebra, we see that for a sufficiently small , , and

Remark 5

As illustrated in Section 6.2, the from Assumption B.3 should be thought of as a regularization term which is can be set arbitrarily large. In the following discussion, we will assume that is dominated by the term.

To gain intuition about this term, let’s consider what it looks like under a sequence of increasingly weaker assumptions:

a. Strongly convex, Gaussian noise -strongly convex, -smooth, for all . (In reality we need to consider a truncated Gaussian so as not to violate Assumption B.2, but this is a minor issue). In this case, , , , , so . This is the same rate as obtained in [6]. However, [6] gets a bound which is stronger than bound.

b. Non-convex, Gaussian noise: not strongly convex but satisfies Assumption A, and . In this case, , , This is the setting studied by [4] and [15]. The rate we recover is , which is in line with [4], and is the best rate obtainable from [15].

c. Non-convex, Inhomogenous noise: satisfies Assumption A, and satisfies Assumption B. To simplify matters, suppose the problem is rescaled so that . Then the main additional term compared to setting 2. above is . This seems to suggest that the effect of a -Lipschitz noise can play a similar role in hindering mixing as a -Lipschitz nonconvex drift.

In the case when dimension is high, computing could be difficult, but if for each , one has access to samples whose covariance is , then one can approximate

via the Central Limit Theorem (e.g.

3) by drawing a sufficiently large number of samples. The proof of Theorem 1 can be modified relatively easily to accomodate this. We discuss this in further detail in Appendix A.4

5.2 Convergence under non-Gaussian noise

In this section, we prove the convergence of (1) to the invariant distribution of (3). We will assume that , and satisfy Assumptions A and B.

Theorem 2

Let be some target accuracy satisfying . Let us define

Where is some universal constant specified in the proof.

Let , for any . Let have dynamic as defined in (2) and let denote its distribution. Then for

(7)

we get

Remark 6

For a desired accuracy , the number of steps needed is of order . The dependency is considerably worse than in Theorem 1. This is because we need to take many steps of (1) in order to approximate a single step of (2). For details, see the coupling construction in Equations (28) - (31).

In [3], the authors proved a convergence result of similar flavor, i.e. a sequence of the form (1) converges to of (3). The dependence in their paper is . This is faster than our rate, but their proof made a number of much stronger assumptions. In particular, they assumed that is strongly convex, and (1) is contractive.

6 Application to stochastic gradient descent

6.1 SGD as SDE

In this section, we will try to cast SGD in the form of (1). We will consider an objective of the form

(8)

. We reserve the letter to denote a random batch from , sampled with replacement (will specify the batch size as needed). We will define as follows

For a single sample, i.e. , we define

(9)

I.e. is the covariance matrix of a single sampled gradient minus the true gradient.

A standard run of SGD, with minibatch size , has the following form:

(10)

Notice that is in the form of (1), with . The covariance matrix of the noise term is .

Because the magnitude of the noise covariance scales with , it follows that as , (10) converges to the deterministic gradient flow ODE.

However, the loss of randomness as might not be desirable. It has been observed in certain cases that as SGD approaches GD, through either small step-size or large batch-size, the generalization error goes up [11]. In Section 6.3.1, we present a set of empirical results to support this claim.

We argue that the right way to take the limit of SGD is to fix the term in (10). Specifically we define the continuous limit of (10) as

(11)

(Recall that in (10)). Notice that the above is similar to (3), with , which matches the covariance matrix of (10). Our definition is thus motivated by Theorem 2, which states that (1) converges to (3).

Let be some stepsize, and let be an arbitrary constant. Consider the following stochastic sequence:

(12)

Where 3 mini-batches, sampled iid and with replacement, and . Intuitively, in addition to the SGD noise, we inject additional noise by adding the difference between two independently sampled mini-batches.

We first note that (12) is in the form of (1), with

(13)

The noise covariance matrix is .

If we pick

(14)

then we guarantee that , which matches (11). By Theorem 2, for sufficiently small and sufficiently large , of (12) converges to the invariant distribution of (11).

We stress that we are not proposing (12) as a practical algorithm. The reason that (12) is interesting is that it gives us a family of which converges to (11), and is implementable in practice. In section 6.3.2, we implement and (12) evaluate its performance. From the experiments, it appears that (12) has similar test accuracy to vanilla (10) with step-size . We thus hypothesize that the test accuracy depends largely on the shape and scale of the noise covariance matrix, which implies that the generalization properties of (10) for large should extend to its limit (11).

We remark that [10] proposed a different way of injecting noise, multiplying the sampled gradient with a suitably scaled Gaussian noise.

6.2 Satisfying Assumptions in Section 4.1

Finally, we remark on how the function defined in (8), along with the stochastic sequence defined in (12) can satisfy the assumptions in Section 4.1.

First, let us additionally assume that for each , has the form

Where is a -strongly convex regularizer outside a ball of radius . has a minima at and has -Lipschitz gradients. Suppose further that . These additional assumptions make sense when we are only interested in over , so plays the role of a function that keeps us within .

It can immediately be verified that satisfies Assumption A with .

The noise term in (13) satisfies Assumption B.1 by definition, and satisfies Assumption B.3 with . Assumption B.2 is bounded if is bounded for all , i.e. the sampled gradient does not deviate from the true gradient by more than a constant. We will need to assume directly Assumption B.4, as it is a property of the distribution of for .

6.3 Experiments

In this section, we present experimental results. In all experiments, we use two different neural network architectures on the CIFAR-10 dataset [14]

. The first architecture is a simple convolutional neural network, which we call CNN in the following, and the other is the VGG19 network 

[17]

. To make our experiments consistent with the setting of SGD, we do not use batch normalization or dropout. In all of our experiments, we run SGD algorithm

epochs such that the algorithm converges sufficiently.

Let be the covariance matrix of a single sample as defined in (9). For all SGD variants studied in this section, the covariance matrix will be some scaling of . We define the relative variance of a sequence as the scaling of in its continuous limit. For a SGD sequence with stepsize and batchsize, one can verify that the relative scaling of is . The authors of [11] have also observed this ratio is correlated with the quality of SGD solutions.

Out of pragmatic graph plotting considerations, we actually define relative variance to be the scaling wrt the noise when learning rate= and batch size=.

6.3.1 Accuracy vs relative Variance

In our first experiment, we show that there is a positive correlation between the relative variance of SGD (with respect to a particular baseline) and the final test accuracy of the trained model. We choose constant learning rate from

(15)

and batch size from

(16)

For each (learning rate, batch size) pair, we plot its final test accuracy against its relative variance in Figure 2. From the plot, higher relative variance indeed leads to better final test accuracy. We also highlight the fact that conditioned on the relative variance, the test accuracy is not significantly correlated with either the step-size or the batch-size. Specifically, there is a strong correlation between relative variance of a SGD sequence and its test accuracy, regardless of the combination of batch-size and learning rate.

Figure 2: Relationship between final test accuracy and the relative variance of the SGD algorithm.

6.3.2 SGD with injected noise

In this section, we implement and examine the performance of the Algorithm proposed in (12). In the Figure (3), each denotes a baseline SGD run, with learning-rate specified in the legend and batch-size specified by plot title. For example, in the first plot of Figure 3, the red denotes a SGD run with learning rate and batch-size . For each , we have a corresponding , of the same color. The corresponds to a run of (12), with , , and chosen so that the noise term as defined in (13) has covariance . In addition to and , we also plot in small teal marker all the other runs from Section 6.3.1. This helps highlight the linear trend between log(relative variance) and test accuracy that we observed in Section 6.3.1.

As can be seen, the (test error, relative variance) values for the

runs fall close to the linear trend. (Though there are some outliers). Specifically, a run of (

12) produces similar test accuracy to vanilla SGD runs with the same relative variance (e.g. SGD runs with the same minibatch size and 10 times the learning rate). We highlight two potential implications: First, just like in Section 6.3.1, we observe that the test accuracy is strongly correlated with relative variance, even for noise of the form (13), which can have rather different higher moments than . Second, since the points fall close to the linear trend, we hypothesize that for all , and for all chosen as in (14), the test accuracy of (12) will be similar to the test accuracy of (10). Then by our convergence result, (11) should also have similar test error. If true, then this implies that we only need to study and to explain the generalization properties of SGD.

Figure 3: Injecting noise with minibatch

Finally, Figure 4 presents a similar experiment to 3. This time, for each , we have a run with , , , and chosen so that the noise term as defined in (13) has covariance . We see that once again, the runs fall close to the linear trend.

Figure 4: Injecting noise with minibatch

7 Acknowledgements

We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of Google for a Google Research Award.

References

Appendix A Proofs for Convergence under Gaussian Noise (Section 5.1)

a.1 Proof of Theorem 1

In this section, we state our main Theorem. Our proof proceeds by recursively applying Lemma 1 over many steps.


  • Let be as defined in the Theorem statement.

    For the rest of this proof, consider defined as in Lemma 15 using the parameters ().

    Using Lemma 15.4, we know that

    As a consequence, for any two distributions , ,

    Suppose that we have the guarantee that

    (17)

    Then,

    Where the last inequality is by our choice of , and we have concluded our proof. The rest of this proof will be dedicated to showing (17).

    Let . Let and be defined as in (3). Let , then for all , by definition of .

    First, by our choice of the initial ,

    Combined with our choice of , we can apply Lemma 8 with , , to get that for all ,

    (18)

    Now, consider an arbitrary integer . For , (2) and (3) evolve as

    The above is the same process as (20) and (21), for which (22) is a coupling.

    We can thus apply Lemma 1 with the given , , , and . Then

    Applying the above recursively,

    (19)

    Where we use the fact that for all . Using Lemma 9 and the definition of in Theorem 1, we can upper bound the initial error as

    By our definition of in the Theorem statement and by our definition of at the start of the proof,

    which implies

    Then (19) gives

    thus proving (17) and concluding our proof.

a.2 A coupling construction

In this subsection, we will study the evolution of (3) and (2) over a small time interval. Specifically, we will study

(20)
(21)

One can verify that (20) is equivalent to (3), and (21) is equivalent to a single step of (2) (i.e. over an iterval ).

We first give the explicit coupling between (20) and (21):

Define using the following coupled SDE:

(22)