Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD). This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.
There has been rapid recent progress in understanding the finite-time behavior of MCMC methods, by comparing them to stochastic differential equations (SDEs), such as the Langevin diffusion. It is natural in this context to think of SGD as a discrete time approximation of an SDE. But there are two significant barriers to extending previous analyses to the case of SGD, because those analyses are mostly restricted to isotropic Gaussian noise. First, the noise in SGD can be far from Gaussian. For instance, sampling from a minibatch leads to a discrete distribution. Second, the noise depends significantly on the current state (the optimization variable). For instance, if the objective is an average over a training sample of a non-negative loss, as the objective approaches zero, the noise variance of minibatch SGD goes to zero. Any attempt to cast SGD as an SDE must thus be able to handle this kind of noise.
This motivates the study of Langevin MCMC-like methods that have a state-dependent noise term:
where is the state variable at time , is the step-size, is a potential, is the noise function, and are sampled iid from some set (for example, in minibatch SGD, is the set of subsets of indices in the training sample. We discuss the SGD example in more detail in Section 6).
Throughout this paper, we assume that for all . We define a matrix-valued function to be the square root of the covariance matrix of , i.e. for all ,
Where for a symmetric positive semidefinite matrix , is the unique symmetric positive semidefinite matrix such that .
Given the above definition of , it is natural to consider the following stochastic process:
Finally, we can consider the continuous-time limit of (2):
We will let denote the invariant distribution of (3). In Theorem 1, we establish quantitative rates at which that (2) converges to . In Theorem 2, we establish quantitative rates at which (1) also converges to . Notice in particular that the only property of the SDE (3) that corresponds to the noise function in (1) is the covariance function .
Our contributions are as follows:
We prove a quantitative convergence rate for (2) to of (3) in Theorem 2. This is the first quantitative rate for overdamped Langevin MCMC when the diffusion matrix is state-dependent. Our rate is comparable to earlier work with similar assumptions of nonconvexity, but assuming constant diffusion matrix [2, 4, 15].
Based on our theory, we describe a "large-noise" version of SGD and empirically evaluate its generalization performance. See Section 6.
2 Related work
There has been a long line of work in the study of convergence properties of stochastic processes. We review the ones most relevant to our work here.
Recent work on non-asymptotic convergence rates of Langevin MCMC algorithms began with  and , which established quantitative rates for Unadjusted Langevin MCMC under log-concavity assumptions (i.e. (2) with convex and for some constant ). Another line of work [7, 8, 2, 4, 15] analyzed the convergence of MCMC algorithms under nonconvexity assumptions. In particular,  and  considered the Overdamped Langevin MCMC algorithm under similar assumptions as Assumption A, but still assuming for some constant . Finally,  analyzed the convergence of (1) to (3) under much more restrictive assumptions of convexity of . In addition,  requires that (1) be a contractive process. In this paper, we show that convergence is possible as long as is sufficiently regular and is lower-bounded globally by some constant. Following our presentation of Theorem 1 and 2, we compare our result with some of the above mentioned results.
3 An illustration of the importance of considering inhomogeneity
To motivate further discussion, we present a simple example to illustrate how
can significantly skew the invariant distribution of (3) away from .
We will define the potential and the diffusion function as
A plot of can be found in Figure 1.a. We highlight the fact that is constructed to have two local minima: one shallow minimum at , and one deeper minimum at . A plot of can be found in Figure 1.b. is constructed to have increasing magnitude at larger values of . This has the effect of biasing the invariant distribution towards smaller values of .
Given and , we will consider the SDE in (3). Let be the invariant distribution under this SDE. Using the Fokker Planck equation, we obtain an explicit expression for the invariant distribution: . We can then integrate to get , which we plot in Figure 1.c.
Remarkably, is has only one local minima at . The larger minima of at was completely smoothed over by the effect of the large diffusion . This is very different from when the noise is homogeneous (e.g. ), in which case . We also simulate (3) (using (2)) for the given and for 1000 samples (each simulated for 1000 steps), and plot the histogram in Figure 1.d.
4 Assumptions and definitions
We assume that satisfies
The function is continuously-differentiable on and has Lipschitz continuous gradients; that is, there exists a positive constant such that for all ,
The function has a stationary point at zero:
There exist constants such that for all with , we have:
in addition, for all with , we have:
We make the following assumptions on and :
For all , has zero mean, i.e.
For almost sure , and for all ,
For almost sure , and for all ,
There is a positive constant such that for all ,
We discuss these assumptions in a specific setting at the end of Section 6.1.
4.2 Definitions and notation
For a tensor, we use to denote the operator norm:
It can be verified that for all , is a norm over . Furthermore, when , this is the Euclidean norm, and when
, this is the largest singular value.
We use the notation to denote both vector and matrix inner products.
For vectors , (the dot product).
For matrices , (the trace inner product).
We will use to denote outer product. For two vectors , means that . We extend this notation to matrix vector outer products: if , and similarly if . We will use to denote and to denote .
Given any distribution and
, a joint distribution distributionis a coupling between and if its marginals are equal to and respectively. Let denote the space of all couplings between and . Then the 1-Wasserstein distance is defined as
Our proof centers around a function , defined in Appendix E. Intuitively, plays the role of the metric . We will define a -Wasserstein distance as follows:
We overload the notation and sometimes use
for random variablesand to denote the distance between their distributions. We will prove our convergence results in , which then implies a convergence in by using Lemma 15.4.
Given a random variable , we use to denote its distribution.
5 Main results
5.1 Convergence under Gaussian noise
Finding a suitable can be done very quickly using gradient descent wrt . The convergence rate of to the ball of radius is very fast, due to Assumption A.
In a nontrivial instance,
Substituting in the parameters, and after some algebra, we see that for a sufficiently small , , and
To gain intuition about this term, let’s consider what it looks like under a sequence of increasingly weaker assumptions:
a. Strongly convex, Gaussian noise -strongly convex, -smooth, for all . (In reality we need to consider a truncated Gaussian so as not to violate Assumption B.2, but this is a minor issue). In this case, , , , , so . This is the same rate as obtained in . However,  gets a bound which is stronger than bound.
b. Non-convex, Gaussian noise: not strongly convex but satisfies Assumption A, and . In this case, , , This is the setting studied by  and . The rate we recover is , which is in line with , and is the best rate obtainable from .
c. Non-convex, Inhomogenous noise: satisfies Assumption A, and satisfies Assumption B. To simplify matters, suppose the problem is rescaled so that . Then the main additional term compared to setting 2. above is . This seems to suggest that the effect of a -Lipschitz noise can play a similar role in hindering mixing as a -Lipschitz nonconvex drift.
In the case when dimension is high, computing could be difficult, but if for each , one has access to samples whose covariance is , then one can approximate
via the Central Limit Theorem (e.g.3) by drawing a sufficiently large number of samples. The proof of Theorem 1 can be modified relatively easily to accomodate this. We discuss this in further detail in Appendix A.4
5.2 Convergence under non-Gaussian noise
Let be some target accuracy satisfying . Let us define
Where is some universal constant specified in the proof.
Let , for any . Let have dynamic as defined in (2) and let denote its distribution. Then for
In , the authors proved a convergence result of similar flavor, i.e. a sequence of the form (1) converges to of (3). The dependence in their paper is . This is faster than our rate, but their proof made a number of much stronger assumptions. In particular, they assumed that is strongly convex, and (1) is contractive.
6 Application to stochastic gradient descent
6.1 SGD as SDE
In this section, we will try to cast SGD in the form of (1). We will consider an objective of the form
. We reserve the letter to denote a random batch from , sampled with replacement (will specify the batch size as needed). We will define as follows
For a single sample, i.e. , we define
I.e. is the covariance matrix of a single sampled gradient minus the true gradient.
A standard run of SGD, with minibatch size , has the following form:
Notice that is in the form of (1), with . The covariance matrix of the noise term is .
Because the magnitude of the noise covariance scales with , it follows that as , (10) converges to the deterministic gradient flow ODE.
However, the loss of randomness as might not be desirable. It has been observed in certain cases that as SGD approaches GD, through either small step-size or large batch-size, the generalization error goes up . In Section 6.3.1, we present a set of empirical results to support this claim.
(Recall that in (10)). Notice that the above is similar to (3), with , which matches the covariance matrix of (10). Our definition is thus motivated by Theorem 2, which states that (1) converges to (3).
Let be some stepsize, and let be an arbitrary constant. Consider the following stochastic sequence:
Where 3 mini-batches, sampled iid and with replacement, and . Intuitively, in addition to the SGD noise, we inject additional noise by adding the difference between two independently sampled mini-batches.
If we pick
We stress that we are not proposing (12) as a practical algorithm. The reason that (12) is interesting is that it gives us a family of which converges to (11), and is implementable in practice. In section 6.3.2, we implement and (12) evaluate its performance. From the experiments, it appears that (12) has similar test accuracy to vanilla (10) with step-size . We thus hypothesize that the test accuracy depends largely on the shape and scale of the noise covariance matrix, which implies that the generalization properties of (10) for large should extend to its limit (11).
We remark that  proposed a different way of injecting noise, multiplying the sampled gradient with a suitably scaled Gaussian noise.
6.2 Satisfying Assumptions in Section 4.1
First, let us additionally assume that for each , has the form
Where is a -strongly convex regularizer outside a ball of radius . has a minima at and has -Lipschitz gradients. Suppose further that . These additional assumptions make sense when we are only interested in over , so plays the role of a function that keeps us within .
It can immediately be verified that satisfies Assumption A with .
The noise term in (13) satisfies Assumption B.1 by definition, and satisfies Assumption B.3 with . Assumption B.2 is bounded if is bounded for all , i.e. the sampled gradient does not deviate from the true gradient by more than a constant. We will need to assume directly Assumption B.4, as it is a property of the distribution of for .
In this section, we present experimental results. In all experiments, we use two different neural network architectures on the CIFAR-10 dataset 
. The first architecture is a simple convolutional neural network, which we call CNN in the following, and the other is the VGG19 network
. To make our experiments consistent with the setting of SGD, we do not use batch normalization or dropout. In all of our experiments, we run SGD algorithmepochs such that the algorithm converges sufficiently.
Let be the covariance matrix of a single sample as defined in (9). For all SGD variants studied in this section, the covariance matrix will be some scaling of . We define the relative variance of a sequence as the scaling of in its continuous limit. For a SGD sequence with stepsize and batchsize, one can verify that the relative scaling of is . The authors of  have also observed this ratio is correlated with the quality of SGD solutions.
Out of pragmatic graph plotting considerations, we actually define relative variance to be the scaling wrt the noise when learning rate= and batch size=.
6.3.1 Accuracy vs relative Variance
In our first experiment, we show that there is a positive correlation between the relative variance of SGD (with respect to a particular baseline) and the final test accuracy of the trained model. We choose constant learning rate from
and batch size from
For each (learning rate, batch size) pair, we plot its final test accuracy against its relative variance in Figure 2. From the plot, higher relative variance indeed leads to better final test accuracy. We also highlight the fact that conditioned on the relative variance, the test accuracy is not significantly correlated with either the step-size or the batch-size. Specifically, there is a strong correlation between relative variance of a SGD sequence and its test accuracy, regardless of the combination of batch-size and learning rate.
6.3.2 SGD with injected noise
In this section, we implement and examine the performance of the Algorithm proposed in (12). In the Figure (3), each denotes a baseline SGD run, with learning-rate specified in the legend and batch-size specified by plot title. For example, in the first plot of Figure 3, the red denotes a SGD run with learning rate and batch-size . For each , we have a corresponding , of the same color. The corresponds to a run of (12), with , , and chosen so that the noise term as defined in (13) has covariance . In addition to and , we also plot in small teal marker all the other runs from Section 6.3.1. This helps highlight the linear trend between log(relative variance) and test accuracy that we observed in Section 6.3.1.
As can be seen, the (test error, relative variance) values for the
runs fall close to the linear trend. (Though there are some outliers). Specifically, a run of (12) produces similar test accuracy to vanilla SGD runs with the same relative variance (e.g. SGD runs with the same minibatch size and 10 times the learning rate). We highlight two potential implications: First, just like in Section 6.3.1, we observe that the test accuracy is strongly correlated with relative variance, even for noise of the form (13), which can have rather different higher moments than . Second, since the points fall close to the linear trend, we hypothesize that for all , and for all chosen as in (14), the test accuracy of (12) will be similar to the test accuracy of (10). Then by our convergence result, (11) should also have similar test error. If true, then this implies that we only need to study and to explain the generalization properties of SGD.
We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of Google for a Google Research Award.
-  A. Anastasiou, K. Balasubramanian, and M. A. Erdogdu. Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale clt. arXiv preprint arXiv:1904.02130, 2019.
-  N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for hamiltonian monte carlo. arXiv preprint arXiv:1805.00452, 2018.
-  X. Cheng, P. L. Bartlett, and M. I. Jordan. Quantitative central limit theorems for discrete stochastic processes. arXiv preprint arXiv:1902.00832, 2019.
-  X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018.
-  A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
-  A. Durmus and E. Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
-  A. Eberle. Reflection coupling and wasserstein contractivity without convexity. Comptes Rendus Mathematique, 349(19-20):1101–1104, 2011.
-  A. Eberle. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166(3-4):851–886, 2016.
-  R. Eldan, D. Mikulincer, and A. Zhai. The clt in high dimensions: quantitative bounds via martingale embedding. arXiv preprint arXiv:1806.09087, 2018.
-  E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017.
-  S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
-  N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
-  R. Kleinberg, Y. Li, and Y. Yuan. An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175, 2018.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization. arXiv preprint arXiv:1811.08413, 2018.
-  S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Appendix A Proofs for Convergence under Gaussian Noise (Section 5.1)
a.1 Proof of Theorem 1
In this section, we state our main Theorem. Our proof proceeds by recursively applying Lemma 1 over many steps.
Let be as defined in the Theorem statement.
For the rest of this proof, consider defined as in Lemma 15 using the parameters ().
Using Lemma 15.4, we know that
As a consequence, for any two distributions , ,
Suppose that we have the guarantee that
Where the last inequality is by our choice of , and we have concluded our proof. The rest of this proof will be dedicated to showing (17).
Let . Let and be defined as in (3). Let , then for all , by definition of .
First, by our choice of the initial ,
Combined with our choice of , we can apply Lemma 8 with , , to get that for all ,
We can thus apply Lemma 1 with the given , , , and . Then
Applying the above recursively,
By our definition of in the Theorem statement and by our definition of at the start of the proof,
a.2 A coupling construction
Define using the following coupled SDE: