Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of the fact that the gradient noise in SGD is highly non-isotropic; the covariance matrix of mini-batch gradients has a rank as small as 1 these claims, proven in the appendix.

Authors

• 21 publications
• 86 publications
• Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent

We interpret the variational inference of the Stochastic Gradient Descen...
01/18/2019 ∙ by Wenqing Hu, et al. ∙ 0

• A geometric interpretation of stochastic gradient descent using diffusion metrics

Stochastic gradient descent (SGD) is a key ingredient in the training of...
10/27/2019 ∙ by R. Fioresi, et al. ∙ 0

• On the Origin of Implicit Regularization in Stochastic Gradient Descent

For infinitesimal learning rates, stochastic gradient descent (SGD) foll...
01/28/2021 ∙ by Samuel L. Smith, et al. ∙ 0

• Stabilizing Adversarial Nets With Prediction Methods

Adversarial neural networks solve many important problems in data scienc...
05/20/2017 ∙ by Abhay Yadav, et al. ∙ 0

• A deep learning theory for neural networks grounded in physics

In the last decade, deep learning has become a major component of artifi...
03/18/2021 ∙ by Benjamin Scellier, et al. ∙ 0

Stochastic gradient descent (SGD) has been the dominant optimization met...
07/29/2019 ∙ by Erhan Bilal, et al. ∙ 0

• A Simple Baseline for Bayesian Uncertainty in Deep Learning

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose ...
02/07/2019 ∙ by Wesley Maddox, et al. ∙ 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our first result is to show precisely in what sense stochastic gradient descent (SGD) implicitly performs variational inference, as is often claimed informally in the literature. For a loss function with weights , if

is the steady-state distribution over the weights estimated by SGD,

 ρss=argminρ Ex∼ρ[Φ(x)]−η2b H(ρ);

where is the entropy of the distribution and and are the learning rate and batch-size, respectively. The potential , which we characterize explicitly, is related but not necessarily equal to . It is only a function of the architecture and the dataset. This implies that SGD implicitly performs variational inference with a uniform prior, albeit of a different loss than the one used to compute back-propagation gradients.

We next prove that the implicit potential is equal to our chosen loss if and only if the noise in mini-batch gradients is isotropic. This condition, however, is not satisfied for deep networks. Empirically, we find gradient noise to be highly non-isotropic with the rank of its covariance matrix being about of its dimension. Thus, SGD on deep networks implicitly discovers locations where , these are not the locations where . This is our second main result: the most likely locations of SGD are not the local minima, nor the saddle points, of the original loss. The deviation of these critical points, which we compute explicitly scales linearly with and is typically large in practice.

When mini-batch noise is non-isotropic, SGD does not even converge in the classical sense. We prove that, instead of undergoing Brownian motion in the vicinity of a critical point, trajectories have a deterministic component that causes SGD to traverse closed loops in the weight space. We detect such loops using a Fourier analysis of SGD trajectories. We also show through an example that SGD with non-isotropic noise can even converge to stable limit cycles around saddle points.

2 Background on continuous-time SGD

Stochastic gradient descent performs the following updates while training a network where is the learning rate and is the average gradient over a mini-batch ,

 ∇fb(x)=1b ∑k∈b ∇fk(x). (1)

We overload notation for both the set of examples in a mini-batch and its size. We assume that weights belong to a compact subset , to ensure appropriate boundary conditions for the evolution of steady-state densities in SGD, although all our results hold without this assumption if the loss grows unbounded as , for instance, with weight decay as a regularizer.

Definition 1 (Diffusion matrix D(x)).

If a mini-batch is sampled with replacement, we show in Section A.1

that the variance of mini-batch gradients is

where

 D(x)=(1N N∑k=1∇fk(x) ∇fk(x)⊤)−∇f(x) ∇f(x)⊤⪰0. (2)

Note that is independent of the learning rate and the batch-size . It only depends on the weights , architecture and loss defined by , and the dataset. We will often discuss two cases: isotropic diffusion when is a scalar multiple of identity, independent of , and non-isotropic diffusion, when is a general function of the weights .

We now construct a stochastic differential equation (SDE) for the discrete-time SGD updates.

Lemma 2 (Continuous-time SGD).

The continuous-time limit of SGD is given by

 dx(t)=−∇f(x) dt+√2β−1D(x) dW(t); (3)

where is Brownian motion and is the inverse temperature defined as . The steady-state distribution of the weights , evolves according to the Fokker-Planck equation (Risken, 1996, Ito form):

 ∂ρ∂t (FP)

where the notation denotes the divergence

for any vector

; the divergence operator is applied column-wise to matrices such as .

We refer to Li et al. (2017b, Thm. 1) for the proof of the convergence of discrete SGD to Eq. 3. Note that completely captures the magnitude of noise in SGD that depends only upon the learning rate and the mini-batch size .

Assumption 3 (Steady-state distribution exists and is unique).

We assume that the steady-state distribution of the Fokker-Planck equation Eq. FP exists and is unique, this is denoted by and satisfies,

 0=∂ρss∂t=∇⋅(∇f(x) ρss+β−1 ∇⋅(D(x) ρss)). (4)

3 SGD performs variational inference

Let us first implicitly define a potential using the steady-state distribution :

 Φ(x)=−β−1logρss(x), (5)

up to a constant. The potential depends only on the full-gradient and the diffusion matrix; see Appendix C for a proof. It will be made explicit in Section 5. We express in terms of the potential using a normalizing constant as

 ρss(x)=1Z(β) e−βΦ(x) (6)

which is also the steady-state solution of

 dx=β−1 ∇⋅D(x) dt−D(x) ∇Φ(x) dt+√2β−1D(x) dW(t) (7)

as can be verified by direct substitution in Eq. FP.

The above observation is very useful because it suggests that, if can be written in terms of the diffusion matrix and a gradient term , the steady-state distribution of this SDE is easily obtained. We exploit this observation to rewrite in terms a term that gives rise to the above steady-state, the spatial derivative of the diffusion matrix, and the remainder:

 j(x)=−∇f(x)+D(x) ∇Φ(x)−β−1∇⋅D(x), (8)

interpreted as the part of that cannot be written as for some . We now make an important assumption on which has its origins in thermodynamics.

Assumption 4 (Force j(x) is conservative).

We assume that

 ∇⋅j(x)=0. (9)

The Fokker-Planck equation Eq. FP typically models a physical system which exchanges energy with an external environment (Ottinger, 2005; Qian, 2014). In our case, this physical system is the gradient dynamics while the interaction with the environment is through the term involving temperature: . The second law of thermodynamics states that the entropy of a system can never decrease; in Appendix B we show how the above assumption is sufficient to satisfy the second law. We also discuss some properties of in Appendix C that are a consequence of this. The most important is that is always orthogonal to . We illustrate the effects of this assumption in Example 19.

This leads us to the main result of this section.

Theorem 5 (SGD performs variational inference).

The functional

 F(ρ) =β−1 KL(ρ || ρss) (10)

decreases monotonically along the trajectories of the Fokker-Planck equation Eq. FP and converges to its minimum, which is zero, at steady-state. Moreover, we also have an energetic-entropic split

 F(ρ)=Ex∈ρ [Φ(x)]−β−1H(ρ)+constant. (11)

Theorem 5, proven in Section F.1, shows that SGD implicitly minimizes a combination of two terms: an “energetic” term, and an “entropic” term. The first is the average potential over a distribution . The steady-state of SGD in Eq. 6

is such that it places most of its probability mass in regions of the parameter space with small values of

. The second shows that SGD has an implicit bias towards solutions that maximize the entropy of the distribution .

Note that the energetic term in Eq. 11 has potential , instead of . This is an important fact and the crux of this paper.

Lemma 6 (Potential equals original loss iff isotropic diffusion).

If the diffusion matrix is isotropic, i.e., a constant multiple of the identity, the implicit potential is the original loss itself

 D(x)=c Id×d⇔Φ(x)=f(x). (12)

This is proven in Section F.2. The definition in Eq. 8 shows that when is non-isotropic. This results in a deterministic component in the SGD dynamics which does not affect the functional , hence is called a “conservative force.” The following lemma is proven in Section F.3.

Lemma 7 (Most likely trajectories of SGD are limit cycles).

The force does not decrease in Eq. 11 and introduces a deterministic component in SGD given by

 ˙x=j(x). (13)

The condition in creftype 4 implies that most likely trajectories of SGD traverse closed trajectories in weight space.

Theorem 5 applies for a general and it is equivalent to the celebrated JKO functional (Jordan et al., 1997) in optimal transportation (Santambrogio, 2015; Villani, 2008) if the diffusion matrix is isotropic. Appendix D provides a brief overview using the heat equation as an example.

Corollary 8 (Wasserstein gradient flow for isotropic noise).

If , trajectories of the Fokker-Planck equation Eq. FP are gradient flow in the Wasserstein metric of the functional

 F(ρ)=Ex∼ρ[f(x)]−β−1H(ρ). (JKO)

Observe that the energetic term contains in Corollary 8. The proof follows from Lemmas 6 and 5, see Santambrogio (2017) for a rigorous treatment of Wasserstein metrics. The JKO functional above has had an enormous impact in optimal transport because results like Corollaries 8 and 5 provide a way to modify the functional in an interpretable fashion. Modifying the Fokker-Planck equation or the SGD updates directly to enforce regularization properties on the solutions is much harder.

3.2 Connection to Bayesian inference

Note the absence of any prior in Eq. 11. On the other hand, the evidence lower bound (Kingma and Welling, 2013) for the dataset is,

 −logp(Ξ) ≤Ex∼q[f(x)]+KL(q(x|Ξ) || p(x|Ξ)), (ELBO) ≤Ex∼q[f(x)]−H(q)+H(q,p);

where is the cross-entropy of the estimated steady-state and the variational prior. The implicit loss function of SGD in Eq. 11 therefore corresponds to a uniform prior . In other words, we have shown that SGD itself performs variational optimization with a uniform prior. Note that this prior is well-defined by our hypothesis of for some compact .

It is important to note that SGD implicitly minimizes a potential instead of the original loss in ELBO. We prove in Section 5 that this potential is quite different from if the diffusion matrix is non-isotropic, in particular, with respect to its critical points.

Remark 9 (SGD has an information bottleneck).

The functional Eq. 11 is equivalent to the information bottleneck principle in representation learning (Tishby et al., 1999). Minimizing this functional, explicitly, has been shown to lead to invariant representations (Achille and Soatto, 2017). Theorem 5 shows that SGD implicitly contains this bottleneck and therefore begets these properties, naturally.

Remark 10 (ELBO prior conflicts with SGD).

Working with ELBO in practice involves one or multiple steps of SGD to minimize the energetic term along with an estimate of the -divergence term, often using a factored Gaussian prior (Kingma and Welling, 2013; Jordan et al., 1999). As Theorem 5 shows, such an approach also enforces a uniform prior whose strength is determined by and conflicts with the externally imposed Gaussian prior. This conflict—which fundamentally arises from using SGD to minimize the energetic term—has resulted in researchers artificially modulating the strength of the -divergence term using a scalar pre-factor (Mandt et al., 2016).

3.3 Practical implications

We will show in Section 5 that the potential does not depend on the optimization process, it is only a function of the dataset and the architecture. The effect of two important parameters, the learning rate and the mini-batch size therefore completely determines the strength of the entropic regularization term. If , the implicit regularization of SGD goes to zero. This implies that

 β−1=η2b should not be small

is a good tenet for regularization of SGD.

Remark 11 (Learning rate should scale linearly with batch-size to generalize well).

In order to maintain the entropic regularization, the learning rate needs to scale linearly with the batch-size . This prediction, based on Theorem 5, fits very well with empirical evidence wherein one obtains good generalization performance only with small mini-batches in deep networks (Keskar et al., 2016), or via such linear scaling (Goyal et al., 2017).

Remark 12 (Sampling with replacement is better than without replacement).

The diffusion matrix for the case when mini-batches are sampled with replacement is very close to Eq. 2, see Section A.2. However, the corresponding inverse temperature is

 β′−1=η2b(1−bN) should not be small.

The extra factor of reduces the entropic regularization in Eq. 11, as , the inverse temperature . As a consequence, for the same learning rate and batch-size Theorem 5 predicts that sampling with replacement has better regularization than sampling without replacement. This effect is particularly pronounced at large batch-sizes.

4 Empirical characterization of SGD dynamics

Section 4.1 shows that the diffusion matrix for modern deep networks is highly non-isotropic with a very low rank. We also analyze trajectories of SGD and detect periodic components using a frequency analysis in Section 4.2; this validates the prediction of Lemma 7.

We consider three networks for these experiments: a convolutional network called small-lenet, a two-layer fully-connected network on MNIST (LeCun et al., 1998) and a smaller version of the All-CNN-C architecture of Springenberg et al. (2014) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009); see Appendix E for more details.

4.1 Highly non-isotropic D(x) for deep networks

Figs. 2 and 1 show the eigenspectrum111thresholded at . This formula is widely used, for instance, in numpy.

of the diffusion matrix. In all cases, it has a large fraction of almost-zero eigenvalues with a very small rank that ranges between

- . Moreover, non-zero eigenvalues are spread across a vast range with a large variance.

Remark 13 (Noise in SGD is largely independent of the weights).

The variance of noise in Eq. 3 is

 η D(xk)b=2 β−1D(xk).

We have plotted the eigenspectra of the diffusion matrix in Fig. 1 and Fig. 2 at three different instants, , and training completion; they are almost indistinguishable. This implies that the variance of the mini-batch gradients in deep networks can be considered a constant, highly non-isotropic matrix.

Remark 14 (More non-isotropic diffusion if data is diverse).

The eigenspectra in Fig. 2

for CIFAR-10 and CIFAR-100 have much larger eigenvalues and standard-deviation than those in

Fig. 1, this is expected because the images in the CIFAR datasets have more variety than those in MNIST. Similarly, while CIFAR-100 has qualitatively similar images as CIFAR-10, it has more classes and as a result, it is a much harder dataset. This correlates well with the fact that both the mean and standard-deviation of the eigenvalues in Fig. 1(b) are much higher than those in Fig. 1(a). Input augmentation increases the diversity of mini-batch gradients. This is seen in Fig. 1(c) where the standard-deviation of the eigenvalues is much higher as compared to Fig. 1(a).

Remark 15 (Inverse temperature scales with the mean of the eigenspectrum).

Remark 14 shows that the mean of the eigenspectrum is large if the dataset is diverse. Based on this, we propose that the inverse temperature should scale linearly with the mean of the eigenvalues of :

 (ηb) (1d d∑k=1 λ(D))=constant; (14)

where is the number of weights. This keeps the noise in SGD constant in magnitude for different values of the learning rate , mini-batch size , architectures, and datasets. Note that other hyper-parameters which affect stochasticity such as dropout probability are implicit inside .

Remark 16 (Variance of the eigenspectrum informs architecture search).

Compare the eigenspectra in Figs. 0(b) and 0(a) with those in Figs. 1(c) and 1(a). The former pair shows that small-lenet which is a much better network than small-fc also has a much larger rank, i.e., the number of non-zero eigenvalues ( is symmetric). The second pair shows that for the same dataset, data-augmentation creates a larger variance in the eigenspectrum. This suggests that both the quantities, viz., rank of the diffusion matrix and the variance of the eigenspectrum, inform the performance of a given architecture on the dataset. Note that as discussed in Remark 15, the mean of the eigenvalues can be controlled using the learning rate and the batch-size .

This observation is useful for automated architecture search where we can use the quantity

 rank(D)d+var(λ(D))

to estimate the efficacy of a given architecture, possibly, without even training, since does not depend on the weights much. This task currently requires enormous amounts of computational power (Zoph and Le, 2016; Baker et al., 2016; Brock et al., 2017).

4.2 Analysis of long-term trajectories

We train a smaller version of small-fc on down-sampled MNIST images for epochs and store snapshots of the weights after each epoch to get a long trajectory in the weight space. We discard the first epochs of training (“burnin”) to ensure that SGD has reached the steady-state. The learning rate is fixed to after this, up to epochs.

Remark 17 (Low-frequency periodic components in SGD trajectories).

Iterates of SGD, after it reaches the neighborhood of a critical point , are expected to perform Brownian motion with variance , the FFT in Fig. 2(a) would be flat if this were so. Instead, we see low-frequency modes in the trajectory that are indicators of a periodic dynamics of the force . These modes are not sharp peaks in the FFT because can be a non-linear function of the weights thereby causing the modes to spread into all dimensions of . The FFT is dominated by jittery high-frequency modes on the right with a slight increasing trend; this suggests the presence of colored noise in SGD at high-frequencies.

The auto-correlation (AC) in Fig. 2(b) should be compared with the AC for Brownian motion which decays to zero very quickly and stays within the red confidence bands (). Our iterates are significantly correlated with each other even at very large lags. This further indicates that trajectories of SGD do not perform Brownian motion.

Remark 18 (Gradient magnitude in deep networks is always large).

Fig. 2(c) shows that the full-gradient computed over the entire dataset (without burnin) does not decrease much with respect to the number of epochs. While it is expected to have a non-zero gradient norm because SGD only converges to a neighborhood of a critical point for non-zero learning rates, the magnitude of this gradient norm is quite large. This magnitude drops only by about a factor of over the next epochs. The presence of a non-zero also explains this, it causes SGD to be away from critical points, this phenomenon is made precise in Theorem 22. Let us note that a similar plot is also seen in Shwartz-Ziv and Tishby (2017) for the per-layer gradient magnitude.

5 SGD for deep networks is out-of-equilibrium

This section now gives an explicit formula for the potential . We also discuss implications of this for generalization in Section 5.3.

The fundamental difficulty in obtaining an explicit expression for is that even if the diffusion matrix is full-rank, there need not exist a function such that at all . We therefore split the analysis into two cases:

1. a local analysis near any critical point where we linearize and to compute for some , and

2. the general case where cannot be written as a local rotation and scaling of .

Let us introduce these cases with an example from Noh and Lee (2015).

Example 19 (Double-well potential with limit cycles).

Define

 Φ(x)=(x21−1)24+x222.

Instead of constructing a diffusion matrix , we will directly construct different gradients that lead to the same potential ; these are equivalent but the later is much easier. The dynamics is given by , where . We pick for some parameter where

 Jss(x)=e−(x21+x22)24 (−x2, x1).

Note that this satisfies Eq. 6 and does not change . Fig. 4 shows the gradient field along with a discussion.

5.1 Linearization around a critical point

Without loss of generality, let be a critical point of . This critical point can be a local minimum, maximum, or even a saddle point. We linearize the gradient around the origin and define a fixed matrix (the Hessian) to be . Let be the constant diffusion matrix matrix. The dynamics in Eq. 3 can now be written as

 dx=−Fx dt+√2β−1 D dW(t). (15)
Lemma 20 (Linearization).

The matrix in Eq. 15 can be uniquely decomposed into

 F =(D+Q) U; (16)

and are the symmetric and anti-symmetric parts of a matrix with , to get .

The above lemma is a classical result if the critical point is a local minimum, i.e., if the loss is locally convex near

; this case has also been explored in machine learning before

(Mandt et al., 2016). We refer to Kwon et al. (2005) for the proof that linearizes around any critical point.

We see from Lemma 20 that, near a critical point,

 ∇f=(D+Q) ∇Φ−β−1∇⋅D−β−1∇⋅Q (17)

up to the first order. This suggests that the effect of is to rotate the gradient field and move the critical points, also seen in Fig. 3(b). Note that and in the linearized analysis.

5.2 General case

We next give the general expression for the deviation of the critical points from those of the original loss .

A-type stochastic integration:

A Fokker-Planck equation is a deterministic partial differential equation (PDE) and every steady-state distribution,

in this case, has a unique such PDE that achieves it. However, the same PDE can be tied to different SDEs depending on the stochastic integration scheme, e.g., Ito, Stratonovich (Risken, 1996; Oksendal, 2003), Hanggi (Hänggi, 1978), -type etc. An “A-type” interpretation is one such scheme (Ao et al., 2007; Shi et al., 2012). It is widely used in non-equilibrium studies in physics and biology (Wang et al., 2008; Zhu et al., 2004) because it allows one to compute the steady-state distribution easily; its implications are supported by other mathematical analyses such as Tel et al. (1989); Qian (2014).

The main result of the section now follows. It exploits the A-type interpretation to compute the difference between the most likely locations of SGD which are given by the critical points of the potential and those of the original loss .

Theorem 22 (Most likely locations are not the critical points of the loss).

The Ito SDE

 dx=−∇f(x) dt+√2β−1D(x) dW(t)

is equivalent to the A-type SDE (Ao et al., 2007; Shi et al., 2012)

 dx=−(D(x)+Q(x)) ∇Φ(x) dt+√2β−1D(x) dW(t) (18)

with the same steady-state distribution and Fokker-Planck equation Eq. FP if

 (19)

The anti-symmetric matrix and the potential can be explicitly computed in terms of the gradient and the diffusion matrix . The potential does not depend on .

See Section F.4 for the proof. It exploits the fact that the the Ito SDE Eq. 3 and the A-type SDE Eq. 18 should have the same Fokker-Planck equations because they have the same steady-state distributions.

Remark 23 (SGD is far away from critical points).

The time spent by a Markov chain at a state is proportional to its steady-state distribution . While it is easily seen that SGD does not converge in the Cauchy sense due to the stochasticity, it is very surprising that it may spend a significant amount of time away from the critical points of the original loss. If has a large divergence, the set of states with might be drastically different than those with . This is also seen in example Fig. 3(c); in fact, SGD may even converge around a saddle point.

This also closes the logical loop we began in Section 3 where we assumed the existence of and defined the potential using it. Theorems 22 and 20 show that both can be defined uniquely in terms of the original quantities, i.e., the gradient term and the diffusion matrix . There is no ambiguity as to whether the potential results in the steady-state or vice-versa.

Remark 24 (Consistent with the linear case).

Theorem 22 presents a picture that is completely consistent with Lemma 20. If and , or if is a constant like the linear case in Lemma 20, the divergence of in Eq. 19 is zero.

Remark 25 (Out-of-equilibrium effect can be large even if D is constant).

The presence of a with non-zero divergence is the consequence of a non-isotropic and it persists even if is constant and independent of weights . So long as is not isotropic, as we discussed in the beginning of Section 5, there need not exist a function such that at all . This is also seen in our experiments, the diffusion matrix is almost constant with respect to weights for deep networks, but consequences of out-of-equilibrium behavior are still seen in Section 4.2.

Remark 26 (Out-of-equilibrium effect increases with β−1).

The effect predicted by Eq. 19 becomes more pronounced if is large. In other words, small batch-sizes or high learning rates cause SGD to be drastically out-of-equilibrium. Theorem 5 also shows that as , the implicit entropic regularization in SGD vanishes. Observe that these are exactly the conditions under which we typically obtain good generalization performance for deep networks (Keskar et al., 2016; Goyal et al., 2017). This suggests that non-equilibrium behavior in SGD is crucial to obtain good generalization performance, especially for high-dimensional models such as deep networks where such effects are expected to be more pronounced.

5.3 Generalization

It was found that solutions of discrete learning problems that generalize well belong to dense clusters in the weight space (Baldassi et al., 2015, 2016)

. Such dense clusters are exponentially fewer compared to isolated solutions. To exploit these observations, the authors proposed a loss called “local entropy” that is out-of-equilibrium by construction and can find these well-generalizable solutions easily. This idea has also been successful in deep learning where

Chaudhari et al. (2016) modified SGD to seek solutions in “wide minima” with low curvature to obtain improvements in generalization performance as well as convergence rate (Chaudhari et al., 2017a).

Local entropy is a smoothed version of the original loss given by

 fγ(x)=−log(Gγ ∗ e−f(x)),

where is a Gaussian kernel of variance . Even with an isotropic diffusion matrix, the steady-state distribution with as the loss function is . For large values of , the new loss makes the original local minima exponentially less likely. In other words, local entropy does not rely on non-isotropic gradient noise to obtain out-of-equilibrium behavior, it gets it explicitly, by construction. This is also seen in Fig. 3(c): if SGD is drastically out-of-equilibrium, it converges around the “wide” saddle point region at the origin which has a small local entropy.

Actively constructing out-of-equilibrium behavior leads to good generalization in practice. Our evidence that SGD on deep networks itself possesses out-of-equilibrium behavior then indicates that SGD for deep networks generalizes well because of such behavior.

6 Related work

SGD, variational inference and implicit regularization:

The idea that SGD is related to variational inference has been seen in machine learning before (Duvenaud et al., 2016; Mandt et al., 2016) under assumptions such as quadratic steady-states; for instance, see Mandt et al. (2017) for methods to approximate steady-states using SGD. Our results here are very different, we would instead like to understand properties of SGD itself. Indeed, in full generality, SGD performs variational inference using a new potential that it implicitly constructs given an architecture and a dataset.

It is widely believed that SGD is an implicit regularizer, see Zhang et al. (2016); Neyshabur et al. (2017); Shwartz-Ziv and Tishby (2017) among others. This belief stems from its remarkable empirical performance. Our results show that such intuition is very well-placed. Thanks to the special architecture of deep networks where gradient noise is highly non-isotropic, SGD helps itself to a potential with properties that lead to both generalization and acceleration.

SGD and noise:

Noise is often added in SGD to improve its behavior around saddle points for non-convex losses, see Lee et al. (2016); Anandkumar and Ge (2016); Ge et al. (2015). It is also quite indispensable for training deep networks (Hinton and Van Camp, 1993; Srivastava et al., 2014; Kingma et al., 2015; Gulcehre et al., 2016; Achille and Soatto, 2017). There is however a disconnect between these two directions due to the fact that while adding external gradient noise helps in theory, it works poorly in practice (Neelakantan et al., 2015; Chaudhari and Soatto, 2015). Instead, “noise tied to the architecture” works better, e.g., dropout, or small mini-batches. Our results close this gap and show that SGD crucially leverages the highly degenerate noise induced by the architecture.

Yin et al. (2017) construct a scalar measure of the gradient diversity given by , and analyze its effect on the maximum allowed batch-size in the context of distributed optimization.

Markov Chain Monte Carlo:

MCMC methods that sample from a negative log-likelihood have employed the idea of designing a force to accelerate convergence, see Ma et al. (2015) for a thorough survey, or Pavliotis (2016); Kaiser et al. (2017) for a rigorous treatment. We instead compute the potential given and , which necessitates the use of techniques from physics. In fact, our results show that since for deep networks due to non-isotropic gradient noise, very simple algorithms such as SGLD by Welling and Teh (2011) also benefit from the acceleration that their sophisticated counterparts aim for (Ding et al., 2014; Chen et al., 2016).

7 Discussion

The continuous-time point-of-view used in this paper gives access to general principles that govern SGD, such analyses are increasingly becoming popular (Wibisono et al., 2016; Chaudhari et al., 2017b). However, in practice, deep networks are trained for only a few epochs with discrete-time updates. Closing this gap is an important future direction. A promising avenue towards this is that for typical conditions in practice such as small mini-batches or large learning rates, SGD converges to the steady-state distribution quickly (Raginsky et al., 2017).

8 Acknowledgments

PC would like to thank Adam Oberman for introducing him to the JKO functional. The authors would also like to thank Alhussein Fawzi for numerous discussions during the conception of this paper and his contribution to its improvement.

Appendix A Diffusion matrix D(x)

In this section we denote and . Although we drop the dependence of on to keep the notation clear, we emphasize that the diffusion matrix depends on the weights .

a.1 With replacement

Let be

iid random variables in

. We would like to compute

 var(1b b∑j=1 gij)=Ei1,…,ib ⎧⎨⎩(1bb∑j=1 gij−g) (1bb∑j=1 gij−g)⊤⎫⎬⎭.

Note that we have that for any , the random vectors and are independent. We therefore have

 covar(gij,gik)=0=Eij, ik{(gij−g)(gik−g)⊤}

We use this to obtain

 var(1b b∑j=1 gij)=1b2 b∑j=1 var(gij) =1NbN∑k=1 ((gk−g) (gk−g)⊤) =1b (∑Nk=1 gk g⊤kN−g g⊤).

We will set

 D(x)=1N(N∑k=1 gk g⊤k)−g g⊤. (A1)

and assimilate the factor of in the inverse temperature .

a.2 Without replacement

Let us define an indicator random variable

that denotes if an example was sampled in batch . We can show that

 var(1i∈b)=bN−b2N2,

and for ,

 covar(1i∈b,1j∈b)=−b(N−b)N2(N−1).

Similar to Li et al. (2017a), we can now compute

 var(1b N∑k=1 gk 1k∈b) =1b2 var(N∑k=1 gk 1k∈b) =1b2 N∑k=1 gk g⊤k var(1k∈b)+1b2 N∑i,j=1, i≠j gi g⊤j covar(1i∈b,1j∈b) =1b (1−bN) [∑Nk=1gk g⊤kN−1−(1−1N−1) g g⊤].

We will again set

 D(x)=1N−1(N∑k=1gk g⊤k)−(1−1N−1) g g⊤ (A2)

and assimilate the factor of that depends on the batch-size in the inverse temperature .

Appendix B Discussion on creftype 4

Let be as defined in Eq. 11. In non-equilibrium thermodynamics, it is assumed that the local entropy production is a product of the force from Eq. A8 and the probability current from Eq. FP. This assumption in this form was first introduced by Prigogine (1955) based on the works of Onsager (1931a, b). See Frank (2005, Sec. 4.5) for a mathematical treatment and Jaynes (1980) for further discussion. The rate of entropy increase is given by

 β−1 dSidt=∫x∈Ω∇(δFδρ) J(x,t) dx.

This can now be written using Eq. A8 again as

 β−1 dSidt =∫ ρ D:(∇δFδρ) (∇δFδρ)⊤+∫ jρ(∇δFδρ) dx.

The first term in the above expression is non-negative, in order to ensure that , we require

 0 =∫ jρ(∇δFδρ) dx =∫ ∇⋅(jρ) (δFδρ) dx;

where the second equality again follows by integration by parts. It can be shown (Frank, 2005, Sec. 4.5.5) that the condition in creftype 4, viz., , is sufficient to make the above integral vanish and therefore for the entropy generation to be non-negative.

Appendix C Some properties of the force j

The Fokker-Planck equation Eq. FP can be written in terms of the probability current as

 0=ρsst =∇⋅(−j ρss+D ∇Φ ρss−β−1(∇⋅D) ρss+β−1∇⋅(Dρss)) =∇⋅ Jss.

Since we have , from the observation Eq. 7, we also have that

 0=ρsst=∇⋅(D ∇Φ ρss+β−1D ∇ρss),

and consequently,

 0 =∇⋅(j ρss) (A3) ⇒j(x) =Jssρss.

In other words, the conservative force is non-zero only if detailed balance is broken, i.e., . We also have

 0 =∇⋅(j ρss) =ρss(∇⋅j − j⋅∇Φ),

which shows using creftype 4 and for all that is always orthogonal to the gradient of the potential

 0 =j(x)⋅∇Φ(x) (A4) =j(x)⋅∇ρss.

Using the definition of in Eq. 8, we have detailed balance when

 ∇f(x)=D(x) ∇Φ(x)−β−1∇⋅D(x). (A5)

Appendix D Heat equation as a gradient flow

As first discovered in the works of Jordan, Kinderleherer and Otto (Jordan et al., 1998; Otto, 2001), certain partial differential equations can be seen as coming from a variational principle, i.e., they perform steepest descent with respect to functionals of their state distribution. Section 3 is a generalization of this idea, we give a short overview here with the heat equation. The heat equation

 ρt=∇⋅(∇ρ),

can be written as the steepest descent for the Dirichlet energy functional

 12 ∫Ω |∇ρ|2 dx.

However, the same PDE can also be seen as the gradient flow of the negative Shannon entropy in the Wasserstein metric (Santambrogio, 2017, 2015),

 −H(ρ)=∫Ω ρ(x)logρ(x) dx.

More precisely, the sequence of iterated minimization problems

 ρτk+1∈argminρ {−H(ρ)+W22(ρ,ρτk)2τ} (A6)

converges to trajectories of the heat equation as . This equivalence is extremely powerful because it allows us to interpret, and modify, the functional that PDEs such as the heat equation implicitly minimize.

This equivalence is also quite natural, the heat equation describes the probability density of pure Brownian motion: . The Wasserstein point-of-view suggests that Brownian motion maximizes the entropy of its state distribution, while the Dirichlet functional suggests that it minimizes the total-variation of its density. These are equivalent. While the latter has been used extensively in image processing, our paper suggests that the entropic regularization point-of-view is very useful to understand SGD for machine learning.

Appendix E Experimental setup

We consider the following three networks on the MNIST (LeCun et al., 1998) and the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009).

1. small-lenet: a smaller version of LeNet (LeCun et al., 1998)

on MNIST with batch-normalization and dropout (

) after both convolutional layers of and output channels, respectively. The fully-connected layer has hidden units. This network has weights and reaches about training and validation error.

2. small-fc:

a fully-connected network with two-layers, batch-normalization and rectified linear units that takes

down-sampled images of MNIST as input and has hidden units. Experiments in Section 4.2 use a smaller version of this network with hidden units and output classes ( input images); this is called tiny-fc.

3. small-allcnn: this a smaller version of the fully-convolutional network for CIFAR-10 and CIFAR-100 introduced by Springenberg et al. (2014) with batch-normalization and output channels in the first and second block respectively. It has weights and reaches about and training and validation errors, respectively.

We train the above networks with SGD with appropriate learning rate annealing and Nesterov’s momentum set to . We do not use any data-augmentation and pre-process data using global contrast normalization with ZCA for CIFAR-10 and CIFAR-100.

We use networks with about weights to keep the eigen-decomposition of tractable. These networks however possess all the architectural intricacies such as convolutions, dropout, batch-normalization etc. We evaluate using Eq. 2 with the network in evaluation mode.

Appendix F Proofs

f.1 Theorem 5

The -divergence is non-negative: with equality if and only if . The expression in Eq. 11 follows after writing

 logρss=−βΦ−logZ(β).

We now show that with equality only at when reaches its minimum and the Fokker-Planck equation achieves its steady-state. The first variation (Santambrogio, 2015) of computed from Eq. 11 is

 δFδρ(ρ)=Φ(x)+β−1(logρ+1), (A7)

which helps us write the Fokker-Planck equation Eq. FP as

 ρt=∇⋅(−j ρ+ρ D ∇(δFδρ)). (A8)

Together, we can now write

 dF(ρ)dt =∫x∈Ω ρt δFδρ dx

As we show in Appendix B, the first term above is zero due to creftype 4. Under suitable boundary condition on the Fokker-Planck equation which ensures that no probability mass flows across the boundary of the domain , after an integration by parts, the second term can be written as

 dF(ρ)dt =−∫x∈Ω ρ D:(∇ δFδρ(ρ)) (∇ δFδρ(ρ))⊤ dx ≤0.

In the above expression, denotes the matrix dot product . The final inequality with the quadratic form holds because is a covariance matrix. Moreover, we have from Eq. A7 that

 dF(ρss)dt=0.

f.2 Lemma 6

The forward implication can be checked by substituting in the Fokker-Planck equation Eq. FP while the reverse implication is true since otherwise Eq. A4 would not hold.

f.3 Lemma 7

The Fokker-Planck operator written as

 L ρ=∇⋅(−j ρ+ D ∇Φ ρ−β−1 (∇⋅D) ρ+β−1∇⋅(D ρ))

from Eqs. FP and 8 can be split into two operators

 L=LS+LA,

where the symmetric part is

 LSρ=∇⋅(D ∇Φ ρ−β−1 (∇⋅D) ρ+ β−1∇⋅(D ρ)) (A9)

and the anti-symmetric part is

 LAρ =∇⋅(−jρ) =∇⋅(−D ∇Φ ρ+ ∇fρ+β−1(∇⋅D)ρ) (A10) =∇⋅(β−1 D ∇ρ+ ∇fρ+β−1(∇⋅D)ρ).

We first note that does not affect in Theorem 5. For solutions of , we have

 ddtF(ρ) =∫Ω δFδρ ρt dx =∫Ω δFδρ ∇⋅(−j ρ) dx =0,

by creftype 4. The dynamics of the anti-symmetric operator is thus completely deterministic and conserves . In fact, the equation Eq. A10 is known as the Liouville equation (Frank, 2005) and describes the density of a completely deterministic dynamics given by

 ˙x=j(x); (A11)

where from Appendix C. On account of the trajectories of the Liouville operator being deterministic, they are also the most likely ones under the steady-state distribution .

f.4 Theorem 22

All the matrices below depend on the weights ; we suppress this to keep the notation clear. Our original SDE is given by

 dx=−∇f dt+√2β−1 D dW(t).

We will transform the original SDE into a new SDE

 G dx=−∇Φ