1 Introduction
Our first result is to show precisely in what sense stochastic gradient descent (SGD) implicitly performs variational inference, as is often claimed informally in the literature. For a loss function with weights , if
is the steadystate distribution over the weights estimated by SGD,
where is the entropy of the distribution and and are the learning rate and batchsize, respectively. The potential , which we characterize explicitly, is related but not necessarily equal to . It is only a function of the architecture and the dataset. This implies that SGD implicitly performs variational inference with a uniform prior, albeit of a different loss than the one used to compute backpropagation gradients.
We next prove that the implicit potential is equal to our chosen loss if and only if the noise in minibatch gradients is isotropic. This condition, however, is not satisfied for deep networks. Empirically, we find gradient noise to be highly nonisotropic with the rank of its covariance matrix being about of its dimension. Thus, SGD on deep networks implicitly discovers locations where , these are not the locations where . This is our second main result: the most likely locations of SGD are not the local minima, nor the saddle points, of the original loss. The deviation of these critical points, which we compute explicitly scales linearly with and is typically large in practice.
When minibatch noise is nonisotropic, SGD does not even converge in the classical sense. We prove that, instead of undergoing Brownian motion in the vicinity of a critical point, trajectories have a deterministic component that causes SGD to traverse closed loops in the weight space. We detect such loops using a Fourier analysis of SGD trajectories. We also show through an example that SGD with nonisotropic noise can even converge to stable limit cycles around saddle points.
2 Background on continuoustime SGD
Stochastic gradient descent performs the following updates while training a network where is the learning rate and is the average gradient over a minibatch ,
(1) 
We overload notation for both the set of examples in a minibatch and its size. We assume that weights belong to a compact subset , to ensure appropriate boundary conditions for the evolution of steadystate densities in SGD, although all our results hold without this assumption if the loss grows unbounded as , for instance, with weight decay as a regularizer.
Definition 1 (Diffusion matrix ).
If a minibatch is sampled with replacement, we show in Section A.1
that the variance of minibatch gradients is
where(2) 
Note that is independent of the learning rate and the batchsize . It only depends on the weights , architecture and loss defined by , and the dataset. We will often discuss two cases: isotropic diffusion when is a scalar multiple of identity, independent of , and nonisotropic diffusion, when is a general function of the weights .
We now construct a stochastic differential equation (SDE) for the discretetime SGD updates.
Lemma 2 (Continuoustime SGD).
The continuoustime limit of SGD is given by
(3) 
where is Brownian motion and is the inverse temperature defined as . The steadystate distribution of the weights , evolves according to the FokkerPlanck equation (Risken, 1996, Ito form):
(FP) 
where the notation denotes the divergence
for any vector
; the divergence operator is applied columnwise to matrices such as .We refer to Li et al. (2017b, Thm. 1) for the proof of the convergence of discrete SGD to Eq. 3. Note that completely captures the magnitude of noise in SGD that depends only upon the learning rate and the minibatch size .
Assumption 3 (Steadystate distribution exists and is unique).
We assume that the steadystate distribution of the FokkerPlanck equation Eq. FP exists and is unique, this is denoted by and satisfies,
(4) 
3 SGD performs variational inference
Let us first implicitly define a potential using the steadystate distribution :
(5) 
up to a constant. The potential depends only on the fullgradient and the diffusion matrix; see Appendix C for a proof. It will be made explicit in Section 5. We express in terms of the potential using a normalizing constant as
(6) 
which is also the steadystate solution of
(7) 
as can be verified by direct substitution in Eq. FP.
The above observation is very useful because it suggests that, if can be written in terms of the diffusion matrix and a gradient term , the steadystate distribution of this SDE is easily obtained. We exploit this observation to rewrite in terms a term that gives rise to the above steadystate, the spatial derivative of the diffusion matrix, and the remainder:
(8) 
interpreted as the part of that cannot be written as for some . We now make an important assumption on which has its origins in thermodynamics.
Assumption 4 (Force is conservative).
We assume that
(9) 
The FokkerPlanck equation Eq. FP typically models a physical system which exchanges energy with an external environment (Ottinger, 2005; Qian, 2014). In our case, this physical system is the gradient dynamics while the interaction with the environment is through the term involving temperature: . The second law of thermodynamics states that the entropy of a system can never decrease; in Appendix B we show how the above assumption is sufficient to satisfy the second law. We also discuss some properties of in Appendix C that are a consequence of this. The most important is that is always orthogonal to . We illustrate the effects of this assumption in Example 19.
This leads us to the main result of this section.
Theorem 5 (SGD performs variational inference).
The functional
(10) 
decreases monotonically along the trajectories of the FokkerPlanck equation Eq. FP and converges to its minimum, which is zero, at steadystate. Moreover, we also have an energeticentropic split
(11) 
Theorem 5, proven in Section F.1, shows that SGD implicitly minimizes a combination of two terms: an “energetic” term, and an “entropic” term. The first is the average potential over a distribution . The steadystate of SGD in Eq. 6
is such that it places most of its probability mass in regions of the parameter space with small values of
. The second shows that SGD has an implicit bias towards solutions that maximize the entropy of the distribution .Note that the energetic term in Eq. 11 has potential , instead of . This is an important fact and the crux of this paper.
Lemma 6 (Potential equals original loss iff isotropic diffusion).
If the diffusion matrix is isotropic, i.e., a constant multiple of the identity, the implicit potential is the original loss itself
(12) 
This is proven in Section F.2. The definition in Eq. 8 shows that when is nonisotropic. This results in a deterministic component in the SGD dynamics which does not affect the functional , hence is called a “conservative force.” The following lemma is proven in Section F.3.
Lemma 7 (Most likely trajectories of SGD are limit cycles).
The force does not decrease in Eq. 11 and introduces a deterministic component in SGD given by
(13) 
The condition in creftype 4 implies that most likely trajectories of SGD traverse closed trajectories in weight space.
3.1 Wasserstein gradient flow
Theorem 5 applies for a general and it is equivalent to the celebrated JKO functional (Jordan et al., 1997) in optimal transportation (Santambrogio, 2015; Villani, 2008) if the diffusion matrix is isotropic. Appendix D provides a brief overview using the heat equation as an example.
Corollary 8 (Wasserstein gradient flow for isotropic noise).
If , trajectories of the FokkerPlanck equation Eq. FP are gradient flow in the Wasserstein metric of the functional
(JKO) 
Observe that the energetic term contains in Corollary 8. The proof follows from Lemmas 6 and 5, see Santambrogio (2017) for a rigorous treatment of Wasserstein metrics. The JKO functional above has had an enormous impact in optimal transport because results like Corollaries 8 and 5 provide a way to modify the functional in an interpretable fashion. Modifying the FokkerPlanck equation or the SGD updates directly to enforce regularization properties on the solutions is much harder.
3.2 Connection to Bayesian inference
Note the absence of any prior in Eq. 11. On the other hand, the evidence lower bound (Kingma and Welling, 2013) for the dataset is,
(ELBO)  
where is the crossentropy of the estimated steadystate and the variational prior. The implicit loss function of SGD in Eq. 11 therefore corresponds to a uniform prior . In other words, we have shown that SGD itself performs variational optimization with a uniform prior. Note that this prior is welldefined by our hypothesis of for some compact .
It is important to note that SGD implicitly minimizes a potential instead of the original loss in ELBO. We prove in Section 5 that this potential is quite different from if the diffusion matrix is nonisotropic, in particular, with respect to its critical points.
Remark 9 (SGD has an information bottleneck).
The functional Eq. 11 is equivalent to the information bottleneck principle in representation learning (Tishby et al., 1999). Minimizing this functional, explicitly, has been shown to lead to invariant representations (Achille and Soatto, 2017). Theorem 5 shows that SGD implicitly contains this bottleneck and therefore begets these properties, naturally.
Remark 10 (ELBO prior conflicts with SGD).
Working with ELBO in practice involves one or multiple steps of SGD to minimize the energetic term along with an estimate of the divergence term, often using a factored Gaussian prior (Kingma and Welling, 2013; Jordan et al., 1999). As Theorem 5 shows, such an approach also enforces a uniform prior whose strength is determined by and conflicts with the externally imposed Gaussian prior. This conflict—which fundamentally arises from using SGD to minimize the energetic term—has resulted in researchers artificially modulating the strength of the divergence term using a scalar prefactor (Mandt et al., 2016).
3.3 Practical implications
We will show in Section 5 that the potential does not depend on the optimization process, it is only a function of the dataset and the architecture. The effect of two important parameters, the learning rate and the minibatch size therefore completely determines the strength of the entropic regularization term. If , the implicit regularization of SGD goes to zero. This implies that
is a good tenet for regularization of SGD.
Remark 11 (Learning rate should scale linearly with batchsize to generalize well).
In order to maintain the entropic regularization, the learning rate needs to scale linearly with the batchsize . This prediction, based on Theorem 5, fits very well with empirical evidence wherein one obtains good generalization performance only with small minibatches in deep networks (Keskar et al., 2016), or via such linear scaling (Goyal et al., 2017).
Remark 12 (Sampling with replacement is better than without replacement).
The diffusion matrix for the case when minibatches are sampled with replacement is very close to Eq. 2, see Section A.2. However, the corresponding inverse temperature is
The extra factor of reduces the entropic regularization in Eq. 11, as , the inverse temperature . As a consequence, for the same learning rate and batchsize , Theorem 5 predicts that sampling with replacement has better regularization than sampling without replacement. This effect is particularly pronounced at large batchsizes.
4 Empirical characterization of SGD dynamics
Section 4.1 shows that the diffusion matrix for modern deep networks is highly nonisotropic with a very low rank. We also analyze trajectories of SGD and detect periodic components using a frequency analysis in Section 4.2; this validates the prediction of Lemma 7.
We consider three networks for these experiments: a convolutional network called smalllenet, a twolayer fullyconnected network on MNIST (LeCun et al., 1998) and a smaller version of the AllCNNC architecture of Springenberg et al. (2014) on the CIFAR10 and CIFAR100 datasets (Krizhevsky, 2009); see Appendix E for more details.
4.1 Highly nonisotropic for deep networks
Figs. 2 and 1 show the eigenspectrum^{1}^{1}1thresholded at . This formula is widely used, for instance, in numpy.
of the diffusion matrix. In all cases, it has a large fraction of almostzero eigenvalues with a very small rank that ranges between
 . Moreover, nonzero eigenvalues are spread across a vast range with a large variance.Remark 13 (Noise in SGD is largely independent of the weights).
The variance of noise in Eq. 3 is
We have plotted the eigenspectra of the diffusion matrix in Fig. 1 and Fig. 2 at three different instants, , and training completion; they are almost indistinguishable. This implies that the variance of the minibatch gradients in deep networks can be considered a constant, highly nonisotropic matrix.
Remark 14 (More nonisotropic diffusion if data is diverse).
The eigenspectra in Fig. 2
for CIFAR10 and CIFAR100 have much larger eigenvalues and standarddeviation than those in
Fig. 1, this is expected because the images in the CIFAR datasets have more variety than those in MNIST. Similarly, while CIFAR100 has qualitatively similar images as CIFAR10, it has more classes and as a result, it is a much harder dataset. This correlates well with the fact that both the mean and standarddeviation of the eigenvalues in Fig. 1(b) are much higher than those in Fig. 1(a). Input augmentation increases the diversity of minibatch gradients. This is seen in Fig. 1(c) where the standarddeviation of the eigenvalues is much higher as compared to Fig. 1(a).Remark 15 (Inverse temperature scales with the mean of the eigenspectrum).
Remark 14 shows that the mean of the eigenspectrum is large if the dataset is diverse. Based on this, we propose that the inverse temperature should scale linearly with the mean of the eigenvalues of :
(14) 
where is the number of weights. This keeps the noise in SGD constant in magnitude for different values of the learning rate , minibatch size , architectures, and datasets. Note that other hyperparameters which affect stochasticity such as dropout probability are implicit inside .
Remark 16 (Variance of the eigenspectrum informs architecture search).
Compare the eigenspectra in Figs. 0(b) and 0(a) with those in Figs. 1(c) and 1(a). The former pair shows that smalllenet which is a much better network than smallfc also has a much larger rank, i.e., the number of nonzero eigenvalues ( is symmetric). The second pair shows that for the same dataset, dataaugmentation creates a larger variance in the eigenspectrum. This suggests that both the quantities, viz., rank of the diffusion matrix and the variance of the eigenspectrum, inform the performance of a given architecture on the dataset. Note that as discussed in Remark 15, the mean of the eigenvalues can be controlled using the learning rate and the batchsize .
This observation is useful for automated architecture search where we can use the quantity
to estimate the efficacy of a given architecture, possibly, without even training, since does not depend on the weights much. This task currently requires enormous amounts of computational power (Zoph and Le, 2016; Baker et al., 2016; Brock et al., 2017).
shows the Fast Fourier Transform (FFT) of
whereis the number of epochs and
denotes the index of the weight. Fig. 2(b) shows the autocorrelation of with confidence bands denoted by the dotted red lines. Both Figs. 2(b) and 2(a) show the mean and one standarddeviation over the weight index ; the standard deviation is very small which indicates that all the weights have a very similar frequency spectrum. Figs. 2(b) and 2(a)should be compared with the FFT of white noise which should be flat and the autocorrelation of Brownian motion which quickly decays to zero, respectively.
Figs. 3 and 2(a) therefore show that trajectories of SGD are not simply Brownian motion. Moreover the gradient at these locations is quite large (Fig. 2(c)).4.2 Analysis of longterm trajectories
We train a smaller version of smallfc on downsampled MNIST images for epochs and store snapshots of the weights after each epoch to get a long trajectory in the weight space. We discard the first epochs of training (“burnin”) to ensure that SGD has reached the steadystate. The learning rate is fixed to after this, up to epochs.
Remark 17 (Lowfrequency periodic components in SGD trajectories).
Iterates of SGD, after it reaches the neighborhood of a critical point , are expected to perform Brownian motion with variance , the FFT in Fig. 2(a) would be flat if this were so. Instead, we see lowfrequency modes in the trajectory that are indicators of a periodic dynamics of the force . These modes are not sharp peaks in the FFT because can be a nonlinear function of the weights thereby causing the modes to spread into all dimensions of . The FFT is dominated by jittery highfrequency modes on the right with a slight increasing trend; this suggests the presence of colored noise in SGD at highfrequencies.
The autocorrelation (AC) in Fig. 2(b) should be compared with the AC for Brownian motion which decays to zero very quickly and stays within the red confidence bands (). Our iterates are significantly correlated with each other even at very large lags. This further indicates that trajectories of SGD do not perform Brownian motion.
Remark 18 (Gradient magnitude in deep networks is always large).
Fig. 2(c) shows that the fullgradient computed over the entire dataset (without burnin) does not decrease much with respect to the number of epochs. While it is expected to have a nonzero gradient norm because SGD only converges to a neighborhood of a critical point for nonzero learning rates, the magnitude of this gradient norm is quite large. This magnitude drops only by about a factor of over the next epochs. The presence of a nonzero also explains this, it causes SGD to be away from critical points, this phenomenon is made precise in Theorem 22. Let us note that a similar plot is also seen in ShwartzZiv and Tishby (2017) for the perlayer gradient magnitude.
5 SGD for deep networks is outofequilibrium
This section now gives an explicit formula for the potential . We also discuss implications of this for generalization in Section 5.3.
The fundamental difficulty in obtaining an explicit expression for is that even if the diffusion matrix is fullrank, there need not exist a function such that at all . We therefore split the analysis into two cases:

a local analysis near any critical point where we linearize and to compute for some , and

the general case where cannot be written as a local rotation and scaling of .
Let us introduce these cases with an example from Noh and Lee (2015).
Example 19 (Doublewell potential with limit cycles).
Define
Instead of constructing a diffusion matrix , we will directly construct different gradients that lead to the same potential ; these are equivalent but the later is much easier. The dynamics is given by , where . We pick for some parameter where
Note that this satisfies Eq. 6 and does not change . Fig. 4 shows the gradient field along with a discussion.
5.1 Linearization around a critical point
Without loss of generality, let be a critical point of . This critical point can be a local minimum, maximum, or even a saddle point. We linearize the gradient around the origin and define a fixed matrix (the Hessian) to be . Let be the constant diffusion matrix matrix. The dynamics in Eq. 3 can now be written as
(15) 
Lemma 20 (Linearization).
The matrix in Eq. 15 can be uniquely decomposed into
(16) 
and are the symmetric and antisymmetric parts of a matrix with , to get .
The above lemma is a classical result if the critical point is a local minimum, i.e., if the loss is locally convex near
; this case has also been explored in machine learning before
(Mandt et al., 2016). We refer to Kwon et al. (2005) for the proof that linearizes around any critical point.5.2 General case
We next give the general expression for the deviation of the critical points from those of the original loss .
Atype stochastic integration:
A FokkerPlanck equation is a deterministic partial differential equation (PDE) and every steadystate distribution,
in this case, has a unique such PDE that achieves it. However, the same PDE can be tied to different SDEs depending on the stochastic integration scheme, e.g., Ito, Stratonovich (Risken, 1996; Oksendal, 2003), Hanggi (Hänggi, 1978), type etc. An “Atype” interpretation is one such scheme (Ao et al., 2007; Shi et al., 2012). It is widely used in nonequilibrium studies in physics and biology (Wang et al., 2008; Zhu et al., 2004) because it allows one to compute the steadystate distribution easily; its implications are supported by other mathematical analyses such as Tel et al. (1989); Qian (2014).The main result of the section now follows. It exploits the Atype interpretation to compute the difference between the most likely locations of SGD which are given by the critical points of the potential and those of the original loss .
Theorem 22 (Most likely locations are not the critical points of the loss).
The Ito SDE
is equivalent to the Atype SDE (Ao et al., 2007; Shi et al., 2012)
(18) 
with the same steadystate distribution and FokkerPlanck equation Eq. FP if
(19) 
The antisymmetric matrix and the potential can be explicitly computed in terms of the gradient and the diffusion matrix . The potential does not depend on .
See Section F.4 for the proof. It exploits the fact that the the Ito SDE Eq. 3 and the Atype SDE Eq. 18 should have the same FokkerPlanck equations because they have the same steadystate distributions.
Remark 23 (SGD is far away from critical points).
The time spent by a Markov chain at a state is proportional to its steadystate distribution . While it is easily seen that SGD does not converge in the Cauchy sense due to the stochasticity, it is very surprising that it may spend a significant amount of time away from the critical points of the original loss. If has a large divergence, the set of states with might be drastically different than those with . This is also seen in example Fig. 3(c); in fact, SGD may even converge around a saddle point.
This also closes the logical loop we began in Section 3 where we assumed the existence of and defined the potential using it. Theorems 22 and 20 show that both can be defined uniquely in terms of the original quantities, i.e., the gradient term and the diffusion matrix . There is no ambiguity as to whether the potential results in the steadystate or viceversa.
Remark 24 (Consistent with the linear case).
Theorem 22 presents a picture that is completely consistent with Lemma 20. If and , or if is a constant like the linear case in Lemma 20, the divergence of in Eq. 19 is zero.
Remark 25 (Outofequilibrium effect can be large even if is constant).
The presence of a with nonzero divergence is the consequence of a nonisotropic and it persists even if is constant and independent of weights . So long as is not isotropic, as we discussed in the beginning of Section 5, there need not exist a function such that at all . This is also seen in our experiments, the diffusion matrix is almost constant with respect to weights for deep networks, but consequences of outofequilibrium behavior are still seen in Section 4.2.
Remark 26 (Outofequilibrium effect increases with ).
The effect predicted by Eq. 19 becomes more pronounced if is large. In other words, small batchsizes or high learning rates cause SGD to be drastically outofequilibrium. Theorem 5 also shows that as , the implicit entropic regularization in SGD vanishes. Observe that these are exactly the conditions under which we typically obtain good generalization performance for deep networks (Keskar et al., 2016; Goyal et al., 2017). This suggests that nonequilibrium behavior in SGD is crucial to obtain good generalization performance, especially for highdimensional models such as deep networks where such effects are expected to be more pronounced.
5.3 Generalization
It was found that solutions of discrete learning problems that generalize well belong to dense clusters in the weight space (Baldassi et al., 2015, 2016)
. Such dense clusters are exponentially fewer compared to isolated solutions. To exploit these observations, the authors proposed a loss called “local entropy” that is outofequilibrium by construction and can find these wellgeneralizable solutions easily. This idea has also been successful in deep learning where
Chaudhari et al. (2016) modified SGD to seek solutions in “wide minima” with low curvature to obtain improvements in generalization performance as well as convergence rate (Chaudhari et al., 2017a).Local entropy is a smoothed version of the original loss given by
where is a Gaussian kernel of variance . Even with an isotropic diffusion matrix, the steadystate distribution with as the loss function is . For large values of , the new loss makes the original local minima exponentially less likely. In other words, local entropy does not rely on nonisotropic gradient noise to obtain outofequilibrium behavior, it gets it explicitly, by construction. This is also seen in Fig. 3(c): if SGD is drastically outofequilibrium, it converges around the “wide” saddle point region at the origin which has a small local entropy.
Actively constructing outofequilibrium behavior leads to good generalization in practice. Our evidence that SGD on deep networks itself possesses outofequilibrium behavior then indicates that SGD for deep networks generalizes well because of such behavior.
6 Related work
SGD, variational inference and implicit regularization:
The idea that SGD is related to variational inference has been seen in machine learning before (Duvenaud et al., 2016; Mandt et al., 2016) under assumptions such as quadratic steadystates; for instance, see Mandt et al. (2017) for methods to approximate steadystates using SGD. Our results here are very different, we would instead like to understand properties of SGD itself. Indeed, in full generality, SGD performs variational inference using a new potential that it implicitly constructs given an architecture and a dataset.
It is widely believed that SGD is an implicit regularizer, see Zhang et al. (2016); Neyshabur et al. (2017); ShwartzZiv and Tishby (2017) among others. This belief stems from its remarkable empirical performance. Our results show that such intuition is very wellplaced. Thanks to the special architecture of deep networks where gradient noise is highly nonisotropic, SGD helps itself to a potential with properties that lead to both generalization and acceleration.
SGD and noise:
Noise is often added in SGD to improve its behavior around saddle points for nonconvex losses, see Lee et al. (2016); Anandkumar and Ge (2016); Ge et al. (2015). It is also quite indispensable for training deep networks (Hinton and Van Camp, 1993; Srivastava et al., 2014; Kingma et al., 2015; Gulcehre et al., 2016; Achille and Soatto, 2017). There is however a disconnect between these two directions due to the fact that while adding external gradient noise helps in theory, it works poorly in practice (Neelakantan et al., 2015; Chaudhari and Soatto, 2015). Instead, “noise tied to the architecture” works better, e.g., dropout, or small minibatches. Our results close this gap and show that SGD crucially leverages the highly degenerate noise induced by the architecture.
Gradient diversity:
Yin et al. (2017) construct a scalar measure of the gradient diversity given by , and analyze its effect on the maximum allowed batchsize in the context of distributed optimization.
Markov Chain Monte Carlo:
MCMC methods that sample from a negative loglikelihood have employed the idea of designing a force to accelerate convergence, see Ma et al. (2015) for a thorough survey, or Pavliotis (2016); Kaiser et al. (2017) for a rigorous treatment. We instead compute the potential given and , which necessitates the use of techniques from physics. In fact, our results show that since for deep networks due to nonisotropic gradient noise, very simple algorithms such as SGLD by Welling and Teh (2011) also benefit from the acceleration that their sophisticated counterparts aim for (Ding et al., 2014; Chen et al., 2016).
7 Discussion
The continuoustime pointofview used in this paper gives access to general principles that govern SGD, such analyses are increasingly becoming popular (Wibisono et al., 2016; Chaudhari et al., 2017b). However, in practice, deep networks are trained for only a few epochs with discretetime updates. Closing this gap is an important future direction. A promising avenue towards this is that for typical conditions in practice such as small minibatches or large learning rates, SGD converges to the steadystate distribution quickly (Raginsky et al., 2017).
8 Acknowledgments
PC would like to thank Adam Oberman for introducing him to the JKO functional. The authors would also like to thank Alhussein Fawzi for numerous discussions during the conception of this paper and his contribution to its improvement.
References
 Achille and Soatto (2017) Achille, A. and Soatto, S. (2017). On the emergence of invariance and disentangling in deep representations. arXiv:1706.01350.
 Anandkumar and Ge (2016) Anandkumar, A. and Ge, R. (2016). Efficient approaches for escaping higher order saddle points in nonconvex optimization. In COLT, pages 81–102.
 Ao et al. (2007) Ao, P., Kwon, C., and Qian, H. (2007). On the existence of potential landscape in the evolution of complex systems. Complexity, 12(4):19–27.
 Baker et al. (2016) Baker, B., Gupta, O., Naik, N., and Raskar, R. (2016). Designing neural network architectures using reinforcement learning. arXiv:1611.02167.
 Baldassi et al. (2016) Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R. (2016). Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. PNAS, 113(48):E7655–E7662.

Baldassi et al. (2015)
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R.
(2015).
Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses.
Physical review letters, 115(12):128101.  Brock et al. (2017) Brock, A., Lim, T., Ritchie, J., and Weston, N. (2017). SMASH: OneShot Model Architecture Search through HyperNetworks. arXiv:1708.05344.
 Chaudhari et al. (2017a) Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., and Oberman, A. (2017a). Parle: parallelizing stochastic gradient descent. arXiv:1707.00424.
 Chaudhari et al. (2016) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2016). EntropySGD: biasing gradient descent into wide valleys. arXiv:1611.01838.
 Chaudhari et al. (2017b) Chaudhari, P., Oberman, A., Osher, S., Soatto, S., and Guillame, C. (2017b). Deep Relaxation: partial differential equations for optimizing deep neural networks. arXiv:1704.04932.
 Chaudhari and Soatto (2015) Chaudhari, P. and Soatto, S. (2015). On the energy landscape of deep networks. arXiv:1511.06485.
 Chen et al. (2016) Chen, C., Carlson, D., Gan, Z., Li, C., and Carin, L. (2016). Bridging the gap between stochastic gradient MCMC and stochastic optimization. In AISTATS, pages 1051–1060.
 Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R., and Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS, pages 3203–3211.
 Duvenaud et al. (2016) Duvenaud, D., Maclaurin, D., and Adams, R. (2016). Early stopping as nonparametric variational inference. In AISTATS, pages 1070–1077.
 Frank (2005) Frank, T. D. (2005). Nonlinear FokkerPlanck equations: fundamentals and applications. Springer Science & Business Media.

Ge et al. (2015)
Ge, R., Huang, F., Jin, C., and Yuan, Y. (2015).
Escaping from saddle points — online stochastic gradient for tensor decomposition.
In COLT, pages 797–842.  Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.

Gulcehre et al. (2016)
Gulcehre, C., Moczulski, M., Denil, M., and Bengio, Y. (2016).
Noisy activation functions.
In ICML, pages 3059–3068.  Hänggi (1978) Hänggi, P. (1978). On derivations and solutions of master equations and asymptotic representations. Zeitschrift für Physik B Condensed Matter, 30(1):85–95.

Hinton and Van Camp (1993)
Hinton, G. E. and Van Camp, D. (1993).
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM.  Jaynes (1980) Jaynes, E. T. (1980). The minimum entropy production principle. Annual Review of Physical Chemistry, 31(1):579–601.
 Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
 Jordan et al. (1997) Jordan, R., Kinderlehrer, D., and Otto, F. (1997). Free energy and the fokkerplanck equation. Physica D: Nonlinear Phenomena, 107(24):265–271.
 Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. (1998). The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17.
 Kaiser et al. (2017) Kaiser, M., Jack, R. L., and Zimmer, J. (2017). Acceleration of convergence to equilibrium in Markov chains by breaking detailed balance. Journal of Statistical Physics, 168(2):259–287.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On largebatch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836.
 Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. In NIPS, pages 2575–2583.
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational Bayes. arXiv:1312.6114.
 Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, Computer Science, University of Toronto.
 Kwon et al. (2005) Kwon, C., Ao, P., and Thouless, D. J. (2005). Structure of stochastic dynamics near fixed points. Proceedings of the National Academy of Sciences of the United States of America, 102(37):13029–13033.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
 Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges to minimizers. In COLT, pages 1246–1257.
 Li et al. (2017a) Li, C. J., Li, L., Qian, J., and Liu, J.G. (2017a). Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. arXiv:1705.07562.
 Li et al. (2017b) Li, Q., Tai, C., and Weinan, E. (2017b). Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML, pages 2101–2110.
 Ma et al. (2015) Ma, Y.A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In NIPS, pages 2917–2925.
 Mandt et al. (2016) Mandt, S., Hoffman, M., and Blei, D. (2016). A variational analysis of stochastic gradient algorithms. In ICML, pages 354–363.
 Mandt et al. (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv:1704.04289.
 Neelakantan et al. (2015) Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv:1511.06807.
 Neyshabur et al. (2017) Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv:1705.03071.

Noh and Lee (2015)
Noh, J. D. and Lee, J. (2015).
On the steadystate probability distribution of nonequilibrium stochastic systems.
Journal of the Korean Physical Society, 66(4):544–552.  Oksendal (2003) Oksendal, B. (2003). Stochastic differential equations. Springer.
 Onsager (1931a) Onsager, L. (1931a). Reciprocal relations in irreversible processes. I. Physical review, 37(4):405.
 Onsager (1931b) Onsager, L. (1931b). Reciprocal relations in irreversible processes. II. Physical review, 38(12):2265.
 Ottinger (2005) Ottinger, H. (2005). Beyond equilibrium thermodynamics. John Wiley & Sons.
 Otto (2001) Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
 Pavliotis (2016) Pavliotis, G. A. (2016). Stochastic processes and applications. Springer.
 Prigogine (1955) Prigogine, I. (1955). Thermodynamics of irreversible processes, volume 404. Thomas.
 Qian (2014) Qian, H. (2014). The zeroth law of thermodynamics and volumepreserving conservative system in equilibrium with stochastic damping. Physics Letters A, 378(7):609–616.
 Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Nonconvex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. arXiv:1702.03849.
 Risken (1996) Risken, H. (1996). The FokkerPlanck Equation. Springer.
 Santambrogio (2015) Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY.
 Santambrogio (2017) Santambrogio, F. (2017). Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154.
 Shi et al. (2012) Shi, J., Chen, T., Yuan, R., Yuan, B., and Ao, P. (2012). Relation of a new interpretation of stochastic differential equations to ito process. Journal of Statistical physics, 148(3):579–590.
 ShwartzZiv and Tishby (2017) ShwartzZiv, R. and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv:1703.00810.
 Springenberg et al. (2014) Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv:1412.6806.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958.
 Tel et al. (1989) Tel, T., Graham, R., and Hu, G. (1989). Nonequilibrium potentials and their powerseries expansions. Physical Review A, 40(7):4065.
 Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. (1999). The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377.
 Villani (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
 Wang et al. (2008) Wang, J., Xu, L., and Wang, E. (2008). Potential landscape and flux framework of nonequilibrium networks: robustness, dissipation, and coherence of biochemical oscillations. Proceedings of the National Academy of Sciences, 105(34):12271–12276.
 Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML, pages 681–688.
 Wibisono et al. (2016) Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. PNAS, page 201614734.
 Yin et al. (2017) Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K., and Bartlett, P. (2017). Gradient diversity empowers distributed learning. arXiv:1706.05699.
 Yin and Ao (2006) Yin, L. and Ao, P. (2006). Existence and construction of dynamical potential in nonequilibrium processes without detailed balance. Journal of Physics A: Mathematical and General, 39(27):8593.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv:1611.03530.
 Zhu et al. (2004) Zhu, X.M., Yin, L., Hood, L., and Ao, P. (2004). Calculating biological behaviors of epigenetic states in the phage life cycle. Functional & integrative genomics, 4(3):188–195.
 Zoph and Le (2016) Zoph, B. and Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv:1611.01578.
Appendix A Diffusion matrix
In this section we denote and . Although we drop the dependence of on to keep the notation clear, we emphasize that the diffusion matrix depends on the weights .
a.1 With replacement
Let be
iid random variables in
. We would like to computeNote that we have that for any , the random vectors and are independent. We therefore have
We use this to obtain
We will set
(A1) 
and assimilate the factor of in the inverse temperature .
a.2 Without replacement
Let us define an indicator random variable
that denotes if an example was sampled in batch . We can show thatand for ,
Similar to Li et al. (2017a), we can now compute
We will again set
(A2) 
and assimilate the factor of that depends on the batchsize in the inverse temperature .
Appendix B Discussion on creftype 4
Let be as defined in Eq. 11. In nonequilibrium thermodynamics, it is assumed that the local entropy production is a product of the force from Eq. A8 and the probability current from Eq. FP. This assumption in this form was first introduced by Prigogine (1955) based on the works of Onsager (1931a, b). See Frank (2005, Sec. 4.5) for a mathematical treatment and Jaynes (1980) for further discussion. The rate of entropy increase is given by
This can now be written using Eq. A8 again as
The first term in the above expression is nonnegative, in order to ensure that , we require
where the second equality again follows by integration by parts. It can be shown (Frank, 2005, Sec. 4.5.5) that the condition in creftype 4, viz., , is sufficient to make the above integral vanish and therefore for the entropy generation to be nonnegative.
Appendix C Some properties of the force
The FokkerPlanck equation Eq. FP can be written in terms of the probability current as
Since we have , from the observation Eq. 7, we also have that
and consequently,
(A3)  
In other words, the conservative force is nonzero only if detailed balance is broken, i.e., . We also have
which shows using creftype 4 and for all that is always orthogonal to the gradient of the potential
(A4)  
Using the definition of in Eq. 8, we have detailed balance when
(A5) 
Appendix D Heat equation as a gradient flow
As first discovered in the works of Jordan, Kinderleherer and Otto (Jordan et al., 1998; Otto, 2001), certain partial differential equations can be seen as coming from a variational principle, i.e., they perform steepest descent with respect to functionals of their state distribution. Section 3 is a generalization of this idea, we give a short overview here with the heat equation. The heat equation
can be written as the steepest descent for the Dirichlet energy functional
However, the same PDE can also be seen as the gradient flow of the negative Shannon entropy in the Wasserstein metric (Santambrogio, 2017, 2015),
More precisely, the sequence of iterated minimization problems
(A6) 
converges to trajectories of the heat equation as . This equivalence is extremely powerful because it allows us to interpret, and modify, the functional that PDEs such as the heat equation implicitly minimize.
This equivalence is also quite natural, the heat equation describes the probability density of pure Brownian motion: . The Wasserstein pointofview suggests that Brownian motion maximizes the entropy of its state distribution, while the Dirichlet functional suggests that it minimizes the totalvariation of its density. These are equivalent. While the latter has been used extensively in image processing, our paper suggests that the entropic regularization pointofview is very useful to understand SGD for machine learning.
Appendix E Experimental setup
We consider the following three networks on the MNIST (LeCun et al., 1998) and the CIFAR10 and CIFAR100 datasets (Krizhevsky, 2009).

smalllenet: a smaller version of LeNet (LeCun et al., 1998)
on MNIST with batchnormalization and dropout (
) after both convolutional layers of and output channels, respectively. The fullyconnected layer has hidden units. This network has weights and reaches about training and validation error. 
smallfc:
a fullyconnected network with twolayers, batchnormalization and rectified linear units that takes
downsampled images of MNIST as input and has hidden units. Experiments in Section 4.2 use a smaller version of this network with hidden units and output classes ( input images); this is called tinyfc. 
smallallcnn: this a smaller version of the fullyconvolutional network for CIFAR10 and CIFAR100 introduced by Springenberg et al. (2014) with batchnormalization and output channels in the first and second block respectively. It has weights and reaches about and training and validation errors, respectively.
We train the above networks with SGD with appropriate learning rate annealing and Nesterov’s momentum set to . We do not use any dataaugmentation and preprocess data using global contrast normalization with ZCA for CIFAR10 and CIFAR100.
We use networks with about weights to keep the eigendecomposition of tractable. These networks however possess all the architectural intricacies such as convolutions, dropout, batchnormalization etc. We evaluate using Eq. 2 with the network in evaluation mode.
Appendix F Proofs
f.1 Theorem 5
The divergence is nonnegative: with equality if and only if . The expression in Eq. 11 follows after writing
We now show that with equality only at when reaches its minimum and the FokkerPlanck equation achieves its steadystate. The first variation (Santambrogio, 2015) of computed from Eq. 11 is
(A7) 
which helps us write the FokkerPlanck equation Eq. FP as
(A8) 
Together, we can now write
As we show in Appendix B, the first term above is zero due to creftype 4. Under suitable boundary condition on the FokkerPlanck equation which ensures that no probability mass flows across the boundary of the domain , after an integration by parts, the second term can be written as
In the above expression, denotes the matrix dot product . The final inequality with the quadratic form holds because is a covariance matrix. Moreover, we have from Eq. A7 that
f.2 Lemma 6
f.3 Lemma 7
The FokkerPlanck operator written as
from Eqs. FP and 8 can be split into two operators
where the symmetric part is
(A9) 
and the antisymmetric part is
(A10)  
We first note that does not affect in Theorem 5. For solutions of , we have
by creftype 4. The dynamics of the antisymmetric operator is thus completely deterministic and conserves . In fact, the equation Eq. A10 is known as the Liouville equation (Frank, 2005) and describes the density of a completely deterministic dynamics given by
(A11) 
where from Appendix C. On account of the trajectories of the Liouville operator being deterministic, they are also the most likely ones under the steadystate distribution .
f.4 Theorem 22
All the matrices below depend on the weights ; we suppress this to keep the notation clear. Our original SDE is given by
We will transform the original SDE into a new SDE
Comments
There are no comments yet.