Our first result is to show precisely in what sense stochastic gradient descent (SGD) implicitly performs variational inference, as is often claimed informally in the literature. For a loss function with weights , if
is the steady-state distribution over the weights estimated by SGD,
where is the entropy of the distribution and and are the learning rate and batch-size, respectively. The potential , which we characterize explicitly, is related but not necessarily equal to . It is only a function of the architecture and the dataset. This implies that SGD implicitly performs variational inference with a uniform prior, albeit of a different loss than the one used to compute back-propagation gradients.
We next prove that the implicit potential is equal to our chosen loss if and only if the noise in mini-batch gradients is isotropic. This condition, however, is not satisfied for deep networks. Empirically, we find gradient noise to be highly non-isotropic with the rank of its covariance matrix being about of its dimension. Thus, SGD on deep networks implicitly discovers locations where , these are not the locations where . This is our second main result: the most likely locations of SGD are not the local minima, nor the saddle points, of the original loss. The deviation of these critical points, which we compute explicitly scales linearly with and is typically large in practice.
When mini-batch noise is non-isotropic, SGD does not even converge in the classical sense. We prove that, instead of undergoing Brownian motion in the vicinity of a critical point, trajectories have a deterministic component that causes SGD to traverse closed loops in the weight space. We detect such loops using a Fourier analysis of SGD trajectories. We also show through an example that SGD with non-isotropic noise can even converge to stable limit cycles around saddle points.
2 Background on continuous-time SGD
Stochastic gradient descent performs the following updates while training a network where is the learning rate and is the average gradient over a mini-batch ,
We overload notation for both the set of examples in a mini-batch and its size. We assume that weights belong to a compact subset , to ensure appropriate boundary conditions for the evolution of steady-state densities in SGD, although all our results hold without this assumption if the loss grows unbounded as , for instance, with weight decay as a regularizer.
Definition 1 (Diffusion matrix ).
Note that is independent of the learning rate and the batch-size . It only depends on the weights , architecture and loss defined by , and the dataset. We will often discuss two cases: isotropic diffusion when is a scalar multiple of identity, independent of , and non-isotropic diffusion, when is a general function of the weights .
We now construct a stochastic differential equation (SDE) for the discrete-time SGD updates.
Lemma 2 (Continuous-time SGD).
The continuous-time limit of SGD is given by
where is Brownian motion and is the inverse temperature defined as . The steady-state distribution of the weights , evolves according to the Fokker-Planck equation (Risken, 1996, Ito form):
where the notation denotes the divergence for any vector
for any vector; the divergence operator is applied column-wise to matrices such as .
We refer to Li et al. (2017b, Thm. 1) for the proof of the convergence of discrete SGD to Eq. 3. Note that completely captures the magnitude of noise in SGD that depends only upon the learning rate and the mini-batch size .
Assumption 3 (Steady-state distribution exists and is unique).
We assume that the steady-state distribution of the Fokker-Planck equation Eq. FP exists and is unique, this is denoted by and satisfies,
3 SGD performs variational inference
Let us first implicitly define a potential using the steady-state distribution :
up to a constant. The potential depends only on the full-gradient and the diffusion matrix; see Appendix C for a proof. It will be made explicit in Section 5. We express in terms of the potential using a normalizing constant as
which is also the steady-state solution of
as can be verified by direct substitution in Eq. FP.
The above observation is very useful because it suggests that, if can be written in terms of the diffusion matrix and a gradient term , the steady-state distribution of this SDE is easily obtained. We exploit this observation to rewrite in terms a term that gives rise to the above steady-state, the spatial derivative of the diffusion matrix, and the remainder:
interpreted as the part of that cannot be written as for some . We now make an important assumption on which has its origins in thermodynamics.
Assumption 4 (Force is conservative).
We assume that
The Fokker-Planck equation Eq. FP typically models a physical system which exchanges energy with an external environment (Ottinger, 2005; Qian, 2014). In our case, this physical system is the gradient dynamics while the interaction with the environment is through the term involving temperature: . The second law of thermodynamics states that the entropy of a system can never decrease; in Appendix B we show how the above assumption is sufficient to satisfy the second law. We also discuss some properties of in Appendix C that are a consequence of this. The most important is that is always orthogonal to . We illustrate the effects of this assumption in Example 19.
This leads us to the main result of this section.
Theorem 5 (SGD performs variational inference).
decreases monotonically along the trajectories of the Fokker-Planck equation Eq. FP and converges to its minimum, which is zero, at steady-state. Moreover, we also have an energetic-entropic split
Theorem 5, proven in Section F.1, shows that SGD implicitly minimizes a combination of two terms: an “energetic” term, and an “entropic” term. The first is the average potential over a distribution . The steady-state of SGD in Eq. 6
is such that it places most of its probability mass in regions of the parameter space with small values of. The second shows that SGD has an implicit bias towards solutions that maximize the entropy of the distribution .
Note that the energetic term in Eq. 11 has potential , instead of . This is an important fact and the crux of this paper.
Lemma 6 (Potential equals original loss iff isotropic diffusion).
If the diffusion matrix is isotropic, i.e., a constant multiple of the identity, the implicit potential is the original loss itself
This is proven in Section F.2. The definition in Eq. 8 shows that when is non-isotropic. This results in a deterministic component in the SGD dynamics which does not affect the functional , hence is called a “conservative force.” The following lemma is proven in Section F.3.
Lemma 7 (Most likely trajectories of SGD are limit cycles).
3.1 Wasserstein gradient flow
Theorem 5 applies for a general and it is equivalent to the celebrated JKO functional (Jordan et al., 1997) in optimal transportation (Santambrogio, 2015; Villani, 2008) if the diffusion matrix is isotropic. Appendix D provides a brief overview using the heat equation as an example.
Corollary 8 (Wasserstein gradient flow for isotropic noise).
If , trajectories of the Fokker-Planck equation Eq. FP are gradient flow in the Wasserstein metric of the functional
Observe that the energetic term contains in Corollary 8. The proof follows from Lemmas 6 and 5, see Santambrogio (2017) for a rigorous treatment of Wasserstein metrics. The JKO functional above has had an enormous impact in optimal transport because results like Corollaries 8 and 5 provide a way to modify the functional in an interpretable fashion. Modifying the Fokker-Planck equation or the SGD updates directly to enforce regularization properties on the solutions is much harder.
3.2 Connection to Bayesian inference
where is the cross-entropy of the estimated steady-state and the variational prior. The implicit loss function of SGD in Eq. 11 therefore corresponds to a uniform prior . In other words, we have shown that SGD itself performs variational optimization with a uniform prior. Note that this prior is well-defined by our hypothesis of for some compact .
It is important to note that SGD implicitly minimizes a potential instead of the original loss in ELBO. We prove in Section 5 that this potential is quite different from if the diffusion matrix is non-isotropic, in particular, with respect to its critical points.
Remark 9 (SGD has an information bottleneck).
The functional Eq. 11 is equivalent to the information bottleneck principle in representation learning (Tishby et al., 1999). Minimizing this functional, explicitly, has been shown to lead to invariant representations (Achille and Soatto, 2017). Theorem 5 shows that SGD implicitly contains this bottleneck and therefore begets these properties, naturally.
Remark 10 (ELBO prior conflicts with SGD).
Working with ELBO in practice involves one or multiple steps of SGD to minimize the energetic term along with an estimate of the -divergence term, often using a factored Gaussian prior (Kingma and Welling, 2013; Jordan et al., 1999). As Theorem 5 shows, such an approach also enforces a uniform prior whose strength is determined by and conflicts with the externally imposed Gaussian prior. This conflict—which fundamentally arises from using SGD to minimize the energetic term—has resulted in researchers artificially modulating the strength of the -divergence term using a scalar pre-factor (Mandt et al., 2016).
3.3 Practical implications
We will show in Section 5 that the potential does not depend on the optimization process, it is only a function of the dataset and the architecture. The effect of two important parameters, the learning rate and the mini-batch size therefore completely determines the strength of the entropic regularization term. If , the implicit regularization of SGD goes to zero. This implies that
is a good tenet for regularization of SGD.
Remark 11 (Learning rate should scale linearly with batch-size to generalize well).
In order to maintain the entropic regularization, the learning rate needs to scale linearly with the batch-size . This prediction, based on Theorem 5, fits very well with empirical evidence wherein one obtains good generalization performance only with small mini-batches in deep networks (Keskar et al., 2016), or via such linear scaling (Goyal et al., 2017).
Remark 12 (Sampling with replacement is better than without replacement).
The extra factor of reduces the entropic regularization in Eq. 11, as , the inverse temperature . As a consequence, for the same learning rate and batch-size , Theorem 5 predicts that sampling with replacement has better regularization than sampling without replacement. This effect is particularly pronounced at large batch-sizes.
4 Empirical characterization of SGD dynamics
Section 4.1 shows that the diffusion matrix for modern deep networks is highly non-isotropic with a very low rank. We also analyze trajectories of SGD and detect periodic components using a frequency analysis in Section 4.2; this validates the prediction of Lemma 7.
We consider three networks for these experiments: a convolutional network called small-lenet, a two-layer fully-connected network on MNIST (LeCun et al., 1998) and a smaller version of the All-CNN-C architecture of Springenberg et al. (2014) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009); see Appendix E for more details.
4.1 Highly non-isotropic for deep networks
of the diffusion matrix. In all cases, it has a large fraction of almost-zero eigenvalues with a very small rank that ranges between- . Moreover, non-zero eigenvalues are spread across a vast range with a large variance.
Remark 13 (Noise in SGD is largely independent of the weights).
The variance of noise in Eq. 3 is
We have plotted the eigenspectra of the diffusion matrix in Fig. 1 and Fig. 2 at three different instants, , and training completion; they are almost indistinguishable. This implies that the variance of the mini-batch gradients in deep networks can be considered a constant, highly non-isotropic matrix.
Remark 14 (More non-isotropic diffusion if data is diverse).
The eigenspectra in Fig. 2 for CIFAR-10 and CIFAR-100 have much larger eigenvalues and standard-deviation than those in
for CIFAR-10 and CIFAR-100 have much larger eigenvalues and standard-deviation than those inFig. 1, this is expected because the images in the CIFAR datasets have more variety than those in MNIST. Similarly, while CIFAR-100 has qualitatively similar images as CIFAR-10, it has more classes and as a result, it is a much harder dataset. This correlates well with the fact that both the mean and standard-deviation of the eigenvalues in Fig. 1(b) are much higher than those in Fig. 1(a). Input augmentation increases the diversity of mini-batch gradients. This is seen in Fig. 1(c) where the standard-deviation of the eigenvalues is much higher as compared to Fig. 1(a).
Remark 15 (Inverse temperature scales with the mean of the eigenspectrum).
Remark 14 shows that the mean of the eigenspectrum is large if the dataset is diverse. Based on this, we propose that the inverse temperature should scale linearly with the mean of the eigenvalues of :
where is the number of weights. This keeps the noise in SGD constant in magnitude for different values of the learning rate , mini-batch size , architectures, and datasets. Note that other hyper-parameters which affect stochasticity such as dropout probability are implicit inside .
Remark 16 (Variance of the eigenspectrum informs architecture search).
Compare the eigenspectra in Figs. 0(b) and 0(a) with those in Figs. 1(c) and 1(a). The former pair shows that small-lenet which is a much better network than small-fc also has a much larger rank, i.e., the number of non-zero eigenvalues ( is symmetric). The second pair shows that for the same dataset, data-augmentation creates a larger variance in the eigenspectrum. This suggests that both the quantities, viz., rank of the diffusion matrix and the variance of the eigenspectrum, inform the performance of a given architecture on the dataset. Note that as discussed in Remark 15, the mean of the eigenvalues can be controlled using the learning rate and the batch-size .
This observation is useful for automated architecture search where we can use the quantity
to estimate the efficacy of a given architecture, possibly, without even training, since does not depend on the weights much. This task currently requires enormous amounts of computational power (Zoph and Le, 2016; Baker et al., 2016; Brock et al., 2017).
shows the Fast Fourier Transform (FFT) ofwhere
is the number of epochs anddenotes the index of the weight. Fig. 2(b) shows the auto-correlation of with confidence bands denoted by the dotted red lines. Both Figs. 2(b) and 2(a) show the mean and one standard-deviation over the weight index ; the standard deviation is very small which indicates that all the weights have a very similar frequency spectrum. Figs. 2(b) and 2(a)
should be compared with the FFT of white noise which should be flat and the auto-correlation of Brownian motion which quickly decays to zero, respectively.Figs. 3 and 2(a) therefore show that trajectories of SGD are not simply Brownian motion. Moreover the gradient at these locations is quite large (Fig. 2(c)).
4.2 Analysis of long-term trajectories
We train a smaller version of small-fc on down-sampled MNIST images for epochs and store snapshots of the weights after each epoch to get a long trajectory in the weight space. We discard the first epochs of training (“burnin”) to ensure that SGD has reached the steady-state. The learning rate is fixed to after this, up to epochs.
Remark 17 (Low-frequency periodic components in SGD trajectories).
Iterates of SGD, after it reaches the neighborhood of a critical point , are expected to perform Brownian motion with variance , the FFT in Fig. 2(a) would be flat if this were so. Instead, we see low-frequency modes in the trajectory that are indicators of a periodic dynamics of the force . These modes are not sharp peaks in the FFT because can be a non-linear function of the weights thereby causing the modes to spread into all dimensions of . The FFT is dominated by jittery high-frequency modes on the right with a slight increasing trend; this suggests the presence of colored noise in SGD at high-frequencies.
The auto-correlation (AC) in Fig. 2(b) should be compared with the AC for Brownian motion which decays to zero very quickly and stays within the red confidence bands (). Our iterates are significantly correlated with each other even at very large lags. This further indicates that trajectories of SGD do not perform Brownian motion.
Remark 18 (Gradient magnitude in deep networks is always large).
Fig. 2(c) shows that the full-gradient computed over the entire dataset (without burnin) does not decrease much with respect to the number of epochs. While it is expected to have a non-zero gradient norm because SGD only converges to a neighborhood of a critical point for non-zero learning rates, the magnitude of this gradient norm is quite large. This magnitude drops only by about a factor of over the next epochs. The presence of a non-zero also explains this, it causes SGD to be away from critical points, this phenomenon is made precise in Theorem 22. Let us note that a similar plot is also seen in Shwartz-Ziv and Tishby (2017) for the per-layer gradient magnitude.
5 SGD for deep networks is out-of-equilibrium
This section now gives an explicit formula for the potential . We also discuss implications of this for generalization in Section 5.3.
The fundamental difficulty in obtaining an explicit expression for is that even if the diffusion matrix is full-rank, there need not exist a function such that at all . We therefore split the analysis into two cases:
a local analysis near any critical point where we linearize and to compute for some , and
the general case where cannot be written as a local rotation and scaling of .
Let us introduce these cases with an example from Noh and Lee (2015).
Example 19 (Double-well potential with limit cycles).
Instead of constructing a diffusion matrix , we will directly construct different gradients that lead to the same potential ; these are equivalent but the later is much easier. The dynamics is given by , where . We pick for some parameter where
5.1 Linearization around a critical point
Without loss of generality, let be a critical point of . This critical point can be a local minimum, maximum, or even a saddle point. We linearize the gradient around the origin and define a fixed matrix (the Hessian) to be . Let be the constant diffusion matrix matrix. The dynamics in Eq. 3 can now be written as
Lemma 20 (Linearization).
The matrix in Eq. 15 can be uniquely decomposed into
and are the symmetric and anti-symmetric parts of a matrix with , to get .
The above lemma is a classical result if the critical point is a local minimum, i.e., if the loss is locally convex near
; this case has also been explored in machine learning before(Mandt et al., 2016). We refer to Kwon et al. (2005) for the proof that linearizes around any critical point.
5.2 General case
We next give the general expression for the deviation of the critical points from those of the original loss .
A-type stochastic integration:
A Fokker-Planck equation is a deterministic partial differential equation (PDE) and every steady-state distribution,in this case, has a unique such PDE that achieves it. However, the same PDE can be tied to different SDEs depending on the stochastic integration scheme, e.g., Ito, Stratonovich (Risken, 1996; Oksendal, 2003), Hanggi (Hänggi, 1978), -type etc. An “A-type” interpretation is one such scheme (Ao et al., 2007; Shi et al., 2012). It is widely used in non-equilibrium studies in physics and biology (Wang et al., 2008; Zhu et al., 2004) because it allows one to compute the steady-state distribution easily; its implications are supported by other mathematical analyses such as Tel et al. (1989); Qian (2014).
The main result of the section now follows. It exploits the A-type interpretation to compute the difference between the most likely locations of SGD which are given by the critical points of the potential and those of the original loss .
Theorem 22 (Most likely locations are not the critical points of the loss).
The Ito SDE
with the same steady-state distribution and Fokker-Planck equation Eq. FP if
The anti-symmetric matrix and the potential can be explicitly computed in terms of the gradient and the diffusion matrix . The potential does not depend on .
See Section F.4 for the proof. It exploits the fact that the the Ito SDE Eq. 3 and the A-type SDE Eq. 18 should have the same Fokker-Planck equations because they have the same steady-state distributions.
Remark 23 (SGD is far away from critical points).
The time spent by a Markov chain at a state is proportional to its steady-state distribution . While it is easily seen that SGD does not converge in the Cauchy sense due to the stochasticity, it is very surprising that it may spend a significant amount of time away from the critical points of the original loss. If has a large divergence, the set of states with might be drastically different than those with . This is also seen in example Fig. 3(c); in fact, SGD may even converge around a saddle point.
This also closes the logical loop we began in Section 3 where we assumed the existence of and defined the potential using it. Theorems 22 and 20 show that both can be defined uniquely in terms of the original quantities, i.e., the gradient term and the diffusion matrix . There is no ambiguity as to whether the potential results in the steady-state or vice-versa.
Remark 24 (Consistent with the linear case).
Remark 25 (Out-of-equilibrium effect can be large even if is constant).
The presence of a with non-zero divergence is the consequence of a non-isotropic and it persists even if is constant and independent of weights . So long as is not isotropic, as we discussed in the beginning of Section 5, there need not exist a function such that at all . This is also seen in our experiments, the diffusion matrix is almost constant with respect to weights for deep networks, but consequences of out-of-equilibrium behavior are still seen in Section 4.2.
Remark 26 (Out-of-equilibrium effect increases with ).
The effect predicted by Eq. 19 becomes more pronounced if is large. In other words, small batch-sizes or high learning rates cause SGD to be drastically out-of-equilibrium. Theorem 5 also shows that as , the implicit entropic regularization in SGD vanishes. Observe that these are exactly the conditions under which we typically obtain good generalization performance for deep networks (Keskar et al., 2016; Goyal et al., 2017). This suggests that non-equilibrium behavior in SGD is crucial to obtain good generalization performance, especially for high-dimensional models such as deep networks where such effects are expected to be more pronounced.
. Such dense clusters are exponentially fewer compared to isolated solutions. To exploit these observations, the authors proposed a loss called “local entropy” that is out-of-equilibrium by construction and can find these well-generalizable solutions easily. This idea has also been successful in deep learning whereChaudhari et al. (2016) modified SGD to seek solutions in “wide minima” with low curvature to obtain improvements in generalization performance as well as convergence rate (Chaudhari et al., 2017a).
Local entropy is a smoothed version of the original loss given by
where is a Gaussian kernel of variance . Even with an isotropic diffusion matrix, the steady-state distribution with as the loss function is . For large values of , the new loss makes the original local minima exponentially less likely. In other words, local entropy does not rely on non-isotropic gradient noise to obtain out-of-equilibrium behavior, it gets it explicitly, by construction. This is also seen in Fig. 3(c): if SGD is drastically out-of-equilibrium, it converges around the “wide” saddle point region at the origin which has a small local entropy.
Actively constructing out-of-equilibrium behavior leads to good generalization in practice. Our evidence that SGD on deep networks itself possesses out-of-equilibrium behavior then indicates that SGD for deep networks generalizes well because of such behavior.
6 Related work
SGD, variational inference and implicit regularization:
The idea that SGD is related to variational inference has been seen in machine learning before (Duvenaud et al., 2016; Mandt et al., 2016) under assumptions such as quadratic steady-states; for instance, see Mandt et al. (2017) for methods to approximate steady-states using SGD. Our results here are very different, we would instead like to understand properties of SGD itself. Indeed, in full generality, SGD performs variational inference using a new potential that it implicitly constructs given an architecture and a dataset.
It is widely believed that SGD is an implicit regularizer, see Zhang et al. (2016); Neyshabur et al. (2017); Shwartz-Ziv and Tishby (2017) among others. This belief stems from its remarkable empirical performance. Our results show that such intuition is very well-placed. Thanks to the special architecture of deep networks where gradient noise is highly non-isotropic, SGD helps itself to a potential with properties that lead to both generalization and acceleration.
SGD and noise:
Noise is often added in SGD to improve its behavior around saddle points for non-convex losses, see Lee et al. (2016); Anandkumar and Ge (2016); Ge et al. (2015). It is also quite indispensable for training deep networks (Hinton and Van Camp, 1993; Srivastava et al., 2014; Kingma et al., 2015; Gulcehre et al., 2016; Achille and Soatto, 2017). There is however a disconnect between these two directions due to the fact that while adding external gradient noise helps in theory, it works poorly in practice (Neelakantan et al., 2015; Chaudhari and Soatto, 2015). Instead, “noise tied to the architecture” works better, e.g., dropout, or small mini-batches. Our results close this gap and show that SGD crucially leverages the highly degenerate noise induced by the architecture.
Yin et al. (2017) construct a scalar measure of the gradient diversity given by , and analyze its effect on the maximum allowed batch-size in the context of distributed optimization.
Markov Chain Monte Carlo:
MCMC methods that sample from a negative log-likelihood have employed the idea of designing a force to accelerate convergence, see Ma et al. (2015) for a thorough survey, or Pavliotis (2016); Kaiser et al. (2017) for a rigorous treatment. We instead compute the potential given and , which necessitates the use of techniques from physics. In fact, our results show that since for deep networks due to non-isotropic gradient noise, very simple algorithms such as SGLD by Welling and Teh (2011) also benefit from the acceleration that their sophisticated counterparts aim for (Ding et al., 2014; Chen et al., 2016).
The continuous-time point-of-view used in this paper gives access to general principles that govern SGD, such analyses are increasingly becoming popular (Wibisono et al., 2016; Chaudhari et al., 2017b). However, in practice, deep networks are trained for only a few epochs with discrete-time updates. Closing this gap is an important future direction. A promising avenue towards this is that for typical conditions in practice such as small mini-batches or large learning rates, SGD converges to the steady-state distribution quickly (Raginsky et al., 2017).
PC would like to thank Adam Oberman for introducing him to the JKO functional. The authors would also like to thank Alhussein Fawzi for numerous discussions during the conception of this paper and his contribution to its improvement.
- Achille and Soatto (2017) Achille, A. and Soatto, S. (2017). On the emergence of invariance and disentangling in deep representations. arXiv:1706.01350.
- Anandkumar and Ge (2016) Anandkumar, A. and Ge, R. (2016). Efficient approaches for escaping higher order saddle points in non-convex optimization. In COLT, pages 81–102.
- Ao et al. (2007) Ao, P., Kwon, C., and Qian, H. (2007). On the existence of potential landscape in the evolution of complex systems. Complexity, 12(4):19–27.
- Baker et al. (2016) Baker, B., Gupta, O., Naik, N., and Raskar, R. (2016). Designing neural network architectures using reinforcement learning. arXiv:1611.02167.
- Baldassi et al. (2016) Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R. (2016). Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. PNAS, 113(48):E7655–E7662.
Baldassi et al. (2015)
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R.
Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses.Physical review letters, 115(12):128101.
- Brock et al. (2017) Brock, A., Lim, T., Ritchie, J., and Weston, N. (2017). SMASH: One-Shot Model Architecture Search through HyperNetworks. arXiv:1708.05344.
- Chaudhari et al. (2017a) Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., and Oberman, A. (2017a). Parle: parallelizing stochastic gradient descent. arXiv:1707.00424.
- Chaudhari et al. (2016) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2016). Entropy-SGD: biasing gradient descent into wide valleys. arXiv:1611.01838.
- Chaudhari et al. (2017b) Chaudhari, P., Oberman, A., Osher, S., Soatto, S., and Guillame, C. (2017b). Deep Relaxation: partial differential equations for optimizing deep neural networks. arXiv:1704.04932.
- Chaudhari and Soatto (2015) Chaudhari, P. and Soatto, S. (2015). On the energy landscape of deep networks. arXiv:1511.06485.
- Chen et al. (2016) Chen, C., Carlson, D., Gan, Z., Li, C., and Carin, L. (2016). Bridging the gap between stochastic gradient MCMC and stochastic optimization. In AISTATS, pages 1051–1060.
- Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R., and Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS, pages 3203–3211.
- Duvenaud et al. (2016) Duvenaud, D., Maclaurin, D., and Adams, R. (2016). Early stopping as non-parametric variational inference. In AISTATS, pages 1070–1077.
- Frank (2005) Frank, T. D. (2005). Nonlinear Fokker-Planck equations: fundamentals and applications. Springer Science & Business Media.
Ge et al. (2015)
Ge, R., Huang, F., Jin, C., and Yuan, Y. (2015).
Escaping from saddle points — online stochastic gradient for tensor decomposition.In COLT, pages 797–842.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
Gulcehre et al. (2016)
Gulcehre, C., Moczulski, M., Denil, M., and Bengio, Y. (2016).
Noisy activation functions.In ICML, pages 3059–3068.
- Hänggi (1978) Hänggi, P. (1978). On derivations and solutions of master equations and asymptotic representations. Zeitschrift für Physik B Condensed Matter, 30(1):85–95.
Hinton and Van Camp (1993)
Hinton, G. E. and Van Camp, D. (1993).
Keeping the neural networks simple by minimizing the description
length of the weights.
Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM.
- Jaynes (1980) Jaynes, E. T. (1980). The minimum entropy production principle. Annual Review of Physical Chemistry, 31(1):579–601.
- Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
- Jordan et al. (1997) Jordan, R., Kinderlehrer, D., and Otto, F. (1997). Free energy and the fokker-planck equation. Physica D: Nonlinear Phenomena, 107(2-4):265–271.
- Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. (1998). The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17.
- Kaiser et al. (2017) Kaiser, M., Jack, R. L., and Zimmer, J. (2017). Acceleration of convergence to equilibrium in Markov chains by breaking detailed balance. Journal of Statistical Physics, 168(2):259–287.
- Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836.
- Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. In NIPS, pages 2575–2583.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arXiv:1312.6114.
- Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, Computer Science, University of Toronto.
- Kwon et al. (2005) Kwon, C., Ao, P., and Thouless, D. J. (2005). Structure of stochastic dynamics near fixed points. Proceedings of the National Academy of Sciences of the United States of America, 102(37):13029–13033.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges to minimizers. In COLT, pages 1246–1257.
- Li et al. (2017a) Li, C. J., Li, L., Qian, J., and Liu, J.-G. (2017a). Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. arXiv:1705.07562.
- Li et al. (2017b) Li, Q., Tai, C., and Weinan, E. (2017b). Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML, pages 2101–2110.
- Ma et al. (2015) Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In NIPS, pages 2917–2925.
- Mandt et al. (2016) Mandt, S., Hoffman, M., and Blei, D. (2016). A variational analysis of stochastic gradient algorithms. In ICML, pages 354–363.
- Mandt et al. (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv:1704.04289.
- Neelakantan et al. (2015) Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv:1511.06807.
- Neyshabur et al. (2017) Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv:1705.03071.
Noh and Lee (2015)
Noh, J. D. and Lee, J. (2015).
On the steady-state probability distribution of nonequilibrium stochastic systems.Journal of the Korean Physical Society, 66(4):544–552.
- Oksendal (2003) Oksendal, B. (2003). Stochastic differential equations. Springer.
- Onsager (1931a) Onsager, L. (1931a). Reciprocal relations in irreversible processes. I. Physical review, 37(4):405.
- Onsager (1931b) Onsager, L. (1931b). Reciprocal relations in irreversible processes. II. Physical review, 38(12):2265.
- Ottinger (2005) Ottinger, H. (2005). Beyond equilibrium thermodynamics. John Wiley & Sons.
- Otto (2001) Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
- Pavliotis (2016) Pavliotis, G. A. (2016). Stochastic processes and applications. Springer.
- Prigogine (1955) Prigogine, I. (1955). Thermodynamics of irreversible processes, volume 404. Thomas.
- Qian (2014) Qian, H. (2014). The zeroth law of thermodynamics and volume-preserving conservative system in equilibrium with stochastic damping. Physics Letters A, 378(7):609–616.
- Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. arXiv:1702.03849.
- Risken (1996) Risken, H. (1996). The Fokker-Planck Equation. Springer.
- Santambrogio (2015) Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY.
- Santambrogio (2017) Santambrogio, F. (2017). Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154.
- Shi et al. (2012) Shi, J., Chen, T., Yuan, R., Yuan, B., and Ao, P. (2012). Relation of a new interpretation of stochastic differential equations to ito process. Journal of Statistical physics, 148(3):579–590.
- Shwartz-Ziv and Tishby (2017) Shwartz-Ziv, R. and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv:1703.00810.
- Springenberg et al. (2014) Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv:1412.6806.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958.
- Tel et al. (1989) Tel, T., Graham, R., and Hu, G. (1989). Nonequilibrium potentials and their power-series expansions. Physical Review A, 40(7):4065.
- Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. (1999). The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377.
- Villani (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
- Wang et al. (2008) Wang, J., Xu, L., and Wang, E. (2008). Potential landscape and flux framework of nonequilibrium networks: robustness, dissipation, and coherence of biochemical oscillations. Proceedings of the National Academy of Sciences, 105(34):12271–12276.
- Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML, pages 681–688.
- Wibisono et al. (2016) Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. PNAS, page 201614734.
- Yin et al. (2017) Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K., and Bartlett, P. (2017). Gradient diversity empowers distributed learning. arXiv:1706.05699.
- Yin and Ao (2006) Yin, L. and Ao, P. (2006). Existence and construction of dynamical potential in nonequilibrium processes without detailed balance. Journal of Physics A: Mathematical and General, 39(27):8593.
- Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv:1611.03530.
- Zhu et al. (2004) Zhu, X.-M., Yin, L., Hood, L., and Ao, P. (2004). Calculating biological behaviors of epigenetic states in the phage life cycle. Functional & integrative genomics, 4(3):188–195.
- Zoph and Le (2016) Zoph, B. and Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv:1611.01578.
Appendix A Diffusion matrix
In this section we denote and . Although we drop the dependence of on to keep the notation clear, we emphasize that the diffusion matrix depends on the weights .
a.1 With replacement
iid random variables in. We would like to compute
Note that we have that for any , the random vectors and are independent. We therefore have
We use this to obtain
We will set
and assimilate the factor of in the inverse temperature .
a.2 Without replacement
Appendix B Discussion on creftype 4
Let be as defined in Eq. 11. In non-equilibrium thermodynamics, it is assumed that the local entropy production is a product of the force from Eq. A8 and the probability current from Eq. FP. This assumption in this form was first introduced by Prigogine (1955) based on the works of Onsager (1931a, b). See Frank (2005, Sec. 4.5) for a mathematical treatment and Jaynes (1980) for further discussion. The rate of entropy increase is given by
This can now be written using Eq. A8 again as
The first term in the above expression is non-negative, in order to ensure that , we require
where the second equality again follows by integration by parts. It can be shown (Frank, 2005, Sec. 4.5.5) that the condition in creftype 4, viz., , is sufficient to make the above integral vanish and therefore for the entropy generation to be non-negative.
Appendix C Some properties of the force
The Fokker-Planck equation Eq. FP can be written in terms of the probability current as
Since we have , from the observation Eq. 7, we also have that
In other words, the conservative force is non-zero only if detailed balance is broken, i.e., . We also have
which shows using creftype 4 and for all that is always orthogonal to the gradient of the potential
Using the definition of in Eq. 8, we have detailed balance when
Appendix D Heat equation as a gradient flow
As first discovered in the works of Jordan, Kinderleherer and Otto (Jordan et al., 1998; Otto, 2001), certain partial differential equations can be seen as coming from a variational principle, i.e., they perform steepest descent with respect to functionals of their state distribution. Section 3 is a generalization of this idea, we give a short overview here with the heat equation. The heat equation
can be written as the steepest descent for the Dirichlet energy functional
More precisely, the sequence of iterated minimization problems
converges to trajectories of the heat equation as . This equivalence is extremely powerful because it allows us to interpret, and modify, the functional that PDEs such as the heat equation implicitly minimize.
This equivalence is also quite natural, the heat equation describes the probability density of pure Brownian motion: . The Wasserstein point-of-view suggests that Brownian motion maximizes the entropy of its state distribution, while the Dirichlet functional suggests that it minimizes the total-variation of its density. These are equivalent. While the latter has been used extensively in image processing, our paper suggests that the entropic regularization point-of-view is very useful to understand SGD for machine learning.
Appendix E Experimental setup
a fully-connected network with two-layers, batch-normalization and rectified linear units that takesdown-sampled images of MNIST as input and has hidden units. Experiments in Section 4.2 use a smaller version of this network with hidden units and output classes ( input images); this is called tiny-fc.
small-allcnn: this a smaller version of the fully-convolutional network for CIFAR-10 and CIFAR-100 introduced by Springenberg et al. (2014) with batch-normalization and output channels in the first and second block respectively. It has weights and reaches about and training and validation errors, respectively.
We train the above networks with SGD with appropriate learning rate annealing and Nesterov’s momentum set to . We do not use any data-augmentation and pre-process data using global contrast normalization with ZCA for CIFAR-10 and CIFAR-100.
We use networks with about weights to keep the eigen-decomposition of tractable. These networks however possess all the architectural intricacies such as convolutions, dropout, batch-normalization etc. We evaluate using Eq. 2 with the network in evaluation mode.
Appendix F Proofs
f.1 Theorem 5
The -divergence is non-negative: with equality if and only if . The expression in Eq. 11 follows after writing
which helps us write the Fokker-Planck equation Eq. FP as
Together, we can now write
As we show in Appendix B, the first term above is zero due to creftype 4. Under suitable boundary condition on the Fokker-Planck equation which ensures that no probability mass flows across the boundary of the domain , after an integration by parts, the second term can be written as
In the above expression, denotes the matrix dot product . The final inequality with the quadratic form holds because is a covariance matrix. Moreover, we have from Eq. A7 that
f.2 Lemma 6
f.3 Lemma 7
The Fokker-Planck operator written as
where the symmetric part is
and the anti-symmetric part is
We first note that does not affect in Theorem 5. For solutions of , we have
by creftype 4. The dynamics of the anti-symmetric operator is thus completely deterministic and conserves . In fact, the equation Eq. A10 is known as the Liouville equation (Frank, 2005) and describes the density of a completely deterministic dynamics given by
where from Appendix C. On account of the trajectories of the Liouville operator being deterministic, they are also the most likely ones under the steady-state distribution .
f.4 Theorem 22
All the matrices below depend on the weights ; we suppress this to keep the notation clear. Our original SDE is given by
We will transform the original SDE into a new SDE