A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast

02/10/2020 ∙ by Zeke Xie, et al. ∙ The University of Tokyo 13

Stochastic optimization algorithms, such as Stochastic Gradient Descent (SGD) and its variants, are mainstream methods for training deep networks in practice. However, the theoretical mechanism behind gradient noise still remains to be further investigated. Deep learning is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. In this paper, we focus on a fundamental problem in deep learning, "How can deep learning usually find flat minima among so many minima?" To answer the question, we develop a density diffusion theory (DDT) for revealing the fundamental dynamical mechanism of SGD and deep learning. More specifically, we study how escape time from loss valleys to the outside of valleys depends on minima sharpness, gradient noise and hyperparameters. One of the most interesting findings is that stochastic gradient noise from SGD can help escape from sharp minima exponentially faster than flat minima, while white noise can only help escape from sharp minima polynomially faster than flat minima. We also find large-batch training requires exponentially many iterations to pass through sharp minima and find flat minima. We present direct empirical evidence supporting the proposed theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning (LeCun et al., 2015) has achieved great empirical success in various application areas. Due to the over-parametrization and the highly complex loss landscape of deep networks, optimizing deep networks is a difficult task. Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks. Empirically, SGD can usually find flat minima among a large number of sharp minima and local minima (Hochreiter and Schmidhuber, 1995, 1997). More work reports that learning flat minima closely relate to generalization (Hardt et al., 2016; Zhang et al., 2017; Arpit et al., 2017; Hoffer et al., 2017; Dinh et al., 2017; Neyshabur et al., 2017; Wu et al., 2017; Dziugaite and Roy, 2017; Kleinberg et al., 2018). Some researchers specifically study flatness itself. They try to measure flatness (Hochreiter and Schmidhuber, 1997; Keskar et al., 2017; Sagun et al., 2017; Yao et al., 2018), rescale flatness (Tsuzuku et al., 2019), and find flatter minima (Hoffer et al., 2017; Chaudhari et al., 2017; He et al., 2019)

. But we still lack a quantitative theory that answers why deep learning find flat minima with such a high probability.

We develop a very fundamental theory, Density Diffusion Theory, for deep learning dynamics in this paper. Particularly, we hope to quantitatively analyze learning flat minima from the viewpoint of the dynamical evolution of deep networks during training. Precisely predicting the evolution of a stochastic dynamical system is a nearly impossible task. We turn to modeling the diffusion process of probability densities of parameters instead of model parameters themselves. We show probability densities of parameters can be well captured in our theoretical framework. We have four main contributions.

  • The proposed theory reveals the fundamental roles of gradient noise, batch size, learning rate, and hessian in deep learning dynamics.

  • SGD escapes from sharp minima exponentially faster than flat minima.

  • The number of iterations required to escape from one valley exponentially depends on the ratio of batch size and learning rate.

  • The explorable parameter space around one minimum may be reduced into a much lower dimensional space.

2 Background


Works Noise Theoretical Coverage Escape Time
Zhu et al. (2019) Anisotropic Shallow Networks Qualitative
Simsekli et al. (2019) Isotropic General models Polynomial
Nguyen et al. (2019) Isotropic General models Polynomial
The proposed work Anisotropic General models Exponential
Table 1: Recent quantitative research on escape time analysis.

We mainly introduce Langevin Dynamics (LD) and its generalized background on deep learning in this section. We consider LD is a important theoretical framework that can directly connect learning flat minima and dynamical equations. Such theoretical approach was proposed by physicists a long time ago (Kramers, 1940; Van Kampen, 1992)

. But it hasn’t attracted enough attention from the machine learning community.

LD is motivated and originally derived as a discretization of a stochastic differential equation that leads to a stationary distribution equivalent to the posterior distribution. Machine learning researchers also introduced Stochastic Gradient Langevin Dynamics (SGLD) as Bayesian learning (Welling and Teh, 2011) and Bayesian sampling (Ding et al., 2014). The LD method in machine learning is to inject Gaussian noise into the parameter updates so that the distribution of the parameters will converge to the full posterior distribution rather than just the maximum a posteriori mode.

We denote the data samples as , parameters as

and the training loss function as

. For simplicity, we denote training loss as , which also refers to the energy function in LD. LD can be written as

(1)

where

obeys a normal distribution,

is the diffusion coefficient, and is the learning rate. Its corresponding continuous-time stochastic differential equation is written as

(2)

where we replace by as unit time, and . We can obtain the corresponding Fokker-Planck equation as

(3)

where

is the probability density function of

at time . The Fokker-Planck equation can describe how the probability density function evolves. The Fokker-Planck equation has a stationary solution (Welling and Teh, 2011) as

(4)

where the partition function is a normalization factor. is the temperature in statistical physics. Machine learning researchers usually set in standard LD, so the resulting solution of LD converges to the exact posterior distribution. If we let tend to , LD will be reduced to Gradient Decent and just finds the maximum a posteriori parameters . We may regard Gradient Descent as a special case of Generalized LD (GLD). Mandt et al. (2017)

, a pioneer of the GLD perspective, proposed that SGD can be regarded as approximate Bayesian inference. Gradient Decent is used for finding the optimal solution, but only SGD and its variants can find “good” solution for deep networks. SGLD performs worse than SGD but can still be an effective alternative to training deep networks. Based on

Mandt et al. (2017), we introduce how GD, LD, and SGD are related as follows. GD can be written as

(5)

while SGD can be written as

(6)
(7)

where is the loss of one minibatch, and represents the gradient noise covariance matrix. Let us replace by as unit time. The continuous-time process corresponding to SGD is written as

(8)
(9)

where and . In the dynamics of SGD, is a positive semi-definite diffusion matrix rather than a diffusion coefficient. The Fokker-Planck equation with a diffusion matrix is

(10)
(11)

where is a nabla operator, and is the element in the th row and th column of . We note that the essential difference between SGD and GD/LD lies the diffusion matrix that is not only anisotropic but also dynamical in SGD. Studying the Fokker-Planck equation corresponding to SGD is the starting point of our theory.

Zhu et al. (2019) has studied anisotropic gradient noise, and present the theoretical analysis that is applicable to very shallow networks and complex assumptions. Wu et al. (2018) has studied the escape problems of SGD from a dynamical perspective, and obtains qualitative conclusions on batch size, learning rate, and sharpness. Hu et al. (2019) theoretically shows that the escape dynamics of SGD exponentially depends on the inverse learning rate without direct empirical evidence of the exponential relation.

Nguyen et al. (2019) mainly contributes to closing the theoretical gap between continuous-time dynamics and discrete-time dynamics. It also introduces the polynomial relation of the mean escape time and loss valleys’ width in the case of the isotropic Lévy noise. The polynomial relation is a well studied conclusion in stochastic process (Imkeller and Pavlyukevich, 2006).

Simsekli et al. (2019)

first argues that stochastic gradient noise is Lévy noise (stable variables), rather than Gaussian noise. They present strong empirical evidence showing that stochastic gradient noise is heavy-tailed, and the heavy-tailed distribution is closer to a stable distribution than a Gaussian distribution according to the observation. The empirical evidence in

Simsekli et al. (2019) is actually based a hidden assumption that stochastic gradient noise is isotropic and obey the same distribution along each dimension. We tend to believe Simsekli et al. (2019) provides a better empirical analysis on stochastic gradient noise under the isotropic assumption. The empirical results in Simsekli et al. (2019) actually supports that ’s each dimensional components’ scales are random variables that obey a single stable distribution.

However, we focus on proposing a quantitative theory that can deal with stochastic gradient noise that is anisotropic and dynamical. The LD approach requires us to consider the diffusion term in Langevin Diffusion is based on the random noise produced by minibatch training not by different dimensions. The viewpoint that

as a stochastic process is not a isotropic Lévy Process but an anisotropic Gaussian Process can help us capture deep learning dynamics more accurately. Our empirical results also support that the mean escape time exponentially depends on valleys’ sharpness. Lévy noise only produces the polynomial relation. What’s more, according to Generalized Central Limit Theorem

(Gnedenko et al., 1954)

, the mean of many infinite-variance random variables converges to a stable distribution, while the mean of many finite-variance random variables converges to a Gaussian distribution. We think it is reasonable to assume that gradient noise are finite in practice.

Table 1 displays the different results between some related works and the proposed work. To the best of our knowledge, no one has introduced anisotropic diffusion into the theoretical analysis of general gradient-based nonconvex learning. Anisotropic diffusion not only makes the posterior of SGD distinct from the posterior of SGLD, but also make deep learning powerful. A more fundamental theory that can include all of learning rate, batch size, and sharpness into deep learning dynamics is an important research direction.

3 Density Diffusion Theory

3.1 Diffusion Theory of SGLD

We first theoretically analyze diffusion of SGLD in this section. SGLD is often used for deep learning, bayesian learning and other nonconvex learning problems. We note that we only propose the theory for the continuous-time deep learning dynamics in following analysis. Continuous-time learning dynamics governed by Langevin Diffusion is approximately equivalent to its Euler discretization when the learning rate is small enough, theoretically guaranteed by Sato and Nakagawa (2014); Nguyen et al. (2019).

We write the discrete-time dynamics of SGLD as

(12)
(13)

where indicates the diffusion matrix term from the injected gradient noise. We note that there are two sources of gradient noise in SGLD. The variance of stochastic gradient noise is , while the variance of the injected noise is . In the limit of , the injected noise dominates the stochastic gradient noise. Thus the dynamics of SGLD is also governed by Equation 9, where the diffusion matrix can be regarded a fixed real number in theoretical analysis.

(a) 1-Dimensional Escape
(b) High-Dimensional Escape
Figure 1: Kramers Escape Problem. and are minima of two neighboring valleys.

is the saddle point separating the two valleys. The hessian matrix at b has only one negative eigenvalue, which indicates its most possible escape direction.

locates outside of Valley .

We start the theoretical analysis from a simple problem. We assume there are two valleys, Sharp Valley and Flat Valley , seen in Figure 1. Also Col b is the boundary between two valleys. The problem is what is the mean escape time for a particle governed by Equation 2 from Sharp Valley to Flat Valley ? This problem is called Kramers Escape Problem in statistical physics (Kramers, 1940).

Lemma 3.1 (Gauss’s Divergence Theorem (Arfken and Weber, 1999; Lipschutz et al., 2009)).

Suppose is a subset of which is compact and has a piecewise smooth boundary . If is a continuously differentiable vector field defined on a neighborhood of , then:

The left side is a volume integral over the volume , and the right side is the surface integral over the boundary of the volume .

Gauss’s Divergence Theorem states that the surface integral of a vector field over a closed surface, which is called the flux through the surface, is equal to the volume integral of the divergence over the region inside the surface. We denote the mean escape time as and the escape rate as . We also denote that represents the probability current. The probability current describes the “motion” of the probability density rather than the parameters’ exact values. We apply Gauss’s Divergence Theorem to the Fokker-Planck Equation resulting in

(14)
(15)

The mean escape time is expressed (Van Kampen, 1992) as

(16)

where is the current probability inside Valley a, is the probability current produced by , is the probability flux (surface integrals of probability current), is the surface (boundary) surrounding Valley a, is the volume surrounded by . In the case of one-dimensional escape, . Mean escape time, also called first exit time, means the mean time required for a particle to escape a loss valley, a widely used concept in statistical physics and stochastic process (Kramers, 1940; Van Kampen, 1992; Nguyen et al., 2019). The knowledge on divergence and flux (surface integrals) can be obtained from Gauss’s Divergence Theorem (Arfken and Weber, 1999; Lipschutz et al., 2009).

This approach is first studied by Kramers (1940) and discussed in details by Van Kampen (1992). In order to analytically solve the Kramers Escape Problem of SGLD, we need three approximated assumptions first.

Assumption 1.

D is -independent, .

Assumption 2.

The system is in quasi-equilibrium, .

Assumption 3.

The system is under small gradient noise (low temperature), .

Assumption 1 is a mild assumption in LD/SGLD (Welling and Teh, 2011), and is justified when the injected noise dominates stochastic gradient noise (). This is true in the final stage of training. Assumption 2 is widely used in Kramers Escape Problem across multiple fields, such as physics and chemistry (Kramers, 1940; Van Kampen, 1992). It means the model can fit data well and model parameters are close to some minimum in deep learning. Its validity is examined by our empirical results. Assumption 3 is justified when is small. Under Assumption 1, the stationary distribution of the Fokker-Planck Equation is . Under Assumption 2, has locally reached the stationary distribution around critical points, although the distribution of different valleys may have not reached the stationary distribution. Under Assumption 3, we can apply the second order Taylor approximation around critical points, and the probability densities escape from Valley a only along most possible paths (MPPs). The probability flux far from MPPs is ignorable. MPPs must be critical paths. We generalize critical points into critical paths as the path where 1) the gradient perpendicular to the path direction must be zero, and 2) the second order directional derivatives perpendicular to the path direction must be nonnegative. MPPs transport probability densities just like how rivers transport water.

Theorem 3.2.

The loss function is of class and one-dimensional. If Assumption 1, 2, and 3 hold, and the dynamics is governed by SGLD, then the mean escape time from Valley to the neighboring valley is

and are the second-order derivatives of the loss function at the minimum and the saddle point . is the loss barrier height.

Proof.

Supplementary Materials B.1. ∎

We try to generalize the theoretical analysis of one-dimensional escape into the n-dimensional space. For the simplicity of notation, we first solve a simple case of high-dimensional escape that there is only one most possible path existing between Sharp Valley and Flat Valley . In the low temperature limit, the probability current naturally concentrates along the most possible path. The boundary between Sharp Valley and Flat Valley

is still the saddle point b, where the hessian matrix has only one negative eigenvalue and the corresponding eigenvector is the escape direction.

Theorem 3.3.

The loss function is of class and n-dimensional. Only one most possible path exists between Valley a and the outside of Valley a. If Assumption 1, 2, and 3 hold, and the dynamics is governed by SGLD, then the mean escape time from Valley a to the outside of Valley a is

and are hessians of the loss function at the minimum and the saddle point . is the loss barrier height. is the only negative eigenvalue of the hessian matrix .

Proof.

Supplementary Materials B.2. ∎

Weight decaying can ensure the hessians have no zero eigenvalues. We emphasize that the escape time in the proposed theory is the dynamical time, which equals to the product of the number of iterations and the learning rate in experiments. It is not the same as the computational time. The learning rate is the time unit (time step). Each path has its own escape rate. Multiple paths combined together have a total escape rate. If there are multiple parallel or sequential paths from the start valley to the end valley, we can compute the total escape rate easily based on the following computation rules.

Rule 1.

If there are parallel paths between the start valley and the end valley, then

Rule 2.

If there are sequential paths between the start valley and the end valley, then

Proof.

The computation rule 1 is based on the fact that probability currents and flux additive. The computation rule 2 is based on the fact that dynamical time is additive. ∎

Based on Theorem 3.3

and two computation rules, we can easily generalize the escape time analysis into the case that there are multiple parallel or sequential most possible paths. We can also easily get the stationary probability distribution of locating in different valleys. The corollaries can be found in Supplementary Materials

A.

Corollary 3.3.1.

Assume there are two valleys connecting together and the escape paths to the outside of the two valleys are ignorable. If all assumptions of Theorem 3.3 hold, then the stationary probability ratio of locating in Sharp Valley and locating in Flat Valley is

Proof.

Supplementary Materials B.3. ∎

3.2 Diffusion Theory of SGD

We theoretically analyze diffusion of SGD in this section. We first point that the previous theoretical analysis on SGLD is based on the condition that the stationary distribution of the Fokker-Planck equation must be the target posterior, . This condition is satisfied well with the white noise. However, this condition is not satisfied in SGD. The stationary distribution of SGD is distinct from the posterior of SGLD.

To solve the dynamics of SGD, we need a new version of Assumption 1, while Assumption 2 and 3 are still required. We denote the unit vector along the escape direction at as in following analysis.

Assumption 4.

The diffusion matrix is stable around critical points, which means .

Under Assumption 4, the stationary distribution around critical points satisfies the condition of the Smoluchowski equation. Assumption 4 is justified by the relation of hessians and stochastic gradient noise covariance, which is discussed by (Pawitan, 2001; Zhu et al., 2019). We justify Assumption 4 again as follows.

Based on Smith and Le (2018), we express the stochastic gradient noise covariance as

(17)

It has been empirically observed that the gradient noise variance dominates the gradient mean in the final stage of SGD optimization,

(18)
(19)

and we know . From (Pawitan, 2001), we also know the fisher information matrix satisfying

(20)

Thus we obtain the result of Zhu et al. (2019)

(21)
(22)

which gives

(23)

It indicates the eigenvectors of are aligned closely with the corresponding eigenvectors of . People usually simplify the formula by using . Under the second-order Taylor’s approximation around critical points that

(24)

and are both locally stable around critical points. So Assumption 4 is justified.

(Mandt et al., 2017) proposes the stationary distribution of SGD as

(25)

where satisfies

(26)

According to Equation 23, we have

(27)

We can easily generalize the formula into the neighborhood around critical paths. We apply the second-order Taylor’s approximation to the neighborhood around the critical path, and get

(28)

where is the nearest MPP point of .

Theorem 3.4.

The loss function is of class and one-dimensional. If Assumption 4, 2, and 3 hold, and the dynamics is governed by SGD, then the mean escape time from Valley a to the outside of Valley a is

is a path-dependent parameter. is the learning rate. is the batch size. is the training data size. and are the second-order derivatives of the loss function at the minimum and the saddle point . is the loss barrier height.

Proof.

Supplementary Materials B.4. ∎

Theorem 3.5.

The loss function is of class and n-dimensional. Only one most possible path exists between Valley a and the outside of Valley a. If Assumption 4, 2, and 3 hold, and the dynamics is governed by SGD, then the mean escape time from Valley a to the outside of Valley a is

indicates the most possible escape direction. is a path-dependent parameter. is the learning rate. is the batch size. is the training data size. and are the eigenvalues of hessians of the loss function at the minimum and the saddle point corresponding to the most escape direction. is the loss barrier height.

Proof.

Supplementary Materials B.5. ∎

Corollary 3.5.1.

The probability density escapes along multiple parallel paths from Valley a to the outside of Valley a. If all assumptions of Theorem 3.5 hold, then the total escape rate from Valley a to the outside of Valley a is

is the index of the most possible paths. indicates the escape direction of the most possible path .

Corollary 3.5.2.

Assume there are two valleys connecting together and the escape paths to the outside of the two valleys are ignorable. If all assumptions of Theorem 3.5 hold, then the stationary distribution of locating these valleys is given by

is the index of valleys. is the mean escape time from Valley to the outside of Valley .

Proof.

Supplementary Materials B.6. ∎

The theoretical analysis of SGD is also applicable to the dynamics with a mixture of stochastic gradient noise and injected Gaussian noise, as long as the eigenvectors of the total diffusion matrix are closely aligned with the eigenvectors of . Although deep learning dynamics is complex, we are able to model deep learning dynamics in the theoretical framework of density diffusion.

4 Discussion

In this section, we discuss about new findings from the proposed theory. In deep learning, one loss valley represents one mode and the landscape contain many good modes and bad modes. It transits from one mode to another mode during training. We can easily notice the essential advantages of SGD.

SGLD: The transition time from one mode to its neighboring modes only depends on the hessian determinant of the minimum, the hessian determinant of the saddle point, the barrier height and the temperature (the diffusion coefficient). SGLD favors flat minima polynomially more than sharp minima.

SGD: First, we discover that the mean escape time exponentially depends on the ratio of the batch size and the learning rate. Second, the mean escape time exponentially depends on the eigenvalues of the hessians of Valley a and Col b corresponding to the most possible escape directions, denoted as and , rather than polynomially depending on the hessian determinant in SGLD. The second order directional derivatives orthogonal to the most possible escape direction have approximately zero impact on escape dynamics. SGD can escape exponentially fast from “sharp” minima with large eigenvalues of hessians corresponding to the most possible escape directions, while SGLD can escape polynomially fast from minima with large hessian determinants.

The proposed theory helps us understand the gradient noise’s role in deep learning much better than before. (Zhu et al., 2019) shows that anisotropic noise helps escape from sharp minima. However, according to the proof of Theorem 3.4, anisotropic noise doesn’t necessarily helps escape from sharp minima. From Theorem 3.5, we know the stochastic gradient noise is the kind of helpful noise that strongly encourages the favors to flat minima.

The proposed theory discovers the theoretical mechanism behind the known finding that large-batch training can easily locating in sharp minima, and increasing the learning rate proportionally is helpful for large-batch training (Krizhevsky, 2014; Keskar et al., 2017; Sagun et al., 2017; Smith et al., 2018; Yao et al., 2018). The main cause is large-batch (LB) training expect exponentially many iterations to pass through sharp minima and saddle points. The practical computational time is too short to achieve transitions to enough candidate minima. So the probability of locating in sharp minima becomes much more higher. If we apply Linear Scaling (Krizhevsky, 2014) to keep

fixed, and train models with same epochs (rather than iterations), the dynamics will keep nearly same as before.

The proposed theory also reveals the clear meaning of “sharpness”, and the “sharpness” has reformed in contexts of SGLD and SGD. The hessians of both minima and saddle points essentially matter in deep learning. In the context of SGLD, the “sharpness” of minima, quantified by the hessian determinant, dominates learning dynamics. In the context of SGD, the “sharpness”, quantified by the eigenvalues of hessians corresponding to the most possible escape directions, dominates learning dynamics. The “sharpness” exponentially speedup escape from sharp minima in SGD. The probability of escaping along one direction is exponentially weighted by the eigenvalue of hessians corresponding to the direction. SGD usually just climbs paths with large second order directional derivatives first, and then move to cross saddle points. The principal eigenvalues of hessians at a minimum decides the total escape rate of this minimum.

We claim the advantages of SGD mainly come from the exponential relation of the mean escape time and hessians. More precisely, the hessians that dominates deep learning dynamics are the principal components of hessians at minima and the negative-eigenvalue components of hessians at saddle points. In fact, the hessians of over-parametrized deep networks has most small and even nearly-zero eigenvalues and a small number of outlier eigenvalues

(Sagun et al., 2017). Although the parameter space is very high-dimensional, the dynamics of SGD naturally avoids learning towards those “meaningless” dimensions with small second order directional derivatives. The novel characteristic significantly reduce the explorable parameter space around one minimum into a much lower dimensional space. Thus the parameter space of deep learning can be regarded as a probabilistic mixture of many simple low-dimensional space around different minima. We believe this will help us understand generalization of deep learning in future.

5 Empirical Analysis

(a) Avila
(b) Banknote
(c) Cardiotocography
(d) Diagnosis
Figure 2: The escape rate exponentially depends on the “path hessians” in the dynamics of SGD. is linear with . The “path hessians” indicates the eigenvalues of hessians corresponding to the escape directions.
(a) Avila
(b) Banknote
(c) Cardiotocography
(d) Diagnosis
Figure 3: The escape rate exponentially depends on the batch size in the dynamics of SGD. is linear with .
(a) Avila
(b) Banknote
(c) Cardiotocography
(d) Diagnosis
Figure 4: The escape rate exponentially depends on the learning rate in the dynamics of SGD. is linear with

. The estimated escape rate has incorporated

as the time unit.

We empirically study the proposed theoretical results in this section. We try to directly validate the escape formulas on real world data sets. The escape rates under various gradient noise scales, batch sizes, learning rates, and hessians are simulated. How to compare the escape time under various hessians? Our method is to multiply a rescaling factor to each parameter, and the hessians will be proportionally rescaled by a factor . If we let , then . The theoretical relations we try to validate can be formulated as

  • is linear with in the dynamics of white noise;

  • is linear with in the dynamics of white noise;

  • is linear with in the dynamics of SGD.

  • is linear with in the dynamics of SGD.

  • is linear with in the dynamics of SGD.

Datasets: (De Stefano et al., 2018; Dua and Graff, 2017) a) Avila, b) Banknote Authentication, c) Cardiotocography, d) Dataset for Sensorless Drive Diagnosis. Models: Fully-connected networks with ,

, ReLu activations, and cross entropy losses.

Experimental Settings: Supplementary Materials C.1.

Experimental Results: The experimental results on the dynamics of SGLD/white noise is presented in Supplementary Materials C.2. The results completely support the density diffusion theory in the dynamics of white noise. Figure 2, 3 and 4 show that the exponential relation of the escape rate with the hessian, the batch size and the learning rate is clearly observed. The empirical results support the conclusion of Section 4 that large-batch training requires exponential times iterations to transit from one valley to another. We particularly note that in Figure 2 and 4

we set the batch size as 1, which means our theoretical predictions hold in practice even for single sample. We also conduct a set of supplementary experiments on Styblinski-Tang Function, Logistic Regression, and Four-layer fully-connected networks with artificial data sets, seen in Supplementary Materials

D. We may more precisely control the gradient noise scale, hessians and the exact locations of minima and loss barrier in the supplementary experiments.

6 Conclusion

The proposed fundamental theory can analyze deep learning dynamics, and is generally applicable to other gradient-based nonconvex learning settings. The proposed theory requires three assumptions: 1) the locally-stable diffusion assumption (Assumption 1/4), 2) the quasi-equilibrium assumption (Assumption 2) 3) the low temperature assumption (Assumption 3). We show why the three assumptions are approximately reasonable in practice. In the context of SGLD, Section 3.1, we reveal an interesting finding that the mean escape time polynomially depends on the hessian determinant and exponentially depends on the gradient noise scale. In the context of SGD, Section 3.2, we discover the mean escape time from sharp minima exponentially depends on hessians, more precisely the principal components of hessians. We also prove how hyperparameters, such as the batch size and learning rate, contribute to deep learning dynamics. In Section 4, we present more findings derived from the diffusion theory. In SGLD, the “sharpness” is quantified by hessian determinants. In SGD, the “sharpness” is quantified by “path hessians”, namely the eigenvalues of hessians corresponding to the escape directions. One essential characteristic of SGD is that outlier hessians around critical points dominates learning dynamics. Most dimensions with nearly zero small second order directional derivatives approximately become meaningless in SGD. We believe the proposed theory not only can help us understand how SGD and its variants work, but also provide researchers new theoretical tools to understand deep learning.

Acknowledgement

I would like to express my deep gratitude to Professor Masashi Sugiyama and Professor Issei Sato for their patient guidance and useful critiques of this research work. My grateful thanks are also extended to Dr. Qianyuan Tang for his suggestions in physics background. MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.

References

  • G. B. Arfken and H. J. Weber (1999) Mathematical methods for physicists. AAPT. Cited by: §3.1, Lemma 3.1.
  • D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §1.
  • P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina (2017) Entropy-sgd: biasing gradient descent into wide valleys. In International Conference on Learning Representations, Cited by: §1.
  • C. De Stefano, M. Maniaci, F. Fontanella, and A. S. di Freca (2018) Reliable writer identification in medieval manuscripts through page layout features: the “avila” bible case.

    Engineering Applications of Artificial Intelligence

    72, pp. 99–110.
    Cited by: §5.
  • N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven (2014) Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pp. 3203–3211. Cited by: §2.
  • L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (2017) Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. Cited by: §1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.
  • G. K. Dziugaite and D. M. Roy (2017)

    Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data

    .
    arXiv preprint arXiv:1703.11008. Cited by: §1.
  • B. Gnedenko, A. Kolmogorov, B. Gnedenko, and A. Kolmogorov (1954) Limit distributions for sums of independent. Am. J. Math 105. Cited by: §2.
  • M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. Cited by: §1.
  • H. He, G. Huang, and Y. Yuan (2019) Asymmetric valleys: beyond sharp and flat local minima. In Advances in Neural Information Processing Systems 32, pp. 2549–2560. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1995) Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems, pp. 529–536. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1.
  • E. Hoffer, I. Hubara, and D. Soudry (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pp. 1729–1739. Cited by: §1.
  • W. Hu, C. J. Li, L. Li, and J. Liu (2019) On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications 4 (1), pp. 3–32. Cited by: §2.
  • P. Imkeller and I. Pavlyukevich (2006) First exit times of sdes driven by stable lévy processes. Stochastic Processes and their Applications 116 (4), pp. 611–642. Cited by: §2.
  • N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations, Cited by: §1, §4.
  • R. Kleinberg, Y. Li, and Y. Yuan (2018) An alternative view: when does sgd escape local minima?. In International Conference on Machine Learning, pp. 2703–2712. Cited by: §1.
  • H. A. Kramers (1940) Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7 (4), pp. 284–304. Cited by: §2, §3.1, §3.1, §3.1, §3.1.
  • A. Krizhevsky (2014)

    One weird trick for parallelizing convolutional neural networks

    .
    arXiv preprint arXiv:1404.5997. Cited by: §4.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
  • S. Lipschutz, M. R. Spiegel, and D. Spellman (2009)

    Vector analysis and an introduction to tensor analysis

    .
    McGraw-Hill. Cited by: §3.1, Lemma 3.1.
  • S. Mandt, M. D. Hoffman, and D. M. Blei (2017) Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research 18 (1), pp. 4873–4907. Cited by: §2, §3.2.
  • B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5949–5958. Cited by: §1.
  • T. H. Nguyen, U. Şimşekli, M. Gürbüzbalaban, and G. Richard (2019) First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. arXiv preprint arXiv:1906.09069. Cited by: Table 1, §2, §3.1, §3.1.
  • Y. Pawitan (2001) In all likelihood: statistical modelling and inference using likelihood. Oxford University Press. Cited by: §3.2, §3.2.
  • L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou (2017) Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454. Cited by: §1, §4, §4.
  • I. Sato and H. Nakagawa (2014) Approximation analysis of stochastic gradient langevin dynamics by using fokker-planck equation and ito process. In International Conference on Machine Learning, pp. 982–990. Cited by: §3.1.
  • U. Simsekli, L. Sagun, and M. Gurbuzbalaban (2019) A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pp. 5827–5837. Cited by: Table 1, §2.
  • S. L. Smith, P. Kindermans, and Q. V. Le (2018) Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, Cited by: §4.
  • S. L. Smith and Q. V. Le (2018) A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, Cited by: §3.2.
  • Y. Tsuzuku, I. Sato, and M. Sugiyama (2019) Normalized flat minima: exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. arXiv preprint arXiv:1901.04653. Cited by: §1.
  • N. G. Van Kampen (1992) Stochastic processes in physics and chemistry. Vol. 1, Elsevier. Cited by: §2, §3.1, §3.1, §3.1.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: §2, §2, §3.1.
  • L. Wu, C. Ma, and E. Weinan (2018) How sgd selects the global minima in over-parameterized learning: a dynamical stability perspective. In Advances in Neural Information Processing Systems, pp. 8279–8288. Cited by: §2.
  • L. Wu, Z. Zhu, et al. (2017) Towards understanding generalization of deep learning: perspective of loss landscapes. arXiv preprint arXiv:1706.10239. Cited by: §1.
  • Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney (2018) Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pp. 4949–4959. Cited by: §1, §4.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In International Conference on Machine Learning, Cited by: §1.
  • Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma (2019) The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. Cited by: Table 1, §2, §3.2, §3.2, §4.

Appendix A Corollaries

Corollary A.0.1.

The loss function is of class and n-dimensional. The probability density escapes along multiple parallel paths from Valley a to the outside of Valley a. If Assumption 1, 2, and 3 hold, and the dynamics is governed by SGLD, then the mean escape time is

p is the index of MPPs. and are the hessians of the loss function at the minimum and the saddle point . is the loss barrier height. is the only negative eigenvalue of the hessian matrix .

Corollary A.0.2.

The loss function is of class and n-dimensional. There exist multiple valleys between the start valley and the end valley. The probability density escapes along multiple sequential paths through multiple valleys from the start valley to the end valley. If Assumption 1, 2, and 3 hold, and the dynamics is governed by SGLD, then the mean escape time is

p is the index of MPPs. and are the hessians of the loss function at the minimum and the saddle point . is the loss barrier height. is the only negative eigenvalue of the hessian matrix .

Appendix B Proofs

b.1 Proof of Theorem 3.2

Proof.

This proposition is a well known conclusion in statistical physics under Assumption 1, 2 and 3. We still provide an intuitional proof here, and the following theoretical analysis will be partly based on this proof. We decompose the proof into two steps: 1) compute the probability of locating in valley a, , and 2) compute the probability flux .

Step 1: Under Assumption 1, the stationary distribution around minimum a is , where . Under Assumption 3, we may only consider the second order Taylor approximation of the density function around critical points. We use the notation as the temperature parameter in the stationary distribution, and use the notation as the diffusion coefficient in the dynamics, for their different roles.

(29)
(30)
(31)
(32)
(33)

Step 2:

(34)
(35)
(36)

Apply this result to the Fokker-Planck Equation 11, we have

(37)
(38)

And thus we obtain the Smoluchowski equation and a new form of J

(39)
(40)

We note that the probability density outside Valley a must be zero, . As we want to compute the probability flux escaping from Valley a in the proof, the probability flux escaping from other valleys into Valley a should be ignored. Under Assumption 2, we integrate the equation from Valley a to the outside of Valley a along the most possible escape path

(41)
(42)
(43)
(44)

We move to the outside of integral based on Gauss’s Divergence Theorem, because is fixed on the escape path from one minimum to another. As there is no field source on the escape path, . Then . Obviously, only minima are probability sources in deep learning. Under Assumption 3 and the second-order Taylor approximation, we have

(45)
(46)
(47)
(48)

Based on the results of Step 1 and Step 2, we obtain

(49)
(50)
(51)
(52)

b.2 Proof of Theorem 3.3

Proof.

We generalize the proof of Theorem 3.2 into the high-dimensional analog.

Step 1:

(53)
(54)
(55)
(56)
(57)

Step 2: Based on the formula of the one-dimensional probability current and flux, we obtain

(58)
(59)
(60)

So we have

(61)
(62)

b.3 Proof of Corollary 3.3.1

Proof.

Under Assumption 1 and 2, the stationary distribution must be the target posterior, . So we have

(63)

We apply the second-order Taylor approximation and compute the probability of locating in Valley a as

(64)
(65)
(66)
(67)
(68)

Similarly, we have

(69)

and . Thus we obtain

(70)

b.4 Proof of Theorem 3.4

Proof.

Under Assumption 2, 3 and 4, we can prove the proposition. We decompose the proof into two steps just like before. The following proof is similar to the proof of Theorem 3.2 except that we make the temperature near the minimum a and the temperature near the saddle point b. From the proof of Theorem 3.2, we have known is dominated by the dynamics near the minimum a, and the probability current is dominated by the dynamics near the minimum a and the cob b. So the escape dynamics is insensitive to the variations of temperature between the minimum a’s and the col b’s .

Step 1: Under Assumption 3, we may only consider the second order Taylor approximation of the density function around critical points.

(71)
(72)
(73)
(74)
(75)

Step 2:

(76)
(77)

According to Equation 23, is ignorable near the minimum a and the col b, thus

(78)

Apply this result to the Fokker-Planck Equation 11, we have

(79)
(80)

And thus we obtain the Smoluchowski equation and a new form of J

(81)
(82)

We note that the Smoluchowski equation is true only near critical points. We assume the point s is the midpoint on the most possible path between a and b, where . The temperature dominates the path , while temperature dominates the path . So we have

(83)

Under Assumption 2, we integrate the equation from Valley a to the outside of Valley a along the most possible escape path

(84)
(85)