DeepAI AI Chat
Log In Sign Up

How Many Factors Influence Minima in SGD?

by   Victor Luo, et al.

Stochastic gradient descent (SGD) is often applied to train Deep Neural Networks (DNNs), and research efforts have been devoted to investigate the convergent dynamics of SGD and minima found by SGD. The influencing factors identified in the literature include learning rate, batch size, Hessian, and gradient covariance, and stochastic differential equations are used to model SGD and establish the relationships among these factors for characterizing minima found by SGD. It has been found that the ratio of batch size to learning rate is a main factor in highlighting the underlying SGD dynamics; however, the influence of other important factors such as the Hessian and gradient covariance is not entirely agreed upon. This paper describes the factors and relationships in the recent literature and presents numerical findings on the relationships. In particular, it confirms the four-factor and general relationship results obtained in Wang (2019), while the three-factor and associated relationship results found in Jastrzȩbski et al. (2018) may not hold beyond the considered special case.


page 6

page 7


Three Factors Influencing Minima in SGD

We study the properties of the endpoint of stochastic gradient descent (...

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

We study the statistical properties of the dynamic trajectory of stochas...

The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradi...

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

It is generally recognized that finite learning rate (LR), in contrast t...

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

While stochastic gradient descent (SGD) and variants have been surprisin...

A Scale Invariant Flatness Measure for Deep Network Minima

It has been empirically observed that the flatness of minima obtained fr...