Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

12/03/2018
by   Xiaowu Dai, et al.
0

Stochastic gradient descent (SGD) is almost ubiquitously used for training non-convex optimization tasks. Recently, a hypothesis proposed by Keskar et al. [2017] that large batch methods tend to converge to sharp minimizers has received increasing attention. We theoretically justify this hypothesis by providing new properties of SGD in both finite-time and asymptotic regimes. In particular, we give an explicit escaping time of SGD from a local minimum in the finite-time regime and prove that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size. We also find that SGD with a larger ratio of learning rate to batch size tends to converge to a flat minimum faster, however, its generalization performance could be worse than the SGD with a smaller ratio of learning rate to batch size. We include numerical experiments to corroborate these theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2021

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

We study the statistical properties of the dynamic trajectory of stochas...
research
12/07/2020

Stochastic Gradient Descent with Large Learning Rate

As a simple and efficient optimization method in deep learning, stochast...
research
09/05/2020

S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima

The stochastic gradient descent (SGD) method is most widely used for dee...
research
09/24/2020

How Many Factors Influence Minima in SGD?

Stochastic gradient descent (SGD) is often applied to train Deep Neural ...
research
06/07/2023

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

In this paper, we first present an explanation regarding the common occu...
research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...
research
03/14/2019

Inefficiency of K-FAC for Large Batch Size Training

In stochastic optimization, large batch training can leverage parallel r...

Please sign up or login with your details

Forgot password? Click here to reset