On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

12/02/2021
by   Xiaowu Dai, et al.
0

We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations (SDEs). We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the batch size in the asymptotic regime. However, the convergence rate is rigorously proven to depend on the batch size. These results are validated empirically with various datasets and models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2018

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...
research
09/14/2017

The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradi...
research
06/22/2022

A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta

Mini-batch SGD with momentum is a fundamental algorithm for learning lar...
research
03/22/2022

Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum

Most convergence guarantees for stochastic gradient descent with momentu...
research
05/21/2018

SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning

In distributed deep learning, a large batch size in Stochastic Gradient ...
research
03/14/2019

Inefficiency of K-FAC for Large Batch Size Training

In stochastic optimization, large batch training can leverage parallel r...
research
01/19/2023

An SDE for Modeling SAM: Theory and Insights

We study the SAM (Sharpness-Aware Minimization) optimizer which has rece...

Please sign up or login with your details

Forgot password? Click here to reset