On the Generalization Benefit of Noise in Stochastic Gradient Descent

06/26/2020
by   Samuel L. Smith, et al.
12

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2017

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

This paper tackles two related questions at the heart of machine learnin...
research
10/21/2019

Non-Gaussianity of Stochastic Gradient Noise

What enables Stochastic Gradient Descent (SGD) to achieve better general...
research
06/04/2021

Fluctuation-dissipation Type Theorem in Stochastic Linear Learning

The fluctuation-dissipation theorem (FDT) is a simple yet powerful conse...
research
10/06/2020

A Closer Look at Codistillation for Distributed Training

Codistillation has been proposed as a mechanism to share knowledge among...
research
05/09/2019

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

We investigate how the final parameters found by stochastic gradient des...
research
09/19/2023

On the different regimes of Stochastic Gradient Descent

Modern deep networks are trained with stochastic gradient descent (SGD) ...
research
07/21/2023

Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent

Each round in Differential Private Stochastic Gradient Descent (DPSGD) t...

Please sign up or login with your details

Forgot password? Click here to reset