Stochastic Training is Not Necessary for Generalization

09/29/2021
by   Jonas Geiping, et al.
0

It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be completely replaced with explicit regularization. This strongly suggests that theories that rely heavily on properties of stochastic sampling to explain generalization are incomplete, as strong generalization behavior is still observed in the absence of stochastic sampling. Fundamentally, deep learning can succeed without stochasticity. Our observations further indicate that the perceived difficulty of full-batch training is largely the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2021

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance o...
research
11/17/2020

Contrastive Weight Regularization for Large Minibatch SGD

The minibatch stochastic gradient descent method (SGD) is widely applied...
research
06/27/2020

Stochastic Batch Augmentation with An Effective Distilled Dynamic Soft Label Regularizer

Data augmentation have been intensively used in training deep neural net...
research
11/29/2022

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...
research
05/30/2023

Auto-tune: PAC-Bayes Optimization over Prior and Posterior for Neural Networks

It is widely recognized that the generalization ability of neural networ...
research
12/28/2020

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...
research
05/20/2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

Please sign up or login with your details

Forgot password? Click here to reset