DeepAI AI Chat
Log In Sign Up

Stochastic Training is Not Necessary for Generalization

by   Jonas Geiping, et al.
University of Maryland
University of Siegen

It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be completely replaced with explicit regularization. This strongly suggests that theories that rely heavily on properties of stochastic sampling to explain generalization are incomplete, as strong generalization behavior is still observed in the absence of stochastic sampling. Fundamentally, deep learning can succeed without stochasticity. Our observations further indicate that the perceived difficulty of full-batch training is largely the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.


page 1

page 2

page 3

page 4


SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance o...

Contrastive Weight Regularization for Large Minibatch SGD

The minibatch stochastic gradient descent method (SGD) is widely applied...

Stochastic Batch Augmentation with An Effective Distilled Dynamic Soft Label Regularizer

Data augmentation have been intensively used in training deep neural net...

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...

Auto-tune: PAC-Bayes Optimization over Prior and Posterior for Neural Networks

It is widely recognized that the generalization ability of neural networ...

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

Code Repositories


Training vision models with full-batch gradient descent and regularization

view repo