DeepAI AI Chat
Log In Sign Up

Stochastic Training is Not Necessary for Generalization

09/29/2021
by   Jonas Geiping, et al.
University of Maryland
University of Siegen
0

It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be completely replaced with explicit regularization. This strongly suggests that theories that rely heavily on properties of stochastic sampling to explain generalization are incomplete, as strong generalization behavior is still observed in the absence of stochastic sampling. Fundamentally, deep learning can succeed without stochasticity. Our observations further indicate that the perceived difficulty of full-batch training is largely the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/01/2021

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance o...
11/17/2020

Contrastive Weight Regularization for Large Minibatch SGD

The minibatch stochastic gradient descent method (SGD) is widely applied...
06/27/2020

Stochastic Batch Augmentation with An Effective Distilled Dynamic Soft Label Regularizer

Data augmentation have been intensively used in training deep neural net...
11/29/2022

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...
05/30/2023

Auto-tune: PAC-Bayes Optimization over Prior and Posterior for Neural Networks

It is widely recognized that the generalization ability of neural networ...
12/28/2020

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...
05/20/2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

Code Repositories

fullbatchtraining

Training vision models with full-batch gradient descent and regularization


view repo