Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

06/14/2023
by   Nikhil Vyas, et al.
0

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by high learning rate or small batch size ("SGD noise"). While prior works that focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that large learning rate and small batch size do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating larger or more cost-effective gradient steps. Our work suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient flow algorithm. We provide evidence to support this hypothesis by conducting experiments that reduce SGD noise during training and by measuring the pointwise functional distance between models trained with varying SGD noise levels, but at equivalent loss values. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.

READ FULL TEXT
research
01/28/2021

On the Origin of Implicit Regularization in Stochastic Gradient Descent

For infinitesimal learning rates, stochastic gradient descent (SGD) foll...
research
07/11/2021

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the...
research
08/05/2019

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications

Under StepDecay learning rate strategy (decaying the learning rate after...
research
10/16/2020

The Deep Bootstrap: Good Online Learners are Good Offline Generalizers

We propose a new framework for reasoning about generalization in deep le...
research
03/31/2021

Empirically explaining SGD from a line search perspective

Optimization in Deep Learning is mainly guided by vague intuitions and s...
research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...
research
11/04/2020

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

Understanding the algorithmic regularization effect of stochastic gradie...

Please sign up or login with your details

Forgot password? Click here to reset