Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

06/07/2023
by   Libin Zhu, et al.
0

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

READ FULL TEXT
research
05/31/2019

Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel

Stochastic Gradient Descent (SGD) is widely used to train deep neural ne...
research
12/03/2018

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...
research
06/13/2020

The Pitfalls of Simplicity Bias in Neural Networks

Several works have proposed Simplicity Bias (SB)—the tendency of standar...
research
06/24/2020

When Do Neural Networks Outperform Kernel Methods?

For a certain scaling of the initialization of stochastic gradient desce...
research
09/15/2022

Random initialisations performing above chance and how to find them

Neural networks trained with stochastic gradient descent (SGD) starting ...
research
06/04/2021

Learning Curves for SGD on Structured Features

The generalization performance of a machine learning algorithm such as a...
research
06/21/2022

On the Maximum Hessian Eigenvalue and Generalization

The mechanisms by which certain training interventions, such as increasi...

Please sign up or login with your details

Forgot password? Click here to reset