Second-order regression models exhibit progressive sharpening to the edge of stability

10/10/2022
by   Atish Agarwala, et al.
0

Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant Neural Tangent Kernel (NTK) regime, for which the predictive function is approximately linear in the parameters. As such, we consider the next simplest class of predictive models, namely those that are quadratic in the parameters, which we call second-order regression models. For quadratic objectives in two dimensions, we prove that this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which we explicitly compute. In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network, suggesting that progressive sharpening and edge-of-stability behavior aren't unique features of neural networks, and could be a more general property of discrete learning algorithms in high-dimensional non-linear models.

READ FULL TEXT

page 3

page 15

research
02/26/2021

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

We empirically demonstrate that full-batch gradient descent on neural ne...
research
09/30/2022

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

Traditional analyses of gradient descent show that when the largest eige...
research
07/26/2022

Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability

Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural ...
research
07/09/2023

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory

Cohen et al. (2021) empirically study the evolution of the largest eigen...
research
10/07/2022

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Recently, researchers observed that gradient descent for deep neural net...
research
07/29/2022

Adaptive Gradient Methods at the Edge of Stability

Very little is known about the training dynamics of adaptive gradient me...
research
12/14/2020

An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network Training

Motivated by the potential for parallel implementation of batch-based al...

Please sign up or login with your details

Forgot password? Click here to reset