DeepAI AI Chat
Log In Sign Up

Adaptive Gradient Methods at the Edge of Stability

07/29/2022
by   Jeremy M Cohen, et al.
Google
Carnegie Mellon University
30

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value – the stability threshold of a gradient descent algorithm. For Adam with step size η and β_1 = 0.9, this stability threshold is 38/η. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Adaptive Edge of Stability” (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/26/2021

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

We empirically demonstrate that full-batch gradient descent on neural ne...
11/15/2020

Explaining the Adaptive Generalisation Gap

We conjecture that the reason for the difference in generalisation betwe...
10/07/2022

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Recently, researchers observed that gradient descent for deep neural net...
07/26/2022

Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability

Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural ...
05/19/2022

Understanding Gradient Descent on Edge of Stability in Deep Learning

Deep learning experiments in Cohen et al. (2021) using deterministic Gra...
10/10/2022

Second-order regression models exhibit progressive sharpening to the edge of stability

Recent studies of gradient descent with large step sizes have shown that...
11/09/2020

Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering

Standard first-order stochastic optimization algorithms base their updat...