On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

05/20/2022
by   Sadhika Malladi, et al.
9

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2020

Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent

The minimization of the loss function is of paramount importance in deep...
research
06/01/2022

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

When solving finite-sum minimization problems, two common alternatives t...
research
11/19/2021

Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Stochastic gradient descent (SGD) and its variants have established them...
research
01/07/2018

Theory of Deep Learning IIb: Optimization Properties of SGD

In Theory IIb we characterize with a mix of theory and experiments the o...
research
09/29/2021

Stochastic Training is Not Necessary for Generalization

It is widely believed that the implicit regularization of stochastic gra...
research
01/26/2019

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learni...
research
01/19/2023

An SDE for Modeling SAM: Theory and Insights

We study the SAM (Sharpness-Aware Minimization) optimizer which has rece...

Please sign up or login with your details

Forgot password? Click here to reset