On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

05/20/2022
by   Sadhika Malladi, et al.
9

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/20/2020

Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent

The minimization of the loss function is of paramount importance in deep...
06/01/2022

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

When solving finite-sum minimization problems, two common alternatives t...
12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...
11/19/2021

Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Stochastic gradient descent (SGD) and its variants have established them...
08/20/2019

Automatic and Simultaneous Adjustment of Learning Rate and Momentum for Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) methods are prominent for training mac...
01/26/2019

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learni...
11/11/2021

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Stochastic approximation (SA) and stochastic gradient descent (SGD) algo...