On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

by   Sadhika Malladi, et al.

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.


page 1

page 2

page 3

page 4


Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent

The minimization of the loss function is of paramount importance in deep...

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

When solving finite-sum minimization problems, two common alternatives t...

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...

Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Stochastic gradient descent (SGD) and its variants have established them...

Automatic and Simultaneous Adjustment of Learning Rate and Momentum for Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) methods are prominent for training mac...

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learni...

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Stochastic approximation (SA) and stochastic gradient descent (SGD) algo...