An Adaptive and Momental Bound Method for Stochastic Learning

10/27/2019
by   Jianbang Ding, et al.
0

Training deep neural networks requires intricate initialization and careful selection of learning rates. The emergence of stochastic gradient optimization methods that use adaptive learning rates based on squared past gradients, e.g., AdaGrad, AdaDelta, and Adam, eases the job slightly. However, such methods have also been proven problematic in recent studies with their own pitfalls including non-convergence issues and so on. Alternative variants have been proposed for enhancement, such as AMSGrad, AdaShift and AdaBound. In this work, we identify a new problem of adaptive learning rate methods that exhibits at the beginning of learning where Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method to restrict the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks. Our experiments verify that AdaMod eliminates the extremely large learning rates throughout the training and brings significant improvements especially on complex networks such as DenseNet and Transformer, compared to Adam. Our implementation is available at: https://github.com/lancopku/AdaMod

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/26/2019

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have bee...
research
01/01/2021

Adam revisited: a weighted past gradients perspective

Adaptive learning rate methods have been successfully applied in many fi...
research
08/08/2019

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabil...
research
06/21/2020

Adaptive Learning Rates with Maximum Variation Averaging

Adaptive gradient methods such as RMSProp and Adam use exponential movin...
research
02/24/2019

Rapidly Adapting Moment Estimation

Adaptive gradient methods such as Adam have been shown to be very effect...
research
05/19/2018

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

First-order optimization methods have been playing a prominent role in d...
research
11/23/2018

A Sufficient Condition for Convergences of Adam and RMSProp

Adam and RMSProp, as two of the most influential adaptive stochastic alg...

Please sign up or login with your details

Forgot password? Click here to reset