On the Variance of the Adaptive Learning Rate and Beyond

08/08/2019
by   Liyuan Liu, et al.
1

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.

READ FULL TEXT
research
12/24/2019

CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity

Most optimizers including stochastic gradient descent (SGD) and its adap...
research
10/27/2019

An Adaptive and Momental Bound Method for Stochastic Learning

Training deep neural networks requires intricate initialization and care...
research
09/21/2019

Using Statistics to Automate Stochastic Optimization

Despite the development of numerous adaptive optimizers, tuning the lear...
research
06/21/2020

Adaptive Learning Rates with Maximum Variation Averaging

Adaptive gradient methods such as RMSProp and Adam use exponential movin...
research
02/16/2021

GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

Changes in neural architectures have fostered significant breakthroughs ...
research
01/11/2022

Optimization Planning for 3D ConvNets

It is not trivial to optimally learn a 3D Convolutional Neural Networks ...
research
05/27/2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, a first-order stochastic gradient method with layer...

Please sign up or login with your details

Forgot password? Click here to reset