Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

05/27/2019
by   Boris Ginsburg, et al.
0

We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on the diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2014

ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient

Stochastic gradient algorithms have been the main focus of large-scale l...
research
02/27/2018

Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Stochastic gradient descent (SGD) has achieved great success in training...
research
10/21/2020

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed...
research
04/11/2018

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

In several recently proposed stochastic optimization methods (e.g. RMSPr...
research
08/08/2019

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabil...
research
06/21/2020

Adaptive Learning Rates with Maximum Variation Averaging

Adaptive gradient methods such as RMSProp and Adam use exponential movin...
research
02/15/2015

Equilibrated adaptive learning rates for non-convex optimization

Parameter-specific adaptive learning rate methods are computationally ef...

Please sign up or login with your details

Forgot password? Click here to reset