Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

04/11/2018
by   Noam Shazeer, et al.
0

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2019

On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...
research
06/12/2020

ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

Due to its simplicity and outstanding ability to generalize, stochastic ...
research
05/27/2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, a first-order stochastic gradient method with layer...
research
11/22/2018

HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Deep neural networks are traditionally trained using human-designed stoc...
research
01/30/2019

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...
research
07/05/2023

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated exce...
research
06/02/2023

Leveraging the Triple Exponential Moving Average for Fast-Adaptive Moment Estimation

Network optimization is a crucial step in the field of deep learning, as...

Please sign up or login with your details

Forgot password? Click here to reset