Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

10/11/2021
by   Juntang Zhuang, et al.
0

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for t-th update, denominator uses information up to step t-1, while numerator uses gradient at t-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of O(1/√(T)) for the stochastic non-convex case, which matches the oracle rate and outperforms the O(logT/√(T)) rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam.

READ FULL TEXT

page 8

page 22

page 31

page 32

page 33

research
08/02/2019

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Although adaptive gradient methods (AGMs) have fast speed in training de...
research
08/13/2019

On the Convergence of AdaBound and its Connection to SGD

Adaptive gradient methods such as Adam have gained extreme popularity du...
research
02/02/2023

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

A number of recent adaptive optimizers improve the generalisation perfor...
research
09/26/2019

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Background: Recent developments have made it possible to accelerate neur...
research
10/06/2017

Accumulated Gradient Normalization

This work addresses the instability in asynchronous data parallel optimi...
research
05/31/2016

Asynchrony begets Momentum, with an Application to Deep Learning

Asynchronous methods are widely used in deep learning, but have limited ...
research
07/02/2020

Adaptive Braking for Mitigating Gradient Delay

Neural network training is commonly accelerated by using multiple synchr...

Please sign up or login with your details

Forgot password? Click here to reset