AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

09/29/2018
by   Zhiming Zhou, et al.
0

Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide a new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient g_t and the second moment term v_t in Adam (t is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such unbalanced step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating v_t and g_t will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates v_t and g_t by temporal shifting, i.e., using temporally shifted gradient g_t-n to calculate v_t. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2019

Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization

Although ADAM is a very popular algorithm for optimizing the weights of ...
research
12/09/2021

Extending AdamW by Leveraging Its Second Moment and Magnitude

Recent work [4] analyses the local convergence of Adam in a neighbourhoo...
research
05/27/2022

Incorporating the Barzilai-Borwein Adaptive Step Size into Sugradient Methods for Deep Network Training

In this paper, we incorporate the Barzilai-Borwein step size into gradie...
research
10/02/2021

Fast Line Search for Multi-Task Learning

Multi-task learning is a powerful method for solving several tasks joint...
research
05/30/2022

On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

It has been widely observed in training of neural networks that when app...
research
06/02/2021

q-RBFNN:A Quantum Calculus-based RBF Neural Network

In this research a novel stochastic gradient descent based learning appr...
research
11/30/2021

Trust the Critics: Generatorless and Multipurpose WGANs with Initial Convergence Guarantees

Inspired by ideas from optimal transport theory we present Trust the Cri...

Please sign up or login with your details

Forgot password? Click here to reset