Adam revisited: a weighted past gradients perspective

01/01/2021
by   Hui Zhong, et al.
0

Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.

READ FULL TEXT

page 1

page 17

research
10/27/2019

An Adaptive and Momental Bound Method for Stochastic Learning

Training deep neural networks requires intricate initialization and care...
research
04/19/2019

On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...
research
03/04/2019

Optimistic Adaptive Acceleration for Optimization

We consider a new variant of AMSGrad. AMSGrad RKK18 is a popular adaptiv...
research
05/04/2019

NAMSG: An Efficient Method For Training Neural Networks

We introduce NAMSG, an adaptive first-order algorithm for training neura...
research
06/21/2020

Adaptive Learning Rates with Maximum Variation Averaging

Adaptive gradient methods such as RMSProp and Adam use exponential movin...
research
04/28/2021

FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizer by Strong Convexity

The AdaBelief algorithm demonstrates superior generalization ability to ...
research
02/12/2020

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Identifying a divergence problem in Adam, we propose a new optimizer, La...

Please sign up or login with your details

Forgot password? Click here to reset