A Sufficient Condition for Convergences of Adam and RMSProp

11/23/2018
by   Fangyu Zou, et al.
0

Adam and RMSProp, as two of the most influential adaptive stochastic algorithms for training deep neural networks, have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA, etc., can be directly implied via the proposed sufficient condition in the non-convex setting. In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. This observation together with this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficient condition by applying Adam and RMSProp to tackle the counterexamples and train deep neural networks. Numerical results are exactly in accord with the analysis in theory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2021

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Adam is one of the most influential adaptive stochastic algorithms for t...
research
03/23/2022

An Adaptive Gradient Method with Energy and Momentum

We introduce a novel algorithm for gradient-based optimization of stocha...
research
03/01/2023

AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Sharpness aware minimization (SAM) optimizer has been extensively explor...
research
08/10/2018

Weighted AdaGrad with Unified Momentum

Integrating adaptive learning rate and momentum techniques into SGD lead...
research
05/09/2023

UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization

Adam-type algorithms have become a preferred choice for optimisation in ...
research
10/27/2019

An Adaptive and Momental Bound Method for Stochastic Learning

Training deep neural networks requires intricate initialization and care...
research
02/09/2022

Optimal learning rate schedules in high-dimensional non-convex optimization problems

Learning rate schedules are ubiquitously used to speed up and improve op...

Please sign up or login with your details

Forgot password? Click here to reset