DeepAI AI Chat
Log In Sign Up

On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

by   Fangyu Zou, et al.
Stony Brook University

Adaptive stochastic gradient descent methods, such as AdaGrad, Adam, AdaDelta, Nadam, AMSGrad, etc., have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad ward2018adagrad and perturbed AdaGrad li2018convergence. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The O(T/√(T)) non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly characterized by leveraging a newly developed unified formulation of these two momentum mechanisms. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.


page 1

page 2

page 3

page 4


On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks

Adaptive stochastic gradient descent methods, such as AdaGrad, RMSProp, ...

SGEM: stochastic gradient with energy and momentum

In this paper, we propose SGEM, Stochastic Gradient with Energy and Mome...

Last-iterate convergence analysis of stochastic momentum methods for neural networks

The stochastic momentum method is a commonly used acceleration technique...

Factorial Powers for Stochastic Optimization

The convergence rates for convex and non-convex optimization methods dep...

Understanding the Role of Momentum in Stochastic Gradient Methods

The use of momentum in stochastic gradient methods has become a widespre...

Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization

Stochastic gradient methods with momentum are widely used in application...

Improved Analysis of Clipping Algorithms for Non-convex Optimization

Gradient clipping is commonly used in training deep neural networks part...