A New Adaptive Gradient Method with Gradient Decomposition

07/18/2021
by   Zhou Shao, et al.
0

Adaptive gradient methods, especially Adam-type methods (such as Adam, AMSGrad, and AdaBound), have been proposed to speed up the training process with an element-wise scaling term on learning rates. However, they often generalize poorly compared with stochastic gradient descent (SGD) and its accelerated schemes such as SGD with momentum (SGDM). In this paper, we propose a new adaptive method called DecGD, which simultaneously achieves good generalization like SGDM and obtain rapid convergence like Adam-type methods. In particular, DecGD decomposes the current gradient into the product of two terms including a surrogate gradient and a loss based vector. Our method adjusts the learning rates adaptively according to the current loss based vector instead of the squared gradients used in Adam-type methods. The intuition for adaptive learning rates of DecGD is that a good optimizer, in general cases, needs to decrease the learning rates as the loss decreases, which is similar to the learning rates decay scheduling technique. Therefore, DecGD gets a rapid convergence in the early phases of training and controls the effective learning rates according to the loss based vectors which help lead to a better generalization. Convergence analysis is discussed in both convex and non-convex situations. Finally, empirical results on widely-used tasks and models demonstrate that DecGD shows better generalization performance than SGDM and rapid convergence like Adam-type methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2021

Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization

Adaptive gradient methods, such as Adam, have achieved tremendous succes...
research
07/04/2021

AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations

Adaptive optimization methods have been widely used in deep learning. Th...
research
02/26/2019

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have bee...
research
05/30/2019

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...
research
06/10/2019

Adaptively Preconditioned Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics infuses isotropic gradient noise t...
research
05/12/2020

Convergence of Online Adaptive and Recurrent Optimization Algorithms

We prove local convergence of several notable gradient descentalgorithms...
research
08/17/2020

Adaptive Multi-level Hyper-gradient Descent

Adaptive learning rates can lead to faster convergence and better final ...

Please sign up or login with your details

Forgot password? Click here to reset