Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

by   Qianqian Tong, et al.

Although adaptive gradient methods (AGMs) have fast speed in training deep neural networks, it is known to generalize worse than the stochastic gradient descent (SGD) or SGD with momentum (S-Momentum). Many works have attempted to modify AGMs so to close the gap in generalization performance between AGMs and S-Momentum, but they do not answer why there is such a gap. We identify that the anisotropic scale of the adaptive learning rate (A-LR) used by AGMs contributes to the generalization performance gap, and all existing modified AGMs actually represent efforts in revising the A-LR. Because the A-LR varies significantly across the dimensions of the problem over the optimization epochs (i.e., anisotropic scale), we propose a new AGM by calibrating the A-LR with a softplus function, resulting in the Sadam and SAMSGrad methods[Code is available at]. These methods have better chance to not trap at sharp local minimizers, which helps them resume the dips in the generalization error curve observed with SGD and S-Momentum. We further provide a new way to analyze the convergence of AGMs (e.g., Adam, Sadam, and SAMSGrad) under the nonconvex, non-strongly convex, and Polyak-Łojasiewicz conditions. We prove that the convergence rate of ADAM also depends on its hyper-parameter epsilon, which has been overlooked in prior convergence analysis. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.


page 1

page 2

page 3

page 4


Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer w...

Training Neural Networks for and by Interpolation

The majority of modern deep learning models are able to interpolate the ...

Spherical Perspective on Learning with Batch Norm

Batch Normalization (BN) is a prominent deep learning technique. In spit...

A Simple Asymmetric Momentum Make SGD Greatest Again

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymme...

A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms

The dynamic behavior of RMSprop and Adam algorithms is studied through a...

Please sign up or login with your details

Forgot password? Click here to reset