Log In Sign Up

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

by   Qianqian Tong, et al.

Although adaptive gradient methods (AGMs) have fast speed in training deep neural networks, it is known to generalize worse than the stochastic gradient descent (SGD) or SGD with momentum (S-Momentum). Many works have attempted to modify AGMs so to close the gap in generalization performance between AGMs and S-Momentum, but they do not answer why there is such a gap. We identify that the anisotropic scale of the adaptive learning rate (A-LR) used by AGMs contributes to the generalization performance gap, and all existing modified AGMs actually represent efforts in revising the A-LR. Because the A-LR varies significantly across the dimensions of the problem over the optimization epochs (i.e., anisotropic scale), we propose a new AGM by calibrating the A-LR with a softplus function, resulting in the Sadam and SAMSGrad methods[Code is available at]. These methods have better chance to not trap at sharp local minimizers, which helps them resume the dips in the generalization error curve observed with SGD and S-Momentum. We further provide a new way to analyze the convergence of AGMs (e.g., Adam, Sadam, and SAMSGrad) under the nonconvex, non-strongly convex, and Polyak-Łojasiewicz conditions. We prove that the convergence rate of ADAM also depends on its hyper-parameter epsilon, which has been overlooked in prior convergence analysis. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.


page 1

page 2

page 3

page 4


Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer w...

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

It is well-known that stochastic gradient noise (SGN) acts as implicit r...

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...

Training Neural Networks for and by Interpolation

The majority of modern deep learning models are able to interpolate the ...

Spherical Perspective on Learning with Batch Norm

Batch Normalization (BN) is a prominent deep learning technique. In spit...