DeepAI
Log In Sign Up

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

08/02/2019
by   Qianqian Tong, et al.
0

Although adaptive gradient methods (AGMs) have fast speed in training deep neural networks, it is known to generalize worse than the stochastic gradient descent (SGD) or SGD with momentum (S-Momentum). Many works have attempted to modify AGMs so to close the gap in generalization performance between AGMs and S-Momentum, but they do not answer why there is such a gap. We identify that the anisotropic scale of the adaptive learning rate (A-LR) used by AGMs contributes to the generalization performance gap, and all existing modified AGMs actually represent efforts in revising the A-LR. Because the A-LR varies significantly across the dimensions of the problem over the optimization epochs (i.e., anisotropic scale), we propose a new AGM by calibrating the A-LR with a softplus function, resulting in the Sadam and SAMSGrad methods[Code is available at https://github.com/neilliang90/Sadam.git.]. These methods have better chance to not trap at sharp local minimizers, which helps them resume the dips in the generalization error curve observed with SGD and S-Momentum. We further provide a new way to analyze the convergence of AGMs (e.g., Adam, Sadam, and SAMSGrad) under the nonconvex, non-strongly convex, and Polyak-Łojasiewicz conditions. We prove that the convergence rate of ADAM also depends on its hyper-parameter epsilon, which has been overlooked in prior convergence analysis. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/18/2018

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...
10/11/2021

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer w...
03/31/2021

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

It is well-known that stochastic gradient noise (SGN) acts as implicit r...
10/15/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...
06/13/2019

Training Neural Networks for and by Interpolation

The majority of modern deep learning models are able to interpolate the ...
06/23/2020

Spherical Perspective on Learning with Batch Norm

Batch Normalization (BN) is a prominent deep learning technique. In spit...