Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

06/18/2018
by   Jinghui Chen, et al.
4

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method (Padam), which unifies the Adam/Amsgrad with SGD to achieve the best from both worlds. Experiments on standard benchmarks show that Padam can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2019

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Although adaptive gradient methods (AGMs) have fast speed in training de...
research
05/23/2017

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...
research
05/23/2019

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSpr...
research
03/01/2023

AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Sharpness aware minimization (SAM) optimizer has been extensively explor...
research
10/21/2020

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed...
research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...

Please sign up or login with your details

Forgot password? Click here to reset