Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

12/03/2020
by   Bao Wang, et al.
1

Momentum plays a crucial role in stochastic gradient-based optimization algorithms for accelerating or improving training deep neural networks (DNNs). In deep learning practice, the momentum is usually weighted by a well-calibrated constant. However, tuning hyperparameters for momentum can be a significant computational burden. In this paper, we propose a novel adaptive momentum for improving DNNs training; this adaptive momentum, with no momentum related hyperparameter required, is motivated by the nonlinear conjugate gradient (NCG) method. Stochastic gradient descent (SGD) with this new adaptive momentum eliminates the need for the momentum hyperparameter calibration, allows a significantly larger learning rate, accelerates DNN training, and improves final accuracy and robustness of the trained DNNs. For instance, SGD with this adaptive momentum reduces classification errors for training ResNet110 for CIFAR10 and CIFAR100 from 5.25% to 4.64% and 23.75% to 20.03%, respectively. Furthermore, SGD with the new adaptive momentum also benefits adversarial training and improves adversarial robustness of the trained DNNs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2021

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Heavy ball momentum is crucial in accelerating (stochastic) gradient-bas...
research
02/24/2020

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

Stochastic gradient descent (SGD) with constant momentum and its variant...
research
10/28/2022

Flatter, faster: scaling momentum for optimal speedup of SGD

Commonly used optimization algorithms often show a trade-off between goo...
research
01/27/2017

Reinforced stochastic gradient descent for deep neural network learning

Stochastic gradient descent (SGD) is a standard optimization method to m...
research
03/24/2022

On Exploiting Layerwise Gradient Statistics for Effective Training of Deep Neural Networks

Adam and AdaBelief compute and make use of elementwise adaptive stepsize...
research
12/20/2017

ADINE: An Adaptive Momentum Method for Stochastic Gradient Descent

Two major momentum-based techniques that have achieved tremendous succes...
research
06/26/2020

Is SGD a Bayesian sampler? Well, almost

Overparameterised deep neural networks (DNNs) are highly expressive and ...

Please sign up or login with your details

Forgot password? Click here to reset