Adaptive Gradient Method with Resilience and Momentum

10/21/2020
by   Jie Liu, et al.
0

Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control the parameter-wise learning rate (e.g., Adam and RMSProp). Although they show a large improvement in convergence speed, most adaptive learning rate methods suffer from compromised generalization compared with SGD. In this paper, we proposed an Adaptive Gradient Method with Resilience and Momentum (AdaRem), motivated by the observation that the oscillations of network parameters slow the training, and give a theoretical proof of convergence. For each parameter, AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient, and thus encourages long-term consistent parameter updating with much fewer oscillations. Comprehensive experiments have been conducted to verify the effectiveness of AdaRem when training various models on a large-scale image recognition dataset, e.g., ImageNet, which also demonstrate that our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2018

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...
research
06/22/2021

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Recent progress on deep learning relies heavily on the quality and effic...
research
01/24/2023

Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization

We propose ActiveLR, an optimization meta algorithm that localizes the l...
research
05/27/2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, a first-order stochastic gradient method with layer...
research
07/06/2020

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootst...
research
10/10/2021

Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Embedding learning has found widespread applications in recommendation s...
research
08/01/2022

Dynamic Batch Adaptation

Current deep learning adaptive optimizer methods adjust the step magnitu...

Please sign up or login with your details

Forgot password? Click here to reset