Quasi-hyperbolic momentum and Adam for deep learning

10/16/2018
by   Jerry Ma, et al.
0

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. PyTorch code is immediately available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2023

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Stochastic gradient descent with momentum (SGDM) has been widely used in...
research
10/08/2021

Momentum Doesn't Change the Implicit Bias

The momentum acceleration technique is widely adopted in many optimizati...
research
01/25/2022

On Uniform Boundedness Properties of SGD and its Momentum Variants

A theoretical, and potentially also practical, problem with stochastic g...
research
10/01/2021

Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum

Momentum method has been used extensively in optimizers for deep learnin...
research
05/31/2016

Asynchrony begets Momentum, with an Application to Deep Learning

Asynchronous methods are widely used in deep learning, but have limited ...
research
07/13/2022

Towards understanding how momentum improves generalization in deep learning

Stochastic gradient descent (SGD) with momentum is widely used for train...
research
11/23/2021

Variance Reduction in Deep Learning: More Momentum is All You Need

Variance reduction (VR) techniques have contributed significantly to acc...

Please sign up or login with your details

Forgot password? Click here to reset