Decaying momentum helps neural network training

10/11/2019
by   John Chen, et al.
0

Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts.

READ FULL TEXT
research
07/27/2023

The Marginal Value of Momentum for Small Learning Rate SGD

Momentum is known to accelerate the convergence of gradient descent in s...
research
07/02/2020

Adaptive Braking for Mitigating Gradient Delay

Neural network training is commonly accelerated by using multiple synchr...
research
11/01/2017

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can u...
research
09/05/2023

A Simple Asymmetric Momentum Make SGD Greatest Again

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymme...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...
research
05/14/2019

Robust Neural Network Training using Periodic Sampling over Model Weights

Deep neural networks provide best-in-class performance for a number of c...
research
11/04/2020

Reverse engineering learned optimizers reveals known and novel mechanisms

Learned optimizers are algorithms that can themselves be trained to solv...

Please sign up or login with your details

Forgot password? Click here to reset