Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

02/24/2020
by   Bao Wang, et al.
8

Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93 improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with fewer training epochs compared to the SGD baseline. We provide code for SRSGD at https://github.com/minhtannguyen/SRSGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2020

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum plays a crucial role in stochastic gradient-based optimization ...
research
04/11/2020

Exploit Where Optimizer Explores via Residuals

To train neural networks faster, many research efforts have been devoted...
research
07/13/2022

Towards understanding how momentum improves generalization in deep learning

Stochastic gradient descent (SGD) with momentum is widely used for train...
research
09/13/2017

Normalized Direction-preserving Adam

Optimization algorithms for training deep models not only affects the co...
research
03/02/2019

Time-Delay Momentum: A Regularization Perspective on the Convergence and Generalization of Stochastic Momentum for Deep Learning

In this paper we study the problem of convergence and generalization err...
research
08/13/2016

SGDR: Stochastic Gradient Descent with Warm Restarts

Restart techniques are common in gradient-free optimization to deal with...
research
06/12/2023

Fast Diffusion Model

Despite their success in real data synthesis, diffusion models (DMs) oft...

Please sign up or login with your details

Forgot password? Click here to reset