Escaping Saddle Points Faster with Stochastic Momentum

06/05/2021
by   Jun-Kun Wang, et al.
0

Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a “momentum” term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM [KB15], AMSGrad [RKK18], etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter–our analysis suggests that β∈ [0,1) should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2020

Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis

Momentum methods are now used pervasively within the machine learning co...
research
07/27/2023

The Marginal Value of Momentum for Small Learning Rate SGD

Momentum is known to accelerate the convergence of gradient descent in s...
research
02/14/2018

Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely ap...
research
03/02/2019

Time-Delay Momentum: A Regularization Perspective on the Convergence and Generalization of Stochastic Momentum for Deep Learning

In this paper we study the problem of convergence and generalization err...
research
05/31/2016

Asynchrony begets Momentum, with an Application to Deep Learning

Asynchronous methods are widely used in deep learning, but have limited ...
research
06/12/2023

Fast Diffusion Model

Despite their success in real data synthesis, diffusion models (DMs) oft...
research
10/31/2018

MaSS: an Accelerated Stochastic Method for Over-parametrized Learning

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an a...

Please sign up or login with your details

Forgot password? Click here to reset