Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

02/14/2018
by   Tianyi Liu, et al.
0

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning. Popular examples include training deep neural networks, dimensionality reduction, and etc. Due to the lack of convexity and the extra momentum term, the optimization theory of MSGD is still largely unknown. In this paper, we study this fundamental optimization algorithm based on the so-called "strict saddle problem." By diffusion approximation type analysis, our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks. Moreover, our analysis applies the martingale method and "Fixed-State-Chain" method from the stochastic approximation literature, which are of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2018

Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization

Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD...
research
06/05/2021

Escaping Saddle Points Faster with Stochastic Momentum

Stochastic gradient descent (SGD) with stochastic momentum is popular in...
research
08/03/2022

SGEM: stochastic gradient with energy and momentum

In this paper, we propose SGEM, Stochastic Gradient with Energy and Mome...
research
05/22/2017

Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

In this paper, we study the stochastic gradient descent method in analyz...
research
06/12/2023

Fast Diffusion Model

Despite their success in real data synthesis, diffusion models (DMs) oft...
research
04/14/2023

Who breaks early, looses: goal oriented training of deep neural networks based on port Hamiltonian dynamics

The highly structured energy landscape of the loss as a function of para...
research
03/06/2018

Dimensionality Reduction for Stationary Time Series via Stochastic Nonconvex Optimization

Stochastic optimization naturally arises in machine learning. Efficient ...

Please sign up or login with your details

Forgot password? Click here to reset