Escaping Saddle Points with Adaptive Gradient Methods

01/26/2019
by   Matthew Staib, et al.
0

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2021

Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods

Stochastically controlled stochastic gradient (SCSG) methods have been p...
research
02/01/2019

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

In this paper, we prove that the simplest Stochastic Gradient Descent (S...
research
08/16/2018

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Adaptive gradient methods are workhorses in deep learning. However, the ...
research
06/12/2020

Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs

Adaptive gradient methods have attracted much attention of machine learn...
research
06/10/2019

Adaptively Preconditioned Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics infuses isotropic gradient noise t...
research
11/04/2022

How Does Adaptive Optimization Impact Local Neural Network Geometry?

Adaptive optimization methods are well known to achieve superior converg...
research
05/20/2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

Please sign up or login with your details

Forgot password? Click here to reset