First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

06/21/2019
by   Thanh Huy Nguyen, et al.
12

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using α-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Lévy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2019

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
02/13/2020

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Stochastic gradient descent with momentum (SGDm) is one of the most popu...
research
03/05/2023

Revisiting the Noise Model of Stochastic Gradient Descent

The stochastic gradient noise (SGN) is a significant factor in the succe...
research
01/18/2019

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
05/23/2022

Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent

Recent studies have shown that gradient descent (GD) can achieve improve...
research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...
research
05/05/2021

Understanding Long Range Memory Effects in Deep Neural Networks

Stochastic gradient descent (SGD) is of fundamental importance in deep l...

Please sign up or login with your details

Forgot password? Click here to reset