Understanding Long Range Memory Effects in Deep Neural Networks

05/05/2021
by   Chengli Tan, et al.
0

Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is attributed to the stochastic gradient noise (SGN) incurred in the training process. Based on this general consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of a stochastic differential equation (SDE) driven by either Brownian or Lévy stable motion. In this study, we argue that SGN is neither Gaussian nor stable. Instead, inspired by the long-time correlation emerging in SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by fractional Brownian motion (FBM). Accordingly, the different convergence behavior of SGD dynamics is well grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. This indicates a lower escaping rate for a larger Hurst parameter, and thus SGD stays longer in flat minima. This happens to coincide with the well-known phenomenon that SGD favors flat minima that generalize well. Four groups of experiments are conducted to validate our conjecture, and it is demonstrated that long-range memory effects persist across various model architectures, datasets, and training strategies. Our study opens up a new perspective and may contribute to a better understanding of SGD.

READ FULL TEXT
research
11/29/2019

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
01/18/2019

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
01/07/2018

Theory of Deep Learning IIb: Optimization Properties of SGD

In Theory IIb we characterize with a mix of theory and experiments the o...
research
06/21/2019

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Stochastic gradient descent (SGD) has been widely used in machine learni...
research
06/09/2022

Trajectory-dependent Generalization Bounds for Deep Neural Networks via Fractional Brownian Motion

Despite being tremendously overparameterized, it is appreciated that dee...
research
01/31/2023

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Understanding when the noise in stochastic gradient descent (SGD) affect...
research
10/12/2020

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer f...

Please sign up or login with your details

Forgot password? Click here to reset