The Heavy-Tail Phenomenon in SGD

06/08/2020
by   Mert Gurbuzbalaban, et al.
0

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize η to the batch size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the eigenspectra of the network weights. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters η and b, the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of linear regression: we show that even in a simple quadratic optimization problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and neural networks. To our knowledge, these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2020

Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks

Despite its success in a wide range of applications, characterizing the ...
research
05/13/2022

Heavy-Tail Phenomenon in Decentralized SGD

Recent theoretical studies have shown that heavy-tails can emerge in sto...
research
02/20/2021

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Recent studies have provided both empirical and theoretical evidence ill...
research
12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...
research
02/13/2020

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Stochastic gradient descent with momentum (SGDm) is one of the most popu...
research
06/07/2021

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Neural network compression techniques have become increasingly popular a...
research
04/26/2022

An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate

A particular direction of recent advance about stochastic deep-learning ...

Please sign up or login with your details

Forgot password? Click here to reset