Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

02/13/2020
by   Umut Şimşekli, et al.
5

Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that the optima of stationary distribution of the dynamics might not match the optima of the cost function to be optimized. As a remedy, we develop a novel framework, which we coin as fractional ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with natural gradient methods and gradient clipping, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2019

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Stochastic gradient descent (SGD) has been widely used in machine learni...
research
06/08/2020

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been pr...
research
01/22/2019

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

Recent studies on diffusion-based sampling methods have shown that Lange...
research
05/23/2022

Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent

Recent studies have shown that gradient descent (GD) can achieve improve...
research
06/11/2020

Multiplicative noise and heavy tails in stochastic optimization

Although stochastic optimization is central to modern machine learning, ...
research
06/02/2022

Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares

Recent studies have shown that heavy tails can emerge in stochastic opti...
research
04/26/2022

An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate

A particular direction of recent advance about stochastic deep-learning ...

Please sign up or login with your details

Forgot password? Click here to reset