Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

02/08/2021
by   Xingyu Wang, et al.
0

The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, which is well known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks; under the presence of such heavy-tailed noise, it can be shown that SGD can escape sharp local minima, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima; it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases the dynamics of the heavy-tailed SGD closely resemble that of a special continuous-time Markov chain which never visits any sharp minima. We verify our theoretical results with numerical experiments and discuss the implications on the generalizability of SGD in deep learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2020

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer f...
research
12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...
research
09/22/2020

Anomalous diffusion dynamics of learning in deep neural networks

Learning in deep neural networks (DNNs) is implemented through minimizin...
research
05/21/2018

Learning with Non-Convex Truncated Losses by SGD

Learning with a convex loss function has been a dominating paradigm for...
research
11/29/2019

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
11/01/2021

A variance principle explains why dropout finds flatter minima

Although dropout has achieved great success in deep learning, little is ...
research
03/02/2023

Why (and When) does Local SGD Generalize Better than SGD?

Local SGD is a communication-efficient variant of SGD for large-scale tr...

Please sign up or login with your details

Forgot password? Click here to reset