A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast

02/10/2020
by   Zeke Xie, et al.
15

Stochastic optimization algorithms, such as Stochastic Gradient Descent (SGD) and its variants, are mainstream methods for training deep networks in practice. However, the theoretical mechanism behind gradient noise still remains to be further investigated. Deep learning is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. In this paper, we focus on a fundamental problem in deep learning, "How can deep learning usually find flat minima among so many minima?" To answer the question, we develop a density diffusion theory (DDT) for revealing the fundamental dynamical mechanism of SGD and deep learning. More specifically, we study how escape time from loss valleys to the outside of valleys depends on minima sharpness, gradient noise and hyperparameters. One of the most interesting findings is that stochastic gradient noise from SGD can help escape from sharp minima exponentially faster than flat minima, while white noise can only help escape from sharp minima polynomially faster than flat minima. We also find large-batch training requires exponentially many iterations to pass through sharp minima and find flat minima. We present direct empirical evidence supporting the proposed theoretical results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2020

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Stochastic gradient descent (SGD) and its variants are mainstream method...
research
01/07/2018

Theory of Deep Learning IIb: Optimization Properties of SGD

In Theory IIb we characterize with a mix of theory and experiments the o...
research
09/05/2020

S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima

The stochastic gradient descent (SGD) method is most widely used for dee...
research
05/25/2023

How to escape sharp minima

Modern machine learning applications have seen a remarkable success of o...
research
07/03/2017

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep ne...
research
05/10/2019

The sharp, the flat and the shallow: Can weakly interacting agents learn to escape bad minima?

An open problem in machine learning is whether flat minima generalize be...
research
01/16/2023

Stability Analysis of Sharpness-Aware Minimization

Sharpness-aware minimization (SAM) is a recently proposed training metho...

Please sign up or login with your details

Forgot password? Click here to reset