Theory of Deep Learning IIb: Optimization Properties of SGD

01/07/2018
by   Chiyuan Zhang, et al.
0

In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers

READ FULL TEXT

page 1

page 6

page 7

page 8

research
02/10/2020

A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast

Stochastic optimization algorithms, such as Stochastic Gradient Descent ...
research
06/04/2021

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis

The representation of functions by artificial neural networks depends on...
research
05/05/2021

Understanding Long Range Memory Effects in Deep Neural Networks

Stochastic gradient descent (SGD) is of fundamental importance in deep l...
research
11/07/2021

Quasi-potential theory for escape problem: Quantitative sharpness effect on SGD's escape from local minima

We develop a quantitative theory on an escape problem of a stochastic gr...
research
03/23/2023

The Probabilistic Stability of Stochastic Gradient Descent

A fundamental open problem in deep learning theory is how to define and ...
research
07/18/2020

On regularization of gradient descent, layer imbalance and flat minima

We analyze the training dynamics for deep linear networks using a new me...
research
05/20/2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

Please sign up or login with your details

Forgot password? Click here to reset