Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

10/30/2017
by   Pratik Chaudhari, et al.
0

Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of the fact that the gradient noise in SGD is highly non-isotropic; the covariance matrix of mini-batch gradients has a rank as small as 1 these claims, proven in the appendix.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2019

Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent

We interpret the variational inference of the Stochastic Gradient Descen...
research
10/27/2019

A geometric interpretation of stochastic gradient descent using diffusion metrics

Stochastic gradient descent (SGD) is a key ingredient in the training of...
research
08/13/2023

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm is the algorithm we use ...
research
12/24/2022

Visualizing Information Bottleneck through Variational Inference

The Information Bottleneck theory provides a theoretical and computation...
research
05/20/2017

Stabilizing Adversarial Nets With Prediction Methods

Adversarial neural networks solve many important problems in data scienc...
research
06/05/2023

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Decentralized stochastic gradient descent (D-SGD) allows collaborative l...
research
03/18/2021

A deep learning theory for neural networks grounded in physics

In the last decade, deep learning has become a major component of artifi...

Please sign up or login with your details

Forgot password? Click here to reset