Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

07/19/2021
by   Daniel Kunin, et al.
0

In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.

READ FULL TEXT

page 6

page 7

research
09/22/2020

Anomalous diffusion dynamics of learning in deep neural networks

Learning in deep neural networks (DNNs) is implemented through minimizin...
research
12/08/2020

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Predicting the dynamics of neural network parameters during training is ...
research
02/24/2018

A Walk with SGD

Exploring why stochastic gradient descent (SGD) based optimization metho...
research
02/08/2021

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

We propose a new framework, inspired by random matrix theory, for analyz...
research
06/07/2023

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

In this work, we reveal a strong implicit bias of stochastic gradient de...
research
04/01/2023

Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability

Random label noises (or observational noises) widely exist in practical ...
research
04/11/2023

Simulations of quantum dynamics with fermionic phase-space representations using numerical matrix factorizations as stochastic gauges

The Gaussian phase-space representation can be used to implement quantum...

Please sign up or login with your details

Forgot password? Click here to reset