
Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion
In this work we explore the limiting dynamics of deep neural networks tr...
read it

Directional Analysis of Stochastic Gradient Descent via von MisesFisher Distributions in Deep learning
Although stochastic gradient descent (SGD) is a driving force behind the...
read it

Dynamical meanfield theory for stochastic gradient descent in Gaussian mixture classification
We analyze in a closed form the learning dynamics of stochastic gradient...
read it

Fluctuationdissipation Type Theorem in Stochastic Linear Learning
The fluctuationdissipation theorem (FDT) is a simple yet powerful conse...
read it

On the Generalization Benefit of Noise in Stochastic Gradient Descent
It has long been argued that minibatch stochastic gradient descent can g...
read it

SGD in the Large: Averagecase Analysis, Asymptotics, and Stepsize Criticality
We propose a new framework, inspired by random matrix theory, for analyz...
read it

Noether: The More Things Change, the More Stay the Same
Symmetries have proven to be important ingredients in the analysis of ne...
read it
Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics
Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in highdimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from realworld datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuoustime limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic predictions for learning dynamics on VGG16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.
READ FULL TEXT
Comments
There are no comments yet.