The effective noise of Stochastic Gradient Descent

12/20/2021
by   Francesca Mignacco, et al.
0

Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, persistent SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

Stochastic Gradient Descent-like relaxation is equivalent to Glauber dynamics in discrete optimization and inference problems

Is Stochastic Gradient Descent (SGD) substantially different from Glaube...
research
05/03/2021

Effective temperatures for single particle system under dichotomous noise

Three different definitions of effective temperature – 𝒯_ k, 𝒯_ i and 𝒯_...
research
09/19/2023

On the different regimes of Stochastic Gradient Descent

Modern deep networks are trained with stochastic gradient descent (SGD) ...
research
01/31/2023

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Understanding when the noise in stochastic gradient descent (SGD) affect...
research
10/12/2016

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

This work characterizes the benefits of averaging techniques widely used...
research
06/11/2018

The Effect of Network Width on the Performance of Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (S...
research
02/23/2020

Improve SGD Training via Aligning Mini-batches

Deep neural networks (DNNs) for supervised learning can be viewed as a p...

Please sign up or login with your details

Forgot password? Click here to reset