Non-Gaussianity of Stochastic Gradient Noise

10/21/2019
by   Abhishek Panigrahi, et al.
9

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/26/2021

The convergence of the Stochastic Gradient Descent (SGD) : a self-contained proof

We give here a proof of the convergence of the Stochastic Gradient Desce...
research
07/09/2019

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Increasing the batch size is a popular way to speed up neural network tr...
research
02/22/2020

On the Inductive Bias of a CNN for Orthogonal Patterns Distributions

Training overparameterized convolutional neural networks with gradient b...
research
06/26/2020

On the Generalization Benefit of Noise in Stochastic Gradient Descent

It has long been argued that minibatch stochastic gradient descent can g...
research
09/22/2015

Stochastic gradient descent methods for estimation with large data sets

We develop methods for parameter estimation in settings with large-scale...
research
03/28/2017

Unifying the Stochastic Spectral Descent for Restricted Boltzmann Machines with Bernoulli or Gaussian Inputs

Stochastic gradient descent based algorithms are typically used as the g...
research
06/10/2019

Adaptively Preconditioned Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics infuses isotropic gradient noise t...

Please sign up or login with your details

Forgot password? Click here to reset