Inherent Noise in Gradient Based Methods

05/26/2020
by   Arushi Gupta, et al.
8

Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its effect on robustness to parameter perturbations has been linked to generalization. We examine a property of GD and SGD, namely that instead of iterating through all scalar weights in the network and updating them one by one, GD (and SGD) updates all the parameters at the same time. As a result, each parameter w^i calculates its partial derivative at the stale parameter w_t, but then suffers loss L̂(w_t+1). We show that this causes noise to be introduced into the optimization. We find that this noise penalizes models that are sensitive to perturbations in the weights. We find that penalties are most pronounced for batches that are currently being used to update, and are higher for larger models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2020

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer f...
research
08/13/2023

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Stochastic gradient descent (SGD) and adaptive gradient methods, such as...
research
06/11/2021

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SG...
research
01/31/2023

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Understanding when the noise in stochastic gradient descent (SGD) affect...
research
06/16/2017

A Closer Look at Memorization in Deep Networks

We examine the role of memorization in deep learning, drawing connection...
research
04/14/2022

RankNEAT: Outperforming Stochastic Gradient Search in Preference Learning Tasks

Stochastic gradient descent (SGD) is a premium optimization method for t...
research
02/10/2021

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

The noise in stochastic gradient descent (SGD), caused by minibatch samp...

Please sign up or login with your details

Forgot password? Click here to reset