Shape Matters: Understanding the Implicit Bias of the Noise Covariance

06/15/2020
by   Jeff Z. HaoChen, et al.
9

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise – induced by mini-batches or label perturbation – is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et el. and Woodworth et el. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not. Code for our project is publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2021

Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections

Gaussian noise injections (GNIs) are a family of simple and widely-used ...
research
06/18/2019

The Multiplicative Noise in Stochastic Gradient Descent: Data-Dependent Regularization, Continuous and Discrete Approximation

The randomness in Stochastic Gradient Descent (SGD) is considered to pla...
research
06/11/2021

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SG...
research
12/21/2020

Regularization in neural network optimization via trimmed stochastic gradient descent with noisy label

Regularization is essential for avoiding over-fitting to training data i...
research
10/13/2021

What Happens after SGD Reaches Zero Loss? –A Mathematical Framework

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is ...
research
08/15/2023

Max-affine regression via first-order methods

We consider regression of a max-affine model that produces a piecewise l...
research
06/07/2018

Scalable Natural Gradient Langevin Dynamics in Practice

Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for Ba...

Please sign up or login with your details

Forgot password? Click here to reset