What Happens after SGD Reaches Zero Loss? –A Mathematical Framework

10/13/2021
by   Zhiyuan Li, et al.
30

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇^2 L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold – i.e., the "implicit bias" – using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η^-2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η^-1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κln d) samples for learning an κ-sparse overparametrized linear model in ℝ^d (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous Õ(κ^2) upper bound (HaoChen et al., 2020).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SG...
research
03/02/2023

Why (and When) does Local SGD Generalize Better than SGD?

Local SGD is a communication-efficient variant of SGD for large-scale tr...
research
01/18/2019

Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent

We interpret the variational inference of the Stochastic Gradient Descen...
research
05/19/2022

Understanding Gradient Descent on Edge of Stability in Deep Learning

Deep learning experiments in Cohen et al. (2021) using deterministic Gra...
research
06/15/2020

Shape Matters: Understanding the Implicit Bias of the Noise Covariance

The noise in stochastic gradient descent (SGD) provides a crucial implic...
research
03/09/2020

Communication-Efficient Distributed SGD with Error-Feedback, Revisited

We show that the convergence proof of a recent algorithm called dist-EF-...
research
04/01/2023

Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability

Random label noises (or observational noises) widely exist in practical ...

Please sign up or login with your details

Forgot password? Click here to reset