The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

05/27/2023
by   Lei Wu, et al.
0

In this paper, we study the implicit regularization of stochastic gradient descent (SGD) through the lens of dynamical stability (Wu et al., 2018). We start by revising existing stability analyses of SGD, showing how the Frobenius norm and trace of Hessian relate to different notions of stability. Notably, if a global minimum is linearly stable for SGD, then the trace of Hessian must be less than or equal to 2/η, where η denotes the learning rate. By contrast, for gradient descent (GD), the stability imposes a similar constraint but only on the largest eigenvalue of Hessian. We then turn to analyze the generalization properties of these stable minima, focusing specifically on two-layer ReLU networks and diagonal linear networks. Notably, we establish the equivalence between these metrics of sharpness and certain parameter norms for the two models, which allows us to show that the stable minima of SGD provably generalize well. By contrast, the stability-induced regularization of GD is provably too weak to ensure satisfactory generalization. This discrepancy provides an explanation of why SGD often generalizes better than GD. Note that the learning rate (LR) plays a pivotal role in the strength of stability-induced regularization. As the LR increases, the regularization effect becomes more pronounced, elucidating why SGD with a larger LR consistently demonstrates superior generalization capabilities. Additionally, numerical experiments are provided to support our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2022

When does SGD favor flat minima? A quantitative characterization via linear stability

The observation that stochastic gradient descent (SGD) favors flat minim...
research
02/01/2021

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance o...
research
05/19/2022

Understanding Gradient Descent on Edge of Stability in Deep Learning

Deep learning experiments in Cohen et al. (2021) using deterministic Gra...
research
02/17/2023

SAM operates far from home: eigenvalue regularization as a dynamical phenomenon

The Sharpness Aware Minimization (SAM) optimization algorithm has been s...
research
02/17/2023

(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability

In this paper, we investigate the impact of stochasticity and large step...
research
06/30/2023

The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks

We study the type of solutions to which stochastic gradient descent conv...
research
12/28/2020

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...

Please sign up or login with your details

Forgot password? Click here to reset