When does SGD favor flat minima? A quantitative characterization via linear stability

07/06/2022
by   Lei Wu, et al.
0

The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its linear stability (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum θ^* is linearly stable for SGD, then it must satisfy H(θ^*)_F≤ O(√(B)/η), where H(θ^*)_F, B,η denote the Frobenius norm of Hessian at θ^*, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum exponentially fast. Hence, for minima accessible to SGD, the flatness – as measured by the Frobenius norm of the Hessian – is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.

READ FULL TEXT
research
05/27/2023

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

In this paper, we study the implicit regularization of stochastic gradie...
research
05/20/2021

Logarithmic landscape and power-law escape rate of SGD

Stochastic gradient descent (SGD) undergoes complicated multiplicative n...
research
06/13/2023

Exact Mean Square Linear Stability Analysis for SGD

The dynamical stability of optimization methods at the vicinity of minim...
research
02/02/2019

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Despite the non-convex nature of their loss functions, deep neural netwo...
research
09/14/2017

The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradi...
research
06/24/2023

G-TRACER: Expected Sharpness Optimization

We propose a new regularization scheme for the optimization of deep lear...
research
11/07/2021

Quasi-potential theory for escape problem: Quantitative sharpness effect on SGD's escape from local minima

We develop a quantitative theory on an escape problem of a stochastic gr...

Please sign up or login with your details

Forgot password? Click here to reset