Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

06/02/2022
by   Ning Yang, et al.
4

Generalization is one of the most important problems in deep learning (DL). In the overparameterized regime in neural networks, there exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable. Empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand how SGD drives the learning system to flat solutions, we construct a simple model whose loss landscape has a continuous set of degenerate (or near degenerate) minima. By solving the Fokker-Planck equation of the underlying stochastic learning dynamics, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD-loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. Furthermore, a stronger SGD noise shortens the convergence time to the flat solutions. However, we identify an upper bound for the SGD noise beyond which the system fails to converge. Our results not only elucidate the role of SGD for generalization they may also have important implications for hyperparameter selection for learning efficiently without divergence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2018

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent

Understanding the generalization of deep learning has raised lots of con...
research
01/20/2022

Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

In this paper, we study the sharpness of a deep learning (DL) loss lands...
research
07/25/2021

SGD May Never Escape Saddle Points

Stochastic gradient descent (SGD) has been deployed to solve highly non-...
research
05/20/2021

Logarithmic landscape and power-law escape rate of SGD

Stochastic gradient descent (SGD) undergoes complicated multiplicative n...
research
06/08/2023

Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Stochastic gradient descent (SGD) has become a cornerstone of neural net...
research
07/18/2020

On regularization of gradient descent, layer imbalance and flat minima

We analyze the training dynamics for deep linear networks using a new me...
research
06/07/2023

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

In this work, we reveal a strong implicit bias of stochastic gradient de...

Please sign up or login with your details

Forgot password? Click here to reset