The Probabilistic Stability of Stochastic Gradient Descent

03/23/2023
by   Liu Ziyin, et al.
0

A fundamental open problem in deep learning theory is how to define and understand the stability of stochastic gradient descent (SGD) close to a fixed point. Conventional literature relies on the convergence of statistical moments, esp., the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and use the convergence in probability condition to define the probabilistic stability of SGD. The proposed stability directly answers a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of solutions that may overfit badly. To achieve this, we show that only under the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning, convergence to low-rank saddles, and correct learning. When applied to a neural network, these phase diagrams imply that SGD prefers low-rank saddles when the underlying gradient is noisy, thereby improving the learning performance. This result is in sharp contrast to the conventional wisdom that SGD prefers flatter minima to sharp ones, which we find insufficient to explain the experimental data. We also prove that the probabilistic stability of SGD can be quantified by the Lyapunov exponents of the SGD dynamics, which can easily be measured in practice. Our work potentially opens a new venue for addressing the fundamental question of how the learning algorithm affects the learning outcome in deep learning.

READ FULL TEXT

page 2

page 5

page 8

page 19

page 20

page 21

page 22

research
02/01/2021

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance o...
research
10/23/2017

Stability and Generalization of Learning Algorithms that Converge to Global Optima

We establish novel generalization bounds for learning algorithms that co...
research
01/07/2018

Theory of Deep Learning IIb: Optimization Properties of SGD

In Theory IIb we characterize with a mix of theory and experiments the o...
research
01/31/2020

On the Convergence of Stochastic Gradient Descent with Low-Rank Projections for Convex Low-Rank Matrix Problems

We revisit the use of Stochastic Gradient Descent (SGD) for solving conv...
research
05/25/2023

Implicit bias of SGD in L_2-regularized linear DNNs: One-way jumps from high to low rank

The L_2-regularized loss of Deep Linear Networks (DLNs) with more than o...
research
03/31/2022

Exploiting Explainable Metrics for Augmented SGD

Explaining the generalization characteristics of deep learning is an eme...
research
03/01/2023

D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Kohn-Sham Density Functional Theory (KS-DFT) has been traditionally solv...

Please sign up or login with your details

Forgot password? Click here to reset