On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

02/24/2021
by   Zhiyuan Li, et al.
12

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs). But formal justification for this approximation (e.g., (Li et al., 2019a)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets. (c) A provable and empirically testable necessary condition for the SDE approximation to hold and also its most famous implication, the linear scaling rule (Smith et al., 2020; Goyal et al., 2017). The analysis also gives rigorous insight into why the SDE approximation may fail.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2023

Why (and When) does Local SGD Generalize Better than SGD?

Local SGD is a communication-efficient variant of SGD for large-scale tr...
research
04/18/2022

Multilevel Picard approximations for high-dimensional decoupled forward-backward stochastic differential equations

Backward stochastic differential equations (BSDEs) appear in numeruous a...
research
09/24/2020

How Many Factors Influence Minima in SGD?

Stochastic gradient descent (SGD) is often applied to train Deep Neural ...
research
02/02/2022

Improved quantum algorithms for linear and nonlinear differential equations

We present substantially generalized and improved quantum algorithms ove...
research
06/15/2021

Compression Implies Generalization

Explaining the surprising generalization performance of deep neural netw...
research
10/03/2019

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Recent research shows that the following two models are equivalent: (a) ...

Please sign up or login with your details

Forgot password? Click here to reset