Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

10/06/2020
by   Zhiyuan Li, et al.
0

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new ' intrinsic learning rate' parameter that is the product of the normal learning rate and weight decay factor. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge – via theory and experiments – to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2019

An Exponential Learning Rate Schedule for Deep Learning

Intriguing empirical evidence exists that deep learning can work well wi...
research
11/13/2017

Three Factors Influencing Minima in SGD

We study the properties of the endpoint of stochastic gradient descent (...
research
06/14/2022

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Normalization layers (e.g., Batch Normalization, Layer Normalization) we...
research
09/08/2022

Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

A fundamental property of deep learning normalization techniques, such a...
research
05/09/2019

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

We investigate how the final parameters found by stochastic gradient des...
research
06/05/2018

On layer-level control of DNN training and its impact on generalization

The generalization ability of a neural network depends on the optimizati...
research
06/29/2021

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Despite the conventional wisdom that using batch normalization with weig...

Please sign up or login with your details

Forgot password? Click here to reset