
An Exponential Learning Rate Schedule for Deep Learning
Intriguing empirical evidence exists that deep learning can work well wi...
read it

L2 Regularization versus Batch and Weight Normalization
Batch Normalization is a commonly used trick to improve the training of ...
read it

On Learning Rates and Schrödinger Operators
The learning rate is perhaps the single most important parameter in the ...
read it

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
We investigate how the final parameters found by stochastic gradient des...
read it

Meanfield Analysis of Batch Normalization
Batch Normalization (BatchNorm) is an extremely useful component of mode...
read it

On layerlevel control of DNN training and its impact on generalization
The generalization ability of a neural network depends on the optimizati...
read it

A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization
Machine learning practitioners invest significant manual and computation...
read it
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate
Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGDinduced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new ' intrinsic learning rate' parameter that is the product of the normal learning rate and weight decay factor. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge – via theory and experiments – to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
READ FULL TEXT
Comments
There are no comments yet.