Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

07/10/2019
by   Yuanzhi Li, et al.
6

Stochastic gradient descent with a large initial learning rate is a widely adopted method for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes low noise, hard-to-fit patterns, it generalizes worse on higher noise, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

READ FULL TEXT
research
05/15/2020

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Learning rate schedule can significantly affect generalization performan...
research
01/28/2019

Stiffness: A New Perspective on Generalization in Neural Networks

We investigate neural network training and generalization using the conc...
research
03/26/2019

Improving image classifiers for small datasets by learning rate adaptations

Our paper introduces an efficient combination of established techniques ...
research
12/14/2022

Maximal Initial Learning Rates in Deep ReLU Networks

Training a neural network requires choosing a suitable learning rate, in...
research
08/05/2019

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications

Under StepDecay learning rate strategy (decaying the learning rate after...
research
11/04/2020

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

Understanding the algorithmic regularization effect of stochastic gradie...
research
05/02/2007

The Parameter-Less Self-Organizing Map algorithm

The Parameter-Less Self-Organizing Map (PLSOM) is a new neural network a...

Please sign up or login with your details

Forgot password? Click here to reset