Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

05/15/2020
by   Preetum Nakkiran, et al.
8

Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li-Wei-Ma (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems – in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping.

READ FULL TEXT
research
07/10/2019

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Stochastic gradient descent with a large initial learning rate is a wide...
research
03/27/2023

Learning Rate Schedules in the Presence of Distribution Shift

We design learning rate schedules that minimize regret for SGD-based onl...
research
12/14/2022

Learning threshold neurons via the "edge of stability"

Existing analyses of neural network training often operate under the unr...
research
02/09/2022

Optimal learning rate schedules in high-dimensional non-convex optimization problems

Learning rate schedules are ubiquitously used to speed up and improve op...
research
02/28/2022

On the Benefits of Large Learning Rates for Kernel Methods

This paper studies an intriguing phenomenon related to the good generali...
research
05/02/2007

The Parameter-Less Self-Organizing Map algorithm

The Parameter-Less Self-Organizing Map (PLSOM) is a new neural network a...
research
11/25/2020

Implicit bias of deep linear networks in the large learning rate phase

Correctly choosing a learning rate (scheme) for gradient-based optimizat...

Please sign up or login with your details

Forgot password? Click here to reset