Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

10/07/2021
by   Yuqing Wang, et al.
0

Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., min_X, YA - XY^⊤_ F^2. We prove a convergence theory for constant large learning rates well beyond 2/L, where L is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing', meaning that magnitudes of X and Y at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2021

Stochastic Learning Rate Optimization in the Stochastic Approximation and Online Learning Settings

In this work, multiplicative stochasticity is applied to the learning ra...
research
02/22/2021

Super-Convergence with an Unstable Learning Rate

Conventional wisdom dictates that learning rate should be in the stable ...
research
08/23/2017

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

In this paper, we show a phenomenon, which we named "super-convergence",...
research
12/14/2022

Maximal Initial Learning Rates in Deep ReLU Networks

Training a neural network requires choosing a suitable learning rate, in...
research
02/17/2023

SAM operates far from home: eigenvalue regularization as a dynamical phenomenon

The Sharpness Aware Minimization (SAM) optimization algorithm has been s...
research
05/13/2023

Depth Dependence of μP Learning Rates in ReLU MLPs

In this short note we consider random fully connected ReLU networks of w...
research
05/20/2016

Convergence of Contrastive Divergence with Annealed Learning Rate in Exponential Family

In our recent paper, we showed that in exponential family, contrastive d...

Please sign up or login with your details

Forgot password? Click here to reset