Depth Dependence of μP Learning Rates in ReLU MLPs

05/13/2023
by   Samy Jelassi, et al.
7

In this short note we consider random fully connected ReLU networks of width n and depth L equipped with a mean-field weight initialization. Our purpose is to study the dependence on n and L of the maximal update (μP) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large n,L. As in prior work on μP of Yang et. al., we find that this maximal update learning rate is independent of n for all but the first and last layer weights. However, we find that it has a non-trivial dependence of L, scaling like L^-3/2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2022

Maximal Initial Learning Rates in Deep ReLU Networks

Training a neural network requires choosing a suitable learning rate, in...
research
01/18/2023

Catapult Dynamics and Phase Transitions in Quadratic Nets

Neural networks trained with gradient descent can undergo non-trivial ph...
research
03/15/2020

Stochastic gradient descent with random learning rate

We propose to optimize neural networks with a uniformly-distributed rand...
research
10/07/2021

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Recent empirical advances show that training deep models with large lear...
research
03/06/2019

Mean-field Analysis of Batch Normalization

Batch Normalization (BatchNorm) is an extremely useful component of mode...
research
10/18/2019

Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm

We study the problem of fitting task-specific learning rate schedules fr...
research
10/10/2022

Meta-Principled Family of Hyperparameter Scaling Strategies

In this note, we first derive a one-parameter family of hyperparameter s...

Please sign up or login with your details

Forgot password? Click here to reset