Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

10/29/2021
by   Karl Hajjar, et al.
0

To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of "small" initialization corresponding to ”mean-field” limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization μP. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.

READ FULL TEXT
research
04/06/2023

Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks

We analyze the dynamics of finite width effects in wide but finite featu...
research
06/19/2019

Disentangling feature and lazy learning in deep neural networks: an empirical study

Two distinct limits for deep learning as the net width h→∞ have been pro...
research
03/31/2023

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

In supervised learning, the regularization path is sometimes used as a c...
research
06/11/2020

Dynamically Stable Infinite-Width Limits of Neural Classifiers

Recent research has been focused on two different approaches to studying...
research
06/01/2023

Initial Guessing Bias: How Untrained Networks Favor Some Classes

The initial state of neural networks plays a central role in conditionin...
research
02/24/2022

Embedded Ensembles: Infinite Width Limit and Operating Regimes

A memory efficient approach to ensembling neural networks is to share mo...
research
01/16/2020

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimizatio...

Please sign up or login with your details

Forgot password? Click here to reset