Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

02/13/2019
by   Greg Yang, et al.
0

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline tensor program that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the gradient independence assumption -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

READ FULL TEXT
research
05/01/2021

One-pass Stochastic Gradient Descent in Overparametrized Two-layer Neural Networks

There has been a recent surge of interest in understanding the convergen...
research
10/28/2019

Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

Wide neural networks with random weights and biases are Gaussian process...
research
04/15/2019

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

The goal of this paper is to study why stochastic gradient descent (SGD)...
research
08/03/2023

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Going beyond stochastic gradient descent (SGD), what new phenomena emerg...
research
06/25/2020

Tensor Programs II: Neural Tangent Kernel for Any Architecture

We prove that a randomly initialized neural network of *any architecture...
research
06/24/2020

When Do Neural Networks Outperform Kernel Methods?

For a certain scaling of the initialization of stochastic gradient desce...
research
06/03/2022

Canonical convolutional neural networks

We introduce canonical weight normalization for convolutional neural net...

Please sign up or login with your details

Forgot password? Click here to reset