DeepAI AI Chat
Log In Sign Up

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

by   Mufan Bill Li, et al.

The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.


On the infinite-depth limit of finite-width neural networks

In this paper, we study the infinite-depth limit of finite-width residua...

Infinitely Wide Tensor Networks as Gaussian Process

Gaussian Process is a non-parametric prior which can be understood as a ...

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

We demonstrate that in residual neural networks (ResNets) dynamical isom...

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

Theoretical results show that neural networks can be approximated by Gau...

Universality Laws for High-Dimensional Learning with Random Features

We prove a universality theorem for learning with random features. Our r...

Convergence of Deep Neural Networks to a Hierarchical Covariance Matrix Decomposition

We show that in a deep neural network trained with ReLU, the low-lying l...

Characterizing the Spectrum of the NTK via a Power Series Expansion

Under mild conditions on the network initialization we derive a power se...