# On the infinite width limit of neural networks with a standard parameterization

There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and a nonstandard learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at https://github.com/google/neural-tangents.

• 63 publications
• 13 publications
• 25 publications
• 20 publications
07/31/2020

### Finite Versus Infinite Neural Networks: an Empirical Study

We perform a careful, thorough, and large scale empirical study of the c...
02/09/2019

### Simulating extrapolated dynamics with parameterization networks

An artificial neural network architecture, parameterization networks, is...
12/05/2019

### Neural Tangents: Fast and Easy Infinite Neural Networks in Python

Neural Tangents is a library designed to enable research into infinite-w...
06/17/2022

### Fast Finite Width Neural Tangent Kernel

The Neural Tangent Kernel (NTK), defined as Θ_θ^f(x_1, x_2) = [∂ f(θ, x_...
06/13/2020

### Collegial Ensembles

Modern neural network performance typically improves as model size incre...
08/19/2020

### Asymptotics of Wide Convolutional Neural Networks

Wide neural networks have proven to be a rich class of architectures for...
07/31/2021

### Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks

Matrix completion problems arise in many applications including recommen...

## 1 Introduction

Infinite width Bayesian [neal, lee2018deep, matthews2018, matthews2018b_arxiv, novak2018bayesian, garriga-alonso2018deep, NIPS2019_8809, yang2019scaling, yang2019wide, de2019random] and gradient descent trained [jacot2018neural, lee2019wide, chizat2019lazy, yang2019scaling, jacot2019freeze, dyer2019asymptotics, bietti2019inductive, arora2019on, arora2019harnessing, schwartz2019information] neural networks are an area of active and extremely promising work. There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks111Another line of work applies a different scaling, and derives non-fixed infinite width kernels [mei2018mean, mei2019mean, chizat2018global, Nguyen2019MeanFL].: the NTK parameterization [jacot2018neural, §2]; and the naive standard parameterization [park2019effect, §2.1]; [glorot2010understanding, he2016deep]. However, the extrapolations of both of these parameterizations to infinite width fail to capture crucial aspects of finite width networks:

• In finite width networks, differences in relative layer widths can have a profound effect on training dynamics. Under the NTK parameterization, as layer width goes to infinity, relative layer width has no effect on training dynamics or predictions.

• As the naive standard parameterization is extended to large widths, the largest stable learning rate scales like [karakida2018universal, Theorem 7]; [park2019effect, §H]. A learning rate that goes to zero as width goes to infinity poses a variety of practical and theoretical challenges, including a neural tangent kernel with entries that diverge to infinity.

• At finite width, convolutional networks with an NTK parameterization have been reported to generalize more poorly than those with a standard parameterization [park2019effect, §I] (though we do not consistently reproduce this relationship in our own experiments, see Figure 3).

• For neither NTK nor naive standard parameterization do infinite width learning rates agree closely with those typically used to train finite width standard parameterization networks.

• The relative learning dynamics of bias and weight parameters are different in the NTK parameterization than they are for a standard parameterization finite-width network.

In this note we propose an improved extrapolation of the standard parameterization to infinite width that resolves these inconsistencies while simultaneously leading to a well-defined neural tangent kernel. Namely, in this parameterization the resulting infinite width network maintains a learning rate scale that agrees with that used to train the original network, preserves the impact of relative layer widths on training dynamics for finite width networks, and similarly preserves the relative training dynamics of weights and biases.

## 2 Improved standard parameterization

Affine layers in neural networks are typically written as,

 zl+1=Wlyl+bl (1)

where are pre-activations, are activations, are weights, and are biases. To preserve the scale of the pre-activations as the width of the network, , is varied one typically initializes the weights as and biases as . However, as was noted in [jacot2018neural], this leads to divergent gradient flow dynamics as . In [jacot2018neural], the authors resolve this situation by using an alternative parameterization where affine layers are written as,

 zl+1=σ√Nlωlyl+bl (2)

where . This leads to a well-behaved infinite-width limit, but involves a number of inconsistencies relative to standard neural networks.

The core idea here is to write the width of the neural network in each layer in terms of an auxiliary parameter, , . We then write an affine layer as,

 zl+1=1√sWlyl+bl (3)

The infinite width limit can be taken by letting

. The parameter variances

and original layer widths instead appear in the variance of the initializer (as is typically done for finite width networks). A complete set of equations describing an affine layer, and corresponding infinite width kernels, for this parameterization are given in Tables 1 and 2, for fully connected and convolutional architectures respectively.

A formal proof of convergence of the improved standard parameteriation to the specified kernels is beyond the scope of this short note. However, we observe that the proof technique in lee2019wide applies with minimal modification. Additionally, Monte Carlo validation of the correctness of the introduced kernels is performed as part of the Neural Tangents [novak2020neural] unit test suite.

## 3 Experiments

In this section, we study empirical properties of infinite and finite width networks stemming from both the NTK and improved standard parameterization. All of the experiments in this section were done using Neural Tangents library [novak2020neural]. Here we focus on kernels corresponding to ReLU networks with .

In Figure 1 we compare the predictions of kernels for pairs of identical networks, but using the improved standard or NTK parameterization. We find that the performance of the kernels resulting from the two parameterizations are extremely similar, while the training dynamics of the improved standard parameterization network are expected to better match those of typical finite width networks. In Figure 2 we show that if the width parameter is carefully tuned, then the neural tangent kernel for a fully connected network using the improved standard parameterization can outperform the kernel for an NTK parameterized network. In Figure 3, we show that random finite width networks using the standard and NTK parameterization perform similarly.

## 4 Discussion

The analytic forms for the various kernels inspire some additional interesting observations:

• For the NTK parameterization, the kernel resulting from a Bayesian neural network and from gradient descent training of the readout layer of an infinite width network are the same. For the both the naive and improved standard parameterization however, the two differ.

• For neural networks with a standard parameterization, the magnitude of the contribution of the bias to the neural tangent kernel (and thus to learning dynamics) remains constant with increasing width. However, the contribution of the weights to the learning dynamics grows like like . We should thus expect that as networks become wide, the role played by the bias in training becomes less important.

In this note, we introduced an improved extrapolation of finite width networks to infinite width that better matches the parameterization and learning dynamics of typical finite width networks. It is our hope that this will enable theory and experiments with infinite width networks to better explain the behavior of practical finite width networks.