On the infinite width limit of neural networks with a standard parameterization

01/21/2020 ∙ by Jascha Sohl-Dickstein, et al. ∙ 13

There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and a nonstandard learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at https://github.com/google/neural-tangents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Infinite width Bayesian [neal, lee2018deep, matthews2018, matthews2018b_arxiv, novak2018bayesian, garriga-alonso2018deep, NIPS2019_8809, yang2019scaling, yang2019wide, de2019random] and gradient descent trained [jacot2018neural, lee2019wide, chizat2019lazy, yang2019scaling, jacot2019freeze, dyer2019asymptotics, bietti2019inductive, arora2019on, arora2019harnessing, schwartz2019information] neural networks are an area of active and extremely promising work. There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks111Another line of work applies a different scaling, and derives non-fixed infinite width kernels [mei2018mean, mei2019mean, chizat2018global, Nguyen2019MeanFL].: the NTK parameterization [jacot2018neural, §2]; and the naive standard parameterization [park2019effect, §2.1]; [glorot2010understanding, he2016deep]. However, the extrapolations of both of these parameterizations to infinite width fail to capture crucial aspects of finite width networks:

  • In finite width networks, differences in relative layer widths can have a profound effect on training dynamics. Under the NTK parameterization, as layer width goes to infinity, relative layer width has no effect on training dynamics or predictions.

  • As the naive standard parameterization is extended to large widths, the largest stable learning rate scales like [karakida2018universal, Theorem 7]; [park2019effect, §H]. A learning rate that goes to zero as width goes to infinity poses a variety of practical and theoretical challenges, including a neural tangent kernel with entries that diverge to infinity.

  • At finite width, convolutional networks with an NTK parameterization have been reported to generalize more poorly than those with a standard parameterization [park2019effect, §I] (though we do not consistently reproduce this relationship in our own experiments, see Figure 3).

  • For neither NTK nor naive standard parameterization do infinite width learning rates agree closely with those typically used to train finite width standard parameterization networks.

  • The relative learning dynamics of bias and weight parameters are different in the NTK parameterization than they are for a standard parameterization finite-width network.

In this note we propose an improved extrapolation of the standard parameterization to infinite width that resolves these inconsistencies while simultaneously leading to a well-defined neural tangent kernel. Namely, in this parameterization the resulting infinite width network maintains a learning rate scale that agrees with that used to train the original network, preserves the impact of relative layer widths on training dynamics for finite width networks, and similarly preserves the relative training dynamics of weights and biases.

Parameterization Standard (naive) NTK Standard (improved)
Layer equation,
Weight shape,
initialization,
initialization,
NNGP,
NTK, diverges
Table 1: Equations describing a fully connected layer for each parameterization, both for a finite width network and for the corresponding infinite width NNGP and NT kernels. Here is the baseline (finite network) width of layer , and is a width-scaling factor that is taken to for infinite width networks.
Parameterization Standard (naive) NTK Standard (improved)
Layer equation,
Weight shape,
initialization,
initialization,
NNGP,
NTK, diverges
Table 2: Equations describing a convolutional layer for each parameterization, both for a finite width network and for the corresponding infinite width NNGP and NT kernels. We use Einstein notation for summation – indices that appear only in a single term are implicitly summed over. is the number of spatial positions in the convolution kernel, indexes over spatial locations within the kernel, corresponds to input spatial location offset by , is the baseline (finite network) channel count of layer , is the diagonal averaging operator defined in xiao18a and novak2018bayesian, and is a width-scaling factor that is taken to for infinite channel count networks.

2 Improved standard parameterization

Affine layers in neural networks are typically written as,

(1)

where are pre-activations, are activations, are weights, and are biases. To preserve the scale of the pre-activations as the width of the network, , is varied one typically initializes the weights as and biases as . However, as was noted in [jacot2018neural], this leads to divergent gradient flow dynamics as . In [jacot2018neural], the authors resolve this situation by using an alternative parameterization where affine layers are written as,

(2)

where . This leads to a well-behaved infinite-width limit, but involves a number of inconsistencies relative to standard neural networks.

The core idea here is to write the width of the neural network in each layer in terms of an auxiliary parameter, , . We then write an affine layer as,

(3)

The infinite width limit can be taken by letting

. The parameter variances

and original layer widths instead appear in the variance of the initializer (as is typically done for finite width networks). A complete set of equations describing an affine layer, and corresponding infinite width kernels, for this parameterization are given in Tables 1 and 2, for fully connected and convolutional architectures respectively.

A formal proof of convergence of the improved standard parameteriation to the specified kernels is beyond the scope of this short note. However, we observe that the proof technique in lee2019wide applies with minimal modification. Additionally, Monte Carlo validation of the correctness of the introduced kernels is performed as part of the Neural Tangents [novak2020neural] unit test suite.

Figure 1: Infinite width networks with various architectures achieve similar error when using the improved standard parameterization or the NTK parameterization, while the improved standard parameterization better matches properties of typical finite width networks. Each point compares the neural tangent kernel prediction error for the same architecture on CIFAR-10, but using NTK (x-axis) or improved standard (y-axis) parameterization. (Upper) Each point corresponds to varying training set size (), depth ( for FC / Conv, fixed number of block of 4 for WRN) and widths ( for FC / Conv and widening factor for WRN). FC is fully connected network with constant hidden width and Conv-Vec / GAP

correspond to constant channel convolutional neural networks without / with global average pooling.

WRN-LN

is Wide Residual Network with four residual blocks and Batch Normalization layer replaced with Layer Normalization.

(Lower) Each layer width of fully connected architecture are randomly sampled from with .
Figure 2: For fully connected networks, the neural tangent kernel prediction for the improved standard parameterization can outperform the NTK parameterization, especially when the layer widths used in the standard parameterization are tuned. Experiments are performed on the CIFAR-10 dataset with networks corresponding to 5 hidden layers.
Figure 3: SGD trained finite width neural networks perform similarly when using the standard parameterization or the NTK parameterization. For all experiments, the network was trained with an MSE loss on the full CIFAR-10 dataset (45k/5k/10k split). Each point in FC corresponds to varying width , and each point in Conv-VEC and Conv-GAP

corresponds to varying number of channels {8, 11, 16, 23, 32, 45, 64, 90, 128, 181, 256, 362, 512}. All networks are ReLU networks with

. They were trained with vanilla SGD without L2 regularization and data augmentation. Constant learning rate was grid searched over 20 log spaced values within [0.01, 100]. For standard parameterization learning rate is divided by . FC

networks were trained with batch size 1024 for 3,000 epochs whereas

Conv networks were trained with batch size 256 for 10,000 epochs.

3 Experiments

In this section, we study empirical properties of infinite and finite width networks stemming from both the NTK and improved standard parameterization. All of the experiments in this section were done using Neural Tangents library [novak2020neural]. Here we focus on kernels corresponding to ReLU networks with .

In Figure 1 we compare the predictions of kernels for pairs of identical networks, but using the improved standard or NTK parameterization. We find that the performance of the kernels resulting from the two parameterizations are extremely similar, while the training dynamics of the improved standard parameterization network are expected to better match those of typical finite width networks. In Figure 2 we show that if the width parameter is carefully tuned, then the neural tangent kernel for a fully connected network using the improved standard parameterization can outperform the kernel for an NTK parameterized network. In Figure 3, we show that random finite width networks using the standard and NTK parameterization perform similarly.

4 Discussion

The analytic forms for the various kernels inspire some additional interesting observations:

  • For the NTK parameterization, the kernel resulting from a Bayesian neural network and from gradient descent training of the readout layer of an infinite width network are the same. For the both the naive and improved standard parameterization however, the two differ.

  • For neural networks with a standard parameterization, the magnitude of the contribution of the bias to the neural tangent kernel (and thus to learning dynamics) remains constant with increasing width. However, the contribution of the weights to the learning dynamics grows like like . We should thus expect that as networks become wide, the role played by the bias in training becomes less important.

In this note, we introduced an improved extrapolation of finite width networks to infinite width that better matches the parameterization and learning dynamics of typical finite width networks. It is our hope that this will enable theory and experiments with infinite width networks to better explain the behavior of practical finite width networks.

References