Infinite width Bayesian [neal, lee2018deep, matthews2018, matthews2018b_arxiv, novak2018bayesian, garriga-alonso2018deep, NIPS2019_8809, yang2019scaling, yang2019wide, de2019random] and gradient descent trained [jacot2018neural, lee2019wide, chizat2019lazy, yang2019scaling, jacot2019freeze, dyer2019asymptotics, bietti2019inductive, arora2019on, arora2019harnessing, schwartz2019information] neural networks are an area of active and extremely promising work. There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks111Another line of work applies a different scaling, and derives non-fixed infinite width kernels [mei2018mean, mei2019mean, chizat2018global, Nguyen2019MeanFL].: the NTK parameterization [jacot2018neural, §2]; and the naive standard parameterization [park2019effect, §2.1]; [glorot2010understanding, he2016deep]. However, the extrapolations of both of these parameterizations to infinite width fail to capture crucial aspects of finite width networks:
In finite width networks, differences in relative layer widths can have a profound effect on training dynamics. Under the NTK parameterization, as layer width goes to infinity, relative layer width has no effect on training dynamics or predictions.
As the naive standard parameterization is extended to large widths, the largest stable learning rate scales like [karakida2018universal, Theorem 7]; [park2019effect, §H]. A learning rate that goes to zero as width goes to infinity poses a variety of practical and theoretical challenges, including a neural tangent kernel with entries that diverge to infinity.
At finite width, convolutional networks with an NTK parameterization have been reported to generalize more poorly than those with a standard parameterization [park2019effect, §I] (though we do not consistently reproduce this relationship in our own experiments, see Figure 3).
For neither NTK nor naive standard parameterization do infinite width learning rates agree closely with those typically used to train finite width standard parameterization networks.
The relative learning dynamics of bias and weight parameters are different in the NTK parameterization than they are for a standard parameterization finite-width network.
In this note we propose an improved extrapolation of the standard parameterization to infinite width that resolves these inconsistencies while simultaneously leading to a well-defined neural tangent kernel. Namely, in this parameterization the resulting infinite width network maintains a learning rate scale that agrees with that used to train the original network, preserves the impact of relative layer widths on training dynamics for finite width networks, and similarly preserves the relative training dynamics of weights and biases.
|Parameterization||Standard (naive)||NTK||Standard (improved)|
|Parameterization||Standard (naive)||NTK||Standard (improved)|
2 Improved standard parameterization
Affine layers in neural networks are typically written as,
where are pre-activations, are activations, are weights, and are biases. To preserve the scale of the pre-activations as the width of the network, , is varied one typically initializes the weights as and biases as . However, as was noted in [jacot2018neural], this leads to divergent gradient flow dynamics as . In [jacot2018neural], the authors resolve this situation by using an alternative parameterization where affine layers are written as,
where . This leads to a well-behaved infinite-width limit, but involves a number of inconsistencies relative to standard neural networks.
The core idea here is to write the width of the neural network in each layer in terms of an auxiliary parameter, , . We then write an affine layer as,
The infinite width limit can be taken by letting
. The parameter variancesand original layer widths instead appear in the variance of the initializer (as is typically done for finite width networks). A complete set of equations describing an affine layer, and corresponding infinite width kernels, for this parameterization are given in Tables 1 and 2, for fully connected and convolutional architectures respectively.
A formal proof of convergence of the improved standard parameteriation to the specified kernels is beyond the scope of this short note. However, we observe that the proof technique in lee2019wide applies with minimal modification. Additionally, Monte Carlo validation of the correctness of the introduced kernels is performed as part of the Neural Tangents [novak2020neural] unit test suite.
correspond to constant channel convolutional neural networks without / with global average pooling.WRN-LN
is Wide Residual Network with four residual blocks and Batch Normalization layer replaced with Layer Normalization.(Lower) Each layer width of fully connected architecture are randomly sampled from with .
In this section, we study empirical properties of infinite and finite width networks stemming from both the NTK and improved standard parameterization. All of the experiments in this section were done using Neural Tangents library [novak2020neural]. Here we focus on kernels corresponding to ReLU networks with .
In Figure 1 we compare the predictions of kernels for pairs of identical networks, but using the improved standard or NTK parameterization. We find that the performance of the kernels resulting from the two parameterizations are extremely similar, while the training dynamics of the improved standard parameterization network are expected to better match those of typical finite width networks. In Figure 2 we show that if the width parameter is carefully tuned, then the neural tangent kernel for a fully connected network using the improved standard parameterization can outperform the kernel for an NTK parameterized network. In Figure 3, we show that random finite width networks using the standard and NTK parameterization perform similarly.
The analytic forms for the various kernels inspire some additional interesting observations:
For the NTK parameterization, the kernel resulting from a Bayesian neural network and from gradient descent training of the readout layer of an infinite width network are the same. For the both the naive and improved standard parameterization however, the two differ.
For neural networks with a standard parameterization, the magnitude of the contribution of the bias to the neural tangent kernel (and thus to learning dynamics) remains constant with increasing width. However, the contribution of the weights to the learning dynamics grows like like . We should thus expect that as networks become wide, the role played by the bias in training becomes less important.
In this note, we introduced an improved extrapolation of finite width networks to infinite width that better matches the parameterization and learning dynamics of typical finite width networks. It is our hope that this will enable theory and experiments with infinite width networks to better explain the behavior of practical finite width networks.