Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

02/18/2019 ∙ by Jaehoon Lee, et al. ∙ Google 0

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models based on deep neural networks have achieved unprecedented performance across a wide range of tasks Krizhevsky et al. (2012); He et al. (2016); Devlin et al. (2018). Typically, these models are regarded as complex systems for which many types of theoretical analyses are intractable. Moreover, characterizing the gradient-based training dynamics of these models is challenging owing to the typically high-dimensional non-convex loss surfaces governing the optimization. As is common in the physical sciences, investigating the extreme limits of such systems can often shed light on these hard problems. For neural networks, one such limit is that of infinite width, which refers either to the number of hidden units in a fully-connected layer or to the number of channels in a convolutional layer. Under this limit, the output of the network at initialization is a draw from a Gaussian process (GP); moreover, the network output remains governed by a GP after exact Bayesian training using squared loss (Neal, 1994; Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019; Garriga-Alonso et al., 2018). Aside from its theoretical simplicity, the infinite-width limit is also of practical interest as wider networks have been found to generalize better Neyshabur et al. (2015); Novak et al. (2018); Lee et al. (2018); Novak et al. (2019); Neyshabur et al. (2019).

In this work, we explore the learning dynamics of wide neural networks under gradient descent and find that the weight-space description of the dynamics becomes surprisingly simple: as the width becomes large, the neural network can be effectively replaced by its first-order Taylor expansion with respect to its parameters at initialization. For this induced linear model, the dynamics of gradient descent become analytically tractable. While the linearization is only exact in the infinite width limit, we nevertheless find excellent agreement between the predictions of the original network and those of the linearized version even for finite width configurations. The agreement persists across different architectures, optimization methods, and loss functions.

For squared loss, the exact learning dynamics admit a closed-form solution that allows us to characterize the evolution of the predictive distribution in terms of a GP. This result can be thought of as extension of “sample-then-optimize” posterior sampling Matthews et al. (2017) to the training of deep neural networks. Our empirical simulations confirm that the result accurately models the variation in predictions across an ensemble of finite-width models with different random initializations.

1.1 Our Contribution

We begin by building on a recent result by Jacot et al. (2018) that characterizes the exact dynamics of network outputs throughout gradient descent training in the infinite width limit. Their results establish that gradient descent in parameter space corresponds to kernel gradient descent in function space with respect to a new kernel, the Neural Tangent Kernel (NTK). One may ask what this tells us about the nature of the dynamics in parameter space, where training updates are actually made. A key contribution of our work is to show that dynamics in parameter space are equivalent to the training dynamics of a model which is affine in the collection of all network parameters, the weights and biases. This result holds regardless of the choice of loss function. In the case of squared loss, the dynamics admit a closed-form solution as a function of time.

The output of an infinitely wide neural network is Gaussian at initialization (Lee et al., 2018; Matthews et al., 2018), and as mentioned in Jacot et al. (2018), for squared loss it remains Gaussian throughout training111The setting is full-batch training under gradient flow.. We derive explicit time-dependent expressions for the mean and covariance functions of this GP, and provide a novel interpretation of the result. In particular, it offers a quantitative understanding of the mechanism by which gradient descent differs from Bayesian posterior sampling of the parameters: while both methods generate draws from a GP, gradient descent does not generate samples from the posterior of any probabilistic model. This observation is in contrast to the “sample-then-optimize” framework of Matthews et al. (2017) in which only the top-layer weights are trained and gradient descent does

sample from the Bayesian posterior. These observations establish a framework with which to analyze the long-standing questions as to if, how, and in what contexts gradient descent provides concrete benefits relative to Bayesian inference.

As argued by Chizat & Bach (2018b), these theoretical results may appear too simple to be applicable to realistic neural networks. Nonetheless, we empirically investigate the applicability of the theory in the finite-width setting and find that it gives an accurate characterization of both the learning dynamics and posterior function distributions across a variety of conditions, including some practical network architectures such as the wide residual network Zagoruyko & Komodakis (2016).

1.2 Additional related work

Daniely et al. (2016) study the relationship between neural networks and kernels at initialization. They bound the difference between the infinite width kernel and the empirical kernel at finite width , which diminishes as . Daniely (2017)

uses the same kernel perspective to study stochastic gradient descent(SGD) training of neural networks.

Saxe et al. (2013) study the training dynamics of deep linear networks, in which the nonlinearities are treated as identity functions. Deep linear networks are linear in their inputs, but not their parameters. In contrast, we show that the outputs of sufficiently wide neural networks are linear in their parameters but not usually their inputs.

Du et al. (2018); Allen-Zhu et al. (2018a, b); Zou et al. (2018) study the convergence of gradient descent to global minima. They proved that for i.i.d. Gaussian initialization, the parameters of sufficiently wide networks move little from their initial values during SGD. This small motion of the parameters is crucial to the effect we present, where wide neural networks behave linearly in terms of their parameters throughout training.

Mei et al. (2018); Chizat & Bach (2018a); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2018) analyze the mean field SGD dynamics of training neural networks in the large-width limit. Their mean field analysis describes distributional dynamics of network parameters via a PDE. However, their analysis is restricted to one hidden layer networks with a scaling limit () different from ours (), which is commonly used in modern networks He et al. (2016); Glorot & Bengio (2010).

Finally, Zhang et al. (2019) observed that some of the layers in trained neural networks are robust to re-initialization, but not to re-randomization. Our framework provides some theoretical support for this empirical finding.

2 Theoretical Results

2.1 Notation and setup

Let denote the training set and and denote the inputs and labels, respectively. Consider a fully-connected feed-forward network with hidden layers with width , for and a readout layer with . For each , we use

to represent the pre- and post-activation functions at layer

with input . The recurrence relation for a feed-forward network is defined as

(1)

where is a point-wise activation function, and are the weights and biases, and are the trainable variables, drawn i.i.d. from standard Gaussian at initialization, and and

are weight and bias variances. Note that this parametrization method is non-standard (see SM 

E for further analysis), and we will refer to it as the NTK parameterization. It has already been adopted in several recent works van Laarhoven (2017); Karras et al. (2018); Jacot et al. (2018); Du et al. (2018); Park et al. (2018)

. Unlike the standard parameterization that only normalizes the forward dynamics of the network, the NTK-parameterization also normalizes its backward dynamics. We note that the predictions and training dynamics of NTK-parameterized networks are identical to those of standard networks, up to a width-dependent scaling factor in the learning rate for each parameter tensor. We compare the test performance of these two parameterization methods in Figure 

S1 of SM. Our results (linearity in weights, GP predictions) also hold for infinitely wide networks with a standard parameterization (see SM Section E Figure S2).

We set , the collection of parameters mapping to the -th layer, whose cardinality is . Define and similarly or . Denote by the time-dependence of the parameters and by their initial values. We use

to denote the output (or logits) of the neural network at time

. Let

denote the loss function where the first argument is the prediction and the second argument the true label. In supervised learning, one is interested in learning a

that minimizes the empirical loss222To simplify the notation for later equations, we use the total loss here instead of the average loss, but for all plots in Section 3, we show the average loss.,

(2)

Let be the learning rate333Note that compared to the conventional parameterization, is larger by factor of width. The NTK parameterization allows usage of a universal learning rate scale irrespective of network width.. Via continuous time gradient descent, the evolution of the parameters and the logits can be written as

(3)
(4)

where , the vector of concatenated logits for all examples, and is the gradient of the loss with respect to the model’s output, . is the tangent kernel at time , which is a matrix

(5)

One can define the tangent kernel for general arguments, e.g. where is test input. We refer to as the empirical tangent kernel.

2.2 Linearized networks

In this section, we consider the training dynamics of the linearized network. Specifically, we replace the outputs of the neural network by their first order Taylor expansion,

(6)

where is the change in the parameters from their initial values444Since , we will often drop the superscript at time 0.. Note that is the sum of two terms: the first term is the initial output of the network, which remains unchanged during training, and the second term captures the change to the initial value during training. The dynamics of gradient flow using this linearized function are governed by,

(7)
(8)

As remains constant throughout training, these dynamics are often quite simple. In the case of an MSE loss, i.e., , the ODEs have closed form solutions

(9)
(10)

For an arbitrary point , , where

(11)
(12)

Therefore, we can obtain the time evolution of the linearized neural network without training it. We only need to compute the tangent kernel and the outputs at initialization and use Equations 11, 12, and 9 to compute the dynamics of the outputs and the weights.

2.3 Infinite width limit yields Gaussian processes

As the width of the hidden layers approaches infinity, the Central Limit Theorem (CLT) implies that the outputs at initialization

converge to a multivariate Gaussian in distribution. Informally, this can be proved by induction. Conditioning on activations at layer , each pre-activation () at layer is a sum of

i.i.d. properly normalized random variables (the weights) and a Gaussian distribution (the bias term). One can apply the CLT to conclude that

are i.i.d. Gaussian; see Poole et al. (2016); Schoenholz et al. (2017); Lee et al. (2018); Xiao et al. (2018); Yang & Schoenholz (2017) for more details, and Matthews et al. (2018); Novak et al. (2019) for a formal treatment.

Therefore, randomly initialized neural networks are in correspondence with a certain class of Gaussian processes (hereinafter referred to as NNGPs), which facilitates a fully Bayesian treatment of neural networks (Lee et al., 2018; Matthews et al., 2018). More precisely, let denote the -th output dimension and denote the the sample-to-sample kernel function (of the pre-activation) of the outputs in the infinite width setting,

(13)

then , where denotes the covariance between the -th output of and -th output of , which can be computed recursively (see Lee et al. (2018) Section 2.3 and Supplementary Material (SM) Section A). For an unobserved test input , the joint output distribution is also a Gaussian process. Conditioning on the training samples555 This imposes that directly corresponds to the network predictions. In the case of softmax readout, variational or sampling methods are required to marginalize over .

, the posterior predictive distribution of

is also Gaussian

(14)

where . This is the posterior predictive distribution resulting from exact Bayesian inference in an infinitely wide neural network.

2.3.1 Gaussian processes from gradient descent training

If we freeze the variables after initialization and only optimize , the original network and its linearization are identical. Letting the width approach infinity, this particular tangent kernel will converge to

in probability and Equation 

11 will converge to the posterior Equation 2.3 as (for further details see SM Section D). This is a realization of the “sample-then-optimize” approach for evaluating the posterior of a Gaussian process proposed in Matthews et al. (2017).

If none of the variables are frozen, in the infinite width setting, also converges in probability to a deterministic kernel Jacot et al. (2018), which we sometimes refer to as the analytic kernel, and which can also be computed recursively (see SM Section A). Letting the width go to infinity, for any , the output is also Gaussian distributed because Equations 11 and 12 describe an affine transform of the Gaussian . The mean() and the variance() are given by

(15)
(16)

Unlike the case when only is optimized, Equations 15 and 16 do not admit an interpretation corresponding to the posterior sampling of a probabilistic model666One possible exception is when the NNGP kernel and NTK are the same up to a scalar multiplication.. We contrast the predictive distributions from the NNGP, NTK-GP (i.e. Equations 15 and 16) and ensembles of NNs in Figure 3.

2.4 Infinite width networks are linearized networks

The ODEs (Equation 3, 4) of the original network are unsolvable in general, since evolves with time. Remarkably, Jacot et al. (2018) showed that

(17)

under the technical assumption that the integrated functional gradient norm remains stochastically bounded as sequentially, i.e.

(18)

where is the training budget (independent from the widths), i.e. the amount of allowable training time. This assumption was verified in Jacot et al. (2018) in some specific cases. Recent work on the convergence theory of over-parameterized neural networks provides a rigorous proof of Equation 18 in the simultaneous limit (i.e. under discrete-time stochastic gradient descent with MSE loss on various architectures (fully-connected networks, residual CNNs, RNNs) (Du et al., 2018; Allen-Zhu et al., 2018b, a; Zou et al., 2018). Note that the bound in Equation 17 is just a theoretical upper bound and we observe empirically that the exponent may be improved to ; see Figure 2.

Coupling Equation 18 with Grönwall’s type arguments, in the MSE setting, we can upper bound the discrepancy between the outputs of the original network and those of its linearization,

(19)

which approaches as the width goes to infinity. Intuitively, the ODE of the original network (Equation 4) can be considered as a -fluctuation from the linearized ODE (Equation 8). One expects the difference between the solutions of these two ODEs to be upper bounded by some functional of (refer to SM Section F for the proof). Therefore, for a large width network, the training dynamics can be well approximated by linearized dynamics.

Note that the updates for individual weights in Equation 7 vanish in the infinite width limit, which for instance can be seen from the explicit width dependence of the gradients in the NTK parameterization. Individual weights move by a vanishingly small amount for wide networks in this regime of dynamics, as do hidden layer activations, but they collectively conspire to provide a finite change in the final output of the network, as is necessary for training.

An additional insight gained from linearization of the network is that the dynamics derived in Jacot et al. (2018) are equivalent to a random features method, where the features are the gradients of the model with respect to its weights.

2.5 Extensions

The linearization of the wide neural networks and its training dynamics can be generalized in multiple directions.

2.5.1 Momentum

One direction is to go beyond vanilla gradient descent dynamics. We consider momentum updates777Combining the usual two stage update into a single equation.

(20)

The discrete update to the function output becomes

(21)

where is the output of the linearized network after steps. One can take the continuous time limit as in Qian (1999); Su et al. (2014) and obtain

(22)
(23)

where continuous time relates to steps and . These equations are also amenable to analytic treatment for MSE loss. See Figure 56 and 7 for experimental agreement.

Figure 1: Kernel convergence. Kernels computed from randomly initialized networks with one and three hidden layers converge to the corresponding analytic kernel as width and number of Monte Carlo samples increases. Colors indicate averages over different numbers of Monte Carlo samples.
Figure 2: Relative Frobenius norm change during training. One hidden layer, networks trained with , . We measure changes of (read-out/non read-out) weights, empirical and empirical after steps of gradient descent updates for varying width. We see that the change in weights scales as whereas the change in and is bounded by but close to .

2.5.2 Cross-entropy loss

One can extend the loss function to general functions with multiple output dimensions. Unlike for squared error, we do not have a closed form solution to the dynamics equation. However, the equations for the dynamics can be solved using an ODE solver as an initial value problem. For cross-entropy loss with softmax output (see SM Section C),

(24)

the dynamics equation becomes

(25)
(26)

2.5.3 Beyond fully-connected networks

Although our theoretical analysis has focused on fully-connected architectures, there is good reason to suspect the results to extend to much broader class of models. In particular, a wealth of recent literature suggests that the mean field theory governing the wide network limit of fully-connected models Poole et al. (2016); Schoenholz et al. (2017) extends naturally to residual networks Yang & Schoenholz (2017), CNNs Xiao et al. (2018), RNNs Chen et al. (2018)

, batch normalization 

Yang et al. (2019), and general architectures Yang (2019). We postpone the development of these theoretical extensions in favor of a purely empirical investigation of linearization for a variety of architectures (see Section 3.3).

Figure 3: Dynamics of mean and variance of trained neural network outputs follow analytic dynamics from linearization. Black lines indicate the time evolution of the predictive output distribution from an ensemble of 100 trained neural networks (NNs). The blue region indicates the analytic prediction of the output distribution throughout training (Equations 15, 16). Finally, the red region indicates the prediction that would result from training only the top layer, corresponding to an NNGP (Equations S30, S31). The trained network has 3 hidden layers of width 8192, activation functions, , no bias, and

. The output is computed for inputs interpolated between two training points (denoted with black dots)

. The shaded region and dotted lines denote 2 standard deviations (

quantile) from the mean denoted in solid lines. Training was performed with full-batch gradient descent with dataset size . For dynamics for individual draw of functions, see SM Figure S4

3 Experiments

In this section, we provide empirical support showing that the training dynamics of wide neural networks are well captured by linearized models. We consider fully-connected, convolutional, and wide ResNet architectures trained with full and mini- batch gradient descent using learning rates sufficiently small so that the continuous time approximation holds well. We consider two-class classification on CIFAR-10 (horses and planes) as well as ten-class classification on MNIST and CIFAR-10. When using MSE loss, we treat the binary classification task as regression with one class regressing to and the other to .

3.1 Convergence of empirical kernel

As in Novak et al. (2019)

, we can use Monte Carlo estimates of the tangent kernel (Equation 

5) to probe convergence to the infinite width kernel (analytically computed using Equations S3, S6). For simplicity, we consider random inputs drawn from with . In Figure 1, we observe convergence as both width increases and the number of Monte Carlo samples increases. For both NNGP and tangent kernels we observe and as predicted by a CLT Daniely et al. (2016).

Moreover, as the neural network trains the change during training in the NNGP and tangent kernels, and in individual weights, becomes small as width increases, as shown in Figure 2.

Figure 4: Full batch gradient descent on a model behaves similarly to analytic dynamics on its linearization, both for network outputs, and also for individual weights. A binary CIFAR classification task with MSE loss and a fully connected network with 5 hidden layers of width , , , , , and . All three panes in the first row show dynamics for a randomly selected subset of datapoints or parameters. The first two panes in the second row show that the dynamics of loss and accuracy for training and test points agree well between the original and linearized model. The bottom right pane shows the dynamics of RMSE between the two models on test points. We observe that the empirical kernel gives more accurate dynamics for finite width networks.

3.2 Predictive output distribution

In the case of an MSE loss, the output distribution remains Gaussian throughout training. In Figure 3, the predictive output distribution for input points interpolated between two training points is shown for an ensemble of neural networks and their corresponding GPs. The interpolation is given by where are two training inputs with different classes. We observe that the mean and variance dynamics of neural network outputs during gradient descent training follow the analytic dynamics from linearization well (Equations 15, 16). Moreover the NNGP posterior which corresponds to exact Bayesian inference, while similar, is noticeably different from the predictive distribution at the end of gradient descent training. For dynamics for individual function draws, see SM Figure S4.

Figure 5: A convolutional network and its linearization behave similarly when trained using full batch gradient descent with a momentum optimizer Binary CIFAR classification task with MSE loss, convolutional network with 3 hidden layers of channel size , size filters, average pooling after last convolutional layer, , , , and . The linearized model is trained directly by full batch gradient descent with momentum, rather than by integrating its continuous time analytic dynamics. Panes are the same as in Figure 4.

3.3 Comparison of training dynamics of linearized network to original network

For a particular realization of a finite width network, one can analytically predict the dynamics of the weights and outputs over the course of training using the empirical tangent kernel at initialization. In Figures 5,6,7, we compare these linearized dynamics (Equations 910) with the result of training the actual network. In all cases we see remarkably good agreement. We also observe that for finite networks, dynamics predicted using the empirical kernel better match the data than those obtained using the infinite-width, analytic, kernel . To understand this we note that , as plotted in Figure 2.

For general loss, e.g. cross-entropy with softmax output, we need to rely on solving the ODE Equations 25 and 26. We use the dopri5

method for ODE integration, which is the default integrator in TensorFlow (

tf.contrib.integrate.odeint). In Figure 4, we see that the learning dynamics for the CIFAR-10 all class classification task with cross-entropy loss are well described by the linearized model. In Figure 6, we tested full MNIST digit classification with cross-entropy loss, and trained with a momentum optimizer. For cross-entropy loss with softmax output, some logits at late times grow indefinitely, in contrast to MSE loss where logits converge to target value. The error between original and linearized model for cross entropy loss becomes much worse at late times if the two models deviate significantly before the logits enter their late-time steady-growth regime (See Figure S5.).

One can directly optimize parameters of instead of solving the ODE induced by the tangent kernel . Standard neural network optimization techniques such as mini-batching, weight decay, and data augmentation can be directly applied. In Figure 5 and 7, we compared the training dynamics of the linearized and original network while directly training both networks.

As discussed in Section 2.5.3, the linearized dynamics successfully describes the training of networks beyond vanilla fully connected models. To demonstrate the generality of this procedure we show we can predict the learning dynamics of Wide Residual Networks (WRN) Zagoruyko & Komodakis (2016)

. WRNs are a class of model that are popular in computer vision and leverage convolutions, batch normalization, skip connections, and average pooling. In Figure 

7, we show a comparison between the linearized dynamics and the true dynamics for a wide residual network trained with MSE loss and SGD with momentum. We slightly modified the block structure described in Figure 7 so that each layer has a constant number of channels (1024 in this case) and otherwise followed the original implementation. As elsewhere, we see strong agreement between the predicted dynamics and the result of training.

Figure 6: A neural network and its linearization behave similarly when both are trained via SGD with momentum on cross entropy loss on MNIST. Experiment is for 10 class MNIST classification using a fully connected network with 2 hidden layers of width , , , , , , and . Both models are trained using stochastic minibatching with batch size 64. Panes are the same as in Figure 4, except that the top row shows all ten logits for a single randomly selected datapoint.
group name output size block type
conv1 32 32 [33, channel size]
conv2 32 32 N
conv3 16 16 N
conv4 8 8 N
avg-pool 1 1 [8 8]
Figure 7: A wide residual network and its linearization behave similarly when both are trained by SGD with momentum on MSE loss on CIFAR-10. (top) We adopt the network architecture from Zagoruyko & Komodakis (2016). In the residual block, we follow Batch Normalization-ReLU-Conv ordering. We use , channel size , , , , , and . Both linearized and original model are trained directly on full CIFAR-10 (), using stochastic minibatching with batch size 8. (bottom) Output dynamics for a randomly selected subset of train and test points are shown in the first row. The second row shows training and accuracy curves for original and linearized networks.

3.4 Effects of depth and dataset size

The training dynamics of a neural network match those of its linearization when the width is infinite and the dataset is finite. In previous experiments, we chose sufficiently wide networks to achieve small error between neural networks and their linearization for smaller datasets. Here we investigate how the agreement between the linearized dynamics and the true dynamics behaves as a function of width and dataset size across a wide range of models. We consider Root Mean Squared Error (RMSE) between the predicted outputs and the true outputs of the network over the test set. Generally, the RMSE will increase with time until it plateaus at the end of training at some final value. In Figure 8 we plot this plateau RMSE for a range of models as a function of both width and dataset size. Overall, we observe that as the width grows the error decreases. This decrease goes approximately as for fully-connected networks, with more ambiguous scaling for convolutional and WRN architectures. Additionally, we see that the error grows approximately linearly in the size of the dataset. Thus, although error grows with dataset this can be counterbalanced by a corresponding increase in the model size.

Figure 8: Error dependence on depth and dataset size. Final value of the RMSE for fully-conected, convolutional, wide residual network as networks becomes wider for varying depth and dataset size. Top row: error in fully connected networks as the depth is varied from 1 to 16 (left) and the dataset size is varied from 32 to 4096 (right). Bottom row: error in convolutional networks (left) as the depth is varied between 1 and 32 and WRN (right) for depths 10 and 16 corresponding to N=1,2 described in Figure 7. Network is critically initialized , , trained with gradient descent on MSE loss.

4 Discussion

We showed theoretically that the learning dynamics in parameter space of deep nonlinear neural networks are exactly described by a linearized model in the infinite width limit. Empirical investigation revealed that this agrees very well with actual training dynamics and predictive distributions across fully-connected, convolutional, and even wide residual network architectures, as well as with different optimizers (gradient descent, momentum, mini-batching) and loss functions (MSE, cross-entropy). Our results suggest that a surprising number of realistic neural networks may be operating in the regime we studied.

In the regime which we study, since the learning dynamics is fully captured by the kernel and the target signal, studying the properties of to determine trainability and generalization are interesting future directions. Furthermore, the infinite width limit gives us a simple characterization of both gradient descent and Bayesian inference. Some preliminary observations in Lee et al. (2018) showed that wide neural networks trained with SGD perform similarly to the corresponding GPs as width increase, while Novak et al. (2019) found the opposite in the case of convopostlutional networks without pooling. By studying properties of the NNGP kernel and the tangent kernel , we may shed light on the inductive bias of gradient descent.

Acknowledgements

We thank Roman Novak and Greg Yang for useful discussions and feedback.

References

Appendix A Computing NTK and NNGP Kernel

For completeness, we reproduce, informally, the recursive formula of the NNGP kernel and the tangent kernel from Lee et al. (2018) and Jacot et al. (2018), respectively. Let the activation function be absolutely continuous. Let and be functions from positive semi-definite matrices to given by

(S1)

In the infinite width limit, the NNGP and tangent kernel can be computed recursively. Let be two inputs in . Then and converge in distribution to a joint Gaussian as . The mean is zero and the variance is

(S2)
(S3)

with base case

(S4)

Using this one can also derive the tangent kernel for gradient descent training. We will use induction to show that

(S5)

where

(S6)

with . Let

(S7)

Then

(S8)

Letting sequentially, the first term converges to the NNGP kernel

. By applying the chain rule and the induction step (letting

sequentially), the second term is

(S9)
(S10)
(S11)
(S12)

Appendix B Tangent kernel for and

For and activation functions, the tangent kernel can be computed analytically. We begin with the case ; using the formula from Cho & Saul (2009), we can compute and in closed form. Let be a PSD matrix. We will use

(S13)

where

(S14)

Let and . Then is a mean zero Gaussian with . Then

(S15)
(S16)

For , let be the same as above. Following Williams (1997), we get

(S17)
(S18)

Appendix C Multi-dimensional output and cross-entropy loss

For completeness, we include the derivation for cross entropy loss for softmax output,

(S19)

Recall that . For general input point and for an arbitrary parameterized function parameterized by , gradient flow dynamics is given by

(S20)
(S21)

Let . The above is

(S22)
(S23)

The linearization is

(S24)
(S25)

Appendix D Gradient flow dynamics for training only the readout-layer

The connection between Gaussian processes and Bayesian wide neural networks can be extended to the setting when only the readout layer parameters are being optimized. More precisely, we show that when training only the readout layer, the outputs of the network form a Gaussian process (over an ensemble of draws from the parameter prior) throughout training, where that output is an interpolation between the GP prior and GP posterior.

Note that for any , in the infinite width limit in probability, where for notational simplicity we assign . The regression problem is specified with mean-squared loss

(S26)

and applying gradient flow to optimize the readout layer (and freezing all other parameters),

(S27)

where is the learning rate. The solution to this ODE gives the evolution of the output of an arbitrary . So long as the empirical kernel is invertible, it is

(S28)

For any , letting for , one has the convergence in distribution in probability and distribution respectively

(S29)

Moreover and the term containing are the only stochastic term over the ensemble of network initializations, therefore for any the output throughout training converges to a Gaussian distribution in the infinite width limit, with

(S30)
(S31)

Thus the output of the neural network is also a GP and the asymptotic solution (i.e. ) is identical to the posterior of the NNGP (Equation 2.3). Therefore, in the infinite width case, the optimized neural network is performing posterior sampling if only the readout layer is being trained. This result is a realization of sample-then-optimize equivalence identified in Matthews et al. (2017).

Appendix E Results for NTK parameterization transfer to standard parameterization

(a) MNIST
(b) CIFAR
Figure S1: NTK vs Standard parameterization. Across different choices of dataset, activation function and loss function, models obtained from (S)GD training for both parameterization (circle and triangle denotes NTK and standard parameterization respectively) get similar performance.

Here we present a sketch for why the linearization results, derived for NTK parameterized networks, also apply to networks with a standard parameterization

(S32)

The NTK parameterization in Equation 1 is not commonly used for training neural networks. While the function that the network represents is the same for both NTK and standard parameterization, training dynamics under gradient descent are generally different for the two parameterizations. However, for a particular choice of layer-dependent learning rate training dynamics also become identical. Let and be layer-dependent learning rate for and in the NTK parameterization, and be the learning rate for all parameters in the standard parameterization, where . Recall that gradient descent training in standard neural networks requires a learning rate that scales with width like , so defines a width-invariant learning rate (Park et al., 2018). If we choose