Machine learning models based on deep neural networks have achieved unprecedented performance across a wide range of tasks Krizhevsky et al. (2012); He et al. (2016); Devlin et al. (2018). Typically, these models are regarded as complex systems for which many types of theoretical analyses are intractable. Moreover, characterizing the gradient-based training dynamics of these models is challenging owing to the typically high-dimensional non-convex loss surfaces governing the optimization. As is common in the physical sciences, investigating the extreme limits of such systems can often shed light on these hard problems. For neural networks, one such limit is that of infinite width, which refers either to the number of hidden units in a fully-connected layer or to the number of channels in a convolutional layer. Under this limit, the output of the network at initialization is a draw from a Gaussian process (GP); moreover, the network output remains governed by a GP after exact Bayesian training using squared loss (Neal, 1994; Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019; Garriga-Alonso et al., 2018). Aside from its theoretical simplicity, the infinite-width limit is also of practical interest as wider networks have been found to generalize better Neyshabur et al. (2015); Novak et al. (2018); Lee et al. (2018); Novak et al. (2019); Neyshabur et al. (2019).
In this work, we explore the learning dynamics of wide neural networks under gradient descent and find that the weight-space description of the dynamics becomes surprisingly simple: as the width becomes large, the neural network can be effectively replaced by its first-order Taylor expansion with respect to its parameters at initialization. For this induced linear model, the dynamics of gradient descent become analytically tractable. While the linearization is only exact in the infinite width limit, we nevertheless find excellent agreement between the predictions of the original network and those of the linearized version even for finite width configurations. The agreement persists across different architectures, optimization methods, and loss functions.
For squared loss, the exact learning dynamics admit a closed-form solution that allows us to characterize the evolution of the predictive distribution in terms of a GP. This result can be thought of as extension of “sample-then-optimize” posterior sampling Matthews et al. (2017) to the training of deep neural networks. Our empirical simulations confirm that the result accurately models the variation in predictions across an ensemble of finite-width models with different random initializations.
1.1 Our Contribution
We begin by building on a recent result by Jacot et al. (2018) that characterizes the exact dynamics of network outputs throughout gradient descent training in the infinite width limit. Their results establish that gradient descent in parameter space corresponds to kernel gradient descent in function space with respect to a new kernel, the Neural Tangent Kernel (NTK). One may ask what this tells us about the nature of the dynamics in parameter space, where training updates are actually made. A key contribution of our work is to show that dynamics in parameter space are equivalent to the training dynamics of a model which is affine in the collection of all network parameters, the weights and biases. This result holds regardless of the choice of loss function. In the case of squared loss, the dynamics admit a closed-form solution as a function of time.
The output of an infinitely wide neural network is Gaussian at initialization (Lee et al., 2018; Matthews et al., 2018), and as mentioned in Jacot et al. (2018), for squared loss it remains Gaussian throughout training111The setting is full-batch training under gradient flow.. We derive explicit time-dependent expressions for the mean and covariance functions of this GP, and provide a novel interpretation of the result. In particular, it offers a quantitative understanding of the mechanism by which gradient descent differs from Bayesian posterior sampling of the parameters: while both methods generate draws from a GP, gradient descent does not generate samples from the posterior of any probabilistic model. This observation is in contrast to the “sample-then-optimize” framework of Matthews et al. (2017) in which only the top-layer weights are trained and gradient descent does
sample from the Bayesian posterior. These observations establish a framework with which to analyze the long-standing questions as to if, how, and in what contexts gradient descent provides concrete benefits relative to Bayesian inference.
As argued by Chizat & Bach (2018b), these theoretical results may appear too simple to be applicable to realistic neural networks. Nonetheless, we empirically investigate the applicability of the theory in the finite-width setting and find that it gives an accurate characterization of both the learning dynamics and posterior function distributions across a variety of conditions, including some practical network architectures such as the wide residual network Zagoruyko & Komodakis (2016).
1.2 Additional related work
Daniely et al. (2016) study the relationship between neural networks and kernels at initialization. They bound the difference between the infinite width kernel and the empirical kernel at finite width , which diminishes as . Daniely (2017)
uses the same kernel perspective to study stochastic gradient descent(SGD) training of neural networks.
Saxe et al. (2013) study the training dynamics of deep linear networks, in which the nonlinearities are treated as identity functions. Deep linear networks are linear in their inputs, but not their parameters. In contrast, we show that the outputs of sufficiently wide neural networks are linear in their parameters but not usually their inputs.
Du et al. (2018); Allen-Zhu et al. (2018a, b); Zou et al. (2018) study the convergence of gradient descent to global minima. They proved that for i.i.d. Gaussian initialization, the parameters of sufficiently wide networks move little from their initial values during SGD. This small motion of the parameters is crucial to the effect we present, where wide neural networks behave linearly in terms of their parameters throughout training.
Mei et al. (2018); Chizat & Bach (2018a); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2018) analyze the mean field SGD dynamics of training neural networks in the large-width limit. Their mean field analysis describes distributional dynamics of network parameters via a PDE. However, their analysis is restricted to one hidden layer networks with a scaling limit () different from ours (), which is commonly used in modern networks He et al. (2016); Glorot & Bengio (2010).
Finally, Zhang et al. (2019) observed that some of the layers in trained neural networks are robust to re-initialization, but not to re-randomization. Our framework provides some theoretical support for this empirical finding.
2 Theoretical Results
2.1 Notation and setup
Let denote the training set and and denote the inputs and labels, respectively. Consider a fully-connected feed-forward network with hidden layers with width , for and a readout layer with . For each , we use
to represent the pre- and post-activation functions at layerwith input . The recurrence relation for a feed-forward network is defined as
where is a point-wise activation function, and are the weights and biases, and are the trainable variables, drawn i.i.d. from standard Gaussian at initialization, and and
are weight and bias variances. Note that this parametrization method is non-standard (see SME for further analysis), and we will refer to it as the NTK parameterization. It has already been adopted in several recent works van Laarhoven (2017); Karras et al. (2018); Jacot et al. (2018); Du et al. (2018); Park et al. (2018)
. Unlike the standard parameterization that only normalizes the forward dynamics of the network, the NTK-parameterization also normalizes its backward dynamics. We note that the predictions and training dynamics of NTK-parameterized networks are identical to those of standard networks, up to a width-dependent scaling factor in the learning rate for each parameter tensor. We compare the test performance of these two parameterization methods in FigureS1 of SM. Our results (linearity in weights, GP predictions) also hold for infinitely wide networks with a standard parameterization (see SM Section E Figure S2).
We set , the collection of parameters mapping to the -th layer, whose cardinality is . Define and similarly or . Denote by the time-dependence of the parameters and by their initial values. We use
to denote the output (or logits) of the neural network at time. Let
denote the loss function where the first argument is the prediction and the second argument the true label. In supervised learning, one is interested in learning athat minimizes the empirical loss222To simplify the notation for later equations, we use the total loss here instead of the average loss, but for all plots in Section 3, we show the average loss.,
Let be the learning rate333Note that compared to the conventional parameterization, is larger by factor of width. The NTK parameterization allows usage of a universal learning rate scale irrespective of network width.. Via continuous time gradient descent, the evolution of the parameters and the logits can be written as
where , the vector of concatenated logits for all examples, and is the gradient of the loss with respect to the model’s output, . is the tangent kernel at time , which is a matrix
One can define the tangent kernel for general arguments, e.g. where is test input. We refer to as the empirical tangent kernel.
2.2 Linearized networks
In this section, we consider the training dynamics of the linearized network. Specifically, we replace the outputs of the neural network by their first order Taylor expansion,
where is the change in the parameters from their initial values444Since , we will often drop the superscript at time 0.. Note that is the sum of two terms: the first term is the initial output of the network, which remains unchanged during training, and the second term captures the change to the initial value during training. The dynamics of gradient flow using this linearized function are governed by,
As remains constant throughout training, these dynamics are often quite simple. In the case of an MSE loss, i.e., , the ODEs have closed form solutions
For an arbitrary point , , where
Therefore, we can obtain the time evolution of the linearized neural network without training it. We only need to compute the tangent kernel and the outputs at initialization and use Equations 11, 12, and 9 to compute the dynamics of the outputs and the weights.
2.3 Infinite width limit yields Gaussian processes
As the width of the hidden layers approaches infinity, the Central Limit Theorem (CLT) implies that the outputs at initializationconverge to a multivariate Gaussian in distribution. Informally, this can be proved by induction. Conditioning on activations at layer , each pre-activation () at layer is a sum of are i.i.d. Gaussian; see Poole et al. (2016); Schoenholz et al. (2017); Lee et al. (2018); Xiao et al. (2018); Yang & Schoenholz (2017) for more details, and Matthews et al. (2018); Novak et al. (2019) for a formal treatment.
Therefore, randomly initialized neural networks are in correspondence with a certain class of Gaussian processes (hereinafter referred to as NNGPs), which facilitates a fully Bayesian treatment of neural networks (Lee et al., 2018; Matthews et al., 2018). More precisely, let denote the -th output dimension and denote the the sample-to-sample kernel function (of the pre-activation) of the outputs in the infinite width setting,
then , where denotes the covariance between the -th output of and -th output of , which can be computed recursively (see Lee et al. (2018) Section 2.3 and Supplementary Material (SM) Section A). For an unobserved test input , the joint output distribution is also a Gaussian process. Conditioning on the training samples555 This imposes that directly corresponds to the network predictions. In the case of softmax readout, variational or sampling methods are required to marginalize over .
, the posterior predictive distribution ofis also Gaussian
where . This is the posterior predictive distribution resulting from exact Bayesian inference in an infinitely wide neural network.
2.3.1 Gaussian processes from gradient descent training
If we freeze the variables after initialization and only optimize , the original network and its linearization are identical. Letting the width approach infinity, this particular tangent kernel will converge to
in probability and Equation11 will converge to the posterior Equation 2.3 as (for further details see SM Section D). This is a realization of the “sample-then-optimize” approach for evaluating the posterior of a Gaussian process proposed in Matthews et al. (2017).
If none of the variables are frozen, in the infinite width setting, also converges in probability to a deterministic kernel Jacot et al. (2018), which we sometimes refer to as the analytic kernel, and which can also be computed recursively (see SM Section A). Letting the width go to infinity, for any , the output is also Gaussian distributed because Equations 11 and 12 describe an affine transform of the Gaussian . The mean() and the variance() are given by
Unlike the case when only is optimized, Equations 15 and 16 do not admit an interpretation corresponding to the posterior sampling of a probabilistic model666One possible exception is when the NNGP kernel and NTK are the same up to a scalar multiplication.. We contrast the predictive distributions from the NNGP, NTK-GP (i.e. Equations 15 and 16) and ensembles of NNs in Figure 3.
2.4 Infinite width networks are linearized networks
under the technical assumption that the integrated functional gradient norm remains stochastically bounded as sequentially, i.e.
where is the training budget (independent from the widths), i.e. the amount of allowable training time. This assumption was verified in Jacot et al. (2018) in some specific cases. Recent work on the convergence theory of over-parameterized neural networks provides a rigorous proof of Equation 18 in the simultaneous limit (i.e. under discrete-time stochastic gradient descent with MSE loss on various architectures (fully-connected networks, residual CNNs, RNNs) (Du et al., 2018; Allen-Zhu et al., 2018b, a; Zou et al., 2018). Note that the bound in Equation 17 is just a theoretical upper bound and we observe empirically that the exponent may be improved to ; see Figure 2.
Coupling Equation 18 with Grönwall’s type arguments, in the MSE setting, we can upper bound the discrepancy between the outputs of the original network and those of its linearization,
which approaches as the width goes to infinity. Intuitively, the ODE of the original network (Equation 4) can be considered as a -fluctuation from the linearized ODE (Equation 8). One expects the difference between the solutions of these two ODEs to be upper bounded by some functional of (refer to SM Section F for the proof). Therefore, for a large width network, the training dynamics can be well approximated by linearized dynamics.
Note that the updates for individual weights in Equation 7 vanish in the infinite width limit, which for instance can be seen from the explicit width dependence of the gradients in the NTK parameterization. Individual weights move by a vanishingly small amount for wide networks in this regime of dynamics, as do hidden layer activations, but they collectively conspire to provide a finite change in the final output of the network, as is necessary for training.
An additional insight gained from linearization of the network is that the dynamics derived in Jacot et al. (2018) are equivalent to a random features method, where the features are the gradients of the model with respect to its weights.
The linearization of the wide neural networks and its training dynamics can be generalized in multiple directions.
One direction is to go beyond vanilla gradient descent dynamics. We consider momentum updates777Combining the usual two stage update into a single equation.
The discrete update to the function output becomes
2.5.2 Cross-entropy loss
One can extend the loss function to general functions with multiple output dimensions. Unlike for squared error, we do not have a closed form solution to the dynamics equation. However, the equations for the dynamics can be solved using an ODE solver as an initial value problem. For cross-entropy loss with softmax output (see SM Section C),
the dynamics equation becomes
2.5.3 Beyond fully-connected networks
Although our theoretical analysis has focused on fully-connected architectures, there is good reason to suspect the results to extend to much broader class of models. In particular, a wealth of recent literature suggests that the mean field theory governing the wide network limit of fully-connected models Poole et al. (2016); Schoenholz et al. (2017) extends naturally to residual networks Yang & Schoenholz (2017), CNNs Xiao et al. (2018), RNNs Chen et al. (2018)2019), and general architectures Yang (2019). We postpone the development of these theoretical extensions in favor of a purely empirical investigation of linearization for a variety of architectures (see Section 3.3).
In this section, we provide empirical support showing that the training dynamics of wide neural networks are well captured by linearized models. We consider fully-connected, convolutional, and wide ResNet architectures trained with full and mini- batch gradient descent using learning rates sufficiently small so that the continuous time approximation holds well. We consider two-class classification on CIFAR-10 (horses and planes) as well as ten-class classification on MNIST and CIFAR-10. When using MSE loss, we treat the binary classification task as regression with one class regressing to and the other to .
3.1 Convergence of empirical kernel
As in Novak et al. (2019)
, we can use Monte Carlo estimates of the tangent kernel (Equation5) to probe convergence to the infinite width kernel (analytically computed using Equations S3, S6). For simplicity, we consider random inputs drawn from with . In Figure 1, we observe convergence as both width increases and the number of Monte Carlo samples increases. For both NNGP and tangent kernels we observe and as predicted by a CLT Daniely et al. (2016).
Moreover, as the neural network trains the change during training in the NNGP and tangent kernels, and in individual weights, becomes small as width increases, as shown in Figure 2.
3.2 Predictive output distribution
In the case of an MSE loss, the output distribution remains Gaussian throughout training. In Figure 3, the predictive output distribution for input points interpolated between two training points is shown for an ensemble of neural networks and their corresponding GPs. The interpolation is given by where are two training inputs with different classes. We observe that the mean and variance dynamics of neural network outputs during gradient descent training follow the analytic dynamics from linearization well (Equations 15, 16). Moreover the NNGP posterior which corresponds to exact Bayesian inference, while similar, is noticeably different from the predictive distribution at the end of gradient descent training. For dynamics for individual function draws, see SM Figure S4.
3.3 Comparison of training dynamics of linearized network to original network
For a particular realization of a finite width network, one can analytically predict the dynamics of the weights and outputs over the course of training using the empirical tangent kernel at initialization. In Figures 5,6,7, we compare these linearized dynamics (Equations 9, 10) with the result of training the actual network. In all cases we see remarkably good agreement. We also observe that for finite networks, dynamics predicted using the empirical kernel better match the data than those obtained using the infinite-width, analytic, kernel . To understand this we note that , as plotted in Figure 2.
method for ODE integration, which is the default integrator in TensorFlow (tf.contrib.integrate.odeint). In Figure 4, we see that the learning dynamics for the CIFAR-10 all class classification task with cross-entropy loss are well described by the linearized model. In Figure 6, we tested full MNIST digit classification with cross-entropy loss, and trained with a momentum optimizer. For cross-entropy loss with softmax output, some logits at late times grow indefinitely, in contrast to MSE loss where logits converge to target value. The error between original and linearized model for cross entropy loss becomes much worse at late times if the two models deviate significantly before the logits enter their late-time steady-growth regime (See Figure S5.).
One can directly optimize parameters of instead of solving the ODE induced by the tangent kernel . Standard neural network optimization techniques such as mini-batching, weight decay, and data augmentation can be directly applied. In Figure 5 and 7, we compared the training dynamics of the linearized and original network while directly training both networks.
As discussed in Section 2.5.3, the linearized dynamics successfully describes the training of networks beyond vanilla fully connected models. To demonstrate the generality of this procedure we show we can predict the learning dynamics of Wide Residual Networks (WRN) Zagoruyko & Komodakis (2016)
. WRNs are a class of model that are popular in computer vision and leverage convolutions, batch normalization, skip connections, and average pooling. In Figure7, we show a comparison between the linearized dynamics and the true dynamics for a wide residual network trained with MSE loss and SGD with momentum. We slightly modified the block structure described in Figure 7 so that each layer has a constant number of channels (1024 in this case) and otherwise followed the original implementation. As elsewhere, we see strong agreement between the predicted dynamics and the result of training.
|group name||output size||block type|
|conv1||32 32||[33, channel size]|
|avg-pool||1 1||[8 8]|
3.4 Effects of depth and dataset size
The training dynamics of a neural network match those of its linearization when the width is infinite and the dataset is finite. In previous experiments, we chose sufficiently wide networks to achieve small error between neural networks and their linearization for smaller datasets. Here we investigate how the agreement between the linearized dynamics and the true dynamics behaves as a function of width and dataset size across a wide range of models. We consider Root Mean Squared Error (RMSE) between the predicted outputs and the true outputs of the network over the test set. Generally, the RMSE will increase with time until it plateaus at the end of training at some final value. In Figure 8 we plot this plateau RMSE for a range of models as a function of both width and dataset size. Overall, we observe that as the width grows the error decreases. This decrease goes approximately as for fully-connected networks, with more ambiguous scaling for convolutional and WRN architectures. Additionally, we see that the error grows approximately linearly in the size of the dataset. Thus, although error grows with dataset this can be counterbalanced by a corresponding increase in the model size.
We showed theoretically that the learning dynamics in parameter space of deep nonlinear neural networks are exactly described by a linearized model in the infinite width limit. Empirical investigation revealed that this agrees very well with actual training dynamics and predictive distributions across fully-connected, convolutional, and even wide residual network architectures, as well as with different optimizers (gradient descent, momentum, mini-batching) and loss functions (MSE, cross-entropy). Our results suggest that a surprising number of realistic neural networks may be operating in the regime we studied.
In the regime which we study, since the learning dynamics is fully captured by the kernel and the target signal, studying the properties of to determine trainability and generalization are interesting future directions. Furthermore, the infinite width limit gives us a simple characterization of both gradient descent and Bayesian inference. Some preliminary observations in Lee et al. (2018) showed that wide neural networks trained with SGD perform similarly to the corresponding GPs as width increase, while Novak et al. (2019) found the opposite in the case of convopostlutional networks without pooling. By studying properties of the NNGP kernel and the tangent kernel , we may shed light on the inductive bias of gradient descent.
We thank Roman Novak and Greg Yang for useful discussions and feedback.
- Allen-Zhu et al. (2018a) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018a.
- Allen-Zhu et al. (2018b) Allen-Zhu, Z., Li, Y., and Song, Z. On the convergence rate of training recurrent neural networks. arXiv preprint arXiv:1810.12065, 2018b.
Chen et al. (2018)
Chen, M., Pennington, J., and Schoenholz, S.
Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks.In International Conference on Machine Learning, 2018.
- Chizat & Bach (2018a) Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pp. 3040–3050, 2018a.
- Chizat & Bach (2018b) Chizat, L. and Bach, F. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018b.
- Cho & Saul (2009) Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in neural information processing systems, 2009.
- Daniely (2017) Daniely, A. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems 30. 2017.
- Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, 2016.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dragomir (2003) Dragomir, S. S. Some Gronwall type inequalities and applications. Nova Science Publishers New York, 2003.
- Du et al. (2018) Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
- Garriga-Alonso et al. (2018) Garriga-Alonso, A., Aitchison, L., and Rasmussen, C. E. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
Glorot & Bengio (2010)
Glorot, X. and Bengio, Y.
Understanding the difficulty of training deep feedforward neural
International Conference on Artificial Intelligence and Statistics, pp. 249–256, 2010.
He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31. 2018.
- Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012.
- Lee et al. (2018) Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., and Sohl-dickstein, J. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
- Matthews et al. (2017) Matthews, A. G. d. G., Hron, J., Turner, R. E., and Ghahramani, Z. Sample-then-optimize posterior sampling for bayesian linear models. In NeurIPS Workshop on Advances in Approximate Bayesian Inference, 2017. URL http://approximateinference.org/2017/accepted/MatthewsEtAl2017.pdf.
- Matthews et al. (2018) Matthews, A. G. d. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
- Mei et al. (2018) Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Neal (1994) Neal, R. M. Priors for infinite networks (tech. rep. no. crg-tr-94-1). University of Toronto, 1994.
- Neyshabur et al. (2015) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations workshop track, 2015.
- Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
- Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.
- Novak et al. (2019) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
Park et al. (2018)
Park, D. S., Smith, S. L., Sohl-dickstein, J., and Le, Q. V.
Optimal SGD hyperparameters for fully connected networks.In NeurIPS Bayesian Deep Learning Workshop, 2018. URL http://bayesiandeeplearning.org/2018/papers/86.pdf.
- Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
- Qian (1999) Qian, N. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Rotskoff & Vanden-Eijnden (2018) Rotskoff, G. M. and Vanden-Eijnden, E. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
- Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Schoenholz et al. (2017) Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. Deep information propagation. International Conference on Learning Representations, 2017.
- Sirignano & Spiliopoulos (2018) Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
- Su et al. (2014) Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518, 2014.
- van Laarhoven (2017) van Laarhoven, T. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Williams (1997) Williams, C. K. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997.
- Xiao et al. (2018) Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018.
- Yang (2019) Yang, G. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Yang & Schoenholz (2017) Yang, G. and Schoenholz, S. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems. 2017.
- Yang et al. (2019) Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and Schoenholz, S. S. A mean field theory of batch normalization. In International Conference on Learning Representations, 2019.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In British Machine Vision Conference, 2016.
- Zhang et al. (2019) Zhang, C., Bengio, S., and Singer, Y. Are all layers created equal? arXiv preprint arXiv:1902.01996, 2019.
- Zou et al. (2018) Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Computing NTK and NNGP Kernel
For completeness, we reproduce, informally, the recursive formula of the NNGP kernel and the tangent kernel from Lee et al. (2018) and Jacot et al. (2018), respectively. Let the activation function be absolutely continuous. Let and be functions from positive semi-definite matrices to given by
In the infinite width limit, the NNGP and tangent kernel can be computed recursively. Let be two inputs in . Then and converge in distribution to a joint Gaussian as . The mean is zero and the variance is
with base case
Using this one can also derive the tangent kernel for gradient descent training. We will use induction to show that
with . Let
Letting sequentially, the first term converges to the NNGP kernel
. By applying the chain rule and the induction step (lettingsequentially), the second term is
Appendix B Tangent kernel for and
For and activation functions, the tangent kernel can be computed analytically. We begin with the case ; using the formula from Cho & Saul (2009), we can compute and in closed form. Let be a PSD matrix. We will use
Let and . Then is a mean zero Gaussian with . Then
For , let be the same as above. Following Williams (1997), we get
Appendix C Multi-dimensional output and cross-entropy loss
For completeness, we include the derivation for cross entropy loss for softmax output,
Recall that . For general input point and for an arbitrary parameterized function parameterized by , gradient flow dynamics is given by
Let . The above is
The linearization is
Appendix D Gradient flow dynamics for training only the readout-layer
The connection between Gaussian processes and Bayesian wide neural networks can be extended to the setting when only the readout layer parameters are being optimized. More precisely, we show that when training only the readout layer, the outputs of the network form a Gaussian process (over an ensemble of draws from the parameter prior) throughout training, where that output is an interpolation between the GP prior and GP posterior.
Note that for any , in the infinite width limit in probability, where for notational simplicity we assign . The regression problem is specified with mean-squared loss
and applying gradient flow to optimize the readout layer (and freezing all other parameters),
where is the learning rate. The solution to this ODE gives the evolution of the output of an arbitrary . So long as the empirical kernel is invertible, it is
For any , letting for , one has the convergence in distribution in probability and distribution respectively
Moreover and the term containing are the only stochastic term over the ensemble of network initializations, therefore for any the output throughout training converges to a Gaussian distribution in the infinite width limit, with
Thus the output of the neural network is also a GP and the asymptotic solution (i.e. ) is identical to the posterior of the NNGP (Equation 2.3). Therefore, in the infinite width case, the optimized neural network is performing posterior sampling if only the readout layer is being trained. This result is a realization of sample-then-optimize equivalence identified in Matthews et al. (2017).
Appendix E Results for NTK parameterization transfer to standard parameterization
Here we present a sketch for why the linearization results, derived for NTK parameterized networks, also apply to networks with a standard parameterization
The NTK parameterization in Equation 1 is not commonly used for training neural networks. While the function that the network represents is the same for both NTK and standard parameterization, training dynamics under gradient descent are generally different for the two parameterizations. However, for a particular choice of layer-dependent learning rate training dynamics also become identical. Let and be layer-dependent learning rate for and in the NTK parameterization, and be the learning rate for all parameters in the standard parameterization, where . Recall that gradient descent training in standard neural networks requires a learning rate that scales with width like , so defines a width-invariant learning rate (Park et al., 2018). If we choose