1 Introduction
Deep neural networks have emerged in recent years as flexible parametric models which can fit complex patterns in data. As a contrasting approach, Gaussian processes have long served as a traditional nonparametric tool for modeling. An equivalence between these two approaches was derived in
Neal (1994a), for the case of one layer networks in the limit of infinite width. Neal (1994a) further suggested that a similar correspondence might hold for deeper networks.Consider a deep fullyconnected neural network with i.i.d. random parameters. Each scalar output of the network, an affine transformation of the final hidden layer, will be a sum of i.i.d. terms. As we will discuss in detail below, in the limit of infinite width the Central Limit Theorem
^{1}^{1}1Throughout this paper, we assume the conditions on the parameter distributions and nonlinearities are such that the Central Limit Theorem will hold; for instance, that the weight variance is scaled inversely proportional to the layer width.
implies that the function computed by the neural network (NN) is a function drawn from a Gaussian process (GP). In the case of single hiddenlayer networks, the form of the kernel of this GP is well known (Neal (1994a); Williams (1997)).This correspondence implies that if we choose the hypothesis space to be the class of infinitely wide neural networks, an i.i.d. prior over weights and biases can be replaced with a corresponding GP prior over functions. As noted by (Williams, 1997), this substitution enables exact Bayesian inference for regression using neural networks. The computation requires building the necessary covariance matrices over the training and test sets and straightforward linear algebra computations.
In light of the resurgence in popularity of neural networks, it is timely to revisit this line of work. We delineate the correspondence between deep and wide neural networks and GPs and utilize it for Bayesian training of neural networks on regression tasks.
1.1 Related Work
Our work touches on aspects of GPs, Bayesian learning, and compositional kernels. The correspondence between infinite neural networks and GPs was first noted by Neal (1994a, b). Williams (1997) computes analytic GP kernels for single hiddenlayer neural networks with error function or Gaussian nonlinearities and noted the use of the GP prior for exact Bayesian inference in regression. Duvenaud et al. (2014) discusses several routes to building deep GPs and observes the degenerate form of kernels that are composed infinitely many times – a point we will return to Section 3.2 – but they do not derive the form of GP kernels as we do. Hazan & Jaakkola (2015) also discusses constructing kernels equivalent to infinitely wide deep neural networks, but their construction does not go beyond two hidden layers with nonlinearities.
Related work has also appeared outside of the GP context but in compositional kernel constructions. Cho & Saul (2009)
derives compositional kernels for polynomial rectified nonlinearities, which includes the Sign and ReLU nonlinearities, and can be used in GPs; our manner of composing kernels matches theirs, though the context is different.
Daniely et al. (2016) extends the construction of compositional kernels to neural networks whose underlying directed acyclic graph is of general form. They also prove, utilizing the formalism of dual activations, that compositional kernels originating from fullyconnected topologies with the same nonlinearity become degenerate when composed infinitely many times. In a different context than compositional kernels, Poole et al. (2016); Schoenholz et al. (2017)study the same underlying recurrence relation for the specific case of fullyconnected networks and bounded nonlinearities. They distinguish regions in hyperparameter space with different fixed points and convergence behavior in the recurrence relations. The focus in these works was to better understand the expressivity and trainability of deep networks.
Drawing inspiration from the multilayer nature of deep neural networks, there is a line of work considering various approaches to stacking GPs, such as deep GPs (Lawrence & Moore (2007); Damianou & Lawrence (2013); Hensman & Lawrence (2014); Duvenaud et al. (2014); Bui et al. (2016)), which can give rise to a richer class of probabilistic models beyond GPs. This contrasts with our work, where we study GPs that are in direct correspondence with deep, infinitely wide neural networks. Krauth et al. (2016) has recently explored the performance of GP models with deep kernels given in Cho & Saul (2009), implemented with scalable approximations. However, they do not discuss the equivalence between deep neural networks and GPs with compositional kernels, which constitutes a conceptual contribution of our work. Furthermore, we note that the GP kernels in our work are more general than the compositional kernel construction outlined in Cho & Saul (2009) in two respects: (i) we are not limited to rectified polynomials but can deal with general nonlinearities, and (ii) we consider two additional hyperparameters in the kernels, which would correspond to the weight and bias parameter variances in a neural network. Finally, Gal & Ghahramani (2016) connects dropout in deep neural networks with approximate Bayesian inference in deep GPs.
Another series of recent works (Wilson et al. (2016b, a); AlShedivat et al. (2017)), termed deep kernel learning, utilize GPs with base kernels which take in features produced by a deep multilayer neural network, and train the resulting model endtoend. Our work differs from these in that our GP corresponds to a multilayer neural network. Additionally, our GP kernels have many fewer parameters, and these parameters correspond to the hyperparameters of the equivalent neural network.
1.2 Summary of Contributions
We begin by specifying the form of a GP which corresponds to a deep, infinitely wide neural network – hereafter referred to as the Neural Network GP (NNGP) – in terms of a recursive, deterministic computation of the kernel function. The prescription is valid for generic pointwise nonlinearities in fullyconnected feedforward networks. We develop a computationally efficient method (Section 2.5) to compute the covariance function corresponding to deep neural networks with fixed hyperparameters.
In this work, as a first proof of concept of our NNGP construction, we focus on exact Bayesian inference for regression tasks, treating classification as regression on class labels. While less principled, leastsquares classification performs well (Rifkin et al., 2003) and allows us to compare exact inference via a GP to prediction by a trained neural network on wellstudied tasks (MNIST and CIFAR10 classification). Note that it is possible to extend GPs to softmax classification with cross entropy loss (Williams & Barber (1998); Rasmussen & Williams (2006)), which we aim to investigate in future work.
We conduct experiments making Bayesian predictions on MNIST and CIFAR10 (Section 3) and compare against NNs trained with standard gradientbased approaches. The experiments explore different hyperparameter settings of the Bayesian training including network depth, nonlinearity, training set size (up to and including the full dataset consisting of tens of thousands of images), and weight and bias variance. Our experiments reveal that the best NNGP performance is consistently competitive against that of NNs trained with gradientbased techniques, and the best NNGP setting, chosen across hyperparameters, often surpasses that of conventional training (Section 3, Table 1). We further observe that, with increasing network width, the performance of neural networks with gradientbased training approaches that of the NNGP computation, and that the GP uncertainty is strongly correlated with prediction error. Furthermore, the performance of the NNGP depends on the structure of the kernel, which can be connected to recent work on signal propagation in networks with random parameters (Schoenholz et al., 2017).
2 Deep, infinitely wide Neural Networks are drawn from GPs
We begin by specifying the correspondence between GPs and deep, infinitely wide neural networks, which hinges crucially on application of the Central Limit Theorem. We review the singlehidden layer case (Section 2.2) before moving to the multilayer case (Section 2.3).
2.1 Notation
Consider an hiddenlayer fullyconnected neural network with hidden layers of width (for layer ) and pointwise nonlinearities . Let denote the input to the network, and let denote its output. The th component of the activations in the th layer, postnonlinearity and postaffine transformation, are denoted and respectively. We will refer to these as the post and preactivations. (We let for the input, dropping the Arabic numeral superscript, and instead use a Greek superscript to denote a particular input ). Weight and bias parameters for the th layer have components , which are independent and randomly drawn, and we take them all to have zero mean and variances and , respectively. denotes a Gaussian process with mean and covariance functions , respectively.
2.2 Review of Gaussian Processes and Singlelayer Neural Networks
We briefly review the correspondence between singlehidden layer neural networks and GPs (Neal (1994a, b); Williams (1997)). The th component of the network output, , is computed as,
(1) 
where we have emphasized the dependence on input . Because the weight and bias parameters are taken to be i.i.d., the postactivations are independent for . Moreover, since is a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width ,
will be Gaussian distributed. Likewise, from the multidimensional Central Limit Theorem, any finite collection of
will have a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that , a GP with mean and covariance , which are themselves independent of . Because the parameters have zero mean, we have that and,(2) 
where we have introduced as in Neal (1994a); it is obtained by integrating against the distribution of . Note that, as any two for are joint Gaussian and have zero covariance, they are guaranteed to be independent despite utilizing the same features produced by the hidden layer.
2.3 Gaussian Processes and Deep Neural Networks
The arguments of the previous section can be extended to deeper layers by induction. We proceed by taking the hidden layer widths to be infinite in succession (, , etc.) as we continue with the induction, to guarantee that the input to the layer under consideration is already governed by a GP. In Appendix C we provide an alternative derivation in terms of Bayesian marginalization over intermediate layers, which does not depend on the order of limits, in the case of a Gaussian prior on the weights. A concurrent work (de G. Matthews et al., 2018) further derives the convergence rate towards a GP if all layers are taken to infinite width simultaneously, but at different rates.
Suppose that is a GP, identical and independent for every (and hence are independent and identically distributed). After steps, the network computes
(3) 
As before, is a sum of i.i.d. random terms so that, as , any finite collection will have joint multivariate Gaussian distribution and . The covariance is
(4) 
By induction, the expectation in Equation 4 is over the GP governing
, but this is equivalent to integrating against the joint distribution of only
and . The latter is described by a zero mean, twodimensional Gaussian whose covariance matrix has distinct entries , , and . As such, these are the only three quantities that appear in the result. We introduce the shorthand(5) 
to emphasize the recursive relationship between and via a deterministic function whose form depends only on the nonlinearity . This gives an iterative series of computations which can be performed to obtain for the GP describing the network’s final output.
For the base case , suppose ; we can utilize the recursion relating and , where
(6) 
In fact, these recurrence relations have appeared in other contexts. They are exactly the relations derived in the mean field theory of signal propagation in fullyconnected random neural networks (Poole et al. (2016); Schoenholz et al. (2017)) and also appear in the literature on compositional kernels (Cho & Saul (2009); Daniely et al. (2016)
). For certain activation functions, Equation
5 can be computed analytically (Cho & Saul (2009); Daniely et al. (2016)). In the case of the ReLU nonlinearity, it yields the wellknown arccosine kernel (Cho & Saul (2009)) whose form we reproduce in Appendix B. When no analytic form exists, it can instead be efficiently computed numerically, as described in Section 2.5.2.4 Bayesian Training of Neural Networks using Gaussian Process Priors
Here we provide a short review of how a GP prior over functions can be used to do Bayesian inference; see e.g. (Rasmussen & Williams, 2006) for a comprehensive review of GPs. Given a dataset consisting of inputtarget pairs , we wish to make a Bayesian prediction at test point using a distribution over functions . This distribution is constrained to take values on the training inputs and,
(7) 
where are the targets on the training set, and corresponds to observation noise. We will assume a noise model consisting of a Gaussian with variance centered at .
If the conditions of Section 2.2 or 2.3 apply, our choice of prior over functions implies that are draws from a GP and is a multivariate Gaussian whose covariance matrix has the form
where the block structure corresponds to the division between the training set and the test point. That is, is an matrix whose th element is with , while e.g. the th element of is . As is standard in GPs, the integral in Equation 7 can be done exactly, resulting in with
(8)  
(9) 
where is the identity. The predicted distribution for is hence determined from straightforward matrix computations, yet nonetheless corresponds to fully Bayesian training of the deep neural network. The form of the covariance function used is determined by the choice of GP prior, i.e. the neural network model class, which depends on depth, nonlinearity, and weight and bias variances. We henceforth resume placing a superscript as in to emphasize the choice of depth for the compositional kernel.
2.5 Efficient Implementation of the GP Kernel
Given an layer deep neural network with fixed hyperparameters, constructing the covariance matrix for the equivalent GP involves computing the Gaussian integral in Equation 4 for all pairs of trainingtraining and trainingtest points, recursively for all layers. For some nonlinearities, such as ReLU, this integration can be done analytically. However, to compute the kernel corresponding to arbitrary nonlinearities, the integral must be performed numerically.
The most direct implementation of a numerical algorithm for would be to compute integrals independently for each pair of datapoints and each layer. This is prohibitively expensive and costs , where
is the sampling density for the pair of Gaussian random variables in the 2D integral and
are the training and test set sizes, respectively. However, by careful pipelining, and by preprocessing all inputs to have identical norm, we can improve this cost to , where and are sampling densities for a variance and correlation grid, as described below. In order to achieve this, we break the process into several steps:
Generate: preactivations consisting of elements linearly spaced between and ; variances with linearly spaced elements, where ; and correlations with linearly spaced elements. Note that we are using fixed, rather than adaptive, sampling grids to allow operations to be parallelized and reused across datapoints and layers.

Populate a matrix containing a lookup table for the function in Equation 5. This involves numerically approximating a Gaussian integral, in terms of the marginal variances and correlations . We guarantee that the marginal variance is identical for each datapoint, by preprocessing all datapoints to have identical norm at the input layer, so the number of entries in the lookup table need only be . These entries are computed as^{2}^{2}2For numerical reasons, in practice an independent 1D lookup table is built for the case that .:
(10) 
For every pair of datapoints and in layer , compute using Equation 5. Approximate the function
by bilinear interpolation into the matrix
from Step 2, where we interpolate into using the value of , and interpolate into using . Remember that , due to data preprocessing to guarantee constant norm. 
Repeat the previous step recursively for all layers. Bilinear interpolation has constant cost, so this has cost .
This computational recipe allows us to compute the covariance matrix for the NNGP corresponding to any wellbehaved nonlinearity
. All computational steps above can be implemented using accelerated tensor operations, and computation of
is typically faster than solving the system of linear equations in Equation 89. Figure 6 illustrates the close agreement between the kernel function computed numerically (using this approach) and analytically, for the ReLU nonlinearity. It also illustrates the angular dependence of the kernel and its evolution with increasing depth.Finally, note that the full computational pipeline is deterministic and differentiable
. The shape and properties of a deep network kernel are purely determined by hyperparameters of the deep neural network. Since GPs give exact marginal likelihood estimates, this kernel construction may allow principled hyperparameter selection, or nonlinearity design, e.g. by gradient ascent on the log likelihood w.r.t. the hyperparameters. Although this is not the focus of current work, we hope to return to this topic in followup work.
An open source implementation of the algorithm is available at https://github.com/brainresearch/nngp.
3 Experimental Results
3.1 Description
We compare NNGPs with SGD^{3}^{3}3For all presented results, the variant of SGD used is Adam. Although not shown, we found vanilla SGD produced qualitatively similar results, with slightly higher MSE. trained neural networks on the permutation invariant MNIST and CIFAR10 datasets. The baseline neural network is a fullyconnected network with identical width at each hidden layer. Training is on the mean squared error (MSE) loss, chosen so as to allow direct comparison to GP predictions. Formulating classification as regression often leads to good results (Rifkin & Klautau, 2004). Future work may involve evaluating the NNGP on a cross entropy loss using the approach in (Williams & Barber, 1998; Rasmussen & Williams, 2006). Training used the Adam optimizer (Kingma & Ba (2014)) with learning rate and initial weight/bias variances optimized over validation error using the Google Vizier hyperparameter tuner (Golovin et al., 2017). Dropout was not used. In future work, it would be interesting to incorporate dropout into the NNGP covariance matrix using an approach like that in (Schoenholz et al., 2017)
. For the study, nonlinearities were chosen to be either rectified linear units (ReLU) or hyperbolic tangent (Tanh). Class labels were encoded as a onehot, zeromean, regression target (i.e., entries of 0.1 for the incorrect class and 0.9 for the correct class). We constructed the covariance kernel numerically for ReLU and Tanh nonlinearities following the method described in Section
2.5.Performance: We find that the NNGP often outperforms trained finite width networks. See Table 1 and Figure 1.


We additionally find the performance of the best finitewidth NNs, trained with a variant of SGD, approaches that of the NNGP with increasing layer width. This is interesting from at least two, potentially related, standpoints. (1) NNs are commonly believed to be powerful because of their ability to do flexible representation learning, while our NNGP uses fixed basis functions; nonetheless, in our experiments we find no salient performance advantage to the former. (2) It hints at a possible relationship between SGD and Bayesian inference in certain regimes – were the neural networks trained in a fully Bayesian fashion, rather than by SGD, the approach to NNGP in the large width limit would be guaranteed. There is recent work suggesting that SGD can implement approximate Bayesian inference (Mandt et al., 2017) under certain assumptions.
The similarity of the performance of the widest NN in Figure 1 with the NNGP suggests that the limit of infinite network width, which is inherent to the GP, is far from being a disadvantage. Indeed, in practice it is found that the best generalizing NNs are in fact the widest. To support this, in Figure 2 we show generalization gap results from an experiment in which we train 180 fullyconnected networks with five hidden layers on CIFAR10 with a range of layer widths. For this experiment, we trained the networks using a standard cross entropy loss rather than MSE, leading to a slight difference in performance.
Uncertainty: One benefit in using a GP is that, due to its Bayesian nature, all predictions have uncertainty estimates (Equation 9). For conventional neural networks, capturing the uncertainty in a model’s predictions is challenging (Gal, 2016). In the NNGP, every test point has an explicit estimate of prediction variance associated with it (Equation 9). In our experiments, we observe that the NNGP uncertainty estimate is highly correlated with prediction error (Figure 3).
Num training  Model (ReLU)  Test accuracy  Model (tanh)  Test accuracy 

MNIST:1k  NN250003.190.00  0.9252  NN210000.600.00  0.9254 
GP201.450.28  0.9279  GP201.960.62  0.9266  
MNIST:10k  NN220000.420.16  0.9771  NN220002.411.84  0.9745 
GP70.610.07  0.9765  GP21.620.28  0.9773  
MNIST:50k  NN220000.600.44  0.9864  NN250000.280.34  0.9857 
GP10.100.48  0.9875  GP11.280.00  0.9879  
CIFAR:1k  NN55001.290.28  0.3225  NN12001.450.12  0.3378 
GP71.280.00  0.3608  GP502.970.97  0.3702  
CIFAR:10k  NN520001.601.07  0.4545  NN15001.481.59  0.4429 
GP52.970.28  0.4780  GP73.482.00  0.4766  
CIFAR:45k  NN350000.530.01  0.5313  NN220001.052.08  0.5034 
GP33.311.86  0.5566  GP33.481.52  0.5558 
3.2 Relationship to Deep Signal Propagation
Several prior works (Poole et al. (2016); Schoenholz et al. (2017); Daniely et al. (2016); Duvenaud et al. (2014)) have noted the recurrence relations Equation 5 commonly approach a functionally uninteresting fixed point with depth , in that becomes a constant or piecewise constant map. We now briefly relate our ability to train NNGPs with the convergence of to the fixedpoint kernel.
We will be particularly interested in contextualizing our results in relation to Poole et al. (2016); Schoenholz et al. (2017) which analyzed the fixed points and the approach to them in detail for bounded nonlinearities. To briefly recapitulate: there are regions of hyperparameter space (called phases) where changes only quantitatively with and . However, there are low dimensional boundaries that separate different phases and between them the nature of changes qualitatively.
For the Tanh nonlinearity, there are two distinct phases respectively called the “ordered” phase and the “chaotic” phase that can be understood as a competition between the weights and the biases of the network. A diagram showing these phases and the boundary between them is shown in Figure 4
. In the ordered phase, the features obtained by propagating an input through the each layer of the recursion become similar for dissimilar inputs. Fundamentally, this occurs because the different inputs share common bias vectors and so all inputs end up just approaching the random bias. In this case the covariance
for every pair of inputs , where is a constant that depends only on and . All inputs have unit correlation asymptotically with depth. By contrast in the chaotic phase the weight variance dominates and similar inputs become dissimilar with depth as they are randomly projected by the weight matrices. In this case, the covariance for but for . Here is the fixed point correlation. In each of these regimes, there is also a finite depthscale which describes the characteristic number of layers over which the covariance function decays exponentially towards its fixed point form. Exactly at the boundary between these two regimes is a line in space where the decay towards its fixed point is significantly slower and nonexponential. It was noted in Schoenholz et al. (2017) that this approach to the fixedpoint covariance fundamentally bounded whether or not neural networks could successfully be trained. It was shown that initializing networks on this line allowed for significantly deeper neural networks to be trained.For ReLU networks a similar picture emerges, however there are some subtleties due to the unbounded nature of the nonlinearity. In this case for all and , for all and every point becomes asymptotically correlated. Despite this, there are again two phases: a “bounded” phase in which is finite (and nonzero) and an unbounded phase in which is either infinite or zero. As in the Tanh case there are depth scales that control the rate of convergence to these fixed points and therefore limit the maximum trainable depth. The phase diagram for the ReLU nonlinearity is also shown in Figure 4.
In a striking analogy with the trainability of neural networks, we observe that the performance of the NNGP appears to closely track the structure from the phase diagram, clearly illustrated in Figure 4. Indeed, we see that as for hyperparameter settings that are far from criticality, the GP is unable to train and we encounter poor test set performance. By contrast, near criticality we observe that our models display high accuracy. Moreover, we find that the accuracy appears to drop more quickly away from the phase boundary with increase in depth of the GP kernel, . To understand this effect we note that information about data will be available to our model only through the difference . However, as the depth gets larger, this difference becomes increasingly small and at some point can no longer be represented due to numerical precision. At this point our test accuracy begins to quickly degrade to random chance.


4 Conclusion and Future Directions
By harnessing the limit of infinite width, we have specified a correspondence between priors on deep neural networks and Gaussian processes whose kernel function is constructed in a compositional, but fully deterministic and differentiable, manner. Use of a GP prior on functions enables exact Bayesian inference for regression from matrix computations, and hence we are able to obtain predictions and uncertainty estimates from deep neural networks without stochastic gradientbased training. The performance is competitive with the best neural networks (within specified class of fullyconnected models) trained on the same regression task under similar hyperparameter settings. While we were able to run experiments for somewhat large datasets (sizes of 50k), we intend to look into scalability for larger learning tasks, possibly harnessing recent progress in scalable GPs (QuiñoneroCandela & Rasmussen (2005); Hensman et al. (2013)).
In our experiments, we observed the performance of the optimized neural network appears to approach that of the GP computation with increasing width. Whether gradientbased stochastic optimization implements an approximate Bayesian computation is an interesting question (Mandt et al., 2017). Further investigation is needed to determine if SGD does approximately implement Bayesian inference under the conditions typically employed in practice.
Additionally, the NNGP provides explicit estimates of uncertainty. This may be useful in predicting model failure in critical applications of deep learning, or for active learning tasks where it can be used to identify the best datapoints to hand label.
Acknowledgments
We thank Ryan Adams, Samy Bengio, and Matt Hoffman for useful discussions and feedback, and Gamaleldin Elsayed and Daniel Levy for helpful comments on the manuscript.
References

AlShedivat et al. (2017)
Maruan AlShedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P
Xing.
Learning scalable deep kernels with recurrent structure.
Journal of Machine Learning Research
, 18(82):1–37, 2017.  Bui et al. (2016) Thang Bui, Daniel HernándezLobato, Jose HernandezLobato, Yingzhen Li, and Richard Turner. Deep gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp. 1472–1481, 2016.
 Cho & Saul (2009) Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342–350, 2009.
 Damianou & Lawrence (2013) Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.
 Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
 de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1nGgWC.
 Duvenaud et al. (2014) David Duvenaud, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, pp. 202–210, 2014.
 Gal (2016) Yarin Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
 Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
 Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017.
 Hazan & Jaakkola (2015) Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
 Hensman & Lawrence (2014) James Hensman and Neil D Lawrence. Nested variational compression in deep gaussian processes. arXiv preprint arXiv:1412.1370, 2014.
 Hensman et al. (2013) James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
 Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krauth et al. (2016) Karl Krauth, Edwin V Bonilla, Kurt Cutajar, and Maurizio Filippone. Autogp: Exploring the capabilities and limitations of gaussian process models. arXiv preprint arXiv:1610.05392, 2016.
 Lawrence & Moore (2007) Neil D Lawrence and Andrew J Moore. Hierarchical gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007.
 Mandt et al. (2017) Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
 Neal (1994a) Radford M. Neal. Priors for infinite networks (tech. rep. no. crgtr941). University of Toronto, 1994a.
 Neal (1994b) Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, Dept. of Computer Science, 1994b.
 Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
 QuiñoneroCandela & Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen & Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 Rifkin & Klautau (2004) Ryan Rifkin and Aldebaro Klautau. In defense of onevsall classification. Journal of machine learning research, 5(Jan):101–141, 2004.
 Rifkin et al. (2003) Ryan Rifkin, Gene Yeo, Tomaso Poggio, et al. Regularized leastsquares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003.
 Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha SohlDickstein. Deep information propagation. ICLR, 2017.
 Williams (1997) Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997.
 Williams & Barber (1998) Christopher KI Williams and David Barber. Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
 Wilson et al. (2016a) Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a.
 Wilson et al. (2016b) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016b.
Appendix A Draws from an NNGP prior
Figure 5 illustrates the nature of the GP prior for the ReLU nonlinearity by depicting samples of 1D functions drawn from a ReLU GP, , with fixed depth and .
Appendix B Analytic Form for Kernel and Comparison
In the main text, we noted that the recurrence relation Equation 5 can be computed analytically for certain nonlinearities. In particular, this was computed in Cho & Saul (2009) for polynomial rectified nonlinearities. For ReLU, the result including the weight and bias variance is
(11) 
To illustrate the angular form of and its evolution with , in Figure 6 we plot for the ReLU nonlinearity, where is the angle between and with norms such that . We observe a flattening of the angular structure with increase in depth , as predicted from the understanding in Section 3.2. Simultaneously, the figure also illustrates the good agreement between the kernel computed using the numerical implementation of Section 2.5 (blue, starred) and the analytic arccosine kernel, Equation 11 (red), for a particular choice of hyperparameters .
Appendix C Bayesian marginalization over intermediate layers
In this section, we present an alternate derivation of the equivalence between infinitely wide deep neural networks and Gaussian process by marginalization over intermediate layers. For this derivation, we take the weight and bias parameters to be drawn from independent Gaussians, with zero mean and appropriately scaled variance.
We are interested in finding the distribution over network outputs , conditioned on network inputs , for input dimensionality , output dimensionality , and dataset size . Intervening layers will have width , for
. We define the second moment matrix (here
postnonlinearity) for each layer to be(12) 
Our approach is to think of intermediate random variables corresponding to these second moments defined above. By definition, only depends on . In turn, the preactivations are described by a Gaussian process conditioned on the second moment matrix ,
(13) 
where
(14) 
This correspondence of each layer to a GP, conditioned on the layer’s second moment matrix, is exact even for finite width because the parameters are drawn from a Gaussian. Altogether, this justifies the graphical model depicted in Figure 7.
We will write as an integral over all the intervening second moment matrices ,
(15) 
This joint distribution can be decomposed as
(16) 
The directed decomposition in Equation 16 holds because is a function only of (see Equation 12), depends only on (see Equation 13), is a function only of , etc (Figure 7).
The sum in Equation 12 for is a sum over i.i.d. terms. As grows large, the Central Limit Theorem applies, and converges to a Gaussian with variance that shrinks as . Further, in the infinite width limit it will go to a delta function,
(17) 
with defined as in Equation 5. Similarly, the dependence of on can be expressed as a delta function,
(18) 
Substituting , and into Equation 16, we get
(19) 
So, in the limit of infinite width, is described by a Gaussian process with kernel .
Appendix D Details of the Experiments
We outline details of the experiments for Section 3. For MNIST we use a 50k/10k/10k split of the training/validation/test dataset. For CIFAR10, we used a 45k/5k/10k split. The validation set was used for choosing the best hyperparameters and evaluation on the test set is reported.
For training neural networks hyperparameters were optimized via random search on average 250 trials for each choice of (, depth, width, nonlinearity).
Random search range: Learning rate was sampled within in logscale, weight decay constant was sampled from in logscale, , was uniformly sampled and minibatch size was chosen equally among .
For the GP with given depth and nonlinearity, a grid of 30 points evenly spaced from 0.1 to 5.0 (for ) and 30 points evenly spaced from 0 to 2.0 (for ) was evaluated to generate the heatmap. The best GP run was chosen among the 900 evaluations in the  grid.
Computation time: We report computation times for NNGP experiments. The grid generation with took 440460s with 6 CPUs for , which was amortized over all the experiments. For full (50k) MNIST, constructing for each layer took 90140s (depending on CPU generation) running on 64 CPUs. Solving linear equations via Cholesky decomposition took 180220s for 1000 test points.
Details of NNGP implementaion: For all the experiments we used precomputed lookup tables with and . Default value for the target noise was set to and was increased by factor of 10 when Cholesky decomposition failed while solving Equation 8 and 9. We refer to Rasmussen & Williams (2006) for standard numerically stable implementation of GP regression.
Appendix E Further results
Here we include more results from experiments described in Section 3.
Uncertainty: Relationship between the target MSE and the GP’s uncertainty estimate for smaller training set size is shown in Figure 8.
Performance: Performance of grid points of  for varying depth is shown in Figure 9. The best performing NNGP’s hyperparameters are distributed near the critical line (Figure 10) where the phase changes as described in Section 3.2.
Num training  Model (ReLU)  Test accuracy  Model (tanh)  Test accuracy 

MNIST:100  NN250000.100.11  0.7786  NN15001.480.61  0.7766 
GP1001.790.83  0.7735  GP1003.140.97  0.7736  
MNIST:200  NN220000.520.00  0.8298  NN210001.801.99  0.8223 
GP1001.790.83  0.8282  GP1003.992.00  0.8277  
MNIST:500  NN250001.820.77  0.9028  NN150003.742.18  0.9001 
GP1001.790.83  0.8995  GP503.481.86  0.9008  
MNIST:1k  NN250003.190.00  0.9252  NN210000.600.00  0.9254 
GP201.450.28  0.9279  GP201.960.62  0.9266  
MNIST:2k  NN250002.880.01  0.9468  NN120000.981.30  0.9462 
GP101.110.55  0.9485  GP101.791.45  0.9477  
MNIST:5k  NN35002.920.22  0.9675  NN210004.122.18  0.9655 
GP70.610.07  0.9692  GP31.110.00  0.9693  
MNIST:10k  NN220000.420.16  0.9771  NN220002.411.84  0.9745 
GP70.610.07  0.9765  GP21.620.28  0.9773  
MNIST:20k  NN310002.450.98  0.9825  NN220000.210.10  0.9814 
GP51.620.83  0.9830  GP12.630.00  0.9836  
MNIST:50k  NN220000.600.44  0.9864  NN250000.280.34  0.9857 
GP10.100.48  0.9875  GP11.280.00  0.9879  
CIFAR:100  NN55001.881.00  0.2586  NN22003.222.09  0.2470 
GP34.490.97  0.2673  GP103.651.17  0.2718  
CIFAR:200  NN32000.170.00  0.2719  NN32001.410.21  0.2686 
GP33.991.72  0.3022  GP73.650.55  0.2927  
CIFAR:500  NN11001.260.63  0.3132  NN120000.110.90  0.2939 
GP201.790.21  0.3395  GP73.650.62  0.3291  
CIFAR:1k  NN55001.290.28  0.3225  NN12001.450.12  0.3378 
GP71.280.00  0.3608  GP502.970.97  0.3702  
CIFAR:2k  NN350005.590.57  0.3894  NN510000.861.28  0.3597 
GP34.161.17  0.3953  GP54.661.03  0.3959  
CIFAR:5k  NN520005.261.74  0.4241  NN150000.070.22  0.3993 
GP34.661.03  0.4454  GP103.651.38  0.4430  
CIFAR:10k  NN520001.601.07  0.4545  NN15001.481.59  0.4429 
GP52.970.28  0.4780  GP73.482.00  0.4766  
CIFAR:20k  NN350004.180.18  0.5041  NN250000.021.12  0.4565 
GP35.000.83  0.5118  GP73.141.93  0.5124  
CIFAR:45k  NN350000.530.01  0.5313  NN220001.052.08  0.5034 
GP33.311.86  0.5566  GP33.481.52  0.5558 
Comments
There are no comments yet.