1 Results
1.1 Toy example
In the introduction, we noted that infinite Bayesian networks perform worse than standard neural networks trained using stochastic gradient descent. Thus, as we make finite neural networks wider, there should be some point at which performance begins to degrade. We considered a simple, twolayer, fullyconnected linear network with the full set of
4dimensional inputs denoted , hidden unit activations denoted , and 10dimensional outputs denoted ,(1) 
where is IID standard Gaussian noise, is the inputtohidden weight matrix and is the hiddentooutput weight matrix, whose columns, and are generated IID from,
(2) 
and where the variance of the weights is normalised by the number of inputs to that layer,
for the 4dimensional input, and for the width of the hidden layer.In the first example (Fig. 1
left), we generated targets for supervised learning using a second neural network with weights generated as described above, and
hidden units. We evaluated the Bayesian modelevidence for networks with many different numbers of hidden units (xaxis). Bayesian reasoning would suggest that the model evidence for the true model (i.e. with a matched number of hidden units) should be higher than the model evidence for any other model, as indeed we found (Fig. 1top left), and these patterns held true for the predictive probability, or equivalently test performance (Fig.
1 bottom left). While these results give an example where smaller networks perform better, they do not necessarily help us to understand the behaviour of neural networks on real datasets, where the true generative process for the data is not known, and is certaintly not in our model class. As such, we considered two further examples where the neural network generating the targets lay outside of our model class. First, we first used the same neural network to generate the targets (with ), but multiplying the inputs by (Fig. 1 middle). Second, we modified the inputs by zeroingout all but the first input unit (Fig. 1 right). In both of these experiments, there was an optimium number of hidden units, after which performance degraded as more hidden units were included.To understand why this might be the case, it is insightful to consider the methods we used to evaluate the model evidence and generate these results. In particular, note that conditioned on , the output for any given channel, , is IID and depends only on the corresponding column of the output weights, ,
(3)  
Thus, we can integrate over the output weights, , to obtain a distribution over conditioned on ,  
(4) 
which is the classical Gaussian process representation of Bayesian linear regression
(Rasmussen & Williams, 2006). Remembering that the hidden activities, , is a deterministic function of the weights, , and inputs, , we can write this distribution as,(5) 
Thus, the firstlayer weights, , act as kernel hyperparameters in a Gaussian process: they control the covariance of the outputs, . To evaluate the model evidence we need to integrate over ,
(6) 
and we estimate this integral by drawing
samples from the prior, . Importantly, while provides flexibility in the kernel in finite networks, this flexibility gradually disappears as we consider larger networks. In particular,(7) 
Therefore, in this limit, the distribution over converges to,
(8) 
This is exactly the distribution we would expect from Bayesian linear regression in a onelayer network. Thus, by taking the infinite limit, we have eliminated the additional flexibility afforded by the twolayer network, and we can see that the superior performance of smaller networks in Fig. 1 emerges because they give additional flexibility in the covariance of the outputs, which gradually disappears as network size increases.
1.2 Kernel representations for finite networks
In the previous section, we considered the simplest networks in which these phenomena emerge: a twolayer, linear network. In this section, we setup a finite deep nonlinear network and show that activity flowing through this network can be understood entirely in terms of kernel matricies.
Consider a single layer within a fullyconnected network, where the activity at the previous layer, , corresponding to a batch containing all inputs, is multiplied by a weight matrix, , to give activations, . This activation matrix, is multiplied by another matrix, , to give an updated activation matrix, . Critically, this multiplication leaves the representation unchanged (Fig. 2): simply helps to ensure that we can exactly compute the kernel in nonlinear finite networks (see below). Finally, the activations, , are passed through a nonlinearity, , to give the activity at this layer, ,
(9) 
where,
(10) 
In a standard neural network, we would set , such that and . For a fullyconnected network, the columns of , denoted
are generated IID from a Gaussian distribution,
(11) 
where the normalization constants, , ensures that activations remain normalized.
We now define the activation kernel and activity kernel,
(12) 
where the kernels for and are equivalent, as we always use for which . The only remaining object is the covariance, . As each channel (column) of the activations is a linear function of the corresponding channel of the weights, , the activations are Gaussian and IID conditioned on the activity at the previous layer,
(13) 
with covariance . For a fully connected network, the covariance, , is equal to the previous layer’s activitykernel, ,
(14) 
but the relationship is more complex in convolutional architectures (GarrigaAlonso et al., 2019; Novak et al., 2019) (Appendix A.2).
In order to work entirely in the kernel domain, we need to be able to transform to to , and back to (Fig. 2). The first transformation, from the activity kernel at the previous layer, , to the covariance, is described above. To perform the next transformation, from the covariance, , to the activity kernel, , we sample activations, , from a Gaussian with covariance (Eq. 13), then directly compute the activation kernel from the activations (Eq. 12). Both of these transformations can be performed in either finite or infinite networks. However, the key difficulty comes when we try to transform the activation kernel, , into the activity kernel, , in finite networks. For deep linear networks, which are useful for analytical if not practical purposes, this issue does not emerge as the activity kernel is the activation kernel,
(15) 
However, to compute for nonlinear networks, we need the activations, to be Gaussian distributed, so that we can apply results from Cho & Saul (2009). In a standard finite network, where we take , the distribution cannot be Gaussian, as the activations are constrained to . This is the reason we require : if we allow to be Gaussian distributed,
(16) 
and take the limit of ,
(17) 
then we simultaneously have Gaussian , and we have left the activation kernel unchanged, . As this network alternates between finite and infinite layers, we call it a finiteinfinite network.
1.3 Deep neural networks as deep Gaussian processes
Given the above setup, we can see that even a finite network with is a deep Gaussian process. In particular, in a deep Gaussian process, the activations at layer , denoted , consist of IID channels that are Gaussianprocess distributed (Eq. 13), with a kernel/covariance determined by the activations at the previous layer. For a fully connected network,
(18) 
The relationship between finite neural networks and Deep GPs is worth noting, because the same intuition, of the lowerlayers shaping the toplayer kernel, arises in both senarios (e.g. Bui et al., 2016), and because there is potential for applying GP inference methods for neural networks, and vice versa.
1.4 Kernel flexibility: Prior viewpoint: Analytical
We can analyse how flexibility in the kernel emerges, by looking at the variability (i.e. the variance and covariance) of , and . In the appendix, we derive recursive updates for deep, linear, convolutional networks,
(19a)  
(19b)  
(19c)  
where,  
(19d) 
and where , , and index datapoints, whereas and index spatial locations, and where fully connected networks are a special case where there is only one spatial location in the inputs and hidden layers.
For fully connected networks, this expression predicts that the variance of the kernel is proportional to the depth (including the last layer; ) and inversely proportional to the width, ,
(20) 
This expression is so simple because, for a fullyconnected linear network, the expected covariance at each layer is the same. For more nonlinear, convolutional or locallyconnected networks the covariance is still proportional to , but the depthdependence becomes more complex, as the covariance changes as it propagates through layers.
1.5 Kernel flexibility: Prior viewpoint: Experimental
To check the validity of these expressions, we sampled 10,000 neural networks from the prior, and evaluated the variance of the kernel for a single input (Fig. 3
). These inputs were either spatially unstructured (i.e. white noise), or spatially structured, in which case the inputs were the same across the whole image. For fully connected networks, we confirmed that the variance of the kernel is proportional to the depth including the last layer,
, and inversely proportional to width, (Fig. 3A). For locally connected networks, we found that structured and unstructured inputs gave the same kernel variance, which is expected as any spatial structured is destroyed after the first layer (Fig. 3B). Further, for convolutional networks with structured input, the variance of the kernel was proportional to network depth (Fig. 3C bottom), but whenever that spatial structure was absent, either because it was absent in the inputs or because it was eliminated by an LCN (Fig. 3BC bottom) the variance of the kernel was almost constant with depth (see Appendix A.2.1). The large decrease in kernel flexibility for locally connected networks might be one reason behind the result in Novak et al. (2019) that locally connected networks have performance that is very similar to an infinitewidth network, in which all flexibility has been eliminated. Finally, as the spatial input size, , increases, for convolutional networks with spatially structured inputs, the variance of the kernel is constant, whereas for locally connected or spatially unstructured inputs, the variance falls (Fig. 3D).1.6 Kernel flexibility: posterior viewpoint: analytical
An alternative approach to understanding flexibility in finite neural networks is to consider the posterior viewpoint: how learning shapes toplevel representations. To obtain analytical insights, we considered maximum aposteriori and sampling based inference in a deep, fullyconnected, linear network. In both cases, we found that learned neural networks shift the representation from being close to the input kernel, defined by,
(21) 
to being close the output kernel, defined by,
(22) 
In particular, under MAP inference, the shape of the kernel smoothly transitions from the input to the output kernel (Appendix B.2),
(23) 
where is the geometric average of the width in layers to , and is the geometric average of the width in layers to . Thus, the kernels (and the underlying weights) at each layer can be made arbitrarily large or small by changing the width, despite the prior distribution being chosen specifically to ensure that the scale of the kernels was invariant to network width. This is an issue inherent to the use of MAP inference, which often finds modes that give a poor characterisation of the Bayesian posterior. In contrast, if we sample the weights using Langevin sampling (Appendix C), and set all the intermediate weights, from to to , then we get a similar recusion,
(24) 
but where the toplayer representation, depends on the ratio between the network width, , and the number of output units, . In particular, if , then we get a relationship very similar to that for MAP inference,
(25) 
as . However, as the network width grows very large, the prior begins to dominate, and the posterior becomes dominated by the prior,
(26) 
as . Finally, if the network width is small in comparison to the number of units,
(27) 
as the toplayer kernel converges to the output, .
1.7 Kernel flexibility: posterior viewpoint: empirical
The above results suggest that finite neural networks perform well by giving flexibility to interpolate between the input kernel and output kernel. To see how this happens in real neural networks, we considered a 34layer ResNet without batchnorm corresponding to the infinite network in
GarrigaAlonso et al. (2019) trained on CIFAR10. We began by computing the correlation between elements of the finite and infinite kernel (Fig. 4A top) as we go through network layers (xaxis), and as we go through training (blue lines). As expected, the randomly initialized, untrained network retains a high correlation with the infinite kernel at all layers, though the correlation is somewhat smaller for higher layers, as there have been more opportunities for discrepancies to build up. However, for trained networks, this correspondence between the finite and infinite networks is far weaker: even at the first layer, the correlation is only around , and as we go through layers, the correlation decreases to almost zero. To understand whether this decrease in correlation indicated that the kernels were being actively shaped, we computed the correlation between the kernel for the finite network and the output kernel, defined by taking the inner product of vectors representing the onehot class labels (Fig. 4A bottom). We found that while the correlation for the untrained network decreased across layers, training gives strong positive correlations with the output kernel, and these correlations increase as we move through network layers. While correlation is a useful simple measure of similarity, there are other measures of similarity that take into account the special structure of kernel matricies. In particular, we considered the marginal likelihood for the onehot outputs corresponding to the class label, which is also a measure of the similarity of the kernel and the kernel formed by the onehot outputs. To compute this measure, we maximized the marginal likelihood using a covariance that is a scaled sum of the kernel defined by that layer of the network and the identity (see Appendix D). As we would hope, we found that for the infinite network, the marginal likelihood increased somewhat as we moved through network layers, and the untrained finite network had similar performance, except that there was a fall in performance at the last layer. In contrast, the marginal likelihood for the finite, trained networks was initially very close to the infinite networks, but grows rapidly as we move through network layers.To gain an insight into how training shaped the neural network kernels, we computed their eigenvalue spectrum. For the infinite network (Fig. 4C top), we found that the eigenvalue spectrum at all levels decayed as a power law. This is expected at the lowest level due to the well known power spectrum of images (Van der Schaaf & van Hateren, 1996), but is not necessarily the case at higherlevels. Given the powerspectrum of the output kernel is just a small set of equalsized eigenvalues corresponding to the class labels (Fig. 4C bottom, green line), we might expect the eigenspectrum of finite networks to gradually get steeper as we move through network layers. In fact, we find the opposite: for intermediate layers, the eigenvalue spectrum becomes flatter, which can be interpreted as the network attempting to retain as much information as possible about all aspects of the image. It is only at the last layer where the relevant information is selected, giving an eigenvalue spectrum with around 10 large and roughly equallysized eigenvalues, followed by much smaller eigenvalues, which mirrors the spectrum of the output kernel.
2 Conclusions
We have shown that finite Bayesian neural networks have more flexibility than infinite networks, and that this may explain the superior performance of finite networks. We assessed flexibility from two perspectives. First, we looked at the prior viewpoint: the variability in the toplayer kernel induced by the prior over a finite neural network. Second, we looked at the posterior viewpoint: the ability of the learning process to shape the toplayer kernel. Under both MAP inference and sampling in finite networks, learning gradually shaped toplayer representations so as to match the outputkernel. For MAP inference, the degree of kernel shaping is not affected by network width, so it remains even in infinite networks. In contrast, as Bayesian networks are made wider, the kernels become gradually less flexible, eliminating the possibility for learning to shape the kernel. As such, this raises the possibility that MAP inference may perform better than full Bayesian inference, at least for very wide networks with independent Gaussian priors over the weights.
References
 Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.

Bui et al. (2016)
Thang Bui, Daniel HernándezLobato, Jose HernandezLobato, Yingzhen Li, and
Richard Turner.
Deep gaussian processes for regression using approximate expectation
propagation.
In
International Conference on Machine Learning
, pp. 1472–1481, 2016. 
Chen et al. (2018)
Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.
Neural ordinary differential equations.
In Advances in neural information processing systems, pp. 6571–6583, 2018. 
Cho & Saul (2009)
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.
NeurIPS, 2009.  GarrigaAlonso et al. (2019) Adrià GarrigaAlonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow Gaussian processes. ICLR, 2019.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks.
In
European conference on computer vision
, pp. 630–645. Springer, 2016.  Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
 Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha SohlDickstein. Deep neural networks as Gaussian processes. ICLR, 2018.
 Matthews et al. (2018) AGDG Matthews, M Rowland, J Hron, RE Turner, and Z Ghahramani. Gaussian process behaviour in wide deep neural networks. 2018.
 Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. ICLR, 2019.
 Ramsey (1926) Frank P Ramsey. Truth and probability. In Readings in Formal Epistemology, pp. 21–45. Springer, 1926.
 Rasmussen & Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT press, 2006.
 Van der Schaaf & van Hateren (1996) van A Van der Schaaf and JH van van Hateren. Modelling the power spectra of natural images: statistics and information. Vision research, 36(17):2759–2770, 1996.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Appendix A Kernel flexibility: prior viewpoint
To compute the covariance of the kernel for a deep network, we consider a recursion where we start with , then compute the resulting , then compute the resulting . In particular, we apply the law of total covariance for , and we consider linear networks for which ,
(28a)  
(28b)  
(28c) 
The first equation is different for fully connected and convolutional networks, so we give its form later.
The expression for always behaves in the same way for linear and nonlinear, fully connected and convolutional networks so we consider this first. In particular, we always have , so the first term in Eq. (28b) is
(29) 
For the second term in Eq. (28b), we substitute the definition of (Eq. 12),
(30) 
The ’s are jointly Gaussian, so their expectations are,
(31a)  
(31b)  
(31c) 
Thus, we can write the covariance of the kernels as,
(32) 
Substituting this into the second term in Eq. (28b) and writing the expected product in terms of the product of expectations,
(33)  
(34)  
where,  
(35) 
Thus, Eq. (28b) can be written,
(36) 
We could directly use this expression but it is somewhat unwieldy as the righthandside contains three covariance terms. Instead, note that it is possible to approximate this expression as the final two covariance terms are , in comparison to for the other terms, and as such they become negligable as grows,
(37) 
Thus, the recursive updates become,
(38a)  
(38b)  
(38c) 
a.1 Fully connected network
For a fully connected, network,
(39)  
where the weights are drawn from a correlated, zeromean Gaussian, with covariance  
(40)  
Thus, has distribution,  
(41)  
where is given by,  
(42)  
(43)  
(44)  
substituting for the expectation (Eq. 40), and identifying the activity kernel (Eq. 12),  
(45) 
Thus,
(46) 
which can be substituted into the recursion Eq. (38a) allowing the updates to be directly computed,
(47a)  
(47b)  
(47c) 
a.2 Convolutional network
For locally connected and convolutional networks, we introduce spatial structure into the activations, and we use spatial indicies, , , and . Thus, the activations for datapoint at layer , spatial location and channel are given by,
(48) 
Note that for many purposes, these higherorder tensors can be treated as vectors and matrices, if we combine indicies (e.g. using a “reshape” or “view” operation). The commas in the index list are used to denote how to combine indicies for this particular operation, such that it can be understood as a standard matrix/vector operation. For the above equation, the activations,
are given by the matrix product of the activities from the previous layer, , and the weights, , where remember that is the number of spatial locations in the input.For a convolutional neural network, the weights are the same if we consider the same inputtooutput channels, and the same spatial displacement, , and are uncorrelated otherwise,
(49) 
where is the set of all valid spatial displacements for the convolution, and is the number of valid spatial displacements (i.e. the size of the convolutional patch). For a locallyconnected network, the only additional requirement is that the output spatial locations are the same,
(50) 
Now we can compute the covariance of the activations, , for a convolutional network,
(51)  
(52)  
(53)  
substituting the covariance of the weights (Eq. 50), and noting that the product of ’s forms the definition of the activity kernel (Eq. 12),  
(54) 
For locally connected intermediate layers, the derivation here is the same as previously, except that the output locations must be the same for there to be any covariance in the weights,
(55) 
Substituting this into Eq. (38a),
(56) 
Now, we can put together full recursive updates for convolutional networks, by pulling the sum out of the covariance above, and by taking the indicies in Eq. (38), as indexing both a datapoint and a spatial location (i.e. ),
(57a)  
(57b)  
(57c) 
Finally, to compute these terms, note that we can compute the recursions for and ,
(58) 
and this expression can be computed efficiently as a 2D convolution.
a.2.1 Spatially structured and unstructured networks
To understand the very different results for spatially structured and unstructured networks (Fig. 3B–D) despite their having the same infinite limit, we need to consider how Eq. (58) interacts with Eq. (55). For a locally connected (i.e. spatially unstructured) network, the covariance of activations at different locations is always zero, i.e. for whereas, for a spatially structured network, the covariance terms have the same scale as the variance terms. The terms enter into the variance of the kernel through Eq. (58). Note that there are terms in this sum, and the sum is normalized by dividing by . Thus, in spatially structured networks, there are terms all with the same scale, so is . In contrast, for spatially unstructured networks, we have only nonzero terms, so