DeepAI
Log In Sign Up

Why bigger is not always better: on finite and infinite neural networks

10/17/2019
by   Laurence Aitchison, et al.
0

Recent work has shown that the outputs of convolutional neural networks become Gaussian process (GP) distributed when we take the number of channels to infinity. In principle, these infinite networks should perform very well, both because they allow for exact Bayesian inference, and because widening networks is generally thought to improve (or at least not diminish) performance. However, Bayesian infinite networks perform poorly in comparison to finite networks, and our goal here is to explain this discrepancy. We note that the high-level representation induced by an infinite network has very little flexibility; it depends only on network hyperparameters such as depth, and as such cannot learn a good high-level representation of data. In contrast, finite networks correspond to a rich prior over high-level representations, corresponding to kernel hyperparameters. We analyse this flexibility from the perspective of the prior (looking at the structured prior covariance of the top-level kernel), and from the perspective of the posterior, showing that the representation in a learned, finite deep linear network slowly transitions from the kernel induced by the inputs towards the kernel induced by the outputs, both for gradient descent, and for Langevin sampling. Finally, we explore representation learning in deep, convolutional, nonlinear networks, showing that learned representations differ dramatically from the corresponding infinite network.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/01/2021

Asymptotics of representation learning in finite Bayesian neural networks

Recent works have suggested that finite Bayesian neural networks may out...
01/07/2021

Infinitely Wide Tensor Networks as Gaussian Process

Gaussian Process is a non-parametric prior which can be understood as a ...
04/16/2019

A Bayesian Perspective on the Deep Image Prior

The deep image prior was recently introduced as a prior for natural imag...
11/23/2021

Depth induces scale-averaging in overparameterized linear Bayesian neural networks

Inference in deep Bayesian neural networks is only fully understood in t...
06/18/2020

Exact posterior distributions of wide Bayesian neural networks

Recent work has shown that the prior over functions induced by a deep Ba...
07/26/2021

Are Bayesian neural networks intrinsically good at out-of-distribution detection?

The need to avoid confident predictions on unfamiliar data has sparked i...

1 Results

1.1 Toy example

In the introduction, we noted that infinite Bayesian networks perform worse than standard neural networks trained using stochastic gradient descent. Thus, as we make finite neural networks wider, there should be some point at which performance begins to degrade. We considered a simple, two-layer, fully-connected linear network with the full set of

4-dimensional inputs denoted , hidden unit activations denoted , and 10-dimensional outputs denoted ,

(1)

where is IID standard Gaussian noise, is the input-to-hidden weight matrix and is the hidden-to-output weight matrix, whose columns, and are generated IID from,

(2)

and where the variance of the weights is normalised by the number of inputs to that layer,

for the 4-dimensional input, and for the width of the hidden layer.

In the first example (Fig. 1

left), we generated targets for supervised learning using a second neural network with weights generated as described above, and

hidden units. We evaluated the Bayesian model-evidence for networks with many different numbers of hidden units (x-axis). Bayesian reasoning would suggest that the model evidence for the true model (i.e. with a matched number of hidden units) should be higher than the model evidence for any other model, as indeed we found (Fig. 1

top left), and these patterns held true for the predictive probability, or equivalently test performance (Fig. 

1 bottom left). While these results give an example where smaller networks perform better, they do not necessarily help us to understand the behaviour of neural networks on real datasets, where the true generative process for the data is not known, and is certaintly not in our model class. As such, we considered two further examples where the neural network generating the targets lay outside of our model class. First, we first used the same neural network to generate the targets (with ), but multiplying the inputs by (Fig. 1 middle). Second, we modified the inputs by zeroing-out all but the first input unit (Fig. 1 right). In both of these experiments, there was an optimium number of hidden units, after which performance degraded as more hidden units were included.

Figure 1: A toy fully-connected, two-layer Bayesian linear network showing situations in which smaller networks perform better than larger networks. Left: training data generated from the prior with hidden units. Middle: training data generated from the prior with but where we scale-up the inputs by a factor of 100. Right: training data generated from the prior with , but where we zero-out all but the first input dimension. Top: Bayesian model evidence. Bottom: predictive log-probability, or equivalently test-error.

To understand why this might be the case, it is insightful to consider the methods we used to evaluate the model evidence and generate these results. In particular, note that conditioned on , the output for any given channel, , is IID and depends only on the corresponding column of the output weights, ,

(3)
Thus, we can integrate over the output weights, , to obtain a distribution over conditioned on ,
(4)

which is the classical Gaussian process representation of Bayesian linear regression

(Rasmussen & Williams, 2006). Remembering that the hidden activities, , is a deterministic function of the weights, , and inputs, , we can write this distribution as,

(5)

Thus, the first-layer weights, , act as kernel hyperparameters in a Gaussian process: they control the covariance of the outputs, . To evaluate the model evidence we need to integrate over ,

(6)

and we estimate this integral by drawing

samples from the prior, . Importantly, while provides flexibility in the kernel in finite networks, this flexibility gradually disappears as we consider larger networks. In particular,

(7)

Therefore, in this limit, the distribution over converges to,

(8)

This is exactly the distribution we would expect from Bayesian linear regression in a one-layer network. Thus, by taking the infinite limit, we have eliminated the additional flexibility afforded by the two-layer network, and we can see that the superior performance of smaller networks in Fig. 1 emerges because they give additional flexibility in the covariance of the outputs, which gradually disappears as network size increases.

1.2 Kernel representations for finite networks

In the previous section, we considered the simplest networks in which these phenomena emerge: a two-layer, linear network. In this section, we setup a finite deep nonlinear network and show that activity flowing through this network can be understood entirely in terms of kernel matricies.

Consider a single layer within a fully-connected network, where the activity at the previous layer, , corresponding to a batch containing all inputs, is multiplied by a weight matrix, , to give activations, . This activation matrix, is multiplied by another matrix, , to give an updated activation matrix, . Critically, this multiplication leaves the representation unchanged (Fig. 2): simply helps to ensure that we can exactly compute the kernel in nonlinear finite networks (see below). Finally, the activations, , are passed through a non-linearity, , to give the activity at this layer, ,

(9)

where,

(10)

In a standard neural network, we would set , such that and . For a fully-connected network, the columns of , denoted

are generated IID from a Gaussian distribution,

(11)

where the normalization constants, , ensures that activations remain normalized.

We now define the activation kernel and activity kernel,

(12)

where the kernels for and are equivalent, as we always use for which . The only remaining object is the covariance, . As each channel (column) of the activations is a linear function of the corresponding channel of the weights, , the activations are Gaussian and IID conditioned on the activity at the previous layer,

(13)

with covariance . For a fully connected network, the covariance, , is equal to the previous layer’s activity-kernel, ,

(14)

but the relationship is more complex in convolutional architectures (Garriga-Alonso et al., 2019; Novak et al., 2019) (Appendix A.2).

In order to work entirely in the kernel domain, we need to be able to transform to to , and back to (Fig. 2). The first transformation, from the activity kernel at the previous layer, , to the covariance, is described above. To perform the next transformation, from the covariance, , to the activity kernel, , we sample activations, , from a Gaussian with covariance (Eq. 13), then directly compute the activation kernel from the activations (Eq. 12). Both of these transformations can be performed in either finite or infinite networks. However, the key difficulty comes when we try to transform the activation kernel, , into the activity kernel, , in finite networks. For deep linear networks, which are useful for analytical if not practical purposes, this issue does not emerge as the activity kernel is the activation kernel,

(15)

However, to compute for nonlinear networks, we need the activations, to be Gaussian distributed, so that we can apply results from Cho & Saul (2009). In a standard finite network, where we take , the distribution cannot be Gaussian, as the activations are constrained to . This is the reason we require : if we allow to be Gaussian distributed,

(16)

and take the limit of ,

(17)

then we simultaneously have Gaussian , and we have left the activation kernel unchanged, . As this network alternates between finite and infinite layers, we call it a finite-infinite network.

Figure 2: The relationships between the feature-space and kernel representations of the neural network. For a typical finite neural network, , so . For a finite-infinite network (which allows us to compute from ), we send , and draw the elements of IID from a Gaussian distribution with zero mean and variance .

1.3 Deep neural networks as deep Gaussian processes

Given the above setup, we can see that even a finite network with is a deep Gaussian process. In particular, in a deep Gaussian process, the activations at layer , denoted , consist of IID channels that are Gaussian-process distributed (Eq. 13), with a kernel/covariance determined by the activations at the previous layer. For a fully connected network,

(18)

The relationship between finite neural networks and Deep GPs is worth noting, because the same intuition, of the lower-layers shaping the top-layer kernel, arises in both senarios (e.g. Bui et al., 2016), and because there is potential for applying GP inference methods for neural networks, and vice versa.

1.4 Kernel flexibility: Prior viewpoint: Analytical

We can analyse how flexibility in the kernel emerges, by looking at the variability (i.e. the variance and covariance) of , and . In the appendix, we derive recursive updates for deep, linear, convolutional networks,

(19a)
(19b)
(19c)
where,
(19d)

and where , , and index datapoints, whereas and index spatial locations, and where fully connected networks are a special case where there is only one spatial location in the inputs and hidden layers.

For fully connected networks, this expression predicts that the variance of the kernel is proportional to the depth (including the last layer; ) and inversely proportional to the width, ,

(20)

This expression is so simple because, for a fully-connected linear network, the expected covariance at each layer is the same. For more nonlinear, convolutional or locally-connected networks the covariance is still proportional to , but the depth-dependence becomes more complex, as the covariance changes as it propagates through layers.

1.5 Kernel flexibility: Prior viewpoint: Experimental

To check the validity of these expressions, we sampled 10,000 neural networks from the prior, and evaluated the variance of the kernel for a single input (Fig. 3

). These inputs were either spatially unstructured (i.e. white noise), or spatially structured, in which case the inputs were the same across the whole image. For fully connected networks, we confirmed that the variance of the kernel is proportional to the depth including the last layer,

, and inversely proportional to width, (Fig. 3A). For locally connected networks, we found that structured and unstructured inputs gave the same kernel variance, which is expected as any spatial structured is destroyed after the first layer (Fig. 3B). Further, for convolutional networks with structured input, the variance of the kernel was proportional to network depth (Fig. 3C bottom), but whenever that spatial structure was absent, either because it was absent in the inputs or because it was eliminated by an LCN (Fig. 3BC bottom) the variance of the kernel was almost constant with depth (see Appendix A.2.1). The large decrease in kernel flexibility for locally connected networks might be one reason behind the result in Novak et al. (2019) that locally connected networks have performance that is very similar to an infinite-width network, in which all flexibility has been eliminated. Finally, as the spatial input size, , increases, for convolutional networks with spatially structured inputs, the variance of the kernel is constant, whereas for locally connected or spatially unstructured inputs, the variance falls (Fig. 3D).

Figure 3: The variance of the kernel for linear, fully connected and convolutional networks, with spatially structured and unstructured inputs. The dashed lines in all plots display the theoretical approximation (Eq. 19) which is valid when the width is much greater than the number of layers. The solid lines display the empirical variance of the kernel from 10,000 simulations. A The variance of the kernel for fully connected networks, plotted against network width, , for shallow (blue; ), and deep (orange; ) networks (top) and plotted against network depth, , for narrow (green; ) and wide (red; ) networks. B The variance of the kernel for locally connected networks with spatially structured and unstructured inputs, plotted against the number of channels, , and against network depth, . Note that the structured line lies underneath the unstructured line. The inputs are 1-dimensional with spatial locations, and input channels. C As in B, but for convolutional networks. D The variance of the kernel as a function of the input spatial size, , for deep () LCNs (top) and CNNs (bottom) with spatially strutured and unstructured inputs.

1.6 Kernel flexibility: posterior viewpoint: analytical

An alternative approach to understanding flexibility in finite neural networks is to consider the posterior viewpoint: how learning shapes top-level representations. To obtain analytical insights, we considered maximum a-posteriori and sampling based inference in a deep, fully-connected, linear network. In both cases, we found that learned neural networks shift the representation from being close to the input kernel, defined by,

(21)

to being close the output kernel, defined by,

(22)

In particular, under MAP inference, the shape of the kernel smoothly transitions from the input to the output kernel (Appendix B.2),

(23)

where is the geometric average of the width in layers to , and is the geometric average of the width in layers to . Thus, the kernels (and the underlying weights) at each layer can be made arbitrarily large or small by changing the width, despite the prior distribution being chosen specifically to ensure that the scale of the kernels was invariant to network width. This is an issue inherent to the use of MAP inference, which often finds modes that give a poor characterisation of the Bayesian posterior. In contrast, if we sample the weights using Langevin sampling (Appendix C), and set all the intermediate weights, from to to , then we get a similar recusion,

(24)

but where the top-layer representation, depends on the ratio between the network width, , and the number of output units, . In particular, if , then we get a relationship very similar to that for MAP inference,

(25)

as . However, as the network width grows very large, the prior begins to dominate, and the posterior becomes dominated by the prior,

(26)

as . Finally, if the network width is small in comparison to the number of units,

(27)

as the top-layer kernel converges to the output, .

1.7 Kernel flexibility: posterior viewpoint: empirical

The above results suggest that finite neural networks perform well by giving flexibility to interpolate between the input kernel and output kernel. To see how this happens in real neural networks, we considered a 34-layer ResNet without batchnorm corresponding to the infinite network in

Garriga-Alonso et al. (2019) trained on CIFAR-10. We began by computing the correlation between elements of the finite and infinite kernel (Fig. 4A top) as we go through network layers (x-axis), and as we go through training (blue lines). As expected, the randomly initialized, untrained network retains a high correlation with the infinite kernel at all layers, though the correlation is somewhat smaller for higher layers, as there have been more opportunities for discrepancies to build up. However, for trained networks, this correspondence between the finite and infinite networks is far weaker: even at the first layer, the correlation is only around , and as we go through layers, the correlation decreases to almost zero. To understand whether this decrease in correlation indicated that the kernels were being actively shaped, we computed the correlation between the kernel for the finite network and the output kernel, defined by taking the inner product of vectors representing the one-hot class labels (Fig. 4A bottom). We found that while the correlation for the untrained network decreased across layers, training gives strong positive correlations with the output kernel, and these correlations increase as we move through network layers. While correlation is a useful simple measure of similarity, there are other measures of similarity that take into account the special structure of kernel matricies. In particular, we considered the marginal likelihood for the one-hot outputs corresponding to the class label, which is also a measure of the similarity of the kernel and the kernel formed by the one-hot outputs. To compute this measure, we maximized the marginal likelihood using a covariance that is a scaled sum of the kernel defined by that layer of the network and the identity (see Appendix D). As we would hope, we found that for the infinite network, the marginal likelihood increased somewhat as we moved through network layers, and the untrained finite network had similar performance, except that there was a fall in performance at the last layer. In contrast, the marginal likelihood for the finite, trained networks was initially very close to the infinite networks, but grows rapidly as we move through network layers.

Figure 4: Comparison of kernels for finite and infinite neural networks at different layers. All kernels are computed on test data. A

(top) Correlation (coefficient) between the kernel defined by the infinite network, and kernel defined by a finite network after different numbers of training epochs.

A (bottom) Correlation (coefficient) between the kernel defined by the infinite network, and the output kernel defined by taking the inner product of one-hot vectors representing the class label. B (top) The Gaussian process marginal likelihood for the one-hot class labels. B (bottom) The fraction of variance in the direction of the one-hot output class labels. C

(top) The eigenvalues of the kernel defined by the infinite network as we progress through layers, and compared to a

power law (grey). C (top) The eigenvalues of the kernel defined by the finite network after 200 training epochs, as we progress through network layers.

To gain an insight into how training shaped the neural network kernels, we computed their eigenvalue spectrum. For the infinite network (Fig. 4C top), we found that the eigenvalue spectrum at all levels decayed as a power law. This is expected at the lowest level due to the well known power spectrum of images (Van der Schaaf & van Hateren, 1996), but is not necessarily the case at higher-levels. Given the power-spectrum of the output kernel is just a small set of equal-sized eigenvalues corresponding to the class labels (Fig. 4C bottom, green line), we might expect the eigenspectrum of finite networks to gradually get steeper as we move through network layers. In fact, we find the opposite: for intermediate layers, the eigenvalue spectrum becomes flatter, which can be interpreted as the network attempting to retain as much information as possible about all aspects of the image. It is only at the last layer where the relevant information is selected, giving an eigenvalue spectrum with around 10 large and roughly equally-sized eigenvalues, followed by much smaller eigenvalues, which mirrors the spectrum of the output kernel.

2 Conclusions

We have shown that finite Bayesian neural networks have more flexibility than infinite networks, and that this may explain the superior performance of finite networks. We assessed flexibility from two perspectives. First, we looked at the prior viewpoint: the variability in the top-layer kernel induced by the prior over a finite neural network. Second, we looked at the posterior viewpoint: the ability of the learning process to shape the top-layer kernel. Under both MAP inference and sampling in finite networks, learning gradually shaped top-layer representations so as to match the output-kernel. For MAP inference, the degree of kernel shaping is not affected by network width, so it remains even in infinite networks. In contrast, as Bayesian networks are made wider, the kernels become gradually less flexible, eliminating the possibility for learning to shape the kernel. As such, this raises the possibility that MAP inference may perform better than full Bayesian inference, at least for very wide networks with independent Gaussian priors over the weights.

References

  • Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
  • Bui et al. (2016) Thang Bui, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep gaussian processes for regression using approximate expectation propagation. In

    International Conference on Machine Learning

    , pp. 1472–1481, 2016.
  • Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.

    Neural ordinary differential equations.

    In Advances in neural information processing systems, pp. 6571–6583, 2018.
  • Cho & Saul (2009) Youngmin Cho and Lawrence K Saul.

    Kernel methods for deep learning.

    NeurIPS, 2009.
  • Garriga-Alonso et al. (2019) Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow Gaussian processes. ICLR, 2019.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In

    European conference on computer vision

    , pp. 630–645. Springer, 2016.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
  • Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. ICLR, 2018.
  • Matthews et al. (2018) AGDG Matthews, M Rowland, J Hron, RE Turner, and Z Ghahramani. Gaussian process behaviour in wide deep neural networks. 2018.
  • Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. ICLR, 2019.
  • Ramsey (1926) Frank P Ramsey. Truth and probability. In Readings in Formal Epistemology, pp. 21–45. Springer, 1926.
  • Rasmussen & Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT press, 2006.
  • Van der Schaaf & van Hateren (1996) van A Van der Schaaf and JH van van Hateren. Modelling the power spectra of natural images: statistics and information. Vision research, 36(17):2759–2770, 1996.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

Appendix A Kernel flexibility: prior viewpoint

To compute the covariance of the kernel for a deep network, we consider a recursion where we start with , then compute the resulting , then compute the resulting . In particular, we apply the law of total covariance for , and we consider linear networks for which ,

(28a)
(28b)
(28c)

The first equation is different for fully connected and convolutional networks, so we give its form later.

The expression for always behaves in the same way for linear and nonlinear, fully connected and convolutional networks so we consider this first. In particular, we always have , so the first term in Eq. (28b) is

(29)

For the second term in Eq. (28b), we substitute the definition of (Eq. 12),

(30)

The ’s are jointly Gaussian, so their expectations are,

(31a)
(31b)
(31c)

Thus, we can write the covariance of the kernels as,

(32)

Substituting this into the second term in Eq. (28b) and writing the expected product in terms of the product of expectations,

(33)
(34)
where,
(35)

Thus, Eq. (28b) can be written,

(36)

We could directly use this expression but it is somewhat unwieldy as the right-hand-side contains three covariance terms. Instead, note that it is possible to approximate this expression as the final two covariance terms are , in comparison to for the other terms, and as such they become negligable as grows,

(37)

Thus, the recursive updates become,

(38a)
(38b)
(38c)

a.1 Fully connected network

For a fully connected, network,

(39)
where the weights are drawn from a correlated, zero-mean Gaussian, with covariance
(40)
Thus, has distribution,
(41)
where is given by,
(42)
(43)
(44)
substituting for the expectation (Eq. 40), and identifying the activity kernel (Eq. 12),
(45)

Thus,

(46)

which can be substituted into the recursion Eq. (38a) allowing the updates to be directly computed,

(47a)
(47b)
(47c)

a.2 Convolutional network

For locally connected and convolutional networks, we introduce spatial structure into the activations, and we use spatial indicies, , , and . Thus, the activations for datapoint at layer , spatial location and channel are given by,

(48)

Note that for many purposes, these higher-order tensors can be treated as vectors and matrices, if we combine indicies (e.g. using a “reshape” or “view” operation). The commas in the index list are used to denote how to combine indicies for this particular operation, such that it can be understood as a standard matrix/vector operation. For the above equation, the activations,

are given by the matrix product of the activities from the previous layer, , and the weights, , where remember that is the number of spatial locations in the input.

For a convolutional neural network, the weights are the same if we consider the same input-to-output channels, and the same spatial displacement, , and are uncorrelated otherwise,

(49)

where is the set of all valid spatial displacements for the convolution, and is the number of valid spatial displacements (i.e. the size of the convolutional patch). For a locally-connected network, the only additional requirement is that the output spatial locations are the same,

(50)

Now we can compute the covariance of the activations, , for a convolutional network,

(51)
(52)
(53)
substituting the covariance of the weights (Eq. 50), and noting that the product of ’s forms the definition of the activity kernel (Eq. 12),
(54)

For locally connected intermediate layers, the derivation here is the same as previously, except that the output locations must be the same for there to be any covariance in the weights,

(55)

Substituting this into Eq. (38a),

(56)

Now, we can put together full recursive updates for convolutional networks, by pulling the sum out of the covariance above, and by taking the indicies in Eq. (38), as indexing both a datapoint and a spatial location (i.e. ),

(57a)
(57b)
(57c)

Finally, to compute these terms, note that we can compute the recursions for and ,

(58)

and this expression can be computed efficiently as a 2D convolution.

a.2.1 Spatially structured and unstructured networks

To understand the very different results for spatially structured and unstructured networks (Fig. 3B–D) despite their having the same infinite limit, we need to consider how Eq. (58) interacts with Eq. (55). For a locally connected (i.e. spatially unstructured) network, the covariance of activations at different locations is always zero, i.e.  for whereas, for a spatially structured network, the covariance terms have the same scale as the variance terms. The terms enter into the variance of the kernel through Eq. (58). Note that there are terms in this sum, and the sum is normalized by dividing by . Thus, in spatially structured networks, there are terms all with the same scale, so is . In contrast, for spatially unstructured networks, we have only nonzero terms, so