# Depth induces scale-averaging in overparameterized linear Bayesian neural networks

Inference in deep Bayesian neural networks is only fully understood in the infinite-width limit, where the posterior flexibility afforded by increased depth washes out and the posterior predictive collapses to a shallow Gaussian process. Here, we interpret finite deep linear Bayesian neural networks as data-dependent scale mixtures of Gaussian process predictors across output channels. We leverage this observation to study representation learning in these networks, allowing us to connect limiting results obtained in previous studies within a unified framework. In total, these results advance our analytical understanding of how depth affects inference in a simple class of Bayesian neural networks.

## Authors

• 6 publications
• 31 publications
01/03/2020

### Wide Neural Networks with Bottlenecks are Deep Gaussian Processes

There is recently much work on the "wide limit" of neural networks, wher...
06/01/2021

### Asymptotics of representation learning in finite Bayesian neural networks

Recent works have suggested that finite Bayesian neural networks may out...
04/23/2021

### Exact priors of finite neural networks

Bayesian neural networks are theoretically well-understood only in the i...
08/19/2020

### Improving predictions of Bayesian neural networks via local linearization

In this paper we argue that in Bayesian deep learning, the frequently ut...
10/17/2019

### Why bigger is not always better: on finite and infinite neural networks

Recent work has shown that the outputs of convolutional neural networks ...
10/14/2020

### Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

Modern deep learning models have achieved great success in predictive ac...
07/26/2021

### Are Bayesian neural networks intrinsically good at out-of-distribution detection?

The need to avoid confident predictions on unfamiliar data has sparked i...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Understanding the effect of depth and width on inference is among the central goals of the modern theory of neural networks. Recent theoretical advances have elucidated the behavior of networks in the infinite-width limit, in which the complexity introduced by depth washes out and inference is described by Gaussian process regression [24, 32, 18, 23, 34, 15, 17]. However, inference at finite widths, where hidden layers retain the flexibility to learn task-relevant representations, remains incompletely understood [2, 37, 36, 17, 26]. In the setting of gradient-based maximum likelihood optimization, some insights have been gained through the study of finite overparameterized deep linear neural networks [11, 28, 35, 4]. In the fully Bayesian setting, the behavior of this simple class of models has been characterized in several limiting cases, including asymptotically at large but finite width [24, 2, 36, 19, 26]. However, a unifying perspective on these results is lacking, and our understanding of inference in deep linear Bayesian neural networks (henceforth BNNs) at finite width remains incomplete.

Here, we make the following contributions toward a more comprehensive understanding of BNN inference:

1. We express the moment generating function of the posterior predictive of a finite, overparameterized deep

BNN as a data-dependent continuous scale mixture of Gaussian process (GP) generating functions. This scale average induces coupling across output channels, and compliments previous interpretations of deep BNNs in terms of mixing over an adaptive kernel distribution [25, 1]. This observation is mathematically straightforward, but yields some useful insights into inference in finite BNNs. We extend this argument to compute the posterior mean feature kernel of the network’s first layer, allowing us to study the representations learned by finite BNNs.

2. We study the asymptotic behavior of these scale mixtures in several limits, allowing us to connect our results to previous work on the asymptotics of BNNs [36, 2, 26, 19]. We identify several interesting areas for future investigation, and point to challenges for precise characterization of how BNNs behave in certain asymptotic regimes.

## Ii Setup

We begin by defining our setup and our notation, which is mostly standard [14, 30, 21, 31]. Depending on context, will denote the

norm on vectors or the Frobenius norm on matrices. We will use the shorthand that integrals without specified domains are taken over all real matrices of the implied dimension. We use the standard Loewner order on real symmetric matrices, such that

(respectively ) means that the matrix is positive semi-definite, or PSD (respectively positive-definite, or PD). For a matrix , we let be its row-major vectorization. Then, denoting the Kronecker product by , we have for conformable matrices , , and . For brevity, we define the shorthand .

For a set of compatibly-sized matrices , , …, , we define a depth- BNN as the linear map

 f:Rn0→Rndx↦Wd⋯W1x. (1)

We will assume that the “hidden layer widths” are all greater than or equal to the output dimension , such that the rank of the end-to-end weight matrix is not constrained by an intermediate bottleneck. We make the standard choice of isotropic Gaussian priors over the weight matrices:

 [Wℓ]ij∼i.i.d.N(0,n−1ℓ−1), (2)

with variances chosen such that the prior variances of the activations at any layer do not diverge with increasing width

[20, 24, 32, 18, 23, 34, 15]. One could allow general layer-dependent variances , but for BNNs the additional factors can always be absorbed into the definition of the input so long as they are finite and non-zero. Thus, for the sake of notational clarity, we make the simplest choice of prior variances.

For a training dataset of examples, we choose an isotropic Gaussian likelihood

 p(D|W1,…,Wd)∝exp(−β2p∑μ=1∥f(xμ)−yμ∥2); (3)

we will refer to the inverse variance as the inverse temperature in analogy with statistical mechanics. The Bayes posterior over the weight matrices is then given up to normalization as .

We collect the training inputs and targets into data matrices and with elements and , respectively. We will sometimes find it useful to consider a differentiated test dataset with corresponding data matrices and . For these data, we define the associated normalized Gram matrices , , , , and . Our assumptions on the data will be given purely in terms of conditions on these Gram matrices. In particular, we will assume that the training input Gram matrix is invertible; other conditions will be introduced as needed. We note that this invertibility condition, combined with our assumption that the hidden layer widths are wide enough such that the end-to-end weight matrix is not rank-constrained, means that the

BNNs we consider can linearly interpolate their training data, and are thus overparameterized.

## Iii Scale-averaging in deep ℓBNNs

### Iii-a The function-space prior as a scale mixture

We begin with the nearly trivial observation that, for some input data matrix , the induced prior over network outputs can be expressed as a continuous scale mixture of matrix Gaussians. This expression will prove useful in our subsequent study of the posterior predictive by allowing us to compute integrals over network outputs rather than over network weights. This simplification is allowed thanks to the fact that the likelihood (3) models the targets as being independent of the parameters given the network outputs.

Recall from §II that the prior distribution of the first layer’s weight matrix is a matrix Gaussian:

 W⊤1∼MNn1×n0(0,n−10In0,In1). (4)

Then, for fixed, the distribution of induced by the prior over

can be read off using the properties of the matrix Gaussian under linear transformations

[30]:

 F=XW⊤1(W⊤2⋯W⊤d)∼MNp×nd(0,Gxx,L), (5)

where we have recognized the normalized Gram matrix and defined the matrix

 L≡Wd⋯W2W⊤2⋯W⊤d. (6)

For this to make sense, both and must be of full rank. As stated in §II, we assume the dataset to be such that is invertible. Moreover, denoting the prior distribution over induced by the priors over by , the stated assumption that , implies that is invertible -almost surely [14, 30].

Using the law of total expectation, we conclude that the prior over outputs for any fixed set of inputs with invertible Gram matrix is a continuous scale mixture of matrix Gaussians, with the prior density explicitly given as

 pd(F|X)=EL∼ϖ[(2π)−ndp/2det(L⊗Gxx)−1/2×etr(−12L−1F⊤G−1xxF)]. (7)

For a depth-two network, is a Wishart distribution

, which simplifies to a scalar Gamma-distributed random variable

when [30]. These results allow one to easily write down the density of with respect to Lebesgue measure in the two-layer case. We note that the density of is expressible for deeper networks in terms of the Meijer -function [9]; we will not further pursue this line of analysis in the present work. One could also integrate out weight matrices beyond , but this would yield more complicated formulas for the prior density, which do not permit easy analysis of the posterior predictive [36, 37]. In particular, it is unclear how one might obtain an exact expression for the joint function-space prior density for all examples [37].

### Iii-B The cumulant generating function of the posterior predictive

We now exploit the observations of the previous section to study the posterior predictives of finite

BNNs. To do so, we will consider the moment generating function

 Z(β,J)=EW1,⋯,Wd|X,Yetr(J⊤Wd⋯W1^X⊤) (8)

of the posterior predictive for some test data . To leverage the mixture-of-Gaussians interpretation of the prior, we express the generating function as an integral over function outputs and , yielding

 Z(β,J)∝∫dFd^Fexp(tr(^F⊤J)−β2∥F−Y∥2)×pd(F,^F|X,^X) (9)

in terms of the joint prior , where the implied constant of proportionality ensures that . Here, the joint prior is given by substituting the combined dataset into (7) under the temporary assumption that the Gram matrix of the combined dataset is invertible.

We now exchange integration over and with expectation over the scale matrix , which allows us to evaluate the Gaussian integrals over and exactly. This calculation is easily performed using row-major vectorization [21]; we defer a detailed sketch to the Appendix and merely summarize the result here. We define the symmetric matrix

 ΓL≡Ipnd+βGxx⊗L, (10)

the symmetric matrix

 ΣL≡G^x^x⊗L−β(G⊤x^x⊗L)Γ−1L(Gx^x⊗L), (11)

and the -dimensional vector

 μL≡β(G⊤x^x⊗L)Γ−1Lv(Y). (12)

We let

be a probability measure over

positive semidefinite matrices, defined by its density

 dρdϖ∝det(ΓL)−1/2exp(−12βv(Y)⊤Γ−1Lv(Y)) (13)

with respect to ; the implied constant of proportionality ensures that . Then,

 Z(β,J)=EL∼ρexp(μ⊤Lv(J)+12v(J)⊤ΣLv(J)). (14)

From this moment generating function, we can immediately read off that the mean and covariance of the posterior predictive are

 ⟨^F^μj⟩=EL∼ρμ⊤Lv(χ^μj) (15)

and

 cov(^F^μj,^F^νk)=EL∼ρv(χ^μj)⊤ΣLv(χ^νk)+covL∼ρ(μ⊤Lv(χ^μj),μ⊤Lv(χ^νk)), (16)

respectively, where we define the matrix . We remark that all of these results extend to the training set predictor with the replacement .

We recognize (14) as a scale-average of the GP generating function of a single-layer BNN, for which [24, 32, 18, 23, 34, 15, 31]. Similarly, the mean predictor is a scale-average of GP mean predictors, while the predictor covariance includes an additional term beyond the average of the GP covariance, as per the law of total covariance [30]. We emphasize that the scale distribution is data-dependent: depth allows the

BNN to adaptively couple its output channels in a way that a single-layer network cannot. We finally remark that, unlike in studies of gradient-based maximum likelihood estimation in deep linear networks

[11, 28, 35, 4], no exceptional assumptions on the weight distribution or data are required to obtain this intuitive picture.

The above results are rendered somewhat complicated by the need to average over PSD matrices. If , the situation simplifies substantially, as the scale variable is now a scalar , and the Kronecker products can be eliminated. Concretely, for , we define

 Γλ ≡Ip+βλGxx∈Rp×p, (17) Σλ ≡λG^x^x−βλ2G⊤x^xΓ−1λGx^x∈R^p×^p,and (18) μλ ≡λβG⊤x^xΓ−1λy∈R^p, (19)

and let be a probability measure on , defined by its density

 dρdϖ∝det(Γλ)−1/2exp(−12βy⊤Γ−1λy) (20)

with respect to ; the implied constant of proportionality ensures that . Then, the cumulant generating function of the posterior predictive of a deep BNN with scalar output can be expressed as

 Z(β,j)=Eλ∼ρexp(μ⊤λj+12j⊤Σλj). (21)

From this, we obtain correspondingly simplified expressions for the predictor mean and covariance, which reduce to

 ⟨^f⟩=Eλ∼ρμλ (22)

and

 cov(^f)=Eλ∼ρΣλ+covλ∼ρ(μλ), (23)

respectively. Again, these results represent scale-averages of shallow GP predictors, but they are of a simpler form thanks to the lack of mixing between outputs. Even in this simplified setting, and even if one makes a further restriction to the case in which there is only a single training example, the averages defy exact analysis for general values of the hyperparameters due to the terms of the form

in the exponent [9].

### Iii-C The zero-temperature limit

Though analysis of the scale distribution is challenging for general values of the likelihood variance, the situation simplifies somewhat in the zero-temperature limit of vanishing likelihood variance. In this limit, the likelihood tends to a collection of Dirac masses that enforce the constraint that the BNN interpolates its training set. For this interpretation to be sensible at the level of the posterior predictive, the training dataset must be linearly interpolatable, i.e., there must exist some matrix such that . We will focus on this case, and operate under the assumption that the training dataset Gram matrix is invertible. Then, we expect all expectations over to be sufficiently regular such that we can interchange the limit in with the integrals, which should allow us to compute them using the pointwise limit of the density . We note that, though this limit is convenient for theoretical analysis, it is somewhat unnatural from a Bayesian perspective, as it models the targets as a deterministic function of the outputs [20, 33].

Under these regularity assumptions, we expect to have the almost-sure low-temperature limit , which yields the almost-sure limiting behavior

 μL →(G⊤x^xG−1xx⊗In2)v(Y), (24) ΣL →(G^x^x−G⊤x^xG−1xxGx^x)⊗L. (25)

Then, the limiting mean predictor simplifies to

 limβ→∞⟨^F⟩=G⊤x^xG−1xxY. (26)

This precisely corresponds to the least-norm pseudoinverse solution to the system , which is intuitively sensible. Moreover, we have the limiting covariance

 limβ→∞cov(^F^μj,^F^νk)=(G^x^x−G⊤x^xG−1xxGx^x)^μ^ν×limβ→∞EL∼ρLjk, (27)

which is precisely the GP posterior sample-sample covariance, multiplied by a coupling between output channels.

This argument also yields an approximate density

 dρdϖ∝1det(L)p/2etr(−12Y⊤G−1xxYL−1). (28)

In the two-layer case, where follows a Wishart distribution, the limiting density of with respect to Lebesgue measure on PSD matrices should then be given as

 dρdL∝det(L)(n1−p)/2−(n2+1)/2×etr(−12(n1L+Y⊤G−1xxYL−1)). (29)

This implies that follows a matrix generalized inverse Gaussian (MGIG) distribution at low temperatures [8, 10]:

 L∼MGIGn2(Y⊤G−1xxY,n1In2,n1−p2). (30)

This observation yields several insights. First, it implies that the moment generating functions of and are given in terms of Bessel functions of matrix argument of the second kind [8, 10, 7, 13, 9]. Second, neither the mean nor the reciprocal mean of the MGIG are known in closed form for general values of the parameters [8, 10]. We will therefore resort to studying the behavior of these expectations in various asymptotic limits in §V. However, reasonably efficient algorithms for sampling from the MGIG are available; the situation is of course particularly simple when [10]. Therefore, this formulation could allow faster numerical studies of two-layer BNNs at low temperatures than is possible through naïve sampling of the weights, as the dimensionality of the search space is reduced from to .

## Iv Average first-layer feature kernels in ℓBNNs

We now use the methods of §III to study the average feature kernels of deep BNNs. For technical convenience, we restrict our attention to the kernel of the first hidden layer evaluated on the training set:

 K≡1n1XW1W⊤1X⊤. (31)

Then, we can proceed as before to integrate out of the posterior moment generating function of :

 Z(β,J)=EW1,…,Wd|X,Yetr(1n1JXW⊤1W1X⊤). (32)

As discussed in the Appendix, the required computation is straightforward as all integrals are Gaussian. Whereas we considered the full generating function of the posterior predictive, we focus only on the posterior mean of the kernel. Defining the scale-dependent matrix

 [ΔL]μν≡β2v(Y)⊤Γ−1L(GxxχμνGxx⊗L)Γ−1Lv(Y)−βtr[Γ−1L(GxxχμνGxx⊗L)] (33)

for is the matrix , the posterior-averaged feature kernel can be expressed as

 ⟨K⟩=Gxx+1n1EL∼ρΔL. (34)

As the shallow GP result is simply , this yields a natural interpretation of the mean kernel of a finite-width deep BNN as the GP kernel plus some correction. Though the complexity of the matrix renders this result somewhat less than fully transparent, the situation again simplifies for the case of scalar output, for which we have

 Δλ=λβ2GxxΓ−1λyy⊤Γ−1λGxx−λβGxxΓ−1λGxx. (35)

We observe that the first term in this result is the outer product of the non-scale-averaged mean training set predictor with itself. Strikingly, the matrix —when evaluated at —is precisely the matrix that appears as the asymptotic correction to the average kernel computed in our previous work [36].

Following the discussion of §III-C, we have the almost-sure pointwise low-temperature limit

 limβ→∞ΔL=YL−1Y⊤−ndGxx. (36)

Thus, to compute the average kernel at low temperatures, we must compute the limiting reciprocal mean of . As noted in §III-C, this is not known in closed form.

## V Asymptotic behavior of ℓBNNs

We now consider the asymptotic behavior of BNNs in various limits, allowing us to connect our results to those of previous works. So as to make contact with as many previous works as possible [24, 32, 18, 23, 34, 15, 36, 26, 19, 2], we will largely focus on the behavior of the average kernel . In all cases, we will assume that the hidden layer widths are of a comparable scale , such that the ratios remain fixed as is taken to be large. As in the rest of the paper, we will assume that is invertible, which requires that . Thus, in limits in which is taken to be large, we implicitly also take to be large. We will only consider limits in which the depth is held fixed and finite, or, at least, tends to infinity far more slowly than the hidden layer width, such that is perturbatively small [36, 26, 12]. For the sake of analytical tractability, will often restrict our attention to two-layer networks () in the zero-temperature limit (). For notational brevity, we define the ratios and , which, under our assumptions, are bounded as .

### V-a n→∞, p and nd fixed

We first consider the regime in which the hidden layer widths tend to infinity with fixed input dimension, output dimension, training dataset size, and depth. This is the most commonly considered asymptotic regime for BNNs [24, 32, 18, 23, 34, 15, 36, 26]. In this limit, a simple saddle-point argument shows that the data-dependence in can be neglected, and that the expectations over should be dominated by the mode of , which is [29]. Applying this result to evaluate the expectations in the posterior predictive (14), we recover the expected correspondence between infinitely-wide BNNs and Gaussian processes [24, 32, 18, 23, 34, 15, 36, 26].

Moreover, we can use this simple argument to recover the leading asymptotic correction to the average hidden layer kernel computed in our previous work [36]. As the expectation in (34) carries an overall factor of , the leading correction is simply given by evaluating at the saddle-point value of ; corrections to the saddle point at large but finite widths will lead subleading corrections to the kernel [29]. After some algebraic simplification, this yields

 ⟨K⟩=Gxx+γGxxΓ−1∞(Gyy−Γ∞)Γ−1∞Gxx+O(γ−2), (37)

where . This matches the result of [36], and is consistent with the observation in §IV that the finite-width kernel in the scalar output setting is simply the average of the asymptotic correction over scales. More generally, one could treat as a small perturbation of the identity, and use perturbative methods similar to those of our previous work to recover the results on corrections to predictor statistics given there [36].

### V-B p→∞, n and nd fixed

We next consider the regime in which the dataset size is taken to be large relative to the hidden layer width and output dimension. This regime is of interest because one expects posterior concentration to occur in the large-dataset regime [20, 33]. Focusing on the zero-temperature limit, we make the simple approximation of neglecting all terms in the density (29) that do not scale with , leaving . We then evaluate the integral over by a saddle-point approximation, yielding . This yields an average kernel of

 ⟨K⟩≈(1−γ)Gxx+1n1Y(1pY⊤G−1xxY)−1Y⊤ (38)

under the reasonable assumption that is of full rank in this regime. Notably, the correction to the GP kernel need not be vanishingly small.

### V-C n,p→∞ and nd fixed

We now consider the limit in which the hidden layer width and training dataset size tend to infinity for fixed depth and output dimension, as previously studied by Li & Sompolinsky [19]. We focus—as those authors did—on the zero-temperature limit, and restrict our attention to the two-layer case for the sake of analytical tractability. Then, exploiting the results of §III-C, we expect the expectations over to be dominated by the mode of the MGIG. Concretely, we neglect terms of order , while keeping the term as we expect it to be of order . Then, the mode of is determined by a continuous algebraic Ricatti equation (CARE) [7, 10]:

 In2−L−1(Y⊤G−1xxYn1)L−1−(1−α)L−1=0. (39)

Using the fact that the solutions of this equation commute with the matrix [7], this is identical to the defining equation for the “renormalization matrix” of [19] in the depth-two case. In particular, using the result of §III-C and §IV, this immediately implies that we recover their results for the predictor statistics and zero-temperature kernel.

After some simplification, the solution to this CARE yields

 ⟨K⟩≈12Gxx[(1+α)Ip+((1−α)2Ip+4γG−1xxGyy)1/2], (40)

but there may be corrections to the saddle point at non-vanishing (in particular, if grows faster than roughly [29]). To leading order in , we have

 (41)

which is easily seen to agree with the result in the finite- regime upon expanding when .

### V-D n,nd→∞ and p fixed

Our analysis in the preceding sections was facilitated by the fact that the dimensionality of the scale integral remained finite. However, regimes in which the number of outputs tends to infinity with the hidden layer width can also be of interest. In particular, this limit is relevant to the study of autoencoding in high dimensions, and potentially also to classification tasks with many groups (e.g., ImageNet

[27]

). Though the same techniques that permit easy asymptotic analysis of other limits cannot be directly applied

[29], the problem of computing kernel statistics can be reformulated as an integral over the kernel matrices themselves, as noted by Aitchison [2]. Then, provided that is held fixed, the kernel can be computed using a saddle-point approximation. As in the case above, it is easiest to make analytical progress in two-layer networks at zero temperature. There, one finds that the limiting kernel is determined by the solution to the CARE [2, 3]

 G−1xx−γK−1GyyK−1+(γ−1)K−1=0. (42)

The solution to this CARE yields

 ⟨K⟩≈12Gxx[(1−γ)Ip+((1−γ)2Ip+4γG−1xxGyy)1/2]. (43)

In particular, when , we recover Aichison [2]’s result that . More generally, we observe that this result is suggestively similar to the kernel in the case of large and finite . In particular, this result can be recovered by making what is in principle an unjustified naïve Laplace approximation to the integral over as in the preceding section while keeping terms of order and ignoring possible corrections to the saddle point from the high-dimensional measure. Further exploration of this will be an interesting subject for future investigation.

### V-E n,nd,p→∞

Finally, one might consider the regime in which the hidden layer width, output dimension, and dataset size tend jointly to infinity. This regime is more challenging to study than those discussed previously, as there is not a clear way to reduce the problem to a finite-dimensional integral. The natural setup for this joint asymptotic limit is a random design teacher-student setting, in which the input examples are independent and identically distributed samples from some distribution and the targets are generated by a linear model with random coefficient matrix. Then, the BNN problem is closely related to the random-design linear-rank matrix inference task, which is known to be challenging to analyze [6, 5, 22]. We direct the interested reader to recent works by Barbier and Macris [5] and by Maillard et al. [22] on this problem, and defer more detailed analysis to future work.

## Vi Discussion and conclusions

In this short paper, we have studied some aspects of inference in finite overparameterized BNNs. We presented a simple argument that leads to a clear conceptual picture of the effect of depth, and exploited those methods to connect the results of previous studies. However, we note that our approach is specialized to linear networks, and would not extend easily to nonlinear BNNs. Taken together, our results provide some insight into finite-width effects in a model where depth does not affect the hypothesis class, but does affect inference.

The output-mixing scale average interpretation studied in this work compliments previous interpretations of deep BNNs as mixtures of GPs across a data-adaptive distribution of the kernel that measures similarities between input examples. This interpretation has been pursued in a series of recent works by Aitchison and colleagues [1, 2, 3], starting with the abovementioned work on kernel statistics in deep BNNs [2]. Those authors have also studied a class of models that generalizes this interpretation of deep BNNs by explicitly fixing prior distributions over data-adaptive kernels, resulting in model predictions [2, 3]. This adaptive-kernel description has also recently been considered by Pleiss and Cunningham [25], who showed that the mean predictor of a two-layer deep GP can be interpreted as a data-dependent mixture of function bases. Their result covers a much broader model class than just BNNs—the class of all possible BNNs, linear or nonlinear, is a degenerate subclass of the set of deep GPs—but does not capture higher moments of the posterior predictive.

For a deep BNNs, the adaptive-kernel interpretation arises naturally if one integrates the readout weight matrix out of the prior, rather than integrating out the first layer weight matrix as we did here [1, 2]. Other than in the limit of large output dimension, integrating out rather than affords some advantages—some merely aesthetic, others technical—if one has the specific objective of analytically characterizing inference in BNNs. Both approaches allow for the study of the limit of large width and fixed dataset size, but the need to average over the dataset-size-dimensional kernel matrix makes the large-dataset limit harder to study in the adaptive-kernel interpretation. If one studies the posterior predictive generating function (8) using the adaptive kernel interpretation, one must contend with the need to average over the kernel matrix for the combined train-test set. The blocks of this combined kernel matrix do not appear on equal footing in the generating function because the likelihood only involves the training set; this results in conceptually more complex expressions. Finally, the approach taken here has the advantage of simplifying dramatically for single-output networks; such a simplification is not as obvious in the adaptive-kernel interpretation.

In concurrent work, Lee et al. [16] have proposed to manually introduce scale mixing to wide BNNs by fixing priors over the prior variances of the last layer’s weights. With this setup, taking the limit of infinite hidden layer width results in a scale mixture of GP predictors. Here, we observe that such scale-averaging arises naturally as an effect of depth in finite-width BNNs, hence their setup could be interpreted as manually compensating for the effective loss of depth in an infinite BNN. Based on numerical experiments, they claim that this method can in some cases improve generalization performance relative to that of the fixed-scale GP predictor corresponding to the infinite-width limit of a BNN with fixed prior weight variances. However, importantly, their setup does not consider coupling across multiple output channels. Comprehensive investigation of when data-adaptive scale mixing yields better generalization performance than a fixed-scale GP will be an interesting subject for future investigation.

To conclude, the results of this work illustrate several important conceptual points. Notably, the behavior of networks with many outputs is qualitatively distinct from those with scalar outputs, as there are interactions between output channels which are apparent neither in the scalar output case nor in the limit of infinite width and fixed output dimension. These interactions render both finite-size and asymptotic analyses more challenging. This issue is not merely one of abstract theoretical interest. Rather, it is potentially relevant to attempts to explain empirical results in deep learning. As modern image recognition tasks often include thousands of classes, the ratio of depth to output dimension of realistic networks may non-negligible

[27]. Thus, we believe that careful analysis of representation learning in joint limits of infinite hidden layer width, output dimension, depth, and dataset size will be an important subject for future work.

## Acknowledgments

We thank A. Atanasov and B. Bordelon for useful conversations and helpful comments on our manuscript.

## References

• [1] L. Aitchison, A. Yang, and S. W. Ober (2021-18–24 Jul) Deep kernel processes. In

Proceedings of the 38th International Conference on Machine Learning

, M. Meila and T. Zhang (Eds.),
Proceedings of Machine Learning Research, Vol. 139, pp. 130–140. Cited by: item 1, §VI, §VI.
• [2] L. Aitchison (2020-07) Why bigger is not always better: on finite and infinite neural networks. In Proceedings of the 37th International Conference on Machine Learning, H. Daumé III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 156–164. Cited by: item 2, §I, §V-D, §V, §VI, §VI.
• [3] L. Aitchison (2021) Deep kernel machines and fast solvers for deep kernel machines. arXiv preprint arXiv:2108.13097. Cited by: §V-D, §VI.
• [4] A. Atanasov, B. Bordelon, and C. Pehlevan (2021) Neural networks as kernel learners: the silent alignment effect. arXiv preprint arXiv:2111.00034. Cited by: §I, §III-B.
• [5] J. Barbier and N. Macris (2021)

Statistical limits of dictionary learning: random matrix theory and the spectral replica method

.
arXiv preprint arXiv:2109.06610. Cited by: §V-E.
• [6] J. Bun, R. Allez, J. Bouchaud, and M. Potters (2016) Rotational invariant estimator for general noisy matrices. IEEE Transactions on Information Theory 62 (12), pp. 7475–7490. Cited by: §V-E.
• [7] R. W. Butler and A. T. Wood (2003) Laplace approximation for Bessel functions of matrix argument. Journal of Computational and Applied Mathematics 155 (2), pp. 359–382. Cited by: §III-C, §V-C.
• [8] R. W. Butler (1998)

Generalized inverse Gaussian distributions and their Wishart connections

.
Scandinavian Journal of Statistics 25 (1), pp. 69–75. Cited by: §III-C.
• [9] (2021) NIST Digital Library of Mathematical Functions. Note: http://dlmf.nist.gov/, Release 1.1.1 of 2021-03-15F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds. Cited by: §III-A, §III-B, §III-C.
• [10] F. Fazayeli and A. Banerjee (2016) The matrix generalized inverse Gaussian distribution: properties and applications. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 648–664. Cited by: §III-C, §V-C.
• [11] K. Fukumizu (1998) Effect of batch learning in multilayer neural networks. In Proceedings of the 5th International Conference on Neural Information Processing, pp. 67–70. Cited by: §I, §III-B.
• [12] B. Hanin (2021) Random neural networks in the infinite width limit as Gaussian processes. arXiv preprint arXiv:2107.01562. Cited by: §V.
• [13] C. S. Herz (1955) Bessel functions of matrix argument. Annals of Mathematics, pp. 474–523. Cited by: §III-C.
• [14] R. A. Horn and C. R. Johnson (2012) Matrix analysis. Cambridge University Press. Cited by: Appendix A, Appendix A, §II, §III-A.
• [15] J. Hron, Y. Bahri, R. Novak, J. Pennington, and J. Sohl-Dickstein (2020) Exact posterior distributions of wide Bayesian neural networks. arXiv preprint arXiv:2006.10541. Cited by: §I, §II, §III-B, §V-A, §V.
• [16] H. Lee, E. Yun, H. Yang, and J. Lee (2021) Scale mixtures of neural network Gaussian processes. arXiv preprint arXiv:2107.01408. Cited by: §VI.
• [17] J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein (2020) Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 15156–15172. Cited by: §I.
• [18] J. Lee, J. Sohl-Dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri (2018) Deep neural networks as Gaussian processes. In International Conference on Learning Representations, Cited by: §I, §II, §III-B, §V-A, §V.
• [19] Q. Li and H. Sompolinsky (2021-09)

Statistical mechanics of deep linear neural networks: the backpropagating kernel renormalization

.
Phys. Rev. X 11, pp. 031059. External Links: Document Cited by: item 2, §I, §V-C, §V.
• [20] D. J. MacKay (1992) A practical Bayesian framework for backpropagation networks. Neural Computation 4 (3), pp. 448–472. Cited by: §II, §III-C, §V-B.
• [21] J. R. Magnus and H. Neudecker (2019) Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons. Cited by: §II, §III-B.
• [22] A. Maillard, F. Krzakala, M. Mézard, and L. Zdeborová (2021) Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising. arXiv preprint arXiv:2110.08775. Cited by: §V-E.
• [23] A. G. d. G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani (2018) Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, Cited by: §I, §II, §III-B, §V-A, §V.
• [24] R. M. Neal (1996) Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29–53. Cited by: §I, §II, §III-B, §V-A, §V.
• [25] G. Pleiss and J. P. Cunningham (2021) The limitations of large width in neural networks: a deep Gaussian process perspective. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: item 1, §VI.
• [26] D. A. Roberts, S. Yaida, and B. Hanin (2021) The principles of deep learning theory. arXiv preprint arXiv:2106.10165. Cited by: item 2, §I, §V-A, §V.
• [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV)

115 (3), pp. 211–252.
External Links: Document Cited by: §V-D, §VI.
• [28] A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §I, §III-B.
• [29] Z. Shun and P. McCullagh (1995) Laplace approximation of high dimensional integrals. Journal of the Royal Statistical Society: Series B (Methodological) 57 (4), pp. 749–760. Cited by: §V-A, §V-A, §V-C, §V-D.
• [30] R. Vershynin (2018) High-dimensional probability: an introduction with applications in data science. Vol. 47, Cambridge University Press. Cited by: Appendix A, §II, §III-A, §III-A, §III-B.
• [31] C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §II, §III-B.
• [32] C. K. Williams (1997) Computing with infinite networks. Advances in Neural Information Processing Systems, pp. 295–301. Cited by: §I, §II, §III-B, §V-A, §V.
• [33] A. G. Wilson and P. Izmailov (2020) Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 4697–4708. Cited by: §III-C, §V-B.
• [34] G. Yang (2019) Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760. Cited by: §I, §II, §III-B, §V-A, §V.
• [35] C. Yun, S. Krishnan, and H. Mobahi (2021) A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations, Cited by: §I, §III-B.
• [36] J. A. Zavatone-Veth, A. Canatar, B. S. Ruben, and C. Pehlevan (2021) Asymptotics of representation learning in finite Bayesian neural networks. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: item 2, §I, §III-A, §IV, §V-A, §V-A, §V.
• [37] J. A. Zavatone-Veth and C. Pehlevan (2021) Exact marginal prior distributions of finite Bayesian neural networks. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: §I, §III-A.

## Appendix A Derivation of the posterior predictive generating function

In this short appendix, we sketch the derivations of our results for the moment generating function of the posterior predictive (reported in §III) and the posterior average kernel (reported in §IV). Following the setup in §III, computation of the moment generating function of the posterior predictive requires only the evaluation of a single Gaussian integral, hence we will omit many intermediate steps for brevity. We proceed under the assumption that the combined Gram matrix

 ~Gxx=[GxxGx^xG⊤x^xG^x^x] (44)

is invertible; the result extends to the general case by continuity [30, 14].

We start by using the representation of the generating function as an integral over predictions (9) and the expression for the function-space prior density as a continuous scale mixture (7). Then, assuming that that we can apply Fubini’s theorem to interchange the integrals over and with the integral over , our first task is to evaluate the matrix Gaussian integral

 (2π)−nd~p/2det(L)−~p/2det(~Gxx)−nd/2×∫d~Fetr(−12β(F−Y)(F−Y)⊤+^F⊤J−12L−1~F⊤~G−1xx~F), (45)

where . This integral is easiest to evaluate using row-major vectorization, for which . Then, defining the matrix

 A=[βIpnd000]+~G−1xx⊗L−1 (46)

and the vector

 b=[βv(Y)v(J)], (47)

the integral of interest can be expressed as

 (2π)−nd~p/2det(~Gxx⊗L)−1/2×∫dv(~F)exp(−12v(~F)⊤Av(~F)+b⊤v(~F))=det(~Gxx⊗L)−1/2det(A)−1/2exp(12b⊤A−1b) (48)

up to a normalizing factor of . Using properties of the Kroenecker product, we find after a bit of algebra that [14]

 det(~Gxx⊗L)det(A)=det[(~Gxx⊗L)A]=det(ΓL) (49)

and

 12b⊤A−1b−12βtr(YY⊤)=−12βv(Y)⊤Γ−1Lv(Y)+μ⊤Lv(J)+12v(J)⊤ΣLv(J),

where we have defined the matrices and and the vector as in (10), (11), and (12) of the main text, respectively. We then conclude the desired result upon grouping -dependent terms that do not depend on the source into the density .

The kernel statistics may be derived through an analogous procedure. We start with the posterior moment generating function

 Z(β,J)=EW1,…,Wd|X,Yetr(−121n1JXW⊤1W1X⊤), (50)

where the source term is defined with a factor of for convenience. As before, the first layer weight matrix can be integrated out, yielding

 Z(β,J)∝EL∼ϖ(2π)−ndp/2det(L)−p/2det(Gxx)−nd/2×∫dFexp(−β2∥F−Y∥2F)×etr(−12L−1F⊤G−1xx(Ip+1n1GxxJ)F). (51)

This is again a Gaussian integral, hence it can be evaluated by direct computation using the vectorization method discussed above. After varying the result with respect to , one obtains the formula (34) reported in the main text.