Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

10/11/2018 ∙ by Roman Novak, et al. ∙ 18

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance in finite-channel CNNs trained with stochastic gradient descent (SGD) has no corresponding property in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017)

. NN performance is often motivated in terms of model architectures, initializations, and training procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many real-world datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality

(Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tiňo et al., 2004, 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively.

The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; de G. Matthews et al., 2018) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and easily reasoned about.

Previous work studying these Neural Network-equivalent Gaussian Processes (NN-GPs) has established the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NN-GPs to gain specific insights into the equivalent NNs.

In the present work, we extend the equivalence between NNs and NN-GPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting architecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, designing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality can vanish in the Bayesian treatment of the infinite width limit.

The specific novel contributions of the present work are:

  1. We show analytically that CNNs with many channels, trained in a fully Bayesian fashion, correspond to an NN-GP (§2, §3

    ). We show this for CNNs both with and without pooling, with arbitrary convolutional striding, and with both

    and padding. We prove convergence as the number of channels in hidden layers go to infinity uniformly (§A.5.3), strengthening and extending the result of de G. Matthews et al. (2018) under mild conditions on the nonlinearity derivative.

  2. We show that in the absence of pooling, the NN-GP for a CNN and a Locally Connected Network (LCN) are identical (§5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing or translation equivariance.

  3. We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NN-GP (Figure 4, b, c). Moreover, both architectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure 4, a, b), corroborating a similar trend observed in Canziani et al. (2016, Figure 2)

    . However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NN-GP by a significant margin. We experimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Table

    1).

  4. We introduce a Monte Carlo method to compute NN-GP kernels for situations (such as CNNs with pooling) where evaluating the NN-GP is otherwise computationally infeasible (§4).

1.1 Related work

In early work on neural network priors, Neal (1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Follow-up work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al. (2018); de G. Matthews et al. (2018).

The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018; Chen et al., 2018) is related to the construction of the GPs we consider. They apply a mean field approximation in which the pre-activation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Recently, Xiao et al. (2018) extended this to convolutional architectures without pooling. Xiao et al. (2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NN-GP performance, as discussed in §A.2.

Compositional kernels coming from convolutional and fully connected layers also appeared outside of the GP context in Daniely et al. (2016). In this work, they prove approximation guarantees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases.

There is a line of work considering stacking of GPs, such as deep GP(Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b, a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model end-to-end. Finally, van der Wilk et al. (2017) incorporates convolutional structure into GP kernels, with follow-up work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018; Anonymous, 2019) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit.

Borovykh (2018) analyzes the convergence of network outputs to a GP after marginalizing over all inputs in a dataset, in the case of a temporal CNN. Thus, while they also consider a GP limit, they do not address the dependence of network outputs on specific inputs, and their model is unable to generate test set predictions.

In concurrent work, Garriga-Alonso et al. (2018)

derive an NN-GP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NN-GP performance to finite-width SGD-trained CNNs and analyze the differences between the two models.

2 Many-channel Bayesian CNNs are Gaussian processes

2.1 Preliminaries

Consider a series of convolutional hidden layers, . The parameters of the network are the convolutional filters and biases, and , respectively, with outgoing (incoming) channel index () and filter relative spatial location .111We will use Roman letters to index channels and Greek letters for spatial location. We use letters , etc to denote channel indices, , etc to denote spatial indices and , etc for filter indices. For notational simplicity, we treat the 1D case with spatial dimension in the text, but the single spatial index can be extended to higher dimensions by replacing with tuples. Similarly, our analysis straightforwardly generalizes to strided convolutions (§A.3). Assume a Gaussian prior on both the filter weights and biases,

(1)

The weight and bias variances are

, respectively. is the number of channels (filters) in layer , is the filter size, and is the fraction of the receptive field variance at location (with ). In experiments we utilize uniform , but nonuniform should enable kernel properties that are better suited for ultra-deep networks, as in Xiao et al. (2018).

Let denote a set of input images (training set or validation set or both). The network has activations and pre-activations for each input image , with input channel count , number of pixels , where

(2)

We emphasize the dependence of and on the input . is a pointwise nonlinearity. is assumed to be zero padded so that the spatial size is constant throughout the network.

A recurring quantity in this work will be the empirical uncentered covariance tensor

of the activations , defined as

(3)

is therefore a 4-dimensional random variable indexed by two inputs

and two spatial locations (the dependence on layer widths and their weights and biases is implied and by default not stated explicitly). , the empirical uncentered covariance of inputs, is deterministic.

Whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension. E.g. is a tensor of shape , has the shape , has the shape , etc.

2.2 Correspondence between Gaussian processes and Bayesian deep CNNs with infinitely many channels

We next consider the prior over functions computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, for , and derive its equivalence to a GP with a compositional kernel. The following section gives a proof which uses the empirical uncentered covariance tensors to characterize finite width intermediate layers and relies on explicit Bayesian marginalization over these intermediate layers. In Appendix A.5 we give several alternative derivations of the correspondence.

Figure 1: Graphical model for the computation performed by a feedforward neural network with Gaussian weights, in terms of inputs , pre-activations , and uncentered covariance tensors . Notice that per Equation 4, only depends on the empirical uncentered covariances .

2.2.1 A single convolutional layer is a GP conditioned on the uncentered covariance tensor of the previous layer’s activations

As can be seen in Equation 2, the pre-activation tensor is an affine transformation of the multivariate Gaussian , specified by the previous layer’s activations . An affine transformation of a multivariate Gaussian is itself a Gaussian. Specifically,

(4)

where the first equality in Equation 4 follows from the independence of the weights and biases for each channel . The uncentered covariance tensor for the pre-activations is derived in Xiao et al. (2018), where is an affine transformation (a cross-correlation operator followed by a shifting operator) defined as follows:

(5)

2.2.2 Uncentered covariance tensor becomes deterministic with increasing channel count

The summands in Equation 3 are i.i.d., due to the independence of the weights and biases for each channel . Subject to weak restrictions on the nonlinearity

, we can apply the law of large number and conclude that,

(6)
(7)

For nonlinearities such as (Nair & Hinton, 2010) and the error function () can be computed in closed form as derived in Cho & Saul (2009) and Williams (1997) respectively.

2.2.3 Bayesian marginalization over all hidden layers

The distribution over the CNN outputs can be evaluated by marginalizing over all intermediate layer uncentered covariances in the network (see Figure 1):

(8)
(9)

In the limit of infinitely many channels in the hidden layers, 222Unlike de G. Matthews et al. (2018), we do not require each to be strictly increasing., all the conditional distributions except for converge weakly to delta functions and can be integrated out. Precisely, Equation 9 reduces to the expression in the following theorem.

Theorem 2.1.

If is Lipschitz, then we have the following convergence in distribution

(10)

i.e. composed with itself times and applied to .

In other words, is the (deterministic) covariance of the CNN activations in the limit of infinitely many (hence subscript) channels in each of the convolutional layers from to . See §A.5.3 for the proof. Therefore Equation 10

states that the outputs for any set of input examples and pixel indices are jointly Gaussian distributed – i.e. the output of a CNN with infinitely many channels in its

hidden layers is described by a GP with a covariance function .

3 Transforming a GP over spatial locations into a GP over classes

In §2.2 we have shown that in the infinite channel limit a deep CNN is a GP indexed by input samples and spatial locations of the top layer. Further, its uncentered covariance tensor

can be computed in closed form. Here we show that transformations to obtain class predictions that are common in CNN classifiers can be represented as either vectorization or projection (as long as we treat classification as regression, similarly to

Lee et al. (2018)). Both of these operations preserve the GP equivalence and allow the computation of the covariance tensor of the respective GP (now indexed by input samples and target classes) as a simple transformation of .

3.1 Vectorization

One common readout strategy is to vectorize (flatten) the output of the last convolutional layer into a vector and stack a fully connected layer on top:

(11)

where the weights and biases are i.i.d. Gaussian, , and is the number of classes. The sample-sample kernel of the output (identical for each class ) of this particular GP, denoted by , is

(12)
(13)

where the limit of infinite width is derived identically to §2.2. As observed in Xiao et al. (2018), to compute any diagonal terms of , one needs only the corresponding diagonal terms of . Consequently, we only need to store and the memory cost is (or per covariance entry in an iterative or distributed setting). Note that this approach ignores pixel-pixel covariances and produces a GP corresponding to a locally-connected network (see §5.1).

3.2 Projection

Another approach is a projection collapsing the spatial dimensions. Let be a deterministic vector, , and be the same as above.

Figure 2: Different dimensionality collapsing strategies described in §3. Validation accuracy of an MC-CNN-GP with pooling (item 1) is consistently better than other models due to translation invariance of the kernel. CNN-GP with zero padding3.1) outperforms an analogous CNN-GP without padding as depth increases. At depth the spatial dimension of the output without padding is reduced to , making the CNN-GP without padding equivalent to the center pixel selection strategy (item 2) – which also performs worse than the CNN-GP (we conjecture, due to overfitting to centrally-located features) but approaches the latter (right) in the limit of large depth, as information becomes more uniformly spatially distributed (Xiao et al., 2018). CNN-GPs generally outperform FCN-GP, presumably due to the local connectivity prior, but can fail to capture nonlinear interactions between spatially-distant pixels at shallow depths (left). Values are reported on a 2K/4K train/validation subsets of CIFAR10. See §A.7.3 for experimental details.

Define the output to be

(14)
(15)

where the limiting behavior is derived identically to Equation 12. Examples of this approach include

  1. [ref =3.2.0]

  2. Global average pooling: take and denote this particular GP as . Then

    (16)

    This approach corresponds to applying global average pooling right after the last convolutional layer.333 Spatially local average pooling in intermediary layers can be constructed in a similar fashion (§A.3). We focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects of the model like local connectivity or equivariance. This approach takes all pixel-pixel covariance into consideration and makes the kernel translation invariant. However, it requires memory to compute the sample-sample covariance of the GP (or per covariance entry in an iterative or distributed setting). It is impractical to use this method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see §4).

  3. Subsampling one particular pixel: take ,

    (17)

    This approach (denoted ) makes use of only one pixel-pixel covariance, and requires the same amount of memory as to compute.

We compare the performance of presented strategies in Figure 2. Note that all described strategies admit stacking additional FC layers on top while retaining the GP equivalence, using a derivation analogous to §2.

4 Monte Carlo evaluation of intractable GP kernels

We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally impractical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte Carlo-GP (MC-GP) kernel,

(18)

where consists of draws of the weights and biases from their prior distribution, , and is the width or number of channels in hidden layers. The MC-GP kernel converges to the analytic kernel with increasing width, in probability.

For finite width networks, the uncertainty in is . From Daniely et al. (2016), we know that , which leads to . For finite , is also a biased estimate of , where the bias depends solely on network width. We do not currently have an analytic form for this bias, but we can see in Figures 3 and 7 that for the hyperparameters we probe it is small relative to the variance. In particular, is nearly constant for constant . We thus treat as the effective sample size for the Monte Carlo kernel estimate. Increasing and reducing can reduce memory cost, though potentially at the expense of increased compute time and bias.

In a non-distributed setting, the MC-GP reduces the memory requirements to compute from to , making the evaluation of CNN-GPs with pooling practical.

MC-CNN-GP

Figure 3: Validation accuracy (left) of an MC-CNN-GP increases with (i.e. channel count times number of samples) and approaches that of the exact CNN-GP (not shown), while the distance (right) to the exact kernel decreases. The dark band in the left plot corresponds to ill-conditioning of when the number of outer products contributing to approximately equals its rank. Values reported are for a 3-layer model applied to a 2K/4K train/validation subset of CIFAR10 downsampled to . See Figure 7 for similar results with other architectures and §A.7.2 for experimental details.

5 Discussion

5.1 Bayesian CNNs with many channels are identical to locally connected networks, in the absence of pooling

Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated.

The CNN-GP predictions without spatial pooling in §3.1 and item 2 depend only on sample-sample covariances, and do not depend on pixel-pixel covariances. LCNs destroy pixel-pixel covariances: , for and all and . However, LCNs preserve the covariances between input examples at every pixel: . As a result, in the absence of pooling, LCN-GPs and CNN-GPs are identical. Moreover, LCN-GPs with pooling are identical to CNN-GPs with vectorization of the top layer (under suitable scaling of ). We confirm these findings experimentally in trained networks in the limit of large width in Figure 4 (b), as well as by demonstrating convergence of MC-GPs of the respective architectures to the same CNN-GP (modulo scaling of ) in Figures 3 and 7.

5.2 Pooling leverages equivariance to provide invariance

The only kernel leveraging pixel-pixel covariances is that of the CNN-GP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures 2, 3, 4 (b); Tables 1, 2), in both CNNs and CNN-GPs.

5.3 Finite-channel SGD-trained CNNs can outperform infinite-channel Bayesian CNNs, in the absence of pooling

In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finite-width SGD-trained CNNs often approaches that of their CNN-GP counterpart (Figure 4, b, c)444This observation is conditioned on the respective NN fitting the training set to . Underfitting breaks the correspondance to an NN-GP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table 2)., suggesting that in those cases equivariance does not play a beneficial role in SGD-trained networks.

However, as can be seen in Tables 1, 2 and Figure 4 (c), the best CNN overall outperforms the best CNN-GP by a significant margin – an observation specific to CNNs and not FCNs or LCNs. We observe this gap in performance especially in the case of networks trained with a large learning rate. In Table 1 we demonstrate this large gap in performance by evaluating different models with equivalent architecure and hyperparameter settings, chosen for good SGD-trained CNN performance.

We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the infinite channel CNN limit, contributes to the performance of SGD-trained finite-channel CNNs with the correct settings of hyperparameters. Nonetheless, more work is needed to disentangle and quantify the separate contributions of stochastic optimization and finite width effects to differences in performance between CNNs with weight sharing and their corresponding CNN-GPs.

(a) (c)
(b) No Pooling Global Average Pooling

LCN

CNN

#Channels
Figure 4: (a): SGD-trained CNNs often perform better with increasing number of channels. Each line corresponds to a particular choice of architecture and initialization hyperparameters, with best learning rate and weight decay selected independently for each number of channels (-axis). (b): SGD-trained CNNs often approach the performance of their corresponding CNN-GP with increasing number of channels. All models have the same architecture except for pooling and weight sharing, as well as training-related hyperparameters such as learning rate, weight decay and batch size, which are selected for each number of channels (-axis) to maximize validation performance (-axis) of a neural network. As the number of channels grows, best validation accuracy increases and approaches accuracy of the respective GP (solid horizontal line). (c): However, the best-performing SGD-trained CNNs can outperform their corresponding CNN-GPs. Each point corresponds to the validation accuracy of: (-axis) a specific CNN-GP; (-axis) the best CNN with the same architectural hyper-parameters selected among the 100%-accurate models on the full training CIFAR10 dataset with different learning rates, weight decay and number of channels. While CNN-GP appears competitive against -accurate CNNs (above the diagonal), the best CNNs overall outperform CNN-GPs by a significant margin (below the diagonal, right). For further analysis of factors leading to similar or diverging behavior between SGD-trained finite CNNs and infinite Bayesian CNNs see Tables 1 and 2. Experimental details: all networks have reached training accuracy on CIFAR10. Values in (c) are reported on an 0.5K/4K train/validation subset downsampled to for computational reasons. See §A.7.5 and §A.7.1 for full experimental details of (a, c) and (b) plots respectively.
Quality: Compositionality Local connectivity Equivariance Invariance
Model: FCN FCN-GP LCN (w/ pooling) CNN-GP CNN CNN w/ pooling
Error: ()
Table 1:

Disentangling the role of network topology, equivariance, and invariance on test performance, for SGD-trained and infinite width Bayesian networks.

Test error (%) on CIFAR10 of different models of the same depth, nonlinearity, and weight and bias variances. LCN and CNN-GP have a hierarchical local topology, beneficial for image recognition tasks and outperform fully connected models (FCN and FCN-GP). As predicted in §5.1: (i) weight sharing has no effect in the Bayesian treatment of an infinite width CNN (CNN-GP performs similarly to an LCN, a CNN without weight sharing), and (ii) pooling has no effect on generalization of an LCN model (LCN and LCN with pooling perform nearly identically). Local connectivity combined with equivariance (CNN) is enabled by weight sharing in an SGD-trained finite model, allowing for a significant improvement. Finally, invariance enabled by weight sharing and pooling allows for the best performance. Values are reported for 8-layer models. See §A.7.6 for experimental details and Table 2 for more model comparisons.
Model CIFAR10 MNIST Fashion-MNIST
CNN with pooling ()
CNN with and large learning rate ()
CNN-GP
CNN with small learning rate
CNN with (any learning rate)
Convolutional GP (van der Wilk et al., 2017)
ResNet GP (Garriga-Alonso et al., 2018)
Residual CNN-GP (Garriga-Alonso et al., 2018)
CNN-GP (Garriga-Alonso et al., 2018)
FCN-GP
FCN-GP (Lee et al., 2018)
FCN ()
Table 2: Aspects of architecture and inference influencing test performance. Test error (%) for best model within each model family, maximizing validation accuracy over depth, width, and training and initialization hyperpameters. Except where indicated by parentheses, all models achieve 100% training accuracy. For SGD-trained CNNs, numbers in parentheses correspond to the same model family, but without restriction on training accuracy. CNN-GP achieves state of the art results on CIFAR10 for GPs without trainable kernels and outperforms SGD models optimized with a small learning rate to 100% train accuracy. When SGD optimization is allowed to underfit the training set, there is a significant improvement in generalization. Further, when nonlinearities are paired with large learning rates, the performance of SGD-trained models again improves relative to CNN-GPs, suggesting a beneficial interplay between s and fast SGD training. These differences in performance between CNNs and CNN-GPs are not observed between FCNs and FCN-GPs, or between LCNs and LCN-GPs (Table 1), suggesting that equivariance is the underlying factor responsible for the improved performance of finite SGD-trained CNNs relative to infinite Bayesian CNNs without pooling. See §A.7.5 for experimental details.

6 Conclusion

In this work we have derived a Gaussian process that corresponds to a deep fully Bayesian CNN with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture.

The CNN-GP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all training-related hyperparameters. However, we found that the best overall performance is achieved by finite SGD-trained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research into disentangling the contributions of the two qualities (Bayesian treatment and infinite width) to the performance gap observed.

7 Acknowledgements

We thank Greg Yang, Sam Schoenholz, Vinay Rao, Daniel Freeman, and Qiang Zeng for frequent discussion and feedback on preliminary results.

References

Appendix A Appendix

a.1 Additional Figures

No Pooling Global Average Pooling

LCN

CNN

#Channels
Figure 5: Best validation loss (vertical axis) of trained neural networks (dashed line) as the number of channels increases (horizontal axis) approaches that of a respective (MC-)CNN-GP (solid horizontal line). See Figure 4 (b) for validation accuracy, Figure 6 for training loss and §A.7.1 for experimental details.
No Pooling Global Average Pooling

LCN

CNN

#Channels
Figure 6: Training loss (vertical axis) of best (in terms of validation loss) neural networks as the number of channels increases (horizontal axis). While perfect loss is not achieved (but accuracy is), we observe no consistent improvement when increasing the capacity of the network (left to right). This eliminates underfitting as a possible explanation for why small models perform worse in Figure 4 (b). See Figure 5 for validation loss and §A.7.1 for experimental details.

MC-CNN-GP with pooling

MC-LCN-GP

MC-LCN-GP with Pooling

MC-FCN-GP

Figure 7: As in Figure 3, validation accuracy (left) of MC-GPs increases with (i.e. width times number of samples), while the distance (right) to the the respective exact GP kernel (or the best available estimate in the case of CNN-GP with pooling, top row) decreases. We remark that when using shared weights, convergence is slower as smaller number of independent random parameters are being used. For example a single-layer MC-LCN-GP kernel is expected to converge approximately times faster than MC-CNN-GP, which is in agreement with our results obtained in the second row and Figure 3

. I.e. the geometric mean of the ratios of the kernel distance from (3-layer) MC-CNN-GP and MC-LCN-GP to the respective CNN-GP is

). See §A.7.2 for experimental details.

a.2 Relationship to Deep Signal Propagation

The recurrence relation linking the GP kernel at layer to that of layer following from Equation 10 (i.e. ) is precisely the covariance map examined in a series of related papers on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as , or e.g. in Xiao et al. (2018)). In those works, the action of this map on hidden-state covariance matrices was interpreted as defining a dynamical system whose large-depth behavior informs aspects of trainability. In particular, as , , i.e. the covariance approaches a fixed point . The convergence to a fixed point is problematic for learning because the hidden states no longer contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in the chaotic regime outputs of the GP become asymptotically decorrelated and therefore independent, while in the ordered regime they approach perfect correlation of . Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible.

This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NN-GP kernels can be performed – see Table 3.

Depth: Phase boundary

CNN-GP

FCN-GP

Table 3: Validation accuracy of CNN- and FCN- GPs as a function of weight (, horizontal axis) and bias (, vertical axis) variances. As predicted in §A.2, the regions of good performance concentrate around the critical line (phase boundary, right) as the depth increases (left to right). All plots share common axes ranges and employ the nonlinearity. See §A.7.2 for experimental details.

a.3 Strided convolutions and average pooling in intermediate layers

Our analysis in the main text can easily be extended to cover average pooling and strided convolutions (applied before the pointwise nonlinearity). Recall that conditioned on the pre-activation is a mean-zero multivariate Gaussian. Let denote a linear operator. Then is mean zero Gaussian, and the covariance is

(19)

One can easily see that are i.i.d. multivariate Gaussian.

Strided convolution. Strided convolution is equivalent to a non-strided convolution composed with sub-sampling. Let denote size of the stride. Then the strided convolution is equivalent to choosing as follows: for .

Average pooling. Average pooling with stride and window size is equivalent to choosing for and .

a.4 Review of exact Bayesian regression with GPs

Our discussion in the paper has focused on model priors

. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be done

exactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let denote training inputs , training targets, and collectively

for the training set. The integral over the posterior can be evaluated analytically to give a posterior predictive distribution on a test point

which is Normal, , with

(20)
(21)

We use the shorthand to denote the matrix formed by evaluating the GP covariance on the training inputs, and likewise is a -length vector formed from the covariance between the test input and training inputs. Computationally, the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as (though algorithms scaling as exist for sufficiently large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (Quiñonero-Candela & Rasmussen, 2005).

a.5 Kernel Convergence Proof

In this section, we present three different approaches to illustrate the weak convergence of neural networks to Gaussian processes as the number of channels goes to infinity. Although the first §A.5.1 and second approaches §A.5.2 (taking iterated limits) are less formal, they provide some intuitions to the convergence of neural networks to GPs. The approach in §A.5.3 is more standard and the proof is more involved. We only provide the arguments for convolutional neural networks. It is straightforward to extend them to locally- or fully connected networks.

We will use the following well-known theorem.

Theorem A.1 (Portmanteau Theorem).

Let be a sequence of real-valued random variables. The following are equivalent:

  1. in distribution,

  2. For all bounded continuous function ,

    (22)
  3. The characteristic functions of

    , i.e. converge to that of pointwisely, i.e. for all ,

    (23)

a.5.1 Forward Mode

We show that when taking sequentially, a CNN converges to a GP in the following sense: pre-activations of each layers () converge to a Gaussian in distribution. We will proceed by induction. Let . It is not difficult to see that are pairwisely independent (multivariate) Gaussian with identical distribution and thus i.i.d. Gaussian. Assume are i.i.d. Gaussian (unconditionally). We claim that so are . Indeed, since both the connection weights from layer to layer and the biases from different channels are independent, are uncorrelated and have the same distribution. To prove that they are mutually independent, we only need to show that for each , converges to a Gaussian in distribution as . Since are i.i.d., thus the outcomes of the inner sum of Equation 2

are i.i.d. We can then apply a multivariate central limit theorem

555Assuming the covariance of is finite. to conclude that converges to a Gaussian in distribution (note that we have applied the fact that is a Gaussian).

a.5.2 Reverse Mode

Conditioning on , is a random variable that converges to in probability as the number of channels (the law of large numbers, see Equation 7).

It is clear that different channels of are uncorrelated and have the same distribution. We will show that for any channel index , the random variable “converges” to the Gaussian

(24)

in the sense that its characteristic function converges point-wisely to that of , i.e. for each and for all vectors

(25)
Proof.

Applying Fubini’s Theorem and the formula of the characteristic function of multivariate Gaussian

(26)
(27)
(28)
(29)
(30)

We now apply and switch the order of it with the outer integral. The Lebesgue dominant theorem allows us to do so because the inner integral is bounded above by the constant function which is absolutely integrable w.r.t. the outer integral. We then apply Theorem A.1, since is bounded and continuous in and

(31)

Repeatedly applying the same argument666Here we need to be continuous. gives

(32)

Note that the addition of various layers on top (as discussed in §3) does not change the proof in a qualitative way. ∎

a.5.3 Uniform Convergence Mode

In this section, we present a sufficient condition on the activation function

so that the neural networks will converge to a Gaussian process as all the widths approach to infinity uniformly. Precisely, we are interested in the case as , i.e.,

(33)

Using Theorem A.1 and the arguments in the above section, it is not difficult to see that a sufficient condition is that the empirical covariance converges in probability to the analytic covariance.

Corollary A.1.1.

If , i.e. converges to in probability as , then

(34)

In the remaining section, we provide a sufficient condition for Corollary A.1.1 (i.e. ), borrowing some ideas from Daniely et al. (2016).

Notation. Let denote the set of positive semi-definite matrices and for , define

(35)

Further let and be a function and a random variable (induced by the activation ) given by

(36)
(37)

Finally, let denote the space of measurable functions with the following properties:

  1. Uniformly Squared Integrable: for every , there exists a positive constant such that

    (38)
  2. Lipschitz Continuity: for every , there exists such that for all ,

    (39)
  3. Uniform Convergence in Probability: for every and every ,

    (40)

We will also use and to denote the spaces of functions satisfying property 1, property 2 and property 3, respectively. It is not difficult to see that for every , is a vector space, and so is .

Definition A.1.

We say is linearly bounded (exponentially bounded) if there exist such that

(41)

Note that the class of linearly bounded (exponentially bounded) functions is closed under addition and scalar multiplication. Moreover exponentially bounded functions contain all polynomials, are also closed under multiplication and integration in the sense for any constant the function

(42)

is also exponentially bounded.

Lemma A.2.

The following is true:

  1. contains all exponentially bounded functions.

  2. contains all functions whose first derivative are exponentially bounded.

  3. contains all linearly bounded functions.

Proof.

1. We prove the first statement. Assume .

(43)

In the last inequality, we applied

Thus

(44)

2. To prove the second statement, let and define (similarly for ):

(45)

Then (and ). Let

(46)

and

(47)

Since is exponentially bounded, is also exponentially bounded. In addition, is exponentially bounded for any polynomial .

Applying the Mean Value Theorem (we use the notation to hide the dependence on and other absolute constants)

(48)
(49)
(50)
(51)

Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and is exponentially bounded. There is a constant (hidden in ) and such that the above is bounded by