Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017)
. NN performance is often motivated in terms of model architectures, initializations, and training procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many real-world datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality(Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tiňo et al., 2004, 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively.
The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; de G. Matthews et al., 2018) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and easily reasoned about.
Previous work studying these Neural Network-equivalent Gaussian Processes (NN-GPs) has established the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NN-GPs to gain specific insights into the equivalent NNs.
In the present work, we extend the equivalence between NNs and NN-GPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting architecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, designing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality can vanish in the Bayesian treatment of the infinite width limit.
The specific novel contributions of the present work are:
). We show this for CNNs both with and without pooling, with arbitrary convolutional striding, and with bothand padding. We prove convergence as the number of channels in hidden layers go to infinity uniformly (§A.5.3), strengthening and extending the result of de G. Matthews et al. (2018) under mild conditions on the nonlinearity derivative.
We show that in the absence of pooling, the NN-GP for a CNN and a Locally Connected Network (LCN) are identical (§5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing or translation equivariance.
We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NN-GP (Figure 4, b, c). Moreover, both architectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure 4, a, b), corroborating a similar trend observed in Canziani et al. (2016, Figure 2)
. However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NN-GP by a significant margin. We experimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Table1).
We introduce a Monte Carlo method to compute NN-GP kernels for situations (such as CNNs with pooling) where evaluating the NN-GP is otherwise computationally infeasible (§4).
1.1 Related work
In early work on neural network priors, Neal (1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Follow-up work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al. (2018); de G. Matthews et al. (2018).
The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018; Chen et al., 2018) is related to the construction of the GPs we consider. They apply a mean field approximation in which the pre-activation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Recently, Xiao et al. (2018) extended this to convolutional architectures without pooling. Xiao et al. (2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NN-GP performance, as discussed in §A.2.
Compositional kernels coming from convolutional and fully connected layers also appeared outside of the GP context in Daniely et al. (2016). In this work, they prove approximation guarantees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases.
There is a line of work considering stacking of GPs, such as deep GPs (Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b, a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model end-to-end. Finally, van der Wilk et al. (2017) incorporates convolutional structure into GP kernels, with follow-up work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018; Anonymous, 2019) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit.
Borovykh (2018) analyzes the convergence of network outputs to a GP after marginalizing over all inputs in a dataset, in the case of a temporal CNN. Thus, while they also consider a GP limit, they do not address the dependence of network outputs on specific inputs, and their model is unable to generate test set predictions.
In concurrent work, Garriga-Alonso et al. (2018)
derive an NN-GP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NN-GP performance to finite-width SGD-trained CNNs and analyze the differences between the two models.
2 Many-channel Bayesian CNNs are Gaussian processes
Consider a series of convolutional hidden layers, . The parameters of the network are the convolutional filters and biases, and , respectively, with outgoing (incoming) channel index () and filter relative spatial location .111We will use Roman letters to index channels and Greek letters for spatial location. We use letters , etc to denote channel indices, , etc to denote spatial indices and , etc for filter indices. For notational simplicity, we treat the 1D case with spatial dimension in the text, but the single spatial index can be extended to higher dimensions by replacing with tuples. Similarly, our analysis straightforwardly generalizes to strided convolutions (§A.3). Assume a Gaussian prior on both the filter weights and biases,
The weight and bias variances are, respectively. is the number of channels (filters) in layer , is the filter size, and is the fraction of the receptive field variance at location (with ). In experiments we utilize uniform , but nonuniform should enable kernel properties that are better suited for ultra-deep networks, as in Xiao et al. (2018).
Let denote a set of input images (training set or validation set or both). The network has activations and pre-activations for each input image , with input channel count , number of pixels , where
We emphasize the dependence of and on the input . is a pointwise nonlinearity. is assumed to be zero padded so that the spatial size is constant throughout the network.
A recurring quantity in this work will be the empirical uncentered covariance tensorof the activations , defined as
is therefore a 4-dimensional random variable indexed by two inputsand two spatial locations (the dependence on layer widths and their weights and biases is implied and by default not stated explicitly). , the empirical uncentered covariance of inputs, is deterministic.
Whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension. E.g. is a tensor of shape , has the shape , has the shape , etc.
2.2 Correspondence between Gaussian processes and Bayesian deep CNNs with infinitely many channels
We next consider the prior over functions computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, for , and derive its equivalence to a GP with a compositional kernel. The following section gives a proof which uses the empirical uncentered covariance tensors to characterize finite width intermediate layers and relies on explicit Bayesian marginalization over these intermediate layers. In Appendix A.5 we give several alternative derivations of the correspondence.
2.2.1 A single convolutional layer is a GP conditioned on the uncentered covariance tensor of the previous layer’s activations
As can be seen in Equation 2, the pre-activation tensor is an affine transformation of the multivariate Gaussian , specified by the previous layer’s activations . An affine transformation of a multivariate Gaussian is itself a Gaussian. Specifically,
where the first equality in Equation 4 follows from the independence of the weights and biases for each channel . The uncentered covariance tensor for the pre-activations is derived in Xiao et al. (2018), where is an affine transformation (a cross-correlation operator followed by a shifting operator) defined as follows:
2.2.2 Uncentered covariance tensor becomes deterministic with increasing channel count
The summands in Equation 3 are i.i.d., due to the independence of the weights and biases for each channel . Subject to weak restrictions on the nonlinearity
, we can apply the law of large number and conclude that,
2.2.3 Bayesian marginalization over all hidden layers
The distribution over the CNN outputs can be evaluated by marginalizing over all intermediate layer uncentered covariances in the network (see Figure 1):
In the limit of infinitely many channels in the hidden layers, 222Unlike de G. Matthews et al. (2018), we do not require each to be strictly increasing., all the conditional distributions except for converge weakly to delta functions and can be integrated out. Precisely, Equation 9 reduces to the expression in the following theorem.
If is Lipschitz, then we have the following convergence in distribution
i.e. composed with itself times and applied to .
In other words, is the (deterministic) covariance of the CNN activations in the limit of infinitely many (hence subscript) channels in each of the convolutional layers from to . See §A.5.3 for the proof. Therefore Equation 10
states that the outputs for any set of input examples and pixel indices are jointly Gaussian distributed – i.e. the output of a CNN with infinitely many channels in itshidden layers is described by a GP with a covariance function .
3 Transforming a GP over spatial locations into a GP over classes
In §2.2 we have shown that in the infinite channel limit a deep CNN is a GP indexed by input samples and spatial locations of the top layer. Further, its uncentered covariance tensor
can be computed in closed form. Here we show that transformations to obtain class predictions that are common in CNN classifiers can be represented as either vectorization or projection (as long as we treat classification as regression, similarly toLee et al. (2018)). Both of these operations preserve the GP equivalence and allow the computation of the covariance tensor of the respective GP (now indexed by input samples and target classes) as a simple transformation of .
One common readout strategy is to vectorize (flatten) the output of the last convolutional layer into a vector and stack a fully connected layer on top:
where the weights and biases are i.i.d. Gaussian, , and is the number of classes. The sample-sample kernel of the output (identical for each class ) of this particular GP, denoted by , is
where the limit of infinite width is derived identically to §2.2. As observed in Xiao et al. (2018), to compute any diagonal terms of , one needs only the corresponding diagonal terms of . Consequently, we only need to store and the memory cost is (or per covariance entry in an iterative or distributed setting). Note that this approach ignores pixel-pixel covariances and produces a GP corresponding to a locally-connected network (see §5.1).
Another approach is a projection collapsing the spatial dimensions. Let be a deterministic vector, , and be the same as above.
Define the output to be
where the limiting behavior is derived identically to Equation 12. Examples of this approach include
Global average pooling: take and denote this particular GP as . Then
This approach corresponds to applying global average pooling right after the last convolutional layer.333 Spatially local average pooling in intermediary layers can be constructed in a similar fashion (§A.3). We focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects of the model like local connectivity or equivariance. This approach takes all pixel-pixel covariance into consideration and makes the kernel translation invariant. However, it requires memory to compute the sample-sample covariance of the GP (or per covariance entry in an iterative or distributed setting). It is impractical to use this method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see §4).
Subsampling one particular pixel: take ,
This approach (denoted ) makes use of only one pixel-pixel covariance, and requires the same amount of memory as to compute.
4 Monte Carlo evaluation of intractable GP kernels
We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally impractical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte Carlo-GP (MC-GP) kernel,
where consists of draws of the weights and biases from their prior distribution, , and is the width or number of channels in hidden layers. The MC-GP kernel converges to the analytic kernel with increasing width, in probability.
For finite width networks, the uncertainty in is . From Daniely et al. (2016), we know that , which leads to . For finite , is also a biased estimate of , where the bias depends solely on network width. We do not currently have an analytic form for this bias, but we can see in Figures 3 and 7 that for the hyperparameters we probe it is small relative to the variance. In particular, is nearly constant for constant . We thus treat as the effective sample size for the Monte Carlo kernel estimate. Increasing and reducing can reduce memory cost, though potentially at the expense of increased compute time and bias.
In a non-distributed setting, the MC-GP reduces the memory requirements to compute from to , making the evaluation of CNN-GPs with pooling practical.
5.1 Bayesian CNNs with many channels are identical to locally connected networks, in the absence of pooling
Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated.
The CNN-GP predictions without spatial pooling in §3.1 and item 2 depend only on sample-sample covariances, and do not depend on pixel-pixel covariances. LCNs destroy pixel-pixel covariances: , for and all and . However, LCNs preserve the covariances between input examples at every pixel: . As a result, in the absence of pooling, LCN-GPs and CNN-GPs are identical. Moreover, LCN-GPs with pooling are identical to CNN-GPs with vectorization of the top layer (under suitable scaling of ). We confirm these findings experimentally in trained networks in the limit of large width in Figure 4 (b), as well as by demonstrating convergence of MC-GPs of the respective architectures to the same CNN-GP (modulo scaling of ) in Figures 3 and 7.
5.2 Pooling leverages equivariance to provide invariance
The only kernel leveraging pixel-pixel covariances is that of the CNN-GP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures 2, 3, 4 (b); Tables 1, 2), in both CNNs and CNN-GPs.
5.3 Finite-channel SGD-trained CNNs can outperform infinite-channel Bayesian CNNs, in the absence of pooling
In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finite-width SGD-trained CNNs often approaches that of their CNN-GP counterpart (Figure 4, b, c)444This observation is conditioned on the respective NN fitting the training set to . Underfitting breaks the correspondance to an NN-GP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table 2)., suggesting that in those cases equivariance does not play a beneficial role in SGD-trained networks.
However, as can be seen in Tables 1, 2 and Figure 4 (c), the best CNN overall outperforms the best CNN-GP by a significant margin – an observation specific to CNNs and not FCNs or LCNs. We observe this gap in performance especially in the case of networks trained with a large learning rate. In Table 1 we demonstrate this large gap in performance by evaluating different models with equivalent architecure and hyperparameter settings, chosen for good SGD-trained CNN performance.
We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the infinite channel CNN limit, contributes to the performance of SGD-trained finite-channel CNNs with the correct settings of hyperparameters. Nonetheless, more work is needed to disentangle and quantify the separate contributions of stochastic optimization and finite width effects to differences in performance between CNNs with weight sharing and their corresponding CNN-GPs.
|(b)||No Pooling||Global Average Pooling|
|Model:||FCN||FCN-GP||LCN (w/ pooling)||CNN-GP||CNN||CNN w/ pooling|
Disentangling the role of network topology, equivariance, and invariance on test performance, for SGD-trained and infinite width Bayesian networks.Test error (%) on CIFAR10 of different models of the same depth, nonlinearity, and weight and bias variances. LCN and CNN-GP have a hierarchical local topology, beneficial for image recognition tasks and outperform fully connected models (FCN and FCN-GP). As predicted in §5.1: (i) weight sharing has no effect in the Bayesian treatment of an infinite width CNN (CNN-GP performs similarly to an LCN, a CNN without weight sharing), and (ii) pooling has no effect on generalization of an LCN model (LCN and LCN with pooling perform nearly identically). Local connectivity combined with equivariance (CNN) is enabled by weight sharing in an SGD-trained finite model, allowing for a significant improvement. Finally, invariance enabled by weight sharing and pooling allows for the best performance. Values are reported for 8-layer models. See §A.7.6 for experimental details and Table 2 for more model comparisons.
|CNN with pooling||()|
|CNN with and large learning rate||()|
|CNN with small learning rate|
|CNN with (any learning rate)|
|Convolutional GP (van der Wilk et al., 2017)|
|ResNet GP (Garriga-Alonso et al., 2018)|
|Residual CNN-GP (Garriga-Alonso et al., 2018)|
|CNN-GP (Garriga-Alonso et al., 2018)|
|FCN-GP (Lee et al., 2018)|
In this work we have derived a Gaussian process that corresponds to a deep fully Bayesian CNN with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture.
The CNN-GP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all training-related hyperparameters. However, we found that the best overall performance is achieved by finite SGD-trained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research into disentangling the contributions of the two qualities (Bayesian treatment and infinite width) to the performance gap observed.
We thank Greg Yang, Sam Schoenholz, Vinay Rao, Daniel Freeman, and Qiang Zeng for frequent discussion and feedback on preliminary results.
- Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Anonymous (2019) Anonymous. Deep convolutional gaussian process. In Submitted to International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyeUPi09Y7. under review.
- Blomqvist et al. (2018) Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional gaussian processes. arXiv preprint arXiv:1810.03052, 2018.
- Borovykh (2018) Anastasia Borovykh. A gaussian process perspective on convolutional neural networks. ResearchGate:325192731, 05 2018. URL https://www.researchgate.net/publication/325192731.
- Bradshaw et al. (2017) John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.
- Canziani et al. (2016) Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
Chen et al. (2018)
Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz.
Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks.In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 873–882, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/chen18i.html.
Cho & Saul (2009)
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.In Advances in neural information processing systems, pp. 342–350, 2009.
- Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204, 2015.
- Cohen & Welling (2016) Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999, 2016.
- Damianou & Lawrence (2013) Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.
- Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
- de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Fukushima (1975) Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3-4):121–136, 1975.
Fukushima & Miyake (1982)
Kunihiko Fukushima and Sei Miyake.
Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition.In Competition and cooperation in neural nets, pp. 267–285. Springer, 1982.
- Garriga-Alonso et al. (2018) Adrià Garriga-Alonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow Gaussian processes. arXiv preprint arXiv:1808.05587, aug 2018. URL https://arxiv.org/abs/1808.05587.
- Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017.
- Goodfellow et al. (2015) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. International Conference on Learning Representations, 2015.
- Hanin & Rolnick (2018) Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018.
- Hazan & Jaakkola (2015) Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, 2015.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Kumar et al. (2018) Vinayak Kumar, Vaibhav Singh, PK Srijith, and Andreas Damianou. Deep gaussian processes with convolutional kernels. arXiv preprint arXiv:1806.01655, 2018.
- Lawrence & Moore (2007) Neil D Lawrence and Andrew J Moore. Hierarchical gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007.
- Le Roux & Bengio (2007) Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007.
- Lecun (1989) Yann Lecun. Generalization and network design strategies. In Connectionism in perspective. Elsevier, 1989.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohl-dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
- Lin et al. (2017) Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
- Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Neal (1994) Radford M. Neal. Priors for infinite networks (tech. rep. no. crg-tr-94-1). University of Toronto, 1994.
- Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the international Conference on Learning Representations workshop track, abs/1412.6614, 2015.
- Novak et al. (2018) Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
- Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Pascanu et al. (2014) Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014.
Poggio et al. (2017)
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review.International Journal of Automation and Computing, 14(5):503–519, Oct 2017. ISSN 1751-8520. doi: 10.1007/s11633-017-1054-2. URL https://doi.org/10.1007/s11633-017-1054-2.
- Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
- Quiñonero-Candela & Rasmussen (2005) Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
- Rahimi & Recht (2007) Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In In Neural Infomration Processing Systems, 2007.
- Rasmussen & Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
- Rumelhart et al. (1985) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
- Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ICLR, 2017.
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Tiňo et al. (2004) Peter Tiňo, Michal Cernansky, and Lubica Benuskova. Markovian architectural bias of recurrent neural networks. IEEE Transactions on Neural Networks, 15(1):6–15, 2004.
- Tiňo et al. (2007) Peter Tiňo, Barbara Hammer, and Mikael Bodén. Markovian bias of neural-based architectures with feedback connections. In Perspectives of neural-symbolic integration, pp. 95–133. Springer, 2007.
- van der Wilk et al. (2017) Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian processes. In Advances in Neural Information Processing Systems 30, pp. 2849–2858, 2017.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
- Vershynin (2010) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
Paul J Werbos.
Generalization of backpropagation with application to a recurrent gas market model.Neural networks, 1(4):339–356, 1988.
- Williams (1997) Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997.
- Wilson et al. (2016a) Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a.
- Wilson et al. (2016b) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016b.
- Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Xiao et al. (2018) Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393–5402, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/xiao18a.html.
- Yang & Schoenholz (2017) Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114, 2017.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
Appendix A Appendix
a.1 Additional Figures
|No Pooling||Global Average Pooling|
|No Pooling||Global Average Pooling|
MC-CNN-GP with pooling
MC-LCN-GP with Pooling
. I.e. the geometric mean of the ratios of the kernel distance from (3-layer) MC-CNN-GP and MC-LCN-GP to the respective CNN-GP is). See §A.7.2 for experimental details.
a.2 Relationship to Deep Signal Propagation
The recurrence relation linking the GP kernel at layer to that of layer following from Equation 10 (i.e. ) is precisely the covariance map examined in a series of related papers on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as , or e.g. in Xiao et al. (2018)). In those works, the action of this map on hidden-state covariance matrices was interpreted as defining a dynamical system whose large-depth behavior informs aspects of trainability. In particular, as , , i.e. the covariance approaches a fixed point . The convergence to a fixed point is problematic for learning because the hidden states no longer contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in the chaotic regime outputs of the GP become asymptotically decorrelated and therefore independent, while in the ordered regime they approach perfect correlation of . Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible.
This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NN-GP kernels can be performed – see Table 3.
a.3 Strided convolutions and average pooling in intermediate layers
Our analysis in the main text can easily be extended to cover average pooling and strided convolutions (applied before the pointwise nonlinearity). Recall that conditioned on the pre-activation is a mean-zero multivariate Gaussian. Let denote a linear operator. Then is mean zero Gaussian, and the covariance is
One can easily see that are i.i.d. multivariate Gaussian.
Strided convolution. Strided convolution is equivalent to a non-strided convolution composed with sub-sampling. Let denote size of the stride. Then the strided convolution is equivalent to choosing as follows: for .
Average pooling. Average pooling with stride and window size is equivalent to choosing for and .
a.4 Review of exact Bayesian regression with GPs
Our discussion in the paper has focused on model priors
. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be doneexactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let denote training inputs , training targets, and collectively
for the training set. The integral over the posterior can be evaluated analytically to give a posterior predictive distribution on a test pointwhich is Normal, , with
We use the shorthand to denote the matrix formed by evaluating the GP covariance on the training inputs, and likewise is a -length vector formed from the covariance between the test input and training inputs. Computationally, the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as (though algorithms scaling as exist for sufficiently large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (Quiñonero-Candela & Rasmussen, 2005).
a.5 Kernel Convergence Proof
In this section, we present three different approaches to illustrate the weak convergence of neural networks to Gaussian processes as the number of channels goes to infinity. Although the first §A.5.1 and second approaches §A.5.2 (taking iterated limits) are less formal, they provide some intuitions to the convergence of neural networks to GPs. The approach in §A.5.3 is more standard and the proof is more involved. We only provide the arguments for convolutional neural networks. It is straightforward to extend them to locally- or fully connected networks.
We will use the following well-known theorem.
Theorem A.1 (Portmanteau Theorem).
Let be a sequence of real-valued random variables. The following are equivalent:
For all bounded continuous function ,
The characteristic functions of, i.e. converge to that of pointwisely, i.e. for all ,
a.5.1 Forward Mode
We show that when taking sequentially, a CNN converges to a GP in the following sense: pre-activations of each layers () converge to a Gaussian in distribution. We will proceed by induction. Let . It is not difficult to see that are pairwisely independent (multivariate) Gaussian with identical distribution and thus i.i.d. Gaussian. Assume are i.i.d. Gaussian (unconditionally). We claim that so are . Indeed, since both the connection weights from layer to layer and the biases from different channels are independent, are uncorrelated and have the same distribution. To prove that they are mutually independent, we only need to show that for each , converges to a Gaussian in distribution as . Since are i.i.d., thus the outcomes of the inner sum of Equation 2
are i.i.d. We can then apply a multivariate central limit theorem555Assuming the covariance of is finite. to conclude that converges to a Gaussian in distribution (note that we have applied the fact that is a Gaussian).
a.5.2 Reverse Mode
Conditioning on , is a random variable that converges to in probability as the number of channels (the law of large numbers, see Equation 7).
It is clear that different channels of are uncorrelated and have the same distribution. We will show that for any channel index , the random variable “converges” to the Gaussian
in the sense that its characteristic function converges point-wisely to that of , i.e. for each and for all vectors
Applying Fubini’s Theorem and the formula of the characteristic function of multivariate Gaussian
We now apply and switch the order of it with the outer integral. The Lebesgue dominant theorem allows us to do so because the inner integral is bounded above by the constant function which is absolutely integrable w.r.t. the outer integral. We then apply Theorem A.1, since is bounded and continuous in and
Repeatedly applying the same argument666Here we need to be continuous. gives
Note that the addition of various layers on top (as discussed in §3) does not change the proof in a qualitative way. ∎
a.5.3 Uniform Convergence Mode
In this section, we present a sufficient condition on the activation functionso that the neural networks will converge to a Gaussian process as all the widths approach to infinity uniformly. Precisely, we are interested in the case as , i.e.,
Using Theorem A.1 and the arguments in the above section, it is not difficult to see that a sufficient condition is that the empirical covariance converges in probability to the analytic covariance.
If , i.e. converges to in probability as , then
Notation. Let denote the set of positive semi-definite matrices and for , define
Further let and be a function and a random variable (induced by the activation ) given by
Finally, let denote the space of measurable functions with the following properties:
Uniformly Squared Integrable: for every , there exists a positive constant such that
Lipschitz Continuity: for every , there exists such that for all ,
Uniform Convergence in Probability: for every and every ,
We will also use and to denote the spaces of functions satisfying property 1, property 2 and property 3, respectively. It is not difficult to see that for every , is a vector space, and so is .
We say is linearly bounded (exponentially bounded) if there exist such that
Note that the class of linearly bounded (exponentially bounded) functions is closed under addition and scalar multiplication. Moreover exponentially bounded functions contain all polynomials, are also closed under multiplication and integration in the sense for any constant the function
is also exponentially bounded.
The following is true:
contains all exponentially bounded functions.
contains all functions whose first derivative are exponentially bounded.
contains all linearly bounded functions.
1. We prove the first statement. Assume .
In the last inequality, we applied
2. To prove the second statement, let and define (similarly for ):
Then (and ). Let
Since is exponentially bounded, is also exponentially bounded. In addition, is exponentially bounded for any polynomial .
Applying the Mean Value Theorem (we use the notation to hide the dependence on and other absolute constants)
Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and is exponentially bounded. There is a constant (hidden in ) and such that the above is bounded by