On the expected behaviour of noise regularised deep neural networks as Gaussian processes

10/12/2019
by   Arnu Pretorius, et al.
22

Recent work has established the equivalence between deep neural networks and Gaussian processes (GPs), resulting in so-called neural network Gaussian processes (NNGPs). The behaviour of these models depends on the initialisation of the corresponding network. In this work, we consider the impact of noise regularisation (e.g. dropout) on NNGPs, and relate their behaviour to signal propagation theory in noise regularised deep neural networks. For ReLU activations, we find that the best performing NNGPs have kernel parameters that correspond to a recently proposed initialisation scheme for noise regularised ReLU networks. In addition, we show how the noise influences the covariance matrix of the NNGP, producing a stronger prior towards simple functions away from the training points. We verify our theoretical findings with experiments on MNIST and CIFAR-10 as well as on synthetic data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

11/01/2018

Critical initialisation for deep signal propagation in noisy rectifier neural networks

Stochastic regularisation is an important weapon in the arsenal of a dee...
05/10/2021

Deep Neural Networks as Point Estimates for Deep Gaussian Processes

Deep Gaussian processes (DGPs) have struggled for relevance in applicati...
07/10/2020

Characteristics of Monte Carlo Dropout in Wide Neural Networks

Monte Carlo (MC) dropout is one of the state-of-the-art approaches for u...
04/30/2018

Gaussian Process Behaviour in Wide Deep Neural Networks

Whilst deep neural networks have shown great empirical success, there is...
06/18/2020

Infinite attention: NNGP and NTK for deep attention networks

There is a growing amount of literature on the relationship between wide...
02/24/2014

Avoiding pathologies in very deep networks

Choosing appropriate architectures and regularization strategies for dee...
11/04/2021

Rate of Convergence of Polynomial Networks to Gaussian Processes

We examine one-hidden-layer neural networks with random weights. It is w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Dependence of noisy NNGPs on critical parameters for performance. (a) Critical boundary for kernel parameters as a function of noise. (b) MNIST test accuracy for a -layer noisy NNGP, for kernel parameters , (both sampled in interval sizes of ). Training and test set sizes are . (c) CIFAR-10 test accuracy, details same as (b).

Modern deep neural networks (NNs) are powerful tools for modeling highly complex functions. However, deep NNs lack natural ways of incorporating uncertainty estimation, and (approximate) Bayesian inference for NNs remains a challenge. In contrast, non-parameteric approaches such as Gaussian Processes (GPs) provide exact Bayesian inference and well-calibrated uncertainty estimates, but typically consider substantially simpler models than deep NNs. Therefore, a large body of work has recently emerged attempting to combine parametric deep learning models and GPs so as to derive benefits from both. These approaches include deep GPs

(Damianou and Lawrence, 2013; Duvenaud et al., 2014; Hensman and Lawrence, 2014; Dai et al., 2015; Bui et al., 2016; Salimbeni and Deisenroth, 2017), deep kernel learning (Wilson et al., 2016b, a; Al-Shedivat et al., 2016) and viewing deep learning with dropout as an approximate deep GP (Gal and Ghahramani, 2016).

For shallow infinite width neural networks, an exact equivalence to GPs has been known for some time (Neal, 1994; Williams, 1997; Le Roux and Bengio, 2007). However, this equivalence has only recently been extended to deeper architectures (Hazan and Jaakkola, 2015; Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019). Referred to as neural network Gaussian processes (NNGPs) in Lee et al. (2018)

, the resulting models are GPs with an exact correspondence to infinitely wide deep neural networks. Importantly, the NNGP depends on the hyperparameters of the network and its initialisation, which determines the network’s signal propagations dynamics.

In deep neural networks, signal propagation has been shown to exhibit distinct phases depending on the initialisation of the network (Poole et al., 2016). These phases include ordered and chaotic regimes associated with vanishing and exploding gradients respectively, which can result in poor network performance (Schoenholz et al., 2017). By initialising at the critical boundary between these two regimes, known as the “edge of chaos”, the flow of information through the network improves, often resulting in faster and deeper training for a variety of architectures (Pennington et al., 2017; Yang and Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018).

Lee et al. (2018) showed that NNGPs tend to inherit the above behaviour from their corresponding randomly initialised networks. In particular, there exists an interaction between poor signal propagation and a poorly constructed kernel. As a result, the performance of NNGPs tend to suffer if their kernels are constructed using kernel parameters that correspond to network initialisations far from the critical boundary. Furthermore, even at the critical boundary, inputs to a neural network can still become asymptotically correlated at large depth (Schoenholz et al., 2017). The rate of convergence in this correlation limits the depth to which networks can be trained, because after this convergence the network is unable to distinguish between different training observations at the output layer. This dependence on depth (in the constructed kernel) for performance, also manifests in NNGPs (Lee et al., 2018).

Various design decisions are required to instantiate a modern NN. Important decisions for trainability and test performance often include both initialisation and regularisation. If initialised poorly, a network might become untrainable due to poor signal propagation, whereas a lack of regularisation could hurt the test performance of the network if it starts to overfit. Commonly used approaches to alleviating these issues include principled initialisation schemes (Glorot and Bengio, 2010; He et al., 2015) and improved regularisation strategies. Among the most successful regularisation strategies is dropout (Srivastava et al., 2014), a form of noise regularisation where scaled Bernoulli noise is applied multiplicatively to the units of a network to prevent co-adaptation. However, as shown by Pretorius et al. (2018), these components do not act in isolation and therefore the initialisation of the network should depend on the amount of noise regularisation being applied.

In this paper, we investigate the following research question: do the signal propagation dynamics that influence noise regularised NNs also govern the behaviour of corresponding “noisy NNGPs”? In the presence of multiplicative noise regularisation, Pretorius et al. (2018)

derived the critical initialisation for stable signal propagation in feedforward ReLU networks: More specifically, the authors showed that stable propagation is achieved by setting all unit biases to zero and sampling the weights from zero mean Gaussians with variance

set equal to . Here, is the number of incoming units to the layer and

is the second moment of the noise. For example, when using dropout,

(where

is the probability of keeping the unit active) and therefore the initial weights are sampled from

. Furthermore, it was shown that the rate of convergence to a fixed point correlation between inputs increases as a function of the amount of regularisation being applied. Consequently, increased noise further limits the depth of trainability in neural networks. In this paper, we investigate whether these findings for noise regularised networks carry across to their noisy NNGP counterparts.

We consider noise regularised fully-connected feedforward NNs and study the behaviour of noisy NNGPs. Our analysis is done in expectation over the noise, under a general noise model (of which dropout is a special case). We give the kernel corresponding to noisy ReLU NNGPs and highlight the different noise-induced degeneracies in the kernel as the depth becomes large. Specifically, we show that the above noise dependent initialisations promoting stable signal propagation in noisy ReLU NNs correspond exactly to the kernel parameters exhibiting good performance in noisy NNGPs, as shown in Figure 1. However, even at criticality, we show that as the noise tends to infinity the covariance of the NNGP becomes diagonal. As a result, noise regularisation translates into a stronger prior for simple functions away from the training points. Finally, we verify our findings with experiments on real-world and synthetic datasets.

2 Noise regularised deep neural networks as Gaussian processes

Figure 2: Noisy NNGP covariance: Example of the covariance for a noisy NNGP with only two inputs (orange), (purple) and two output units (green and blue) at layer and sampled noise .

We consider a noise regularised fully-connected deep feedforward neural network. Given an input , we inject noise into the model

(1)

using the operator  to denote either addition or multiplication, where

is an input noise vector, sampled from a pre-specified noise distribution. The noise is assumed to have

in the additive case, and for multiplicative noise distributions such that in both cases, . The weights and biases

are sampled i.i.d. from zero mean Gaussian distributions with variances

and , respectively, where denotes the dimensionality of the hidden layer in the network. Each hidden layer’s activations

are computed element-wise using an activation function

. Lastly, we denote the second moment of the noise as and define as the entire function mapping from input to output, with .

By choosing parameter sampling distributions at initialisation we are implicitly specifying a prior over networks in parameter space. We now transition from parameter space to function space by instead specifying a prior directly over function values. Assume a training set of input-output pairs . If we can show that is Gaussian distributed at initialisation, then the distribution over the output of the network at these points is completely determined by the second-order statistics and , defining the following GP

(2)

We begin by assuming the following additive error model with regression outcomes , where .111Note that here we consider scalar outputs, i.e. , hence . Also, the additive error noise should not be confused with the injected noise in (1). Since

, the joint distribution over all outcomes is

(3)

where . In GP regression we are interested in finding the marginal distribution

(4)

We proceed as in (Lee et al., 2018) to argue that is in fact a zero mean Gaussian (we refer the reader to Matthews et al. (2018) for a more formal approach) and derive the elements of the covariance matrix in (2) for noise regularised deep neural networks. Subsequently, we obtain an expression for (4) by combining (2) and (3) and using standard results for the marginal of a Gaussian.

To show that the expected distribution of over the injected noise is Gaussian, we first note that conditioned on the inputs, the “output” units at layer , stemming from the post-activations in the previous layer are given by , for . As previously mentioned, we sample the weights and biases i.i.d. from a zero mean Gaussian and define the noise to be i.i.d. such that is unbiased in expectation of the injected noise. Therefore, in a wide network,

is a sum of a large collection of i.i.d. random variables. As

, the central limit theorem ensures that the distribution of

will tend to a Gaussian with mean and covariance . As a result, the function values can be treated as samples from a GP given by . Here, is an covariance matrix given by

where is the Kronecker product. The kernel function depends on the layer depth, the scale of the weights and biases and the amount of noise regularisation being applied. A schematic illustration of the covariance matrix is given in Figure 2 for the simple case of only two inputs and two output units. To derive the elements of the covariance , consider the units , and inputs which give

Note that and due to the independence between the incoming connections (weights) associated with each output unit. Therefore, we only consider the case where , which for gives

where the expectation is taking with respect to . The final equality follows from applying the above argument recursively for the previous layer . For the case of (and ), we have that the diagonal components of the covariance matrix are given by

where the influence of the noise is explicitly expressed through its second moment . Using the initial condition

and letting each layer width in the network tend to infinity in succession, this recursive construction gives as Gaussian distributed with mean and covariance

(5)

Finally, combining (2), (3) and (5) and using standard results for the marginal of a Gaussian distribution, the marginal in (4) can be shown to be

where with the Kronecker delta (Williams and Rasmussen, 2006). Therefore, together with the additive noise level , our model for the joint distribution over training outcomes is fully determined by the equivalent kernel corresponding to layer-wise recursion of an infinite basis function expansion. This kernel, in turn, depends on the parameterisation of the network and the amount of injected noise.

Having developed our noisy NNGP model, we next discuss predicting outcomes for unseen test data points. To make a prediction, we evaluate the predictive distribution at a new test point . Consider the joint

where we can partition the covariance as follows

with , an dimensional vector and . Using standard results for the conditional distribution of a partitioned Gaussian vector, we find

(6)

where and . This result is the function space equivalent to exact Bayesian inference in parameter space: by computing the conditional in (6), we are implicitly performing an integration over the posterior of the parameters associated with an infinitely wide noise regularised neural network (Williams, 1997).

In the next section, we study how the properties of the kernel derived in this section depend the parameters of the network when using ReLU activations. Furthermore, for the remainder of the paper we drop the dependence on the hidden units and training set indices for ease of notation and simply refer to as .

3 Kernel parameters and critical neural network initialisation

width=0.8center Weight variance Bias variance Limiting value as - Additive Noise () - a.1) a.2) - Mult. Noise () - m.1) m.2) m.3) m.4) m.5)

Table 1: Limiting behaviour for degenerate and critical noisy ReLU kernels.

We begin by examining the interaction between the parameters of the noisy NNGP kernel and its corresponding network initialisation. Specifically, we focus on ReLU activations and show that the kernel parameter values that lead to non-degenerate kernels for deep noisy NNGPs are exactly those that promote stable signal propagation in noise regularised ReLU networks.

Let and define

then the elements of the covariance at a hidden unit are

(7)

for , where

with diagonal elements given by

(8)

These formulae are the kernel equivalent to the signal propagation recurrences derived in (Pretorius et al., 2018) for noisy ReLU networks. For no noise and outside the context of GPs, a similar result can be found in (Cho and Saul, 2009). Repeated substitution in (8) shows that

(9)

The limiting properties of the kernel are seen by letting in (3). In this limit, several degenerate kernels arise, analogous to cases of unstable signal propagation dynamics in mean-field theory and other related work (Poole et al., 2016; Daniely et al., 2016; Schoenholz et al., 2017; Pretorius et al., 2018). We provide the different cases in Table 1. For any amount of additive noise, all possible settings (see A.1 and A.2) of the kernel parameters and in (3) will result in a degenerate kernel in the limit of infinite depth. The situation is similar for multiplicative noise, except for the case (M.5), where . We refer to these parameters in (M.5) as the critical kernel parameters. Here, the diagonal elements of the covariance stay fixed at their initial values even at extreme depth. These parameter values are identical to the proposed initialisations for deep noisy ReLU networks derived in (Pretorius et al., 2018).

From (7) we can see that the off-diagonal elements of the covariance matrix are influenced by the noise level at the critical values through the relationship . Furthermore, we note that as . Therefore, multiplicative noise regularisation has a damping effect on the kernel function evaluated between different inputs, tending towards total dissimilarity and a diagonal covariance. This reduction in the richness of the covariance structure exploitable by the NNGP then enforces a strong prior for simple functions away from the training points. To see this effect, consider the predictive distribution in (6), for a test point . For large amounts of noise, and therefore in the limit, and . Since is symmetric positive definite by definition and , the predicted outcome will be a sample from a zero mean Gaussian with maximal uncertainty as measured by the variance , i.e .

To validate the above claims, the following section provides an empirical investigation. In particular, we test two hypotheses that stem from the above theoretical analysis, using both real-world and synthetic datasets.

4 Experiments

We have shown how the kernel parameters for a noisy NNGP relate to those for its corresponding deep neural network. In doing so, our discussion has led us to the following testable hypotheses:

Figure 3: Sensitivity of NNGP kernel parameters for different depths. (a) Test accuracy for NNGP with depth on MNIST with training and test set sizes of and kernel parameters , to and to , both equally spaced using interval sizes of . (b) NNGP with depth on CIFAR-10 (). (c) - (d) Same as (a) and (b) but with depth .
  1. Noisy NNGPs perform best at their critical parameters: The sensitivity of the kernel parameters should cause noisy NNGPs to perform poorly at settings far away from the critical kernel parameters. Furthermore, the reliance on these critical values for good performance should become more marked at greater depth [Figures 1 and 3].

  2. Noise constrains the covariance and leads to simpler models away from the training points with larger uncertainty: Even at criticality, noise injection applies a shrinkage effect to the kernel function evaluated between different inputs to the noisy NNGP. This should lead to a constrained covariance structure, or in the limit of a large amount of noise, a completely diagonal covariance. The NNGP prior over functions regularised in this way should lead to simpler models away from the training points with increased estimates of uncertainty [Figures 4 and 5].

Figure 4: Effects of noise induced regularisation on the noisy NNGPs in Figure 1 for MNIST. (a) Depth evolution of for the different kernel parameters (dashed orange lines), (solid red lines) and (solid blue lines), with in all experiments. (b) Depth evolution of (c) Relationship between accuracy and covariance norm. Orange points correspond to critical kernel parameters with larger sizes indicating more noise. (d)

Scatter plot of accuracy as a function of mean variance. We measure the quality of uncertainty estimates by computing the correlation of the mean posterior predictive variance with test accuracy. Main plot contains all points, whereas the inset only contains points close to criticality (green to red showing an increase in noise).

H1: We begin by investigating the sensitivity of the kernel parameter values. As shown in Figure 1, we test the performance of -layer NNGPs on MNIST and CIFAR-10 with kernels constructed for a grid of variance parameters , for varying values of the noise level parameter . Our approach to classification in this paper is identical to (Lee et al., 2018), where classification is treated as a regression problem. Specifically, instead of one-hot output vectors, each output vector is recoded as a zero mean regression output with the value in the index corresponding to the correct class and for all other indices corresponding to the incorrect classes. The predicted class label for a given input is then simply the index corresponding to the maximum value in the output vector as predicted by the NNGP regression model. For all experiments, we set . Figures 1(b) and (c) confirm our expectations, showing that the kernel parameters corresponding to NNGPs with good performance closely follow the critical initialisation boundary shown in Figure 1(a). As kernels are constructed further away from criticality, their performances start to deteriorate.

The sensitivity to the kernel parameters becomes more acute at larger depth as shown in Figure 3. Panels (a) and (b) plot the results for a shallower depth of . In this case, a wide band is seen to form around the critical boundary (beige shaded area) with kernel parameters far away from their critical values still able to perform reasonably well. This is no longer the case in Panels (c) and (d), where we tested performances at a greater depth, . At this depth, the NNGP is far more sensitive to the kernel parameters and only a few models with kernel parameters very close to the critical boundary are seen to perform well.

H2: For all the models evaluated in H1, we also study the influence of the noise on the kernel as well as on the resulting NNGP covariance matrix. For each model, we plot in Figure 4(a) and (b) the depth evolution of the kernel, using two inputs from the MNIST dataset. In (a) we track the variance of one of the inputs and in (b) the covariance between the two inputs. The dashed orange lines correspond to the kernel parameters , with shown in solid red and shown in solid blue. (Recall that for all experiments). The limiting behaviour described in (M.1), (M.3) and (M.5) in Table 1 is shown in (a), with all kernels tending towards degenerate function mappings, except those evolving under the critical parameters. Furthermore, in (b), we show the damping effect on the kernel at criticality, highlighted by decreasing asymptotes (dashed orange lines around layer 20) as more noise is being applied (darker lines).

The depth dynamics of the kernel also constrains the resulting covariance matrix. To see the effect of this, we use the Frobenius norm of the covariance matrix as a proxy for its richness. Figure 4(c) plots the relationship between the covariance norm and test accuracy for all the experiments in H1. Interestingly, the norm seems to suggest a step function relationship. Moving from right to left in (c), we observe a sudden and dramatic drop in performance beyond a certain amount of regularisation, as measured by a decreasing covariance norm. In other words, there seems to exist some requisite amount of information to be captured by the covariance matrix in order for the NNGP to be able to perform well. This is also the case at criticality: in (c), the orange points correspond to critical kernels with larger points associated with more noise.

The effect of increased noise regularisation on uncertainty estimation is shown in Figure 4(d), where we plot test accuracy as a function of the mean posterior predictive variance. For NNGPs far away from criticality (blue points in main plot), we see little correlation () between variance and test accuracy. The inset in (d) shows the same plot but for NNGPs close to their critical parameters. For these models the (negative) correlation is stronger (), possibly providing more reliable uncertainty estimates. Here, the green points are low noise models and the red points are high noise models. As expected, noise regularisation causes the posterior predictive variance to increase leading to higher uncertainty. We did the same investigations using the CIFAR-10 dataset and obtained similar results (see Appendix A).

Figure 5: 20-layer noisy NNGPs with 1D input data and : Left column , middle column and right column . (a) - (c) Samples from NNGP prior. (d) - (f) NNGP covariance , . (g) - (i) NNGP fit (red line) using four training examples (blue dots) sampled from a simple sinusoidal function (green line) with .

Finally, to gain more insight, we consider a simple one-dimensional regression task.222The example is taken from Chapter 1 in Bishop (2006). The top row in Figure 5 shows samples from a 20-layer NNGP prior for (a) (no noise), (b) (small noise), (c) (large noise) and . We found a small amount of bias , improved each fit (see Appendix B for a discussion on non-zero biases). The covariance structure corresponding to each NNGP is shown in (d)-(f), located in the middle row of Figure 5. The bottom row, (g)-(i) shows the fit from the posterior predictive (red line) using four training examples (blue dots) sampled from a simple sinusoidal function (green line) with . Moving across the columns from left to right, we find that the samples from the prior become more erratic as the covariance becomes diagonal, which strongly regularises the regression model at previously unseen test points. Note that the model in (i) still corresponds to exact Bayesian inference, but with a strong prior for near constant functions with high uncertainty away from the training points.

5 Discussion

We have shown that critical initialisation of noisy ReLU networks corresponds to a choice of optimal kernel parameters in noisy NNGPs and that deviation from these critical parameters leads to poor performance, becoming more severe with depth and the extent of the deviation. In addition, we highlighted the effect of noise on the covariance of a noisy NNGP at criticality, with noise in the limit yielding a fully diagonal covariance, acting as a regulariser on the posterior predictive.

It is interesting to reflect on the connection between deep NNs and GPs in the context of representation learning and noise regularisation. The core assumption in deep learning is that deep NNs learn distributed hierarchical representations useful for modeling the true underlying data generating mechanism, whereas shallow models do not. In these deeper models, noise regularisation is thought to be successful largely because of its influence on representations at different levels of abstraction during the training procedure (Goodfellow et al., 2016). As discussed in previous work (Neal, 1994; Matthews et al., 2018), the kernels associated with NNGPs do not use learned hierarchical representations. Nevertheless these models are still able to perform as well, or sometimes better than, their neural network counterparts (Lee et al., 2018). In the infinite width setting, the success of regularisation from noise injection is unlikely to have the same interpretation as in the finite width setting. We note that in the context of NNGPs, noise regularisation has a stronger connection with controlling the length scale parameter in commonly used kernel functions than regularising through corrupted representations at different levels of abstraction. This connection with the length scale parameter means that noise regularisation in NNGPs may be more accurately interpreted as a useful mechanism to designing priors by controlling the smoothness of the kernel function.

Finally, recent work related to NNGPs, has made it possible to accurately model the learning dynamics of deep neural networks by taking a function space perspective of gradient descent training in the infinite width limit (Jacot et al., 2018; Lee et al., 2019). We envision a similar analysis could be applied to accurately model the learning dynamics of noise regularised deep neural networks.

References

  • M. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, and E. P. Xing (2016) Learning scalable deep kernels with recurrent structure.

    Journal of Machine Learning Research

    .
    Cited by: §1.
  • C. M. Bishop (2006) Pattern recognition and machine learning. Springer. Cited by: footnote 2.
  • T. Bui, D. Hernández-Lobato, J. Hernández-Lobato, Y. Li, and R. Turner (2016) Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, Cited by: §1.
  • M. Chen, J. Pennington, and S. S. Schoenholz (2018)

    Dynamical isometry and a mean field theory of RNNs: gating enables signal propagation in recurrent neural networks

    .
    International Conference on Machine Learning. Cited by: §1.
  • Y. Cho and L. K. Saul (2009) Kernel methods for deep learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  • Z. Dai, A. Damianou, J. González, and N. Lawrence (2015) Variational auto-encoded deep Gaussian processes. arXiv preprint arXiv:1511.06455. Cited by: §1.
  • A. Damianou and N. Lawrence (2013) Deep Gaussian processes. In Artificial Intelligence and Statistics, Cited by: §1.
  • A. Daniely, R. Frostig, and Y. Singer (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, Cited by: §3.
  • D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani (2014) Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, Cited by: §1.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics, Cited by: §1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §5.
  • T. Hazan and T. Jaakkola (2015) Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on ImageNet classification

    .
    In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    Cited by: §1.
  • J. Hensman and N. D. Lawrence (2014) Nested variational compression in deep Gaussian processes. arXiv preprint arXiv:1412.1370. Cited by: §1.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, Cited by: §5.
  • N. Le Roux and Y. Bengio (2007) Continuous neural networks. In Artificial Intelligence and Statistics, Cited by: §1.
  • J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018) Deep neural networks as Gaussian processes. International Conference on Learning Representations. Cited by: §1, §1, §2, §4, §5.
  • J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington (2019) Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720. Cited by: §5.
  • A. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani (2018) Gaussian process behaviour in wide deep neural networks. International Conference on Learning Representations. Cited by: §1, §2, §5.
  • R. M. Neal (1994) Priors for infinite networks. Technical report no. crg-tr-94-1, University of Toronto. Cited by: §1, §5.
  • R. Novak, L. Xiao, J. Lee, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2019)

    Bayesian convolutional neural networks with many channels are Gaussian processes

    .
    International Conference on Learning Representations. Cited by: §1.
  • J. Pennington, S. Schoenholz, and S. Ganguli (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems, Cited by: §1.
  • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016) Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
  • A. Pretorius, E. Van Biljon, S. Kroon, and H. Kamper (2018) Critical initialisation for deep signal propagation in noisy rectifier neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §3, §3.
  • H. Salimbeni and M. Deisenroth (2017) Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, Cited by: §1.
  • S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein (2017) Deep information propagation. International Conference on Learning Representations. Cited by: §1, §1, §3.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research. Cited by: §1.
  • C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. MIT Press Cambridge, MA. Cited by: §2.
  • C. K. Williams (1997) Computing with infinite networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing (2016a) Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, Cited by: §1.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016b) Deep kernel learning. In Artificial Intelligence and Statistics, Cited by: §1.
  • L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington (2018) Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning. Cited by: §1.
  • G. Yang and S. Schoenholz (2017) Mean field residual networks: on the edge of chaos. In Advances in Neural Information Processing Systems, Cited by: §1.

Appendix

A Additional results

Figure 6: Effects of noise induced regularisation on NNGPs in Figure 1 for CIFAR-10. (a) Depth evolution of for the different kernel parameters (dashed orange lines), (solid red lines) and (solid blue lines), with in all experiments. (b) Depth evolution of (c) Relationship between accuracy and covariance norm. Orange points correspond to critical kernel parameters with larger sizes indicating more noise. (d) Quality of uncertainty estimates as measured by the correlation of the mean posterior predictive variance with test accuracy. Main plot contains all points, whereas the inset only contains points close to criticality (green to red showing an increase in noise).

Figure 6 provides additional results using CIFAR-10 instead of MNIST for the experiments presented in Figure 1. The results are similar to those described in the main text for MNIST.

B Kernels with non-zero biases

In our experiments, we noticed that noisy ReLU NNGPs often benefit from small non-zeros biases. Therefore, we consider here the implication of non-zero biases for the evolution of the diagonal terms in the NNGP covariance. Recall that the diagonal (variance) terms of the covariance matrix can be expanded as follows

(10)

We first focus on the multiplicative noise case, at the critical weight variance . Here, a non-zero bias translates into a second term in (B), that grows linearly with depth. For small initialised biases in typically deep networks this term will be small. For example, a -layer deep neural network with , will translate into an NNGP covariance matrix with diagonal covariance terms given by . Therefore, at depth, the linear growth in a non-zero , is far less severe than the exponential growth/decay from an incorrect setting of . In the additive noise case with , the situation is similar, but with an added linear growth in noise. Unfortunately, it is less straightforward to analyse the effects of non-zero biases on the off-diagonal covariance terms.