1 Introduction
Modern deep neural networks (NNs) are powerful tools for modeling highly complex functions. However, deep NNs lack natural ways of incorporating uncertainty estimation, and (approximate) Bayesian inference for NNs remains a challenge. In contrast, nonparameteric approaches such as Gaussian Processes (GPs) provide exact Bayesian inference and wellcalibrated uncertainty estimates, but typically consider substantially simpler models than deep NNs. Therefore, a large body of work has recently emerged attempting to combine parametric deep learning models and GPs so as to derive benefits from both. These approaches include deep GPs
(Damianou and Lawrence, 2013; Duvenaud et al., 2014; Hensman and Lawrence, 2014; Dai et al., 2015; Bui et al., 2016; Salimbeni and Deisenroth, 2017), deep kernel learning (Wilson et al., 2016b, a; AlShedivat et al., 2016) and viewing deep learning with dropout as an approximate deep GP (Gal and Ghahramani, 2016).For shallow infinite width neural networks, an exact equivalence to GPs has been known for some time (Neal, 1994; Williams, 1997; Le Roux and Bengio, 2007). However, this equivalence has only recently been extended to deeper architectures (Hazan and Jaakkola, 2015; Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019). Referred to as neural network Gaussian processes (NNGPs) in Lee et al. (2018)
, the resulting models are GPs with an exact correspondence to infinitely wide deep neural networks. Importantly, the NNGP depends on the hyperparameters of the network and its initialisation, which determines the network’s signal propagations dynamics.
In deep neural networks, signal propagation has been shown to exhibit distinct phases depending on the initialisation of the network (Poole et al., 2016). These phases include ordered and chaotic regimes associated with vanishing and exploding gradients respectively, which can result in poor network performance (Schoenholz et al., 2017). By initialising at the critical boundary between these two regimes, known as the “edge of chaos”, the flow of information through the network improves, often resulting in faster and deeper training for a variety of architectures (Pennington et al., 2017; Yang and Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018).
Lee et al. (2018) showed that NNGPs tend to inherit the above behaviour from their corresponding randomly initialised networks. In particular, there exists an interaction between poor signal propagation and a poorly constructed kernel. As a result, the performance of NNGPs tend to suffer if their kernels are constructed using kernel parameters that correspond to network initialisations far from the critical boundary. Furthermore, even at the critical boundary, inputs to a neural network can still become asymptotically correlated at large depth (Schoenholz et al., 2017). The rate of convergence in this correlation limits the depth to which networks can be trained, because after this convergence the network is unable to distinguish between different training observations at the output layer. This dependence on depth (in the constructed kernel) for performance, also manifests in NNGPs (Lee et al., 2018).
Various design decisions are required to instantiate a modern NN. Important decisions for trainability and test performance often include both initialisation and regularisation. If initialised poorly, a network might become untrainable due to poor signal propagation, whereas a lack of regularisation could hurt the test performance of the network if it starts to overfit. Commonly used approaches to alleviating these issues include principled initialisation schemes (Glorot and Bengio, 2010; He et al., 2015) and improved regularisation strategies. Among the most successful regularisation strategies is dropout (Srivastava et al., 2014), a form of noise regularisation where scaled Bernoulli noise is applied multiplicatively to the units of a network to prevent coadaptation. However, as shown by Pretorius et al. (2018), these components do not act in isolation and therefore the initialisation of the network should depend on the amount of noise regularisation being applied.
In this paper, we investigate the following research question: do the signal propagation dynamics that influence noise regularised NNs also govern the behaviour of corresponding “noisy NNGPs”? In the presence of multiplicative noise regularisation, Pretorius et al. (2018)
derived the critical initialisation for stable signal propagation in feedforward ReLU networks: More specifically, the authors showed that stable propagation is achieved by setting all unit biases to zero and sampling the weights from zero mean Gaussians with variance
set equal to . Here, is the number of incoming units to the layer andis the second moment of the noise. For example, when using dropout,
(whereis the probability of keeping the unit active) and therefore the initial weights are sampled from
. Furthermore, it was shown that the rate of convergence to a fixed point correlation between inputs increases as a function of the amount of regularisation being applied. Consequently, increased noise further limits the depth of trainability in neural networks. In this paper, we investigate whether these findings for noise regularised networks carry across to their noisy NNGP counterparts.We consider noise regularised fullyconnected feedforward NNs and study the behaviour of noisy NNGPs. Our analysis is done in expectation over the noise, under a general noise model (of which dropout is a special case). We give the kernel corresponding to noisy ReLU NNGPs and highlight the different noiseinduced degeneracies in the kernel as the depth becomes large. Specifically, we show that the above noise dependent initialisations promoting stable signal propagation in noisy ReLU NNs correspond exactly to the kernel parameters exhibiting good performance in noisy NNGPs, as shown in Figure 1. However, even at criticality, we show that as the noise tends to infinity the covariance of the NNGP becomes diagonal. As a result, noise regularisation translates into a stronger prior for simple functions away from the training points. Finally, we verify our findings with experiments on realworld and synthetic datasets.
2 Noise regularised deep neural networks as Gaussian processes
We consider a noise regularised fullyconnected deep feedforward neural network. Given an input , we inject noise into the model
(1) 
using the operator to denote either addition or multiplication, where
is an input noise vector, sampled from a prespecified noise distribution. The noise is assumed to have
in the additive case, and for multiplicative noise distributions such that in both cases, . The weights and biasesare sampled i.i.d. from zero mean Gaussian distributions with variances
and , respectively, where denotes the dimensionality of the hidden layer in the network. Each hidden layer’s activationsare computed elementwise using an activation function
. Lastly, we denote the second moment of the noise as and define as the entire function mapping from input to output, with .By choosing parameter sampling distributions at initialisation we are implicitly specifying a prior over networks in parameter space. We now transition from parameter space to function space by instead specifying a prior directly over function values. Assume a training set of inputoutput pairs . If we can show that is Gaussian distributed at initialisation, then the distribution over the output of the network at these points is completely determined by the secondorder statistics and , defining the following GP
(2) 
We begin by assuming the following additive error model with regression outcomes , where .^{1}^{1}1Note that here we consider scalar outputs, i.e. , hence . Also, the additive error noise should not be confused with the injected noise in (1). Since
, the joint distribution over all outcomes is
(3) 
where . In GP regression we are interested in finding the marginal distribution
(4) 
We proceed as in (Lee et al., 2018) to argue that is in fact a zero mean Gaussian (we refer the reader to Matthews et al. (2018) for a more formal approach) and derive the elements of the covariance matrix in (2) for noise regularised deep neural networks. Subsequently, we obtain an expression for (4) by combining (2) and (3) and using standard results for the marginal of a Gaussian.
To show that the expected distribution of over the injected noise is Gaussian, we first note that conditioned on the inputs, the “output” units at layer , stemming from the postactivations in the previous layer are given by , for . As previously mentioned, we sample the weights and biases i.i.d. from a zero mean Gaussian and define the noise to be i.i.d. such that is unbiased in expectation of the injected noise. Therefore, in a wide network,
is a sum of a large collection of i.i.d. random variables. As
, the central limit theorem ensures that the distribution of
will tend to a Gaussian with mean and covariance . As a result, the function values can be treated as samples from a GP given by . Here, is an covariance matrix given bywhere is the Kronecker product. The kernel function depends on the layer depth, the scale of the weights and biases and the amount of noise regularisation being applied. A schematic illustration of the covariance matrix is given in Figure 2 for the simple case of only two inputs and two output units. To derive the elements of the covariance , consider the units , and inputs which give
Note that and due to the independence between the incoming connections (weights) associated with each output unit. Therefore, we only consider the case where , which for gives
where the expectation is taking with respect to . The final equality follows from applying the above argument recursively for the previous layer . For the case of (and ), we have that the diagonal components of the covariance matrix are given by
where the influence of the noise is explicitly expressed through its second moment . Using the initial condition
and letting each layer width in the network tend to infinity in succession, this recursive construction gives as Gaussian distributed with mean and covariance
(5) 
Finally, combining (2), (3) and (5) and using standard results for the marginal of a Gaussian distribution, the marginal in (4) can be shown to be
where with the Kronecker delta (Williams and Rasmussen, 2006). Therefore, together with the additive noise level , our model for the joint distribution over training outcomes is fully determined by the equivalent kernel corresponding to layerwise recursion of an infinite basis function expansion. This kernel, in turn, depends on the parameterisation of the network and the amount of injected noise.
Having developed our noisy NNGP model, we next discuss predicting outcomes for unseen test data points. To make a prediction, we evaluate the predictive distribution at a new test point . Consider the joint
where we can partition the covariance as follows
with , an dimensional vector and . Using standard results for the conditional distribution of a partitioned Gaussian vector, we find
(6) 
where and . This result is the function space equivalent to exact Bayesian inference in parameter space: by computing the conditional in (6), we are implicitly performing an integration over the posterior of the parameters associated with an infinitely wide noise regularised neural network (Williams, 1997).
In the next section, we study how the properties of the kernel derived in this section depend the parameters of the network when using ReLU activations. Furthermore, for the remainder of the paper we drop the dependence on the hidden units and training set indices for ease of notation and simply refer to as .
3 Kernel parameters and critical neural network initialisation
We begin by examining the interaction between the parameters of the noisy NNGP kernel and its corresponding network initialisation. Specifically, we focus on ReLU activations and show that the kernel parameter values that lead to nondegenerate kernels for deep noisy NNGPs are exactly those that promote stable signal propagation in noise regularised ReLU networks.
Let and define
then the elements of the covariance at a hidden unit are
(7) 
for , where
with diagonal elements given by
(8) 
These formulae are the kernel equivalent to the signal propagation recurrences derived in (Pretorius et al., 2018) for noisy ReLU networks. For no noise and outside the context of GPs, a similar result can be found in (Cho and Saul, 2009). Repeated substitution in (8) shows that
(9) 
The limiting properties of the kernel are seen by letting in (3). In this limit, several degenerate kernels arise, analogous to cases of unstable signal propagation dynamics in meanfield theory and other related work (Poole et al., 2016; Daniely et al., 2016; Schoenholz et al., 2017; Pretorius et al., 2018). We provide the different cases in Table 1. For any amount of additive noise, all possible settings (see A.1 and A.2) of the kernel parameters and in (3) will result in a degenerate kernel in the limit of infinite depth. The situation is similar for multiplicative noise, except for the case (M.5), where . We refer to these parameters in (M.5) as the critical kernel parameters. Here, the diagonal elements of the covariance stay fixed at their initial values even at extreme depth. These parameter values are identical to the proposed initialisations for deep noisy ReLU networks derived in (Pretorius et al., 2018).
From (7) we can see that the offdiagonal elements of the covariance matrix are influenced by the noise level at the critical values through the relationship . Furthermore, we note that as . Therefore, multiplicative noise regularisation has a damping effect on the kernel function evaluated between different inputs, tending towards total dissimilarity and a diagonal covariance. This reduction in the richness of the covariance structure exploitable by the NNGP then enforces a strong prior for simple functions away from the training points. To see this effect, consider the predictive distribution in (6), for a test point . For large amounts of noise, and therefore in the limit, and . Since is symmetric positive definite by definition and , the predicted outcome will be a sample from a zero mean Gaussian with maximal uncertainty as measured by the variance , i.e .
To validate the above claims, the following section provides an empirical investigation. In particular, we test two hypotheses that stem from the above theoretical analysis, using both realworld and synthetic datasets.
4 Experiments
We have shown how the kernel parameters for a noisy NNGP relate to those for its corresponding deep neural network. In doing so, our discussion has led us to the following testable hypotheses:

Noisy NNGPs perform best at their critical parameters: The sensitivity of the kernel parameters should cause noisy NNGPs to perform poorly at settings far away from the critical kernel parameters. Furthermore, the reliance on these critical values for good performance should become more marked at greater depth [Figures 1 and 3].

Noise constrains the covariance and leads to simpler models away from the training points with larger uncertainty: Even at criticality, noise injection applies a shrinkage effect to the kernel function evaluated between different inputs to the noisy NNGP. This should lead to a constrained covariance structure, or in the limit of a large amount of noise, a completely diagonal covariance. The NNGP prior over functions regularised in this way should lead to simpler models away from the training points with increased estimates of uncertainty [Figures 4 and 5].
H1: We begin by investigating the sensitivity of the kernel parameter values. As shown in Figure 1, we test the performance of layer NNGPs on MNIST and CIFAR10 with kernels constructed for a grid of variance parameters , for varying values of the noise level parameter . Our approach to classification in this paper is identical to (Lee et al., 2018), where classification is treated as a regression problem. Specifically, instead of onehot output vectors, each output vector is recoded as a zero mean regression output with the value in the index corresponding to the correct class and for all other indices corresponding to the incorrect classes. The predicted class label for a given input is then simply the index corresponding to the maximum value in the output vector as predicted by the NNGP regression model. For all experiments, we set . Figures 1(b) and (c) confirm our expectations, showing that the kernel parameters corresponding to NNGPs with good performance closely follow the critical initialisation boundary shown in Figure 1(a). As kernels are constructed further away from criticality, their performances start to deteriorate.
The sensitivity to the kernel parameters becomes more acute at larger depth as shown in Figure 3. Panels (a) and (b) plot the results for a shallower depth of . In this case, a wide band is seen to form around the critical boundary (beige shaded area) with kernel parameters far away from their critical values still able to perform reasonably well. This is no longer the case in Panels (c) and (d), where we tested performances at a greater depth, . At this depth, the NNGP is far more sensitive to the kernel parameters and only a few models with kernel parameters very close to the critical boundary are seen to perform well.
H2: For all the models evaluated in H1, we also study the influence of the noise on the kernel as well as on the resulting NNGP covariance matrix. For each model, we plot in Figure 4(a) and (b) the depth evolution of the kernel, using two inputs from the MNIST dataset. In (a) we track the variance of one of the inputs and in (b) the covariance between the two inputs. The dashed orange lines correspond to the kernel parameters , with shown in solid red and shown in solid blue. (Recall that for all experiments). The limiting behaviour described in (M.1), (M.3) and (M.5) in Table 1 is shown in (a), with all kernels tending towards degenerate function mappings, except those evolving under the critical parameters. Furthermore, in (b), we show the damping effect on the kernel at criticality, highlighted by decreasing asymptotes (dashed orange lines around layer 20) as more noise is being applied (darker lines).
The depth dynamics of the kernel also constrains the resulting covariance matrix. To see the effect of this, we use the Frobenius norm of the covariance matrix as a proxy for its richness. Figure 4(c) plots the relationship between the covariance norm and test accuracy for all the experiments in H1. Interestingly, the norm seems to suggest a step function relationship. Moving from right to left in (c), we observe a sudden and dramatic drop in performance beyond a certain amount of regularisation, as measured by a decreasing covariance norm. In other words, there seems to exist some requisite amount of information to be captured by the covariance matrix in order for the NNGP to be able to perform well. This is also the case at criticality: in (c), the orange points correspond to critical kernels with larger points associated with more noise.
The effect of increased noise regularisation on uncertainty estimation is shown in Figure 4(d), where we plot test accuracy as a function of the mean posterior predictive variance. For NNGPs far away from criticality (blue points in main plot), we see little correlation () between variance and test accuracy. The inset in (d) shows the same plot but for NNGPs close to their critical parameters. For these models the (negative) correlation is stronger (), possibly providing more reliable uncertainty estimates. Here, the green points are low noise models and the red points are high noise models. As expected, noise regularisation causes the posterior predictive variance to increase leading to higher uncertainty. We did the same investigations using the CIFAR10 dataset and obtained similar results (see Appendix A).
Finally, to gain more insight, we consider a simple onedimensional regression task.^{2}^{2}2The example is taken from Chapter 1 in Bishop (2006). The top row in Figure 5 shows samples from a 20layer NNGP prior for (a) (no noise), (b) (small noise), (c) (large noise) and . We found a small amount of bias , improved each fit (see Appendix B for a discussion on nonzero biases). The covariance structure corresponding to each NNGP is shown in (d)(f), located in the middle row of Figure 5. The bottom row, (g)(i) shows the fit from the posterior predictive (red line) using four training examples (blue dots) sampled from a simple sinusoidal function (green line) with . Moving across the columns from left to right, we find that the samples from the prior become more erratic as the covariance becomes diagonal, which strongly regularises the regression model at previously unseen test points. Note that the model in (i) still corresponds to exact Bayesian inference, but with a strong prior for near constant functions with high uncertainty away from the training points.
5 Discussion
We have shown that critical initialisation of noisy ReLU networks corresponds to a choice of optimal kernel parameters in noisy NNGPs and that deviation from these critical parameters leads to poor performance, becoming more severe with depth and the extent of the deviation. In addition, we highlighted the effect of noise on the covariance of a noisy NNGP at criticality, with noise in the limit yielding a fully diagonal covariance, acting as a regulariser on the posterior predictive.
It is interesting to reflect on the connection between deep NNs and GPs in the context of representation learning and noise regularisation. The core assumption in deep learning is that deep NNs learn distributed hierarchical representations useful for modeling the true underlying data generating mechanism, whereas shallow models do not. In these deeper models, noise regularisation is thought to be successful largely because of its influence on representations at different levels of abstraction during the training procedure (Goodfellow et al., 2016). As discussed in previous work (Neal, 1994; Matthews et al., 2018), the kernels associated with NNGPs do not use learned hierarchical representations. Nevertheless these models are still able to perform as well, or sometimes better than, their neural network counterparts (Lee et al., 2018). In the infinite width setting, the success of regularisation from noise injection is unlikely to have the same interpretation as in the finite width setting. We note that in the context of NNGPs, noise regularisation has a stronger connection with controlling the length scale parameter in commonly used kernel functions than regularising through corrupted representations at different levels of abstraction. This connection with the length scale parameter means that noise regularisation in NNGPs may be more accurately interpreted as a useful mechanism to designing priors by controlling the smoothness of the kernel function.
Finally, recent work related to NNGPs, has made it possible to accurately model the learning dynamics of deep neural networks by taking a function space perspective of gradient descent training in the infinite width limit (Jacot et al., 2018; Lee et al., 2019). We envision a similar analysis could be applied to accurately model the learning dynamics of noise regularised deep neural networks.
References

Learning scalable deep kernels with recurrent structure.
Journal of Machine Learning Research
. Cited by: §1.  Pattern recognition and machine learning. Springer. Cited by: footnote 2.
 Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, Cited by: §1.

Dynamical isometry and a mean field theory of RNNs: gating enables signal propagation in recurrent neural networks
. International Conference on Machine Learning. Cited by: §1.  Kernel methods for deep learning. In Advances in Neural Information Processing Systems, Cited by: §3.
 Variational autoencoded deep Gaussian processes. arXiv preprint arXiv:1511.06455. Cited by: §1.
 Deep Gaussian processes. In Artificial Intelligence and Statistics, Cited by: §1.
 Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, Cited by: §3.
 Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, Cited by: §1.
 Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, Cited by: §1.
 Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics, Cited by: §1.
 Deep learning. MIT press. Cited by: §5.
 Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133. Cited by: §1.

Delving deep into rectifiers: surpassing humanlevel performance on ImageNet classification
. InProceedings of the IEEE International Conference on Computer Vision
, Cited by: §1.  Nested variational compression in deep Gaussian processes. arXiv preprint arXiv:1412.1370. Cited by: §1.
 Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, Cited by: §5.
 Continuous neural networks. In Artificial Intelligence and Statistics, Cited by: §1.
 Deep neural networks as Gaussian processes. International Conference on Learning Representations. Cited by: §1, §1, §2, §4, §5.
 Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720. Cited by: §5.
 Gaussian process behaviour in wide deep neural networks. International Conference on Learning Representations. Cited by: §1, §2, §5.
 Priors for infinite networks. Technical report no. crgtr941, University of Toronto. Cited by: §1, §5.

Bayesian convolutional neural networks with many channels are Gaussian processes
. International Conference on Learning Representations. Cited by: §1.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems, Cited by: §1.
 Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
 Critical initialisation for deep signal propagation in noisy rectifier neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §3, §3.
 Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, Cited by: §1.
 Deep information propagation. International Conference on Learning Representations. Cited by: §1, §1, §3.
 Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research. Cited by: §1.
 Gaussian processes for machine learning. MIT Press Cambridge, MA. Cited by: §2.
 Computing with infinite networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
 Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, Cited by: §1.
 Deep kernel learning. In Artificial Intelligence and Statistics, Cited by: §1.
 Dynamical isometry and a mean field theory of CNNs: how to train 10,000layer vanilla convolutional neural networks. International Conference on Machine Learning. Cited by: §1.
 Mean field residual networks: on the edge of chaos. In Advances in Neural Information Processing Systems, Cited by: §1.
Appendix
A Additional results
Figure 6 provides additional results using CIFAR10 instead of MNIST for the experiments presented in Figure 1. The results are similar to those described in the main text for MNIST.
B Kernels with nonzero biases
In our experiments, we noticed that noisy ReLU NNGPs often benefit from small nonzeros biases. Therefore, we consider here the implication of nonzero biases for the evolution of the diagonal terms in the NNGP covariance. Recall that the diagonal (variance) terms of the covariance matrix can be expanded as follows
(10) 
We first focus on the multiplicative noise case, at the critical weight variance . Here, a nonzero bias translates into a second term in (B), that grows linearly with depth. For small initialised biases in typically deep networks this term will be small. For example, a layer deep neural network with , will translate into an NNGP covariance matrix with diagonal covariance terms given by . Therefore, at depth, the linear growth in a nonzero , is far less severe than the exponential growth/decay from an incorrect setting of . In the additive noise case with , the situation is similar, but with an added linear growth in noise. Unfortunately, it is less straightforward to analyse the effects of nonzero biases on the offdiagonal covariance terms.
Comments
There are no comments yet.