1 Introduction
The high expressivity of deep NNs [1] can be combined with the predictive ability of GP regression by taking advantage of the apparent equivalence between GPs and fullyconnected, infinitelywide, deep NNs with random weights and biases. This equivalence was first proved by [2] for shallow NNs. In a subsequent paper, the authors of [3]
obtained analytically the covariance functions of the GPs induced by the shallow NNs having two specific activation functions: error and Gaussian functions. More recently, the authors of
[4] extended [3]’s work to the case of deep NNs having generic activation functions, with reference to the mean field theory of signal propagation [1, 5].The term NNGP was coined by the authors of [4], but until now this method despite its potential merits has not been used extensively in applications. In particular, the authors of [4] validated the NNGP on two image recognition datasets: MNIST (handwritten digits data) and CIFAR10 (color images data), and observed that NNGP outperforms finite width NNs in terms of classification accuracy. The primary objective of our work is to exploit the flexibility of NNGP in order to approximate smooth and discontinuous functions, and furthermore in assessing the effectiveness of NNGP to solve linear and nonlinear PDEs, compared to the GP and deep NN methods formulated recently in [6, 7, 8]. Specifically, so far GP regression has been successfully applied to solution of PDEs including linear and nonlinear forward problems [6] as well as inverse problems [9, 10]. Use of GP regression can bypass discretizing differential operators by properly placing GP priors, and can easily handle noisy data as well as the associated uncertainty propagation in timedependent problems. Hence, the motivation of the current research is to evaluate if indeed NNGP possesses not only the high expressivity of deep NNs but also the uncertainty quantification property of GPs. To this end, we will perform threeway comparisons between NNGP, GP and deep NN for prototypical problems involving function approximation and PDE solution.
The present paper makes two main contributions.
First, to pursue high expressivity, we increase the number of hyperparameters of NNGP by assuming that the prior variances of weights and biases vary layer by layer and that the weight variances between the input layer and the first hidden layer are different for different neurons in the input layer. We also train the resulting NNGP regression model by using maximum likelihood estimation. We note that the authors of
[4] assumed only two hyperparameters: one weight variance and one bias variance, which are kept fixed over different layers; the two hyperparameters are adjusted to achieve the best performance on validation dataset. However, no training was performed in [4].Second, we derive the analytical covariance function for the NNGP induced by the deep NN with the errorfunction nonlinearity. This is motivated by the observation that the analytical covariance function induced by ReLU activated NN, given in
[4], cannot be used to solve the twodimensional Poisson equation. Denoting by the analytical covariance function from the ReLU case where andare input vectors of length two, we observed that the Laplacian of the covariance function, namely,
, which has to be computed in order to solve the problem (see Eq.(18)), will tend to infinity as .The paper is organized as follows. In Section 2, we review the equivalence between GPs and infinitely wide NNs, which is a cornerstone of NNGP. We change the notations for weight and bias variances compared to the previous work [4] in order to show a different hyperparameter setup. Section 3 introduces the analytically tractable covariance functions induced by the deep NNs with the ReLU and errorfunction nonlinearities. We employed a version of Gaussian quadrature, proposed in [4], to validate numerically the two analytical covariance functions. The methodology on GP regression for function approximation and PDE solution is briefly reviewed and summarized in Section 4. Section 5 compares the accuracy of GP, NNGP and NN in function approximation and PDE solution. The uncertainty estimates of GP and NNGP are also presented in the section. Finally, we conclude in Section 5 with a brief summary.
2 Equivalence between GPs and infinitely wide NNs
In this section, we adopt nearly the same notations as [4] except those for weight and bias variances. We add the subscript to the variances and in order to represent index of layer and the subscript to in order to distinguish the weights coming from different neurons in input layer.
The fullyconnected neural net can be represented as in Fig.1
. The equivalence relies on two assumptions: (1) Weight and bias parameters are all random variables. For
the weights are independent and identically distributed (i.i.d.) and obey the same distribution with zero mean and same variance, namely, . The biases () are i.i.d. Gaussian random variables, . The weights and biases are independent with each other. Note that in [4] the weight and bias variances are kept fixed over different layers, i.e., and , but we here assume the variances vary layer by layer and increase the number of variances to be determined. For , the weights need only to be independent with respect to the subscript , i.e., ; the biases are independent Gaussian random variables, . Unlike [4], we remove the assumption of being identically distributed for the weights. These weights do not have to be identically distributed since we cannot use the Central Limit Theorem for the input of the first hidden layer, namely,
. Actually, the width of input layer is finite and will never be infinity. Allowing weight variances to vary for different neurons in input layer, we increase again the number of variances to be determined in NNGP. (2) The widths of all hidden layers tend to infinity, i.e., . This allows one to use the Central Limit Theorem to deduce the equivalence relation.It follows from [3] and [11] that for analytical derivation of the covariance function the distribution must be Gaussian. The specific forms of
do not really matter, since using the Central Limit Theorem one always obtains the Gaussian distribution whatever
is.The th component of the input of the second hidden layer is computed as
(1) 
In the rest of paper, we use bold letter, say and , to represent column vector or matrix. The dependence on the input vector is emphasized in the above equations. Since weights and biases and are i.i.d., the output of the first hidden layer is also i.i.d. with respect to the subscript . Because is a sum of i.i.d. terms, the use of the Central Limit Theorem yields that will be Gaussian distributed if . Similarly, from the multidimensional Central Limit Theorem, any finite collection of will form a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Note that are arbitrary input vectors.
Denoting by the GP with the mean function and the covariance function , we learn that . Also, are independent GPs with same mean and covariance functions. The mean function of GP is zero due to the zero mean of the distribution , and the covariance function is computed as
(2) 
In the above derivation, we use the independence relations between weights ( is the Kronecker delta), between weight and bias , and between weight and activation function . The last equality in Eq. (2) comes from the independence of with respect to . The input of the first hidden layer is denoted by , and the expectation is taken over the distribution of and . Actually, is a GP only if the distribution is Gaussian. In this case, the covariance function of the GP is
(3) 
The input vector components and are deterministic and thus
3 Two NNinduced covariance kernels of GP
The arguments of the previous section can be extended to deeper layers by induction. For layer (), using the multidimensional Central Limit Theorem, we learn that and are both GPs. Specifically, has zero mean function and the covariance function computed as
(4) 
The expectation in Eq. (4) is taken over the GP governing
, and this amounts to integrating over the joint distribution of two Gaussian random variables
and given and . The covariance function is rewritten as(5) 
where the density function of the joint distribution of and is
(6) 
with the covariance matrix
(7) 
Eq. (5) is actually an iteration formula, given by
(8) 
where the diagonal matrix is defined by . Note that due to symmetry of the kernel. The trivariate function represents the twofold integral in Eq. (5), which depends on specific form of activation function . For certain , the function is analytically tractable. Two iteration formulas that are analytically derivable will be given in the next two subsections.
3.1 Kernel from the ReLU nonlinearity
Letting the activation function be a rectifier , we can derive the analytical expression for , and the corresponding iteration formula is [4, 11]
(9) 
Here we have increased the number of undetermined variances compared to the original formula in [4] in which , , and . There are totally undetermined parameters : (), (), and (). These parameters (a.k.a. hyperparameters in GP regression) can be learned by using maximum likelihood estimation.
Since the trivariate function is not analytically tractable for a generic nonlinearity, the authors of [4] developed an efficient Gaussian quadrature scheme to approximate the twofold integral in Eq.(5). Fig. 2 (left) compares the numerical results computed by the Gaussian quadrature scheme with the analytical results computed by Eq. (9), showing good agreement.
3.2 Kernel from the errorfunction nonlinearity
The errorfunction nonlinearity can also yield an analytical iteration formula just as the ReLU nonlinearity does. However, the existing formula only holds for singlehidden layer, which is given by [3]. The present paper extends the singlehidden layer case to multilayer case using the strategy presented in Section 2.3 of [11]. The resulting iteration formula is
(10) 
To validate the above formula, we compare the analytical results from the above formula with the numerical results from the aforementioned Gaussian quadrature scheme, and as we see in Fig.2 (right) the new iteration formula (10) is correct.
We specifically call the GPs with the kernels that are induced iteratively by NNs “NNGP” for short. NNGP is actually a specific case of GP, but we still distinguish it from the standard GP involving an ad hoc kernel (e.g., square exponential and Matern kernels). An exception is the output of the first hidden layer, . The output is also a GP and its covariance function is [3, 12]
(11) 
We still regard the GP with this kernel as a standard GP rather than a NNGP, as the kernel corresponds to an incomplete, shallow NN without output layer. In Section 5.2.2 we will compare the predictive ability of this kernel with our generic kernel computed by formula (10).
4 GP regression
The essence of GP regression is the conditional distribution of quantities of interest, , given the known observations , where and are both Gaussian random vectors. Thanks to the analytical tractability of Gaussian distribution, we can derive analytically mean vector and covariance matrix of the conditional distribution. Supposed that and have the joint multivariate Gaussian distribution with zero mean vector and blockwise covariance matrix
(12) 
the conditional distribution is also a multivariate Gaussian distribution with the mean vector
(13) 
and the covariance matrix
(14) 
(See (A.5) and (A.6) of [12]). Zeromean and
variance Gaussian white noise is assumed in observations.
Note that observations and quantities of interest are usually evaluated at specific timespace coordinates. To preserve a multivariate Gaussian distribution (12) at arbitrary timespace coordinates, the observations and the quantities of interest can be seen as samples of GPs. In other words, is a sample of and is a sample of , where the index set represents arbitrary time and/or space coordinates. Particularly, for solving PDEs, the observations consist of samples from different GPs. Different covariance function will capture different trends and features that the observations exhibit.
Before computing mean and covariance of conditional distribution from formulas (13) and (14), we need to first learn the parameters hidden in covariance function as well as the noise variance . These parameters are also known as hyperparameters and can be learned by using maximum likelihood estimation (MLE) [12, 13]. In practice, we find the optimal hyperparameters (including undetermined parameters in covariance function and the noise variance ) that minimize the negative log marginallikelihood [12]
(15) 
where denotes the determinant and the covariance matrix for observations (or training data) depends on and .
In what follows, we will briefly review that different formulations of observations will enable the GP regression to solve different problems including function approximation and PDE solution.
4.1 Function approximation
Let and where is Gaussian white noise having variance . The objective is to approximate the unknown scalarvalued function given the observations . After prescribing the covariance function for computing the covariance matrix and then training the regression model by the MLE, we derive the function approximation by using formula (13). Moreover, the use of the variance computed from (14
) yields the confidence interval. Denoting by
( by matrix) the collection of the evaluation points (usually called training inputs), we formulate the covariance matrix in (12) by:(16) 
4.2 PDE solution
First we consider the following timeindependent linear PDE
(17) 
where is a linear differential operator. The source term and the boundary condition term are the only information we know, and thus we put them in the observations by letting , where we select boundary points as the evaluation points for and domain points as the evaluation points for . The variances of the Gaussian white noises and are and , respectively. We assume that the solution is a GP with zero mean and covariance function , and the source term is another GP that depends on , whose covariance function is written as [6]
(18) 
Denoting by ( by matrix) the collection of the boundary evaluation points and by ( by matrix) the collection of domain evaluation points , we formulate the covariance matrix in (12) as [6]
(19) 
For timedependent problem, the above GP regression model also works as we can regard the time variable as the th dimension of spacetime domain (See Section 4.3.1 of [6]).
The alternative way to handle the timedependent problems is numerical GP regression [7]. Consider the following timedependent PDE
(20) 
The differential operator can be nonlinear (say, Burgers’ equation). We denote by the linearized counterpart of . For linear operator, . The use of Eular backward scheme leads to . According to the known information from initialboundary conditions, we define the observations at th time step as and . Note that and . Thus, we actually apply GP regression at each time step. Assuming is a GP with zero mean and covariance kernel and denoting by ( by matrix) the collection of the boundary evaluation points selected at th time step and by ( by matrix) the collection of the domain evaluation points sampled at ()th time step, we formulate the covariance matrix as [7]
(21) 
Generally, we let the covariance function be the same for all , namely , except with varied undetermined parameters . Considering the propagation of the uncertainty in time marching, we need to take into account the uncertainty of the predictions in previous time step (i.e., predicted at ). We rewrite the predictive or posterior variance in (14) as [7]
(22) 
5 Numerical results
We perform a threeway comparison between NNGP, NN, and GP for function approximation and PDE solution for prototypical problems. The hyperparameters of NNGP and GP are trained by minimizing the negative log marginallikelihood function. The conjugate gradient algorithm (see the function minimize() in GPML Matlab code (version 3.6) [14]
) is employed for the optimization. For NN, network parameters (weights and biases) are trained by the LBFGSB algorithm that is provided in the TensorFlow package.
Because of the nonconvex nature of the objective function, to avoid trapping into the local minima, we run our optimization algorithm code ten times with different initializations. Among the ten groups of optimized hyperparameters, for GP/NNGP we choose the one yielding the smallest negative log marginallikelihood, and for NN we select the one producing the lowest loss for NN. The initializations of hyperparameters for GP/NNGP are taken as the first ten entries of the Halton quasirandom sequence [15] and these initializations are kept deterministic. The initializations of network weights are obtained from Xavier’s initialization algorithm [16], which is provided by the TensorFlow package. Also, for each initialization at most 200 function evaluations are allowed in conjugate gradient search in GP/NNGP. For NNGP, the central finite difference scheme with the step size is used in computing the gradients of covariance function with respect to hyperparameters while for GP the analytical derivation is employed. Additionally, we compute relative error in quantifying accuracy of function approximation and PDE solution, where is a vector formed by the function values or PDE’s solutions evaluated at test points.
5.1 Function approximation
The covariance matrices of GP/NNGP are formulated according to (16).
5.1.1 Step function
We approximate the step function , otherwise. Ten evenly distributed training inputs and 100 equispaced test points are chosen. Fig. 3 shows the approximation by the GPs with two commonly used kernels: squared exponential (SE) and Matern [12]. The first kernel can describe well the data exhibiting smooth trend while the second one is suitable for finite regularity. However for the step function, a nonsmooth function, the performance of these kernels is poor. The NNGP predictions shown in Figs. 4 and 5 are much better. In the domain away from the discontinuity, the NNGP approximation is good with small uncertainty. Particularly, the ReLUinduced kernel even succeeds in capturing the discontinuity. An important observation is that the ReLU induced kernel is more suitable for nonsmooth functions than the errorfunction induced kernel. Another observation regarding NNGP is that deepening the inducing NN does not change the approximation accuracy for the step function. Figs. 6 and 7 show the predictions of NN, which are less accurate than NNGP’s. Unlike GP/NNGP, NN does not quantify any uncertainty for approximation.
5.1.2 Hartmann 3D function
The Hartmann function is frequently used for testing global optimization algorithms. Here we consider the trivariate case. The function is much smoother than the step function and thus we can expect GP to perform well. To test the approximation accuracy, we first generate (, and ) points by choosing the first entries of the Halton sequence, and then randomly permutate the points, followed by selecting the first 70% points as training points and the remaining 30% points as the test points.
Fig. 8 compares NN, NNGP, and GP in terms of approximation accuracy. Since the Hartmann 3D function is smooth, the NNGP with errorfunction induced kernel and the GP with squared exponential kernel are both very accurate. ReLU is not suitable for smooth function in contrast to approximating the nonsmooth step function. Additionally, the GP and NNGP both outperform NN. It is observed again that NN does not give any uncertainty estimate, whereas GP and NNGP do. Another observation is that increasing the depth of the inducing NN in NNGP does not change the accuracy.
5.2 PDE solution
We use the proposed covariance function from errorfunction nonlinearity to solve the following two PDEs. We replace the covariance function in Eq. (19) as well as in Eq. (21) with the new kernel (10). The derivation of kernel’s derivatives for the case of Poisson equation is given in Appendix.
5.2.1 2D Poisson equation
Consider the equation
(23) 
with two fabricated solutions (1) and (2) . Dirichlet boundary conditions are assumed. The second solution is more complex than the first one and thus we use more training points in the second solution case. The training inputs are chosen from the first entries of the Halton sequence and is equispaced on the boundary, which are shown in Fig. 9.
The covariance matrices of GP/NNGP are obtained by (19). Figs. 10 and 11 give the error plots of NN, NNGP, and GP for fabricated solution 1 and 2, respectively. Totally 441 evenly distributed test points are taken to evaluate the relative error. The NN results are obtained according to the approach proposed in [8]. To keep a fair comparison, the same training inputs are selected in the NN case. It is observed from the figures that for different fabricated solutions, the approximation accuracy of the GP and NNGP is comparable. Also, the NN accuracy is nearly one order of magnitude lower than that of GP/NNGP.
Fig. 12 shows the uncertainty estimates of GP and onelayer NNGP evaluated on the cut line of the square domain, for the case of fabricated solution 2. We see that for both methods uncertainty is reduced with increasing number of training points. Moreover, the uncertainty is strongly correlated with the prediction error as smaller uncertainty corresponds to lower error. Note that the uncertainty at the endpoints of the cut line, namely, and , is exactly zero, due to the boundary condition.
5.2.2 1D Burgers’ equation
Consider the equation [7]
(24) 
with . The initial condition is . After the linearization through replacing the nonlinear term by , where is the mean vector computed for the previous time step by (13), we derive the differential operator . The covariance matrices of GP/NNGP are formulated by (21).
The time step size is fixed to be . To evaluate the relative error at each time step, we place 400 equispaced test points in the space domain . We also solve the same equation using the NN approach proposed in [8]. One main difference between the NN approach and the numerical GP/NNGP regression is the sampling strategy for training inputs. For NN, we sample the training inputs in the timespace domain , whereas for GP/NNGP, we sample the training inputs merely in the space domain and then perform timemarching. To make the comparison as fair as possible, we sample 10000 training points for NN, as in GP/NNGP we have at most 100 (number of time step) 100 (number of training inputs in space) = 10000 (number of timespace sampling points). The Latin hypercube sampling strategy is adopted in sampling the training inputs in NN, GP, and NNGP. To accelerate training of numerical GP/NNGP, the initial guess of hyperparameters for current time step is taken as the optimized hyperparameters attained at previous time step. This strategy is reasonable since for a small time step two successive GPs are highly correlated and therefore the respective hyperparameters should be close.
The kernel (11), rather than the SE considered in solving Poisson equation, is taken in GP. We choose a nonstationary kernel instead of a stationary one, because the solution exhibits discontinuity that cannot be well captured by stationary kernel. Additionally, the NN with four hidden layers of 40unit width and the hyperbolic tangent nonlinearity is employed in the NN approach.
We first consider the case where the initial condition is noisefree. Figs. 13, 14, and 15 show the numerical solutions computed by the GP, the NNGP with onehiddenlayer (), and the NNGP with threehiddenlayer (), respectively. We can see that the NNGP has higher solution accuracy than the GP. Importantly, different from previous examples, increasing the depth of inducing NN in NNGP improves the results in the current example. The high accuracy of NNGP is attributed to a larger number of hyperparameters compared to the GP case. More hyperparameters presumably implies higher expressivity for function approximation, which means that we can use the kernel to approximate a wider spectrum of functions. For simple function or solution (as in the Hartmann 3D function and Poisson equation examples), the advantage of higher expressivity is not fully realized, but for complex ones, such as the step function and the Burgers’ equation here, the advantage can be clearly seen.
The NN results are shown in Fig.16. Although NN can derive sufficiently accurate solutions, it does not give any uncertainty estimate. Fig. 17 collects the errors from GP, NNGP, and NN for noisefree case. The NNGP and NN achieve errors of the same order of magnitude, while the GP performs worst. This is because NNGP and NN both have a large set of parameters to be tuned and thus possess higher expressivity.
Next we consider the case where the initial condition is contaminated by Gaussian white noise of zero mean and variance . The NN approach for the noise case is beyond the scope of the present paper, since without proper regularization methods (such as dropout [18], early stopping, and weighted L1/L2 penalty), NN will easily encounter overfitting. Unlike NN, GP and NNGP inherently include the model complexity penalty in the negative logmarginal likelihood and have less risk in overfitting (see the discussion in Section 5.4.1 of the book [12]).
Numerical solutions computed by the GP and NNGP are ploted in Figs. 18, 19, and 20. NNGP still has higher solution accuracy than the GP due to the higher expressivity. Data noise amplifies uncertainty represented by the orange shadowed region. GP and NNGP can handle noise well, because noise variance can be directly learned from the training data and the corresponding uncertainty is quantified by the conditional (or posterior) distribution. Fig. 21 compares the solution accuracy of GP and NNGP. For longterm simulations, the accuracy of the NNGP is roughly one order higher than that of GP.
For NNGP, increasing the depth, , of inducing NN does not guarantee the increase of accuracy. For example, onelayer NNGP with 101 training points outperforms threelayer NNGP with 101 training points, as is shown in Fig. 21. A larger amounts to a larger number of hyperparameters, which merely increases the possibility in fitting complex functions better. Actually, can also be seen as another hyperparameter of NNGP.
6 Summary
A larger number of hyperparameters enables NNGP to achieve higher or comparable accuracy for both smooth and nonsmooth functions in comparison with GP, which can be seen from Table 1
. The deep NN that induces NNGP contributes its prior variances of network parameters (weights and biases) as well as its depth to the hyperparameter list of NNGP, which endows NNGP with high expressivity. On the other hand, NNGP is able to estimate uncertainty of predictions, which is crucial to noisydata handling and active learning
[19]. Unlike NN, Bayesian NN can provide uncertainty estimate. However, the conventional Bayesian NN [20] could be timeconsuming to train due to approximation of a highdimensional integral over weight space.Function/Solution  Feature  Accuracy comparison  Kernel/nonlinearity  

GP  NNGP  NN  
Step function  Nonsmooth  NNGP> {GP, NN}  SE/Matern  erf/ReLU  erf/ReLU 
Hartmann 3D  Smooth  {NNGP, GP}> NN  SE/Matern  erf/ReLU  erf/ReLU 
Poisson eq.  Smooth  {NNGP, GP}> NN  SE  erf  erf/tanh 
Burgers’ eq.  Nonsmooth  {NNGP, NN}> GP  kernel (11)  erf  tanh 
Due to the need for inverting the covariance matrix, NNGP has cubic time complexity for training (see Table 2). In Table 2, for GP and NNGP is the size of training set, and and are numbers of evaluations of negative log marginallikelihood in conjugate gradient algorithm for GP and NNGP, respectively. For NN, the accurate estimate of time complexity for training is still an open question [21]. Generally, the training of a fullyconnected NN is faster than that of GP, because one does not need to invert a matrix. For each training point, the forward and backward propagation only need linear cost, namely , where is the total number of weights in the network. for NN means batch size; in this paper we fed the whole training set to optimization algorithm and thus the batch size is exactly the size of training set. is number of iterations in optimization algorithm. In numerical examples of this paper, training set size does not exceed 1000. However, for very large datasets, GP and NNGP will be less attractive compared to NN. In future work, we intend to leverage scalable GPs recently developed in [22, 23, 24] to tackle large dataset problems.
GP  NNGP  NN  

Uncertainty  ✓  ✓  
Cost 
Acknowledgements
We thank Mr. Dongkun Zhang and Dr. Maziar Raissi for useful discussions. This work was supported by AFOSR (FA95501710013) and NSF (DMS1736088). The first author was also supported by the National Natural Science Foundation of China (11701025). The second author was also supported by a gift from Swiss company BHRobotics.
References
 [1] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360–3368, 2016.
 [2] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
 [3] Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pages 295–301, 1997.
 [4] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha SohlDickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
 [5] Haim Sompolinsky, Andrea Crisanti, and HansJurgen Sommers. Chaos in random neural networks. Physical review letters, 61(3):259, 1988.
 [6] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Inferring solutions of differential equations using noisy multifidelity data. Journal of Computational Physics, 335:736–746, 2017.
 [7] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical gaussian processes for timedependent and nonlinear partial differential equations. arXiv preprint arXiv:1703.10230, 2017.
 [8] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561, 2017.
 [9] Guofei Pang, Paris Perdikaris, Wei Cai, and George Em Karniadakis. Discovering variable fractional orders of advection–dispersion equations from field data using multifidelity bayesian optimization. Journal of Computational Physics, 348:694–714, 2017.
 [10] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Machine learning of linear differential equations using gaussian processes. Journal of Computational Physics, 348:683–693, 2017.

[11]
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.
In Advances in neural information processing systems, pages 342–350, 2009.  [12] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. the MIT press, 2006.
 [13] Loic Le Gratiet. Multifidelity Gaussian process regression for computer experiments. PhD thesis, Université ParisDiderotParis VII, 2013.
 [14] Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. Journal of Machine Learning Research, 11(Nov):3011–3015, 2010.
 [15] Ladislav Kocis and William J Whiten. Computational investigations of lowdiscrepancy sequences. ACM Transactions on Mathematical Software (TOMS), 23(2):266–294, 1997.

[16]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256, 2010.  [17] Cea Basdevant, M Deville, P Haldenwang, JM Lacroix, J Ouazzani, R Peyret, P Orlandi, and AT Patera. Spectral and finite difference solutions of the burgers equation. Computers & fluids, 14(1):23–41, 1986.
 [18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [19] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
 [20] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 [21] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. arXiv preprint arXiv: 1707.04615, 2017.
 [22] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
 [23] Sivaram Ambikasaran, Daniel ForemanMackey, Leslie Greengard, David W Hogg, and Michael O’Neil. Fast direct methods for gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 38(2):252–265, 2016.
 [24] Alexander Litvinenko, Ying Sun, Marc G Genton, and David Keyes. Likelihood approximation with hierarchical matrices for large spatial datasets. arXiv preprint arXiv:1709.04419, 2017.
Appendix: Derivatives of NNGP kernel from the errorfunction nonlinearity
Absorbing the constant coefficients and into the variances, the kernel derived from the errorfunction nonlinearity, namely (10), can be further simplified to
(A1) 
To solve PDEs, we need to compute the derivatives of the kernel . We take the 2D Poisson equation as an example. We need to know the explicit forms of and where and . The partial derivatives up to fourth order needs to be derived. Denoting by the trivariate function the
term in the above iteration formula, the use of chain rule yields