Neural-net-induced Gaussian process regression for function approximation and PDE solution

06/22/2018 ∙ by Guofei Pang, et al. ∙ Brown University 0

Neural-net-induced Gaussian process (NNGP) regression inherits both the high expressivity of deep neural networks (deep NNs) as well as the uncertainty quantification property of Gaussian processes (GPs). We generalize the current NNGP to first include a larger number of hyperparameters and subsequently train the model by maximum likelihood estimation. Unlike previous works on NNGP that targeted classification, here we apply the generalized NNGP to function approximation and to solving partial differential equations (PDEs). Specifically, we develop an analytical iteration formula to compute the covariance function of GP induced by deep NN with an error-function nonlinearity. We compare the performance of the generalized NNGP for function approximations and PDE solutions with those of GPs and fully-connected NNs. We observe that for smooth functions the generalized NNGP can yield the same order of accuracy with GP, while both NNGP and GP outperform deep NN. For non-smooth functions, the generalized NNGP is superior to GP and comparable or superior to deep NN.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The high expressivity of deep NNs [1] can be combined with the predictive ability of GP regression by taking advantage of the apparent equivalence between GPs and fully-connected, infinitely-wide, deep NNs with random weights and biases. This equivalence was first proved by [2] for shallow NNs. In a subsequent paper, the authors of [3]

obtained analytically the covariance functions of the GPs induced by the shallow NNs having two specific activation functions: error and Gaussian functions. More recently, the authors of

[4] extended [3]’s work to the case of deep NNs having generic activation functions, with reference to the mean field theory of signal propagation [1, 5].

The term NNGP was coined by the authors of [4], but until now this method despite its potential merits has not been used extensively in applications. In particular, the authors of [4] validated the NNGP on two image recognition datasets: MNIST (handwritten digits data) and CIFAR-10 (color images data), and observed that NNGP outperforms finite width NNs in terms of classification accuracy. The primary objective of our work is to exploit the flexibility of NNGP in order to approximate smooth and discontinuous functions, and furthermore in assessing the effectiveness of NNGP to solve linear and nonlinear PDEs, compared to the GP and deep NN methods formulated recently in [6, 7, 8]. Specifically, so far GP regression has been successfully applied to solution of PDEs including linear and nonlinear forward problems [6] as well as inverse problems [9, 10]. Use of GP regression can bypass discretizing differential operators by properly placing GP priors, and can easily handle noisy data as well as the associated uncertainty propagation in time-dependent problems. Hence, the motivation of the current research is to evaluate if indeed NNGP possesses not only the high expressivity of deep NNs but also the uncertainty quantification property of GPs. To this end, we will perform three-way comparisons between NNGP, GP and deep NN for prototypical problems involving function approximation and PDE solution.

The present paper makes two main contributions.

First, to pursue high expressivity, we increase the number of hyperparameters of NNGP by assuming that the prior variances of weights and biases vary layer by layer and that the weight variances between the input layer and the first hidden layer are different for different neurons in the input layer. We also train the resulting NNGP regression model by using maximum likelihood estimation. We note that the authors of

[4] assumed only two hyperparameters: one weight variance and one bias variance, which are kept fixed over different layers; the two hyperparameters are adjusted to achieve the best performance on validation dataset. However, no training was performed in [4].

Second, we derive the analytical covariance function for the NNGP induced by the deep NN with the error-function nonlinearity. This is motivated by the observation that the analytical covariance function induced by ReLU activated NN, given in

[4], cannot be used to solve the two-dimensional Poisson equation. Denoting by the analytical covariance function from the ReLU case where and

are input vectors of length two, we observed that the Laplacian of the covariance function, namely,

, which has to be computed in order to solve the problem (see Eq.(18)), will tend to infinity as .

The paper is organized as follows. In Section 2, we review the equivalence between GPs and infinitely wide NNs, which is a cornerstone of NNGP. We change the notations for weight and bias variances compared to the previous work [4] in order to show a different hyperparameter setup. Section 3 introduces the analytically tractable covariance functions induced by the deep NNs with the ReLU and error-function nonlinearities. We employed a version of Gaussian quadrature, proposed in [4], to validate numerically the two analytical covariance functions. The methodology on GP regression for function approximation and PDE solution is briefly reviewed and summarized in Section 4. Section 5 compares the accuracy of GP, NNGP and NN in function approximation and PDE solution. The uncertainty estimates of GP and NNGP are also presented in the section. Finally, we conclude in Section 5 with a brief summary.

Figure 1: A fully connected neural net with hidden layers. For each unit or neuron in hidden layers, there exist one input and one output with . Layer 0 is the input layer. is the -th component of the input vector , and is the -th component of the output vector . The dimensions of input and output spaces are and , respectively. The width for hidden layer is . At the center of each unit is the activation function that transforms input to the corresponding output. Between two successive layers, the weight for denotes the contribution of unit in layer to unit in layer . Layer is the output layer. The bias is attached to unit in layer for . Note that for clarity most of connecting arrows between layers are omitted.

2 Equivalence between GPs and infinitely wide NNs

In this section, we adopt nearly the same notations as [4] except those for weight and bias variances. We add the subscript to the variances and in order to represent index of layer and the subscript to in order to distinguish the weights coming from different neurons in input layer.

The fully-connected neural net can be represented as in Fig.1

. The equivalence relies on two assumptions: (1) Weight and bias parameters are all random variables. For

the weights are independent and identically distributed (i.i.d.) and obey the same distribution with zero mean and same variance, namely, . The biases () are i.i.d. Gaussian random variables, . The weights and biases are independent with each other. Note that in [4] the weight and bias variances are kept fixed over different layers, i.e., and , but we here assume the variances vary layer by layer and increase the number of variances to be determined. For , the weights need only to be independent with respect to the subscript , i.e., ; the biases are independent Gaussian random variables, . Unlike [4]

, we remove the assumption of being identically distributed for the weights. These weights do not have to be identically distributed since we cannot use the Central Limit Theorem for the input of the first hidden layer, namely,

. Actually, the width of input layer is finite and will never be infinity. Allowing weight variances to vary for different neurons in input layer, we increase again the number of variances to be determined in NNGP. (2) The widths of all hidden layers tend to infinity, i.e., . This allows one to use the Central Limit Theorem to deduce the equivalence relation.

It follows from [3] and [11] that for analytical derivation of the covariance function the distribution must be Gaussian. The specific forms of

do not really matter, since using the Central Limit Theorem one always obtains the Gaussian distribution whatever

is.

The -th component of the input of the second hidden layer is computed as

(1)

In the rest of paper, we use bold letter, say and , to represent column vector or matrix. The dependence on the input vector is emphasized in the above equations. Since weights and biases and are i.i.d., the output of the first hidden layer is also i.i.d. with respect to the subscript . Because is a sum of i.i.d. terms, the use of the Central Limit Theorem yields that will be Gaussian distributed if . Similarly, from the multidimensional Central Limit Theorem, any finite collection of will form a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Note that are arbitrary input vectors.

Denoting by the GP with the mean function and the covariance function , we learn that . Also, are independent GPs with same mean and covariance functions. The mean function of GP is zero due to the zero mean of the distribution , and the covariance function is computed as

(2)

In the above derivation, we use the independence relations between weights ( is the Kronecker delta), between weight and bias , and between weight and activation function . The last equality in Eq. (2) comes from the independence of with respect to . The input of the first hidden layer is denoted by , and the expectation is taken over the distribution of and . Actually, is a GP only if the distribution is Gaussian. In this case, the covariance function of the GP is

(3)

The input vector components and are deterministic and thus

3 Two NN-induced covariance kernels of GP

The arguments of the previous section can be extended to deeper layers by induction. For layer (), using the multidimensional Central Limit Theorem, we learn that and are both GPs. Specifically, has zero mean function and the covariance function computed as

(4)

The expectation in Eq. (4) is taken over the GP governing

, and this amounts to integrating over the joint distribution of two Gaussian random variables

and given and . The covariance function is rewritten as

(5)

where the density function of the joint distribution of and is

(6)

with the covariance matrix

(7)

Eq. (5) is actually an iteration formula, given by

(8)

where the diagonal matrix is defined by . Note that due to symmetry of the kernel. The trivariate function represents the two-fold integral in Eq. (5), which depends on specific form of activation function . For certain , the function is analytically tractable. Two iteration formulas that are analytically derivable will be given in the next two subsections.

3.1 Kernel from the ReLU nonlinearity

Letting the activation function be a rectifier , we can derive the analytical expression for , and the corresponding iteration formula is [4, 11]

(9)

Here we have increased the number of undetermined variances compared to the original formula in [4] in which , , and . There are totally undetermined parameters : (), (), and (). These parameters (a.k.a. hyperparameters in GP regression) can be learned by using maximum likelihood estimation.

Since the trivariate function is not analytically tractable for a generic nonlinearity, the authors of [4] developed an efficient Gaussian quadrature scheme to approximate the two-fold integral in Eq.(5). Fig. 2 (left) compares the numerical results computed by the Gaussian quadrature scheme with the analytical results computed by Eq. (9), showing good agreement.

Figure 2: Comparison of the analytical and numerical iteration formulas for the covariance functions from the ReLU (left) and error-function (right) nonlinearities. The covariance function of the GP is computed analytically and numerically for . For ease of comparison, the input vectors () are normalized to . The angle between the two vectors is . All the weight variances are set to be 1.6 and the bias variances are 0.1.

3.2 Kernel from the error-function nonlinearity

The error-function nonlinearity can also yield an analytical iteration formula just as the ReLU nonlinearity does. However, the existing formula only holds for single-hidden layer, which is given by [3]. The present paper extends the single-hidden layer case to multi-layer case using the strategy presented in Section 2.3 of [11]. The resulting iteration formula is

(10)

To validate the above formula, we compare the analytical results from the above formula with the numerical results from the aforementioned Gaussian quadrature scheme, and as we see in Fig.2 (right) the new iteration formula (10) is correct.

We specifically call the GPs with the kernels that are induced iteratively by NNs “NNGP” for short. NNGP is actually a specific case of GP, but we still distinguish it from the standard GP involving an ad hoc kernel (e.g., square exponential and Matern kernels). An exception is the output of the first hidden layer, . The output is also a GP and its covariance function is [3, 12]

(11)

We still regard the GP with this kernel as a standard GP rather than a NNGP, as the kernel corresponds to an incomplete, shallow NN without output layer. In Section 5.2.2 we will compare the predictive ability of this kernel with our generic kernel computed by formula (10).

4 GP regression

The essence of GP regression is the conditional distribution of quantities of interest, , given the known observations , where and are both Gaussian random vectors. Thanks to the analytical tractability of Gaussian distribution, we can derive analytically mean vector and covariance matrix of the conditional distribution. Supposed that and have the joint multivariate Gaussian distribution with zero mean vector and block-wise covariance matrix

(12)

the conditional distribution is also a multivariate Gaussian distribution with the mean vector

(13)

and the covariance matrix

(14)

(See (A.5) and (A.6) of [12]). Zero-mean and

-variance Gaussian white noise is assumed in observations.

Note that observations and quantities of interest are usually evaluated at specific time-space coordinates. To preserve a multivariate Gaussian distribution (12) at arbitrary time-space coordinates, the observations and the quantities of interest can be seen as samples of GPs. In other words, is a sample of and is a sample of , where the index set represents arbitrary time and/or space coordinates. Particularly, for solving PDEs, the observations consist of samples from different GPs. Different covariance function will capture different trends and features that the observations exhibit.

Before computing mean and covariance of conditional distribution from formulas (13) and (14), we need to first learn the parameters hidden in covariance function as well as the noise variance . These parameters are also known as hyperparameters and can be learned by using maximum likelihood estimation (MLE) [12, 13]. In practice, we find the optimal hyperparameters (including undetermined parameters in covariance function and the noise variance ) that minimize the negative log marginal-likelihood [12]

(15)

where denotes the determinant and the covariance matrix for observations (or training data) depends on and .

In what follows, we will briefly review that different formulations of observations will enable the GP regression to solve different problems including function approximation and PDE solution.

4.1 Function approximation

Let and where is Gaussian white noise having variance . The objective is to approximate the unknown scalar-valued function given the observations . After prescribing the covariance function for computing the covariance matrix and then training the regression model by the MLE, we derive the function approximation by using formula (13). Moreover, the use of the variance computed from (14

) yields the confidence interval. Denoting by

( by matrix) the collection of the evaluation points (usually called training inputs), we formulate the covariance matrix in (12) by:

(16)

4.2 PDE solution

First we consider the following time-independent linear PDE

(17)

where is a linear differential operator. The source term and the boundary condition term are the only information we know, and thus we put them in the observations by letting , where we select boundary points as the evaluation points for and domain points as the evaluation points for . The variances of the Gaussian white noises and are and , respectively. We assume that the solution is a GP with zero mean and covariance function , and the source term is another GP that depends on , whose covariance function is written as [6]

(18)

Denoting by ( by matrix) the collection of the boundary evaluation points and by ( by matrix) the collection of domain evaluation points , we formulate the covariance matrix in (12) as [6]

(19)

For time-dependent problem, the above GP regression model also works as we can regard the time variable as the -th dimension of space-time domain (See Section 4.3.1 of [6]).

The alternative way to handle the time-dependent problems is numerical GP regression [7]. Consider the following time-dependent PDE

(20)

The differential operator can be nonlinear (say, Burgers’ equation). We denote by the linearized counterpart of . For linear operator, . The use of Eular backward scheme leads to . According to the known information from initial-boundary conditions, we define the observations at -th time step as and . Note that and . Thus, we actually apply GP regression at each time step. Assuming is a GP with zero mean and covariance kernel and denoting by ( by matrix) the collection of the boundary evaluation points selected at -th time step and by ( by matrix) the collection of the domain evaluation points sampled at ()-th time step, we formulate the covariance matrix as [7]

(21)

Generally, we let the covariance function be the same for all , namely , except with varied undetermined parameters . Considering the propagation of the uncertainty in time marching, we need to take into account the uncertainty of the predictions in previous time step (i.e., predicted at ). We rewrite the predictive or posterior variance in (14) as [7]

(22)

5 Numerical results

We perform a three-way comparison between NNGP, NN, and GP for function approximation and PDE solution for prototypical problems. The hyperparameters of NNGP and GP are trained by minimizing the negative log marginal-likelihood function. The conjugate gradient algorithm (see the function minimize() in GPML Matlab code (version 3.6) [14]

) is employed for the optimization. For NN, network parameters (weights and biases) are trained by the L-BFGS-B algorithm that is provided in the TensorFlow package.

Because of the non-convex nature of the objective function, to avoid trapping into the local minima, we run our optimization algorithm code ten times with different initializations. Among the ten groups of optimized hyperparameters, for GP/NNGP we choose the one yielding the smallest negative log marginal-likelihood, and for NN we select the one producing the lowest loss for NN. The initializations of hyperparameters for GP/NNGP are taken as the first ten entries of the Halton quasi-random sequence [15] and these initializations are kept deterministic. The initializations of network weights are obtained from Xavier’s initialization algorithm [16], which is provided by the TensorFlow package. Also, for each initialization at most 200 function evaluations are allowed in conjugate gradient search in GP/NNGP. For NNGP, the central finite difference scheme with the step size is used in computing the gradients of covariance function with respect to hyperparameters while for GP the analytical derivation is employed. Additionally, we compute relative error in quantifying accuracy of function approximation and PDE solution, where is a vector formed by the function values or PDE’s solutions evaluated at test points.

5.1 Function approximation

The covariance matrices of GP/NNGP are formulated according to (16).

5.1.1 Step function

We approximate the step function , otherwise. Ten evenly distributed training inputs and 100 equispaced test points are chosen. Fig. 3 shows the approximation by the GPs with two commonly used kernels: squared exponential (SE) and Matern [12]. The first kernel can describe well the data exhibiting smooth trend while the second one is suitable for finite regularity. However for the step function, a non-smooth function, the performance of these kernels is poor. The NNGP predictions shown in Figs. 4 and 5 are much better. In the domain away from the discontinuity, the NNGP approximation is good with small uncertainty. Particularly, the ReLU-induced kernel even succeeds in capturing the discontinuity. An important observation is that the ReLU induced kernel is more suitable for non-smooth functions than the error-function induced kernel. Another observation regarding NNGP is that deepening the inducing NN does not change the approximation accuracy for the step function. Figs. 6 and 7 show the predictions of NN, which are less accurate than NNGP’s. Unlike GP/NNGP, NN does not quantify any uncertainty for approximation.

Figure 3: Approximating the step function: GPs with SE (left) and Matern (right) kernels. “Two-std band” denotes the mean vector of the conditional distribution ( computed from (13

)) plus/minus twice the standard deviation (diagonals of

computed from (14)). The quantities of interest is where (100 by 1 matrix) is the collection of test points.

.

Figure 4: Approximating the step function: NNGP with error-function induced kernel. “One-layer” indicates that there is totally one hidden layer in the inducing NN ().
Figure 5: Approximating the step function: NNGP with ReLU induced kernel.
Figure 6: Approximating the step function: NN with the error-function nonlinearity (“5-unit” indicates that there exist 5 units/neurons in each hidden layer).
Figure 7: Approximating the step function: NN with the ReLU nonlinearity.

5.1.2 Hartmann 3D function

The Hartmann function is frequently used for testing global optimization algorithms. Here we consider the trivariate case. The function is much smoother than the step function and thus we can expect GP to perform well. To test the approximation accuracy, we first generate (, and ) points by choosing the first entries of the Halton sequence, and then randomly permutate the points, followed by selecting the first 70% points as training points and the remaining 30% points as the test points.

Figure 8: Approximating the Hartman 3D function: error comparison. (“nngp-erf-1” means the NNGP with the one hidden layer and error-function induced kernel; “gp-SE” denotes the GP with the squared exponential kernel; “nn-erf-3-20” indicates the three-layer, 20-unit-wide NN with the error-function nonlinearity.)

Fig. 8 compares NN, NNGP, and GP in terms of approximation accuracy. Since the Hartmann 3D function is smooth, the NNGP with error-function induced kernel and the GP with squared exponential kernel are both very accurate. ReLU is not suitable for smooth function in contrast to approximating the non-smooth step function. Additionally, the GP and NNGP both outperform NN. It is observed again that NN does not give any uncertainty estimate, whereas GP and NNGP do. Another observation is that increasing the depth of the inducing NN in NNGP does not change the accuracy.

5.2 PDE solution

We use the proposed covariance function from error-function nonlinearity to solve the following two PDEs. We replace the covariance function in Eq. (19) as well as in Eq. (21) with the new kernel (10). The derivation of kernel’s derivatives for the case of Poisson equation is given in Appendix.

5.2.1 2D Poisson equation

Consider the equation

(23)

with two fabricated solutions (1) and (2) . Dirichlet boundary conditions are assumed. The second solution is more complex than the first one and thus we use more training points in the second solution case. The training inputs are chosen from the first entries of the Halton sequence and is equispaced on the boundary, which are shown in Fig. 9.

Figure 9: Solving 2D Poisson equation: Two setups of training inputs for both fabricated solutions.

The covariance matrices of GP/NNGP are obtained by (19). Figs. 10 and 11 give the error plots of NN, NNGP, and GP for fabricated solution 1 and 2, respectively. Totally 441 evenly distributed test points are taken to evaluate the relative error. The NN results are obtained according to the approach proposed in [8]. To keep a fair comparison, the same training inputs are selected in the NN case. It is observed from the figures that for different fabricated solutions, the approximation accuracy of the GP and NNGP is comparable. Also, the NN accuracy is nearly one order of magnitude lower than that of GP/NNGP.

Figure 10: Solving 2D Poisson equation with the exact solution . (“nn-tanh-3-20” represents the three-layer, 20-unit-wide NN with the hyperbolic tangent nonlinearity.
Figure 11: Solving 2D Poisson equation with the exact solution ).

Fig. 12 shows the uncertainty estimates of GP and one-layer NNGP evaluated on the cut line of the square domain, for the case of fabricated solution 2. We see that for both methods uncertainty is reduced with increasing number of training points. Moreover, the uncertainty is strongly correlated with the prediction error as smaller uncertainty corresponds to lower error. Note that the uncertainty at the endpoints of the cut line, namely, and , is exactly zero, due to the boundary condition.

Figure 12: Solving 2D Poisson equation with the fabricated solution 2: uncertainty estimates for GP and one-layer NNGP evaluated on the cut line . The -axis represents the standard deviation of the conditional or posterior distribution computed by (14). The relative errors evaluated on the cut line are 0.049, 0.029, 0.022, and 0.0081 from the top curve to the bottom one. It is seen that the uncertainty is strongly correlated with the prediction error.

5.2.2 1D Burgers’ equation

Consider the equation [7]

(24)

with . The initial condition is . After the linearization through replacing the nonlinear term by , where is the mean vector computed for the previous time step by (13), we derive the differential operator . The covariance matrices of GP/NNGP are formulated by (21).

The time step size is fixed to be . To evaluate the relative error at each time step, we place 400 equispaced test points in the space domain . We also solve the same equation using the NN approach proposed in [8]. One main difference between the NN approach and the numerical GP/NNGP regression is the sampling strategy for training inputs. For NN, we sample the training inputs in the time-space domain , whereas for GP/NNGP, we sample the training inputs merely in the space domain and then perform time-marching. To make the comparison as fair as possible, we sample 10000 training points for NN, as in GP/NNGP we have at most 100 (number of time step) 100 (number of training inputs in space) = 10000 (number of time-space sampling points). The Latin hypercube sampling strategy is adopted in sampling the training inputs in NN, GP, and NNGP. To accelerate training of numerical GP/NNGP, the initial guess of hyperparameters for current time step is taken as the optimized hyperparameters attained at previous time step. This strategy is reasonable since for a small time step two successive GPs are highly correlated and therefore the respective hyperparameters should be close.

The kernel (11), rather than the SE considered in solving Poisson equation, is taken in GP. We choose a non-stationary kernel instead of a stationary one, because the solution exhibits discontinuity that cannot be well captured by stationary kernel. Additionally, the NN with four hidden layers of 40-unit width and the hyperbolic tangent nonlinearity is employed in the NN approach.

Figure 13: Solving 1D Burgers’ equation with noise-free initial condition: GP with kernel (11). The blue solid curve is the exact solution computed according to [17], the red circle is the training points at (namely ), and the red dashed curve is the numerical solution. The x-axis ticks denote the locations of 31 randomly sampled training points at (). The orange shadowed region is the two-standard-deviation band.
Figure 14: Solving 1D Burgers’ equation with noise-free initial condition: NNGP with one-layer, error-function induced kernel. Number of training points at is and for is .
Figure 15: Solving 1D Burgers’ equation with noise-free initial condition: NNGP with three-layer, error-function induced kernel. Number of training points at is and at is .

We first consider the case where the initial condition is noise-free. Figs. 13, 14, and 15 show the numerical solutions computed by the GP, the NNGP with one-hidden-layer (), and the NNGP with three-hidden-layer (), respectively. We can see that the NNGP has higher solution accuracy than the GP. Importantly, different from previous examples, increasing the depth of inducing NN in NNGP improves the results in the current example. The high accuracy of NNGP is attributed to a larger number of hyperparameters compared to the GP case. More hyperparameters presumably implies higher expressivity for function approximation, which means that we can use the kernel to approximate a wider spectrum of functions. For simple function or solution (as in the Hartmann 3D function and Poisson equation examples), the advantage of higher expressivity is not fully realized, but for complex ones, such as the step function and the Burgers’ equation here, the advantage can be clearly seen.

Figure 16: Solving 1D Burgers’ equation with noise-free initial condition: NN with four hidden layers of 40-unit width as well as hyperbolic tangent nonlinearity. The maximum error appears near the discontinuity point.
Figure 17: Solving 1D Burgers’ equation with noise-free initial condition: Accuracy comparison of NN, NNGP, and GP. “nngp-1” means the NNGP with one hidden layer and error-function induced kernel. “31” and “101” denote the numbers of training points and for , respectively. It is fixed that for all the NNGP examples. Kernel (11) is adopted in the GP. “nn-tanh-4-40” represents the NN with four hidden layers of 40-unit width as well as hyperbolic tangent nonlinearity.

The NN results are shown in Fig.16. Although NN can derive sufficiently accurate solutions, it does not give any uncertainty estimate. Fig. 17 collects the errors from GP, NNGP, and NN for noise-free case. The NNGP and NN achieve errors of the same order of magnitude, while the GP performs worst. This is because NNGP and NN both have a large set of parameters to be tuned and thus possess higher expressivity.

Next we consider the case where the initial condition is contaminated by Gaussian white noise of zero mean and variance . The NN approach for the noise case is beyond the scope of the present paper, since without proper regularization methods (such as dropout [18], early stopping, and weighted L1/L2 penalty), NN will easily encounter overfitting. Unlike NN, GP and NNGP inherently include the model complexity penalty in the negative log-marginal likelihood and have less risk in overfitting (see the discussion in Section 5.4.1 of the book [12]).

Figure 18: Solving 1D Burgers’ equation with noisy initial condition (noise variance ): GP with kernel (11). Number of training points at is and at is .
Figure 19: Solving 1D Burgers’ equation with noisy initial condition (noise variance ): NNGP with the one hidden layer and error-function induced kernel. Number of training points at is and at is .
Figure 20: Solving 1D Burgers’ equation with noisy initial condition (noise variance ): NNGP with the three hidden layers and error-function induced kernel. Number of training points at is and at is .
Figure 21: Solving 1D Burgers’ equation with noisy initial condition (noise variance ): Accuracy comparison of GP and NNGP.

Numerical solutions computed by the GP and NNGP are ploted in Figs. 18, 19, and 20. NNGP still has higher solution accuracy than the GP due to the higher expressivity. Data noise amplifies uncertainty represented by the orange shadowed region. GP and NNGP can handle noise well, because noise variance can be directly learned from the training data and the corresponding uncertainty is quantified by the conditional (or posterior) distribution. Fig. 21 compares the solution accuracy of GP and NNGP. For long-term simulations, the accuracy of the NNGP is roughly one order higher than that of GP.

For NNGP, increasing the depth, , of inducing NN does not guarantee the increase of accuracy. For example, one-layer NNGP with 101 training points outperforms three-layer NNGP with 101 training points, as is shown in Fig. 21. A larger amounts to a larger number of hyperparameters, which merely increases the possibility in fitting complex functions better. Actually, can also be seen as another hyperparameter of NNGP.

6 Summary

A larger number of hyperparameters enables NNGP to achieve higher or comparable accuracy for both smooth and non-smooth functions in comparison with GP, which can be seen from Table 1

. The deep NN that induces NNGP contributes its prior variances of network parameters (weights and biases) as well as its depth to the hyperparameter list of NNGP, which endows NNGP with high expressivity. On the other hand, NNGP is able to estimate uncertainty of predictions, which is crucial to noisy-data handling and active learning

[19]. Unlike NN, Bayesian NN can provide uncertainty estimate. However, the conventional Bayesian NN [20] could be time-consuming to train due to approximation of a high-dimensional integral over weight space.

Function/Solution Feature Accuracy comparison Kernel/nonlinearity
GP NNGP NN
Step function Non-smooth NNGP> {GP, NN} SE/Matern erf/ReLU erf/ReLU
Hartmann 3D Smooth {NNGP, GP}> NN SE/Matern erf/ReLU erf/ReLU
Poisson eq. Smooth {NNGP, GP}> NN SE erf erf/tanh
Burgers’ eq. Non-smooth {NNGP, NN}> GP kernel (11) erf tanh
Table 1: Comparison of GP, NNGP, and NN (Accuracy) for function approximation and PDE solution.

Due to the need for inverting the covariance matrix, NNGP has cubic time complexity for training (see Table 2). In Table 2, for GP and NNGP is the size of training set, and and are numbers of evaluations of negative log marginal-likelihood in conjugate gradient algorithm for GP and NNGP, respectively. For NN, the accurate estimate of time complexity for training is still an open question [21]. Generally, the training of a fully-connected NN is faster than that of GP, because one does not need to invert a matrix. For each training point, the forward and backward propagation only need linear cost, namely , where is the total number of weights in the network. for NN means batch size; in this paper we fed the whole training set to optimization algorithm and thus the batch size is exactly the size of training set. is number of iterations in optimization algorithm. In numerical examples of this paper, training set size does not exceed 1000. However, for very large datasets, GP and NNGP will be less attractive compared to NN. In future work, we intend to leverage scalable GPs recently developed in [22, 23, 24] to tackle large dataset problems.

GP NNGP NN
Uncertainty
Cost
Table 2: Comparison of GP, NNGP, and NN (whether to estimate uncertainty and the computational cost for training) for function approximation and PDE solution.

Acknowledgements

We thank Mr. Dongkun Zhang and Dr. Maziar Raissi for useful discussions. This work was supported by AFOSR (FA9550-17-1-0013) and NSF (DMS-1736088). The first author was also supported by the National Natural Science Foundation of China (11701025). The second author was also supported by a gift from Swiss company BH-Robotics.

References

Appendix: Derivatives of NNGP kernel from the error-function nonlinearity

Absorbing the constant coefficients and into the variances, the kernel derived from the error-function nonlinearity, namely (10), can be further simplified to

(A1)

To solve PDEs, we need to compute the derivatives of the kernel . We take the 2D Poisson equation as an example. We need to know the explicit forms of and where and . The partial derivatives up to fourth order needs to be derived. Denoting by the trivariate function the

term in the above iteration formula, the use of chain rule yields