Deep neural networks (DNNs) become increasingly important in scientific computing fields weinan2017deep ; weinan2018deep ; han2018solving ; he2018relu ; cai2019multi ; liao2019deep ; siegel2019approximation ; hamilton2019dnn ; cai2019deep ; wang2020mesh
. A major potential advantage over traditional numerical methods is that DNNs could overcome the curse of dimensionality in high-dimensional problems. With traditional numerical methods, several studies have made progress on the understanding of the algorithm characteristics of DNNs. For example, by exploring ReLU DNN representation of continuous piecewise linear function in FEM,he2018relu theoretically establish that a ReLU DNN can accurately represent any linear finite element functions. In the aspect of the convergence behavior, xu_training_2018 ; xu2019frequency show a Frequency Principle (F-Principle) that DNNs often eliminate low-frequency error first while most of the conventional methods (e.g., Jacobi method) exhibit the opposite convergence behavior—faster convergence for higher-frequency error. These understandings could lead to a better use of DNNs in practice, such as Multi-scale DNN algorithms are proposed based on the F-Principle to fast eliminate high-frequency error cai2019phasednn ; cai2019multi .
The aim of this paper is to investigate the different behaviors between DNNs and traditional numerical method, e.g., Ritz-Galerkin (R-G) method, in solving PDEs given a few sample points. We denote as the sample number and as the basis number in the Ritz-Galerkin method or the neuron number in DNNs. In traditional PDE models, we consider the situation where the functions in the equation are completely known, i.e. the sample number goes to infinity. But in practical application, such as signal processing, statistical mechanics, chemical and biophysical dynamic systems, we often encounter the problem that only a few sample values can be obtained. We wonder what effect R-G methods would have on solving this particular problem, and what the solution would be obtained by the DNN method. In this paper, we show that R-G method considers the discrete sampling points as linear combinations of Dirac delta functions, while DNN methods implicitly impose regularity on the function that interpolates the discrete sampling points. And we incorporate the F-Principle to show how DNN is different from the R-G method. Our work indicates that with implicit bias of DNNs, the traditional methods, e.g., FEM he2018relu , could provide insights into understanding DNNs.
The rest of the paper is organized as follows. In section 2, we briefly introduce the R-G method and the DNN method. In section 3, we present the difference between using these two methods to solve PDEs theoretically and numerically. We end the paper with the conclusion in section 4.
In this section we take the toy model of Poisson’s equation as example to investigate the difference of solution behaviors between R-G method and DNN method.
2.1 Poisson problem
We consider the -dimensional Poisson problem posed on the bounded domain with Dirichlet boundary condition as
where represents the Laplace operator,
is a d-dimensional vector. It is known that the problem (1) admits a unique solution for , and its regularity can be raised to if for some . In literatures, there has a number of effective numerical methods to solve boundary value problem (BVP) (1) in general case. We here consider a special situation: we only have the information of at the sample points . In practical application, we may imagine that we only have finite experiment data, i.e., the value of , and have no more information of at other points. Through solving such a particular Poisson problem (1) with R-G method and deep learning method, we aim to find the bias of these two methods in solving PDEs.
2.2 R-G method
The variational form of problem (1) is the following:
The weak form of (3) is to find such that
The problem (1) is the strong form if the solution . To numerically solve (4), we now introduce the finite dimensional space to approximate the infinite dimensional space . Let be a subspace with a sequence of basis functions . The numerical solution that we will find can be represented as
For different types of basis functions, the R-G method can be divided into finite element method (FEM) and spectral method (SM) and so on. If the basis functions are local, namely, they have a compact support set, such as a linear hat basis function
this method is usually taken as the FEM. On the other hand, if we choose global basis function such as Fourier basis or Legendre function Shen2011Spectral , we call R-G method SM.
The error estimate theory of R-G method has been well established. Under suitable assumption on the regularity of the solution, the linear finite element solutionhas the following error estimate
where the constant is independent of grid size . The spectral method has the following error estimate
where the exponent depends only on the regularity (smoothness) of the solution . If is smooth enough and satisfies certain boundary conditions, the spectral method has the spectral accuracy.
In this paper, we use the R-G method to solve the Poisson problem (1) with the special situation, i.e. we only have the information of at the sample points . In this case, the integral on the right side of equation (6) is hard to be computed exactly. In the one-dimensional case, if we know the properties of , we might be able to compute this integral with high order precision based on these points. But in higher dimensions, if we want to compute the integral on the right in (6), Monte Carlo (MC) Christian2004Monte method may be the best as far as we know. Since DNN-based methods are often used in high-dimensional cases, we use MC integral to approximate the integral on the right-hand side of (6), and arrive at the following modified R-G equation
2.3 DNN method
We now introduce the DNN method. The -layer neural network is denoted by
where , , , , is a scalar function and “” means entry-wise operation. We denote the set of parameters by
and an entry of by .
If the activation function in a one-hidden layer DNN is selected as the form ofin (5), then the solution of the one-hidden layer DNN similar to (5) is given as
), while in the DNN method both the basis and the coefficients are obtained through the gradient descent algorithm with a loss function. The model in (10) can be generalized to a normal DNN in (9).
The loss function corresponding to problem (1) is given by
or a variation form weinan2017deep
where the last term is for the boundary condition and is a hyper-parameter. In numerical experiments, we use the loss function (11) to train DNNs. We remark that the result with the loss function (12) is similar.
3 Main results
3.1 R-G method in solving PDE
In the classical case, is a given function, and thus we may obtain the right-hand-side term of R-G equation by using the information of at the integral point. As the number of basis functions approaches infinity, the numerical solution obtained by R-G method (6) approximates the exact solution of problem (1). It is interesting to ask if we only have the information of at the finite points, what could happen to numerical solution obtained by (8) when ?
Fixing the number of sample points , we study the property of the solution of the numerical method (8). We have the following theorem.
When tends to infinity, the numerical method (8) is solving the problem
where is the Dirac delta function.
The variation form of (13) is: find such that
Then the above variational problem can be transformed into the R-G equation
we then obtain
This is exactly the equation (8) obtained by solving problem (1) with a special . According to the error estimation theory, the solution of this equation approximates the exact solution of the problem (13) as approaches infinity.
3.2 DNN method in solving PDE
) are adaptive. In addition, the DNN problem is optimized by (stochastic) gradient descent. The experiments in the next section would show that when the number of bases goes to infinity, DNN methods solve (1) by a smoother function rather than the piecewise linear function in Theorem (1). In this section, we utilize the F-Principle to understand what leads to the smoothness.
We start from considering a two-layer neural network, following zhang2019explicitizing ,
where , , and , and () is the activation function of ReLU. Note that this two-layer model is slightly different from the model in (10) for easy calculation in zhang2019explicitizing . The target function is denoted by . The network is trained by mean-squared error (MSE) loss function
is a probability density. Considering finite samples, we have
For any function defined on
, we use the following convention of the Fourier transform and its inverse:
where denotes the frequency.
is the norm, and are initial parameters before training, and
The minimization in (19) clearly shows that the DNN has an implicit bias in addition to the sample constraint in (20). As monotonically increases with , the optimization problem prefers to choosing a function that has less high frequency components, which explicates the implicit bias of the F-Principle — DNN prefers low frequency xu_training_2018 ; xu2019frequency . For a general DNN, the coefficient cannot be obtained exactly, however, the monotonically decreasing property of with respect to can be postulated based on the F-Principle. Since lower frequency of a function exhibits smoother property, the solution of the optimization problem is smoother compared with the one derived from solving (8), which simply interpolate all missing data by zero for .
In this section, we present three examples to investigate the numerical solution behaviors of R-G and DNN method. For simplicity, we consider the following two-point boundary value problem
where The problem (23) has the exact solution in the form of
Example 1: Fixed the number of sampling points , we use R-G method and DNN method to solve the problem (23).
R-G method. First, we use R-G method to solve the problem (23), specially the spectral method with the Fourier basis function given as
We set the the numbers of basis functions . Fig. 1 plots the numerical and exact solutions with different . One can see that the R-G solution approximates a piecewise linear function when . This result is consistent with the properties of the solutions we analyzed in Theorem 1.
DNN method. For better comparison with R-G method, we choose the sine activation function in DNN with one hidden layer. And the number of neurons . The loss function (11) is selected with the parameter . We reduce the loss to an order of 1e-4, and learning rate is take by 1e-4. And we use 1000 test points when we draw pictures. Fig. 2 plots the comparison between DNN solution and exact solution. And we observe that the DNN solutions are always smooth even when is large.
Example 2: We set the number of sampling points , and keep the other parameters the same as in Example 1. The R-G solutions and DNN solutions are shown in Fig. 3 and 4 respectively. The results are consistent with the case when , i.e. the R-G solution converges to a piecewise linear function as is taken larger and larger, while the DNN solution is still smooth.
To further verify that the R-G solution approximates the piecewise linear function, we compute the mean square difference between numerical solution (R-G and DNN) and piecewise linear solution, which is obtained by interpolating the values of the numerical solutions at sampling points. We denote the numerical solution as and the interpolated function as . For simplicity, we set and . The mean square difference can be represented as
Here we choose . And the mean square differences are presented in Fig. 5. In the R-G part, where we use a log-log coordinate system, we see that the R-G solution approximates the piecewise linear solution with algebraic precision when . While in the DNN part, we use the semi-log coordinate system, and we can observe that the distance between the DNN solution and the piecewise linear solution hardly changes with the increase of .
Example 3: In this example, we use ReLU function as the basis function in R-G method and the activation function in DNN method to repeat the results in Example 1. The number of sampling points is taken as . The unstated parameters are taken the same as those in Example 1.
Since the linear finite element function can be represented by a ReLU function, namely,
So we can use ReLU as the basis function for R-G method, but for convenience, we can just use piecewise linear function instead of ReLU function as the basis function. There is no doubt that the FEM solution is piecewise linear, see Fig. 6.
In DNN method, we construct neural networks with three hidden layers, the number of neurons in each layer is . As shown in Fig. 7, again, the DNN learns the data as a very smooth function.
This paper compares the different behaviors of Ritz-Galerkin method and DNN method in solving PDEs to better understand the working principle of DNNs. We consider a particular Poisson problem (1), where the right term is a discrete function. We analyze why the two numerical methods behave differently in theory. R-G method deals with the discrete as the linear combination of Dirac delta functions, while DNN methods implicitly impose regularity on the function that interpolates the discrete sampling points due to the F-principle. Furthermore, from the numerical experiments, as the number of bases goes large, one can see that the solution obtained by R-G method is piecewise linear, regardless of the basis function, but the solution obtained by DNN method is smooth, which is not sensitive to the activation function. In conclusion, based on the theoretical and numerical study, we can see that the implicit bias with traditional methods provides important understandings to the DNN methods.
Acknowledgements.Zhiqin Xu is supported by the Student Innovation Center and start up fund at Shanghai Jiao Tong University. Jiwei Zhang is partially supported by NSFC under No. 11771035, and the Natural Science Foundation of Hubei Province No. 2019CFA007, and Xiangtan University 2018ICIP01. Jiwei Zhang thanks for the supports and hospitality of Beijing Computational Science Research Center during his visit in the summer for this topic. Yaoyu Zhang is supported by National Science Foundation (Grant No. DMS-1638352) and the Ky Fan and Yu-Fen Fan Membership Fund.
Conflict of interest
The authors declare that they have no conflict of interest.
- (1) Brenner, S.C., Scott, L.R.: The mathematical theory of finite element methods. Springer, New York, third edition (2008)
- (2) Cai, W., Li, X., Liu, L.: Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 (2019)
- (3) Cai, W., Xu, Z.Q.J.: Multi-scale deep neural networks for solving high dimensional pdes. arXiv preprint arXiv:1910.11710 (2019)
- (4) Cai, Z., Chen, J., Liu, M., Liu, X.: Deep least-squares methods: an unsupervised learning-based numerical method for solving elliptic pdes. arXiv preprint arXiv:1911.02109 (2019)
- (5) E, W., Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics 5(4), 349–380 (2017)
- (6) E, W., Yu, B.: The deep ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics 6(1), 1–12 (2018)
- (7) Hamilton, A., Tran, T., Mckay, M., Quiring, B., Vassilevski, P.: Dnn approximation of nonlinear finite element equations. Tech. rep., Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States) (2019)
- (8) Han, J., Jentzen, A., Weinan, E.: Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115(34), 8505–8510 (2018)
- (9) He, J., Li, L., Xu, J., Zheng, C.: Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 (2018)
- (10) Liao, Y., Ming, P.: Deep nitsche method: Deep ritz method with essential boundary conditions. arXiv preprint arXiv:1912.01309 (2019)
- (11) Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, New York (2004)
- (12) Shen, J., Tang, T., Wang, L.L.: Spectral methods. Algorithms, analysis and applications. Springer (2011)
- (13) Siegel, J.W., Xu, J.: On the approximation properties of neural networks. arXiv preprint arXiv:1904.02311 (2019)
- (14) Wang, Z., Zhang, Z.: A mesh-free method for interface problems using the deep learning approach. Journal of Computational Physics 400, 108963 (2020)
- (15) Xu, Z.Q.J., Zhang, Y., Luo, T., Xiao, Y., Ma, Z.: Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 (2019)
- (16) Xu, Z.Q.J., Zhang, Y., Xiao, Y.: Training behavior of deep neural network in frequency domain. arXiv preprint arXiv:1807.01251 (2018)
- (17) Zhang, Y., Xu, Z.Q.J., Luo, T., Ma, Z.: Explicitizing an implicit bias of the frequency principle in two-layer neural networks. arXiv preprint arXiv:1905.10264 (2019)