Implicit bias with Ritz-Galerkin method in understanding deep learning for solving PDEs

02/19/2020 ∙ by Jihong Wang, et al. ∙ Shanghai Jiao Tong University Institute for Advanced Study Wuhan University 0

This paper aims at studying the difference between Ritz-Galerkin (R-G) method and deep neural network (DNN) method in solving partial differential equations (PDEs) to better understand deep learning. To this end, we consider solving a particular Poisson problem, where the information of the right-hand side of the equation f is only available at n sample points while the bases (neuron) number is much larger than n, which is common in DNN-based methods. Through both theoretical study and numerical study, we show the R-G method solves this particular problem by a piecewise linear function because R-G method considers the discrete sampling points as linear combinations of Dirac delta functions. However, we show that DNNs solve the problem with a much smoother function based on previous study of F-Principle (Xu et al., (2019) [15] and Zhang et al., (2019) [17]), that is, DNN methods implicitly impose regularity on the function that interpolates the discrete sampling points. Our work shows that with implicit bias of DNNs, the traditional methods, e.g., FEM, could provide insights into understanding DNNs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) become increasingly important in scientific computing fields weinan2017deep ; weinan2018deep ; han2018solving ; he2018relu ; cai2019multi ; liao2019deep ; siegel2019approximation ; hamilton2019dnn ; cai2019deep ; wang2020mesh

. A major potential advantage over traditional numerical methods is that DNNs could overcome the curse of dimensionality in high-dimensional problems. With traditional numerical methods, several studies have made progress on the understanding of the algorithm characteristics of DNNs. For example, by exploring ReLU DNN representation of continuous piecewise linear function in FEM,

he2018relu theoretically establish that a ReLU DNN can accurately represent any linear finite element functions. In the aspect of the convergence behavior, xu_training_2018 ; xu2019frequency show a Frequency Principle (F-Principle) that DNNs often eliminate low-frequency error first while most of the conventional methods (e.g., Jacobi method) exhibit the opposite convergence behavior—faster convergence for higher-frequency error. These understandings could lead to a better use of DNNs in practice, such as Multi-scale DNN algorithms are proposed based on the F-Principle to fast eliminate high-frequency error cai2019phasednn ; cai2019multi .

The aim of this paper is to investigate the different behaviors between DNNs and traditional numerical method, e.g., Ritz-Galerkin (R-G) method, in solving PDEs given a few sample points. We denote as the sample number and as the basis number in the Ritz-Galerkin method or the neuron number in DNNs. In traditional PDE models, we consider the situation where the functions in the equation are completely known, i.e. the sample number goes to infinity. But in practical application, such as signal processing, statistical mechanics, chemical and biophysical dynamic systems, we often encounter the problem that only a few sample values can be obtained. We wonder what effect R-G methods would have on solving this particular problem, and what the solution would be obtained by the DNN method. In this paper, we show that R-G method considers the discrete sampling points as linear combinations of Dirac delta functions, while DNN methods implicitly impose regularity on the function that interpolates the discrete sampling points. And we incorporate the F-Principle to show how DNN is different from the R-G method. Our work indicates that with implicit bias of DNNs, the traditional methods, e.g., FEM he2018relu , could provide insights into understanding DNNs.

The rest of the paper is organized as follows. In section 2, we briefly introduce the R-G method and the DNN method. In section 3, we present the difference between using these two methods to solve PDEs theoretically and numerically. We end the paper with the conclusion in section 4.

2 Preliminary

In this section we take the toy model of Poisson’s equation as example to investigate the difference of solution behaviors between R-G method and DNN method.

2.1 Poisson problem

We consider the -dimensional Poisson problem posed on the bounded domain with Dirichlet boundary condition as


where represents the Laplace operator,

is a d-dimensional vector. It is known that the problem (

1) admits a unique solution for , and its regularity can be raised to if for some . In literatures, there has a number of effective numerical methods to solve boundary value problem (BVP) (1) in general case. We here consider a special situation: we only have the information of at the sample points . In practical application, we may imagine that we only have finite experiment data, i.e., the value of , and have no more information of at other points. Through solving such a particular Poisson problem (1) with R-G method and deep learning method, we aim to find the bias of these two methods in solving PDEs.

2.2 R-G method

In this subsection, we briefly introduce the R-G method Brenner2008The . For problem (1), we construct a functional



The variational form of problem (1) is the following:


The weak form of (3) is to find such that


The problem (1) is the strong form if the solution . To numerically solve (4), we now introduce the finite dimensional space to approximate the infinite dimensional space . Let be a subspace with a sequence of basis functions . The numerical solution that we will find can be represented as


where the coefficients are the unknown values that we need to solve. Replacing by , both problems (3) and (4) can be transformed to solve the following system:


From (6), we can calculate , and then obtain the numerical solution . We usually call (6) R-G equation.

For different types of basis functions, the R-G method can be divided into finite element method (FEM) and spectral method (SM) and so on. If the basis functions are local, namely, they have a compact support set, such as a linear hat basis function


this method is usually taken as the FEM. On the other hand, if we choose global basis function such as Fourier basis or Legendre function Shen2011Spectral , we call R-G method SM.

The error estimate theory of R-G method has been well established. Under suitable assumption on the regularity of the solution, the linear finite element solution

has the following error estimate

where the constant is independent of grid size . The spectral method has the following error estimate

where the exponent depends only on the regularity (smoothness) of the solution . If is smooth enough and satisfies certain boundary conditions, the spectral method has the spectral accuracy.

In this paper, we use the R-G method to solve the Poisson problem (1) with the special situation, i.e. we only have the information of at the sample points . In this case, the integral on the right side of equation (6) is hard to be computed exactly. In the one-dimensional case, if we know the properties of , we might be able to compute this integral with high order precision based on these points. But in higher dimensions, if we want to compute the integral on the right in (6), Monte Carlo (MC) Christian2004Monte method may be the best as far as we know. Since DNN-based methods are often used in high-dimensional cases, we use MC integral to approximate the integral on the right-hand side of (6), and arrive at the following modified R-G equation


In the later numerical experiments, we solve the Poisson problem (1) by the above R-G equation (8), and investigate the bias of the R-G method.

2.3 DNN method

We now introduce the DNN method. The -layer neural network is denoted by


where , , , , is a scalar function and “” means entry-wise operation. We denote the set of parameters by

and an entry of by .

If the activation function in a one-hidden layer DNN is selected as the form of

in (5), then the solution of the one-hidden layer DNN similar to (5) is given as


where are parameters. The basis functions in (5) are given and the coefficients are obtained by solving (6

), while in the DNN method both the basis and the coefficients are obtained through the gradient descent algorithm with a loss function. The model in (

10) can be generalized to a normal DNN in (9).

The loss function corresponding to problem (1) is given by


or a variation form weinan2017deep


where the last term is for the boundary condition and is a hyper-parameter. In numerical experiments, we use the loss function (11) to train DNNs. We remark that the result with the loss function (12) is similar.

3 Main results

3.1 R-G method in solving PDE

In the classical case, is a given function, and thus we may obtain the right-hand-side term of R-G equation by using the information of at the integral point. As the number of basis functions approaches infinity, the numerical solution obtained by R-G method (6) approximates the exact solution of problem (1). It is interesting to ask if we only have the information of at the finite points, what could happen to numerical solution obtained by (8) when ?

Fixing the number of sample points , we study the property of the solution of the numerical method (8). We have the following theorem.

Theorem 1

When tends to infinity, the numerical method (8) is solving the problem


where is the Dirac delta function.


The variation form of (13) is: find such that


We use R-G method (6) to solve (14) by replacing by a finite dimensional space . Set the numerical solution

Then the above variational problem can be transformed into the R-G equation

Note that

we then obtain


This is exactly the equation (8) obtained by solving problem (1) with a special . According to the error estimation theory, the solution of this equation approximates the exact solution of the problem (13) as approaches infinity.

Remark: Since the result is a linear function after function being integrated twice, we know from Theorem 1 that the numerical solution obtained by the R-G method (8) is piecewise linear. We will verify this in later numerical experiments.

3.2 DNN method in solving PDE

DNNs are widely used in solving PDEs, especially for high-dimensional problems. The loss functions in Eqs. (11, 12) are equivalent to solve (6) except that the bases in (11, 12

) are adaptive. In addition, the DNN problem is optimized by (stochastic) gradient descent. The experiments in the next section would show that when the number of bases goes to infinity, DNN methods solve (

1) by a smoother function rather than the piecewise linear function in Theorem (1). In this section, we utilize the F-Principle to understand what leads to the smoothness.

We start from considering a two-layer neural network, following zhang2019explicitizing ,


where , , and , and () is the activation function of ReLU. Note that this two-layer model is slightly different from the model in (10) for easy calculation in zhang2019explicitizing . The target function is denoted by . The network is trained by mean-squared error (MSE) loss function



is a probability density. Considering finite samples, we have


For any function defined on

, we use the following convention of the Fourier transform and its inverse:

where denotes the frequency.

zhang2019explicitizing shows that when the neuron number is sufficient large, training the network in (16) with gradient descent is equivalent to solve the following optimization problem




is the norm, and are initial parameters before training, and


The minimization in (19) clearly shows that the DNN has an implicit bias in addition to the sample constraint in (20). As monotonically increases with , the optimization problem prefers to choosing a function that has less high frequency components, which explicates the implicit bias of the F-Principle — DNN prefers low frequency xu_training_2018 ; xu2019frequency . For a general DNN, the coefficient cannot be obtained exactly, however, the monotonically decreasing property of with respect to can be postulated based on the F-Principle. Since lower frequency of a function exhibits smoother property, the solution of the optimization problem is smoother compared with the one derived from solving (8), which simply interpolate all missing data by zero for .

3.3 Experiments

In this section, we present three examples to investigate the numerical solution behaviors of R-G and DNN method. For simplicity, we consider the following two-point boundary value problem


where The problem (23) has the exact solution in the form of

Example 1: Fixed the number of sampling points , we use R-G method and DNN method to solve the problem (23).

R-G method. First, we use R-G method to solve the problem (23), specially the spectral method with the Fourier basis function given as

We set the the numbers of basis functions . Fig. 1 plots the numerical and exact solutions with different . One can see that the R-G solution approximates a piecewise linear function when . This result is consistent with the properties of the solutions we analyzed in Theorem 1.

Figure 1: Numerical solution in SM with fixed

DNN method. For better comparison with R-G method, we choose the sine activation function in DNN with one hidden layer. And the number of neurons . The loss function (11) is selected with the parameter . We reduce the loss to an order of 1e-4, and learning rate is take by 1e-4. And we use 1000 test points when we draw pictures. Fig. 2 plots the comparison between DNN solution and exact solution. And we observe that the DNN solutions are always smooth even when is large.

Figure 2: Solutions in DNN method with fixed

Example 2: We set the number of sampling points , and keep the other parameters the same as in Example 1. The R-G solutions and DNN solutions are shown in Fig. 3 and 4 respectively. The results are consistent with the case when , i.e. the R-G solution converges to a piecewise linear function as is taken larger and larger, while the DNN solution is still smooth.

Figure 3: Numerical solution in SM with fixed
Figure 4: Numerical solution in DNN method with fixed

To further verify that the R-G solution approximates the piecewise linear function, we compute the mean square difference between numerical solution (R-G and DNN) and piecewise linear solution, which is obtained by interpolating the values of the numerical solutions at sampling points. We denote the numerical solution as and the interpolated function as . For simplicity, we set and . The mean square difference can be represented as

Here we choose . And the mean square differences are presented in Fig. 5. In the R-G part, where we use a log-log coordinate system, we see that the R-G solution approximates the piecewise linear solution with algebraic precision when . While in the DNN part, we use the semi-log coordinate system, and we can observe that the distance between the DNN solution and the piecewise linear solution hardly changes with the increase of .

Figure 5: Mean square difference of numerical solution and piecewise linear solution vs.

Example 3: In this example, we use ReLU function as the basis function in R-G method and the activation function in DNN method to repeat the results in Example 1. The number of sampling points is taken as . The unstated parameters are taken the same as those in Example 1.

Since the linear finite element function can be represented by a ReLU function, namely,

So we can use ReLU as the basis function for R-G method, but for convenience, we can just use piecewise linear function instead of ReLU function as the basis function. There is no doubt that the FEM solution is piecewise linear, see Fig. 6.

In DNN method, we construct neural networks with three hidden layers, the number of neurons in each layer is . As shown in Fig. 7, again, the DNN learns the data as a very smooth function.

Figure 6: Numerical solution in FEM with fixed
Figure 7: Numerical solution in DNN method with fixed

4 Conclusion

This paper compares the different behaviors of Ritz-Galerkin method and DNN method in solving PDEs to better understand the working principle of DNNs. We consider a particular Poisson problem (1), where the right term is a discrete function. We analyze why the two numerical methods behave differently in theory. R-G method deals with the discrete as the linear combination of Dirac delta functions, while DNN methods implicitly impose regularity on the function that interpolates the discrete sampling points due to the F-principle. Furthermore, from the numerical experiments, as the number of bases goes large, one can see that the solution obtained by R-G method is piecewise linear, regardless of the basis function, but the solution obtained by DNN method is smooth, which is not sensitive to the activation function. In conclusion, based on the theoretical and numerical study, we can see that the implicit bias with traditional methods provides important understandings to the DNN methods.

Zhiqin Xu is supported by the Student Innovation Center and start up fund at Shanghai Jiao Tong University. Jiwei Zhang is partially supported by NSFC under No. 11771035, and the Natural Science Foundation of Hubei Province No. 2019CFA007, and Xiangtan University 2018ICIP01. Jiwei Zhang thanks for the supports and hospitality of Beijing Computational Science Research Center during his visit in the summer for this topic. Yaoyu Zhang is supported by National Science Foundation (Grant No. DMS-1638352) and the Ky Fan and Yu-Fen Fan Membership Fund.

Conflict of interest

The authors declare that they have no conflict of interest.