DeepAI
Log In Sign Up

SelectNet: Self-paced Learning for High-dimensional Partial Differential Equations

The residual method with deep neural networks as function parametrization has been applied to solve certain high-dimensional partial differential equations (PDEs) successfully; however, its convergence is slow and might not be guaranteed even within a simple class of PDEs. To improve the convergence of the network-based residual model, we introduce a novel self-paced learning framework, SelectNet, which quantifies the difficulty of training samples, chooses simpler samples in the early stage of training, and slowly explores more challenging samples, e.g., samples with larger residual errors, mimicking the human cognitive process for more efficient learning. In particular, a selection network and the PDE solution network are trained simultaneously; the selection network adaptively weighting the training samples of the solution network achieving the goal of self-paced learning. Numerical examples indicate that the proposed SelectNet model outperforms existing models on the convergence speed and the convergence robustness, especially for low-regularity solutions.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/28/2021

Active Learning Based Sampling for High-Dimensional Nonlinear Partial Differential Equations

The deep-learning-based least squares method has shown successful result...
03/16/2021

A Local Deep Learning Method for Solving High Order Partial Differential Equations

At present, deep learning based methods are being employed to resolve th...
12/28/2021

DAS: A deep adaptive sampling method for solving partial differential equations

In this work we propose a deep adaptive sampling (DAS) method for solvin...
05/10/2022

Deep Learning-based Schemes for Singularly Perturbed Convection-Diffusion Problems

Deep learning-based numerical schemes such as Physically Informed Neural...
07/09/2021

An efficient greedy training algorithm for neural networks and applications in PDEs

Recently, neural networks have been widely applied for solving partial d...
03/02/2022

Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations

Machine learning methods have been shown to give accurate predictions in...

1 Introduction

High-dimensional partial differential equations (PDEs) are important tools in physical, financial, and biological models [30, 15, 49, 16, 46]

. However, developing numerical methods for high-dimensional PDEs has been a challenging task due to the curse of dimensionality in the discretization of the problem. For example, in traditional methods such as finite difference methods and finite element methods,

degree of freedom is required for a -dimensional problem if we set grid points or basis functions in each direction. Even if becomes moderately large, the exponential growth in the dimension makes traditional methods immediately computationally intractable.

Recent research of the approximation theory of deep neural networks (DNNs) shows that deep network approximation is a powerful tool for mesh-free function parametrization. The research on the approximation theory of neural networks traces back to the pioneering work [8, 19, 1]

on the universal approximation of shallow networks with sigmoid activation functions. The recent research focus was on the approximation rate of DNNs for various function spaces in terms of the number of network parameters showing that deep networks are more powerful than shallow networks in terms of approximation efficiency. For example, smooth functions

[34, 32, 47, 14, 36, 45, 13], piecewise smooth functions [39], band-limited functions [37], continuous functions [48, 42, 41]. The reader is referred to [41] for the explicit characterization of the approximation error for networks with an arbitrary width and depth.

In particular, deep network approximation can lessen or overcome the curse of dimensionality under certain circumstances, making it an attractive tool for solving high-dimensional problems. For a sufficiently smooth function that is the integral transform of a high dimensional (essentially) compactly supported function with a one-dimensional integral kernel, the no curse of dimensionality can be shown via establishing the connection of network approximation with the Monte Carlo sampling or equivalently the law of large numbers

[1, 37]. For functions in the Korobov space, connecting deep network approximation and sparse grid approximation leads to the fact that deep network approximation can lessen the curse of dimensionality [36]. For general continuous functions, [38] proves that deep network approximation can significantly reduce the curse of dimensionality through the Kolmogorov-Arnold superposition theorem. Finally, if the approximation error is only concerned on a low-dimensional manifold, there is no curse of dimensionality for deep network approximation [6, 4, 41].

As an efficient function parametrization tool, neural networks have been applied to solve PDEs via various approaches. Early work in [29] applies neural networks to approximate PDE solutions defined on grid points. Later in [10, 28], DNNs are employed to approximate solutions in the whole domain and PDEs are solved by minimizing the discrete residual at prescribed collocation points. DNNs coupled with boundary governing terms by design can satisfy boundary conditions [35]

. Nevertheless, designing boundary governing terms is usually difficult for complex geometry. Another approach to enforcing boundary conditions is to add boundary residual errors to the loss function as a penalized term and minimize it as well as the PDE residual error

[17, 27]. The second technique is in the same spirit of residual methods in finite element methods and is more convenient in implementation. Therefore, it has been widely utilized for PDEs with complex domains. However, network computation was usually expensive limiting the applications of network-based PDE solvers. Thanks to the development of GPU-based parallel computing over the last two decades, which greatly boosts the network computation, network-based PDE solvers were revisited recently and have become a popular tool especially for high-dimensional problems [43, 3, 50, 31, 12, 18, 2, 20, 21, 5]. Nevertheless, most network-based PDE solvers suffer from robustness issues: their convergence is slow and might not be guaranteed even within a simple class of PDEs.

To ease the issue above, we introduce a novel self-paced learning framework, SelectNet, to adaptively choose training samples in the residual model. Self-paced learning [26] is a recently raised learning technique that can choose a part of the training samples for actual training over time. Specifically, for a training data set with

samplings, self-paced learning uses a vector

to indicate whether or not each training sample should be included in the current training stage. The philosophy of self-paced learning is to simulate the learning style of human beings, which tends to learn easier aspects of a learning task first and deal with more complicated samplings later. Based on self-paced learning, a novel technique for selected sampling is put forward, which uses a selection neural network instead of the 0-1 selection vector . Hence, it helps learning avoid redundant training information and speeds up the convergence of learning outcomes. This idea is further improved in [22] by introducing a DNN to select training data for image classification. Among similar works, a state-of-the-art algorithm named as SelectNet is proposed in [33] for image classification, especially for imbalanced data problems. Based on the observation that samples near the singularity of the PDE solution are rare compared to samples from the regular part, we extend the SelectNet [33] to network-based residual models especially for PDE solutions with certain irregularity. As we shall see later, numerical results show that the proposed model is competitive with the traditional (basic) residual model for analytic solutions, and it outperforms others for low-regularity solutions, in the aspect of the convergence speed.

The organization of this paper is as follows. In Section 2, we introduce the residual methods and formulate the corresponding optimization model. In Section 3, we present the SelectNet model in detail. In Section 4, we put forward the error estimates of the basic and SelectNet models. In Section 5, we discuss the network implementation in the proposed model. In Section 6, we present ample numerical experiments for various equations to validate our model. We conclude with some remarks in the final section.

2 Residual Methods for PDEs

In this work, we aim at solving the following (initial) boundary value problems, giving a bounded domain :

  • elliptic equations

    (1)
  • parabolic equations

    (2)
  • hyperbolic equations

    (3)

where is the solution function; , , , are given data functions; is a spatial differential operator concerning the derivatives of ; is a boundary operator specifying a Dirichlet, Neumann or Robin boundary condition.

In this method, the temporal variable will be regarded as an extra spatial coordinate, and it will not be dealt with differently from . For simplicity, the PDEs in (1)-(3) are unified in the following form

(4)

where includes the spatial variable and possibly the temporal variable ; represents a generic PDE; represents the governing conditions including the boundary condition and possibly the initial condition; and are the corresponding domains of the equations.

Now we seek a neural network approximating the solution of the PDE (4). Note the residuals for the PDE and the governing conditions can be written by

(5)

One can solve the PDE by searching for the optimal parameters of the network that minimize the square sum of these two residuals, i.e.

(6)

where is usually the -norm and is a parameter for weighting the sum, e.g.,

(7)

3 SelectNet Model

The network-based residual model has been applied to solve certain high-dimensional PDEs successfully. However, its convergence is slow and might not be guaranteed even within a simple class of PDEs. To ease this issue, we introduce a novel self-paced learning framework, SelectNet, to adaptively choose training samples in the residual model. The basic philosophy is to mimic the human cognitive process for more efficient learning: learning first from easier examples and slowly exploring more complicated ones. The proposed model is related to selected sampling [7, 23]

, an important tool of deep learning for computer science applications. Nevertheless, the effectiveness of selected sampling in scientific computing has not been fully explored yet.

In particular, a selection network (subscript s for “selection”) and the PDE solution network are trained simultaneously; the selection network adaptively weighting the training samples of the solution network achieving the goal of self-paced learning. is a “mentor” helping to decide whether a sample is important enough to train the “student” network . The “mentor” could significantly avoid redundant training information and help to speed up the convergence. This idea is originally from self-paced learning [25] and is further improved in [22] by introducing a DNN to select training data for image classification. Among similar works, a state-of-the-art algorithm named as SelectNet was proposed in [33] for image classification, especially for imbalanced data problem. Based on the observation that samples near the singularity of the PDE solution are rare compared to samples from the regular part, we extend the SelectNet [33] to network-based residual models especially for PDE solutions with certain irregularity.

Originally in image classification, for a training data set , self-paced learning uses a vector to indicate whether or not each training sample should be included in the current training stage ( if the th sample is included in the current iteration). The overall target function including is

(8)

where denotes the loss function of a DNN

for classifying a sample

to . When this model is relaxed to and the alternative convex search is applied to solve the relaxed optimization, a straightforward derivation easily reveals a rule for the optimal value for each entry in the -th iteration as

(9)

A sample with a smaller loss than the threshold is treated as an “easy” sample and will be selected in training. When computing with a fixed , the classifier is trained only on the selected “easy” samples. This mechanism helps to reduce the generalization error for image classification when the training data distribution is usually different from the test data distribution.

When solving high dimensional PDEs, the training and test data distributions are the same and there is no limitation for sampling. Hence, the desired mechanism in this case is to: 1) consider no bias on choosing samples for fast convergence in the early stage of training; 2) exclude “easy” samples and focus on “difficult” samples for better convergence in the latter stage of training. In particular, the SelectNet should satisfy the following requirements. First of all, as a weight function, is set to be bounded and uniformly scaled. Second, should generate no bias towards samples in the early stage of training. Third, the role of is to add higher weights to samples with larger point-wise residual errors in the latter stage of training.

In practice, two particular selection networks and are introduced separately for the PDE residual and condition residual. They are required to be bounded as

(10)
(11)

and uniformly scaled as

(12)

Based on the model of residual methods in (7) and the requirements above, we have the following SelectNet framework for the residual method:

(13)

where the last penalty term with a small penalty parameter is to enforce (12) to be satisfied approximately. In the early stage of training,

can be initialized as constant one and it has a slowly varying and smooth function configuration with a high probability satisfying the second requirement above. In the latter stage of training,

has been optimized by the inner max problem to choose difficult samples and the third requirement above is satisfied. Besides, the condition in (10)-(11) holds automatically if the last layer of activation functions of is bounded and the network output is properly rescaled and shifted as we shall discuss later in the next section.

4 Error estimates

In this section, theoretical analysis are presented to show the solution errors of the basic and SelectNet models are bounded by the loss function (residual errors). Specifically, we will take the elliptic PDE with Neumann boundary condition as an example. The conclusion can be generalized for other well-posed PDEs by similar argument. Consider

(14)

where is an open subset of whose boundary is smooth; , , is a given function in .

Suppose the problem (14) admits a unique solution in . Also, suppose the variational optimization problem

(15)

with admissible set has a feasible solution satisfying

(16)

then

(17)

where is a constant only depending on and . Furthermore, let be a subset of which contains , and be a subset of which contains , . Suppose the variational optimization problem

(18)

has a feasible solution satisfying

(19)

then (17) with replacing also holds.

Let be the true solution of (14), and . Starting from the identity

(20)

we multiply to both sides of (20) and integrate over . Since , by integration by parts it follows

(21)

Hence, by Cauchy-Schwarz inequality,

(22)

By trace theorem, for some which only depends on and , then we have

(23)

with . Finally, by the hypothesis (16), (17) directly follows (23).

Similarly, by the fact that , we can also obtain the same estimate for .

By using the triangle inequality, we can conclude the solutions of the basic and SelectNet models are equivalent as long as the loss functions are minimized sufficiently, that is Under the hypothesis of Theorem 4.1, it satisfies

(24)

5 Network Implementation

5.1 Network Architecture

The proposed framework is independent of the choice of DNNs. Advanced network design may improve the accuracy and convergence of the proposed framework, which would be interesting for future work.

In this paper, feedforward neural networks will be repeatedly applied. Let denote such a network with an input and parameters , then it is defined recursively as follows:

(25)

where is an application-dependent nonlinear activation function, and consists of all the weights and biases satisfying

(26)

The number is called the width of the network and is called the depth.

For simplicity, we deploy the feedforward neural network with the activation function as the solution network that approximates the solution of the PDE. As for the selection network introduced in Section 3, since it is required to be bounded in , it can be defined via

(27)

where

is the sigmoidal function, and

is a generic network, e.g. a feedforward neural network with the ReLU activation

.

5.2 Special Network for Dirichlet Boundary Conditions

In the case of homogeneous Dirichlet boundary conditions, it is worth mentioning a special network design that satisfies the boundary condition automatically as discussed in [28, 3].

Let us focus on the boundary value problem to introduce this special network stucture. It is straightforward to generalize this idea to the case of an initial boundary value problem and we omit this discussion. Assume a homogeneous Dirichlet boundary condition

(28)

then a solution network automatically satisfying the condition above can be constructed by

(29)

where is a generic network as in (25), and is a specifically chosen function such as on .

For example, if is a -dimensional unit ball, then can take the form

(30)

For another example, if is the -dimensional cube , then can take the form

(31)

Since the boundary condition is always fulfilled, it suffices to solve the min-max problem

(32)

to identify the best solution network .

5.3 Derivatives of Networks

Note that the evaluation of the optimization problem in (13) involves the derivative of the network in terms of . When the activation function of the network is differentiable, the network is differentiable and the derivative in terms of

can be evaluated efficiently via the backpropagation algorithm. Note that the network we adopt in this paper is not differentiable. Hence, numerical differentiation will be utilized to estimate the derivative of networks. For example, for the elliptic operator

, can be estimated by the second-order central difference formula

(33)

up to an error of .

5.4 Network Training

Once networks have been set up, the rest is to train the networks to solve the min-max problem in (13

). The stochastic gradient descent (SGD) method or its variants (e.g., Adam

[24]) is an efficient tool to solve this problem numerically. Although the convergence of SGD for the min-max problem is still an active research topic [40, 9, 44], empirical success shows that SGD can provide a good approximate solution.

Before completing the algorithm description of SelectNet, let us introduce the key setup of SGD and summarize it in Algorithm 1

below. In each training iteration, we first set uniformly distributed training points

and , and define the empirical loss of these training points as

(34)

where . Next, can be updated by the gradient ascent via

(35)

and can be updated by the gradient descent via

(36)

with step sizes and . Note that training points are randomly renewed in each iteration. In fact, for the same set of training points in each iteration, the updates (35) and (36) can be performed and times, respectively.

0:  the PDE (4)
0:  the parameters in the solution network
  Set parameters , , for iterations and parameters , for sample sizes
  Initialize and
  for  do
     Generate uniformly distributed sampling points and
     for  do
        Update
     end for
     
     for  do
        Update
     end for
     
     if Stopping criteria is satisfied then
        Return
     end if
  end for
Algorithm 1 Residual Model with SelectNet

6 Numerical Examples

In this section, the proposed SelectNet model is tested on several PDE examples including elliptic/parabolic and linear/nonlinear high-dimensional problems. The basic model (7) is also tested for comparison. For the basic and SelectNet models, we choose the feedforward architecture with activation for the solution network, and the feedforward architecture with ReLU activation for the selection network. AdamGrad [11] is employed to solve the optimization problems, with learning rates

(37)

for the selection network, and

(38)

for the solution network, where are equispaced segments of total iterations. Other parameters used in the model and algorithm are listed in Table 1.

the dimension of the problem
the width of each layer in the solution network
the width of each layer in the selection network
the depth of the solution network
the depth of the selection network
the upper bound of the selection network
the lower bound of the selection network
number of iterations in the optimization
number of updates of the selection network in each iteration
number of updates of the solution network in each iteration
number of training points inside the domain in each iteration
number of training points on the domain boundary in each iteration
penalty parameter to uniform the selection network
summation weight of the boundary residual
Table 1: Parameters in the model and algorithm.

We take the (relative) error at uniformly distributed sampling points as the metric to evaluate the accuracy, which is formulated by

(39)

6.1 Elliptic Equations with Low-Regularity Solutions

First, let us consider the nonlinear elliptic equation inside a bounded domain

(40)

with . In this case, we specify the exact solution by

(41)

whose first derivative is singular at the origin and the third derivative is singular on the boundary. The problems with and are solved. The implementation parameters and the minimal obtained errors by the basic and SelectNet models are listed in Table 2 and 3. Also, the curves of error decay versus iterations are shown in Fig. 1. From these results, it is observed both models work for the elliptic problem, but SelectNet has a clearly better performance than the basic model: its error decay is more numerically stable and reaches to a much lower level. The advantage of SelectNet is more significant when is higher. Besides, we present in Fig. 2 the following surfaces

  • the -slice of numerical solution

  • the -slice of solution error

  • the -slice of selection network

for . It shows the numerical error accumulates near the origin due to its low regularity. On the other hand, the selection net attains its peak at the origin, implying the selection of training points is mainly distributed near the origin where the error is mainly distributed.

Additionally, to test the robustness of the models, we perform them on the case of using various parameters. The results are listed in Table 4. It is seen the basic model does not converge if fewer sample points are chosen in each iteration, while SelectNet always converges (“converge” means the errors obtained by iterations are decreasing overall). This reflects the robustness of SelectNet against the classical basic model.

Parameters Dimensions SelectNet Basic
/
/
/
/
/
Table 2: The parameters of the implementation for various models in the nonlinear elliptic example.
Dimension SelectNet Basic
Table 3: errors obtained by various models in the nonlinear elliptic example.
(a)
(b)
Figure 1: errors v.s. iterations in the nonlinear elliptic example (Red: SelecNet model; Blue: the basic model).
Figure 2: The -surfaces of the numerical solution, solution error and selection network by SelectNet (d=20) in the nonlinear elliptic example.
Model Minimal error If converges
Basic 10000 1 1000 No
2000 No
5000 Yes
10000 Yes
100 100 1000 No
2000 No
5000 No
10000 Yes
SelectNet 10000 1 1000 Yes
2000 Yes
5000 Yes
10000 Yes
100 100 1000 Yes
2000 Yes
5000 Yes
10000 Yes
Table 4: the minimal errors and status of convergence obtained by the basic model and SelectNet model with various parameters in the nonlinear elliptic example.

6.2 Parabolic Equations

In this section, SelectNet is tested on an initial boundary value problem of the parabolic equation, which is given by

(42)

with . Two examples are presented in this section.

First, we take (heat equation), and the exact solution is set by

(43)

In the SelectNet model, time-discretization schemes are not utilized. Instead, we regard as an extra spatial variable of the problem. Hence the problem domain is an analog of a hypercylinder, and the “boundary” value is specified in the bottom and the side . As in the preceding examples, the parameters and computed results by the basic and SelectNet models for and are listed in Table 5 and 6. It is clearly shown SelectNet still obtains smaller errors than the basic model under the same parameter setting. In Fig. 3 the curves of error decay are presented, and in Fig. 4 the -surfaces of the numerical solution, solution error and selection network when are displayed.

Parameters Dimensions SelectNet Basic
/
/
/
/
/
Table 5: The parameters of the implementation for various models in the heat example.
Dimension SelectNet basic
Table 6: errors obtained by various models in the heat example.
(a)
(b)
Figure 3: errors v.s. iterations in the heat example (Red: SelecNet model; Blue: the basic model).
Figure 4: The -surfaces of the numerical solution, solution error and selection network by SelectNet (d=20) in the heat example.

Second, we take which has a non-constant coefficient , and the exact solution is set by

(44)

Note is at most smooth at and . The parameters, reduced errors and error curves are shown in Table 7, 8 and Fig. 5. The high singularity of the solution causes this problem much more difficult than previous ones. Correspondingly, we take more sampling points in each iteration. It can be seen in the figure, the error curves of the SelectNet decay faster to lower levels than the basic model for both and . Especially, when the basic model can not reduce the error even to , while SelectNet converges to the solution within error . Moreover, the -surface of the results for are shown in Fig. 6, from that we can observe the solution error is mainly distributed near the singular slices and .

Parameters Dimensions SelectNet Basic
/
/
/
/
/
Table 7: The parameters of the implementation for various models in the second parabolic example.
Dimension SelectNet Basic
Table 8: errors obtained by various models in the second parabolic example.
(a)
(b)
Figure 5: errors v.s. iterations in the second parabolic example (Red: SelecNet model; Blue: the basic model).
Figure 6: The -surfaces of the numerical solution, solution error and selection network by SelectNet (d=20) in the second parabolic example.

6.3 Hyperbolic Equations

In the last example, we test SelectNet by solving the initial boundary value problem of the hyperbolic (wave) equation, which is given by

(45)

with and exact solution is set by

(46)

Same as in the parabolic example, we list the parameters and obtained errors for and in Table 9 and 10, which demonstrates the SelectNet still converges faster than the basic model (especially when is higher). Also, we display the curves of error decay in Fig. 7, and the -surfaces of the results at when in Fig. 8. The results in the examples of parabolic and hyperbolic equations imply our proposed model works successfully for time-dependent problems.