1 Introduction
Highdimensional partial differential equations (PDEs) are important tools in physical, financial, and biological models [30, 15, 49, 16, 46]
. However, developing numerical methods for highdimensional PDEs has been a challenging task due to the curse of dimensionality in the discretization of the problem. For example, in traditional methods such as finite difference methods and finite element methods,
degree of freedom is required for a dimensional problem if we set grid points or basis functions in each direction. Even if becomes moderately large, the exponential growth in the dimension makes traditional methods immediately computationally intractable.Recent research of the approximation theory of deep neural networks (DNNs) shows that deep network approximation is a powerful tool for meshfree function parametrization. The research on the approximation theory of neural networks traces back to the pioneering work [8, 19, 1]
on the universal approximation of shallow networks with sigmoid activation functions. The recent research focus was on the approximation rate of DNNs for various function spaces in terms of the number of network parameters showing that deep networks are more powerful than shallow networks in terms of approximation efficiency. For example, smooth functions
[34, 32, 47, 14, 36, 45, 13], piecewise smooth functions [39], bandlimited functions [37], continuous functions [48, 42, 41]. The reader is referred to [41] for the explicit characterization of the approximation error for networks with an arbitrary width and depth.In particular, deep network approximation can lessen or overcome the curse of dimensionality under certain circumstances, making it an attractive tool for solving highdimensional problems. For a sufficiently smooth function that is the integral transform of a high dimensional (essentially) compactly supported function with a onedimensional integral kernel, the no curse of dimensionality can be shown via establishing the connection of network approximation with the Monte Carlo sampling or equivalently the law of large numbers
[1, 37]. For functions in the Korobov space, connecting deep network approximation and sparse grid approximation leads to the fact that deep network approximation can lessen the curse of dimensionality [36]. For general continuous functions, [38] proves that deep network approximation can significantly reduce the curse of dimensionality through the KolmogorovArnold superposition theorem. Finally, if the approximation error is only concerned on a lowdimensional manifold, there is no curse of dimensionality for deep network approximation [6, 4, 41].As an efficient function parametrization tool, neural networks have been applied to solve PDEs via various approaches. Early work in [29] applies neural networks to approximate PDE solutions defined on grid points. Later in [10, 28], DNNs are employed to approximate solutions in the whole domain and PDEs are solved by minimizing the discrete residual at prescribed collocation points. DNNs coupled with boundary governing terms by design can satisfy boundary conditions [35]
. Nevertheless, designing boundary governing terms is usually difficult for complex geometry. Another approach to enforcing boundary conditions is to add boundary residual errors to the loss function as a penalized term and minimize it as well as the PDE residual error
[17, 27]. The second technique is in the same spirit of residual methods in finite element methods and is more convenient in implementation. Therefore, it has been widely utilized for PDEs with complex domains. However, network computation was usually expensive limiting the applications of networkbased PDE solvers. Thanks to the development of GPUbased parallel computing over the last two decades, which greatly boosts the network computation, networkbased PDE solvers were revisited recently and have become a popular tool especially for highdimensional problems [43, 3, 50, 31, 12, 18, 2, 20, 21, 5]. Nevertheless, most networkbased PDE solvers suffer from robustness issues: their convergence is slow and might not be guaranteed even within a simple class of PDEs.To ease the issue above, we introduce a novel selfpaced learning framework, SelectNet, to adaptively choose training samples in the residual model. Selfpaced learning [26] is a recently raised learning technique that can choose a part of the training samples for actual training over time. Specifically, for a training data set with
samplings, selfpaced learning uses a vector
to indicate whether or not each training sample should be included in the current training stage. The philosophy of selfpaced learning is to simulate the learning style of human beings, which tends to learn easier aspects of a learning task first and deal with more complicated samplings later. Based on selfpaced learning, a novel technique for selected sampling is put forward, which uses a selection neural network instead of the 01 selection vector . Hence, it helps learning avoid redundant training information and speeds up the convergence of learning outcomes. This idea is further improved in [22] by introducing a DNN to select training data for image classification. Among similar works, a stateoftheart algorithm named as SelectNet is proposed in [33] for image classification, especially for imbalanced data problems. Based on the observation that samples near the singularity of the PDE solution are rare compared to samples from the regular part, we extend the SelectNet [33] to networkbased residual models especially for PDE solutions with certain irregularity. As we shall see later, numerical results show that the proposed model is competitive with the traditional (basic) residual model for analytic solutions, and it outperforms others for lowregularity solutions, in the aspect of the convergence speed.The organization of this paper is as follows. In Section 2, we introduce the residual methods and formulate the corresponding optimization model. In Section 3, we present the SelectNet model in detail. In Section 4, we put forward the error estimates of the basic and SelectNet models. In Section 5, we discuss the network implementation in the proposed model. In Section 6, we present ample numerical experiments for various equations to validate our model. We conclude with some remarks in the final section.
2 Residual Methods for PDEs
In this work, we aim at solving the following (initial) boundary value problems, giving a bounded domain :

elliptic equations
(1) 
parabolic equations
(2) 
hyperbolic equations
(3)
where is the solution function; , , , are given data functions; is a spatial differential operator concerning the derivatives of ; is a boundary operator specifying a Dirichlet, Neumann or Robin boundary condition.
In this method, the temporal variable will be regarded as an extra spatial coordinate, and it will not be dealt with differently from . For simplicity, the PDEs in (1)(3) are unified in the following form
(4) 
where includes the spatial variable and possibly the temporal variable ; represents a generic PDE; represents the governing conditions including the boundary condition and possibly the initial condition; and are the corresponding domains of the equations.
Now we seek a neural network approximating the solution of the PDE (4). Note the residuals for the PDE and the governing conditions can be written by
(5) 
One can solve the PDE by searching for the optimal parameters of the network that minimize the square sum of these two residuals, i.e.
(6) 
where is usually the norm and is a parameter for weighting the sum, e.g.,
(7) 
3 SelectNet Model
The networkbased residual model has been applied to solve certain highdimensional PDEs successfully. However, its convergence is slow and might not be guaranteed even within a simple class of PDEs. To ease this issue, we introduce a novel selfpaced learning framework, SelectNet, to adaptively choose training samples in the residual model. The basic philosophy is to mimic the human cognitive process for more efficient learning: learning first from easier examples and slowly exploring more complicated ones. The proposed model is related to selected sampling [7, 23]
, an important tool of deep learning for computer science applications. Nevertheless, the effectiveness of selected sampling in scientific computing has not been fully explored yet.
In particular, a selection network (subscript s for “selection”) and the PDE solution network are trained simultaneously; the selection network adaptively weighting the training samples of the solution network achieving the goal of selfpaced learning. is a “mentor” helping to decide whether a sample is important enough to train the “student” network . The “mentor” could significantly avoid redundant training information and help to speed up the convergence. This idea is originally from selfpaced learning [25] and is further improved in [22] by introducing a DNN to select training data for image classification. Among similar works, a stateoftheart algorithm named as SelectNet was proposed in [33] for image classification, especially for imbalanced data problem. Based on the observation that samples near the singularity of the PDE solution are rare compared to samples from the regular part, we extend the SelectNet [33] to networkbased residual models especially for PDE solutions with certain irregularity.
Originally in image classification, for a training data set , selfpaced learning uses a vector to indicate whether or not each training sample should be included in the current training stage ( if the th sample is included in the current iteration). The overall target function including is
(8) 
where denotes the loss function of a DNN
for classifying a sample
to . When this model is relaxed to and the alternative convex search is applied to solve the relaxed optimization, a straightforward derivation easily reveals a rule for the optimal value for each entry in the th iteration as(9) 
A sample with a smaller loss than the threshold is treated as an “easy” sample and will be selected in training. When computing with a fixed , the classifier is trained only on the selected “easy” samples. This mechanism helps to reduce the generalization error for image classification when the training data distribution is usually different from the test data distribution.
When solving high dimensional PDEs, the training and test data distributions are the same and there is no limitation for sampling. Hence, the desired mechanism in this case is to: 1) consider no bias on choosing samples for fast convergence in the early stage of training; 2) exclude “easy” samples and focus on “difficult” samples for better convergence in the latter stage of training. In particular, the SelectNet should satisfy the following requirements. First of all, as a weight function, is set to be bounded and uniformly scaled. Second, should generate no bias towards samples in the early stage of training. Third, the role of is to add higher weights to samples with larger pointwise residual errors in the latter stage of training.
In practice, two particular selection networks and are introduced separately for the PDE residual and condition residual. They are required to be bounded as
(10)  
(11) 
and uniformly scaled as
(12) 
Based on the model of residual methods in (7) and the requirements above, we have the following SelectNet framework for the residual method:
(13) 
where the last penalty term with a small penalty parameter is to enforce (12) to be satisfied approximately. In the early stage of training,
can be initialized as constant one and it has a slowly varying and smooth function configuration with a high probability satisfying the second requirement above. In the latter stage of training,
has been optimized by the inner max problem to choose difficult samples and the third requirement above is satisfied. Besides, the condition in (10)(11) holds automatically if the last layer of activation functions of is bounded and the network output is properly rescaled and shifted as we shall discuss later in the next section.4 Error estimates
In this section, theoretical analysis are presented to show the solution errors of the basic and SelectNet models are bounded by the loss function (residual errors). Specifically, we will take the elliptic PDE with Neumann boundary condition as an example. The conclusion can be generalized for other wellposed PDEs by similar argument. Consider
(14) 
where is an open subset of whose boundary is smooth; , , is a given function in .
Suppose the problem (14) admits a unique solution in . Also, suppose the variational optimization problem
(15) 
with admissible set has a feasible solution satisfying
(16) 
then
(17) 
where is a constant only depending on and . Furthermore, let be a subset of which contains , and be a subset of which contains , . Suppose the variational optimization problem
(18) 
has a feasible solution satisfying
(19) 
then (17) with replacing also holds.
Let be the true solution of (14), and . Starting from the identity
(20) 
we multiply to both sides of (20) and integrate over . Since , by integration by parts it follows
(21) 
Hence, by CauchySchwarz inequality,
(22) 
By trace theorem, for some which only depends on and , then we have
(23) 
with . Finally, by the hypothesis (16), (17) directly follows (23).
Similarly, by the fact that , we can also obtain the same estimate for .
By using the triangle inequality, we can conclude the solutions of the basic and SelectNet models are equivalent as long as the loss functions are minimized sufficiently, that is Under the hypothesis of Theorem 4.1, it satisfies
(24) 
5 Network Implementation
5.1 Network Architecture
The proposed framework is independent of the choice of DNNs. Advanced network design may improve the accuracy and convergence of the proposed framework, which would be interesting for future work.
In this paper, feedforward neural networks will be repeatedly applied. Let denote such a network with an input and parameters , then it is defined recursively as follows:
(25) 
where is an applicationdependent nonlinear activation function, and consists of all the weights and biases satisfying
(26) 
The number is called the width of the network and is called the depth.
For simplicity, we deploy the feedforward neural network with the activation function as the solution network that approximates the solution of the PDE. As for the selection network introduced in Section 3, since it is required to be bounded in , it can be defined via
(27) 
where
is the sigmoidal function, and
is a generic network, e.g. a feedforward neural network with the ReLU activation
.5.2 Special Network for Dirichlet Boundary Conditions
In the case of homogeneous Dirichlet boundary conditions, it is worth mentioning a special network design that satisfies the boundary condition automatically as discussed in [28, 3].
Let us focus on the boundary value problem to introduce this special network stucture. It is straightforward to generalize this idea to the case of an initial boundary value problem and we omit this discussion. Assume a homogeneous Dirichlet boundary condition
(28) 
then a solution network automatically satisfying the condition above can be constructed by
(29) 
where is a generic network as in (25), and is a specifically chosen function such as on .
For example, if is a dimensional unit ball, then can take the form
(30) 
For another example, if is the dimensional cube , then can take the form
(31) 
Since the boundary condition is always fulfilled, it suffices to solve the minmax problem
(32) 
to identify the best solution network .
5.3 Derivatives of Networks
Note that the evaluation of the optimization problem in (13) involves the derivative of the network in terms of . When the activation function of the network is differentiable, the network is differentiable and the derivative in terms of
can be evaluated efficiently via the backpropagation algorithm. Note that the network we adopt in this paper is not differentiable. Hence, numerical differentiation will be utilized to estimate the derivative of networks. For example, for the elliptic operator
, can be estimated by the secondorder central difference formula(33) 
up to an error of .
5.4 Network Training
Once networks have been set up, the rest is to train the networks to solve the minmax problem in (13
). The stochastic gradient descent (SGD) method or its variants (e.g., Adam
[24]) is an efficient tool to solve this problem numerically. Although the convergence of SGD for the minmax problem is still an active research topic [40, 9, 44], empirical success shows that SGD can provide a good approximate solution.Before completing the algorithm description of SelectNet, let us introduce the key setup of SGD and summarize it in Algorithm 1
below. In each training iteration, we first set uniformly distributed training points
and , and define the empirical loss of these training points as(34) 
where . Next, can be updated by the gradient ascent via
(35) 
and can be updated by the gradient descent via
(36) 
with step sizes and . Note that training points are randomly renewed in each iteration. In fact, for the same set of training points in each iteration, the updates (35) and (36) can be performed and times, respectively.
6 Numerical Examples
In this section, the proposed SelectNet model is tested on several PDE examples including elliptic/parabolic and linear/nonlinear highdimensional problems. The basic model (7) is also tested for comparison. For the basic and SelectNet models, we choose the feedforward architecture with activation for the solution network, and the feedforward architecture with ReLU activation for the selection network. AdamGrad [11] is employed to solve the optimization problems, with learning rates
(37) 
for the selection network, and
(38) 
for the solution network, where are equispaced segments of total iterations. Other parameters used in the model and algorithm are listed in Table 1.
the dimension of the problem  

the width of each layer in the solution network  
the width of each layer in the selection network  
the depth of the solution network  
the depth of the selection network  
the upper bound of the selection network  
the lower bound of the selection network  
number of iterations in the optimization  
number of updates of the selection network in each iteration  
number of updates of the solution network in each iteration  
number of training points inside the domain in each iteration  
number of training points on the domain boundary in each iteration  
penalty parameter to uniform the selection network  
summation weight of the boundary residual 
We take the (relative) error at uniformly distributed sampling points as the metric to evaluate the accuracy, which is formulated by
(39) 
6.1 Elliptic Equations with LowRegularity Solutions
First, let us consider the nonlinear elliptic equation inside a bounded domain
(40) 
with . In this case, we specify the exact solution by
(41) 
whose first derivative is singular at the origin and the third derivative is singular on the boundary. The problems with and are solved. The implementation parameters and the minimal obtained errors by the basic and SelectNet models are listed in Table 2 and 3. Also, the curves of error decay versus iterations are shown in Fig. 1. From these results, it is observed both models work for the elliptic problem, but SelectNet has a clearly better performance than the basic model: its error decay is more numerically stable and reaches to a much lower level. The advantage of SelectNet is more significant when is higher. Besides, we present in Fig. 2 the following surfaces

the slice of numerical solution

the slice of solution error

the slice of selection network
for . It shows the numerical error accumulates near the origin due to its low regularity. On the other hand, the selection net attains its peak at the origin, implying the selection of training points is mainly distributed near the origin where the error is mainly distributed.
Additionally, to test the robustness of the models, we perform them on the case of using various parameters. The results are listed in Table 4. It is seen the basic model does not converge if fewer sample points are chosen in each iteration, while SelectNet always converges (“converge” means the errors obtained by iterations are decreasing overall). This reflects the robustness of SelectNet against the classical basic model.
Parameters  Dimensions  SelectNet  Basic 

/  
/  
/  
/  
/ 
Dimension  SelectNet  Basic 

Model  Minimal error  If converges  
Basic  10000  1  1000  No  
2000  No  
5000  Yes  
10000  Yes  
100  100  1000  No  
2000  No  
5000  No  
10000  Yes  
SelectNet  10000  1  1000  Yes  
2000  Yes  
5000  Yes  
10000  Yes  
100  100  1000  Yes  
2000  Yes  
5000  Yes  
10000  Yes 
6.2 Parabolic Equations
In this section, SelectNet is tested on an initial boundary value problem of the parabolic equation, which is given by
(42) 
with . Two examples are presented in this section.
First, we take (heat equation), and the exact solution is set by
(43) 
In the SelectNet model, timediscretization schemes are not utilized. Instead, we regard as an extra spatial variable of the problem. Hence the problem domain is an analog of a hypercylinder, and the “boundary” value is specified in the bottom and the side . As in the preceding examples, the parameters and computed results by the basic and SelectNet models for and are listed in Table 5 and 6. It is clearly shown SelectNet still obtains smaller errors than the basic model under the same parameter setting. In Fig. 3 the curves of error decay are presented, and in Fig. 4 the surfaces of the numerical solution, solution error and selection network when are displayed.
Parameters  Dimensions  SelectNet  Basic 

/  
/  
/  
/  
/ 
Dimension  SelectNet  basic 

Second, we take which has a nonconstant coefficient , and the exact solution is set by
(44) 
Note is at most smooth at and . The parameters, reduced errors and error curves are shown in Table 7, 8 and Fig. 5. The high singularity of the solution causes this problem much more difficult than previous ones. Correspondingly, we take more sampling points in each iteration. It can be seen in the figure, the error curves of the SelectNet decay faster to lower levels than the basic model for both and . Especially, when the basic model can not reduce the error even to , while SelectNet converges to the solution within error . Moreover, the surface of the results for are shown in Fig. 6, from that we can observe the solution error is mainly distributed near the singular slices and .
Parameters  Dimensions  SelectNet  Basic 

/  
/  
/  
/  
/ 
Dimension  SelectNet  Basic 

6.3 Hyperbolic Equations
In the last example, we test SelectNet by solving the initial boundary value problem of the hyperbolic (wave) equation, which is given by
(45) 
with and exact solution is set by
(46) 
Same as in the parabolic example, we list the parameters and obtained errors for and in Table 9 and 10, which demonstrates the SelectNet still converges faster than the basic model (especially when is higher). Also, we display the curves of error decay in Fig. 7, and the surfaces of the results at when in Fig. 8. The results in the examples of parabolic and hyperbolic equations imply our proposed model works successfully for timedependent problems.