1 Introduction
Physicsinformed neural networks (PINNs) [32] have emerged in recent years and quickly became a powerful tool for solving both forward and inverse problems of partial differential equations (PDEs) via deep neural networks (DNNs) [33, 20, 16]
. PINNs embed the PDEs into the loss of the neural network using automatic differentiation. Compared with traditional numerical PDE solvers, such as the finite difference method (FDM) and the finite element method (FEM), PINNs are mesh free and therefore highly flexible. Moreover, PINNs can easily incorporate both physicsbased constraints and data measurements into the loss function. PINNs have been applied to tackle diverse problems in computational science and engineering, such as inverse problems in nanooptics, metamaterials
[1], and fluid dynamics [33], parameter estimation in systems biology
[43, 2], and problems of inverse design and topology optimization [21]. In addition to standard PDEs, PINNs have also been extended to solve other types of PDEs, including integrodifferential equations [20], fractional PDEs [29], and stochastic PDEs [47].Despite the past success, addressing a wide range of PDE problems with increasing levels of complexity can be theoretically and practically challenging, and thus many aspects of PINNs still require further improvements to achieve more accurate prediction, higher computational efficiency, and training robustness [16]. A series of extensions to the vanilla PINN have been proposed to boost the performance of PINNs from various aspects. For example, better loss functions have been discovered via metalearning [31], and gradientenhanced PINNs (gPINNs) have been developed to embed the gradient information of the PDE residual into the loss [44]. In PINNs, the total loss is a weighted summation of multiple loss terms corresponding to the PDE and initial/boundary conditions, and different methods have been developed to automatically tune these weights and balance the losses [39, 40, 42]. Moreover, a different weight for each loss term could be set at every training point [24, 9, 21, 19]. For problems in a large domain, decomposition of the spatiotemporal domain accelerates the training of PINNs and improves their accuracy [26, 34, 15]. For timedependent problems, it is usually helpful to first train PINNs within a short time domain and then gradually expand the time intervals of training until the entire time domain is covered [41, 17, 23, 11, 38]. In addition to these general methods, other problemspecific techniques have also been developed, e.g., enforcing Dirichlet or periodic boundary conditions exactly by constructing special neural network architectures [18, 5, 21].
PINNs are mainly optimized against the PDE loss, which guarantees that the trained network is consistent with the PDE to be solved. PDE loss is evaluated at a set of scattered residual points. Intuitively, the effect of residual points on PINNs is similar to the effect of mesh points on FEM, and thus the location and distribution of these residual points should be highly important to the performance of PINNs. However, in previous studies on PINNs, two simple residual point sampling methods (i.e., an equispaced uniform grid and uniformly random sampling) have mainly been used, and the importance of residual point sampling has largely been overlooked.
1.1 Related work and our contributions
Different residual point sampling methods can be classified into two categories: uniform sampling and nonuniform sampling. Uniform sampling can be obtained in multiple ways. For example, we could use the nodes of an equispaced uniform grid as the residual points or randomly sample the points according to a continuous uniform distribution in the computational domain. Although these two sampling methods are simple and widely used, alternative sampling methods may be applied. The Latin hypercube sampling (LHS)
[25, 36] was used in Ref. [32], and the Sobol sequence [35] was first used for PINNs in Ref. [29]. The Sobol sequence is one type of quasirandom lowdiscrepancy sequences among other sequences, such as the Halton sequence [12], and the Hammersley sequence [13]. Lowdiscrepancy sequences usually perform better than uniformly distributed random numbers in many applications such as numerical integration; hence, a comprehensive comparison of these methods for PINNs is required. However, very few comparisons [10, 3] have been performed. In this study, we
extensively compared the performance of different uniform sampling methods, including (1) equispaced uniform grid, (2) uniformly random sampling, (3) LHS, (4) Sobol sequence, (5) Halton sequence, and (6) Hammersley sequence.
In supervised learning, the dataset is fixed during training, but in PINNs, we can select residual points at any location. Hence, instead of using the same residual points during training, in each optimization iteration, we could select a new set of residual points, as first emphasized in Ref.
[20]. While this strategy has been used in some works, it has not yet been systematically tested. Thus, in this study, we
tested the performance of such a resampling strategy and investigated the effect of the number of residual points and the resampling period for the first time.
Uniform sampling works well for some simple PDEs, but it may not be efficient for those that are more complicated. To improve the accuracy, we could manually select the residual points in a nonuniform way, as was done in Ref. [22] for highspeed flows, but this approach is highly problemdependent and usually tedious and timeconsuming. In this study, we focus on automatic and adaptive nonuniform sampling. Motivated by the adaptive mesh refinement in FEM, Lu et al. [20] proposed the first adaptive nonuniform sampling for PINNs in 2019, the residualbased adaptive refinement (RAR) method, which adds new residual points in the locations with large PDE residuals. In 2021, another sampling strategy [27]
was developed, where all the residual points were resampled according to a probability density function (PDF) proportional to the PDE residual. In this study, motivated by these two ideas, we proposed two new sampling strategies:

residualbased adaptive distribution (RAD), where the PDF for sampling is a nonlinear function of the PDE residual;

residualbased adaptive refinement with distribution (RARD), which is a hybrid method of RAR and RAD, i.e., the new residual points are added according to a PDF.
During the preparation of this paper, a few new studies appeared [45, 4, 7, 37, 30, 46, 14] that also proposed modified versions of RAR or PDFbased resampling. Most of these methods are special cases of the proposed RAD and RARD, and our methods can achieve better performance. We include a detailed comparison of these strategies in Section 2.4, after introducing several notations and our new proposed methods.
In this study, we have considered a total of 10 different sampling methods, including seven nonadaptive sampling methods (six different uniform samplings and one uniform sampling with resampling) and three adaptive sampling approaches (RAR, RAD, and RARD).

We compared the performance of these sampling methods for four forward problems of PDEs and investigated the effect of the number of residual points.

We also compared their performance for two inverse problems that have not yet been considered in the literature.

We performed more than 6000 simulations of PINNs to obtain all the results shown in this study.
1.2 Organization
This paper is organized as follows. In Section 2, after providing a brief overview of PINNs and different nonadaptive sampling strategies, two new adaptive nonuniform sampling strategies (RAD and RARD) are proposed. In Section 3, we compare the performance of 10 different methods for six different PDE problems, including four forward problems and two inverse problems. Section 4 summarizes the findings and concludes the paper.
2 Methods
This section briefly reviews physicsinformed neural networks (PINNs) in solving forward and inverse partial differential equations (PDEs). Then different types of uniformly sampling are introduced. Next, two nonuniform residualbased adaptive sampling methods are proposed to enhance the accuracy and training efficiency of PINNs. Finally, a comparison of related methods is presented.
2.1 PINNs in solving forward and inverse PDEs
We consider the PDE parameterized by defined on a domain ,
with boundary conditions on
and denotes the solution at . In PINNs, the initial condition is treated as the Dirichlet boundary condition.
A forward problem is aimed to obtain the solution across the entire domain, where the model parameters are known. In practice, the model parameters might be unknown, but some observations from the solution are available, which lead to an inverse problem. An inverse problem is aimed to discover parameters that best describe the observed data from the solution.
PINNs are capable of addressing both forward and inverse problems. To solve a forward problem, the solution is represented with a neural network . The network parameters are trained to approximate the solution , such that the loss function is minimized [32, 20]:
where
(1) 
and and are the weights. Two sets of points are samples both inside the domain () and on the boundaries (). Here, and are referred to as the sets of “residual points”, and .
To solve the inverse problem, an additional loss term corresponding to the misfit of the observed data at the locations , defined as
is added to the loss function. The loss function is then defined as
with an additional weight . Then the network parameters are trained simultaneously with .
For certain PDE problems, it is possible to enforce boundary conditions directly by constructing a special network architecture [18, 5, 21, 44], which eliminates the loss term of boundary conditions. In this study, the boundary conditions are enforced exactly and automatically. Hence, for a forward problem, the loss function is
For an inverse problem, the loss function is
where we choose for the diffusionreaction equation in Section 3.6, and for the Kortewegde Vries equation in Section 3.7.
2.2 Uniformlydistributed nonadaptive sampling
The training of PINNs requires a set of residual points (). The sampling strategy of plays a vital role in promoting the accuracy and computational efficiency of PINNs. Here, we discuss several sampling approaches.
2.2.1 Fixed residual points
In most studies of PINNs, we specify the residual points at the beginning of training and never change them during the training process. Two simple sampling methods (equispaced uniform grids and uniformly random sampling) have been commonly used. Other sampling methods, such as the Latin hypercube sampling (LHS) [25, 36] and the Sobol sequence [35], have also been used in some studies [32, 29, 10]. The Sobol sequence is one type of quasirandom lowdiscrepancy sequences. Lowdiscrepancy sequences are commonly used as a replacement for uniformly distributed random numbers and usually perform better in many applications such as numerical integration. This study also considers other lowdiscrepancy sequences, including the Halton sequence [12] and the Hammersley sequence [13].
We list the six uniform sampling methods as follows, and the examples of 400 points generated in using different methods are shown in Fig. 1.

Equispaced uniform grid (Grid): The residual points are chosen as the nodes of an equispaced uniform grid of the computational domain.

Uniformly random sampling (Random): The residual points are randomly sampled according to a continuous uniform distribution over the domain. In practice, this is usually done using pseudorandom number generators such as the PCG64 algorithm [28].

Latin hypercube sampling (LHS) [25, 36]
: The LHS is a stratified Monte Carlo sampling method that generates random samples that occur within intervals on the basis of equal probability and with normal distribution for each range.

Quasirandom lowdiscrepancy sequences:

Halton sequence (Halton) [12]: The Halton samples are generated according to the reversing or flipping the base conversion of numbers using primes.

Hammersley sequence (Hammersley) [13]: The Hammersley sequence is the same as the Halton sequence, except in the first dimension where points are located equidistant from each other.

Sobol sequence (Sobol) [35]: The Sobol sequence is a base2 digital sequence that fills in a highly uniform manner.

2.2.2 Uniform points with resampling
In PINNs, a point at any location can be used to evaluate the PDE loss. Instead of using the fixed residual points during training, we could also select a new set of residual points in every certain optimization iteration [20]. The specific method to sample the points each time can be chosen from those methods discussed in Section 2.2.1. We can even use different sampling methods at different times, so many possible implementations make it impossible to be completely covered in this study.
In this study, we only consider Random sampling with resampling (RandomR). The RandomR method is the same as the Random method, except that the residual points are resampled for every iteration. The resampling period
is also an important hyperparameter for accuracy, as we demonstrate in our empirical experiments in Section
3.2.3 Nonuniform adaptive sampling
Although the uniform sampling strategies were predominantly employed, recent studies on the nonuniform adaptive sampling strategies [20, 27] have demonstrated promising improvement in the distribution of residual points during the training processes and achieved better accuracy.
2.3.1 Residualbased adaptive refinement with greed (RARG)
The first adaptive sampling method for PINNs is the residualbased adaptive refinement method (RAR) proposed in Ref. [20]. RAR aims to improve the distribution of residual points during the training process by sampling more points in the locations where the PDE residual is large. Specifically, after every certain iteration, RAR adds new points in the locations with large PDE residuals (Algorithm 1). RAR only focuses on the points with large residual, and thus it is a greedy algorithm. To better distinguish from the other sampling methods, the RAR method is referred to as RARG in this study.
2.3.2 Residualbased adaptive distribution (RAD)
RARG significantly improves the performance of PINNs when solving certain PDEs of solutions with steep gradients [20, 44]. Nevertheless, RARG focuses mainly on the location where the PDE residual is largest and disregards the locations of smaller residuals. Another sampling strategy was developed later in Ref. [27], where all the residual points are resampled according to a probability density function (PDF) proportional to the PDE residual. Specifically, for any point , we first compute the PDE residual , and then compute a probability as
where is a normalizing constant. Then all the residual points are sampled according to .
This approach works for certain PDEs, but as we show in our numerical examples, it does not work well in some cases. Following this idea, we propose an improved version called the residualbased adaptive distribution (RAD) method (Algorithm 2), where we use a new PDF defined as
(2) 
where and are two hyperparameters. can be approximated by a numerical integration such as Monte Carlo integration. We note that the RandomR method in Section 2.2.2 is a special case of RAD by choosing or .
In RAD (Algorithm 2 line 4), we need to sample a set of points according to , which can be done in a few ways. When is lowdimensional, we can sample the points approximately in the following bruteforce way:

Sample a set of dense points using one of the methods in Section 2.2.1;

Compute for the points in ;

Define a probability mass function with the normalizing constant ;

Sample a subset of points from according to .
This method is simple, easy to implement, and sufficient for many PDE problems. For more complicated cases, we can use other methods such as inverse transform sampling, Markov chain Monte Carlo (MCMC) methods, and generative adversarial networks (GANs)
[8].The two hyperparameters and in Eq. (2) control the profile of and thus the distribution of sampled points. We illustrate the effect of and using a simple 2D example,
(3) 
with in Fig. 2. When , it becomes a uniform distribution. As the value of increases, more residual points will large PDE residuals are sampled. As the value of increases, the residual points exhibit an inclination to be uniformly distributed. Compared with RAR, RAD provides more freedom to balance the points in the locations with large and small residuals by tuning and . The optimal values of and are problemdependent, and based on our numerical results, the combination of and is usually a good default choice.
2.3.3 Residualbased adaptive refinement with distribution (RARD)
We also propose a hybrid method of RARG and RAD, namely, residualbased adaptive refinement with distribution (RARD) (Algorithm 3). Similar to RARG, RARD repeatedly adds new points to the training dataset; similar to RAD, the new points are sampled based on the PDF in Eq. (2). We note that when , only points with the largest PDE residual are added, which recovers RARG. The optimal values of and are problem dependent, and based on our numerical results, the combination of and is usually a good default choice.
2.4 Comparison with related work
As discussed in Section 2.3, our proposed RAD and RARD are improved versions of the methods in Refs. [20, 27]. Here, we summarize the similarities between their methods and ours.
During the preparation of this paper, a few new papers appeared [45, 4, 7, 37, 30, 46, 14] that also proposed similar methods. Here, we summarize the similarities and differences between these studies.

The method proposed by Gao et al. [7] (in December 2021) is a special case of RAD by choosing .

Tang et al. [37] (in December 2021) proposed two methods. One is a special case of RAD by choosing and , and the other is a special case of RARD by choosing and .

Zeng et al. [46] (in April 2022) proposed a subdomain version of RARG. The entire domain is divided into many subdomains, and then new points are added to the several subdomains with large average PDE residual.

Zapf et al. [45] (in May 2022) proposed a modified version of RARG, where some points with small PDE residual are removed while adding points with large PDE residual. They show that compared with RAR, this reduces the computational cost, but the accuracy keeps similar.

Hanna et al. [14] (in May 2022) proposed a similar method as RARD, but they chose , where is a small tolerance.

Similar to the work of Zapf et al., Daw et al. [4] (in July 2022) also proposed to remove the points with small PDE residual, but instead of adding new points with large PDE residual, they added new uniformly random sampled points.
Thus all these methods are special cases of our proposed RAD and RARD (or with minor modification). However, in our study, two tunable variables and are introduced. As we show in our results, the values of and could be crucial since they significantly influence the residual points distribution. By choosing proper values of and , our methods would outperform the other methods.
We also note that the pointwise weighting [24, 9, 21, 19] can be viewed as a special case of adaptive sampling, described as follows. When the residual points are randomly sampled from a uniform distribution , and the number of residual points is large, the PDE loss in Eq. (1) can be approximated by . If we consider a pointwise weighting function , then the loss becomes , while for RAD the loss is . If we choose (divided by a normalizing constant) as the PDF , then the two losses are equal.
3 Results
We apply PINNs with all the ten sampling methods in Section 2 to solve six forward and inverse PDE problems. In all examples, the hyperbolic tangent (
) is selected as the activation function. Table
1 summarizes the network width, depth, and optimizers used for each example. More details of the hyperparameters and training procedure can be found in each section of the specific problem.Problems  Depth  Width  Optimizer 

Section 3.2 Diffusion equation  4  32  Adam 
Section 3.3 Burgers’ equation  4  64  Adam + LBFGS 
Section 3.4 AllenCahn equation  4  64  Adam + LBFGS 
Section 3.5 Wave equation  6  100  Adam + LBFGS 
Section 3.6 Diffusionreaction equation (inverse)  4  20  Adam 
Section 3.7 Kortewegde Vries equation (inverse)  4  100  Adam 
For both forward and inverse problems, to evaluate the accuracy of the solution , the relative error is used:
For inverse problems, to evaluate the accuracy of the predicted coefficients , the relative error is also computed:
As the result of PINN has randomness due to the random sampling, network initialization, and optimization, thus, for each case, we run the same experiment at least 10 times and then compute the geometric mean and standard deviation of the errors. The code in this study is implemented by using the library DeepXDE
[20] and is publicly available from the GitHub repository https://github.com/lugroup/pinnsampling.3.1 Summary
Here, we first present a summary of the accuracy of all the methods for the forward and inverse problems listed in Tables 2 and Table 3, respectively. A relatively small number of residual points is chosen to show the difference among different methods. In the specific section of each problem (Sections 3.2–3.7), we discuss all the detailed analyses, including the convergence of error during the training process, the convergence of error with respect to the number of residual points, and the effects of different hyperparameters (e.g., the period of resampling in RandomR, the values of and in RAD and RARD, and the number of new points added each time in RARD). We note that RandomR is a special case of RAD by choosing or , and RARG is a special case of RARD by choosing .
Our main findings from the results are as follows.

The proposed RAD method has always performed the best among the 10 sampling methods when solving all forward and inverse problems.

For PDEs with complicated solutions, such as the Burgers’ and multiscale wave equation, the proposed RAD and RARD methods are predominately effective and yield errors magnitudes lower.

For PDEs with smooth solutions, such as the diffusion equation and diffusionreaction equation, some uniform sampling methods, such as the Hammersley and RandomR, also produce sufficiently low errors.

Compared with other uniform sampling methods, RandomR usually demonstrates better performance.

Among the six uniform sampling methods with fixed residual points, the lowdiscrepancy sequences (Halton, Hammersley, and Sobol) generally perform better than Random and LHS, and both are better than Grid.
Diffusion  Burgers’  AllenCahn  Wave  
No. of residual points  30  2000  1000  2000 
Grid  0.66 0.06%  13.7 2.37%  93.4 6.98%  81.3 13.7% 
Random  0.74 0.17%  13.3 8.35%  22.2 16.9%  68.4 20.1% 
LHS  0.48 0.24%  13.5 9.05%  26.6 15.8%  75.9 33.1% 
Halton  0.24 0.17%  4.51 3.93%  0.29 0.14%  60.2 10.0% 
Hammersley  0.17 0.07%  3.02 2.98%  0.14 0.14%  58.9 8.52% 
Sobol  0.19 0.07%  3.38 3.21%  0.35 0.24%  57.5 14.7% 
RandomR  0.12 0.06%  1.69 1.67%  0.55 0.34%  0.72 0.90% 
RARG [20]  0.20 0.07%  0.12 0.04%  0.53 0.19%  0.81 0.11% 
RAD  0.11 0.07%  0.02 0.00%  0.08 0.06%  0.09 0.04% 
RARD  0.14 0.11%  0.03 0.01%  0.09 0.03%  0.29 0.04% 
Diffusionreaction  Kortewegde Vries  
No. of residual points  15  600  
Grid  0.36 0.12%  8.58 2.14%  24.4 11.1%  53.7 30.7%  42.0 22.3% 
Random  0.35 0.17%  5.77 2.05%  8.86 2.80%  16.4 7.33%  16.8 7.40% 
LHS  0.36 0.14%  7.00 2.62%  10.9 2.60%  22.0 6.68%  22.6 6.36% 
Halton  0.23 0.08%  6.16 1.08%  8.76 3.33%  16.7 6.16%  17.2 6.20% 
Hammersley  0.28 0.08%  6.37 0.91%  4.49 3.56%  5.24 7.08%  5.71 7.32% 
Sobol  0.21 0.06%  3.09 0.75%  8.59 3.67%  15.8 6.15%  15.6 5.79% 
RandomR  0.19 0.09%  3.43 1.80%  0.97 0.15%  0.41 0.30%  1.14 0.31% 
RARG [20]  1.12 0.11%  15.9 1.53%  8.83 1.98%  15.4 9.29%  14.5 9.25% 
RAD  0.17 0.09%  2.76 1.32%  0.77 0.11%  0.31 0.19%  0.86 0.25% 
RARD  0.76 0.24%  10.3 3.28%  2.36 0.98%  3.49 2.21%  3.18 2.02% 
3.2 Diffusion equation
We first consider the following onedimensional diffusion equation:
where is the concentration of the diffusing material. The exact solution is .
We first compare the performance of the six uniform sampling methods with fixed residual points (Fig. 3A). The number of residual points is ranged from 10 to 80 with an increment of 10 points each time. For each number of residual points, the maximum iteration is set to be with Adam as the optimizer. When the number of points is large (e.g., more than 70), all these methods have similar performance. However, when the number of residual points is small such as 50, the Hammersley and Sobol sequences perform better than others, and the equispaced uniform grid and random sampling have the largest errors (about one order of magnitude larger than Hammersley and Sobol).
We then test the RandomR method using 30 residual points (Fig. 3B). The accuracy of RandomR has a strong dependence on the period of resampling, and the optimal period of resampling in this problem is around 200. Compared with Random without resampling, the RandomR method always leads to lower relative errors regardless of the period of resampling. The error can be lower by one order of magnitude by choosing a proper resampling period. Among all the nonadaptive methods, RandomR performs the best.
Next, we test the performance of the nonuniform adaptive sampling methods. In Algorithms 2 and 3, the neural network is first trained using steps of Adam. In the RAD method, we use 30 residual points and resample every iterations. The errors of RAD with different values of and are shown in Figs. 3C and D. We note that RandomR is a special case of RAD with either or . Here, RAD with large values of or small values of leads to better accuracy, i.e., the points are almost uniformly distributed. For the RARD method (Figs. 3E and F), one residual point is added after every iterations starting from 10 points. When using and (the two red lines in Figs. 3E red F), RARD performs the best.
When using 30 residual points, the errors of all the methods are listed in Table 2. In this diffusion equation, all the methods achieve a good accuracy (). Compared with RandomR (0.12%), RAD and RARD (0.11%) are not significantly better. The reason could be that the solution of this diffusion equation is very smooth, so uniformly distributed points are good enough. In our following examples, we show that RAD and RARD work significantly better and achieve an error of orders of magnitude smaller than the nonadaptive methods.
3.3 Burgers’ equation
The Burgers’ equation is considered defined as:
where is the flow velocity and is the viscosity of the fluid. In this study, is set at . Different from the diffusion equation with a smooth solution, the solution of the Burgers’ equation has a sharp front when and is close to 1.
We first test the uniform sampling methods by using the number of residual points ranging from 1,000 to 10,000 (Fig. 4A). The maximum iteration is 15,000 steps with Adam as optimizer followed by 15,000 steps of LBFGS. Fig. 4A shows that the Hammersley method converges the fastest and reaches the lowest relative error among all the uniform sampling methods, while the Halton and Sobol sequences also perform adequately.
Fig. 4B shows the relative error as a function of the period of resampling using the RandomR method with 2,000 residual points. Similar to the diffusion equation, the RandomR method always outperforms the Random method. However, the performance of RandomR is not sensitive to the period of resampling if the period is smaller than 100. Choosing a period of resampling too large can negatively affect its performance.
When applying the nonuniform adaptive methods, the neural network is first trained using 15,000 steps of Adam and then 1,000 steps of LBFGS. In the RAD method, we use 2000 residual points, which are resampled every 2,000 iterations (1,000 iterations using Adam followed by 1,000 iterations using LBFGS). As indicated by Fig. 4C, the RAD method possesses significantly greater advantages over the RandomR method (a special case of RAD by choosing or ), whose relative errors barely decrease during the training processes. This fact reflects that both extreme cases show worse performance. In contrast, for and (the red lines in Figs. 4C and D), the relative error declines rapidly and quickly reaches . The RAD method is also effective when choosing a set of and in a moderate range.
For the RARD method, 1,000 residual points are selected in the pretrained process, and 10 residual points are added every 2,000 iterations (1,000 iterations using Adam and 1,000 iterations using LBFGS as optimizer) until the total number of residual points reaches 2,000. Shown by Figs. 4E and F, the optimal values for and are found to be 2 and 0, respectively.
Since the solution of Burgers’ equation has a very steep region, when using 2000 residual points, both RAD and RARD have competitive advantages over the uniform sampling methods in terms of accuracy and efficiency. For the following three forward PDE problems (AllenCahn equation in Section 3.4, wave equation in Section 3.5, and diffusionreaction equation in Section 3.6), unless otherwise stated, the maximum iterations, the use of optimizer, and the training processes remain the same as the Burgers’ equation.
Table 2 summarizes the relative error for all methods when we fix the number of residual points at 2000. All uniform sampling methods fail to capture the solution well. The relative errors given by the Halton, Hammersley, and Sobol methods () are around onefourth of that given by the Grid, Random, and LHS methods (%). Even though the RandomR performs the best among all uniform methods (1.69 1.67%), the proposed RAD and RARD methods can achieve an relative error two orders of magnitude lower than that (0.02%).
3.4 AllenCahn equation
Next, we consider the AllenCahn equation in the following form:
where the diffusion coefficient . Fig. 5 outlines the relative errors of different sampling methods for the AllenCahn equation.
Similar patterns are found for the nonadaptive uniform sampling as in the previous examples. The Hammersley method has the best accuracy (Fig. 5A). As the number of residual points becomes significantly large, the difference between these uniform sampling methods becomes negligible. Except for the equispaced uniform grid method, other uniform sampling methods converge to relative errors of , about the same magnitude as the number of residual points reaching . Fig. 5B shows that when using 1000 residual points for RandomR, lower relative errors can be obtained if we select a period of resampling less than 500.
We next test the performance of RAD for different values of and when using a different number of residual points. In Figs. 5C and D, we resampled 500 residual points every 2000 iteration, while in Figs. 5E and F, we used 1000 residual points instead. For both cases, the combination of and (the red lines in Figs. 5C–F) gives good accuracy. When fewer residual points (e.g., 500) are used, the RAD methods boost the performance of PINNs.
Similarly, we also test RARD in Figs. 5G–J. In Figs. 5G and H, we pretrain the neural network with 500 residual points and add 10 residual points after every 2000 iterations until the total number of residual points reaches 1000. In Figs. 5I and J, we pretrain the neural network using 1000 residual points and heading to 2000 residual points in the same fashion. We recognize that 2 and 0 are the best and values for the RARD method for both scenarios, which outperform the RARG method.
As proven in this example, when applying the RAD and the RARD methods, the optimal values of and remain stable even though we choose a different number of residual points. In addition, we find that the optimal and for the Burgers’ and Allen Cahn equations are the same for both the RAD and the RARD methods. Thus, we could choose for the RAD methods and for the RARD methods by default when first applied these methods to a new PDE problem.
To make a comparison across all sampling methods, Table 2 shows the relative error for the AllenCahn equation when we fix the number of residual points at 1000. The Grid, Random, and LHS methods are prone to substantial errors, which are all larger than 20%. Nevertheless, the other four uniform methods (Halton, Hammersley, Sobol, and RandomR) have greater performance and can achieve relative errors of less than 1%. Remarkably, the RAD and RARD methods we proposed can further bring down the relative error below 0.1%.
3.5 Wave equation
In this example, the following onedimensional wave equation is considered:
where the exact solution is given as:
The solution has a multiscale behavior in both spatial and temporal directions.
When we test the six uniform sampling methods, the number of residual points are ranged from to , with an increment of each time. The Hammersley method achieves the lowest relative error with the fastest rate (Fig. 6A). When the number of residual points approaches , the Random, Halton, and Hammersley methods can all obtain an relative error .
To determine the effectiveness of RandomR when using different numbers of residual points, we test the following three scenarios: small ( points), medium ( points), and large () sets of residual points (Figs. 6B, C, and D). In the medium case (Fig. 6C), the RandomR attains relative errors magnitudes lower than the Random method. However, in the small and large cases (Figs. 6B and D), the RandomR methods show no advantage over the Random method regardless of the period of resampling. This is because when the number of residual points is small, both the Random and RandomR methods fail to provide accurate predictions. On the other hand, if the number of residual points is large, the predictions by the Random method are already highly accurate, so the RandomR is unable to further improve the accuracy.
Since the optimal sets of and for both RAD and RARD methods are found to be the same for the Burgers’ and the Allen Cahn equations, in this numerical experiment, we only apply the default settings (i.e., RAD: and ; RARD: and ) to investigate the effect of other factors, including the number of residual points for the RAD method and the number of points added to the RARD method.
In Fig. 6E, we compare the performance of three nonuniform adaptive sampling methods under the same number of residual points from to . We first train the network using iterations of Adam and iterations of LBFGS, and then after each resampling in RAD or adding new points in RARD/RARG, we train the network with iterations of LBFGS. For the RARG and the RARD methods, we first train the network with 50% of the final number of the residual points and add 10 residual points each time until reaching the total number of residual points. As we can see from Fig. 6E, the RAD achieves much better results when the number of residual points is small. As the number of residual points increases, the RARD method acts more effectively and eventually reaches comparable accuracy to the RAD method. Since the RAD method is more computationally costly than the RARD methods with the same number of residual points, we suggest applying the RAD method when the number of residual points is small and the RARD method when the number of residual points is large.
We next investigate the RAD method with a different number of residual points (i.e., , , , and ). Fig. 6F illustrates that if we increase the number of residual points, lower relative error can be achieved but with diminishing marginal effect. We train the network for more than iterations to see if the relative error can further decrease. However, the relative errors converge and remain relatively stable after iterations.
One important factor to consider in the RARD and the RARG methods is how new points are added. We can either add a small number of residual points each time and prolong the training process or add a large number of residual points each time and shorten the process. In Fig. 6G, we first train the network with 1000 residual points and then add new residual points at different rates until the total number of residual points reaches 2000. After adding new residual points each time, we train the network using 1000 steps of LBFGS. Likewise, in Fig. 6H, we first train the network with 2500 residual points and add new points at different rates until the total number of residual points reaches 5000. In both cases (Figs. 6G and H) that use the RARD methods, we find that the best strategy is to add 10 points each time. However, shown by two redshaded regions in Figs. 6G and H, the results are more stable when we use a larger number of residual points. Fig. 6I is set up the same way as Fig. 6H but tests the RARG method. The best strategy for the RARG is identical to that of the RARD.
Table 2 outlines the relative error for the wave equation using all methods when the number of residual points equals 2000. All uniform methods with fixed residual points perform poorly (error %) and fail to approximate the truth values. RandomR, as a special case of the proposed RAD, gives relative errors of around 1%. The RARD method significantly enhances the prediction accuracy resulting in relative errors under 0.3%. In addition, the RAD with the default setting of and converges to relative errors under 0.1%.
3.6 Diffusionreaction equation
The first inverse problem we consider is the diffusionreaction system as follows:
where is the source term. is the diffusion coefficient, and is the solute concentration. In this problem, we aim to infer the spacedependent reaction rate with given measurements on the solution . The exact unknown reaction rate is
We aim to learn the unknown function and solve for by using eight observations of , which are uniformly distributed on the domain , including two points on both sides of the boundaries. The relative errors for both the solution (Figs. 7A, C, and E) and the unknown function (Figs. 7B, D, and F) are computed. The maximum number of iterations is steps of Adam. Figs. 7A and B summarize the performance of all uniform sampling methods. We note that in 1D, the Hammersley and Halton sequences are identical and outperform other uniform methods. We fix the residual points at 15 and compare the Random method with the RandomR method. The relative errors (Figs. 7C and D) given by the RandomR remain steady, disregarding the changes in the period of resampling, and are approximately the same as that produced by the Random method. This is because the reactiondiffusion system is fairly simple and can be easily handled by uniform sampling methods without resampling.
Next, we compare the Random, RAD, RARG, and RARD methods with default settings (i.e., RAD: and ; RARD: and ) using a different number of residual points. For the random and RAD methods, the maximum number of iterations is steps of Adam. For the RARG/RARD, we first train the neural network with 50% of the total number of residual points for steps of Adam; then we add one point each time and train for steps of Adam until we meet the total number of residual points. As shown by Figs. 7E and F, the RAD method surpasses other methods and is able to produce low relative error even when the number of residual points is very small. However, RARG and RARD are even worse than the Random sampling.
To sum up, we fix the number of residual points at 15 and present the relative error for both the solution and unknown function in Table 3. The RAD yields the minimum relative error (0.17% for ; 2.76% for ). However, due to the simplicity of this PDE problem, some uniform sampling methods, especially the Sobol and RandomR, have comparable performance to the RAD. Generally speaking, we recognize that the uniform sampling methods are adequate when solving this inverse PDE with smooth solutions. Still, the RAD method can further enhance the performance of PINNs, especially when the number of residual points is small.
3.7 Kortewegde Vries equation
The second inverse problem we solve is the Kortewegde Vries (KdV) equation:
where and are two unknown parameters. The exact values for and are 1 and 0.0025, respectively. The initial condition is , and periodic boundary conditions are used. To infer and , we assume that we have the observations of two solution snapshots and at 64 uniformly distributed points at each time.
In Fig. 8, the first column (Figs. 8A, D, and G) shows the relative error of the solution , while the second column (Figs. 8B, E, and H) and the third column (Figs. 8C, F, and I) illustrate the relative errors for and , respectively. The maximum iteration is steps of Adam. Hammersley achieves better accuracy than the other uniform sampling methods. The Sobol and Halton methods behave comparably as these two curves (the yellow and green curves in Figs. 8A, B, and C) are almost overlapping. Shown in Figs. 8D, E and F, the RandomR method yields higher accuracy than the Random method by about one order of magnitude in all cases when using 1000 residual points. A smaller period of resampling leads to smaller errors.
Figs. 8G, H, and I compare the RandomR, Random, RAD, RARG, and RARD methods using the same number of residual points and the total number of iterations. For the Random and the RandomR methods, we train the network for steps of Adams. For the RAD methods, we first train the network using steps of Adams; then, we resample the residual points and train for 1000 steps of Adams 50 times. In order to fix the total number of iterations for the RARG/RARD methods to , we accordingly adjust the number of new residual points added each time. For example, if the final number of residual points is 500, we first train the network using 250 residual points (i.e., 50% of the total number of residual points) with steps of Adams; and we consequently add 5 points and train for 1000 steps of Adams each time. If the final number of residual points is 1000, we first train the network using 500 residual points with steps of Adams; and then we add 10 points and train for 1000 steps of Adams each time. As demonstrated by Figs. 8G, H, and I, the RAD method is the best, while the RandomR method is also reasonably accurate. We show one example of the training process (Figs. 8J, K, and L) when the number of residual points is 600 to illustrate the convergence of the solution, , and during training. The resampling strategies, especially the RAD method, achieve the greatest success among all sampling methods.
Table 3 demonstrates the relative errors for the solution and the relative error of two unknown parameters and , for all methods when the number of residual points is set at 600. The lowest relative errors for uniform sampling with fixed points are given by Hammersley (). The RandomR is the secondbest method and provides relative errors of around 1%. With the smallest errors () and standard deviations, the RAD method has compelling advantages over all other methods in terms of accuracy and robustness. It is noteworthy that the RARD method provides adequate accuracy () and is less expensive than the RandomR and RAD methods when the number of residual points is the same. Therefore, the RARD is also a valuable approach to consider.
4 Conclusions
In this paper, we present a comprehensive study of two categories of sampling for physicsinformed neural networks (PINNs), including nonadaptive uniform sampling and adaptive nonuniform sampling. For the nonadaptive uniform sampling, we have considered six methods: (1) equispaced uniform grid (Grid), (2) uniformly random sampling (Random), (3) Latin hypercube sampling (LHS), (4) Halton sequence (Halton), (5) Hammersley sequence (Hammersley), and (6) Sobol sequence (Sobol). We have also considered a resampling strategy for uniform sampling (RandomR). For the adaptive nonuniform sampling, motivated by the residualbased adaptive refinement with greed (RARG) [20], we proposed two new residualbased adaptive sampling methods: residualbased adaptive distribution (RAD) and residualbased adaptive refinement with distribution (RARD).
We extensively investigated the performance of these ten sampling methods in solving four forward and two inverse problems of partial differential equations (PDEs) with many setups, such as a different number of residual points. Our results show that the proposed RAD and RARD significantly improve the accuracy of PINNs by orders of magnitude, especially when the number of residual points is small. RAD and RARD also have great advantages for the PDEs with complicated solutions, e.g., the solution of the Burgers’ equation with steep gradients and the solution of the wave equation with a multiscale behavior. A summary of the comparison of these methods can be found in Section 3.1.
Based on our empirical results, we summarize the following suggestions as a practical guideline in choosing sampling methods for PINNs.

RAD with and can be chosen as the default sampling method when solving a new PDE. The hyperparameters and can be tuned to balance the points in the locations with large and small PDE residuals.

RARD can achieve comparable accuracy to RAD, but RARD is more computationally efficient as it gradually increases the number of residual points. Hence, RARD ( and by default) is preferable for the case with limited computational resources.

RandomR can be used in the situation where adaptive sampling is not allowed, e.g., it is difficult to sample residual points according to a probability density function. The period of resampling should not be chosen as too small or too large.

A lowdiscrepancy sequence (e.g., Hammersley) should be considered rather than Grid, Random, or LHS, when we have to use a fixed set of residual points, such as in PINNs with the augmented Lagrangian method (hPINNs) [21].
In this study, we sample residual points in RAD and RARD by using a bruteforce approach, which is simple, easy to implement, and sufficient for many PDEs. However, for highdimensional problems, we need to use other methods, such as generative adversarial networks (GANs) [8], as was done in Ref. [37]. Moreover, the probability of sampling a point is only considered as . While this probability works very well in this study, it is possible that there exists another better choice. We can learn a new probability density function by metalearning, as was done for loss functions of PINNs in Ref. [31].
References
 [1] (2020) Physicsinformed neural networks for inverse problems in nanooptics and metamaterials. Optics Express 28 (8), pp. 11618. External Links: Document Cited by: §1.
 [2] (2022) Systems biology: identifiability analysis and parameter identification via systemsbiology informed neural networks. arXiv preprint arXiv:2202.01723. Cited by: §1.

[3]
(2022)
Stateoftheart review of design of experiments for physicsinformed deep learning
. arXiv preprint arXiv:2202.06416. Cited by: §1.1.  [4] (2022) Rethinking the importance of sampling in physicsinformed neural networks. arXiv preprint arXiv:2207.02338. Cited by: §1.1, 7th item, §2.4.
 [5] (2021) A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks. Journal of Computational Physics 435, pp. 110242. Cited by: §1, §2.1.
 [6] (2015) Fast generation of 2D node distributions for meshfree pde discretizations. Computers & Mathematics with Applications 69 (7), pp. 531–544. Cited by: 4th item.
 [7] (2021) Active learning based sampling for highdimensional nonlinear partial differential equations. arXiv preprint arXiv:2112.13988. Cited by: §1.1, 1st item, §2.4.
 [8] (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.3.2, §4.
 [9] (2021) SelectNet: selfpaced learning for highdimensional partial differential equations. Journal of Computational Physics 441, pp. 110444. Cited by: §1, §2.4.
 [10] (2020) Analysis of three dimensional potential problems in nonhomogeneous media with deep learning based collocation method. arXiv preprint arXiv:2010.12060. Cited by: §1.1, §2.2.1.
 [11] (2022) Improved training of physicsinformed neural networks with model ensembles. arXiv preprint arXiv:2204.05108. Cited by: §1.
 [12] (1960) On the efficiency of certain quasirandom sequences of points in evaluating multidimensional integrals. Numerische Mathematik 2 (1), pp. 84–90. Cited by: §1.1, item 4a, §2.2.1.
 [13] (1964) MonteCarlo methods, mathuen. Lon. Cited by: §1.1, item 4b, §2.2.1.
 [14] (2022) Residualbased adaptivity for twophase flow simulation in porous media using physicsinformed neural networks. Computer Methods in Applied Mechanics and Engineering 396, pp. 115100. Cited by: §1.1, 6th item, §2.4.
 [15] (2020) Extended physicsinformed neural networks (XPINNs): a generalized spacetime domain decomposition based deep learning framework for nonlinear partial differential equations. Communications in Computational Physics 28 (5), pp. 2002–2041. External Links: ISSN 19917120, Document Cited by: §1.

[16]
(2021)
Physicsinformed machine learning
. Nature Reviews Physics 3 (6), pp. 422–440. External Links: Document Cited by: §1, §1.  [17] (2021) Characterizing possible failure modes in physicsinformed neural networks. Advances in Neural Information Processing Systems 34, pp. 26548–26560. Cited by: §1.

[18]
(2020)
Systematic construction of neural forms for solving partial differential equations inside rectangular domains, subject to initial, boundary and interface conditions.
International Journal on Artificial Intelligence Tools
29 (05), pp. 2050009. Cited by: §1, §2.1.  [19] (2022) Revisiting PINNs: generative adversarial physicsinformed neural networks and pointweighting method. arXiv preprint arXiv:2205.08754. Cited by: §1, §2.4.
 [20] (2021) DeepXDE: a deep learning library for solving differential equations. SIAM Review 63 (1), pp. 208–228. Cited by: §1.1, §1.1, §1, 1st item, §2.1, §2.2.2, §2.3.1, §2.3.2, §2.3, §2.4, Table 2, Table 3, §3, §4, 1.
 [21] (2021) Physicsinformed neural networks with hard constraints for inverse design. SIAM Journal on Scientific Computing 43 (6). External Links: Document Cited by: §1, §1, §2.1, §2.4, 4th item.
 [22] (2020) Physicsinformed neural networks for highspeed flows. Computer Methods in Applied Mechanics and Engineering 360, pp. 112789. Cited by: §1.1.
 [23] (2022) A novel sequential method to train physics informed neural networks for allen cahn and cahn hilliard equations. Computer Methods in Applied Mechanics and Engineering 390, pp. 114474. Cited by: §1.
 [24] (2020) Selfadaptive physicsinformed neural networks using a soft attention mechanism. arXiv preprint arXiv:2009.04544. Cited by: §1, §2.4.
 [25] (2000) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42 (1), pp. 55–61. Cited by: §1.1, item 3, §2.2.1.
 [26] (2020) PPINN: parareal physicsinformed neural network for timedependent pdes. Computer Methods in Applied Mechanics and Engineering 370, pp. 113250. External Links: Document Cited by: §1.
 [27] (2021) Efficient training of physicsinformed neural networks via importance sampling. ComputerAided Civil and Infrastructure Engineering. Cited by: §1.1, 2nd item, §2.3.2, §2.3, §2.4.
 [28] (2014) PCG: a family of simple fast spaceefficient statistically good algorithms for random number generation. ACM Transactions on Mathematical Software. Cited by: item 2.
 [29] (2019) fPINNs: fractional physicsinformed neural networks. SIAM Journal on Scientific Computing 41 (4). External Links: Document Cited by: §1.1, §1, §2.2.1.
 [30] (2022) RANG: a residualbased adaptive node generation method for physicsinformed neural networks. arXiv preprint arXiv:2205.01051. Cited by: §1.1, 4th item, §2.4.
 [31] (2022) Metalearning PINN loss functions. Journal of Computational Physics 458, pp. 111121. External Links: Document Cited by: §1, §4.
 [32] (2019) Physicsinformed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. External Links: Document Cited by: §1.1, §1, §2.1, §2.2.1.
 [33] (2020) Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations. Science 367 (6481), pp. 1026–1030. External Links: Document Cited by: §1.
 [34] (2021) Parallel physicsinformed neural networks via domain decomposition. Journal of Computational Physics 447, pp. 110683. External Links: Document Cited by: §1.
 [35] (1967) On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 7 (4), pp. 784–802. Cited by: §1.1, item 4c, §2.2.1.
 [36] (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29 (2), pp. 143–151. Cited by: §1.1, item 3, §2.2.1.
 [37] (2021) DAS: a deep adaptive sampling method for solving partial differential equations. arXiv preprint arXiv:2112.14038. Cited by: §1.1, 2nd item, §2.4, §4.
 [38] (2022) Respecting causality is all you need for training physicsinformed neural networks. arXiv preprint arXiv:2203.07404. Cited by: §1.
 [39] (2021) Understanding and mitigating gradient flow pathologies in physicsinformed neural networks. SIAM Journal on Scientific Computing 43 (5), pp. A3055–A3081. Cited by: §1.
 [40] (2022) When and why PINNs fail to train: a neural tangent kernel perspective. Journal of Computational Physics 449, pp. 110768. Cited by: §1.
 [41] (2020) Solving AllenCahn and CahnHilliard equations using the adaptive physics informed neural networks. arXiv preprint arXiv:2007.04542. Cited by: §1.
 [42] (2022) Selfadaptive loss balanced physicsinformed neural networks. Neurocomputing. Cited by: §1.
 [43] (2020) Systems biology informed deep learning for inferring parameters and hidden dynamics. PLOS Computational Biology 16 (11). External Links: Document Cited by: §1.
 [44] (2022) Gradientenhanced physicsinformed neural networks for forward and inverse PDE problems. Computer Methods in Applied Mechanics and Engineering 393, pp. 114823. External Links: Document Cited by: §1, §2.1, §2.3.2.
 [45] (2022) Investigating molecular transport in the human brain from MRI with physicsinformed neural networks. arXiv preprint arXiv:2205.02592. Cited by: §1.1, 5th item, §2.4.
 [46] (2022) Adaptive deep neural networks methods for highdimensional partial differential equations. Journal of Computational Physics 463, pp. 111232. Cited by: §1.1, 3rd item, §2.4.
 [47] (2019) Quantifying total uncertainty in physicsinformed neural networks for solving forward and inverse stochastic problems. Journal of Computational Physics 397, pp. 108850. External Links: Document Cited by: §1.