1.1 Problem Statement
High-dimensional partial differential equations are widely used in modeling complicated phenomena; e.g., the Hamilton–Jacobi–Bellman (HJB) equation [HJB] in control theory, the Schrödinger equation [sch_equation]
in quantum mechanics. There are various traditional numerical methods for solving partial differential equations, especially for low-dimensional problems. However, when it comes to solving high-dimensional problems, the curse of dimensionality becomes a major computational challenge for many traditional methods; e.g., in the finite difference method[FDM], the computational complexity is exponential to the number of dimensions. With that being said, traditional methods oftentimes are computationally intractable in solving high-dimensional problems.
In approximation theory, deep neural networks (DNNs) have been a more effective tool for function approximation than traditional approximation tools such as finite elements and wavelets. In[yarotsky1, yarotsky2, shijun2, shijun3, shijun6, hon2021simultaneous]
, it has been shown that ReLU-type DNNs can provide more attractive approximation rates than traditional tools to approximate continuous functions and smooth functions with explicit error characterization[shijun2, shijun3, shijun6] in the -norm [yarotsky1, yarotsky2, shijun2, shijun3, shijun6] and in the -norm for [hon2021simultaneous]. If target functions lie in the space of functions with special integral representations [barron1993, weinan1, weinan2, siegel2021sharp, SIEGEL2020313, JCM-39-801]
, ReLU-type DNNs can lessen the curse of dimensionality in function approximation. Armed with advanced activation functions, DNNs is able to conquer the curse of dimensionality in approximation theory for approximating continuous functions[shijun4, shijun5, yarotsky3, shijun7]. In generalization theory, it was shown that DNNs can achieve a dimension-independent error rate for solving PDEs [doi:10.1137/19M125649X, LuoYang2020, lu2021priori, chen2021representation]. These theoretical results have justified the application of DNNs to solve high-dimensional PDEs recently in [Han8505, raissi2017physicsI, raissi2017physicsII, Deep_Ritz, DLPDE1, DLPDE2l, Deep_Ritz, IntNet, DLPDE3]. Two main advantages of deep-learning-based methods presented in these studies are summarizes as follows: firstly, the curse of dimensionality can be weakened or even be overcome in certain classes of PDEs [arnulf1, arnulf2]; secondly, deep-learning-based PDE solvers are mesh-free without tedious mesh generation for complex domains in traditional solvers. Thus, deep-learning-based methods have shown tremendous potential to surpass other numerical methods especially in solving high-dimensional PDEs in complex domains.
Deep-learning-based PDE solvers have been applied to solve problems in many areas of practical engineering and life sciences [other1, other2, other3]. However, most deep-learning-based solvers are computationally expensive because the convergence speed is usually slow and it is probably not guaranteed to convergence to the desired precision. A recent study [gu2021selectnet] proposed a self-paced learning framework, that was inspired by an algorithm named "SelectNet" [liu2019selectnet] originally for image classification, to ease the issue of slow convergence. In [gu2021selectnet], new neural networks called selection networks were introduced to be trained simultaneously with the residual model based solution network. The selection network learned to provide weights to training points to guide the solution network to learn more from points that have larger residual errors.
In this paper, we aim to address the aforementioned issue of slow convergence by introducing active-learning-based adaptive sampling techniques to sample the most informative training examples for networks to be trained on. This algorithm is motivated by the active learning approach primarily used in supervised learning in computer vision[AL1, AL2, AL3, AL4], which chooses high-quality data to label so that the network learns faster and less human labeling labor is encountered.
1.2 Related work
1.2.1 Relevant Literature Review
Mesh-based neural networks are proposed to approximate solution operators for a specific type of differential equations [LEE1990110, Lagaris_1998, MALEK2006260, meshbased_1, meshbased_2]
. However, these methods become computationally intractable in high dimensions since the degrees of freedom will explode exponentially as the number of dimensions increases. Therefore, the mesh-free model[meshfree2, E_2017, DLPDE2l, raissi2017physicsI, meshfree1, WAN, chen2021friedrichs]
is the key to solving high-dimensional PDEs. Mesh-free models use sampling of points and the Monte-Carlo method to approximate the objective loss function to avoid generating meshes that are computationally restrictive in high dimensions. In an early study[DLPDE4], PDEs are solved via deep neural networks by directly minimizing the degree to which the network approximation violates the PDE at prescribed collocation points, and boundary conditions are treated as a hard constraint by constructing boundary governing functions that satisfy the boundary conditions although designing such functions usually is difficult in complicated PDE problems. In a later study concerning boundary value problems [DLPDE5], boundary residual errors are also taken into account in a single loss function to enforce boundary conditions. The latter is more commonly used since it does not require auxiliary functions that are problem-dependent and difficult to construct.
Active learning methods are not commonly utilized in deep-learning-based PDE solvers. In [activePDE1]
, active learning is used to select “additional observation locations" to "minimize the variance of the PDE coefficient" and to "minimize the variance of the solution of the PDE". In[lu2020deepxde], the residual-based adaptive refinement (RAR) is proposed to adaptively select large residual points. RAR first chooses uniform points and acquires residual errors at these points, and then replicates a certain number of large residual points to the set for training. RAR and our active-learning-based adaptive sampling (to be introduced in Section 2.2) share the same methodology of selecting large residual error points based upon uniform points. The difference is that our adaptive sampling ranks all points. Low residual points will less likely be selected whereas RAR still keeps them. Large residual error points are also ranked, i.e., even if two points both have large residuals, the one with larger residual error is more likely to be selected. In contrast, RAR does not differentiate them and chooses the largest residual points. In [Adaptive1], various adaptive strategies have also been introduced. For example, in their adaptive-sampling-in-time approach, collocation points initially come from a small specified time interval and this interval gradually expands. As shown in [Adaptive1], this approach considerably improves the convergence and overall accuracy. However, the number of training points accumulates in this approach and the computation becomes more expensive. In our adaptive sampling approach, the number of training points does not grow; therefore, in some cases, our adaptive sampling is a better fit although the approaches in [Adaptive1] are already highly sophisticated.
1.2.2 Deep-Learning-based Least Squares Method
Consider the following boundary value problem for simplicity to introduce the deep-learning-based least squares method: find the unknown solution such that
where is the boundary of the domain, and are differential operators in and on , respectively. The goal is to train a neural network, denoted by , where is the set of network parameters, to approximate the ground truth solution of the PDE (1.1). In the least squares method (LSM), the PDE is solved by finding the optimal set of parameters that minimizes the loss function; i.e.,
where is a positive hyper-parameter that weights the boundary loss. The last step is a Monte-Carlo approximation with and being and allocation points sampled from the respective probability densities that and follow. This model will be referred to as the basic model in this paper. For a time-dependent PDE, the temporal coordinate can be regarded as another spatial coordinate to build a similar least squares model.
This paper is structured as follows. the active-learning-based adaptive sampling will be introduced in Section 2. The detailed numerical implementation of the proposed sampling method is discussed in Section 3. In Section 4, extensive numerical experiments will be provided to demonstrate the efficiency of the proposed method.
2 Active-Learning-based Adaptive Sampling
2.1 Overview of Active Learning
Active learning [AL5, AL6]
is a machine learning method in which the learning algorithm inspects unlabeled data and interactively chooses the most informative data points to learn. Active learning methodology has primarily been applied in computer vision tasks[AL1]
, and has shown an extraordinary performance. Active learning for deep learning proceeds in rounds. In each iteration/round, the current model serves to assess the informativeness of training examples. It is usually used for supervised or semi-supervised learning in which there is a large amount of unlabeled data and the algorithm chooses the most informative data points to get labeled for training. There are two most common metrics to measure informativeness: uncertainty[uncertainty] and diversity [diversity]
. In the uncertainty sampling, the algorithm tries to find training examples that the current model is most uncertain about, as a proxy of the model output being incorrect. For example, in a classification problem, this uncertainty can be measured by the current model prediction; if for a training example, the model output probability distribution over classes has similar values as opposed to high probability for a single class, then the current model is uncertain about this particular training example. In the diversity sampling[diversity2], the model seeks to identify training examples that most represent the diversity in the data. It is worth mentioning that query by committee (QBC) [QBC1] is also a common active learning approach.
2.2 Adaptive Sampling (AS)
In supervised or semi-supervised learning, the uncertainty sampling algorithm chooses data points that the current model is most uncertain about to get labeled and to be trained on. The least squares method for solving PDEs is an unsupervised learning approach because there is no label in the loss function. However, the idea of uncertainty sampling can still be applied to choose the most informative allocation points to learn. Hence, inspired by the uncertainty sampling, we propose the active-learning-based adaptive sampling in this paper to select high-quality train examples to speed up the convergence of PDE solvers with the basic least squares idea in (1.2).
In our adaptive sampling, the objective is to preferentially choose allocation points with larger absolute residual errors. Intuitively, one can think of large residual error at a point as a proxy of the model being wrong to a greater extent at this particular point. The fundamental methodology of adaptive sampling is to choose from a biased importance distribution that attaches higher priority to important volumes/regions of the domain. For simplicity, the absolute residual error of the network approximation at a point is defined as follows:
where is the solution network. The adaptive sampling is thus proposed to choose allocation points following a distribution, denoted by , given by the probability density function proportional to the distribution of residual errors, that is in and on , respectively. The following generalized density function can be applied to choose allocations in or on , respectively. More precisely, we define
where is an unknown normalizing constant, is a non-negative constant exponent that controls the effect of adaptive sampling. With larger , larger residual error points are more likely to be sampled; when , there is no effect of adaptive sampling.
As one may have noticed, the unknown normalizing constant is difficult to calculate. Therefore, one may not be able to directly sample points from . Two techniques to simulate observations from are reviewed in the following.
2.2.1 Metropolis Hastings Sampling
The Metropolis-Hastings (MH) algorithm [MH1]
is a commonly used Markov chain Monte Carlo (MCMC) method[MCMC1]. Markov chain comes in because the next sample depends on the current sample. Monte Carlo comes in because it simulates observations from a target distribution that is difficult to directly sample from. The Metropolis-Hastings algorithm will be adopted to select points from the target distribution .
Note that all algorithms sample points in for training, and the same logic follows for sampling on .
In Algorithm 1, the first points will be discarded which are called "burn-ins" where is a non-negative integer. It is a computational trick to correct the bias of early samples. In practice, can be . However, there is no rigorous justification for burn-ins based on the best of our knowledge. Note that in actual implementation, Algorithm 2
in a vectorized version following the coding style in Python will be deployed.
2.2.2 Self-normalized Sampling
The Metropolis-Hastings Algorithm sometimes can be computationally expensive, and the convergence of this algorithm can be slow. Therefore, an alternative algorithm is proposed to sample training examples. This algorithm is faster than the MH algorithm because this algorithm is entirely parallel in generating points as opposed to the MH algorithm in which each sample point is based on the previous sample; therefore, the MH algorithm is not parallel. Our proposed new sampling technique will be used in the numerical experiment.
This algorithm is named "Self-normalized Sampling" because the discrete probability density function obtained by this algorithm normalizes itself. It is well-known that the marginal distribution of points generated by the Metropolis-Hastings algorithm will converge to the target distribution; the self-normalized sampling algorithm does efficiently generate a set of points that focuses on large-residual-error volumes in the experiment. Moreover, in Appendix D, we will discuss why this self-normalized algorithm makes sense.
In general, self-normalized sampling is less computationally expensive than the Metropolis-Hastings algorithm. However, in some rare cases, the residual error is only large in some extremely small volumes and relatively low elsewhere; points generated in step 1 of Algorithm 3 can fail to have any point inside these large error volumes. A naive remedy to this is to generate more than points in step 1 so that it covers a broader range of points. On the other hand, the Metropolis-Hastings algorithm can substantially sample points on these small volumes although the number of burn-ins needs to be very large to reach these volumes.
A demonstration of points generated in 2D by Algorithms 2 and 3 are presented. Figure 2.2 is an example of a distribution of squared residual errors. Figures 2.4 and 2.4 are distributions of points generated by the Metropolis-Hastings algorithm and the self-normalized sampling, respectively, with and burn-ins for the Metropolis-Hastings algorithm; the uniform distribution used in Algorithms 2 and 3 is replaced by uniform annular distribution [gu2021selectnet]. Figure 2.2 is the distribution of 500 points generated by a uniform annular distribution. Numbers of points inside of the ellipse generated by each approach are counted in Table 1.
|Approach||Number of Points|
2.3 Adaptive Sampling: A Statistical Viewpoint
Adaptive sampling is motivated by active learning that takes values of residual errors as a proxy of the model being wrong. It is natural to connect adaptive sampling with uncertainty sampling that takes uncertainty as a proxy of the model being wrong. Adaptive sampling can be explained from a statistical viewpoint. The definition of expectation and Monte-Carlo estimate is in AppendixB.
Definition 2.1 (Owen, 2013 [mcbook]).
Let be a random variable and
be a random variable andbe any function defined on with density . Let q(x) be another probability density on such that implies . is also called biasing distribution or important distribution. The importance sampling estimate is defined by
Denote by a uniform distribution on . Note that if , then and cancel out each other.
Theorem 2.1 (Owen, 2013 [mcbook]).
The importance sampling estimate is an unbiased estimate of
is an unbiased estimate of.
One way to justify adaptive sampling is to adopt the notion of active learning. Adaptive sampling can also be understood as a better estimating approach. In (1.2), the plain Monte-Carlo estimate will face the issue of slow convergence if, for example, the distribution of squared residual errors is large at a small volume but is low, or nearly zero, outside of . The plain Monte-Carlo estimate could fail to have even one point inside the volume . Faster convergence can be achieved “by sampling from a distribution that overweights the important region, hence the name importance sampling" [mcbook]. This paper is not concerned with finding an unbiased estimate to the expectation but finding the best parameter that minimizes the expectation. Therefore, allocation points will be sampled from without altering the formulation in the last step of (1.2) since only depends on .
3 Setup of Numerical Implementation
3.1 Network Architecture
Note that adaptive sampling introduced in Section 2 is an active-learning-based sampling technique that is independent of the choice of network architectures. In the numerical implementation, the solution network
is a feed-forward neural network in which each layer is fully connected. The network can be expressed as a composition of several activation functions; i.e.,
where is the depth of the network, is a nonlinear activation function, , , , , , , ,
is the number of nodes or neurons in a layer and called the width of the network,is the dimension of the problem and , . Network setting parameters used in the numerical implementation are listed in Table 2.
|Depth of the Network||3|
|Width of the Network||100|
|Dimension of Spatial Coordinates||to be specified|
|Learning Rate for Solution Network||Specified in (3.4)|
The network is trained to solve the minimization problem defined in (1.2) via Adam [adam]
(a variant of stochastic gradient descent), where the empirical loss is defined by
and is updated by
where is the learning rate of the solution network defined by
where denotes the learning rate at the
-th epoch, andis the number of total epochs. Index starts from 0, i.e., the learning rate for the first epoch is .
3.2 Derivatives of Networks
In order to calculate residual errors, needs to be evaluated. In the implementation, the numerical differentiation (see Appendix C) with step size will be used to estimate partial derivatives except in Section 4.4, where Autograd is used since it outputs correct partial derivatives in 2D. For example, let be the solution network with and parameters . The numerical differentiation estimate of , denoted by , is defined by
where , is a vector of all ’s except a in the i-th entry, and is the step size. Note that, as found in [gu2021selectnet], with a step size of , the truncation error is up to , and it can be ignored in practice since the final error is at least .
3.3 Training Solution Networks with Active Sampling
3.4 Uniform Annular Distribution
The uniform annular approach will be employed to sample points in in the first step of Algorithm 2 and 3. The uniform annular approach divides the interior of the domain into annuli and generates samples uniformly in each annulus, where is the number of training points in in each epoch, and is the number of annuli; in practice should be divided by , i.e., . This distribution is also utilized in [gu2021selectnet]. As a side note, this uniform annular distribution approach can also be understood as a biased distribution that covers a broader range of points in . Unlike the adaptive sampling distribution in (2.2), the uniform annular distribution is not necessarily a better distribution for the model to learn for each PDE example.
3.5 Accuracy Assessment
To measure the accuracy of the model, 10000 testing points , which are different from training points, will be sampled from the uniform annular distribution to approximate the overall relative error and the relative maximum modulus error at these points. Note that in the implementation, the random seed is fixed which implies that for the same PDE example, all models will be tested on the same set of uniform annular points to measure accuracy. The overall relative error and the relative maximum modulus error are, respectively, defined as follows
4 Numerical Experiment
In this section, our adaptive sampling is tested on three types of PDE examples: elliptic, parabolic, and hyperbolic equations. Testing includes various dimensions. Adaptive sampling will be compared with RAR in the first example. The RAR algorithm used in this work (see Algorithm 5) is slightly different from the algorithm proposed in the paper [lu2020deepxde] in order to make the comparison in which the number of training points will still be fixed. in the algorithm proposed in the paper [lu2020deepxde] is chosen to be 2000 for testing. Finally, our adaptive sampling will be applied to some current frameworks to test if adaptive sampling is compatible with these frameworks. The parameter setting for Sections 4.1 and 4.2 is listed in Table 3.
|Number of Epochs||20000|
|Number of Training Points in in Each Epoch||12000|
|for Example (4.1)||Number of Training Points on in Each Epoch||12000|
|for Example (4.2)||Number of Training Points on in Each Epoch||12000 +|
|Boundary Loss Weighting Term in (3.2)||10|
4.1 Elliptic Equation
We consider a nonlinear elliptic equation: Find such that
where and is given appropriately so that the exact solution is given by .
Adaptive sampling (AS) is tested on this PDE example in various high dimensions: 10, 20, and 100. Adaptive sampling will be compared with the basic least squares method discussed in Section 1.2.2. Since the computational costs of these two models are different, for this particular example, error decay versus time in seconds is also displayed to demonstrate the efficiency of adaptive sampling. Running time is obtained by training on Quadro RTX 8000 Graphics Car. Only training time has been taken into account; time for accuracy assessment and saving data is not included. The overall relative error seems very large in 100 dimensions because the true solution vanishes in most of the volume when the dimension is high.
|Dimension||AS||Basic||RAR||Error Reduced by AS|
Time versus error decay in 10 dimensions, 20 dimensions, 100 dimensions are demonstrated in Figures 4.1, 4.4, 4.5, respectively. Figure 4.2 shows the -surface of the ground-truth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.3 shows the heat-map of the absolute difference in -surface, where the color bar represents the absolute difference between the ground-truth solution and network solution. These results clearly show that adaptive sampling helps to reduce the error, especially relative maximum modulus error. More precisely, adaptive sampling helps to significantly reduce the variance in the distribution of error over the domain so that, compared to without adaptive sampling, it is less likely to have volumes/areas where the error is relatively much larger. Moreover, adaptive sampling performs better than RAR in high-dimensional settings; it must be pointed out that RAR works well in low-dimensions and the paper [lu2020deepxde] is mainly concerned with low-dimensions. However, in high dimensions, the effect of RAR is not strong enough. It is worth mentioning that residual error and the actual error are not proportional to each other. In some cases, adaptive sampling even with only a power of will tend to over-focus on volumes with extremely large residuals, which implies that the RAR algorithm will be a better fit to refine the network solution.
4.2 Parabolic Equation
We consider the following parabolic equation:
where , , and
is given approriately so that the exact solution is given by .
Table 5 shows the overall relative error and maximum relative modulus error by adaptive sampling and the basic residual model at the end of 20000 epochs. The number of epochs versus error decay in 10 dimensions and 20 dimensions are illustrated in Figures 4.6 and 4.9 respectively. Figure 4.7 shows the -surface of the ground-truth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.8 shows the heat-map of absolute difference in -surface, where the color bar represents the absolute difference between the ground-truth solution and network solution. This example evidently demonstrates the advantage of adaptive sampling on reducing the variance in the distribution of error over the domain. These results clearly show that adaptive sampling helps to reduce the error, especially relative maximum modulus error.
|Dimension||Adaptive Sampling||Basic Model||Error Reduced by AS|
4.3 Hyperbolic Equation
Lastly, We consider the following hyperbolic equation:
where , . , , , and is given appropriately so that the exact solution is . denotes the Laplace operator taken in the spatial variable only.
Table 6 shows the overall relative error and maximum relative modulus error by adaptive sampling and the basic model at the end of 20000 epochs. The number of epochs versus error decay in 10 dimensions and 20 dimensions are presented respectively in Figure 4.10 and 4.13. Figure 4.11 shows the -surface of the ground-truth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.12 shows the heat-map of absolute difference in -surface, where the color bar represents the absolute difference between the ground-truth solution and network solution. This example shows that the adaptive sampling reduces the variance in the distribution of error over the domain. We can see from these results that adaptive sampling helps to reduce the error, especially relative maximum modulus error.
|Dimension||Adaptive Sampling||Basic Model||Error Reduced by AS|
4.4 Poisson’s Equation
In this example, we will show that adaptive sampling is compatible with some recent frameworks, such as DRM[Deep_Ritz], DGM[other3], and WAN[WAN]. To this end, we consider a 2D Poisson’s equation: Find such that
The exact solution is given by . All models are trained for 300 seconds in NVIDIA Quadro P1000 Graphics Card. Each framework without and with adaptive sampling both have 30 trials to approximate the PDE. The error obtained with adaptive sampling is compared with the error obtained without adaptive sampling to test if adaptive sampling is compatible with various frameworks to lower the error. Note that, in different trials, different random seeds are used. Moreover, in the
-th trial, the same torch random seed is used with and without adaptive sampling, which implies that the networks are initialized with same parameters. In this test, ResNet[he2015deep]
with 4 residual blocks is used with Tanh as activation functions. Each epoch has 1024 uniform points on the boundary and in the domain respectively for training. Moreover, instead of numerical differentiation, Pytorch Autograd is utilized to evaluate derivatives.
Table 7 shows statistics of errors obtained in 30 trials without and with adaptive sampling. Figure 4.4 shows the average error of 30 trials versus time in seconds for DGM, DRM, and WAN respectively with and without adaptive sampling. These results clearly show that adaptive sampling is compatible with all three frameworks. The errors obtained with adaptive sampling are lower than those without adaptive sampling. Note that, as observed in previous high-dimensional examples, the advantage of adaptive sampling becomes more significant when the dimension is high.
|Framework||Mean||Standard Deviation||Minimum Value||Coefficient of Variation|
Acknowledgement W.G. was partially supported by the National Science Foundation under grant DMS-2050133. C. W. was partially supported by the National Science Foundation Award DMS-2136380. The authors would like to express our deepest appreciation to Dr. Haizhao Yang from Purdue University for his invaluable insights, unparalleled support, and constructive advice. Thanks should also go to Dr.Yiqi Gu from University of Hong Kong for helpful discussions.
Appendix A RAR Sampling
Appendix B Preliminary Definition
Let be a real random variable and be any real function. The expectation of is defined by
where is the probability density function of .
Let be real random variables that follow the distribution and be any function defined on with density . The plain Monte-Carlo estimate of , the expectation of , is defined by
where denotes the volume (length in 1D and area in 2D) of , is the number of points sampled for estimating . Note that is an unbiased estimate of .
Appendix C Numerical Differentiation
A real scalar function of variables is a rule that assigns a number to an array of numbers .
Let be a real scalar function of variables defined on . The partial derivative of at a point with respect to is given by
where , is an array of all zeros except 1 for the i-th element.
Let be a real scalar function of variables defined on , , and . The numerical differentiation estimate of with respect to is given by:
Note that the truncation error is of order .
Appendix D Self-normalized Sampling
Let be the desired distribution of a continuous variable over the domain . Suppose that we cannot compute directly but have the unnormalized version such that
where is the unknown normalizing constant. The self-normalized sampling algorithm is:
In the first step, should be large even if is small, and when is large, can be equal to or even smaller than as long as is still large; in the numerical experiment in this work, 12,000 allocation points are needed, so and are both set to be 12,000.
We will show that this algorithm makes sense. Let be any subset of .
The probability of in the sense of our target distribution is:
Note that, this is the Monte Carlo integration, and are and random points in and
respectively. This approximation is governed by the Law of Large Numbers; therefore,and are expected to be large.
Now, suppose out of points generated in Algorithm 6, there are points in . When and are large, by the Law of Large Numbers, . Hence, we can apply these and as random points to Equation D.1, the above probability becomes:
and this is the probability of in the sense of the discrete p.m.f. in Algorithm 6.
One advantage of this algorithm is that it generates points in a completely parallel fashion whereas in the Metropolis-Hastings algorithm, each draw is based on the previous draw; thus, the MH algorithm cannot be parallelized.