1 Introduction
1.1 Problem Statement
Highdimensional partial differential equations are widely used in modeling complicated phenomena; e.g., the Hamilton–Jacobi–Bellman (HJB) equation [HJB] in control theory, the Schrödinger equation [sch_equation]
in quantum mechanics. There are various traditional numerical methods for solving partial differential equations, especially for lowdimensional problems. However, when it comes to solving highdimensional problems, the curse of dimensionality becomes a major computational challenge for many traditional methods; e.g., in the finite difference method
[FDM], the computational complexity is exponential to the number of dimensions. With that being said, traditional methods oftentimes are computationally intractable in solving highdimensional problems.In approximation theory, deep neural networks (DNNs) have been a more effective tool for function approximation than traditional approximation tools such as finite elements and wavelets. In
[yarotsky1, yarotsky2, shijun2, shijun3, shijun6, hon2021simultaneous], it has been shown that ReLUtype DNNs can provide more attractive approximation rates than traditional tools to approximate continuous functions and smooth functions with explicit error characterization
[shijun2, shijun3, shijun6] in the norm [yarotsky1, yarotsky2, shijun2, shijun3, shijun6] and in the norm for [hon2021simultaneous]. If target functions lie in the space of functions with special integral representations [barron1993, weinan1, weinan2, siegel2021sharp, SIEGEL2020313, JCM39801], ReLUtype DNNs can lessen the curse of dimensionality in function approximation. Armed with advanced activation functions, DNNs is able to conquer the curse of dimensionality in approximation theory for approximating continuous functions
[shijun4, shijun5, yarotsky3, shijun7]. In generalization theory, it was shown that DNNs can achieve a dimensionindependent error rate for solving PDEs [doi:10.1137/19M125649X, LuoYang2020, lu2021priori, chen2021representation]. These theoretical results have justified the application of DNNs to solve highdimensional PDEs recently in [Han8505, raissi2017physicsI, raissi2017physicsII, Deep_Ritz, DLPDE1, DLPDE2l, Deep_Ritz, IntNet, DLPDE3]. Two main advantages of deeplearningbased methods presented in these studies are summarizes as follows: firstly, the curse of dimensionality can be weakened or even be overcome in certain classes of PDEs [arnulf1, arnulf2]; secondly, deeplearningbased PDE solvers are meshfree without tedious mesh generation for complex domains in traditional solvers. Thus, deeplearningbased methods have shown tremendous potential to surpass other numerical methods especially in solving highdimensional PDEs in complex domains.Deeplearningbased PDE solvers have been applied to solve problems in many areas of practical engineering and life sciences [other1, other2, other3]. However, most deeplearningbased solvers are computationally expensive because the convergence speed is usually slow and it is probably not guaranteed to convergence to the desired precision. A recent study [gu2021selectnet] proposed a selfpaced learning framework, that was inspired by an algorithm named "SelectNet" [liu2019selectnet] originally for image classification, to ease the issue of slow convergence. In [gu2021selectnet], new neural networks called selection networks were introduced to be trained simultaneously with the residual model based solution network. The selection network learned to provide weights to training points to guide the solution network to learn more from points that have larger residual errors.
In this paper, we aim to address the aforementioned issue of slow convergence by introducing activelearningbased adaptive sampling techniques to sample the most informative training examples for networks to be trained on. This algorithm is motivated by the active learning approach primarily used in supervised learning in computer vision
[AL1, AL2, AL3, AL4], which chooses highquality data to label so that the network learns faster and less human labeling labor is encountered.1.2 Related work
1.2.1 Relevant Literature Review
Meshbased neural networks are proposed to approximate solution operators for a specific type of differential equations [LEE1990110, Lagaris_1998, MALEK2006260, meshbased_1, meshbased_2]
. However, these methods become computationally intractable in high dimensions since the degrees of freedom will explode exponentially as the number of dimensions increases. Therefore, the meshfree model
[meshfree2, E_2017, DLPDE2l, raissi2017physicsI, meshfree1, WAN, chen2021friedrichs]is the key to solving highdimensional PDEs. Meshfree models use sampling of points and the MonteCarlo method to approximate the objective loss function to avoid generating meshes that are computationally restrictive in high dimensions. In an early study
[DLPDE4], PDEs are solved via deep neural networks by directly minimizing the degree to which the network approximation violates the PDE at prescribed collocation points, and boundary conditions are treated as a hard constraint by constructing boundary governing functions that satisfy the boundary conditions although designing such functions usually is difficult in complicated PDE problems. In a later study concerning boundary value problems [DLPDE5], boundary residual errors are also taken into account in a single loss function to enforce boundary conditions. The latter is more commonly used since it does not require auxiliary functions that are problemdependent and difficult to construct.Active learning methods are not commonly utilized in deeplearningbased PDE solvers. In [activePDE1]
, active learning is used to select “additional observation locations" to "minimize the variance of the PDE coefficient" and to "minimize the variance of the solution of the PDE". In
[lu2020deepxde], the residualbased adaptive refinement (RAR) is proposed to adaptively select large residual points. RAR first chooses uniform points and acquires residual errors at these points, and then replicates a certain number of large residual points to the set for training. RAR and our activelearningbased adaptive sampling (to be introduced in Section 2.2) share the same methodology of selecting large residual error points based upon uniform points. The difference is that our adaptive sampling ranks all points. Low residual points will less likely be selected whereas RAR still keeps them. Large residual error points are also ranked, i.e., even if two points both have large residuals, the one with larger residual error is more likely to be selected. In contrast, RAR does not differentiate them and chooses the largest residual points. In [Adaptive1], various adaptive strategies have also been introduced. For example, in their adaptivesamplingintime approach, collocation points initially come from a small specified time interval and this interval gradually expands. As shown in [Adaptive1], this approach considerably improves the convergence and overall accuracy. However, the number of training points accumulates in this approach and the computation becomes more expensive. In our adaptive sampling approach, the number of training points does not grow; therefore, in some cases, our adaptive sampling is a better fit although the approaches in [Adaptive1] are already highly sophisticated.1.2.2 DeepLearningbased Least Squares Method
Consider the following boundary value problem for simplicity to introduce the deeplearningbased least squares method: find the unknown solution such that
(1.1) 
where is the boundary of the domain, and are differential operators in and on , respectively. The goal is to train a neural network, denoted by , where is the set of network parameters, to approximate the ground truth solution of the PDE (1.1). In the least squares method (LSM), the PDE is solved by finding the optimal set of parameters that minimizes the loss function; i.e.,
(1.2)  
where is a positive hyperparameter that weights the boundary loss. The last step is a MonteCarlo approximation with and being and allocation points sampled from the respective probability densities that and follow. This model will be referred to as the basic model in this paper. For a timedependent PDE, the temporal coordinate can be regarded as another spatial coordinate to build a similar least squares model.
1.3 Organization
This paper is structured as follows. the activelearningbased adaptive sampling will be introduced in Section 2. The detailed numerical implementation of the proposed sampling method is discussed in Section 3. In Section 4, extensive numerical experiments will be provided to demonstrate the efficiency of the proposed method.
2 ActiveLearningbased Adaptive Sampling
2.1 Overview of Active Learning
Active learning [AL5, AL6]
is a machine learning method in which the learning algorithm inspects unlabeled data and interactively chooses the most informative data points to learn. Active learning methodology has primarily been applied in computer vision tasks
[AL1], and has shown an extraordinary performance. Active learning for deep learning proceeds in rounds. In each iteration/round, the current model serves to assess the informativeness of training examples. It is usually used for supervised or semisupervised learning in which there is a large amount of unlabeled data and the algorithm chooses the most informative data points to get labeled for training. There are two most common metrics to measure informativeness: uncertainty
[uncertainty] and diversity [diversity]. In the uncertainty sampling, the algorithm tries to find training examples that the current model is most uncertain about, as a proxy of the model output being incorrect. For example, in a classification problem, this uncertainty can be measured by the current model prediction; if for a training example, the model output probability distribution over classes has similar values as opposed to high probability for a single class, then the current model is uncertain about this particular training example. In the diversity sampling
[diversity2], the model seeks to identify training examples that most represent the diversity in the data. It is worth mentioning that query by committee (QBC) [QBC1] is also a common active learning approach.2.2 Adaptive Sampling (AS)
In supervised or semisupervised learning, the uncertainty sampling algorithm chooses data points that the current model is most uncertain about to get labeled and to be trained on. The least squares method for solving PDEs is an unsupervised learning approach because there is no label in the loss function. However, the idea of uncertainty sampling can still be applied to choose the most informative allocation points to learn. Hence, inspired by the uncertainty sampling, we propose the activelearningbased adaptive sampling in this paper to select highquality train examples to speed up the convergence of PDE solvers with the basic least squares idea in (
1.2).In our adaptive sampling, the objective is to preferentially choose allocation points with larger absolute residual errors. Intuitively, one can think of large residual error at a point as a proxy of the model being wrong to a greater extent at this particular point. The fundamental methodology of adaptive sampling is to choose from a biased importance distribution that attaches higher priority to important volumes/regions of the domain. For simplicity, the absolute residual error of the network approximation at a point is defined as follows:
(2.1) 
where is the solution network. The adaptive sampling is thus proposed to choose allocation points following a distribution, denoted by , given by the probability density function proportional to the distribution of residual errors, that is in and on , respectively. The following generalized density function can be applied to choose allocations in or on , respectively. More precisely, we define
(2.2) 
where is an unknown normalizing constant, is a nonnegative constant exponent that controls the effect of adaptive sampling. With larger , larger residual error points are more likely to be sampled; when , there is no effect of adaptive sampling.
As one may have noticed, the unknown normalizing constant is difficult to calculate. Therefore, one may not be able to directly sample points from . Two techniques to simulate observations from are reviewed in the following.
2.2.1 Metropolis Hastings Sampling
The MetropolisHastings (MH) algorithm [MH1]
is a commonly used Markov chain Monte Carlo (MCMC) method
[MCMC1]. Markov chain comes in because the next sample depends on the current sample. Monte Carlo comes in because it simulates observations from a target distribution that is difficult to directly sample from. The MetropolisHastings algorithm will be adopted to select points from the target distribution .Note that all algorithms sample points in for training, and the same logic follows for sampling on .
In Algorithm 1, the first points will be discarded which are called "burnins" where is a nonnegative integer. It is a computational trick to correct the bias of early samples. In practice, can be . However, there is no rigorous justification for burnins based on the best of our knowledge. Note that in actual implementation, Algorithm 2
in a vectorized version following the coding style in Python will be deployed.
2.2.2 Selfnormalized Sampling
The MetropolisHastings Algorithm sometimes can be computationally expensive, and the convergence of this algorithm can be slow. Therefore, an alternative algorithm is proposed to sample training examples. This algorithm is faster than the MH algorithm because this algorithm is entirely parallel in generating points as opposed to the MH algorithm in which each sample point is based on the previous sample; therefore, the MH algorithm is not parallel. Our proposed new sampling technique will be used in the numerical experiment.
This algorithm is named "Selfnormalized Sampling" because the discrete probability density function obtained by this algorithm normalizes itself. It is wellknown that the marginal distribution of points generated by the MetropolisHastings algorithm will converge to the target distribution; the selfnormalized sampling algorithm does efficiently generate a set of points that focuses on largeresidualerror volumes in the experiment. Moreover, in Appendix D, we will discuss why this selfnormalized algorithm makes sense.
In general, selfnormalized sampling is less computationally expensive than the MetropolisHastings algorithm. However, in some rare cases, the residual error is only large in some extremely small volumes and relatively low elsewhere; points generated in step 1 of Algorithm 3 can fail to have any point inside these large error volumes. A naive remedy to this is to generate more than points in step 1 so that it covers a broader range of points. On the other hand, the MetropolisHastings algorithm can substantially sample points on these small volumes although the number of burnins needs to be very large to reach these volumes.
A demonstration of points generated in 2D by Algorithms 2 and 3 are presented. Figure 2.2 is an example of a distribution of squared residual errors. Figures 2.4 and 2.4 are distributions of points generated by the MetropolisHastings algorithm and the selfnormalized sampling, respectively, with and burnins for the MetropolisHastings algorithm; the uniform distribution used in Algorithms 2 and 3 is replaced by uniform annular distribution [gu2021selectnet]. Figure 2.2 is the distribution of 500 points generated by a uniform annular distribution. Numbers of points inside of the ellipse generated by each approach are counted in Table 1.
Approach  Number of Points 

Uniform Annular  61 
Metropolis Hastings  210 
Selfnormalized Sampling  255 
2.3 Adaptive Sampling: A Statistical Viewpoint
Adaptive sampling is motivated by active learning that takes values of residual errors as a proxy of the model being wrong. It is natural to connect adaptive sampling with uncertainty sampling that takes uncertainty as a proxy of the model being wrong. Adaptive sampling can be explained from a statistical viewpoint. The definition of expectation and MonteCarlo estimate is in Appendix
B.Definition 2.1 (Owen, 2013 [mcbook]).
Let
be a random variable and
be any function defined on with density . Let q(x) be another probability density on such that implies . is also called biasing distribution or important distribution. The importance sampling estimate is defined byDenote by a uniform distribution on . Note that if , then and cancel out each other.
Theorem 2.1 (Owen, 2013 [mcbook]).
As one may have noticed, Theorem 2.1 can be applied to estimate expectations in Equation (1.2), where , is defined in (2.2), and . The importance sampling estimate of is thus .
One way to justify adaptive sampling is to adopt the notion of active learning. Adaptive sampling can also be understood as a better estimating approach. In (1.2), the plain MonteCarlo estimate will face the issue of slow convergence if, for example, the distribution of squared residual errors is large at a small volume but is low, or nearly zero, outside of . The plain MonteCarlo estimate could fail to have even one point inside the volume . Faster convergence can be achieved “by sampling from a distribution that overweights the important region, hence the name importance sampling" [mcbook]. This paper is not concerned with finding an unbiased estimate to the expectation but finding the best parameter that minimizes the expectation. Therefore, allocation points will be sampled from without altering the formulation in the last step of (1.2) since only depends on .
3 Setup of Numerical Implementation
3.1 Network Architecture
Note that adaptive sampling introduced in Section 2 is an activelearningbased sampling technique that is independent of the choice of network architectures. In the numerical implementation, the solution network
is a feedforward neural network in which each layer is fully connected. The network can be expressed as a composition of several activation functions; i.e.,
(3.1) 
where is the depth of the network, is a nonlinear activation function, , , , , , , ,
is the number of nodes or neurons in a layer and called the width of the network,
is the dimension of the problem and , . Network setting parameters used in the numerical implementation are listed in Table 2.Symbol  Meaning  Value 

Depth of the Network  3  
Width of the Network  100  
Activation Function  =  
Dimension of Spatial Coordinates  to be specified  
Learning Rate for Solution Network  Specified in (3.4) 
The network is trained to solve the minimization problem defined in (1.2) via Adam [adam]
(a variant of stochastic gradient descent), where the empirical loss is defined by
(3.2) 
and is updated by
(3.3) 
where is the learning rate of the solution network defined by
(3.4)  
where denotes the learning rate at the
th epoch, and
is the number of total epochs. Index starts from 0, i.e., the learning rate for the first epoch is .3.2 Derivatives of Networks
In order to calculate residual errors, needs to be evaluated. In the implementation, the numerical differentiation (see Appendix C) with step size will be used to estimate partial derivatives except in Section 4.4, where Autograd is used since it outputs correct partial derivatives in 2D. For example, let be the solution network with and parameters . The numerical differentiation estimate of , denoted by , is defined by
where , is a vector of all ’s except a in the ith entry, and is the step size. Note that, as found in [gu2021selectnet], with a step size of , the truncation error is up to , and it can be ignored in practice since the final error is at least .
3.3 Training Solution Networks with Active Sampling
3.4 Uniform Annular Distribution
The uniform annular approach will be employed to sample points in in the first step of Algorithm 2 and 3. The uniform annular approach divides the interior of the domain into annuli and generates samples uniformly in each annulus, where is the number of training points in in each epoch, and is the number of annuli; in practice should be divided by , i.e., . This distribution is also utilized in [gu2021selectnet]. As a side note, this uniform annular distribution approach can also be understood as a biased distribution that covers a broader range of points in . Unlike the adaptive sampling distribution in (2.2), the uniform annular distribution is not necessarily a better distribution for the model to learn for each PDE example.
3.5 Accuracy Assessment
To measure the accuracy of the model, 10000 testing points , which are different from training points, will be sampled from the uniform annular distribution to approximate the overall relative error and the relative maximum modulus error at these points. Note that in the implementation, the random seed is fixed which implies that for the same PDE example, all models will be tested on the same set of uniform annular points to measure accuracy. The overall relative error and the relative maximum modulus error are, respectively, defined as follows
4 Numerical Experiment
In this section, our adaptive sampling is tested on three types of PDE examples: elliptic, parabolic, and hyperbolic equations. Testing includes various dimensions. Adaptive sampling will be compared with RAR in the first example. The RAR algorithm used in this work (see Algorithm 5) is slightly different from the algorithm proposed in the paper [lu2020deepxde] in order to make the comparison in which the number of training points will still be fixed. in the algorithm proposed in the paper [lu2020deepxde] is chosen to be 2000 for testing. Finally, our adaptive sampling will be applied to some current frameworks to test if adaptive sampling is compatible with these frameworks. The parameter setting for Sections 4.1 and 4.2 is listed in Table 3.
Symbol  Meaning  Value 
Number of Epochs  20000  
Number of Training Points in in Each Epoch  12000  
for Example (4.1)  Number of Training Points on in Each Epoch  12000 
for Example (4.2)  Number of Training Points on in Each Epoch  12000 + 
Boundary Loss Weighting Term in (3.2)  10 
4.1 Elliptic Equation
We consider a nonlinear elliptic equation: Find such that
(4.1)  
where and is given appropriately so that the exact solution is given by .
Adaptive sampling (AS) is tested on this PDE example in various high dimensions: 10, 20, and 100. Adaptive sampling will be compared with the basic least squares method discussed in Section 1.2.2. Since the computational costs of these two models are different, for this particular example, error decay versus time in seconds is also displayed to demonstrate the efficiency of adaptive sampling. Running time is obtained by training on Quadro RTX 8000 Graphics Car. Only training time has been taken into account; time for accuracy assessment and saving data is not included. The overall relative error seems very large in 100 dimensions because the true solution vanishes in most of the volume when the dimension is high.
Dimension  AS  Basic  RAR  Error Reduced by AS  

10 






20 






100 





Time versus error decay in 10 dimensions, 20 dimensions, 100 dimensions are demonstrated in Figures 4.1, 4.4, 4.5, respectively. Figure 4.2 shows the surface of the groundtruth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.3 shows the heatmap of the absolute difference in surface, where the color bar represents the absolute difference between the groundtruth solution and network solution. These results clearly show that adaptive sampling helps to reduce the error, especially relative maximum modulus error. More precisely, adaptive sampling helps to significantly reduce the variance in the distribution of error over the domain so that, compared to without adaptive sampling, it is less likely to have volumes/areas where the error is relatively much larger. Moreover, adaptive sampling performs better than RAR in highdimensional settings; it must be pointed out that RAR works well in lowdimensions and the paper [lu2020deepxde] is mainly concerned with lowdimensions. However, in high dimensions, the effect of RAR is not strong enough. It is worth mentioning that residual error and the actual error are not proportional to each other. In some cases, adaptive sampling even with only a power of will tend to overfocus on volumes with extremely large residuals, which implies that the RAR algorithm will be a better fit to refine the network solution.
4.2 Parabolic Equation
We consider the following parabolic equation:
(4.2)  
where , , and
is given approriately so that the exact solution is given by .
Table 5 shows the overall relative error and maximum relative modulus error by adaptive sampling and the basic residual model at the end of 20000 epochs. The number of epochs versus error decay in 10 dimensions and 20 dimensions are illustrated in Figures 4.6 and 4.9 respectively. Figure 4.7 shows the surface of the groundtruth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.8 shows the heatmap of absolute difference in surface, where the color bar represents the absolute difference between the groundtruth solution and network solution. This example evidently demonstrates the advantage of adaptive sampling on reducing the variance in the distribution of error over the domain. These results clearly show that adaptive sampling helps to reduce the error, especially relative maximum modulus error.
Dimension  Adaptive Sampling  Basic Model  Error Reduced by AS  

10 D 





20 D 




4.3 Hyperbolic Equation
Lastly, We consider the following hyperbolic equation:
(4.3)  
where , . , , , and is given appropriately so that the exact solution is . denotes the Laplace operator taken in the spatial variable only.
Table 6 shows the overall relative error and maximum relative modulus error by adaptive sampling and the basic model at the end of 20000 epochs. The number of epochs versus error decay in 10 dimensions and 20 dimensions are presented respectively in Figure 4.10 and 4.13. Figure 4.11 shows the surface of the groundtruth solution and network solutions with and without adaptive sampling, where axes are , , and . Figure 4.12 shows the heatmap of absolute difference in surface, where the color bar represents the absolute difference between the groundtruth solution and network solution. This example shows that the adaptive sampling reduces the variance in the distribution of error over the domain. We can see from these results that adaptive sampling helps to reduce the error, especially relative maximum modulus error.
Dimension  Adaptive Sampling  Basic Model  Error Reduced by AS  

10 





20 




4.4 Poisson’s Equation
In this example, we will show that adaptive sampling is compatible with some recent frameworks, such as DRM[Deep_Ritz], DGM[other3], and WAN[WAN]. To this end, we consider a 2D Poisson’s equation: Find such that
The exact solution is given by . All models are trained for 300 seconds in NVIDIA Quadro P1000 Graphics Card. Each framework without and with adaptive sampling both have 30 trials to approximate the PDE. The error obtained with adaptive sampling is compared with the error obtained without adaptive sampling to test if adaptive sampling is compatible with various frameworks to lower the error. Note that, in different trials, different random seeds are used. Moreover, in the
th trial, the same torch random seed is used with and without adaptive sampling, which implies that the networks are initialized with same parameters. In this test, ResNet
[he2015deep]with 4 residual blocks is used with Tanh as activation functions. Each epoch has 1024 uniform points on the boundary and in the domain respectively for training. Moreover, instead of numerical differentiation, Pytorch Autograd is utilized to evaluate derivatives.
Table 7 shows statistics of errors obtained in 30 trials without and with adaptive sampling. Figure 4.4 shows the average error of 30 trials versus time in seconds for DGM, DRM, and WAN respectively with and without adaptive sampling. These results clearly show that adaptive sampling is compatible with all three frameworks. The errors obtained with adaptive sampling are lower than those without adaptive sampling. Note that, as observed in previous highdimensional examples, the advantage of adaptive sampling becomes more significant when the dimension is high.
Framework  Mean  Standard Deviation  Minimum Value  Coefficient of Variation 

DGM  0.0333  0.0064  0.0244  19.2 
DGM  0.0280  0.0079  0.0155  29.2 
DRM  0.0273  0.0097  0.0132  35.5 
DRM  0.0255  0.0093  0.0122  36.5 
WAN  0.0329  0.0062  0.0245  18.8 
WAN  0.0282  0.0075  0.0153  26.6 
Acknowledgement W.G. was partially supported by the National Science Foundation under grant DMS2050133. C. W. was partially supported by the National Science Foundation Award DMS2136380. The authors would like to express our deepest appreciation to Dr. Haizhao Yang from Purdue University for his invaluable insights, unparalleled support, and constructive advice. Thanks should also go to Dr.Yiqi Gu from University of Hong Kong for helpful discussions.
References
Appendix A RAR Sampling
Appendix B Preliminary Definition
Definition B.1.
Let be a real random variable and be any real function. The expectation of is defined by
where is the probability density function of .
Definition B.2.
Let be real random variables that follow the distribution and be any function defined on with density . The plain MonteCarlo estimate of , the expectation of , is defined by
where denotes the volume (length in 1D and area in 2D) of , is the number of points sampled for estimating . Note that is an unbiased estimate of .
Appendix C Numerical Differentiation
Definition C.1.
A real scalar function of variables is a rule that assigns a number to an array of numbers .
Definition C.2.
Let be a real scalar function of variables defined on . The partial derivative of at a point with respect to is given by
where , is an array of all zeros except 1 for the ith element.
Definition C.3.
Let be a real scalar function of variables defined on , , and . The numerical differentiation estimate of with respect to is given by:
Note that the truncation error is of order .
Appendix D Selfnormalized Sampling
Let be the desired distribution of a continuous variable over the domain . Suppose that we cannot compute directly but have the unnormalized version such that
where is the unknown normalizing constant. The selfnormalized sampling algorithm is:
In the first step, should be large even if is small, and when is large, can be equal to or even smaller than as long as is still large; in the numerical experiment in this work, 12,000 allocation points are needed, so and are both set to be 12,000.
We will show that this algorithm makes sense. Let be any subset of .
The probability of in the sense of our target distribution is:
(D.1) 
Note that, this is the Monte Carlo integration, and are and random points in and
respectively. This approximation is governed by the Law of Large Numbers; therefore,
and are expected to be large.Now, suppose out of points generated in Algorithm 6, there are points in . When and are large, by the Law of Large Numbers, . Hence, we can apply these and as random points to Equation D.1, the above probability becomes:
(D.2) 
and this is the probability of in the sense of the discrete p.m.f. in Algorithm 6.
One advantage of this algorithm is that it generates points in a completely parallel fashion whereas in the MetropolisHastings algorithm, each draw is based on the previous draw; thus, the MH algorithm cannot be parallelized.
Comments
There are no comments yet.