# Estimation of distributions via multilevel Monte Carlo with stratified sampling

We design and implement a novel algorithm for computing a multilevel Monte Carlo (MLMC) estimator of the cumulative distribution function of a quantity of interest in problems with random input parameters or initial conditions. Our approach combines a standard MLMC method with stratified sampling by replacing standard Monte Carlo at each level with stratified Monte Carlo with proportional allocation. We show that the resulting stratified MLMC algorithm is more efficient than its standard MLMC counterpart, due to the reduction in variance at each level provided by the stratification of the random parameter's domain. A smoothing approximation for the indicator function based on kernel density estimation yields a more efficient algorithm compared to the typically used polynomial smoothing. The difference in computational cost between the smoothing methods depends on the required error tolerance.

## Authors

• 3 publications
• 13 publications
• ### Multilevel Monte Carlo Simulation of the Eddy Current Problem With Random Parameters

The multilevel Monte Carlo method is applied to an academic example in t...
05/23/2017 ∙ by Armin Galetzka, et al. ∙ 0

• ### Iterative Multilevel density estimation for McKean-Vlasov SDEs via projections

In this paper, we present a generic methodology for the efficient numeri...
09/25/2019 ∙ by Denis Belomestny, et al. ∙ 0

• ### Adaptive stratified sampling for non-smooth problems

Science and engineering problems subject to uncertainty are frequently b...
07/03/2021 ∙ by Per Pettersson, et al. ∙ 0

• ### Multilevel Monte Carlo learning

In this work, we study the approximation of expected values of functiona...
02/17/2021 ∙ by Thomas Gerstner, et al. ∙ 0

• ### Approximate Program Smoothing Using Mean-Variance Statistics, with Application to Procedural Shader Bandlimiting

This paper introduces a general method to approximate the convolution of...
06/05/2017 ∙ by Yuting Yang, et al. ∙ 0

• ### A Multilevel Approach to Variance Reduction in the Stochastic Estimation of the Trace of a Matrix

The trace of a matrix function f(A), most notably of the matrix inverse,...
08/25/2021 ∙ by Andreas Frommer, et al. ∙ 0

• ### Scale-invariant multilevel Monte Carlo method

In this paper, the scale-invariant version of the mean and variance mult...
06/17/2021 ∙ by Sharana Kumar Shivanand, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Motivation

Simulation of many complex systems, such as subsurface flows in porous media [1, 2] or reaction initiation in heterogeneous explosives [3], is complicated by a lack of information about key properties such as permeability or initial porosity. Uncertainty in the medium’s properties or initial state propagates into uncertainty in predicted quantities of interest (QoIs), such as mass flow rate or the material’s temperature.

Probabilistic methods, which treat an uncertain input or initial state of the system as a random variable, provide a natural venue to quantify predictive uncertainty in a QoI. These techniques render the QoI random as well, i.e., it takes on values that are distributed according to some probability density function (PDF). Such approaches include stochastic finite element methods (FEMs), which characterize the random parameter fields in terms of a finite set of random variables, e.g., via a spectral representation or a Karhunen-Loève expansion. This finite set of random variables defines a finite-dimensional outcome space on which the solution to the resulting stochastic partial differential equation (PDE) is defined. Examples of stochastic FEMs include stochastic Galerkin, which expands the solution of a stochastic PDE in terms of orthogonal basis functions, and stochastic collocation, which samples the random parameters at predetermined values or “nodes”

[4]

. While such methods perform well when the number of stochastic parameters (aka “stochastic dimension”) is low and these parameters exhibit a long correlation length, for many stochastic degrees of freedom and short correlation lengths their performance decreases dramatically

[5]. In addition, since nonlinearity degrades the solution’s regularity in the outcome space, stochastic FEMs also struggle with solving highly nonlinear problems [6]

. Another class of probabilistic techniques involves the derivation of deterministic equations for the statistical moments

[7, 8] or PDF [9, 10]

of the QoI. While these methods do not suffer from the “curse of dimensionality”, they require a closure approximation for the derived moment or PDF equations. Such closures often involve perturbation expansions of relevant quantities into series in the powers of the random parameters’ variances, which limits the applicability of such methods to parameters with low coefficients of variation.

Monte Carlo simulations (MC) [11]

remain the most robust and straightforward way to solve PDEs with random parameters/initial conditions. The method samples the random variables from their distribution, solves the deterministic PDE for each realization, and computes the resulting statistics of the QoI. While combining a nonintrusive character with a convergence that is independent of the stochastic dimension, MC converges very slowly: the standard deviation of the MC estimator for the QoI’s expectation value is inversely proportional to

where is the number of realizations [11]. While this drawback spurred the development of alternative probabilistic methods such as those listed above, efforts to combine MC with the multigrid concept by Heinrich [12, 13] and later by Giles [14] have sparked renewed interest in MC under the form of the multilevel Monte Carlo (MLMC) method. MLMC aims to achieve the same solution error as MC but at a lower computational cost by correcting realizations on a coarse spatial grid with sampling at finer levels of discretization. By sampling predominantly at the coarsest levels where samples are cheaper to compute, it aims to outperform MC which only performs realizations on the finest spatial grid. The related technique of multifidelity Monte Carlo (also referred to as solver-based MLMC as a opposed to traditional grid-based MLMC) generalizes this approach by using models of varying fidelities and speeds on the different levels [15, 16, 17].

Most work on MLMC has been aimed at computing estimators for expectation values and variances of QoIs (see, e.g., [18, 19, 20, 21]). However, the information contained in these first two moments is insufficient to understand, e.g., the probability of rare events. This task requires knowledge of the QoI’s full PDF, or equivalently, its cumulative distribution function (CDF) [22, 23, 24]. Assuming the QoI to be a function of a continuous random input variable , i.e., , where is a measurable function with the sample space, the CDF of at a point is given by

 F(q)=E[I(−∞,q](Q)]=∫∞−∞I(−∞,q](s) f(s) ds=∫q−∞f(s) ds, (1)

where the indicator function is defined as

 I(−∞,q](s)={1for s∈(−∞,q]0for s∈(q,+∞). (2)

Application of MLMC to the estimation of distributions only started a few years ago. Giles et al. [25] developed an algorithm to estimate PDFs and CDFs using the indicator function approach, while Bierig et al. [26] approximated PDFs via a truncated moment sequence and the method of maximum entropy. Both approaches were used to approximate CDFs of molecular species in biochemical reaction networks [27]. Elfverson et al. [28] used MLMC to estimate failure probabilities, which are single-point evaluations of the CDF. Lu et al. [29] obtained a more efficient MLMC algorithm by calibrating the polynomial smoothing of the indicator function in [25] to optimize the smoothing bandwidth for a given value of the error tolerance, thereby achieving a faster variance decay with level, and by enabling an a posteriori switch to MC if the latter turned out to have a lower computational cost. Krumscheid et al. [30]

developed an algorithm for approximating general parametric expectations, including CDFs and characteristic functions, and simultaneously deriving robustness indicators such as quantiles and conditional values-at-risk.

To reduce the computational cost of MLMC further, the multigrid approach can be combined with a sampling strategy at each level that is more efficient than standard MC. For example, quasi-Monte Carlo uses quasi-random, rather than random or pseudo-random, sequences to achieve faster convergence than MC, and has been used to speed up the MLMC computation of the mean system state [31]. Furthermore, a number of so-called “variance reduction” techniques have been developed to obtain estimators with a lower variance than MC for the same number of realizations. These include stratification, antithetic sampling, importance sampling, and control variates [11]. Recently, importance sampling was incorporated into MLMC for a more efficient computation of expectation values of QoIs [32]. Ullman et al. [33]

estimated failure probabilities for rare events by combining subset simulation using Markov chain Monte Carlo

[34] with multilevel failure domains defined on a hierarchy of discrete spatial grids with decreasing mesh sizes. We propose to divide the domain of a random input parameter or initial condition using stratified Monte Carlo to achieve variance reduction at each discretization level, and to estimate the CDF of an output QoI via the resulting “stratified” Multilevel Monte Carlo approach.

Section 2 provides a mathematical formulation of MLMC and sMLMC estimators of CDFs and the cost and error associated with computing them. Section 3 discusses two testbed problems for assessing the performance of MLMC and sMLMC compared to standard MC. Conclusions and future research directions are reserved for Section 4.

## 2 Multilevel Monte Carlo for CDFs

Usually the QoI cannot be simulated directly. Instead, one discretizes on a spatial grid and introduces a sequence of random variables, where is the number of cells in , which converges to as increases. We assume that converges both in the mean and in the sense of distribution to as , i.e.,

 E[QM−Q]=O(M−α1),E[I(−∞,q](QM)−I(−∞,q](Q)]=O(M−α2), (3)

as for independent of and . We approximate the statistics of rather than of , i.e., our goal is to estimate the CDF of on some compact interval . At each point , is given by

 FM(q)=E[I(−∞,q](QM)]. (4)

We assume the PDF of to be at least -times continuously differentiable on for some and . Then, given the values of at a set of equidistant points with separation distance , we estimate at any point

via a piecewise polynomial interpolation of degree

. The resulting approximation of is given by

 Fh,M(q)=S∑n=0E[I(−∞,qn](QM)]ϕn(q)≡S∑n=0E[In(QM)] ϕn(q), (5)

where (

) are, e.g., Lagrange basis polynomials or, in the case of cubic spline interpolation, third-degree polynomials. The goal is to find an unbiased estimator

of , so that an estimator of is computed as

 ^Fh,M(q)=S∑n=0^In,Mϕn(q). (6)

The estimator converges to the CDF of the original QoI as (to reduce the discretization error) and its variance decreases (to reduce the sampling error).

### 2.1 Standard Monte Carlo (MC)

The MC estimator for based on independent samples of is defined by

 ^IMCn,M=1NMCNMC∑j=1In(Q(j)M), (7)

where is the sample of . Let denote the MC estimator of , and be a piecewise polynomial interpolation of given its value at a set of points with . The error between the two, , consists of a sampling part related to estimating and a discretization part related to approximating by . Specifically, the mean squared error (MSE) of is bounded by

 E[∥e(^FMCh,M)∥2∞] ≤E[∥^FMCh,M−E[^FMCh,M]∥2∞]eMC1+∥Fh,M−Fh∥2eMC2 (8) ≤max0≤n≤SN−1MCV[In(QM)]+max0≤n≤S|E[In(QM)−In(Q)]|2,

where denotes the norm and refers to the variance operator. Here and are, respectively, the sampling and discretization error, in the mean squared sense. For the root mean squared error (RMSE) to be lower than a prescribed tolerance , it is sufficient to limit both and to, at most, .

For and all other CDF estimators considered in this work, we assume that the number of interpolation points is large enough for the interpolation error to be negligible and for to represent the error between the estimator and the true CDF of .

### 2.2 Standard Multilevel Monte Carlo (MLMC) without smoothing

Rather than sampling on a single spatial mesh, one considers a sequence of approximations () of associated with discrete meshes {}. Here the number of cells in mesh and , where is the spatial dimension. The idea behind this approach is to start by performing cheap-to-compute samples on a coarse mesh, and then gradually correct the resulting estimate of by sampling on finer grids, where generating a realization is more computationally expensive. One rewrites as a telescopic sum

 (9a) where In(Yl) (l=0,…,Lmax) are given by In(Yl)={In(QMl)−In(QMl−1)1≤l≤LmaxIn(QMl)l=0. (9b)

The MLMC estimator for is defined as

 ^IMLMCn,M=Lmax∑l=0^IMCn(Yl)=Lmax∑l=01NlNl∑j=1In(Y(j)l). (10)

While remains approximately constant with , decreases with , allowing the estimator to have the same overall sampling error as its MC counterpart using a decreasing number of samples as increases.

The MSE of the MLMC estimator for is bounded by

 E[∥e(^FMLMCh,M)∥2∞] ≤E[∥^FMLMCh,M−E[^FMLMCh,M]∥2∞]eML1+∥Fh,M−Fh∥2eML2 (11) ≤max0≤n≤SLmax∑l=0N−1lV[In(Yl)]+max0≤n≤S|E[In(QMLmax)−In(Q)]|2,

where and are, respectively, the sampling and discretization error, in the mean squared sense. From the triangle inequality it follows that

 max0≤n≤S|E[In(YLmax)|≈max0≤n≤S|E[In(QMLmax)−In(Q)]|. (12)

Hence, to determine the maximum level of an MLMC simulation with given tolerance , we check if is satisfied for the current level . Once the value of is found, we can compare the performance of MLMC to MC by performing the latter on this finest level, i.e., for in Section 2.1 equal to , re-using the samples already computed with MLMC at this level. Since , this strategy ensures that the bound for in (11) is the same as the bound for in (8). To achieve an RSME error of at most , it is sufficient that and .

### 2.3 Standard Multilevel Monte Carlo (MLMC) with smoothing

The jump discontinuity in the indicator function may lead to a slow decay of

and make MLMC slower than MC for sufficiently large values of the error tolerance  [29]. To accelerate the variance decay and to improve the computational efficiency of MLMC, a sigmoid-type smoothing function can be used to remove the singularity in the indicator function. We consider two different smoothing functions, namely a polynomial proposed by Giles et al. [25] (which we will refer to as ) and the CDF of a Gaussian kernel in the context of kernel density estimation [35] (denoted by ).

#### 2.3.1 Smoothing via Giles’ polynomial

Under the assumption, made in Section 2, that the PDF of is at least -times continuously differentiable on for some and , we define a smoothing function that satisfies [25]

1. cost of computing for some constant

2. is Lipschitz continuous

3. for and for

4. for .

The function can be constructed as the uniquely determined polynomial of degree (at most) , such that for , and . To obtain an appropriate smoothing function, is extended with

 gG(s)={1s<−10s>1. (13)

The indicator function at each level () is replaced by , where the bandwidth is a measure of the width over which the discontinuity in is smoothed out. The MLMC estimator with smoothing for is given by

 ^gMLMCn,M=Lmax∑l=0^gMCn(Yl), (14)

where with

 gn(Y(j)l)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩gG(Q(j)Ml−qnδG,l)−gG⎛⎝Q(j)Ml−1−qnδG,l⎞⎠1≤l≤LmaxgG(Q(j)Ml−qnδG,l)l=0. (15)

The MLMC estimator with smoothing for is given by a piecewise polynomial interpolation with degree

 ^FMLMCh,δG,M(q)=S∑n=0^gMLMCn,M ϕn(q)=S∑n=0L% max∑l=0^gMCn(Yl)ϕn(q). (16)

The MSE of is bounded by

 E[∥e(^FMLMCh,δG,M)∥2∞]≤ E[∥^F% MLMCh,δG,M−E[^FMLMCh,δG,M]∥2∞]eML1+∥Fh,M−Fh∥2eML2 (17) +∥E[^F% MLMCh,δG,M]−Fh,M∥2∞eML3.

Compared to (11), (17) contains an additional term , which is the (mean squared) smoothing error. To achieve an RSME error of at most , it is sufficient that , and .

Finding the optimal value for the bandwidth at each level () is crucial to the performance of MLMC. The error should be as close as possible to in order to maximize the smoothing and, hence, the variance decay with increasing level. A possible strategy consists of the following steps.

1. Given the error tolerance , at level estimate for each interpolation point in by solving

 1N0l∣∣ ∣∣N0l∑j=1⎡⎢⎣gG⎛⎜⎝Q(j)Ml−qnδG,l,n⎞⎟⎠−In(Q(j)Ml)⎤⎥⎦∣∣ ∣∣=ϵ2 (18)

based on a set of initial samples .

2. Define the smoothing parameter for level , , as

 δG,l=max0≤n≤SδG,l,n. (19)
3. Repeat steps 1 and 2 for each new level.

We follow the numerical algorithm in A to compute and to measure the associated computational cost (see Section 2.5). This algorithm is inspired by the approaches [18, 29]. To compare the performance of the MLMC simulation with that of the corresponding MC simulation at the highest level (which ensures that ), we also compute the number of MC samples of required to satisfy and the resulting computational cost of the MC estimator (see A.1).

#### 2.3.2 Smoothing based on kernel density estimation (KDE)

We propose an alternative way of smoothing the indicator function. It is grounded in Kernel Density Estimation (KDE), a nonparametric estimation of the PDF of a random variable. Let be independent and identically distributed samples drawn from the distribution of a QoI with unknown PDF . Then the KDE of is given by

 ^fδK(q)=1nn∑i=1KδK(q−qi)=1nδKn∑i=1K(q−qiδK) (20)

where is a kernel and is the bandwidth. We refer to for as a scaled kernel. For a Gaussian kernel, the KDE of is

 ^fδK(q)=1nδKn∑i=11√2πexp⎡⎣−12(q−qiδK)2⎤⎦. (21)

A corresponding estimate of the CDF is then obtained by considering the CDF of the Gaussian kernel, which yields

 (22)

Here

is the CDF of the standard normal distribution and plays the role of indicator smoothing function. The bandwidth

is the counterpart of defined in Section 2.3.1. Based on (22), at each level () we replace with and define an MLMC estimator with smoothing for similar to the one given by (16) but with the smoothing function . The methodology of Section 2.3.1 is then used to bound its MSE and find the optimal value of the bandwidth at each level (). The algorithm in A.1 is deployed to compute and to measure its computational cost.

### 2.4 Stratified Multilevel Monte Carlo (sMLMC)

In stratified Monte Carlo (sMC), one divides the domain of the uncertain input parameter into mutually exclusive and exhaustive regions () called strata. Let be the probability of being in stratum . The expectation value of is

 E[QM(W)]=r∑i=1piζi, (23)

where () is the stratum mean given by

 ζi=∫DiQM(w) dF(i)W(w)=1pi∫DiQM(w) dFW(w). (24)

Here is the conditional CDF of given that . The sMC estimator for is given by

 ^QsMCM=r∑i=1pinini∑j=1Q(j,i)M, (25)

where is the number of independent samples generated in the th stratum for each with , and is the th sample of that has a corresponding input parameter () in stratum . The variance of this estimator is

 V[^QsMCM]=r∑i=1σ2ip2ini,σ2i=∫Di(QM(w)−ζi)2dF(i)W(w) (26)

with being the variance of within stratum .

Two common choices for are proportional and optimal allocations [11]. For proportional allocation, and

 ^QsMC, pM =1Nr∑i=1ni∑j=1Q(j,i) (27) V[^QsMC, pM] =1Nr∑i=1σ2ipi=V[^QMCM]−1Nr∑i=1pi(ζi−E(QM))2, (28)

which shows that stratification produces an estimator with a lower variance than its MC counterpart. For optimal allocation, with for , yielding

 V[^QsMC, oM] =1N(r∑i=1σipi)2. (29)

It can be shown [11] that this is the smallest variance possible for an sMC estimator, which, given (28), is also smaller than the variance of the corresponding MC estimator.

Replacing with in (25) leads to

 ^IsMCn,M=r∑i=1pinini∑j=1In(Q(j,i)M). (30)

To combine the benefits of stratification with those of the multigrid approach, we replace MC at each level of the MLMC algorithm with sMC and refer to the resulting algorithm as “stratified Multilevel Monte Carlo” (sMLMC). In analogy to (10), the sMLMC estimator for is defined as

 ^IsMLMCn,M =Lmax∑l=0^IsMCn(Yl) (31) =Lmax∑l=0r∑i=1pini,lni∑j=1In(Y(j,i)l),

where with are defined in (9b). The MSE of the sMLMC estimator for is bounded by

 E[∥e(^FsMLMCh,M)∥2∞] (32) ≤max0≤n≤SLmax∑l=0V[^IsMCn(Yl)]+max0≤n≤S|E[In(QMLmax)−In(Q)]|2

where the bound for is the same as the bound for in (8), provided . To reduce the computational cost further, we again smooth the indicator function, yielding an additional error term in (32) similar to in (17). The resulting sMLMC estimator with smoothing for is defined as

 ^gsMLMCn,M=Lmax∑l=0^gsMCn(Yl), (33)

where and

 gn(Y(j,i)l)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩gG(Q(j,i)Ml−qnδG,l)−gG⎛⎝Q(j,i)Ml−1−qnδG,l⎞⎠1≤l≤LmaxgG(Q(j,i)Ml−qnδG,l)l=0 (34)

for polynomial-based smoothing, or

 gn(Y(j,i)l)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩gK(qn−Q(j,i)MlδK,l)−gK⎛⎝qn−Q(j,i)Ml−1δK,l⎞⎠1≤l≤LmaxgK(qn−Q(j,i)MlδK,l)l=0 (35)

for KDE-based smoothing. The sMLMC estimator with smoothing for is then given by

 ^FsMLMCh,δ,M(q)=S∑n=0^gsMLMCn,M ϕn(q)=S∑n=0Lmax∑l=0^gsMCn(Yl)ϕn(q). (36)

with given by or . To compute and and measure the associated computational cost, we deploy the algorithm in A.2.

### 2.5 Relative cost of MLMC and sMLMC versus MC

We estimate the total cost of computing the MLMC estimator without smoothing of as an average over independent realizations of the corresponding algorithm,

 C(^FMLMCh,M)=1NrealNreal∑k=1LMLMCmax,k∑l=0¯w(k)lN(k)l, (37)

where is the average computational cost of computing a sample of on level for realization , and denotes the finest discretization level at which the sampling is performed for this realization.

For sMLMC without smoothing,

 C(^FsMLMCh,M)=1NrealNreal∑k=1LsMLMCmax,k∑l=0r∑i=1¯w(k)i,ln(k)i,l, (38)

where is the average computational cost for computing a sample of in stratum on level for realization , and refers to the finest discretization level for this realization.

To compare the performance of MLMC and sMLMC with or without smoothing with that of MC, we consider realizations of the MC algorithm and perform the realization on level of the corresponding realization of MLMC without smoothing. This provides a single average cost,

 C(^FMCh,M)=1NrealNreal∑k=1¯wLMLMCmax,kN(k)% MC (39)

against which to compare , , and (with for polynomial-based smoothing or for KDE-based smoothing). Here is the number of samples computed in the realization of the MC algorithm.

## 3 Numerical results

We consider two testbed problems: linear diffusion with a random diffusion coefficient, and inviscid Burgers’ equation with a random initial condition. In both cases, the random input variable is drawn from a truncated lognormal PDF defined on with mean and variance , i.e.,

 fW=√2√πσw⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩exp(−(lnw−μ)22σ2)erf(lnwu−μ√2σ)−erf(lnwl−μ√2σ)for w∈[wl,wu]0otherwise. (40)

We assume the PDF of the QoI to be at least 3 times continuously differentiable such that we can use a cubic-spline interpolation to approximate its CDF . To compare the performance of MC, MLMC and sMLMC, we measure their average computational cost over independent runs for error tolerances , 0.008 and 0.005. The highest possible maximum level for MLMC/sMLMC we consider is . At each level we use proportional allocation for the sMC algorithm. We choose an identical number of warmup samples () for all independent runs of the MLMC algorithm, but use different values for MLMC with and without smoothing to eliminate as much as possible any oversampling ( is lower when smoothing is applied). For sMLMC, we define the number of warmup samples in each stratum , based on with and the chosen allocation strategy (identical for all independent sMLMC runs). The values of for sMLMC is typically lower than those for MLMC and different between the cases with and without smoothing, again to avoid oversampling.

For each value of , the run involves the following steps.

1. Perform non-smoothed MLMC, yielding a maximum level .

2. Perform MC with , re-using already computed samples from the MLMC run.

3. Perform smoothed MLMC using Giles’ polynomial with a computed smoothing parameter at each level .

4. Perform smoothed MLMC using KDE with a computed smoothing parameter at each level .

5. Perform non-smoothed sMLMC.

6. Perform smoothed sMLMC using KDE.

### 3.1 Linear diffusion equation

Consider

 ∂u∂t=D∂2u∂x2,x∈(0,4),t>0 (41a) u(0,t)=−1,u(4,t)=1 (41b) u(x,0)=tanh(2−x0.05) (41c)

Here