# Scheduling massively parallel multigrid for multilevel Monte Carlo methods

The computational complexity of naive, sampling-based uncertainty quantification for 3D partial differential equations is extremely high. Multilevel approaches, such as multilevel Monte Carlo (MLMC), can reduce the complexity significantly, but to exploit them fully in a parallel environment, sophisticated scheduling strategies are needed. Often fast algorithms that are executed in parallel are essential to compute fine level samples in 3D, whereas to compute individual coarse level samples only moderate numbers of processors can be employed efficiently. We make use of multiple instances of a parallel multigrid solver combined with advanced load balancing techniques. In particular, we optimize the concurrent execution across the three layers of the MLMC method: parallelization across levels, across samples, and across the spatial grid. The overall efficiency and performance of these methods will be analyzed. Here the scalability window of the multigrid solver is revealed as being essential, i.e., the property that the solution can be computed with a range of process numbers while maintaining good parallel efficiency. We evaluate the new scheduling strategies in a series of numerical tests, and conclude the paper demonstrating large 3D scaling experiments.

• 2 publications
• 9 publications
• 6 publications
• 22 publications
• 28 publications
11/23/2021

### A Massively Parallel Implementation of Multilevel Monte Carlo for Finite Element Models

The Multilevel Monte Carlo (MLMC) method has proven to be an effective v...
11/14/2019

### Space-time multilevel Monte Carlo methods and their application to cardiac electrophysiology

We present a novel approach aimed at high-performance uncertainty quanti...
06/21/2021

### Uncertainty Quantification by MLMC and Local Time-stepping For Wave Propagation

Because of their robustness, efficiency and non-intrusiveness, Monte Car...
04/10/2019

### A Three-Level Parallelisation Scheme and Application to the Nelder-Mead Algorithm

We consider a three-level parallelisation scheme. The second and third l...
05/05/2021

### Space-time multilevel quadrature methods and their application for cardiac electrophysiology

We present a novel approach which aims at high-performance uncertainty q...
04/29/2019

### A highly parallel algorithm for computing the action of a matrix exponential on a vector based on a multilevel Monte Carlo method

A novel algorithm for computing the action of a matrix exponential over ...
11/04/2019

### Exa-Dune – Flexible PDE Solvers, Numerical Methods and Applications

In the Exa-Dune project we have developed, implemented and optimised num...

## 1 Introduction

Data uncertainties are ubiquitous in many application fields, such as subsurface flow or climate prediction. Inherent uncertainties in input data propagate to uncertainties in quantities of interest, such as the time it takes pollutants leaking from a waste repository to reach a drink water well. This situation has driven the development of novel uncertainty quantification (UQ) methods; most commonly, using partial differential equations (PDEs) to model the physical processes and stochastic models to incorporate data uncertainties. Simulation outputs are then statistics (mean, moments, cumulative distribution function) of the quantities of interest. However, typical sampling-and-averaging techniques for computing statistics quickly become infeasible, when each sample involves the numerical solution of a PDE.

### 1.1 Mathematical model and UQ methods

Let us consider an abstract, possibly nonlinear system of PDEs with uncertain data

 M(u;ω) = 0, (1)

where the solution is sought in some suitable space of functions with and open and bounded, subject to suitable boundary conditions. is a differential operator depending on a set of random parameters parametrised by an element of the abstract sample space that encapsulates the uncertainty in the data, with the set of all outcomes, the -algebra (the “set” of all events), and

the associated probability measure. As a consequence the solution

itself is a random field, i.e. , with realizations in .

We are typically only interested in functionals of . To compute them we need to approximate the solution numerically, e.g. using finite element methods, which introduces bias error. The cost typically grows inverse proportionally to some power of the bias error, i.e. where denotes the bias error tolerance. This is a challenging computational task that requires novel methodology combined with cutting-edge parallel computing for two reasons: firstly, real life applications lead to PDE systems in three dimensions that often can only be solved effectively and accurately on a parallel computer (even without data uncertainties); secondly, typical uncertainties in applications, such as a random diffusion coefficient , are spatially varying on many scales and cannot be described by a handful of stochastic parameters. This limits considerably the types of UQ methods that are applicable.

For low dimensional problems, stochastic Galerkin, stochastic collocation and polynomial chaos methods have been shown to provide efficient and powerful UQ tools (see, e.g., [13, 42, 26] and the references therein), but in general their complexity grows exponentially with the stochastic dimension. The cost of sampling methods, such as, e.g., Monte Carlo, does not grow with the stochastic dimension, but classical Monte Carlo is notoriously slow to converge. Multilevel Monte Carlo (MLMC) simulation [14, 6] can help to significantly accelerate the convergence. It has been applied and extended to a range of applications, see [2, 28, 7, 11, 22].

The idea of MLMC is to reduce algorithmic complexity by performing as much computational work as possible on coarse meshes. To this end, MLMC uses a hierarchy of discretisations of (1

) of increasing accuracy to estimate statistics of

more efficiently, i.e. using a large number of coarse samples to fully capture the variability, but only a handful of fine samples to eliminate the bias due to the spatial discretisation. Here, we employ multilevel methods not only to accelerate the stochastic part, but also to provide a scalable solver for individual realizations of (1).

### 1.2 Parallel methods and algorithms

Current leading-edge supercomputers provide a peak performance in the order of a hundred petaflop/s (i.e.  floating point operations per second) [33]. However, all these computers draw their computational power from parallelism, with current processor numbers already at see [9]. The technological trend indicates that future exascale computers may use . Consequently, designing efficient fast parallel algorithms for high performance computers is a challenging task today and will be even more so in the future.

MLMC methods are characterized by three algorithmic levels that are potential candidates for parallel execution. As in a standard Monte Carlo method, the algorithm uses a sequence of classical deterministic problems (samples) that can be computed in parallel. The size of these subproblems varies depending on which level of resolution the samples are computed on. We therefore distinguish between parallelism within an MLMC level and parallelism across MLMC levels. The third algorithmic level is the solver for each deterministic PDE problem which can be parallelized itself. Indeed, the total number of samples on finer MLMC levels is typically moderate, so that the first two levels of parallelism will not suffice to exploit processors. Parallel solvers for elliptic PDEs are now able to solve systems with degrees of freedom on petascale machines [16] with compute times of a few minutes using highly parallel multigrid methods [5, 17]. In this paper, we will illustrate for a simple model problem in three spatial dimensions, how these different levels of parallelism can be combined and how efficient parallel MLMC strategies can be designed.

To achieve this, we extend the massively parallel Hierarchical Hybrid Grids (HHG) framework [3, 19] that exhibits excellent strong and weak scaling behavior [21, 1] to the MLMC setting. We use the fast multigrid solver in HHG to generate spatially correlated samples of the random diffusion coefficient, as well as to solve the resulting subsurface flow problems efficiently. Furthermore, the hierarchy of discretisations in HHG provides the ideal multilevel framework for the MLMC algorithm.

Parallel solvers may not yield linear speedup and the efficiency may deteriorate on a large parallel computer system when the problems become too small. In this case, too little work can be executed concurrently and the scalar overhead dominates. This effect is well-known and can be understood prototypically in the form of Amdahl’s law [21]. In the MLMC context, problems of drastically different size must be solved. In general, a solver, when applied to a problem of given size, will be characterized by its scalability window, i.e., the processor range for which the parallel efficiency remains above an acceptable threshold. Because of memory constraints, the scalability window will open at a certain minimal processor number. For larger processor numbers the parallel efficiency will deteriorate until the scalability window closes. In practice, additional restrictions imposed by the system and the software permit only specific processor numbers within the scalability window to be used.

MLMC typically leads to a large number of small problems, a small number of very large problems, and a fair number of intermediate size problems. On the coarser levels, the problem size is in general too small to use the full machine. The problem is outside the scalability window and solver-parallelism alone is insufficient. On the other hand, the efficiency of parallelization across samples and across levels typically does not deteriorate, since only little data must be extracted from each sample to compute the final result of the UQ problem. However, on finer levels we may not have enough samples to fill the entire machine. Especially for adaptive MLMC, where the number of samples on each level is not known a priori but must be computed adaptively using data from all levels, this creates a challenging load balancing problem.

A large scalability window of the solver is essential to devise highly efficient execution strategies, but finding the optimal schedule is restricted by a complex combination of mathematical and technical constraints. Thus the scheduling problem becomes in itself a high-dimensional, multi-constrained, discrete optimisation problem. Developing suitable approaches in this setting is one of the main objectives of this paper. See [34, 35] for earlier static and dynamic load balancing approaches.

The paper is structured as follows: In Section 2, we briefly review the MLMC method and its adaptive version. Section 3 introduces the model problem. Here, we use an alternative PDE-based sampling technique for Matérn covariances [25, 29] that allows us to reuse the parallel multigrid solver. In Sections 4 and 5, we define a classification of different parallel execution strategies and develop them into different parallel scheduling approaches. In Section 6, we study the parallel efficiency of the proposed strategies and demonstrate their flexibility and robustness, before finishing in Section 7 with large-scale experiments on advanced supercomputer systems.

## 2 The Multilevel Monte Carlo method

To describe the MLMC method, we assume that we have a hierarchy of finite element (FE) discretisations of (1). Let be a nested sequence of FE spaces with , mesh size and degrees of freedom. In the Hierarchical Hybrid Grids (HHG) framework [3, 19], the underlying sequence of FE meshes is obtained via uniform mesh refinement from a coarsest grid , and thus and in three space dimensions.

Denoting by the FE approximation of on Level , we have

 Mℓ(uℓ;ω)=0,ℓ≥0. (2)

Here, the (non)linear operator and the functional of interest may also involve numerical approximations.

### 2.1 Standard Monte-Carlo Simulation

The standard Monte Carlo (MC) estimator for the expected value of on level is given by

 ˆQMC,NL=1NN∑i=1QiL, (3)

where , , are independent samples of .

There are two sources of error: (i) The bias error due to the FE approximation. Assuming that , for almost all and a constant , it follows directly that there exists a constant , independent of , such that

 |E[QL−Q]|≤CbM−αL≤εb (4)

for (cf. [36]).

(ii) There is a sampling error due to the finite number of samples in (3).

The total error is typically quantified via the mean square error (MSE), given by

 e(ˆQMC,NL)2:=E[(ˆQMC,NL−E[Q])2]=(E[QL−Q])2+N−1V[QL], (5)

where

denotes the variance of the random variable

. The first term in (5) can be bounded in terms of (4), and the second term in is smaller than a sample tolerance if . We note that for sufficiently large, . To ensure that the total MSE is less than we choose

 ε2s=θε2andε2b=(1−θ)ε2,for any fixed  0<θ<1. (6)

Thus, to reduce (5) we need to choose a sufficiently fine FE mesh and a sufficiently large number of samples. This very quickly leads to an intractable problem for complex PDE problems in 3D. The cost for one sample of depends on the complexity of the FE solver and of the random field generator. Typically it will grow like , for some and some constant , independent of and of . Thus, the total cost to achieve a MSE (the -cost) is

 Cost(ˆQMC,NL)=O(MγN)=O(ε−2−γ/α). (7)

For the coefficient field and for the output functional studied below, we have only . In that case, even if , to reduce the error by a factor 2 the cost grows by a factor of , which quickly leads to an intractable problem even in a massively parallel environment.

### 2.2 Multilevel Monte-Carlo Simulation

Multilevel Monte Carlo (MLMC) simulation [14, 6, 2] seeks to reduce the variance of the estimator and thus to reduce computational time, by recursively using coarser FE models as control variates. By exploiting the linearity of the expectation operator, we avoid estimating directly on the finest level  and do not compute all samples to the desired accuracy (bias error). Instead, using the simple identity , we estimate the mean on the coarsest level (Level ) and correct this mean successively by adding estimates of the expected values of , for . Setting , the MLMC estimator is then defined as

 ˆQMLL:=L∑ℓ=0ˆYMC,Nℓℓ, (8)

where the numbers of samples , , are chosen to minimize the total cost of this estimator for a given prescribed sampling error (see Eqn. (11) below). Note that we require the FE solutions and on two levels to compute a sample of , for , and thus two PDE solves, but crucially both with the same and thus with the same PDE coefficient (see Algorithm 1).

The cost of this estimator is

 Cost(ˆQMLL)=L∑ℓ=0NℓCℓ, (9)

where is the cost to compute one sample of on level . For simplicity, we use independent samples across all levels, so that the standard MC estimators in (8) are independent. Then, the MSE of simply expands to

 (10)

This leads to a hugely reduced variance of the estimator since both FE approximations and converge to and thus , as .

By choosing , we can ensure again that the bias error is less than , but we still have some freedom to choose the numbers of samples on each of the levels, and thus to ensure that the sampling error is less than . We will use this freedom to minimize the cost in (9) subject to the constraint , a simple discrete, constrained optimization problem with respect to (cf. [14, 6]). It leads to

 Nℓ = ε−2s(L∑ℓ=0√V[Yℓ]Cℓ)√V[Yℓ]Cℓ. (11)

Finally, under the assumptions that

 Cℓ≤CcMγℓandV[Yℓ]≤CvM−βℓ, (12)

for some and and for two constants and , independent of and of , the -cost to achieve can be bounded by

 Cost(ˆQMLL)=ε−2s(L∑ℓ=0√V[Yℓ]Cℓ)2≤CMLε−2−max(0,γ−βα). (13)

Typically for smooth functionals . For CDFs we typically have .

There are three regimes: , and . In the case of the exponential covariance, typically and and thus , which is a full two orders of magnitude faster than the standard MC method. Moreover, MLMC is optimal for this problem, in the sense that its cost is asymptotically of the same order as the cost of computing a single sample to the same tolerance .

### 2.3 Adaptive Multilevel Monte Carlo

In Algorithm 2 we present a simple sequential, adaptive algorithm from [14, 6] that uses the computed samples to estimate bias and sampling error and thus chooses the optimal values for and . Alternative adaptive algorithms are described in [15, 7, 11]. For the remainder of the paper we will restrict to uniform mesh refinement, i.e. and in 3D.

To estimate the bias error, let us assume that is sufficiently large, so that we are in the asymptotic regime, i.e. in (4). Then (cf. [11])

 |E[Qℓ−Q]|≤18α−1ˆYMC,Nℓℓ. (14)

Also, using the sample estimator to estimate and the CPU times from the runs up-to-date to estimate , we can estimate

 Nℓ ≈ ε−2s(L∑ℓ=0√s2ℓCℓ)√s2ℓCℓ. (15)

## 3 Model problem and deterministic solver

As an example, we consider an elliptic PDE in weak form: Find such that

 ∫D∇v(x)⋅(k(x,ω)∇u(x,ω))dx=∫Df(x)v(x)dx,for all v∈V and ω∈Ω. (16)

This problem is motivated from subsurface flow. The solution and the coefficient are random fields on related to fluid pressure and rock permeability. For simplicity, we only consider , homogeneous Dirichlet conditions and a deterministic source term . If is continuous (as a function of ) and almost surely (a.s.) in , then it follows from the Lax-Milgram Lemma that this problem has a unique solution (cf. [4]). As quantities of interest in Section 7, we consider , for some , or alternatively , for some two-dimensional manifold can be of interest.

### 3.1 Discretisation

To discretise (16), for each , we use standard finite elements on a sequence of uniformly refined simplicial meshes . Let be the FE space associated with , the set of interior vertices, the mesh size and the number of degrees of freedom. Now, problem (16) is discretised by restricting it to functions . Using the nodal basis of and expanding , this can be written as a linear equation system where the entries of the system matrix are assembled elementwise based on on a four node quadrature formula

 A(ℓ)(ω)U(ℓ)(ω)=F(ℓ), where A(ℓ)i,j(ω):=∑τ∈Tℓ∇ϕi⋅∇ϕj∣∣τ|τ|4(4∑k=1k(xτk,ω)), andF(ℓ)i:=∫Dfϕi% dx.

Here , denote the four vertices of the element .

The quantity of interest is simply approximated by . For to converge to , as , we need stronger assumptions on the random field . Let , i.e. Hölder-continuous with coefficient , and suppose and have bounded second moments. It was shown in [36] that

 E[(Q(u)−Q(uℓ))q]=O(htqℓ)=O(M−tq/3ℓ),q=1,2, (17)

Hence, the bound in (4) holds with , and since

 V[Q(uℓ)−Q(uℓ−1)]≤E[(Q(uℓ)−Q(uℓ−1))2]≤2∑r=ℓ,ℓ−1E[(Q(u)−Q(ur))2]

the bound in (12) holds with .

### 3.2 PDE-based sampling for lognormal random fields

A coefficient function of particular interest is the lognormal random field , where is a mean-free, stationary Gaussian random field with exponential covariance

 E[Z(x,ω)Z(y,ω)]=σ2exp(−|x−y|λ). (18)

The two parameters in this model are the variance and the correlation length . Individual samples of this random field are in , for any . In particular, this means that the convergence rates in (4) and (12) are and in this case, for any . The field belongs to the larger class of Matérn covariances [25, 26], which also includes smoother, stationary lognormal fields, but we will only consider the exponential covariance in this paper.

Two of the most common approaches to realise the random field above are Karhunen-Loeve (KL) expansion [13] and circulant embedding [8, 20]. While the KL expansion is very convenient for analysis and essential for polynomial expansion methods such as stochastic collocation, it can quickly dominate all the computational cost for short correlation lengths

in three dimensions. Circulant embedding, on the other hand, relies on the Fast Fourier Transform, which may pose limits to scalability in a massively parallel environment. An alternative way to sample

is to exploit the fact that in three dimensions, mean-free Gaussian fields with exponential covariance are solutions to the stochastic partial differential equation (SPDE)

 (κ2−Δ)Z(x,ω)=dW(x,ω), (19)

where the right hand side

is Gaussian white noise with unit variance and

denotes equality in distribution. As shown by Whittle [41], a solution of this SPDE will be Gaussian with exponential covariance and .

In [25], the authors show how this SPDE can be solved using a FE discretisation and this will be the approach we use to bring our fast parallel multigrid methods to bear again. Since we only require samples of at the vertices of , we discretise (19) using again standard finite elements. If now denotes the FE approximation to , then we approximate in our qudrature formula by , for all . It was shown in [25, 31] that converges in a certain weak sense to with , for any . Since (19) is in principle posed on all of we embed the problem into the larger domain with artifical, homogeneous Neumann boundary conditions on (see [31]).

## 4 Performance parameters and execution strategies

Although MLMC methods can achieve better computational complexity than standard MC methods, efficient parallel execution strategies are challenging and depend strongly on the performance characteristics of the solver, in our case a multigrid method. The ultimate goal is to distribute the processors to the different subtasks such that the total run time of the MLMC is minimal. This can be formulated as a high dimensional, multi-constraint discrete optimization problem. More precisely, this scheduling problem is in general NP-complete, see, e.g., [10, 12, 23, 37] and the references therein, precluding exact solutions in practically relevant situations.

### 4.1 Characteristic performance parameters

To design an efficient scheduling strategy, we rely on predictions of the time-to solution and have to take into account fluctuations. Static strategies thus possibly suffer from a significant load imbalance and may result in poor parallel efficiency. Dynamic strategies, such as the greedy load balancing algorithms in [34], which take into account run-time data are more robust, especially when run-times vary strongly within a level.

For the best performance, the number of processors per sample on level should lie within the scalability window of the PDE solver, where the parallel efficiency is above a prescribed threshold of, e.g., 80%. Due to machine constraints, may be restricted to a subset, such as , where characterizes the size of the scalability window and . Efficient implementations of 3D multigrid schemes such as, e.g., within HHG [3, 18, 19], have excellent strong scalability and a fairly large scalability window, with a typical value of for a parallel efficiency threshold of . The HHG solver has not only good strong scaling properties but also exhibits excellent weak scalability. We can thus assume that and for PDEs in 3D. The value of is the number of processors for which the main memory capacity is fully utilized. Multigrid PDE solvers typically achieve the best parallel efficiency for , when each subdomain is as large as possible and the ratio of computation to communication is maximal (cf. [17]).

In the following, the time-to solution for the th sample on level executing on processors is denoted by . We assume that

 t(i,ℓ,θ)≈Cℓ,θ(ωi)tℓ,θ,1≤i≤Nℓ,0≤ℓ≤L,0≤θ≤S. (20)

Here, is a reference time-to solution per sample on level . Several natural choices exist, such as the mode, median, mean or minimum over a sample set. The term encapsulates fluctuations across samples. It depends on the robustness of the PDE solver, as well as on the type of parallel computer system. It is scaled such that it is equal to one if there are no run-time variations. Fig. 1 (right) shows a typical run-time distribution for samples each of which was computed on processors with and in (18).

Assuming no efficiency loss due to load imbalances and an optimal parallel efficiency for , the theoretical optimal mean run-time for the MLMC method is

 toptmlmc=Pmin0PmaxL∑ℓ=0Nℓ23ℓE(Cℓ,0)tℓ,0. (21)

There are three main sources of inefficiency in parallel MLMC algorithms: (i) a partly idle machine due to large run-time variations between samples scheduled in parallel, (ii) non-optimal strong scalability properties of the solver, i.e., , or (iii) over-sampling, i.e., more samples than required are scheduled to fill the machine. In the following we address (ii) and (iii) in more detail.

The strong parallel efficiency of a solver can be charaterized in terms of . In order to predict , , we define a surrogate cost function depending on that is motivated by Amdahl’s law [21]:

 tℓ,θ≈tℓ,0(B+2−θ(1−B)),Effℓ(θ)≈(2θB+(1−B))−1. (22)

The serial fraction parameter in (22) quantifies the amount of non-parallelizable work. It can be calibrated from time measurements. For a solver with good scalability properties, is almost constant over the levels so that we use a single value on all levels. Fig. 1 (left) shows the typical range of the scalability window, i.e., , for and . We also see the influence of different serial fraction parameters on the parallel efficiency and the good agreement of the cost model (22) with averaged measured run-times. The fitted serial fraction parameter lies in the range of for different types of PDE within the HHG framework. In an adaptive strategy, we can also use performance measurements from past computations to fit better values of in the cost predictions for future scheduling steps.

Let denote the number of samples that can be at most computed simultaneously on level if processors are used per sample, and by we denote the number of required sequential cycles to run in total a minimum of samples. Then, , and the associated relative load imbalance are given by

 Jℓ(θ)=⌊Pmax23ℓ+θPmin0⌋,kseqℓ(θ)=⌈NℓJℓ(θ)⌉,Imbℓ(θ):=1−23ℓ+θPmin0Nℓkseqℓ(θ)Pmax. (23)

We note that , with when no load imbalance occurs. For , part of the machine will be idle either due to the or due to .

The remaining processors in the last sequential steps can be used to compute additional samples that improve the accuracy, but are not necessary to achieve the required tolerance, or we can schedule samples on other levels in parallel (see the next section). The product

 ηℓ(θ):=(1−Imbℓ(θ))Effℓ(θ) (24)

will be termed MLMC level efficiency and we note that it also depends on .

### 4.2 Classification of concurrent execution strategies

We classify execution strategies for MLMC methods in two ways, either referring to the layers of parallelisms or to the resulting time-processor diagram.

#### 4.2.1 Layers of parallel execution

Especially on the finer grid levels in MLMC, the number of samples is too small to fully exploit modern parallel systems by executing individual samples in parallel. Multiple layers of parallelism must be identified. In the context of MLMC methods, three natural layers exist:

Level parallelism:

The estimators on level may be computed in parallel.

Sample parallelism:

The samples on level may be evaluated in parallel.

Solver parallelism:

The PDE solver to compute sample may be parallelized.

The loops over the levels and over the samples are inherently parallel, except for some minimal postprocessing to compute the statistical quantities of interest. The challenge is how to balance the load between different levels of parallelism and how to schedule the solvers for each sample. Especially in the adaptive setting, without a priori information, an exclusive use of level parallelism is not always possible, but in most practical cases, a minimal number of required levels and samples is known a priori. For the moment, we assume and , , to be fixed and given. In general, these quantities have to be determined dynamically (cf. Alg. 2).

The concurrent execution can now be classified according to the number of layers of parallelism that are exploited: one, two, or three. Typically, and on modern supercomputers, and thus solver parallelism is mandatory for large-scale computing. Thus, the only possible one-layer approach on supercomputers is the solver-only strategy. For a two-layer approach, one can either exploit the solver and level layers or the solver and sample layers. Since the number of levels  is, in general, quite small, the solver-level strategy has significantly lower parallelization potential than a solver-sample strategy. Finally, the three-layer approach takes into account all three possible layers of parallelism and is the most flexible one.

#### 4.2.2 Concurrency in the processor-time diagram

An alternative way to classify different parallel execution models is to consider the time-processor diagram, where the scheduling of each sample , , , is represented by a rectangular box with the height representing the number of processors used. A parallel execution model is called homogeneous bulk synchronous if at any time in the processor diagram, all tasks execute on the same level with the same number of processors. Otherwise it is called heterogeneous bulk synchronous. The upper row of Fig. 2 illustrates two examples of homogeneous bulk synchronous strategies, whereas the lower row presents two heterogeneous strategies.

The one-layer homogeneous strategy, as shown in Fig. 2 (left), offers no flexibility. The theoretical run-time is simply given by , where is such that . It guarantees perfect load balancing, but will not lead to a good overall efficiency since on the coarser levels is typically significantly larger than . On the coarsest level we may even have , i.e., less grid points than processors. Thus we will not further consider this option.

## 5 Examples for scheduling strategies

Our focus is on scheduling algorithms that are flexible with respect to the scalability window of the PDE solver and robust up to a huge number of processors

. To solve the optimization problems, we will either impose additional assumptions that allow an exact solution, or we will use meta-heuristic search algorithms such as, e.g., simulated annealing

[38, 39]. Before we introduce our scheduling approaches, we comment briefly on technical and practical aspects that are important for the implementation.

##### Sub-communicators

To parallelize over samples as well as within samples, we split the MPI_COMM_WORLD communicator via the MPI_Comm_split command and provide each sample with its own MPI sub-communicator. This requires only minimal changes to the multigrid algorithm and all MPI communication routines can still be used. A similar approach, using the MPI group concept, is used in [35].

##### Random number generator

To generate the samples of the diffusion coefficient we use the approach described in Sect. 3.2. This requires suitable random numbers for the definition of the white noise on the right hand side of (19). For large scale MLMC computations we select the Ran [30] generator that has a period of and is thus suitable even for realizations. It is parallelized straightforwardly by choosing different seeds for each process, see, e.g., [24].

We consider now examples for the different classes of scheduling strategies.

### 5.1 Sample synchronous and level synchronous homogeneous

Here, to schedule the samples we assume that the run-time of the solver depends on the level and on the number of associated processors, but not on the particular sample , . As the different levels are treated sequentially and each concurrent sample is executed with the same number of processors, we can test all possible configurations.

Let be fixed. Then, for a fixed , the total time on level is . We select the largest index such that

 θℓ=argmin0≤θ≤Skseqℓ(θ)tℓ,θ=argmax0≤θ≤SEffℓ(θ)(1−Imbℓ(θ)).

Thus the minimization of the run-time per level is equivalent to a maximization of the total level efficiency. The computation of is trivial provided is known for all . We can either set it to be the average of pre-computed timings of the solver on level or we can use (22) with a fitted serial fraction parameter . In that case,

 θℓ=argmin0≤θ≤Skseqℓ(θ)(B+2−θ(1−B)).

The level only enters this formula implicitly, through and through the growth factor . Given , we can group the processors accordingly and run sequential steps for each level . Note that the actual value of does not influence the selection of  . It does of course influence the absolute run-time.

We consider two variants: (i) Sample synchronous homogeneous (SaSyHom) imposes a synchronization step after each sequential step (see Fig. 3, left). Here statistical quantities can be updated after each step. (ii) Level synchronous homogeneous (LeSyHom), where each block of processors executes all without any synchronization (see Fig. 3, centre).

Altogether samples are computed. When the run-time does not vary across samples, both strategies will results in the same MLMC run-time. If it does vary then the LeSyHom strategy has the advantage that fluctuations in the run-time will be averaged and a shorter overall MLMC run-time can be expected for sufficiently large .

### 5.2 Run-time robust homogeneous

So far, we have assumed that the run-time is sample independent, which is idealistic (see Fig. 1, right). In the experiment in Fig. 1 (right), 3 out of 2048 samples required a run-time of on processors. On a large machine with and with we need only sequential steps on level . Therefore, the (empirical) probability that the SaSyHom strategy leads to a runtime of is about , while the theoretical optimal run-time is . Here, is the actual run-time of the ith sample from Fig. 1 (right). The probability that the LeSyHom strategy leads to a runtime of is less than ; in all other cases, a run-time of is achieved.

Let us now fix again and include run-time variations in the determination of . Unfortunately, in general, run-time distribution functions are not known analytically, and thus the expected run-time

 Eℓ,θ:=E⎡⎢ ⎢⎣max1≤j≤Jℓ(θ)(kseqℓ(θ)∑k=1t(ijk,ℓ,θ))⎤⎥ ⎥⎦ (25)

cannot be computed explicitly. Here, the samples are denoted by with and , related to their position in the time-processor diagram. The expression in (25) yields the actual, expected run-time on level to compute samples with processors per sample when no synchronization after the sequential steps is performed.

The main idea is now to compute an approximation for , and then to minimise for each , i.e., to find such that for all . As a first approximation, we replace by the approximation in (20) and assume that the stochastic cost factor distribution neither depends on the level nor on the scale parameter . Furthermore, we approximate the expected value by an average over samples to obtain the approximation

 ˆEℓ,θ(μ):=1μμ∑m=1max1≤j≤Jℓ(θ)⎛⎜ ⎜⎝kseqℓ(θ)∑k=1C0,0(ωijkm)⎞⎟ ⎟⎠tℓ,θ, (26)

where . If reliable data for is available we define

 θℓ:=argmin0≤θ≤SˆEℓ,θ(μ)

Otherwise, we include a further approximation and replace by in (26) before finding the minimum of . Here, we still require an estimate for the serial fraction parameter . To decide on the number of samples in (26), we keep increasing until it is large enough so that and yield the same . For all our test settings, we found that is sufficient.

To evaluate (26), we need some information on the stochastic cost factor which was assumed to be constant across levels and across the scaling window. We use a run-time histogram associated with level . This information is either available from past computations or can be built up adaptively within a MLMC method. Having the run-times , , of samples on level at hand, we emulate by using a pseudo random integer generator from a uniform discrete distribution ranging from one to and replace the obtained value by . Having computed the value of , we proceed as for LeSyHom and call this strategy run-time robust homogeneous (RuRoHom). For constant run-times, RuRoHom yields again the same run-times as LeSyHom and as SaSyHom.

### 5.3 Dynamic variants

So far, we have used pre-computed values for and in all variants, and each processor block carries out the computation for exactly samples. For large run-time variations, this will still lead to unnecessary inefficiencies. Instead of assigning samples to each processor block a-priori, they can also be assigned to the processor blocks dynamically at run-time. As soon as a block terminates a computation, a new sample is assigned to it until the required number is reached. This reduces over-sampling and can additionally reduce the total run-time on level . However, on massively parallel architectures this will only be efficient when the dynamic distribution of samples does not lead to a significant communication overhead. The dynamic strategy can be combined with either the LeSyHom or the RuRoHom approach and we denote them dynamic level synchronous homogeneous (DyLeSyHom) and dynamic run-time robust homogeneous (DyRuRoHom), respectively. Fig. 3 (right) illustrates the DLeSyHom strategy. Note specifically that here not all processor blocks execute the same number of sequential steps.

In order to utilize the full machine, it is crucial that no processor is blocked by actively waiting to coordinate the asynchronous execution. The necessary functionality may not be fully supported on all parallel systems. Here we use the MPI 2.0 standard that permits one-sided communication and thus allows a non-intrusive implementation. The one-sided communication is accomplished by remote direct memory access (RDMA) using registered memory windows. In our implementation, we create a window on one processor to synchronize the number of samples that are already computed. Exclusive locks are performed on a get/accumulate combination to access the number of samples.

### 5.4 Heterogeneous bulk synchronous scheduling

Heterogeneous strategies are clearly more flexible than homogeneous ones, but the number of scheduling possibilities grows exponentially. Thus, we must first reduce the complexity of the scheduling problem. In particular, we ignore again run-time variations and assume . We also assume that on all levels . Within an adaptive strategy, samples may only be required on some of the levels at certain times and thus this condition has to hold true only on a subset of .

In contrast to the homogeneous setups, we do not aim to find scaling parameters  that minimize the run-time on each level separately, but instead minimize the total MLMC run-time. We formulate the minimization process as two constrained minimization steps that are coupled only in one direction, where we have to identify the number of samples on level which are carried out in parallel with processors, as well as the number of associated sequential steps. Firstly, assuming to be given, for all , , we solve the constrained minimization problem for

 argminkseqℓ(θ)∈N0(max0≤θ≤Stℓ,θkseqℓ(θ)),S∑θ=0Nℓ,θkseqℓ(θ)≥Nℓ.

Secondly, having at hand, we find values for such as to minimize

 argminNℓ,θ∈N0max0≤θ≤S0≤ℓ≤Ltℓ,θkseqℓ(θ),

the expected run-time, subject to the following inequality constraints

 0≤Nℓ,θ≤2−3ℓ2−θPmax/Pmin0, (27a) S∑θ=0Nℓ,θ>0, for ℓ∈I, (27b) L∑ℓ=0S∑θ=0Nℓ,θ23ℓ2θPmin0≤Pmax. (27c)

We apply integer encoding [32] for the initialization and for possible mutation operators to guarantee that . Clearly, if then (27c) implies (27a). However, even though it is redundant, (27a) is enforced explicitly to restrict the search space in the meta-heuristic optimization algorithm. The condition in (27b) that at least one sample is scheduled on each level at all times could also be relaxed. However, this would require a redistribution of processors in the optimization problem and can significantly increase the algorithmic and technical complexity. If (27b) is violated on some level , we set . Condition (27c), however, is a hard constraint. The number of processors that are scheduled cannot be larger than . If (27c) is violated, we enforce it by a repeated multiplication of by until it holds. At first glance this possibly leads to an unbalanced work load, but the applied meta-heuristic search strategy compensates for it. With the values of identified, the samples are distributed dynamically onto the machine, see also [34].

To illustrate the complexity of this optimization task, we consider the number of different combinations for that satisfy (27a) but not necessarily (27b) and (27c). For example, for , , and , there are possible combinations. Even for the special case that the scalability window degenerates, i.e., that , there are still possibilities.

As an example for the following two subsections, we consider and actual run-times from measurements in a set of numerical experiments:

 (tℓ,θ)0≤ℓ≤3,0≤θ≤4=⎛⎜ ⎜ ⎜⎝16783.8442.3021.6311.6017186.2844.5323.1312.4117790.4047.0724.2112.9717991.6148.2724.8613.63⎞⎟ ⎟ ⎟⎠. (28)

#### 5.4.1 The degenerate case S=0 and a new auxiliary objective

For , a cheap but non-optimal way to choose is

 Nℓ,0=⌊PmaxNℓtℓ,0∑Li=0Ni23iPmin0ti,0⌋. (29)

The corresponding run-time is . The total number of processors is . This choice is acceptable when the workload is evenly distributed across levels, which is one of the typical scenarios in MLMC. It also requires that the full machine can be exploited without any imbalance in the workload.

Using the first column of (28) in (29), we find as total run-time and the distribution