1 Introduction
Data uncertainties are ubiquitous in many application fields, such as subsurface flow or climate prediction. Inherent uncertainties in input data propagate to uncertainties in quantities of interest, such as the time it takes pollutants leaking from a waste repository to reach a drink water well. This situation has driven the development of novel uncertainty quantification (UQ) methods; most commonly, using partial differential equations (PDEs) to model the physical processes and stochastic models to incorporate data uncertainties. Simulation outputs are then statistics (mean, moments, cumulative distribution function) of the quantities of interest. However, typical samplingandaveraging techniques for computing statistics quickly become infeasible, when each sample involves the numerical solution of a PDE.
1.1 Mathematical model and UQ methods
Let us consider an abstract, possibly nonlinear system of PDEs with uncertain data
(1) 
where the solution is sought in some suitable space of functions with and open and bounded, subject to suitable boundary conditions. is a differential operator depending on a set of random parameters parametrised by an element of the abstract sample space that encapsulates the uncertainty in the data, with the set of all outcomes, the algebra (the “set” of all events), and
the associated probability measure. As a consequence the solution
itself is a random field, i.e. , with realizations in .We are typically only interested in functionals of . To compute them we need to approximate the solution numerically, e.g. using finite element methods, which introduces bias error. The cost typically grows inverse proportionally to some power of the bias error, i.e. where denotes the bias error tolerance. This is a challenging computational task that requires novel methodology combined with cuttingedge parallel computing for two reasons: firstly, real life applications lead to PDE systems in three dimensions that often can only be solved effectively and accurately on a parallel computer (even without data uncertainties); secondly, typical uncertainties in applications, such as a random diffusion coefficient , are spatially varying on many scales and cannot be described by a handful of stochastic parameters. This limits considerably the types of UQ methods that are applicable.
For low dimensional problems, stochastic Galerkin, stochastic collocation and polynomial chaos methods have been shown to provide efficient and powerful UQ tools (see, e.g., [13, 42, 26] and the references therein), but in general their complexity grows exponentially with the stochastic dimension. The cost of sampling methods, such as, e.g., Monte Carlo, does not grow with the stochastic dimension, but classical Monte Carlo is notoriously slow to converge. Multilevel Monte Carlo (MLMC) simulation [14, 6] can help to significantly accelerate the convergence. It has been applied and extended to a range of applications, see [2, 28, 7, 11, 22].
The idea of MLMC is to reduce algorithmic complexity by performing as much computational work as possible on coarse meshes. To this end, MLMC uses a hierarchy of discretisations of (1
) of increasing accuracy to estimate statistics of
more efficiently, i.e. using a large number of coarse samples to fully capture the variability, but only a handful of fine samples to eliminate the bias due to the spatial discretisation. Here, we employ multilevel methods not only to accelerate the stochastic part, but also to provide a scalable solver for individual realizations of (1).1.2 Parallel methods and algorithms
Current leadingedge supercomputers provide a peak performance in the order of a hundred petaflop/s (i.e. floating point operations per second) [33]. However, all these computers draw their computational power from parallelism, with current processor numbers already at see [9]. The technological trend indicates that future exascale computers may use . Consequently, designing efficient fast parallel algorithms for high performance computers is a challenging task today and will be even more so in the future.
MLMC methods are characterized by three algorithmic levels that are potential candidates for parallel execution. As in a standard Monte Carlo method, the algorithm uses a sequence of classical deterministic problems (samples) that can be computed in parallel. The size of these subproblems varies depending on which level of resolution the samples are computed on. We therefore distinguish between parallelism within an MLMC level and parallelism across MLMC levels. The third algorithmic level is the solver for each deterministic PDE problem which can be parallelized itself. Indeed, the total number of samples on finer MLMC levels is typically moderate, so that the first two levels of parallelism will not suffice to exploit processors. Parallel solvers for elliptic PDEs are now able to solve systems with degrees of freedom on petascale machines [16] with compute times of a few minutes using highly parallel multigrid methods [5, 17]. In this paper, we will illustrate for a simple model problem in three spatial dimensions, how these different levels of parallelism can be combined and how efficient parallel MLMC strategies can be designed.
To achieve this, we extend the massively parallel Hierarchical Hybrid Grids (HHG) framework [3, 19] that exhibits excellent strong and weak scaling behavior [21, 1] to the MLMC setting. We use the fast multigrid solver in HHG to generate spatially correlated samples of the random diffusion coefficient, as well as to solve the resulting subsurface flow problems efficiently. Furthermore, the hierarchy of discretisations in HHG provides the ideal multilevel framework for the MLMC algorithm.
Parallel solvers may not yield linear speedup and the efficiency may deteriorate on a large parallel computer system when the problems become too small. In this case, too little work can be executed concurrently and the scalar overhead dominates. This effect is wellknown and can be understood prototypically in the form of Amdahl’s law [21]. In the MLMC context, problems of drastically different size must be solved. In general, a solver, when applied to a problem of given size, will be characterized by its scalability window, i.e., the processor range for which the parallel efficiency remains above an acceptable threshold. Because of memory constraints, the scalability window will open at a certain minimal processor number. For larger processor numbers the parallel efficiency will deteriorate until the scalability window closes. In practice, additional restrictions imposed by the system and the software permit only specific processor numbers within the scalability window to be used.
MLMC typically leads to a large number of small problems, a small number of very large problems, and a fair number of intermediate size problems. On the coarser levels, the problem size is in general too small to use the full machine. The problem is outside the scalability window and solverparallelism alone is insufficient. On the other hand, the efficiency of parallelization across samples and across levels typically does not deteriorate, since only little data must be extracted from each sample to compute the final result of the UQ problem. However, on finer levels we may not have enough samples to fill the entire machine. Especially for adaptive MLMC, where the number of samples on each level is not known a priori but must be computed adaptively using data from all levels, this creates a challenging load balancing problem.
A large scalability window of the solver is essential to devise highly efficient execution strategies, but finding the optimal schedule is restricted by a complex combination of mathematical and technical constraints. Thus the scheduling problem becomes in itself a highdimensional, multiconstrained, discrete optimisation problem. Developing suitable approaches in this setting is one of the main objectives of this paper. See [34, 35] for earlier static and dynamic load balancing approaches.
The paper is structured as follows: In Section 2, we briefly review the MLMC method and its adaptive version. Section 3 introduces the model problem. Here, we use an alternative PDEbased sampling technique for Matérn covariances [25, 29] that allows us to reuse the parallel multigrid solver. In Sections 4 and 5, we define a classification of different parallel execution strategies and develop them into different parallel scheduling approaches. In Section 6, we study the parallel efficiency of the proposed strategies and demonstrate their flexibility and robustness, before finishing in Section 7 with largescale experiments on advanced supercomputer systems.
2 The Multilevel Monte Carlo method
To describe the MLMC method, we assume that we have a hierarchy of finite element (FE) discretisations of (1). Let be a nested sequence of FE spaces with , mesh size and degrees of freedom. In the Hierarchical Hybrid Grids (HHG) framework [3, 19], the underlying sequence of FE meshes is obtained via uniform mesh refinement from a coarsest grid , and thus and in three space dimensions.
Denoting by the FE approximation of on Level , we have
(2) 
Here, the (non)linear operator and the functional of interest may also involve numerical approximations.
2.1 Standard MonteCarlo Simulation
The standard Monte Carlo (MC) estimator for the expected value of on level is given by
(3) 
where , , are independent samples of .
There are two sources of error: (i) The bias error due to the FE approximation. Assuming that , for almost all and a constant , it follows directly that there exists a constant , independent of , such that
(4) 
for (cf. [36]).
(ii) There is a sampling error due to the finite number of samples in (3).
The total error is typically quantified via the mean square error (MSE), given by
(5) 
where
denotes the variance of the random variable
. The first term in (5) can be bounded in terms of (4), and the second term in is smaller than a sample tolerance if . We note that for sufficiently large, . To ensure that the total MSE is less than we choose(6) 
Thus, to reduce (5) we need to choose a sufficiently fine FE mesh and a sufficiently large number of samples. This very quickly leads to an intractable problem for complex PDE problems in 3D. The cost for one sample of depends on the complexity of the FE solver and of the random field generator. Typically it will grow like , for some and some constant , independent of and of . Thus, the total cost to achieve a MSE (the cost) is
(7) 
For the coefficient field and for the output functional studied below, we have only . In that case, even if , to reduce the error by a factor 2 the cost grows by a factor of , which quickly leads to an intractable problem even in a massively parallel environment.
2.2 Multilevel MonteCarlo Simulation
Multilevel Monte Carlo (MLMC) simulation [14, 6, 2] seeks to reduce the variance of the estimator and thus to reduce computational time, by recursively using coarser FE models as control variates. By exploiting the linearity of the expectation operator, we avoid estimating directly on the finest level and do not compute all samples to the desired accuracy (bias error). Instead, using the simple identity , we estimate the mean on the coarsest level (Level ) and correct this mean successively by adding estimates of the expected values of , for . Setting , the MLMC estimator is then defined as
(8) 
where the numbers of samples , , are chosen to minimize the total cost of this estimator for a given prescribed sampling error (see Eqn. (11) below). Note that we require the FE solutions and on two levels to compute a sample of , for , and thus two PDE solves, but crucially both with the same and thus with the same PDE coefficient (see Algorithm 1).
1.  For all levels do  
a.  For do  
i. Set up (2) for on Level and (if ).  
ii. Compute and (if ), as well as .  
b.  Compute .  
2.  Compute using (8). 
The cost of this estimator is
(9) 
where is the cost to compute one sample of on level . For simplicity, we use independent samples across all levels, so that the standard MC estimators in (8) are independent. Then, the MSE of simply expands to
(10) 
This leads to a hugely reduced variance of the estimator since both FE approximations and converge to and thus , as .
By choosing , we can ensure again that the bias error is less than , but we still have some freedom to choose the numbers of samples on each of the levels, and thus to ensure that the sampling error is less than . We will use this freedom to minimize the cost in (9) subject to the constraint , a simple discrete, constrained optimization problem with respect to (cf. [14, 6]). It leads to
(11) 
Finally, under the assumptions that
(12) 
for some and and for two constants and , independent of and of , the cost to achieve can be bounded by
(13) 
Typically for smooth functionals . For CDFs we typically have .
There are three regimes: , and . In the case of the exponential covariance, typically and and thus , which is a full two orders of magnitude faster than the standard MC method. Moreover, MLMC is optimal for this problem, in the sense that its cost is asymptotically of the same order as the cost of computing a single sample to the same tolerance .
2.3 Adaptive Multilevel Monte Carlo
In Algorithm 2 we present a simple sequential, adaptive algorithm from [14, 6] that uses the computed samples to estimate bias and sampling error and thus chooses the optimal values for and . Alternative adaptive algorithms are described in [15, 7, 11]. For the remainder of the paper we will restrict to uniform mesh refinement, i.e. and in 3D.
1.  Set , , and .  
2.  For all levels do  
a.  Compute new samples of until there are .  
b.  Compute and , and estimate .  
3.  Update the estimates for using (15) and  
if , increase and set .  
4.  If all and are unchanged,  
Go to 5.  
Else Return to 2.  
5.  Set . 
3 Model problem and deterministic solver
As an example, we consider an elliptic PDE in weak form: Find such that
(16) 
This problem is motivated from subsurface flow. The solution and the coefficient are random fields on related to fluid pressure and rock permeability. For simplicity, we only consider , homogeneous Dirichlet conditions and a deterministic source term . If is continuous (as a function of ) and almost surely (a.s.) in , then it follows from the LaxMilgram Lemma that this problem has a unique solution (cf. [4]). As quantities of interest in Section 7, we consider , for some , or alternatively , for some twodimensional manifold can be of interest.
3.1 Discretisation
To discretise (16), for each , we use standard finite elements on a sequence of uniformly refined simplicial meshes . Let be the FE space associated with , the set of interior vertices, the mesh size and the number of degrees of freedom. Now, problem (16) is discretised by restricting it to functions . Using the nodal basis of and expanding , this can be written as a linear equation system where the entries of the system matrix are assembled elementwise based on on a four node quadrature formula
where  
Here , denote the four vertices of the element .
The quantity of interest is simply approximated by . For to converge to , as , we need stronger assumptions on the random field . Let , i.e. Höldercontinuous with coefficient , and suppose and have bounded second moments. It was shown in [36] that
(17) 
Hence, the bound in (4) holds with , and since
the bound in (12) holds with .
3.2 PDEbased sampling for lognormal random fields
A coefficient function of particular interest is the lognormal random field , where is a meanfree, stationary Gaussian random field with exponential covariance
(18) 
The two parameters in this model are the variance and the correlation length . Individual samples of this random field are in , for any . In particular, this means that the convergence rates in (4) and (12) are and in this case, for any . The field belongs to the larger class of Matérn covariances [25, 26], which also includes smoother, stationary lognormal fields, but we will only consider the exponential covariance in this paper.
Two of the most common approaches to realise the random field above are KarhunenLoeve (KL) expansion [13] and circulant embedding [8, 20]. While the KL expansion is very convenient for analysis and essential for polynomial expansion methods such as stochastic collocation, it can quickly dominate all the computational cost for short correlation lengths
in three dimensions. Circulant embedding, on the other hand, relies on the Fast Fourier Transform, which may pose limits to scalability in a massively parallel environment. An alternative way to sample
is to exploit the fact that in three dimensions, meanfree Gaussian fields with exponential covariance are solutions to the stochastic partial differential equation (SPDE)(19) 
where the right hand side
is Gaussian white noise with unit variance and
denotes equality in distribution. As shown by Whittle [41], a solution of this SPDE will be Gaussian with exponential covariance and .In [25], the authors show how this SPDE can be solved using a FE discretisation and this will be the approach we use to bring our fast parallel multigrid methods to bear again. Since we only require samples of at the vertices of , we discretise (19) using again standard finite elements. If now denotes the FE approximation to , then we approximate in our qudrature formula by , for all . It was shown in [25, 31] that converges in a certain weak sense to with , for any . Since (19) is in principle posed on all of we embed the problem into the larger domain with artifical, homogeneous Neumann boundary conditions on (see [31]).
4 Performance parameters and execution strategies
Although MLMC methods can achieve better computational complexity than standard MC methods, efficient parallel execution strategies are challenging and depend strongly on the performance characteristics of the solver, in our case a multigrid method. The ultimate goal is to distribute the processors to the different subtasks such that the total run time of the MLMC is minimal. This can be formulated as a high dimensional, multiconstraint discrete optimization problem. More precisely, this scheduling problem is in general NPcomplete, see, e.g., [10, 12, 23, 37] and the references therein, precluding exact solutions in practically relevant situations.
4.1 Characteristic performance parameters
To design an efficient scheduling strategy, we rely on predictions of the timeto solution and have to take into account fluctuations. Static strategies thus possibly suffer from a significant load imbalance and may result in poor parallel efficiency. Dynamic strategies, such as the greedy load balancing algorithms in [34], which take into account runtime data are more robust, especially when runtimes vary strongly within a level.
For the best performance, the number of processors per sample on level should lie within the scalability window of the PDE solver, where the parallel efficiency is above a prescribed threshold of, e.g., 80%. Due to machine constraints, may be restricted to a subset, such as , where characterizes the size of the scalability window and . Efficient implementations of 3D multigrid schemes such as, e.g., within HHG [3, 18, 19], have excellent strong scalability and a fairly large scalability window, with a typical value of for a parallel efficiency threshold of . The HHG solver has not only good strong scaling properties but also exhibits excellent weak scalability. We can thus assume that and for PDEs in 3D. The value of is the number of processors for which the main memory capacity is fully utilized. Multigrid PDE solvers typically achieve the best parallel efficiency for , when each subdomain is as large as possible and the ratio of computation to communication is maximal (cf. [17]).
In the following, the timeto solution for the th sample on level executing on processors is denoted by . We assume that
(20) 
Here, is a reference timeto solution per sample on level . Several natural choices exist, such as the mode, median, mean or minimum over a sample set. The term encapsulates fluctuations across samples. It depends on the robustness of the PDE solver, as well as on the type of parallel computer system. It is scaled such that it is equal to one if there are no runtime variations. Fig. 1 (right) shows a typical runtime distribution for samples each of which was computed on processors with and in (18).
Assuming no efficiency loss due to load imbalances and an optimal parallel efficiency for , the theoretical optimal mean runtime for the MLMC method is
(21) 
There are three main sources of inefficiency in parallel MLMC algorithms: (i) a partly idle machine due to large runtime variations between samples scheduled in parallel, (ii) nonoptimal strong scalability properties of the solver, i.e., , or (iii) oversampling, i.e., more samples than required are scheduled to fill the machine. In the following we address (ii) and (iii) in more detail.
The strong parallel efficiency of a solver can be charaterized in terms of . In order to predict , , we define a surrogate cost function depending on that is motivated by Amdahl’s law [21]:
(22) 
The serial fraction parameter in (22) quantifies the amount of nonparallelizable work. It can be calibrated from time measurements. For a solver with good scalability properties, is almost constant over the levels so that we use a single value on all levels. Fig. 1 (left) shows the typical range of the scalability window, i.e., , for and . We also see the influence of different serial fraction parameters on the parallel efficiency and the good agreement of the cost model (22) with averaged measured runtimes. The fitted serial fraction parameter lies in the range of for different types of PDE within the HHG framework. In an adaptive strategy, we can also use performance measurements from past computations to fit better values of in the cost predictions for future scheduling steps.
Let denote the number of samples that can be at most computed simultaneously on level if processors are used per sample, and by we denote the number of required sequential cycles to run in total a minimum of samples. Then, , and the associated relative load imbalance are given by
(23) 
We note that , with when no load imbalance occurs. For , part of the machine will be idle either due to the or due to .
The remaining processors in the last sequential steps can be used to compute additional samples that improve the accuracy, but are not necessary to achieve the required tolerance, or we can schedule samples on other levels in parallel (see the next section). The product
(24) 
will be termed MLMC level efficiency and we note that it also depends on .
4.2 Classification of concurrent execution strategies
We classify execution strategies for MLMC methods in two ways, either referring to the layers of parallelisms or to the resulting timeprocessor diagram.
4.2.1 Layers of parallel execution
Especially on the finer grid levels in MLMC, the number of samples is too small to fully exploit modern parallel systems by executing individual samples in parallel. Multiple layers of parallelism must be identified. In the context of MLMC methods, three natural layers exist:
 Level parallelism:

The estimators on level may be computed in parallel.
 Sample parallelism:

The samples on level may be evaluated in parallel.
 Solver parallelism:

The PDE solver to compute sample may be parallelized.
The loops over the levels and over the samples are inherently parallel, except for some minimal postprocessing to compute the statistical quantities of interest. The challenge is how to balance the load between different levels of parallelism and how to schedule the solvers for each sample. Especially in the adaptive setting, without a priori information, an exclusive use of level parallelism is not always possible, but in most practical cases, a minimal number of required levels and samples is known a priori. For the moment, we assume and , , to be fixed and given. In general, these quantities have to be determined dynamically (cf. Alg. 2).
The concurrent execution can now be classified according to the number of layers of parallelism that are exploited: one, two, or three. Typically, and on modern supercomputers, and thus solver parallelism is mandatory for largescale computing. Thus, the only possible onelayer approach on supercomputers is the solveronly strategy. For a twolayer approach, one can either exploit the solver and level layers or the solver and sample layers. Since the number of levels is, in general, quite small, the solverlevel strategy has significantly lower parallelization potential than a solversample strategy. Finally, the threelayer approach takes into account all three possible layers of parallelism and is the most flexible one.
4.2.2 Concurrency in the processortime diagram
An alternative way to classify different parallel execution models is to consider the timeprocessor diagram, where the scheduling of each sample , , , is represented by a rectangular box with the height representing the number of processors used. A parallel execution model is called homogeneous bulk synchronous if at any time in the processor diagram, all tasks execute on the same level with the same number of processors. Otherwise it is called heterogeneous bulk synchronous. The upper row of Fig. 2 illustrates two examples of homogeneous bulk synchronous strategies, whereas the lower row presents two heterogeneous strategies.
The onelayer homogeneous strategy, as shown in Fig. 2 (left), offers no flexibility. The theoretical runtime is simply given by , where is such that . It guarantees perfect load balancing, but will not lead to a good overall efficiency since on the coarser levels is typically significantly larger than . On the coarsest level we may even have , i.e., less grid points than processors. Thus we will not further consider this option.
5 Examples for scheduling strategies
Our focus is on scheduling algorithms that are flexible with respect to the scalability window of the PDE solver and robust up to a huge number of processors
. To solve the optimization problems, we will either impose additional assumptions that allow an exact solution, or we will use metaheuristic search algorithms such as, e.g., simulated annealing
[38, 39]. Before we introduce our scheduling approaches, we comment briefly on technical and practical aspects that are important for the implementation.Subcommunicators
To parallelize over samples as well as within samples, we split the MPI_COMM_WORLD communicator via the MPI_Comm_split command and provide each sample with its own MPI subcommunicator. This requires only minimal changes to the multigrid algorithm and all MPI communication routines can still be used. A similar approach, using the MPI group concept, is used in [35].
Random number generator
To generate the samples of the diffusion coefficient we use the approach described in Sect. 3.2. This requires suitable random numbers for the definition of the white noise on the right hand side of (19). For large scale MLMC computations we select the Ran [30] generator that has a period of and is thus suitable even for realizations. It is parallelized straightforwardly by choosing different seeds for each process, see, e.g., [24].
We consider now examples for the different classes of scheduling strategies.
5.1 Sample synchronous and level synchronous homogeneous
Here, to schedule the samples we assume that the runtime of the solver depends on the level and on the number of associated processors, but not on the particular sample , . As the different levels are treated sequentially and each concurrent sample is executed with the same number of processors, we can test all possible configurations.
Let be fixed. Then, for a fixed , the total time on level is . We select the largest index such that
Thus the minimization of the runtime per level is equivalent to a maximization of the total level efficiency. The computation of is trivial provided is known for all . We can either set it to be the average of precomputed timings of the solver on level or we can use (22) with a fitted serial fraction parameter . In that case,
The level only enters this formula implicitly, through and through the growth factor . Given , we can group the processors accordingly and run sequential steps for each level . Note that the actual value of does not influence the selection of . It does of course influence the absolute runtime.
We consider two variants: (i) Sample synchronous homogeneous (SaSyHom) imposes a synchronization step after each sequential step (see Fig. 3, left). Here statistical quantities can be updated after each step. (ii) Level synchronous homogeneous (LeSyHom), where each block of processors executes all without any synchronization (see Fig. 3, centre).
Altogether samples are computed. When the runtime does not vary across samples, both strategies will results in the same MLMC runtime. If it does vary then the LeSyHom strategy has the advantage that fluctuations in the runtime will be averaged and a shorter overall MLMC runtime can be expected for sufficiently large .
5.2 Runtime robust homogeneous
So far, we have assumed that the runtime is sample independent, which is idealistic (see Fig. 1, right). In the experiment in Fig. 1 (right), 3 out of 2048 samples required a runtime of on processors. On a large machine with and with we need only sequential steps on level . Therefore, the (empirical) probability that the SaSyHom strategy leads to a runtime of is about , while the theoretical optimal runtime is . Here, is the actual runtime of the ith sample from Fig. 1 (right). The probability that the LeSyHom strategy leads to a runtime of is less than ; in all other cases, a runtime of is achieved.
Let us now fix again and include runtime variations in the determination of . Unfortunately, in general, runtime distribution functions are not known analytically, and thus the expected runtime
(25) 
cannot be computed explicitly. Here, the samples are denoted by with and , related to their position in the timeprocessor diagram. The expression in (25) yields the actual, expected runtime on level to compute samples with processors per sample when no synchronization after the sequential steps is performed.
The main idea is now to compute an approximation for , and then to minimise for each , i.e., to find such that for all . As a first approximation, we replace by the approximation in (20) and assume that the stochastic cost factor distribution neither depends on the level nor on the scale parameter . Furthermore, we approximate the expected value by an average over samples to obtain the approximation
(26) 
where . If reliable data for is available we define
Otherwise, we include a further approximation and replace by in (26) before finding the minimum of . Here, we still require an estimate for the serial fraction parameter . To decide on the number of samples in (26), we keep increasing until it is large enough so that and yield the same . For all our test settings, we found that is sufficient.
To evaluate (26), we need some information on the stochastic cost factor which was assumed to be constant across levels and across the scaling window. We use a runtime histogram associated with level . This information is either available from past computations or can be built up adaptively within a MLMC method. Having the runtimes , , of samples on level at hand, we emulate by using a pseudo random integer generator from a uniform discrete distribution ranging from one to and replace the obtained value by . Having computed the value of , we proceed as for LeSyHom and call this strategy runtime robust homogeneous (RuRoHom). For constant runtimes, RuRoHom yields again the same runtimes as LeSyHom and as SaSyHom.
5.3 Dynamic variants
So far, we have used precomputed values for and in all variants, and each processor block carries out the computation for exactly samples. For large runtime variations, this will still lead to unnecessary inefficiencies. Instead of assigning samples to each processor block apriori, they can also be assigned to the processor blocks dynamically at runtime. As soon as a block terminates a computation, a new sample is assigned to it until the required number is reached. This reduces oversampling and can additionally reduce the total runtime on level . However, on massively parallel architectures this will only be efficient when the dynamic distribution of samples does not lead to a significant communication overhead. The dynamic strategy can be combined with either the LeSyHom or the RuRoHom approach and we denote them dynamic level synchronous homogeneous (DyLeSyHom) and dynamic runtime robust homogeneous (DyRuRoHom), respectively. Fig. 3 (right) illustrates the DLeSyHom strategy. Note specifically that here not all processor blocks execute the same number of sequential steps.
In order to utilize the full machine, it is crucial that no processor is blocked by actively waiting to coordinate the asynchronous execution. The necessary functionality may not be fully supported on all parallel systems. Here we use the MPI 2.0 standard that permits onesided communication and thus allows a nonintrusive implementation. The onesided communication is accomplished by remote direct memory access (RDMA) using registered memory windows. In our implementation, we create a window on one processor to synchronize the number of samples that are already computed. Exclusive locks are performed on a get/accumulate combination to access the number of samples.
5.4 Heterogeneous bulk synchronous scheduling
Heterogeneous strategies are clearly more flexible than homogeneous ones, but the number of scheduling possibilities grows exponentially. Thus, we must first reduce the complexity of the scheduling problem. In particular, we ignore again runtime variations and assume . We also assume that on all levels . Within an adaptive strategy, samples may only be required on some of the levels at certain times and thus this condition has to hold true only on a subset of .
In contrast to the homogeneous setups, we do not aim to find scaling parameters that minimize the runtime on each level separately, but instead minimize the total MLMC runtime. We formulate the minimization process as two constrained minimization steps that are coupled only in one direction, where we have to identify the number of samples on level which are carried out in parallel with processors, as well as the number of associated sequential steps. Firstly, assuming to be given, for all , , we solve the constrained minimization problem for
Secondly, having at hand, we find values for such as to minimize
the expected runtime, subject to the following inequality constraints
(27a)  
(27b)  
(27c) 
We apply integer encoding [32] for the initialization and for possible mutation operators to guarantee that . Clearly, if then (27c) implies (27a). However, even though it is redundant, (27a) is enforced explicitly to restrict the search space in the metaheuristic optimization algorithm. The condition in (27b) that at least one sample is scheduled on each level at all times could also be relaxed. However, this would require a redistribution of processors in the optimization problem and can significantly increase the algorithmic and technical complexity. If (27b) is violated on some level , we set . Condition (27c), however, is a hard constraint. The number of processors that are scheduled cannot be larger than . If (27c) is violated, we enforce it by a repeated multiplication of by until it holds. At first glance this possibly leads to an unbalanced work load, but the applied metaheuristic search strategy compensates for it. With the values of identified, the samples are distributed dynamically onto the machine, see also [34].
To illustrate the complexity of this optimization task, we consider the number of different combinations for that satisfy (27a) but not necessarily (27b) and (27c). For example, for , , and , there are possible combinations. Even for the special case that the scalability window degenerates, i.e., that , there are still possibilities.
As an example for the following two subsections, we consider and actual runtimes from measurements in a set of numerical experiments:
(28) 
5.4.1 The degenerate case and a new auxiliary objective
For , a cheap but nonoptimal way to choose is
(29) 
The corresponding runtime is . The total number of processors is . This choice is acceptable when the workload is evenly distributed across levels, which is one of the typical scenarios in MLMC. It also requires that the full machine can be exploited without any imbalance in the workload.