1. Introduction
UQ requires the solution of stochastic PDEs with random data. Some methods for solving stochastic PDEs, e.g. stochastic Galerkin [lemaitre2010spectral, ghanem1991stochastic], are based on a standard approximation in space like Finite Elements or finite volumes, and different types of polynomial expansions in the stochastic direction [Xiu2003]
. Although generally powerful, these techniques typically suffer two significant drawbacks. The first one is being intrusive, i.e. a code that can be used to solve a deterministic problem must be modified to solve the stochastic analogue. The second one is the poor scaling in the dimension of the stochastic space, i.e. suffering from the socalled “curse of dimensionality”
[Babuska2010].In contrast, sampling methods, i.e. Monte Carlo (MC
) and its variants, have a convergence rate which is independent of the stochastic dimension and are nonintrusive. The only necessary assumption for such methods to converge to the exact statistics of the solution when the number of input samples tends to infinity is the existence of finite second moments, i.e. finite variance. This assumption is easily satisfied for many physical systems, making the method generally applicable to many problems of practical interest. They require the repeated evaluation of a deterministic model with different randomly generated inputs and can be implemented in a way that does not require intrusive modifications of the deterministic code. However, the number of samples required to achieve statistical convergence combined with the complexity of the computational model required to have enough spatial and/or temporal accuracy can make the
MC method rather expensive. Indeed, the reduction of overall computational cost to achieve a desired error tolerance is the principal motivation for the development of its variants, like the MLMC method considered herein.Stochastic collocation methods [Babuska2010] are also based on polynomial expansions but using orthogonal basis and appropriate quadratures. The final computation requires independent evaluations of the deterministic model, in the same way as sampling methods, thus sharing the nonintrusiveness property. Appropriate quadratures are required to mitigate the curse of dimensionality, which is a field of intense research [Xiu2005, Chen2017]. Similarly, importance sampling methods [Rauhut2012, Hampton2015, Adcock2017], and the deterministic design of experiments [Diaz2018b] are also known to reduce the necessary number of model evaluations, particularly in the contexts of function approximation [Luthen2021]. The algorithms discussed herein can be extended for parallelization of these “smart sampling” methods as well, but the required modifications are beyond the scope of this article.
Focus on this important complexity reduction makes MLMC [Kebaier2005, Giles2008, Giles2012, Mishra2012, Mishra2012a, Pisaroni2017], and its adaptive variants [Giles2009, Cliffe2011, Elfverson, Collier2015] arguably the most practical extensions of MC. These methods rely on different levels of computational effort for the deterministic model, e.g. a hierarchy of spatial or temporal meshes. These methods seek to combine samples on each level to benefit from the high accuracy of the expensive, higher levels, and the low computational cost of the less accurate, lower levels. A key aspect of the algorithm is the evaluation of how many samples should be used at each level.
Sampling methods for UQ, including MLMC, belong to a general class of methods for outerloop applications [Peherstorfer2018, Section 1.2], defined as computational applications in which there is an outer loop around a model that is called at each iteration to evaluate a function. This includes optimization and statistical inference apart from UQ. For example, in the case of optimization under uncertainty, a UQ problem is solved as an inner loop within each step of the outer loop of the optimization algorithm [de2020touu]. A big effort was performed during the 90s for the development of the Dakota library [Eldred1998, Eldred2000]
aimed to deal with this class of problems. Parallelization opportunities where classified into two areas, algorithmic and function evaluation
^{1}^{1}1The classification in [Eldred1998, Eldred2000] actually included four categories but two are enough for the current discussion about MLMC. and it was already observed that the former requires very little or no interprocessor communication in contrast to the latter. Therefore, the first question posed was how to select the amount of parallelism used on each area, i.e. one has to decide between the assignment of processors to the parallelization of the model or the parallelization of the outer loop (with the concurrent execution of several models). The analysis in [Eldred1998, Eldred2000] shows that the most efficient choice is to use the minimum number of processors that permit to run the model, for the simple reason that some loss of efficiency occurs in strong scaling.The parallelization of MLMC has been considered in recent years [Sukys2014, Sukys2014a, Gmeiner2016, Shegunov2020, Zakharov2020, Gantner2016, Baumgarten2021] as multiple levels of spatial/temporal discretization introduce additional scheduling challenges to effectively utilize resources of a parallel environment with minimal idling. One can conceptually exploit three levels of parallelism in MLMC. The first is the parallelization of the (deterministic) PDE solver itself, which is often dependent on the level, meaning that different levels will use different computational resources. The second is the parallelization of the different MC samples on each level of MLMC, which is the easier one given the independence between them and the large number required, especially at lower discretization levels. The third one is the parallelization between levels, which we found important for the optimality of the algorithm, as discussed below. The two main ingredients of the implementation are i) a processor partitioning scheme that and ii) a scheduling strategy and the goal is to maximize efficiency.
Scheduling strategies for MLMC sampling can be classified [Gmeiner2016] in three different ways. First, they are classified according to the number of parallelization layers exploited simultaneously, inside samples, between samples and between levels. Second, they are classified as homogeneous, when all the processors are working to sample on the same discretization level with the same number of processors per sample, and heterogeneous otherwise. It is important to note that, according to this definition, heterogeneous scheduling includes many different possibilities, e.g. using different number of processors for running samples at the same level (which is considered in [Gmeiner2016]) or running samples at different levels simultaneously (which is not considered in [Gmeiner2016]). A final classification is between static scheduling, when the assignment of work is done before starting to sample, and dynamic scheduling, where assignment is made “on the fly”. It is also important to note that the term “dynamic” is used in [Gmeiner2016] to refer to scheduling strategies where the work is assigned depending on a cost estimation made at the beginning of the calculation but before starting to sample. Several scheduling algorithms are studied in [Gmeiner2016], mostly homogeneous, but also a couple of heterogeneous variants (with and without strong scaling), both static and dynamic, although three layer parallelism was not investigated.
A scheduling with three layer parallelism was proposed in [Sukys2012, Sukys2014, Sukys2014a] with two variants, a static one in which the distribution is made based on work estimates given by the size of the discretization at each level and one in which the distribution of work depends also on the PDE coefficients (which are sample dependent) and is performed at execution time but before starting to sample. In these references, “adaptive load balancing” is used to refer to this second algorithm and it is explicitly stated that it is not “dynamic” see [Sukys2014a, page 111], although it is categorized in this way in [Gmeiner2016]. The processor partitioning scheme in [Sukys2012, Sukys2014, Sukys2014a] divides available units into levels, each of them composed by samplers that contain several cores each, that is, the computational power is distributed into many levels and samples at a given time and assignment is made depending on the load balancing strategy just mentioned. For this scheme a 2approximation is proved in [Sukys2014, Sukys2014a] assuming strong scaling of the PDE solver, that is, the execution time is at most twice the optimal. The observed efficiency is improved by the adaptive load balancing algorithm although actual computational times are slower than in the static case [Sukys2014, Figs. 3 and 4], which shows a very good strong and weak scaling up to 40k cores. A similar strategy was followed in [Gantner2016], where an object oriented implementation of several sampling methods, including MC and Quasi Monte Carlo (QMC), either single or multilevel is presented.
Another dynamic scheduling strategy for MLMC was recently proposed in [Tosi2021]. It exploits general purpose scheduling algorithms in [Tejedor2017] which permit to deal with complex dependencies between tasks and this is exploited to extract parallelism between samples and levels. Although in principle it can be also used with parallel sampling, the mapping of tasks to processors and its effects on load balancing is not discussed. Moreover, parallelization at the sample level is not shown in the numerical examples.
In this article we propose the first dynamic scheduling strategy with three layer parallelism that works on top of a multiple partition of the set of available processors. Given the number of levels of the MLMC algorithm and assuming a weak scaling of the PDE solver we define the number of processors per sample in terms of the number of processors used in the coarsest one, as in [Gmeiner2016]. We then consider multiple partitions of the set of available processors by these numbers of processors, thus obtaining different partitions, where is the maximum number of levels of the MLMC method. Together with this multiple partition we present a dynamic scheduling algorithm that assigns tasks to available resources prioritizing finer levels but allowing the concurrent execution of samples at different levels. This distinctive feature of our implementation admits a proof that the algorithm is a 2approximation, with a very small idling actually occurring only at the end of the computation or at the end of an adaptive MLMC iteration. Besides, excellent scalability is obtained, both strong and weak. With some additional modifications discussed below, this scalability is observed even for the difficult case of models having low persample evaluation times. Therefore, the novelty of this approach includes

a dynamic scheduling algorithm that exploits three layer parallelism: for each individual sample, across samples, and across levels;

the description of an MPI implementation based on a masterslave strategy with a multiple partition of slave processors;

a simple yet robust, dynamic batch sampling mechanism to reduce the communication between the master and the slaves;

a parallelization of the master for eliminating a bottleneck at extreme scales;

the numerical demonstration of the scalability of the implementation, also when stressed by short sampling times.
The article is structured as follows. Section 2 describes the MLMC method and an adaptive variant implemented herein and includes the description of the PDE we consider as an example. Section 3 describes the scheduling in an abstract way and its 2approximation property. Section 4 describes the actual implementation, including the parallel partition strategy. Section 5 presents several examples to demonstrate the weak and strong scalability of the implementation efficiency.
2. The multilevel Monte Carlo method
The principal idea of MLMC is to exploit a hierarchy of model discretizations to reduce the overall computational cost by transferring the majority of the sampling cost to the cheaper models, and having an accuracy governed by the more expensive models. The method is appropriate for any hierarchy of models for which convergence of the Quantity of Interest (QoI) is known, although such knowledge is not generally necessary.
2.1. Model problem
In this work we consider the following elliptic stochastic problem, although the algorithm permits consideration of a general class of PDE problems using parallel solvers. Given an oriented manifold and its corresponding interior domain , find such that
(1) 
where
, denotes the uncertainty, described by a complete probability space. Additionally, the stochastic coefficients, i.e. the diffusion
, forcing term , and boundary condition may be considered random fields too. We note that it is usual in the literature to assume these to be deterministic when stochastic domains are considered [Chaudhry20181127, Mohan2011874, Xiu2006, Harbrecht2008, Harbrecht2013, Harbrecht2016, Dambrine2017943, Dambrine2016921]. Let us assume that the realizations of and are bounded almost surely. We can define a bounded artificial domain that contains all possible realizations of , i.e. . We also assume that and are defined in , independently of . With the random solution of this problem at hand we aim to compute where is a QoI, e.g. an integral of on a subregion or surface, or an evaluation of at a specific point , and 𝔼 the expectation.The problem defined by (1) is wellposed under the assumption of uniform ellipticity [Barth2011, Babuska2010, Chaudhry20181127], i.e. there exists such that (and if a stochastic diffusion is considered). Under these assumptions, the bilinear form associated to the weak form of (1) is bounded and coercive and the LaxMilgram lemma guarantees a solution for any uniformly bounded by and [Chaudhry20181127].
The discretization of (1) is constructed on top of a grid , introducing a Lagrangian FE space. When the domain is stochastic we use the Aggregated Finite Element Method (AgFEM) method in which the grid is a shape regular partition of a background domain
which is typically a bounding box. A discrete approximation is then constructed identifying cut cells of the background mesh and building a subtriangulation on each of them. This construction permits the integration of the weak form of the problem and a judicious choice of the degrees of freedom can obtain a well posedproblem. The
AgFEM method was introduced in [Badia2017a], implemented in parallel in [Verdugo2019] and exploited to perform UQ in random domains in [Badia2021]. The reader is referred to these publications for further details.2.2. The standard Mlmc method
The MLMC method distributes sampling on a hierarchy of discretizations to reduce the overall computational cost, with respect to that of sampling on the finest one, while keeping a similar accuracy. The hierarchy consists of meshes of sizes . Performing simulations for different values of the random parameters on , the expectation is corrected using the whole hierarchy as
(2) 
where is the approximation of the QoI computed using the FE solution for the discretization and for and . The total error of this approximation is now given by
(3) 
The first term, the (squared) discretization error, is reduced by enforcing that provides a sufficiently accurate approximation. Assuming that it decays as for some constant and rate the maximum level required to make it smaller than, e.g. , is
(4) 
Estimates of and are available in some cases, e.g. elliptic stochastic equations [Teckentrup2013a] but they are generally unknown. The second term in (3) is the statistical error, given by . Under general conditions tends to zero as and in this sense, the MLMC method can be understood as a variance reduction method.
The total computational cost is given by where is the complexity (computational cost) of evaluating , i.e. the average cost to evaluate one sample at level and another one at level . Its minimization allows the optimal number of samples to be taken on each level [Giles2008] to have a mean squared error smaller than , as
(5) 
Observe that the number of samples per level in (5) depends on , and and for some problems a theoretical estimate in terms of is possible. These estimates, however, contain unknown constants which need to be determined at an initial screening phase. An alternative is to update them during the execution, as described in the next section.
2.3. An adaptive Mlmc method
A practical extension of MLMC is given by Adaptive Mulitlevel Monte Carlo (AMLMC) [Giles2009, Cliffe2011, Elfverson, Collier2015], which dynamically enforces the discretization and/or sampling error to be below a given tolerance. This is done by updating the maximum level and the sample sizes with (ideally) optimal values that reduce the MLMC error with minimal increase in computational cost. Since the actual expectation, variance, and cost are unknown, they are estimated from sample averages and variances, which we refer to as moment estimates. Extrapolation of these moments estimates are also needed when they are not available.
The algorithm used here begins with a simple initial set of samples, which may be chosen by any number of considerations. It is presented in Section 4, consisting of a loop in which a set of samples are evaluated, convergence is checked and, if it is not reached, the maximum level and the sample sizes are updated. In order to update an estimation of the constants and in (4) is made using to estimate the discretization error [Elfverson]. In order to update an estimation of variance and cost is required to evaluate (5). Variance is estimated by the sample variance, i.e.
(6) 
Cost estimates are computed for each level, by adding the total cputime used to compute each , denoted here as and computing the average cost on that level. From these the sample sizes are estimated to achieve the desired error, with extra samples on each level as a conservative estimate to avoid unnecessary outer loop iterations.
It is important to observe that each iteration of the adaptive algorithm, in which the number of samples and the maximum number of levels are updated, requires a reduction (to compute averages and sample variances). In practice, it results in a synchronization, as described in Section 4. This makes MLMC more challenging to scale to large core counts.
3. Parallel scheduling
In this section we present a new algorithm for the scheduling of sampling tasks required by the MLMC method in a parallel environment. For a general introduction to the subject see [Blazewicz2019, Vazirani2003]. In general (and imprecise) terms, the scheduling problem is the development of an algorithm for the assignment of tasks to processors to optimize some cost function, e.g. the total makespan (the maximum runtime). An important distinction to be made is between preemptive and nonpreemptive scheduling. In the first case, the scheduling algorithm is developed under the assumption that tasks can be preempted and restarted later on at no extra cost. We do not consider this possibility in our implementation and we therefore concentrate only on the second case.
Some results are available under the assumption that each task is executed in one processor, as in [Tosi2021]. Given a set of independent tasks , whose execution times are respectively, and a set of processors, the makespan of a scheduling is bounded by the average work of each processor where
and the maximum work of a task is given by
The bound is tight, the equality with the first argument of the maximum occurs when are all equal and with the second when . This problem is NPhard [Blazewicz2019].
A simple greedy algorithm to solve this problem is to assign a task to the processor that has the least amount of work already assigned, regardless of the execution time of each task. In this case, it is easy to show that , where is the makespan of the optimal scheduling [Vazirani2003], and it is said to be a 2approximation. Moreover, the bound is tight, which is shown by considering the case of tasks of equal processing time followed by a single task whose processing time is times larger. This example suggests that executing the longer tasks first reduces the makespan. This fact was exploited in [Graham1969] to develop a greedy algorithm in which tasks are ordered in decreasing execution time and then assigned to the first available processor, see algorithm 5.1.2 in [Blazewicz2019]. The makespan bound for this algorithm is , that is, a 4/3approximation.
However, it is important to emphasize that the execution time of a given sample in the MLMC method is hardly known in advance. In some cases the execution time can be estimated before the calculation based on the PDE coefficients. For example, in [Sukys2014], first order hyperbolic problems are approximated using explicit time integration methods with a time step determined by the (random) equation coefficients through the CFL stability condition. However, in general, tasks cannot be ordered by execution time beforehand and therefore a 2approximation bound is the best that can be generally expected. It is worth emphasizing that the actual makespan is much closer to the optimal in practice, see Section 5.
Let us now move to the case in which a task may require more than one processor. There are several factors to consider in the decision of how many processors are assigned to each task. In a general case [Gmeiner2016], it depends on two factors, the idling that can be generated and the loss of efficiency in the strong scaling of the underlying solver. If no idling occurs, as in the algorithm we propose herein, and there are enough samples to assign work to all processors, the inevitable loss of efficiency in strong scaling of the solver implies that the best efficiency is always obtained assigning the minimum number of processors that permit to execute each sample, as already noticed in [Eldred1998, Eldred2000]. On the other hand, executing all tasks with the same number of processors would lead to larger execution times on finer levels of the MLMC hierarchy. Assuming a weakly scalable solver, and assigning a number of tasks proportional to the size of the mesh at each level gives constant mean execution time for each sample, with variations with the change of the random PDE coefficients. This assumption is made in [Gmeiner2016], where the number of processors per task at level of the MLMC algorithm is taken of the form and is determined to maximize efficiency.
We divide the tasks in charge of each sample in the MLMC algorithm into sets that require processors with for fixed and we denote by the execution (wallclock) time of each of them. The problem of scheduling these tasks using processors to minimize the makespan is also NPcomplete. Therefore, we cannot expect to obtain a polynomial time algorithm, so we propose an algorithm to solve a restricted problem rather than the general one. We restrict the number of possible values of to where (hereafter means that is an integer divisor of with both , as usual). Observe that the choice in [Gmeiner2016] satisfies this assumption. We also assume that . This assumption of divisibility is not necessary in the scheduling algorithm, but simplifies the proof of Theorem 3.1 presented below, see Remark 3.3. The total work is now
(7) 
and
This work needs to be distributed between processors and we define if is assigned to processor and otherwise. The workload of processor is
(8) 
Under these restrictions (of the possible values of ) we propose Alg. 1, whose execution is illustrated in Fig. 1 and we prove that it is a 2approximation in Theorem 3.1. Observe that because wallclock times of samples are unknown before their actual execution, implementing Alg. 1 requires dynamic programming [Blazewicz2019]. A MPI implementation is described in the next section.
Theorem 3.1.
Given satisfying , Algorithm 1 is a 2approximation.
Proof.
At the beginning the set of processors is divided into groups of processors. Because , once tasks have been executed, each group of processors can be divided into groups of tasks each, which are ready to execute tasks of the group . The previous statement is true for any and therefore, no processor idles until there are no more tasks to execute. Once this fact is established, the proof is similar to the case of single processor tasks given in [Graham1966] (see [Vazirani2003] for the simplified proof followed here). Consider the group of processors that finishes last and let be the starting time of the last executed task and its execution time. Because the algorithm assigns tasks to processors with less workload, all the rest of the processors are busy at time and therefore . Because we get . ∎
Remark 3.2.
In the cases where the hypothesis of Theorem 3.1 are not satisfied it is still possible to apply the rationale behind Theorem 3.1. After all tasks have been executed, the group of processors can be divided into subgroups to execute tasks and there will be processors left. If for some they can be assigned to execute tasks , otherwise they became idle. However, even if these processors can be assigned further work, it is possible that they idle before the rest of tasks have been executed. For instance, consider the case of . After execution of tasks is finished two groups of processors are created and tasks in start being executed while the remaining two processors execute tasks in . If all tasks in are executed before those in these two processors start idling. However, if , which is the case of the MLMC, a good balance can be expected in practice.
Remark 3.3.
It is important to keep in mind the hypothesis of independence of tasks made at the beginning of this section. Theorem 3.1 is only valid under this assumption. However, this is actually the case in the standard MLMC algorithm described in Section 2.2, apart from some simple postprocessing at the end of the execution. In the case of the AMLMC algorithm described in Section 2.3 additional synchronization occurs at each iteration of the algorithm, introducing a dependency between tasks, i.e. those in the second iteration must be executed after those in the first. Therefore, the results of this section actually apply to the scheduling of tasks required to complete one iteration of the AMLMC which, in any case, represents a substantial amount of computational work.
4. A message passing implementation
In this section we describe an implementation of the algorithm described in Section 3 developed on top of the MPI standard. The two main ingredients of the implementation are a strategy for the partition of the set of processors into groups for parallel sampling, described in Section 4.1, and a dynamic scheduling algorithm, described in Section 4.2. We also include a modification that parallelizes the master coordinator that is designed for improving scalability at the most extreme scales, described in Section 4.3.
4.1. Processor partition strategy
The whole strategy is based on a masterslave approach, with a master processor, the coordinator, in charge of the decisions required for scheduling MLMC tasks and the remaining slave processors in charge of sampling (also referred as slaves in the following). The communication between the master and slaves required to implement Alg. 1 is described in Section 4.2. We therefore assume we are given processors, one master and slaves.
As described in Section 3, slave processors will be required to work on different samples and, more importantly, coordinate with different slaves depending on the sample they are executing. Therefore, different partitions are required at different instants during the calculation. To satisfy this requirement we generate a family of partitions of the set of slaves. The total number of partitions, denoted by as in the previous section, is fixed during the calculation and levels of the MLMC algorithm are mapped to each partition. Therefore, the same index is used in the following to denote the MLMC level and the partition in which samples of run. When the AMLMC algorithm described in Section 2.3 is used, a value of must be defined by the user such that during the whole calculation and the number of samples for levels are set to .
The family of partitions is generated in a hierarchical manner, which requires . The set of slave processors is first divided into groups of and a group with processors. In a second step, each group of processors is divided into groups of and one group of processors. The remaining group of processors is divided into (which can be ) groups of processors and one group of size . The process continues recursively until the final groups of processes are generated. The first processor of each group, which we will refer to as the root, will play an important role in the algorithm described in Section 4.2.
To fix ideas, consider processors that will be used to sample with tasks requiring , and processors each, as illustrated in Fig. 2. The slave processors will be first split into groups of processors each. The second partition of the family will be constructed splitting each of them into groups of size . Finally these groups will be split into subcommunicators having processors each. Each slave processor will belong to different groups. In this way, any slave processor may be used for the execution of a task requiring , or processors.
On the other hand, consider processors that will be used to sample with tasks requiring , and processors each, as illustrated in Fig. 3. The slave processors will be first split into groups of processors each. The second partition of the family will be constructed splitting each group into groups having processors and one group having processors. Finally, each group of the second partition will be split into one or two groups having processors.
The groups of processors are managed using the primitives provided by the MPI standard acting on MPI communicators. This allows copying MPI_comm_world using MPI_comm_dup and then to split it into new (sub)communicators calling MPI_comm_split (a collective operation), which requires a color (defined by an integer) to identify the group. In the first partition the color is given by where is the identifier of the current task in the global communicator (a copy of MPI_comm_world), which can be obtained calling MPI_comm_rank. The second splitting is performed in the same fashion, now with a color defined by and the current task in the first communicator just created. This procedure is easily coded in a recursive function with the current communicator as argument.
4.2. Dynamic scheduling algorithm
In this section, we describe the proposed dynamic scheduling in Alg. 2. At the beginning of the execution, the family of partitions described in the previous section is created, which is a collective operation. After this initial phase, which also includes the allocation of required data structures, the calculation of the MLMC estimator proceeds as shown in Alg. 2 (see line 2, line 2, and line 2).
Although the MPI standard permits a multipleprogram multipledata execution model, the singleprogram multipledata (SPMD) execution model is widely used and the implementation described here is made in this way. In this programming model, it is possible to perform different tasks on different processors coding the logic in terms of the MPI ranks, which are obtained calling MPI_comm_rank, as mentioned before. A call to this function on the global communicator permits determining if the current process is the master (with rank ) or a slave (with positive rank) and a call on a given subcommunicator if it is the root (with rank ) of that group or not. This functionality is implemented in functions i_am_master, i_am_slave, and i_am_root (which receive the current level to specify the partition) that return true or false. It is also worth emphasizing that in the SPMD programming model variables initialized in line 2 to line 2 of Alg. 2 are available in all processors, possibly with different (or no) use on each of them.
The communications are organized in two layers: between the master and the roots and between the roots and the rest of slaves. The master does not exchange any message, and therefore does not require any synchronization, with a slave that is not a root on some partition. Roots are in charge of communicating with the master to receive duties and with the rest of slaves to send them. There are two duties, DoSample and NextPartition, that the master assigns to each root (and roots to the rest of slaves).^{2}^{2}2Actually it is possible to have more duties, such as allocate and free, but their need depends on the software design and are not relevant for the description of the algorithm.
All slave processors start working in the finest partition (see the initialization at line 2 in Alg. 2) and emit a broadcast to get the duties from their root in line 2 of Alg. 2. This broadcast is executed in the slave communicators of level , e.g. in the green communicators shown in Fig. 2 and all slaves wait for their root (processors and in Fig. 2) to get the duty. Each root communicates to the master that the group is ready to execute at level (see line 2) and receives the instruction to either sample (with the duty DoSample) or proceed to the next partition (with the duty NextPartition) at line 2. This duty is defined by the master at line 2 depending on the number of samples required at that level, which is received from the root at line 2. The function ReceiveFromRoot is implemented calling MPI_Recv with the wildcard MPI_ANY_SOURCE for the argument source, which permits receiving a message from any of the roots. While there are samples to assign, the master keeps looping at line 2 actively listening from roots. Following the example of Fig. 2, roots and send their rank (RootId in Alg. 2) and the level (green) to the master, who receives one of them and assigns a sample at line 2 at each iteration of the loop in line 2. In this way the execution timeline shown in Fig. 1, which is consistent with the partition set partition shown in Fig. 2, starts with the samples assigned. When the roots receive the duty DoSample and broadcast it, all the slaves in the group start sampling at line 2.
While all the slaves are sampling, the master is waiting at line 2 until a group finishes, which occurs at time in Fig. 1. When slaves finish sampling they go to the next iteration of the loop in line 2 and arrive again to the broadcast at line 2, except the root that sends its rank and partition to the master at line 2. After receiving this message at line 2, the master proceeds then to decide again if there are samples to assign at line 2. In the execution timeline of Fig. 1 all the samples have been assigned at time so the master sends the duty NextPartition to root at line line 2. After receiving and broadcasting this duty, slaves in this group go to the next partition at line 2 and the group is divided into two groups shown in blue in Fig. 2 with roots and . In the next iteration of the loop, they arrive again to line 2 and to line 2 but now with . After the communication with the master they receive the duty DoSample and they start sampling at level while processors to are sampling at level . The same process occurs at time , now reaching level . When slaves finish sampling, if there are still samples to assign, the master assigns them samples at the same level, which occurs at time in Fig. 1. Observe that it may even happen that samples assigned on partitions separated by more than one level are executed concurrently.
At some point all samples have been assigned and for all . When the next sample finishes, which occurs at time in Fig. 1, the master will send it down in the partition hierarchy and if that will imply for the slaves to go to which will make them to break the loop at line 2. The master will count the roots that left the execution and will break the loop at line 2 when they reach ( in the execution timeline of Fig. 1).
To reduce communication, the samples are not typically communicated one at a time, but in batches of samples. These batches are always sequential ranges of integers which partition the set . Instead of communicating each individual sample, instead each individual batch of samples is communicated, by indicating the smallest and largest sample in the batch. This batch size, denoted is determined dynamically, depending on the current number of assigned samples on the particular level, , and the total number of assigned samples on the particular level, , as well as the number of partitions on that level, . Each batch size proportional to the current number of unassigned samples, denoted by , divided by the total number of samples . More specifically,
(9) 
subject to minimum and maximum batch sizes that can be modified by parameters. The default minimum is , and the default maximum is . These are the values used in all examples here. Naturally, should the size of the batch be larger than the number of remaining samples, the batch only contains the remaining samples.
After the sampling loop at line 2 is finished, there is a reduction process to get the final estimate . Observe that partial sums have been performed during sampling, keeping them on root processors, see line 2. At the end of the sampling loop, slave processors send these partial sums to the master at line 2, which are received (using local variables ) at line 2, accumulated at line 2 and finally weighted and added to compute the final estimate in line 2. If executing AMLMC, the number of samples and the maximum level are updated in line 2 and line 2 respectively and the whole process is repeated until convergence is reached.
4.3. Parallelizing the coordinator
The above implementation has a notable potential bottleneck in the use of a single master task that is seen in the example Section 5.1. This bottleneck is not currently an issue at the scale of computation usually encountered, but a potentially fast executing model requires significantly more communication throughput. It is indicative of a computational state wherein the overall processing power dramatically overshadows that of the evaluation of a single model, which is generally expected to occur at some large enough number of processors, regardless of the model.
It is anticipated that this is an important potential bottleneck in certain applications and on certain architectures. We therefore consider briefly the parallelization of the master coordinator. This is done by replacing the single task by a communication tree. Each communication task is limited to have a specified number of tasks which report to it, be they root tasks, or other coordinator tasks. This parameter is herein referred to as comm_limit. This functionality works with the batch sampling mechanism mentioned in Section 4.2. Samples are allocated based not only on total and available samples, but also the number of executing partitions on the slave coordinator , denoted , with respect to the total number of partitions that report to that master coordinator , . In this case the batch, , is proportional to
(10) 
and is again subject to similar minimum and maximum values.
Functionally, this implementation can be visualized as additional partition layers of tasks as in Fig. 2 or Fig. 3, but with an additional task separated on each new layer representing this dedicated communication task. The layers are usually identical, but often, there is a single communication partition that can only be partially filled. In such a case, that potential communication partition is lumped entirely into another communication partition. This implementation does not balance these loads and allows this limit on communication tasks to be violated. However, this violation occurs only on at most one task at each level of the tree, and in the worse case doubles the amount of communication on that layer.
The implementation improves scalability for the example of Section 5.1, but uses a relatively crude communication tree, and is not especially optimal. Additionally, all manager tasks exist outside of the context of the largest computational partition, which means that there is a lower bound on the rootspertask value. For example, in an example where and level uses task per model evaluation, and level uses tasks per model evaluation, the lower bound on the rootspertask value would be . This is a potential problem if there are both rapidly executing serial models, and slowly executing highly parallelized models being scheduled.
5. Numerical Examples
In this section we evaluate the scaling performance of the scheduling algorithm for several different examples, noting that here we adjust scaling based on the number of evaluations of the model on each level, called the sample sizes, and not with respect to e.g. discretization levels. The implementation of the algorithm presented herein relies in FEMPAR , [badia_fempar:_2017, Badia2020]
an open source, hybrid OpenMP/MPI parallel, objectoriented (OO) FORTRAN200X scientific software framework for the massively parallel FE simulation of multiscale/multiphysics problems governed by PDEs. We develop a new layer of software that permits the execution of several instances of a model as required by sampling
UQ methods. In this way it is possible to exploit advanced discretization methods provided by the FEMPAR library, e.g., AgFEM methods. These embedded techniques have been recently exploited in [Badia2021] to perform UQ in random geometries using the MLMC implementation described herein. The code here is open source, under the GPLv3 license, and available at the FEMPAR gitlab repository.First, we consider a model independent evaluation of performance, testing the scheduling algorithm using a computational model that simply idles, while executing in parallel in Section 5.1.
We consider first a “Parallel Benchmarking Example” in Section 5.1. This model idles for a random amount of time, while executing in parallel with core counts on each level as are anticipated by a 3D execution of an Algebraic Multigrid (AMG) algorithm.
As a second example, we consider a 3D Poisson problem in Section 5.2, with a random diffusion term that is described by a KarhunenLoève Expansion (KLE), and fixed Dirichlet boundary conditions of on the boundary of a randomized popcorn geometry similar to that described in [Badia2021]. This Poisson model is discretized using AMG, and executed in parallel with differing core counts on each level.
We consider two scaling experiments for these examples.

Strong scaling: The number of MLMC samples (or AMLMC tolerance) is kept constant, while the number of processors is increased.

Weak scaling: The number of MLMC samples is increased (or AMLMC tolerance is decreased), and the number of processors is increased, so that the relative number of MLMC samples to processors is fixed.
We consider here sample sizes (or initial sample sizes in the case of AMLMC), , which are integers that experience an exponential growth,
(11) 
(for ). This formula represents a theoretically ideal sample ratio when considering the computational costs and accuracies for computing integrated quantities from sufficiently regular problems discretized spatially and solved via the conjugate gradient method, as validated in [Badia2021]. To compute the scalability of the scheduling algorithm, we tune the overall computational cost of the experiments by a sample multiplier, denoted by , and sample each level according to
(12) 
where is a relatively small sample size closely coinciding with (11). Here , as the computational architecture has nodes of processing cores. Specifically, we utilize sample sizes as in Table 1, which are sizes broadly acceptable for discretizations in spatial dimensions.
We define as the wall clock time to compute the MLMC estimate as a function of and we define the speedup by
(13) 
The speedup is useful here for weak scaling, as this normalization allows us to compare results for different model evaluation times, which would be difficult to do if wall times were presented.
We also consider the effect of idling and coordination relative to the time spent in active computation. We divide the scheduling algorithm execution on each task into three parts, managing, , idling, , and active computation, , which we measure in coreseconds. The master task and any other coordinators are always contributing to the managing total, while executing tasks are contributing to either idling or active computation, depending on where they are in the scheduling algorithm execution. When an executing partition is in the setup or run phase, its cores contributes to the active total; otherwise, it contributes to the idling total. We define the efficiency as in [Sukys2014], to be
(14) 
a value in that measures the overhead of the scheduling algorithm as a ratio of total core time used in the computation.
We note that our results compare most closely to the similarly performed experiments in [Gmeiner2016] and [Tosi2021]. Our implementation performs well generally, and especially well at addressing the difficult problem of scalability when the model has a low execution time.
5.1. A benchmarking example
Our first example is a conceptually simple test of the scheduling algorithm scaling with a simple model class which utilizes partitions of individual processors, and whose only evaluation is to wait for a random amount of time. This model allows us to increase the stress from the communication in the scheduling algorithm. For each sample we define this time by a uniform distribution of mean
and variance , i.e. a random value between and , where and scale exponentially with the level of the model evaluation. This computation time is denoted here by . This exact knowledge of the statistics of times for model evaluation permits a more accurate benchmarking of the scheduling algorithm.For the parallel models considered here we do not vary the execution time based on the level of computation, which is a reasonable approximation when the size of the problem scales in correspondence with the number of processors used to compute it. This model is quite robust to evaluating the scalability of the scheduling algorithm independently of the specific computations of the model, and with the following examples, demonstrates the benefits to parallelization of the models being used. We consider in . Here, the short computation time of allows us to insure a high sustained level of communication in the scheduling algorithm as the number of tasks increases, stressing the communication structure, while the longer computation times may more accurately simulate real model evaluation times.
Specifically, we consider a benchmarking example that is discretized to run on , and cores. We refer to the average as the Short Time Model, the average as the Medium Time model, and the average as the Long Time model. Here in particular, this model is designed to represent a 3 dimensional AMG solver.
We also consider the same model parameters under strong scaling, with the sample multiplier fixed to , while for weak scaling is the number of computational nodes. The samples for each level are given by (12) and Table 1 for nodes.
Fig. 4 shows the speedup for the two scaling regimes given by (13), while Fig. 5 shows the efficiency. We note that the parallel benchmarking performs well for the Medium Time and Long Time models, but that the Short Time model provides a significant communication stress, and that in this case, the use of additional communication tasks improve scalability.
We also consider two additional modifications, beyond the consideration of additional communicators and execution times. For these, we use only one communicator task, and . First, is the variance of the execution time. We consider three cases: that of high , given by ; that of medium , given by ; and that of low , given by . These results are shown in Fig. 6, and Fig. 7. We note that changes to provide at most a marginal change in performance.
Additionally, we consider the use of batch sampling. Using a single master communicator, we consider the case when all batch sizes are of size for . These results are shown in Fig. 8, and Fig. 9. It is seen that there is worse scalability at higher core counts without batch sampling, as the single master communicator cannot act fast enough to efficiently communicate each individual sample to the execution partitions.
5.2. 3D Poisson in a random domain
Here we consider a discretized finite element model, where the number of tasks per model evaluation depends on the level. To this end, we describe the details of a 3D Poisson problem as in (1).
Here we assume the domain to be a stochastic "popcorn" shape as defined in [Badia2021], which provides a complex geometry that changes for each realized sample. The boundary conditions are given by a , Moreover, the diffusion term , is a KLE
where every point is lognormally distributed, and the associated normal distributed random variables are correlated by an exponential covariance function with correlation length
. The normal random variables havemean, and a standard deviation of
, with the first eigenfunctions in each of the two dimensions, used to construct the approximation with a totalorder basis of order , giving total functions. Here, totalorderrefers to a tensor product of polynomials in each dimension such that the sum of the orders of those polynomials is at most
.We consider levels of computation with , and . Here we consider a uniform coarsest grid of elements, and each subsequent grid has twice the number of elements in each dimension, up to elements for . The AMLMC here has initial samples on all three levels as given by (12) and Table 1 for the appropriate for nodes. For strong scaling, we consider consider the same model parameters under strong scaling, with the sample multiplier fixed to , while for weak scaling is the number of computational nodes.
Fig. 10 shows the speedup for the two scaling regimes given by (13), while Fig. 11 shows the efficiency. We note that there are no noticeable issues scaling up to the nodes tested for this MLMC example.
Finally, we utilize the AMLMC algorithm with this model, to test the effect of the synchronization points which that algorithm requires. For weak scaling we consider, a set of desired tolerances for desired error, in this case of the volume of the associated geometric shape. Specifically, the error tolerances are chosen to be between and . These tolerances are chosen to allow the same initial sampling of the other 3D Poisson example, but such that those samples are insufficient to converge in the initial iteration. The strong scaling utilizes , and an error tolerance corresponding of .
We note that as the ultimate sample sizes for this experiment are variable, the performance is more variable. Of note, each execution leads to a different final achieved error. We consider a raw scaling, and also consider a scaling that mitigates this discrepancy, by scaling the results inversely proportional to the achieved error. That is, lower achieved errors will have used more samples, and thus more computational resources. We adjust the wall times to these values, so that the results are neutral with respect to the achieved error of the final AMLMC estimation. The results are presented in Fig. 12, and Fig. 13. We note that both the strong and weak scaling are significantly more efficient for smaller for this problem, but that this is not a result of significant idling, so that the scalability is reasonable as increases.
6. Conclusions
In this work we present a new dynamic scheduling algorithm to compute samples in the MLMC method that fully exploits three levels of parallelism, within each sample, across samples, and across levels. The implementation of this algorithm based on the MPI standard described here is based on general primitives available in any version of the standard and therefore in any MPI distribution, i.e. it does not require any advanced feature like onesided communication.
We also demonstrate the performance of method when applied to MLMC, on a stress test of the scheduling algorithm on a benchmarking problem, as well as a Poisson model solved on a complex, sample dependent domain via AMG. The implementation is a powerful, robust, and scalable means to perform UQ via MLMC or other similar statistical computations. The numerical examples presented herein show excelent scalability on the examples tested, when certain good scaling features are utilized. The algorithm implementation also performs well when utilized for AMLMC. Overall, our results show performance that competes with that demonstrated in the similar experiments in [Gmeiner2016] and [Tosi2021] (which only show results exploiting two layers of parallelism), showing high efficiency, and notable performance on the difficult case of rapidly executing models.
The results in this work could also be extended by exploiting the techniques in [Tejedor2017] to account for the dependency between tasks in the context of parallel sampling. This is particularly relevant for the AMLMC method where some tasks have a dependency on other tasks that require more processors. Although this can be handled just by an iterative loop as descibed above, a dynamic scheduling with task dependency could potentially be more efficient. This development require to define appropriate priorities taking into account the amount of parallelism at sample level and is left for a future work.
Comments
There are no comments yet.