# A highly parallel algorithm for computing the action of a matrix exponential on a vector based on a multilevel Monte Carlo method

A novel algorithm for computing the action of a matrix exponential over a vector is proposed. The algorithm is based on a multilevel Monte Carlo method, and the vector solution is computed probabilistically generating suitable random paths which evolve through the indices of the matrix according to a suitable probability law. The computational complexity is proved in this paper to be significantly better than the classical Monte Carlo method, which allows the computation of much more accurate solutions. Furthermore, the positive features of the algorithm in terms of parallelism were exploited in practice to develop a highly scalable implementation capable of solving some test problems very efficiently using high performance supercomputers equipped with a large number of cores. For the specific case of shared memory architectures the performance of the algorithm was compared with the results obtained using an available Krylov-based algorithm, outperforming the latter in all benchmarks analyzed so far.

## Authors

• 2 publications
• 4 publications
• 4 publications
• ### A Monte Carlo method for computing the action of a matrix exponential on a vector

A Monte Carlo method for computing the action of a matrix exponential fo...
04/29/2019 ∙ by Juan A. Acebron, et al. ∙ 0

• ### Scale-invariant multilevel Monte Carlo method

In this paper, the scale-invariant version of the mean and variance mult...
06/17/2021 ∙ by Sharana Kumar Shivanand, et al. ∙ 0

• ### Monte Carlo algorithm for the extrema of tempered stable processes

We develop a novel Monte Carlo algorithm for the vector consisting of th...
03/29/2021 ∙ by Jorge Ignacio González Cázares, et al. ∙ 0

• ### Multilevel Monte Carlo Finite Volume Methods for Random Conservation Laws with Discontinuous Flux

We consider a random scalar hyperbolic conservation law in one spatial d...
06/21/2019 ∙ by Jayesh Badwaik, et al. ∙ 0

• ### Multilevel Monte Carlo learning

In this work, we study the approximation of expected values of functiona...
02/17/2021 ∙ by Thomas Gerstner, et al. ∙ 0

• ### Multilevel Monte-Carlo for Solving POMDPs Online

Planning under partial obervability is essential for autonomous robots. ...
07/23/2019 ∙ by Marcus Hoerger, et al. ∙ 0

• ### On the Use of Lee's Protocol for Speckle-Reducing Techniques

This paper presents two new MAP (Maximum a Posteriori) filters for speck...
09/09/2012 ∙ by Elsa E. Moschetti, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In contrast to the numerical methods for solving linear algebra problems, the development of methods for evaluating function of matrices has been in general much less explored. This can be explained partially due to the underlying mathematical complexity of evaluating the function, but also under the computational point of view, because the algorithms developed so far tend to be less efficient and in general more difficult to be parallelized. In addition, related to the first issue, an added difficulty appears in estimating the associated error of the numerical method, which is well understood for solving iteratively linear algebra problems, but becomes a rather cumbersome process for functions of matrices. This is even worse in the case of the matrix exponential, due to the lack of a clear and consensually agreed notion of the residual of the iterative method, see for instance

[13].

In particular the second issue represents indeed a serious drawback, since it is preventing in practice to deal with large scale problems appearing in science and engineering. Nowadays there are a plethora of applications described by mathematical models which require evaluating some type of function of matrices in order to be solved numerically. For the specific case of the matrix exponential, we can find applications in fields as diverse as circuit simulations [39]; power grid simulations [35, 40]; nuclear reaction simulations [34]

; analysis of transient solutions in Markov chains

[36]

; numerical solution of partial differential equations (PDEs)

[30]; and analysis of complex networks [10], to cite just a few examples. More specifically, in the field of partial differential equations, numerically solving a boundary-value PDE problem by means of the method of lines requires in practice to compute the action of a matrix exponential using therefore exponential integrators [28]. On the other hand, in network analysis, determining some relevant metrics of the network, such as for instance the total communicability which characterizes the importance of the nodes inside the network, entails computing the exponential of the adjacency matrix of the network.

For the specific problem of computing the action of the matrix exponential over a vector several classes of numerical methods have been proposed in the literature in the last decades (see the excellent review in [26], and references therein). Probably the most analyzed and disseminated methods are those based on Krylov-based subspace methods, which use in practice a basis of a subspace constructed using the Arnoldi process, and compute the exponential of the projected matrix (typically much smaller) by using standard matrix exponential techniques [27].

An alternative to the aforementioned deterministic methods does exist, and consists in using probabilistic methods based on Monte Carlo (MC) simulations. Although much less known than the former methods, the Monte Carlo methods specifically used for solving linear algebra problems have been discussed in the literature in various forms along the years. In fact, it was the seminal paper by von Neumann and Ulam during the 40’s [22] that gives rise to an entire new field, and from there a multitude of relevant results, and substantial improvements of the original algorithm have appeared in the literature during the last years, see e.g. [17] and [16] for further references. Essentially the main goal is to generate a discrete Markov chain whose underlying random paths evolve through the different indices of the matrix. The method can be understood formally as a procedure consisting in a Monte Carlo sampling of the Neumann series of the inverse of the matrix. The convergence of the method was rigorously established in [29], and improved further more recently (see for instance [12], and [19] just to cite a few references).

Generalizing the method for dealing with some functions of matrices, such as the matrix exponential, was only recently accomplished in [7]. The method is based on generating random paths, which evolve through the indices of the matrix, governed now by a suitable continuous-time Markov chain. The vector solution is computed probabilistically by averaging over a suitable multiplicative functional.

The main advantages of the probabilistic methods, as it was already stated in the literature, are mainly due to its privileged computational features, such as simplicity to code and parallelize. This in practice allows us to develop parallel codes with extremely low communication overhead among processors, having a positive impact in parallel features such as scalability and fault-tolerance. Furthermore, there is also another distinguishing aspect of the method, which is the capability of computing the solution of the problem at specific chosen points, without the need for solving globally the entire problem. This remarkable feature has been explored for efficiently solving continuous problems such as boundary-value problems for PDEs in [3, 5, 6], offering significant advantages in dealing with some specific applications found in science and engineering.

Yet an important disadvantage of any Monte Carlo method is the slow convergence rate to the solution of the numerical method [20], being in general of order , where

denotes the sample size. Nevertheless, there already exist a few statistical techniques, such as variance reduction, multilevel Monte Carlo (MLMC), and quasi-random numbers, which have been proposed to mitigate in practice such a poor performance, improving the order of the global error, and consequently the overall performance of the algorithm. Among all the aforementioned methods, the multilevel method clearly stands out, and currently it has become in fact the preferred method to speed up the convergence of a variety of stochastic simulations, with a remarkable impact on a wide spectrum of applications. An excellent review has been recently published in

[24] describing in detail the method as well as a variety of applications where it was successfully applied (see also [8] for more details specifically related with the topic of this paper).

One of the main contributions of this paper is precisely to develop a multilevel method for the problem of computing the action of a matrix exponential over a vector. This is done by conveniently adapting the probabilistic representation of the solution derived in [7] to the multilevel framework. In addition, the convergence of the method is analyzed, as well as the computational cost estimated. The second important contribution was to parallelize the resulting algorithm, and finally run successfully several relevant benchmarks for an extremely large number of processors using high performance supercomputers belonging to the top-performance supercomputers in the world (according to the well-known TOP500 list [38]).

The outline of the paper is as follows. Briefly, the mathematical description of the probabilistic method is summarized in Section 2, and the problem is mathematically formalized according to the multilevel framework. In Section 3, the developed algorithm is described through the corresponding pseudocodes. Section 4 is devoted to the analysis of both, the algorithm complexity, and the numerical errors of the method. Finally in Section 5 several benchmarks are run to assess the performance and scalability of the method, and whenever available, a comparison with the performance obtained by the classical Krylov-based method is done. In closing, we highlight the main results and suggest further directions for future research.

## 2 Mathematical description of the probabilistic method and multilevel Monte Carlo method

In order to implement any multilevel Monte Carlo method it is mandatory to have a probabilistic representation of the solution. Thus, we describe next the probabilistic method used so far to compute the action of a matrix exponential over a vector.

### 2.1 Probabilistic method

The probabilistic representation for the action of a matrix exponential over a vector was introduced in [7] for dealing exclusively with adjacency matrices of undirected graphs. However, in the following we show that this representation can be straightforwardly generalized for dealing with arbitrary matrices.

Consider a general n-by-n matrix, a given n-dimensional vector, and an n-dimensional vector. This vector corresponds to the vector solution after computing the action of a matrix exponential over the vector , that is . Here the parameter is a constant, typically interpreted as the time variable in partial differential equations, or an effective ”temperature” of the network in problems related with complex networks (see [21], e.g.).

Let us define a diagonal matrix , represented hereafter as a vector , with entries , , and a matrix with entries given by

 tij={Lii,if i=j(−1)σijLij,otherwise (1)

where is a binary matrix with entries taking the value when , and otherwise. Here denotes the Laplacian matrix, defined in the broad sense as a matrix with nonpositive off-diagonal entries , and zero row sums, that is . Then, it holds that . Note that our definition differs from the classical one addressed to adjacency matrices [2]. Instead, in this paper matrix can be any matrix. This is possible due to two changes. First, our diagonal matrix is not a degree matrix since the diagonal term in the original matrix is added to the degree of the row (stored in ). Second, we replace matrix with matrix , which takes into account that matrix can have both positive and negative values unlike an adjacency matrix. Thus, this does not constitute any restriction in the class of matrices amenable to be represented probabilistically. Quite the contrary, one can see that any arbitrary matrix can be straightforward decomposed in such a way.

Finding a probabilistic representation for this problem requires in practice [7] to use a splitting method for approximating the action of the matrix exponential over the vector as follows,

 ¯x=(eΔtD/2e−ΔtTeΔtD/2)Nu, (2)

where , which in the following and for convenience it will be termed as the time step. Note that corresponds to an approximation of the true solution . In fact, this corresponds to the Strang splitting method, and therefore leads to an error, which after one time step is known [1] to be of order locally, and of order globally. Therefore, the true solution is recovered in the limit . The probabilistic representation for computing a single entry of the vector is then given by

 ¯xi=eΔtdi/2E[N∏k=1ηk], (3)

where , , and . The , , is a sequence of discrete random variables with outcomes on , and a two-point random variable taking values and with a probability related to the matrix . The probabilities , , and for , correspond to the transition probabilities of a continuous-time Markov chain generated by the infinitesimal generator and evaluated at time for each , being solution of the Kolmogorov’s backward equations [9],

 P′(t)=QP(t),P(0)=1(t≥0), (4)

for the matrix transition probability .

A neat picture of this probabilistic representation can be described as follows: A random path starting at the chosen entry is generated according to the continuous-time Markov chain governed by the generator , and evolves in time jumping randomly from to any state on . Along this process, functions are evaluated, and the solution is obtained through an expected value of a suitable multiplicative functional.

Note that such a representation allows in practice to compute a single entry of the vector solution, but can be conveniently modified to represent the full vector solution . The probabilistic representation for computing a single entry of the vector solution requires generating suitable random paths evolving backward in time from the state at to a final state on for . Instead, the probabilistic representation for the full vector requires generating a random path that starts at a given state according to a specific initial distribution, and evolves forward in time governed by a continuous-time Markov chain generated by the transpose of the generator . The contribution to the entry of the vector is mathematically formalized through the following representation:

 ^xi=eΔtdi/2UE[N∏k=1ηk], (5)

where , , and , and .

In order to adapt this representation to the multilevel Monte Carlo framework, it is convenient to use the typical notation used so far in the literature. This entails rewriting the probabilistic representation for computing a single entry of the vector solution as

 ¯xi=E[P],P=N/2∏j=1η(j), (6)

where

 η(j) = ϕ(j)eΔt(dik/2+dik+1+dik+2/2),j=1,…,N/2−1, η(j) = ϕ(j)eΔt(dik/2+dik+1+dik+2/2)uk,j=N/2. (7)

Here , and . A similar expression can be readily found for the probabilistic representation of the full vector solution.

It is worth observing that we have to deal with two sources of error when implementing in practice the probabilistic method, that is the statistical error coming from the use of a finite sample size for estimating the expected value, and the error due to the splitting method. In fact this error can be considered as being the equivalent to the truncation error appearing in discretizing differential equations, and in the following it will be termed as truncation error.

### 2.2 The multilevel Monte Carlo method

The multilevel Monte Carlo method we have developed is essentially based on the well-known method many times described in the literature. In the following we introduce briefly the ideas underlying the method for those readers not familiar with the topic. For further details see the excellent survey in [24], and references therein.

Essentially, the goal of the geometric multilevel Monte Carlo method consists in approximating the finest solution , obtained to the level of discretization , using a sequence of coarser approximations obtained at previous levels , from to . In our specific problem this corresponds to different levels of discretization according to the value of , being now , with . The minimum and initial level is chosen typically to be the entire interval, that is . However this is not theoretically required, and for this specific problem we show in Section 4 that it is best not to do so. This is because the computational cost tends to be independent of the level when simulating for the coarsest level of simulation. Therefore, in the following we assume that the minimum level to be chosen is . The multilevel method can be formalized mathematically through the following telescoping series,

 ¯xL=E[PL]=E[Pl0]+L∑l=l0+1ml, (8)

where , , and

 η(j)l = ϕ(j)leΔtl(dik/2+dik+1+dik+2/2),j=1,…,N/2−1, η(j)l = ϕ(j)leΔtl(dik/2+dik+1+dik+2/2)uk,j=N/2, (9)

with , and . Note that this induces a truncation error which is proportional to . Numerically, when a finite sample of sizes is used, Eq. (8) can be approximated by the following estimator

 ¯xL≈1M0M0∑i=1P(i)l0+L∑l=l0+11MlMl∑i=1(P(i)l−P(i)l−1). (10)

It is worth observing that the samples used for computing the approximation at level are reused for computing the level adapting them conveniently for such a coarse level. In fact, the underlying correlation appearing between the two consecutive levels belonging to the same sample becomes essential in order to reduce the overall variance for the same computational cost. However, the final goal of the multilevel method is the opposite, that is, reducing the computational cost by choosing conveniently an optimal sample size , keeping fixed the overall variance within a prescribed accuracy . After a suitable minimization process, the result as explained in [24] is given by

 Ml=1ε2√VlClL∑l=l0VlCl, (11)

where , and are the computational cost, and the variance for each level , respectively. The overall computational cost and variance can be calculated as follows

 CT=L∑l=l0Cl,VT=L∑l=l0VlMl, (12)

## 3 The multilevel algorithm

To implement in practice the multilevel method for computing the action of the matrix exponential over a vector, it is first necessary to introduce a suitable algorithm capable of generating efficiently the random paths. Second, we need to describe the strategy followed to compute the difference between any two consecutive levels as appears in Eq. (10). This requires an efficient technique to reuse the paths obtained when simulating with a higher level for the lower level at discretization .

Concerning the first issue, we describe next the numerical method proposed to generate in practice the continuous-time Markov chain. Let represent the transition probability matrix. Then the Kolmogorov’s backward equation in Eq. (4) can be equivalently represented as the following system of integral equations [9]

 pij(t)=δije−Liit+∑j≠i∫t0dsLiie−Liiskijpij(t−s), (13)

where . Let be a sequence of independent exponential random times picked up from the exponential probability density . The integral equations above along with the sequences of random times can be used to simulate a path according to the following recursive algorithm: Generate a first random time obeying the exponential density function; Then, depending on whether or not, two different alternatives are taken; If , the algorithm is stopped, and no jump from the state to a different state is taken; If, on the contrary, , then the state jumps to a different state according to the probability function

, and a new second random number exponentially distributed

is generated; If the algorithm proceeds repeating the same elementary rules, otherwise it is stopped.

In Fig. 1 a sketch diagram for the case of is shown. This illustrates graphically how the second issue, related to the computation of the coarse level using the higher level , has been solved in practice. There we plot the four different scenarios that may occur when generating random paths (assuming we are interested in computing only a single entry of the vector solution, and therefore forcing all random paths to start at the same state ). Thus, from Eq. (6), the possible outcomes of the two random variables may induce two transitions to any of the rows of a given matrix during the two time steps of size . But only the last one should be used for determining the paths corresponding to the previous level . More specifically, the set of the four figures describe the following scenarios: a) Transitions occur at the first and the second time step; b) Transition only at the first time step; c) Transition only at the second time step, and d) no transition at all. Note that the last scenario contributes with zero to the term in (8).

In Algorithm 1, we describe a pseudocode corresponding to the implementation of the multilevel method. In fact, this consists in the general setting for any implementation of the method for a variety of problems. The distinguishing feature among them is the suitable procedure chosen to compute in practice any of the terms of the expansion in Eq. (8), as well as the associated variances. The pseudocode of the procedure for computing a single entry of the vector solution is described in Algorithm 2.

Although the multilevel method could be used to compute the full vector solution as well, the implementation is much more involved and the performance of the algorithm less efficient. This is because it will require in practice to save vectors instead of scalars for any of the levels in Eq. (8). This can be mitigated instead by computing a scalar function of the full vector solution, and since the complexity of the algorithm for computing the full vector solution by Monte Carlo is similar to that for obtaining the solution of a single entry, in principle the computation time of the multilevel method for the former case should be comparable. In fact, the pseudocode is similar (see Eq. 3 and Eq. 5).

INPUT: , , , ,
Call MLMCL() for fast estimating and for
while  do
Compute the optimal number of samples for
Call MLMCL() for further improvement for
if  then EXIT
else
Increase number of levels,
end if
end while
procedure MLMCL()
,
for  do
, ,
for  do

if  then

end if
generate exponentially distributed
while  do
generate exponentially distributed

generate according to Eq.(13)

end while

if  then

end if
end for

end for
return
end procedure

## 4 Convergence and Computational complexity of the multilevel algorithm

The computational complexity of any MLMC algorithm can be established properly resorting to Theorem 1 in [24]. However, it is mandatory to characterize previously the convergence of some important quantities such as the mean and variance , as well as the computational time of the Monte Carlo algorithm, as a function of the level .

Concerning the scaling of the mean with the level , it can be readily estimated as follows. Since corresponds to the theoretical solution, , obtained probabilistically in practice when , corresponds in fact to the truncation error . Recall that this was considered previously as being due to the Strang splitting method. Therefore, the local error after one time step of this approximation is known [1] to be

 εS=Δt3l(112[D,[D,T]]−124[T,[T,D]])u+O(Δt4l), (14)

and globally of order . This is in agreement with Fig. 2(a), where the mean is plotted as a function of the level for the example consisting in simulations of a small-world network of three different sizes.

Characterizing the variance as a function of the level turns out to be a much more involved procedure. To start, it holds that

 V[Pl−Pl−1]=E[(Pl−Pl−1)2]−(E[Pl−Pl−1])2≤E[(Pl−Pl−1)2]. (15)

Hence, the problem can be reduced to the problem of estimating . For this purpose, and as a preliminary step, it will be estimated next a partial result regarding the random variable and , and then the final result will be estimated accordingly. These random variables are obtained when generating paths for a single time step (when the level is ), and two consecutive time steps (when the level is ). The superscripts and denote two and one consecutive steps respectively. Therefore, we first establish the following Lemma.

###### Lemma 1

Let and discrete random variables that take values on , with probability and given by the transition probabilities of a continuous-time Markov chain generated by the infinitesimal generator and evaluated at time . Then, it holds that

 E[(η(2)l−η(1)l−1)2]=O(Δt3l), (16)

where , and , respectively.

Proof. Expanding and in powers of yields

 E[(η(2)l−η(1)l−1)2]=Δt2lE[ξ2]+E[O(Δt3l)], (17)

where . The possible outcomes of the random variables , and can be one of the following four different cases: (a) ; (b) ; (c) and (d) (see Fig. 1

for illustration). To distinguish among them, consider one pair of binary variables

, taking values and corresponding to the cases , and , respectively. From the transition probabilities of the corresponding continuous-time Markov chain, the probability of obtaining each of them is given by

 pα1α2=[α1eΔtldi+(1−α1)(1−eΔtldi)][α2eΔtldj+(1−α2)(1−eΔtldj)] (18)

By expanding in powers of , we have

 pα1α2=α1α2+CΔtl+O(Δt2l), (19)

where depends merely on . Note that for the case (d), which corresponds to , turns out to be , therefore it follows that is and the proof is complete.

To find the global convergence rate, consider the following Lemma.

###### Lemma 2

Assume , with , are discrete random variables taking values on , with probability given by the transition probabilities of a continuous-time Markov chain generated by the infinitesimal generator and evaluated at time . Then, it holds that

 E[(Pl−Pl−1)2]=O(Δt2l) (20)

Proof. Expanding and in Eq. (9) in powers of , we have

 E[(Pl−Pl−1)2])=Δt2lE[(N/2∑j=1ξj)2]+O(Δt3l), (21)

where , and . Taking into account that are independent random variables governed by the same probability function, it follows that

 E[(N/2∑j=1ξj)2]=N/2∑j=1E[ξ2j]+2∑j

Since is independent of , then . Therefore, we have

 N/2∑j=1E[ξ2j]=N2E[ξ21]. (23)

Finally from Lemma 1 it turns out that is of order and since by definition, then it follows that .

In Fig. 2(b), is shown as a function of the level . The adjacency matrices correspond to a small-world network of three different sizes. Note that the obtained numerical convergence rate fully agrees with the theoretical estimation.

The computational time of the Monte Carlo algorithm was already estimated in [7], and it is given by

 TCPU=αinβ¯dM+αoutβΔtlM. (24)

Here is , while and are suitable proportionality constants. In Fig. 3, the results corresponding to the CPU time spent by the Monte Carlo algorithm when computing the total communicability of two different networks characterized by different values of is shown. The results are in agreement with the theoretical estimation in Eq. (24). In particular, note that for sufficiently large (or equivalently sufficiently small) the computational time tends to a constant value, while for smaller values the computational time scales as . This also explains what was mentioned previously in Section 2, which is that the initial level of the multilevel method could be different from zero to obtain a better performance of the MLMC algorithm. In fact, depending on the value of and consequently on the level , two different working regimes can be observed, and only for the regime characterized by a value of sufficiently small, the computational time asymptotically increases with . Specifically this occurs when the contribution to the computational time of the second term in Eq. (24) is much larger than the first term. Assuming that the value of the proportionality constants , are similar, we can readily estimate the minimum value of the level needed for this purpose, and is given by

 l0≫log2(β¯d). (25)

However, in general both constants , and are not only different, but also difficult to be theoretically estimated. From numerical simulations, however, a more practical lower bound has been found and is given by

 l0=log2(2βdmax), (26)

where corresponds to the maximum value of the diagonal matrix . In Fig. 4 the computational time spent by the MLMC method for different values of the initial level is shown. Here the MLMC was applied to the problem of computing the total communicability [11] of two different networks, small-world and scale-free, of size . For the small-world network the maximum degree is , while for the scale-free network is . The value of was chosen to be for the small-world network and for the scale-free network. The last one was chosen specifically to ensure the convergence of the method, as it was pointed out in [7]. Note that, for both networks, the computational time attains a minimum at a specific value of , which is well approximated by Eq. (26).

In view of the convergence rates estimated above, we can apply the aforementioned Theorem 1 in [24] and conclude that the computational complexity of the proposed MLMC algorithm is of order . In practice this means that the error due to the splitting in Eq. (2) is totally canceled out from the algorithm, remaining only the computational cost inherent to any Monte Carlo method due to the statistical error. Rather, the complexity of the classical Monte Carlo algorithm proposed in [7] is of order of . Indeed this can be readily proved as follows. Concerning the statistical error, the sample size required to achieve a prescribed accuracy is given by , while for the splitting error, being the method of order of , the time step required for a given is . Therefore, the computational complexity, which depends on , is given by . In Fig. 5 the results corresponding to the computational time spent to compute the communicability of a single node of a small-world network of size are plotted as a function of a chosen prescribed accuracy for both, the multilevel Monte Carlo and the classical Monte Carlo method. Note the perfect agreement with the theoretical estimates, and the performance notably superior to the classical Monte Carlo method in [7] for lower accuracy values.

## 5 Performance evaluation

To illustrate the performance of the multilevel Monte Carlo method, in the following we show the results corresponding to several benchmarks conducted so far. They concern the numerical solution of a linear parabolic differential equation by means of an exponential integrator, as well as, the numerical computation of the total communicability metric in some complex synthetic networks.

In fact, among an important application of the multilevel algorithm for computing the action of a matrix exponential, we have the numerical solution of parabolic PDEs by means of the method of lines, using therefore an exponential integrator. When the method of lines [31]

is applied to an initial parabolic PDE problem discretizing the spatial variable, a system of coupled ordinary differential equations, with time as the independent variable, is obtained. Finally, the system can be solved resorting to the computation of a matrix exponential which acts on the discretized initial value function. The method was here applied for solving the Dirichlet boundary-value problem for both, a 3D heat equation, and a 3D convection-diffusion equation. The former problem is given by

 ∂u∂t=∇2u,in Ω=[−δ,δ]3,t>0, (27)

with boundary- and initial-conditions

 u(x,t)|∂Ω=0,u(x,0)=f(x). (28)

The approximated solution where , , after discretizing in space with grid spacing , and using the standard point stencil finite difference approximation, can be written formally as

 ^u(x,t)=en2xt4δ2^L^u0(x), (29)

where denotes the corresponding discretized Laplacian operator.

Concerning the convection-diffusion equation, mathematically we have

 ∂u∂t=∇2u+β⋅∇u, x∈Ω,t>0, u(x,t)|∂Ω=g(x,t), (30) u(x,0)=f(x),

where is the velocity field. After applying the standard Galerkin finite element method [41] to the discretized nodes , the following linear system of coupled first order ODEs is obtained

 Mdudt=Ku+F,u(0)=u0, (31)

where , is the assembled mass matrix, is the corresponding assembled stiffness matrix, and is the load vector. Concerning the boundary data, these are included modifying as usual the matrices and the vector. For computational convenience, in the following the mass matrix was lumped [41], resulting in practice in a diagonal mass matrix.

Formally, the solution of the inhomogeneous system of ODEs (31) can be written in terms of a matrix exponential as follows

 u(x,t)=e−tM−1Ku0+∫t0dse−sM−1KF(t−s). (32)

Note that for the particular case of having time-independent boundary data, the load vector becomes therefore constant, and the solution simplifies to

 u(x,t)=e−tM−1Ku0−K−1M(e−tM−1K−1)F. (33)

On the other hand, for arbitrary time-dependent boundary data, the integral in Eq. (32) can be computed resorting to suitable numerical quadratures. This procedure can be followed in any case to avoid evaluating the inverse of the matrix in Eq. (33), which in general can be computationally costly. In fact, this was specifically used here for solving numerically the system of equations in (31). Since to compute the matrix exponential the error was estimated to be of order (see Sec. 4), to avoid lowering down this order, in the following we have implemented the Simpson quadrature rule, which is known to be of much higher order. More specifically, the solution can be computed as follows

 ^u(x,t)=e−tM−1Ku0+Δt⎛⎝F(t)+2N/2−1∑j=1e−tjM−1KF(t−tj) +4N/2∑j=1e−tjM−1KF(t−tj)+e−tM−1KF(0)⎞⎠, (34)

where , and . Note that we require to compute independent matrix exponential evaluations at different instants of time. However, it turns out that using the Algorithm 2, in practice only a single evaluation at the final time is needed to compute. This is because when using the Monte Carlo method for computing the matrix exponential at time , the information required to evaluate the matrix exponential at intermediate times have been also automatically generated by the algorithm. In fact, the random paths generated up to time , which have been simulated advancing in time steps of size , can be used directly to evaluate the matrix exponential over the vector at time . Moreover, it is worth observing that this can be accomplished without any additional computational cost.

In practice this can be readily done modifying slightly the Algorithm, as it is shown in boldface in the new Algorithm 3. Here is a vector containing the suitable weights corresponding to the Simpson quadrature rule.

procedure MLMCL-FEM()
, ,,
for  do
, ,
for  do

if  then

end if
generate exponentially distributed
while  do

generate exponentially distributed
generate according to Eq.(13)

end while

if  then

end if

end for

end for
return
end procedure

Other important application of the matrix exponential consists in computing the total communicability of a network. By definition, the total communicability of a network [11] is given by

 TC=(1,eA1), (35)

where is a vector of ones, and denotes the scalar product. In the following we analyze the total communicability for several networks consisting in generated synthetic networks of the type small-world and scale-free of arbitrary size. These networks have been generated in Matlab using the functions smallw and pref, respectively, both freely available through the toolbox CONTEST [15]

. In contrast to the small-world network, the scale-free networks are characterized by the presence of hubs, which in practice entail a much larger maximum eigenvalue than for the small-world networks. Then, since the value of this eigenvalue increases with the network size, and in order to keep constant the numerical error, it may be necessary for the MLMC method to increase strongly the number of required levels accordingly. To prevent such a computationally costly procedure, a reasonable alternative relies on computing a generalization of the communicability, that is

, where is typically interpreted as an effective ”temperature” of the network (see [21], e.g.). Essentially the idea that was exploited in [7] was to use the inverse of the maximum eigenvalue as the value of the parameter , which in practice will control the rapid growth of the norm of the matrix with the size of the network. Different values of could have a direct impact not only on the entries of the communicability vector, but also on the ranking of the nodes according to their communicability values. However, in practice this does not occur. Through the analysis of the intersection similarity of several networks [7] it was shown that the chosen value of does not affect significantly the results, being in all cases the differences well below the typical error tolerances, and even becoming smaller for increasingly larger network sizes. Consequently, and to ensure fast convergence of the method, in the simulations below we have used , where is the maximum eigenvalue of A. However, finding the maximum eigenvalue for large networks is itself computationally costly and, in the following, a faster alternative based on computing the maximum degree of the network, , was used instead as an upper bound value.

### 5.1 Shared memory architecture

The simulations corresponding to the shared memory architecture were run on both a commodity server equipped with cores and GB of RAM, and the MareNostrum supercomputer using a single node with cores. The MLMC algorithm has been implemented in OpenMP, and to compare the performance with other methods, as well as to control the numerical errors, the MATLAB toolbox funmkryl freely available in [23] has been used. This method consists in the implementation of a Krylov subspace method with deflated restarting for matrix functions [27]. Note that Matlab was originally written in C/C++ and, specifically, operations involving matrix-vector multiplication or matrix-matrix multiplication show nowadays an optimal performance in the latest versions of Matlab, since they are exploiting very efficiently multithreading execution as well as SIMD units available in current microprocessors. Taking into account that the Krylov subspace method requires matrix-vector multiplications extensively, we assume the obtained performance of the Matlab code to be more than competitive with respect to the performance of a native code in C/C++. Moreover, our implemented OpenMP code was not optimized to ensure a fair comparison with Matlab. Finally, it is worth remarking that the choice for using Matlab for comparison and not a native code was essentially motivated by the lack of any parallel code freely available in C/C++.

Example A: Partial Differential equations.

The computational time spent by both, the MC and MLMC method, for solving the initial-boundary value problem for a 3D heat equation at a single point is shown in Tables 1 and 2. This has been done for different matrix sizes and number of cores running on the commodity server, and for about the same accuracy. It is worth observing that in view of the probabilistic nature of any Monte Carlo-based algorithm, the measured computational time for a single simulation cannot be uniquely defined. Therefore, in the following and for convenience, for both the Monte Carlo and MLMC method, the computational times reported in all tables have been chosen to be the most favorable simulation in terms of elapsed time, obtained after repeating the simulations a few times. Concerning the error, this was estimated using the aforementioned Krylov-based method by setting a very small value of the stopping-accuracy parameter, , as well as the restart parameter to .

For comparison, the computational time spent by Matlab is also shown only for the smaller matrix size, since for the larger one Matlab simulations run out of memory. As it was pointed out in [7] this is mainly due to the memory demands of any Krylov-based algorithm. Instead, the Monte Carlo method is extremely efficient in terms of memory management, since it requires only to allocate in memory the input matrix.

For the solution of the convection-diffusion equation, the arbitrary complex geometry plotted in Fig. 6 was used as the domain, being the Dirichlet boundary data chosen to be at the surface of the outer sphere, and at the surface of the inner cylinder. The size of the domain can be conveniently increased by simply rescaling both, the sphere and cylinder, using a single scale parameter scale. To generate the computational mesh, and obtaining the corresponding FEM matrices and vector, the scientific software COMSOL [14] was used, choosing specifically linear elements at the discretization setting. Concerning the element size used when meshing the geometry, it was kept fixed to be and for the maximum and minimum size, respectively.

The computational time spent for computing the solution at a single point and time inside the domain for different number cores is shown in Table 3. Note that both the MC method and the MLMC method scale well with the number of cores, while the computational time spent with Matlab rapidly saturates when increasing the number of cores, due to the heavier intercommunication overhead of the Krylov-based algorithm.

In Table 4 the computational time spent when computing the solution for different size domains is shown, being now the number of cores kept fixed to the maximum number of cores available.

It is remarkable that the computational cost of the MC and MLMC method appears to be almost independent of the size of the domain, while it increases almost linearly for the Krylov-based method. As it was already explained in [7] for the specific case of complex networks, this is mainly due to the similar matrix structure observed for any value of the matrix size. Because of this, the error becomes mostly independent of the size, and consequently it is not required to modify further the value of the sample size M, or the time step for increasingly larger matrix sizes (assuming a given prescribed accuracy for the solution), making therefore the computational cost of the algorithm almost independent of the size of the domain. This does not happen with the Krylov-based method, allowing specifically for the MLMC method to achieve a computational performance higher than the Matlab solution for large scale problems.

Even though, the MC and MLMC methods were proposed initially to compute the solution at single temporal points, it turns out that they can be used as well to obtain the solution at intermediate instants of time. As a remarkable feature this can be done without any additional computational cost, as it was already explained in Sec. 5. For the Krylov-based methods, it is worth pointing out that there were also some recent attempts [32] to improve the performance of the method for computing the solution in a finite time interval, being however the performance of the resulting algorithm slightly worse than the performance of the algorithm for computing the solution at a single time. To test the accuracy of the obtained solution for intermediate times, in Fig. 7 the solution computed using the Monte Carlo method is compared with the solution obtained using the Krylov-based method. Note the excellent agreement between both solutions for any value of time.

Example B: Complex networks

Small-world networks. In Table 5 the computational time required to compute the total communicability of a small-world network of size is shown as a function of the number of cores for the Monte Carlo, MLMC method, and Matlab.

In Table 6 the results corresponding to a sort of weak scalability analysis of the MLMC algorithm are shown. For this purpose the algorithm was run for an increasing number of cores, searching for the value of the accuracy that equals the simulation time. Note that when the number of used cores increases, the accuracy should be reduced accordingly. Moreover, since the computational cost of the MLMC algorithm is of order , the workload of the algorithm increases when reducing the value of , being therefore required to increase conveniently the number of used cores to keep approximately constant the overall execution time.

From the results in Table 6 it can be seen, for instance, that whenever the number of used cores increases times (passing from to cores), the value of should be reduced by a factor of approximately for the same execution time. It is worth observing that for such a reduction of , the workload of the algorithm increases by a factor of , which can be fully mitigated by increasing the number of used cores up to . This is due to the remarkable scalability of the parallel algorithm. Similar conclusions can be drawn from the results obtained when using cores.

Table 7 shows the results corresponding to the computational time when computing the total communicability for different network sizes. Here the number of cores was fixed to the maximum number of cores available. It is remarkable to note that the computational cost of the MLMC method appears to be almost independent of the size of the network, while it increases almost linearly for the Krylov-based method.

Scale-free networks. In Table 8 the results corresponding to a scale-free network for an arbitrarily large size are shown for different number of cores. Similar to the results obtained for the small-world network, the MLMC method outperforms the Krylov-based method for large size networks and cores.

### 5.2 Distributed memory architecture

The simulations for a distributed memory architecture were carried out on the MareNostrum Supercomputer of the Barcelona Supercomputing Center (BSC) and on the Marconi Supercomputer at CINECA. In both cases, two processes were launched on each node (one per processor), with as many threads as physical cores available (24 threads on MareNostrum and 18 on Marconi). Up to 200 nodes (a total of 9600 cores) were used on MareNostrum and up to 160 nodes (5760 cores) on Marconi, which are respectively the maximum we had access to.

To the best of our knowledge, no parallel code suitable for distributed memory architecture capable of computing the action of a matrix exponential over a vector is currently available. Therefore, in the following, only results corresponding to the proposed multilevel method implemented in MPI are given.

Example B: Partial Differential equations.

The computational time spent by the multilevel method for computing the solution of the boundary value problem for the 3D heat equations at a single point is shown in Table 9 for different number of cores. These results were all obtained in the Marconi system. The speedup column indicates how much faster the execution is relative to half the number of cores (previous row in the table).