## 1 Basic idea of ParaDIAG

We start with a basic introduction to ParaDIAG algorithms. Suppose we need to solve in parallel the system of ODEs with initial value arising from the semi-discretization of a time-dependent PDE, where . For finite element discretizations, is the mass matrix and is the stiffness matrix. For finite difference discretizations,

is just an identity matrix. The classical approach for solving such systems of ODEs is to apply a time-integrator, and then solve the resulting difference equation step-by-step in time. Instead, ParaDIAG tries to solve these difference equations

all-at-once. For linear multi-step methods, the all-at-once system is of the form(1.1) |

where are Toeplitz matrices
specified by the time-integrator and is the number of time
steps^{1}^{1}1For Runge-Kutta methods, the all-at-once system is
different and will be treated in a forthcoming update of this
note.. All ParaDIAG algorithms focus on treating the matrices
and , while keeping and unchanged. There are
mainly two approaches: first, using different step sizes
, e.g., a geometrically increasing sequence
with , which makes the
time-discretization matrices diagonalizable. This yields ParaDIAG
algorithms which are direct solvers in the ParaDIAG-I group
[12, 5, 7].

The second treatment is to use a uniform step size and solve the all-at-once system (1.1) iteratively, which leads to ParaDIAG algorithms in the ParaDIAG-II group. There are several variants, but the common point is to introduce the -circulant block matrix

(1.2) |

where and are Strang type -circulant matrices constructed from and , and is a free parameter. One can then either solve (1.1) via the stationary iteration [18]

(1.3) |

where is the iteration index, or via Krylov subspace methods (e.g., GMRES, MINRES) by solving the preconditioned system [13]

(1.4) |

which is nothing else than the stationary iteration (1.3) written at its fixed point, i.e. at convergence.

The algorithms proposed in [15] and [8] are essentially ParaDIAG-II algorithms as well, but they are derived from a different point of view. For example, in [8] the authors introduced a Waveform Relaxation (WR) iteration , , and after a time-discretization one can show that at each iteration the all-at-once system is , where . The algorithm in [15] can be understood similarly.

For each variant of ParaDIAG-II we need to compute with

being an input vector. The reason for using

is twofold: first, since and are Strang type -circulant matrices constructed from the Toeplitz matrices and , it naturally holds that converges to as goes to zero. This implies that by using a relatively small , the ParaDIAG-II algorithms converge rapidly. The second point lies in the fact that and can be diagonalized simultaneously, as is shown in the following Lemma.###### Lemma 1 (see [3])

Let (with and ) be the discrete Fourier matrix and define for any given parameter the diagonal matrix

Then the two -circulant matrices can be simultaneously diagonalized as

where and represents the first column of , .

Due to the property of the Kronecker product, we can factor and thus we can compute by performing the following three steps:

(1.5) |

where and . Since and are given by FFT techniques, Step-(a) and Step-(c) can be computed efficiently with operations. Step-(b) can be computed in parallel since all linear systems are completely independent from each other at different time points. These three steps represent the key steps of ParaDIAG algorithms and will appear frequently in this note, although the details differ in the various cases.

For nonlinear problems with , the basic idea for applying ParaDIAG algorithms is as follows: for linear multi-step methods, the non-linear all-at-once system is

(1.6) |

where . The Jacobian matrix of (1.6) is

(1.7) |

where . To apply ParaDIAG, we approximate the Jacobian matrix (1.7) by

where is constructed from the values by some averaging [6], e.g., or . Then, we can solve (1.6) by the following simplified Newton iteration:

(1.8) |

where for each iteration the increment can be obtained using a ParaDIAG algorithm performing the three steps in (1.5). If we use different step sizes as in [6], then and are already diagonalizable, and we can replace by in (1.8).

In practice, the ParaDIAG algorithms can be combined with a windowing technique: after a certain number of time steps computed in parallel in the current time window, the computation can be restarted for the next time window in a sequential way. This permits the use of a certain adaptivity in time and space.

## 2 ParaDIAG for Linear Advection-Diffusion Problems

To illustrate the ParaDIAG-I and ParaDIAG-II algorithms, we now
use the concrete example of the advection-diffusion equation
with periodic boundary conditions^{2}^{2}2We use periodic
boundary condition make the advection dominated situation harder
for PinT algorithms, see [4].

(2.1) |

where . Using the method of lines and a centered finite difference scheme for the spatial derivatives, we get the system of ODEs

(2.2a) | |||

where the matrix is | |||

(2.2b) | |||

Here , and the periodic boundary conditions cause a zero eigenvalue in the matrix . |

### 2.1 ParaDIAG-I

To use ParaDIAG as a direct solver, one has to use all different time steps to make the time stepping matrix diagonalizable, and one possibility is to use geometrically increasing time step sizes to discretize (2.2a) as proposed in [12],

(2.3) |

where is free parameter and is the first step size. We use the linear -method,

(2.4) |

We will only consider and , which corresponds to the Backward-Euler method and the Trapezoidal rule. For , the method is also called the Crank-Nicolson scheme. The difference equations (2.4) can be combined into the all-at-once system

(2.5a) | |||

where , is an identity matrix and are matrices representing the time-discretization, namely | |||

(2.5b) |

The right hand-side is given by with .

Let and . Then, we can rewrite (2.5a) as

(2.6) |

where is an identity matrix. The diagonalization of for and can be found in [5] and [7] respectively, but for the reader’s convenience, we show the details here:

###### Theorem 2.1 (see [5, 7])

For the geometrically increasing step sizes given by (2.3) with , the matrix can be diagonalized as , where

. The eigenvector matrix

and its inverse are Toeplitz matrices of the formwhere

Now using the typical ParaDIAG factorization

we can solve (2.6) by performing the three steps

(2.7) |

where and . Since and are given in closed form, we only have to do matrix vector multiplications for Step-(a) and Step-(c), or one could use a fast Toeplitz solver based on Fourier techniques. For Step-(b), the linear systems can be solved simultaneously in parallel. There is however an important issue with this direct time parallel solver ParaDIAG-I: if the time steps are very different, the truncation error of the time stepping scheme becomes worse, and if they are very close to each other, ParaDIAG-I suffers from roundoff error in the diagonalization used in Step-(a) and Step-(c). The best one can do is to balance the two errors, as a detailed analysis in [5, 7] shows, and this limits the applicability of ParaDIAG-I to shorter time intervals and few time steps: the roundoff error is proportional to the condition number of , i.e.,

If is an eigenvector matrix of , the scaled matrix with any invertible diagonal matrix is an eigenvector matrix of as well. From [5, 7], the matrix is a good choice.

To illustrate the limitations of ParaDIAG-I, we provide the Matlab code ParaDIAG_V1_for_ADE, to test it for the advection-diffusion equation. For given , and the constraint determines uniquely the value of the geometric time steps as

(2.8) |

For the space discretization, we fix . To study the accuracy of ParaDIAG-I, we use a reference solution obtained from the Matlab ODE solver ode45 with a very small absolute and relative tolerance, AbsTol= and RelTol=, see Figure 2.1 for an illustration.

The results of ParaDIAG-I compared to a sequential time integration are shown in Figure 2.2

We clearly see that using the geometric time steps (2.8) degrades the accuracy of the numerical solution, and when the time steps are too similar, the roundoff error problem sets in. This phenomenon was carefully studied in [5, 7], and the best possible geometrically stretched grid was determined, which leads to precise limits of time window length and number of time steps within which ParaDIAG-I can be reliably used.

### 2.2 ParaDIAG-II

Instead of using ParaDIAG as a direct solver with all different time steps to make the time stepping matrix diagonalizable, we can use it iteratively and solve a nearby problem in each iteration chosen such that the time stepping matrix of the nearby problem with uniform time step size can still be diagonalized. This idea leads to ParaDIAG algorithms in the ParaDIAG-II group. Among this group, we can use ParaDIAG within a stationary iteration or a Krylov subspace method. There are so far two very different ways to use ParaDIAG within a stationary iteration, proposed in [8] and [15]. The use of ParaDIAG within a Krylov subspace method can be found in [13, 18].

#### 2.2.1 ParaDIAG-II – Waveform Relaxation (WR) Variant

The ParaDIAG algorithm introduced in [8] is based on the Waveform Relaxation iteration

(2.9) |

where is the iteration index and is a free parameter. Upon convergence, the tail term is canceled and thus the converged solution is the solution of (2.2a). Applying the linear -method with a uniform step size to (2.9) gives

(2.10) |

where . We rewrite (2.10) as an all-at-once system,

(2.11a) | |||

where , and are given by | |||

(2.11b) |

The matrices are so-called -circulant matrices and can be diagonalized as stated in Lemma 1, and we can again use the typical ParaDIAG factorization . Hence, similar to (2.7) we can solve (2.11a) performing the three steps

(2.12) |

where and . In (2.7), Step-(a) and Step-(c) can be computed efficiently via FFT and Step-(b) is again highly parallel. The eigenvector matrix satisfies

(2.13) |

and thus the conditioning is depending on the choice of . The convergence properties of this ParaDIAG-II algorithm are summarized in the following theorem.

###### Theorem 2.2 (see [8])

For the linear system of ODEs , suppose with being an arbitrary eigenvalue of . Let be the -th iterate of the ParaDIAG-II algorithm (2.10) with and

be the reference solution obtained by directly applying the same time-integrator to the system of ODEs. Then the linear convergence estimate

holds, whereThis shows that the ParaDIAG-II algorithm (2.10) converges with a rate independent of the spectrum of the matrix and the step size of the time-discretization. The convergence factor becomes smaller when decreases, but the condition number of (cf. (2.13)) implies that can not be arbitrarily small (e.g., not of the size ), because in this case the roundoff error will pollute the accuracy. The best parameter is again the value balancing the roundoff error and the discretization error, like for the direct solver ParaDIAG-I, see [8] for more discussions. In practice, and are good choices.

We provide a Matlab code, namely ParaDIAG_V2_WR_for_ADE, to test the ParaDIAG-II algorithm (2.10). In the code, we use the fft command to obtain by just using the first columns of , instead of the entire matrices. To implement Step-(a) in (2.12), we use the fft command as follows:

b=reshape(b,Nx,Nt); sol_stepA=fft(Gam.*(b.’)).’; |

where b is the vector . Similarly, to implement Step-(c) we use the inverse FFT command ifft,

Uk=(invGam.*ifft(sol_stepB.’)).’; |

Here, Gam and invGam. With an initial guess chosen randomly as random(’unif’,-20,20,, ), the first 2 iterates of this ParaDIAG-II algorithm are shown in Figure 2.3.

The maximum error at each iteration is shown in Figure 2.4.

We next present some parallel speedup results for the ParaDIAG-II algorithm (2.10) based on Waveform Relaxation for a time-dependent advection-diffusion problem with periodic boundary conditions in 2D,

(2.14) |

where and . The results were obtained on the China Tianhe-1 supercomputer [19], which is a multi-array, configurable and cooperative parallel system with a theoretical peak performance of 1.372 petaflops, composed of high performance general-purpose microprocessors and a high-speed Infiniband network. We used the parallel Fortran library MUMPS (MUltifrontal Massively Parallel sparse direct Solver [2, 1]) version 4.10.0 to solve the linear systems in Step-(b) of (2.12). For Step-(a) and Step-(c), the fft and ifft commands are dissected into complex arithmetic operations. For the particular case when the source term is zero as shown in (2.11b), Step-(a) can be implemented in an economical way: only the first column of is needed to compute .

We provided the parallel codes in Fortran, which are zipped by a document labeled as ’Parallel Codes.zip’. For reader’s convenience, the zip document contains a README file, in which we briefly introduce how to use these codes. Note that the modification of the Fortran library MUMPS (serving the purpose of serializing it in a distributed memory environment) is due to the fact that the relevant spatial solves are currently executed only in serial. In addition, a number of note statements are appended to Fortran’s functions and subroutines. All the Fortran codes are compiled with mpich-3.1.3 using the icc compiler version 11.1.059 and -O2 optimization level.

We denote by ParaDIAG-II (B-E) the algorithm (2.10) using Backward-Euler, and by ParaDIAG-II (TR) the one using the Trapezoidal rule, and set . For comparison, we also apply the parareal algorithm and MGRiT to (2.14). The parareal algorithm is implemented using the two-level XBraid solver with F-relaxation, and MGRiT is the multilevel XBraid solver with FCF-relaxation (i.e., an initial F-relaxation followed by a C-relaxation and then a second F-relaxation). Furthermore, we skip the unnecessary work during the first XBraid down cycle for both the parareal and MGRiT algorithms, and fix the coarsening factor to 8. As shown in Table 2.1,

np |
||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

B-E | TR | PR | MG | B-E | TR | PR | MG | B-E | TR | PR | MG | B-E | TR | PR | MG | B-E | TR | PR | MG | B-E | TR | PR | MG | |

4 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

8 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

16 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

32 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

64 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

128 | 4 | 4 | 9 | 4 | 4 | 5 | 10 | 7 | 5 | 5 | 33 | 20 | 5 | 5 | 51 | 26 | 5 | 5 | 54 | 27 | 5 | 5 | 55 | 27 |

B-E: ParaDIAG-II (B-E), TR: ParaDIAG-II (TR), PR: parareal, MG: MGRiT

ParaDIAG-II (B-E), ParaDIAG-II (TR), parareal and MGRiT converge robustly with respect to the number of processors. ParaDIAG-II (B-E) and ParaDIAG-II (TR) lead to parameter-robust convergence, while for parareal and MGRiT the required iteration counts increase dramatically as changes from 1 to . The tolerance tol for all experiments here is set to ). In Figure 2.5

we compare the measured CPU times for these PinT algorithms. Clearly, ParaDIAG-II (B-E) and ParaDIAG-II (TR) are two optimally scaling PinT algorithms, while for MGRiT and parareal the scaling is a little bit worse (this is because of the sequential coarse-grid-correction as we will see in the next subsection). The corresponding data is given in Table 2.2.

parareal | MGRiT | ParaDIAG-II (B-E) | ParaDIAG-II (TR) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1.94 | 2.28 | 2.49 | 2.69 | 2.93 | 4.33 | 5.28 | 7.00 | 3.96 | 7.54 | 15.20 | 29.03 | 3.90 | 7.73 | 15.13 | 28.47 | |

1.95 | 2.30 | 2.53 | 2.80 | 3.03 | 4.72 | 5.57 | 7.41 | 4.03 | 7.86 | 15.53 | 30.93 | 3.89 | 7.75 | 15.03 | 29.00 | |

1.96 | 2.32 | 2.56 | 2.83 | 2.99 | 4.58 | 5.62 | 7.35 | 3.90 | 7.76 | 15.23 | 29.00 | 3.91 | 7.77 | 15.20 | 28.84 | |

1.97 | 2.33 | 2.57 | 2.85 | 2.99 | 4.58 | 5.61 | 7.37 | 3.87 | 7.33 | 15.23 | 29.85 | 3.87 | 7.72 | 15.14 | 28.63 | |

1.95 | 2.31 | 2.55 | 2.82 | 2.95 | 4.55 | 5.61 | 7.33 | 3.89 | 7.74 | 15.17 | 29.57 | 3.91 | 7.72 | 15.29 | 29.95 | |

1.94 | 2.30 | 2.54 | 2.81 | 2.99 | 4.46 | 5.64 | 7.35 | 3.56 | 7.75 | 15.15 | 28.37 | 3.89 | 7.72 | 15.15 | 29.26 |

Regarding the parallel efficiency measured by [14] ( is the wall-clock time using processors), the average parallel efficiency for ParaDIAG-II (B-E) is 92.06%, for ParaDIAG-II (TR) it is 90.70%, while it is only 22.82% for MGRiT and 8.75% for parareal.

#### 2.2.2 ParaDIAG-II – Parareal Variant

The second way to use ParaDIAG within a stationary iteration is based on formulating the coarse-grid-correction (CGC) procedure of the parareal algorithm[10, 9] as an all-at-once system and applying ParaDIAG to it. The parareal algorithm is an iterative PinT algorithm, based on the updating formula

(2.15) |

where and are called coarse and fine propagator, specified by two time-integrators. The quantity denotes a value calculated by applying successively steps of the fine propagator to the differential equations with initial value and the fine step size . The integer is called the coarsening ratio. Let

Then, the parareal algorithm is . This is the so called CGC, which is a sequential procedure and is often the bottleneck of the parallel efficiency. In [15], the author proposed an idea to parallelize the CGC: supposing we have to solve an initial-value problem

we apply to a slightly wrong problem, namely

where is a free parameter. We use the linear case to illustrate the details of ParaDIAG-II based on the parareal algorithm (for the nonlinear case, see [15]). We also use for simplicity Backward-Euler for . Let . The quantity computed from the previous iteration is

Note that all the quantities and can be computed simultaneously in parallel. Hence, . The parareal algorithm (2.15) can be rewritten as

where , which can be represented as

This problem is now precisely of the form (2.10) for , i.e. the Backward-Euler method, and the solution can be obtained using ParaDIAG-II (cf. 2.12). The convergence rate of this ParaDIAG-II parareal variant is summarized in the following theorem.

###### Theorem 2.3 (see [15])

Let be the convergence factor of the parareal algorithm with sequential-in-time CGC (i.e., the classical parareal algorithm) and be the convergence factor with parallel-in-time CGC. Then, there exists some threshold of the parameter , such that

In particular, for linear systems of ODEs with