1 Basic idea of ParaDIAG
We start with a basic introduction to ParaDIAG algorithms. Suppose we need to solve in parallel the system of ODEs with initial value arising from the semi-discretization of a time-dependent PDE, where . For finite element discretizations, is the mass matrix and is the stiffness matrix. For finite difference discretizations,
is just an identity matrix. The classical approach for solving such systems of ODEs is to apply a time-integrator, and then solve the resulting difference equation step-by-step in time. Instead, ParaDIAG tries to solve these difference equationsall-at-once. For linear multi-step methods, the all-at-once system is of the form
where are Toeplitz matrices specified by the time-integrator and is the number of time steps111For Runge-Kutta methods, the all-at-once system is different and will be treated in a forthcoming update of this note.. All ParaDIAG algorithms focus on treating the matrices and , while keeping and unchanged. There are mainly two approaches: first, using different step sizes , e.g., a geometrically increasing sequence with , which makes the time-discretization matrices diagonalizable. This yields ParaDIAG algorithms which are direct solvers in the ParaDIAG-I group [12, 5, 7].
The second treatment is to use a uniform step size and solve the all-at-once system (1.1) iteratively, which leads to ParaDIAG algorithms in the ParaDIAG-II group. There are several variants, but the common point is to introduce the -circulant block matrix
where is the iteration index, or via Krylov subspace methods (e.g., GMRES, MINRES) by solving the preconditioned system 
which is nothing else than the stationary iteration (1.3) written at its fixed point, i.e. at convergence.
The algorithms proposed in  and  are essentially ParaDIAG-II algorithms as well, but they are derived from a different point of view. For example, in  the authors introduced a Waveform Relaxation (WR) iteration , , and after a time-discretization one can show that at each iteration the all-at-once system is , where . The algorithm in  can be understood similarly.
For each variant of ParaDIAG-II we need to compute with
being an input vector. The reason for usingis twofold: first, since and are Strang type -circulant matrices constructed from the Toeplitz matrices and , it naturally holds that converges to as goes to zero. This implies that by using a relatively small , the ParaDIAG-II algorithms converge rapidly. The second point lies in the fact that and can be diagonalized simultaneously, as is shown in the following Lemma.
Lemma 1 (see )
Let (with and ) be the discrete Fourier matrix and define for any given parameter the diagonal matrix
Then the two -circulant matrices can be simultaneously diagonalized as
where and represents the first column of , .
Due to the property of the Kronecker product, we can factor and thus we can compute by performing the following three steps:
where and . Since and are given by FFT techniques, Step-(a) and Step-(c) can be computed efficiently with operations. Step-(b) can be computed in parallel since all linear systems are completely independent from each other at different time points. These three steps represent the key steps of ParaDIAG algorithms and will appear frequently in this note, although the details differ in the various cases.
For nonlinear problems with , the basic idea for applying ParaDIAG algorithms is as follows: for linear multi-step methods, the non-linear all-at-once system is
where . The Jacobian matrix of (1.6) is
where . To apply ParaDIAG, we approximate the Jacobian matrix (1.7) by
where for each iteration the increment can be obtained using a ParaDIAG algorithm performing the three steps in (1.5). If we use different step sizes as in , then and are already diagonalizable, and we can replace by in (1.8).
In practice, the ParaDIAG algorithms can be combined with a windowing technique: after a certain number of time steps computed in parallel in the current time window, the computation can be restarted for the next time window in a sequential way. This permits the use of a certain adaptivity in time and space.
2 ParaDIAG for Linear Advection-Diffusion Problems
To illustrate the ParaDIAG-I and ParaDIAG-II algorithms, we now use the concrete example of the advection-diffusion equation with periodic boundary conditions222We use periodic boundary condition make the advection dominated situation harder for PinT algorithms, see .
where . Using the method of lines and a centered finite difference scheme for the spatial derivatives, we get the system of ODEs
|where the matrix is|
, and the periodic boundary conditions cause a zero eigenvalue in the matrix.
To use ParaDIAG as a direct solver, one has to use all different time steps to make the time stepping matrix diagonalizable, and one possibility is to use geometrically increasing time step sizes to discretize (2.2a) as proposed in ,
where is free parameter and is the first step size. We use the linear -method,
We will only consider and , which corresponds to the Backward-Euler method and the Trapezoidal rule. For , the method is also called the Crank-Nicolson scheme. The difference equations (2.4) can be combined into the all-at-once system
|where , is an identity matrix and are matrices representing the time-discretization, namely|
The right hand-side is given by with .
Let and . Then, we can rewrite (2.5a) as
Now using the typical ParaDIAG factorization
we can solve (2.6) by performing the three steps
where and . Since and are given in closed form, we only have to do matrix vector multiplications for Step-(a) and Step-(c), or one could use a fast Toeplitz solver based on Fourier techniques. For Step-(b), the linear systems can be solved simultaneously in parallel. There is however an important issue with this direct time parallel solver ParaDIAG-I: if the time steps are very different, the truncation error of the time stepping scheme becomes worse, and if they are very close to each other, ParaDIAG-I suffers from roundoff error in the diagonalization used in Step-(a) and Step-(c). The best one can do is to balance the two errors, as a detailed analysis in [5, 7] shows, and this limits the applicability of ParaDIAG-I to shorter time intervals and few time steps: the roundoff error is proportional to the condition number of , i.e.,
To illustrate the limitations of ParaDIAG-I, we provide the Matlab code ParaDIAG_V1_for_ADE, to test it for the advection-diffusion equation. For given , and the constraint determines uniquely the value of the geometric time steps as
For the space discretization, we fix . To study the accuracy of ParaDIAG-I, we use a reference solution obtained from the Matlab ODE solver ode45 with a very small absolute and relative tolerance, AbsTol= and RelTol=, see Figure 2.1 for an illustration.
The results of ParaDIAG-I compared to a sequential time integration are shown in Figure 2.2
We clearly see that using the geometric time steps (2.8) degrades the accuracy of the numerical solution, and when the time steps are too similar, the roundoff error problem sets in. This phenomenon was carefully studied in [5, 7], and the best possible geometrically stretched grid was determined, which leads to precise limits of time window length and number of time steps within which ParaDIAG-I can be reliably used.
Instead of using ParaDIAG as a direct solver with all different time steps to make the time stepping matrix diagonalizable, we can use it iteratively and solve a nearby problem in each iteration chosen such that the time stepping matrix of the nearby problem with uniform time step size can still be diagonalized. This idea leads to ParaDIAG algorithms in the ParaDIAG-II group. Among this group, we can use ParaDIAG within a stationary iteration or a Krylov subspace method. There are so far two very different ways to use ParaDIAG within a stationary iteration, proposed in  and . The use of ParaDIAG within a Krylov subspace method can be found in [13, 18].
2.2.1 ParaDIAG-II – Waveform Relaxation (WR) Variant
The ParaDIAG algorithm introduced in  is based on the Waveform Relaxation iteration
where is the iteration index and is a free parameter. Upon convergence, the tail term is canceled and thus the converged solution is the solution of (2.2a). Applying the linear -method with a uniform step size to (2.9) gives
where . We rewrite (2.10) as an all-at-once system,
|where , and are given by|
The matrices are so-called -circulant matrices and can be diagonalized as stated in Lemma 1, and we can again use the typical ParaDIAG factorization . Hence, similar to (2.7) we can solve (2.11a) performing the three steps
where and . In (2.7), Step-(a) and Step-(c) can be computed efficiently via FFT and Step-(b) is again highly parallel. The eigenvector matrix satisfies
and thus the conditioning is depending on the choice of . The convergence properties of this ParaDIAG-II algorithm are summarized in the following theorem.
Theorem 2.2 (see )
For the linear system of ODEs , suppose
with being an arbitrary
eigenvalue of . Let be the -th iterate of the
ParaDIAG-II algorithm (2.10) with and
be the reference solution obtained by directly applying
the same time-integrator to the system of ODEs. Then the
linear convergence estimate
be the reference solution obtained by directly applying the same time-integrator to the system of ODEs. Then the linear convergence estimateholds, where
This shows that the ParaDIAG-II algorithm (2.10) converges with a rate independent of the spectrum of the matrix and the step size of the time-discretization. The convergence factor becomes smaller when decreases, but the condition number of (cf. (2.13)) implies that can not be arbitrarily small (e.g., not of the size ), because in this case the roundoff error will pollute the accuracy. The best parameter is again the value balancing the roundoff error and the discretization error, like for the direct solver ParaDIAG-I, see  for more discussions. In practice, and are good choices.
We provide a Matlab code, namely ParaDIAG_V2_WR_for_ADE, to test the ParaDIAG-II algorithm (2.10). In the code, we use the fft command to obtain by just using the first columns of , instead of the entire matrices. To implement Step-(a) in (2.12), we use the fft command as follows:
where b is the vector . Similarly, to implement Step-(c) we use the inverse FFT command ifft,
Here, Gam and invGam. With an initial guess chosen randomly as random(’unif’,-20,20,, ), the first 2 iterates of this ParaDIAG-II algorithm are shown in Figure 2.3.
The maximum error at each iteration is shown in Figure 2.4.
We next present some parallel speedup results for the ParaDIAG-II algorithm (2.10) based on Waveform Relaxation for a time-dependent advection-diffusion problem with periodic boundary conditions in 2D,
where and . The results were obtained on the China Tianhe-1 supercomputer , which is a multi-array, configurable and cooperative parallel system with a theoretical peak performance of 1.372 petaflops, composed of high performance general-purpose microprocessors and a high-speed Infiniband network. We used the parallel Fortran library MUMPS (MUltifrontal Massively Parallel sparse direct Solver [2, 1]) version 4.10.0 to solve the linear systems in Step-(b) of (2.12). For Step-(a) and Step-(c), the fft and ifft commands are dissected into complex arithmetic operations. For the particular case when the source term is zero as shown in (2.11b), Step-(a) can be implemented in an economical way: only the first column of is needed to compute .
We provided the parallel codes in Fortran, which are zipped by a document labeled as ’Parallel Codes.zip’. For reader’s convenience, the zip document contains a README file, in which we briefly introduce how to use these codes. Note that the modification of the Fortran library MUMPS (serving the purpose of serializing it in a distributed memory environment) is due to the fact that the relevant spatial solves are currently executed only in serial. In addition, a number of note statements are appended to Fortran’s functions and subroutines. All the Fortran codes are compiled with mpich-3.1.3 using the icc compiler version 11.1.059 and -O2 optimization level.
We denote by ParaDIAG-II (B-E) the algorithm (2.10) using Backward-Euler, and by ParaDIAG-II (TR) the one using the Trapezoidal rule, and set . For comparison, we also apply the parareal algorithm and MGRiT to (2.14). The parareal algorithm is implemented using the two-level XBraid solver with F-relaxation, and MGRiT is the multilevel XBraid solver with FCF-relaxation (i.e., an initial F-relaxation followed by a C-relaxation and then a second F-relaxation). Furthermore, we skip the unnecessary work during the first XBraid down cycle for both the parareal and MGRiT algorithms, and fix the coarsening factor to 8. As shown in Table 2.1,
B-E: ParaDIAG-II (B-E), TR: ParaDIAG-II (TR), PR: parareal, MG: MGRiT
ParaDIAG-II (B-E), ParaDIAG-II (TR), parareal and MGRiT converge robustly with respect to the number of processors. ParaDIAG-II (B-E) and ParaDIAG-II (TR) lead to parameter-robust convergence, while for parareal and MGRiT the required iteration counts increase dramatically as changes from 1 to . The tolerance tol for all experiments here is set to ). In Figure 2.5
we compare the measured CPU times for these PinT algorithms. Clearly, ParaDIAG-II (B-E) and ParaDIAG-II (TR) are two optimally scaling PinT algorithms, while for MGRiT and parareal the scaling is a little bit worse (this is because of the sequential coarse-grid-correction as we will see in the next subsection). The corresponding data is given in Table 2.2.
|parareal||MGRiT||ParaDIAG-II (B-E)||ParaDIAG-II (TR)|
Regarding the parallel efficiency measured by  ( is the wall-clock time using processors), the average parallel efficiency for ParaDIAG-II (B-E) is 92.06%, for ParaDIAG-II (TR) it is 90.70%, while it is only 22.82% for MGRiT and 8.75% for parareal.
2.2.2 ParaDIAG-II – Parareal Variant
The second way to use ParaDIAG within a stationary iteration is based on formulating the coarse-grid-correction (CGC) procedure of the parareal algorithm[10, 9] as an all-at-once system and applying ParaDIAG to it. The parareal algorithm is an iterative PinT algorithm, based on the updating formula
where and are called coarse and fine propagator, specified by two time-integrators. The quantity denotes a value calculated by applying successively steps of the fine propagator to the differential equations with initial value and the fine step size . The integer is called the coarsening ratio. Let
Then, the parareal algorithm is . This is the so called CGC, which is a sequential procedure and is often the bottleneck of the parallel efficiency. In , the author proposed an idea to parallelize the CGC: supposing we have to solve an initial-value problem
we apply to a slightly wrong problem, namely
where is a free parameter. We use the linear case to illustrate the details of ParaDIAG-II based on the parareal algorithm (for the nonlinear case, see ). We also use for simplicity Backward-Euler for . Let . The quantity computed from the previous iteration is
Note that all the quantities and can be computed simultaneously in parallel. Hence, . The parareal algorithm (2.15) can be rewritten as
where , which can be represented as
This problem is now precisely of the form (2.10) for , i.e. the Backward-Euler method, and the solution can be obtained using ParaDIAG-II (cf. 2.12). The convergence rate of this ParaDIAG-II parareal variant is summarized in the following theorem.
Theorem 2.3 (see )
Let be the convergence factor of the parareal algorithm with sequential-in-time CGC (i.e., the classical parareal algorithm) and be the convergence factor with parallel-in-time CGC. Then, there exists some threshold of the parameter , such that
In particular, for linear systems of ODEs with