1. Introduction
The family of iterative solvers known as Krylov subspace methods (KSMs) (Greenbaum, 1997; Liesen and Strakoš, 2012; Saad, 2003; Van der Vorst, 2003) are among the most efficient presentday methods for solving large scale sparse systems of linear equations. The Conjugate Gradient method (CG) can be considered as the mother of all Krylov subspace methods. It was derived in 1952 (Hestenes and Stiefel, 1952) to the aim of solving linear systems with a sparse symmetric positive definite (SPD) matrix . The CG method is one of the most widely used methods for solving these systems today, which in turn form the basis of a plethora of scientific and industrial applications. However, driven by the essential transition from optimal timetosolution on a single node towards massively parallel computing over the last decades (Shalf, 2007), the bottleneck for fast execution of Krylov subspace methods has shifted. Whereas in the past the application of the sparse matrixvector product (spmv) was considered the most timeconsuming part of the algorithm, the global synchronizations required in dot product and norm computations form the main bottleneck for efficient execution on presentday distributed memory hardware (Dongarra et al., 2011, 2015).
Driven by the increasing levels of parallelism in presentday HPC machines, research on the elimination of the global communication bottleneck has recently regained significant attention from the international computer science, engineering and numerical mathematics communities. Evolving from pioneering work on reducing communication in Krylov subspace methods from the late 1980’s and 90’s (Strakoš, 1987; Meurant, 1987; Chronopoulos and Gear, 1989; D’Azevedo et al., 1992; Demmel et al., 1993; Erhel, 1995), a number of variants of the classic Krylov subspace algorithms have been introduced over the last years. We specifically point out recent work by Chronopoulos et al. (Chronopoulos and Kucherov, 2010), Hoemmen (Hoemmen, 2010), Carson et al. (Carson et al., 2013; Carson and Demmel, 2014), McInnes et al. (McInnes et al., 2014), Grigori et al. (Grigori et al., 2016), Eller et al. (Eller and Gropp, 2016), Imberti et al. (Imberti and Erhel, 2017) and Zhuang et al. (Zhuang and Casas, 2017).
The research presented in this paper is situated in the research branch on socalled “communication hiding” or “pipelined” Krylov subspace methods^{1}^{1}1Note: In the context of communication reduction in Krylov subspace methods, the terminology “pipelined” KSMs that is used throughout the related applied linear algebra and computer science literature refers to software pipelining, i.e. algorithmic reformulations to the KSM procedure in order to reduce communication overhead, and should not be confused with hardwarelevel instruction pipelining (ILP). (Ghysels et al., 2013; Ghysels and Vanroose, 2014; Cools and Vanroose, 2017). These algorithmic variations to classic Krylov subspace methods are designed to overlap the timeconsuming global communications in each iteration of the algorithm with computational tasks such as calculating spmvs or axpys (vector operations of the form ) as well as with global communication phases initiated in later iterations. Thanks to the reduction/elimination of the synchronization bottleneck, pipelined algorithms have been shown to increase parallel efficiency by allowing the algorithm to continue scaling on large numbers of processors (Sanan et al., 2016; Yamazaki et al., 2017). However, when redesigning these algorithms one should be wary not to deteriorate the mathematical properties (numerical stability, attainable accuracy, etc.) that are well established for the original Krylov algorithms. Research on analyzing and improving the numerical stability of pipelined Krylov subspace methods has recently been performed in (Carson et al., 2018; Cools et al., 2018; Cools, 2018; Cools et al., 2019); we point out the references therein for more information on the numerical properties of Krylov subspace methods.
Strong scaling results obtained from an MPIbased implementation of the length pipelined Conjugate Gradient method, p()CG for short, are presented in this work. The p()CG method was recently presented in (Cornelis et al., 2017; Cools et al., 2019) and allows to overlap each global reduction phase with the communication and computational work of subsequent iterations. This overlap is achieved by exploiting the asynchronous nonblocking global “Iallreduce” MPI operations that have been available from the MPI3 standard onward. The pipeline length is a parameter of the method that can be chosen depending on the problem and hardware setup (as a function of the communicationtocomputation ratio). As is the case for all communication reducing Krylov subspace methods, a number of problem and hardwaredependent factors influence the communicationtocomputation ratio and thus directly affect the performance of the method. The choice of the pipeline length is subject to the speed of the communication network, the time consumed by computing the matrixvector product and the effort one is willing to invest in a preconditioner. The argument for a longer pipeline use case is stronger for preconditioners that use limited communication (e.g. block Jacobi, nooverlap DDM, …).
This work aims to provide the reader with insights on the implementation and use of the pipelined Conjugate Gradient method as a promising contemporary application of MPI. The focus of the paper is therefore on providing a brief overview of the properties of the p()CG method which are required for a practical MPIbased implementation.
We conclude this introduction by presenting a short overview of the contents of this paper. Section 2 familiarizes the reader with the fundamentals of communication hiding Krylov subspace methods. It presents a highlevel summary of the mathematical principles behind the length pipelined CG method and formulates technical comments on the algorithm’s pseudocode, which is included in this section. Section 3 provides the practical framework for an MPIbased implementation of the method described in Section 2. It includes discussions on the ability to overlap the main algorithmic kernels in the algorithm using asynchronous nonblocking MPI_Iallreduce calls, the computational cost of these kernels, and an overview of the algorithm’s storage requirements. Section 4 presents various strong scaling results that were obtained using an MPIbased implementation of the pipelined CG algorithm in the PETSc library (Balay et al., 2015). The experiments were executed on up to 1024 nodes of the US Department of Energy NERSC facility “Cori”. The experimental results are supplemented by an elaborate discussion that aims to provide the reader with detailed insights on the performance gains reported in the strong scaling experiments. The manuscript is concluded by a short outlook in Section 5.
2. The pipelined CG algorithm
The deep pipelined Conjugate Gradient method, denoted p()CG for short, was first presented in (Cornelis et al., 2017), where it was derived in analogy to the p()GMRES method (Ghysels et al., 2013). The parameter represents the pipeline length which indicates the number of iterations that are overlapped by each global reduction phase. We summarize the current stateoftheart deep pipelined p()CG method below.
2.1. Mathematical framework for p()Cg
Let be the orthonormal basis for the Krylov subspace in iteration of the p()CG algorithm, consisting of vectors. Here is assumed to be a symmetric positive definite matrix. The Krylov subspace basis vectors satisfy the Lanczos relation
(1) 
with
(2) 
Let , then the Lanczos relation (1) translates in vector notation to
(3) 
The auxiliary basis runs vectors ahead of the basis and is defined as
(4) 
where the matrix polynomial is given by
(5) 
with optional stabilizing shifts , see (Ghysels et al., 2013; Cornelis et al., 2017; Hoemmen, 2010). The choice of the polynomial is discussed in Section 2.2. Contrary to the basis , the auxiliary basis is not orthonormal. It is constructed using the recursive definitions
(6) 
which are obtained by leftmultiplying the Lanczos relation (3) on both sides by . Expression (6) translates into a Lanczos type matrix relation
(7) 
where the matrix contains the matrix , which is shifted places along the main diagonal, i.e.
(8) 
The bases and are connected through the basis transformation for , where is a banded upper triangular matrix with a band width of nonzero diagonals (Cornelis et al., 2017). For a symmetric matrix the matrix is symmetric around its th upper diagonal, since
(9) 
The following recurrence relation for is derived from the basis transformation (with ):
(10) 
A total of iterations after the dotproducts
(11) 
have been initiated, the elements with , which were computed as , are corrected as follows:
(12) 
for , and
(13) 
The tridiagonal matrix , see (2), is updated recursively by adding one column in each iteration . The diagonal element is characterized by the expressions:
(14) 
The term is considered zero when . The update for the offdiagonal element is given by
(15) 
The element has already been computed in the previous iteration and can thus simply be copied due to the symmetry of .
Once the basis has been constructed, the solution can be updated based on a search direction , following the classic derivation of DLanczos in (Saad, 2003), Sec. 6.7.1. The RitzGalerkin condition (that holds for )
(16) 
implies . The LUfactorization of the tridiagonal matrix is given by
(17) 
Note that . It follows from (17) that the elements of the lower/upper triangular matrices and are given by (with )
(18) 
Expression (2.1) indicates that the approximate solution equals
(19) 
where and . It holds that . The columns (for ) of can be computed recursively. From it follows
(20) 
Denoting the vector by , it follows from
(21) 
that and for . The approximate solution is then updated using the recurrence relation
(22) 
The above expressions are combined in Alg. 1. Once the initial pipeline for has been filled, the relations (6)(10) can be used to recursively compute the basis vectors and in iterations (see lines 1921). The scalar results of the global reduction phase (line 23) are required iterations later (lines 910). In every iteration global communication is thus overlapped with the computational work and the communications of subsequent iterations. This “communication hiding” forms the core property of the p()CG algorithm. Note that Alg. 1 presents a complete version of the algorithm including preconditioning, stable recurrences, stopping criteria, etc. These topics are covered in more detail in Section 2.2 below.
2.2. Technical comments on the algorithm
To complement the mathematical framework a number of insightful technicalities of the algorithm are discussed.
Preconditioned p()Cg
Alg. 1 is written in a general form that also includes preconditioning. This requires to compute the unpreconditioned auxiliary variables in addition to the quantities described in Section 2.1. These variables also satisfy a Lanczos type relation:
(23) 
The corresponding dotproducts on line 23 are defined as inner products in the context of the preconditioned algorithm, see (Cools et al., 2019) for details.
Residual norm in p()Cg
The residual is not computed in Alg. 1, but its norm is characterized by the quantity (or its natural norm equivalent in the preconditioned case) for . This quantity is useful to formulate a stopping criterion for the p()CG iteration without introducing additional communication or computations, see Alg. 1, line 32.
Number of dot products in p()Cg
Although Alg. 1, line 23, indicates that in each iteration a total of dot products need to be computed, the number of dot product computations can be limited to by exploiting the symmetry of the matrix , see expression (2.1). Since for , only the dot products for and the th upper diagonal element need to be computed.
Choice of auxiliary basis shifts
The auxiliary basis vectors are defined as , but the basis is in general not orthogonal. For longer pipelined lengths , may be very illconditioned and may become close to singular as augments. The effect is the most pronounced when , in which case . Shifts can be set to improve the conditioning of . It holds that
(24) 
where is a part of the basis obtained by dropping the first columns. Hence, the polynomial has a major impact on the conditioning of the matrix , which in turn plays a crucial role in the propagation of local rounding errors in the p()CG algorithm (Cornelis et al., 2017). This observation motivates minimization of . Optimal shift values for minimizing the Euclidean norm of are the Chebyshev shifts (Hoemmen, 2010; Ghysels et al., 2013; Cornelis et al., 2017) (for ):
(25) 
which are used throughout this work. This requires a notion of the largest (smallest) eigenvalue
(resp.), which can be estimated a priori, e.g. by a few power method iterations.
Square root breakdowns in p()Cg
When for certain the matrix becomes (numerically) singular, the Cholesky factorization procedure in p()CG will fail. This may manifest in the form of a square root breakdown on line 10 in Alg. 1 when the root argument becomes negative. Numerical roundoff errors may increase the occurrence of these breakdowns in practice, cf. (Cornelis et al., 2017; Cools et al., 2019). When a breakdown occurs the p()CG iteration is restarted explicitly. This restarting may delay convergence compared to standard CG, where no square root breakdowns occur.
Numerically stable recurrences
Although expression (10) provides a recursive definition for the basis , it was shown in (Cornelis et al., 2017) that this relation may lead to numerical unstable convergence behavior. We therefore introduce a total of bases, denoted by , where the upper index ‘’ () labels the different bases and the lower index ‘’ indicates the iteration. The auxiliary bases are defined as:
(26) 
where the polynomial is defined by (5). By definition (26) the basis denotes the original Krylov subspace basis
(27) 
whereas the th basis is precisely the auxiliary basis that was defined earlier, see (4), i.e.
(28) 
Intuitively, the th basis should be interpreted as running spmvs ahead of the original Krylov basis .
Note that the first vectors in the basis and all bases with are identical, which is represented by line 6 in Alg. 1. For each pair of subsequent bases and (for ) it holds that
(29) 
By multiplying the original Lanczos relation (3) for on both sides by the respective polynomial with and by exploiting the associativity of and , it is straightforward to see that each auxiliary basis satisfies a Lanczos type recurrence relation:
(30) 
for and . When expression (30) yields the Lanczos relation (3) for , whereas setting results in the recurrence relation (6) for . To avoid the computation of additional spmvs, the relations (30) can be rewritten using expression (29) as:
(31) 
with and . The recurrence relations (31) allow to compute the vector updates for the bases without the need to compute spmvs. The recursive update (30) is used only for , i.e. to compute the vectors in the auxiliary basis we use the recurrence relation
(32) 
for , which pours down to the recurrence relation (6).
The recursive relations derived above are summarized on lines 1921 in Alg. 1. In the th iteration of Alg. 1 each basis is updated by adding one vector. The algorithm thus computes a total of new basis vectors, i.e.: , per iteration. For each basis, the corresponding vector update is
All vector updates make use of the same scalar coefficients , and that are computed in iteration of Alg. 1 (lines 1218). Only one spmv, namely , is required per iteration to compute all basis vector updates.
3. MPIbased implementation
Section 2 gave an overview of the mathematical properties of the p()CG method. In this section we comment on technical aspects concerning the implementation of Alg. 1.
3.1. Hiding global communication
A schematic kernelbased representation of the p()CG algorithm is introduced in this section. The following computational kernels are defined in iteration in Alg. 1:
tag  type  kernel description  Alg. 1 line 

(K1)  spmv  apply and  34 
(K2)  scalar  update matrix  910 
(K3)  scalar  update matrix  1218 
(K4)  axpy  recursive basis updates  1921 
(K5)  dotpr  compute dot products  23 
(K6)  axpy  update and  2433 
The spmv kernel (K1) is considered to be the most computationally intensive part of the algorithm, which is overlapped with the global reduction phase in (K5) to hide communication latency and idle core time. Kernels (K2), (K3), (K4) and (K6) represent local operations which are assumed to have a very low arithmetic complexity when executed on large scale multinode hardware. These operations are also overlapped with the global reduction phase in p()CG for pipeline lengths . In (K5) all local contributions to the dot products are computed by each worker and subsequently a global reduction phase is performed. The preconditioned p()CG algorithm can be summarized schematically using these kernel definitions as displayed in Alg. 2.
Our implementation of Alg. 1 uses the MPI3 standard communication library, which allows for asynchronous progress in the reduction by setting the environment variables
MPICH_ASYNC_PROGRESS=1; MPICH_MAX_THREAD_SAFETY=multiple;
Note that we also allow for asynchronous communication when running the classic CG method in our experiments. Global communication is initiated by an MPIIallreduce call which starts a nonblocking reduction:
MPI_Iallreduce(...,G(i2l+1:i+1,i+1),...,req(i));
The argument G(i2l+1:i+1,i+1) represents the elements of the matrix that are computed in (K5) in iteration . The result of the corresponding global reduction phase is signaled to be due to arrive by the call to
MPI_Wait(req(i),...);
The MPI_Request array element req(i) that is passed to MPI_Wait keeps track of the iteration index in which the global reduction phase was initiated. Since the p()CG method overlaps spmv’s with a single global reduction phase, the call to
MPI_Wait(req(i),...);
occurs in iteration , i.e. iterations after the call
MPI_Iallreduce(..., req(i)).
The schematic representation, Alg. 2, shows that the global reduction phase that is initiated by MPI_Iallreduce with request req(i) in iteration overlaps with a total of spmv’s, namely the kernels (K1) in iterations up to . The corresponding call to MPI_Wait with request req((i+l)l) = req(i) takes place in iteration before the computations of (K2) in which the dot product results are required, but after the spmv kernel (K1) has been executed. The global reduction also overlaps with the less computationally intensive operations (K2), (K3), (K4) and (K6). Hence, the global communication latency of the dot products in (K5) is ‘hidden’ behind the computational work of p()CG iterations.
Note that apart from overlapping global communication with computational kernels as described above, the p()CG method also overlaps the global communication phase with other global communication phases (namely from the next iterations) when the pipeline length is larger than . Additional insights on this topic can be found in Section 4.2.
3.2. Computational and storage costs
Table 1 gives an overview of implementation details of the p()CG method, Alg. 1, including storage requirements and number of flops (floating point operations) per iteration. Preconditioning is excluded here for simplicity. We compare to the same properties for Ghysels’ pCG method (Ghysels and Vanroose, 2014). The latter algorithm is conceptually equivalent to p()CG but was derived in an essentially different way, see (Ghysels and Vanroose, 2014; Cools and Vanroose, 2017; Cools et al., 2018).
glred 
spmv 
Flops 
Time 
Memory 


CG  2  1  10  2 glred + 1 spmv  3 
pCG  1  1  16  (glred, spmv)  6 
p()CG  1  1  (glred, spmv)  (, ) 
3.2.1. Floating point operations per iteration
The Conjugate Gradients variants listed in Table 1 compute a single spmv in each iteration. However, as indicated by the Time column, time per iteration may be reduced significantly by overlapping the global reduction phase with the computation of one or multiple spmvs. Time required by the local axpy and dotpr computations is neglected, since these operations are assumed to scale perfectly as a function of the number of workers. Alg. 1 requires axpys per iteration to update the auxiliary vectors using the recurrence relations (31)(32). The algorithm also computes local dot products to form the matrix and uses two axpy operations to update the search direction and the iterate . In summary, as indicated by the Flops column in Table 1, the p()CG algorithm uses a total of axpys and dotproducts, or flops per iteration (where is the matrix size).
3.2.2. Storage requirements
Table 1 summarizes the storage requirements for different variants of the CG algorithm. We refer to (Ghysels and Vanroose, 2014) for storage details on the CG and pCG algorithms. Alg. 1 stores the three most recently updated vectors in each of the auxiliary bases (which includes the bases and ). Furthermore, note that the most recent vectors in the basis need to be stored for dot product computations (see Alg. 1, line 23), and that the vector is also stored (Alg. 1, line 30). The p()CG algorithm thus keeps a total of vectors in memory at any time during execution (not counting and ). Note that for a minimum of vectors needs to be stored.
4. Strong scaling results
Performance measurements result from a PETSc (Balay et al., 2015) implementation contributed by the authors of the p()CG algorithm on a NERSC distributed memory machine using MPI for communication. The latest version of the algorithm, Alg. 1, will be included in the PETSc distribution under > KSP > IMPLS > CG > PIPELCG from PETSc version 3.11 onward.
4.1. Hardware and software specifications
Parallel performance experiments are performed on up to 1024 compute nodes of the US DoE NERSC “Cori” cluster on 16 Intel 2.3 GHz Haswell cores per node. Nodes are interconnected by a Cray Aries high speed “dragonfly” network. More details on the hardware can be found at https://www.nersc.gov/users/computationalsystems/cori under > Cori Configuration > Haswell Nodes.
Performance measurements result from a PETSc (Balay et al., 2015) implementation of the preconditioned p()CG algorithm, Alg. 1. We use PETSc version 3.10.4. The MPI library used for this experiment is CrayMPICH version 7.5.5. The timing results reported are always the most favorable results (in the sense of smallest overall runtime) over 5 individual runs of each method. Experiments also show results for classic CG and Ghysels’ pCG method (Ghysels and Vanroose, 2014) as a reference. The pCG method is similar to p()CG in operational cost and communication overlap (see Table 1), although it may be numerically less stable than p()CG, see (Cools et al., 2018) for details.
4.2. Discussion on experimental data
Figure 2 shows the result of three strong scaling experiments for the 3D Hydrostatic Ice Sheet Flow problem, which is available as example is the PETSc Scalable Nonlinear Equations Solvers (SNES) folder. The Blatter/Pattyn equations are discretized using (top) / (middle) / (bottom) finite elements respectively. A NewtonKrylov outerinner iteration is used to solve the nonlinear problem. The CG methods used as inner solver are combined with a block Jacobi preconditioner (one block Jacobi step per CG iteration; one block per processor; blocks are approximately inverted using ILU). The relative tolerance of the inner solvers is set to e. For the p()CG method the Chebyshev shifts , see (25), are based on the interval . In summary, the following parameters are provided when calling the p()CG solver:
ksp_type pipelcg pc_type bjacobi ksp_rtol 1e06 ksp_pipelcg_pipel <l> ksp_pipelcg_lmin 0.0 ksp_pipelcg_lmax 2.0
The left panel of Figure 2 shows the speedup of different CG variants relative to the execution time of classic CG on nodes. The dotted line shows the theoretical optimal speedup based on the node CG timing. As expected, the classic CG method stops scaling from a certain (problem size dependent) number of nodes onward, i.e. when the global reduction becomes the dominant timeconsuming kernel in the algorithm. The overlap of communication and computations in the pipelined methods successfully “hides” the global synchronization latency behind the computational kernels, leading to improved scalability.
The right panel of Figure 2 presents a detailed timing breakdown of the individual kernels in the algorithms. The bars represent the maximum timing for a particular kernel over all iterations and over all MPI processes. It is apparent from the red bars that CG features two global reduction phases per iteration whereas the pipelined methods perform only one, see Table 1 (this feature is sometimes referred to as “communication avoiding” in the literature (Chronopoulos and Gear, 1989)). Total times (yellow bars) correspond to the dashed vertical lines in the left panel of Figure 2. The total time spent clearly decreases by “pipelining” global reductions (but note that the overlap is not explicitly shown in the right panel of Figure 2).
Somewhat surprisingly, in most cases the use of longer pipelines () seems to improve scalability compared to the case. This is unexpected, as Figure 2 (right) indicates that for these model problems the time spent in global communication is of the same order of magnitude as the time required to compute an spmv (+ prec). Hence, the theoretical model in Table 1 suggests that a pipeline length would suffice for a perfect overlap, hiding the entire global reduction phase behind computational work, and that longer pipelines do not necessarily yield any benefit towards performance in this case. Figure 4 (left) presents this scenario schematically.
The explanation for the observation that speedup can be achieved for longer pipelines in this particular case is twofold. First, as indicated earlier in this work, a slight benefit may be noticeable from overlapping the global communication with axpys. For only the local axpys on lines 2633 in Alg. 1 are overlapped with the glred phase. However, longer pipelines also overlap the axpy operations on line 1921 of the next iterations with the global reduction phase. This overlap with a (slightly) larger volume of computations may induce a minor speedup when the time spent in the axpy operations is not completely negligible (which is often the case in practice due to readwrites to local memory), although the effective performance gain achieved by this overlap is generally assumed to be quite small.
Second, the overlap of one global reduction by (other) global communication phases may also lead to improved performance, even in the absence of overlap with a significant amount of computational work from the spmv. This “global communication staggering” effect is assumed to be most apparent in extremely communicationbound regimes, in which the time for a global reduction takes significantly longer than the computation of spmvs and/or axpys. To validate this intuitively sensible premise, Figure 3 presents a simple smallscale experiment for solving a 4 million unknowns system with a finite differences 2D Laplacian fivepoint stencil (left, PETSc KSP ex2) and a trivial diagonal system (i.e. a “onepoint stencil”) with the eigenvalues of the 2D Laplacian on the main diagonal (right). Both problems are solved on 128 nodes using classic CG and p()CG, and are equally challenging from a spectral perspective. For the Laplacian (left panel) the benefit of using p()CG over CG is reflected in the total timings (yellow bars). However, the performance gained from using longer pipelines () is quite limited, since the timing breakdown indicates that glred takes around the same order of time as an spmv. This scenario is represented schematically in Fig. 4 (left), from which it is clear that pipeline lengths are not useful here since there is no more global communication to overlap. A different view is presented by the diagonal matrix (right panel), which can be interpreted as a toy model for extremely communicationbound scenarios. In this case a significant speedup is observed by comparing p()CG to p()CG. Longer pipelines with do not improve performance further however. An explanation can again be found in Fig. 4 (right panel). A small gain in performance can be achieved by using p()CG instead of CG, which is due to the avoidance of global communication and the overlap of global communication with the spmv. However, contrary to the lefthand side panel, a significant performance gain of p()CG over p()CG is observed due to the fact that p()CG with allows to start a new global reduction before the global reduction from (the) previous iteration(s) ha(s)(ve) finished. This overlap or “staggering” of the global communication phases for pipeline lengths also allows for more robustness with respect to glred
runtime variance. Together with the aforementioned overlap with
axpy operations for , this “communication staggering” explains the speedup that was observed in the strong scaling experiments in Fig. 2 when using p()CG with pipeline lenghts .5. Conclusions
The primary aim of this work is to present pipelined Krylov subspace methods as a valuable application of MPIbased programming that is of great practical interest for improving scalability of linear solver codes. The performance gains and strong scaling benefits of the pipelining approach are demonstrated for the deep pipelined CG method on various large scale benchmark problems in this work. In case of optimal overlap of communications and computations, a speedup factor of can be observed when comparing p()CG to classic CG. Similar results can be obtained using MPIbased implementations of other communicationavoiding and hiding methods, many of which are available in the open source PETSc framework. We believe that pipelined Krylov subspace methods provide an excellent use case for the asynchronous nonblocking communications implemented in the MPI3 standard. We encourage both MPI developers and users from applications to share their insights to further improve MPI functionality for these and other interesting futureoriented applications. Moreover, we encourage MPI developers to address the overhead of thread safety which currently seems to be quite significant as observed in our experiments (but not reported in the current manuscript).
Acknowledgements.
The authors are grateful for the funding that supported this work. Specifically, J. Cornelis acknowledges funding by the University of Antwerp Research Council under the University Research Fund (BOF). S. Cools receives funding from the Flemish Research Foundation (FWO Flanders) under grant 12H4617N. This work was also supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, “Scientific Discovery through Advanced Computing” (SciDAC) program through the FASTMath Institute under Contract No. DEAC0205CH11231 at Lawrence Berkeley National Laboratory.References
 (1)
 Balay et al. (2015) S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, K. Rupp, B.F. Smith, S. Zampini, and H. Zhang. 2015. PETSc Web page. www.mcs.anl.gov/petsc. http://www.mcs.anl.gov/petsc
 Carson and Demmel (2014) E. Carson and J. Demmel. 2014. A residual replacement strategy for improving the maximum attainable accuracy of sstep Krylov subspace methods. SIAM J. Matrix Anal. Appl. 35, 1 (2014), 22–43.
 Carson et al. (2013) E. Carson, N. Knight, and J. Demmel. 2013. Avoiding communication in nonsymmetric Lanczosbased Krylov subspace methods. SIAM Journal on Scientific Computing 35, 5 (2013), S42–S61.
 Carson et al. (2018) E. Carson, M. Rozlozník, Z. Strakoš, P. Tichỳ, and M. Tma. 2018. The Numerical Stability Analysis of Pipelined Conjugate Gradient Methods: Historical Context and Methodology. SIAM Journal on Scientific Computing 40, 5 (2018), A3549–A3580.
 Chronopoulos and Gear (1989) A.T. Chronopoulos and C.W. Gear. 1989. sStep iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (1989), 153–168.
 Chronopoulos and Kucherov (2010) A.T. Chronopoulos and A.B. Kucherov. 2010. Block sstep Krylov iterative methods. Numerical Linear Algebra with Applications 17, 1 (2010), 3–15.
 Cools (2018) S. Cools. 2018. Analyzing and improving maximal attainable accuracy in the communication hiding pipelined BiCGStab method. Submitted, Parallel Computing, PMAA’18 Special Issue, preprint: arXiv:1809.01948 (2018).
 Cools et al. (2019) S. Cools, J. Cornelis, and W. Vanroose. 2019. Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method. Submitted, IEEE Transactions of Parallel and Distributed Systems, preprint: arXiv:1902.03100 (2019).
 Cools and Vanroose (2017) S. Cools and W. Vanroose. 2017. The communicationhiding pipelined BiCGStab method for the parallel solution of large unsymmetric linear systems. Parallel Comput. 65 (2017), 1–20.
 Cools et al. (2018) S. Cools, E.F. Yetkin, E. Agullo, L. Giraud, and W. Vanroose. 2018. Analyzing the Effect of Local Rounding Error Propagation on the Maximal Attainable Accuracy of the Pipelined Conjugate Gradient Method. SIAM J. Matrix Anal. Appl. 39, 1 (2018), 426–450.
 Cornelis et al. (2017) J. Cornelis, S. Cools, and W. Vanroose. 2017. The communicationhiding Conjugate Gradient method with deep pipelines. Submitted, SIAM Journal on Scientific Computing, preprint: arXiv:1801.04728 (2017).
 D’Azevedo et al. (1992) E.F. D’Azevedo, V. Eijkhout, and C.H. Romine. 1992. Reducing communication costs in the Conjugate Gradient algorithm on distributed memory multiprocessors. Technical Report. Technical report, Oak Ridge National Lab, TM/12192, TN, US.
 Demmel et al. (1993) J.W. Demmel, M.T. Heath, and H.A. Van der Vorst. 1993. Parallel Numerical Linear Algebra. Acta Numerica 2 (1993), 111–197.
 Dongarra et al. (2011) J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. Andre, D. Barkai, J. Berthou, T. Boku, B. Braunschweig, et al. 2011. The international exascale software project roadmap. International Journal of High Performance Computing Applications 25, 1 (2011), 3–60.
 Dongarra et al. (2015) J. Dongarra, M.A. Heroux, and P. Luszczek. 2015. HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems. University of Tennessee, Electrical Engineering and Computer Sciente Department, Technical Report UTEECS15736 (2015).
 Eller and Gropp (2016) P.R. Eller and W. Gropp. 2016. Scalable Nonblocking Preconditioned Conjugate Gradient Methods. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 204–215.
 Erhel (1995) J. Erhel. 1995. A parallel GMRES version for general sparse matrices. Electronic Transactions on Numerical Analysis 3, 12 (1995), 160–176.
 Ghysels et al. (2013) P. Ghysels, T.J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM Journal on Scientific Computing 35, 1 (2013), C48–C71.
 Ghysels and Vanroose (2014) P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Comput. 40, 7 (2014), 224–238.
 Greenbaum (1997) A. Greenbaum. 1997. Iterative methods for solving linear systems. SIAM.
 Grigori et al. (2016) L. Grigori, S. Moufawad, and F. Nataf. 2016. Enlarged Krylov Subspace Conjugate Gradient Methods for Reducing Communication. SIAM J. Matrix Anal. Appl. 37, 2 (2016), 744–773.
 Hestenes and Stiefel (1952) M.R. Hestenes and E. Stiefel. 1952. Methods of Conjugate Gradients for solving linear systems. J. Res. Nat. Bur. Standards 14, 6 (1952).
 Hoemmen (2010) M. Hoemmen. 2010. Communicationavoiding Krylov subspace methods, PhD disseration. EECS Department, University of California, Berkeley.
 Imberti and Erhel (2017) D. Imberti and J. Erhel. 2017. Varying the s in Your sstep GMRES. Electronic Transactions on Numerical Analysis 47 (2017), 206–230.
 Liesen and Strakoš (2012) J. Liesen and Z. Strakoš. 2012. Krylov Subspace Methods: Principles and Analysis. Oxford University Press.
 McInnes et al. (2014) L.C. McInnes, B. Smith, H. Zhang, and R.T. Mills. 2014. Hierarchical Krylov and nested Krylov methods for extremescale computing. Parallel Comput. 40, 1 (2014), 17–31.
 Meurant (1987) G. Meurant. 1987. Multitasking the Conjugate Gradient method on the CRAY XMP/48. Parallel Comput. 5, 3 (1987), 267–280.
 Saad (2003) Y. Saad. 2003. Iterative methods for sparse linear systems. SIAM.
 Sanan et al. (2016) P. Sanan, S.M. Schnepp, and D.A. May. 2016. Pipelined, Flexible Krylov Subspace Methods. SIAM Journal on Scientific Computing 38, 5 (2016), C441–C470.
 Shalf (2007) J. Shalf. 2007. The new landscape of parallel computer architecture. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012–066.
 Strakoš (1987) Z. Strakoš. 1987. Effectivity and optimizing of algorithms and programs on the hostcomputer/arrayprocessor system. Parallel Comput. 4, 2 (1987), 189–207.
 Van der Vorst (2003) H.A. Van der Vorst. 2003. Iterative Krylov methods for large linear systems. Vol. 13. Cambridge University Press.
 Yamazaki et al. (2017) I. Yamazaki, M. Hoemmen, P. Luszczek, and J. Dongarra. 2017. Improving performance of GMRES by reducing communication and pipelining global collectives. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 1118–1127.
 Zhuang and Casas (2017) S. Zhuang and M. Casas. 2017. IterationFusing Conjugate Gradient. In Proceedings of the International Conference on Supercomputing. ACM, 21–30.
Comments
There are no comments yet.