Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining

05/15/2019
by   Siegfried Cools, et al.
UAntwerpen
Berkeley Lab
0

This paper presents performance results comparing MPI-based implementations of the popular Conjugate Gradient (CG) method and several of its communication hiding (or 'pipelined') variants. Pipelined CG methods are designed to efficiently solve SPD linear systems on massively parallel distributed memory hardware, and typically display significantly improved strong scaling compared to classic CG. This increase in parallel performance is achieved by overlapping the global reduction phase (MPI_Iallreduce) required to compute the inner products in each iteration by (chiefly local) computational work such as the matrix-vector product as well as other global communication. This work includes a brief introduction to the deep pipelined CG method for readers that may be unfamiliar with the specifics of the method. A brief overview of implementation details provides the practical tools required for implementation of the algorithm. Subsequently, easily reproducible strong scaling results on the US Department of Energy (DoE) NERSC machine 'Cori' (Phase I - Haswell nodes) on up to 1024 nodes with 16 MPI ranks per node are presented using an implementation of p(l)-CG that is available in the open source PETSc library. Observations on the staggering and overlap of the asynchronous, non-blocking global communication phases with communication and computational kernels are drawn from the experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 9

01/15/2018

The Communication-Hiding Conjugate Gradient Method with Deep Pipelines

Krylov subspace methods are among the most efficient present-day solvers...
03/02/2021

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous ...
05/04/2019

Predict-and-recompute conjugate gradient variants

The standard implementation of the conjugate gradient algorithm suffers ...
05/04/2019

New communication hiding conjugate gradient variants

The conjugate gradient algorithm suffers from communication bottlenecks ...
03/16/2022

On Distributed Gravitational N-Body Simulations

The N-body problem is a classic problem involving a system of N discrete...
06/18/2018

Implementation of Peridynamics utilizing HPX -- the C++ standard library for parallelism and concurrency

Peridynamics is a non-local generalization of continuum mechanics tailor...
03/10/2018

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

In this paper, we study FPGA based pipelined and superscalar design of t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The family of iterative solvers known as Krylov subspace methods (KSMs) (Greenbaum, 1997; Liesen and Strakoš, 2012; Saad, 2003; Van der Vorst, 2003) are among the most efficient present-day methods for solving large scale sparse systems of linear equations. The Conjugate Gradient method (CG) can be considered as the mother of all Krylov subspace methods. It was derived in 1952 (Hestenes and Stiefel, 1952) to the aim of solving linear systems with a sparse symmetric positive definite (SPD) matrix . The CG method is one of the most widely used methods for solving these systems today, which in turn form the basis of a plethora of scientific and industrial applications. However, driven by the essential transition from optimal time-to-solution on a single node towards massively parallel computing over the last decades (Shalf, 2007), the bottleneck for fast execution of Krylov subspace methods has shifted. Whereas in the past the application of the sparse matrix-vector product (spmv) was considered the most time-consuming part of the algorithm, the global synchronizations required in dot product and norm computations form the main bottleneck for efficient execution on present-day distributed memory hardware (Dongarra et al., 2011, 2015).

Driven by the increasing levels of parallelism in present-day HPC machines, research on the elimination of the global communication bottleneck has recently regained significant attention from the international computer science, engineering and numerical mathematics communities. Evolving from pioneering work on reducing communication in Krylov subspace methods from the late 1980’s and 90’s (Strakoš, 1987; Meurant, 1987; Chronopoulos and Gear, 1989; D’Azevedo et al., 1992; Demmel et al., 1993; Erhel, 1995), a number of variants of the classic Krylov subspace algorithms have been introduced over the last years. We specifically point out recent work by Chronopoulos et al. (Chronopoulos and Kucherov, 2010), Hoemmen (Hoemmen, 2010), Carson et al. (Carson et al., 2013; Carson and Demmel, 2014), McInnes et al. (McInnes et al., 2014), Grigori et al. (Grigori et al., 2016), Eller et al. (Eller and Gropp, 2016), Imberti et al. (Imberti and Erhel, 2017) and Zhuang et al. (Zhuang and Casas, 2017).

The research presented in this paper is situated in the research branch on so-called “communication hiding” or “pipelined” Krylov subspace methods111Note: In the context of communication reduction in Krylov subspace methods, the terminology “pipelined” KSMs that is used throughout the related applied linear algebra and computer science literature refers to software pipelining, i.e. algorithmic reformulations to the KSM procedure in order to reduce communication overhead, and should not be confused with hardware-level instruction pipelining (ILP). (Ghysels et al., 2013; Ghysels and Vanroose, 2014; Cools and Vanroose, 2017). These algorithmic variations to classic Krylov subspace methods are designed to overlap the time-consuming global communications in each iteration of the algorithm with computational tasks such as calculating spmvs or axpys (vector operations of the form ) as well as with global communication phases initiated in later iterations. Thanks to the reduction/elimination of the synchronization bottleneck, pipelined algorithms have been shown to increase parallel efficiency by allowing the algorithm to continue scaling on large numbers of processors (Sanan et al., 2016; Yamazaki et al., 2017). However, when redesigning these algorithms one should be wary not to deteriorate the mathematical properties (numerical stability, attainable accuracy, etc.) that are well established for the original Krylov algorithms. Research on analyzing and improving the numerical stability of pipelined Krylov subspace methods has recently been performed in (Carson et al., 2018; Cools et al., 2018; Cools, 2018; Cools et al., 2019); we point out the references therein for more information on the numerical properties of Krylov subspace methods.

Strong scaling results obtained from an MPI-based implementation of the -length pipelined Conjugate Gradient method, p()-CG for short, are presented in this work. The p()-CG method was recently presented in (Cornelis et al., 2017; Cools et al., 2019) and allows to overlap each global reduction phase with the communication and computational work of subsequent iterations. This overlap is achieved by exploiting the asynchronous non-blocking global “Iallreduce” MPI operations that have been available from the MPI3 standard onward. The pipeline length is a parameter of the method that can be chosen depending on the problem and hardware setup (as a function of the communication-to-computation ratio). As is the case for all communication reducing Krylov subspace methods, a number of problem- and hardware-dependent factors influence the communication-to-computation ratio and thus directly affect the performance of the method. The choice of the pipeline length is subject to the speed of the communication network, the time consumed by computing the matrix-vector product and the effort one is willing to invest in a preconditioner. The argument for a longer pipeline use case is stronger for preconditioners that use limited communication (e.g. block Jacobi, no-overlap DDM, …).

This work aims to provide the reader with insights on the implementation and use of the pipelined Conjugate Gradient method as a promising contemporary application of MPI. The focus of the paper is therefore on providing a brief overview of the properties of the p()-CG method which are required for a practical MPI-based implementation.

We conclude this introduction by presenting a short overview of the contents of this paper. Section 2 familiarizes the reader with the fundamentals of communication hiding Krylov subspace methods. It presents a high-level summary of the mathematical principles behind the -length pipelined CG method and formulates technical comments on the algorithm’s pseudo-code, which is included in this section. Section 3 provides the practical framework for an MPI-based implementation of the method described in Section 2. It includes discussions on the ability to overlap the main algorithmic kernels in the algorithm using asynchronous non-blocking MPI_Iallreduce calls, the computational cost of these kernels, and an overview of the algorithm’s storage requirements. Section 4 presents various strong scaling results that were obtained using an MPI-based implementation of the pipelined CG algorithm in the PETSc library (Balay et al., 2015). The experiments were executed on up to 1024 nodes of the US Department of Energy NERSC facility “Cori”. The experimental results are supplemented by an elaborate discussion that aims to provide the reader with detailed insights on the performance gains reported in the strong scaling experiments. The manuscript is concluded by a short outlook in Section 5.

2. The pipelined CG algorithm

The deep pipelined Conjugate Gradient method, denoted p()-CG for short, was first presented in (Cornelis et al., 2017), where it was derived in analogy to the p()-GMRES method (Ghysels et al., 2013). The parameter represents the pipeline length which indicates the number of iterations that are overlapped by each global reduction phase. We summarize the current state-of-the-art deep pipelined p()-CG method below.

2.1. Mathematical framework for p()-Cg

Let be the orthonormal basis for the Krylov subspace in iteration of the p()-CG algorithm, consisting of vectors. Here is assumed to be a symmetric positive definite matrix. The Krylov subspace basis vectors satisfy the Lanczos relation

(1)

with

(2)

Let , then the Lanczos relation (1) translates in vector notation to

(3)

The auxiliary basis runs vectors ahead of the basis and is defined as

(4)

where the matrix polynomial is given by

(5)

with optional stabilizing shifts , see (Ghysels et al., 2013; Cornelis et al., 2017; Hoemmen, 2010). The choice of the polynomial is discussed in Section 2.2. Contrary to the basis , the auxiliary basis is not orthonormal. It is constructed using the recursive definitions

(6)

which are obtained by left-multiplying the Lanczos relation (3) on both sides by . Expression (6) translates into a Lanczos type matrix relation

(7)

where the matrix contains the matrix , which is shifted places along the main diagonal, i.e.

(8)

The bases and are connected through the basis transformation for , where is a banded upper triangular matrix with a band width of non-zero diagonals (Cornelis et al., 2017). For a symmetric matrix the matrix is symmetric around its -th upper diagonal, since

(9)

The following recurrence relation for is derived from the basis transformation (with ):

(10)

A total of iterations after the dot-products

(11)

have been initiated, the elements with , which were computed as , are corrected as follows:

(12)

for , and

(13)

The tridiagonal matrix , see (2), is updated recursively by adding one column in each iteration . The diagonal element is characterized by the expressions:

(14)

The term is considered zero when . The update for the off-diagonal element is given by

(15)

The element has already been computed in the previous iteration and can thus simply be copied due to the symmetry of .

1:
2:for  do
3:      # Compute matrix-vector product
4:     ; # Apply preconditioner
5:     if  then
6:           # Copy auxiliary basis vectors      
7:     end if
8:     if  then # Finalize dot-products ()
9:           # Update transformation matrix
10:          
11:          # Check for breakdown and restart if required # Square root breakdown check
12:          if  then
13:                # Add column to Hessenberg matrix
14:               
15:          else
16:               
17:                          
18:          end if
19:           # Compute basis vectors
20:           # Compute basis vector
21:           # Compute basis vector      
22:     end if
23:      # Initiate dot-products ()
24:     if  then
25:          
26:     elseif then
27:           # Factorize Hessenberg matrix
28:          
29:           # Compute recursive residual norm
30:           # Update search direction
31:           # Update approximate solution
32:          if  then RETURN; end if # Check convergence criterion                
33:     end if
Algorithm 1 Preconditioned -length pipelined p()-CG Input: , , , , , , ,

Once the basis has been constructed, the solution can be updated based on a search direction , following the classic derivation of D-Lanczos in (Saad, 2003), Sec. 6.7.1. The Ritz-Galerkin condition (that holds for )

(16)

implies . The LU-factorization of the tridiagonal matrix is given by

(17)

Note that . It follows from (17) that the elements of the lower/upper triangular matrices and are given by (with )

(18)

Expression (2.1) indicates that the approximate solution equals

(19)

where and . It holds that . The columns (for ) of can be computed recursively. From it follows

(20)

Denoting the vector by , it follows from

(21)

that and for . The approximate solution is then updated using the recurrence relation

(22)

The above expressions are combined in Alg. 1. Once the initial pipeline for has been filled, the relations (6)-(10) can be used to recursively compute the basis vectors and in iterations (see lines 19-21). The scalar results of the global reduction phase (line 23) are required iterations later (lines 9-10). In every iteration global communication is thus overlapped with the computational work and the communications of subsequent iterations. This “communication hiding” forms the core property of the p()-CG algorithm. Note that Alg. 1 presents a complete version of the algorithm including preconditioning, stable recurrences, stopping criteria, etc. These topics are covered in more detail in Section 2.2 below.

2.2. Technical comments on the algorithm

To complement the mathematical framework a number of insightful technicalities of the algorithm are discussed.

Preconditioned p()-Cg

Alg. 1 is written in a general form that also includes preconditioning. This requires to compute the unpreconditioned auxiliary variables in addition to the quantities described in Section 2.1. These variables also satisfy a Lanczos type relation:

(23)

The corresponding dot-products on line 23 are defined as -inner products in the context of the preconditioned algorithm, see (Cools et al., 2019) for details.

Residual norm in p()-Cg

The residual is not computed in Alg. 1, but its norm is characterized by the quantity (or its natural norm equivalent in the preconditioned case) for . This quantity is useful to formulate a stopping criterion for the p()-CG iteration without introducing additional communication or computations, see Alg. 1, line 32.

Number of dot products in p()-Cg

Although Alg. 1, line 23, indicates that in each iteration a total of dot products need to be computed, the number of dot product computations can be limited to by exploiting the symmetry of the matrix , see expression (2.1). Since for , only the dot products for and the -th upper diagonal element need to be computed.

Choice of auxiliary basis shifts

The auxiliary basis vectors are defined as , but the basis is in general not orthogonal. For longer pipelined lengths , may be very ill-conditioned and may become close to singular as augments. The effect is the most pronounced when , in which case . Shifts can be set to improve the conditioning of . It holds that

(24)

where is a part of the basis obtained by dropping the first columns. Hence, the polynomial has a major impact on the conditioning of the matrix , which in turn plays a crucial role in the propagation of local rounding errors in the p()-CG algorithm (Cornelis et al., 2017). This observation motivates minimization of . Optimal shift values for minimizing the Euclidean -norm of are the Chebyshev shifts (Hoemmen, 2010; Ghysels et al., 2013; Cornelis et al., 2017) (for ):

(25)

which are used throughout this work. This requires a notion of the largest (smallest) eigenvalue

(resp. 

), which can be estimated a priori, e.g. by a few power method iterations.

Square root breakdowns in p()-Cg

When for certain the matrix becomes (numerically) singular, the Cholesky factorization procedure in p()-CG will fail. This may manifest in the form of a square root breakdown on line 10 in Alg. 1 when the root argument becomes negative. Numerical round-off errors may increase the occurrence of these breakdowns in practice, cf. (Cornelis et al., 2017; Cools et al., 2019). When a breakdown occurs the p()-CG iteration is restarted explicitly. This restarting may delay convergence compared to standard CG, where no square root breakdowns occur.

Numerically stable recurrences

Although expression (10) provides a recursive definition for the basis , it was shown in (Cornelis et al., 2017) that this relation may lead to numerical unstable convergence behavior. We therefore introduce a total of bases, denoted by , where the upper index ‘’ () labels the different bases and the lower index ‘’ indicates the iteration. The auxiliary bases are defined as:

(26)

where the polynomial is defined by (5). By definition (26) the basis denotes the original Krylov subspace basis

(27)

whereas the -th basis is precisely the auxiliary basis that was defined earlier, see (4), i.e.

(28)

Intuitively, the -th basis should be interpreted as running spmvs ahead of the original Krylov basis .

Note that the first vectors in the basis and all bases with are identical, which is represented by line 6 in Alg. 1. For each pair of subsequent bases and (for ) it holds that

(29)

By multiplying the original Lanczos relation (3) for on both sides by the respective polynomial with and by exploiting the associativity of and , it is straightforward to see that each auxiliary basis satisfies a Lanczos type recurrence relation:

(30)

for and . When expression (30) yields the Lanczos relation (3) for , whereas setting results in the recurrence relation (6) for . To avoid the computation of additional spmvs, the relations (30) can be rewritten using expression (29) as:

(31)

with and . The recurrence relations (31) allow to compute the vector updates for the bases without the need to compute spmvs. The recursive update (30) is used only for , i.e. to compute the vectors in the auxiliary basis we use the recurrence relation

(32)

for , which pours down to the recurrence relation (6).

The recursive relations derived above are summarized on lines 19-21 in Alg. 1. In the -th iteration of Alg. 1 each basis is updated by adding one vector. The algorithm thus computes a total of new basis vectors, i.e.: , per iteration. For each basis, the corresponding vector update is

All vector updates make use of the same scalar coefficients , and that are computed in iteration of Alg. 1 (lines 12-18). Only one spmv, namely , is required per iteration to compute all basis vector updates.

3. MPI-based implementation

Section 2 gave an overview of the mathematical properties of the p()-CG method. In this section we comment on technical aspects concerning the implementation of Alg. 1.

3.1. Hiding global communication

A schematic kernel-based representation of the p()-CG algorithm is introduced in this section. The following computational kernels are defined in iteration in Alg. 1:

tag type kernel description Alg. 1 line
(K1) spmv apply and 3-4
(K2) scalar update matrix 9-10
(K3) scalar update matrix 12-18
(K4) axpy recursive basis updates 19-21
(K5) dotpr compute dot products 23
(K6) axpy update and 24-33

The spmv kernel (K1) is considered to be the most computationally intensive part of the algorithm, which is overlapped with the global reduction phase in (K5) to hide communication latency and idle core time. Kernels (K2), (K3), (K4) and (K6) represent local operations which are assumed to have a very low arithmetic complexity when executed on large scale multi-node hardware. These operations are also overlapped with the global reduction phase in p()-CG for pipeline lengths . In (K5) all local contributions to the dot products are computed by each worker and subsequently a global reduction phase is performed. The preconditioned p()-CG algorithm can be summarized schematically using these kernel definitions as displayed in Alg. 2.

1:initialization ;
2:for  do
3:     (K1) spmv ;
4:     if  then
5:          MPI_Wait(req(i-l),…) ;
6:          (K2) scalar ;
7:          (K3) scalar ;
8:          (K4) axpy ;      
9:     end if
10:     (K5) dotpr ;
11:      MPI_Iallreduce(…,G(i-2l+1:i+1,i+1),…,req(i)) ;
12:     (K6) axpy ;
13:end for
14:finalization (drain the pipeline) ;
Algorithm 2 Schematic representation of p()-CG

Our implementation of Alg. 1 uses the MPI-3 standard communication library, which allows for asynchronous progress in the reduction by setting the environment variables

  MPICH_ASYNC_PROGRESS=1;
  MPICH_MAX_THREAD_SAFETY=multiple;

Note that we also allow for asynchronous communication when running the classic CG method in our experiments. Global communication is initiated by an MPIIallreduce call which starts a non-blocking reduction:

  MPI_Iallreduce(...,G(i-2l+1:i+1,i+1),...,req(i));

The argument G(i-2l+1:i+1,i+1) represents the elements of the matrix that are computed in (K5) in iteration . The result of the corresponding global reduction phase is signaled to be due to arrive by the call to

  MPI_Wait(req(i),...);

The MPI_Request array element req(i) that is passed to MPI_Wait keeps track of the iteration index in which the global reduction phase was initiated. Since the p()-CG method overlaps spmv’s with a single global reduction phase, the call to

  MPI_Wait(req(i),...);

occurs in iteration , i.e.  iterations after the call

  MPI_Iallreduce(..., req(i)).

The schematic representation, Alg. 2, shows that the global reduction phase that is initiated by MPI_Iallreduce with request req(i) in iteration overlaps with a total of spmv’s, namely the kernels (K1) in iterations up to . The corresponding call to MPI_Wait with request req((i+l)-l) = req(i) takes place in iteration before the computations of (K2) in which the dot product results are required, but after the spmv kernel (K1) has been executed. The global reduction also overlaps with the less computationally intensive operations (K2), (K3), (K4) and (K6). Hence, the global communication latency of the dot products in (K5) is ‘hidden’ behind the computational work of p()-CG iterations.

Note that apart from overlapping global communication with computational kernels as described above, the p()-CG method also overlaps the global communication phase with other global communication phases (namely from the next iterations) when the pipeline length is larger than . Additional insights on this topic can be found in Section 4.2.

3.2. Computational and storage costs

Table 1 gives an overview of implementation details of the p()-CG method, Alg. 1, including storage requirements and number of flops (floating point operations) per iteration. Preconditioning is excluded here for simplicity. We compare to the same properties for Ghysels’ p-CG method (Ghysels and Vanroose, 2014). The latter algorithm is conceptually equivalent to p()-CG but was derived in an essentially different way, see (Ghysels and Vanroose, 2014; Cools and Vanroose, 2017; Cools et al., 2018).

glred

spmv

Flops

Time

Memory

CG 2 1 10 2 glred + 1 spmv 3
p-CG 1 1 16 (glred, spmv) 6
p()-CG 1 1 (glred, spmv) (, )
Table 1. Specifications of different CG variants (without preconditioner); ‘CG’ denotes classic CG; ‘p-CG’ is Ghysels’ pipelined CG (Ghysels and Vanroose, 2014). Columns GLRED and SPMV list the number of synchronization phases and SPMVs per iteration. The column Flops indicates the number of flops () required to compute AXPYs and dot products (with ). The Time column shows the time spent in global all-reduce communications and SPMVs. Memory counts the number of vectors in memory (excl.  and ) at any time during execution.

3.2.1. Floating point operations per iteration

The Conjugate Gradients variants listed in Table 1 compute a single spmv in each iteration. However, as indicated by the Time column, time per iteration may be reduced significantly by overlapping the global reduction phase with the computation of one or multiple spmvs. Time required by the local axpy and dot-pr computations is neglected, since these operations are assumed to scale perfectly as a function of the number of workers. Alg. 1 requires axpys per iteration to update the auxiliary vectors using the recurrence relations (31)-(32). The algorithm also computes local dot products to form the matrix and uses two axpy operations to update the search direction and the iterate . In summary, as indicated by the Flops column in Table 1, the p()-CG algorithm uses a total of axpys and dot-products, or flops per iteration (where is the matrix size).

3.2.2. Storage requirements

Table 1 summarizes the storage requirements for different variants of the CG algorithm. We refer to (Ghysels and Vanroose, 2014) for storage details on the CG and p-CG algorithms. Alg. 1 stores the three most recently updated vectors in each of the auxiliary bases (which includes the bases and ). Furthermore, note that the most recent vectors in the basis need to be stored for dot product computations (see Alg. 1, line 23), and that the vector is also stored (Alg. 1, line 30). The p()-CG algorithm thus keeps a total of vectors in memory at any time during execution (not counting and ). Note that for a minimum of vectors needs to be stored.

4. Strong scaling results

Performance measurements result from a PETSc (Balay et al., 2015) implementation contributed by the authors of the p()-CG algorithm on a NERSC distributed memory machine using MPI for communication. The latest version of the algorithm, Alg. 1, will be included in the PETSc distribution under > KSP > IMPLS > CG > PIPELCG from PETSc version 3.11 onward.

4.1. Hardware and software specifications

Parallel performance experiments are performed on up to 1024 compute nodes of the US DoE NERSC “Cori” cluster on 16 Intel 2.3 GHz Haswell cores per node. Nodes are interconnected by a Cray Aries high speed “dragonfly” network. More details on the hardware can be found at https://www.nersc.gov/users/computational-systems/cori under > Cori Configuration > Haswell Nodes.

Performance measurements result from a PETSc (Balay et al., 2015) implementation of the preconditioned p()-CG algorithm, Alg. 1. We use PETSc version 3.10.4. The MPI library used for this experiment is Cray-MPICH version 7.5.5. The timing results reported are always the most favorable results (in the sense of smallest overall run-time) over 5 individual runs of each method. Experiments also show results for classic CG and Ghysels’ p-CG method (Ghysels and Vanroose, 2014) as a reference. The p-CG method is similar to p()-CG in operational cost and communication overlap (see Table 1), although it may be numerically less stable than p()-CG, see (Cools et al., 2018) for details.

Figure 2. Non-linear 3D hydrostatic (Blatter/Pattyn) ice sheet flow equations (PETSc SNES ex48). Left: speedup over -node classic CG for various pipeline lengths and problem sizes. Right: timing breakdown showing the maximum time spent in each kernel. Total timings correspond to the dashed line in the left graph(s). Legend: ‘CG’: classic CG, ‘PCG’: Ghysels’ pipelined CG, ‘PCG l’: l-length pipelined CG (l=1,2,3).
Figure 3. Timing breakdowns showing the maximum time spent in each kernel for classic CG and p()-CG (=1,2,3). Left: linear 2D finite difference 5-point stencil Laplacian system (PETSc KSP ex2) (see also Fig. 4, left panel for interpretation). Right: “communication bound” toy problem; diagonal matrix system with identical spectrum and right-hand side to the 2D 5-point stencil Laplacian (see also Fig. 4, right panel).
Figure 4. Schematic representation of the “communication hiding” property of the p()-CG method in different scenarios. The work flow for the main SPMV, AXPY and GLRED kernels during three consecutive CG (top), p()-CG (middle) and p()-CG (bottom) iterations are shown. Left: theoretically “ideal” scenario for pipeline length , where approximately equal time is spent in GLRED and SPMV. No speedup is expected for pipeline lengths compared to (see shaded areas). Right: extremely communication-bound scenario where time spent in GLRED is much larger than time of SPMV computation. A significant speedup of p()-CG over p()-CG is expected as a result of the “staggered” communication phases.

4.2. Discussion on experimental data

Figure 2 shows the result of three strong scaling experiments for the 3D Hydrostatic Ice Sheet Flow problem, which is available as example is the PETSc Scalable Nonlinear Equations Solvers (SNES) folder. The Blatter/Pattyn equations are discretized using (top) / (middle) / (bottom) finite elements respectively. A Newton-Krylov outer-inner iteration is used to solve the non-linear problem. The CG methods used as inner solver are combined with a block Jacobi preconditioner (one block Jacobi step per CG iteration; one block per processor; blocks are approximately inverted using ILU). The relative tolerance of the inner solvers is set to e-. For the p()-CG method the Chebyshev shifts , see (25), are based on the interval . In summary, the following parameters are provided when calling the p()-CG solver:

  -ksp_type pipelcg       -pc_type bjacobi
  -ksp_rtol 1e-06         -ksp_pipelcg_pipel <l>
  -ksp_pipelcg_lmin 0.0   -ksp_pipelcg_lmax 2.0

The left panel of Figure 2 shows the speedup of different CG variants relative to the execution time of classic CG on nodes. The dotted line shows the theoretical optimal speedup based on the -node CG timing. As expected, the classic CG method stops scaling from a certain (problem size dependent) number of nodes onward, i.e. when the global reduction becomes the dominant time-consuming kernel in the algorithm. The overlap of communication and computations in the pipelined methods successfully “hides” the global synchronization latency behind the computational kernels, leading to improved scalability.

The right panel of Figure 2 presents a detailed timing breakdown of the individual kernels in the algorithms. The bars represent the maximum timing for a particular kernel over all iterations and over all MPI processes. It is apparent from the red bars that CG features two global reduction phases per iteration whereas the pipelined methods perform only one, see Table 1 (this feature is sometimes referred to as “communication avoiding” in the literature (Chronopoulos and Gear, 1989)). Total times (yellow bars) correspond to the dashed vertical lines in the left panel of Figure 2. The total time spent clearly decreases by “pipelining” global reductions (but note that the overlap is not explicitly shown in the right panel of Figure 2).

Somewhat surprisingly, in most cases the use of longer pipelines () seems to improve scalability compared to the case. This is unexpected, as Figure 2 (right) indicates that for these model problems the time spent in global communication is of the same order of magnitude as the time required to compute an spmv (+ prec). Hence, the theoretical model in Table 1 suggests that a pipeline length would suffice for a perfect overlap, hiding the entire global reduction phase behind computational work, and that longer pipelines do not necessarily yield any benefit towards performance in this case. Figure 4 (left) presents this scenario schematically.

The explanation for the observation that speedup can be achieved for longer pipelines in this particular case is two-fold. First, as indicated earlier in this work, a slight benefit may be noticeable from overlapping the global communication with axpys. For only the local axpys on lines 26-33 in Alg. 1 are overlapped with the glred phase. However, longer pipelines also overlap the axpy operations on line 19-21 of the next iterations with the global reduction phase. This overlap with a (slightly) larger volume of computations may induce a minor speedup when the time spent in the axpy operations is not completely negligible (which is often the case in practice due to read-writes to local memory), although the effective performance gain achieved by this overlap is generally assumed to be quite small.

Second, the overlap of one global reduction by (other) global communication phases may also lead to improved performance, even in the absence of overlap with a significant amount of computational work from the spmv. This “global communication staggering” effect is assumed to be most apparent in extremely communication-bound regimes, in which the time for a global reduction takes significantly longer than the computation of spmvs and/or axpys. To validate this intuitively sensible premise, Figure 3 presents a simple small-scale experiment for solving a 4 million unknowns system with a finite differences 2D Laplacian five-point stencil (left, PETSc KSP ex2) and a trivial diagonal system (i.e. a “one-point stencil”) with the eigenvalues of the 2D Laplacian on the main diagonal (right). Both problems are solved on 128 nodes using classic CG and p()-CG, and are equally challenging from a spectral perspective. For the Laplacian (left panel) the benefit of using p()-CG over CG is reflected in the total timings (yellow bars). However, the performance gained from using longer pipelines () is quite limited, since the timing breakdown indicates that glred takes around the same order of time as an spmv. This scenario is represented schematically in Fig. 4 (left), from which it is clear that pipeline lengths are not useful here since there is no more global communication to overlap. A different view is presented by the diagonal matrix (right panel), which can be interpreted as a toy model for extremely communication-bound scenarios. In this case a significant speedup is observed by comparing p()-CG to p()-CG. Longer pipelines with do not improve performance further however. An explanation can again be found in Fig. 4 (right panel). A small gain in performance can be achieved by using p()-CG instead of CG, which is due to the avoidance of global communication and the overlap of global communication with the spmv. However, contrary to the left-hand side panel, a significant performance gain of p()-CG over p()-CG is observed due to the fact that p()-CG with allows to start a new global reduction before the global reduction from (the) previous iteration(s) ha(s)(ve) finished. This overlap or “staggering” of the global communication phases for pipeline lengths also allows for more robustness with respect to glred

run-time variance. Together with the aforementioned overlap with

axpy operations for , this “communication staggering” explains the speedup that was observed in the strong scaling experiments in Fig. 2 when using p()-CG with pipeline lenghts .

5. Conclusions

The primary aim of this work is to present pipelined Krylov subspace methods as a valuable application of MPI-based programming that is of great practical interest for improving scalability of linear solver codes. The performance gains and strong scaling benefits of the pipelining approach are demonstrated for the deep pipelined CG method on various large scale benchmark problems in this work. In case of optimal overlap of communications and computations, a speedup factor of can be observed when comparing p()-CG to classic CG. Similar results can be obtained using MPI-based implementations of other communication-avoiding and -hiding methods, many of which are available in the open source PETSc framework. We believe that pipelined Krylov subspace methods provide an excellent use case for the asynchronous non-blocking communications implemented in the MPI3 standard. We encourage both MPI developers and users from applications to share their insights to further improve MPI functionality for these and other interesting future-oriented applications. Moreover, we encourage MPI developers to address the overhead of thread safety which currently seems to be quite significant as observed in our experiments (but not reported in the current manuscript).

Acknowledgements.
The authors are grateful for the funding that supported this work. Specifically, J. Cornelis acknowledges funding by the University of Antwerp Research Council under the University Research Fund (BOF). S. Cools receives funding from the Flemish Research Foundation (FWO Flanders) under grant 12H4617N. This work was also supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, “Scientific Discovery through Advanced Computing” (SciDAC) program through the FASTMath Institute under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.

References

  • (1)
  • Balay et al. (2015) S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L. Curfman McInnes, K. Rupp, B.F. Smith, S. Zampini, and H. Zhang. 2015. PETSc Web page. www.mcs.anl.gov/petsc. http://www.mcs.anl.gov/petsc
  • Carson and Demmel (2014) E. Carson and J. Demmel. 2014. A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods. SIAM J. Matrix Anal. Appl. 35, 1 (2014), 22–43.
  • Carson et al. (2013) E. Carson, N. Knight, and J. Demmel. 2013. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing 35, 5 (2013), S42–S61.
  • Carson et al. (2018) E. Carson, M. Rozlozník, Z. Strakoš, P. Tichỳ, and M. Tma. 2018. The Numerical Stability Analysis of Pipelined Conjugate Gradient Methods: Historical Context and Methodology. SIAM Journal on Scientific Computing 40, 5 (2018), A3549–A3580.
  • Chronopoulos and Gear (1989) A.T. Chronopoulos and C.W. Gear. 1989. s-Step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (1989), 153–168.
  • Chronopoulos and Kucherov (2010) A.T. Chronopoulos and A.B. Kucherov. 2010. Block s-step Krylov iterative methods. Numerical Linear Algebra with Applications 17, 1 (2010), 3–15.
  • Cools (2018) S. Cools. 2018. Analyzing and improving maximal attainable accuracy in the communication hiding pipelined BiCGStab method. Submitted, Parallel Computing, PMAA’18 Special Issue, preprint: arXiv:1809.01948 (2018).
  • Cools et al. (2019) S. Cools, J. Cornelis, and W. Vanroose. 2019. Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method. Submitted, IEEE Transactions of Parallel and Distributed Systems, preprint: arXiv:1902.03100 (2019).
  • Cools and Vanroose (2017) S. Cools and W. Vanroose. 2017. The communication-hiding pipelined BiCGStab method for the parallel solution of large unsymmetric linear systems. Parallel Comput. 65 (2017), 1–20.
  • Cools et al. (2018) S. Cools, E.F. Yetkin, E. Agullo, L. Giraud, and W. Vanroose. 2018. Analyzing the Effect of Local Rounding Error Propagation on the Maximal Attainable Accuracy of the Pipelined Conjugate Gradient Method. SIAM J. Matrix Anal. Appl. 39, 1 (2018), 426–450.
  • Cornelis et al. (2017) J. Cornelis, S. Cools, and W. Vanroose. 2017. The communication-hiding Conjugate Gradient method with deep pipelines. Submitted, SIAM Journal on Scientific Computing, preprint: arXiv:1801.04728 (2017).
  • D’Azevedo et al. (1992) E.F. D’Azevedo, V. Eijkhout, and C.H. Romine. 1992. Reducing communication costs in the Conjugate Gradient algorithm on distributed memory multiprocessors. Technical Report. Technical report, Oak Ridge National Lab, TM/12192, TN, US.
  • Demmel et al. (1993) J.W. Demmel, M.T. Heath, and H.A. Van der Vorst. 1993. Parallel Numerical Linear Algebra. Acta Numerica 2 (1993), 111–197.
  • Dongarra et al. (2011) J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. Andre, D. Barkai, J. Berthou, T. Boku, B. Braunschweig, et al. 2011. The international exascale software project roadmap. International Journal of High Performance Computing Applications 25, 1 (2011), 3–60.
  • Dongarra et al. (2015) J. Dongarra, M.A. Heroux, and P. Luszczek. 2015. HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems. University of Tennessee, Electrical Engineering and Computer Sciente Department, Technical Report UT-EECS-15-736 (2015).
  • Eller and Gropp (2016) P.R. Eller and W. Gropp. 2016. Scalable Non-blocking Preconditioned Conjugate Gradient Methods. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 204–215.
  • Erhel (1995) J. Erhel. 1995. A parallel GMRES version for general sparse matrices. Electronic Transactions on Numerical Analysis 3, 12 (1995), 160–176.
  • Ghysels et al. (2013) P. Ghysels, T.J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM Journal on Scientific Computing 35, 1 (2013), C48–C71.
  • Ghysels and Vanroose (2014) P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Comput. 40, 7 (2014), 224–238.
  • Greenbaum (1997) A. Greenbaum. 1997. Iterative methods for solving linear systems. SIAM.
  • Grigori et al. (2016) L. Grigori, S. Moufawad, and F. Nataf. 2016. Enlarged Krylov Subspace Conjugate Gradient Methods for Reducing Communication. SIAM J. Matrix Anal. Appl. 37, 2 (2016), 744–773.
  • Hestenes and Stiefel (1952) M.R. Hestenes and E. Stiefel. 1952. Methods of Conjugate Gradients for solving linear systems. J. Res. Nat. Bur. Standards 14, 6 (1952).
  • Hoemmen (2010) M. Hoemmen. 2010. Communication-avoiding Krylov subspace methods, PhD disseration. EECS Department, University of California, Berkeley.
  • Imberti and Erhel (2017) D. Imberti and J. Erhel. 2017. Varying the s in Your s-step GMRES. Electronic Transactions on Numerical Analysis 47 (2017), 206–230.
  • Liesen and Strakoš (2012) J. Liesen and Z. Strakoš. 2012. Krylov Subspace Methods: Principles and Analysis. Oxford University Press.
  • McInnes et al. (2014) L.C. McInnes, B. Smith, H. Zhang, and R.T. Mills. 2014. Hierarchical Krylov and nested Krylov methods for extreme-scale computing. Parallel Comput. 40, 1 (2014), 17–31.
  • Meurant (1987) G. Meurant. 1987. Multitasking the Conjugate Gradient method on the CRAY X-MP/48. Parallel Comput. 5, 3 (1987), 267–280.
  • Saad (2003) Y. Saad. 2003. Iterative methods for sparse linear systems. SIAM.
  • Sanan et al. (2016) P. Sanan, S.M. Schnepp, and D.A. May. 2016. Pipelined, Flexible Krylov Subspace Methods. SIAM Journal on Scientific Computing 38, 5 (2016), C441–C470.
  • Shalf (2007) J. Shalf. 2007. The new landscape of parallel computer architecture. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012–066.
  • Strakoš (1987) Z. Strakoš. 1987. Effectivity and optimizing of algorithms and programs on the host-computer/array-processor system. Parallel Comput. 4, 2 (1987), 189–207.
  • Van der Vorst (2003) H.A. Van der Vorst. 2003. Iterative Krylov methods for large linear systems. Vol. 13. Cambridge University Press.
  • Yamazaki et al. (2017) I. Yamazaki, M. Hoemmen, P. Luszczek, and J. Dongarra. 2017. Improving performance of GMRES by reducing communication and pipelining global collectives. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 1118–1127.
  • Zhuang and Casas (2017) S. Zhuang and M. Casas. 2017. Iteration-Fusing Conjugate Gradient. In Proceedings of the International Conference on Supercomputing. ACM, 21–30.