The family of iterative solvers known as Krylov subspace methods (KSMs) [28, 34, 42] are among the most efficient present-day methods for solving large scale sparse systems of linear equations. The mother of all Krylov subspace methods is undoubtedly the Conjugate Gradient method (CG) that was derived in 1952  to the aim of solving linear systems with a symmetric positive definite and preferably sparse matrix . The CG method is one of the most widely used methods for solving said systems today, which form the basis of a plethora of scientific and industrial applications. However, driven by the essential transition from optimal single node performance towards massively parallel computer hardware over the last decades , the bottleneck for fast execution of Krylov subspace methods has shifted. Whereas in the past the application of the sparse matrix-vector product (spmv) was considered the most time-consuming part of the algorithm, the global synchronizations required in dot product and norm computations form the main bottleneck for efficient execution on present-day distributed memory hardware .
Driven by the increasing levels of parallelism in present-day HPC machines, as attested by the current strive towards exascale high-performance computing software , research on the elimination of the global communication bottleneck has recently regained significant attention from the international computer science, engineering and numerical mathematics communities. Sprouting largely from pioneering work on reducing communication in Krylov subspace methods from the late 1980’s and 90’s [39, 30, 5, 11, 12, 16], a number of variants of the classic Krylov subspace algorithms have been introduced over the last years. We point out recent work by Chronopoulos et al. , Hoemmen , Carson et al. , McInnes et al. , Grigori et al. , Eller et al. , Imberti et al.  and Zhuang et al. .
The contents of the current work are situated in the research branch on so-called “pipelined” Krylov subspace methods111Note: In the context of communication reduction in Krylov subspace methods, the terminology “pipelined” KSMs that is used throughout the related applied linear algebra and computer science literature refers to software pipelining, i.e. algorithmic reformulations to the KSM procedure in order to reduce communication overhead, and should not be confused with hardware-level instruction pipelining (ILP). [17, 18, 8]. Alternatively called “communication hiding” methods, these algorithmic variations to classic Krylov subspace methods are designed to overlap the time-consuming global communications in each iteration of the algorithm with computational tasks such as calculating spmvs or axpys (vector operations of the form ). Thanks to the reduction/elimination of the synchronization bottleneck, pipelined algorithms have been shown to increase parallel efficiency by allowing the algorithm to continue scaling on large numbers of processors [35, 45]. However, the algorithmic reformulations that allows for this efficiency increase come at the cost of reduced numerical stability [18, 4], which presently is one of the main drawbacks of pipelined (as well as other communication reducing) methods. Research on analyzing and improving the numerical stability of pipelined Krylov subspace methods, which is essential both for a proper understanding and the practical usability of the methods, has recently been performed by the authors [9, 7] and others [2, 4].
This work presents a numerically stable variant of the -length pipelined Conjugate Gradient method, p()-CG for short. The p()-CG method was presented in  and allows to overlap each global reduction phase with the computational work of subsequent iterations. The pipeline length is a parameter of the method that can be chosen depending on the problem and hardware setup (as a function of the communication-to-computation ratio). As is the case for all communication reducing Krylov subspace methods, the preconditioner choice influences the communication-to-computation ratio and thus affects the performance of the method. The pipeline length hence also depends on the effort invested in the preconditioner. A preconditioner that uses limited global communication (block Jacobi, no-overlap DDM, …) is generally preferred in this setting.
The propagation of local rounding errors in the multi-term recurrence relations of the p()-CG algorithm is the primary source of loss of attainable accuracy on the final solution . By introducing intermediate auxiliary basis variables, we derive a new p()-CG algorithm with modified recurrence relations for which no rounding error propagation occurs. It is proven analytically that the resulting recurrence relations are numerically stable for any pipeline length . The new algorithm is guaranteed to reach the same accuracy as the classic CG method. This work thus resolves one of the major restrictions for the practical use of pipelined Krylov subspace methods. The redesigned algorithm comes at no additional computational cost and has only a minor storage overhead compared to the former p()-CG algorithm, thus effectively replacing the earlier implementation of the method. In addition, it is shown that formulating a preconditioned version of the new algorithm is straightforward.
We conclude this introduction by providing a short overview of the further contents of this manuscript. Section 2 presents a high-level summary of the basic principles behind the -length pipelined CG method and formulates the key numerical properties of the method that motivate this work. It familiarizes the reader with the notations and concepts used throughout this paper. In Section 3 the numerical stability analysis of the p()-CG recurrence relations is briefly recapped, as it forms the basis for the analysis of the stable algorithm in Section 4.4. Section 4 contains the main contributions of this work, presenting the technical details of the stable p()-CG algorithm alongside an overview of its main implementation properties and a numerical analysis of the new rounding error resilient recurrence relations. Numerical experiments that demonstrate both the parallel performance of the p()-CG method and the excellent attainable accuracy in comparison to earlier variants of pipelined Krylov subspace methods are presented in Section 5. The manuscript is concluded in Section 6.
2 Deep pipelined Conjugate Gradients
The deep pipelined Conjugate Gradient method, denoted p()-CG for short, was first presented in , where it was derived in analogy to the p()-GMRES method . The parameter represents the pipeline length which indicates the number of iterations that are overlapped by each global reduction phase. We summarize the current state-of-the-art deep pipelined p()-CG method below, which forms the starting point for the discussion in this work.
2.1 Basis recurrence relations in exact arithmetic
Let be the orthonormal basis for the Krylov subspace in iteration of the p()-CG algorithm, consisting of vectors. Here is a symmetric positive definite matrix. The Krylov subspace basis vectors satisfy the Lanczos relation
Let , then the Lanczos relation (1) translates in vector notation to
The auxiliary basis runs vectors ahead of the basis and is defined as
where the matrix polynomial is given by
with optional stabilizing shifts , see [17, 10, 26]. We refer to Section 2.2 for a discussion on the Krylov subspace basis , i.e. the choice of the polynomial . Contrary to the basis , the auxiliary basis is in general not orthonormal. It is constructed using the recursive definitions
where the matrix contains the matrix , which is shifted places along the main diagonal. The bases and are connected through the basis transformation for , where is a banded upper triangular matrix with a band width of non-zero diagonals . For a symmetric matrix the matrix is symmetric around its -th upper diagonal, since
The following recurrence relation for is derived from the basis transformation (with ):
A total of iterations after the dot-products
have been initiated, the elements with , which were computed as , are corrected as follows:
for , and:
Additionally, in the -th iteration the tridiagonal matrix , see (1), can be updated recursively by adding one column. The diagonal element is characterized by the expressions:
The term is considered zero when . The update for the off-diagonal element is given by
The element has already been computed in the previous iteration and can thus simply be copied due to the symmetry of .
Once the basis has been constructed, the solution can be updated based on a search direction , following the classic derivation of D-Lanczos in , Sec. 6.7.1. The Ritz-Galerkin condition
implies . The LU-factorization of the tridiagonal matrix is given by
Note that . It follows from (16) that the elements of the lower/upper triangular matrices and are given by (with )
Expression (2.1) indicates that the approximate solution equals
where and . Note that . The columns (for ) of can be computed recursively. From it follows
Denoting the vector by , it follows from that and for . Using the search direction and the scalar , the approximate solution is updated using the recurrence relation:
The above expressions are combined in Alg. 1. Once the initial pipeline for has been filled, the relations (6)-(9) are used to recursively compute the basis vectors and in iterations (see lines 15-16). The scalar results of the global reduction phase (line 18) are required iterations later (line 5-6). In every iteration global communication is thus overlapped with the computational work of subsequent iterations, forming the heart of the communication hiding p()-CG algorithm.
Dot products in p()-CG. Although Alg. 1, line 18 indicates that in each iteration a total of dot products need to be computed, the number of dot product computations can be limited to by exploiting the symmetry of the matrix , see expression (2.1). Since for , only the dot products for and the -th upper diagonal element need to be computed in iteration .
2.2 On the conditioning of the auxiliary basis
As is an orthonormal basis, the transformation can be interpreted as a QR factorization of the auxiliary basis . Moreover, it holds that is the Cholesky factorization of , since
The elements of the transformation matrix are computed on lines 5-6 of Alg. 1 precisely by means of this Cholesky factorization. This observation leads to the following two essential insights related to the conditioning of the basis .
Square root breakdowns in p()-CG. The auxiliary basis vectors are defined as , but the basis is in general not orthogonal. Hence, vectors are not necessarily linearly independent. In particular for longer , different vectors are expected to become more and more aligned. This leads to being ill-conditioned, closing to singularity as augments. The effect is the most pronounced when , in which case . Shifts can be set to improve the conditioning of , see also Remark 4.
When for certain the matrix becomes (numerically) singular, the Cholesky factorization procedure in p()-CG will fail. This may manifest in the form of a square root breakdown on line 7 in Alg. 1 when the root argument becomes negative. Numerical round-off errors may increase the occurrence of these breakdowns in practice. When a breakdown occurs in p()-CG the iteration is restarted, in analogy to the GMRES algorithm, although it should be noted that the nature of the breakdown in both algorithms is quite different. Evidently, the restarting strategy may delay convergence compared to standard CG, where no square root breakdowns occur.
Choice of the auxiliary basis and relation to the shifts. It follows from the Cholesky factorization (21) that the inverse of the transformation matrix is
The conditioning of the matrix is thus determined by the conditioning of . Furthermore, it holds that
where is a part of the basis obtained by dropping the first columns. Hence, the polynomial has a major impact on the conditioning of the matrix , which in turn plays a crucial role in the propagation of local rounding errors in the p()-CG algorithm , see Section 3.1 of the current work. This observation indicates intuitively why should preferably be as small as possible, which can be achieved by choosing appropriate values for the shifts . Optimal shift values in the sense of minimizing the Euclidean -norm of are the Chebyshev shifts [26, 17, 10] (for ):
2.3 Loss of orthogonality and attainable accuracy
This work is motivated by the observation that two main issues affect the convergence of pipelined CG methods: loss of basis vector orthogonality and inexact Lanczos relations. We comment briefly on both issues from a high-level point of view and clearly mark the scope of this work. Important insights about the similarities and differences between classic CG and p()-CG are highlighted before going into more details on the numerics in Sections 3 and 4.
2.3.1 Loss of orthogonality
It is well-known that in finite precision arithmetic the orthogonality of the Krylov subspace basis , i.e.
(identity matrix) may not hold exactly. Inexact orthogonality may appear in every variant of the CG algorithm, see, in particular in the D-Lanczos222Note: The D-Lanczos (short for “direct Lanczos”) algorithm is a variant of the CG method that is equivalent to the latter in exact arithmetic, save for the solution of the system which is computed by using Gaussian elimination in D-Lanczos. The D-Lanczos method is the basic Krylov subspace method from which the p()-CG method was derived, see , Section 2, for details. algorithm , where a new basis vector is constructed by orthogonalizing with respect to the previous two basis vectors, as well as in the related p()-CG method, Alg. 1. Loss of orthogonality typically leads to delay of convergence, meaning the residual deviates from the one in the scenario in which orthogonality would not be lost. We use a notation with bars to designate variables that are actually computed in a finite precision implementation of the algorithm. The key relation for the Conjugate Gradient method is the Ritz-Galerkin condition:
This equality only holds under the assumption that which requires . Note that in finite precision arithmetic the convergence delay can be observed in both the actual residual norm as well as the recursively computed residual norm , since both quantities are based on the (possibly non-orthogonal) basis , see Fig. 0(a)-1(a) (discussed further in Section 2.3.3).
2.3.2 The inexact Lanczos relation
The basis vectors in the pipelined CG algorithm are not computed explicitly using the Lanczos relation (1). Rather, they are computed recursively, see (9), to avoid the computation of additional spmvs. In finite precision, local rounding errors in the recurrence relation may contaminate the basis , such that the Lanczos relation is no longer valid. Moreover, due to propagation of local rounding errors,
may grow dramatically as the iteration proceeds. Using the classic model for floating point arithmetic with machine precision[32, 21, 20, 44], the round-off error on basic computations on the matrix , vectors , and a scalar are bounded by
Here indicates the finite precision floating point representation, is the maximum number of nonzeros in any row of , and the norm represents the Euclidean 2-norm. Under this model the recurrence relations for and in a finite precision implementation of p()-CG are
where and are local rounding errors which are bounded by and , and . The actual residual satisfies the following relations in a finite precision setting:
where . The recursively computed residual that appears in expression (2.3.2) tends to zero. The actual residual norm , on the other hand, stagnates around , a quantity referred to as the maximal attainable accuracy of the method. The difference between the norm of the actual residual and the recursively computed residual is illustrated in Fig. 0(a). A detailed analysis of the deviation from the Lanczos relation in finite precision (“inexact Lanczos”) can be found in Section 3.
2.3.3 Scope and limitations of this manuscript
The issue of inexact Lanczos relations in p()-CG is timely and deserving of attention. Loss of accuracy resulting from the inexact Lanczos relation has long been a limiting factor in applying p()-CG and related algorithms in practice. Fig. 0(a) illustrates how the norms of the actual residuals stagnate while the recursively computed residual norms continue to decrease. For p()-CG local rounding errors in the recurrence relations are propagated leading to reduced attainable accuracy compared to D-Lanczos. Moreover, while loss of orthogonality also warrants further investigation, this issue is not exclusive to pipelined methods. Delayed convergence is observed in classic Krylov subspace methods also, see Fig. 1, whereas loss of attainable accuracy is not. The issue could be addressed by e.g. re-orthogonalizing the basis, see . However, communication-reducing methods are not particularly suitable to include re-orthogonalization, since this introduces additional global reduction phases.
Although loss of orthogonality does not originate from applying the pipelining technique, it may be more pronounced for pipelined methods compared to their classic counterparts, see Fig. 0(b). However, Fig. 0(a) indicates that the effect of the inexact Lanczos relation on convergence is much more dramatic than the effect of inexact orthogonality for all pipeline lengths . This manuscript thus focuses on improving the numerical stability of the p()-CG method by neutralizing the propagation of local rounding errors in the recursively computed basis vector updates. As such, this work proposes a key step towards a numerically stable communication hiding variant of the CG method.
3 Analyzing rounding error propagation
This section recaps the analysis of local rounding errors that stem from the recurrence relations in the pipelined p()-CG method, Alg. 1. It aims to precisely explain the source of the loss of accuracy observed for the p()-CG method. The methodology for the analysis is similar to the one used in classic works by Paige [32, 33], Greenbaum [19, 21, 20], Gutknecht , Strakos , Meurant , Sleijpen [37, 38], Van der Vorst , Higham , and others.
Finite precision variants of the exact scalar and vector variables introduced in Section 2 are denoted by a bar symbol in this section. We differentiate between “actual” vector variables, which satisfy the Lanczos relations exactly but are not computed in the algorithm, and “recursively computed” variables, which contain machine-precision sized round-off errors related to finite precision computations.
3.1 Local rounding error behavior in finite precision
For any the true basis vector, denoted by , satisfies the Lanczos relation (3) exactly, that is, for :
without the addition of a local rounding error. This vector is not actually computed in the p()-CG algorithm. Instead, the computed basis vector is calculated from the finite precision variant of relation (9) for , i.e.:
where the size of the local rounding errors is bounded in terms of machine precision: . Let and . Relation (29) alternatively translates to the following formulation in matrix notation (with ):
where is a -by- rectangular matrix holding the entries directly below the main diagonal. The matrix collects the gaps between the actual and recursively computed basis vectors, which quantify the deviation from the Lanczos relation in the finite precision setting. These gaps are crucial in describing the propagation of local rounding errors throughout the p()-CG algorithm and are directly linked to the gap between the actual and recursively computed residuals, see expression (2.3.2). From (30) one obtains
where collects the local rounding errors. The computed auxiliary vector satisfies a finite precision version of the recurrence relation (6), which summarizes to
where is the local rounding error which can again be bounded in terms of machine precision . Expression (33) can be formulated in matrix notation as:
Furthermore, the following matrix relations hold between the scalar coefficients and in Alg. 1: