1 Introduction
Krylov subspace methods [25, 34, 36, 39, 44] have been used for decades as efficient iterative solution methods for linear systems. This paper considers the problem of solving algebraic linear systems of the form , where is a real or complex nonsingular square matrix and the righthand side vector correspondingly has length . Given an initial guess for the solution and an initial residual , Krylov subspace methods construct a series of approximate solutions that lie in the th Krylov subspace
using some orthogonality constraint that differentiates the various Krylov subspace methods. For problems with symmetric (or Hermitian) positive definite (SPD) matrices – which are the main focus of this work – one of the most basic yet widely used Krylov subspace methods is the (preconditioned) Conjugate Gradient (CG) method, that dates back to the original 1952 paper [30], in which the orthogonality constraint boils down to minimizing the norm of the error over the Krylov subspace, i.e.,
where the energy norm is defined as with the SPD matrix .
A clear trend in current (petascale) and future (exascale) high performance computing hardware consists of upscaling the number of parallel compute nodes as well as the number of processors per node [13]. Compute nodes are generally interconnected using fast InfiniBand connections. However, the increasing gap between computational performance and memory/interconnect latency implies that on HPC hardware data movement is much more expensive than floating point operations (flops), both with respect to execution time as well as energy consumption [13, 19]. As such, reducing time spent in moving data and/or waiting for data will be essential for exascale applications.
Although Krylov subspace algorithms are traditionally highly efficient with respect to the number of flops that are required to obtain a sufficiently precise approximate solution , they are not optimized towards minimizing communication. The HPCG benchmark [14, 15], for example, shows that Krylov subspace methods are able to attain only a small fraction of the machine peak performance on large scale hardware. Since the matrix is often sparse and generally requires only limited communication between neighboring processors, the primary bottleneck for parallel execution is typically not the (sparse) matrixvector (spmv) product. Instead, parallel efficiency stalls due to communication latency caused by global synchronization phases when computing the dotproducts required for the orthogonalization procedures in the Krylov subspace algorithm.
A variety of interesting research branches on reducing or eliminating the synchronization bottleneck in Krylov subspace methods has emerged over the last decades. Based on the earliest ideas of communication reduction in Krylov subspace methods [41, 5, 10, 12, 17, 11, 7], a number of methods that aim to eliminate global synchronization points has recently been (re)introduced. These include hierarchical Krylov subspace methods [35], enlarged Krylov subspace methods [28], an iteration fusing Conjugate Gradient method [47], step Krylov subspace methods (also called “communication avoiding” Krylov subspace methods) [6, 3, 2, 33], and pipelined Krylov subspace methods (also referred to as “communication hiding” Krylov subspace methods) [21, 22, 40, 16, 46]. The latter aim not only to reduce the number of global synchronization points in the algorithm, but also overlap global communication with useful (mostly local) computations such as spmvs () and axpys (). In this way idle core time is reduced by simultaneously performing the synchronization phase and the independent computebound calculations. When the time required by one global communication phase approximately equals the computation time for the spmv, the pipelined CG method from [22], denoted as pCG in this work, allows for a good overlap and thus features significantly reduced total time to solution compared to the classic CG method. In heavily communicationbound scenarios where a global reduction takes significantly longer than one spmv, deeper pipelines allow to overlap the global reduction phase with the computational work of multiple spmvs. The concept of deep pipelines was introduced by Ghysels et al. [21] for the Generalized Minimal Residual (GMRES) method, and was recently extended to the CG method [9].
The advantages of using pipelined (and other) communication reducing Krylov subspace methods from a performance point of view have been illustrated in many of the aforementioned works. However, reorganizing the traditional Krylov subspace algorithm into a communication reducing variant typically introduces unwanted issues with the numerical stability of the algorithm. In exact arithmetic pipelined Krylov subspace methods produce a series of iterates identical to the classic CG method. However, in finite precision arithmetic their behavior can differ significantly as local rounding errors may induce a decrease in attainable accuracy and a delay of convergence. The impact of finite precision roundoff errors on numerical stability of classic CG has been extensively studied in a number of papers among which [23, 26, 24, 29, 42, 43, 37, 20].
In communication reducing CG variants the effects of local rounding errors are significantly amplified. We refer to our previous work [8] and the paper [4] by Carson et al. for a (historic) overview of the numerical stability analysis of pipelined Conjugate Gradient methods. In the present study we focus on analyzing the recently introduced deep pipelined Conjugate Gradient method [9], denoted by p()CG.^{1}^{1}1Note for a proper understanding that the p()CG method in [9] was not derived from the pCG method [22], but is rather based on similar principles as the p()GMRES method [21]. Although both pCG and p()CG are pipelined variants of the CG algorithm suffering from local rounding error propagation, the exact underlying mechanism by which error propagation occurs differs between these methods, as analyzed in Sections 23 of this work. In [9] it was observed that the numerical accuracy attainable by the p()CG method may be reduced drastically for larger pipeline lengths . Similar observations have been made for other classes of communication reducing methods, see e.g. the studies [5, 2] for the influence of the step parameter on the numerical stability of communication avoiding Krylov subspace methods. The performance of pipelined Krylov subspace methods with respect to HPC system noise has recently been analyzed in a stochastic framework, see [38].
The sensitivity of the p()CG Krylov subspace method to local rounding errors may induce an obstacle for the practical useability and reliability of this pipelined method. It is therefore vital to thoroughly understand and – if possible – counteract the negative effects of local rounding error propagation on the numerical stability of the method. The aim of this study is twofold:

to establish a theoretical framework for the behavior of local rounding errors that explains the observed loss of attainable accuracy in pipelined CG methods;

to exemplify possible – preferably easytoimplement and costefficient – countermeasures which can be used to improve the attainable accuracy of these methods when required.
This paper analyzes the behavior of local rounding errors that stem from the multiterm recurrence relations in the p()CG algorithm, and compares to recent related work in [4, 8] on the lengthone pipelined CG method [22]. We explicitly characterize the propagation of local rounding errors throughout the algorithm and discuss the influence of the pipelined length and the choice of the Krylov basis on the maximal attainable accuracy of the method. Furthermore, based on the error analysis, a possible approach for stabilizing the p()CG method is suggested near the end of the manuscript. The analysis in this paper is limited to the effect of local rounding errors in the multiterm recurrence relations on attainable accuracy; a discussion of the loss of orthogonality [26, 29] and consequential delay of convergence in pipelined CG variants is beyond the scope of this work.
The paper is structured as follows. Section 2 briefly summarizes the existing numerical rounding error analysis for classic CG and pipelined CG (pCG), while meanwhile introducing the notation that will be used throughout the manuscript. Section 3 is devoted to analyzing the behavior of local rounding errors that stem from the recurrence relations in p()CG. In Section 4 the numerical analysis is extended by establishing practical bounds for the operator that governs the error propagation. This allows to substantiate the impact of the pipeline length , the iteration index and the choice of the Krylov basis on attainable accuracy. Section 5 presents a countermeasure to reduce the impact of local rounding errors on final attainable accuracy. This stabilization technique results directly from the error analysis; however, it comes at the cost of an increase in the algorithm’s computational cost. In Section 6 we present some numerical experiments to verify and substantiate the numerical analysis in this work. Finally, the paper is concluded in Section 7.
2 Analyzing the effect of local rounding errors in classic CG and pCG
We first present an overview of the classic rounding error analysis of the standard CG method, Algorithm 1, which is treated in a broad variety of publications, see e.g. [26, 42, 43, 37]. In addition, we summarize the rounding error analysis of the lengthone pipelined CG method (denoted as ‘pCG’), Algorithm 2, see Ghysels and Vanroose [22]. This analysis was recently presented in [8, 4].
2.1 Behavior of local rounding errors in classic CG
In the classic Conjugate Gradient method, Algorithm 1, the recurrence relations for the approximate solution , the residual , and the search direction are given in exact arithmetic by
(1) 
respectively, where . In finite precision arithmetic the recurrence relations for these vectors do not hold exactly but are contaminated by local rounding errors in each iteration. We differentiate between exact variables (e.g. the exact residual ) and the actually computed variables (e.g. the finite precision residual ) by using a notation with bars for the computed quantities.
We use the classic model for floating point arithmetic with machine precision
. The roundoff error on scalar multiplication, vector summation, spmv application and dotproduct computation on an by matrix , length vectors , and a scalar number are respectively bounded bywhere indicates the finite precision floating point representation, is the maximum number of nonzeros in any row of , and the norm represents the Euclidean 2norm in this manuscript.
Considering the recurrence relations for the approximate solution , the residual and the search direction computed by the CG algorithm in the finite precision framework yields
(2) 
where , and represent local rounding errors, and where . We refer to the analysis in [4, 8] for bounds on the norms of these local rounding errors.
It is wellknown that local rounding errors in the classic CG algorithm are accumulated but not amplified throughout the algorithm. This is concluded directly from computing the gap on the residual . With this notation it follows from (2) that
(3) 
Hence in each iteration local rounding errors of the form add to the gap on the residual.
By introducing the matrix notation for the residual gaps in the first iterations and by analogously defining and for the local rounding errors, expression (3) can be formulated as
(4) 
where is a upper triangular matrix of ones. No amplification of local rounding errors occurs; indeed, local rounding errors are merely accumulated in the classic CG algorithm.
2.2 Propagation of local rounding errors in pCG
The behavior of local rounding errors in the lengthone pCG method from [22], see Algorithm 2, was analyzed in our previous work [8] and in the related work [4]. Pipelined pCG uses additional recurrence relations for the auxiliary vector quantities , and . The coupling between these recursively defined variables may cause amplification of the local rounding errors in the algorithm. In finite precision one has the following recurrence relations in pCG (without preconditioner)^{2}^{2}2Preconditioning of the various Conjugate Gradients variants has been largely omitted in this section for simplicity of notation. The extension of the local rounding error analysis to the preconditioned pCG (and, by extension, p()CG) algorithm is trivial, since the recurrences for the unpreconditioned variables are decoupled from their preconditioned counterparts. We refer the reader to Section 3.4 and our own related work in [8] for more details.:
(5) 
The respective bounds for the local rounding errors and (based on the recurrence relations above) can be found in [8], Section 2.3, where it is furthermore shown that the residual gap is coupled to the gaps , and on the auxiliary variables in pCG. We present the relations from [8] in matrix form. Let , and . Writing the gaps as
and using the following expressions for the local rounding errors on the auxiliary variables
we obtain matrix expressions for the local rounding errors in pCG:
where
The inverse of the upper bidiagonal matrix can be expressed in terms of the products of the coefficients (with ), i.e.
Furthermore, the matrix is defined by setting the first row of to zero. For a correct interpretation we note that this matrix is not invertible; the notation merely indicates the close relation to . The residual gap in pCG is summarized by the following expression:
(6) 
Hence, the entries of the coefficient matrices and determine the propagation of the local rounding errors in pCG. The entries of consist of a product of the scalar coefficients . In exact arithmetic these coefficients equal , such that
(7) 
Since the residual norm in CG is not guaranteed to decrease monotonically, the factor may for some be much larger than one. A similar argument may be used in the finite precision framework to derive that some entries of may be significantly larger than one, and may hence (possibly dramatically) amplify the corresponding local rounding errors in expression (2.2). This behavior is illustrated in Section 6, Fig. 2 (top right), where the pCG residual stagnates at a reduced maximal attainable accuracy level compared to classic CG, see Fig. 2 (top left).
3 Analyzing local rounding error propagation in p()Cg
This section aims to analyze the behavior of local rounding errors that stem from the recurrence relations in the pipelined p()CG algorithm with deep pipelines, Algorithm 3. The analysis will follow a framework similar to Section 2; however, the p()CG method is in fact much more closely related to p()GMRES [21] than it is to pCG [22]. For a proper understanding we first resume the essential definitions and recurrence relations used in the p()CG algorithm.
3.1 Basis vector recurrence relations in p()CG in exact arithmetic
Let be the orthonormal basis for the Krylov subspace in iteration of the p()CG algorithm, where is a symmetric matrix. These vectors satisfy the Arnoldi relation
(8) 
where is the tridiagonal matrix
For the relation translates in vector notation to the recursive definition of :
(9) 
Note that for it is assumed that . We define the auxiliary vector basis , which runs vectors ahead of the basis (i.e. the socalled pipeline of length ) as
(10) 
with optional stabilizing shifts , see [21, 9]. We comment on the choice of the Krylov basis and its effect on numerical stability further on in this manuscript. Note that contrary to the basis , the auxiliary basis is in general not orthonormal. For one has the relation:
(11) 
whereas for the relation (9) for is multiplied on both sides by to obtain the recurrence relation for :
(12) 
In matrix formulation the expressions (11)(12) for translate into the Arnoldilike relation
(13) 
where the matrix is
The basis vectors and are connected through the basis transformation for . The upper triangular matrix has a band structure with a band width of at most nonzero diagonals, see [9], Lemma 6. Using this basis transformation the following recurrence relation for is derived:
(14) 
The recurrence relations (12) and (14) are used in Alg. 3 to recursively compute the respective basis vectors and in iterations (i.e. once the initial pipeline for has been filled).
Remark
Remark
Unlike the pCG Algorithm 2, the p()CG Algorithm 3 may encounter a square root breakdown during the iteration. When the root argument becomes negative (possibly influenced by roundoff errors in practice) a breakdown occurs. It is suggested in [9] that when breakdown occurs the iteration is restarted in analogy to the (pipelined) GMRES algorithm. Note however that this restarting strategy may delay convergence, cf. Section 6, Fig. 1.
3.2 Local rounding error behavior in finite precision p()Cg
For any the true basis vector, denoted by , satisfies the Arnoldi relation (9) exactly, that is, it is defined as
(16) 
For it is assumed that On the other hand, the computed basis vector is calculated from the finite precision variant of the recurrence relation (14), i.e.
(17) 
where the size of the local rounding errors can be bounded in terms of the machine precision as follows:
By subtracting the computed basis vector from both sides of the equation (16), it is easy to see that this relation alternatively translates to
or written in matrix notation:
(18) 
where
Comments
There are no comments yet.