Numerical stability analysis of the class of communication hiding pipelined Conjugate Gradient methods

04/09/2018 ∙ by Siegfried Cools, et al. ∙ 0

Krylov subspace methods are widely known as efficient algebraic methods for solving linear systems. However, on massively parallel hardware their performance is typically limited by communication latency rather than floating point performance. With HPC hardware advancing towards the exascale regime the gap between computation (i.e. flops) and communication (i.e. internode communication, as well as data movement within the memory hierarchy) keeps steadily increasing, imposing the need for scalable alternatives to traditional Krylov subspace methods. One such approach are pipelined Krylov subspace methods, which reduce the number of global synchronization points and overlap global communication latency with local arithmetic operations, thus `hiding' the global reduction phases behind useful computations. To obtain this overlap the algorithm is reformulated by introducing a number of auxiliary vector quantities, which are computed using additional recurrence relations. Although pipelined Krylov subspace methods are equivalent to traditional Krylov subspace methods in exact arithmetic, the behavior of local rounding errors induced by the multi-term recurrence relations in finite precision may in practice affect convergence significantly. This numerical stability study aims to characterize the effect of local rounding errors in various pipelined versions of the popular Conjugate Gradient method. We derive expressions for the gaps between the true and (recursively) computed variables that are used to update the search directions in the different CG variants. Furthermore, we show how these results can be used to analyze and correct the effect of local rounding error propagation on the maximal attainable accuracy of pipelined CG methods. The analysis in this work is supplemented by various numerical experiments that demonstrate the numerical stability of the pipelined CG methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Krylov subspace methods [25, 34, 36, 39, 44] have been used for decades as efficient iterative solution methods for linear systems. This paper considers the problem of solving algebraic linear systems of the form , where is a real or complex non-singular square matrix and the right-hand side vector correspondingly has length . Given an initial guess for the solution and an initial residual , Krylov subspace methods construct a series of approximate solutions that lie in the -th Krylov subspace

using some orthogonality constraint that differentiates the various Krylov subspace methods. For problems with symmetric (or Hermitian) positive definite (SPD) matrices – which are the main focus of this work – one of the most basic yet widely used Krylov subspace methods is the (preconditioned) Conjugate Gradient (CG) method, that dates back to the original 1952 paper [30], in which the orthogonality constraint boils down to minimizing the -norm of the error over the Krylov subspace, i.e.,

where the energy norm is defined as with the SPD matrix .

A clear trend in current (petascale) and future (exascale) high performance computing hardware consists of up-scaling the number of parallel compute nodes as well as the number of processors per node [13]. Compute nodes are generally interconnected using fast InfiniBand connections. However, the increasing gap between computational performance and memory/interconnect latency implies that on HPC hardware data movement is much more expensive than floating point operations (flops), both with respect to execution time as well as energy consumption [13, 19]. As such, reducing time spent in moving data and/or waiting for data will be essential for exascale applications.

Although Krylov subspace algorithms are traditionally highly efficient with respect to the number of flops that are required to obtain a sufficiently precise approximate solution , they are not optimized towards minimizing communication. The HPCG benchmark [14, 15], for example, shows that Krylov subspace methods are able to attain only a small fraction of the machine peak performance on large scale hardware. Since the matrix is often sparse and generally requires only limited communication between neighboring processors, the primary bottleneck for parallel execution is typically not the (sparse) matrix-vector (spmv) product. Instead, parallel efficiency stalls due to communication latency caused by global synchronization phases when computing the dot-products required for the orthogonalization procedures in the Krylov subspace algorithm.

A variety of interesting research branches on reducing or eliminating the synchronization bottleneck in Krylov subspace methods has emerged over the last decades. Based on the earliest ideas of communication reduction in Krylov subspace methods [41, 5, 10, 12, 17, 11, 7], a number of methods that aim to eliminate global synchronization points has recently been (re)introduced. These include hierarchical Krylov subspace methods [35], enlarged Krylov subspace methods [28], an iteration fusing Conjugate Gradient method [47], -step Krylov subspace methods (also called “communication avoiding” Krylov subspace methods) [6, 3, 2, 33], and pipelined Krylov subspace methods (also referred to as “communication hiding” Krylov subspace methods) [21, 22, 40, 16, 46]. The latter aim not only to reduce the number of global synchronization points in the algorithm, but also overlap global communication with useful (mostly local) computations such as spmvs () and axpys (). In this way idle core time is reduced by simultaneously performing the synchronization phase and the independent compute-bound calculations. When the time required by one global communication phase approximately equals the computation time for the spmv, the pipelined CG method from [22], denoted as p-CG in this work, allows for a good overlap and thus features significantly reduced total time to solution compared to the classic CG method. In heavily communication-bound scenarios where a global reduction takes significantly longer than one spmv, deeper pipelines allow to overlap the global reduction phase with the computational work of multiple spmvs. The concept of deep pipelines was introduced by Ghysels et al. [21] for the Generalized Minimal Residual (GMRES) method, and was recently extended to the CG method [9].

The advantages of using pipelined (and other) communication reducing Krylov subspace methods from a performance point of view have been illustrated in many of the aforementioned works. However, reorganizing the traditional Krylov subspace algorithm into a communication reducing variant typically introduces unwanted issues with the numerical stability of the algorithm. In exact arithmetic pipelined Krylov subspace methods produce a series of iterates identical to the classic CG method. However, in finite precision arithmetic their behavior can differ significantly as local rounding errors may induce a decrease in attainable accuracy and a delay of convergence. The impact of finite precision round-off errors on numerical stability of classic CG has been extensively studied in a number of papers among which [23, 26, 24, 29, 42, 43, 37, 20].

In communication reducing CG variants the effects of local rounding errors are significantly amplified. We refer to our previous work [8] and the paper [4] by Carson et al. for a (historic) overview of the numerical stability analysis of pipelined Conjugate Gradient methods. In the present study we focus on analyzing the recently introduced deep pipelined Conjugate Gradient method [9], denoted by p()-CG.111Note for a proper understanding that the p()-CG method in [9] was not derived from the p-CG method [22], but is rather based on similar principles as the p()-GMRES method [21]. Although both p-CG and p()-CG are pipelined variants of the CG algorithm suffering from local rounding error propagation, the exact underlying mechanism by which error propagation occurs differs between these methods, as analyzed in Sections 2-3 of this work. In [9] it was observed that the numerical accuracy attainable by the p()-CG method may be reduced drastically for larger pipeline lengths . Similar observations have been made for other classes of communication reducing methods, see e.g. the studies [5, 2] for the influence of the -step parameter on the numerical stability of communication avoiding Krylov subspace methods. The performance of pipelined Krylov subspace methods with respect to HPC system noise has recently been analyzed in a stochastic framework, see [38].

The sensitivity of the p()-CG Krylov subspace method to local rounding errors may induce an obstacle for the practical useability and reliability of this pipelined method. It is therefore vital to thoroughly understand and – if possible – counteract the negative effects of local rounding error propagation on the numerical stability of the method. The aim of this study is two-fold:

  1. to establish a theoretical framework for the behavior of local rounding errors that explains the observed loss of attainable accuracy in pipelined CG methods;

  2. to exemplify possible – preferably easy-to-implement and cost-efficient – countermeasures which can be used to improve the attainable accuracy of these methods when required.

This paper analyzes the behavior of local rounding errors that stem from the multi-term recurrence relations in the p()-CG algorithm, and compares to recent related work in [4, 8] on the length-one pipelined CG method [22]. We explicitly characterize the propagation of local rounding errors throughout the algorithm and discuss the influence of the pipelined length and the choice of the Krylov basis on the maximal attainable accuracy of the method. Furthermore, based on the error analysis, a possible approach for stabilizing the p()-CG method is suggested near the end of the manuscript. The analysis in this paper is limited to the effect of local rounding errors in the multi-term recurrence relations on attainable accuracy; a discussion of the loss of orthogonality [26, 29] and consequential delay of convergence in pipelined CG variants is beyond the scope of this work.

The paper is structured as follows. Section 2 briefly summarizes the existing numerical rounding error analysis for classic CG and pipelined CG (p-CG), while meanwhile introducing the notation that will be used throughout the manuscript. Section 3 is devoted to analyzing the behavior of local rounding errors that stem from the recurrence relations in p()-CG. In Section 4 the numerical analysis is extended by establishing practical bounds for the operator that governs the error propagation. This allows to substantiate the impact of the pipeline length , the iteration index and the choice of the Krylov basis on attainable accuracy. Section 5 presents a countermeasure to reduce the impact of local rounding errors on final attainable accuracy. This stabilization technique results directly from the error analysis; however, it comes at the cost of an increase in the algorithm’s computational cost. In Section 6 we present some numerical experiments to verify and substantiate the numerical analysis in this work. Finally, the paper is concluded in Section 7.

2 Analyzing the effect of local rounding errors in classic CG and p-CG

We first present an overview of the classic rounding error analysis of the standard CG method, Algorithm 1, which is treated in a broad variety of publications, see e.g. [26, 42, 43, 37]. In addition, we summarize the rounding error analysis of the length-one pipelined CG method (denoted as ‘p-CG’), Algorithm 2, see Ghysels and Vanroose [22]. This analysis was recently presented in [8, 4].

2.1 Behavior of local rounding errors in classic CG

1:;
2:;
3:for  do
4:     ;
5:     ;
6:     if  then
7:          RETURN;      
8:     end if
9:     ;
10:     ;
11:     ;
12:     ;
13:end for
Algorithm 1 Conjugate Gradient method (CG) Input: , , , ,

In the classic Conjugate Gradient method, Algorithm 1, the recurrence relations for the approximate solution , the residual , and the search direction are given in exact arithmetic by

(1)

respectively, where . In finite precision arithmetic the recurrence relations for these vectors do not hold exactly but are contaminated by local rounding errors in each iteration. We differentiate between exact variables (e.g. the exact residual ) and the actually computed variables (e.g. the finite precision residual ) by using a notation with bars for the computed quantities.

We use the classic model for floating point arithmetic with machine precision

. The round-off error on scalar multiplication, vector summation, spmv application and dot-product computation on an -by- matrix , length vectors , and a scalar number are respectively bounded by

where indicates the finite precision floating point representation, is the maximum number of nonzeros in any row of , and the norm represents the Euclidean 2-norm in this manuscript.

Considering the recurrence relations for the approximate solution , the residual and the search direction computed by the CG algorithm in the finite precision framework yields

(2)

where , and represent local rounding errors, and where . We refer to the analysis in [4, 8] for bounds on the norms of these local rounding errors.

It is well-known that local rounding errors in the classic CG algorithm are accumulated but not amplified throughout the algorithm. This is concluded directly from computing the gap on the residual . With this notation it follows from (2) that

(3)

Hence in each iteration local rounding errors of the form add to the gap on the residual.

By introducing the matrix notation for the residual gaps in the first iterations and by analogously defining and for the local rounding errors, expression (3) can be formulated as

(4)

where is a upper triangular matrix of ones. No amplification of local rounding errors occurs; indeed, local rounding errors are merely accumulated in the classic CG algorithm.

2.2 Propagation of local rounding errors in p-CG

1:;
2:;
3:for  do
4:     ;
5:     ;
6:     ;
7:     if  then
8:          RETURN;      
9:     end if
10:     if  then
11:          ;
12:          ;
13:     else
14:          ;
15:          ;      
16:     end if
17:     ;
18:     ;
19:     ;
20:     ;
21:     ;
22:     ;
23:end for
Algorithm 2 Pipelined Conjugate Gradient method (p-CG) Input: , , , ,

The behavior of local rounding errors in the length-one p-CG method from [22], see Algorithm 2, was analyzed in our previous work [8] and in the related work [4]. Pipelined p-CG uses additional recurrence relations for the auxiliary vector quantities , and . The coupling between these recursively defined variables may cause amplification of the local rounding errors in the algorithm. In finite precision one has the following recurrence relations in p-CG (without preconditioner)222Preconditioning of the various Conjugate Gradients variants has been largely omitted in this section for simplicity of notation. The extension of the local rounding error analysis to the preconditioned p-CG (and, by extension, p()-CG) algorithm is trivial, since the recurrences for the unpreconditioned variables are decoupled from their preconditioned counterparts. We refer the reader to Section 3.4 and our own related work in [8] for more details.:

(5)

The respective bounds for the local rounding errors and (based on the recurrence relations above) can be found in [8], Section 2.3, where it is furthermore shown that the residual gap is coupled to the gaps , and on the auxiliary variables in p-CG. We present the relations from [8] in matrix form. Let , and . Writing the gaps as

and using the following expressions for the local rounding errors on the auxiliary variables

we obtain matrix expressions for the local rounding errors in p-CG:

where

The inverse of the upper bidiagonal matrix can be expressed in terms of the products of the coefficients (with ), i.e.

Furthermore, the matrix is defined by setting the first row of to zero. For a correct interpretation we note that this matrix is not invertible; the notation merely indicates the close relation to . The residual gap in p-CG is summarized by the following expression:

(6)

Hence, the entries of the coefficient matrices and determine the propagation of the local rounding errors in p-CG. The entries of consist of a product of the scalar coefficients . In exact arithmetic these coefficients equal , such that

(7)

Since the residual norm in CG is not guaranteed to decrease monotonically, the factor may for some be much larger than one. A similar argument may be used in the finite precision framework to derive that some entries of may be significantly larger than one, and may hence (possibly dramatically) amplify the corresponding local rounding errors in expression (2.2). This behavior is illustrated in Section 6, Fig. 2 (top right), where the p-CG residual stagnates at a reduced maximal attainable accuracy level compared to classic CG, see Fig. 2 (top left).

3 Analyzing local rounding error propagation in p()-Cg

This section aims to analyze the behavior of local rounding errors that stem from the recurrence relations in the pipelined p()-CG algorithm with deep pipelines, Algorithm 3. The analysis will follow a framework similar to Section 2; however, the p()-CG method is in fact much more closely related to p()-GMRES [21] than it is to p-CG [22]. For a proper understanding we first resume the essential definitions and recurrence relations used in the p()-CG algorithm.

3.1 Basis vector recurrence relations in p()-CG in exact arithmetic

1:
2:for  do
3:     
4:     
5:     if  then
6:          
7:          
8:          # Check for breakdown and restart if required
9:          if  then
10:               
11:               
12:          else
13:               
14:                          
15:          end if
16:          
17:                
18:     end if
19:     if  then
20:          
21:     else
22:                
23:     end if
24:     if  then
25:          
26:          
27:          
28:     elseif then
29:          
30:          
31:          
32:          
33:          
34:          if  then
35:               RETURN;           
36:          end if      
37:     end if
38:end for
Algorithm 3 Deep pipelined CG (p()-CG) Input: , , , , , ,

Let be the orthonormal basis for the Krylov subspace in iteration of the p()-CG algorithm, where is a symmetric matrix. These vectors satisfy the Arnoldi relation

(8)

where is the tridiagonal matrix

For the relation translates in vector notation to the recursive definition of :

(9)

Note that for it is assumed that . We define the auxiliary vector basis , which runs vectors ahead of the basis (i.e. the so-called pipeline of length ) as

(10)

with optional stabilizing shifts , see [21, 9]. We comment on the choice of the Krylov basis and its effect on numerical stability further on in this manuscript. Note that contrary to the basis , the auxiliary basis is in general not orthonormal. For one has the relation:

(11)

whereas for the relation (9) for is multiplied on both sides by to obtain the recurrence relation for :

(12)

In matrix formulation the expressions (11)-(12) for translate into the Arnoldi-like relation

(13)

where the matrix is

The basis vectors and are connected through the basis transformation for . The upper triangular matrix has a band structure with a band width of at most non-zero diagonals, see [9], Lemma 6. Using this basis transformation the following recurrence relation for is derived:

(14)

The recurrence relations (12) and (14) are used in Alg. 3 to recursively compute the respective basis vectors and in iterations (i.e. once the initial pipeline for has been filled).

Remark

Note that although the residual is not recursively computed in Algorithm 3, it is proven in [9] that the residual norm can be characterized by the quantity , i.e.

(15)

Remark

Unlike the p-CG Algorithm 2, the p()-CG Algorithm 3 may encounter a square root breakdown during the iteration. When the root argument becomes negative (possibly influenced by round-off errors in practice) a breakdown occurs. It is suggested in [9] that when breakdown occurs the iteration is restarted in analogy to the (pipelined) GMRES algorithm. Note however that this restarting strategy may delay convergence, cf. Section 6, Fig. 1.

3.2 Local rounding error behavior in finite precision p()-Cg

For any the true basis vector, denoted by , satisfies the Arnoldi relation (9) exactly, that is, it is defined as

(16)

For it is assumed that On the other hand, the computed basis vector is calculated from the finite precision variant of the recurrence relation (14), i.e.

(17)

where the size of the local rounding errors can be bounded in terms of the machine precision as follows:

By subtracting the computed basis vector from both sides of the equation (16), it is easy to see that this relation alternatively translates to

or written in matrix notation:

(18)

where