## 1 Introduction

Several variants of the conjugate gradient algorithm (CG) for solving a symmetric positive definite linear system have been proposed to make better use of parallelism; see, e.g., [19, 20, 14, 21, 3, 4, 8]. While all of these algorithms are mathematically equivalent, they behave differently when implemented in finite precision arithmetic. One difference is in the ultimately attainable accuracy of the computed solution. All of these algorithms compute an initial residual , where

is the initial guess for the solution, and then compute updated “residual” vectors

, , using a recurrence formula. In finite precision arithmetic, however, these updated vectors may differ from the actual residuals , where is the approximate solution vector generated at step . When this difference becomes significant, the norms of the updated vectors may or may not continue to decrease, but the actual residual norm levels off. The level of accuracy of the approximate solution when this occurs is studied in [2, 5].In this paper, we consider what happens before the actual and updated residual vectors start to deviate significantly. Even during this stage, the different variants may show different convergence patterns on difficult problems. This is a phenomenon that we wish to understand. On simpler problems, where eigenvalues of the coefficient matrix are distributed in a more uniform way, the algorithms may all behave very similarly. This, too, is something that needs a mathematical explanation since this may hold even after agreement with exact arithmetic is lost.

A good deal of work beginning in the 1980’s (and in the thesis of Paige [17] dating back to 1971) has been aimed at explaining the behavior of the Lanczos and conjugate gradient algorithms in finite precision arithmetic. See, for example [16, 10, 12, 6]. In a seminal paper [16], Paige showed that a particular implementation of the Lanczos algorithm had certain nice properties even when implemented in finite precision arithmetic and that those properties could be used to establish results about the eigenvalue/vector approximations generated during a finite precision Lanczos computation. A nice summary of this work can be found in [18]. Later these same properties were used to establish results about the convergence of the conjugate gradient algorithm in finite precision arithmetic. A natural question to ask is: Which of the various proposed implementations satisfy these properties, and do those that do satisfy the properties used in Paige’s analysis have better behavior than those that do not? For those that do not, are there other ways to explain their behavior?

In the following subsections we review the properties that have been assumed in order to establish good convergence results for the CG algorithm. As far as we know, these properties have not been rigorously proved even for standard implementations of the CG algorithm; in [10], for instance, it was simply assumed that these properties hold. We will not rigorously establish such properties in this paper either but will check numerically whether or not they hold for a number of test problems and whether satisfaction of such properties coincides with faster convergence (in terms of number of iterations).

Throughout the paper, will denote a real symmetric positive definite matrix, although the results are easily extended to complex Hermitian positive definite matrices. The symbol will denote the 2-norm for vectors and the corresponding spectral norm for matrices.

### 1.1 Finite Precision Lanczos Computations

In [10] an analogy was established between finite precision Lanczos computations with matrix and initial vector and exact Lanczos applied to a larger matrix whose eigenvalues all lie in tiny intervals about the eigenvalues of . The initial vector associated with

was such that the sum of squares of its components in the directions of eigenvectors of

corresponding to eigenvalues in the interval about an eigenvalue of was approximately equal to the square of the component of in the direction of the corresponding eigenvector of . This meant that theorems (that assume exact arithmetic) about the behavior of the Lanczos algorithm applied to such matrices with such initial vectors , could be applied to finite precision computations with matrix and initial vector .The assumptions needed for this analogy to hold were that vectors generated by the finite precision computation satisfied

(1) |

where is the by matrix whose columns are , is a symmetric tridiagonal matrix,

is the th unit vector , and , which accounts for rounding errors, has columns , , satisfying

(2) |

where is a modest multiple of the machine precision. It is further assumed that, because of the choice of the coefficients, , the 2-norm of each vector is approximately , say, , and the inner product of successive pairs of vectors is almost :

(3) |

The analysis in [10] applies to any computation that satisfies these assumptions for some . This analysis relies heavily on the work of Paige [16, 17], who showed that a good finite precision implementation of the Lanczos algorithm satisfies these assumptions, with explicit bounds on the quantities denoted here as .

### 1.2 Relation Between CG Residuals and Lanczos Vectors

The conjugate gradient algorithm for solving a symmetric positive definite linear system can be written in the following form, due to Hestenes and Stiefel [13]:

It is well-known that if in the Lanczos algorithm, then subsequent Lanczos vectors are just scaled versions of the corresponding CG residuals. To see this from the HSCG algorithm, we first note that the residual vectors , , satisfy a 3-term recurrence:

If we define normalized residuals by , then these vectors satisfy

or,

Finally, noting that , this becomes

(4) |

Thus, if is the by matrix whose columns are , then

(5) |

where and is a symmetric tridiagonal matrix with diagonal entries , , (where terms involving are taken to be ) and sub and super diagonal entries , . It follows that if formula (5) can be replaced by something of the form (1) when the columns of come from normalizing “residual” vectors in a finite precision CG computation, with the computed vectors satisfying properties (2) and (3) as well, then the analysis of [10] will apply to the finite precision CG computation. We emphasize again, that it will give information about the rate at which the updated residual vectors will decrease in norm and thus is of interest only as long as these updated vectors resemble the actual residuals, .

### 1.3 Implications for Finite Precision CG Implementations

Under these conditions, the analysis in [10] shows that the updated residual vectors converge at the rate predicted by exact arithmetic theory for a symmetric positive definite matrix whose condition number is just slightly larger than that of :

(6) |

It shows further that the -norm of the error in the finite precision computation – that is, the quantity – is reduced at about the same rate as the -norm of the error in exact CG applied to :

(7) |

A sharper bound on the quantities in (6) and (7) can be given in terms of the size of the th degree minimax polynomial on the union of tiny intervals containing the eigenvalues of ; if these intervals are , then the quantity

(8) |

is an upper bound for the quantity on the left in (7) and times this value is an upper bound for the left-hand side of (6). For some eigenvalue distributions, the size of this minimax polynomial is not much less than that of the Chebyshev polynomial on the entire interval , on which the bounds in (6) and (7) are based, but for other eigenvalue distributions, the difference can be great.

These bounds are independent of the initial residual . With knowledge of the size of components of in the directions of each eigenvector of , the analysis in [10] gives additional insight into the convergence of a finite precision CG computation. It behaves like exact CG applied to a matrix whose eigenvalues lie in tiny intervals about the eigenvalues of , with an initial residual satisfying

(9) |

where is a normalized eigenvector of corresponding to eigenvalue , is a normalized eigenvector of corresponding to eigenvalue , and the sum over is the sum over all eigenvalues of in the small interval . In some cases, even assuming exact arithmetic where , bounds based on the size of the minimax polynomial on the set of eigenvalues are large overestimates for observed convergence rates because, while for any given , there is an initial residual for which equality will hold at step [9], components of that initial residual may differ by hundreds of orders of magnitude; such an initial residual could not even be represented on a machine with standard bounds on exponent size, so whatever the initial residual in the finite precision computation, it is necessarily far from the worst possible one.

In the following sections, we consider three variants of the conjugate gradient algorithm and try to determine which ones satisfy the assumptions necessary for the analysis in [10] to apply; namely, (1), (2), and (3). We do not do a complete rounding error analysis, as that quickly becomes complicated. Instead we indicate how such an analysis might go and then check numerically, using test problems, to see which variants satisfy the needed assumptions. We demonstrate that for problems where the size of the small intervals in (8) makes a significant difference in the convergence rate of exact CG applied to problems with eigenvalues in those intervals, different CG variants tend to converge differently in finite precision arithmetic, with significant correlation between convergence rate and the level to which (1), (2), and (3) are satisfied. For problems where the interval size in (8) makes little difference in the convergence rate of exact CG, as long as is “small”, or where the upper bound (7) adequately describes the convergence of exact CG, the finite precision implementations that we consider all converge similarly until the ultimate level of accuracy is reached.

The analysis in [10] is complicated and the bounds on interval size are not at all tight. Bounds of the form (6) and (7) can be more easily obtained by other means, if it is assumed that (1 - 3) hold (or, at least, that the eigenvalues of in (1) lie essentially between the largest and smallest eigenvalues of ), and if it is also assumed that the approximate solution generated by the finite precision computation satisfies

(10) |

where and and are the matrices generated by the finite precision computation. (In exact arithmetic, this would be an exact formula for .) For example, it is shown in [6, Theorem 2.2], using a simple proof with realistic bounds for the roundoff terms, that a bound of the form (6) holds.

A more general estimate is given in

[15, Theorem 6.2] when the Lanczos algorithm is used to approximate for general functions :where is the set of polynomials of degree at most , is slightly less than the smallest eigenvalue of (and assumed to be positive) and is slightly greater than the largest eigenvalue of , and (for convenience only) a zero initial guess is assumed. If , then this gives a bound on the 2-norm of the error in solving a linear system:

If we write in the form and take to be the th degree Chebyshev polynomial on the interval , normalized to have value at the origin, then we obtain the bound

Finally, to relate the 2-norm of the error at step to the 2-norm of the initial error , we can write

(11) |

## 2 Some CG Variants Designed to Make Better Use of Parallelism

While matrix-vector multiplication can be parallelized and vectors can be partitioned among different processors in the HSCG algorithm, almost none of the operations in that algorithm can be performed simultaneously. Looking at the algorithm of the previous section, it can be seen that at each iteration, the matrix-vector product must be started, with at least part of it completed, before computation of the inner product can begin. This inner product must be completed before the vectors and can be formed, and must be at least partly completed, before computation of the next inner product can begin. This inner product must be completed before can be formed, and at least part of must be completed before the start of the next iteration computing . It has been observed that waiting for the two inner products to complete can be very costly when using large numbers of processors [1, 5].

Several mathematically equivalent CG variants have been devised to allow overlapping of inner products with each other and with the matrix-vector multiplication in each iteration of the algorithm. In the following subsections, we consider two of these: one due to Chronopoulos and Gear [3, 4] (CGCG) that allows either overlapping of the two inner products or overlapping of one of these with the matrix-vector product, and a pipelined version due to Ghysels and Vanroose [8] (GVCG) that allows overlapping of both inner products as well as the matrix-vector multiplication.

We give an indication of how closely equation (1) might be satisfied by each of these variants, along with the original HSCG algorithm, when rounding errors affect the computation. We do not do a complete rounding error analysis but measure the quantities in (2) and (3) for a set of test problems and determine if the size of these quantities correlates with the rate of convergence (in terms of number of iterations) before the ultimately attainable accuracy is achieved. All of our experiments are performed on a single processor using standard double precision arithmetic, and we do not consider the timing of the algorithms, only the number of iterations required to reach a given level of accuracy.

### 2.1 Hscg

When the Hestenes and Stiefel algorithm of the previous section is implemented in finite precision arithmetic, the vectors and satisfy

where the roundoff terms and satisfy

where is a modest multiple of the machine precision, and the constant depends on the method used for matrix-vector multiplication. If is an by matrix with at most nonzeros in any row and if the matrix-vector product is computed in the standard way, then can be taken to be . We assume that the coefficients and are computed according to the formulas in the HSCG algorithm, namely,

with any errors in computing these formulas being included in the and terms. It follows that

where

(12) |

Defining , we can write

or,

With the formula for , this takes a form similar to (4):

(13) |

From (12), the last term can be written as

(14) |

This is the th column of in (1), and while we have not proved that it remains below , at least it involves only local rounding errors.

### 2.2 Cgcg

Chronopoulos and Gear proposed the following version of the CG algorithm to make better use of parallelism [3]:

Notice that the computation of can be overlapped with that of . Alternatively, once has been formed, the two inner products and can be computed simultaneously. In exact arithmetic, the additional vector is equal to .

The CGCG algorithm can be written in the form (4) in much the same way as the HSCG algorithm. For the finite precision computation, the relevant formulas are:

where the roundoff terms , , and satisfy

Eliminating the ’s and ’s, we can obtain a three-term recurrence for :

where

Defining , and proceeding exactly as was done for HSCG, we obtain equation (13), where now the last term is

(15) |

Again, this involves only local roundoff terms.

### 2.3 Gvcg

This algorithm, developed by Ghysels and Vanroose [8] and also known as pipelined CG, is the most parallel of the three versions of CG that we consider:

Note that the computation of both inner products and required at each iteration can be overlapped with each other and with the matrix vector product, , as well as with some of the vector operations. In exact arithmetic, the auxiliary vectors satisfy , , , .

In finite precision arithmetic, the vectors in the GVCG algorithm satisfy

where the roundoff terms satisfy

When we try to eliminate auxiliary vectors and form a three-term recurrence for , we find

It was noted that in exact arithmetic , so we can write this recurrence in the form

(16) | |||||

The amount by which the computed vector fails to satisfy a three-term recurrence now depends not only on local rounding errors, but also on the amount by which differs from . This will involve rounding errors made at all previous steps. To see the size of this difference, subtract times the equation for from the equation for :

and apply this recursively to obtain

(17) |

To determine the size of the difference between and , subtract times the equation for from that for and apply recursively to find

Finally, noting that , one can replace the above products to obtain

Substituting this expression into (17) and (17) into (16), we see the amount by which may fail to satisfy the three-term recurrence that it satisfied in the other algorithms to within local roundoff errors. This suggests that the matrix in (1) may be significantly larger for this algorithm than for the others.

## 3 Some Test Problems

The following problem –`bcsstk03`

from the BCSSTRUC1 (BCS Structural Engineering
Matrices) in the Harwell-Boeing collection [7] – was studied in [2].
It is a matrix with condition number . For convenience, we
normalized the matrix so that the matrix we used had norm 1. In exact arithmetic,
the CG algorithm would obtain the exact solution in steps. Results of
running HSCG, CGCG, and GVCG are plotted in Figure 1.
We set a random solution vector and computed , and we used a zero initial guess .
Computations were carried out in MATLAB, using standard double precision arithmetic.
The figure shows the -norm of the error at each step ,
, divided by the -norm of the
initial error. Also shown is the upper bound (7), which is a large
overestimate for all variants. Note that the different variants of CG not only
reach different levels of accuracy, but even before the ultimate accuracy level
is reached, they converge at different rates.
The fastest (in terms of number of iterations) is HSCG, followed by CGCG, with GVCG
requiring the most iterations.

The situation is different, however, for other matrices in this collection, which might be considered more realistic in terms of the size of problems that would likely be solved with CG. Figure 2 shows the convergence of HSCG, CGCG, and GVCG for six other test matrices. Here each matrix was prescaled by its diagonal (to avoid possibly different rounding errors in preconditioned variants using the diagonal as a preconditioner), and the value printed on each plot is the condition number of the prescaled matrix. Again, we set a random solution vector and a zero initial guess. While there is still some difference in the attainable level of accuracy for the different variants, until this level is reached, all methods converge at essentially the same rate. The bound (7) is shown as well, and while this provides a good estimate of the actual convergence rate for some of the problems, it is a large overestimate for others.

For several of these problems, we computed

(18) |