## 1 Introduction

In this paper, we consider the use of Krylov subspace methods for solving large, sparse linear systems , where . We will focus on the conjugate gradient (CG) method [25], which is used when is symmetric positive definite. Given an initial approximate solution and corresponding residual , the CG method iteratively updates the approximate solution using a sequence of nested Krylov subspaces , where

denotes the -dimension Krylov subspace with matrix

and starting vector

. In iteration , the updated approximate solution is chosen by imposing the Galerkin condition .Thus each iteration of CG requires a matrix-vector product with in order to expand the dimension of the Krylov subspace and a number of inner products to perform the orthogonalization. On modern computer architectures, the speed with which the matrix-vector products and inner products can be computed is limited by communication (i.e., the movement of data). This limits the potential speed of individual iterations attainable by an implementation of CG. To perform a sparse matrix-vector product in parallel, each processor must communicate entries of the source vector and/or the destination vector that it owns to neighboring processors. Inner products require a global synchronization, i.e., the computation can not proceed until all processors have finished their local computation and communicated the result to other processors. For large-scale sparse problems on large-scale machines, the cost of synchronization between parallel processors can dominate the run-time (see, e.g., the exascale computing report [19, pp. 28]).

Research efforts toward removing the performance bottlenecks caused by communication in CG and other Krylov subspace methods have produced various approaches.
One such approach are the -step Krylov subspace methods (also called “communication-avoiding” Krylov subspace methods); for a thorough treatment of background, related work, and performance experiments, see, e.g., the theses [5, 27]. In -step Krylov subspace methods, instead of performing one iteration at a time, the iterations are performed in blocks of ; i.e., in each iteration, the Krylov subspace is expanded by dimensions by computing new basis vectors and then all inner products between the new basis vectors needed for the next iterations are computed in one block operation. In this way, computing the inner products for iterations only requires a single global synchronization, decreasing the synchronization cost per iteration by a factor of . This approach has been shown to lead to significant speedups for a number of problems and real-world applications (see, e.g., [35, 43]).
In the remainder of the paper, we will refer to the matrices whose columns consist of the basis vectors computed in each block as *-step basis matrices*.
Further details of the -step CG method are discussed in section 2. We emphasize that our use of the overloaded term “-step methods” here differs from other works, e.g., [14, 30] and [21, §9.2.7], in which ‘-step method’ refers to a type of restarted Lanczos procedure.

In exact arithmetic the -step CG method produces the exact same iterates as the classical CG method, but their behavior can differ significantly in finite precision. In both -step and classical Krylov subspace methods, rounding errors due to finite precision arithmetic have two basic effects: a decrease in attainable accuracy and a delay of convergence. It has long been known that for -step Krylov subspace methods, as is increased (and so the condition numbers of the -step basis matrices increase), the attainable accuracy decreases and the convergence delay increases relative to the classical CG method (see, e.g., [10]). At the extreme, if the parameter is chosen to be too large, the -dimensional bases computed for each block can be numerically rank deficient and the -step method can fail to converge.

This sensitive numerical behavior poses a practical obstacle to optimizing the performance of -step methods, and diminishes their usability and reliability. In a setting where the performance of CG is communication-bound, we expect that up to some point, increasing will decrease the time per iteration. If we pick only based on minimizing the time per iteration, however, we can run into problems.
First, the finite precision error may cause a large convergence delay, negating any potential performance gain with respect to the overall runtime. Since the number of iterations required for convergence for a given value is not known a priori, choosing the value that results in the fastest time-to-solution is a difficult problem. Second, the chosen parameter may cause -step CG to fail to converge to the user-specified accuracy. In this case, the particular problem is *unsolvable* by the -step CG method.

Requiring the user to choose the parameter thus diminishes the practical usability of -step Krylov subspace methods. It is therefore imperative that we develop a better understanding of the convergence rate and accuracy in finite precision -step CG and other -step Krylov subspace methods. Our hope is that by studying the theoretical properties of methods designed for large-scale computations in finite precision, we can develop methods and techniques that are efficient, capable of meeting application-dependent accuracy constraints, and which do not require that the user have extensive knowledge of numerical linear algebra.

Toward this goal, in this paper we develop a variable -step CG method in which the values can vary between blocks and are chosen automatically by the algorithm such that a user-specified accuracy is attained. We call this the *adaptive -step CG method*.
The condition for determining in each block is derived using a bound on the size of the gap between the true residuals and the updated residuals computed in finite precision.
We show that in order to prevent a loss of attainable accuracy, in a given block of iterations, the condition number of the -step basis matrix should be no larger than a quantity related to the inverse of the 2-norms of the residuals computed within that block. In other words, as the method converges, more ill-conditioned bases (i.e., larger values) can be used without a loss of attainable accuracy.
Furthermore, we show that the adaptive -step CG method can be implemented in a way that does not increase the synchronization cost per block versus (fixed) -step CG.

Whereas the fixed -step CG method may not attain the same accuracy as classical CG, our numerical experiments on a set of example matrices demonstrate that our adaptive -step CG method can achieve the same accuracy as classical CG, while still reducing the number of synchronizations required by up to over a factor of .

In section 2 we give a brief background on -step Krylov subspace methods and detail the -step CG method, and in section 3 we discuss related work. Section 4 presents our main contribution; in section 4.1, we derive a theoretical bound on how large the -step basis matrix condition number can be in terms of the norms of the updated residuals without affecting the attainable accuracy, and in section 4.2, we detail the adaptive -step CG algorithm that makes use of this bound in determining the number of iterations to perform in each block. Numerical experiments for a small set of test matrices are presented in section 5. We discuss possible extensions and future work and briefly conclude in section 6.

## 2 Background: -step Krylov subspace methods

The basic idea behind -step Krylov subspace methods is to grow the underlying Krylov subspace dimensions at a time and perform the necessary orthogonalization with a single global synchronization. Within a block of iterations, the vector updates are performed implicitly by updating the coordinates of the iteration vectors in the -dimensional basis spanned by the columns of the -step basis matrix. The algorithmic details vary depending on the particular Krylov subspace method considered, but the general approach is the same. There is a wealth of literature related to -step Krylov subspace methods. We give a brief summary of early related work below; a thorough bibliography can be found in [27, Table 1.1].

The -step approach was first used in the context of Krylov subspace methods by Van Rosendale [41], who developed a variant of the parallel conjugate gradient method which minimizes inner product data dependencies with the goal of exposing more parallelism. The term “-step Krylov subspace methods” was first used by Chronopoulos and Gear, who developed an -step CG method [9, 10]. This work was followed in subsequent years by the development of other -step variants of Orthomin and GMRES [11], Arnoldi and Symmetric Lanczos [31, 32], MINRES, GCR, and Orthomin [8, 13], Nonsymmetric Lanczos [33], and Orthodir [12]. Walker similarly used -step bases, but with the goal of improving stability in GMRES by replacing the modified Gram-Schmidt orthogonalization process with Householder QR [42].

Many of the earliest -step Krylov subspace methods used monomial bases (i.e., ) for the computed -step basis matrices, but it was found that convergence often could not be guaranteed for (see, e.g., [10]). This motivated research into the use of other more well-conditioned bases for the Krylov subspaces, including scaled monomial bases [26], Chebyshev bases [29, 16, 17], and Newton bases [1, 20]. The growing cost of communication in large-scale sparse problems has created a recent resurgence of interest in the implementation, optimization, and development of -step Krylov subspace methods; see, e.g., the recent works [18, 27, 35, 46, 23, 34, 44, 45, 43].

### 2.1 The -step conjugate gradient method

In this work, we focus on the -step conjugate gradient method (Algorithm 2), which is equivalent to the -step CG method (Algorithm 2) in exact arithmetic. In the remainder of this section, we give a brief explanation of the -step CG method; further details on the derivation, implementation, and performance of the -step CG method can be found in, e.g., [5, 27].

The -step CG method consists of an outer loop, indexed by , which iterates over the blocks of iterations, and an inner loop, which iterates over iterations within a block. For clarity, we globally index iterations by . It follows from the properties of CG that for ,

(1) |

Then the CG vectors for the next iterations lie in the sum of the column spaces of the matrices

(2) |

where is a polynomial of degree satisfying the three-term recurrence

(3) |

We define the -step basis matrix and define to be the same as except with all zeros in columns and . Note that the ’s in the subscripts are unnecessary here, but will be useful in describing the variable -step CG method later on. Under certain assumptions on the nonzero structure of , can be computed in parallel in each outer loop for the same asymptotic latency cost as a single matrix-vector product using the so-called “matrix powers kernel” (see [18] for details). Since the columns of satisfy (3), we can write

(4) |

where is a block tridiagonal matrix defined by the 3-term recurrence coefficients in (3).

By (1), there exist vectors , , and that represent the coordinates of the CG iterates , , and , respectively, in for . That is,

(5) | ||||

Therefore, in the inner loop of -step CG, rather than update the CG vectors explicitly, we instead update their coordinates in , i.e., for ,

Thus the matrix-vector product with in each iteration becomes a matrix-vector product with the much smaller matrix . This along with the length- vector updates can be performed locally by each processor in each inner loop iteration.

We can also reduce communication in computing the inner products. Letting , and using (5) and (4), and can be computed by

The matrix can be computed with one global synchronization per outer loop iteration, which, in terms of asymptotic parallel latency, costs the same as a single inner product. As and are dimension , and can be computed locally by each processor within the inner loop.

## 3 Related work

In this section, we discussed related work in the areas of variable -step Krylov subspace methods, the analysis of maximum attainable accuracy in finite precision CG, and inexact Krylov subspace methods.

### 3.1 Variable -step Krylov subspace methods

Williams et al. [43] use a variable -step BICGSTAB as the coarse grid solver in a multigrid method. The technique used here, termed “telescoping”, is motivated by the observation than some coarse grid solves converge after only a few iterations, whereas other solves take significantly longer. By starting with a small value and gradually increasing it, they ensure that -step BICGSTAB with larger is only used when the solve takes enough iterations to amortize the additional costs associated with -step CG.

Imberti and Erhel [28] have recently developed an -step GMRES method that allows variable sizes. They recommend choosing values according to the Fibonacci sequence up to some maximum value , i.e., starting with and for , . In this way, the sequence used by Imberti and Erhel is predetermined.

In contrast, in our approach, the sequence is dynamically chosen based on the 2-norm of the updated residual (which is not necessarily monotonically decreasing in CG). In addition, our method is designed such that a user-specified accuracy can be attained when , where is the accuracy attained by classical CG for the same problem.

### 3.2 Maximum attainable accuracy in Krylov subspace methods

In both -step and classical variants of Krylov subspace methods, finite precision roundoff error in updates to the approximate solution and the residual in each iteration can cause the updated residual and the true residual to grow further and further apart as the iterations proceed. If this deviation grows large, it can limit the *maximum attainable accuracy*, i.e., the accuracy with which we can solve on a computer with unit round-off .
Analyses of maximum attainable accuracy in CG and other classical KSMs are given by Greenbaum [22], van der Vorst and Ye [40], Sleijpen, van der Vorst, and Fokkema [37], Sleijpen, van der Vorst, and Modersitzki [38], Björck, Elfving, and Strakoš [2], and Gutknecht and Strakoš [24].
One important result of these analyses is the insight that loss of accuracy can be caused at a very early stage of the computation, which can not be corrected in later iterations. Analyses of the maximum attainable accuracy in -step CG and the -step biconjugate gradient method (BICG) can be found in [6, 5].

### 3.3 Inexact Krylov subspace methods

The term “inexact Krylov subspace methods” refers to Krylov subspace methods in which the matrix-vector products are computed inexactly as where represents some error matrix (due to either finite precision computation or intentional approximation). It is assumed that all other computations (orthogonalization and vector updates) are performed exactly. Using an analysis of the deviation of the true and updated residuals, it is shown that under these assumptions the size of the perturbation term can be allowed to grow inversely proportional to the norm of the updated residual without adversely affecting the attainable accuracy. This theory has potential application in mixed-precision Krylov subspace methods, as well as in situations where the operator is expensive to apply and can be approximated in order to gain performance.

Much work has been dedicated to inexact Krylov subspace methods. In [36]

, Simoncini and Szyld provide a general theory for inexact variants applicable to both symmetric and nonsymmetric linear systems and eigenvalue problems and use this to derive criteria for bounding the size of

in such a way that does not diminish the attainable accuracy of the method. A similar analysis was given by van de Eschof and Sleijpen [39]. These analyses confirm and improve upon the earlier empirical results for the inexact GMRES method of Bouras and Fraysseé [3] and the inexact conjugate gradient method of Bouras, Frayssé, and Giraud [4].Our work is very closely related to the theory of inexact Krylov subspace methods in that our analysis results in a somewhat analogous “relaxation strategy”, where instead of relaxing the accuracy of the matrix-vector products we are instead relaxing a constraint on the condition numbers of the computed -step basis matrices. In addition, our analysis assumes that all parts of the computation are performed in a fixed precision .

## 4 Variable -step CG

As discussed, the conditioning of the -step basis matrices in -step Krylov subspace methods affects the
rate of convergence and attainable accuracy.
In the -step CG method (Algorithm 2), the Krylov subspace basis is computed in blocks of size in each outer loop.
This method can be generalized to allow the blocks to be of varying size; in other words, a different value can be used in each outer
loop iteration. We denote the block size in outer iteration by , where for some maximum value
. We call this a *variable -step Krylov subspace method*. A general variable -step CG method is shown in Algorithm 3.

In this section, we derive a variable -step CG method which automatically determines in each outer loop such that a user-specified accuracy can be attained, which we call the adaptive -step CG method. Our analysis is based on the maximum attainable accuracy analysis for -step CG in [6], which shows that the attainable accuracy of -step CG depends on the condition numbers of the -step basis matrices computed in each outer loop. In summary, we show that if in outer loop , is selected such that the condition number of the basis matrix is inversely proportional to the maximum 2-norm of the residual vectors computed within outer loop , then the approximate solution produced by the variable -step CG method can be as accurate as the approximate solution produced by the classical CG method. Further, our method allows for the user to specify the desired accuracy, so in the case that a less accurate solution is required, our method automatically adjusts to allow for higher values. This effectively exploits the tradeoff between accuracy and performance in -step Krylov subspace methods.

In this section we derive our adaptive -step approach, which stems from a bound on the growth in outer loop of the gap between the true and updated residuals in finite precision.

### 4.1 A constraint on basis matrix condition number

In the remainder of the paper, hats denote quantities computed in finite precision, denotes the unit round-off, denotes the 2-norm, and denotes the 2-norm condition number . To simplify the analysis, we drop all terms (and higher order terms in ). Writing the true residual , we can write the bound

Then as (i.e., as the method converges), we have .
The size of the *residual gap*, i.e., , therefore determines the attainable accuracy of the method (see the related references in section 3.2).

We begin by considering the rounding errors made in the variable -step CG method (Algorithm 3). In outer loop and inner loops of finite precision variable -step CG, using standard models of floating point error (e.g., [21, §2.4]) we have

(6) | ||||

(7) | ||||

(8) | ||||

(9) | ||||

(10) | ||||

where denotes the maximum number of nonzeros per row in and . Let denote the residual gap in iteration . Similar to the analysis in [6], for variable -step CG we can write the growth of the residual gap in iteration in outer loop as

Our goal is now to manipulate a bound on this quantity in order to derive a method for determining the largest value we can use in outer loop such that the desired accuracy is attained. Using standard techniques, we have the norm-wise bound

where we have used . To simplify the notation, we let and . We note that depends on the polynomial basis used and is not expected to be too large; e.g., for the monomial basis, . The above bound can then be written

(11) | ||||

Note that assuming is full rank, we have

(12) | ||||

(13) |

To bound in terms of the size of the residuals, notice that by (9),

so

Thus together with (13) we can write

This allows us to write the bound (11) in the form

Using , we have

and writing , we have

Then

and letting , this bound can be written

(15) |

This gives a norm-wise bound on the growth of the residual gap due to finite precision errors made in outer iteration .

Letting , notice that we have

Suppose that at the end of the iterations we want the size of the residual gap to be on the order (where , as this is the best we could expect). Then assuming the number of outer iterations is not too high, this can be accomplished by requiring that in each outer loop, for ,

From (15), this means that we must have, for all inner iterations ,

(16) |

The bound (16) tells us that as the 2-norm of the residual decreases, we can tolerate a more ill-conditioned basis matrix without detriment to the attainable accuracy. Since grows with increasing , this suggests that can be allowed to increase as the method converges at a rate proportional to the inverse of the residual 2-norm. This naturally gives a relaxation strategy that is somewhat analogous to those derived for inexact Krylov subspace methods (see section 3.3). We discuss the implementation of this strategy in the following subsection.

### 4.2 The adaptive -step CG method

In this section, we propose a variable -step CG method that makes use of the constraint (16) in determining in each outer loop , which we call the *adaptive* -step CG method. Our approach uses the largest value possible in each outer iteration such that (16) is satisfied up to some set by the user (e.g., could be selected by auto-tuning to find the value that minimizes the time per iteration in -step CG). Since a variable -step CG method produces the same iterates as classical CG when

Comments

There are no comments yet.