1 Introduction
The Rolling Stones rollingStones  You Can’t Always Get What You Want …
No, you can’t always get what you want
You can’t always get what you want
You can’t always get what you want
But if you try sometimes, well you just might find
You get what you need
Let be an realvalued matrix, where and are vectors, where is a positive integer with . In this manuscript, appears in the context of two parallel distributedmemory computations: (1) finding an approximate solution to a linear system with a Krylov subspace method such as GMRES Saad86 and (2) finding the eigenvalues of using the KrylovSchur algorithm Stewart86. In both instances, orthgonalization of the Krylov basis vectors is required.
The GramSchmidt algorithm orthogonalizes a set of basis vectors. Krylov methods for linear system solvers and eigenvalue computations both depend upon the orthogonality of the basis vectors for the Krylov subspace in finite precision arithmetic. The Arnoldi algorithm applies GramSchmidt orthogonalization to the Krylov basis vectors which are columns of the matrix in the factorization . The loss of orthogonality of the computed basis vectors can deviate substantially from the machine precision . When the orthogonality of the basis vectors is completely lost, the Krylov iterations, may fail to converge. For the solution of linear systems of equations , Paige et. al. 2006simaxpaigerozloznikstrakos have shown that when the loss of orthogonality is bounded by , where , then the MGSGMRES algorithm is backward stable for the solution of linear systems. Here, is the condition number of the matrix , where
are the singular values of
. For eigenvalue computations employing the KrylovSchur algorithm, accurate and converged eigenvalue approximations are obtained when the loss of orthogonality remains close to . In this paper, a stable Arnoldi algorithm that achieves machine precision orthogonality is presented that performs well on a distributedmemory parallel computer.The orthogonalization kernels require a global reduction in the form of MPI_AllReduce and these communication patterns are expensive on a distributedmemory parallel computer BienzGroppOlson2019. Our algorithms are designed such that they require only one reduction to normalize each vector and apply orthogonal projections. Our focus is on algorithms that orthogonalize the Krylov vectors one column at a time as opposed to blocks (e.g. only one column becomes available at a time, and thus the vectors are processed in a “leftlooking” fashion). The Krylov basis dimension , is assumed to be much smaller than the dimension of the coefficient matrix . is then referred to as a tall and skinny matrix. These are typical of Krylov iterative methods for solving a linear system of equations or eigenvalue problems, which rely on the Arnoldi expansion.
The classical GramSchmidt (CGS) algorithm is preferable for parallel computation because it requires only two global reductions for each column vector (a projection step, followed by a vector normalization). In practice, however, CGS leads to numerical instability for the solution of and also eigenvalues, because the loss of orthogonality is bounded above by . This bound was conjectured for a long time and finally proven in the two papers glre:nm:05; smbl:nm:06. The GMRES iteration will stall and fail to converge if orthogonality of the Krylov vectors is completely lost, for example, when as described by Paige paige2018Book, where the matrix was introduced in Paige et. al. 2006simaxpaigerozloznikstrakos. In order to obtain backward stable eigenvalues from the KrylovSchur algorithm, Stewart Stewart86 demonstrates that orthogonality of the Krylov vectors is sufficient.
To reduce the loss of the orthogonality to the level of machine precision , the CGS algorithm can be applied twice (CGS2) to reorthogonalize the basis vectors. This is the “twice is enough” result from Kahan and Parlett doi:10.1137/1.9781611971163, which has also been proven by Giraud et. al. in glre:nm:05. We follow the convention that, under the assumption for a given input matrix and constant then CGS2 can construct orthogonal columns with machine precision, Theorem 2 in glre:nm:05. The number of floating point operations (flops) for CGS2 is therefore (twice the cost of CGS), and requires three global reductions.
A onereduce variant of CGS2 was derived in slayt:nla:20 and is applied to the Arnoldi iteration in this paper. It achieves the same loss of orthogonality as the original CGS2 algorithm but requires only one reduction per column vector. To reduce the number of global reductions and avoid cancellation errors, the normalization step is lagged and the Pythagorean theorem is employed. The reorthogonalization is also delayed to the next iteration and thus is performed “onthefly” as opposed to applying the entire algorithm in a second pass. The resulting algorithm combines these two steps into one global reduction and is referred to as the delayed DCGS2.
The DCGS2 algorithm is analysed in the context of the factorization and the Arnoldi expansion. For the latter, the algorithm must not only retain the level of orthogonality but also maintain the backward stability of the factorization (referred to as the representation error). Specifically, during the Arnoldi iterations, a sparse matrixvector product (SpMV) is applied to compute the next basis vector. Because of the delay, the algorithm computes the next vector from the previous vector that has been orthogonalized once. Hence, in the next iteration, the vector must be scaled appropriately. DCGS2 consists of two essential ingredients: (1) Stephen’s trick and (2) an Arnoldi representation error correction (not needed for the traditional factorization).
Extensive numerical results are presented in the context of the KrylovSchur eigenvalue algorithm to demonstrate the numerical stability and accuracy of the DCGS2 algorithm. Performance results are presented for the ORNL Summit supercomputer (two 22core IBM Power 9 CPUs and six NVIDIA Volta 100 GPUs per node) to demonstrate that the DCGS2 algorithm improves the CGS2 performance by a factor of up to on GPUs, while maintaining the same loss of orthogonality as the original CGS2 algorithm.
2 Related Work
Even though MGSGMRES is backward stable for the solution of linear systems, the CGS2GMRES algorithm was found to be more scalable for massively parallel computation in a 1998 study by Fraysé et. al. Frayse98 and included in the Trilinos framework by Bavier et. al. Belos. The more recent development of a onereduce MGSGMRES algorithm by Swirydowicz et. al. slayt:nla:20 implies that a reevaluation of these results is certainly warranted in the context of a onereduce DCGS2.
The traditional CGS2 algorithm requires three global reductions to orthogonalize each vector: one for the first projection, another for the second pass and a third for the normalization. Our onereduce DCGS2 algorithm delays reorthogonalization. This is achieved by lagging the normalization as originally proposed by Kim and Chronopoulos Kim92) and then applying Stephen’s trick. The Pythagorean trick introduced by Smoktunowicz et. al. smbl:nm:06 avoids cancellation errors and Carson et. al. 2020_arXive_Carson generalize this to block GramSchmidt algorithms. The delayed normalization for the Arnoldi iteration was employed by Hernandez et al. 2007hernandezparco without a correction. The algorithm presented by these authors is not forward stable for eigenvalue computations because the loss of orthogonality is as least , see Section 6. Delaying the normalization also requires a scaling to update the Krylov vectors in the Arnoldi expansion.
The complexity of the CGS algorithm is where has the dimensions by. The recent work slayt:nla:20; 2020yamazakiproceedingsofsiampp20 describe a onereduce inverse compact ICWYMGS algorithm with a triangular solve in the projection step. This BLAS level2 implementation of ICWY MGS requires one synchronization per step, the same as our new DCGS2. The original BLAS level1 formulation of modified GramSchmidt (MGS) requires flops (the same as CGS) and applies the elementary orthogonal projections sequentially, requiring a separate global reduction for each innerproduct. The complexity of ICWYMGS is and thus it requires 50% more flops than CGS or the traditional version of MGS, however, it is less costly than CGS2. The orthogonality error of MGS is bounded by (see Björck in 1967 Bjork_1967). Thus, MGS maintains orthogonality better than CGS but worse than CGS2.
An MGS2 algorithm with reorthogonalization exists, analogous to CGS2. In practice, the MGS2 algorithm exhibits an loss of orthogonality. The number of flops for MGS2 is . (Double the cost of CGS or MGS and the same as CGS2). Level 2 MGS2 requires flops ( for construction of the triangular correction matrix , for the first pass and for the second pass). These costs will be important considerations in our strong scaling studies of our new algorithms on the ORNL Summit supercomputer.
3 DCGS2 Algorithm for the Factorization
In this section, the classical GramSchmidt algorithm to compute the factorization is reviewed, where the matrix has dimensions . Algorithm 1 displays the pseudocode for the –th step of the traditional CGS2 algorithm. The algorithm orthogonalizes the column vector against the orthogonal column vectors of the matrix using a single projection:
followed by a second application of the projection in the form of a reorthogonalization pass,
Finally, the vector
is normalized to produce the orthonormal vector
,The factorization of (i.e. ), is produced at the end of the –th step. However, CGS2 still requires three global reductions: one for the first projection, another for the second pass, and the third for the normalization (steps 1, 3, and 5 respectively).
Algorithm 2 displays the pseudocode of the DCGS2 algorithm for computing the factorization. In order to obtain one global reduction, the second projection and normalization are lagged or delayed to the next step. The purpose of the Pythagorean trick is to mitigate cancellation errors due to finite precision arithmetic. Namely, the norm of the updated vector is computed as follows
where and the orthogonality of is assumed in finite precision arithmetic to . This corresponds to Step 3 in Algorithm 2.
Because the normalization is delayed, Stephen’s trick allows us to compute the norm with versus at the beginning of each iteration. Finally, for the orthogonalization of the column vector , the scalar is computed by using instead of . Therefore, after the computation of , the norm must be corrected
This correction corresponds to step 7 (Stephen’s trick) in Algorithm 2 given below.
For the final –th step, the traditional CGS2 algorithm is applied. This incurs two additional synchronizations.
4 DCGS2 Algorithm for the Arnoldi Expansion
Algorithm 3 displays the pseudocode of the standard CGS2 algorithm for the Arnoldi expansion. The only difference from the factorization in Algorithm 1 is that the next basis vector is generated by applying the sparsematrix vector multiply (SpMV) to the previously normalized column vector . At the end of the –th step, in exact arithmetic, the matrices would satisfy the Arnoldi expansion,
(1) 
The DCGS2 algorithm for the Arnoldi expansion will now be derived, requiring only one global reduction for each column vector. The representation error and loss of orthogonality are maintained at the same level as the traditional Arnoldi expansion computed with the CGS2 algorithm. To achieve one global reduction, the reorthogonalization and the normalization are delayed to the next step, and then combined.
When the reorthogonalization is delayed, the next basis vector is generated by applying an SpMV to the current vector. Namely, the next vector is computed as by using the vector instead of , where is the resulting vector Thus, an efficient strategy is required to compute from and also to generate the Hessenberg matrix elements that satisfy the Arnoldi relation.
After the delay, the reorthogonalized and normalized vector , is computed as follows
(2) 
Equation (2) can also be written as
(3) 
where is an upper triangular matrix.
Multiplying (2) by from the left, it follows that
Next the vector is computed, which is the resulting vector after orthogonalizing against the previous vectors :
The last term is dropped from (4), for two reasons:

The DCGS2 algorithm is constructed such that the loss of orthogonality is , and

is expected to be bounded by . Hence, when , the norm of the term is bounded by .
Therefore, at this point our sign is now an approximation and
where in the last equation.
Finally, using (2), it is possible to compute as
where . This step is Stephen’s trick in the context of Arnoldi.
After substitution of this expression, it follows that
where , and
The –th column of the Hessenburg matrix is computed as follows and satisfies the Arnoldi relation (1). First, reorder (4) in a factorization form:
From (2), it also follows that
which represents the orthogonalization of the vector . By replacing in (4) with the expression in (4), obtain
This is the representation of in the Krylov subspace having orthonormal basis vectors using the matrices and . However, the representation of in is needed with the matrix . Namely, write (4) as
From these equations, is computed using and .
Now, replacing with (4), we obtain
where

represents the factorization,

and are the reorthogonalization terms,

corrects the vector norm (Stephen’s trick),

and are the representation error correction terms, and

is the scaling factor.
Items 1, 2 and 3 are present in both the QR and Arnoldi algorithms. Items 4 and 5 are specific to the Arnoldi algorithm.
5 Computation and Communication Costs
Gram Schmidt algorithms are listed in Tables 1 and 2. Although theoretically equivalent, they exhibit different behaviour in finite precision arithmetic. Their computation and communication costs also vary. All the schemes, except MGS level1, are based on BLAS level2 operations. The level1 implementation of MGS applies each orthogonal projection sequentially to a vector and does not take advantage of the BLAS level2 such as DGEMM matrixmatrix multiplication. In addition, this algorithm requires one global reduction (MPI_AllReduce) per iteration to orthogonalize a vector, i.e., global reductions at the –th step. The level2 BLAS implementation of ICWYMGS batches together the orthogonal projections and computes one row of the strictly lower triangular matrix slayt:nla:20.
The resulting inverse compact projector is given by
where the triangular correction matrix is given by
The implied triangular solve requires an additional flops at the th step and leads to the slightly higher operation count compared to the level1 MGS. However, it requires only one global reduction, and hence the number of synchronizations does not depend on the number of vectors to be orthogonalized.
In the case of the DCGS2 algorithm, the correction matrix was derived in Appendix 1 of slayt:nla:20 and is given by
This form of the projector results in an loss of orthogonality and was employed in the step and pipelined GMRES described in 2020yamazakiproceedingsofsiampp20. When the matrix is split into and and applied across two iterations of the DCGS2 algorithm, the resulting loss of orthogonality is in practice.
orth scheme  flops per step  synchs  bandwidth 

MGS level 1  
MGS ICWY  1  
CGS  2  
CGS2  3  
CGS2 (lagged norm)  2  
DCGS2  1 
orth scheme  flops per step  synchs  bandwidth 

MGS level 1  
MGS ICWY  
CGS  
CGS2  
CGS2 (lagged norm)  
DCGS2 
Both of the CGS and CGS2 algorithms are based on level2 BLAS operations. CGS applies a single projection, and then normalizes, requiring two synchronizations. This projection step consists of two DGEMV operations and one DDOT for the normalization. CGS suffers from at least an loss of orthogonality. CGS2 acheives through reorthogonalization (see Figure 1). The additional projection within CGS2 accounts for one additional global reduction per step and an additional operations.
The DCGS2 algorithm requires one reduction (compared to the three reductions needed by CGS2) and employs level3 BLAS for the computation in a tallandskinny DGEMM. This leads to a higher sustained performance of DCGS2 compared with CGS2 (e.g. the GigaFlop/sec). In the context of the Arnoldi algorithm, DCGS2 requires an additional flops at the th step. The additional cost is due to the Arnoldi representation trick explained in Section 4. The correction term that is needed for Arnoldi requires an additional operations from a matrixvector product using the Hessenberg matrix in the Arnoldi expansion.
orth scheme  LOO  Proven 

MGS level 1  Bjork_1967  
MGS level 2  Conjectured  
CGS  glre:nm:05  
CGS2  glre:nm:05  
CGS2 (lagged norm)  Conjectured  
DCGS2  Conjectured 
6 Numerical Results
In this section, the numerical properties of these orthogonalization schemes in the context of the Arnoldi iteration are determined experimentally as formal proofs of backward stability are forthcoming
6.1 Manteuffel Matrix
The matrix generated by “central differences” presented in 1978_mantueffel_adaptive by Manteuffel is employed in our tests of the KrylovSchur eigenvalue solver based on the DCGS2 and Arnoldi algorithms. The matrix is convenient for an evaluation of the orthognalization schemes. The Manteuffel matrix is expressed as the sum
Comments
There are no comments yet.