I Introduction
The GramSchmidt algorithm constructs an orthogonal basis from a set of linearly independent vectors spanning a vector space. The classical (CGS) algorithm is columnoriented, whereas the modified (MGS) GramSchmidt algorithm has both row and column variants. The numerical properties of both variants were clarified by Björck
[3], who established the backwards stability of the factorization of a matrix produced by MGS for least squares problems. The columns of lose othogonality in both cases and bounds on have been derived and improved by several authors. For the modified algorithm, Björck [3] derived a bound based on the condition number . The situation is worse for the classical algorithm where the loss of orthogonality depends on as shown in [7] and [8]. For both CGS and MGS, the orthogonality can be restored to the level of machine roundoff by reorthogonalization of the columns of . Iterative reorthogonalization was discussed in Leon et. al. [12] along with the “twice is enough” result of Kahan which was later generalized by Giraud et. al. [8]. A subsequent paper by Smoktunowicz et. al. [22] introduced a more stable Choleskylike CGS algorithm using a Pythagorean identity for the normalization of the basis vectors.We present low synchronization variants of iterated classical (CGS) and modified GramSchmidt (MGS) algorithms that require one and two global reduction communication steps. The MGS algorithms are based on the compact form of the orthogonalization process described in Bjorck [4] and Malard and Paige [13]. Our main contribution is to introduce a lag into the column oriented MGS algorithm in order to delay normalization of the diagonal elements of . In addition, the reductions can be overlapped with the computation and pipelined as demonstrated by Ghysels et. al. [6] and Yamazaki et. al. [26]. Derivations of low communication algorithms for iterated CGS can be based on the paper by Ruhe [18] who establishes the equivalence of reorthogonalization for classical GramSchmidt with the GaussJacobi iteration for the normal equations with coefficient matrix and GaussSeidel for modified GramSchmidt. The author unrolls the recurrence for reorthogonalization, thus obtaining a loopinvariant form of the algorithm. The loop invariant allows us to combine the two reorthogonalization iterations into one step and thereby eliminate the need for an additional global reduction. The loss of orthogonality relationship for the matrix is also exploited to reduce the computation and communication overhead.
Both the classical and modified GramSchmidt algorithms may employ BLAS2 vector inner products and norms to construct an orthonormal set of basis vectors. On modern multicore computers global reductions use a treelike sum of scalar products involving every core (MPI rank) on every node of the machine. The latency of this operation grows with the height of the tree determined by the number of cores. The remaining operations required in the orthogonalization are scalable. These consist of the mass innerproduct , vector MAXPY operation , and sparse matrixvector product (SpMV) at each iteration. For sparse matrices distributed by rows, the parallel SpMV typically requires local communication between cores. The BLAS1 vector sum is trivially dataparallel. Hence, the global reductions are the limiting factor for the available parallelism in the algorithm. Our MGS algorithm reduces the number of global reductions per iteration to only one.
Ii GramSchmidt and Reorthogonalization
Ruhe [18] provides a unique perspective by proving that the iterated classical and modified GramSchmidt algorithms are equivalent to GaussJacobi and GaussSeidel iterations for linear systems. Let be an nonsingular matrix with factorization obtained from the GramSchmidt algorithm. Consider one step of the classical GramSchmidt algorithm, with orthonormal vectors forming an matrix , and the vector , which is to be orthogonalized to form the –st column of . The classical GramSchmidt algorithm with reorthogonalization is presented as Algorithm 1.
Each iteration in Steps 3–5 is a reorthogonalization. The notation in Step 7 means that the final vector is normalized to form the next column of . The normalization coefficient enters the matrix below the vector , leading to the corresponding matrix factors
For the columnwise modified GramSchmidt algorithm, each iteration (Steps 2 to 4) in Algorithm 2 is divided into rank1 updates,
The key observation made by Ruhe [18] is that for the classical GramSchmidt algorithm, the loopinvariants are obtained by unrolling the recurrences as follows,
and therefore , where is a projection matrix. It follows that
This is the GaussJacobi iteration applied to the normal equations for the matrix and right hand side vector
The loopinvariant form of the matrix update can be judiciously applied in order to minimize the number of communication steps in parallel implementations of classical GramSchmidt.
For distributedmemory parallel computation based on messagepassing, the number of global reduction summation steps in the iterated classical GramSchmidt algorithm can be reduced to two by employing the Ruhe loopinvariant. In particular, it is possible to write Step 4 of Algorithm 1 by combining the first two iterations of the reorthogonalization as follows, given
The above derivation implies that the projection matrix takes the form,
where is a strictly lower triangular matrix
The upper triangular matrix can be computed one column at a time and this suggests the form of a two synchronization step parallel algorithm for iterated CGS (Algorithm 3).
Two global reduction communications are required for the mass inner products in Steps 1, and the normalization in Step 4. The above algorithm suggests that a synchronization step can be eliminated if the normalization Step 4 was lagged to the th iteration. A lagged modified GramSchmidt algorithm based on the compact representation of the projection matrix is derived below.
In the case of the modified GramSchmidt algorithm, the loop invariant form of the st step of the reorthogonalization update for a column of the matrix is derived by Ruhe as
where
The columns of the matrix are updated as follows,
An important matrix power series expansion links the different GramSchmidt algorithms to the stability analyses and the loss of orthogonality relations. In addition, the equivalence of the modified GramSchmidt algorithm and compact representation given below follow from the power series
A measure of the loss of orthogonality was introduced by Paige [14] as , where . The norm remains close to for orthogonal vectors and increases to one as orthogonality is lost. However, given the loopinvariants, the matrix derived by Ruhe does not contain the higherorder innerproducts that are present in the matrix obtained for the compact MGS projection.
In order to derive a block form of the modified GramSchmidt algorithm based on level2 BLAS type operations, Björck [4] proposed a compact representation for MGS analogous to the Householder factorization [20]. The matrix projection can be written as
where is the lower triangular matrix of basis vector inner products. The representation is then given by the matrix form
and where
The paper by Malard and Paige [13] shows how the transpose of the matrix above can be formed recursively as follows
A parallel modified GramSchmidt algorithm that requires only one global synchronization step can be based on the above compact form combined with a lagged normalization of the diagonal of the upper triangular matrix .
The single global reduction in Algorithm 4 consists of the mass inner product in Step 2 and the norm in Step 3. The derivation above can be extended to the iterated classical GramSchmidt algorithm with two full reorthogonalization steps. The columns of are updated in a single operation requiring one synchronization and then the projection step takes the following form with a second global reduction.
The resulting algorithm only requires two global communication steps. The first synchronization step consists of the mass inner product in Step 2 and normalization in Step 3. The second synchronization occurs in Step 8 with a second mass inner product and mass AXPY. Algorithm 5 can achieve orthogonality of the columns of the matrix to machine precision.
Iii One Sychronization MGSGMRES Algorithm
Given a large sparse linear system of equations , with residual vector , the Generalized Minimal Residual (GMRES) algorithm of Saad and Schultz [19] employs the Arnoldi GramSchmidt algorithm with , , in order to orthogonalize the vectors spanning the Krylov subspace
The algorithm produces an orthogonal matrix
and the matrix decompositionIndeed, it was observed by Paige et. al. [15] that the algorithm can be viewed as the GramSchmidt factorization of a matrix formed by adding a column to each iteration
The GMRES algorithm with modified GramSchmidt orthogonalization was derived by Saad and Schultz [19] and presented as Algorithm 6 below.
The standard level1 BLAS modified GramSchmidt (MGS) Algorithm 6 computes the orthogonalization in Steps 4 to 7. The normalization Steps 8 and 9 are performed afterwards, looking forward one iteration to compute and Ghysels et. al. [6] proposed an alternative form of the normalization to avoid a global reduction
This computation is unstable and suffers from numerical cancellation because it does not account for the loss of orthogonality in and approximates the inner product
Both Ghysels et. al. [6] and Yamazaki et. al. [26] describe pipelined algorithms based on CGS1 that incorporate an iteration lag in the normalization step to avoid the unstable normalization above. However, these algorithms are somewhat unstable because the loss of orthogonality for CGS1 is , see [8]. The authors employ a change of basis factorization, and shifts in order to mitigate these instabilities. However, the pipeline depths still must remain relatively small.
The MGSGMRES algorithm is backward stable [15] and orthogonality is maintained to , where is the condition number of
defined as the ratio of the largest to smallest singular values
[7]. By employing the Level2 BLAS compact MGS algorithm with a backward lag, the normalization for the –th iteration is computed to full accuracy with the orthogonalization for iteration . In effect, the lag creates a pipeline where orthogonalization of the Krylov vectors , and is performed before the diagonal of the matrix in the factorization, corresponding to , is computed for iteration 2. The lagged compact algorithm is implemented below as Algorithm 7 for the th iteration of MGSGMRES, where .The normalization for iteration is computed in Steps 11 and 12. Therefore, the column of the Hessenberg matrix is precomputed for iteration and the matrix is employed to solve the least squares problem at iteration . Because the loss of orthogonality for MGSGMRES is , Algorithm 7 is not only more stable but can be pipelined to an arbitrary iteration depth .
Iv Implementation Details
The standard Level1 BLAS MGSGMRES implementation with the modified GramSchmidt process requires
separate innerproducts (global summations) in iteration . Therefore, the amount of global communication grows quadratically. Several authors have proposed how to reduce the communication overhead associated with MGSGMRES. The iterated classical GramSchmidt algorithm with two reorthogonalization steps requires at least two global reductions. The TrilinosBelos solver stack employs an iterated classical GramSchmidt ICGSGMRES [1]. There have been several approaches proposed in the literature to reduce the communication overhead associated with ICGSGMRES. Hernandez et. al. [10] implement the reorthogonalization with communication pipelining. The approach developed here requires less computation and communication than previous algorithms. The onesync Level2 BLAS MGSGMRES algorithm allows us to perform the compact modified GramSchmidt algorithm, by combining a mass innerproduct and normalization into one global reduction MPI_AllReduce per GMRES iteration. The mass inner product for computing is written asThe algorithm for is implemented with the vector mass AXPY (MAXPY).
The PetSC library employs the vector MAXPY for a pipelined –GMRES iteration with reorthogonalization [6].
For optimal performance of these two algorithms on current generation multicore processors, cacheblocking combined with loop unrolling can be exploited. Once the Krylov subspace dimension is sufficiently large, the outer loop of the mass innerproduct should be unrolled into batches of two, four, or more summations. The overall effect is to expose more work to multiple floating point units and prefetch together with multiple columns of into the cache. The outer loop can also be threaded given the typically large vector lengths. In an MPI parallel implementation, the columns of are partitioned across the MPI ranks. Unrolling should have the beneficial effect of increasing the memory bandwidth utilization of the processors. The vector MAXPY should also be unrolled with the elements prefetched into the cache. The Hessenberg matrix is small and dense. Thus, it is laid out in memory by columns for cache access and fast application of the orthogonal rotations in the leastsquares problem. The remaining operations inside the onesync MGSGMRES are to compute norms, which are BLAS1 operations, (sparse) matrixvector products and the application of a preconditioner. BLAS1 operations are usually highly optimized for the particular architecture. The sparse matrixvector multiply is parallel and can be easily partitioned by rows when using a compressedsparse row (CSR) storage format.
Both the mass inner product and vector MAXPY are wellsuited to a singleinstruction multipledata (SIMD) model of parallel execution. These computations can be massively threaded on a GPU. For GPU execution, the mass innerproduct can be offloaded to the GPU but requires multiple CUDA kernel launches to perform synchronizations between reduction steps. Synchronization is possible within streaming multiprocessors (SM) executing thread blocks. However, a CUDA kernel launch is needed to synchronize across the SM. Unrolling of the innerproducts into batches is also possible and permits higher sustained memory bandwidth between the GPU global memory and streaming multiprocessors (SM). This is also the case for the vector MAXPY. A CUDA implementation of the vector MAXPY is provided by PetSC. A recent paper by Romero et. al. [16] also describes a vector MAXPY for the GPU. The GPU implementation of the CGS1 based CAGMRES algorithm is described in [27]. An important optimization to save storage and promote data reuse is to fuse the matvec+precon step and not store the preconditioned vectors [24].
V Numerical Results
The primary motivation for proposing the low synchronization GMRES algorithms was to improve the performance and strongscaling characteristics of DOE physics based simulations at exascale and, in particular, the Nalu CFD solver, Domino et. al. [5]. The Nalu model solves the incompressible NavierStokes equations on unstructured finiteelement meshes. The momentum and pressure equations are timesplit and solved sequentially. Both the momentum and pressure continuity equation are solved with preconditioned GMRES iterations. The pressure preconditioner employs either smoothedaggregation or classical RugeStüben algebraic multigrid (CAMG). The latter is provided by the HypreBoomerAMG library of solvers [9]. The Krylov solvers, which typically involve a substantial amount of communication, are an expensive part of the simulation, requiring over 50% of the simulation time.
In order to evaluate the performance of the different GMRES solvers, a separate Hypre linear solver application has been implemented. In all our performance studies, matrices were obtained from Nalu wind turbine simulations and tested in our Hypre solver application. The performance studies involving GPU acceleration were completed on a single node of the SummitDev supercomputer at ORNL. SummitDev is built with IBM S822LC nodes which consist of two 10core IBM Power8 processors with 256 GB of DD4 memory and four NVIDIA Tesla P100 GPUs connected by NVlink running at 80 GB/sec. The GPUs share 16 GB HBM2 memory. The interconnect is based on two rails of Mellanox EDR Infiniband.
GramSchmidt orthogonalization, as emphasized earlier, is the most expensive part of the GMRES algorithm. Hence, the algorithms developed in this paper are applied. These are based upon one and two synchronization CGS and MGS GramSchmidt algorithms. For each example, we specify which orthogonalization method was employed.
Va Numerical Stability and Accuracy
Linear systems appearing in the literature are employed to evaluate the numerical stability and limiting accuracy for several of the GMRES algorithms described here. All tests were performed in Matlab and the loss of orthogonality metric that is plotted for the Krylov vectors was derived by Paige [14]. The first problem was proposed by Simonici and Szyld [21]. A diagonal matrix of dimension 100 is specified as with random righthand side , that is normalized so that . The matrix condition number is and is controlled by the small first diagonal element. A preconditioner is not employed in this case. The problem allows us to specify the limiting accuracy of the GMRES iteration. In particular, Figure 1 plots the relative residual versus the number of iterations for the standard level1 BLAS MGSGMRES and level2 onesynch lagged MGSGMRES algorithm. These exhibit identical convergence behaviour with the loss of orthogonality metric increasing to one at 80 iterations. The convergence stalls at this point with the relative residual reaching . In the case of the one or twosynch classical GramSchmidt (CGS2) based on Algorithm 5, the loss of orthogonality metric remains close to machine precision and the convergence curve continues to decrease without stalling to .
A second problem is taken from the Florida (now Texas A&M) sparse matrix collection. The particular system is ‘thermal1’ with dimension and number of nonzeros, . The righthand side is obtained from , where . For this problem a RugeStüben [17] classical AMG preconditioner is applied. Once again the standard level1 MGSGMRES and level2 onesynch GMRES algorithms are compared with the CGS2 GMRES algorithm(s). The relative residuals and loss of orthogonality metrics are plotted in Figure 2. Observe that when that the convergence stalls for the MGSGMRES algorithms, whereas the CGS2 variants maintain orthogonality to near the level of machine precision.
VB GPU C/CUDA Performance
The results presented below were obtained on SummitDev computational cluster (OLCF) using NVCC V9.0.67. All variables are allocated using device memory. The standard cuSPARSE and cuBLAS functions are used to perform BLAS operations (SpMV, AXPY, SCAL, DOT). It could be assumed that MDOT and MAXPY can be expressed as dense matrixmatrix product (MDOT) and a combination of dense matrixmatrix products (MAXPY), and implemented using standard cuBLAS routines. However, it was found that this approach was less efficient than our custom implementation. We speculate that the main reason for suboptimal performance was the need to transpose in MDOT and to perform two CUDA kernel calls instead of one for MAXPY. Hence, MAXPY and MDOT are implemented as fused operations in CUDA and optimized for maximum performance.
The problem employed for GPU performance analysis was extracted from the Nalu Vestas V27 wind turbine simulation described in Thomas et. al. [25]. The discretization scheme is based on the ControlVolume FiniteElement method (CVFEM). The matrix has rows and columns. The average number of nonzeros per row is . The matrix is stored in the standard compressedsparse row (CSR) format, which is supported by the cuSPARSE library. A preconditioner is not applied in this test.
Five approaches are compared for solving the linear system: (1) Level1 BLAS MGSGMRES(m) compiled using CUDA, (2) MGSGMRES(m) written by the authors for Hypre using only device memory and reducing unnecessary data copies and convergence checks, (3) GMRES(m) with CGS1 and Ghysels normalization, (4) CGS2 GMRES (Algorithm 1), and (5) twosynch GMRES(m) as in Algorithm 3. Restarts are set as with relative tolerance . The only difference between approaches (2)–(5) is how they orthogonalize Krylov vectors. Figure 3 displays the improvement in runtime by using alternative orthogonalization approaches. The improvement is much greater, as expected, for large restart values. Figure 5 indicates the associated speedups.
Figure 4 is a plot of the ratio of wallclock runtime taken by GramSchmidt to the total runtime of the sparse linear system solver. One can easily observe that the Level1 BLAS modified GramSchmidt accounts for 90% of the total runtime. Using GMRES with different orthogonalization strategies lowers this time down to 40% from 70%. The remainder of the execution time is taken by various BLAS routines and matvecs. Hence, an even larger improvement is expected in runtime when using a preconditioner.
Figure 5 displays the speedup resulting from using lowlatency orthogonalization methods. The implementation of the Hypre MGSGMRES GPU implementation is taken as the baseline reference timing.
Figure 6 displays the number of GMRES iterations resulting from using lowlatency orthogonalization methods. It may be observed that for short restart cycles the number of iterations varies for the different methods. This is expected because of the CGS1 stability versus CGS2 and MGS.
The ‘Parabolic FEM’ from the Florida sparse matrix collection is used in a second performance test. Total runtimes are reported in Table I. The matrix dimension is with nonzeros. The HypreBoomerAMG cycle with 3 levels is the preconditioner with scaled Jacobi smoother on the GPU. The total solver (Solve) time is reported along with the GramSchmidt (Orth) times and number of GMRES iterations (Iters) required to reach the specified relative tolerance. The lowest runtimes are achieved by the GMRES twosynch algorithm. These are 60% faster than the Hypre Level1 BLAS MGSGMRES (for ).
Method  tolerance  Iters  Orth (s)  Solve (s) 

GMRESHYPRE  1e13  78  0.95  3.06 
GMRESMGS (new)  1e13  78  0.25  2.03 
GMRESCGS1 alt norm  1e13  N/A  N/A  N/A 
GMRESCGS2  1e13  78  0.14  2.00 
GMREStwo synch  1e13  78  0.12  1.92 
GMRESHYPRE  1e15  N/A  N/A  N/A 
GMRESMGS (new)  1e15  93  0.27  2.34 
GMRESCGS1 alt norm  1e15  N/A  N/A  N/A 
GMRESCGS2  1e15  92  0.15  2.14 
GMREStwo synch  1e15  92  0.13  2.26 
Initial testing of the Hypre linear solver on the Peregrine supercomputer at NREL has demonstrated that the lowlatency algorithms result in a reduced communication overhead for the MPI global reductions. Peregrine nodes contain Xeon E52670 v3 Haswell processors, core sockets, 64 GB DDR4 memory, interconnected with an Infiniband network. The runtime spent in the GramSchmidt orthogonalization is plotted in Figure 7 for the Nalu V27 linear system. Clearly, the new algorithms reduce the time as the node count increases.
Vi Conclusions and Future goals
We have presented lowlatency classical and modified GramSchmidt algorithms and applied these to the ArnoldiGMRES algorithm. The number of synchronization steps depends on the desired level of orthogonality, stability and limiting accuracy of the the resulting GMRES Krylov iterations. A onesynchronization MGSGMRES algorithm based on the compact MGS algorithm with a lagged normalization step results in a backwardstable algorithm [15] with loss of orthogonality. GPU implementations of these lowlatency algorithms achieve up to a 35 times speedup over the Hypre Level1 BLAS MGSGMRES algorithm.
The execution speed of our GPU lowlatency GMRES algorithms implies a new scaling paradigm for large sparse linear solvers. The relative balance in runtime that is split between GMRES (including matrixvector multiplies) and a preconditioner such as the HypreBoomerAMG cycle may now shift in favor of a larger number of inexpensive GMRES iterations combined with a lightweight cycle in order to achieve a much lower and optimal runtime. For example, we solved a large K linear system in under two seconds with 100 GMRES iterations, using GPU accelerated matrixvector multiplies, preconditioned by a less costly cycle and Jacobi smoother on the GPU. The challenge is to extend this scaling result to a large number of nodes based on a low number of synchronizations.
An obvious extension of this work is a distributedmemory parallel implementation and testing using MPI+OpenMP and MPI+CUDA. A first step in this direction would be a multiGPU implementation. The theory developed in this paper and the preliminary results indicate that our approach, while stable, also has the desired scalability properties both in terms of fine and coarse grain parallelism. Further, investigation is needed to establish the MPIparallel properties of the proposed algorithms at large scale on nodes.
Vii Acknowledgements
This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DEAC3608GO28308. Funding provided by the Exascale Computing Project (17SC20SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paidup, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes.
References
 [1] Bavier, E., M. Hoemmen, S. Rajamanickam, H. Thornquist (2012). Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems. Sci. Prog. vol. 20. no. 3. 241–255.
 [2] Bischof, C., and Van Loan, C. (1987). The WY Representation for Products of Householder Matrices. SIAM J. Sci. Comp. vol. 8. 2–13.
 [3] Björck, A., (1967). Solving linear least squares problems by GramSchmidt orthogonalization. BIT. 7. 1–21.
 [4] Björck, A., (1994). Numerics of GramSchmidt orthogonalization. Linear Algebra Appl. vol. 197. 297–316.
 [5] Domino, S., (2018). Designorder, nonconformal lowMach fluid algorithms using a hybrid CVFEM/DG approach. J. Comp. Phys., 359. 331–351.
 [6] Ghysels, P., T. J. Ashby, K. Meerbergen, W. van Roose, (2013). Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp., vol. 35, 4871.
 [7] Giraud, L, J. Langou, M. Rozloznik, (2005). On the loss of orthogonality in the GramSchmidt orthogonalization process. Comp. Math. Appl.. vol. 50. 1069–1075
 [8] Giraud, L., J. Langou, M. Rozloznik, J. van den Eshof, (2006). Rounding error analysis of the classical GramSchmidt orthogonalization process. Numer. Math. vol. 101. 87–100.
 [9] Henson, V. E., and U. M. Yang, (2000). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Appl. Numer. Math., 41. 155–177.
 [10] Hernandez, V., J. E. Roman, and A. Tomas, (2007), Parallel Arnoldi solvers with enhanced scalability via global communication rearrangement. Parallel Comput. vol. 33. 521–540.
 [11] Hoemmen, M, (2010). Communicationavoiding Krylov subspace methods. University of California, Berkeley, Tech. Rep. UCB/EECS201037.
 [12] Leon, S. J., A. Björck, W. Gander, (2010). GramSchmidt orthogonalization: 100 years and more. Numer. Linear Algebra Appl.. 1–40.
 [13] Malard, J., and C. C. Paige, (1994), Efficiency and scalability of two parallel QR Factorization Algorithms. Proceedings of the Scalable HighPerformance Computing Conference, IEEE. 615–622.
 [14] Paige, C. C., (2018). The effects of loss of orthogonality on large scale numerical computations. Proceedings of ICCSA 2018, Gervasi et al. (Eds.), LNCS 10962. Springer. 429–439.
 [15] Paige, C. C., M. Rozloznik, and Z. Strakos (2006). Modified Gram Schmidt (MGS), least squares, and backward stability of MGSGMRES SIAM J. Matrix Anal. Appl., vol. 28. 264–284.
 [16] Romero, E., A. Tomás, A. Soriano, and I. Blanquer, (2014). A fast sparse block circulant matrix vector product. Proceedings EuroPar 2014, LNCS 8632, SpringerVerlag, 548–559.
 [17] Ruge, J. W., and K. Stüben, (1987). Algebraic Multigrid, in Multigrid Methods, S. McCormick, ed., Frontiers in Applied Mathematics, SIAM, 73–130.
 [18] Ruhe, A, (1983). Numerical aspects of GramSchmidt orthogonalization of vectors. Linear Algebra Appl. vol 52. 592–601.
 [19] Saad, Y. and M. H. Schultz, (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. vol. 7, 856–869.
 [20] Schreiber, R. and C. F. Van Loan, (1989). A storage efficient representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. vol. 10. 53–57.
 [21] Simoncini, V. and D. B. Szyld Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM J. Sci. Statist. Comput. vol. 25. 454–477.
 [22] Smoktunowicz, A., J. L Barlow, J. Langou, (2006). A note on the error analysis of classical GramSchmidt. Numer. Math. 105 (2), 299313.
 [23] Świrydowicz, K., Chalmers, N., Karakus, A., Warburton, T. (2017) Acceleration of tensorproduct operations for highorder finite element methods. Available at https://arxiv.org/abs/1711.00903.

[24]
Świrydowicz, K., (2017).
Strategies For Recycling Krylov Subspace Methods And Bilinear Form Estimation.
Ph.D Thesis, Virgina Tech. http://hdl.handle.net/10919/78695  [25] Thomas, S. J., S. Ananthan, S. Yellapantula, J. J. Hu, M. A. Sprague, (2018). A comparison of classical and aggregationbased algebraic multigrid preconditioners for highfidelity simulation of windturbine incompressible flows. SIAM J. Sci. Statist. Comput., Submitted.
 [26] Yamazaki, I., M. Hoemmen, P. Luszczek, J. Dongarra, (2017). Improving performance of GMRES by reducing communication and pipelining global collectives. University of Tennesee, ICL Tech. Rep. 9732017.
 [27] Yamazaki, I., H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, (2014) Improving the performance of CAGMRES on multicores with multiple GPUs. IPDPS 2014. 382391
Comments
There are no comments yet.