Low synchronization GMRES algorithms

Communication-avoiding and pipelined variants of Krylov solvers are critical for the scalability of linear system solvers on future exascale architectures. We present low synchronization variants of iterated classical (CGS) and modified Gram-Schmidt (MGS) algorithms that require one and two global reduction communication steps. Derivations of low synchronization iterated CGS algorithms are based on previous work by Ruhe. Our main contribution is to introduce a backward normalization lag into the compact WY form of MGS resulting in a O()κ(A) stable GMRES algorithm that requires only one global synchronization per iteration. The reduction operations are overlapped with computations and pipelined to optimize performance. Further improvements in performance are achieved by accelerating GMRES BLAS-2 operations on GPUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/19/2021

Performance of Low Synchronization Orthogonalization Methods in Anderson Accelerated Fixed Point Solvers

Anderson Acceleration (AA) is a method to accelerate the convergence of ...
01/15/2018

The Communication-Hiding Conjugate Gradient Method with Deep Pipelines

Krylov subspace methods are among the most efficient present-day solvers...
12/17/2017

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Parallel computing has played an important role in speeding up convex op...
04/02/2021

Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers

The parallel strong-scaling of Krylov iterative methods is largely deter...
04/11/2020

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose compu...
09/16/2020

Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices

Fast domain propagation of linear constraints has become a crucial compo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Gram-Schmidt algorithm constructs an orthogonal basis from a set of linearly independent vectors spanning a vector space. The classical (CGS) algorithm is column-oriented, whereas the modified (MGS) Gram-Schmidt algorithm has both row and column variants. The numerical properties of both variants were clarified by Björck

[3], who established the backwards stability of the factorization of a matrix produced by MGS for least squares problems. The columns of lose othogonality in both cases and bounds on have been derived and improved by several authors. For the modified algorithm, Björck [3] derived a bound based on the condition number . The situation is worse for the classical algorithm where the loss of orthogonality depends on as shown in [7] and [8]. For both CGS and MGS, the orthogonality can be restored to the level of machine round-off by reorthogonalization of the columns of . Iterative reorthogonalization was discussed in Leon et. al. [12] along with the “twice is enough” result of Kahan which was later generalized by Giraud et. al. [8]. A subsequent paper by Smoktunowicz et. al.  [22] introduced a more stable Cholesky-like CGS algorithm using a Pythagorean identity for the normalization of the basis vectors.

We present low synchronization variants of iterated classical (CGS) and modified Gram-Schmidt (MGS) algorithms that require one and two global reduction communication steps. The MGS algorithms are based on the compact form of the orthogonalization process described in Bjorck [4] and Malard and Paige [13]. Our main contribution is to introduce a lag into the column oriented MGS algorithm in order to delay normalization of the diagonal elements of . In addition, the reductions can be overlapped with the computation and pipelined as demonstrated by Ghysels et. al. [6] and Yamazaki et. al. [26]. Derivations of low communication algorithms for iterated CGS can be based on the paper by Ruhe [18] who establishes the equivalence of reorthogonalization for classical Gram-Schmidt with the Gauss-Jacobi iteration for the normal equations with coefficient matrix and Gauss-Seidel for modified Gram-Schmidt. The author unrolls the recurrence for re-orthogonalization, thus obtaining a loop-invariant form of the algorithm. The loop invariant allows us to combine the two reorthogonalization iterations into one step and thereby eliminate the need for an additional global reduction. The loss of orthogonality relationship for the matrix is also exploited to reduce the computation and communication overhead.

Both the classical and modified Gram-Schmidt algorithms may employ BLAS-2 vector inner products and norms to construct an orthonormal set of basis vectors. On modern multicore computers global reductions use a tree-like sum of scalar products involving every core (MPI rank) on every node of the machine. The latency of this operation grows with the height of the tree determined by the number of cores. The remaining operations required in the orthogonalization are scalable. These consist of the mass inner-product , vector MAXPY operation , and sparse matrix-vector product (SpMV) at each iteration. For sparse matrices distributed by rows, the parallel SpMV typically requires local communication between cores. The BLAS-1 vector sum is trivially data-parallel. Hence, the global reductions are the limiting factor for the available parallelism in the algorithm. Our MGS algorithm reduces the number of global reductions per iteration to only one.

Ii Gram-Schmidt and Reorthogonalization

Ruhe [18] provides a unique perspective by proving that the iterated classical and modified Gram-Schmidt algorithms are equivalent to Gauss-Jacobi and Gauss-Seidel iterations for linear systems. Let be an non-singular matrix with factorization obtained from the Gram-Schmidt algorithm. Consider one step of the classical Gram-Schmidt algorithm, with orthonormal vectors forming an matrix , and the vector , which is to be orthogonalized to form the –st column of . The classical Gram-Schmidt algorithm with re-orthogonalization is presented as Algorithm 1.

1:Start , .
2:for  do
3:      
4:      
5:      
6:end for
7:
Algorithm 1 Classical Gram-Schmidt Algorithm with re-orthogonalization

Each iteration in Steps 3–5 is a reorthogonalization. The notation in Step 7 means that the final vector is normalized to form the next column of . The normalization coefficient enters the matrix below the vector , leading to the corresponding matrix factors

For the column-wise modified Gram-Schmidt algorithm, each iteration (Steps 2 to 4) in Algorithm 2 is divided into rank-1 updates,

1:for  do
2:      
3:      
4:      
5:end for
6:
Algorithm 2 Modified Gram-Schmidt Algorithm

The key observation made by Ruhe [18] is that for the classical Gram-Schmidt algorithm, the loop-invariants are obtained by unrolling the recurrences as follows,

and therefore , where is a projection matrix. It follows that

This is the Gauss-Jacobi iteration applied to the normal equations for the matrix and right hand side vector

The loop-invariant form of the matrix update can be judiciously applied in order to minimize the number of communication steps in parallel implementations of classical Gram-Schmidt.

For distributed-memory parallel computation based on message-passing, the number of global reduction summation steps in the iterated classical Gram-Schmidt algorithm can be reduced to two by employing the Ruhe loop-invariant. In particular, it is possible to write Step 4 of Algorithm 1 by combining the first two iterations of the re-orthogonalization as follows, given

The above derivation implies that the projection matrix takes the form,

where is a strictly lower triangular matrix

The upper triangular matrix can be computed one column at a time and this suggests the form of a two synchronization step parallel algorithm for iterated CGS (Algorithm 3).

1:
2:
3:
4:
5:
Algorithm 3 Iterated Classical Gram-Schmidt Algorithm with Two Synchronizations

Two global reduction communications are required for the mass inner products in Steps 1, and the normalization in Step 4. The above algorithm suggests that a synchronization step can be eliminated if the normalization Step 4 was lagged to the -th iteration. A lagged modified Gram-Schmidt algorithm based on the compact representation of the projection matrix is derived below.

In the case of the modified Gram-Schmidt algorithm, the loop invariant form of the -st step of the re-orthogonalization update for a column of the matrix is derived by Ruhe as

where

The columns of the matrix are updated as follows,

An important matrix power series expansion links the different Gram-Schmidt algorithms to the stability analyses and the loss of orthogonality relations. In addition, the equivalence of the modified Gram-Schmidt algorithm and compact representation given below follow from the power series

A measure of the loss of orthogonality was introduced by Paige [14] as , where . The norm remains close to for orthogonal vectors and increases to one as orthogonality is lost. However, given the loop-invariants, the matrix derived by Ruhe does not contain the higher-order inner-products that are present in the matrix obtained for the compact MGS projection.

In order to derive a block form of the modified Gram-Schmidt algorithm based on level-2 BLAS type operations, Björck [4] proposed a compact representation for MGS analogous to the Householder factorization [20]. The matrix projection can be written as

where is the lower triangular matrix of basis vector inner products. The representation is then given by the matrix form

and where

The paper by Malard and Paige [13] shows how the transpose of the matrix above can be formed recursively as follows

A parallel modified Gram-Schmidt algorithm that requires only one global synchronization step can be based on the above compact form combined with a lagged normalization of the diagonal of the upper triangular matrix .

1:
2:
3:,
4:
5:
6:
7:
8:
Algorithm 4 Level-2 Modified Gram-Schmidt Algorithm with Normalization Lag

The single global reduction in Algorithm 4 consists of the mass inner product in Step 2 and the norm in Step 3. The derivation above can be extended to the iterated classical Gram-Schmidt algorithm with two full reorthogonalization steps. The columns of are updated in a single operation requiring one synchronization and then the projection step takes the following form with a second global reduction.

1:
2:
3:,
4:
5:
6:
7:
8:
Algorithm 5 Level-2 Iterated Classical Gram-Schmidt Algorithm with Normalization Lag

The resulting algorithm only requires two global communication steps. The first synchronization step consists of the mass inner product in Step 2 and normalization in Step 3. The second synchronization occurs in Step 8 with a second mass inner product and mass AXPY. Algorithm 5 can achieve orthogonality of the columns of the matrix to machine precision.

Iii One Sychronization MGS-GMRES Algorithm

Given a large sparse linear system of equations , with residual vector , the Generalized Minimal Residual (GMRES) algorithm of Saad and Schultz [19] employs the Arnoldi Gram-Schmidt algorithm with , , in order to orthogonalize the vectors spanning the Krylov subspace

The algorithm produces an orthogonal matrix

and the matrix decomposition

Indeed, it was observed by Paige et. al. [15] that the algorithm can be viewed as the Gram-Schmidt factorization of a matrix formed by adding a column to each iteration

The GMRES algorithm with modified Gram-Schmidt orthogonalization was derived by Saad and Schultz [19] and presented as Algorithm 6 below.

1:, .
2:for  do
3:      
4:      for   do
5:            
6:            
7:      end for
8:      
9:      
10:      Apply Givens rotations to
11:end for
12:
13:
Algorithm 6 MGS-GMRES

The standard level-1 BLAS modified Gram-Schmidt (MGS) Algorithm 6 computes the orthogonalization in Steps 4 to 7. The normalization Steps 8 and 9 are performed afterwards, looking forward one iteration to compute and Ghysels et. al. [6] proposed an alternative form of the normalization to avoid a global reduction

This computation is unstable and suffers from numerical cancellation because it does not account for the loss of orthogonality in and approximates the inner product

Both Ghysels et. al. [6] and Yamazaki et. al. [26] describe pipelined algorithms based on CGS-1 that incorporate an iteration lag in the normalization step to avoid the unstable normalization above. However, these algorithms are somewhat unstable because the loss of orthogonality for CGS-1 is , see [8]. The authors employ a change of basis factorization, and shifts in order to mitigate these instabilities. However, the pipeline depths still must remain relatively small.

The MGS-GMRES algorithm is backward stable [15] and orthogonality is maintained to , where is the condition number of

defined as the ratio of the largest to smallest singular values

[7]. By employing the Level-2 BLAS compact MGS algorithm with a backward lag, the normalization for the –th iteration is computed to full accuracy with the orthogonalization for iteration . In effect, the lag creates a pipeline where orthogonalization of the Krylov vectors , and is performed before the diagonal of the matrix in the factorization, corresponding to , is computed for iteration 2. The lagged compact algorithm is implemented below as Algorithm 7 for the -th iteration of MGS-GMRES, where .

1:, .
2:
3:
4:
5:,
6:
7:Apply Givens rotations to
8:for  do
9:      
10:      
11:      ,
12:      
13:      Apply Givens rotations to
14:end for
15:
16:
Algorithm 7 One Synchronization MGS-GMRES with Lagged Normalization

The normalization for iteration is computed in Steps 11 and 12. Therefore, the column of the Hessenberg matrix is pre-computed for iteration and the matrix is employed to solve the least squares problem at iteration . Because the loss of orthogonality for MGS-GMRES is , Algorithm 7 is not only more stable but can be pipelined to an arbitrary iteration depth .

1:, .
2:
3:
4:
5:,
6:
7:Apply Givens rotations to
8:for  do
9:      
10:      
11:      ,
12:      
13:      Second orthogonalization
14:      
15:      
16:      ,
17:      
18:      Apply Givens rotations to
19:      Apply Givens rotations to
20:end for
Algorithm 8 MGS-GMRES Pipeline

Iv Implementation Details

The standard Level-1 BLAS MGS-GMRES implementation with the modified Gram-Schmidt process requires

separate inner-products (global summations) in iteration . Therefore, the amount of global communication grows quadratically. Several authors have proposed how to reduce the communication overhead associated with MGS-GMRES. The iterated classical Gram-Schmidt algorithm with two reorthogonalization steps requires at least two global reductions. The Trilinos-Belos solver stack employs an iterated classical Gram-Schmidt ICGS-GMRES [1]. There have been several approaches proposed in the literature to reduce the communication overhead associated with ICGS-GMRES. Hernandez et. al. [10] implement the re-orthogonalization with communication pipelining. The approach developed here requires less computation and communication than previous algorithms. The one-sync Level-2 BLAS MGS-GMRES algorithm allows us to perform the compact modified Gram-Schmidt algorithm, by combining a mass inner-product and normalization into one global reduction MPI_AllReduce per GMRES iteration. The mass inner product for computing is written as

The algorithm for is implemented with the vector mass AXPY (MAXPY).

The PetSC library employs the vector MAXPY for a pipelined –GMRES iteration with reorthogonalization [6].

For optimal performance of these two algorithms on current generation multi-core processors, cache-blocking combined with loop unrolling can be exploited. Once the Krylov sub-space dimension is sufficiently large, the outer loop of the mass inner-product should be unrolled into batches of two, four, or more summations. The overall effect is to expose more work to multiple floating point units and pre-fetch together with multiple columns of into the cache. The outer loop can also be threaded given the typically large vector lengths. In an MPI parallel implementation, the columns of are partitioned across the MPI ranks. Unrolling should have the beneficial effect of increasing the memory band-width utilization of the processors. The vector MAXPY should also be unrolled with the elements pre-fetched into the cache. The Hessenberg matrix is small and dense. Thus, it is laid out in memory by columns for cache access and fast application of the orthogonal rotations in the least-squares problem. The remaining operations inside the one-sync MGS-GMRES are to compute norms, which are BLAS-1 operations, (sparse) matrix-vector products and the application of a preconditioner. BLAS-1 operations are usually highly optimized for the particular architecture. The sparse matrix-vector multiply is parallel and can be easily partitioned by rows when using a compressed-sparse row (CSR) storage format.

Both the mass inner product and vector MAXPY are well-suited to a single-instruction multiple-data (SIMD) model of parallel execution. These computations can be massively threaded on a GPU. For GPU execution, the mass inner-product can be off-loaded to the GPU but requires multiple CUDA kernel launches to perform synchronizations between reduction steps. Synchronization is possible within streaming multi-processors (SM) executing thread blocks. However, a CUDA kernel launch is needed to synchronize across the SM. Unrolling of the inner-products into batches is also possible and permits higher sustained memory bandwidth between the GPU global memory and streaming multi-processors (SM). This is also the case for the vector MAXPY. A CUDA implementation of the vector MAXPY is provided by PetSC. A recent paper by Romero et. al. [16] also describes a vector MAXPY for the GPU. The GPU implementation of the CGS-1 based CA-GMRES algorithm is described in [27]. An important optimization to save storage and promote data re-use is to fuse the matvec+precon step and not store the preconditioned vectors [24].

V Numerical Results

The primary motivation for proposing the low synchronization GMRES algorithms was to improve the performance and strong-scaling characteristics of DOE physics based simulations at exascale and, in particular, the Nalu CFD solver, Domino et. al. [5]. The Nalu model solves the incompressible Navier-Stokes equations on unstructured finite-element meshes. The momentum and pressure equations are time-split and solved sequentially. Both the momentum and pressure continuity equation are solved with preconditioned GMRES iterations. The pressure preconditioner employs either smoothed-aggregation or classical Ruge-Stüben algebraic multigrid (C-AMG). The latter is provided by the Hypre-BoomerAMG library of solvers [9]. The Krylov solvers, which typically involve a substantial amount of communication, are an expensive part of the simulation, requiring over 50% of the simulation time.

In order to evaluate the performance of the different GMRES solvers, a separate Hypre linear solver application has been implemented. In all our performance studies, matrices were obtained from Nalu wind turbine simulations and tested in our Hypre solver application. The performance studies involving GPU acceleration were completed on a single node of the Summit-Dev supercomputer at ORNL. Summit-Dev is built with IBM S822LC nodes which consist of two 10-core IBM Power-8 processors with 256 GB of DD4 memory and four NVIDIA Tesla P100 GPUs connected by NVlink running at 80 GB/sec. The GPUs share 16 GB HBM2 memory. The interconnect is based on two rails of Mellanox EDR Infiniband.

Gram-Schmidt orthogonalization, as emphasized earlier, is the most expensive part of the GMRES algorithm. Hence, the algorithms developed in this paper are applied. These are based upon one and two synchronization CGS and MGS Gram-Schmidt algorithms. For each example, we specify which orthogonalization method was employed.

V-a Numerical Stability and Accuracy

Linear systems appearing in the literature are employed to evaluate the numerical stability and limiting accuracy for several of the GMRES algorithms described here. All tests were performed in Matlab and the loss of orthogonality metric that is plotted for the Krylov vectors was derived by Paige [14]. The first problem was proposed by Simonici and Szyld [21]. A diagonal matrix of dimension 100 is specified as with random right-hand side , that is normalized so that . The matrix condition number is and is controlled by the small first diagonal element. A preconditioner is not employed in this case. The problem allows us to specify the limiting accuracy of the GMRES iteration. In particular, Figure 1 plots the relative residual versus the number of iterations for the standard level-1 BLAS MGS-GMRES and level-2 one-synch lagged MGS-GMRES algorithm. These exhibit identical convergence behaviour with the loss of orthogonality metric increasing to one at 80 iterations. The convergence stalls at this point with the relative residual reaching . In the case of the one or two-synch classical Gram-Schmidt (CGS-2) based on Algorithm 5, the loss of orthogonality metric remains close to machine precision and the convergence curve continues to decrease without stalling to .

A second problem is taken from the Florida (now Texas A&M) sparse matrix collection. The particular system is ‘thermal1’ with dimension and number of non-zeros, . The right-hand side is obtained from , where . For this problem a Ruge-Stüben [17] classical AMG preconditioner is applied. Once again the standard level-1 MGS-GMRES and level-2 one-synch GMRES algorithms are compared with the CGS-2 GMRES algorithm(s). The relative residuals and loss of orthogonality metrics are plotted in Figure 2. Observe that when that the convergence stalls for the MGS-GMRES algorithms, whereas the CGS-2 variants maintain orthogonality to near the level of machine precision.

Fig. 1: Simonici matrix. Comparison of limiting accuracy for MGS-GMRES and CGS2-GMRES. Loss of orthogonality measure
Fig. 2: Thermal1 matrix. Comparison of limiting accuracy for MGS-GMRES and CGS2-GMRES. Loss of orthogonality measure

V-B GPU C/CUDA Performance

The results presented below were obtained on Summit-Dev computational cluster (OLCF) using NVCC V9.0.67. All variables are allocated using device memory. The standard cuSPARSE and cuBLAS functions are used to perform BLAS operations (SpMV, AXPY, SCAL, DOT). It could be assumed that MDOT and MAXPY can be expressed as dense matrix-matrix product (MDOT) and a combination of dense matrix-matrix products (MAXPY), and implemented using standard cuBLAS routines. However, it was found that this approach was less efficient than our custom implementation. We speculate that the main reason for sub-optimal performance was the need to transpose in MDOT and to perform two CUDA kernel calls instead of one for MAXPY. Hence, MAXPY and MDOT are implemented as fused operations in CUDA and optimized for maximum performance.

The problem employed for GPU performance analysis was extracted from the Nalu Vestas V27 wind turbine simulation described in Thomas et. al. [25]. The discretization scheme is based on the Control-Volume Finite-Element method (CVFEM). The matrix has rows and columns. The average number of non-zeros per row is . The matrix is stored in the standard compressed-sparse row (CSR) format, which is supported by the cuSPARSE library. A preconditioner is not applied in this test.

Five approaches are compared for solving the linear system: (1) Level-1 BLAS MGS-GMRES(m) compiled using CUDA, (2) MGS-GMRES(m) written by the authors for Hypre using only device memory and reducing unnecessary data copies and convergence checks, (3) GMRES(m) with CGS-1 and Ghysels normalization, (4) CGS-2 GMRES (Algorithm 1), and (5) two-synch GMRES(m) as in Algorithm 3. Restarts are set as with relative tolerance . The only difference between approaches (2)–(5) is how they orthogonalize Krylov vectors. Figure 3 displays the improvement in run-time by using alternative orthogonalization approaches. The improvement is much greater, as expected, for large restart values. Figure 5 indicates the associated speedups.

Fig. 3: Performance of the Hypre GPU implementation of Level-1 BLAS MGS-GMRES(m) versus GMRES(m) with alternative orthogonalization strategies. Tolerance . Linear system from Nalu wind simulation

Figure 4 is a plot of the ratio of wall-clock run-time taken by Gram-Schmidt to the total run-time of the sparse linear system solver. One can easily observe that the Level-1 BLAS modified Gram-Schmidt accounts for 90% of the total run-time. Using GMRES with different orthogonalization strategies lowers this time down to 40% from 70%. The remainder of the execution time is taken by various BLAS routines and matvecs. Hence, an even larger improvement is expected in run-time when using a preconditioner.

Fig. 4: Performance of the Hypre GPU implementation of Level-1 BLAS MGS-GMRES(m) with MGS and GMRES(m) with alternative orthogonalization strategies. The ratio of the time spent on Gram-Schmidt orthogonalization versus time to solution.

Figure 5 displays the speedup resulting from using low-latency orthogonalization methods. The implementation of the Hypre MGS-GMRES GPU implementation is taken as the baseline reference timing.

Fig. 5: The speedup resulting from replacing Hypre Level-1 BLAS modified Gram-Schmidt with alternate low-latency GMRES variants.

Figure 6 displays the number of GMRES iterations resulting from using low-latency orthogonalization methods. It may be observed that for short restart cycles the number of iterations varies for the different methods. This is expected because of the CGS-1 stability versus CGS-2 and MGS.

Fig. 6: Number of iterations for Hypre Level-1 BLAS MGS-GMRES(m) versus alternate low latency GMRES(m).

The ‘Parabolic FEM’ from the Florida sparse matrix collection is used in a second performance test. Total run-times are reported in Table I. The matrix dimension is with non-zeros. The Hypre-BoomerAMG -cycle with 3 levels is the preconditioner with scaled Jacobi smoother on the GPU. The total solver (Solve) time is reported along with the Gram-Schmidt (Orth) times and number of GMRES iterations (Iters) required to reach the specified relative tolerance. The lowest run-times are achieved by the GMRES two-synch algorithm. These are 60% faster than the Hypre Level-1 BLAS MGS-GMRES (for ).

Method tolerance Iters Orth (s) Solve (s)
GMRES-HYPRE 1e-13 78 0.95 3.06
GMRES-MGS (new) 1e-13 78 0.25 2.03
GMRES-CGS1 alt norm 1e-13 N/A N/A N/A
GMRES-CGS2 1e-13 78 0.14 2.00
GMRES-two synch 1e-13 78 0.12 1.92
GMRES-HYPRE 1e-15 N/A N/A N/A
GMRES-MGS (new) 1e-15 93 0.27 2.34
GMRES-CGS1 alt norm 1e-15 N/A N/A N/A
GMRES-CGS2 1e-15 92 0.15 2.14
GMRES-two synch 1e-15 92 0.13 2.26
TABLE I: Parabolic FEM matrix, restart: , N/A - did not converge.

Initial testing of the Hypre linear solver on the Peregrine supercomputer at NREL has demonstrated that the low-latency algorithms result in a reduced communication overhead for the MPI global reductions. Peregrine nodes contain Xeon E5-2670 v3 Haswell processors, core sockets, 64 GB DDR4 memory, inter-connected with an Infiniband network. The run-time spent in the Gram-Schmidt orthogonalization is plotted in Figure 7 for the Nalu V27 linear system. Clearly, the new algorithms reduce the time as the node count increases.


Fig. 7: Gram-Schmidt kernels time on Peregrine at NREL. Time for a single linear system solve. Matrix obtained from Vestas V27 66M element mesh with 29M DOF.

Vi Conclusions and Future goals

We have presented low-latency classical and modified Gram-Schmidt algorithms and applied these to the Arnoldi-GMRES algorithm. The number of synchronization steps depends on the desired level of orthogonality, stability and limiting accuracy of the the resulting GMRES Krylov iterations. A one-synchronization MGS-GMRES algorithm based on the compact MGS algorithm with a lagged normalization step results in a backward-stable algorithm [15] with loss of orthogonality. GPU implementations of these low-latency algorithms achieve up to a 35 times speed-up over the Hypre Level-1 BLAS MGS-GMRES algorithm.

The execution speed of our GPU low-latency GMRES algorithms implies a new scaling paradigm for large sparse linear solvers. The relative balance in run-time that is split between GMRES (including matrix-vector multiplies) and a preconditioner such as the Hypre-BoomerAMG -cycle may now shift in favor of a larger number of inexpensive GMRES iterations combined with a light-weight -cycle in order to achieve a much lower and optimal run-time. For example, we solved a large K linear system in under two seconds with 100 GMRES iterations, using GPU accelerated matrix-vector multiplies, preconditioned by a less costly -cycle and Jacobi smoother on the GPU. The challenge is to extend this scaling result to a large number of nodes based on a low number of synchronizations.

An obvious extension of this work is a distributed-memory parallel implementation and testing using MPI+OpenMP and MPI+CUDA. A first step in this direction would be a multi-GPU implementation. The theory developed in this paper and the preliminary results indicate that our approach, while stable, also has the desired scalability properties both in terms of fine and coarse grain parallelism. Further, investigation is needed to establish the MPI-parallel properties of the proposed algorithms at large scale on nodes.

Vii Acknowledgements

This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes.

References

  • [1] Bavier, E., M. Hoemmen, S. Rajamanickam, H. Thornquist (2012). Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems. Sci. Prog. vol. 20. no. 3. 241–255.
  • [2] Bischof, C., and Van Loan, C. (1987). The WY Representation for Products of Householder Matrices. SIAM J. Sci. Comp. vol. 8. 2–13.
  • [3] Björck, A., (1967). Solving linear least squares problems by Gram-Schmidt orthogonalization. BIT. 7. 1–21.
  • [4] Björck, A., (1994). Numerics of Gram-Schmidt orthogonalization. Linear Algebra Appl. vol. 197. 297–316.
  • [5] Domino, S., (2018). Design-order, non-conformal low-Mach fluid algorithms using a hybrid CVFEM/DG approach. J. Comp. Phys., 359. 331–351.
  • [6] Ghysels, P., T. J. Ashby, K. Meerbergen, W. van Roose, (2013). Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp., vol. 35, 48-71.
  • [7] Giraud, L, J. Langou, M. Rozloznik, (2005). On the loss of orthogonality in the Gram-Schmidt orthogonalization process. Comp. Math. Appl.. vol. 50. 1069–1075
  • [8] Giraud, L., J. Langou, M. Rozloznik, J. van den Eshof, (2006). Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numer. Math. vol. 101. 87–100.
  • [9] Henson, V. E., and U. M. Yang, (2000). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Appl. Numer. Math., 41. 155–177.
  • [10] Hernandez, V., J. E. Roman, and A. Tomas, (2007), Parallel Arnoldi solvers with enhanced scalability via global communication rearrangement. Parallel Comput. vol. 33. 521–540.
  • [11] Hoemmen, M, (2010). Communication-avoiding Krylov subspace methods. University of California, Berkeley, Tech. Rep. UCB/EECS-2010-37.
  • [12] Leon, S. J., A. Björck, W. Gander, (2010). Gram-Schmidt orthogonalization: 100 years and more. Numer. Linear Algebra Appl.. 1–40.
  • [13] Malard, J., and C. C. Paige, (1994), Efficiency and scalability of two parallel QR Factorization Algorithms. Proceedings of the Scalable High-Performance Computing Conference, IEEE. 615–622.
  • [14] Paige, C. C., (2018). The effects of loss of orthogonality on large scale numerical computations. Proceedings of ICCSA 2018, Gervasi et al. (Eds.), LNCS 10962. Springer. 429–439.
  • [15] Paige, C. C., M.  Rozloznik, and Z.  Strakos (2006). Modified Gram Schmidt (MGS), least squares, and backward stability of MGS-GMRES SIAM J. Matrix Anal. Appl., vol. 28. 264–-284.
  • [16] Romero, E., A. Tomás, A. Soriano, and I. Blanquer, (2014). A fast sparse block circulant matrix vector product. Proceedings Euro-Par 2014, LNCS 8632, Springer-Verlag, 548-–559.
  • [17] Ruge, J. W., and K. Stüben, (1987). Algebraic Multigrid, in Multigrid Methods, S. McCormick, ed., Frontiers in Applied Mathematics, SIAM, 73–130.
  • [18] Ruhe, A, (1983). Numerical aspects of Gram-Schmidt orthogonalization of vectors. Linear Algebra Appl. vol 52. 592–601.
  • [19] Saad, Y. and M. H. Schultz, (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. vol. 7, 856–869.
  • [20] Schreiber, R. and C. F. Van Loan, (1989). A storage efficient representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. vol. 10. 53–57.
  • [21] Simoncini, V. and D. B. Szyld Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM J. Sci. Statist. Comput. vol. 25. 454–477.
  • [22] Smoktunowicz, A., J. L Barlow, J. Langou, (2006). A note on the error analysis of classical Gram-Schmidt. Numer. Math. 105 (2), 299-313.
  • [23] Świrydowicz, K., Chalmers, N., Karakus, A., Warburton, T. (2017) Acceleration of tensor-product operations for high-order finite element methods. Available at https://arxiv.org/abs/1711.00903.
  • [24] Świrydowicz, K., (2017).

    Strategies For Recycling Krylov Subspace Methods And Bilinear Form Estimation.

    Ph.D Thesis, Virgina Tech. http://hdl.handle.net/10919/78695
  • [25] Thomas, S. J., S. Ananthan, S. Yellapantula, J. J. Hu, M. A. Sprague, (2018). A comparison of classical and aggregation-based algebraic multigrid preconditioners for high-fidelity simulation of wind-turbine incompressible flows. SIAM J. Sci. Statist. Comput., Submitted.
  • [26] Yamazaki, I., M. Hoemmen, P. Luszczek, J. Dongarra, (2017). Improving performance of GMRES by reducing communication and pipelining global collectives. University of Tennesee, ICL Tech. Rep. 973-2017.
  • [27] Yamazaki, I., H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, (2014) Improving the performance of CA-GMRES on multicores with multiple GPUs. IPDPS 2014. 382-391