# Neumann Series in GMRES and Algebraic Multigrid Smoothers

Neumann series underlie both Krylov methods and algebraic multigrid smoothers. A low-synch modified Gram-Schmidt (MGS)-GMRES algorithm is described that employs a Neumann series to accelerate the projection step. A corollary to the backward stability result of Paige et al. (2006) demonstrates that the truncated Neumann series approximation is sufficient for convergence of GMRES. The lower triangular solver associated with the correction matrix T_m = ( I + L_m )^-1 may then be replaced by a matrix-vector product with T_m = I - L_m. Next, Neumann series are applied to accelerate the classical Rüge-Stuben algebraic multigrid preconditioner using both a polynomial Gauss-Seidel or incomplete ILU smoother. The sparse triangular solver employed in these smoothers is replaced by an inner iteration based upon matrix-vector products. Henrici's departure from normality of the associated iteration matrices leads to a better understanding of these series. Connections are made between the (non)normality of the L and U factors and nonlinear stability analysis, as well as the pseudospectra of the coefficient matrix. Furthermore, re-orderings that preserve structural symmetry also reduce the departure from normality of the upper triangular factor and improve the relative residual of the triangular solves. To demonstrate the effectiveness of this approach on many-core architectures, the proposed solver and preconditioner are applied to the pressure continuity equation for the incompressible Navier-Stokes equations of fluid motion. The pressure solve time is reduced considerably with no change in the convergence rate and the polynomial Gauss-Seidel smoother is compared with a Jacobi smoother. Numerical and timing results are presented for Nalu-Wind and the PeleLM combustion codes, where ILU with iterative triangular solvers is shown to be much more effective than polynomial Gauss-Seidel.

## Authors

• 8 publications
• 2 publications
• 3 publications
• 5 publications
• 12 publications
11/18/2021

### ILU Smoothers for C-AMG with Scaled Triangular Factors

ILU smoothers can be effective in the algebraic multigrid V-cycle. Howev...
04/04/2022

### Numerical Prediction and Post-Test Numerical Analysis of the ASDMAD Wind Tunnel Tests in ETW

This paper presents numerical results in comparison with experimental da...
04/02/2021

### Two-Stage Gauss–Seidel Preconditioners and Smoothers for Krylov Solvers on a GPU cluster

Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a...
03/16/2022

### Convergence Acceleration of Preconditioned CG Solver Based on Error Vector Sampling for a Sequence of Linear Systems

In this paper, we focus on solving a sequence of linear systems with an ...
11/16/2019

### New Polynomial Preconditioned GMRES

A new polynomial preconditioner is given for solving large systems of li...
03/28/2022

### Application of Stabilized Explicit Runge-Kutta Methods to the Incompressible Navier-Stokes Equations by means of a Projection Method and a Differential Algebraic Approach

In this master thesis we have compared different second order stabilized...
04/02/2021

### Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers

The parallel strong-scaling of Krylov iterative methods is largely deter...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The purpose of the present work is to show the important role Neumann series play in MGS-GMRES and C-AMG smoothers. By revealing this structure and analyzing it, we devise ways to speed-up the GMRES+AMG solver in hypre [Falgout2004]. The generalized minimal residual (GMRES) Krylov subspace method [Saad86] is often employed to solve the large linear systems arising in high-resolution physics based simulations using the incompressible Navier-Stokes equations in fluid mechanics. Świrydowicz et al. [2020-swirydowicz-nlawa] recently improved the parallel strong-scaling of the algorithm by reducing the MPI communication requirements to a minimum, while maintaining the numerical stability and robustness of the original algorithm. In order to achieve fast convergence of an elliptic solver, such as the pressure continuity equation, the classical Ruge-Stüben [Ruge1987] algebraic multigrid (C-AMG, or simply AMG) solver is applied as a preconditioner. The low-synchronization MGS-GMRES iteration, Gauss-Seidel and ILU smoothers employ direct triangular solvers.111Note that the phrase direct triangular solver is used to differentiate between the iterative approximations versus the traditional forward and backward recurrences. The triangular solve in GMRES is relatively small and local to each MPI rank. For Gauss-Seidel the coarse level matrices in the AMG -cycle are distributed and some communication is required. Triangular solvers are, in general, difficult to implement in parallel on many-core architectures. In this study, the associated iteration matrices are replaced by truncated Neumann series expansions leading to polynomial-type smoothers. They result in a highly efficient and backward stable approach for Exascale class supercomputers.

Consider the large, sparse linear systems arising from the discretization of the incompressible Navier-Stokes pressure continuity equation [Ashesh2021]. In the present study, linear systems of the form with an real-valued matrix, are solved with Krylov subspace methods using one -cycle of the C-AMG algorithm as the preconditioner. Here, let denote the initial residual with initial guess . Inside GMRES, the Arnoldi algorithm is applied to generate an orthonormal basis for the Krylov subspace spanned by the columns of the matrix, , where , and produces the Hessenberg matrix, , in the Arnoldi expansion such that

 AVm=Vm+1Hm+1,m.

In particular, the Arnoldi algorithm produces a factorization of and the columns of form an orthogonal basis for the Krylov subspace [2006--simax--paige-rozloznik-strakos]. The orthogonality of the columns determines the convergence of Krylov methods for linear system solvers. However, in finite-precision arithmetic, may “lose” orthogonality and this loss, as measured by , may deviate substantially from machine precision, , (see Paige and Strakos [Paige2002]). When linear independence is completely lost, the Krylov iterations may fail to converge. For example, the MGS-GMRES iteration will stall in this case, and this occurs when , where the matrix was introduced in the seminal backward stability analysis of Paige et al. [2006--simax--paige-rozloznik-strakos]. Here, is the strictly upper triangular part of .

The development of low-synchronization Gram-Schmidt and generalized minimal residual algorithms by Świrydowicz et al. [2020-swirydowicz-nlawa] and Bielich et al. [DCGS2Arnoldi] was largely driven by applications that need stable, yet scalable solvers. Both the modified (MGS) and classical Gram-Schmidt with re-orthogonalization (CGS2) are stable algorithms for a GMRES solver. Indeed, CGS2 results in an loss of orthogonality, which suffices for GMRES to converge. Paige et al. [2006--simax--paige-rozloznik-strakos] show that despite loss of orthogonality, MGS-GMRES is backward stable for the solution of linear systems. Here, the condition number of the matrix is given by , where and

are the maximum and minimum singular values of the matrix

, respectively.

An inverse compact (ICWY) modified Gram-Schmidt algorithm is presented in [2020-swirydowicz-nlawa] and is based upon the application of a projector

 P=I−VmTmVTm,T=(VTmVm)−1

where is again ,

is the identity matrix of dimension

, and is an correction matrix. To obtain a one-reduce MGS algorithm, or one MPI global reduction per GMRES iteration, the normalization is delayed to the next iteration. The correction matrix is obtained from the strictly lower triangular part of , denoted . Note that because has almost orthonormal columns, the norm of is small, and is close to (here, the identity matrix of dimension ).

In this work, an alternative to computing the inverse of via triangular solve is developed. A Neumann series expansion for the inverse of the lower triangular correction matrix, , results in the ICWY representation of the projector for the modified Gram-Schmidt factorization of the Arnoldi matrix . This is written as

 P=I−VmTVTm,Tm=(I+Lm)−1=I−Lm+L2m−⋯+Lpm, (1)

where the columns of are defined by the matrix-vector products . The sum is finite because the matrix is nilpotent, as originally noted by Ruhe [Ruhe83].

In this paper, a corollary to the backward stability result of Paige et al. [2006--simax--paige-rozloznik-strakos] and Paige and Strakoš [Paige2002] is presented that demonstrates matrix-vector multiplication by in the projection step is sufficient for convergence of MGS-GMRES when , where . A new formulation of GMRES based upon the truncated Neumann series for the matrix is presented. In particular, the loss of orthogonality and backward stability of the truncated and MGS-GMRES algorithms are the same because of the above corollary. For extremely ill-conditioned matrices, the convergence history of the new algorithm has been found to be identical to the original GMRES algorithm introduced by Saad and Schultz  [Saad86]. In particular, convergence is achieved when the norm-wise relative backward error reaches machine precision (NRBE); see Section 4 for more details. For this reason, Paige and Strakoš [Paige2002] recommended that the NBRE be applied as the stopping criterion.

A polynomial Gauss-Seidel iteration can also be interpreted with respect to the degree- Neumann series expansion as described in the recent paper by Mullowney et al. [Mullowney2021]. Given the regular matrix splitting , , and ,222Note that by convention, this matrix splitting is written such that is the strictly lower triangular part of , and is the strictly upper triangular part. The matrices and in Gauss-Seidel smoother are different from the those used in Gram-Schmidt projection in (1). the polynomial Gauss-Seidel iteration is given by

 x(k+1):=x(k)+(I+D−1L)−1D−1r(k)=x(k)+p∑j=0(−D−1L)jD−1r(k)

For the ILU factorization [ChowSaad1997], the factor has a unit diagonal and strictly lower triangular part . Thus, the inverse may be expressed as a Neumann series and, with nilpotent, this is a finite sum. Similarly, for the strictly upper triangular factor such that , after either row or row/column scaling, the inverse can be written as a Neumann series, as follows

 (I+Us)−1=I−Us+U2s−⋯

Furthermore, when , for the series converges. In the present study, the above iterations are applied as smoothers when constructing a C-AMG preconditioner.

In [Thomas2021], a method for scaling the factor to form in order to mitigate a high degree of non-normality and condition number is proposed. In general, fast convergence of the Jacobi iteration can then be expected [Anzt2015] when iteratively computing the triangular solution, thus avoiding the comparatively high cost on the GPU of a direct triangular solve. Here, the analysis in [Thomas2021] is extended and theoretical upper bounds on the departure from normality [Henrici1962] of a (row and/or column) scaled triangular matrix are provided. In the case of nonsymmetric coefficient matrices, reordering prior to scaling may be employed to enforce a symmetric sparsity pattern. Following the analysis in [ChowSaad1997], an empirical relationship between the drop tolerance and fill level used when computing an incomplete factorization and the subsequent departure from normality of the triangular factors following row and/or column scaling is established.

An additional contribution of our paper is to combine three different Neumann series to create fast and robust solvers. The time to solution is the most important metric versus the number of solver iterations taken in fluid mechanics simulations. The low-synchronization MGS-GMRES may take a larger number of iterations at little or no extra cost if the preconditioner is less expensive. However, the triangular solvers employed by smoothers are not efficient on multi-core architectures. Indeed, the sparse matrix-vector product (SpMV) is 25–50 faster on GPU architectures [Anzt2015]. This observation prompted the introduction of polynomial AMG smoothers in [Mullowney2021], which significantly lowers the cost of the -cycle.

The paper is organized as follows. Low synchronization and generalized minimal residual algorithms are discussed in Section 2. A corollary to Paige et al. [2006--simax--paige-rozloznik-strakos] in Section 3 establishes that is sufficient for convergence. In Section 4, a variant of MGS-GMRES is presented that uses this . Section 5 describes recent developments for hypre on GPUs. The polynomial Gauss-Seidel and ILU smoothers are then introduced in Section 5 where it is shown that pivoting leads to rapid convergence of iterative triangular solvers. The departure from normality of the factors can also be mitigated by either row or row and column scaling via the Ruiz algorithm, thus avoiding divergence of the iterations.

Notation Lowercase bold letters denote a column vector and uppercase letters are matrices (e.g., and , respectively). represents the scalar entry of a matrix , and denotes the column of . Where appropriate, for a matrix , is the –th column. Superscripts indicate the approximate solution (e.g., ) and corresponding residual (e.g., ) of an iterative method at step . Throughout this paper, we will explicitly refer to strictly upper/lower triangular matrices and use the notation (or ) and (or ).333We note that the distinction between these two notations in crucial. For , the size of the strictly upper triangular matrix changes with , whereas the size of remains fixed. Vector notation indicates a subset of the rows and/or columns of a matrix; e.g., denotes the first rows and columns of the matrix and the notation represents the entire row of the first columns of . represents an matrix, and in particular refers to a Hessenberg matrix. In cases where standard notation in the literature is respected that may otherwise conflict with the aforementioned notation, this will be explicitly indicated.

## 2 Low-Synchronization Gram-Schmidt Algorithms

Krylov linear system solvers are often required for extreme scale physics simulations on parallel machines with many-core accelerators. Their strong-scaling is limited by the number and frequency of global reductions in the form of MPI_AllReduce and these communication patterns are expensive. Low-synchronization algorithms are designed such that they require only one reduction per iteration to normalize each vector and apply projections.

A review of compact Gram Schmidt algorithms and their computational costs is given in [DCGS2Arnoldi]. The ICWY form for MGS is the lower triangular matrix and was derived in Świrydowicz et al. [2020-swirydowicz-nlawa]. Here, denotes the identity matrix and is strictly lower triangular. Specifically, these algorithms batch the inner-products together and compute one row of as

 Lk−1,1:k−2=(VTk−2vk−1)T. (2)

The resulting ICWY projector is given by

 P=I−Vk−1Tk−1VTk−1,Tk−1=(I+Lk−1)−1

The implied triangular solve requires an additional flops at iteration and thus leads to a slightly higher operation count compared to the original MGS algorithm. The matrix-vector multiply in (2) increases the complexity by ( total) but decreases the number of global reductions from at iteration to only one reduction when combined with the lagged normalization of a Krylov vector.

## 3 Loss of Orthogonality

When the Krylov vectors are orthogonalized via the finite precision MGS algorithm, and their loss of orthogonality is related in a straightforward way to the convergence of GMRES. In particular, orthogonality among the Krylov vectors is effectively maintained until the norm-wise relative backward error approaches the machine precision as discussed in Paige and Strakoš [Paige2002]. In this section, a corollary to Paige et al. [2006--simax--paige-rozloznik-strakos] establishes that , where for the ICWY MGS formulation of the Arnoldi expansion.

Let be an real-valued matrix, and consider the Arnoldi factorization of the matrix . After steps, in exact arithmetic, the algorithm produces the factorization

 AVk=Vk+1Hk+1,k,VTk+1Vk+1=Ik+1

where is an upper Hessenberg matrix. When applied to the linear system , assume , , and . The Arnoldi algorithm produces an orthogonal basis for the Krylov vectors spanned by the columns of the matrix . Following the notation in [2006--simax--paige-rozloznik-strakos], the notation is employed to represent the computed matrix in finite-precision arithmetic with columns, and to represent the properly normalized matrix with columns. We refer the reader to [2006--simax--paige-rozloznik-strakos] for more details on these definitions. Consider the computed matrix with Krylov vectors as columns. The strictly lower triangular matrix is obtained from the loss of orthogonality relation

 ¯VTk¯Vk=I+¯Lk+¯LTk.

Let us first consider the lower triangular solution algorithm for where the elements of appear in the correction matrix , along with higher powers of the inner products in the Neumann series.

A corollary to the backward stability results from Paige and Strakoš [Paige2002], and Paige et al. [2006--simax--paige-rozloznik-strakos] is now established, namely that the Neumann series for may be truncated according to

 Tk=(I+~Lk)−1=I−~Lk+O(ε2)κ2(B).

The essential result is based on the factorization of the matrix

 B=[r(0),AVk]=Vk+1[e1ρ,Hk+1,k].

A bound on the loss-of-orthogonality given by

 ∥I−~VTk+1~Vk+1∥F≤κ([r(0),AVk])O(ε),

and the derivation of this result allows us to find an upper bound for .

The upper triangular structure of the loss of orthogonality relation is revealed to be

 ~Uk=(I−~Sk)−1~Sk=~Sk(I−~Sk)−1

and it follows that ,

 I−~STk = I−~Lk(I+~Lk)−1 = I−~Lk(I−~Lk+~L2k−~L3k+⋯) = (I+~Lk)−1

where is the strictly upper triangular part of . It then follows immediately, that a bound on the loss of orthogonality is given by

 √2∥~VTk~vk∥2≤∥I−~VTk~Vk∥2=√2∥(I−~Sk)−1~Sk∥F≤43(2m)1/2^γn~κF(B)

where

The matrix is defined in [2006--simax--paige-rozloznik-strakos] to be any positive definite diagonal matrix. Therefore, it follows that

 ∥~Uk∥F=∥~LTk∥F≤O(ε)~κF(B),∥~Lk∥pF≤O(εp)~κpF(B)

and thus the matrix inverse from the Neumann series is given by

 Tk=(I+~Lk)−1 = I−~Lk+~L2k−~L3k+⋯ = I−~Lk+O(ε2)κ2(B)

The growth of the condition number above is related to the norm-wise relative backward error

 β(x(k))=∥r(k)∥2∥b∥2+∥A∥∞∥x(k)∥2

and in particular,

## 4 Low-synchronization MGS-GMRES

The MGS–GMRES orthogonalization algorithm is the factorization of a matrix formed by adding a new column to in each iteration

We remind the reader that bold-face with superscripts (e.g., ) denote the residual in the GMRES algorithm and subscripting (e.g., ) denotes the corresponding element in the upper triangular matrix .

The MGS-GMRES algorithm was proven to be backward stable for the solution of linear systems in [2006--simax--paige-rozloznik-strakos] and orthogonality is maintained to , depending upon the condition number of the matrix . The normalization of the Krylov vector at iteration represents the delayed scaling of the vector in the matrix-vector product . Therefore, an additional Step 8 is required in the one-reduce Algorithm 1, and . The diagonal element of the matrix, corresponds to , in the Arnoldi factorization of the matrix , and is updated after the MGS projection in Step 12 of the GMRES Algorithm 1. The higher computational speed on a GPU is achieved when the correction matrix is replaced by in Step 11 of the GMRES Algorithm 1, resulting in a matrix-vector multiply

The most common convergence criterion when solving linear systems of the form in existing iterative solver frameworks is based upon the relative residual,

 ∥r(k)∥2∥b∥2=∥b−Ax(k)∥2∥b∥2

where is some user-defined convergence tolerance. However, when the columns of become linearly dependent, as indicated by , the orthogonality of the Krylov vectors is completely lost. Then the convergence of MGS-GMRES flattens or stalls at this iteration. Due to the relationship with the backward error for solving linear systems , elucidated by Prager and Oettli [Prager64] and Rigal and Gaches [Rigal67], the backward stability analysis of Paige et al. [2006--simax--paige-rozloznik-strakos] relies instead upon the norm-wise relative backward error (NRBE) reaching machine precision.

## 5 Algebraic Multigrid Preconditioner

Neumann series also naturally arise when employing certain smoothers in AMG, and in particular, polynomial Gauss-Seidel and ILU smoothers. Before analyzing this, necessary background information on AMG is first provided.

AMG [Ruge1987] is an effective method for solving linear systems of equations. In theory, AMG can solve a linear system with unknowns in operations, and even though it was originally developed as a solver, it is now common practice to use AMG as a preconditioner to a Krylov method; in particular, when such Krylov solvers are applied to large-scale linear systems arising in physics-based simulations. As either a solver or preconditioner, an AMG method accelerates the solution of a linear system through error reduction by using a sequence of coarser matrices called a hierarchy. These matrices will be denoted , where , and . Each has dimensions where for . For the purposes of this paper, assume that

 Ak=RkAk−1Pk, (4)

for , where is a rectangular matrix with dimensions . is referred to as a prolongation matrix or prolongator. is the restriction matrix and in the Galerkin formulation of AMG. Associated with each , , is a solver called a smoother, which is usually a stationary iterative method, e.g. Gauss-Seidel, polynomial, or incomplete factorization. The coarse solver used for (the lowest level in the hierarchy) is often a direct solver, although it may be an iterative method if is singular.

The setup phase of AMG is nontrivial for several reasons. Each prolongator is developed algebraically from . Once the transfer matrices are determined, the coarse-matrix representations are computed recursively from through sparse triple-matrix multiplication. This describes a -cycle, the simplest complete AMG cycle; refer to Algorithm 1 in the recent paper by Thomas et al. [Thomas2021]

for a complete description. AMG methods achieve optimality (constant work per degree of freedom in

) through error reductions by the smoother and corrections propagated from coarser levels.

### 5.1 Ruge–Stüben Classical AMG

We now give an overview of classical Ruge-Stüben AMG, starting with notation. Interpolation is formulated in terms of matrix operations. Point

is a neighbor of if and only if there is a non-zero element of the matrix . Point strongly influences if and only if

 |aij|≥θmaxk≠i|aik|, (5)

where is the strength of connection threshold, . This strong influence relation is used to select coarse points. These points are retained in the next coarser level, and the remaining fine points are dropped. Let and be the coarse and fine points selected at level , and let be the number of grid points at level (). Then, , , is a matrix, and is an matrix. Here, the coarsening is performed row-wise by interpolating between coarse and fine points. The coarsening generally attempts to fulfill two contradictory criteria. In order to ensure that a chosen interpolation scheme is well-defined and of good quality, some close neighborhood of each fine point must contain a sufficient amount of coarse points to interpolate from. Hence the set of coarse points must be rich enough. However, the set of coarse points should be sufficiently small in order to achieve a reasonable coarsening rate. The interpolation should lead to a reduction of roughly five times the number of non-zeros at each level of the -cycle.

During the setup phase of AMG methods, the multi-level -cycle hierarchy is constructed consisting of linear systems having exponentially decreasing sizes on coarser levels. A strength-of-connection matrix , is typically first computed to indicate directions of algebraic smoothness and applied in the coarsening algorithms. The construction of may be performed efficiently, because each row is computed independently by selecting entries in the corresponding row of with a prescribed threshold value . hypre-BoomerAMG provides the parallel maximal independent set (PMIS) coarsening [DeSterck2006], which is modified from Luby’s algorithm [Luby1986] for finding maximal independent sets using random numbers.

Interpolation operators in AMG transfer residual errors between adjacent levels. There are a variety of interpolation schemes available. Direct interpolation [StubenAMG] is straightforward to implement in parallel because the interpolatory set of a fine point is just a subset of the neighbors of , and thus the interpolation weights can be determined solely by the -th equation. A bootstrap AMG (BAMG) [Manteuffel2010] variant of direct interpolation is generally found to be better than the original formula. The weights are computed by solving the local optimization problem

 min∥aiiwTi+ai,Csi∥2s.t.wTifCsi=fi,

where is a vector that contains , and denotes strong C-neighbors of and is a target vector that needs to be interpolated exactly. For elliptic problems where the near null-space is spanned by constant vectors, i.e., , the closed-form solution of (5.1) is given by

 wij=−aij+βi/nCsiaii+∑k∈Nwiaik,βi=∑k∈{fi∪Cwi}aik , (6)

where denotes the number of points in , the weak C-neighbors of , the F-neighbors, and the weak neighbors. A known issue of PMIS coarsening is that it can result in F-points without C-neighbors [DeSterck2008]. In such situations, distance-one interpolation algorithms often work well, whereas interpolation operators that can reach C-points at a larger range, such as the extended interpolation [DeSterck2008], can generally yield much better convergence. However, implementing extended interpolation is more complicated because the sparsity pattern of the interpolation operator cannot be determined a priori, which requires dynamically combining C-points in a distance-2 neighborhood. With minor modifications to the original form, it turns out that the extended interpolation operator can be rewritten by using standard sparse matrix computations such as matrix-matrix (M-M) multiplication and diagonal scaling with certain - and -sub-matrices. The coarse-fine C-F splitting of the coarse matrix is given by

 A=[AFFAFCACFACC]

where is assumed to be decomposed into , the diagonal, the strong part and weak part respectively, and , , and are the corresponding sub-matrices of and .

The full prolongation operator is given by

 P=[WI]

The extended “MM-ext” interpolation now takes the form

 W =−[(DFF+Dγ)−1(AsFF+Dβ)][D−1βAsFC]

with

 Dβ=diag(AsFC1C)Dγ=diag(AwFF1F+AwFC1C),

This formulation allows simple and efficient implementations. Similar approaches that are referred to as “MM-ext+i” modified from the original extended+i algorithm [DeSterck2008] and “MM-ext+e” are also available in BoomerAMG. See [Li2020] for details on the class of M-M based interpolation operators.

Aggressive coarsening reduces the grid and operator complexities of the AMG hierarchy, where a second coarsening is applied to the C-points obtained from the first coarsening to produce a smaller set of final C-points. The A-1 aggressive coarsening strategy described in [StubenAMG] is employed in our studies. The second PMIS coarsening is performed with the block of that has nonzero entry if is connected to with at least a path of length less than or equal to two. Aggressive coarsening is usually used with polynomial interpolation [Yang2010] which computes a second-stage interpolation matrix and combined with the first-stage as . The aforementioned MM-based interpolation is also available for the second stage. Finally, Galerkin triple-matrix products are used to build coarse-level matrices involving the prolongation and restriction operators. Falgout et al. [Falgout2020] provides the additional details omitted here on the algorithms employed in hypre for computing distributed and parallel sparse matrix-matrix M–M multiplications.

### 5.2 Polynomial Gauss–Seidel Smoother

To solve a linear system , the Gauss-Seidel (GS) iteration is based upon the matrix splitting , where and are the strictly lower and upper triangular parts of the matrix , respectively. Then, the traditional GS updates the solution based on the following recurrence,

 x(k+1):=x(k)+M−1r(k),k=0,1,2,… (7)

where , , and , or , for the forward or backward sweeps, respectively. In the following, is the –th Gauss-Seidel iterate. To avoid explicitly forming the matrix inverse in (7), a sparse-triangular solve is used to apply to the current residual vector .

To improve the parallel scalability, hypre implements a hybrid variant of GS [Baker:2011], where the neighboring processes first exchange the elements of the solution vector on the boundary of the subdomain assigned to an a parallel process (MPI rank), but then each MPI rank independently applies the local relaxation. Furthermore, in hypre, each rank may apply multiple local GS sweeps for each round of the neighborhood communication. With this approach, each local relaxation updates only the local part of the vector (during the local relaxation, the non-local solution elements on the boundary are not kept consistent among the neighboring ranks). This hybrid algorithm is shown to be effective for many problems [Baker:2011].

A polynomial GS relaxation employs a fixed number of “inner” stationary iterations for approximately solving the triangular system with ,

 ˆx(k+1):=ˆx(k)+ˆM−1(b−Aˆx(k)),k=0,1,2,… (8)

where represents the approximate triangular system solution, i.e., . A Jacobi inner iteration is employed to replace the direct triangular solve. In particular, if denotes the approximate solution from the -th inner iteration at the -th outer GS iteration, then the initial solution is chosen to be the diagonally scaled residual vector,

 g(0)k=D−1r(k), (9)

and the –st Jacobi iteration computes the approximate solution by the recurrence

 g(j+1)k := g(j)k+D−1(r(k)−(L+D)g(j)k) (10) = D−1(r(k)−Lg(j)k). (11)

When “zero” inner sweeps are performed, the polynomial GS recurrence becomes

 ˆx(k+1):=ˆx(k)+g(0)k=ˆx(k)+D−1(b−Aˆx(k)),

and this special case corresponds to Jacobi iteration for the global system, or local system on each MPI rank. When inner iterations are performed, it follows that

 ˆx(k+1) ≈ˆx(k)+(I+D−1L)−1D−1ˆr(k)=ˆx(k)+M−1ˆr(k),

where is approximated by the degree- Neumann expansion and is the residual following the iterated inner solve (e.g., ). Note that is strictly lower triangular and nilpotent so that the Neumann series converges in a finite number of steps.

### 5.3 ILU Smoothers with Scaled Factors and Iterated Triangular Solves

When applying an incomplete factorization as the smoother, careful consideration of algorithms which solve the triangular systems is needed. A direct triangular solver is comparatively more expensive on a GPU than iterations using sparse matrix vector products, and thus Jacobi iteration was proposed in [Anzt2015]. However, in cases when the triangular factors are highly non-normal, the Jacobi iterations may diverge [Anzt2015, Chow2015, Eiermann1993]. Note that triangular matrices are necessarily non-normal [Horn2012], and three methods to mitigate the degree of non-normality are explored: (1) The incomplete factorization with row scaling, (2) reordering to enforce symmetric sparsity patterns, and (3) row and column scaling. For (2), several reordering strategies are considered, and for (3), the Ruiz strategy [Knight2014] is considered. Before discussing both experimental and theoretical analyses for these approaches, some background on (non)-normality is first provided.

#### 5.3.1 Jacobi Iteration and Normality

The Jacobi iteration for solving can be written in the compact form below, with the regular splitting , and . The iteration matrix is defined as and

 x(k+1)=Gx(k)+D−1b (12)

or in the non-compact form, with

 x(k+1)=x(k)+D−1(b−Ax(k)) (13)

where is the diagonal part of . For the triangular systems resulting from the ILU factorization (as opposed to the splitting ), we denote the iteration matrices as and for the lower and upper triangular factors, and respectively. Let and be the diagonal parts of the triangular factors and and denotes the identity matrix. Assume has a unit diagonal, then

 GL = D−1L(DL−L)=I−L (14) GU = D−1U(DU−U)=I−D−1UU (15)

A sufficient condition for the Jacobi iteration to converge is that the spectral radius of the iteration matrix is less than unity, or . is strictly lower and is strictly upper triangular, which implies that the spectral radius of both iteration matrices is zero. Therefore, the Jacobi iteration converges in the asymptotic sense for any triangular system. However, the convergence of the Jacobi method may be affected by the degree of non-normality of the matrix, which is now defined.

A normal matrix satisfies , and so a matrix that is non-normal can be defined in terms of the difference between and to indicate the degree of non-normality. Measures of nonnormality and the behavior of nonnormal matrices, have been extensively considered; see [Ipsen1998, Henrici1962, Elsner1987, Trefethen2005]. In this paper, we use Henrici’s definition of the departure from normality of a matrix

 dep(A)=√∥A∥2F−∥D∥2F, (16)

where

is the diagonal matrix containing the eigenvalues of

[Henrici1962]. By no means is this the only practical metric for describing the degree of nonnormality, however, it is particularly useful in the context of the current paper.

#### 5.3.2 Iterating with an LDU Factorization

Given an incomplete factorization, when either factoring out the matrix from the incomplete , or computing an incomplete factorization, the iteration matrix (15) simplifies to , where is a strictly upper triangular matrix (SUT). In particular, this leads naturally to a Neumann series, where for

 x(k+1) = f−Usf+U2sf−⋯+(−1)kUksf = (I−Us+U2s−⋯+(−1)kUks)f = (I−Us)−1f.

In [Thomas2021], an extensive empirical analysis shows that for certain drop tolerances, and amounts of fill, . Thus, the convergence of the Neumann series is guaranteed. Even for those drop tolerances where does not hold, for moderate , and further goes to fairly quickly. Furthermore, for the (scaled) matrices considered in the current paper, a correlation exists between the maximum value (in magnitude) of and this rate of convergence. Figure 1, compares and for increasing values of for a PeleLM matrix dimension when computing an incomplete factorization. While the number of nonzeros in increases for , the norm rapidly decreases, and by ; see Figure 1. Of course, for strictly upper triangular matrices,

is guaranteed to be the zero matrix when

, however, numerical nilpotence occurs much sooner (i.e., ). Building a theoretical framework to describe the relationship between these metrics is part of our ongoing work, and we refer to [Thomas2021] for more empirical analyses of . In particular, a relationship between larger drop tolerances and smaller fill-in when computing the ILU factorizations is observed when using row scaling.

#### 5.3.3 Matrix Re-Orderings and Non-Normality

Richardson iteration can be interpreted in a nonlinear sense as a continuous in space and discrete time scheme for a heat-equation that is integrated to steady state. The Jacobi iteration can also be considered as a time integrator and the tools of nonlinear stability analysis can be applied. In particular, the recent work of Asllani et al. [Asllani2021] establishes the connection between non-normality, pseudo-spectra and the asymptotic convergence behaviour of such nonlinear schemes.

For the purposes of iteratively solving triangular systems, one might reasonably presuppose that matrix re-orderings have an effect on the departure from normality of the and factors. In particular, re-ordering can minimize the amount of fill-in and thereby reduce the and . Furthermore, symmetric sparsity patterns of the factors would lead to a similar number of Jacobi iterations for the triangular solves. Indeed, this is observed empirically and in the case of non-symmetric matrices, the structural symmetry resulting from the approximate minimum degree (AMD) re-ordering in combination with a row-scaled factorization leads to a small departure from normality and a rapidly converging Jacobi iteration.

Consider the Nalu-Wind pressure continuity equation matrix of dimension Million. Several different matrix re-orderings are applied to , followed by an incomplete factorization. Then the Jacobi iteration is applied to the approximate linear system where the right-hand side is obtained from the extracted pressure system. The relative residual for each of the triangular solves is reported in Table 1. The matrix re-orderings consisted of the symmetric reverse Cuthill-McKee (‘symrcm’) [GeorgeLiu1981], approximate minimum degree or AMD (‘amd’) [Amestoy1996], symmetric AMD (‘symamd’), and nested dissection (‘dissect’) [George1973] in Matlab combined with an ILU factorization and five (5) Jacobi iterations for the systems and .

Clearly, the AMD re-orderings lead to the lowest relative residuals and thus are candidates to be evaluated for a GMRES+AMG solver for the slightly non-symmetric Nalu-Wind pressure systems. Table 2 displays the departure from normality and after re-ordering the coefficient matrix followed by row scaling using the ILU factorization and row scaled .

The departure from normality without re-ordering or scaling is e. There is a clear correlation between the for each of the orderings in Table 2 and the size of the relative residuals in Table 1. Thus, a lower value for Henrici’s departure from normality is a good predictor of the rapid convergence of the iterative triangular solves.

#### 5.3.4 Iterating with an LU Factorization using Ruiz Scaling

Next, consider the application of the Ruiz strategy [Knight2014] directly to the problematic and factors themselves. In cases where only an

factorization is available, this strategy is appealing. The objective is to reduce the condition number of an ill-conditioned matrix, and it is also shown here to be effective in reducing

. A rigorous experimental analysis of this method is given in [Thomas2021] and in the present paper these results are augmented with theoretical bounds.

The Ruiz algorithm is an iterative procedure that uses row and column scaling such that the resulting diagonal entries of the matrix are one, and all other elements are less than (or equal to) one [Knight2014]. Algorithm 2 describes the ILUT smoother employing the scaling for non-normal triangular factors, and the practical implementation of this approach is described in more detail in [Thomas2021]. 444Note that row scaling can be used in place of Ruiz scaling by instead constructing , and letting . It is provided here for ease of reference. In Section 6 and for the applications considered in this paper, only the factor is scaled. However, this algorithm may also be extended to scale the factor in cases when both and are highly non-normal.

The row and column scaling algorithm results in a transformed linear system that now takes the form

 LDrUDcx=b

and the Jacobi iterations are applied to the upper triangular matrix . Here, and represent row and column scaling, respectively, of the factor. However, this matrix has a unit diagonal and the iterations can be expressed in terms of a Neumann series. When an or factorization in available, the series can be directly obtained from or as these are strictly upper triangular matrices. The matrix can also be extracted from either of these matrices by row scaling when possible.

When applying scaling to , a matrix results, with the identity matrix of dimension , and strictly upper triangular. Thus, the inverse of such a matrix can be expressed as a Neumann series

 (I+Us)−1=I−Us+U2s−⋯ (18)

Because is upper triangular and nilpotent, the above sum is finite, and the series is also guaranteed to converge when .

A loose upper bound for can be derived and attributed in part to the fill level permitted in the ILU factorization.

###### Theorem 1

Let represent the number of nonzero elements permitted in the and factors of an incomplete LU factorization. Then for scaled such that , with the identity matrix, and a strictly upper triangular matrix,

 ∥Us∥2≤∥Us∥F≤√n(p−1). (19)

###### Proof

The proof follows from taking the Frobenius norm of a strictly upper triangular matrix with a maximum of nonzero elements per row, and the observation that for all .

The bound in Theorem 1 is obviously pessimistic, even when allowing only a small fill level in the ILU factorization. However, it is observed in practice that after scaling, , and this is often true for the ILUT smoothers with for the applications considered in this paper. An increase in the size of is observed as the level of fill increases; see Table 3 for a comparison of with and the theoretical bound in Theorem 1, when varying the fill level.