HIFIR: Hybrid Incomplete Factorization with Iterative Refinement for Preconditioning Ill-conditioned and Singular Systems

06/18/2021 ∙ by Qiao Chen, et al. ∙ Stony Brook University 0

We introduce a software package called HIFIR for preconditioning sparse, unsymmetric, ill-conditioned, and potentially singular systems. HIFIR computes a hybrid incomplete factorization, which combines multilevel incomplete LU factorization with a truncated, rank-revealing QR factorization on the final Schur complement. This novel hybridization is based on the new theory of approximate generalized inverse and ϵ-accuracy. It enables near-optimal preconditioners for consistent systems and enables flexible GMRES to solve inconsistent systems when coupled with iterative refinement. In this paper, we focus on some practical algorithmic and software issues of HIFIR. In particular, we introduce a new inverse-based rook pivoting into ILU, which improves the robustness and the overall efficiency for some ill-conditioned systems by significantly reducing the size of the final Schur complement for some systems. We also describe the software design of HIFIR in terms of its efficient data structures for supporting rook pivoting in a multilevel setting, its template-based generic programming interfaces for mixed-precision real and complex values in C++, and its user-friendly high-level interfaces in MATLAB and Python. We demonstrate the effectiveness of HIFIR for ill-conditioned or singular systems arising from several applications, including the Helmholtz equation, linear elasticity, stationary incompressible Navier–Stokes equations, and time-dependent advection-diffusion equation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

page 25

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

We consider the problem of preconditioning an iterative solver for a linear system,

(1)

where is sparse and potentially singular, , and , or as in many applications, , , and . For generality, we will assume complex-valued systems in our discussions. In general, (1) is inconsistent in that , where denotes the Moore–Penrose pseudoinverse of and denotes the machine epsilon of a given floating-point number system. In this case, we must seek a least-squares solution of (1), i.e.,

(2)

or often preferably the pseudoinverse solution of (1), i.e.,

(3)

or equivalently, . When is large-scale, it is preferable to solve (1) using a Krylov subspace (KSP) method, such as GMRES (saad1986gmres), which seeks a solution in the th KSP

(4)

at the th iteration, where is typically equal to . It is well known that KSP methods can benefit from robust and effective preconditioners for ill-conditioned problems. This work introduces a software package called HIFIR, which delivers robust and computationally efficient preconditioners for singular systems. As a side product, HIFIR also improves the robustness in preconditioning ill-conditioned systems.

Compared to nonsingular systems, preconditioning (nearly) singular systems is a very challenging problem. Several software packages offer fairly robust and easy-to-use preconditioners for nonsingular systems, such as the multilevel ILU (MLILU) in ILUPACK (bollhofer2011ilupack), the supernodal ILU in SuperLU (li2011supernodal), and various parallel preconditioners in PETSc (petsc-user-ref). Conceptually, such software packages construct a preconditioner that approximates in that . Given , a right-preconditioned111

We consider only right preconditioning because left preconditioning alters the computation of residual vector

and in turn may lead to false stagnation or early terminations for ill-conditioned systems (ghai2019comparison). KSP method seeks a solution to

(5)

in , where typically , and then . Ideally, the preconditioned KSP methods would converge significantly faster than the unpreconditioned ones. However, when is (nearly) singular and the system (1) is inconsistent, there is a lack of robust algorithms and software. Some earlier techniques used a CGLS-type KSP method (e.g., (bjorck1996numerical; fong2011lsmr; paige1982lsqr)), which is mathematically equivalent to solving the normal equation using CG (bjorck1996numerical; hestenes1952methods) or MINRES (paige1975solution; fong2011lsmr). Those KSP methods tend to converge slowly due to the squaring of the condition number by the normal equation in the corresponding KSP (jiao2021approximate). More recently, there has been significant interest in preconditioning GMRES-type methods for singular systems or least-squares problems (hayami2010gmres; jiao2021approximate; morikuni2015convergence). For example, the so-called AB-GMRES (hayami2010gmres) solves the system using GMRES with , and then . Here, plays a similar role as , except that may be singular (or rank deficient if is rectangular). Hayami et al. (hayami2010gmres) constructed based on robust incomplete factorization (RIF) of Benzi and Tůma (benzi2003robustpd; benzi2003robust), which was originally developed for CGLS-type methods. Although RIF could accelerate the convergence of AB-GMRES in (hayami2010gmres), it was not robust in general (morikuni2015convergence). In more recent works (gould2017state; morikuni2015convergence), in AB-GMRES is typically chosen to be (or for real matrices), which unfortunately squares the condition number (analogous to CGLS) and in turn can slow down the convergence. This work aims to deliver a right preconditioner that is more efficient and robust than RIF, and more importantly, enables near-optimal convergence rates. We achieve this goal by leveraging the new theory of -accurate approximated generalized inverse (AGI) (jiao2021approximate), as outlined in Section 2.

Our development of HIFIR was based on our earlier software package called HILUCSI (chen2021hilucsi), which was a prototype implementation of an MLILU for nonsingular saddle-point problems. Compared to single-level ILUs (such as ILU() and ILUTP, etc. (saad2003iterative, Chapter 10)), MLILU is generally more robust for nonsingular indefinite systems (ghai2019comparison; chen2021hilucsi). HILUCSI leveraged several techniques in a novel way to achieve superior efficiency and robustness for saddle-point problems than other MLILU libraries (such as ARMS (saad2002arms), ILUPACK (bollhofer2011ilupack), and ILU++ (mayer2007ilu++)). In particular, HILUCSI achieved high efficiency by introducing a scalability-oriented dropping in a dual-thresholding strategy in the fan-in ILU222The technique is also known as the Crout version of ILU (li2003crout) or left-looking (eisenstat1981algorithms), but we adopt the terminology of “fan in” update, which is commonly used in parallel computing (demmel1993parallel) and is more suggestive in terms of its algorithmic behavior. for linear-time complexity in its factorization and solve. It improved robustness by leveraging mixed symmetric and unsymmetric preprocessing techniques at different levels and combining static and dynamic permutations. HIFIR inherits some of these core algorithmic components of HILUCSI, as described in Section 3. However, as an MLILU technique, HILUCSI was not robust for singular systems, for example, when its final level is singular. HIFIR is designed to overcome this deficiency by leveraging a rank-revealing factorization in its final level, introducing iterative refinement to build a variable preconditioner, and introducing a new pivoting strategy in its ILU portion, as we will detail in Section 3.

The main contributions of this work are as follows. First and foremost, we introduce one of the first software libraries to improve the robustness for ill-conditioned and (nearly) singular systems to achieve near machine precision. Our software library, called HIFIR, or Hybrid Incomplete Factorization with Iterative Refinement, computes an AGI (jiao2021approximate) by hybridizing incomplete LU and rank-revealing QR (RRQR) in a multilevel fashion. When used as a right-preconditioner for GMRES, this hybridization enables (near) optimal convergence for consistent or ill-conditioned systems. When fortified with iterative refinement in FGMRES (saad1993flexible)

, HIFIR enables the robust computation of the left null space and the pseudoinverse solution of inconsistent systems. We have implemented HIFIR using template-based objective-oriented programming in C++. For user-friendliness, the C++ HIFIR library is header-only, with easy-to-use high-level interfaces in MATLAB and Python. The software is open-source and has been made available at

https://github.com/hifirworks/hifir. Second, this work also introduces a novel inverse-based rook pivoting (IBRP) in the fan-in ILU. We describe efficient data structures for the efficient implementation of IBRP and show that this new pivoting strategy improves the robustness and efficiency for some challenging singular systems. Third, HIFIR offers some advanced features, such as the support of complex arithmetic, the ability to precondition both and using the same factorization, and the ability to multiply by an AGI of the preconditioning operator. These features enable the use of HIFIR as building blocks for advanced preconditioners, such as (parallel) block preconditioners. In addition, HIFIR supports mixed precision (e.g., double precision combined with single or potentially half precision) for the input matrix and the preconditioner, which is beneficial for heterogeneous hardware platforms and limited-memory settings.

The remainder of this paper is organized as follows. In Section 2, we give an overview of the theoretical foundation of HIFIR, including optimality conditions, treatment for singular systems, etc. In Section 3, we describe the algorithmic components of HIF and highlight some implementation details. Section 4 describes how to apply HIF as a preconditioner, including iterative refinement. In Section 5, we introduce the application programming interfaces of HIFIR in C++, MATLAB, and Python with example implementations. Section 6 demonstrates HIFIR for some large-scale applications with indefinite ill-conditioned and singular inconsistent systems. Finally, Section 7 concludes the paper with a discussion on future directions. For completeness, we present the details of our data structures in Appendix A and the complexity analysis of IBRP in Appendix B.

2. Theoretical foundations

In this section, we give an overview of the theoretical foundations of HIFIR. Most of the theory was based on that in (jiao2021approximate), except that we generalize the results from real matrices to complex ones. We present some of the most relevant theoretical results for completeness, but we omit the proofs because they follow the same arguments as those in (jiao2021approximate).

2.1. Mathematically optimal right-preconditioning operators for consistent systems

Let us first consider the issue of optimal preconditioning for a consistent system (1), where is in the range of (i.e.,

). In floating-point arithmetic, the convergence rate of a KSP method for such systems depends on the following generalized notion of condition numbers.

Definition 1 ().

Given a potentially singular matrix , the 2-norm condition number of

is the ratio between the largest and the smallest nonzero singular values of

, i.e., , where .

To accelerate a KSP method for such a system, we solve a right-preconditioned system

(6)

using a KSP method, and then . We refer to as a right preconditioning operator (RPO). Ideally, we would like . For nonsingular systems, is equivalent to in (5); for singular systems, generalizes . The symbol signifies that it is based on a generalized inverse.

Definition 2 ().

(rao1972generalized, Definitions 2.2) Given a potentially rank-deficient , is a generalized inverse of if and only if .

It is worth noting that the Moore–Penrose pseudoinverse is a special case of generalized inverses. Although it might be tempting to construct the RPO to approximate , the pseudoinverse is overly restrictive. The following two properties of generalized inverses make them particularly relevant to right-preconditioning singular systems.

Proposition 1.

(jiao2021approximate, Proposition 3.4) If is a generalized inverse of , then

is diagonalizable, and its eigenvalues are all zeros and ones. In other words, there exists a nonsingular matrix

, such that , where is the identity matrix with . Conversely, if there exists a nonsingular such that for , then is a generalized inverse of .

Proposition 2.

(jiao2021approximate, Proposition A.1) Given of rank and a generalized inverse with , the condition number of is bounded by , i.e., .

From Proposition 1, it is easy to show that any generalized inverse (or a nonzero scalar multiple of

) enables a mathematically optimal RPO for consistent systems in the following sense.

Theorem 1.

(jiao2021approximate, Theorem 3.6) Given and a generalized inverse , then GMRES with RPO with converges to a least-squares solution of (1) after one iteration for all and . Conversely, if GMRES with RPO converges to a least-squares solution in one iteration for all and , then is a scalar multiple of a generalized inverse of .

A corollary of Theorem 1 is that if , then the computed is the pseudoinverse solution of (1). Let denote a projection onto . Given any , it is easy to show that is also a generalized inverse, and the computed with as the RPO is then the pseudoinverse solution. Theorem 1 assumes exact arithmetic. With rounding errors, the condition number of must be bounded by a small constant, which holds in general if is bounded due to Proposition 2.

Although the above theory may seem abstract, it suggests a new approach for constructing RPO based on generalized inverses. In particular, one option to construct is to hybridize multilevel ILU with a rank-revealing decomposition on its final Schur complement. As an illustration, consider the following example that combines LU factorization without pivoting with QR factorization with column pivoting (QRCP) (Golub13MC).

Example 0.

Given , after steps of Gaussian elimination,

(7)

where is the Schur complement. Clearly, . Let the QRCP of be

(8)

where is unitary and for (i.e., the (numerical) rank of ). Let and be composed of the first columns of and , respectively. Then, is a generalized inverse of with . Furthermore,

(9)

is a generalized inverse of , with for as in Proposition 1.

We refer to the preceding construction of as a hybrid factorization. It enables a more efficient approach to construct an optimal RPO, for example, compared to applying QRCP to if . For the strategy to be successful, we must address some practical issues. First, we need to allow droppings in the factorization to reduce computational cost and memory requirement, especially for larger-scale systems. Second, we need to control to limit , for example, by leveraging pivoting and equilibration (duff2001algorithms). Third, it is desirable to make as small as possible before we apply QRCP. Hereafter, we will address the first two issues from a theoretical perspective and then address the third issue in Section 3.

2.2. Near-optimal right-preconditioning operators via approximate generalized inverses

Although a mathematically optimal RPO enables the most rapid convergence of KSP methods, the computational cost per iteration may be prohibitively high, so is the memory requirement. In practice, it may be more desirable to construct “near-optimal” RPOs by approximating a generalized inverse. The following definition and theorem establish the guideline for constructing such approximations.

Definition 3 ().

(jiao2021approximate, Definition 3.8) Given , is an -accurate approximate generalized inverse (AGI) if there exists such that

(10)

where is identity matrix with . A class of AGI is -accurate if tends to 0 as its control parameters are tightened. is a stable AGI if for some .

Theorem 2.

(jiao2021approximate, Theorem 3.9) GMRES with an -accurate AGI of converges to a least-squares solution of (1) in exact arithmetic for all consistent systems (i.e., ) with any initial guess .

A corollary of Theorem 2 is that given an -accurate AGI , is also an -accurate AGI, and the computed with as the RPO is the pseudoinverse solution. Mathematically, it is equivalent to compute with as the RPO and then project onto to obtain the pseudoinverse solution.

Remark 1.

In the literature, commonly used measures of accuracy and stability of a preconditioner for a nonsingular matrix were and , respectively; see, e.g., (benzi2002preconditioning). Our new definitions of accuracy and stability in Definition 3 are more general in that they apply to singular systems. In addition, they are more rigorous in that they are based on Theorem 2 and Proposition 2, respectively, instead of based on empirical evidence (benzi2002preconditioning).

Theorem 2 and Example 2.1 suggest that we can construct AGIs by replacing the LU factorization in hybrid factorization with some ILU variants. We will refer to the combination of ILU with a rank-revealing factorization on the Schur complement as a hybrid incomplete factorization (HIF). From the perspective of AGI, a good candidate ILU should satisfy three critical criteria. First, the ILU needs to have prudent dropping strategies to make the approximation as accurate as possible. Second, we must be able to control effectively for stability. Third, the computational cost and storage requirement should ideally scale linearly (or near-linearly) with respect to the input size. In the ILU literature (chow1997experimental; saad2003iterative; bollhofer2006multilevel), there had been significant attention to the first criterion. However, the second criterion excludes simple ILU techniques without pivoting, such as ILU() (saad2003iterative). The third criterion excludes ILU with relatively simple pivoting strategies, such as ILUTP (chow1997experimental; saad2003iterative) and its supernodal variants (li2011supernodal), which suffer from superlinear complexity (ghai2019comparison; chen2021hilucsi).

Although the second and third criteria may appear self-contradicting for traditional ILU techniques, they can be met by a well-designed MLILU technique. Before delving into the details of MLILU algorithms, let us briefly review MLILU and more importantly, show that MLILU can be used to construct accurate and stable AGIs. First, consider a two-level ILU (or more precisely, ILDU) of ,

(11)

where is an ILDU of the leading block, , , and

(12)

is the Schur complement; and are row and column permutation matrices, respectively; and correspond to row and column scaling diagonal matrices, respectively. The Schur complement can be factorized recursively using the same ILU technique, leading to an -level ILU preconditioners, namely,

(13)

where for in (11) at the th level (similarly for and all other permutation and scaling matrices), is composed of the “union” of in (11) for all levels, and is the final Schur complement. As in Example 2.1, we apply QRCP to to obtain , where is unitary and with , along its diagonal, and . We define an RPO as

(14)

where , , and are as in (13). We note the following fact.

Proposition 3.

(jiao2021approximate, Lemma 4.3) If no dropping is applied in MLILU, then in (14) is a generalized inverse of with for as in Proposition 1.

Since without dropping is an optimal RPO for consistent systems due to Theorem 1, we claim that an -accurate and stable constitutes a near-optimal RPO for consistent systems. The near optimality requires sufficient small droppings and numerical stability, or more precisely, should be close to for and must be controlled by the algorithm. To this end, we utilize an MLILU technique called HILUCSI, which stands for Hierarchical Incomplete LU-Crout with Scalability-oriented and Inverse-based droppings (chen2021hilucsi). HILUCSI leverages several techniques, including fan-in ILU (li2003crout), equilibration (duff2001algorithms), static and dynamic pivoting across different levels (chen2021hilucsi; bollhofer2006multilevel), etc., to meet the accuracy and stability requirements. As the name suggests, HILUCSI focuses on scalability in terms of problem sizes, and it has linear time complexity in each level for both the factorization and solve stages (chen2021hilucsi). We defer the detailed description of these algorithmic components to Section 3.

Remark 2.

Besides HILUCSI, there were several MLILU software packages, such as ARMS (saad2002arms), ILU++ (mayer2006alternative), ILUPACK (bollhofer2011ilupack), etc. Proposition 3 can also be applied to improve those packages to solve singular systems by applying QRCP to the final Schur complement. To harness this benefit, however, one must also extend them (especially ARMS and ILU++) to ensure the stability of the factor.

Finally, we note that it is sometimes needed to reuse HIF to construct preconditioners for both and . To achieve, we note the following property:

Proposition 4.

(jiao2021approximate, Proposition 3.5) If is a generalized inverse of , then is a generalized inverse of .

Conceptually, we can extend (11) to construct a preconditioner for as

(15)

2.3. Variable preconditioning via iterative refinement for null-space computation

The preceding discussions focused on consistent systems. For inconsistent systems, i.e., , an AGI (or even a generalized inverse333When , then right-preconditioned GMRES converges to a weighted-least-squares (WLS) solution for inconsistent systems, instead of least-squares solution (jiao2021approximate, Theorem 3.7). We cannot convert the WLS solution into a pseudoinverse solution by projecting it onto .) cannot guarantee the convergence of GMRES due to the following fact.

Theorem 3.

(jiao2021approximate, Theorem 2.4) GMRES with RPO does not break down until finding a least-squares solution of (1) for all and if and only if is range symmetric (i.e., ) and . Furthermore, is the pseudoinverse solution if .

Remark 3.

The main challenge posed by Theorem 3 is that it is difficult, if not impossible, to ensure the range symmetry of for an approximate generalized inverse . The requirement of range symmetry is the primary reason why is often used as in AB-GMRES (gould2017state; morikuni2015convergence). It is also a key factor for the prevalence of CGLS-type KSP methods (bjorck1996numerical; fong2011lsmr; paige1982lsqr) for solving singular and least-squares problems. Although such methods can be accelerated by applying some preconditioners, such as incomplete QR (jennings1984incomplete; saad1988preconditioning) or RIF (benzi2003robustpd; benzi2003robust), it is difficult for these preconditioners to overcome the slowdown caused by the squaring of the condition number by the normal equation.

Fortunately, this issue can be resolved by using flexible GMRES with variable preconditioners, as formalized by the following definition and theorem.

Definition 4 ().

(jiao2021approximate, Definition 2.1) Given a matrix , an initial vector , and variable preconditioners , the th flexible Krylov subspace (FKSP) associated with , , and is

(16)

where and for . The flexible Krylov matrix, denoted by , is composed of the basis vectors in (16).

The orthogonality of with can be enforced using a generalization of Arnoldi iterations as in (saad1993flexible).

Theorem 4.

(jiao2021approximate, Theorem 2.5) If FGMRES with variable preconditioners does not break down until step , where for a specific and for , respectively. If , then it finds a least-squares solution of (1) for an initial guess .

One effective approach to construct variable preconditioners is to introduce iterative refinement (IR) in HIF, leading to HIFIR. Specifically, given an -accurate AGI and an initial vector (typically, ), we refine the solution of iteratively by obtaining for as

(17)

Eq. (17) defines the th RPO in , and different may use different numbers of IR iterations. Note that (17) in general does not converge by a standalone iterative solver in that when is a generalized inverse, and Eq. (17) cannot introduce additional nonlinearity into the variable preconditioner and in turn undermine the robustness of FGMRES due to Theorem 4. Hence, we shift our attention to its use as a variable preconditioner in the context of computing null-space vectors. In particular, we apply the technique to compute the left null space of (i.e., ), which allows us to convert an inconsistent system into a consistent one. Furthermore, by applying the same technique to compute the right null space of (i.e., ), we can convert a least-squares solution from the consistent system into the pseudoinverse solution (jiao2021approximate).

3. Recursive construction of hybrid incomplete factorization

In this section, we describe the overall algorithm for constructing HIF. Similar to that of HILUCSI (chen2021hilucsi) and other MLILU algorithms, HIF is a recursive algorithm. As shown in Figure 1, at each level, HIF takes an input matrix . HIF first performs symmetric or unsymmetric preprocessing depending on whether is (nearly) pattern symmetric, i.e., ; see Section 3.4. The process leads to , where , , , and are obtained from preprocessing. If is nearly pattern symmetric, the preprocessing step may involve static deferring, which splits into a 2-by-2 block matrix, i.e., . Depending on whether the leading block is Hermitian, we then perform incomplete or factorizations, respectively, where and are unit lower and upper triangular matrices, respectively (i.e., their diagonal entries are ones). We compute these factorizations using fan-in updates (aka the Crout version of ILU), which update the th column of using columns 1 through and update the th row of using rows 1 through at the th step, respectively. For stability, we combine fan-in updates with dynamic deferring and scalability-oriented droppings; see Section 3.1. In addition, HIF enables rook pivoting when the previous level had too many deferrals as indicated by the Boolean tag ibrp in Figure 1; see Section 3.2. The dynamic deferring may permute some rows and columns in after and to obtain a new 2-by-2 block structure , where and are the leading rows and blocks in and , respectively. We then compute the Schur complement corresponding to and factorize either directly using RRQR or recursively using HIF, depending on whether is sufficiently small or nearly dense; see Section 3.3. Some of the components above are the same as those in HILUCSI (chen2021hilucsi), the predecessor of HIF. The key differences between HIF and HILUCSI are that 1) HIF introduces a variant of the rook pivoting (poole2000rook) to improve the stability of ILU and in turn, reduces the size of the final Schur complement and 2) HIF uses a rank-revealing QR factorization (chan1987rank) on . In the following, we first focus on these two aspects and then briefly outline the other components. We will describe how to apply HIF as a preconditioner in Section 4.

Figure 1. Overall control flow of hybrid incomplete factorization.

3.1. Incomplete with dynamic deferring and scalability-oriented and inverse-based droppings

The core of HIF at each level is incomplete factorization, or its variant of incomplete factorization. For robustness, we leverage the fan-in ILU, scalability-oriented dropping, inverse-based dropping, and dynamic deferring, as shown in Algorithm 1. Hereafter, we outline the dual thresholding and dynamic deferring steps in fan-in ILU. An optional step in Algorithm 1 is the inverse-based rook pivoting activated by the Boolean flag ibrp. We will dedicate Section 3.2 to this new pivoting strategy.

In Algorithm 1, the input matrix is obtained from applying preprocessing techniques on either the input matrix or the Schur complement from the previous level. Note that in the actual implementation, we do not form explicitly; instead, we rescale the entries in by and in a “just-in-time” fashion. The procedure ilu_factorize also takes and are as input, which were obtained from preprocessing along with and (see Figure 1 and Section 3.4). The main loop in Algorithm 1 factorizes , the leading block of , using fan-in ILU similar to those in (li2003crout; li2005crout), which delays the computation of the Schur complement as late as possible. Unlike (li2003crout; li2005crout), however, ilu_factorize dynamically permutes (aka defers) rows and columns in to the end of and . The procedure computes

(18)

where the first and second question marks (‘’) in (18) correspond to and , which we will describe their computations in Section 3.3. Algorithm 1 returns , , , , , , and , from which and (along with ) can be computed.

A noteworthy feature of ilu_factorize is its scalability-oriented dropping, which differs from the dropping strategies in (li2003crout; li2005crout) and (bollhofer2011ilupack). Consider the step of ILU at a particular level. Using the MATLAB’s colon notation, let and denote the th column and row of and , respectively. Our dropping strategy limits and to be proportional to numbers of nonzeros in the corresponding column and row of the original (i.e., the top level instead of the present level) input matrix, respectively; see line 18. This dropping strategy plays an important role for HILUCSI and HIF to achieve (near) linear complexity in both space and time. Besides this symbolic dropping, HILUCSI and HIF also adopted an inverse-based dropping (bollhofer2006multilevel), which drops every entry such that , where and are user-specified thresholds for the upper bound of and the drop tolerance, respectively (see Section 5.2). The dropping for is similar; see line 19.

inputs:

: input scaled matrix of size (i.e., , passed in as , , and separately)

: row and column permutation vectors of after preprocessing, respectively

: the dimension of the current leading block (i.e., )

params: , , , , , (adapted for present level), max_steps (for rook pivoting)

level: current level

ibrp: Boolean tag for enabling inverse-based rook pivoting

: number of nonzeros per row and column entry of the original user input matrix, respectively

outputs:

: approximate factors of the leading block

: off-diagonal blocks of (11)

: updated row and column permutation vectors, respectively

: updated leading block dimension

1:  ; ;
2:  for  to  do
3:     if ibrp then
4:        {perform inverse-based root pivoting, typically only if }
5:     else
6:        {fan-in update of diagonal entry}
7:     end if
8:     ;

{estimate inverse norms of current

and factors}
9:     while  or  do
10:        permute entries , and to the end; update and accordingly
11:        break if
12:        ; {update inverse norms}
13:        {recompute diagonal entry due to pivoting}
14:        {decrease leading block for factorization}
15:     end while
16:     {fan-in update th column in }
17:     {fan-in update th row in }
18:     select largest and entries in and , respectively {scalability-oriented dropping}
19:     drop entries and {inverse-based dropping}
20:  end for
21:  ; ; ; ;
22:  return  , , , , , , ,
Algorithm 1 ilu_factorize

Another core component in ilu_factorize is the inverse-based dynamic deferring. In particular, during the fan-in ILU, we dynamically defer and (line 10) if we encounter small or large and (line 9). Figure 2 illustrates the process of dynamic deferring. This deferring is similar to that in (bollhofer2006multilevel; bollhofer2011ilupack); in Section 3.2, we will extend it to support rook pivoting. Note that the inverse-based dropping in line 19 can be replaced by a different dropping strategy, such as that in (mayer2006alternative), but we utilized the inverse-based dropping since we are already estimating the inverse norms for deferring.

Figure 2. Illustration of dynamic deferring. and are shorthand for and , respectively. When encountering ill-conditioned factors ( or ) due to or (green regions in left panel), we dynamically defer both and to next level (right panel).

We note some implementation details. First, in line 18, we use quickselect (hoare1961algorithm), which has expected linear time complexity. Second, since we need to access both rows and columns of while computing the fan-in updates (lines 16 and 17), we need to store (or more precisely, ) in both row and column majors. We will describe the data structures in Section 3.5 and Appendix A. Third, the output and will only be used to compute the Schur complement in Section 3.3. Afterwards, and are discarded, since they can be reconstructed from and , respectively as in (11).

3.2. Inverse-based rook pivoting for coarse levels

Dynamic deferring symmetrically permutes rows and columns. Such a permutation strategy works well for reasonably well-conditioned matrices. However, we observe that symmetric permutations alone sometimes lead to relatively large Schur complements for highly ill-conditioned unsymmetric systems. To overcome this issue, we introduce an inverse-based rook pivoting (IBRP) for the fan-in ILU, by adapting the rook pivoting (poole2000rook) for complete LU factorization.

In the standard rook pivoting (poole2000rook), a pivot is found by searching in the row and column in alternating order until its magnitude is no smaller than those of all other entries in the row and column within the Schur complement. This strategy has a comparable cost as partial pivoting for dense matrices but enables superior stability. However, in the context of fan-in ILU, only the th row and column of the Schur complement are updated at the th step; the remaining part of the Schur complement is not available. Hence, we must modify the pivoting procedure to interleave the search with dynamic permutation and “just-in-time” fan-in updates. Figure 3 illustrates one step of the IBRP. Note that this permutation is more general than the dynamic deferring in Figure 2, in that it can exchange with a row in the middle of , and similarly for the rows in . Furthermore, the row and column interchanges are not symmetric in general. Note that IBRP requires a more sophisticated data structure to support the row and column interchanges, which we will address in Section 3.5.

Figure 3. Illustration of inverse-based rook pivoting in HIF. At the upper-left panel, we update and nonzeros in , find a pivot, and then interchange with the pivot row (indicated by curved arrow). The lower-left panel shows the process for .

Besides the difference dictated by the fan-in update, there are two other significant differences between IBRP and the standard rook pivoting. First, we do not simply use the magnitude to determine the pivot, since it may conflict with dynamic deferring. Instead, we add an inverse-based constraint when searching the pivot, so that the pivot row in and pivot column in would not arbitrarily enlarge the condition numbers of the and , respectively. Second, we do not locate the optimal rook pivot whose magnitude is the largest among its row and column; instead, we impose a maximum number of steps of IBRP, controlled by the parameter max_steps. For completeness, Algorithm 2 details the inverse-based rook pivoting. Note that we do not scale and by in lines 2 and 9 in Algorithm 2, compared to and in lines 16 and 17 in Algorithm 1. This omission of scaling is for efficiency purposes, because the choice of pivot does not depend on the scaling of the entries.

inputs:

: input scaled matrix (i.e., , passed in as , , and separately)

: step count in ILU factorization

: row and column permutation vectors of , respectively

: and factors at step , i.e, and , respectively

: diagonal entries at step

: inverse-norm threshold

: leading block dimension

max_steps: maximum number of rook pivoting steps

outputs:

: updated row and column permutation vectors, respectively

: updated and factors, respectively

: diagonal entries with updated

1:  for  to max_steps do
2:     {fan-in update of and without scaling by diagonal}
3:      {find pivot in }
4:     if  and  then
5:        ; ; {perform row interchange}
6:     else if  then
7:        ; break{extract from }
8:     end if
9:     {fan-in update of and without scaling by diagonal}
10:      {find pivot in }
11:     if  and  then
12:        ; ; {perform column interchange}
13:     else
14:        ; break{extract from }
15:     end if
16:  end for
17:  return  
Algorithm 2 ib_rook_pivot

We note an important practical issue. The IBRP can result in significantly denser and factors because the pivoting may undo the effects of the fill-reduction reordering in the preprocessing step. Hence, we enable IBRP only on the coarser levels (typically for ) when dynamic deferring is found to be ineffective. At the coarser levels, we also enlarge the scalability-oriented fill factors and to preserve more fills introduced by IBRP. Our experiments show that applying IBRP on coarser levels can significantly reduce the size of the final Schur complement for some challenging problems, as demonstrated by the example in Table 1.

nnz w/ IBRP in HIF fac. #levels nnz GMRES
time ratio iter.
76,480 329,762 yes 1,407 50 2.51 5 12.0 15
no 9,394 48 226 2 304 2
Table 1. Effectiveness of IBRP for reducing the final Schur complement size for the testing matrix shyy161 from the SuiteSparse Matrix Collection (davis2011university). indicates the dimension of the null space, and denotes the final Schur complement in HIF. We factorized the testing matrix using HIF with and without IBRP, and we report the factorization time (factor time) and nnz ratio (i.e., , aka fill ratio in (li2011supernodal)). We solved the consistent system with right-hand side (where ) using GMRES(30) with relative residual tolerance . The computing environment for running the test can be found in Section 6. Timing results for GMRES were negligible thus omitted.

3.3. Computing and factorizing Schur complements

After finishing ilu_factorize, we need to compute the Schur complement based on (12). This step involves a sparse matrix-matrix (SpMM) multiplication, for which we adopt the algorithm as described in (bank1993sparse). The space and time complexity of SpMM depends on the nonzeros in and . To achieve near-linear complexity, we apply scalability-oriented dropping before SpMM to the rows and columns of and , respectively. Recall that in ilu_factorize we already applied scalability-oriented dropping to the columns and rows of and , respectively. Hence, the nonzeros both rows and columns in and in are well controlled, allowing us to effectively bound the complexity of . In contrast, if we applied dropping after SpMM, the complexity of SpMM may be higher. Note that after computing , we drop both and as they can be reconstructed from the other terms as in (11), and then factorize recursively.

In HIF, the final Schur complement is factorized by rank-revealing QR (or truncated QRCP) in order to guarantee the stability of the preconditioner. Algorithm 3 outlines the procedure. We note a couple of details in the algorithm. First, line 1 computes QRCP for , where is a permutation matrix, and . The QRCP has a time complexity of . Second, lines 28 determine the numerical rank of by comparing the estimated condition number against a threshold , which defaults to . We estimate the -norm condition number using the incremental estimator in (bischof1990incremental), which has a linear complexity per step. Hence, the overall computational cost is dominated by QRCP. For efficiency, we implement the QRCP and condition-number estimator using LAPACK kernel functions xGEQP3 and xLAIC1, respectively.

inputs:

: final Schur complement of size

: condition number threshold for determining numerical rank (default value is )

outputs:

: and factors of

: permutation vector in QRCP

: numerical rank of

1:  factorize {compute QRCP via xGEQP3}
2:  for  to  do
3:     update estimated and {incremental condition number estimation via xLAIC1}
4:     if  then
5:        
6:        break
7:     end if
8:  end for
9:  return  , , ,
Algorithm 3 rrqr_final_schur

Due to the cubic time complexity of QRCP, we would like to make as small as possible, and ideally have , where is the number of unknowns in the original system. This high complexity was the motivation to use IBRP within multilevel ILU. In addition, we would also like to prevent having too many low-quality ILU levels. As a tradeoff, we trigger QRCP based