## 1. Introduction

We consider the problem of preconditioning an iterative solver for a linear system,

(1) |

where is sparse and potentially
singular, , and ,
or as in many applications, ,
, and .
For generality, we will assume complex-valued systems in our discussions.
In general, (1) is *inconsistent* in that
,
where denotes the Moore–Penrose pseudoinverse
of and denotes the machine
epsilon of a given floating-point number system. In this case, we
must seek a *least-squares solution* of (1),
i.e.,

(2) |

or often preferably the *pseudoinverse* *solution* of (1),
i.e.,

(3) |

or equivalently, .
When is large-scale, it is preferable to solve (1)
using a *Krylov subspace* (*KSP*) method, such as GMRES
(saad1986gmres), which seeks a solution in the th KSP

(4) |

at the th iteration, where is typically equal
to . It is well known that KSP methods can benefit
from robust and effective preconditioners for ill-conditioned problems.
This work introduces a software package called *HIFIR*, which
delivers robust and computationally efficient preconditioners for
singular systems. As a side product, HIFIR also improves the robustness
in preconditioning ill-conditioned systems.

Compared to nonsingular systems, preconditioning (nearly) singular
systems is a very challenging problem. Several software packages offer
fairly robust and easy-to-use preconditioners for nonsingular systems,
such as the multilevel ILU (MLILU) in ILUPACK (bollhofer2011ilupack),
the supernodal ILU in SuperLU (li2011supernodal), and various
parallel preconditioners in PETSc (petsc-user-ref). Conceptually,
such software packages construct a preconditioner
that approximates in that .
Given , a right-preconditioned^{1}^{1}1

We consider only right preconditioning because left preconditioning alters the computation of residual vector

and in turn may lead to false stagnation or early terminations for ill-conditioned systems (ghai2019comparison). KSP method seeks a solution to(5) |

in ,
where typically , and then .
Ideally, the preconditioned KSP methods would converge significantly
faster than the unpreconditioned ones. However, when
is (nearly) singular and the system (1) is inconsistent,
there is a lack of robust algorithms and software. Some earlier techniques
used a CGLS-type KSP method (e.g., (bjorck1996numerical; fong2011lsmr; paige1982lsqr)),
which is mathematically equivalent to solving the normal equation
using CG (bjorck1996numerical; hestenes1952methods) or MINRES
(paige1975solution; fong2011lsmr). Those KSP methods tend to
converge slowly due to the squaring of the condition number by the
normal equation in the corresponding KSP (jiao2021approximate).
More recently, there has been significant interest in preconditioning
GMRES-type methods for singular systems or least-squares problems
(hayami2010gmres; jiao2021approximate; morikuni2015convergence).
For example, the so-called AB-GMRES (hayami2010gmres) solves
the system
using GMRES with ,
and then . Here,
plays a similar role as , except that
may be singular (or rank deficient if is rectangular).
Hayami et al. (hayami2010gmres) constructed
based on robust incomplete factorization (RIF) of Benzi and Tůma
(benzi2003robustpd; benzi2003robust), which was originally developed
for CGLS-type methods. Although RIF could accelerate the convergence
of AB-GMRES in (hayami2010gmres), it was not robust in general
(morikuni2015convergence). In more recent works (gould2017state; morikuni2015convergence),
in AB-GMRES is typically chosen to be
(or for real matrices), which unfortunately
squares the condition number (analogous to CGLS) and in turn can slow
down the convergence. This work aims to deliver a right preconditioner
that is more efficient and robust than RIF, and more importantly,
enables near-optimal convergence rates. We achieve this goal by leveraging
the new theory of *-accurate* *approximated generalized
inverse* (*AGI*) (jiao2021approximate), as outlined in
Section 2.

Our development of HIFIR was based on our earlier software package
called *HILUCSI* (chen2021hilucsi), which was a prototype
implementation of an MLILU for nonsingular saddle-point problems.
Compared to single-level ILUs (such as ILU() and ILUTP, etc. (saad2003iterative, Chapter 10)),
MLILU is generally more robust for nonsingular indefinite systems
(ghai2019comparison; chen2021hilucsi). HILUCSI leveraged several
techniques in a novel way to achieve superior efficiency and robustness
for saddle-point problems than other MLILU libraries (such as ARMS
(saad2002arms), ILUPACK (bollhofer2011ilupack), and ILU++
(mayer2007ilu++)). In particular, HILUCSI achieved high efficiency
by introducing a scalability-oriented dropping in a dual-thresholding
strategy in the fan-in ILU^{2}^{2}2The technique is also known as the Crout version of ILU (li2003crout)
or left-looking (eisenstat1981algorithms), but we adopt the
terminology of “fan in” update, which is commonly used in parallel
computing (demmel1993parallel) and is more suggestive in terms
of its algorithmic behavior. for linear-time complexity in its factorization and solve. It improved
robustness by leveraging mixed symmetric and unsymmetric preprocessing
techniques at different levels and combining static and dynamic permutations.
HIFIR inherits some of these core algorithmic components of HILUCSI,
as described in Section 3. However,
as an MLILU technique, HILUCSI was not robust for singular systems,
for example, when its final level is singular. HIFIR is designed to
overcome this deficiency by leveraging a rank-revealing factorization
in its final level, introducing iterative refinement to build a variable
preconditioner, and introducing a new pivoting strategy in its ILU
portion, as we will detail in Section 3.

The main contributions of this work are as follows. First and foremost,
we introduce one of the first software libraries to improve the robustness
for ill-conditioned and (nearly) singular systems to achieve near
machine precision. Our software library, called HIFIR, or *Hybrid
Incomplete Factorization with Iterative Refinement*, computes an AGI
(jiao2021approximate) by hybridizing incomplete LU and rank-revealing
QR (RRQR) in a multilevel fashion. When used as a right-preconditioner
for GMRES, this hybridization enables (near) optimal convergence for
consistent or ill-conditioned systems. When fortified with iterative
refinement in FGMRES (saad1993flexible)

, HIFIR enables the robust computation of the left null space and the pseudoinverse solution of inconsistent systems. We have implemented HIFIR using template-based objective-oriented programming in C++. For user-friendliness, the C++ HIFIR library is header-only, with easy-to-use high-level interfaces in MATLAB and Python. The software is open-source and has been made available at

https://github.com/hifirworks/hifir. Second, this work also introduces a novel inverse-based rook pivoting (IBRP) in the fan-in ILU. We describe efficient data structures for the efficient implementation of IBRP and show that this new pivoting strategy improves the robustness and efficiency for some challenging singular systems. Third, HIFIR offers some advanced features, such as the support of complex arithmetic, the ability to precondition both and using the same factorization, and the ability to multiply by an AGI of the preconditioning operator. These features enable the use of HIFIR as building blocks for advanced preconditioners, such as (parallel) block preconditioners. In addition, HIFIR supports mixed precision (e.g., double precision combined with single or potentially half precision) for the input matrix and the preconditioner, which is beneficial for heterogeneous hardware platforms and limited-memory settings.The remainder of this paper is organized as follows. In Section 2, we give an overview of the theoretical foundation of HIFIR, including optimality conditions, treatment for singular systems, etc. In Section 3, we describe the algorithmic components of HIF and highlight some implementation details. Section 4 describes how to apply HIF as a preconditioner, including iterative refinement. In Section 5, we introduce the application programming interfaces of HIFIR in C++, MATLAB, and Python with example implementations. Section 6 demonstrates HIFIR for some large-scale applications with indefinite ill-conditioned and singular inconsistent systems. Finally, Section 7 concludes the paper with a discussion on future directions. For completeness, we present the details of our data structures in Appendix A and the complexity analysis of IBRP in Appendix B.

## 2. Theoretical foundations

In this section, we give an overview of the theoretical foundations of HIFIR. Most of the theory was based on that in (jiao2021approximate), except that we generalize the results from real matrices to complex ones. We present some of the most relevant theoretical results for completeness, but we omit the proofs because they follow the same arguments as those in (jiao2021approximate).

### 2.1. Mathematically optimal right-preconditioning operators for consistent systems

Let us first consider the issue of optimal preconditioning for a consistent system (1), where is in the range of (i.e.,

). In floating-point arithmetic, the convergence rate of a KSP method for such systems depends on the following generalized notion of condition numbers.

###### Definition 1 ().

Given a potentially singular matrix
, the *2-norm condition
number *of

is the ratio between the largest and the smallest nonzero singular values of

, i.e., , where .To accelerate a KSP method for such a system, we solve a right-preconditioned system

(6) |

using a KSP method, and then .
We refer to as a *right preconditioning operator
*(*RPO*). Ideally, we would like .
For nonsingular systems, is equivalent to
in (5); for singular systems,
generalizes . The symbol signifies
that it is based on a *generalized inverse*.

###### Definition 2 ().

(rao1972generalized, Definitions 2.2)
Given a potentially rank-deficient ,
is a *generalized inverse* of
if and only if .

It is worth noting that the Moore–Penrose pseudoinverse is a special case of generalized inverses. Although it might be tempting to construct the RPO to approximate , the pseudoinverse is overly restrictive. The following two properties of generalized inverses make them particularly relevant to right-preconditioning singular systems.

###### Proposition 1.

(jiao2021approximate, Proposition 3.4) If is a generalized inverse of , then

is diagonalizable, and its eigenvalues are all zeros and ones. In other words, there exists a nonsingular matrix

, such that , where is the identity matrix with . Conversely, if there exists a nonsingular such that for , then is a generalized inverse of .###### Proposition 2.

(jiao2021approximate, Proposition A.1) Given of rank and a generalized inverse with , the condition number of is bounded by , i.e., .

From Proposition 1, it is easy to show that any generalized inverse (or a nonzero scalar multiple of

) enables a mathematically optimal RPO for consistent systems in the following sense.

###### Theorem 1.

(jiao2021approximate, Theorem 3.6) Given and a generalized inverse , then GMRES with RPO with converges to a least-squares solution of (1) after one iteration for all and . Conversely, if GMRES with RPO converges to a least-squares solution in one iteration for all and , then is a scalar multiple of a generalized inverse of .

A corollary of Theorem 1 is that if , then the computed is the pseudoinverse solution of (1). Let denote a projection onto . Given any , it is easy to show that is also a generalized inverse, and the computed with as the RPO is then the pseudoinverse solution. Theorem 1 assumes exact arithmetic. With rounding errors, the condition number of must be bounded by a small constant, which holds in general if is bounded due to Proposition 2.

Although the above theory may seem abstract, it suggests a new approach for constructing RPO based on generalized inverses. In particular, one option to construct is to hybridize multilevel ILU with a rank-revealing decomposition on its final Schur complement. As an illustration, consider the following example that combines LU factorization without pivoting with QR factorization with column pivoting (QRCP) (Golub13MC).

###### Example 0.

Given , after steps of Gaussian elimination,

(7) |

where is the *Schur
complement*. Clearly, .
Let the QRCP of be

(8) |

where is unitary and for (i.e., the (numerical) rank of ). Let and be composed of the first columns of and , respectively. Then, is a generalized inverse of with . Furthermore,

(9) |

is a generalized inverse of , with for as in Proposition 1.

We refer to the preceding construction of as a *hybrid
factorization*. It enables a more efficient approach to construct
an optimal RPO, for example, compared to applying QRCP to
if . For the strategy to be successful, we must address
some practical issues. First, we need to allow droppings in the factorization
to reduce computational cost and memory requirement, especially for
larger-scale systems. Second, we need to control to limit , for example, by
leveraging pivoting and equilibration (duff2001algorithms).
Third, it is desirable to make as small as possible
before we apply QRCP. Hereafter, we will address the first two issues
from a theoretical perspective and then address the third issue in
Section 3.

### 2.2. Near-optimal right-preconditioning operators via approximate generalized inverses

Although a mathematically optimal RPO enables the most rapid convergence of KSP methods, the computational cost per iteration may be prohibitively high, so is the memory requirement. In practice, it may be more desirable to construct “near-optimal” RPOs by approximating a generalized inverse. The following definition and theorem establish the guideline for constructing such approximations.

###### Definition 3 ().

(jiao2021approximate, Definition 3.8)
Given ,
is an *-accurate approximate generalized inverse*
(*AGI*) if there exists
such that

(10) |

where is identity matrix with . A class of AGI is -accurate if tends to 0 as its control parameters are tightened. is a stable AGI if for some .

###### Theorem 2.

(jiao2021approximate, Theorem 3.9) GMRES with an -accurate AGI of converges to a least-squares solution of (1) in exact arithmetic for all consistent systems (i.e., ) with any initial guess .

A corollary of Theorem 2 is that given an -accurate AGI , is also an -accurate AGI, and the computed with as the RPO is the pseudoinverse solution. Mathematically, it is equivalent to compute with as the RPO and then project onto to obtain the pseudoinverse solution.

###### Remark 1.

In the literature, commonly used measures of accuracy and stability of a preconditioner for a nonsingular matrix were and , respectively; see, e.g., (benzi2002preconditioning). Our new definitions of accuracy and stability in Definition 3 are more general in that they apply to singular systems. In addition, they are more rigorous in that they are based on Theorem 2 and Proposition 2, respectively, instead of based on empirical evidence (benzi2002preconditioning).

Theorem 2 and Example 2.1
suggest that we can construct AGIs by replacing the LU factorization
in hybrid factorization with some ILU variants. We will refer to the
combination of ILU with a rank-revealing factorization on the Schur
complement as a *hybrid incomplete factorization* (*HIF*).
From the perspective of AGI, a good candidate ILU should satisfy three
critical criteria. First, the ILU needs to have prudent dropping strategies
to make the approximation as accurate as possible. Second, we must
be able to control effectively
for stability. Third, the computational cost and storage requirement
should ideally scale linearly (or near-linearly) with respect to the
input size. In the ILU literature (chow1997experimental; saad2003iterative; bollhofer2006multilevel),
there had been significant attention to the first criterion. However,
the second criterion excludes simple ILU techniques without pivoting,
such as ILU() (saad2003iterative). The third criterion excludes
ILU with relatively simple pivoting strategies, such as ILUTP (chow1997experimental; saad2003iterative)
and its supernodal variants (li2011supernodal), which suffer
from superlinear complexity (ghai2019comparison; chen2021hilucsi).

Although the second and third criteria may appear self-contradicting for traditional ILU techniques, they can be met by a well-designed MLILU technique. Before delving into the details of MLILU algorithms, let us briefly review MLILU and more importantly, show that MLILU can be used to construct accurate and stable AGIs. First, consider a two-level ILU (or more precisely, ILDU) of ,

(11) |

where is an ILDU of the leading block, , , and

(12) |

is the Schur complement; and are row and column permutation matrices, respectively; and correspond to row and column scaling diagonal matrices, respectively. The Schur complement can be factorized recursively using the same ILU technique, leading to an -level ILU preconditioners, namely,

(13) |

where for in (11) at the th level (similarly for and all other permutation and scaling matrices), is composed of the “union” of in (11) for all levels, and is the final Schur complement. As in Example 2.1, we apply QRCP to to obtain , where is unitary and with , along its diagonal, and . We define an RPO as

(14) |

where , , and are as in (13). We note the following fact.

###### Proposition 3.

Since without dropping is an optimal RPO for consistent
systems due to Theorem 1, we claim that an
-accurate and stable constitutes a near-optimal
RPO for consistent systems. The near optimality requires sufficient
small droppings and numerical stability, or more precisely,
should be close to for and
must be controlled by the algorithm. To this end, we utilize an MLILU
technique called *HILUCSI*, which stands for *Hierarchical
Incomplete LU-Crout with Scalability-oriented and Inverse-based droppings*
(chen2021hilucsi). HILUCSI leverages several techniques, including
fan-in ILU (li2003crout), equilibration (duff2001algorithms),
static and dynamic pivoting across different levels (chen2021hilucsi; bollhofer2006multilevel),
etc., to meet the accuracy and stability requirements. As the name
suggests, HILUCSI focuses on scalability in terms of problem sizes,
and it has linear time complexity in each level for both the factorization
and solve stages (chen2021hilucsi). We defer the detailed description
of these algorithmic components to Section 3.

###### Remark 2.

Besides HILUCSI, there were several MLILU software packages, such as ARMS (saad2002arms), ILU++ (mayer2006alternative), ILUPACK (bollhofer2011ilupack), etc. Proposition 3 can also be applied to improve those packages to solve singular systems by applying QRCP to the final Schur complement. To harness this benefit, however, one must also extend them (especially ARMS and ILU++) to ensure the stability of the factor.

Finally, we note that it is sometimes needed to reuse HIF to construct preconditioners for both and . To achieve, we note the following property:

###### Proposition 4.

(jiao2021approximate, Proposition 3.5) If is a generalized inverse of , then is a generalized inverse of .

Conceptually, we can extend (11) to construct a preconditioner for as

(15) |

### 2.3. Variable preconditioning via iterative refinement for null-space computation

The preceding discussions focused on consistent systems. For inconsistent
systems, i.e., ,
an AGI (or even a generalized inverse^{3}^{3}3When , then right-preconditioned
GMRES converges to a weighted-least-squares (WLS) solution for inconsistent
systems, instead of least-squares solution (jiao2021approximate, Theorem 3.7).
We cannot convert the WLS solution into a pseudoinverse solution by
projecting it onto .) cannot guarantee the convergence of GMRES due to the following fact.

###### Theorem 3.

(jiao2021approximate, Theorem 2.4) GMRES with RPO does not break down until finding a least-squares solution of (1) for all and if and only if is range symmetric (i.e., ) and . Furthermore, is the pseudoinverse solution if .

###### Remark 3.

The main challenge posed by Theorem 3 is that it is difficult, if not impossible, to ensure the range symmetry of for an approximate generalized inverse . The requirement of range symmetry is the primary reason why is often used as in AB-GMRES (gould2017state; morikuni2015convergence). It is also a key factor for the prevalence of CGLS-type KSP methods (bjorck1996numerical; fong2011lsmr; paige1982lsqr) for solving singular and least-squares problems. Although such methods can be accelerated by applying some preconditioners, such as incomplete QR (jennings1984incomplete; saad1988preconditioning) or RIF (benzi2003robustpd; benzi2003robust), it is difficult for these preconditioners to overcome the slowdown caused by the squaring of the condition number by the normal equation.

Fortunately, this issue can be resolved by using flexible GMRES with
*variable preconditioners*, as formalized by the following definition
and theorem.

###### Definition 4 ().

(jiao2021approximate, Definition 2.1) Given
a matrix , an initial vector
, and variable preconditioners ,
the th *flexible Krylov subspace *(*FKSP*) associated
with , , and
is

(16) |

where
and
for . The *flexible Krylov matrix, denoted by* ,
is composed of the basis vectors in (16).

The orthogonality of with can be enforced using a generalization of Arnoldi iterations as in (saad1993flexible).

###### Theorem 4.

(jiao2021approximate, Theorem 2.5) If FGMRES with variable preconditioners does not break down until step , where for a specific and for , respectively. If , then it finds a least-squares solution of (1) for an initial guess .

One effective approach to construct variable preconditioners is to
introduce *iterative refinement* (*IR*) in HIF, leading
to HIFIR. Specifically, given an -accurate AGI
and an initial vector (typically,
), we refine the solution of
iteratively by obtaining for
as

(17) |

Eq. (17) defines the th RPO in , and different may use different numbers of IR iterations. Note that (17) in general does not converge by a standalone iterative solver in that when is a generalized inverse, and Eq. (17) cannot introduce additional nonlinearity into the variable preconditioner and in turn undermine the robustness of FGMRES due to Theorem 4. Hence, we shift our attention to its use as a variable preconditioner in the context of computing null-space vectors. In particular, we apply the technique to compute the left null space of (i.e., ), which allows us to convert an inconsistent system into a consistent one. Furthermore, by applying the same technique to compute the right null space of (i.e., ), we can convert a least-squares solution from the consistent system into the pseudoinverse solution (jiao2021approximate).

## 3. Recursive construction of hybrid incomplete factorization

In this section, we describe the overall algorithm for constructing
HIF. Similar to that of HILUCSI (chen2021hilucsi) and other
MLILU algorithms, HIF is a recursive algorithm. As shown in Figure 1,
at each level, HIF takes an input matrix . HIF first
performs symmetric or unsymmetric preprocessing depending on whether
is (nearly) pattern symmetric, i.e., ;
see Section 3.4. The process leads
to ,
where , , ,
and are obtained from preprocessing. If
is nearly pattern symmetric, the preprocessing step may involve static
deferring, which splits
into a 2-by-2 block matrix, i.e., . Depending on whether the leading block is
Hermitian, we then perform incomplete
or factorizations, respectively,
where and are unit lower and upper
triangular matrices, respectively (i.e., their diagonal entries are
ones). We compute these factorizations using *fan-in updates*
(aka the Crout version of ILU), which update the th column of
using columns 1 through and update the th
row of using rows 1 through at the th
step, respectively. For stability, we combine fan-in updates with
dynamic deferring and scalability-oriented droppings; see Section 3.1.
In addition, HIF enables rook pivoting when the previous level had
too many deferrals as indicated by the Boolean tag ibrp in
Figure 1; see Section 3.2.
The dynamic deferring may permute some rows and columns in
after and to obtain
a new 2-by-2 block structure , where and are the
leading rows and blocks in and ,
respectively. We then compute the Schur complement
corresponding to and factorize
either directly using RRQR or recursively using HIF, depending on
whether is sufficiently small or nearly dense; see
Section 3.3. Some of the components above
are the same as those in HILUCSI (chen2021hilucsi), the predecessor
of HIF. The key differences between HIF and HILUCSI are that 1) HIF
introduces a variant of the rook pivoting (poole2000rook) to
improve the stability of ILU and in turn, reduces the size of the
final Schur complement and 2) HIF uses a rank-revealing
QR factorization (chan1987rank) on . In
the following, we first focus on these two aspects and then briefly
outline the other components. We will describe how to apply HIF as
a preconditioner in Section 4.

### 3.1. Incomplete with dynamic deferring and scalability-oriented and inverse-based droppings

The core of HIF at each level is incomplete factorization, or its variant of incomplete factorization. For robustness, we leverage the fan-in ILU, scalability-oriented dropping, inverse-based dropping, and dynamic deferring, as shown in Algorithm 1. Hereafter, we outline the dual thresholding and dynamic deferring steps in fan-in ILU. An optional step in Algorithm 1 is the inverse-based rook pivoting activated by the Boolean flag ibrp. We will dedicate Section 3.2 to this new pivoting strategy.

In Algorithm 1, the input matrix
is obtained from applying preprocessing techniques on either the input
matrix or the Schur complement from the previous
level. Note that in the actual implementation, we do not form
explicitly; instead, we rescale the entries in by
and in a “just-in-time” fashion.
The procedure ilu_factorize also takes
and are as input, which were obtained from
preprocessing along with and (see
Figure 1 and Section 3.4).
The main loop in Algorithm 1 factorizes ,
the leading block of , using fan-in ILU similar
to those in (li2003crout; li2005crout), which delays the computation
of the Schur complement as late as possible. Unlike (li2003crout; li2005crout),
however, ilu_factorize dynamically permutes (aka *defers*)
rows and columns in to the end of
and . The procedure computes

(18) |

where the first and second question marks (‘’) in (18) correspond to and , which we will describe their computations in Section 3.3. Algorithm 1 returns , , , , , , and , from which and (along with ) can be computed.

A noteworthy feature of ilu_factorize is its *scalability-oriented
dropping*, which differs from the dropping strategies in (li2003crout; li2005crout)
and (bollhofer2011ilupack). Consider the step of ILU at
a particular level. Using the MATLAB’s colon notation, let
and denote the th column and row of
and ,
respectively. Our dropping strategy limits
and to be proportional
to numbers of nonzeros in the corresponding column and row of the
original (i.e., the top level instead of the present level) input
matrix, respectively; see line 18. This
dropping strategy plays an important role for HILUCSI and HIF to achieve
(near) linear complexity in both space and time. Besides this symbolic
dropping, HILUCSI and HIF also adopted an inverse-based dropping (bollhofer2006multilevel),
which drops every entry such that ,
where and are user-specified thresholds
for the upper bound of
and the drop tolerance, respectively (see Section 5.2).
The dropping for is similar; see line 19.

Another core component in ilu_factorize is the inverse-based dynamic deferring. In particular, during the fan-in ILU, we dynamically defer and (line 10) if we encounter small or large and (line 9). Figure 2 illustrates the process of dynamic deferring. This deferring is similar to that in (bollhofer2006multilevel; bollhofer2011ilupack); in Section 3.2, we will extend it to support rook pivoting. Note that the inverse-based dropping in line 19 can be replaced by a different dropping strategy, such as that in (mayer2006alternative), but we utilized the inverse-based dropping since we are already estimating the inverse norms for deferring.

We note some implementation details. First, in line 18, we use quickselect (hoare1961algorithm), which has expected linear time complexity. Second, since we need to access both rows and columns of while computing the fan-in updates (lines 16 and 17), we need to store (or more precisely, ) in both row and column majors. We will describe the data structures in Section 3.5 and Appendix A. Third, the output and will only be used to compute the Schur complement in Section 3.3. Afterwards, and are discarded, since they can be reconstructed from and , respectively as in (11).

### 3.2. Inverse-based rook pivoting for coarse levels

Dynamic deferring symmetrically permutes rows and columns. Such a
permutation strategy works well for reasonably well-conditioned matrices.
However, we observe that symmetric permutations alone sometimes lead
to relatively large Schur complements for highly ill-conditioned unsymmetric
systems. To overcome this issue, we introduce an *inverse-based
rook pivoting* (*IBRP*) for the fan-in ILU, by adapting the rook
pivoting (poole2000rook) for complete LU factorization.

In the standard rook pivoting (poole2000rook), a pivot is found by searching in the row and column in alternating order until its magnitude is no smaller than those of all other entries in the row and column within the Schur complement. This strategy has a comparable cost as partial pivoting for dense matrices but enables superior stability. However, in the context of fan-in ILU, only the th row and column of the Schur complement are updated at the th step; the remaining part of the Schur complement is not available. Hence, we must modify the pivoting procedure to interleave the search with dynamic permutation and “just-in-time” fan-in updates. Figure 3 illustrates one step of the IBRP. Note that this permutation is more general than the dynamic deferring in Figure 2, in that it can exchange with a row in the middle of , and similarly for the rows in . Furthermore, the row and column interchanges are not symmetric in general. Note that IBRP requires a more sophisticated data structure to support the row and column interchanges, which we will address in Section 3.5.

Besides the difference dictated by the fan-in update, there are two other significant differences between IBRP and the standard rook pivoting. First, we do not simply use the magnitude to determine the pivot, since it may conflict with dynamic deferring. Instead, we add an inverse-based constraint when searching the pivot, so that the pivot row in and pivot column in would not arbitrarily enlarge the condition numbers of the and , respectively. Second, we do not locate the optimal rook pivot whose magnitude is the largest among its row and column; instead, we impose a maximum number of steps of IBRP, controlled by the parameter max_steps. For completeness, Algorithm 2 details the inverse-based rook pivoting. Note that we do not scale and by in lines 2 and 9 in Algorithm 2, compared to and in lines 16 and 17 in Algorithm 1. This omission of scaling is for efficiency purposes, because the choice of pivot does not depend on the scaling of the entries.

We note an important practical issue. The IBRP can result in significantly
denser and factors because the
pivoting may undo the effects of the fill-reduction reordering in
the preprocessing step. Hence, we enable IBRP only on the coarser
levels (typically for ) when dynamic deferring is
found to be ineffective. At the coarser levels, we also enlarge the
scalability-oriented fill factors and
to preserve more *fills* introduced by IBRP. Our experiments
show that applying IBRP on coarser levels can significantly reduce
the size of the final Schur complement for some challenging problems,
as demonstrated by the example in Table 1.

nnz | w/ IBRP | in HIF | fac. | #levels | nnz | GMRES | |||
---|---|---|---|---|---|---|---|---|---|

time | ratio | iter. | |||||||

76,480 | 329,762 | yes | 1,407 | 50 | 2.51 | 5 | 12.0 | 15 | |

no | 9,394 | 48 | 226 | 2 | 304 | 2 |

### 3.3. Computing and factorizing Schur complements

After finishing ilu_factorize, we need to compute the Schur complement based on (12). This step involves a sparse matrix-matrix (SpMM) multiplication, for which we adopt the algorithm as described in (bank1993sparse). The space and time complexity of SpMM depends on the nonzeros in and . To achieve near-linear complexity, we apply scalability-oriented dropping before SpMM to the rows and columns of and , respectively. Recall that in ilu_factorize we already applied scalability-oriented dropping to the columns and rows of and , respectively. Hence, the nonzeros both rows and columns in and in are well controlled, allowing us to effectively bound the complexity of . In contrast, if we applied dropping after SpMM, the complexity of SpMM may be higher. Note that after computing , we drop both and as they can be reconstructed from the other terms as in (11), and then factorize recursively.

In HIF, the final Schur complement is factorized by rank-revealing QR (or truncated QRCP) in order to guarantee the stability of the preconditioner. Algorithm 3 outlines the procedure. We note a couple of details in the algorithm. First, line 1 computes QRCP for , where is a permutation matrix, and . The QRCP has a time complexity of . Second, lines 2–8 determine the numerical rank of by comparing the estimated condition number against a threshold , which defaults to . We estimate the -norm condition number using the incremental estimator in (bischof1990incremental), which has a linear complexity per step. Hence, the overall computational cost is dominated by QRCP. For efficiency, we implement the QRCP and condition-number estimator using LAPACK kernel functions xGEQP3 and xLAIC1, respectively.

Due to the cubic time complexity of QRCP, we would like to make as small as possible, and ideally have , where is the number of unknowns in the original system. This high complexity was the motivation to use IBRP within multilevel ILU. In addition, we would also like to prevent having too many low-quality ILU levels. As a tradeoff, we trigger QRCP based

Comments

There are no comments yet.