HILUCSI: Simple, Robust, and Fast Multilevel ILU for Large-Scale Saddle-Point Problems from PDEs

11/22/2019 ∙ by Qiao Chen, et al. ∙ Stony Brook University 0

Incomplete factorization is a widely used preconditioning technique for Krylov subspace methods for solving large-scale sparse linear systems. Its multilevel variants, such as those in ILUPACK and ARMS, have been shown to be more robust for many symmetric or unsymmetric linear systems than the traditional, single-level incomplete LU (or ILU) techniques. However, multilevel ILU still lacked robustness and efficiency for some large-scale saddle-point problems, which often arise from systems of partial differential equations (PDEs). In this work, we introduce HILUCSI, or Hierarchical Incomplete LU-Crout with Scalability-oriented and Inverse-based dropping, which is specifically designed to take advantage of some special features of such systems. HILUCSI differs from the state-of-the-art ILU techniques in two main aspects. First, HILUCSI leverages the near or partial symmetry of the underlying problems and the inherent block structures of multilevel ILU to improve robustness and to simplify the treatment of indefinite systems. Second, HILUCSI introduces a scalability-oriented dropping in conjunction with a variant of inverse-based dropping to improve the efficiency for large-scale problems from PDEs. We demonstrate the effectiveness of HILUCSI for a number of benchmark problems, including those from mixed formulation of the Poisson equation, Stokes equations, and Navier-Stokes equations. We also compare its performance with ILUPACK, the supernodal ILUTP in SuperLU, and multithreaded direct solvers in PARDISO and MUMPS.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Krylov subspace (KSP) methods, such as GMRES [87, 89] and BiCGSTAB [102], are widely used for solving large-scale sparse unsymmetric or indefinite linear systems, especially those arising from numerical discretizations of partial differential equations (PDEs). For relatively ill-conditioned matrices, the KSP methods can significantly benefit from a robust and efficient preconditioner. Incomplete factorization techniques, such as incomplete or Cholesky factorization for M-matrices [77] and symmetric and positive definite (SPD) systems [51, 59, 71, 95] and incomplete LU (or ILU) factorization for symmetric indefinite [65, 93] and unsymmetric [69, 73, 84] systems, are among the most robust preconditioners. Earlier ILU methods, such as ILUT [84], ILUP [83], ILUTP [30, 87], ILUC [65, 66, 73], etc., lacked robustness for many indefinite systems; see e.g. [30, 43, 48, 69, 112] for some challenging benchmark problems that caused failures. With the more recent development of multilevel ILU techniques, such as ARMS [88, 90], ILUPACK [21, 23], MDRILU [111], ILU++ [75, 76], etc., the robustness of ILU has improved significantly for many applications. However, robustness and efficiency are sometimes problematic for highly ill-conditioned or saddle-point systems, which often arise from discretizations of PDEs, such as Stokes, Navier-Stokes, and Helmholtz equations, in application areas such as computational fluid dynamics (CFD), climate modeling, materials science, structural mechanics, multiphysics coupling, etc.; see e.g. [88] and Section 5 in this work for some examples. The objective of this work is to improve the robustness and efficiency of multilevel ILU for such systems. To this end, we introduce a new preconditioner, called HILUCSI (pronounced as Hi-Luxi), which stands for Hierarchical Incomplete LU-Crout with Scalability-oriented and Inverse-based dropping.

The development of HILUCSI was motivated by two observations. First, many linear systems from systems of PDEs are nearly or partially symmetric, with some block structures. Without loss of generality, assume matrix has the form


where . For linear systems from PDEs, often has nearly symmetric nonzero pattern, because in some commonly used numerical methods, such as finite differences [64], finite elements [31, 42], or finite volumes [63], the local support of the basis functions (a.k.a. trial functions in finite elements) and that of the test functions in the variational formulations are often the same or have significant overlaps. In addition, the numerical values are often nearly symmetric (i.e., ), because the numerical asymmetry is often due to small non-self-adjoint terms (such as advection in a diffusion-dominant advection-diffusion problem [42, p. 243]) or due to truncation errors (such as in a Petrov-Galerkin method for a self-adjoint PDE [42, p. 88]). For systems of PDEs, may be partially symmetric in the sense that may be nearly symmetric for the reasons mentioned above, but and may differ significantly, for example, due to strongly imposed constraints in a variational formulation [41], high-order treatment of Neumann boundary conditions in finite elements [17] or finite differences [64], imposition of jump conditions in immersed/embedded boundary methods [58, 80], etc.

The second observation was that it is sometimes advantageous to treat a symmetric indefinite system as unsymmetric in the context of incomplete factorization. This claim seems to contradict the conventional wisdom that “it is rarely sensible to throw away symmetry in preconditioning” [108]. Indeed, some state-of-the-art packages for symmetric indefinite systems preserve symmetry by pivoting or deferring of some combination of and blocks [50, 65, 93, 94], for example, by using the Bunch-Kaufman pivoting [27, 49] as in [50, 65]. Although it may be more efficient to use a mixture of and symmetric pivots in the context of complete factorization [37, 94], for multilevel ILU it is sometimes more robust to use unsymmetric pivoting for symmetric saddle-point problems, as we will demonstrate in Section 5.2.2. At the same time, the conventional wisdom is also sound in that leveraging symmetry not only can reduce the computational cost by up to half for some problems, but also may improve robustness for some matrices from systems of PDEs. A novel feature of HILUCSI is to combine symmetric preprocessing at the top levels for nearly or partially symmetric matrices and unsymmetric factorization at lower levels for all indefinite systems, including symmetric indefinite systems. In addition, HILUCSI introduces a static deferring strategy for (nearly) symmetric saddle-point problems, which eliminates the need for pivots and hence simplifies the implementation.

Due to the numerous applications of nearly or partially symmetric and indefinite systems, there have been significant interests in solving such systems in linear algebra, numerical PDE, and engineering communities. Many recent problems in the SuiteSparse Matrix Collections [34] and Matrix Market [18] belong to these classes, including some of the cases highlighted in [30, 69, 76, 88, 112]. To address these challenging problems, the PDE and multigrid communities have developed some customized preconditioners and solvers for specific equations; see [1, 2] for examples of special multigrid solvers and see [15] for a survey on special techniques for saddle-point problems. In terms of general-purpose solvers, a common practice is to treat such systems as general unsymmetric systems, such as in ARMS [90], ILU++ [75], supernodal ILUTP in SuperLU [69], etc. We note two exceptions. First, one may build a symmetric preconditioner for an unsymmetric , such as using in an algebraic preconditioner [33, 109] or using the self-adjoint terms in a physics-based preconditioner [3]. Second, for a partially symmetric matrix where is symmetric but in (1), one may factorize using a symmetric ILU and then apply unsymmetric ILU on the Schur complement [15]. HILUCSI differs from these two approaches in that it does not explicitly construct a symmetric approximation of , and it allows deferring some rows and columns in into the Schur complement.

Another novel feature of HILUCSI is its scalability-oriented dropping together with inverse-based dropping in the Crout version of multilevel ILU [23, 66]. Our scalability-oriented dropping shares some similarity to the space-based droppings (such as those in ILUT [84], ICMF [71], PARDISO 6 [94]) as well as area-based dropping in [69]. Those space or area-based droppings traditionally focused on controlling space complexity. In contrast, the primary goal of our scalability-oriented dropping is to achieve (near) linear-time complexity in the number of nonzeros in the input matrix. Although linear time complexity implies linear space complexity as a side product, the converse is not true in general. Besides the scalability-oriented dropping, we employ the inverse-based dropping due to Bollhöfer and Saad [19, 22, 23]. We show that our new dropping strategies along with mixed symmetric and unsymmetric processing enabled superior robustness and efficiency for HILUCSI compared to ILUPACK [21] and supernodal ILUTP [69] for saddle-point problems from PDEs. In addition, we show that the serial performance of HILUCSI with optimized parameters is competitive with the state-of-the-art multithreaded direct solvers in MUMPS [6] and PARDISO [94] on 24 cores with up to one million unknowns.

The remainder of the paper is organized as follows. In Section 2, we review some background on incomplete LU factorization and its multilevel variants. In Section 3, we describe the algorithm components of HILUCSI for robustness. In Section 4, we address the efficiency issues with a focus on scalability with respect to problem sizes. In Section 5, we present numerical results with HILUCSI as a right preconditioner for restarted GMRES and compare its performance with some state-of-the-art packages. Finally, Section 6 concludes the paper with a discussion on future work.

2 Background

In this section, we review some incomplete LU preconditioners. Because there is a vast literature on preconditioning and ILU, we focus on only some of the most relevant techniques and mathematical analysis. For comprehensive reviews, we refer readers to some surveys [14, 28, 91, 108] and textbooks [87, 103].

2.1 Incomplete LDU factorizations

ILU, or more precisely incomplete LDU (or ILDU) factorization with pivoting, performs an approximate factorization


where and are unit lower and upper triangular matrices, respectively, and and are row and column permutation matrices, which may be obtained from some static reordering, static or dynamic pivoting, or a combination of them. Let , and is a preconditioner of , or equivalently is a preconditioner of . We consider only right preconditioning in this work. In this context, when solving a linear system using an iterative method, especially a Krylov subspace method, one solves the right-preconditioned linear system


which ideally would converge much faster than solving the original linear system, and then .

2.1.1 Single-level ILU

We refer to the ILU with the basic form in (2) as a single-level ILU. Such a technique has been used to solve linear systems from PDEs since 1950s; see e.g. [106]. In 1977, Meijerink and Van der Vorst [77] showed that incomplete Cholesky (IC) factorization is stable for a symmetric M-matrix, i.e., a matrix with nonpositive off-diagonal entries and nonnegative entries in its inverse. Since then, IC has been extended to and become popular for SPD systems [51, 59, 71, 95]. However, ILU for unsymmetric or indefinite systems had turned out to be much more challenging, and it has been an active research topic over the past few decades; see e.g. [7, 20, 23, 30, 29, 74, 69, 84, 88].

In its simplest form, ILU does not involve any pivoting, and and preserve the sparsity patterns of the lower and upper triangular parts of , respectively. This approach is often referred to as ILU0 or ILU(0). To improve its robustness, one may allow fills, a.k.a. fill-ins, which are new nonzeros in the and factors. The fills may be introduced based on their levels in the elimination tree or based on the magnitude of numerical values. The former leads to the so-called ILU(), which zeros out all the fills of level or higher in the elimination tree. The combination of the two is known as ILU with dual thresholding (ILUT) [84]. The level-based fills may be replaced with some other dropping to control the numbers of fills in each row or column. The ILU implementations in PETSc [10] and hypre [98] use some variants of ILUT with dual thresholding.

ILUT may encounter zero or tiny pivots, which can lead to a breakdown of the factorization. One may replace tiny pivots with a small value, but such a trick is not robust [30]. The robustness may be improved by using pivoting, leading to the so-called ILUP [83] and ILUTP [87]. The ILU implementations in MATLAB [99], SPARSKIT [85], and SuperLU [69], for example, are based on ILUTP. However, ILUTP cannot prevent small pivots [88], so it is still not robust in practice; see [69, 76, 88] and Section 5 in this work for some failed cases with ILUTP.

2.1.2 Multilevel ILU

A more sophisticated form of ILU takes advantage of the global block structure of similar to that in (1) to compute a block factorization of . Specifically, let and denote the row and column permutation matrices. A block-structured preconditioner for the permuted matrix can be constructed via the approximation


where , , , and is the Schur complement. If is further approximated recursively using such a block factorization, one obtains a multilevel ILU (MLILU) [11, 12, 23, 76, 86, 88, 92, 111], also known as multilevel block factorization [107]

. Given a block vector

, then can be computed as


There are several motivations to use multilevel ILU. One of them is to expose parallelism, such as in ILUM and BILUTM [86, 92]. The idea is to permute so that in (4) is a block diagonal matrix, of which each block can then be factorized independently. This approach was inspired by the domain-decomposition (such as additive Schwarz) methods [96], but ILUM factorizes the Schur complement recursively instead of resolving the overlapping regions of subdomains iteratively. Another motivation of multilevel ILU is to construct an algebraic analogy of multigrid methods [82, 110]; see e.g. [11, 12, 13]. However, the fine-coarse grid relationship in multilevel ILU is quite different from the geometric [82, 26] and algebraic multigrid methods [55, 105, 104]. More recently, multilevel ILU has been used successfully in improving the robustness of ILU as a preconditioner; see e.g. [23, 76, 88, 111]. In this work, we strive to improve the robustness of multilevel ILU further along with its efficiency, especially for saddle-point systems arising from PDEs.

2.1.3 Criteria of accuracy, stability, and efficiency

For ILU to be “robust,” it should be accurate and stable. In [30], Chow and Saad consider these issues from an algorithmic point of view. Most notably, they emphasized the importance of avoiding small pivots. In [19], Bollhöfer pointed out the importance of monitoring and

, of which the norms can be estimated incrementally as in

[56]. Based on this idea, Bollhöfer and Saad developed a robust multilevel ILU approach [23], which dynamically defers the rows and columns that lead to large and . From an empirical point view, Benzi [14] measured the stability and accuracy of a preconditioner using and , respectively.

In the context of multilevel ILU, the accuracy and stability of the preconditioner is more complicated. From the algorithmic viewpoint, one of the most fundamental conditions is that the leading block must be well-conditioned [8, 78]. The Schur complement in general would be relatively ill-conditioned [9], but some preprocessing (such as equilibration [101, 35]) can be applied to the Schur complement to improve the stability of its factorization. In [23], Bollhöfer and Saad emphasized the accuracy of the Schur complement. They suggested to use a formulation due to Tismenetsky [100], which unfortunately often leads to excessive fills. A simpler alternative is to tighten the dropping thresholds for the Schur complement, as used in [21, 76, 90, 111].

In terms of efficiency of ILU, traditionally the focus has been on the “sparsity” of the and factors [30]. In this work, we consider a preconditioner to be efficient if it can be constructed in (nearly) linear time in the number of nonzeros in the input matrix , and can be computed in (nearly) linear time for . This definition requires not only the number of nonzeros in and , but also the computational time of the factorization, to be (approximately) proportional to the number of nonzeros in . For linear systems arising from finite differences or finite elements for PDEs, the number of nonzeros per row is typically bounded by a constant. For such systems, this criterion requires (near) linear time complexity in the number of unknowns. This efficiency objective is similar to that of Hackbusch’s hierarchical matrices [53] and also that of Axelsson and Vassilevsk’s AMLI for SPD systems [9]. Note that the efficiency requirement must be satisfied under the constraint of the stability and accuracy of . Although ILU0 and a simple ILUT may have linear complexity, they do not satisfy our accuracy and stability requirements.

2.2 Dropping strategies

For ILU, the dropping strategies are important for both robustness and efficiency. Most modern ILU techniques use some form of “dual thresholding” [84], which combines a dropping strategy based on numerical values along with another dropping strategy based on some combinatorial (or symbolic) properties, such as the level of the elimination tree or the number of fills in rows or columns. For the former, the dropping is controlled by a parameter known as drop tolerance (or in short, droptol or ). Different weighting schemes can be applied to numerical values. Let and denote the and factors of the ILU of the leading block of . In [19], Bollhöfer proposed to use and as weights for the entries in the th column of and the th row of , respectively. Li et al. adopted it in [66] and referred to it as dropping based on condition number estimation. In [23], Bollhöfer and Saad proposed a much more strict dropping strategy, which weights the values with a user-specified threshold, which corresponds to an upper bound of . In [74], Mayer proposed to weigh the values in the th column of with some norm of the th row of and vice versa. In this work, we introduce a scalability-oriented dropping as the combinatorial dropping to achieve near-linear time complexity, which we use together with inverse-based dropping.

We note two concepts that are closely related to dropping. The first is the Crout version of ILU (or ILUC) [66], also known as the left-looking ILU for symmetric matrices [39]. Unlike the typically -ordered Gaussian elimination, at the th step, the Crout ILU computes the th column of (i.e., ) by left looking, and analogously for the th row of (denoted by ). In this way, the numerical values in and are updated as late as possible, so it allows more accurate dropping compared to right-looking algorithms. In addition, and can be estimated incrementally for the inverse-based dropping in Crout ILU. The second concept is modified ILU (MILU) [87], introduced by Dupont et al. [38] and also by Gustafsson [52] for elliptic PDEs. MILU modifies the diagonal entries to compensate the discarded entries so that the sum of each row is preserved. This strategy improves the accuracy for certain applications [40], and it is adopted by SuperLU [69]. Our algorithm uses the Crout version of ILU. However, we do not use MILU, because it was not very effective for general linear systems in our testing.

2.3 Pivoting versus deferring

It is well known that small pivots can make Gaussian elimination unstable [49], and similar for ILU [30]. Analogous to and factorizations, small pivots in ILU can be mitigated by using some variant of partial pivoting for unsymmetric matrices (such as column pivoting in ILUP and ILUTP [83, 87], row pivoting in SuperLU [69], and “dual pivoting” in [73]) and by using Bunch-Kaufman pivots [27, 49] for symmetric indefinite matrices (such as in [65] and in [93]). The pivoting strategies may be dynamic in that they are determined at each step. However, unlike complete factorization, dynamic pivoting cannot guarantee the stability of incomplete factorization, even for nonsingular systems. To overcome this issue, Saad [88] advocated static pivoting, which permutes the matrix in a preprocessing step. Saad’s work was motivated by that of Olschowka and Neumaier [79], who showed that any structurally nonsingular matrix can be permuted and scaled to obtain an I-matrix, whose diagonal entries have magnitude 1 and whose off-diagonal entries have a magnitude less than or equal to 1. To this end, Saad proposed a permutation strategy called ddPQ to achieve some weak diagonal dominance of a leading block, which he then used as the leading block in multilevel ILU without further permutation. However, the weak diagonal dominance from ddPQ cannot guarantee good pivots. In our testing, ddPQ is quite sensitive to the input matrix.

Instead of pivoting, Bollhöfer and Saad [23] proposed to defer the rows and columns to the next level dynamically if they lead to too large norms of or . Such a strategy naturally leads to a dynamic construction of multilevel structure. Analogous to dynamic pivoting, we refer to these deferring strategies as dynamic deferring. An advantage of dynamic deferring versus dynamic pivoting is that the former allows using more efficient data structures, as will explain in Section 4. In addition, we introduce static deferring for (nearly) symmetric saddle-point problems that have many zeros in the diagonal, as we will describe in Section 3.

2.4 Reordering and equilibration

Reordering is an effective preprocessing technique for complete and incomplete factorizations. Some commonly used reordering methods include reverse Cuthill-Mckee (RCM) [72], approximate minimum degree (AMD) [5], nested dissection (ND) [47, 46, 60, 62], etc. Among these techniques, RCM aims to reduce the bandwidth, and it is widely used for (nearly) SPD systems [16, 51]. In contrast, AMD aims to reduce fills, and it is effective in improving the quality of droppings, especially for general unsymmetric matrices. Both ILUPACK and SuperLU use some variants of AMD as the default reordering strategy.

While reordering is concerned with sparsity patterns, equilibration focuses on numerical properties. A simple equilibration technique is to scale rows and columns [49, 101]. A more sophisticated technique, known as Hungarian scaling in [57], combines scaling with column permutation of , so that is an I-matrix [79]. Motivated by the work of Olschowka and Neumaier, Duff and Koster [35, 36] developed similar strategies for sparse matrices, which were implemented in the MC64 routine of the HSL library [61]. The strategy was adopted by several packages, including ILUPACK [24], SuperLU [69], PARDISO [94], MUMPS [6], MATLAB R2019a [99], etc. An I-matrix can reduce the condition number; in addition, an I-matrix has a weak diagonal dominance like ddPQ [88]. MC64 was designed for unsymmetric matrices. A sophisticated equilibration for symmetric indefinite systems was developed by Duff and Pralet [37]. A simple symmetrization strategy was described used in the routine HSL_MC64 in the HSL library, which applies a post-processing step after calling MC64, by setting and , so that preserves symmetry. We adopt the latter approach for (nearly) symmetric matrices.

3 Improving Robustness for Saddle-Point Problems

To achieve robustness for saddle-point problems, we adapt some of the essential components reviewed in the previous section to take advantage of near or partial symmetry in the underlying problem. Figure 1 gives a schematic of the factorization procedure in HILUCSI. In the following, we focus on three of its key components.

Figure 1: Flowchart of the multilevel ILDU factorization in HILUCSI.

3.1 Mixed symmetric and unsymmetric preprocessing

Like ILUPACK, HILUCSI leverages some preprocessing techniques, including reordering and equilibration, at each level of multilevel ILU. A novelty of HILUCSI is that it mixes the preprocessing techniques for symmetric and unsymmetric matrices in a hierarchical fashion. In particular, for nearly and partially symmetric matrices, we apply symmetric equilibration (similar to that in HSL_MC64 [61] as described in Section 2.4) and RCM reordering in the first level, and we apply unsymmetric equilibration and AMD reordering on at lower levels, where nzp denotes the nonzero pattern. In addition, after equilibration, we perform a post-processing to defer zero (or tiny) diagonal entries to the next level, as we will describe in Section 3.2; in this case, we apply symmetric equilibration and AMD reordering on the second level.

The mixed preprocessing requires some justification. First, we observe that equilibration is particularly important in multilevel ILU. Besides stabilizing the factorization of the Schur complement, after scaling, the norm of the leading block (i.e., ) is well bounded. Note that

Hence, bounding the norms of , , , and allows controlling . For SPD systems, it can be shown that it is necessary to bound the condition number of the leading block [78]

. As a heuristic, it is desirable to bound

on each level in a multilevel ILU for unsymmetric matrices.

Second, we observe that the symmetric equilibration is often more effective than unsymmetric equilibration for nearly symmetric matrices arising from systems of PDEs, especially when used in conjunction with RCM. This behavior is probably because those matrices may have some block diagonal dominance as defined in

[45], which may be destroyed by unsymmetric equilibration. Symmetric equilibration with RCM reordering permutes the dominant block diagonal within a narrower band, so that it may preserve block diagonal dominance more effectively. In addition, for SPD matrices, RCM tends to lead to smaller off-diagonal blocks, which also tends to improve the quality of the preconditioner. In our experiments, the combination of symmetric equilibration with RCM reordering at the top level significantly improves both robustness and efficiency for nearly symmetric matrices. However, if the sparsity pattern is unsymmetric or highly irregular, unsymmetric equilibration and AMD reordering tend to be more robust; see Section 5.3 for some comparisons.

Note that unlike MC64, symmetric equilibration does not produce I-matrices. Indeed, any zeros in the diagonal of the input matrix would remain in the diagonal after symmetric equilibration. The zero or tiny diagonal entries require some special attention, which we address using static deferring.

3.2 Static and dynamic deferring

HILUCSI differs from the other ILU techniques in that it avoids small pivots and ill-conditioned triangular factors using a combination of dynamic and static deferring. Our dynamic deferring is similar to that in [23]. In particular, we symmetrically permute a row and column to the lower-right corner if the diagonal is smaller than a threshold () or the estimated norms and exceed some threshold ( and ). Even though this dynamic deferring is effectively in resolving zero or tiny pivots in most cases, our experiments show that for saddle-point problems, it is advantageous to defer the zero and tiny diagonal entries directly to the next level during pre-processing. This is because zero or tiny pivots tend to lead to larger and , even when the tiny pivots are at the lower-right corner in the leading block. It is worth noting that this static deferring strategy eliminates the need of pivoting or deferring of blocks, which significantly simplifies the implementation. Note that in an extreme case, a matrix may have all zero diagonals. This does not introduce complications because HILUCSI would automatically switch to unsymmetric equilibration in the next level.

3.3 Hierarchical dual thresholding

HILUCSI uses a combination of numerical and combinatorial dropping strategies. For numerical dropping, we use the inverse-based thresholding similar to those in [19] and [23]. More specifically, we use and as weights for dropping in the th column in and and th row in in the current level, respectively, where is a use-specified upper bound of on each level. For combinatorial dropping, we introduce a scalability-oriented dropping. The basic idea is to limit the number of nonzeros (nnz) in the th column of , namely , by a factor of the input matrix, i.e.,


where denotes the average number of nonzeros in the columns of and is introduced to avoid excessive dropping for small columns in a highly nonuniform matrix. We limit the nnz in the rows of in a symmetric fashion. In the multilevel setting, we limit the rows and columns in all the levels based on the original input matrix, instead of based on the Schur complement from the previous level. Furthermore, before computing the Schur complement, we further apply dropping to limit the nonzeros in each column in based on the right-hand side of (6), and similarly for the rows in . This strategy is important for the scalability analysis. In Section 4, we will show that with this strategy, HILUCSI achieves near-linear time and space complexity for its factorization and triangular solves.

In HILUCSI, the numerical dropping is controlled by and , and the combinatorial dropping is controlled by . In practice, we found that the combination , , and is robust for a wide range of systems, while the combination , , and is robust and efficient for many saddle-point problems from PDEs. In addition, we observe that the accuracy of the factorization of the second level is often the most critical because it corresponds to factorizing the largest Schur complement. For this reason, we refine the thresholds for level 2 by reducing by a factor of 10, reducing by up to a factor of 2 (while restricting ), and doubling . For lower levels, however, we revert back for efficiency but use the refined and for stability and accuracy. ILUPACK [23] and ILU++[75] also adapt the thresholds based on levels, but their adaptation strategies differ from ours.

4 Achieving efficiency and scalability

In this section, we focus on the efficiency issues of HILUCSI for large-scale linear systems from PDEs.

4.1 Linear time complexity

For linear systems arising from PDEs, the input matrix typically has a constant number of nonzeros per row and per column. For these problems, excluding the preprocessing step, it can be shown that the total cost of HILUCSI on each level is linear in . We outline the argument for the first level. Given a matrix , let , , , and be the output of the factorization of the current level. First, omitting dynamic deferring and thresholding, it is easy to show, using an amortized analysis, that the number of floating point operations in Crout ILU is bounded by


where and denote the th row and th column of , respectively. Second, given an efficient data structure, the floating-point operations in dynamic deferring is proportional to Crout ILU. Furthermore, in the scalability-oriented dropping, we use quick select, which has an expected linear time complexity, to find the largest nonzeros, followed by quick sort after dropping. Hence, the asymptotic complexity of dropping is lower than that of Crout update. Assuming there is a constant number of nonzeros in each row and column,

and . Hence, Crout ILU with deferring takes linear time. Third, by ensuring the nonzeros in each row of and in each column of are bounded by a constant (see Section 3.3), the Schur component can be computed in linear time. This analysis applies to the other sparse levels. HILUCSI uses a dense factorization in the last level if its size is , of which the cost is also .

Note that the overall time complexity of HILUCSI may be slightly superlinear because the worst-case complexities of some preprocessing components (including MC64 and AMD [54, 36]) are superlinear. However, MC64 has an expected linear-time complexity under our assumptions [36], and the cost of AMD is typically negligible. In addition, the number of levels may not be a constant in the worst case, but it is typically small. Hence, by design HILUCSI is expected to deliver near-linear time complexity. In Section 5, we will show that HILUCSI indeed scales better than both ILUPACK and SuperLU, and its near-linear complexity makes its performance competitive with highly optimized direct solvers for large systems.

4.2 Implementation details

To realize the near-linear time complexity, we need an efficient data structure for sparse matrices. In particular, the data structure must support efficient sequential access of the th row of along with all the rows in , and similarly for and , as required by the th step of the Crout ILU. In addition, it must support efficient static and dynamic deferring. The standard storage formats, such as CSR (Compressed Sparse Row), CSC (Compressed Sparse Column), or AIJ, are insufficient. We use a bi-index data structure, similar to that proposed by Jones and Plassmann [59] for the row-version of incomplete Cholesky factorization and that used in Crout version of ILU without pivoting for unsymmetric matrices by Li, Saad, and Chow in [30, 66].

For simplicity, let us first consider the data structure in [66] for without deferring. Because is constructed row by row, we store it using the CSR format, in which the column indices for each row are sorted in ascending order. We then augment it using two size- arrays: Ufirst and Ulist, where the former stores the index of the first nonzero in each row of in CSR, and the latter maintains the linked lists of the entries in Ufirst by columns. At the th step, all the nonzero entries in the th column of must be in Ufirst. Hence, we can access a complete linked list of the entries in the th columns in through Ulist. The same holds for , except that we use CSC as the base for storing .

The data structure in [66] does not support deferring or pivoting. We extend it to support deferring. For simplicity, we describe the procedure for as follows. At the th step, suppose there have been deferrals and the current column in is going to be the th deferral. We move the current column in to the th position. By induction, there will be a gap of for the column indices in . Hence, before processing the th column, we move column to the th position in , which eliminates the gap in a “just-in-time” fashion for . For the indices in , we eliminate the gap at the end of the Crout ILU for the current level. The processing of is symmetric to that of . To implement this efficiently, we enlarge some size- arrays (in particular, the permutation vectors, Llist, and Ulist) to size . This memory overhead is negligible, and it enables us to preserve the amortized constant time per nonzero. In contrast, for Crout ILU with dynamic pivoting, such as those in [65] and [76], the data structure needs to support efficient sequential access to all the rows and columns in , and similarly for . That requirement would double the storage and also incur significantly more data movement. Note that ILUPACK [21] also extended the data structures in [66] to support deferring, but we were unable to find its implementation details for comparison.

5 Numerical results

We have implemented HILUCSI using the C++-11 standard. In this section, we assess the robustness and efficiency of our implementation for some challenging benchmark problems and compare its performance against some state-of-the-art packages. In particular, we chose ILUPACK v2.4 [21, 24] as the representative of multilevel ILU, partially because HILUCSI is based on the same Crout-version of multilevel ILU as in ILUPACK, and more importantly, ILUPACK has been optimized for both unsymmetric and symmetric matrices. In comparison with other packages, our tests showed that ILUPACK outperformed ARMS in ITSOL v2 [90] by up to an order of magnitude for larger unsymmetric systems, and the gap is even larger for symmetric systems; ILUPACK is also significantly more robust than ILU++ [75, 76]. In addition, we chose the supernodal ILUTP in SuperLU v5.2.1 [67, 69] as a representative of single-level ILU, because of its efficient supernodal implementation. In all of our tests, we used right-preconditioning for restarted GMRES(30), which limits the dimension of the Krylov subspace to 30. We used for the convergence criteria of GMRES and limited the number of iterations to . For HILUCSI and ILUPACK, we used our own implementation of flexible GMRES; for SuperLU, we used the GMRES implementation in the latest PETSc v3.11.3. We also compare HILUCSI with the multithreaded direct solvers in MUMPS 5.2.0 [6], MKL PARDISO v2018/3.222, and PARDISO 6.0 [81, 94] for larger-scale saddle-point systems.

We conducted our tests on a single node of a cluster running CentOS 7.4 with two 2.5 GHz 12-core Intel Xeon E5-2680v3 processors and 64 GB of RAM. We compiled HILUCSI, SuperLU and PETSc all using GCC 4.8.5 with the optimization option -O3, and we used the binary release of ILUPACK v2.4 for GNU64. We accessed ILUPACK through its MATLAB mex functions, of which the overhead is negligible. For accurate timing, both turbo and power saving modes were turned off for the processors.

5.1 Baseline as “black-box” preconditioners

As a baseline study, we assess HILUCSI for some benchmark problems in the literature. We collected more than 60 larger-scale benchmark problems that were highlighted in some recent ILU publications, which were mostly from the SuiteSparse Matrix Collections [34] and the Matrix Market [18]. For linear systems that are ill-conditioned, we only consider those with a meaningful right-hand side. We present results on some of the most challenging benchmark problems that were highlighted in [23], [69], and [112], together with two larger unsymmetric systems for Navier-Stokes (N-S) equations. Table 5.1 summarizes these unsymmetric matrices, including their application areas, types, sizes, and estimated condition numbers. Among the problems omitted here, HILUCSI failed only for the system invextr1_new in [112], which has a large null-space dimension of 2,910 and has also caused all the methods tested in [112] to fail.

In addition, we generated two sets of symmetric indefinite systems using FEniCS v2017.1.0 [4] by discretizing the 3D Stokes equation and the mixed formulation of the Poisson equation. These equations have a wide range of applications in CFD, solid mechanics, heat transfer, etc. We discretized the former using Taylor–Hood elements [97], and discretized the latter using a mixture of linear Brezzi-Douglas-Marini (BDM) elements [25] and degree-0 discontinuous Galerkin elements [32]. These problems are challenging in that the matrices have some nonuniform block structures, and they have many zeros in the diagonals. To facilitate scalability study, for each set, we generated three linear systems using meshes of different resolutions. Note that the matrices generated by FEniCS do not enforce symmetry exactly and contain some nearly zero values due to rounding errors. We manually filtered out the small values that are close to machine precision and then enforced symmetry using . For this baseline comparison, we used droptol for all the codes, as in [69], and used the recommended defaults for the other parameters for most problems. For ILUPACK, we used MC64 matching, AMD ordering, and condest (i.e., ) 5. For SuperLU, when using its default options, we could solve four problems. We doubled its fill factor from 10 to 20, which allowed SuperLU to solve another five problems. For HILUCSI, we used and for all the cases.

In Table 5.1, we report the overall runtimes (including both factorization and solve times) for each code, numbers of GMRES iterations, nnz ratios, etc. The fastest runtime for each case is highlighted in bold. HILUCSI had a 95% success rate for these problems with the default parameters, and it was the fastest for 65% of the cases. For twotone, which is not a PDE-based problem, we could not solve it unless we enlarge to . ILUPACK solved 80% of cases and it was the fastest for 10% of the cases. Among the failed cases, ILUPACK ran out of the 64GB of main memory for RM07R. For the symmetric problems, ILUPACK automatically detects symmetric matrices and then applies ILDL factorization with mixed and pivots automatically. This optimization in ILUPACK benefited its timing for those problems but hurt its robustness for the two larger systems from Stokes equations, which we could solve only by explicitly forcing ILUPACK to use unsymmetric ILU. In addition, ILUPACK was unable to solve PR02R, regardless of how we tuned condest. SuperLU was the least robust among the three: it solved only 45% of cases111In [69], supernodal ILUTP had a higher success rate with GMRES(50) and unlimited fill factor. We used GMRES(30) (the default in PETSc [10]) and a fill factor 10 (the default in SuperLU) or 20. and it was the fastest for 25% of cases. Note that for the largest system solved by all the codes, namely atmosmodl, HILUCSI outperformed ILUPACK and SuperLU by a factor of 6 and 9, respectively. On the other hand, for a medium-sized problem, namely e40r5000, SuperLU outperformed HILUCSI and ILUPACK by a factor of 7.5 and 15, respectively. This result shows that supernodal ILUTP excels in cache performance, but its ILUTP is fragile compared to multilevel ILU in ILUPACK and HILUCSI. Overall, HILUCSI delivered the best robustness and efficiency for these cases.

Baseline comparison of HILUCSI, ILUPACK, and SuperLU, denoted as H, I, and S, respectively, using robust parameters (droptol for all and for HILUCSI). , and indicate failed factorization and stagnation of GMRES(30), respectively. HILUCSI failed for twotone with , but it took 7.71 seconds with . For SuperLU, ‘*’ indicates doubling fill-factor. The leaders are in bold. Matrix Appl. n nnz Cond. factor. time (sec.) total time (sec.) GMRES iter. nnz ratio H I S H I S H I S H I S nearly symmetric, indefinite systems from PDEs venkat01 2D Euler 62,424 1,717,792 2.5e6 6.70 8.29 5.49 6.81 8.40 5.91 3 3 3 6.3 5.8 8.7 rma10 3D CFD 46,835 2,374,001 4.4e10 10.5 31.6 4.46 10.6 31.7 4.73 2 2 2 5.8 8.7 6.0 mixtank_new 29,957 1,995,041 2.2e11 39.7 128 40.2 128 9 3 15 41 Goodwin_071 3D 56,021 1,797,934 1.4e7 20.2 15.4 6.45 20.8 15.5 16.3 13 3 81 15 8.2 9.2 Goodwin_095 Navier- 100,037 3,226,066 3.4e7 38.2 42.3 19.8 39.9 42.6 21.1* 18 3 4 16 9.2 12 Goodwin_127 Stokes 178,437 5,778,545 8.2e7 70.9 95.1 62.3 75.1 95.1 65.0* 24 3 4 16 12 15 RM07R (N-S) 381,689 37,464,962 2.2e11 3.1e3 3.3e3 75 44 PR02R 2D N-S 161,070 8,185,136 1.1e21 256 261 14 28 e40r5000 17,281 553,956 2.5e10 18.0 36.8 2.26 18.7 36.8 2.4* 22 2 3 40 37 11 xeono2 materials 157,464 3,866,688 1.4e5 43.7 198 44.8 198 9 3 14 22 atmosmodl Helmholtz 1,489,752 10,319,760 1.5e3 78.2 496 608 85.1 502 718 16 6 125 12 27 8.9 other unsymmetric systems onetone1 circuit 36,057 335,552 9.4e6 0.39 0.92 0.44 1.12 11 5 3.3 3.0 twotone simulation 120,750 1,224,224 4.5e9 18.3 18.5 2 12 bbmat 2D airfoil 38,744 1,771,722 5.4e8 31.4 36.4 57.2 31.9 36.5 59.0* 9 3 8 17 13 18 symmetric, saddle-point problems from PDEs S3D1 3D Stokes 18,037 434,673 1.2e7 1.63 6.40 7.68 1.63 6.56 9.90* 2 1 41 11 53 19 S3D2 121,164 3,821,793 8.9e7 80.3 80.7 4 10 S3D3 853,376 31,067,368 6.3e8 777 781 4 13 M3D1 3D 29,404 522,024 1.7e5 6.57 8.28 6.66 8.31 5 3 15 19 M3D2 mixed 211,948 4,109,496 2.3e6 65.5 125 67.1 127 9 6 16 28 M3D3 Poisson 1,565,908 31,826,184 3.8e7 570 1.2e3 599 1.3e3 21 12 17 31

5.2 Optimized parameters for saddle-point problems

The default parameters in the baseline comparison are robust for general problems. However, they may be inefficient for saddle-point problems from PDEs. We now compare the software for such problems, including some nearly symmetric indefinite systems and purely symmetric saddle-point problems.

5.2.1 Nearly symmetric, indefinite systems

For nearly symmetric matrices, we use six PDE-based problems in Table 5.1, which are from different types of equations in CFD, including 2D Euler, 3D Navier-Stokes equations, and Helmholtz equations. Table 1 shows the comparison of HILUCSI, ILUPACK, and SuperLU for these systems in terms of the factorization times, total times, GMRES iterations, and nonzero rations. We highlighted the fastest runtimes in bold. For a fair comparison, we used for all the codes, used for both HILUCSI and ILUPACK, and used for HILUCSI. It can be seen that HILUCSI was the fastest for all the cases in terms of both factorization and total times. Compared to ILUPACK, the lower factorization cost of HILUCSI was due to a combination of smaller fill factors, fewer levels, and lower time complexity (c.f. Figure 2). However, HILUCSI required more GMRES iterations than ILUPACK, while SuperLU required significantly more iterations for the largest systems. In addition, we note that HILUCSI could solve all the problems with , which would improve the factorization time but increase the number of GMRES iterations for some systems.

Matrix factor. time (sec.) total time (sec.) GMRES iters. nnz ratio #levels
venkat 1.11 3.50 2.64 1.25 3.62 2.94 7 5 7 3.0 2.8 2.4 3 3
rma10 2.31 11.3 2.93 2.47 11.4 3.43 9 4 9 2.0 3.8 2.8 5 6
atmos 10.5 33.8 686 19.8 41.0 748 33 22 75 2.9 4.0 6.2 3 3
G_071 4.01 5.40 4.12 4.99 5.63 7.95 41 12 52 4.8 3.5 5.0 5 7
G_095 7.36 11.8 9.74 10.7 12.3 21.9 78 14 84 4.8 3.9 5.7 6 7
G_127 13.3 32.1 22.5 17.7 33.3 146 56 16 449 4.8 5.0 6.1 6 8
Table 1: Comparison of HILUCSI (denoted as H) versus ILUPACK (I) and SuperLU (S) for nearly pattern-symmetric, indefinite problems with optimized parameters (drop tolerance for all and for HILUCSI). Fastest times are in bold.

Figure 2 shows the relative speedups of HILUCSI and SuperLU versus ILUPACK in terms of factorization and solve times. It can be seen that HILUCSI outperformed ILUPACK for all six cases by a factor between 1.1 and 4.9. For the Goodwin problems, it is clear that the relative speedup increased as the problem sizes increased, thanks to the near-linear time complexity of HILUCSI as discussed in Section 4.1. We note that ILUPACK has a parameter elbow for controlling the size of reserved memory, but the parameter made no difference in our testing. ILUPACK also has another parameter lfil for space-based dropping, of which the use is discouraged in its documentation. Our tests showed that using a small lfil in ILUPACK indeed decreased its robustness, while its time complexity still remained higher than HILUCSI.

(a) Relative speedup of factorization time.
(b) Relative speedup of total time.
Figure 2: Speedups of (a) factorization and (b) total times of HILUCSI and SuperLU versus ILUPACK for nearly symmetric, indefinite problems with optimized parameters. Higher is better. G stands for Goodwin in the horizontal axis.

In addition, we observe that although SuperLU outperformed ILUPACK in terms of factorization times for all the Goodwin problems, it underperformed in terms of the overall times for these problems, because its solve time was significantly larger due to too many GMRES iterations. This again shows the superior robustness of multilevel ILU in HILUCSI and ILUPACK versus the single-level ILUTP in SuperLU.

5.2.2 Symmetric saddle-point problems

We now assess the robustness and efficiency of HILUCSI as the problem sizes increase. To this end, we use the symmetric saddle-point problems and compare HILUCSI with two different solvers in ILUPACK for symmetric and unsymmetric matrices, respectively. Because supernodal ILUTP failed for most of these problems, we do not include it in this comparison. For these saddle-point problems, because there are static deferring, our algorithm enabled symmetric matching in HILUCSI on the first two levels, and we applied RCM ordering for the first level and applied AMD ordering for all the other levels. For ILUPACK, we used AMD ordering, as recommended by ILUPACK’s documentation.

Table 2 shows the comparison of HILUCSI with ILUPACK in terms of the numbers of GMRES iterations, the nonzero ratios, and the numbers of levels, along with the runtimes of HILUCSI. Figure 3 shows the relative speedups of HILUCSI and symmetric ILUPACK relative to the unsymmetric ILUPACK. It can be seen that HILUCSI outperformed the unsymmetric ILUPACK by a factor of four to ten for these problems. The improvement was mostly due to the improved dropping in HILUCSI; in addition, HILUCSI also had fewer levels than unsymmetric ILUPACK. In contrast, the symmetric ILUPACK failed for the two larger systems for the Stokes equations due to encountering a structurally singular system during preprocessing. For the two larger cases for the mixed formulation of the Poisson equation, symmetric ILUPACK was notably less robust and required many more GMRES iterations. This results with ILUPACK justifies our observation in the Introduction that it is sometimes more robust to treat symmetric indefinite systems as unsymmetric. Among the four solved problems, symmetric ILUPACK improved the runtimes of unsymmetric ILUPACK by a factor of 1.2 to 2.6, due to performing computations only on the lower triangular part and different dropping strategies. Finally, note that the improvement from symmetric ILUPACK cannot be achieved if we did not explicit symmetrize the matrices or force ILUPACK to use the symmetric solver. In contrast, since HILUCSI treats the pattern symmetric matrices as nearly symmetric, it delivers more consistent performance even if the matrix is slightly unsymmetric due to rounding or truncation errors. Note that the timing results in Table 2 for HILUCSI did not take advantage of numerical symmetry. Using a symmetric kernel in the first two levels would further improve its performance by 10–20%. Hence, the penalty of losing numerical symmetry is relatively small when using HILUCSI. In addition, we observe that in Figure 3, the relative speedup of HILUCSI versus ILUPACK improves significantly as the size of the problem grows, similar to what we observed for unsymmetric systems.

Matrix HILUCSI GMRES iters. nnz ratio #levels
factor. total H IU IS H IU IS H IU IS
S3D1 0.44 0.45 4 3 7 2.0 4.6 6.4 3 6 5
S3D2 5.56 5.83 7 3 2.5 6.3 3 6
S3D3 61.7 64.1 7 4 2.7 8.4 4 9
M3D1 0.69 0.78 14 6 11 2.7 7.9 5.8 4 8 5
M3D2 6.25 7.75 26 11 29 2.6 9.5 7.3 5 10 5
M3D3 52.9 76.8 53 24 62 2.6 10 7.2 6 15 5
Table 2: Comparison of HILUCSI (denoted as H) with unsymmetric and symmetric ILUPACK (denoted by IU and IS, respectively) for symmetric saddle-point systems with droptol for all and for HILUCSI. indicates failed factorization.
(a) Relative speedup of factorization time.
(b) Relative speedup of total time.
Figure 3: Speedups of (a) factorization and (b) total times of HILUCSI and symmetric factorization in ILUPACK versus unsymmetric ILUPACK for symmetric saddle-point problems. Higher is better.

5.3 Benefits of mixed processing

To assess the effectiveness of mixing symmetric and unsymmetric preprocessing for HILUCSI as we described in Section 3.1, we applied symmetric preprocessing on zero, one, and two levels. Table 3 shows a comparison of the factorization times, total times, GMRES iterations, and nnz ratios for three different classes of problems. It can be seen that for matrices with fully unsymmetric structures, using symmetric preprocessing did not improve robustness and decreased efficiency. However, for many unsymmetric matrices with nearly symmetric structures, using symmetric preprocessing on the first level significantly improved robustness and efficiency. Furthermore, when static deferring is invoked in (nearly) symmetric saddle-point problems, using two levels of symmetric preprocessing further reduced the factorizations times, but the total runtime remained about the same.

Matrix factor. time total time GMRES iters. nnz ratio
H0 H1 H2 H0 H1 H2 H0 H1 H2 H0 H1 H2
general unsymmetric systems
bbmat 31.4 45.5 55.2 31.9 46.3 55.9 9 11 9 17 25 32
nearly symmetric systems
rma10 4.85 2.31 2.53 5.02 2.47 2.69 67 9 9 3.4 2.0 2.3
PR02R 256 293 261 300 14 15 28 32
symmetric, saddle-point problems
M3D3 53.6 52.9 77.3 76.8 52 53 2.6 2.6
M3D2 8.06 6.37 6.25 16.1 7.69 7.75 120 23 26 4.0 2.6 2.6
Table 3: Effect of mixing symmetric and unsymmetric processing in HILUCSI. H0, H1 and H2 denote using zero, one, and two levels of symmetric preprocessing.

5.4 Comparison with multithreaded solvers

Our numerical results above showed that HILUCSI improved the robustness and efficiency compared to the state-of-the-art ILU preconditioners for saddle-point problems from PDEs. In recent years, significant research has been devoted to developing parallel solvers, such as PETSc [10], hypre [44], SuperLU_MT/SuperLU_Dist [67], pARMS [70], MUMPS [6], PARDISO [81, 94], fine-grained ILU [29], etc. Since most computers nowadays have multiple cores, it is therefore a natural question to ask how well HILUCSI performs for such systems compared to state-of-the-art parallel solvers when using all the cores on a desktop or server. Hence, we compare our serial HILUCSI against three state-of-the-art multithreaded solvers, including MUMPS 5.2.0, MKL PARDISO v2018/3.222, and PARDISO 6.0, using 24 cores. Although these are all branded as direct solvers, MKL PARDISO and PARDISO trade stability for performance by pivoting within supernodes only and replacing tiny pivots by a small epsilon, similar to that in [68]. In addition, PARDISO 6 has an implementation of multilevel ILU for symmetric indefinite systems [93], but its overall performance was worse than the direct solver in PARDISO 6, so we only report the comparison with the direct solvers.

First, we compare the robustness. MUMPS solved all the problems in Table 5.1. In contrast, both versions of PARDISO failed for RM07R and PR02R even with iterative refinements enabled, resulting in a success rate of 90%, which is better than ILUPACK but worse than HILUCSI. The failures were due to encountering too many tiny or zero pivots within supernodes, which “rendered the factorization well-defined but essentially useless” [94] for these linear systems.

Next, we assess the performance for larger systems. We focus on the four largest systems in Table 5.1. Figure 4 shows the relative speed of HILUCSI (including its timing for RM07R in Table 5.1 and its timing for the others in Tables 1 and 2) versus those three packages. For the three larger problems, namely atmosmodl, S3D3, and M3D3, which had approximately one million unknowns, HILUCSI outperformed MUMPS and both versions of PARDISO on a single core. In addition, the serial HILUCSI is competitive with the parallel performance of MUMPS and PARDISO on 24 cores. However, for RM07R, which is relatively small, MUMPS outperformed HILUCSI significantly both in serial and in parallel. Overall, HILUCSI is very competitive for large-scale systems from PDEs with more than one million unknowns when using optimized parameters.

(a) Relative performance on one core.
(b) Relative performance on 24 cores.
Figure 4: Relative performance HILUCSI versus MUMPS, MKL PARDISO, and PARDISO 6 for larger systems. Higher is better for HILUCSI.

6 Conclusions and Future Work

In this paper, we described a multilevel incomplete LU factorization technique, called HILUCSI, which is designed for saddle-point problems from PDEs. The key novelty of HILUCSI is that it takes advantage of the near or partial symmetry of the underlying problems and the inherent block structures of multilevel ILU. More specifically, it applies symmetric equilibration and reordering techniques at the top level for nearly or partially symmetric systems, it statically defers tiny or zero pivots in saddle-point systems to the next level, and it applies unsymmetric equilibration and reordering techniques at coarser levels even for symmetric saddle-point problems. These techniques improved robustness of multilevel LU for such problems and simplified the treatment of indefinite systems. In addition, HILUCSI introduces a scalability-oriented dropping, which significantly improved the efficiency for large-scale problems. We demonstrated the robustness of HILUCSI as a right-preconditioner of restarted GMRES for symmetric and unsymmetric saddle-point problems from mixed Poisson, Stokes, and Navier-Stokes equations. Our results showed that HILUCSI outperforms ILUPACK by a factor of four to ten for medium to large problems. Because HILUCSI scales better as the number of unknowns increases, its performance advantage would become even wider for larger problems.

In its current form, HILUCSI has several limitations. First, if the memory is very limited, there may be too many scalability-oriented droppings and the preconditioner may lose robustness. We plan to optimize HILUCSI further for limited-memory situations. Second, for vector-valued PDEs, the matrices may exhibit block structures. It may be worthwhile to explore such block structures to improve cache performance, similar to that in [69] and [51]. Finally, HILUCSI is presently sequential. Although its serial performance is competitive with the parallel performance of MUMPS on 24 cores for large systems, it is desirable to speed up HILUCSI via parallelization.


Results were obtained using the Seawulf and LI-RED computer systems at the Institute for Advanced Computational Science of Stony Brook University, which were partially funded by the Empire State Development grant NYS #28451. We thank Dr. Matthias Bollhöfer for sharing ILUPACK with us, and we thank the anonymous reviewers for their many valuable comments and suggestions on the earlier versions of this paper, which have helped improve this work.


  • [1] M. F. Adams. Algebraic multigrid methods for constrained linear systems with applications to contact problems in solid mechanics. Numer. Linear Algebra Appl., 11(2-3):141–153, 2004.
  • [2] J. H. Adler, T. R. Benson, and S. P. MacLachlan. Preconditioning a mass-conserving discontinuous galerkin discretization of the stokes equations. Numer. Linear Algebra Appl., 24(3):e2047, 2017.
  • [3] B. Aksoylu and H. Klie. A family of physics-based preconditioners for solving elliptic equations on highly heterogeneous media. Appl. Numer. Math., 59(6):1159–1186, 2009.
  • [4] M. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells. The FEniCS project version 1.5. Arc. Num. Softw., 3(100):9–23, 2015.
  • [5] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl., 17(4):886–905, 1996.
  • [6] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster. MUMPS: a general purpose distributed memory sparse solver. In International Workshop on Applied Parallel Computing, pages 121–130. Springer, 2000.
  • [7] H. Anzt, E. Chow, and J. Dongarra. ParILUT—a new parallel threshold ILU factorization. SIAM J. Sci. Comput., 40(4):C503–C519, 2018.
  • [8] O. Axelsson. Iterative Solution Methods. Cambridge Press, 1994.
  • [9] O. Axelsson and P. S. Vassilevski. Algebraic multilevel preconditioning methods, II. SIAM J. Numer. Ana., 27(6):1569–1590, 1990.
  • [10] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, and H. Zhang. PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, 2016.
  • [11] R. E. Bank and R. K. Smith. The incomplete factorization multigraph algorithm. SIAM J. Sci. Comput., 20(4):1349–1364, 1999.
  • [12] R. E. Bank and C. Wagner. Multilevel ILU decomposition. Numer. Math., 82(4):543–576, 1999.
  • [13] R. E. Bank and J. Xu. The hierarchical basis multigrid method and incomplete LU decomposition. Contemporary Mathematics, 180:163–173, 1994.
  • [14] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput. Phys., 182(2):418–477, 2002.
  • [15] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems. Acta Numerica, 14:1–137, 4 2005.
  • [16] M. Benzi, D. B. Szyld, and A. Van Duin. Orderings for incomplete factorization preconditioning of nonsymmetric problems. SIAM J. Sci. Comput., 20(5):1652–1670, 1999.
  • [17] P. Bochev, J. Cheung, M. Perego, and M. Gunzburger. Optimally accurate higher-order finite element methods on polytopial approximations of domains with smooth boundaries. arXiv preprint arXiv:1710.05628, 2017.
  • [18] R. F. Boisvert, R. Pozo, K. Remington, R. F. Barrett, and J. J. Dongarra. Matrix Market: a web resource for test matrix collections. In Qual. Numer. Softw., pages 125–137. Springer, 1997.
  • [19] M. Bollhöfer. A robust ILU with pivoting based on monitoring the growth of the inverse factors. Linear Algebra Appl., 338(1-3):201–218, 2001.
  • [20] M. Bollhöfer. A robust and efficient ILU that incorporates the growth of the inverse triangular factors. SIAM J. Sci. Comput., 25(1):86–103, 2003.
  • [21] M. Bollhöfer, J. I. Aliaga, A. F. Martın, and E. S. Quintana-Ortí. ILUPACK. In Encyclopedia of Parallel Computing. Springer, 2011.
  • [22] M. Bollhöfer and Y. Saad. On the relations between ILUs and factored approximate inverses. SIAM J. Matrix Anal. Appl., 24(1):219–237, 2002.
  • [23] M. Bollhöfer and Y. Saad. Multilevel preconditioners constructed from inverse-based ILUs. SIAM J. Sci. Comput., 27(5):1627–1650, 2006.
  • [24] M. Bollhöfer and Y. Saad. ILUPACK preconditioning software package. Available online at http://ilupack.tu-bs.de/. Release V2.4, June., 2011.
  • [25] F. Brezzi, J. Douglas, and L. D. Marini. Two families of mixed finite elements for second order elliptic problems. Numer. Math., 47(2):217–235, 1985.
  • [26] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial. SIAM, second edition edition, 2000.
  • [27] J. R. Bunch and L. Kaufman. Some stable methods for calculating inertia and solving symmetric linear systems. Math. Comput., 31(137):163–179, 1977.
  • [28] T. F. Chan and H. A. Van Der Vorst. Approximate and incomplete factorizations. In Parallel Numerical Algorithms, pages 167–202. Springer, 1997.
  • [29] E. Chow and A. Patel. Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput., 37(2):C169–C193, 2015.
  • [30] E. Chow and Y. Saad. Experimental study of ILU preconditioners for indefinite matrices. J. Comput. Appl. Math., 86(2):387–414, 1997.
  • [31] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
  • [32] B. Cockburn, G. E. Karniadakis, and C.-W. Shu. The Development of Discontinuous Galerkin Methods. Springer, Berlin Heidelberg, 2000.
  • [33] P. Concus and G. H. Golub. A generalized conjugate gradient method for nonsymmetric systems of linear equations. In Computing Methods in Applied Sciences and Engineering, pages 56–65. Springer, 1976.
  • [34] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1, 2011.
  • [35] I. S. Duff and J. Koster. The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl., 20(4):889–901, 1999.
  • [36] I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl., 22(4):973–996, 2001.
  • [37] I. S. Duff and S. Pralet. Strategies for scaling and pivoting for sparse symmetric indefinite problems. SIAM J. Matrix Ana. Appl., 27(2):313–340, 2005.
  • [38] T. Dupont, R. P. Kendall, and H. Rachford, Jr. An approximate factorization procedure for solving self-adjoint elliptic difference equations. SIAM J. Num. Anal., 5(3):559–573, 1968.
  • [39] S. C. Eisenstat, M. H. Schultz, and A. H. Sherman. Algorithms and data structures for sparse symmetric Gaussian elimination. SIAM J. Sci. Stat. Comp., 2(2):225–237, 1981.
  • [40] H. C. Elman. A stability analysis of incomplete LU factorizations. Math. Comput., pages 191–217, 1986.
  • [41] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics. Oxford University Press, 2014.
  • [42] A. Ern and J.-L. Guermond. Theory and Practice of Finite Elements, volume 159. Springer, 2013.
  • [43] O. G. Ernst and M. J. Gander. Why it is difficult to solve Helmholtz problems with classical iterative methods. In Numerical Analysis of Multiscale Problems, pages 325–363. Springer, 2012.
  • [44] R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In Int. Conf. Comp. Sci., pages 632–641. Springer, 2002.
  • [45] D. G. Feingold, R. S. Varga, et al. Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem. Pac. J. Math., 12(4):1241–1250, 1962.
  • [46] A. George. Nested dissection of a regular finite element mesh. SIAM J. Numer. Ana., 10(2):345–363, 1973.
  • [47] A. George and J. W. Liu. Computer solution of large sparse positive definite systems. Prentice-Hall, Englwood Cliffs. NJ, 1981.
  • [48] A. Ghai, C. Lu, and X. Jiao. A comparison of preconditioned Krylov subspace methods for large-scale nonsymmetric linear systems. Numer. Linear Algebra Appl., page e2215, 2017.
  • [49] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins, 4th edition, 2013.
  • [50] C. Greif, S. He, and P. Liu. SYM-ILDL: Incomplete

    factorization of symmetric indefinite and skew-symmetric matrices.

    ACM Trans. Math. Softw., 44(1):1–21, 2017.
  • [51] A. Gupta and T. George. Adaptive techniques for improving the performance of incomplete factorization preconditioning. SIAM J. Sci. Comput., 32(1):84–110, 2010.
  • [52] I. Gustafsson. A class of first order factorization methods. IT Numer. Math., 18(2):142–156, 1978.
  • [53] W. Hackbusch. Hierarchical Matrices: Algorithms and Analysis, volume 49. Springer, 2015.
  • [54] P. Heggernes, S. Eisenstat, G. Kumfert, and A. Pothen. The computational complexity of the minimum degree algorithm. In Proceedings of 14th Norwegian Computer Science Conference, pages 98–109, 2001.
  • [55] V. E. Henson and U. M. Yang. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155–177, 2002.
  • [56] N. J. Higham. A survey of condition number estimation for triangular matrices. SIAM Review, 29:575–596, 1987.
  • [57] J. Hook, J. Pestana, F. Tisseur, and J. Hogg. Max-balanced Hungarian scalings. SIAM J. Matrix Ana. Appl., 40(1):320–346, 2019.
  • [58] H. Johansen and P. Colella. A Cartesian grid embedded boundary method for Poisson’s equation on irregular domains. J. Comput. Phys., 147(1):60–85, 1998.
  • [59] M. T. Jones and P. E. Plassmann. An improved incomplete Cholesky factorization. ACM Trans. Math. Softw., 21(1):5–17, 1995.
  • [60] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359–392, 1998.
  • [61] S. R. A. Laboratory. HSL. a collection of Fortran codes for large scale scientific computation. http://www.hsl.rl.ac.uk/, retrieved on July 6th, 2019.
  • [62] D. LaSalle and G. Karypis. Efficient nested dissection for multicore architectures. In European Conference on Parallel Processing, pages 467–478. Springer, 2015.
  • [63] R. J. LeVeque. Finite Volume Methods for Hyperbolic Problems, volume 31. Cambridge University Press, 2002.
  • [64] R. J. LeVeque. Finite Difference Methods for Ordinary and Partial Differential Equations: Steady State and Time Dependent Problems. SIAM, Philadelphia, 2007.
  • [65] N. Li and Y. Saad. Crout versions of ILU factorization with pivoting for sparse symmetric matrices. Electron. T. Numer. Ana., 20:75–85, 2005.
  • [66] N. Li, Y. Saad, and E. Chow. Crout versions of ILU for general sparse matrices. SIAM J. Sci. Comput., 25(2):716–728, 2003.
  • [67] X. S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans. Math. Softw., 31(3):302–325, 2005.
  • [68] X. S. Li and J. Demmel. A scalable sparse direct solver using static pivoting. In Proceeding of the 9th SIAM conference on Parallel Processing for Scientific Computing, 1999.
  • [69] X. S. Li and M. Shao. A supernodal approach to incomplete LU factorization with partial pivoting. ACM Trans. Math. Softw., 37(4), 2011.
  • [70] Z. Li, Y. Saad, and M. Sosonkina. pARMS: a parallel version of the algebraic recursive multilevel solver. Numer. Linear Algebra Appl., 10(5-6):485–509, 2003.
  • [71] C.-J. Lin and J. J. Moré. Incomplete Cholesky factorizations with limited memory. SIAM J. Sci. Comput., 21(1):24–45, 1999.
  • [72] W.-H. Liu and A. H. Sherman. Comparative analysis of the Cuthill–McKee and the reverse Cuthill–McKee ordering algorithms for sparse matrices. SIAM J. Numer. Ana., 13(2):198–213, 1976.
  • [73] J. Mayer. ILUCP: a Crout ILU preconditioner with pivoting. Numer. Linear Algebra Appl., 12(9):941–955, 2005.
  • [74] J. Mayer. Alternative weighted dropping strategies for ILUTP. SIAM J. Sci. Comput., 27(4):1424–1437, 2006.
  • [75] J. Mayer. ILU++: A new software package for solving sparse linear systems with iterative methods. In PAMM: Proceedings in Applied Mathematics and Mechanics, volume 7, pages 2020123–2020124. Wiley Online Library, 2007.
  • [76] J. Mayer. A multilevel Crout ILU preconditioner with pivoting and row permutation. Numer. Linear Algebra Appl., 14(10):771–789, 2007.
  • [77] J. A. Meijerink and H. A. van der Vorst. An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Math. Comput., 31(137):148–162, 1977.
  • [78] Y. Notay. Algebraic multigrid and algebraic multilevel methods: a theoretical comparison. Numer. Linear Algebra Appl., 12(5-6):419–451, 2005.
  • [79] M. Olschowka and A. Neumaier. A new pivoting strategy for Gaussian elimination. Linear Algebra Appl., 240:131–151, 1996.
  • [80] C. S. Peskin. The immersed boundary method. Acta Numerica, 11:479–517, 2002.
  • [81] C. G. Petra, O. Schenk, M. Lubin, and K. Gärtner. An augmented incomplete factorization approach for computing the Schur complement in stochastic optimization. SIAM J. Sci. Comput., 36(2):C139–C162, 2014.
  • [82] J. W. Ruge and K. Stüben. Multigrid Methods, chapter Algebraic Multigrid, pages 73–130. Number 13. SIAM, 1987.
  • [83] Y. Saad. Preconditioning techniques for nonsymmetric and indefinite linear systems. J. Comput. Appl. Math., 24:89–105, 1988.
  • [84] Y. Saad. ILUT: A dual threshold incomplete LU factorization. Numer. Linear Algebra Appl., 1, 1994.
  • [85] Y. Saad. SPARSEKIT: a basic toolkit for sparse matrix computations. Technical report, University of Minnesota, 1994.
  • [86] Y. Saad. ILUM: a multi-elimination ILU preconditioner for general sparse matrices. SIAM J. Sci. Comput., 17(4):830–847, 1996.
  • [87] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003.
  • [88] Y. Saad. Multilevel ILU with reorderings for diagonal dominance. SIAM J. Sci. Comput., 27(3):1032–1057, 2005.
  • [89] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856–869, 1986.
  • [90] Y. Saad and B. Suchomel. ARMS: An algebraic recursive multilevel solver for general sparse linear systems. Numer. Linear Algebra Appl., 9(5):359–378, 2002.
  • [91] Y. Saad and H. A. Van Der Vorst. Iterative solution of linear systems in the 20th century. In Numerical Analysis: Historical Developments in the 20th Century, pages 175–207. Elsevier, 2001.
  • [92] Y. Saad and J. Zhang. BILUTM: a domain-based multilevel block ILUT preconditioner for general sparse matrices. SIAM J. Matrix Anal. Appl., 21(1):279–299, 1999.
  • [93] O. Schenk, M. Bollhöfer, and R. A. Römer. On large-scale diagonalization techniques for the anderson model of localization. SIAM Rev., 50(1):91–112, 2008.
  • [94] O. Schenk and K. Gärtner. Parallel sparse direct solver PARDISO – user guide version 6.0.0, 2018.
  • [95] H. D. Simon et al. Incomplete LU preconditioners for conjugate-gradient-type iterative methods. SPE Reservoir Engineering, 3(01):302–306, 1988.
  • [96] B. Smith, P. Bjorstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 2004.
  • [97] C. Taylor and P. Hood. A numerical solution of the Navier-Stokes equations using the finite element technique. Computers & Fluids, 1(1):73–100, 1973.
  • [98] The HYPRE Team. hypre High-Performance Preconditioners User’s Manual, 2017. version 2.12.2.
  • [99] The MathWorks, Inc. MATLAB R2019a. Natick, MA, 2019.
  • [100] M. Tismenetsky. A new preconditioning technique for solving large sparse linear systems. Linear Algebra Appl., 154(331–353), 1991.
  • [101] A. van der Sluis. Condition numbers and equilibration of matrices. Numer. Math., 14:14–23, 1969.
  • [102] H. A. van der Vorst. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13(2):631–644, 1992.
  • [103] H. A. Van der Vorst. Iterative Krylov Methods for Large Linear Systems, volume 13. Cambridge University Press, 2003.
  • [104] P. Vanek, M. Brezina, and J. Mandel. Convergence of algebraic multigrid based on smoothed aggregation. Numerische Mathematik, 88:559–579, 2001.
  • [105] P. Vaněk, J. Mandel, and M. Brezina. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing, 56(3):179–196, 1996.
  • [106] R. S. Varga. Factorization and normalized iterative methods. Technical report, Westinghouse Electric Corp., Pittsburgh, 1959.
  • [107] P. S. Vassilevski. Multilevel Block Factorization Preconditioners: Matrix-Based Analysis and Algorithms for Solving Finite Element Equations. Springer, 2008.
  • [108] A. J. Wathen. Preconditioning. Acta Numerica, 24:329–376, 2015.
  • [109] O. Widlund. A Lanczos method for a class of nonsymmetric systems of linear equations. SIAM J. Numer. Ana., 15(4):801–812, 1978.
  • [110] G. Wittum. On the robustness of ILU smoothing. SIAM J. Sci. Stat. Comp., 10(4):699–717, 1989.
  • [111] J. Zhang. A multilevel dual reordering strategy for robust incomplete LU factorization of indefinite matrices. SIAM J. Matrix Anal. Appl., 22(3):925–947, 2001.
  • [112] Y. Zhu and A. H. Sameh. How to generate effective block jacobi preconditioners for solving large sparse linear systems. In Advances in Computational Fluid-Structure Interaction and Flow Simulation, pages 231–244. Springer, 2016.