1 Introduction
Many application problems lead to solving linear systems of type
where is an nonsingular real or complex system matrix and is the associated righthand side. In particular we are interested in the case where is largescale and sparse. The generic way of solving these systems nowadays consists of using stateoftheart sparse direct solvers (cf., e.g., [3, 40, 11, 32]). Although high performance sparse direct solvers are very efficient in many cases, several structured problems, i.e., problems presenting specific, noticeable sparsity structures, cause the direct solver to produce a significant amount of fillin during the factorization, leading to high memory requirements which can exceed the hardware capability. If these kind of problems can be solved efficiently, then one has to rely on outofcore techniques. These techniques rely on memory locations external to the computer’s working memory, i.e., disks, in order to overcome hardware limitations; see, e.g., [4]. The presence of high fillin might lead to prohibitive execution time, suggesting the use of approximate factorization strategies in combination with Krylov subspace methods as a valid alternative approach. Among the most popular approximate factorization methods, we mention those based on the incomplete factorization [39] and the more recently developed approaches in multilevel frameworks, such as [37, 7, 44]. For direct factorization methods, block structured algorithms, such as multifrontal methods or those based on supernodes, have demonstrated their superiority on modern hardware architectures mainly due to the usage of dense linear algebra kernels such as level3 BLAS for matrixmatrix operations or LAPACK for certain factorization templates. Part of the success of direct solvers is obtained from the symbolic analysis using the (column) elimination tree which is able to predict dense blocks in advance and to set up the data structures appropriately. For incomplete factorization methods this is usually not possible except for very few approaches such as leveloffill approaches [22]. For the symmetric positive definite case, in [34] a block incomplete Cholesky decomposition is computed which uses a supernodal structure breaking up the supernodes into smaller blocks in order to allow refined dropping. A generic approach to block preconditioning method is introduced in [9], where a C++ framework is provided offering blockoriented preconditioning methods for block structures defined by the user; one of these is a block tridiagonal ILU. A very efficient and successful incomplete Cholesky factorization method was presented in [21], where several aspects, such as blocking using the elimination tree or efficient implementation using dense matrix kernels, were put together to eventually end up in a very robust sparse block incomplete Cholesky factorization method. Furthermore, a supernodal block incomplete factorization approach was presented in [31].
In the present paper, our block ILU approach uses several known components, combines them but also introduces further strategies to construct efficient block structures with blocks of variable size for a block ILU factorization method. Furthermore we improve the block partitioning during the factorization. It is the combination of several ingredients that eventually improves the block ILU method significantly over its scalar counterpart in many practical applications on modern computer architectures. Our approach thus generalizes the scalar ILU approach to a block approach, yet further prospective applications of this approach are subject to future research such as using block ILUs within a multilevel framework.
The paper is organized as follows. We will briefly review established incomplete factorization methods (section 2) with special focus on the socalled Crouttype ILU which is sometimes also referred to as leftlooking ILU (at least with respect to ). We will demonstrate that this approach can easily be extended to a block ILU and focus on the major challenges when switching to a block method. Section 3 is devoted to providing the block structures required to make the block ILU approach efficient. It comprises techniques to improve diagonal dominance, reduction of fillin as well as a priori variable block partitioning and aggregating blocks during the factorization. Finally we will demonstrate in section 4 that the combination of these technologies ends up in a very efficient high performance incomplete factorization approach which can easily outperform the traditional ILU by orders of magnitude on modern computers using dense matrix kernels.
2 Incomplete Factorization Methods
The design of preconditioning methods based on incomplete factorization typically relies on efficiently computing approximate triangular factors without having too much symbolic information on hand. For leveloffill ILUs one can certainly use information from the elimination tree [22]
. In contrast to that, thresholdbased ILUs are hardly able to use this kind of information. Instead, efficiency requires us to either compute significantly sparser factors which remain robust in spite of dropping or to heuristically introduce block structures to increase performance
[8, 21]. The general incomplete factorization approaches distinguish how the portions of and are to be computed, e.g., rowwise (also referred to as IKJ variant or known as the ILUT [39]) as one example.2.1 The Crout ILU
A particularly attractive incomplete factorization approach is the so–called Crout ILU [16, 25, 33, 39, 30] since it computes the columns of and rows of simultaneously only using the already computed parts of and . We highlight this version since we are going to establish our block ILU based on this variant. Algorithm 1 gives a rough sketch of this variant omitting several technical details.
An efficient realization of Algorithm 1 certainly does require us to deal with the updates of column of (resp., row of
). This is usually realized using a auxiliary vector and two associated index arrays, whereas the final sparsified row/column
is stored only in compressed format [39]. While this is more or less standard, the more difficult aspect in Algorithm 1 is to access rowwise although it is stored in compressed sparse column format (similar arguments apply to ). A very elegant way to achieve this is to use additional auxiliary dimensional vectors L_head,L_list,L_first for and U_head,U_list,U_first for which goes back to [16] and can also be found in [25]. A detailed description of how these additional vectors have to be used can be found in [25, 30]. The same kind of data structures can furthermore be used to access the initial matrix by columns and by rows simultaneously which is often used in sparse matrixmatrix multiplication codes. The only constraint to make this approach work is to ensure that the nonzero entries in each column of are stored keeping increasing row indices (similar requirements are necessary for and ). In total, if implemented efficiently, the Crout ILU is an extremely effective incomplete factorization approach, since on one hand it computes the columns of and rows of simultaneously and on the other hand it is extremely memory efficient as there is only a constant number of additional auxiliary arrays of length required in addition to the factors , to be computed. In the next section we will describe how this approach can be easily turned into a block ILU.2.2 Crouttype Block ILU
As a first step towards a blockstructured ILU we like to point out that Algorithm 1 can almost be implemented straightforwardly in the same way if the scalar entries are replaced by blocks. Formally only minor changes such as , , and are necessary for the block version. In what follows we describe in more detail how our blockstructured version of the Crout ILU is going to be realized. We assume that the initial matrix is just as regular a sparse matrix, usually without any specific block structure. We may assume that using some permutation strategy we end up with a sparse matrix where at least a block partitioning for the diagonal blocks is obtained. We will later comment in more detail about initial preprocessing steps of this kind. If the variable size of each diagonal block is given in advance then this easily imposes a block structure for the block columns of as well as the block rows of . However, since we are going to drop entries of small size, we will maintain a scalar structure for the rows of and the columns of . In addition, we will store the dense diagonal blocks separately in a block diagonal matrix . This gives a hybrid structure with blocks in one direction and a scalar representation in the other direction. The structure is illustrated in Figure 1.
The hybrid structure of and allows to store easily the nonzeros of one block column of in a single dense subdiagonal block and similarly, the nonzero columns of a block row of (see Figure 1 for a sketch). This way each block column of only consists of one dense block and an index array referring to the nonzero row indices of . This kind of block structure is quite analogous to the structures that are used in supernodal sparse direct factorization methods. Likewise, the update of a single block column of can be computed using dense matrix kernels based on level3 BLAS. To do so, one initially has to load the associated scalar sparse columns of into block column buffer and for each update we first need to gather the associated submatrices required for a level3 BLAS update, perform the dense matrixmatrix multiplication (GEMM) and then to scatter the submatrix back to the buffer. The same procedure is repeated for . The update procedure is sketched in Figure 2.
In total this leads to the basic algorithm of the block incomplete decomposition (BILU). Technically we compute , where and are unit block lower triangular and is block diagonal. This requires factorizing and inverting the diagonal blocks using dense matrix kernels (LAPACK) but simplifies dropping as well as the forward/backward substitution inside a Krylov subspace method. We finally note that like the scalar Crouttype ILU our approach does not incorporate pivoting except inside the diagonal blocks where dense matrix kernels based on LAPACK are used. This is certainly a drawback, however, as we will demonstrate in the section on numerical results, using a combination of several approaches (in particular maximum weight matching, blocking strategies) we are still able to efficiently solve a large number of systems arising from practical application problems.
3 Setting Up and Improving the Block Structures
We will now discuss several strategies that are essential to make the BILU approach from the previous section efficient. We start with some well–established scaling and permutation strategy to improve the block diagonal dominance. Then we use an algorithm to detect block structures of the initial matrix in order to group the associated rows and column together. Based on this block partitioning we will reorder the system in order to reduce the fill–in. In order to detect further dense blocks that will potentially be generated by the BILU, we will perform a simplified local ILU analysis to enlarge the block sizes of the initial block partitioning. Finally, during the computation of the BILU, we allow an even further increase in the block sizes whenever the additional amount of fill is moderate. The complete portfolio of blocking strategies is inspired by the philosophy that creating greater but fewer dense blocks is advantageous in combination of applying dense matrix kernels, such as level3 BLAS and LAPACK, since these are known to better exploit the cache properties of the underlying hardware [28]; however, to avoid higher computational complexity, the maximum block size should be limited.
3.1 Maximum Weight Matching
Unless our given matrix is symmetric and positive definite, in the general (non)symmetric case we may encounter several (block) diagonal pivots of small magnitude (or even zero). A wellestablished technique that often bypasses this problem is the use of maximum weight matchings [35] as an alternative to pivoting. The original idea is to find a maximum weighted matching of the associated bipartite graph where rows and columns of the matrix refer to the nodes and the matrix entries serve as edge weights [13]
. The matching is obtained by computing a maximum product transversal which is equivalent to maximizing the product of the absolute values of the diagonal entries. Finding a maximum product transversal is a well–known linear assignment problem in operation research and combinatorial optimization. Essentially, one has to take the negative logarithm of the entries and minimize the sum of the potential diagonal entries. For large sparse systems, as discussed here, an efficient algorithm was first presented in
[14]. The problem is solved by a sparse variant of the Kuhn–Munkres algorithm. Combinatorial algorithms such as MC64 [14] are experimentally observed to be extremely fast, significantly faster than the incomplete factorization itself though theoretical bounds for computing matchings are somewhat worse [14]. The algorithm returns a permutation as well as two dual vectors from which one has to take the exponential in order to get the desired diagonal scalings for the original matrix. For the present manuscript we will use the associated permutation and the related diagonal scalings as the default initial step replacing by(1) 
where are real diagonal matrices and is a permutation matrix such that the entries of satisfy and . [35] introduced these scalings and permutation for reducing pivoting in Gaussian elimination of full matrices. Its beneficial effect in combination with preconditioning methods has been established in [14, 5]. Furthermore, these kind of maximum weight matchings are also widely used in sparse direct solvers (cf., e.g., [41]).
Example 1
Throughout this paper we will use the following matrix as a guided example to illustrate the components of our numerical method. The matrix venkat50 has been taken from the SuiteSparse Matrix Collection^{1}^{1}1https://sparse.tamu.edu/. Its size is with nonzero entries. This means that on the average the matrix has about nonzero entries per row/column. The system is nonsingular and arises from the discretization of the 2dimensional (2D) Euler equations and from there a solver over several time steps (this matrix refers to time step
). The performance of sparse direct solvers is well documented in the collection. At this moment we will use this matrix to illustrate the effect of using maximum weight matchings. To do so, we compute for all columns their maximum in absolute values and call it
, , and similarly we proceed for the rows to obtain , . In Figure 3 we sketch the pattern of the entries of satisfying . Analogously we sketch the pattern for , defined according to (1).As we can observe from Figure 3, the preprocessed matrix has significantly fewer large entries than the original matrix and, most of them are on the main diagonal or at least close to it.
Since the BILU, as well as its scalar counter part as far as discussed in this paper, do not use further pivoting; the use of maximum weight matching is an essential generic preprocessing step in obtaining a successful incomplete factorization in many test cases though there are certainly some practical problems where the use of maximum weight matchings is less beneficial.
3.2 CosineBased Preprocessing
The initial preprocessing step using maximum weight matching, it is hoped, simply improves the scalar diagonal dominance. We now propose a cosinebased strategy to initialize a block structure of the matrix and that could be possibly improved during the approximate factorization process [38]; in our numerical experiments we will use BILU with and without the cosinebased blocking to illustrate the overall performance of the code. Given two rows , of a matrix , their nonzero pattern can be represented by two row vectors , which have values if and only if the associated entries of , are nonzero and otherwise. The major observation is that two rows have almost the same nonzero pattern if their formal scalar product satisfies . Since this computation is integer based, a simple counting strategy for is sufficient, where is a prescribed threshold. In [38], is suggested which we will use as well. The algorithm uses the (sparse) pattern of the upper triangular part of as long as the associated indices are not yet associated with some diagonal block.
Overall, whenever we use the cosinebased strategy, we replace by
(2) 
where is the permutation matrix generated by the cosinebased approach grouping together columns and rows of . This results in having an improved block pattern. We finally like to mention that beside its benefits, the cosine strategy may become extremely inefficient for cases where becomes relatively dense although is relatively sparse. This situation might verify, e.g., when some of the rows of , even though in limited number, are densely populated with nonzeros. For this reason we use a slightly modified version of Saad’s cosine strategy that ignores rows/columns having too many nonzero entries. This is done by a simple statistical argument computing the average number
of nonzeros per row/column as well as the associated standard deviations
. Rows (resp., columns) exceeding nonzeros are ignored for the cosine blocking strategy.Example 2
We continue Example 1 and illustrate, for the matrix obtained after maximum weight matching has been applied, how many blocks were detected by the cosinebased method. Here our results already refer to the modified version:
system size  # dgl. blocks  max. size  avg. size  std. deviation 

We can see that the majority of blocks detected by the cosine algorithm have a block size .
3.3 Symmetric Reordering
After having identified potential initial blocks using the cosinebased strategy (or even when leaving it out), we next will reorder the system respecting the given block pattern. If the cosine strategy was not used, we would simply use the scalar partitioning instead, i.e., the original matrix. However, replacing by its companion matrix that compresses each block of into a scalar in a straightforward manner, we may reorder the associated companion matrix using standard symmetric reordering strategies such as approximate minimum degree [2] or nested dissection [26, 29] as implemented in the METIS package. Here, for simplicity, we restrict ourselves to the nested dissection ordering as implemented in METIS in order to reduce the fill–in further. After METIS is applied to the compressed companion matrix we expand the associated permutation matrix to a block permutation matrix that preserves the block structure of and thus obtain a reordering for the original matrix respecting the block partitioning. This gives a symmetrically reordered matrix , where
(3) 
We sketch the approach in the following illustration:
Example 3
Finally we illustrate in Figure 4 how the matrix from Examples 1 and 2 is reordered with nested dissection following an initial blocking strategy obtained by the cosine algorithm.
Given this preprocessed matrix we could now start with the Crouttype BILU simply inheriting the block structure and its reordering. Apart from computing the incomplete factorization we need to solve the associated linear system iteratively. Here we use, for simplicity, the restarted GMRES [36] method with restart length until the relative residual is reduced by . As righthand side we use the vector with all ones. In our experimental environment, the code was implemented in C but using a CMEX interface to MATLAB (R2015b). The same applies to the forward/backward solve. This MATLAB release uses Intel MKL 11.1.1 including BLAS and LAPACK 3.4.1. The results were obtained on a single node with 1 TB main memory and Intel Xeon E74880 v2 @ 2.5 GHz processors each of them having cores on a socket leading to cores in total. For this specific example, we compare BILU as described so far with the MATLAB ilu function and its option crout (referred hereafter as ILUC) which perfectly fits with the block ILU as scalar counter part. Both methods use maximum weight matching, the METIS reordering, and a drop tolerance :
time ILU[sec]  time GMRES[sec]  # steps  

BILU  1.9  4.3  3.0  29 
ILUC  20.0  2.7  4.7  95 
Apparently the blocking strategy in combination with the BILU outperforms its scalar counterpart already by one order of magnitude.
3.4 Guessing Initial Block Pattern via a Simplified ILU
Having improved the diagonal dominance and possibly having identified initial dense blocks and reordering the associated companion matrix using a fillreducing method, we now could start factorizing the matrix (cf. Example 3). For reasons of efficiency it may pay to take a closer look at the given matrix before starting the block incomplete factorization, in particular, taking a look at the block partitioning. We underline that the dense blocks identified by the cosinebased analysis of the matrix pattern are unrelated to the ones defined by the elimination tree explore performed by the direct factorization methods. Here, within the context of an incomplete factorization, we cannot expect symbolic strategies to give a sufficiently accurate prediction about dense blocks, since dropping entries of small size during the factorization will destroy most of the graphbased information. We propose instead to use a simplified incomplete factorization model that is drastically cheaper than the BILU itself but might serve as a simple first order guess for the dense blocks that may show up during the factorization. There are several possibilities to compute block patterns for incomplete factorizations, e.g., exploiting masks like the level of fill or thresholds or both of them [43, 10, 23, 22, 42].
To be precise let us briefly recall the leveloffill approach (cf. [10, 23, 24, 39]). Initially, we define a level function via
During the approximate factorization we are faced at step with updates of the form
which modifies the level function to become
Now the leveloffill only allows these kind of updates whenever . Otherwise the update is omitted. If before the update, then does not increase anymore and one may update, whereas for , only updating is permitted as long as . This limits the number of fill–entries. Often enough, smaller numbers of are used. For , the leveloffill ILU simply inherits the original pattern of and disregards any fill outside the initial pattern. For , additional nonzero entries can only be created by an update where we have and in the initial matrix . In short, fillin is permitted by original entries only, but not by fillin entries.
Another ILU approach consists of dropping entries of small size (referred to as ), i.e., at step of the ILU we discard , , whenever (resp., ). This is applied regardless of whether the entries were originally nonzero or created as fillin. Since we are using the Croutbased ILU (see Algorithm 1), at step only column and row of the incomplete factorization are computed, i.e., , and . This is why we relate their size with respect to .
Certainly one could easily combine and to obtain some kind of . For simulation particularly of
, we apply the method only locally estimating the fill pattern at step
. The idea is to simulate the behavior of the Crout ILU quickly and find from this quick and simple simulation a good initial guess for the block pattern.Suppose that we plan to estimate the fill pattern of column of and row of . The initial index sets , consist of the associated pattern of . We do not plan to update the diagonal part as part of the incomplete factorization process and will simply use as the approximation, i.e, is reduced to those indices satisfying . We call this set (resp., for the upper triangular part).
Next, in accordance with the philosophy to only allow fillin from the original nonzero entries, we are seeking for all nonzero entries , of such that . These can be easily obtained by first looking at the nonzero pattern of column of and, inside column only for those satisfying . Let us denote this set by . For all these indices we need to check column of for indices . Given , we can easily simulate the fillin situation only adding to , if is fulfilled. This refers to a fillin which is large enough compared with . Again, , , and from the original matrix are used to substitute the values of the incomplete factorization. Thus this is only a local analysis. We denote by the column index set we obtain by including this type of fillin. We proceed similarly to obtain . As we have now computed the estimated patterns of column of and row of , we could continue to compute similar patterns in steps leading to a sequence of patterns and . For aggregating scalar columns/rows to build blocks we simply need to build their union and measuring the additional zero entries when incorporating the next column/row and removing the entries that refer to the diagonal block, which are considered to be part of a dense diagonal block. This way, we can exactly compute the additional zero entries to fill up the blocks when adding a new column/row into the current block (certainly assuming that our local ILU analysis is accurate enough). Suppose that this way we proceed from step to and assume that the subdiagonal block of consists of nonzero rows and the superdiagonal block of consists of nonzero columns whereas the additional zero entries are given by some number . This way we have nonzeros in the block case whereas the scalar case would only have nonzero entries. Going from step to step we obtain new values for the off–diagonal blocks and . The new scalar fill would become . In order to avoid an inflation of additional wasted zero entries we allow one to incorporate the st column/row also, as long as
holds. This allows to increase the overhead of wasted zeros by with respect to the scalar situation or alternatively to have, say, additional rows in and additional columns in in the block partitioning (or 4 additional rows in but no additional column in , etc.).
In our test cases in section 4 we will illustrate how the algorithms perform with and without the strategy. Besides, we will also demonstrate the behavior for a specific example as follows.
Example 4
We continue Examples 1–3 and in a first step we compare the two blocking strategies when being applied separately and together (with the METIS reordering inbetween).
# blocks  max. size  avg. size  std. deviation  

only cosine  
only  
cosine+ 
We can see that the major blocking was already obtained from the initial cosine strategy while the has added some larger blocks. It looks as if the combination of both yields the best blocking. Looking at the performance of the associated BILU variants (using the abbreviation “c” for the pure cosine strategy, “i” for only using , and “ci” for both) we observe that the combination of both blocking strategies in this example is at least comparable with the initial blocking strategy:
time ILU[sec]  time GMRES[sec]  # steps  

BILU(c)  1.9  4.3  3.0  29 
BILU(i)  3.9  4.0  4.0  31 
BILU(ci)  2.6  4.5  2.5  26 
3.5 Progressive Aggregation
So far we have simply worked with variable block structures that were predefined in advance, either using the cosinebased method or the ILU(1,) strategy or even both of them. In order to improve the blocking further, we will merge blocks during the factorization in the case that two consecutive block columns of and follow each other and the additional memory overhead is acceptable. Although this will increase the fillin and although this aggregation is not completely for free, the expectation is that having fewer but larger blocks pays off in combination with the use of level3 BLAS. We note that the computation of a block column of (resp., block row of ) in a Crouttype BILU (see section 2.2) requires one to compute this block column at some step based on several preceding block columns of . If their number decreases but their size increases, while we compute the ILU, me may expect that the level3 BLAS computation leads to an acceleration as long as the maximum block size is limited to avoid that computational complexity starts dominating the process.
Suppose that, after steps of progressive aggregation, we have computed from our approximate decomposition the leading block columns/rows , as well as the leading inverse block diagonal matrix .
The number of block columns in (resp. ) has been predefined up to step , whereas for steps , and later, we still could change the block sizes easily, since this part has not yet been computed . Merging block columns/rows and requires us to rewrite the associated matrices as
The aggregated inverse block diagonal block of is adapted accordingly, leading to a larger dense inverse diagonal block. Aggregating two consecutive block columns/rows typically increases the fillin , , and also in and , since the aggregated blocks need to have a common nonzero row pattern and must have the same column pattern. We allow to aggregate the blocks progressively whenever the memory increase is mild. Suppose that block column consists of columns and block column has columns. The subdiagonal blocks , , and may have nonzero rows and similarly , , and may have . Then before the aggregation the number of nonzeros is given by
Without explicitly computing or , we can easily compute the associated number of nonzero rows of and nonzero columns of by simply checking the union of nonzero index sets ignoring any cancellations. This gives
nonzero entries, where the diagonal block is always stored in dense format. We let the algorithm aggregate block to become an enlarged block whenever or is satisfied. Certainly one could vary these numbers and we do not claim that they are “best” in some sense. The philosophy is to allow 20% additional fillin or at least two rows/column (e.g., one in and one in ). After checking some examples this has turned out to be an acceptable compromise between fillin and the size of the blocks.
We note that the aggregation process is always checked in step , allowing the block sizes to increase progressively. Because of this, it may happen that in all the steps the current block is aggregated with its predecessor, such that at step we only have one aggregated block, labeled . Theoretically the fillin could be drastically increased, but we did not observe this in our practical experiments. This may be related to the fact that a fillreducing ordering (in our case nested dissection) was applied prior to the BILU computation. Finally we note that the data structures of the Crouttype BILU from section 2.2 can be adapted easily. Technically, the easiest implementation has turned out to define block simply as void (block size ) and to let the aggregated block become block . This way, the auxiliary vectors L_head, L_list, L_first for and U_head, U_list, U_first for from section 2.1 need not be changed at all and the void block quickly drops out step by step (since it is not longer needed for updates).
Example 5
We finish Examples 1–4 by examining the additional benefits of the progressive aggregation. In analogy to Example 4 we sketch the compression rate of each single blocking strategy and their combination:
# blocks  max. size  avg. size  std. deviation  

cosine  
progr. aggr.  
cosine + + progr. aggr. 
As already observed earlier, the best blocking performance results from the combination of all three methods. Finally we compare the BILU method when using only one of the three blocking strategies with the version that incorporates all strategies:
time ILU[sec]  time GMRES[sec]  # steps  

BILU(c–)  1.9  4.3  3.0  29 
BILU(i)  3.9  4.0  4.0  31 
BILU(–p)  3.7  3.5  7.4  52 
BILU(cip)  1.9  4.9  2.3  26 
For this example the overall performance is best using the three methods together at the price of a slightly higher fillin.
We conclude this section noting that using progressive aggregation without an initial block strategy can become quite costly, since the strategy may merge two consecutive block columns/rows several times successively, increasing a scalar column/row to a block size of a few hundred. It is clear that this can hardly be efficient, in general, and this is why having some initial guess for the block partitioning prior to the progressive aggregation is useful.
3.6 Perturbing the Entries of the Diagonal Blocks
In the symmetric positive definite case one may use a block version of the strategy by [1] in order to guarantee that the block incomplete factorization does not break down due to the presence of singular or illconditioned diagonal blocks. In the general case, on the other hand, there exists no analogous strategy. Even in the symmetric positive definite case it was already observed in [27] that shifting the diagonal entry is already sufficient when there are not too many undesired pivots. Since our BILU approach does not use pivoting except inside the diagonal blocks when employng LAPACKbased dense matrix kernels, it may occasionally happen that diagonal blocks become singular or illconditioned in spite of having the system preprocessed using maximum weight matching. To bypass this bottleneck (at least partially), we perturb the diagonal blocks as follows: Let be the maximum entry of (after scaling) in absolute value and let and be some fixed absolute and relative tolerance (in practice we use and ). Suppose that column of a diagonal block consists of entries . We denote their maximum entry in absolute value by . If or if it turns out during the factorization that the block diagonal system is singular or ill–conditioned, then we perturb the largest entry in absolute value of by . We give preference to the diagonal entry instead of (i.e., we choose ), whenever . After that we proceed analogously with respect to the rows of the diagonal block . By giving preference to the diagonal entries of we reveal the original concept of maximum weight matching. Moreover, this tiebreaking strategy might make the system nonsingular or of better condition (e.g., consider a matrix with entries of the same order of magnitude that is rankdeficient). Perturbing the diagonal blocks, in general, has to be applied with care and may easily introduce severe numerical problems, but as long as the number of perturbations is relatively small, this perturbation changes the factorization by a lowrank modification and the latter can usually be handled safely by Krylov subspace methods. Alternatively to perturbing some diagonal blocks if necessary one could have restarted BILU applied to a shifted system which has been observed to be quite helpful [6]. However, in our comparisons we did not observe that BILU behaved better when using shifts.
3.7 Summarizing the Components of the Algorithm
After having explained the components that are combined to build up the BILU we briefly summarize the main ingredients:

Initially we apply maximum weight matching in order to improve the diagonal dominance, i.e., (see section 3.1).

Apply the cosinebased blocking approach to as described in section 3.2. This way we obtain from a permuted matrix .

Next reorder the compressed graph of . Here the compressed graph refers to the matrix , where any diagonal block of according to the cosinebased blocking strategy is replaced by a scalar whenever there is at least one nonzero entry inside this block. We use nested dissection [26, 29] for reordering and we expand the permutation afterwards in order to preserve the block structure of . From the next reordered matrix we obtain is .

Given , we simulate the behavior of our BILU using the simplified method from section 3.4. This simulation does not change anymore but it provides an initial block structure prior to starting the BILU computation

Based on and its block structure, compute the Crouttype BILU according to drop tolerance .

While computing the BILU, attempt to aggregate blocks progressively in order to build larger blocks on the fly.
Summing up all components we eventually end up with an approximate factorization which will be used as preconditioner for Krylov subspace methods. Here, refer to the diagonal scaling matrices from (1), and are the permutation matrices collected from (1), (2) and (3) and is the core BILU. It should be clear that, depending on the application, one certainly may skip one of these steps. E.g., maximum weight matching is, in general, very beneficial as part of a blackbox approach (see, e.g., [5]); however, for some specific applications one might want to avoid it because of its nonsymmetric permutation which is not always helpful. Similarly, nested dissection is chosen just as one fillreducing ordering,; other orderings such as approximate minimum degree (AMD) [2] could have been used as well. Also, e.g., the cosinebased approach may not always pay off if the pattern of the original matrix does not have enough inherent block structures. We have included this preprocessing procedure in the experiments for two reasons: first, to make the approach halfway a blackbox approach, since the cosinebased might fail to provide improvement for unstructured problems; second, to discuss the novelty of the components as part of the complete block factorization approach.
4 Numerical Experiments
For the numerical experiments we select 100 (cf. Appendix 6) largescale nonsymmetric real nonsingular sparse matrices from the SuiteSparse Matrix Collection (see Example 1), each of them having a size of at least . Furthermore we use the same hardware configuration as in Example 3, which consists of a single node with 1 TB main memory and Intel Xeon E74880 v2 @ 2.5 GHz processors each of them having cores on a socket leading to cores in total. As numerical methods we use the scalar Crout–type ILU as implemented as binary code in the MATLAB ilu (referred to as ILUC) and the MATLAB GMRES [36] implementation with a restart length of and relative residual threshold . Our own variants of the BILU are implemented in C and use GMRES(30) as iterative solver as well. In order to distinguish between the single blocking strategies we add to our results suffixes such as “–p” or “cip” in order to illustrate, which and how many of the three blocking strategies “cosine” (c), “” (i), and “progressive aggregation” (p) are used in combination with BILU. Notice that BILU(  ) reduces to a scalar ILU. All matrices are preprocessed with maximum weight matching MC64 [14] and reordered with nested dissection METIS [26, 29] (in the case that the cosine blocking is used, METIS is applied to the compressed graph). We use drop tolerances and finally select the fastest ILU with respect to this selection of . It is clear for incomplete factorization methods that their applicability is parameter dependent and if is optimal for one system it may happen that is required for another system. To compensate for the large variety of problems we also state how often which choice of was selected. Beside ILUC as one benchmark we use PARDISO [40] as another competitor, knowing that over a large selection of matrices, direct solvers are typically known to outperform iterative solvers. However, comparing with PARDISO allows us to measure how far or how close the new BILU is regarding an uptodate high performance sparse direct solver. Interestingly, PARDISO uses maximum weight matchings and nested dissection initially as well which makes the comparison even more appropriate. Besides, we also compare the BILU with UMFPACK as implemented in MATLAB and with SuperILU [31] using a similar set up as for BILU.
4.1 Results
In order to evaluate the quality of the different incomplete factorization methods, PARDISO and UMFPACK, for the large selection of test problems, we use performance profiles as a tool for benchmarking and for comparing the algorithms. These profiles were first proposed in [12] for benchmarking optimization software and subsequently became the standard evaluation tool in the linear solver and optimization community [20]. The profiles are generated by running the set of methods (eight variants of BILU, ILUC, SuperILU, UMFPACK, and PARDISO) on our set of sparse matrices and recording information of interest, e.g., time for the solution operation for a required drop tolerance and memory consumption. Let us assume that a method reports a statistic for a matrix and that a smaller statistic indicates a better solution strategy. We can further define , which represents the best statistic for a given matrix . Then for and each and we define
(4) 
The performance profile of the method is then defined by
(5) 
Thus, the values of indicate the fraction of all examples which can be solved within times the best strategy, e.g. gives the fraction of which solution method is the most effective method and indicates the fraction for which the algorithm succeeded.
To report these statistics, we first display the best computation time in Figure 5. As we can easily see , the BILU methods outperform the scalar ILUC drastically. One has to be aware that the block factorization method consumes more memory. In order to demonstrate that the additional amount of memory is usually still acceptable, we display for the methods from Figure 5 the associated memory consumption as performance profile in Figure 6.
As one would expect, Figure 6 shows that the scalar factorization, BILU(  ), yields the smallest amount of memory, but the variants of BILU using various blockings are most of the time within a close range of the scalar version. The use of approximate factorization methods as an alternative to direct factorization methods is only partially justified by their smaller memory consumption. For many problems, as blackbox solvers, direct methods are more reliable but occasionally too slow or too memory consuming. A natural alternative statistics is based on weighting memory and time appropriately by defining the best performance and the product of time and memory [19]. This performance profile is revealed in Figure 7 showing that with respect to both aspects, time and memory, the BILU variants are apparently extremely attractive.
We like to point out that small drop tolerances are rarely chosen which is in line with the observations in [5]. This is illustrated in Figure 8.
Finally, we stress that the large selection of application problems has led the algorithm to select small sizes for the diagonal blocks, typically 1 and 2 (this analysis for BILU(cip) is reported in Figure 9). While this is not true for structured problems, this is the average result when considering datasets of heterogeneous nature.
Having this almost “scalar” structure in mind, the blockstructured approach is still very close to the scalar version even in the frequent case when the factorization is relatively sparse and nontrivial block structures occur rarely. This makes the blockstructured approach competitive even on a large scale of problems for which it is not optimally designed.
4.2 Performance on selected problems
In this section we consider six real symmetric indefinite matrices (“af_shell*”) which arise from industrial applications in sheet metal forming. We compare the symmetric indefinite version (BILDL) of our BILU which then becomes an incomplete block factorization. Likewise, matching is replaced by a symmetrized approach as introduced in [15]. Using [15], a real diagonal matrix and a permutation matrix are computed such that
(6) 
and all entries of satisfy . Moreover, in practice will have many diagonal blocks of size either such that or of size such that . For details we refer to [15]. The cosine–based compression technique is then applied to the compressed companion matrix, where the potential pivots are merged. After that, compressing the additional blocks from the cosine algorithm, a symmetric reordering is applied to the compressed graph. The is modified to deal with and pivots whatever is locally more appropriate. This yields the symmetrically preprocessed block–structured matrix , where the permutation matrix refers to the overall permutation. is approximately factorized as using the underlying block structure and a similar symmetrized perturbation strategy as in Section 3.6 is used whenever the diagonal blocks are illconditioned.
We compare the blockstructured incomplete factorization approach with the direct solver MA57 as implemented in MATLAB. Initially we compare the computation time for factorizing the matrix for the symmetric indefinite variants of BILU depending on the drop tolerances with the computation time as required by MA57. The results in Figure 10 clearly demonstrate that the scalar approaches are far out of competition whereas the blockstructured approach even remains competitive with MA57 for relatively small drop tolerances showing the other face of the blockstructured approach, namely turning more and more into a highperformance direct solver.
The computation time for each matrix is normalized by the smallest computation time of BILDL and then averaged over the six sample matrices.
Obviously, for preconditioning methods one has to incorporate the computation time for the associated iterative solver, in our case we have chosen the simplified QMR [17, 18] as iterative solver. This certainly changes the situation since was required in order obtain convergence (we use the backward error and a tolerance of ); cf. Figure 11
In order to better display the total performance we draw a performance profile (5) in analogy to the previous section; see Figure 12.
The performance profile clearly underlines the strength of the blockstructured approach even in comparison with a highperformance direct solver, whereas the scalar version suffers from the large amount of fillin. This fill is illustrated in Figure 13 which demonstrates that the blockstructured ILU consumes memory close to the amount that is required by MA57, at least for smaller drop tolerances.
5 Concluding remarks
We have demonstrated that using blocking strategies we are able to create a high performance incomplete BILU that is able to outperform standard factorization by orders of magnitude on modern computer architectures. Beside the blocking strategies, the use of dense matrix kernels is the major reason for its dramatic success in closing the gap between ILUs and uptodate sparse direct solvers. Beyond the scope of this paper is the integration of BILU^{2}^{2}2JANUS BLOCK ILU available at https://bilu.tubs.de as template inside multilevel factorization methods. We plan to investigate this topic in the near future.
References
 [1] (1984) A robust incomplete Choleskiconjugate gradient algorithm. Int J. Numerical Methods in Engineering 20 (5), pp. 949–966. Cited by: §3.6.
 [2] (1996) An approximate minimum degree ordering algorithm. SIAM J. Matrix Analysis and Applications 17 (4), pp. 886–905. Cited by: §3.3, §3.7.
 [3] (2001) A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal of Matrix Analysis and Applications 23 (1), pp. 15–41. Cited by: §1.
 [4] (2012) On computing inverse entries of a sparse matrix in an outofcore environment. SIAM Journal on Scientific Computing 34 (4), pp. A1975–A1999. Cited by: §1.
 [5] (2000) Preconditioning highly indefinite and nonsymmetric matrices. SIAM J. Scientific Computing 22 (4), pp. 1333–1353. Cited by: §3.1, §3.7, §4.1.
 [6] (2002) Preconditioning techniques for large linear systems: a survey. J. Comput. Phys. 182, pp. 418–477. Cited by: §3.6.
 [7] (2006) Multilevel preconditioners constructed from inverse–based ILUs. SIAM J. Sci. Comput. 27 (5), pp. 1627–1650. Cited by: §1.
 [8] (2014) VBARMS: a variable block algebraic recursive multilevel solver for sparse linear systems. J. Comput. Appl. Math. 259, pp. 164–173. Cited by: §2.
 [9] (1998) An object–oriented framework for block preconditioning. ACM Trans. Math. Softw. 24 (2), pp. 159–183. Cited by: §1.
 [10] (19920901) Towards a costeffective ilu preconditioner with high level fill. BIT Numerical Mathematics 32 (3), pp. 442–463. Cited by: §3.4, §3.4.
 [11] (2004) Algorithm 832: UMFPACK V4.3—an unsymmetricpattern multifrontal method. ACM Trans. Math. Softw. 30 (2), pp. 196–199. Cited by: §1.
 [12] (2002) Benchmarking optimization software with performance profiles. Mathematical Programming 91 (2), pp. 201–213. Cited by: §4.1.
 [13] (2017) Direct methods for sparse matrices (2nd edition). Oxford University Press. Cited by: §3.1.
 [14] (1999) The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Analysis and Applications 20 (4), pp. 889–901. Cited by: §3.1, §4.
 [15] (2005) Strategies for scaling and pivoting for sparse symmetric indefinite problems. SIAM J. Matrix Analysis and Applications 27 (2), pp. 313–340. Cited by: §4.2.
 [16] (1982) Yale sparse matrix package I: the symmetric codes. Int J. Numerical Methods in Engineering 18 (8), pp. 1145–1151. Cited by: §2.1, §2.1.

[17]
(1997)
A QMR–based interior–point algorithm for solving linear programs
. Mathematical Programming, Series B 76 (1), pp. 183–210. Cited by: §4.2.  [18] (1995) Software for simplified Lanczos and QMR algorithms. Appl. Numer. Math. 19 (3), pp. 319–341. Cited by: §4.2.
 [19] (2012) An empirical analysis of the performance of preconditioners for SPD systems. ACM Trans. Math. Softw. 38 (4), pp. 24:1–24:30. Cited by: §4.1.
 [20] (2007) A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations. ACM Trans. Math. Softw. 33, pp. 10:1–10:31. Cited by: §4.1.
 [21] (2010) Adaptive techniques for improving the performance of incomplete factorization preconditioning. SIAM J. Sci. Comput. 32 (1), pp. 84–110. Cited by: §1, §2.
 [22] (2008) On finding approximate supernodes for an efficient ILU(k) factorization. Parallel Comput. 34, pp. 345–362. Cited by: §1, §2, §3.4.
 [23] (2001) A scalable parallel algorithm for incomplete factor preconditioning. SIAM J. Scientific Computing 22 (6), pp. 2194–2215. Cited by: §3.4, §3.4.
 [24] (200211) Levelbased incomplete LU factorization: graph model and algorithms. Technical report Technical Report UCRLJC150789, Lawrence Livermore National Labs. Cited by: §3.4.
 [25] (1995) An improved incomplete Cholesky factorization. ACM Trans. Math. Softw. 21 (1), pp. 5–17. Cited by: §2.1, §2.1.
 [26] (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Scientific Computing 20 (1), pp. 359–392. Cited by: item 3, §3.3, §4.
 [27] (1978) The incomplete Choleskyconjugate gradient method for the iterative solution of systems of linear equations. J. Comput. Phys. 26, pp. 43–65. Cited by: §3.6.
 [28] (2018042018) Parallel BLAS performance report. Technical report Technical Report 5, ICLUT1801, University of Tennessee. Cited by: §3.
 [29] (2013) Multi–threaded graph partitioning. Technical report Department of Computer Science & Engineering, University of Minnesota, Minneapolis. Cited by: item 3, §3.3, §4.
 [30] (2004) Crout versions of ILU for general sparse matrices. SIAM J. Scientific Computing 25 (2), pp. 716–728. Cited by: §2.1, §2.1.
 [31] (201104) A supernodal approach to incomplete LU factorization with partial pivoting. ACM Trans. Math. Softw. 37 (4), pp. 43:1–43:20. Cited by: §1, §4.
 [32] (2005) An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans. Mathematical Software 31 (3), pp. 302–325. Cited by: §1.
 [33] (1999) Incomplete Cholesky factorizations with limited memory. SIAM J. Scientific Computing 21 (1), pp. 24–45. Cited by: §2.1.
 [34] (1999) A blocked incomplete cholesky preconditioner for hierarchical–memory computers. In Iterative Methods in Scientific Computation IV, IMACS Series in Computational and Applied Mathematics, pp. 211–221. Cited by: §1.
 [35] (1996) A new pivoting strategy for Gaussian elimination. Linear Algebra and its Applications 240, pp. 131–151. Cited by: §3.1.
 [36] (1986) GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, pp. 856–869. Cited by: §4, Example 3.
 [37] (2002) ARMS: an algebraic recursive multilevel solver for general sparse linear systems. Numer. Lin. Alg. w. Appl. 9, pp. 359–378. Cited by: §1.
 [38] (2003) Finding exact and approximate block structures for ILU preconditioning. SIAM J. Scientific Computing 24 (4), pp. 1107–1123. Cited by: §3.2.
 [39] (2003) Iterative methods for sparse linear systems. second edition, SIAM Publications, Philadelphia. Cited by: §1, §2.1, §2.1, §2, §3.4.
 [40] (2004) Solving unsymmetric sparse systems of linear equations with PARDISO. Journal of Future Generation Computer Systems 20 (3), pp. 475–487. Cited by: §1, §4.
 [41] (2004) The effects of unsymmetric matrix permutations and scalings in semiconductor device and circuit simulation. IEEE Transactions On ComputerAided Design Of Integrated Circuits And Systems 23 (3), pp. 400–411. Cited by: §3.1.
 [42] (2011) The importance of structure in incomplete factorization preconditioners. BIT Numerical Mathematics 51 (2), pp. 385–404. Cited by: §3.4.
 [43] (1981) A conjugate gradient truncated direct method for the iterative solution of the reservoir simulation pressure equation. Soc. Pet. Eng. J. 21, pp. 345–353. Cited by: §3.4.
 [44] (2016) An algebraic multilevel preconditioner with low–rank corrections for sparse symmetric matrices. SIAM J. Matrix Analysis and Applications 37 (1), pp. 235–259. Cited by: §1.
6 Appendix
List of matrices from the SuiteSparse Matrix Collection used for the numerical experiments.
name  size  name  size  

2D_54019_highK  lhr71c  
3D_51448_3D  lung2  
ASIC_100k  majorbasis  
ASIC_100ks  mark3jac120  
ASIC_320k  mark3jac120sc  
ASIC_320ks  mark3jac140  
ASIC_680k  mark3jac140sc  
ASIC_680ks  matrix_9  
atmosmodd  matrixnew_3  
atmosmodj  memchip  
atmosmodl  ohne2  
barrier21  para4  
barrier22  para5  
barrier23  para6  
barrier24  para7  
barrier29  para8  
barrier210  para9  
barrier211  para10  
barrier212  poisson3Db  
Baumann  Raj1  
bayer01  rajat16  
bcircuit  rajat17  
cage12  rajat18  
cage13  rajat20  
cage14  rajat21  
cage15  rajat23  
Chebyshev4  rajat24  
circuit_4  rajat25  
circuit5M_dc  rajat28  
circuit5M  rajat29  
crashbasis  rajat30  
dc1  rajat31  
dc2  scircuit  
dc3  shyy161  
ecl32  stomach  
epb3  tmt_unsym  
FEM_3D_thermal2  torso1  
Freescale1  torso2  
FullChip  torso3  
g7jac180  trans4  
g7jac180sc  trans5  
g7jac200  transient  
g7jac200sc  TSOPF_RS_b39_c30  
hcircuit  twotone  
hvdc2  venkat01  
ibm_matrix_2  venkat25  
laminar_duct3D  venkat50  
language  water_tank  
largebasis  Wordnet3  
lhr71  xenon2 
Comments
There are no comments yet.