Javelin: A Scalable Implementation for Sparse Incomplete LU Factorization

12/13/2018
by   Joshua Dennis Booth, et al.
Franklin & Marshall College
0

In this work, we present a new scalable incomplete LU factorization framework called Javelin to be used as a preconditioner for solving sparse linear systems with iterative methods. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row format. Moreover, this framework stresses the importance of co-designing dependent tasks, such as sparse factorization and triangular solves, on highly-threaded architectures. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to  42x on 68 Intel Knights Landing cores and  12x on 14 Intel Haswell cores.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

01/22/2016

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

We introduce a task-parallel algorithm for sparse incomplete Cholesky fa...
08/27/2019

High Performance Block Incomplete LU Factorization

Many application problems that lead to solving linear systems make use o...
07/23/2013

A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units

Sparse matrix-vector multiplication (spMVM) is the most time-consuming k...
03/10/2022

Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format

Due to ill performance on many devices, sparse matrix-vector multiplicat...
06/02/2019

Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)

Accelerators for sparse matrix multiplication are important components i...
09/26/2020

A highly scalable approach to solving linear systems using two-stage multisplitting

Iterative methods for solving large sparse systems of linear equations a...
11/19/2016

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

We propose two novel techniques for overcoming load-imbalance encountere...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The solution to sparse linear systems of equations dominates the execution time of most parallel scientific simulations. A common approach to solving such systems is using iterative methods, e.g., Conjugate Gradients (CG) or Generalized Minimal Residual Method (GMRES), that rely on sparse matrix-vector multiplication (spmv) and sparse triangular solves (stri). However, iterative methods may not converge or take a large number of steps depending on the conditioning of the system, and some form of preconditioning is normally used. Incomplete factorization is one common method used to precondition the system in order to improve the iterative method by factorizing the coefficient matrix representing the linear system of equations while controlling the number of nonzero elements calculated during the factorization. This incomplete factorization used as a preconditioner allows for the linear system to be modified to:

Because the precondition method tries to limit the number of calculations with , the ratio of float-point operations required to construct and solve is memory-bound making incomplete factorization extremely hard to achieve strong-scaling. One common and useful preconditioner () is based on the incomplete LU factorization. In this work, we provide a new scalable incomplete LU factorization (ILU) framework called Javelin that is designed for current large many-core systems, and allows for scalable sparse triangular solves without having to reformat the input matrix like most scalable triangular solve methods [1] .

A rich variety of incomplete LU factorization algorithms exist, such that where and are lower and upper triangular matrices, respectively. In general, these algorithms can be categorized by the method they use to reduce nonzeros due to fill-in: dropping based on numerical value (ILU()) and dropping based on fill-in level (ILU(k)). The design of Javelin allows for any combination of the two to be used (ILU(k,)) along with modified ILU [2]. However, most of the current packages that can be used for traditional incomplete factorization are serial because of the difficulty of scaling and the overhead that may exist to reformat the output matrix in order to achieve scalable sparse triangular solves. Though parallel ILU packages exist [3, 4, 5, 6], many rely on complex data-structures or modifications of ILU by using approximations to reduce synchronizations, and may not have scalable stri. These packages apply these modifications due to the difficulty in achieving scalable performance with traditional ILU. Javelin is designed to have the robust nature of traditional ILU methods while limiting the use of complex data-structures and still achieving the scalability needed for incomplete factorization, spmv, and stri on current many-core systems such as Intel Xeon Phi by co-optimizing incomplete factorization and stri.

This paper presents Javelin as implemented currently as a threaded templated C++ incomplete LU factorization using OpenMP, and having the following contributions:

  • A parallel ILU optimized for parallel spmv and stri on many-core architectures

  • A parallel ILU that requires minimal data preprocessing and permutations

  • An empirical evaluation of Javelin on Intel Xeon (Haswell) and Intel Xeon Phi (Knights Landing) for ILU(0) and stri

  • An evaluation of optional parameters needed by the framework

Ii Background

This section provides a brief overview of the foundations and improvements of incomplete LU factorization, i.e., , and current scalable implementations of sparse matrix-vector multiplication (spmv) and sparse triangular solve (stri). Here, spmv and stri methods are the driving force behind this work as we desire to have a scalable incomplete factorization that is optimized to the state-of-the-art parallel many-core spmv and stri as they dominate the time of iterative methods.

Incomplete LU factorization. Preconditioning based on incomplete LU decomposition is generally regarded as a robust “black-box” method for unstructured systems arising from a wide range of application areas. ILU has been studied extensively. This includes extensive studies on variations of ILU(k), ILU(), and multiple levels of these two together [7, 2, 8]. The choice of which combination is highly dependent on the linear system. Variations exist that try to improve serial performance using supernodes [7]. However, unstructured systems with limited fill-in will have limited amount of overlapping sparsity patterns that will limit supernode construction.

Parallel incomplete LU factorization. The first work on parallel incomplete factorization was for regular structured systems. However, the need for parallel incomplete factorization on unstructured systems resulted in much work throughout the 1980s until today. The traditional approach in dealing with this irregular problem of unstructured systems is to use level scheduling [9, 10]. In level scheduling, the coefficient matrix is transformed into a graph, such that rows, columns, or blocks of the coefficient matrix represent vertices and dependencies are represented as edges in a graph. Groups of vertices in the graph that do not have any incoming edges are placed together into a level, and the outgoing edges from these vertices are removed. The rows, columns, or blocks that are in each level can be scheduled to be solved concurrently. In a similar fashion, orderings, such as Coloring and Nested-Dissection, have been used to identify rows, columns, or blocks that can be factorized concurrently while in some cases exposing more concurrency. The disadvantage to all of these reorderings is that they may negatively impact the number of iterations needed for the iterative method to converge [11]. We explore this possible negative impact with multiple reordering in Section VII. However, the performance of parallel incomplete factorization depends on many more factors than the exposure of the levels of parallelism. These factors include programming model, task size, and data-structure related to the targeted machine. This was first noted in the early 1990s [12], and recently examined on current many-core systems with different factorization codes in [5]. Recently, one version of ILU has been shown to achieve very good performance on many-core and GPU systems [3]. Despite performance, this method may result in an incomplete factorization that is nondeterministic and that challenges traditional dropping or modified incomplete factorization due to race conditions.

0:   contains the sparsity pattern for and , and and are stored in
1:  for  to  do
2:     
3:     for  in  do
4:        if   then
5:           
6:           
7:           for  in  do
8:              if  in and  then
9:                  -=
10:              end if
11:           end for
12:        end if
13:     end for
14:  end for
Fig. 1: Up-looking Sparse Incomplete LU Factorization Algorithm

Sparse matrix-vector multiplication and triangular solve. Iterative methods, such as CG and GMRES, spend the vast majority of their time preforming the operations of sparse matrix-vector multiplication (spmv) and sparse triangular solve (stri). In particular, preconditioned CG using incomplete Cholesky Decomposition, i.e., , spends up to of its execution time in forward and backward stri [13]. Because these operations dominate the solve time, having an incomplete decomposition that can allow us to make use of these operations efficiently is critically important. There has been a vast amount of research in this area over the past 10 years as large many-core systems are becoming the norm [13, 14, 1, 15]. Many of these methods require reordering and special data-structures [1, 15]. Therefore, these methods are not ideal for iterative methods as matrices would need to be copied over to these data-structures, and the number of iterations would depend on ordering. Moreover, these orderings may not be ideal for parallel incomplete factorization. Two works standout that require us to limit reordering and data movement. The first work [15] introduces an efficient storage format for spmv, namely CSR5. CSR5 does not require a reordering of sparsity pattern, depends on the common Compressed Sparse Row (CSR) format, and allows for quick segmented scans. The only overhead needed is a little extra storage for tile information to help support segmented scan operations. Segmented scan has been shown to be an ideal way to implement spmv and stri on vector based machines [16], with most many-core systems having large vector register lanes. The second work [13, 14] focuses on improving stri on standard CSR format storage using general level scheduling and sparsifying synchronization. This work demonstrates that many of the overheads dealing with level scheduling could be removed, and allowing stri to scale. Javelin uses these works as the foundation to achieve scalable stri, both during and after with modification that allow incomplete factorization.

Iii A framework for ILU

The Javelin framework for parallel incomplete LU factorization depends on predetermining the sparsity pattern and applying an up-looking LU algorithm, i.e., a left-looking algorithm done by rows instead of columns, to the pattern. Both of these operations need to be done in parallel for the factorization code to scale on modern many-core systems. Determining the sparsity pattern in parallel has been studied in the following work [6], and a good implementation exists in [17, 5]. Therefore, we will primarly focus on the second half of implementing the up-looking LU algorithm to the given nonzero pattern. Fig. 1 provides an example of the up-looking LU algorithm using standard MATLAB notation for the sparse matrix containing the sparsity pattern, which is used by many incomplete factorization codes. The main reason for the popularity of up-looking LU is because it can be easily modified to be used with dropping based on some tolerance to the pivot value and modified ILU techniques which help to reduce iteration counts [2]

. Moreover, up-looking LU allows for local estimates of resilience from soft-errors 

[18] and the convergence rate [19]. Like many other incomplete factorizations, Javelin does not allow for pivoting. We note that these algorithms could be applied to other preconditioners and is why we call Javelin a framework.

Fig. 2: The general structure generated by level scheduling ordering. Levels in the top half can be factored in an up-looking method while point-to-point synchronizations are used to handle dependencies. The last level can be solved using one of two different methods.

In addition, Javelin uses level scheduling as the primary method to apply up-looking LU. Past work related to level scheduling of stri shows it is difficult to achieve high performance, with applications scaling to only a few cores. These difficulties were due to the requirement of thread barriers between levels and levels containing too few rows to keep all threads busy. Javelin faces these issues head-on by using point-to-point synchronizations between levels, and then using a second stage approach when the size of levels becomes too small for the thread count. Two different methods exist for factoring the second stage, namely Segmented-Row and Even-Row. We will explain both along with their trade offs, and demonstrate their performance differences in Javelin, though Javelin by default will make the choice for the user based on the matrix structure. In order to apply this two-stage structure, we first find the level scheduling order to either or depending on the user options, where is the sparsity pattern of the lower triangular half of the given matrix. We address the importance of the option between and in the subsection related to the second stage factorization method. Though finding the order alone would be enough to schedule the top stage point-to-point factorization method, we permute the nonzeros in the matrix into the level ordering while copying into the CSR data-structure of and in parallel allowing for first-touch.

Iii-a Level Scheduling, Upper Stage

During preprocessing, the sparsity pattern of with desired fill-in is found and permuted into a level-based ordering as in Fig. 2. Javelin uses a set of user defined options to determine the portion of the matrix to factor using level scheduling. These options include minimal number of rows per level, relative location of level in ordering, and row density, i.e., the number of nonzeros per row. The idea is that Javelin will want to apply a level scheduling to levels that have a very large number of rows so that no thread will run out of work and any imbalance in the amount of work would be amortized across multiple rows. However, if the rows become too dense compared to the relative average density of the matrix or the number of rows in the level is too few, these are moved towards the end to be solved with Javelin’s second stage solver. The catch is when there exists a level with very few rows or high row density in the middle of larger level sets, see Fig. 3. In such cases, moving these rows to the second stage would result in moving a lot of work to the second stage (i.e., all the dependent rows). Traditional implementations of level scheduling would use barriers and would have difficulty because of such rows. However, Javelin using point-to-point synchronizations does not suffer in the same way.

Fig. 3: Here we have a large number of tasks in level x and x+2. However, between them is a level with few tasks. Therefore, this level may not be ideal to be solved by our second level method.
Fig. 4: In the left side of the figure, we have the tasks and their dependencies that are generated by level scheduling. In the right side of the figure, we map tasks to threads while imposing an implied ordering of tasks in a level. Using this order, dependencies are pruned and result in a few synchronizations.

Point-to-point synchronization for level scheduling was introduced as an effective way to parallelize sparse triangular solve [13, 14]. Here we use the observation that incomplete factorization done with up-looking LU has the same dependency structure as sparse triangular solve. In pervious work, we demonstrated that such a method would allow for improved levels of concurrency [20]. The principle of sparsifying point-to-point synchronizations for level scheduling is as follows. The level sets are found as in Fig. 4. Once the sets are found, rows are mapped to threads. This mapping of rows to threads induces an implied ordering, and the implied ordering is used to prune the full set of dependencies. In traditional methods, these dependencies are taken care of by a global barrier between levels or by generating tasks with dependencies in a parallel tasking programming model. Both of these methods have high overheads at runtime. However, point-to-point’s implementation relies on inexpensive spinlocks and allows for certain threads to speed ahead of others into the next levels.

Iii-B Lower Stage

At some point, level scheduling will no longer offer the degree of concurrency needed to achieve continued speedup on modern many-core systems. In these cases, Javelin will switch to one of two different methods. We name these methods: Segmented-Rows and Even-Rows.

Fig. 5: The layout of blocks and tiles in the Segmented-Row (SR) method. Titles are laid out in each block () independently of each other, and can span multiple rows. The nonzero elements in each row are indexed starting from 1. Contiguous nonzeros that are grouped together in tiles are colored similarly.

Segmented-Rows. The Segmented-Rows (SR) method is inspired by the segmented scan that achieves cross-platform scalability of spmv in CSR5 [15]. The goal is to have a structure that can be factored efficiently while already being setup for a spmv-like update needed by stri. In preprocessing, the level ordering of is found, and the rows identified to have SR applied to them are permuted to the end of the matrix during the copy-fill-in phase of building a CSR data-structure. Fig. 5 provides a visual example of this format, where the index of each nonzero element in a row is numbered starting from 1. Each subblock () is defined by the level scheduling using in the upper stage. Inside each subblock, nonzeros are grouped together in a contiguous manner, as displayed in Fig. 5 by different colors. Given a subblock () formed based on the level scheduling ordering, we note that there are no dependencies between the columns within the same subblock, i.e., does not depend on where . This fact is due to using the level ordering for in place of , which the latter does not guarantee the above observation. This observation does allows for tiles to be created within each level of the subblock in a similar manner to CSR5 that only require a small additional array of pointers into the block. These tiles build the foundation to the SR algorithm. A tasking model is used to apply factorization to each level, then to apply the the needed update to other tiles in and , and allowing the next subblock’s set of tiles to be factored. In Javelin, OpenMP tasking is used as the programming model, and tile size options are made available to the user. Fig. 6 provides the algorithm used, and for simplicity we do not explicitly state the dependencies.

0:  Let be the level to complete SR.
  for  to  do
     spawn all tiles (DIVIDE_COLUMNS())
     for  to  do
        spawn all tiles (UPDATE_BLOCK(, ))
     end for
  end for
  spawn all tiles (FACTOR_LU())
Fig. 6: Segmented-Row Method (SR) Algorithm

In Fig. 6, DIVIDE_COLUMNS takes the diagonal value in corresponding to the used columns in the tile and divides the entries in the tile. The UPDATE_BLOCK method applies the multiplication-subtraction updates to nonzeros in tiles in later levels, similar to the inner-most line in Fig. 1. Lastly, FACTOR_LU factors the tiles in the blocks and .

SR is useful in several different situations. These situations are when there exist fewer rows that are excluded from the level scheduling method than threads running on the system and the number of nonzeros in the rows are highly imbalanced. However, SR suffers from the need to use a tasking programming model and the overheads that come with it. Moreover, the fixed tile size of SR lends itself to using optimized vector operations (e.g., AVX). Many new hardware such as Intel Xeon Phi have large vector lanes that need to be used to achieve top performance.

Fig. 7: The layout of blocks in the Even-Rows (ER) method. Each thread will get a number of rows. They will complete an up-looking incomplete LU on while making the needed updates to the values in and .
0:  Let be the level to complete ER.
  for  in in parallel do
     FACTOR_L()
  end for
  spawn all titles (FACTOR_LU(, ))
Fig. 8: Even-Rows Method (ER) Algorithm

Even-Rows. The Even-Rows (ER) method is a much simpler method than SR. ER depends on the number of rows excluded from the level scheduling method being greater than the number of desired threads to be used on the system. With the number of rows being greater than the number of threads, each thread can factor multiple rows and any imbalance can be averaged over all a thread’s rows. The excluded rows are reordered to the end of the matrix similar to SR after finding the level order. Here, either or could be used to find the level order. While generally offers more and larger levels, this does not seem to have much difference in performance, see section VII. Options exist to allow in preprocessing to form tiles similar to SR to be added for stri, but are not needed for the incomplete factorization. In Fig. 7, each thread gets a row and factors up until the corner piece (i.e., and ) while making the needed updates to the corner piece’s and . Fig. 8 provides an overview of the ER method, where FACTOR_L does the factorization of each row up until the corner piece. When all the threads are done with their rows, the corner piece can be factored with a call to FACTOR_LU. The factorization of the corner can be done in serial or parallel. However, for most matrices, serial seems to be good enough due to level size.

Iv Experimental setup

Here we provide the setup for the next three sections to evaluate the performance of incomplete LU factorization, sparse triangular solve, and the sensitivity of Javelin to input.

Test systems. We test our framework on two systems, i.e., Bridges at PSC and Stampede2 at TACC [21]. The first system contains two Intel Xeon E5-2695 v3 (Haswell) processors each with 14 cores and having 128GB DDR4. The second system contains an Intel Xeon Phi Knights Landing 7250 (KNL) each with 68 cores and having 96GB DDR4. The KNL is setup in cache mode with 32KB L1, 1MB L2 per two-core tile, and 16GB direct-mapped L3. Though KNL has a variety of different ways to setup its fast memory, such as cache mode and flat, we test in cache mode as this is the mode recommend by TACC to most MPI+X scientific applications. Therefore, we wanted to test in the mode likely to be used by applications that will use Javelin. All codes are compiled with Intel Compiler 17.4 / 17.3 with -O3 options. We use OpenMP with the DYNAMIC scheduling and CHUNK_SIZE=1 for all our tests, though ER may benefit from different scheduling and chunk size options. This decision was made to limit the number of possible combinations in finding the best overall performance. Additionally, we set OMP_PLACE=cores and OMP_PROC_BIND=close.

Matrix N Nnz RD SP Lvl
wang3 26064 177168 6.8 yes 10 B
TSOPF_RS_b300_c2 28338 2943887 103.88 no 180 B
3D_28984_Tetra 28984 285092 9.84 no 34 B
ibm_matrix_2 51448 537038 10.44 no 29 B
fem_filter 74062 1731206 23.38 yes 554 B
trans4 116835 749800 6.42 no 20 B
scircuit 170998 958936 5.61 yes 34 B
transient 178866 961368 5.37 yes 16 B
offshore 259789 4242673 16.33 yes 74 A
ASIC_320ks 321671 1316085 4.09 yes 16 B
af_shell3 504855 1.756e7 34.79 yes 630 A
parabolic_fem 525825 3674625 6.99 yes 28 A
ASIC_680ks 682712 1693767 2.48 yes 21 B
apache2 715176 4817870 6.74 yes 13 A
tmt_sym 726713 5080961 6.99 yes 28 B
ecology2 999999 4995991 5 yes 13 A
thermal2 1.2e6 8580313 6.99 yes 27 A
G3_circuit 1.5e6 7660826 4.83 yes 13 B
TABLE I: Test Suite. N is the matrix dimension, Nnz is the number of nonzero, RD is nonzeros / N, SP is if the symbolic pattern of the matrix in natural order is symmetric, and Lvl is the number of levels found in the level scheduling.

Matrices. We select a large array of different matrices from the SuiteSparse collection [22]. Table I provides the details for each. Our test suite of matrices is broken into two pieces (labeled A and B). The first group (A) are commonly tested for ILU with regard to convergence [3, 7], despite being symmetric positive definite. The goal of group A is to demonstrate the effect our structure and preordering described in this section have on the convergence. Though previous studies have looked at the effect of preordering on convergence, no study has looked at our particular ordering combination of level set. This is presented in section VII. The second group (B) contains a wide array of matrices from multiple areas. While much previous work [3, 7] focuses on only providing empirical evaluation of matrices from areas coming from the discretization of PDEs, there is a growing need for iterative methods in other areas that have very irregular matrices, such as certain stages of circuit simulation [23, 24]. Therefore, to have a thorough examination, we include a wide range of matrices with different numeric patterns and row densities, i.e., average number of nonzeros per row.

Incomplete levels. We will only consider ILU() with , though Javelin supports other levels as implemented by other work [6, 17, 5] and commonly used in iterative solvers. As increases, additional fill-in is allowed into the sparsity pattern. The placement and amount of this would depend on the ordering of the matrix. Therefore, comparing performance as varies would not provide a deep understanding of the scalability of Javelin without knowing where and how much fill-in was produced. It is therefore common to compare scalability primarily with ILU(0) [5, 3] over a large test suite of matrices with different sparsity pattern and row density in order to estimate how well the implementation scales with fill-in as we do in this paper.

Preordering. It is common to permute the nonzeros of a sparse matrix before applying an iterative method. These permutations allow for better memory access patterns and reductions to iteration counts [11]. Javelin currently only has a parallel level-set ordering built-in that is needed for the framework, though options for additional parallel orderings are to be added in the future. Therefore, we assume that the given matrix is already ordered in some manner that is ideal for the given solver method by the user. For the performance evaluation, we assume that the given matrix has been ordered in the following manner. A Dulmage-Mendelsohn ordering is used to move nonzeros to the diagonal of the matrix, then uses Nested-Dissection (ND) from METIS [25] is performed. This ordering was chosen because ND is commonly applied to coefficient matrices for parallel factorization [17]. We consider the sensitivity of this choice in section VII, and compare against the use of ordering with Reverse Cuthill McKee (RCM) ordering to using ND. We do not consider the timing for ordering as this would be the user’s choice and would depend on many factors.

V Performance evaluation of Incomplete LU

Performance to other packages. We first considered the performance of Javelin to other packages. We note that there does not exist many threaded incomplete factorization packages. Many popular packages, such as SuperLU [7] and Trilinos [26], only offer a serial option, and we found that Javelin in serial was either faster than or within of these serial packages. Additionally most packages that offer traditional ILU use only ILU(k, ) and does not provide an interface to ILU(k). Therefore, we will use this double method to compare to the commercial package Watson Sparse Matrix Package (WSMP) [27], but will use ILU(k) for scalable performance analysis for reasons stated before. WSMP is given the matrix in the level ordering used by Javelin with k=0 and is set to be a value so that nonzeros are similar to that of ILU(0), and we do not allow pivoting. Here, only Javelin with levels will be used, as we compare the tradeoff of using lower stage later. We will only consider the time to perform the numeric factorization (i.e., ) for a particular matrix using cores. Fig 9 presents the slowdown of WSMP vs Javelin, i.e.,

In many cases, WSMP failed due to numerical constraints placed in part by the internal structure and required reordering, and we place ’x’ on such columns. Moreover, we only present up to 8 cores as WSMP does not scale past this point on either system. Note that Javelin is multiple magnitudes faster than WSMP in both serial and in parallel. This multiple magnitude difference is due to using very light weight data structures that leads to no overhead. WSMP (along with many other packages) use supernode or supernode-like data structures that are too big for very sparse incomplete factorization as it does too many data movement operations per float-point operation. As core counts increase, other packages try to reduce synchronizations by grouping like nonzero structures into these data structures, and thus reducing the serial fraction that limits strong scaling. However, if there does not exist many similarities in nonzero structure (as in ILU), the number of reductions is very low. In contrast, Javelin reduces the number of synchronizations by locality in memory mapping and by using point-to-point synchronizations.

(a) Slowdown on Haswell
(b) Slowdown on KNL
Fig. 9: Slowdown of WSMP compared to Javelin on Haswell and KNL. WSMP does not have any real scaling behavior after 8 cores. ’x’ represents where WSMP failed due to internal issues.

Scalability of Javelin. We will define

for any particular system for the remainder of this section. We currently do not consider setup time as this is merely finding the level scheduling and copying to CSR which is done in parallel, and would be reused by other methods such as spmv and stri. Moreover, almost all solver packages require some overhead to copy to structure and preorder, and we do both our copies and level-set ordering in parallel.

(a) 14 Haswell Cores (1 Socket)
(b) 28 Haswell Cores (2 Sockets)
Fig. 10: Speedup on Intel Haswell. The LS bars are speedup due only to level scheduling with point-to-point synchronization and the LS+Lower bars are using both level scheduling and best lower method.

In Fig. 10

, we present the speedup for Javelin over all matrices on Haswell for 14 and 28 cores. We denote the speedup using only the level scheduling with point-to-point communications as LS, and we denote using best combination of level scheduling and lower as LS+Lower. The geometric mean of speedup for selecting the best mixture of all methods and threads is only

(not pictured), however we can see this value does not really reflect the level of speedup possible by Javelin. In general, we observe most matrices can achieve around an speedup using just level scheduling with point-to-point synchronizations on 14-cores. Several matrices are unable to reach this standard of performance. For ibm_matrix_2, trans4, and transient, our lower method is able to boost performance closer to the level of the other matrices. The fem_filter is unable to be helped by these methods, and this is due to the numerical pattern not allowing to find many large levels, see section VII. As we cross socket from 14 core to 28 cores, we notice a much larger gap in performance of difference matrices. Matrices such as af_shell3 and 3D_28984_Tetra are able to perform well over socket, but most cannot. This is due to the non-uniform memory access (NUMA) between the sockets. Currently, Javelin does not account for these types of accesses. The level scheduling depends on light weight point-to-point synchronizations using locks that are not ideal across socket due to their required memory load. The SR lower scheduling method, does very poorly across socket. This is due to having no locality in the tasks as they are taken from the tasking queue, and results in fewer matrices being boosted by using a lower method. With the current OpenMP Scheduling, ER also has trouble with memory accesses across socket. ER could be improved with a more static scheduling or NUMA-aware blocking of the distribution of the lower rows. Despite this, no performance on 28 cores is worse than on 14 cores, and performance of ibm_matrix_2 and TSPF_RS_b300_c2 is able to be boosted by the lower methods.

(a) 68 KNL Cores (1 Socket)
(b) 68 KNL Cores X 2 Threads (1 Socket)
Fig. 11: Speedup on Intel KNL. The LS bars are speedup only due to level scheduling with point-to-point synchronization and the LS+Lower bars are using both level scheduling and best lower method.

In Fig. 11, we present the speedup performance of Javelin on KNL. KNL is a many-core system that is designed for very high levels of concurrent operations. Each core is relatively slow compared to the fast massively out-of-order cores on Haswell. Therefore any inefficiency in parallelizing ILU will be noticeable. In this case, the geometric mean of the best speedup selecting the best parameters tested results in 25.1x. Again this value is not representative of the performance we observe. We present the speedup on 68 cores, which is all the cores in one socket with one thread in the first figure. We see that for most matrices the speedup is around 30x using only level scheduling and point-to-point synchronization. As a fraction of total system use, this is less than that is achieved on Haswell. The main reason for this is the required level of concurrency may not be able to be extracted from the matrices. The lower methods now only helps in two cases, and only by a small fraction. This seems to be due to the tasking overhead of using an OpenMP queue with so many tasks on this number of threads. A specialized light weight tasking library is currently being constructed in Javelin for this reason. Despite this, we will observe the importance of the lower stage for stri in the next section. In the next figure, we test using 136 threads (i.e., 68 cores with 2 threads each) to examine the ability of using multiple threads per core. While minor performance can be gained by some matrices, we see over-subscribing cores does not yield high results with Javelin due to pressure on the memory system. However, the performance does not generally degrade.

Vi Performance evaluation of triangular solve

This section provides an evaluation of sparse triangular solve (stri) inside of Javelin. Recall that the objective of Javelin was to implement a package that provides a scalable incomplete factorization that could be used as a preconditioner for an iterative solver with the structure fitting that need to make stri and spmv scale. In standard execution, the incomplete factorization may only be formed once, but stri may be called thousands of times. Exact iteration counts for a subset of matrices in our test suite can be found in Table II. We will only consider stri here as this is the primary call needed for methods like GMRES that use ILU, and will leave spmv for future work that addresses preconditioners that use spmv like successive over-relaxation. We will compare the performance to that of a standard implementation of stri in CSR format using OpenMP and barriers between levels in a level set ordering as done in previous works [14, 13, 1], and label it as CSR-LVL. WSMP is not reported due to its lack of performance and its inability to factor many matrices. Using CSR-LVL as the base of comparison, performance is measured as:

where is the method, is the matrix, and is the number of cores. The two methods compared from Javelin are using only the level scheduling stage (LS) and using both level scheduling with the lower stage blocking that is automatically picked by Javelin (LS + Lower). Note that the speedup with only LS is the same triangular solving method in [13, 14].

(a) 14 Haswell Cores (1 Socket)
(b) 68 KNL Cores (1 Socket)
Fig. 12: The maximal speedup of stri using Javelin. The CRS-LVL bars are the speedup of the standard level-set stri implement with OpenMP. The LS bars are due only to using the level scheduling stages of Javelin, and LS + Lower is the speedup using both stages.

In Fig. 12, we provide the maximal speedup for stri on one socket of Intel Haswell and Intel KNL. As is known, the traditional method of using level-sets with barriers does not scale well. Using only the level scheduling method with pruning in Javelin we are able to now have much better scaling. However, in some cases, such as wang3 on Haswell cores, the difference between the base case can be very small. The lower stage blocks help performance of stri on all matrices, and increase performance of wang3. In particular, this is seen the most on Intel KNL where the lower stage does not seem to help speedup incomplete factorization as much as on Haswell cores. We believe that this behavior is currently due to the parameters that judge between ER and SR, lower stage size, and subblocking have been tuned with more wieght to stri than ILU. This weighting is due to an iterative method’s time dominated by stri over ILU. Further investigation is needed to provide better balance for different matrices, but may require some estimate of conditioning of the linear system.

Vii Sensitivity analysis

In this section, we will examine options that can be modified in Javelin, and their general effects. We will examine the choice of ordering on iteration count and the distribution of levels for different choices of when to start the lower stage.

Matrix AMD RCM ND NAT LS-RCM LS-ND
offshore 11 10 11 9 11 11
parabolic_fem 300 116 270 288 113 270
af_shell3 420 381 648 284 350 390
thermal2 764 385 750 813 540 753
ecology2 1532 889 1152 881 868 1503
apache2 597 270 597 275 487 603
TABLE II: The number of iterations need to converge based on order for group A matrices. AMD is symmetric approximate minimal degree, NAT is the natural order, RCM is reverse Cuthill-McKee, ND is nested-dissection, LS-RCM is level ordering after RCM, and LS-ND is level ordering after ND.

Iteration count. As previously stated, the number of iterations needed to converge to a solution will depend not only on the conditioning of the system, but the ordering of the sparse coefficient matrix. Though some work has been done to better understand this phenomenon, it is still not fully understood why the order of the sparse coefficient matrix has such a strong impact on the convergence rate. It is known that Coloring and ND orderings in general increase iteration count, and orders like RCM decrease iteration count. In Table II, we provide the number of iterations needed to converge to a solution with relative error of - for each of the matrices in group A and using an array of different orders. We consider SYMAMD (AMD) due to a recommendation in [11], Reverse Cuthill-McKee (RCM), ND, and the natural order (NAT). Moreover, we consider the level scheduling ordering imposed on top of coefficient matrices preordered with RCM (LS-RCM) and ND (LS-ND) in order to compare how the level set ordering done in Javelin will affect iteration count. Coloring is not considered as it is known to be worse in terms of iteration than any other ordering considered here. We note that in most cases, RCM and NAT have the fewest iterations. However, LS-RCM is close or even better than RCM in several cases. In Fig. 13, we present the speedup of group A matrices using only level scheduling with point-to-point synchronizations that were initially ordered using RCM. The speedup is calculated with the relative base being serial with ND ordering. We see that the speedup is comparable to those in section V. Therefore, this makes either method a candidate for increased performance using Javelin. However, the speedup relative to itself is slightly less than that with ND ordering. The largest factor in this is the size of level sets found. We also note that the speedup of stri is proportional to what can be achieved by ILU. As a result, the choice will all be up to the user to determine if the speedup will offset the number of iterations. For example, the first 3 matrices in Fig. 13 would most likely be best preordered with ND, since they have good speedups and few iterations.

Fig. 13: The relative speedup of group A matrices on Intel Haswell when the input matrix is first order with RCM. We notice that we are able to obtain similar performance as we did when we first ordered with ND.

Levels and lower size. Here we will provide an analysis of the number of levels selected for rows to be solved using one of the two lower methods. In Table III, we provide the statistics related to the level selection when considering the pattern of . We note that there are other factors we consider in addition to the minimal row size in selecting which rows are to be permuted to the end. These factors include the row density and the relative location of the row. As pointed out previously, we would not want to permute rows with low density and level sets in the middle of two larger levels.

Matrix Lvl M Max Med R-16 R-24 R-32
wang3 10 5 1.11e4 582 5 21 21
TSOPF_RS… 180 1 4741 109 0 430 529
3D_28984… 34 1 9054 383 24 24 48
ibm_mat… 29 4 3.02e4 1332 20 41 41
fem_filter 554 2 9003 3 1792 1810 1862
trans4 20 1 8.57e4 84 13 13 13
scircuit 34 1 6.23e4 117 53 87 117
transient 16 1 7.82e4 24 2 47 88
offshore 74 1 3.04e4 2724 58 76 529
ASIC_320ks 16 3 2.23e5 2991 3 3 3
af_shell3 630 1 2.31e4 5 1751 2154 3682
parabolic_fem 28 1 1.02e5 5068 16 33 33
ASIC_680ks 21 1 5.84e5 1316 26 26 50
apache2 13 3 3.22e5 2961 3 19 19
tmt_sym 28 1 1.78e5 7393 18 18 47
ecology2 13 1 4.42e5 189 24 24 24
thermal2 27 1 2.72e5 16951 7 7 31
G3_circuit 13 2 6.19e5 4109 11 11 38
TABLE III: Level Set information of pattern. Lvl is the number of levels found, M is the minimal, Max is the maximum, Med is the median number of rows in a level. R-A is the number of rows moved to the end of the coefficient matrix, where A is a sensitivity parameter.

From Table III, we can observe that most matrices have only tens to hundreds of levels. Even matrices with high row density, such as TSOPF_RS_b300_c3 and offshore, fit into this range of levels. We would like to see fewer levels because: 1. reduces the preprocessing time; 2. the level sets tend to be larger allowing for more concurrency. Because of the level size distribution, the median becomes the best possible estimator of possible parallel performance. We see that most matrices can support hundreds of concurrent threads based off this median value. The exception to this is fem_filter, trans4, and af_shell3. While af_shell3 only has a median value of 5, we discover that level scheduling still does a good job with performance as good or better than other matrices. However, the other two do not do well with level scheduling. On a varying number of cores, the lower method is able to help boost trans4, and is not able to for fem_filter. We can observe that trans4 has relatively few rows that ever get permuted to the end (i.e., 13). However, using SR, we are able to improve the speedup by a difference of - on socket. We believe that we could improve selecting a better tile size and changing the tasking system to reduce overhead in the future. Moreover, transient also suffers from a small median value. However, the lower method is able to improve the performance on socket by a factor of on Haswell and by a factor or on KNL, despite the current overheads in the lower method.

Matrix Min Max Median
TSOPF_RS_b330_c2 1 4753 122
3D_28984_Tetra 4 9059 543
ibm_matrix_2 1 30259 1361
trans4 1 8549 493
TABLE IV: Level Set information of pattern. Min is the minimal and Max is the maximum number of rows in a level.

Additionally, we look at similar statistics for the pattern in Table IV. We note that we could only use this pattern if we consider just level scheduling or level scheduling with Even-Rows, but not with the Segmented-Row method. We note that the median number of nonzeros does increase in all matrices as we would expect. However, the increase is very small except in a few cases such as trans4. Despite this increase, we could not get any real additional speedup performance for trans4 using the pattern. Therefore, we by default always recommend using the pattern as it will allow for SR to be run and enable tiling for stri.

Viii Conclusions

This paper presented a new scalable framework for incomplete factorization, called Javelin, aimed at allowing traditional thresholding and fill-level options while being optimized for sparse matrix-vector multiplication and sparse triangular solves on current many-core systems. This framework is designed using level scheduling of up-looking LU factorization that is implemented with light weight point-to-point synchronizations and then using a second (i.e., lower) level stage when level scheduling will no longer provide the needed degree of concurrency. This lower stage is designed by co-optimizing the incomplete factorization and stri after factorization. This is achieved through reducing the serial portion and imbalance to a minimal through design, and thus helping to achieve strong scalability. We demonstrated that Javelin could achieve good speedups on socket for both Intel Haswell and Intel KNL systems for a wide array of coefficient matrices for both ILU and triangular solve. Moreover, we examined the effect that requiring reordering would have on the iteration count of several test systems. This demonstration shows that, not only can we scale incomplete factorization, we can leave the system in an order that has desirable properties.

Acknowledgments

We thank Andrew Bradley and Siva Rajamanickam for helpful discussions while at Sandia National Laboratories.

References

  • [1] H. Kabir, J. D. Booth, G. Aupy, A. Benoit, Y. Robert, and P. Raghavan, “STS-k: a multilevel sparse triangular solution scheme for NUMA multicores,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, 2015, pp. 55:1–55:11.
  • [2] S. MacLachlan, D. Osei-Kuffuor, and Y. Saad, “Modification and compensation strategies for threshold-based incomplete factorizations,” SIAM J. Sci. Comput., vol. 34, no. 1, pp. 48–75, Jan. 2012.
  • [3] E. Chow and A. Patel, “Fine-grained parallel incomplete LU factorization,” SIAM Journal on Scientific Computing, vol. 37, no. 2, pp. C169–C193, Jan 2015.
  • [4] P. Hénon and Y. Saad, “A parallel multistage ILU factorization based on a hierarchical graph decomposition,” SIAM Journal on Scientific Computing, vol. 28, no. 6, pp. 2266–2293, Jan 2006.
  • [5] J. D. Booth, K. Kim, and S. Rajamanickam, “A comparison of high-level programming choices for incomplete sparse factorization across different architectures,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).   IEEE, May 2016.
  • [6] D. Hysom and A. Pothen, “A scalable parallel algorithm for incomplete factor preconditioning,” SIAM SISC, vol. 22, no. 6, pp. 2194–2215, Jan 2001.
  • [7] X. S. Li and M. Shao, “A supernodal approach to incomplete LU factorization with partial pivoting,” ACM Transactions on Mathematical Software, vol. 37, no. 4, pp. 1–20, Feb 2011.
  • [8] E. Chow and Y. Saad, “Experimental study of ILU preconditioners for indefinite matrices,” Journal of Computational and Applied Mathematics, vol. 86, no. 2, pp. 387–414, Dec 1997.
  • [9] X. Dong and G. Cooperman, “A bit-compatible parallelization for ILU(k) preconditioning,” in EuroPar 2011 Parallel Processing.   Springer Berlin Heidelberg, 2011, pp. 66–77.
  • [10] F. Gibou and C. Min, “On the performance of a simple parallel implementation of the ILU-PCG for the poisson equation on irregular domains,” Journal of Computational Physics, vol. 231, no. 14, pp. 4531–4536, May 2012.
  • [11] M. Benzi, D. B. Szyld, and A. van Duin, “Orderings for incomplete factorization preconditioning of nonsymmetric problems,” SIAM Journal on Scientific Computing, vol. 20, no. 5, pp. 1652–1670, Jan 1999.
  • [12] S. W. Hammond and R. Schreiber, “Efficient ICCG on a shared memory multiprocessor,” International Journal of High Speed Computing, vol. 04, no. 01, pp. 1–21, Mar 1992.
  • [13] J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, “Sparsifying synchronization for high-performance shared-memory sparse triangular solver,” in Proceedings of the 29th International Conference on Supercomputing - Volume 8488, ser. ISC 2014.   New York, NY, USA: Springer-Verlag New York, Inc., 2014, pp. 124–140.
  • [14] A. M. Bradley, “A hybrid multithreaded direct sparse triangular solver,” in 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing.   Society for Industrial and Applied Mathematics, Jan 2016, pp. 13–22.
  • [15] W. Liu and B. Vinter, “CSR5,” in International Conference on Supercomputing ICS’15.   ACM Press, 2015.
  • [16] G. E. Blelloch, M. A. Heroux, and M. Zagha, “Segmented operations for sparse matrix computation on vector multiprocessors,” Pittsburgh, PA, USA, Tech. Rep., 1993.
  • [17] K. Kim, Sivasankaran, G. Stelle, H. C. Edwards, and S. Olivier, “Task parallel incomplete cholesky factorization using 2d partitioned-block layout,” publised on arXiv.
  • [18] M. Shantharam, S. Srinivasmurthy, and P. Raghavan, “Fault tolerant preconditioned conjugate gradient for sparse linear system solution,” in International Conference on Supercomputing, ICS’12, Venice, Italy, June 25-29, 2012, 2012, pp. 69–78.
  • [19] J. D. Booth, “Norm-coarsened ordering for parallel incomplete cholesky preconditioning,” in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012, 2012, pp. 1532–1533.
  • [20] J. D. Booth and S. Rajamanickam, “HiLUK: Scalable incomplete factorization utilizing combinatorial methods to reduce overheads,” in SIAM CSC, 2016.
  • [21] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr, “XSEDE: Accelerating scientific discovery,” Computing in Science & Engineering, vol. 16, no. 5, pp. 62–74, Sept.-Oct. 2014.
  • [22] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM TOM, vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011.
  • [23] W. H. A. Schilders, “Iterative solution of linear systems in circuit simulation,” in Progress in Industrial Mathematics at ECMI 2000.   Springer Berlin Heidelberg, 2002, pp. 272–277.
  • [24] Z. Li and C.-J. Shi, “An efficiently preconditioned GMRES method for fast parasitic-sensitive deep-submicron VLSI circuit simulation,” in Design, Automation and Test in Europe.   IEEE, 2005.
  • [25] G. Karypis and V. Kumar, “Metis – unstructured graph partitioning and sparse matrix ordering system, version 2.0,” Tech. Rep., 1995.
  • [26] M. Heroux, R. Bartlett, V. H. R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, H. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams, “An Overview of Trilinos,” Sandia National Laboratories, Tech. Rep. SAND2003-2927, 2003.
  • [27] A. Gupta, “Enhancing performance and robustness of ILU preconditioners by blocking and selective transposition,” SIAM Journal on Scientific Computing, vol. 39, no. 1, pp. A303–A332, 2017.