A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm

We propose a novel block-row partitioning method in order to improve the convergence rate of the block Cimmino algorithm for solving general sparse linear systems of equations. The convergence rate of the block Cimmino algorithm depends on the orthogonality among the block rows obtained by the partitioning method. The proposed method takes numerical orthogonality among block rows into account by proposing a row inner-product graph model of the coefficient matrix. In the graph partitioning formulation defined on this graph model, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of inter-block inner products between block rows thus leading to an improvement in the eigenvalue spectrum of the iteration matrix. This in turn leads to a significant reduction in the number of iterations required for convergence. Extensive experiments conducted on a large set of matrices confirm the validity of the proposed method against a state-of-the-art method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/23/2020

On the preconditioning of three-by-three block saddle point problems

We establish a new iterative method for solving a class of large and spa...
06/19/2021

Comparison Theorems for Splittings of M-matrices in (block) Hessenberg Form

Some variants of the (block) Gauss-Seidel iteration for the solution of ...
08/04/2019

Fast Nonoverlapping Block Jacobi Method for the Dual Rudin--Osher--Fatemi Model

We consider nonoverlapping domain decomposition methods for the Rudin--O...
07/08/2019

Admissible and attainable convergence behavior of block Arnoldi and GMRES

It is well-established that any non-increasing convergence curve is poss...
08/24/2021

Communication-hiding pipelined BiCGSafe methods for solving large linear systems

Recently, a new variant of the BiCGStab method, known as the pipeline Bi...
01/12/2022

Majorization-type cluster robust bounds for block filters and eigensolvers

Convergence analysis of block iterative solvers for Hermitian eigenvalue...
09/26/2019

Heuristics for Symmetric Rectilinear Matrix Partitioning

Partitioning sparse matrices and graphs is a common and important proble...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Row projection methods are a class of linear system solvers [12, 13, 31] that are used for solving linear system of equations of the form

(1)

where is an sparse nonsymmetric nonsingular matrix, and

are column vectors of size

. In these methods, the solution is computed through successive projections. There are mainly two major variations known as Kacmarz [30] and Cimmino [17]. Kaczmarz obtains the solution through a product of orthogonal projections whereas Cimmino reaches the solution through a sum of orthogonal projections. Cimmino is known to be more amenable to parallelism than Kaczmarz [12]. The required number of iterations for Cimmino algorithm, however, could be quite large. One alternative variation is the block Cimmino [6] which is a block row projection method. Iterative block Cimmino has been studied extensively in [6, 8, 12, 16, 21, 23, 24]. A direct version of block Cimmino based on the augmented block row projections is proposed in [22]. However as in any other direct solver [3, 11, 34, 36], this approach also suffers from extensive memory requirement due to fill-in.

In the block Cimmino scheme, the linear system Eq. 1 is partitioned into block rows, where , as follows:

(2)

In Eq. 2, the coefficient matrix and right-hand side vector is partitioned conformably. Here is a submatrix of size and is a subvector of size , for , where

(3)

The block Cimmino scheme is given in Algorithm 1, where is the Moore-Penrose pseudoinverse of and it is defined as

(4)

In Algorithm 1, is used for the sake of the clarity of the notation and it is never computed explicitly. In fact, at line 4 of Algorithm 1, the minimum norm solution of an underdetermined linear least squares problem is obtained via the augmented system approach as discussed later. In Algorithm 1, vectors, which are of size , can be computed independently in parallel without any communication, hence the block Cimmino algorithm is quite suitable for parallel computing platforms.

1:  Choose
2:  while  until convergence do
3:     for  do
4:        
5:     end for
6:     
7:  end while
Algorithm 1 Block Cimmino method

An iteration of block Cimmino method can be reformulated as follows:

,
(5)

where is the iteration matrix for block Cimmino algorithm. is the sum of (projections onto ) and it is defined by

(6)

The projections in block Cimmino iterations can be calculated using several approaches, such as the normal equations, seminormal equations, QR factorization and using the augmented system [5]. The normal and seminormal equation approaches are not considered since they have the potential of introducing numerical difficulties that can be disastrous in some cases when the problem is ill-conditioned. Although the QR factorization is numerically more stable, it is computationally expensive. Therefore we use the augmented system approach, which requires the solution of a sparse linear system that can be done effectively by using a sparse direct solver [6]. Note that if submatrix is in a column overlapped block diagonal form, one could also use the algorithm in [37]. However, this approach is not considered since we assume no structure for the coefficient matrix.

In the augmented system approach, we obtain the solution of Eq. 7 by solving the linear system Eq. 8,

(7)
(8)

Hence, the solution of the augmented system gives the projection .

1.1 The Conjugate Gradient acceleration of block Cimmino method

Convergence rate of the block Cimmino algorithm is known to be slow [12]. In [12], the Conjugate Gradient (CG) method is proposed to accelerate the row projection method. It is also reported that the CG accelerated Cimmino method competes favorably compared to classical preconditioned Generalized Minimum Residual (GMRES) and Conjugate Gradient on Normal Equations (CGNE) for the solution of nonsymmetric linear systems.

We note that the matrix in Eq. 6 is symmetric and positive definite. Hence, one can solve the following system using CG,

(9)

where and is the solution vector of the system Eq. 1. Note that since appears on both of the equalities, it does not affect on the convergence of CG. Algorithm 2 is the pseudocode for CG accelerated block Cimmino method, which is in fact the classical CG applied on the system given in Eq. 9.

1:  Choose
2:  
3:  
4:  while  until convergence do
5:     
6:     
7:     
8:     
9:     
10:     
11:  end while
Algorithm 2 Conjugate Gradient acceleration of block Cimmino method

The matrix vector multiplications in the CG is expressed as the solution of independent underdetermined systems at line which can be done in parallel but need to be summed to obtain by using an all-reduce operation.

1.2 The effect of partitioning

The convergence rate of CG accelerated block Cimmino algorithm is essentially the convergence rate of CG applied on Eq. 9. A well-known upper bound on the convergence rate can be given in terms of the extreme eigenvalues ( and ) of the coefficient matrix. Let

(10)

be the 2-norm condition number of . Then, as in [27], an upper bound on the convergence rate of the CG accelerated block Cimmino can be given by

(11)

where is the exact solution and . Furthermore, it was shown that the convergence rate of CG not only depends on the extreme eigenvalues but also on the separation between those extreme eigenvalues and interior eigenvalues [38] as well the clustering of the internal eigenvalues [29]. In summary the convergence rate of the CG accelerated block Cimmino depends on the extreme eigenvalues and the number of clusters as well as the quality of the clustering.

Therefore, the partitioning of the coefficient matrix into block rows is crucial for the improvements in the convergence rate for the CG accelerated block Cimmino algorithm, since it can improve the eigenvalue distribution of . Note that the eigenvalues of is only affected by the block-row partitioning of the coefficient matrix and independent of any column ordering [21].

Let QR factorization of be defined as

(12)

where the matrices and have dimensions of and , respectively. Then, the matrix will be

.
(13)

Since the eigenvalue spectrum of matrix is the same as the spectrum of matrix [26], is similar to

(14)

where the singular values of matrix

represent the principal angles between the subspaces and [10]. Hence, the smaller the off-diagonals of the matrix Eq. 14, the more eigenvalues of will be clustered around one by the Gershgorin theorem [25]. Therefore, the convergence rate of the block Cimmino method highly depends on the orthogonality among blocks. If blocks are more orthogonal to each other, row inner products between blocks would be arbitrarily small and hence the eigenvalues will be clustered around one. Therefore, CG is expected to converge in a fewer number of iterations.

In a recent study [21], a hypergraph partitioning method is proposed to find a “good” block-row partitioning for the CG accelerated block Cimmino algorithm. It is also shown that it performs better than Cuthill-McKee [18] based methods (with or without filtering small elements in absolute value). However, the hypergraph partitioning method loosely considers the structural orthogonality among block rows and does not consider the numerical values at all.

In this work, we propose a novel graph theoretical block-row partitioning method for the CG accelerated block Cimmino algorithm. For this purpose, we introduce a row inner-product graph model of a given matrix and then the problem of finding a “good” block-row partitioning is formulated as a graph partitioning problem on this graph model. The proposed method takes the numerical orthogonality between block rows of into account. In the proposed method, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of inter-block inner products between block rows thus leading to an improvement in the eigenvalue spectrum of .

The validity of the proposed method is confirmed against two baseline methods by conducting experiments on a large set of matrices that arise in a variety of real life applications. One of the two baseline methods is the state-of-the-art hypergraph partitioning method introduced in [21]. We conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of . We also conduct experiments to compare the performance of three methods in terms of the number of CG iterations and parallel CG time to solution. The results of these experiments show that the proposed partitioning method is considerably better than both baseline methods in terms of all of these performance metrics. Finally, we compare the preprocessing overheads of the methods and show that the proposed method incurs much less overhead than the hypergraph partitioning method, thus allowing a better amortization.

The rest of the paper is organized as follows. Section 2 presents the proposed partitioning method and its implementation. In Section 3, we present and discuss the experimental results. Finally, the paper is concluded with conclusions and directions for future research in Section 4.

2 The proposed partitioning method

In this section, we first describe the row inner-product graph model and then show that finding a “good” block-row partitioning can be formulated as a graph partitioning problem on this graph model. Finally, we give the implementation details for the construction and partitioning of the graph model. We refer the reader to Appendix A for a short background on graph partitioning.

2.1 Row inner-product graph model

In the row-inner product graph of matrix , vertices represent the rows of matrix and edges represent nonzero inner products between rows. That is, contains vertex for each row of matrix . contains an edge only if the inner product of rows and is nonzero. That is,

(15)

Each vertex can be associated with a unit weight or a weight that is equal to the number of nonzeros in row , that is

(16)

respectively. Each edge is associated with a cost equal to the absolute value of the respective inner product. That is,

(17)

This graph is topologically equivalent to the standard graph representation of the symmetric matrix resulting from the sparse matrix-matrix multiplication operation . That is, the sparsity pattern of corresponds to the adjacency matrix representation of . Each nonzero of the resulting matrix incurs an edge . Since each nonzero entry of is computed as the inner product of row and row , the absolute value of the nonzero determines the cost of the respective edge . That is, since , we have .

Figure (a)a shows a sample sparse matrix that contains nonzeros. Figure (b)b depicts the proposed row inner-product graph for this sample matrix. As seen in Fig. (b)b, contains vertices each of which corresponds to a row and edges each of which corresponds to a nonzero row inner product. For example, the inner product of rows and is nonzero where so that contains the edge with . In Fig. (b)b, edges with cost larger than are shown with thick lines in order to make such high inner-product values more visible.

(a) Sample matrix
(b) Row inner-product graph of A
Figure 3: Row inner-product graph model.

Figure (a)a shows the resulting matrix of the sparse matrix-matrix multiplication . Only the off-diagonal nonzero entries together with their values are shown since the values of the diagonal entries do not affect . The cells that contain nonzeros larger than are shown with black background in order to make such high values more visible. Comparison of Figs. (a)a and (b)b shows that the topology of the standard graph model of matrix is equivalent to . As also seen in Fig. (a)a, the values of the nonzero entries of matrix are equal to the costs of respective edges of . For example nonzero with a value incurs an edge with .

2.2 Block-row partitioning via partitioning

A -way partition of can be decoded as a partial permutation on the rows of to induce a permuted matrix , where

(18)

Here, denotes the row permutation matrix which is defined by the -way partition as follows: the rows associated with the vertices in are ordered after the rows associated with the vertices in for . That is, the block-row contains the set of rows corresponding to the set of vertices in part of partition , where ordering of the rows within block row is arbitrary for each . Note that we use the notation to denote both block row and the set of rows in . Since the column permutation does not affect the convergence of block Cimmino algorithm [21], the original column ordering of is maintained.

Consider a partition of . The weight of a part is either equal to the number of rows or number of nonzeros in block row depending on the vertex weighting scheme used according to Eq. 16. That is,

(19)

Therefore, in the partitioning , the partitioning constraint of maintaining balance among part weights according to Eq. 28 corresponds to maintaining balance on either the number of rows or the number of nonzeros among the block rows.

Consider a partition of . A cut edge between parts and represents a nonzero inter-block inner product between block rows and . Therefore, the cutsize of (given in Eq. 29) is equal to

(20)

which corresponds to the total sum of the inter-block inner products (). So, in partitioning , the partitioning objective of minimizing the cutsize corresponds to minimizing the sum of inter-block inner products between block rows. Therefore, this partitioning objective corresponds to making the block rows numerically more orthogonal to each other. In this way, we expect this method to yield a faster convergence in the CG accelerated block Cimmino algorithm.

We introduce Figs. 9 and 6 in order to clarify the proposed graph partitioning method for block-row partitioning. Figure (a)a shows a straightforward 3-way block-row partition of the sample matrix given in Fig. (a)a, where the first four, the second four and the third four consecutive rows in the original order constitute the block rows and , respectively. Figure (b)b shows the -way vertex partition of that corresponds to this straightforward -way block-row partition. Figure (a)a shows a “good” -way vertex partition of obtained by using the graph partitioning tool METIS [33]. Figure (b)b shows the permuted matrix and block-row partition induced by the 3-way vertex partition

As seen in Figs. 9 and 6, both straightforward and “good” block-row partitions achieve perfect balance on the row counts of blocks by having exactly four rows per block. The quality difference between straightforward and “good” block-row partitions can be easily seen by comparing the -way partitions of in Figs. (a)a and (b)b, respectively. As seen in Fig. (b)b, eight out of nine thick edges remain on the cut of , whereas all of the nine thick edges remain internal in as seen in Fig. (a)a.

(a)
(b)
Figure 6: (a) Straightforward 3-way row partition of and (b) 3-way partition of induced by Fig. (a)a.
(a)
(b)
Figure 9: (a)“Good” 3-way partition of and (b) 3-way row partition of induced by Fig. (a)a.

Figure 12 shows the block-checkerboard partitioning of the resulting matrix induced by straightforward and “good” block-row partitioning of the sample matrix in Figs. (b)b and (a)a, respectively. Note that both rows and columns of the matrix are partitioned conformably with the row partitions of the matrix. The comparison of Figs. (b)b and (a)a shows that large nonzeros (dark cells) are scattered across the off-diagonal blocks of matrix for the straightforward partitioning, whereas large nonzeros (dark cells) are clustered to the diagonal blocks of for the “good” partitioning.

(a)
(b)
Figure 12: block-checkerboard partition of matrix induced by 3-way (a) straightforward (Fig. (a)a) and (b) “good” (Fig. (b)b) block-row partitions of matrix .
IP = (a) IP = (b)
Figure 15: Inter-block row-inner-product matrix IP for (a) straightforward and (b) “good” block-row partitions.

We introduce Fig. 15 to compare the quality of the straightforward and “good” block-row partitions in terms of inter-block row-inner-product sums. In the figure, each off-diagonal entry ip of the IP matrix shows the sum of the inter-block row inner-products between the respective block rows and . That is,

(21)

As seen in Fig. 15, ip for the straightforward partition, whereas ip for the “good” partition. Note that ip is also equal to the sum of the absolute values of the nonzeros of the off-diagonal block at the row block and column block of the matrix, i.e.,

(22)

Therefore, the total sum of inter block inner products is

(23)

for the straightforward partition, whereas for the “good” partition it is

(24)
(a) Straightforward partitioning (Fig. (a)a)
(b) “Good” partitioning (Fig. (b)b)
Figure 18: Eigenvalue spectrum of for the block-row partitionings of the sample matrix given in Fig. (a)a.

Figures (b)b and (a)a, respectively show the eigenvalue spectrum of for the straightforward and “good” partitionings. As seen in the figures, for the straightforward partitioning the eigenvalues reside in the interval , whereas for ”good” partitioning the eigenvalues reside in the interval . As seen in Fig. (b)b, after using partitioning the eigenvalues are much better clustered around and the smallest eigenvalue is much larger than that of the straightforward partitioning method.

2.3 Implementation

Implementation of the proposed partitioning method consists of two stages which are constructing and partitioning .

Constructing : For constructing , we choose to use basic sparse matrix-matrix multiplication (SpGEMM) [28] kernel due to existing efficient implementations. The edges of the are obtained from the nonzeros of the matrix, whereas their weights are obtained from the absolute values of those nonzeros.

Note that when matrix has dense column(s), the corresponding matrix will be quite dense. In other words, when a column has nonzeros, corresponding matrix will have at least nonzeros. For example, Fig. (a)a shows a sparse matrix which has a dense column having nonzero entries. As seen in Fig. (b)b, matrix is dense as it has nonzero entries. Clearly, large number of nonzeros in (i.e., large number of edges in ) increases the memory requirement and computation cost of SpGEMM as well as the time requirement for partitioning .

In order to alleviate the aforementioned problem, we propose the following methodology for sparsening . We identify a column (in MATLAB [35] notation) of an matrix as a dense column if it contains more than nonzeros (). Given , we extract a sparse matrix by keeping the largest (in absolute value) nonzeros of dense columns of . That is, the smallest entries of a dense -matrix column is ignored during constructing column of . Hence, the SpGEMM operation is performed on to obtain sparsened resulting matrix . This will lead to a sparsened graph. For example, Fig. (c)c shows this sparsity pattern of a sparse matrix which is extracted from by keeping largest nonzeros in the dense column of . As seen in Fig. (d)d, matrix is very sparse with respect to Fig. (b)b. Note that both and are used only for constructing of . After the partitioning stage, both matrices are discarded. In the rest of the paper, will be referred to as for the sake of the simplicity of presentation.

(a)
(b)
(c)
(d)
Figure 23: Nonzero patterns of , , and .

Partitioning : We use multilevel graph partitioning tool METIS [33] for partitioning . Since METIS uses positive integer weights for edges, the edge costs computed as floating-point values are cast as integers. In order to distribute the edge costs as evenly as possible between the small and large edge costs in the graph, we scale them in a range from 1 to as follows in order to compute integer edge weights required by METIS:

(25)

Here, is the edge cost computed according to Eq. 17, mincost and maxcost respectively denote the minimum and maximum edge costs, and is the weight of the respective edge provided to METIS.

The scaling given in Eq. 25

may suffer from yielding non-uniformly distributed integer edge weights in some instances. That is, there could be some clustering of integer edge weights although edge costs are not clustered in the original

. Ideally edge weights should be real numbers. However, in order to be able to use the limited number of integer weights more effectively, one needs to have more uniformly distributed edge weights. Therefore we prescale the edge costs of by computing their square roots before assigning the integer weights using Eq. 25.

3 Experimental Results

3.1 Experimental Framework

In the experiments, we used the CG accelerated block Cimmino implementation available in ABCD Solver v1.0-beta [39]. In ABCD Solver, MUMPS [3] sparse direct solver is used to factorize the system once and solve the system iteratively. We used the same matrix scaling method [4] by setting NB1, NB2 and NB3 parameters to 10, 20 and 20, respectively, as suggested by one of the previous versions of ABCD Solver (version on 2014-09-15) for the CG accelerated block Cimmino.

ABCD Solver includes a stabilized block-CG accelerated block Cimmino algorithm [8] especially for solving systems with multiple right-hand side vectors. Since the classical CG is guaranteed to converge for systems where the coefficient matrix is symmetric and positive definite and its convergence theory is well-established, in this work we implement the classical CG accelerated block Cimmino in Algorithm 2 for solving sparse linear system of equations with single right-hand side vector rather than the block-CG acceleration. In all experiments with the CG accelerated block Cimmino, we use the normwise backward error [7] at iteration

(26)

as the stopping criterion and we use as the maximum number of iterations.

In the experiments, we have compared the performance of the proposed partitioning method (GP) against two baseline partitioning methods already available in ABCD Solver. The first baseline method, which is referred to as uniform partitioning (UP) method in ABCD Solver, partitions the rows of the coefficient matrix into a given number of block rows with almost equal number of rows without any permutation on the rows of . Note that UP method is the same as the straightforward partitioning method mentioned in Section 2.

The second baseline method is the hypergraph partitioning (HP) method [21]. This method uses the column-net model [14] of sparse matrix in which rows and columns are respectively represented by vertices and hyperedges both with unit weights [21]. Each hyperedge connects the set of vertices corresponding to the rows that have a nonzero on the respective column. In the HP method, a -way partition of the vertices of the column-net model is used to find block rows. The partitioning constraint of maintaining balance on part weights corresponds to finding block rows with equal number of rows. The partitioning objective of the minimizing cutsize according to the cut-net metric corresponds to minimizing the number of linking columns among block rows. Here a linking column refers to a column that contains nonzeros in more than one block row.

The HP method in ABCD Solver uses the multilevel hypergraph partitioning tool PaToH [15] for partitioning the column-net model of matrix . Here, we use the same parameters for PaToH specified in ABCD Solver. That is, the final imbalance and initial imbalance parameters in PaToH are set to and , respectively. The other parameters are left as default of PaToH as in ABCD Solver.

In the proposed GP method, as mentioned in Section 2.3, the multilevel graph partitioning tool METIS is used to partition the row inner-product graph model of . The imbalance parameter of METIS is set to 10% and k-way option is used. For the sake of a fair comparison between HP and GP methods, unit vertex weights are used in

. The other parameters are left as default of METIS. Since both PaToH and METIS use randomized algorithms, we report the results of the geometric mean of

experiments with different seeds for each instance.

Here and hereafter, we use GP, HP and UP to refer to the respective block-row partitioning method as well as ABCD Solver that utilizes the regular block Cimmino algorithm for solving the systems partitioned by the respective method.

The extensive numerical experiments were conducted on a shared memory system. The shared memory system is a four socket 64-core computer that contains four AMD Opteron 6376 processors, where each processor has 16 cores running at 2.3GHz and a total of 64GB of DDR3 memory. Due to memory bandwidth limitations of the platform, experiments are performed with 32 cores.

3.2 Effect of block-row partitioning on eigenvalue spectrum of

The convergence rate of the CG accelerated block Cimmino algorithm is related to the eigenvalue spectrum of . By the nature of the block Cimmino algorithm, most of the eigenvalues of are clustered around 1, but there can be some eigenvalues at extremes of the spectrum.

In this subsection, we conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of by comparing the proposed GP method against UP and HP. In these experiments, in order to be able to compute the eigenvalue spectrum requiring reasonable amount of time and memory, we use four small nonsymmetric sparse matrices: , , and from SuiteSparse Matrix Collection [19]. The first and the second matrices arise in Computational Fluid Dynamics problems, whereas the third and fourth matrices arise in Power Network problems. We partition the matrices into block rows for all of the three partitioning methods UP, HP and GP.

(a) ,
(b) ,
(c) ,
(d) ,
Figure 28: Eigenvalue spectrum of (with the smallest and largest eigenvalues) and the number of CG iterations (iters) required for convergence.

Figure 28 shows the eigenvalue spectrum of obtained by UP, HP and GP methods for each test instance. The figure also reports the number of CG iterations (iters) required for convergence as well as the smallest and largest eigenvalues of . As seen in the figure, both HP and GP methods achieve significantly better clustering of the eigenvalues around 1 compared to UP. This experimental finding is reflected to remarkable decrease in the number of CG iterations attained by both HP and GP over UP.

In the comparison of HP and GP, GP achieves better eigenvalue clustering and hence better convergence rate than HP for all instances. For , , and instances, the better clustering quality attained by GP over HP leads to significant improvement in the convergence rate by , , and , respectively.

3.3 Dataset for Performance Analysis

For the following experiments, we selected all nonsingular nonsymmetric square matrices whose dimensions are between and rows and columns from the SuiteSparse Matrix Collection [19]. The number of matrices based on this criteria turns out to be . Only the HV15R and cage14 matrices are excluded due to memory limitations. We have observed that at least one of the three partitioning methods was able to converge in instances out of these in less than CG iterations.

Table 1 shows the properties of those matrices that are used in the experiments. In the table, the matrices are displayed in decreasing sorted order according to their number of nonzeros.

The right-hand side vectors of the systems are obtained by multiplying the coefficient matrices with a vector whose elements are uniform random numbers in the interval as in ABCD Solver. In all instances, the CG iterations are started from the zero vector.

Matrix name Matrix name
ML_Geer 1,504,002 110,686,677 para-10 155,924 2,094,873
RM07R 381,689 37,464,962 para-6 155,924 2,094,873
ML_Laplace 377,002 27,582,698 cage12 130,228 2,032,536
Transport 1,602,111 23,487,281 ASIC_320k 321,821 1,931,828
rajat31 4,690,002 20,316,253 rajat21 411,676 1,876,011
CoupCons3D 416,800 17,277,420 crashbasis 160,000 1,750,416
Freescale1 3,428,755 17,052,626 majorbasis 160,000 1,750,416
circuit5M_dc 3,523,317 14,865,409 venkat01 62,424 1,717,792
memchip 2,707,524 13,343,948 venkat50 62,424 1,717,777
atmosmodm 1,489,752 10,319,760 venkat25 62,424 1,717,763
atmosmodl 1,489,752 10,319,760 ASIC_680ks 682,712 1,693,767
atmosmodj 1,270,432 8,814,880 ASIC_320ks 321,671 1,316,085
atmosmodd 1,270,432 8,814,880 language 399,130 1,216,334
torso1 116,158 8,516,500 twotone 120,750 1,206,265
PR02R 161,070 8,185,136 matrix_9 103,430 1,205,518
cage13 445,315 7,479,343 torso2 115,967 1,033,473
ohne2 181,343 6,869,939 transient 178,866 961,368
pre2 659,033 5,834,044 ASIC_100k 99,340 940,621
Hamrle3 1,447,360 5,514,242 matrix-new_3 125,329 893,984
Chebyshev4 68,121 5,377,761 m133-b3 200,200 800,800
largebasis 440,020 5,240,084 shar_te2-b3 200,200 800,800
tmt_unsym 917,825 4,584,801 trans4 116,835 749,800
torso3 259,156 4,429,042 Baumann 112,211 748,331
xenon2 157,464 3,866,688 rajat25 87,190 606,489
laminar_duct3D 67,173 3,788,857 rajat28 87,190 606,489
rajat29 643,994 3,760,246 rajat20 86,916 604,299
FEM_3D_thermal2 147,900 3,489,300 LeGresley_87936 87,936 593,276
stomach 213,360 3,021,648 ASIC_100ks 99,190 578,890
thermomech_dK 204,316 2,846,228 3D_51448_3D 51,448 537,038
ASIC_680k 682,862 2,638,997 ibm_matrix_2 51,448 537,038
poisson3Db 85,623 2,374,949 hcircuit 105,676 513,072
barrier2-10 115,625 2,158,759 lung2 109,460 492,564
barrier2-11 115,625 2,158,759 2D_54019_highK 54,019 486,129
barrier2-9 115,625 2,158,759 rajat17 94,294 479,246
barrier2-12 115,625 2,158,759 rajat18 94,294 479,151
mc2depi 525,825 2,100,225 mark3jac140sc 64,089 376,395
para-8 155,924 2,094,873 shyy161 76,480 329,762
para-9 155,924 2,094,873 mark3jac120sc 54,929 322,483
para-5 155,924 2,094,873 circuit_4 80,209 307,604
para-7 155,924 2,094,873 bayer01 57,735 275,094
Table 1: Matrix properties (: number of rows/columns, : number of nonzeros)

We partition all test matrices into block rows for all of the three partitioning methods UP, HP and GP. Since all parallel runs are performed on 32 cores, the number of row-blocks are greater than the number of cores. The consecutive row blocks are uniformly assigned to the cores so that each core is assigned row blocks.

3.4 Convergence and parallel performance

In this subsection, we study the performance of the proposed GP method against UP and HP in terms of the number of CG iterations and parallel CG time to solution.

Table 2 shows the number of CG iterations and parallel CG time for each matrix. In the table, “” denotes that an algorithm fails to reach the desired backward error in iterations for the respective matrix instance. In the table, “” denotes that an algorithm fails during the factorization stage due to a singular block row. “” occurs only for with UP. As seen in Table 2, UP and HP fail to converge in and test instances, respectively, whereas GP fails in only one test instance.

In Table 2, the best result for each test instances is shown in bold. As seen in the table, out of 80 instances, the proposed GP method achieves the fastest convergence in 54 instances, whereas HP and UP achieve the fastest convergence in only 17 and 10 instances, respectively. Furthermore, as seen in Table 2, GP achieves the fastest iterative solution time in 51 instances, whereas HP and UP achieve the fastest solution time in 18 and 11 instances, respectively.

Matrix name # of CG iterations Parallel CG time to soln.
UP HP GP UP HP GP
ML_Geer 355 492 408 293.0 602.9 999.9
RM07R 1990 2128 506 1140.0 1228.3 246.1
ML_Laplace 760 952 971 162.0 314.0 314.6
Transport 2214 1666 1520 1400.0 1709.0 1416.5
rajat31 521 606 239 515.0 645.8 248.6
CoupCons3D 1280 479 152 333.0 114.4 39.1
Freescale1 F 4083 480 F 2996.2 337.6
circuit5M_dc 68 38 7 77.7 28.6 5.4
memchip 298 380 53 262.0 209.0 30.9
atmosmodm 44 23 42 27.8 12.8 23.2
atmosmodl 38 70 103 21.2 38.2 60.1
atmosmodj 167 148 138 69.4 67.6 65.9
atmosmodd 171 155 149 80.7 73.5 73.4
torso1 1559 854 1050 211.0 114.3 188.9
PR02R F 3505 2194 F 391.7 258.9
cage13 14 17 16 4.9 7.7 7.9
ohne2 F 2750 721 F 383.7 123.3
pre2 2343 264 524 535.0 47.9 137.1
Hamrle3 3555 2058 1583 2030.0 715.2 521.3
Chebyshev4 F F 173 F F 46.4
largebasis 1379 554 924 247.0 67.7 111.1
tmt_unsym 495 880 770 89.1 202.1 194.3
torso3 162 75 27 20.7 12.1 4.2
xenon2 25 19 21 1.8 1.2 1.3
laminar_duct3D 247 403 323 13.7 18.2 19.3
rajat29 S 14 28 S 11.8 25.1
FEM_3D_thermal2 86 73 47 5.1 5.2 3.6
stomach 85 37 15 8.5 4.0 1.5
thermomech_dK 26 7 7 2.4 0.5 0.5
ASIC_680k 23 8 5 17.0 7.9 4.0
poisson3Db F 1880 1457 F 82.7 70.7
barrier2-10 F F 4049 F F 309.6
barrier2-11 F F 4270 F F 323.0
barrier2-9 F F 3803 F F 278.1
barrier2-12 F F 4043 F F 280.3
mc2depi 262 487 409 25.5 70.2 54.1
para-8 F 3167 1821 F 279.6 149.8
para-9 F F 2826 F F 249.5
para-5 F F 3972 F F 374.1
para-7 F F 3504 F F 321.2
para-10 F F 3743 F F 308.4
para-6 F F 3756 F F 332.8
cage12 16 18 17 1.3 1.7 1.7
ASIC_320k 1248 856 677 416.0 347.6 315.1
rajat21 3358 F 3978 787.0 F 935.5
crashbasis 263 182 163 18.7 10.0 8.9
majorbasis 239 162 154 13.0 7.9 8.7
venkat01 179 95 85 4.2 1.8 1.7
venkat50 F 3245 3154 F 65.5 68.3
venkat25 F 2491 2250 F 53.8 48.6
ASIC_680ks 6 5 7 1.1 1.0 1.1
ASIC_320ks 66 3 9 6.6 0.3 0.8
language 759 438 146 104.0 53.9 16.5
twotone 2643 1092 742 145.0 39.1 30.6
matrix_9 4549 4291 2089 214.0 184.3 105.3
torso2 42 20 19 1.2 0.7 0.6
transient 2930 2044 F 583.0 253.8 F
ASIC_100k 1057 138 169 180.0 21.7 25.5
matrix-new_3 600 151 222 35.1 6.4 9.9
m133-b3 22 23 20 1.7 1.8 1.5
shar_te2-b3 22 23 20 1.9 1.8 1.6
trans4 F F 2011 F F 335.3
Baumann F F 2986 F F 103.8
rajat25 3054 2414 918 169.0 154.0 51.2
rajat28 2636 1280 842 149.0 82.0 43.8
rajat20 3186 1912 893 198.0 124.3 45.2
LeGresley_87936 F 935 137 F 15.0 2.3
ASIC_100ks 203 51 21 5.1 1.2 0.5
ibm_matrix_2 F 2055 3527 F 26.9 57.9
3D_51448_3D F 1926 3884 F 28.3 72.6
hcircuit F F 2686 F F 38.4
lung2 34 8 17 0.6 0.1 0.3
2D_54019_highK F 275 60 F 3.3 0.7
rajat17 1695 584 386 92.7 32.1 20.6
rajat18 1123 406 393 65.3 22.3 20.2
mark3jac140sc F F 1716 F F 24.3
shyy161 108 115 115 1.3 1.9 1.7
mark3jac120sc F F 1709 F F 19.0
circuit_4 F F 2400 F F 48.4
bayer01 648 25 27 6.1 0.3 0.3
Number of bests 10 17 54 11 18 51
Table 2: Number of CG iterations and parallel CG times in seconds

Figure 31 shows the performance profiles [20] of matrix instances which compares multiple methods over a set of test instances. A performance profile is used to compare multiple methods with respect to the best performing method for each test instance and report the fraction of the test instances in which performance is within a factor of that of the best method. For example, in Fig. (a)a a point (abscissa=, ordinate=) on the performance curve of a method refers to the fact that the performance of the respective method is no worse than that of the best method by a factor of 2 in approximately of the instances. If a method is closer to the top-left corner, then it achieves a better performance.

(a)
(b)
Figure 31: Performance profiles for (a) convergence rate and (b) parallel CG time.

In Figs. (b)b and (a)a, we show the performance profiles in terms of the number of CG iterations and parallel CG times, respectively. As seen in Fig. (a)a, the number of CG iterations required by GP for convergence does not exceed that of the best method by a factor of 2 in approximately of the instances, whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. As seen in Fig. (b)b, the CG time using GP is not slower than that of the best method by a factor of 2 in approximately of instances. Whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. Figures (b)b and (a)a show that the CG time is directly proportional to the number of CG iterations as expected.

3.5 Preprocessing Overhead and Amortization

In this subsection, we analyze the relative preprocessing overhead of the three methods UP, HP and GP in order to find out whether the intelligent partitioning methods HP and GP are amortized. For all of the three block-row partitioning methods, the preprocessing overhead includes matrix scaling, block-row partitioning, creating the submatrices corresponding to these block rows and distributing these submatrices among processors. Recall that the partitioning for UP is straightforward and hence it incurs only negligible additional cost to the preprocessing overhead. On the other hand, intelligent partitioning algorithms utilized in HP and GP incur considerable amount of cost to the overall preprocessing overhead. Furthermore, for GP, construction of row inner-product graph also incurs significant amount of cost.

In Table 3, we display total preprocessing time and total execution time for all three methods for each one of the 80 test instances. In the table, total execution time is the sum of the total preprocessing and the total solution time. Here the total solution time includes parallel factorization and parallel CG solution times. Note that the factorization in block Cimmino algorithm in ABCD Solver is performed by parallel sparse direct solver MUMPS [3]. The factorization, which needs to be performed only once, involves symbolic and numerical factorizations of the coefficient matrices of the augmented systems that arise in the block Cimmino algorithm. Note that this factorization process is embarrassingly parallel since the factorization of the coefficient matrices of the augmented systems are done independently. Recall that MUMPS is also used during the iterative solution stage, where at each iteration a linear system is solved using the factors of the augmented systems that were computed during the factorization stage.

As seen in Table 3, in terms of the preprocessing time, UP is the clear winner in all of the instances as expected, whereas GP incurs much less preprocessing time than HP in all except two instances namely and . As also seen in Table 3, comparing all three methods, GP and UP respectively achieve the smallest total execution time in 50 and 23 instances, whereas HP achieves the smallest total execution time only in 7 instances.

Matrix name Preprocessing time Total execution time
UP HP GP UP HP GP
ML_Geer 69.9 462.2 194.2 377.7 1100.9 1252.1
RM07R 19.5 208.7 84.2 1168.0 1471.6 341.8
ML_Laplace 15.8 114.9 46.6 181.1 432.8 366.4
Transport 22.4 126.2 50.4 1429.3 1872.5 1487.6
rajat31 34.7 85.6 47.4 562.3 745.5 326.7
CoupCons3D 10.4 60.6 29.8 346.4 179.7 72.9
Freescale1 27.8 81.3 42.1 F 3086.6 394.3
circuit5M_dc 26.3 72.6 40.8 115.4 110.6 55.4
memchip 20.5 58.7 36.1 291.4 284.7 74.5
atmosmodm 12.5 67.5 25.3 45.9 91.3 56.6
atmosmodl 12.5 67.6 24.5 39.4 117.3 96.7
atmosmodj 8.9 57.7 21.3 83.1 133.7 94.8
atmosmodd 10.1 58.5 22.3 95.6 140.4 103.4
torso1 4.1 143.3 11.8 220.3 286.9 224.7
PR02R 4.9 31.0 14.3 F 425.9 275.1
cage13 7.4 71.8 33.4 19.4 101.0 71.1
ohne2 4.5 32.0 19.2 F 421.4 144.8
pre2 6.2 59.3 72.7 543.3 111.7 215.9
Hamrle3 9.8 30.8 16.5 2045.3 751.4 543.0
Chebyshev4 2.7 19.2 14.6 F F 98.7
largebasis 4.7 12.3 8.3 253.1 81.6 121.0
tmt_unsym 6.4 19.6 10.1 97.7 224.8 207.1
torso3 3.5 24.0 9.2 25.4 38.3 15.1
xenon2 2.9 14.4 7.5 5.7 16.7 9.9
laminar_duct3D 2.2 17.0 9.9 16.6 36.0 30.2
rajat29 4.3 28.0 14.0 S 1109.4 825.1
FEM_3D_thermal2 2.6 15.5 6.7 8.8 21.8 11.3
stomach 2.5 15.1 5.9 12.0 20.3 8.5
thermomech_dK 3.1 11.4 5.7 6.7 12.8 7.1
ASIC_680k 4.3 35.4 17.5 331.9 719.8 295.9
poisson3Db 2.2 16.3 9.6 F 100.0 81.0
barrier2-10 1.9 15.4 7.6 F F 318.0
barrier2-11 1.9 15.5 7.5 F F 331.7
barrier2-9 1.9 15.3 7.6 F F 286.8
barrier2-12 2.0 15.5 7.7 F F 289.0
mc2depi 3.3 10.9 5.4 30.1 83.0 61.2
para-8 2.0 14.0 6.8 F 294.7 158.0
para-9 2.0 14.0 6.8 F F 257.5
para-5 2.0 14.0 6.6 F F 381.8
para-7 1.9 14.1 6.7 F F 329.3
para-10 2.0 14.1 6.7 F F 316.9
para-6 1.9 14.3 6.7 F F 340.6
cage12 1.8 19.0 9.2 4.3 22.3 12.6
ASIC_320k 2.5 20.3 7.7 457.6 499.2 399.7
rajat21 2.7 23.6 15.0 803.4 F 963.8
crashbasis 1.5 7.1 4.2 21.2 18.0 14.0
majorbasis 1.6 7.2 4.1 15.5 16.0 13.6
venkat01 1.2 5.7 3.0 6.1 8.0 5.3
venkat50 1.2 5.6 3.0 F 71.8 71.9
venkat25 1.2 5.6 3.0 F 60.0 52.2
ASIC_680ks 3.5 9.0 7.5 6.7 13.3 11.0
ASIC_320ks 2.1 6.8 5.9 9.7 8.4 8.2
language 2.5 9.0 7.0 108.3 65.1 25.0
twotone 1.2 6.9 6.6 147.2 47.9 38.6
matrix_9 1.2 8.1 4.3 215.9 193.3 110.6
torso2 0.9 4.4 1.8 2.5 5.6 3.0
transient 1.2 6.2 3.8 587.1 264.8 F
ASIC_100k 1.0 6.5 3.9 214.6 64.8 56.7
matrix-new_3 0.9 6.2 3.4 36.8 14.0 14.3
m133-b3 1.3 9.2 8.3 3.8 12.0 10.7
shar_te2-b3 1.3 9.1 8.4 4.2 11.7 11.0
trans4 0.8 4.7 2.6 F F 360.5
Baumann 0.7 5.9 2.1 F F 106.6
rajat25 0.7 5.2 4.1 170.9 161.8 56.7
rajat28 0.7 5.1 3.9 150.8 89.4 49.1
rajat20 0.6 5.1 4.1 199.8 131.8 50.8
LeGresley_87936 0.7 2.6 1.5 F 18.1 4.4
ASIC_100ks 0.7 3.9 3.1 6.4 5.8 4.2
ibm_matrix_2 0.5 4.0 2.2 F 31.7 61.2
3D_51448_3D 0.5 4.1 2.2 F 33.2 75.2
hcircuit 0.6 2.2 1.8 F F 40.6
lung2 0.6 1.5 1.0 1.6 1.9 1.8
2D_54019_highK 0.5 2.6 1.3 F 8.1 2.8
rajat17 0.6 4.7 3.6 94.5 38.3 25.6
rajat18 0.6 4.7 3.5 67.0 28.4 25.3
mark3jac140sc 0.5 3.3 2.4 F F 27.2
shyy161 0.4 1.9 0.9 1.9 4.4 3.1
mark3jac120sc 0.4 3.0 2.1 F F 21.6
circuit_4 0.4 2.4 2.7 F F 51.5
bayer01 0.3 1.7 1.1 6.7 2.5 1.9
Geometric means Number of bests
2.3 13.7 7.5 23 7 50
Table 3: Preprocessing time and total execution time (including preprocessing) in seconds

In the relative comparison of GP and UP, GP incurs only more preprocessing time than UP, on average, which leads GP to achieve less total execution time than UP in 56 out of 80 instances. In other words, compared to UP method, the sequential implementation of GP amortizes its preprocessing cost for those 56 instances by reducing the number of CG iterations sufficiently. That is GP amortizes its preprocessing cost for the solution of those 56 instances even if we solve only one linear system with a single right-hand side vector. Note that in many applications in which sparse linear systems arise, the solution of consecutive linear systems are required where the coefficient matrix remains the same but the right-hand side vectors change. For such applications, the amortization performance of the proposed GP method will further improve. For example, the amortization performance of GP will improve from 56 to 60 and 64 instances, respectively, for solving two and five consecutive linear systems.

In Figs. (b)b and (a)a, we show the performance profiles in terms of the preprocessing time and the total execution time, respectively. As seen in Fig. (a)a, the preprocessing overhead incurred by GP remains no worse than that of the best method UP by a factor of 4 in of instances. As seen in Fig. (b)b, the total execution time attained by GP does not exceed that of the best method by a factor of 2 in of the instances. On the other hand, HP and UP achieves the same relative performance compared to the best method only in approximately and of the instances, respectively.

(a)
(b)
Figure 34: Performance profiles for (a) preprocessing time and (b) total execution time for GP, HP and UP methods.

4 Conclusion and future work

In this paper, we proposed a novel partitioning method in order to improve the CG accelerated block Cimmino algorithm. The proposed partitioning method takes the numerical orthogonality between block rows of the coefficient matrix into account. The experiments on a large set of real world systems show that the proposed method improves the convergence rate of the CG accelerated block Cimmino compared to the state-of-the-art hypergraph partitioning method. Moreover, it requires not only less preprocessing time and fewer number of CG iterations, but also much less total execution time than the hypergraph partitioning method.

As a future work, we consider two issues: further reducing the number of iterations through preconditioning and reducing the preprocessing overhead through parallelization.

Even though the matrix is not available explicitly, it could be still possible to obtain a preconditioner, further reducing the required number of CG iterations. One viable option could be using sparse approximate inverse type preconditioner where the coefficient matrix does not need to be available explicitly. This approach could be viable especially when consecutive linear systems are needed to be solved with the same coefficient matrix.

The proposed method involves two computational stages, namely constructing row inner-product graph via computing SpGEMM operation and partitioning this graph. For parallelizing the first stage, parallel SpGEMM [1, 2, 9] operation could be used to construct local subgraphs on each processor. For parallelizing the second stage, a parallel graph partitioning tool ParMETIS [32] could be used. In each processor, the local subgraphs generated in parallel in the first stage could be used as input for ParMETIS.

Appendix A Graph and Graph Partitioning

An undirected graph consists of a set of vertices and a set of edges . Each edge connects a pair of distinct vertices and . Each vertex can be assigned a weight shown as and each edge can be assigned a cost shown as .

is defined as a K-way vertex partition of if parts are mutually disjoint and exhaustive. An edge is said to be cut if the vertices and belong to different vertex parts and uncut otherwise. The set of cut edges of a partition is denoted as . In a given partition of , the weight of a part is defined as the sum of the weights of the vertices in , i.e.,