1 Introduction
Row projection methods are a class of linear system solvers [12, 13, 31] that are used for solving linear system of equations of the form
(1) 
where is an sparse nonsymmetric nonsingular matrix, and
are column vectors of size
. In these methods, the solution is computed through successive projections. There are mainly two major variations known as Kacmarz [30] and Cimmino [17]. Kaczmarz obtains the solution through a product of orthogonal projections whereas Cimmino reaches the solution through a sum of orthogonal projections. Cimmino is known to be more amenable to parallelism than Kaczmarz [12]. The required number of iterations for Cimmino algorithm, however, could be quite large. One alternative variation is the block Cimmino [6] which is a block row projection method. Iterative block Cimmino has been studied extensively in [6, 8, 12, 16, 21, 23, 24]. A direct version of block Cimmino based on the augmented block row projections is proposed in [22]. However as in any other direct solver [3, 11, 34, 36], this approach also suffers from extensive memory requirement due to fillin.In the block Cimmino scheme, the linear system Eq. 1 is partitioned into block rows, where , as follows:
(2) 
In Eq. 2, the coefficient matrix and righthand side vector is partitioned conformably. Here is a submatrix of size and is a subvector of size , for , where
(3) 
The block Cimmino scheme is given in Algorithm 1, where is the MoorePenrose pseudoinverse of and it is defined as
(4) 
In Algorithm 1, is used for the sake of the clarity of the notation and it is never computed explicitly. In fact, at line 4 of Algorithm 1, the minimum norm solution of an underdetermined linear least squares problem is obtained via the augmented system approach as discussed later. In Algorithm 1, vectors, which are of size , can be computed independently in parallel without any communication, hence the block Cimmino algorithm is quite suitable for parallel computing platforms.
An iteration of block Cimmino method can be reformulated as follows:

(5) 
where is the iteration matrix for block Cimmino algorithm. is the sum of (projections onto ) and it is defined by
(6) 
The projections in block Cimmino iterations can be calculated using several approaches, such as the normal equations, seminormal equations, QR factorization and using the augmented system [5]. The normal and seminormal equation approaches are not considered since they have the potential of introducing numerical difficulties that can be disastrous in some cases when the problem is illconditioned. Although the QR factorization is numerically more stable, it is computationally expensive. Therefore we use the augmented system approach, which requires the solution of a sparse linear system that can be done effectively by using a sparse direct solver [6]. Note that if submatrix is in a column overlapped block diagonal form, one could also use the algorithm in [37]. However, this approach is not considered since we assume no structure for the coefficient matrix.
In the augmented system approach, we obtain the solution of Eq. 7 by solving the linear system Eq. 8,
(7) 
(8) 
Hence, the solution of the augmented system gives the projection .
1.1 The Conjugate Gradient acceleration of block Cimmino method
Convergence rate of the block Cimmino algorithm is known to be slow [12]. In [12], the Conjugate Gradient (CG) method is proposed to accelerate the row projection method. It is also reported that the CG accelerated Cimmino method competes favorably compared to classical preconditioned Generalized Minimum Residual (GMRES) and Conjugate Gradient on Normal Equations (CGNE) for the solution of nonsymmetric linear systems.
We note that the matrix in Eq. 6 is symmetric and positive definite. Hence, one can solve the following system using CG,
(9) 
where and is the solution vector of the system Eq. 1. Note that since appears on both of the equalities, it does not affect on the convergence of CG. Algorithm 2 is the pseudocode for CG accelerated block Cimmino method, which is in fact the classical CG applied on the system given in Eq. 9.
The matrix vector multiplications in the CG is expressed as the solution of independent underdetermined systems at line which can be done in parallel but need to be summed to obtain by using an allreduce operation.
1.2 The effect of partitioning
The convergence rate of CG accelerated block Cimmino algorithm is essentially the convergence rate of CG applied on Eq. 9. A wellknown upper bound on the convergence rate can be given in terms of the extreme eigenvalues ( and ) of the coefficient matrix. Let
(10) 
be the 2norm condition number of . Then, as in [27], an upper bound on the convergence rate of the CG accelerated block Cimmino can be given by
(11) 
where is the exact solution and . Furthermore, it was shown that the convergence rate of CG not only depends on the extreme eigenvalues but also on the separation between those extreme eigenvalues and interior eigenvalues [38] as well the clustering of the internal eigenvalues [29]. In summary the convergence rate of the CG accelerated block Cimmino depends on the extreme eigenvalues and the number of clusters as well as the quality of the clustering.
Therefore, the partitioning of the coefficient matrix into block rows is crucial for the improvements in the convergence rate for the CG accelerated block Cimmino algorithm, since it can improve the eigenvalue distribution of . Note that the eigenvalues of is only affected by the blockrow partitioning of the coefficient matrix and independent of any column ordering [21].
Let QR factorization of be defined as
(12) 
where the matrices and have dimensions of and , respectively. Then, the matrix will be

(13) 
Since the eigenvalue spectrum of matrix is the same as the spectrum of matrix [26], is similar to
(14) 
where the singular values of matrix
represent the principal angles between the subspaces and [10]. Hence, the smaller the offdiagonals of the matrix Eq. 14, the more eigenvalues of will be clustered around one by the Gershgorin theorem [25]. Therefore, the convergence rate of the block Cimmino method highly depends on the orthogonality among blocks. If blocks are more orthogonal to each other, row inner products between blocks would be arbitrarily small and hence the eigenvalues will be clustered around one. Therefore, CG is expected to converge in a fewer number of iterations.In a recent study [21], a hypergraph partitioning method is proposed to find a “good” blockrow partitioning for the CG accelerated block Cimmino algorithm. It is also shown that it performs better than CuthillMcKee [18] based methods (with or without filtering small elements in absolute value). However, the hypergraph partitioning method loosely considers the structural orthogonality among block rows and does not consider the numerical values at all.
In this work, we propose a novel graph theoretical blockrow partitioning method for the CG accelerated block Cimmino algorithm. For this purpose, we introduce a row innerproduct graph model of a given matrix and then the problem of finding a “good” blockrow partitioning is formulated as a graph partitioning problem on this graph model. The proposed method takes the numerical orthogonality between block rows of into account. In the proposed method, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of interblock inner products between block rows thus leading to an improvement in the eigenvalue spectrum of .
The validity of the proposed method is confirmed against two baseline methods by conducting experiments on a large set of matrices that arise in a variety of real life applications. One of the two baseline methods is the stateoftheart hypergraph partitioning method introduced in [21]. We conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of . We also conduct experiments to compare the performance of three methods in terms of the number of CG iterations and parallel CG time to solution. The results of these experiments show that the proposed partitioning method is considerably better than both baseline methods in terms of all of these performance metrics. Finally, we compare the preprocessing overheads of the methods and show that the proposed method incurs much less overhead than the hypergraph partitioning method, thus allowing a better amortization.
2 The proposed partitioning method
In this section, we first describe the row innerproduct graph model and then show that finding a “good” blockrow partitioning can be formulated as a graph partitioning problem on this graph model. Finally, we give the implementation details for the construction and partitioning of the graph model. We refer the reader to Appendix A for a short background on graph partitioning.
2.1 Row innerproduct graph model
In the rowinner product graph of matrix , vertices represent the rows of matrix and edges represent nonzero inner products between rows. That is, contains vertex for each row of matrix . contains an edge only if the inner product of rows and is nonzero. That is,
(15) 
Each vertex can be associated with a unit weight or a weight that is equal to the number of nonzeros in row , that is
(16) 
respectively. Each edge is associated with a cost equal to the absolute value of the respective inner product. That is,
(17) 
This graph is topologically equivalent to the standard graph representation of the symmetric matrix resulting from the sparse matrixmatrix multiplication operation . That is, the sparsity pattern of corresponds to the adjacency matrix representation of . Each nonzero of the resulting matrix incurs an edge . Since each nonzero entry of is computed as the inner product of row and row , the absolute value of the nonzero determines the cost of the respective edge . That is, since , we have .
Figure (a)a shows a sample sparse matrix that contains nonzeros. Figure (b)b depicts the proposed row innerproduct graph for this sample matrix. As seen in Fig. (b)b, contains vertices each of which corresponds to a row and edges each of which corresponds to a nonzero row inner product. For example, the inner product of rows and is nonzero where so that contains the edge with . In Fig. (b)b, edges with cost larger than are shown with thick lines in order to make such high innerproduct values more visible.
Figure (a)a shows the resulting matrix of the sparse matrixmatrix multiplication . Only the offdiagonal nonzero entries together with their values are shown since the values of the diagonal entries do not affect . The cells that contain nonzeros larger than are shown with black background in order to make such high values more visible. Comparison of Figs. (a)a and (b)b shows that the topology of the standard graph model of matrix is equivalent to . As also seen in Fig. (a)a, the values of the nonzero entries of matrix are equal to the costs of respective edges of . For example nonzero with a value incurs an edge with .
2.2 Blockrow partitioning via partitioning
A way partition of can be decoded as a partial permutation on the rows of to induce a permuted matrix , where
(18) 
Here, denotes the row permutation matrix which is defined by the way partition as follows: the rows associated with the vertices in are ordered after the rows associated with the vertices in for . That is, the blockrow contains the set of rows corresponding to the set of vertices in part of partition , where ordering of the rows within block row is arbitrary for each . Note that we use the notation to denote both block row and the set of rows in . Since the column permutation does not affect the convergence of block Cimmino algorithm [21], the original column ordering of is maintained.
Consider a partition of . The weight of a part is either equal to the number of rows or number of nonzeros in block row depending on the vertex weighting scheme used according to Eq. 16. That is,
(19) 
Therefore, in the partitioning , the partitioning constraint of maintaining balance among part weights according to Eq. 28 corresponds to maintaining balance on either the number of rows or the number of nonzeros among the block rows.
Consider a partition of . A cut edge between parts and represents a nonzero interblock inner product between block rows and . Therefore, the cutsize of (given in Eq. 29) is equal to
(20)  
which corresponds to the total sum of the interblock inner products (). So, in partitioning , the partitioning objective of minimizing the cutsize corresponds to minimizing the sum of interblock inner products between block rows. Therefore, this partitioning objective corresponds to making the block rows numerically more orthogonal to each other. In this way, we expect this method to yield a faster convergence in the CG accelerated block Cimmino algorithm.
We introduce Figs. 9 and 6 in order to clarify the proposed graph partitioning method for blockrow partitioning. Figure (a)a shows a straightforward 3way blockrow partition of the sample matrix given in Fig. (a)a, where the first four, the second four and the third four consecutive rows in the original order constitute the block rows and , respectively. Figure (b)b shows the way vertex partition of that corresponds to this straightforward way blockrow partition. Figure (a)a shows a “good” way vertex partition of obtained by using the graph partitioning tool METIS [33]. Figure (b)b shows the permuted matrix and blockrow partition induced by the 3way vertex partition
As seen in Figs. 9 and 6, both straightforward and “good” blockrow partitions achieve perfect balance on the row counts of blocks by having exactly four rows per block. The quality difference between straightforward and “good” blockrow partitions can be easily seen by comparing the way partitions of in Figs. (a)a and (b)b, respectively. As seen in Fig. (b)b, eight out of nine thick edges remain on the cut of , whereas all of the nine thick edges remain internal in as seen in Fig. (a)a.
Figure 12 shows the blockcheckerboard partitioning of the resulting matrix induced by straightforward and “good” blockrow partitioning of the sample matrix in Figs. (b)b and (a)a, respectively. Note that both rows and columns of the matrix are partitioned conformably with the row partitions of the matrix. The comparison of Figs. (b)b and (a)a shows that large nonzeros (dark cells) are scattered across the offdiagonal blocks of matrix for the straightforward partitioning, whereas large nonzeros (dark cells) are clustered to the diagonal blocks of for the “good” partitioning.
We introduce Fig. 15 to compare the quality of the straightforward and “good” blockrow partitions in terms of interblock rowinnerproduct sums. In the figure, each offdiagonal entry ip of the IP matrix shows the sum of the interblock row innerproducts between the respective block rows and . That is,
(21) 
As seen in Fig. 15, ip for the straightforward partition, whereas ip for the “good” partition. Note that ip is also equal to the sum of the absolute values of the nonzeros of the offdiagonal block at the row block and column block of the matrix, i.e.,
(22) 
Therefore, the total sum of inter block inner products is
(23)  
for the straightforward partition, whereas for the “good” partition it is
(24) 
Figures (b)b and (a)a, respectively show the eigenvalue spectrum of for the straightforward and “good” partitionings. As seen in the figures, for the straightforward partitioning the eigenvalues reside in the interval , whereas for ”good” partitioning the eigenvalues reside in the interval . As seen in Fig. (b)b, after using partitioning the eigenvalues are much better clustered around and the smallest eigenvalue is much larger than that of the straightforward partitioning method.
2.3 Implementation
Implementation of the proposed partitioning method consists of two stages which are constructing and partitioning .
Constructing : For constructing , we choose to use basic sparse matrixmatrix multiplication (SpGEMM) [28] kernel due to existing efficient implementations. The edges of the are obtained from the nonzeros of the matrix, whereas their weights are obtained from the absolute values of those nonzeros.
Note that when matrix has dense column(s), the corresponding matrix will be quite dense. In other words, when a column has nonzeros, corresponding matrix will have at least nonzeros. For example, Fig. (a)a shows a sparse matrix which has a dense column having nonzero entries. As seen in Fig. (b)b, matrix is dense as it has nonzero entries. Clearly, large number of nonzeros in (i.e., large number of edges in ) increases the memory requirement and computation cost of SpGEMM as well as the time requirement for partitioning .
In order to alleviate the aforementioned problem, we propose the following methodology for sparsening . We identify a column (in MATLAB [35] notation) of an matrix as a dense column if it contains more than nonzeros (). Given , we extract a sparse matrix by keeping the largest (in absolute value) nonzeros of dense columns of . That is, the smallest entries of a dense matrix column is ignored during constructing column of . Hence, the SpGEMM operation is performed on to obtain sparsened resulting matrix . This will lead to a sparsened graph. For example, Fig. (c)c shows this sparsity pattern of a sparse matrix which is extracted from by keeping largest nonzeros in the dense column of . As seen in Fig. (d)d, matrix is very sparse with respect to Fig. (b)b. Note that both and are used only for constructing of . After the partitioning stage, both matrices are discarded. In the rest of the paper, will be referred to as for the sake of the simplicity of presentation.
Partitioning : We use multilevel graph partitioning tool METIS [33] for partitioning . Since METIS uses positive integer weights for edges, the edge costs computed as floatingpoint values are cast as integers. In order to distribute the edge costs as evenly as possible between the small and large edge costs in the graph, we scale them in a range from 1 to as follows in order to compute integer edge weights required by METIS:
(25) 
Here, is the edge cost computed according to Eq. 17, mincost and maxcost respectively denote the minimum and maximum edge costs, and is the weight of the respective edge provided to METIS.
The scaling given in Eq. 25
may suffer from yielding nonuniformly distributed integer edge weights in some instances. That is, there could be some clustering of integer edge weights although edge costs are not clustered in the original
. Ideally edge weights should be real numbers. However, in order to be able to use the limited number of integer weights more effectively, one needs to have more uniformly distributed edge weights. Therefore we prescale the edge costs of by computing their square roots before assigning the integer weights using Eq. 25.3 Experimental Results
3.1 Experimental Framework
In the experiments, we used the CG accelerated block Cimmino implementation available in ABCD Solver v1.0beta [39]. In ABCD Solver, MUMPS [3] sparse direct solver is used to factorize the system once and solve the system iteratively. We used the same matrix scaling method [4] by setting NB1, NB2 and NB3 parameters to 10, 20 and 20, respectively, as suggested by one of the previous versions of ABCD Solver (version on 20140915) for the CG accelerated block Cimmino.
ABCD Solver includes a stabilized blockCG accelerated block Cimmino algorithm [8] especially for solving systems with multiple righthand side vectors. Since the classical CG is guaranteed to converge for systems where the coefficient matrix is symmetric and positive definite and its convergence theory is wellestablished, in this work we implement the classical CG accelerated block Cimmino in Algorithm 2 for solving sparse linear system of equations with single righthand side vector rather than the blockCG acceleration. In all experiments with the CG accelerated block Cimmino, we use the normwise backward error [7] at iteration
(26) 
as the stopping criterion and we use as the maximum number of iterations.
In the experiments, we have compared the performance of the proposed partitioning method (GP) against two baseline partitioning methods already available in ABCD Solver. The first baseline method, which is referred to as uniform partitioning (UP) method in ABCD Solver, partitions the rows of the coefficient matrix into a given number of block rows with almost equal number of rows without any permutation on the rows of . Note that UP method is the same as the straightforward partitioning method mentioned in Section 2.
The second baseline method is the hypergraph partitioning (HP) method [21]. This method uses the columnnet model [14] of sparse matrix in which rows and columns are respectively represented by vertices and hyperedges both with unit weights [21]. Each hyperedge connects the set of vertices corresponding to the rows that have a nonzero on the respective column. In the HP method, a way partition of the vertices of the columnnet model is used to find block rows. The partitioning constraint of maintaining balance on part weights corresponds to finding block rows with equal number of rows. The partitioning objective of the minimizing cutsize according to the cutnet metric corresponds to minimizing the number of linking columns among block rows. Here a linking column refers to a column that contains nonzeros in more than one block row.
The HP method in ABCD Solver uses the multilevel hypergraph partitioning tool PaToH [15] for partitioning the columnnet model of matrix . Here, we use the same parameters for PaToH specified in ABCD Solver. That is, the final imbalance and initial imbalance parameters in PaToH are set to and , respectively. The other parameters are left as default of PaToH as in ABCD Solver.
In the proposed GP method, as mentioned in Section 2.3, the multilevel graph partitioning tool METIS is used to partition the row innerproduct graph model of . The imbalance parameter of METIS is set to 10% and kway option is used. For the sake of a fair comparison between HP and GP methods, unit vertex weights are used in
. The other parameters are left as default of METIS. Since both PaToH and METIS use randomized algorithms, we report the results of the geometric mean of
experiments with different seeds for each instance.Here and hereafter, we use GP, HP and UP to refer to the respective blockrow partitioning method as well as ABCD Solver that utilizes the regular block Cimmino algorithm for solving the systems partitioned by the respective method.
The extensive numerical experiments were conducted on a shared memory system. The shared memory system is a four socket 64core computer that contains four AMD Opteron 6376 processors, where each processor has 16 cores running at 2.3GHz and a total of 64GB of DDR3 memory. Due to memory bandwidth limitations of the platform, experiments are performed with 32 cores.
3.2 Effect of blockrow partitioning on eigenvalue spectrum of
The convergence rate of the CG accelerated block Cimmino algorithm is related to the eigenvalue spectrum of . By the nature of the block Cimmino algorithm, most of the eigenvalues of are clustered around 1, but there can be some eigenvalues at extremes of the spectrum.
In this subsection, we conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of by comparing the proposed GP method against UP and HP. In these experiments, in order to be able to compute the eigenvalue spectrum requiring reasonable amount of time and memory, we use four small nonsymmetric sparse matrices: , , and from SuiteSparse Matrix Collection [19]. The first and the second matrices arise in Computational Fluid Dynamics problems, whereas the third and fourth matrices arise in Power Network problems. We partition the matrices into block rows for all of the three partitioning methods UP, HP and GP.
Figure 28 shows the eigenvalue spectrum of obtained by UP, HP and GP methods for each test instance. The figure also reports the number of CG iterations (iters) required for convergence as well as the smallest and largest eigenvalues of . As seen in the figure, both HP and GP methods achieve significantly better clustering of the eigenvalues around 1 compared to UP. This experimental finding is reflected to remarkable decrease in the number of CG iterations attained by both HP and GP over UP.
In the comparison of HP and GP, GP achieves better eigenvalue clustering and hence better convergence rate than HP for all instances. For , , and instances, the better clustering quality attained by GP over HP leads to significant improvement in the convergence rate by , , and , respectively.
3.3 Dataset for Performance Analysis
For the following experiments, we selected all nonsingular nonsymmetric square matrices whose dimensions are between and rows and columns from the SuiteSparse Matrix Collection [19]. The number of matrices based on this criteria turns out to be . Only the HV15R and cage14 matrices are excluded due to memory limitations. We have observed that at least one of the three partitioning methods was able to converge in instances out of these in less than CG iterations.
Table 1 shows the properties of those matrices that are used in the experiments. In the table, the matrices are displayed in decreasing sorted order according to their number of nonzeros.
The righthand side vectors of the systems are obtained by multiplying the coefficient matrices with a vector whose elements are uniform random numbers in the interval as in ABCD Solver. In all instances, the CG iterations are started from the zero vector.
Matrix name  Matrix name  

ML_Geer  1,504,002  110,686,677  para10  155,924  2,094,873 
RM07R  381,689  37,464,962  para6  155,924  2,094,873 
ML_Laplace  377,002  27,582,698  cage12  130,228  2,032,536 
Transport  1,602,111  23,487,281  ASIC_320k  321,821  1,931,828 
rajat31  4,690,002  20,316,253  rajat21  411,676  1,876,011 
CoupCons3D  416,800  17,277,420  crashbasis  160,000  1,750,416 
Freescale1  3,428,755  17,052,626  majorbasis  160,000  1,750,416 
circuit5M_dc  3,523,317  14,865,409  venkat01  62,424  1,717,792 
memchip  2,707,524  13,343,948  venkat50  62,424  1,717,777 
atmosmodm  1,489,752  10,319,760  venkat25  62,424  1,717,763 
atmosmodl  1,489,752  10,319,760  ASIC_680ks  682,712  1,693,767 
atmosmodj  1,270,432  8,814,880  ASIC_320ks  321,671  1,316,085 
atmosmodd  1,270,432  8,814,880  language  399,130  1,216,334 
torso1  116,158  8,516,500  twotone  120,750  1,206,265 
PR02R  161,070  8,185,136  matrix_9  103,430  1,205,518 
cage13  445,315  7,479,343  torso2  115,967  1,033,473 
ohne2  181,343  6,869,939  transient  178,866  961,368 
pre2  659,033  5,834,044  ASIC_100k  99,340  940,621 
Hamrle3  1,447,360  5,514,242  matrixnew_3  125,329  893,984 
Chebyshev4  68,121  5,377,761  m133b3  200,200  800,800 
largebasis  440,020  5,240,084  shar_te2b3  200,200  800,800 
tmt_unsym  917,825  4,584,801  trans4  116,835  749,800 
torso3  259,156  4,429,042  Baumann  112,211  748,331 
xenon2  157,464  3,866,688  rajat25  87,190  606,489 
laminar_duct3D  67,173  3,788,857  rajat28  87,190  606,489 
rajat29  643,994  3,760,246  rajat20  86,916  604,299 
FEM_3D_thermal2  147,900  3,489,300  LeGresley_87936  87,936  593,276 
stomach  213,360  3,021,648  ASIC_100ks  99,190  578,890 
thermomech_dK  204,316  2,846,228  3D_51448_3D  51,448  537,038 
ASIC_680k  682,862  2,638,997  ibm_matrix_2  51,448  537,038 
poisson3Db  85,623  2,374,949  hcircuit  105,676  513,072 
barrier210  115,625  2,158,759  lung2  109,460  492,564 
barrier211  115,625  2,158,759  2D_54019_highK  54,019  486,129 
barrier29  115,625  2,158,759  rajat17  94,294  479,246 
barrier212  115,625  2,158,759  rajat18  94,294  479,151 
mc2depi  525,825  2,100,225  mark3jac140sc  64,089  376,395 
para8  155,924  2,094,873  shyy161  76,480  329,762 
para9  155,924  2,094,873  mark3jac120sc  54,929  322,483 
para5  155,924  2,094,873  circuit_4  80,209  307,604 
para7  155,924  2,094,873  bayer01  57,735  275,094 
We partition all test matrices into block rows for all of the three partitioning methods UP, HP and GP. Since all parallel runs are performed on 32 cores, the number of rowblocks are greater than the number of cores. The consecutive row blocks are uniformly assigned to the cores so that each core is assigned row blocks.
3.4 Convergence and parallel performance
In this subsection, we study the performance of the proposed GP method against UP and HP in terms of the number of CG iterations and parallel CG time to solution.
Table 2 shows the number of CG iterations and parallel CG time for each matrix. In the table, “” denotes that an algorithm fails to reach the desired backward error in iterations for the respective matrix instance. In the table, “” denotes that an algorithm fails during the factorization stage due to a singular block row. “” occurs only for with UP. As seen in Table 2, UP and HP fail to converge in and test instances, respectively, whereas GP fails in only one test instance.
In Table 2, the best result for each test instances is shown in bold. As seen in the table, out of 80 instances, the proposed GP method achieves the fastest convergence in 54 instances, whereas HP and UP achieve the fastest convergence in only 17 and 10 instances, respectively. Furthermore, as seen in Table 2, GP achieves the fastest iterative solution time in 51 instances, whereas HP and UP achieve the fastest solution time in 18 and 11 instances, respectively.
Matrix name  # of CG iterations  Parallel CG time to soln.  
UP  HP  GP  UP  HP  GP  
ML_Geer  355  492  408  293.0  602.9  999.9 
RM07R  1990  2128  506  1140.0  1228.3  246.1 
ML_Laplace  760  952  971  162.0  314.0  314.6 
Transport  2214  1666  1520  1400.0  1709.0  1416.5 
rajat31  521  606  239  515.0  645.8  248.6 
CoupCons3D  1280  479  152  333.0  114.4  39.1 
Freescale1  F  4083  480  F  2996.2  337.6 
circuit5M_dc  68  38  7  77.7  28.6  5.4 
memchip  298  380  53  262.0  209.0  30.9 
atmosmodm  44  23  42  27.8  12.8  23.2 
atmosmodl  38  70  103  21.2  38.2  60.1 
atmosmodj  167  148  138  69.4  67.6  65.9 
atmosmodd  171  155  149  80.7  73.5  73.4 
torso1  1559  854  1050  211.0  114.3  188.9 
PR02R  F  3505  2194  F  391.7  258.9 
cage13  14  17  16  4.9  7.7  7.9 
ohne2  F  2750  721  F  383.7  123.3 
pre2  2343  264  524  535.0  47.9  137.1 
Hamrle3  3555  2058  1583  2030.0  715.2  521.3 
Chebyshev4  F  F  173  F  F  46.4 
largebasis  1379  554  924  247.0  67.7  111.1 
tmt_unsym  495  880  770  89.1  202.1  194.3 
torso3  162  75  27  20.7  12.1  4.2 
xenon2  25  19  21  1.8  1.2  1.3 
laminar_duct3D  247  403  323  13.7  18.2  19.3 
rajat29  S  14  28  S  11.8  25.1 
FEM_3D_thermal2  86  73  47  5.1  5.2  3.6 
stomach  85  37  15  8.5  4.0  1.5 
thermomech_dK  26  7  7  2.4  0.5  0.5 
ASIC_680k  23  8  5  17.0  7.9  4.0 
poisson3Db  F  1880  1457  F  82.7  70.7 
barrier210  F  F  4049  F  F  309.6 
barrier211  F  F  4270  F  F  323.0 
barrier29  F  F  3803  F  F  278.1 
barrier212  F  F  4043  F  F  280.3 
mc2depi  262  487  409  25.5  70.2  54.1 
para8  F  3167  1821  F  279.6  149.8 
para9  F  F  2826  F  F  249.5 
para5  F  F  3972  F  F  374.1 
para7  F  F  3504  F  F  321.2 
para10  F  F  3743  F  F  308.4 
para6  F  F  3756  F  F  332.8 
cage12  16  18  17  1.3  1.7  1.7 
ASIC_320k  1248  856  677  416.0  347.6  315.1 
rajat21  3358  F  3978  787.0  F  935.5 
crashbasis  263  182  163  18.7  10.0  8.9 
majorbasis  239  162  154  13.0  7.9  8.7 
venkat01  179  95  85  4.2  1.8  1.7 
venkat50  F  3245  3154  F  65.5  68.3 
venkat25  F  2491  2250  F  53.8  48.6 
ASIC_680ks  6  5  7  1.1  1.0  1.1 
ASIC_320ks  66  3  9  6.6  0.3  0.8 
language  759  438  146  104.0  53.9  16.5 
twotone  2643  1092  742  145.0  39.1  30.6 
matrix_9  4549  4291  2089  214.0  184.3  105.3 
torso2  42  20  19  1.2  0.7  0.6 
transient  2930  2044  F  583.0  253.8  F 
ASIC_100k  1057  138  169  180.0  21.7  25.5 
matrixnew_3  600  151  222  35.1  6.4  9.9 
m133b3  22  23  20  1.7  1.8  1.5 
shar_te2b3  22  23  20  1.9  1.8  1.6 
trans4  F  F  2011  F  F  335.3 
Baumann  F  F  2986  F  F  103.8 
rajat25  3054  2414  918  169.0  154.0  51.2 
rajat28  2636  1280  842  149.0  82.0  43.8 
rajat20  3186  1912  893  198.0  124.3  45.2 
LeGresley_87936  F  935  137  F  15.0  2.3 
ASIC_100ks  203  51  21  5.1  1.2  0.5 
ibm_matrix_2  F  2055  3527  F  26.9  57.9 
3D_51448_3D  F  1926  3884  F  28.3  72.6 
hcircuit  F  F  2686  F  F  38.4 
lung2  34  8  17  0.6  0.1  0.3 
2D_54019_highK  F  275  60  F  3.3  0.7 
rajat17  1695  584  386  92.7  32.1  20.6 
rajat18  1123  406  393  65.3  22.3  20.2 
mark3jac140sc  F  F  1716  F  F  24.3 
shyy161  108  115  115  1.3  1.9  1.7 
mark3jac120sc  F  F  1709  F  F  19.0 
circuit_4  F  F  2400  F  F  48.4 
bayer01  648  25  27  6.1  0.3  0.3 
Number of bests  10  17  54  11  18  51 
Figure 31 shows the performance profiles [20] of matrix instances which compares multiple methods over a set of test instances. A performance profile is used to compare multiple methods with respect to the best performing method for each test instance and report the fraction of the test instances in which performance is within a factor of that of the best method. For example, in Fig. (a)a a point (abscissa=, ordinate=) on the performance curve of a method refers to the fact that the performance of the respective method is no worse than that of the best method by a factor of 2 in approximately of the instances. If a method is closer to the topleft corner, then it achieves a better performance.
In Figs. (b)b and (a)a, we show the performance profiles in terms of the number of CG iterations and parallel CG times, respectively. As seen in Fig. (a)a, the number of CG iterations required by GP for convergence does not exceed that of the best method by a factor of 2 in approximately of the instances, whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. As seen in Fig. (b)b, the CG time using GP is not slower than that of the best method by a factor of 2 in approximately of instances. Whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. Figures (b)b and (a)a show that the CG time is directly proportional to the number of CG iterations as expected.
3.5 Preprocessing Overhead and Amortization
In this subsection, we analyze the relative preprocessing overhead of the three methods UP, HP and GP in order to find out whether the intelligent partitioning methods HP and GP are amortized. For all of the three blockrow partitioning methods, the preprocessing overhead includes matrix scaling, blockrow partitioning, creating the submatrices corresponding to these block rows and distributing these submatrices among processors. Recall that the partitioning for UP is straightforward and hence it incurs only negligible additional cost to the preprocessing overhead. On the other hand, intelligent partitioning algorithms utilized in HP and GP incur considerable amount of cost to the overall preprocessing overhead. Furthermore, for GP, construction of row innerproduct graph also incurs significant amount of cost.
In Table 3, we display total preprocessing time and total execution time for all three methods for each one of the 80 test instances. In the table, total execution time is the sum of the total preprocessing and the total solution time. Here the total solution time includes parallel factorization and parallel CG solution times. Note that the factorization in block Cimmino algorithm in ABCD Solver is performed by parallel sparse direct solver MUMPS [3]. The factorization, which needs to be performed only once, involves symbolic and numerical factorizations of the coefficient matrices of the augmented systems that arise in the block Cimmino algorithm. Note that this factorization process is embarrassingly parallel since the factorization of the coefficient matrices of the augmented systems are done independently. Recall that MUMPS is also used during the iterative solution stage, where at each iteration a linear system is solved using the factors of the augmented systems that were computed during the factorization stage.
As seen in Table 3, in terms of the preprocessing time, UP is the clear winner in all of the instances as expected, whereas GP incurs much less preprocessing time than HP in all except two instances namely and . As also seen in Table 3, comparing all three methods, GP and UP respectively achieve the smallest total execution time in 50 and 23 instances, whereas HP achieves the smallest total execution time only in 7 instances.
Matrix name  Preprocessing time  Total execution time  

UP  HP  GP  UP  HP  GP  
ML_Geer  69.9  462.2  194.2  377.7  1100.9  1252.1 
RM07R  19.5  208.7  84.2  1168.0  1471.6  341.8 
ML_Laplace  15.8  114.9  46.6  181.1  432.8  366.4 
Transport  22.4  126.2  50.4  1429.3  1872.5  1487.6 
rajat31  34.7  85.6  47.4  562.3  745.5  326.7 
CoupCons3D  10.4  60.6  29.8  346.4  179.7  72.9 
Freescale1  27.8  81.3  42.1  F  3086.6  394.3 
circuit5M_dc  26.3  72.6  40.8  115.4  110.6  55.4 
memchip  20.5  58.7  36.1  291.4  284.7  74.5 
atmosmodm  12.5  67.5  25.3  45.9  91.3  56.6 
atmosmodl  12.5  67.6  24.5  39.4  117.3  96.7 
atmosmodj  8.9  57.7  21.3  83.1  133.7  94.8 
atmosmodd  10.1  58.5  22.3  95.6  140.4  103.4 
torso1  4.1  143.3  11.8  220.3  286.9  224.7 
PR02R  4.9  31.0  14.3  F  425.9  275.1 
cage13  7.4  71.8  33.4  19.4  101.0  71.1 
ohne2  4.5  32.0  19.2  F  421.4  144.8 
pre2  6.2  59.3  72.7  543.3  111.7  215.9 
Hamrle3  9.8  30.8  16.5  2045.3  751.4  543.0 
Chebyshev4  2.7  19.2  14.6  F  F  98.7 
largebasis  4.7  12.3  8.3  253.1  81.6  121.0 
tmt_unsym  6.4  19.6  10.1  97.7  224.8  207.1 
torso3  3.5  24.0  9.2  25.4  38.3  15.1 
xenon2  2.9  14.4  7.5  5.7  16.7  9.9 
laminar_duct3D  2.2  17.0  9.9  16.6  36.0  30.2 
rajat29  4.3  28.0  14.0  S  1109.4  825.1 
FEM_3D_thermal2  2.6  15.5  6.7  8.8  21.8  11.3 
stomach  2.5  15.1  5.9  12.0  20.3  8.5 
thermomech_dK  3.1  11.4  5.7  6.7  12.8  7.1 
ASIC_680k  4.3  35.4  17.5  331.9  719.8  295.9 
poisson3Db  2.2  16.3  9.6  F  100.0  81.0 
barrier210  1.9  15.4  7.6  F  F  318.0 
barrier211  1.9  15.5  7.5  F  F  331.7 
barrier29  1.9  15.3  7.6  F  F  286.8 
barrier212  2.0  15.5  7.7  F  F  289.0 
mc2depi  3.3  10.9  5.4  30.1  83.0  61.2 
para8  2.0  14.0  6.8  F  294.7  158.0 
para9  2.0  14.0  6.8  F  F  257.5 
para5  2.0  14.0  6.6  F  F  381.8 
para7  1.9  14.1  6.7  F  F  329.3 
para10  2.0  14.1  6.7  F  F  316.9 
para6  1.9  14.3  6.7  F  F  340.6 
cage12  1.8  19.0  9.2  4.3  22.3  12.6 
ASIC_320k  2.5  20.3  7.7  457.6  499.2  399.7 
rajat21  2.7  23.6  15.0  803.4  F  963.8 
crashbasis  1.5  7.1  4.2  21.2  18.0  14.0 
majorbasis  1.6  7.2  4.1  15.5  16.0  13.6 
venkat01  1.2  5.7  3.0  6.1  8.0  5.3 
venkat50  1.2  5.6  3.0  F  71.8  71.9 
venkat25  1.2  5.6  3.0  F  60.0  52.2 
ASIC_680ks  3.5  9.0  7.5  6.7  13.3  11.0 
ASIC_320ks  2.1  6.8  5.9  9.7  8.4  8.2 
language  2.5  9.0  7.0  108.3  65.1  25.0 
twotone  1.2  6.9  6.6  147.2  47.9  38.6 
matrix_9  1.2  8.1  4.3  215.9  193.3  110.6 
torso2  0.9  4.4  1.8  2.5  5.6  3.0 
transient  1.2  6.2  3.8  587.1  264.8  F 
ASIC_100k  1.0  6.5  3.9  214.6  64.8  56.7 
matrixnew_3  0.9  6.2  3.4  36.8  14.0  14.3 
m133b3  1.3  9.2  8.3  3.8  12.0  10.7 
shar_te2b3  1.3  9.1  8.4  4.2  11.7  11.0 
trans4  0.8  4.7  2.6  F  F  360.5 
Baumann  0.7  5.9  2.1  F  F  106.6 
rajat25  0.7  5.2  4.1  170.9  161.8  56.7 
rajat28  0.7  5.1  3.9  150.8  89.4  49.1 
rajat20  0.6  5.1  4.1  199.8  131.8  50.8 
LeGresley_87936  0.7  2.6  1.5  F  18.1  4.4 
ASIC_100ks  0.7  3.9  3.1  6.4  5.8  4.2 
ibm_matrix_2  0.5  4.0  2.2  F  31.7  61.2 
3D_51448_3D  0.5  4.1  2.2  F  33.2  75.2 
hcircuit  0.6  2.2  1.8  F  F  40.6 
lung2  0.6  1.5  1.0  1.6  1.9  1.8 
2D_54019_highK  0.5  2.6  1.3  F  8.1  2.8 
rajat17  0.6  4.7  3.6  94.5  38.3  25.6 
rajat18  0.6  4.7  3.5  67.0  28.4  25.3 
mark3jac140sc  0.5  3.3  2.4  F  F  27.2 
shyy161  0.4  1.9  0.9  1.9  4.4  3.1 
mark3jac120sc  0.4  3.0  2.1  F  F  21.6 
circuit_4  0.4  2.4  2.7  F  F  51.5 
bayer01  0.3  1.7  1.1  6.7  2.5  1.9 
Geometric means  Number of bests  
2.3  13.7  7.5  23  7  50 
In the relative comparison of GP and UP, GP incurs only more preprocessing time than UP, on average, which leads GP to achieve less total execution time than UP in 56 out of 80 instances. In other words, compared to UP method, the sequential implementation of GP amortizes its preprocessing cost for those 56 instances by reducing the number of CG iterations sufficiently. That is GP amortizes its preprocessing cost for the solution of those 56 instances even if we solve only one linear system with a single righthand side vector. Note that in many applications in which sparse linear systems arise, the solution of consecutive linear systems are required where the coefficient matrix remains the same but the righthand side vectors change. For such applications, the amortization performance of the proposed GP method will further improve. For example, the amortization performance of GP will improve from 56 to 60 and 64 instances, respectively, for solving two and five consecutive linear systems.
In Figs. (b)b and (a)a, we show the performance profiles in terms of the preprocessing time and the total execution time, respectively. As seen in Fig. (a)a, the preprocessing overhead incurred by GP remains no worse than that of the best method UP by a factor of 4 in of instances. As seen in Fig. (b)b, the total execution time attained by GP does not exceed that of the best method by a factor of 2 in of the instances. On the other hand, HP and UP achieves the same relative performance compared to the best method only in approximately and of the instances, respectively.
4 Conclusion and future work
In this paper, we proposed a novel partitioning method in order to improve the CG accelerated block Cimmino algorithm. The proposed partitioning method takes the numerical orthogonality between block rows of the coefficient matrix into account. The experiments on a large set of real world systems show that the proposed method improves the convergence rate of the CG accelerated block Cimmino compared to the stateoftheart hypergraph partitioning method. Moreover, it requires not only less preprocessing time and fewer number of CG iterations, but also much less total execution time than the hypergraph partitioning method.
As a future work, we consider two issues: further reducing the number of iterations through preconditioning and reducing the preprocessing overhead through parallelization.
Even though the matrix is not available explicitly, it could be still possible to obtain a preconditioner, further reducing the required number of CG iterations. One viable option could be using sparse approximate inverse type preconditioner where the coefficient matrix does not need to be available explicitly. This approach could be viable especially when consecutive linear systems are needed to be solved with the same coefficient matrix.
The proposed method involves two computational stages, namely constructing row innerproduct graph via computing SpGEMM operation and partitioning this graph. For parallelizing the first stage, parallel SpGEMM [1, 2, 9] operation could be used to construct local subgraphs on each processor. For parallelizing the second stage, a parallel graph partitioning tool ParMETIS [32] could be used. In each processor, the local subgraphs generated in parallel in the first stage could be used as input for ParMETIS.
Appendix A Graph and Graph Partitioning
An undirected graph consists of a set of vertices and a set of edges . Each edge connects a pair of distinct vertices and . Each vertex can be assigned a weight shown as and each edge can be assigned a cost shown as .
is defined as a Kway vertex partition of if parts are mutually disjoint and exhaustive. An edge is said to be cut if the vertices and belong to different vertex parts and uncut otherwise. The set of cut edges of a partition is denoted as . In a given partition of , the weight of a part is defined as the sum of the weights of the vertices in , i.e.,
Comments
There are no comments yet.