where is an sparse nonsymmetric nonsingular matrix, and
are column vectors of size. In these methods, the solution is computed through successive projections. There are mainly two major variations known as Kacmarz  and Cimmino . Kaczmarz obtains the solution through a product of orthogonal projections whereas Cimmino reaches the solution through a sum of orthogonal projections. Cimmino is known to be more amenable to parallelism than Kaczmarz . The required number of iterations for Cimmino algorithm, however, could be quite large. One alternative variation is the block Cimmino  which is a block row projection method. Iterative block Cimmino has been studied extensively in [6, 8, 12, 16, 21, 23, 24]. A direct version of block Cimmino based on the augmented block row projections is proposed in . However as in any other direct solver [3, 11, 34, 36], this approach also suffers from extensive memory requirement due to fill-in.
In the block Cimmino scheme, the linear system Eq. 1 is partitioned into block rows, where , as follows:
In Eq. 2, the coefficient matrix and right-hand side vector is partitioned conformably. Here is a submatrix of size and is a subvector of size , for , where
The block Cimmino scheme is given in Algorithm 1, where is the Moore-Penrose pseudoinverse of and it is defined as
In Algorithm 1, is used for the sake of the clarity of the notation and it is never computed explicitly. In fact, at line 4 of Algorithm 1, the minimum norm solution of an underdetermined linear least squares problem is obtained via the augmented system approach as discussed later. In Algorithm 1, vectors, which are of size , can be computed independently in parallel without any communication, hence the block Cimmino algorithm is quite suitable for parallel computing platforms.
An iteration of block Cimmino method can be reformulated as follows:
where is the iteration matrix for block Cimmino algorithm. is the sum of (projections onto ) and it is defined by
The projections in block Cimmino iterations can be calculated using several approaches, such as the normal equations, seminormal equations, QR factorization and using the augmented system . The normal and seminormal equation approaches are not considered since they have the potential of introducing numerical difficulties that can be disastrous in some cases when the problem is ill-conditioned. Although the QR factorization is numerically more stable, it is computationally expensive. Therefore we use the augmented system approach, which requires the solution of a sparse linear system that can be done effectively by using a sparse direct solver . Note that if submatrix is in a column overlapped block diagonal form, one could also use the algorithm in . However, this approach is not considered since we assume no structure for the coefficient matrix.
Hence, the solution of the augmented system gives the projection .
1.1 The Conjugate Gradient acceleration of block Cimmino method
Convergence rate of the block Cimmino algorithm is known to be slow . In , the Conjugate Gradient (CG) method is proposed to accelerate the row projection method. It is also reported that the CG accelerated Cimmino method competes favorably compared to classical preconditioned Generalized Minimum Residual (GMRES) and Conjugate Gradient on Normal Equations (CGNE) for the solution of nonsymmetric linear systems.
We note that the matrix in Eq. 6 is symmetric and positive definite. Hence, one can solve the following system using CG,
where and is the solution vector of the system Eq. 1. Note that since appears on both of the equalities, it does not affect on the convergence of CG. Algorithm 2 is the pseudocode for CG accelerated block Cimmino method, which is in fact the classical CG applied on the system given in Eq. 9.
The matrix vector multiplications in the CG is expressed as the solution of independent underdetermined systems at line which can be done in parallel but need to be summed to obtain by using an all-reduce operation.
1.2 The effect of partitioning
The convergence rate of CG accelerated block Cimmino algorithm is essentially the convergence rate of CG applied on Eq. 9. A well-known upper bound on the convergence rate can be given in terms of the extreme eigenvalues ( and ) of the coefficient matrix. Let
be the 2-norm condition number of . Then, as in , an upper bound on the convergence rate of the CG accelerated block Cimmino can be given by
where is the exact solution and . Furthermore, it was shown that the convergence rate of CG not only depends on the extreme eigenvalues but also on the separation between those extreme eigenvalues and interior eigenvalues  as well the clustering of the internal eigenvalues . In summary the convergence rate of the CG accelerated block Cimmino depends on the extreme eigenvalues and the number of clusters as well as the quality of the clustering.
Therefore, the partitioning of the coefficient matrix into block rows is crucial for the improvements in the convergence rate for the CG accelerated block Cimmino algorithm, since it can improve the eigenvalue distribution of . Note that the eigenvalues of is only affected by the block-row partitioning of the coefficient matrix and independent of any column ordering .
Let QR factorization of be defined as
where the matrices and have dimensions of and , respectively. Then, the matrix will be
Since the eigenvalue spectrum of matrix is the same as the spectrum of matrix , is similar to
where the singular values of matrixrepresent the principal angles between the subspaces and . Hence, the smaller the off-diagonals of the matrix Eq. 14, the more eigenvalues of will be clustered around one by the Gershgorin theorem . Therefore, the convergence rate of the block Cimmino method highly depends on the orthogonality among blocks. If blocks are more orthogonal to each other, row inner products between blocks would be arbitrarily small and hence the eigenvalues will be clustered around one. Therefore, CG is expected to converge in a fewer number of iterations.
In a recent study , a hypergraph partitioning method is proposed to find a “good” block-row partitioning for the CG accelerated block Cimmino algorithm. It is also shown that it performs better than Cuthill-McKee  based methods (with or without filtering small elements in absolute value). However, the hypergraph partitioning method loosely considers the structural orthogonality among block rows and does not consider the numerical values at all.
In this work, we propose a novel graph theoretical block-row partitioning method for the CG accelerated block Cimmino algorithm. For this purpose, we introduce a row inner-product graph model of a given matrix and then the problem of finding a “good” block-row partitioning is formulated as a graph partitioning problem on this graph model. The proposed method takes the numerical orthogonality between block rows of into account. In the proposed method, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of inter-block inner products between block rows thus leading to an improvement in the eigenvalue spectrum of .
The validity of the proposed method is confirmed against two baseline methods by conducting experiments on a large set of matrices that arise in a variety of real life applications. One of the two baseline methods is the state-of-the-art hypergraph partitioning method introduced in . We conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of . We also conduct experiments to compare the performance of three methods in terms of the number of CG iterations and parallel CG time to solution. The results of these experiments show that the proposed partitioning method is considerably better than both baseline methods in terms of all of these performance metrics. Finally, we compare the preprocessing overheads of the methods and show that the proposed method incurs much less overhead than the hypergraph partitioning method, thus allowing a better amortization.
2 The proposed partitioning method
In this section, we first describe the row inner-product graph model and then show that finding a “good” block-row partitioning can be formulated as a graph partitioning problem on this graph model. Finally, we give the implementation details for the construction and partitioning of the graph model. We refer the reader to Appendix A for a short background on graph partitioning.
2.1 Row inner-product graph model
In the row-inner product graph of matrix , vertices represent the rows of matrix and edges represent nonzero inner products between rows. That is, contains vertex for each row of matrix . contains an edge only if the inner product of rows and is nonzero. That is,
Each vertex can be associated with a unit weight or a weight that is equal to the number of nonzeros in row , that is
respectively. Each edge is associated with a cost equal to the absolute value of the respective inner product. That is,
This graph is topologically equivalent to the standard graph representation of the symmetric matrix resulting from the sparse matrix-matrix multiplication operation . That is, the sparsity pattern of corresponds to the adjacency matrix representation of . Each nonzero of the resulting matrix incurs an edge . Since each nonzero entry of is computed as the inner product of row and row , the absolute value of the nonzero determines the cost of the respective edge . That is, since , we have .
Figure (a)a shows a sample sparse matrix that contains nonzeros. Figure (b)b depicts the proposed row inner-product graph for this sample matrix. As seen in Fig. (b)b, contains vertices each of which corresponds to a row and edges each of which corresponds to a nonzero row inner product. For example, the inner product of rows and is nonzero where so that contains the edge with . In Fig. (b)b, edges with cost larger than are shown with thick lines in order to make such high inner-product values more visible.
Figure (a)a shows the resulting matrix of the sparse matrix-matrix multiplication . Only the off-diagonal nonzero entries together with their values are shown since the values of the diagonal entries do not affect . The cells that contain nonzeros larger than are shown with black background in order to make such high values more visible. Comparison of Figs. (a)a and (b)b shows that the topology of the standard graph model of matrix is equivalent to . As also seen in Fig. (a)a, the values of the nonzero entries of matrix are equal to the costs of respective edges of . For example nonzero with a value incurs an edge with .
2.2 Block-row partitioning via partitioning
A -way partition of can be decoded as a partial permutation on the rows of to induce a permuted matrix , where
Here, denotes the row permutation matrix which is defined by the -way partition as follows: the rows associated with the vertices in are ordered after the rows associated with the vertices in for . That is, the block-row contains the set of rows corresponding to the set of vertices in part of partition , where ordering of the rows within block row is arbitrary for each . Note that we use the notation to denote both block row and the set of rows in . Since the column permutation does not affect the convergence of block Cimmino algorithm , the original column ordering of is maintained.
Consider a partition of . The weight of a part is either equal to the number of rows or number of nonzeros in block row depending on the vertex weighting scheme used according to Eq. 16. That is,
Therefore, in the partitioning , the partitioning constraint of maintaining balance among part weights according to Eq. 28 corresponds to maintaining balance on either the number of rows or the number of nonzeros among the block rows.
Consider a partition of . A cut edge between parts and represents a nonzero inter-block inner product between block rows and . Therefore, the cutsize of (given in Eq. 29) is equal to
which corresponds to the total sum of the inter-block inner products (). So, in partitioning , the partitioning objective of minimizing the cutsize corresponds to minimizing the sum of inter-block inner products between block rows. Therefore, this partitioning objective corresponds to making the block rows numerically more orthogonal to each other. In this way, we expect this method to yield a faster convergence in the CG accelerated block Cimmino algorithm.
We introduce Figs. 9 and 6 in order to clarify the proposed graph partitioning method for block-row partitioning. Figure (a)a shows a straightforward 3-way block-row partition of the sample matrix given in Fig. (a)a, where the first four, the second four and the third four consecutive rows in the original order constitute the block rows and , respectively. Figure (b)b shows the -way vertex partition of that corresponds to this straightforward -way block-row partition. Figure (a)a shows a “good” -way vertex partition of obtained by using the graph partitioning tool METIS . Figure (b)b shows the permuted matrix and block-row partition induced by the 3-way vertex partition
As seen in Figs. 9 and 6, both straightforward and “good” block-row partitions achieve perfect balance on the row counts of blocks by having exactly four rows per block. The quality difference between straightforward and “good” block-row partitions can be easily seen by comparing the -way partitions of in Figs. (a)a and (b)b, respectively. As seen in Fig. (b)b, eight out of nine thick edges remain on the cut of , whereas all of the nine thick edges remain internal in as seen in Fig. (a)a.
Figure 12 shows the block-checkerboard partitioning of the resulting matrix induced by straightforward and “good” block-row partitioning of the sample matrix in Figs. (b)b and (a)a, respectively. Note that both rows and columns of the matrix are partitioned conformably with the row partitions of the matrix. The comparison of Figs. (b)b and (a)a shows that large nonzeros (dark cells) are scattered across the off-diagonal blocks of matrix for the straightforward partitioning, whereas large nonzeros (dark cells) are clustered to the diagonal blocks of for the “good” partitioning.
We introduce Fig. 15 to compare the quality of the straightforward and “good” block-row partitions in terms of inter-block row-inner-product sums. In the figure, each off-diagonal entry ip of the IP matrix shows the sum of the inter-block row inner-products between the respective block rows and . That is,
As seen in Fig. 15, ip for the straightforward partition, whereas ip for the “good” partition. Note that ip is also equal to the sum of the absolute values of the nonzeros of the off-diagonal block at the row block and column block of the matrix, i.e.,
Therefore, the total sum of inter block inner products is
for the straightforward partition, whereas for the “good” partition it is
Figures (b)b and (a)a, respectively show the eigenvalue spectrum of for the straightforward and “good” partitionings. As seen in the figures, for the straightforward partitioning the eigenvalues reside in the interval , whereas for ”good” partitioning the eigenvalues reside in the interval . As seen in Fig. (b)b, after using partitioning the eigenvalues are much better clustered around and the smallest eigenvalue is much larger than that of the straightforward partitioning method.
Implementation of the proposed partitioning method consists of two stages which are constructing and partitioning .
Constructing : For constructing , we choose to use basic sparse matrix-matrix multiplication (SpGEMM)  kernel due to existing efficient implementations. The edges of the are obtained from the nonzeros of the matrix, whereas their weights are obtained from the absolute values of those nonzeros.
Note that when matrix has dense column(s), the corresponding matrix will be quite dense. In other words, when a column has nonzeros, corresponding matrix will have at least nonzeros. For example, Fig. (a)a shows a sparse matrix which has a dense column having nonzero entries. As seen in Fig. (b)b, matrix is dense as it has nonzero entries. Clearly, large number of nonzeros in (i.e., large number of edges in ) increases the memory requirement and computation cost of SpGEMM as well as the time requirement for partitioning .
In order to alleviate the aforementioned problem, we propose the following methodology for sparsening . We identify a column (in MATLAB  notation) of an matrix as a dense column if it contains more than nonzeros (). Given , we extract a sparse matrix by keeping the largest (in absolute value) nonzeros of dense columns of . That is, the smallest entries of a dense -matrix column is ignored during constructing column of . Hence, the SpGEMM operation is performed on to obtain sparsened resulting matrix . This will lead to a sparsened graph. For example, Fig. (c)c shows this sparsity pattern of a sparse matrix which is extracted from by keeping largest nonzeros in the dense column of . As seen in Fig. (d)d, matrix is very sparse with respect to Fig. (b)b. Note that both and are used only for constructing of . After the partitioning stage, both matrices are discarded. In the rest of the paper, will be referred to as for the sake of the simplicity of presentation.
Partitioning : We use multilevel graph partitioning tool METIS  for partitioning . Since METIS uses positive integer weights for edges, the edge costs computed as floating-point values are cast as integers. In order to distribute the edge costs as evenly as possible between the small and large edge costs in the graph, we scale them in a range from 1 to as follows in order to compute integer edge weights required by METIS:
Here, is the edge cost computed according to Eq. 17, mincost and maxcost respectively denote the minimum and maximum edge costs, and is the weight of the respective edge provided to METIS.
The scaling given in Eq. 25
may suffer from yielding non-uniformly distributed integer edge weights in some instances. That is, there could be some clustering of integer edge weights although edge costs are not clustered in the original. Ideally edge weights should be real numbers. However, in order to be able to use the limited number of integer weights more effectively, one needs to have more uniformly distributed edge weights. Therefore we prescale the edge costs of by computing their square roots before assigning the integer weights using Eq. 25.
3 Experimental Results
3.1 Experimental Framework
In the experiments, we used the CG accelerated block Cimmino implementation available in ABCD Solver v1.0-beta . In ABCD Solver, MUMPS  sparse direct solver is used to factorize the system once and solve the system iteratively. We used the same matrix scaling method  by setting NB1, NB2 and NB3 parameters to 10, 20 and 20, respectively, as suggested by one of the previous versions of ABCD Solver (version on 2014-09-15) for the CG accelerated block Cimmino.
ABCD Solver includes a stabilized block-CG accelerated block Cimmino algorithm  especially for solving systems with multiple right-hand side vectors. Since the classical CG is guaranteed to converge for systems where the coefficient matrix is symmetric and positive definite and its convergence theory is well-established, in this work we implement the classical CG accelerated block Cimmino in Algorithm 2 for solving sparse linear system of equations with single right-hand side vector rather than the block-CG acceleration. In all experiments with the CG accelerated block Cimmino, we use the normwise backward error  at iteration
as the stopping criterion and we use as the maximum number of iterations.
In the experiments, we have compared the performance of the proposed partitioning method (GP) against two baseline partitioning methods already available in ABCD Solver. The first baseline method, which is referred to as uniform partitioning (UP) method in ABCD Solver, partitions the rows of the coefficient matrix into a given number of block rows with almost equal number of rows without any permutation on the rows of . Note that UP method is the same as the straightforward partitioning method mentioned in Section 2.
The second baseline method is the hypergraph partitioning (HP) method . This method uses the column-net model  of sparse matrix in which rows and columns are respectively represented by vertices and hyperedges both with unit weights . Each hyperedge connects the set of vertices corresponding to the rows that have a nonzero on the respective column. In the HP method, a -way partition of the vertices of the column-net model is used to find block rows. The partitioning constraint of maintaining balance on part weights corresponds to finding block rows with equal number of rows. The partitioning objective of the minimizing cutsize according to the cut-net metric corresponds to minimizing the number of linking columns among block rows. Here a linking column refers to a column that contains nonzeros in more than one block row.
The HP method in ABCD Solver uses the multilevel hypergraph partitioning tool PaToH  for partitioning the column-net model of matrix . Here, we use the same parameters for PaToH specified in ABCD Solver. That is, the final imbalance and initial imbalance parameters in PaToH are set to and , respectively. The other parameters are left as default of PaToH as in ABCD Solver.
In the proposed GP method, as mentioned in Section 2.3, the multilevel graph partitioning tool METIS is used to partition the row inner-product graph model of . The imbalance parameter of METIS is set to 10% and k-way option is used. For the sake of a fair comparison between HP and GP methods, unit vertex weights are used in
. The other parameters are left as default of METIS. Since both PaToH and METIS use randomized algorithms, we report the results of the geometric mean ofexperiments with different seeds for each instance.
Here and hereafter, we use GP, HP and UP to refer to the respective block-row partitioning method as well as ABCD Solver that utilizes the regular block Cimmino algorithm for solving the systems partitioned by the respective method.
The extensive numerical experiments were conducted on a shared memory system. The shared memory system is a four socket 64-core computer that contains four AMD Opteron 6376 processors, where each processor has 16 cores running at 2.3GHz and a total of 64GB of DDR3 memory. Due to memory bandwidth limitations of the platform, experiments are performed with 32 cores.
3.2 Effect of block-row partitioning on eigenvalue spectrum of
The convergence rate of the CG accelerated block Cimmino algorithm is related to the eigenvalue spectrum of . By the nature of the block Cimmino algorithm, most of the eigenvalues of are clustered around 1, but there can be some eigenvalues at extremes of the spectrum.
In this subsection, we conduct experiments to study the effect of the partitioning on the eigenvalue spectrum of by comparing the proposed GP method against UP and HP. In these experiments, in order to be able to compute the eigenvalue spectrum requiring reasonable amount of time and memory, we use four small nonsymmetric sparse matrices: , , and from SuiteSparse Matrix Collection . The first and the second matrices arise in Computational Fluid Dynamics problems, whereas the third and fourth matrices arise in Power Network problems. We partition the matrices into block rows for all of the three partitioning methods UP, HP and GP.
Figure 28 shows the eigenvalue spectrum of obtained by UP, HP and GP methods for each test instance. The figure also reports the number of CG iterations (iters) required for convergence as well as the smallest and largest eigenvalues of . As seen in the figure, both HP and GP methods achieve significantly better clustering of the eigenvalues around 1 compared to UP. This experimental finding is reflected to remarkable decrease in the number of CG iterations attained by both HP and GP over UP.
In the comparison of HP and GP, GP achieves better eigenvalue clustering and hence better convergence rate than HP for all instances. For , , and instances, the better clustering quality attained by GP over HP leads to significant improvement in the convergence rate by , , and , respectively.
3.3 Dataset for Performance Analysis
For the following experiments, we selected all nonsingular nonsymmetric square matrices whose dimensions are between and rows and columns from the SuiteSparse Matrix Collection . The number of matrices based on this criteria turns out to be . Only the HV15R and cage14 matrices are excluded due to memory limitations. We have observed that at least one of the three partitioning methods was able to converge in instances out of these in less than CG iterations.
Table 1 shows the properties of those matrices that are used in the experiments. In the table, the matrices are displayed in decreasing sorted order according to their number of nonzeros.
The right-hand side vectors of the systems are obtained by multiplying the coefficient matrices with a vector whose elements are uniform random numbers in the interval as in ABCD Solver. In all instances, the CG iterations are started from the zero vector.
|Matrix name||Matrix name|
We partition all test matrices into block rows for all of the three partitioning methods UP, HP and GP. Since all parallel runs are performed on 32 cores, the number of row-blocks are greater than the number of cores. The consecutive row blocks are uniformly assigned to the cores so that each core is assigned row blocks.
3.4 Convergence and parallel performance
In this subsection, we study the performance of the proposed GP method against UP and HP in terms of the number of CG iterations and parallel CG time to solution.
Table 2 shows the number of CG iterations and parallel CG time for each matrix. In the table, “” denotes that an algorithm fails to reach the desired backward error in iterations for the respective matrix instance. In the table, “” denotes that an algorithm fails during the factorization stage due to a singular block row. “” occurs only for with UP. As seen in Table 2, UP and HP fail to converge in and test instances, respectively, whereas GP fails in only one test instance.
In Table 2, the best result for each test instances is shown in bold. As seen in the table, out of 80 instances, the proposed GP method achieves the fastest convergence in 54 instances, whereas HP and UP achieve the fastest convergence in only 17 and 10 instances, respectively. Furthermore, as seen in Table 2, GP achieves the fastest iterative solution time in 51 instances, whereas HP and UP achieve the fastest solution time in 18 and 11 instances, respectively.
|Matrix name||# of CG iterations||Parallel CG time to soln.|
|Number of bests||10||17||54||11||18||51|
Figure 31 shows the performance profiles  of matrix instances which compares multiple methods over a set of test instances. A performance profile is used to compare multiple methods with respect to the best performing method for each test instance and report the fraction of the test instances in which performance is within a factor of that of the best method. For example, in Fig. (a)a a point (abscissa=, ordinate=) on the performance curve of a method refers to the fact that the performance of the respective method is no worse than that of the best method by a factor of 2 in approximately of the instances. If a method is closer to the top-left corner, then it achieves a better performance.
In Figs. (b)b and (a)a, we show the performance profiles in terms of the number of CG iterations and parallel CG times, respectively. As seen in Fig. (a)a, the number of CG iterations required by GP for convergence does not exceed that of the best method by a factor of 2 in approximately of the instances, whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. As seen in Fig. (b)b, the CG time using GP is not slower than that of the best method by a factor of 2 in approximately of instances. Whereas HP and UP achieve the same relative performance compared to the best method in approximately and of the instances, respectively. Figures (b)b and (a)a show that the CG time is directly proportional to the number of CG iterations as expected.
3.5 Preprocessing Overhead and Amortization
In this subsection, we analyze the relative preprocessing overhead of the three methods UP, HP and GP in order to find out whether the intelligent partitioning methods HP and GP are amortized. For all of the three block-row partitioning methods, the preprocessing overhead includes matrix scaling, block-row partitioning, creating the submatrices corresponding to these block rows and distributing these submatrices among processors. Recall that the partitioning for UP is straightforward and hence it incurs only negligible additional cost to the preprocessing overhead. On the other hand, intelligent partitioning algorithms utilized in HP and GP incur considerable amount of cost to the overall preprocessing overhead. Furthermore, for GP, construction of row inner-product graph also incurs significant amount of cost.
In Table 3, we display total preprocessing time and total execution time for all three methods for each one of the 80 test instances. In the table, total execution time is the sum of the total preprocessing and the total solution time. Here the total solution time includes parallel factorization and parallel CG solution times. Note that the factorization in block Cimmino algorithm in ABCD Solver is performed by parallel sparse direct solver MUMPS . The factorization, which needs to be performed only once, involves symbolic and numerical factorizations of the coefficient matrices of the augmented systems that arise in the block Cimmino algorithm. Note that this factorization process is embarrassingly parallel since the factorization of the coefficient matrices of the augmented systems are done independently. Recall that MUMPS is also used during the iterative solution stage, where at each iteration a linear system is solved using the factors of the augmented systems that were computed during the factorization stage.
As seen in Table 3, in terms of the preprocessing time, UP is the clear winner in all of the instances as expected, whereas GP incurs much less preprocessing time than HP in all except two instances namely and . As also seen in Table 3, comparing all three methods, GP and UP respectively achieve the smallest total execution time in 50 and 23 instances, whereas HP achieves the smallest total execution time only in 7 instances.
|Matrix name||Preprocessing time||Total execution time|
|Geometric means||Number of bests|
In the relative comparison of GP and UP, GP incurs only more preprocessing time than UP, on average, which leads GP to achieve less total execution time than UP in 56 out of 80 instances. In other words, compared to UP method, the sequential implementation of GP amortizes its preprocessing cost for those 56 instances by reducing the number of CG iterations sufficiently. That is GP amortizes its preprocessing cost for the solution of those 56 instances even if we solve only one linear system with a single right-hand side vector. Note that in many applications in which sparse linear systems arise, the solution of consecutive linear systems are required where the coefficient matrix remains the same but the right-hand side vectors change. For such applications, the amortization performance of the proposed GP method will further improve. For example, the amortization performance of GP will improve from 56 to 60 and 64 instances, respectively, for solving two and five consecutive linear systems.
In Figs. (b)b and (a)a, we show the performance profiles in terms of the preprocessing time and the total execution time, respectively. As seen in Fig. (a)a, the preprocessing overhead incurred by GP remains no worse than that of the best method UP by a factor of 4 in of instances. As seen in Fig. (b)b, the total execution time attained by GP does not exceed that of the best method by a factor of 2 in of the instances. On the other hand, HP and UP achieves the same relative performance compared to the best method only in approximately and of the instances, respectively.
4 Conclusion and future work
In this paper, we proposed a novel partitioning method in order to improve the CG accelerated block Cimmino algorithm. The proposed partitioning method takes the numerical orthogonality between block rows of the coefficient matrix into account. The experiments on a large set of real world systems show that the proposed method improves the convergence rate of the CG accelerated block Cimmino compared to the state-of-the-art hypergraph partitioning method. Moreover, it requires not only less preprocessing time and fewer number of CG iterations, but also much less total execution time than the hypergraph partitioning method.
As a future work, we consider two issues: further reducing the number of iterations through preconditioning and reducing the preprocessing overhead through parallelization.
Even though the matrix is not available explicitly, it could be still possible to obtain a preconditioner, further reducing the required number of CG iterations. One viable option could be using sparse approximate inverse type preconditioner where the coefficient matrix does not need to be available explicitly. This approach could be viable especially when consecutive linear systems are needed to be solved with the same coefficient matrix.
The proposed method involves two computational stages, namely constructing row inner-product graph via computing SpGEMM operation and partitioning this graph. For parallelizing the first stage, parallel SpGEMM [1, 2, 9] operation could be used to construct local subgraphs on each processor. For parallelizing the second stage, a parallel graph partitioning tool ParMETIS  could be used. In each processor, the local subgraphs generated in parallel in the first stage could be used as input for ParMETIS.
Appendix A Graph and Graph Partitioning
An undirected graph consists of a set of vertices and a set of edges . Each edge connects a pair of distinct vertices and . Each vertex can be assigned a weight shown as and each edge can be assigned a cost shown as .
is defined as a K-way vertex partition of if parts are mutually disjoint and exhaustive. An edge is said to be cut if the vertices and belong to different vertex parts and uncut otherwise. The set of cut edges of a partition is denoted as . In a given partition of , the weight of a part is defined as the sum of the weights of the vertices in , i.e.,