The maximum cardinality matching (MCM) problem is a classical topic in combinatorial optimization. Given a graph, it asks for a set of non-incident edges of maximum size. For the bipartite version of the problem, efficient sequential algorithms such asHopcroft-Karp  have been known for a long time. Practical algorithms for bipartite MCM have recently been studied intensively [10, 19, 5], leading to the development of scalable distributed-memory algorithms [20, 4].
Finding a maximum-cardinality matching on a bipartite graph that also has maximum weight (also called the assignment problem in the literature) is a harder problem, both w.r.t. complexity and in practice. This is because in the transversal problem (i.e. maximum cardinality matching), the edges are interchangeable, but in the assignment problem, they are not. As a result, only cardinality matchings allow concurrent vertex-disjoint searches in the graph, which makes the assignment problem harder to parallelize. Recent published attempts at parallelizing the assignment problem, e.g.  rely on the auction paradigm . While this approach has demonstrated some speedups, its scalability is limited and it is inefficient in maximizing the cardinality in distributed memory .
In this paper, we follow a different approach. Instead of relaxing both the maximum cardinality and maximum weight requirements at the same time, we only relax the maximum weight requirement and use an algorithm that always returns maximum cardinality. This means that we solve the transversal problems optimally and the assignment problem approximately. We only consider graphs that have perfect matchings. Hence, we approximate weights of a maximum-weight perfect matching (MWPM) and thus solve the approximate-weight perfect matching (AWPM) problem on distributed memory machines.
The motivation for this problem comes from sparse direct solvers. Often, sparse linear systems are too large to be solved in a single node, necessitating distributed-memory solvers such as SuperLU_DIST . Partial pivoting, which is stable in practice, requires dynamic row exchanges that are too expensive to perform in the distributed case. Consequently, distributed-memory solvers often resort to static pivoting where the input is pre-permuted to have a “heavy” diagonal so that the factorization can proceed without further pivoting. The definition of “heavy” at the minimum implies having only nonzeros on the diagonal. Whether maximizing the product or the sum of absolute values of the diagonal is the right choice for the objective function is debatable, but both can be solved by finding a perfect bipartite matching of maximum weight. In this formulation, rows and columns of the sparse matrix serve as vertices on each side of the bipartite graph, and nonzeros as edges between them.
Since the input matrix is already distributed as part of the library requirements, it is necessary to use distributed-memory parallel matching algorithms. However, the lack of scalable matching algorithms forces distributed-memory solvers to assemble the entire instance on a single node and then use a sequential matching library, such as the highly-optimized implementations of MWPM algorithms available in MC64 . For instance, the new algorithm in SuperLU_DIST demonstrated strong scaling to cores , but still used the sequential static pivoting step. Such reliance on a sequential library is disruptive to the computation, infeasible for larger instances, and certainly not scalable.
We use the exact distributed memory parallel cardinality matching algorithm from , and combine it with a distributed memory parallel approximation algorithm for the weight. Inspired by a sequential algorithm by Pettie and Sanders , the approximation relies on finding short augmentations. In our case, these augmentations are cycles of length 4 since we maintain a perfect matching.
In this manner, we get a scalable algorithm for the overall problem that can be used in the initialization of sparse direct solvers, although it is not restricted to that application.
The main contributions of this paper are as follows:
Algorithm: We present a highly-parallel algorithm for approximate weight perfect bipartite matching problem.
Usefulness: The presented algorithm returns perfect matchings that are often within of the optimum solution.
Performance: We provide a hybrid OpenMP-MPI implementation that runs significantly faster than a sequential implementation of the optimum algorithm (up to faster on 256 nodes of NERSC/Cori). On 256 nodes of the same system, the parallel implementation attains up to speedup relative to its runtime on a single node.
Impact: The presented algorithm can be used to find good pivots in distributed sparse direct solvers such as SuperLU_DIST, eliminating a longstanding performance bottleneck.
2 Related Work
The bipartite maximum cardinality matching problem has been studied for more than a century, and many different algorithms for solving it have been published over the years [15, 2, 27, 12]. Experimental studies [10, 16]
established that when using heuristic initialization, optimized variants of two different approaches, the Pothen-Fan algorithm  and the push-relabel algorithm  provide superior practical performance in the sequential case. Both algorithms have efficient shared memory counterparts [6, 19] which show good scaling on a single compute node. For distributed memory systems however, the problem has proven to be extremely challenging. Due to the suspected inherent sequentiality of the problem, i.e. no theoretically efficient parallel algorithm is known, such parallel algorithms tend to require a large number of consecutive communication rounds. More recently, a push-relabel variant that exploits the fact that local matching can be performed at a much faster rate than nonlocal operations was presented . A different approach formulated the problem in terms of sparse matrix operations . An implementation of the resulting algorithm scaled up to cores. Our work uses this algorithm as a subroutine.
For the weighted case, parallel approximation algorithms have been shown to scale very well , even in distributed memory . Furthermore, these algorithms also work for nonbipartite graphs. On the other hand, exact algorithms such as successive shortest paths have proven difficult to parallelize, even for shared memory. Currently, auction algorithms [30, 7], which essentially constitute a weighted version of the push-relabel algorithm, are a promising direction and can efficiently find matchings of near-maximum weight, but they tend to be very inefficient at finding perfect cardinality matchings in distributed memory . In shared memory however, they can be competitive . For that case, Hogg and Scott showed that the auction algorithm provides matrix orderings for direct linear solvers of a quality similar to the exact method  .
The aim of our work is similar to Hogg and Scott’s. However, since we target distributed memory machines, we need to develop a different algorithm. Pettie and Sanders described and analyzed sequential linear time approximation algorithms for the general weighted matching problem. Our approximation idea of using cycles of length 4 is inspired by this work.
3 Notation and Background
For any matrix of size there is a weighted bipartite graph , whose vertex set consists of row vertices in and column vertices in . For each nonzero entry of , contains an undirected edge that is incident to row vertex and column vertex , and has a weight .
Given a bipartite graph , a matching is a subset of such that at most one edge in is incident on any vertex. Given a matching in , an edge is matched if it belongs to , and unmatched otherwise. Similarly, a vertex is matched if it is an endpoint of a matched edge. If an edge is matched, we define and and call the vertices mates of each other. A matching is maximal if there is no other matching that properly contains , and is maximum if for every matching in . Furthermore, if , is called a perfect matching. When has full structural rank, the corresponding bipartite graph has a perfect matching. Since this paper focuses on perfect matching algorithms, we assume . We denote by and by . Now, the perfect matching problem consists of either finding a matching that is perfect, or deciding that no such matching exists in the input graph.
The weight of a matching is simply the sum of the weights of its edges. The maximum weight matching problem asks for a matching of maximum weight regardless of cardinality, while the maximum weight perfect matching (MWPM) problem asks for a perfect matching that has maximum weight among all perfect matchings. If we multiply all weights by , the MWPM problem becomes equivalent to the minimum weight perfect matching problem .
All the above problems can be further subdivided into the bipartite and general case, with the latter often requiring significantly more sophisticated algorithms. In the following, we will restrict our discussions to the bipartite case.
An -alternating cycle in is a cycle whose edges are alternately matched and unmatched in matching . By exchanging the matched and unmatched edges of an -alternating cycle, we get a new matching of the same cardinality, but potentially different weight. If the new weight is higher, the alternating cycle is also an -augmenting cycle. The difference in weight is called the gain of the cycle. Let denote the gain of the cycle formed by the vertices and and their mates and . A cycle that contains vertices and edges is called -cycle.
Since we study a maximization problem, for any , we say that a perfect matching in is -optimum if , where is a perfect matching of maximum weight in . We call an algorithm an -approximation algorithm if it always returns -optimum solutions.
4 The Pettie-Sanders Algorithm for Weighted Matching
A sequential linear time approximation algorithm for the weighted matching problem on general graphs was presented by Pettie and Sanders . Because we restrict ourselves to dealing with perfect matchings in bipartite graphs, the algorithm can be simplified significantly. This simplified version, called PSS, is based on three fundamental statements:
A perfect matching is -optimal if it contains no augmenting -cycles.
For any perfect matching , there exists a set of vertex-disjoint augmenting -cycles such that:
An augmenting -cycles can be found in time, where is the maximum degree among vertices in the cycle.
Based on that, a lower bound on the probability of obtaining cycles fromby selecting random augmenting -cycles can be derived, which in turn implies a randomized approximation guarantee, since the average gain of an augmenting -cycle from is:
Thus, the randomized PSS algorithm repeatedly selects and applies random augmenting -cycles to the matching. It thereby computes a optimum matching in expected time . We implement a practical variant of this algorithm that simply loops over the vertices to find and use augmenting -cycles and stops if no such cycles can be found, and use it for comparison with the parallel algorithm in Section 6.
In the same publication, Pettie and Sanders also present a derandomized version of their algorithm with a similar approximation guarantee. The simplified version is shown in Algorithm 1. Our parallel algorithm is inspired by this derandomized algorithm.
To find a -cycle rooted at a vertex , select a neighbor , then follow the matching edges incident to and . Since is a perfect matching, these edges always exists. Next, scan the neighborhood for . If the edge exists, we have found an alternating -cycle of gain .
The main problem when parallelizing the algorithm stems from Line . Here, greedy() is a set of vertex disjoint cycles selected from via a greedy algorithm. Since doing so in parallel would be costly, we deviate slightly from this idea and perform only local comparisons to find a heavy set of disjoint augmenting -cycles. This means that our algorithm does not have the same theoretical properties as Pettie and Sander’s algorithms, and it can terminate without returning a optimum matching, although this can only happen on very specific instances. While this could easily be overcome by applying random augmentations, doing so is not desirable from a practical point of view.
5 The Parallel Approximation Algorithm
5.1 Overview of the parallel AWPM algorithm
In our approximate weight perfect matching, or AWPM algorithm, we combine a parallel maximum cardinality matching (MCM) algorithm with a distributed memory approximation algorithm for the weight. The MCM algorithm is initialized using a maximal cardinality matching algorithm that returns a matching with cardinality at least half of the maximum. While this step is optional, doing so greatly improves the parallel performance. The maximal and maximum matching algorithms together provide a perfect matching whose weight is iteratively improved by discovering augmenting cycles following the idea of the Pettie-Sanders algorithm discussed above.
. Both of these algorithms are implemented using a handful of bulk-synchronous matrix and vector operations and are freely available as part of the CombBLAS library. Among several variants of the maximal matching algorithms, we used a simple greedy algorithm to initialize MCM. For this work, we modified the original cardinality matching algorithms in such a way that when selecting potential matching edges, we break ties by giving precedence to edges with higher weight. This simple heuristic often results in the perfect matchings having high weight without incurring any additional cost. Once a perfect matching is obtained, we improve its weight using our parallel approximation algorithm.
5.2 The approximate weight augmenting cycles algorithm
As mentioned in the last section, the algorithm for maximum weight approximation aims to find augmenting 4-cycles. Unlike longer cycles, they can be found in time proportional to the degree of their vertices, and doing so requires only a constant number of communication rounds in the distributed memory parallel setting. Thus, our algorithm can be considered a special case of the general augmenting cycles algorithm. We will refer to it as AWAC, i.e. approximate weight augmenting cycles, or augmenting cycles algorithm in short.
The vertices of a 4-cycle might be distributed over 4 processes, requiring 4 major communication steps, which will be described in detail below. We use a regular 2D partitioning of the input matrix. Thus, processes form a grid. They are denoted by in the grid. Let be the submatrix assigned to a process. Figure 1 illustrates the layout, and explains the labeling of the processes. The algorithm starts by distributing the matching information across rows and columns, and then loops over the four steps times, or until no more augmenting 4-cycle can be found, as shown in Algorithm 2.
Like the deterministic sequential Algorithm 1, we construct a set of vertex-disjoint augmentations. In effect, we parallelize the FOR ALL statement in Line 4 of Algorithm 1. However, we do not use the same greedy strategy to pick the augmenting 4-cycles in Line 7, since the standard greedy algorithm is inherently sequential. Instead, we perform limited local weight comparisons, which are described below.
After initialization, the first step will be to select vertices at which augmenting cycles are rooted. While all vertices are eligible, we can reduce this number since each cycle can be rooted at any of its 4 vertices. Because the matrix is stored in Compressed Sparse Column (CSC) format, we only start cycles from column vertices, reducing the number of root vertices by half. This number can be halved again by rooting a given cycle only at the column vertex with the lower index.
Now, for a potential cycle rooted at a column vertex and containing a row vertex , we generate a request to the owner of , which includes the edge weight , as shown in Algorithm 3. For performance reasons, the exchange of all such requests is bundeled into an All-to-All collective operation.
In the next step, we need to determine if and thus the alternating cycle of exists and is augmenting, i.e. has . Note that the weight of the matching edges is stored on all processes in the same grid row/column, and can thus be accessed directly, as shown in Algorithm 4. However, before we can augment the matching along this cycle, we have to make sure that it is vertex disjoint with other augmenting cycles, and if not, find out which cycle has the highest gain. We therefore send a request to the process that owns the matched edge .
Now, the owner of each edge collects all incoming requests for that edge, selects one with maximum gain, and discards all others. Since these requests correspond to cycles rooted at , all remaining cycles now are disjoint w.r.t. their edge, as shown in Algorithm 5. However, we still have to ensure that the cycles are disjoint with other cycles at the edge . Therefore, we send a request to its owner, which might not share any column or row with the sending process. Figure 2 shows an examples of such competing cycles.
In Step D, the owner of each edge collects requests for that edge, selects one with maximum gain, and discards all others, similar to Step C. Thus, all cycles are disjoint w.r.t. their edges. However, it is still possible that the edge of one cycle is the edge of a different cycle. Thus, if a process sent a C-request for an edge in Step C, then it will automatically discard the requests for other cycles that have as their edge. As mentioned in Section 4, our strategy deviates from the Pettie-Sanders algorithm here. The reason for this lies in the fact that finding the maximal set of augmenting 4-cycles would incur additional communication that most likely affects only a small number of vertices. Therefore, in the parallel case it is preferable to simply drop the problematic cycles and generate new ones instead.
5.3 Analysis of the parallel AWPM algorithm
We measure communication by the number of words moved () and the number of messages sent (). The cost of communicating a -word message is where is the latency and is the inverse bandwidth, both are defined relative to the cost of a single arithmetic operation. Hence, an algorithm that performs arithmetic operations, sends messages, and moves words takes time.
Since rows are permuted randomly at the start of the algorithm, in this analysis, matrix nonzeros are assumed to be i.i.d. distributed. We also assume a square process grid. The runtime of the MCM algorithm is dominated by parallel sparse matrix-sparse vector multiplication whose complexity has been analyzed in our prior work . Assuming to be the number of iterations needed to find an MCM, the perfect matching computation takes:
In Step A of the AWAC algorithm, we exchange a total of requests among all processes. Under the i.i.d. assumption, each process contains edges and spends time to prepare requests for all other processes. Let and be the set of rows and column whose pairwise connections are stored in process . Let and be the set of mates of vertices in and , respectively. Since a process stores at most rows and columns, and . In order to receive a request in the th process, there must be an edge between a pair of vertices in . Under the i.i.d. assumption, the number of such edges is . Hence a process receives messages in the communication round of Step A.
The number of requests in Step B cannot be greater than in Step A, and requests in Step C and augmentations in Step D are bounded by . Hence the total cost of AWAC is:
Note that in AWAC, is bounded by a small constant that depends on the desired approximation ratio.
|Core||Intel KNL||Intel Ivy Bridge|
|L1 Cache (KB)||32||32|
|L2 Cache (KB)||1024||256|
|STREAM BW (GB/s)||102||104|
|Interconnect||Aries (Dragonfly)||Aries (Dragonfly)|
|Compiler||Intel C++ Compiler (icpc) ver18.0.0|
6.1 Experimental setup
We evaluated the performance of our algorithms on the Edison and Cori supercomputers at NERSC. On Cori, we used KNL nodes set to cache mode where available 16GB MCDRAM is used as cache.Table 1 summarizes key features of these systems. We used Cray’s MPI implementation for inter-node communication and OpenMP for intra-node multithreading. Unless otherwise stated, all of our experiments used 4 MPI processes per node. We rely on the CombBLAS library  for parallel file I/O, data distribution and storage. We always used square process grids because rectangular grids are not supported in CombBLAS.
In our experiments, we used a diverse test set of real-world matrices from the University of Florida sparse matrix collection  and other sources as shown in Table 2. Before running matching algorithms, we normalized matrices such that the maximum entry of each row or column is and the other entries are bounded by .
We compare the performance of our parallel algorithm with the MWPM implementations in MC64  and an optimized sequential implementation of AWPM. Our sequential implementation uses the Kapr-Sipser and Push-Relabel algorithms  to find a perfect matching, and then uses the augmenting cycles algorithm to find an AWPM. MC64 provides five options for computing the weight of a matching. In all of our experiments except Table 4, we used option 4 that maximizes the sum of weights on matched edges. Since a matching algorithm is often used as a subroutine of another application, we exclude file I/O and graph generation time when reporting the running time of matching algorithms.
|boneS10||sym||0.91||28.19||model reduction problem|
|Serena||sym||1.39||32.96||gas reservoir simulation|
|dielFilterV3real||sym||1.10||45.20||higher-order finite element|
|Li7Nmax6 ||sym||0.66||212.23||nuclear config. interaction|
|HV15R||unsym||2.02||283.07||3D engine fan|
|perf008cr||sym||7.90||321.06||3D structural mechanics|
|nlpkkt240||sym||27.99||401.23||Sym. indef. KKT matrix|
|Nm7 ||unsym||4.08||437.37||nuclear config. interaction|
|NaluR3 ||unsym||17.60||473.71||Low Mach fluid flow|
|A05||unsym||1.56||1088.71||MHD for plasma fusion|
6.2 Approximation ratio attained by parallel AWPM algorithm
Even though our parallel algorithm can not guarantee the theoretical bound of the Pettie-Sanders algorithm, the obtained matchings are often very close to the optimum in practice as shown in Table 3. Here, we compute the approximation ratio by dividing approximated weight by optimum weight and show it in the last column. For 10 out of 16 matrices in Table 2, AWPM finds a matching with the optimum weight. Approximate weights of other matrices are also close to the optimum. This observation extends to many other matrices from the University of Florida collection. In an extended set of more than 100 matrices, the average approximation ratio attained by AWPM is 98.66% (min 86%, max 100%). Hence, our algorithm can successfully substitute MWPM algorithms in many practical problems without sacrificing the quality of the matchings.
6.3 Performance of the parallel AWPM algorithm
Figure 4 compares the runtime of the parallel AWPM algorithm with MC64 and sequential AWPM for all 16 matrices from Table 2. Here MC64+gather lines include the time to gather the whole matrix on a single node, representing the total time needed to compute MWPM sequentially in a distributed memory setting. Experiments in Figure 4 were run on Edison.
At first, we discuss the performance of matching algorithms on bigger matrices where a distributed memory algorithm is urgently needed. MC64 failed with A05, the largest matrix in our suite, because of insufficient memory in a single node of Edison. For matrices of this scale, our distributed-memory algorithm is the only viable option for enabling extreme scale scientific applications that rely on MWPM. For other large matrices, parallel AWPM is significantly faster than its sequential competitors. For some problems, this is true even on a single node. For example, when computing a matching on NaluR3 on a single node of Edison, parallel AWPM is and faster than MC64 and sequential AWPM, respectively. On 256 nodes (6144 cores), our parallel algorithm becomes and faster than MC64 and sequential AWPM. Note that AWPM is able to find the optimum solution on NaluR3. Hence, the drastic reduction in runtime comes for free, without sacrificing any matching quality. For other large matrices (with greater than 200M), parallel AWPM runs at least faster than MC64+gather on high concurrency. This observation also holds for most matrices in the second and third rows in Figure 4.
On smaller matrices (e.g., those in the first row of Figure 4), the performance gain from the parallel algorithm is not as dramatic as with bigger matrices. This is expected as it is hard for parallel algorithms to reduce a subsecond sequential runtime. However, for all matrices except Freescale2, parallel AWPM runs faster than MC64+gather on high concurrency. Hence our algorithm is competitive on smaller matrices and runs significantly faster on bigger matrices.
While most instances show excellent scalability, there are two outliers, Freescale2 and HV15R, where parallel AWPM does not run faster than the sequential AWPM. For both matrices, parallel AWPM spends more than 80% of the runtime on finding an initial perfect matching using a distributed-memory MCM algorithm. Obtaining perfect matchings on these two matrices require searching for long paths going through many processors, which is hard to parallelize. Nevertheless, even for these matrices, parallel AWPM remains competitive or significantly faster than MC64+gather, which is the main competitor of our algorithm in practice.
The matching algorithms discussed above show similar performance trends on Cori-KNL, as shown in Figure 5. For example, on the perf00cr matrix, parallel AWPM runs and faster than MC64+gather on 17,408 and 6,144 cores on Cori-KNL and Edison, respectively. On Cori-KNL, MC64 is able to compute matching for the A05 matrix in 135 seconds, whereas, parallel AWPM algorithm took just 0.44 seconds on 256 nodes of Cori-KNL.
6.4 Scalability of parallel AWPM algorithm
Unless otherwise noted, we report the speedup of the parallel AWPM algorithm relative to its runtime on a single node. AWPM still runs in parallel on a single node using 4 MPI processes and employs multithreading within a process.
At first, we discuss the scalability of the complete AWPM algorithm (including the time to compute the initial perfect matching) as shown in Figure 4 and 5. AWPM achieves speedup on average over 13 matrices that were able to run on a single node of Edison. However, the performance improvement is more significant on bigger matrices. For example, our algorithm attains the best speedup of for NaluR3 on 256 nodes on Edison. On 256 nodes of Cori-KNL, AWPM achieves speedup on average over 14 matrices that were able to run on a single node. The best speedups on Cori-KNL are attained on Li7Nmax6 () and NaluR3 (). For the A05 matrix, as we go from 16 nodes to 256 nodes, parallel AWPM runs and faster on Edison and Cori-KNL, respectively.
Since the primary contribution of the paper is the approximate weight augmenting cycles algorithm, we show its scalability separately in Figure 6. We only show results for matrices with more than 200M nonzeros. For all matrices in Figure 6, the parallel augmenting cycles algorithm attains more than a speedup on both Edison and Cori-KNL. Among these matrices, the best speedups of was observed on 256 nodes of Cori-KNL for the Li7Nmax6 matrix, highlighting the excellent scalability of the newly developed algorithm.
6.5 Insights on the performance of parallel AWPM algorithm
In order to study the behavior of parallel AWPM, we break down the execution time of two instances in Figure 7. For both instances, the newly developed augmenting cycles algorithm scales very well. This trend is also observed for most of the matrices, as was explained before. However, the maximum cardinality matching algorithm that is used to obtain an initial perfect matching starts to become the bottleneck on high concurrency. As studied extensively in a prior work , the performance of the parallel MCM algorithm suffers if it needs to search long augmenting paths that span many processors.
6.6 Performance of AWPM as a pre-pivoting tool for a distributed sparse direct solver
|Matching weight||Relative error|
As stated in Section 1, our motivation for AWPM comes from the need for parallel pre-pivoting of large entries to the main diagonal to improve stability of sparse direct solvers. We now evaluate how AWPM performs relative to MC64 in this regard, using the SuperLU_DIST direct solver. For each matrix in Table 4, we set the true solution vector , then generate the right-hand side vector by computing . We use the LAPACK-style simple equilibration scheme to compute the row and column scalings and which would make each row and each column of the scaled matrix have equal norm, and the maximum entry of each row or column is and the other entries are bounded by 1 (Hence, the weight of many matched edges is often 1, and the sum of the logarithms is 0.) We then apply the pre-pivoting strategies of MC64 or AWPM to compute a row permutation vector . Here, the maximization criterion is the sum of the logarithms of the weights, i.e. the product of the weights. The sparsity reordering is then obtained with METIS using the permuted matrix (graph). Finally, the LU factorization is computed as , and the solution is computed based on the transformed linear system. The relative solution error is reported in Table 4. For most matrices, the relative error obtained with AWPM is remarkably close to that of MC64, with the exception of circuit5M. This can be explained by the difference in weights found by MC64 and AWPM. However, for most matrices, the AWPM weights are identical to the MC64 weights. Recall in Table 3, when the sum of weights is maximized, AWPM achieves 99.99% optimum weight for this matrix. When using the permutation obtained from the “AWPM-sum” metric for circuit5M, the computed solution is as accurate as that of MC64. In the future, we plan to investigate the performance of the AWPM-sum and AWPM-product metrics in the solution accuracy of SuperLU_DIST.
7 Concluding Remarks
We presented a new distributed-memory parallel algorithm for the approximate-weight perfect matching problem on bipartite graphs. That is, our algorithm returns a perfect matching (if it exists) that has approximately maximum weight. Our motivation comes from distributed-memory sparse direct solvers where an initial permutation of the sparse matrix that places entries with large absolute values on the diagonal is often performed before the factorization, in order to avoid expensive pivoting during runtime. This process is called static pivoting and the permutation is ideally found using a maximum-weight perfect matching. However, previous attempts in parallelizing the exact algorithms met with limited success. Since the perfect matching part is a strict requirement of static pivoting, our algorithm only relaxes the weight requirement.
There are two key reasons for the performance of our algorithm. For the initial phase where we find a perfect matching, we use existing optimized distributed-memory cardinality matching algorithms with minor modifications to maximize the weight when tie-breaking. For the iterative phase that improves on the weights while maintaining maximum cardinality, we restrict the algorithm to 4-cycles, and thus avoid traversing long augmenting paths, following the Pettie-Sanders approximation algorithm. Hence, the crux of our paper is an efficient method for finding augmenting 4-cycles on a bipartite graph in distributed memory.
In terms of LP-duality, unlike the Hungarian algorithm , ours is a primal method because it maintains a feasible solution. It does not compute a solution to the dual problem, i.e. feasible potential. Since the dual values can be used to scale matrices in linear solvers  extending the algorithm to provide such scaling factors is a goal for future work.
We have tested the algorithm on the largest available instances in the Florida sparse matrix collection. While in some cases finding the perfect matching in parallel is expensive, the weight approximation always scales well. Our experiments show that the weight of the matchings we obtain is high and high enough to replace MC64 for pre-pivoting in linear solvers such as SuperLU_DIST, thereby making such solvers fully scalable.
-  Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the IPDPS, pages 1213–1222, 2014.
-  H. Alt, N. Blum, K. Mehlhorn, and M. Paul. Computing a maximum cardinality matching in a bipartite graph in time . Information Processing Letters, 37(4):237–240, 1991.
-  Ariful Azad and Aydin Buluç. A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs. Parallel Computing, 58:117–130, 2016.
-  Ariful Azad and Aydın Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS. IEEE, 2016.
-  Ariful Azad, Aydın Buluç, and Alex Pothen. Computing maximum cardinality matchings in parallel on bipartite graphs via tree-grafting. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(1):44–59, 2017.
-  Ariful Azad, Mahantesh Halappanavar, Sivasankaran Rajamanickam, Erik Boman, Arif Khan, and Alex Pothen. Multithreaded algorithms for maximum matching in bipartite graphs. In IPDPS, Shanghai, China, 2012. IEEE.
-  Dimitri P. Bertsekas and David A. Castañón. A generic auction algorithm for the minimum cost network flow problem. Comp. Opt. and Appl., 2(3):229–259, 1993.
-  Aydın Buluç and John R Gilbert. The Combinatorial BLAS: Design, implementation, and applications. International Journal of High-Performance Computing Applications (IJHPCA), 25(4), 2011.
-  Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.
-  I. S. Duff, K. Kaya, and B. Uçar. Design, implementation, and analysis of maximum transversal algorithms. ACM Transaction on Mathematical Software, 38(2):13:1– 13:31, 2011.
-  Iain S. Duff and Jacko Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4):973–996, 2001.
-  A. V. Goldberg and R. E. Tarjan. A new approach to the maximum flow problem. Journal of the Association for Computing Machinery, 35:921–940, 1988.
-  Mahantesh Halappanavar, John Feo, Oreste Villa, Antonino Tumeo, and Alex Pothen. Approximate weighted matching on emerging manycore and multithreaded architectures. The International Journal of High Performance Computing Applications, 26(4):413–430, 2012.
-  Jonathan Hogg and Jennifer Scott. On the use of suboptimal matchings for scaling and ordering sparse symmetric matrices. Numerical Linear Algebra with Applications, 22(4):648–663, 2015.
-  John E. Hopcroft and Richard M. Karp. An algorithm for maximum matchings in bipartite graphs. SIAM Journal on Computing, 2(4):225–231, 1973.
-  K. Kaya, J. Langguth, F. Manne, and B. Uçar. Push-relabel based algorithms for maximum transversal problem. Computers & Operations Research, 40(5):1266–1275, 2013.
-  H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
-  J. Langguth, F. Manne, and P. Sanders. Heuristic initialization for bipartite matching problems. ACM Journal of Experimental Algorithmics, 15:1.1–1.22, February, 2010.
-  Johannes Langguth, Ariful Azad, Mahantesh Halappanavar, and Fredrik Manne. On parallel push-relabel based algorithms for bipartite maximum matching. Parallel Computing, 40(7):289–308, 2014.
-  Johannes Langguth, Md. Mostofa Ali Patwary, and Fredrik Manne. Parallel algorithms for bipartite matching problems on distributed memory computers. Parallel Computing, 37(12):820–845, 2011.
-  Xiaoye S Li and James W Demmel. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software (TOMS), 29(2):110–140, 2003.
-  Paul Lin, Matthew Bettencourt, Stefan Domino, Travis Fisher, Mark Hoemmen, Jonathan Hu, Eric Phipps, Andrey Prokopenko, Sivasankaran Rajamanickam, Christopher Siefert, et al. Towards extreme-scale simulations for low mach fluids with second-generation Trilinos. Parallel Processing Letters, 24(04):1442005, 2014.
-  Fredrik Manne and Rob H. Bisseling. A parallel approximation algorithm for the weighted maximum matching problem. In International Conference on Parallel Processing and Applied Mathematics, pages 708–717, 2007.
-  Fredrik Manne and Mahantesh Halappanavar. New effective multithreaded matching algorithms. In Proceedings of the IPDPS, pages 519–528, 2014.
-  Markus Olschowka and Arnold Neumaier. A new pivoting strategy for gaussian elimination. Linear Algebra and its Applications, 240(Supplement C):131 – 151, 1996.
-  Seth Pettie and Peter Sanders. A simpler linear time 2/3-epsilon approximation for maximum weight matching. Inf. Process. Lett., 91(6):271–276, 2004.
-  A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix. ACM Trans. Math. Softw., 16:303–324, 1990.
-  Edward Jason Riedy. Making static pivoting scalable and dependable. PhD thesis, University of California, Berkeley, 2010.
-  P. Sao, X.S. Li, and R. Vuduc. A 3D LU factorization algorithm for sparse matrices. In Proceedings of the IPDPS (to appear). IEEE, 2018.
-  M. Sathe, O. Schenk, and H. Burkhart. An auction-based weighted matching implementation on massively parallel architectures. Parallel Computing, 38(12):595–614, 2012.
-  Alexander Schrijver. Combinatorial optimization: Polyhedra and efficiency, volume 24. Springer Science & Business Media, 2003.