A distributed-memory approximation algorithm for maximum weight perfect bipartite matching

01/30/2018 ∙ by Ariful Azad, et al. ∙ Simula Research Lab Berkeley Lab 0

We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we propose a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The maximum cardinality matching (MCM) problem is a classical topic in combinatorial optimization. Given a graph, it asks for a set of non-incident edges of maximum size. For the bipartite version of the problem, efficient sequential algorithms such as

Hopcroft-Karp [15] have been known for a long time. Practical algorithms for bipartite MCM have recently been studied intensively [10, 19, 5], leading to the development of scalable distributed-memory algorithms [20, 4].

Finding a maximum-cardinality matching on a bipartite graph that also has maximum weight (also called the assignment problem in the literature) is a harder problem, both w.r.t. complexity and in practice. This is because in the transversal problem (i.e.  maximum cardinality matching), the edges are interchangeable, but in the assignment problem, they are not. As a result, only cardinality matchings allow concurrent vertex-disjoint searches in the graph, which makes the assignment problem harder to parallelize. Recent published attempts at parallelizing the assignment problem, e.g. [30] rely on the auction paradigm [7]. While this approach has demonstrated some speedups, its scalability is limited and it is inefficient in maximizing the cardinality in distributed memory [28].

In this paper, we follow a different approach. Instead of relaxing both the maximum cardinality and maximum weight requirements at the same time, we only relax the maximum weight requirement and use an algorithm that always returns maximum cardinality. This means that we solve the transversal problems optimally and the assignment problem approximately. We only consider graphs that have perfect matchings. Hence, we approximate weights of a maximum-weight perfect matching (MWPM) and thus solve the approximate-weight perfect matching (AWPM) problem on distributed memory machines.

The motivation for this problem comes from sparse direct solvers. Often, sparse linear systems are too large to be solved in a single node, necessitating distributed-memory solvers such as SuperLU_DIST [21]. Partial pivoting, which is stable in practice, requires dynamic row exchanges that are too expensive to perform in the distributed case. Consequently, distributed-memory solvers often resort to static pivoting where the input is pre-permuted to have a “heavy” diagonal so that the factorization can proceed without further pivoting. The definition of “heavy” at the minimum implies having only nonzeros on the diagonal. Whether maximizing the product or the sum of absolute values of the diagonal is the right choice for the objective function is debatable, but both can be solved by finding a perfect bipartite matching of maximum weight. In this formulation, rows and columns of the sparse matrix serve as vertices on each side of the bipartite graph, and nonzeros as edges between them.

Since the input matrix is already distributed as part of the library requirements, it is necessary to use distributed-memory parallel matching algorithms. However, the lack of scalable matching algorithms forces distributed-memory solvers to assemble the entire instance on a single node and then use a sequential matching library, such as the highly-optimized implementations of MWPM algorithms available in MC64 [11]. For instance, the new algorithm in SuperLU_DIST demonstrated strong scaling to cores [29], but still used the sequential static pivoting step. Such reliance on a sequential library is disruptive to the computation, infeasible for larger instances, and certainly not scalable.

We use the exact distributed memory parallel cardinality matching algorithm from [4], and combine it with a distributed memory parallel approximation algorithm for the weight. Inspired by a sequential algorithm by Pettie and Sanders [26], the approximation relies on finding short augmentations. In our case, these augmentations are cycles of length 4 since we maintain a perfect matching.

In this manner, we get a scalable algorithm for the overall problem that can be used in the initialization of sparse direct solvers, although it is not restricted to that application.

The main contributions of this paper are as follows:

  • [leftmargin=*]

  • Algorithm: We present a highly-parallel algorithm for approximate weight perfect bipartite matching problem.

  • Usefulness: The presented algorithm returns perfect matchings that are often within of the optimum solution.

  • Performance: We provide a hybrid OpenMP-MPI implementation that runs significantly faster than a sequential implementation of the optimum algorithm (up to faster on 256 nodes of NERSC/Cori). On 256 nodes of the same system, the parallel implementation attains up to speedup relative to its runtime on a single node.

  • Impact: The presented algorithm can be used to find good pivots in distributed sparse direct solvers such as SuperLU_DIST, eliminating a longstanding performance bottleneck.

2 Related Work

The bipartite maximum cardinality matching problem has been studied for more than a century, and many different algorithms for solving it have been published over the years [15, 2, 27, 12]. Experimental studies [10, 16]

established that when using heuristic initialization

[18], optimized variants of two different approaches, the Pothen-Fan algorithm [27] and the push-relabel algorithm [12] provide superior practical performance in the sequential case. Both algorithms have efficient shared memory counterparts [6, 19] which show good scaling on a single compute node. For distributed memory systems however, the problem has proven to be extremely challenging. Due to the suspected inherent sequentiality of the problem, i.e. no theoretically efficient parallel algorithm is known, such parallel algorithms tend to require a large number of consecutive communication rounds. More recently, a push-relabel variant that exploits the fact that local matching can be performed at a much faster rate than nonlocal operations was presented [20]. A different approach formulated the problem in terms of sparse matrix operations [4]. An implementation of the resulting algorithm scaled up to cores. Our work uses this algorithm as a subroutine.

For the weighted case, parallel approximation algorithms have been shown to scale very well [24], even in distributed memory [23]. Furthermore, these algorithms also work for nonbipartite graphs. On the other hand, exact algorithms such as successive shortest paths have proven difficult to parallelize, even for shared memory. Currently, auction algorithms [30, 7], which essentially constitute a weighted version of the push-relabel algorithm, are a promising direction and can efficiently find matchings of near-maximum weight, but they tend to be very inefficient at finding perfect cardinality matchings in distributed memory [28]. In shared memory however, they can be competitive [13]. For that case, Hogg and Scott showed that the auction algorithm provides matrix orderings for direct linear solvers of a quality similar to the exact method [14] .

The aim of our work is similar to Hogg and Scott’s. However, since we target distributed memory machines, we need to develop a different algorithm. Pettie and Sanders described and analyzed sequential linear time approximation algorithms for the general weighted matching problem. Our approximation idea of using cycles of length 4 is inspired by this work.

3 Notation and Background

For any matrix of size there is a weighted bipartite graph , whose vertex set consists of row vertices in and column vertices in . For each nonzero entry of , contains an undirected edge that is incident to row vertex and column vertex , and has a weight .

Given a bipartite graph , a matching is a subset of such that at most one edge in is incident on any vertex. Given a matching in , an edge is matched if it belongs to , and unmatched otherwise. Similarly, a vertex is matched if it is an endpoint of a matched edge. If an edge is matched, we define and and call the vertices mates of each other. A matching is maximal if there is no other matching that properly contains , and is maximum if for every matching in . Furthermore, if , is called a perfect matching. When has full structural rank, the corresponding bipartite graph has a perfect matching. Since this paper focuses on perfect matching algorithms, we assume . We denote by and by . Now, the perfect matching problem consists of either finding a matching that is perfect, or deciding that no such matching exists in the input graph.

The weight of a matching is simply the sum of the weights of its edges. The maximum weight matching problem asks for a matching of maximum weight regardless of cardinality, while the maximum weight perfect matching (MWPM) problem asks for a perfect matching that has maximum weight among all perfect matchings. If we multiply all weights by , the MWPM problem becomes equivalent to the minimum weight perfect matching problem [31].

All the above problems can be further subdivided into the bipartite and general case, with the latter often requiring significantly more sophisticated algorithms. In the following, we will restrict our discussions to the bipartite case.

An -alternating cycle in is a cycle whose edges are alternately matched and unmatched in matching . By exchanging the matched and unmatched edges of an -alternating cycle, we get a new matching of the same cardinality, but potentially different weight. If the new weight is higher, the alternating cycle is also an -augmenting cycle. The difference in weight is called the gain of the cycle. Let denote the gain of the cycle formed by the vertices and and their mates and . A cycle that contains vertices and edges is called -cycle.

Since we study a maximization problem, for any , we say that a perfect matching in is -optimum if , where is a perfect matching of maximum weight in . We call an algorithm an -approximation algorithm if it always returns -optimum solutions.

4 The Pettie-Sanders Algorithm for Weighted Matching

A sequential linear time approximation algorithm for the weighted matching problem on general graphs was presented by Pettie and Sanders [26]. Because we restrict ourselves to dealing with perfect matchings in bipartite graphs, the algorithm can be simplified significantly. This simplified version, called PSS, is based on three fundamental statements:

  1. A perfect matching is -optimal if it contains no augmenting -cycles.

  2. For any perfect matching , there exists a set of vertex-disjoint augmenting -cycles such that:

  3. An augmenting -cycles can be found in time, where is the maximum degree among vertices in the cycle.

Based on that, a lower bound on the probability of obtaining cycles from

by selecting random augmenting -cycles can be derived, which in turn implies a randomized approximation guarantee, since the average gain of an augmenting -cycle from is:

Thus, the randomized PSS algorithm repeatedly selects and applies random augmenting -cycles to the matching. It thereby computes a optimum matching in expected time . We implement a practical variant of this algorithm that simply loops over the vertices to find and use augmenting -cycles and stops if no such cycles can be found, and use it for comparison with the parallel algorithm in Section 6.

In the same publication, Pettie and Sanders also present a derandomized version of their algorithm with a similar approximation guarantee. The simplified version is shown in Algorithm 1. Our parallel algorithm is inspired by this derandomized algorithm.

1:Let be a perfect matching
2:for  = 1 to  do
3:     
4:     for all  do
5:          Find max-gain cycle rooted at
6:          :=      
7:      greedy() return
Algorithm 1 The deterministic sequential algorithm adapted from Pettie and Sanders [26]

To find a -cycle rooted at a vertex , select a neighbor , then follow the matching edges incident to and . Since is a perfect matching, these edges always exists. Next, scan the neighborhood for . If the edge exists, we have found an alternating -cycle of gain .

The main problem when parallelizing the algorithm stems from Line . Here, greedy() is a set of vertex disjoint cycles selected from via a greedy algorithm. Since doing so in parallel would be costly, we deviate slightly from this idea and perform only local comparisons to find a heavy set of disjoint augmenting -cycles. This means that our algorithm does not have the same theoretical properties as Pettie and Sander’s algorithms, and it can terminate without returning a optimum matching, although this can only happen on very specific instances. While this could easily be overcome by applying random augmentations, doing so is not desirable from a practical point of view.

5 The Parallel Approximation Algorithm

5.1 Overview of the parallel AWPM algorithm

In our approximate weight perfect matching, or AWPM algorithm, we combine a parallel maximum cardinality matching (MCM) algorithm with a distributed memory approximation algorithm for the weight. The MCM algorithm is initialized using a maximal cardinality matching algorithm that returns a matching with cardinality at least half of the maximum. While this step is optional, doing so greatly improves the parallel performance. The maximal and maximum matching algorithms together provide a perfect matching whose weight is iteratively improved by discovering augmenting cycles following the idea of the Pettie-Sanders algorithm discussed above.

Distributed memory algorithms for cardinality matchings on bipartite graphs were developed in prior work [3, 4]

. Both of these algorithms are implemented using a handful of bulk-synchronous matrix and vector operations and are freely available as part of the CombBLAS library. Among several variants of the maximal matching algorithms, we used a simple greedy algorithm to initialize MCM. For this work, we modified the original cardinality matching algorithms in such a way that when selecting potential matching edges, we break ties by giving precedence to edges with higher weight. This simple heuristic often results in the perfect matchings having high weight without incurring any additional cost. Once a perfect matching is obtained, we improve its weight using our parallel approximation algorithm.

5.2 The approximate weight augmenting cycles algorithm

As mentioned in the last section, the algorithm for maximum weight approximation aims to find augmenting 4-cycles. Unlike longer cycles, they can be found in time proportional to the degree of their vertices, and doing so requires only a constant number of communication rounds in the distributed memory parallel setting. Thus, our algorithm can be considered a special case of the general augmenting cycles algorithm. We will refer to it as AWAC, i.e. approximate weight augmenting cycles, or augmenting cycles algorithm in short.

The vertices of a 4-cycle might be distributed over 4 processes, requiring 4 major communication steps, which will be described in detail below. We use a regular 2D partitioning of the input matrix. Thus, processes form a grid. They are denoted by in the grid. Let be the submatrix assigned to a process. Figure 1 illustrates the layout, and explains the labeling of the processes. The algorithm starts by distributing the matching information across rows and columns, and then loops over the four steps times, or until no more augmenting 4-cycle can be found, as shown in Algorithm 2.

Figure 1: Augmenting 4-cycle in the 2D distributed matrix. Each square contains the part of the matrix assigned to one process. For simplicity, here the matched edges are gathered on the main diagonal (dashed red line). and denote rows of processes, while and denote columns. Thus, if edge is on process , then its incident matched edges are on process and respectively, where and depend on the position of the matched edges. If the unmatched edge exists in the graph, it must be on process .
1:Input: a weighted graph
2:On each process do in parallel:
3:for all rows of contained in  do
4:     Copy and to a local array
5:for all columns of contained in  do
6:     Copy and to a local array
7:for i = 1 to k do
8:     Step A: Cycle requests
9:     Step B: Cycle completion
10:     Step C: Local cycle comparison
11:     Step D: Nonlocal cycle comparison
12:     if no cycle was found then
13:          break      
Algorithm 2 Sketch of the Basic Parallel Approximation Algorithm

Like the deterministic sequential Algorithm 1, we construct a set of vertex-disjoint augmentations. In effect, we parallelize the FOR ALL statement in Line 4 of Algorithm 1. However, we do not use the same greedy strategy to pick the augmenting 4-cycles in Line 7, since the standard greedy algorithm is inherently sequential. Instead, we perform limited local weight comparisons, which are described below.

After initialization, the first step will be to select vertices at which augmenting cycles are rooted. While all vertices are eligible, we can reduce this number since each cycle can be rooted at any of its 4 vertices. Because the matrix is stored in Compressed Sparse Column (CSC) format, we only start cycles from column vertices, reducing the number of root vertices by half. This number can be halved again by rooting a given cycle only at the column vertex with the lower index.

Now, for a potential cycle rooted at a column vertex and containing a row vertex , we generate a request to the owner of , which includes the edge weight , as shown in Algorithm 3. For performance reasons, the exchange of all such requests is bundeled into an All-to-All collective operation.

1:for all processes (a,b) do
2:     for all columns of contained in  do
3:          for all rows of contained in  do
4:               if {i,j} exists then
5:                    Let be the owner of
6:                     Add A-request to request queue for process                               
7:exchange A-requests via AllToAllv
Algorithm 3 Step A

In the next step, we need to determine if and thus the alternating cycle of exists and is augmenting, i.e. has . Note that the weight of the matching edges is stored on all processes in the same grid row/column, and can thus be accessed directly, as shown in Algorithm 4. However, before we can augment the matching along this cycle, we have to make sure that it is vertex disjoint with other augmenting cycles, and if not, find out which cycle has the highest gain. We therefore send a request to the process that owns the matched edge .

1:for all processes do
2:     for all A-requests from  do
3:          if  exists: then
4:               
5:               if  then
6:                     Add B-request to request queue for process                               
7:exchange B-requests via AllToAllv
Algorithm 4 Step B

Now, the owner of each edge collects all incoming requests for that edge, selects one with maximum gain, and discards all others. Since these requests correspond to cycles rooted at , all remaining cycles now are disjoint w.r.t. their edge, as shown in Algorithm 5. However, we still have to ensure that the cycles are disjoint with other cycles at the edge . Therefore, we send a request to its owner, which might not share any column or row with the sending process. Figure 2 shows an examples of such competing cycles.

1:for all processes do
2:     for all rows of contained in  do
3:           Find B-request with maximum gain
4:           Add C-request to request queue for process      
5:exchange C-requests via AllToAllv
Algorithm 5 Step C
Figure 2: Cycle collisions in the graph view. Left: multiple cycles, all of which are rooted in column vertex compete for the matched edge in Step C of the algorithm. Right: cycles rooted at different column vertices and compete for the matched edge in Step D.
Figure 3: Communication in the 2D distributed matrix view. In Step A (orange arrow) communication goes from to . Step B (blue arrow) from to , and Step C (purple arrow) from to . In Step D, the matching is updated along rows and , and along columns and (green squares). In this example that includes all but the process in the center.

In Step D, the owner of each edge collects requests for that edge, selects one with maximum gain, and discards all others, similar to Step C. Thus, all cycles are disjoint w.r.t. their edges. However, it is still possible that the edge of one cycle is the edge of a different cycle. Thus, if a process sent a C-request for an edge in Step C, then it will automatically discard the requests for other cycles that have as their edge. As mentioned in Section 4, our strategy deviates from the Pettie-Sanders algorithm here. The reason for this lies in the fact that finding the maximal set of augmenting 4-cycles would incur additional communication that most likely affects only a small number of vertices. Therefore, in the parallel case it is preferable to simply drop the problematic cycles and generate new ones instead.

The final step consists of augmenting the matching by flipping matched and unmatched edges in each cycle and communicating the change along the rows and columns, which is shown in Algorithm 6. The entire path of the communication is sketched in Figure 3.

1:for all processes do
2:     for all columns of contained in  do
3:          if No C-request was sent from in Step C  then
4:                find C-request with maximum gain
5:               k =
6:               l =
7:               broadcast to all processes
8:               broadcast to all processes
9:               broadcast to all processes
10:               broadcast to all processes                
Algorithm 6 Step D

5.3 Analysis of the parallel AWPM algorithm

We measure communication by the number of words moved () and the number of messages sent (). The cost of communicating a -word message is where is the latency and is the inverse bandwidth, both are defined relative to the cost of a single arithmetic operation. Hence, an algorithm that performs arithmetic operations, sends messages, and moves words takes time.

Since rows are permuted randomly at the start of the algorithm, in this analysis, matrix nonzeros are assumed to be i.i.d. distributed. We also assume a square process grid. The runtime of the MCM algorithm is dominated by parallel sparse matrix-sparse vector multiplication whose complexity has been analyzed in our prior work [4]. Assuming to be the number of iterations needed to find an MCM, the perfect matching computation takes:

In Step A of the AWAC algorithm, we exchange a total of requests among all processes. Under the i.i.d. assumption, each process contains edges and spends time to prepare requests for all other processes. Let and be the set of rows and column whose pairwise connections are stored in process . Let and be the set of mates of vertices in and , respectively. Since a process stores at most rows and columns, and . In order to receive a request in the th process, there must be an edge between a pair of vertices in . Under the i.i.d. assumption, the number of such edges is . Hence a process receives messages in the communication round of Step A.

The number of requests in Step B cannot be greater than in Step A, and requests in Step C and augmentations in Step D are bounded by . Hence the total cost of AWAC is:

Note that in AWAC, is bounded by a small constant that depends on the desired approximation ratio.

Cori Edison
Core Intel KNL Intel Ivy Bridge
Clock (GHz) 1.4 2.4
L1 Cache (KB) 32 32
L2 Cache (KB) 1024 256
Gflops/core 44 19.2
Node Arch.
Sockets/node 1 2
Cores/socket 68 12
STREAM BW (GB/s) 102 104
Memory (GB) 96 64
Overall system
Nodes 9,688 5,586
Interconnect Aries (Dragonfly) Aries (Dragonfly)
Prog. Environment
Compiler Intel C++ Compiler (icpc) ver18.0.0
Optimization -O3 -O3
Table 1: Overview of Evaluated Platforms. Shared between 2 cores in a tile. Memory bandwidth is measured using the STREAM copy benchmark per node.

6 Results

6.1 Experimental setup

We evaluated the performance of our algorithms on the Edison and Cori supercomputers at NERSC. On Cori, we used KNL nodes set to cache mode where available 16GB MCDRAM is used as cache.Table 1 summarizes key features of these systems. We used Cray’s MPI implementation for inter-node communication and OpenMP for intra-node multithreading. Unless otherwise stated, all of our experiments used 4 MPI processes per node. We rely on the CombBLAS library [8] for parallel file I/O, data distribution and storage. We always used square process grids because rectangular grids are not supported in CombBLAS.

In our experiments, we used a diverse test set of real-world matrices from the University of Florida sparse matrix collection [9] and other sources as shown in Table 2. Before running matching algorithms, we normalized matrices such that the maximum entry of each row or column is and the other entries are bounded by .

We compare the performance of our parallel algorithm with the MWPM implementations in MC64 [11] and an optimized sequential implementation of AWPM. Our sequential implementation uses the Kapr-Sipser and Push-Relabel algorithms [16] to find a perfect matching, and then uses the augmenting cycles algorithm to find an AWPM. MC64 provides five options for computing the weight of a matching. In all of our experiments except Table 4, we used option 4 that maximizes the sum of weights on matched edges. Since a matching algorithm is often used as a subroutine of another application, we exclude file I/O and graph generation time when reporting the running time of matching algorithms.

Matrix Type rows nnz Description
() ()
memchip unsym 2.71 14.81 circuit simulation
rajat31 unsym 4.69 20.32 circuit simulation
Freescale2 unsym 3.00 23.04 circuit simulation
boneS10 sym 0.91 28.19 model reduction problem
Serena sym 1.39 32.96 gas reservoir simulation
audikw_1 sym 0.94 39.30 structural prob
dielFilterV3real sym 1.10 45.20 higher-order finite element
Flan_1565 sym 1.56 59.49 structural problem
circuit5M unsym 5.56 59.52 circuit simulation
Li7Nmax6 [1] sym 0.66 212.23 nuclear config. interaction
HV15R unsym 2.02 283.07 3D engine fan
perf008cr sym 7.90 321.06 3D structural mechanics
nlpkkt240 sym 27.99 401.23 Sym. indef. KKT matrix
Nm7 [1] unsym 4.08 437.37 nuclear config. interaction
NaluR3 [22] unsym 17.60 473.71 Low Mach fluid flow
A05 unsym 1.56 1088.71 MHD for plasma fusion
Table 2: Square matrices used to evaluate matching algorithms. All matrices, except five, are from the University of Florida sparse matrix collection [9]. We separate bigger matrices ( 200M) by a horizontal line.

6.2 Approximation ratio attained by parallel AWPM algorithm

Even though our parallel algorithm can not guarantee the theoretical bound of the Pettie-Sanders algorithm, the obtained matchings are often very close to the optimum in practice as shown in Table 3. Here, we compute the approximation ratio by dividing approximated weight by optimum weight and show it in the last column. For 10 out of 16 matrices in Table 2, AWPM finds a matching with the optimum weight. Approximate weights of other matrices are also close to the optimum. This observation extends to many other matrices from the University of Florida collection. In an extended set of more than 100 matrices, the average approximation ratio attained by AWPM is 98.66% (min 86%, max 100%). Hence, our algorithm can successfully substitute MWPM algorithms in many practical problems without sacrificing the quality of the matchings.

Matrix MC64 AWPM Approx. ratio
memchip 2,707,524 2,707,520  100%
rajat31 4,690,002 4,690,002 100%
Freescale2 2,994,270 2,989,080 99.98%
boneS10 914,898 914,898 100%
Serena 1,391,344 1,391,340 100%
audikw_1 943,624 943,624 100%
dielFilterV3real 1,102,796 1,102,796 100%
Flan_1565 1,564,794 1,564,794 100%
circuit5M 5,557,920 5,557,890 99.99%
Li7Nmax6 663,526 663,521 99.99%
HV15R 1,877,709 1,770,960 94.31%
perf008cr 7,902,327 7,902,327 100%
nlpkkt240 27,762,507 27,710,011 99.81%
Nm7 4,079,730 4,079,730 100%
NaluR3 17,598,889 17,598,889 100%
A05 1,140,660 1,140,660 100%
Table 3: The weight of matchings from AWPM and MC64.
Figure 4: Comparing the parallel AWPM algorithm with sequential AWPM and MC64 on Edison. AWPM time includes the computation of a perfect matching and the newly developed augmenting cycles algorithm. A red line plots the time to gather a distributed matrix on a node and then run MC64. For each matrix, the best speedup attained by the parallel AWPM algorithm relative to MC64+gather time is shown at the top of the corresponding subplot. Four MPI processes per node and 6 threads per process were used. Matrices are arranged in ascending order by first from left to right and then from top to bottom.
Figure 5: Comparing parallel AWPM with two sequential algorithms for four representative matrices on Cori-KNL. The best speedup of the parallel AWPM algorithm relative to MC64+gather time is shown at the top of the corresponding subplot. See the caption of Figure 4 for further detail.
(a)
(b)
Figure 6: Strong scaling of the approximate weight augmenting cycles algorithm for the largest seven matrices from Table 2. The scalability plots starts from one node. Four MPI processes are used in each node of Edison and Cori-KNL.
(a)
(b)
Figure 7: Breakdown of parallel AWPM runtime on Cori-KNL.

6.3 Performance of the parallel AWPM algorithm

Figure 4 compares the runtime of the parallel AWPM algorithm with MC64 and sequential AWPM for all 16 matrices from Table 2. Here MC64+gather lines include the time to gather the whole matrix on a single node, representing the total time needed to compute MWPM sequentially in a distributed memory setting. Experiments in Figure 4 were run on Edison.

At first, we discuss the performance of matching algorithms on bigger matrices where a distributed memory algorithm is urgently needed. MC64 failed with A05, the largest matrix in our suite, because of insufficient memory in a single node of Edison. For matrices of this scale, our distributed-memory algorithm is the only viable option for enabling extreme scale scientific applications that rely on MWPM. For other large matrices, parallel AWPM is significantly faster than its sequential competitors. For some problems, this is true even on a single node. For example, when computing a matching on NaluR3 on a single node of Edison, parallel AWPM is and faster than MC64 and sequential AWPM, respectively. On 256 nodes (6144 cores), our parallel algorithm becomes and faster than MC64 and sequential AWPM. Note that AWPM is able to find the optimum solution on NaluR3. Hence, the drastic reduction in runtime comes for free, without sacrificing any matching quality. For other large matrices (with greater than 200M), parallel AWPM runs at least faster than MC64+gather on high concurrency. This observation also holds for most matrices in the second and third rows in Figure 4.

On smaller matrices (e.g., those in the first row of Figure 4), the performance gain from the parallel algorithm is not as dramatic as with bigger matrices. This is expected as it is hard for parallel algorithms to reduce a subsecond sequential runtime. However, for all matrices except Freescale2, parallel AWPM runs faster than MC64+gather on high concurrency. Hence our algorithm is competitive on smaller matrices and runs significantly faster on bigger matrices.

While most instances show excellent scalability, there are two outliers, Freescale2 and HV15R, where parallel AWPM does not run faster than the sequential AWPM. For both matrices, parallel AWPM spends more than 80% of the runtime on finding an initial perfect matching using a distributed-memory MCM algorithm 

[4]. Obtaining perfect matchings on these two matrices require searching for long paths going through many processors, which is hard to parallelize. Nevertheless, even for these matrices, parallel AWPM remains competitive or significantly faster than MC64+gather, which is the main competitor of our algorithm in practice.

The matching algorithms discussed above show similar performance trends on Cori-KNL, as shown in Figure 5. For example, on the perf00cr matrix, parallel AWPM runs and faster than MC64+gather on 17,408 and 6,144 cores on Cori-KNL and Edison, respectively. On Cori-KNL, MC64 is able to compute matching for the A05 matrix in 135 seconds, whereas, parallel AWPM algorithm took just 0.44 seconds on 256 nodes of Cori-KNL.

6.4 Scalability of parallel AWPM algorithm

Unless otherwise noted, we report the speedup of the parallel AWPM algorithm relative to its runtime on a single node. AWPM still runs in parallel on a single node using 4 MPI processes and employs multithreading within a process.

At first, we discuss the scalability of the complete AWPM algorithm (including the time to compute the initial perfect matching) as shown in Figure 4 and 5. AWPM achieves speedup on average over 13 matrices that were able to run on a single node of Edison. However, the performance improvement is more significant on bigger matrices. For example, our algorithm attains the best speedup of for NaluR3 on 256 nodes on Edison. On 256 nodes of Cori-KNL, AWPM achieves speedup on average over 14 matrices that were able to run on a single node. The best speedups on Cori-KNL are attained on Li7Nmax6 () and NaluR3 (). For the A05 matrix, as we go from 16 nodes to 256 nodes, parallel AWPM runs and faster on Edison and Cori-KNL, respectively.

Since the primary contribution of the paper is the approximate weight augmenting cycles algorithm, we show its scalability separately in Figure 6. We only show results for matrices with more than 200M nonzeros. For all matrices in Figure 6, the parallel augmenting cycles algorithm attains more than a speedup on both Edison and Cori-KNL. Among these matrices, the best speedups of was observed on 256 nodes of Cori-KNL for the Li7Nmax6 matrix, highlighting the excellent scalability of the newly developed algorithm.

6.5 Insights on the performance of parallel AWPM algorithm

In order to study the behavior of parallel AWPM, we break down the execution time of two instances in Figure 7. For both instances, the newly developed augmenting cycles algorithm scales very well. This trend is also observed for most of the matrices, as was explained before. However, the maximum cardinality matching algorithm that is used to obtain an initial perfect matching starts to become the bottleneck on high concurrency. As studied extensively in a prior work [4], the performance of the parallel MCM algorithm suffers if it needs to search long augmenting paths that span many processors.

6.6 Performance of AWPM as a pre-pivoting tool for a distributed sparse direct solver

Matching weight Relative error
Matrix MC64 AWPM MC64 AWPM
memchip
rajat31
Freescale2 -11982.80 -15122.14
boneS10
Serena -6.33 -6.33
audikw_1 -81.90 -81.90
dielFilterV3real -28.97 -28.97
Flan_1565
circuit5M -1298.52 -2491.49
perf008cr
A05 -780638 -780638
Table 4: Weights of matchings from MC64 and AWPM when the sum of logarithm of weights is maximized and SuperLU_DIST relative error of the solution. For other matrices in Table 2, we could not get a solution from SuperLU_DIST.

As stated in Section 1, our motivation for AWPM comes from the need for parallel pre-pivoting of large entries to the main diagonal to improve stability of sparse direct solvers. We now evaluate how AWPM performs relative to MC64 in this regard, using the SuperLU_DIST direct solver. For each matrix in Table 4, we set the true solution vector , then generate the right-hand side vector by computing . We use the LAPACK-style simple equilibration scheme to compute the row and column scalings and which would make each row and each column of the scaled matrix have equal norm, and the maximum entry of each row or column is and the other entries are bounded by 1 (Hence, the weight of many matched edges is often 1, and the sum of the logarithms is 0.) We then apply the pre-pivoting strategies of MC64 or AWPM to compute a row permutation vector . Here, the maximization criterion is the sum of the logarithms of the weights, i.e. the product of the weights. The sparsity reordering is then obtained with METIS using the permuted matrix (graph). Finally, the LU factorization is computed as , and the solution is computed based on the transformed linear system. The relative solution error is reported in Table 4. For most matrices, the relative error obtained with AWPM is remarkably close to that of MC64, with the exception of circuit5M. This can be explained by the difference in weights found by MC64 and AWPM. However, for most matrices, the AWPM weights are identical to the MC64 weights. Recall in Table 3, when the sum of weights is maximized, AWPM achieves 99.99% optimum weight for this matrix. When using the permutation obtained from the “AWPM-sum” metric for circuit5M, the computed solution is as accurate as that of MC64. In the future, we plan to investigate the performance of the AWPM-sum and AWPM-product metrics in the solution accuracy of SuperLU_DIST.

7 Concluding Remarks

We presented a new distributed-memory parallel algorithm for the approximate-weight perfect matching problem on bipartite graphs. That is, our algorithm returns a perfect matching (if it exists) that has approximately maximum weight. Our motivation comes from distributed-memory sparse direct solvers where an initial permutation of the sparse matrix that places entries with large absolute values on the diagonal is often performed before the factorization, in order to avoid expensive pivoting during runtime. This process is called static pivoting and the permutation is ideally found using a maximum-weight perfect matching. However, previous attempts in parallelizing the exact algorithms met with limited success. Since the perfect matching part is a strict requirement of static pivoting, our algorithm only relaxes the weight requirement.

There are two key reasons for the performance of our algorithm. For the initial phase where we find a perfect matching, we use existing optimized distributed-memory cardinality matching algorithms with minor modifications to maximize the weight when tie-breaking. For the iterative phase that improves on the weights while maintaining maximum cardinality, we restrict the algorithm to 4-cycles, and thus avoid traversing long augmenting paths, following the Pettie-Sanders approximation algorithm. Hence, the crux of our paper is an efficient method for finding augmenting 4-cycles on a bipartite graph in distributed memory.

In terms of LP-duality, unlike the Hungarian algorithm [17], ours is a primal method because it maintains a feasible solution. It does not compute a solution to the dual problem, i.e. feasible potential. Since the dual values can be used to scale matrices in linear solvers [25] extending the algorithm to provide such scaling factors is a goal for future work.

We have tested the algorithm on the largest available instances in the Florida sparse matrix collection. While in some cases finding the perfect matching in parallel is expensive, the weight approximation always scales well. Our experiments show that the weight of the matchings we obtain is high and high enough to replace MC64 for pre-pivoting in linear solvers such as SuperLU_DIST, thereby making such solvers fully scalable.

References

  • [1] Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the IPDPS, pages 1213–1222, 2014.
  • [2] H. Alt, N. Blum, K. Mehlhorn, and M. Paul. Computing a maximum cardinality matching in a bipartite graph in time . Information Processing Letters, 37(4):237–240, 1991.
  • [3] Ariful Azad and Aydin Buluç. A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs. Parallel Computing, 58:117–130, 2016.
  • [4] Ariful Azad and Aydın Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS. IEEE, 2016.
  • [5] Ariful Azad, Aydın Buluç, and Alex Pothen. Computing maximum cardinality matchings in parallel on bipartite graphs via tree-grafting. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(1):44–59, 2017.
  • [6] Ariful Azad, Mahantesh Halappanavar, Sivasankaran Rajamanickam, Erik Boman, Arif Khan, and Alex Pothen. Multithreaded algorithms for maximum matching in bipartite graphs. In IPDPS, Shanghai, China, 2012. IEEE.
  • [7] Dimitri P. Bertsekas and David A. Castañón. A generic auction algorithm for the minimum cost network flow problem. Comp. Opt. and Appl., 2(3):229–259, 1993.
  • [8] Aydın Buluç and John R Gilbert. The Combinatorial BLAS: Design, implementation, and applications. International Journal of High-Performance Computing Applications (IJHPCA), 25(4), 2011.
  • [9] Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.
  • [10] I. S. Duff, K. Kaya, and B. Uçar. Design, implementation, and analysis of maximum transversal algorithms. ACM Transaction on Mathematical Software, 38(2):13:1– 13:31, 2011.
  • [11] Iain S. Duff and Jacko Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4):973–996, 2001.
  • [12] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum flow problem. Journal of the Association for Computing Machinery, 35:921–940, 1988.
  • [13] Mahantesh Halappanavar, John Feo, Oreste Villa, Antonino Tumeo, and Alex Pothen. Approximate weighted matching on emerging manycore and multithreaded architectures. The International Journal of High Performance Computing Applications, 26(4):413–430, 2012.
  • [14] Jonathan Hogg and Jennifer Scott. On the use of suboptimal matchings for scaling and ordering sparse symmetric matrices. Numerical Linear Algebra with Applications, 22(4):648–663, 2015.
  • [15] John E. Hopcroft and Richard M. Karp. An algorithm for maximum matchings in bipartite graphs. SIAM Journal on Computing, 2(4):225–231, 1973.
  • [16] K. Kaya, J. Langguth, F. Manne, and B. Uçar. Push-relabel based algorithms for maximum transversal problem. Computers & Operations Research, 40(5):1266–1275, 2013.
  • [17] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  • [18] J. Langguth, F. Manne, and P. Sanders. Heuristic initialization for bipartite matching problems. ACM Journal of Experimental Algorithmics, 15:1.1–1.22, February, 2010.
  • [19] Johannes Langguth, Ariful Azad, Mahantesh Halappanavar, and Fredrik Manne. On parallel push-relabel based algorithms for bipartite maximum matching. Parallel Computing, 40(7):289–308, 2014.
  • [20] Johannes Langguth, Md. Mostofa Ali Patwary, and Fredrik Manne. Parallel algorithms for bipartite matching problems on distributed memory computers. Parallel Computing, 37(12):820–845, 2011.
  • [21] Xiaoye S Li and James W Demmel. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software (TOMS), 29(2):110–140, 2003.
  • [22] Paul Lin, Matthew Bettencourt, Stefan Domino, Travis Fisher, Mark Hoemmen, Jonathan Hu, Eric Phipps, Andrey Prokopenko, Sivasankaran Rajamanickam, Christopher Siefert, et al. Towards extreme-scale simulations for low mach fluids with second-generation Trilinos. Parallel Processing Letters, 24(04):1442005, 2014.
  • [23] Fredrik Manne and Rob H. Bisseling. A parallel approximation algorithm for the weighted maximum matching problem. In International Conference on Parallel Processing and Applied Mathematics, pages 708–717, 2007.
  • [24] Fredrik Manne and Mahantesh Halappanavar. New effective multithreaded matching algorithms. In Proceedings of the IPDPS, pages 519–528, 2014.
  • [25] Markus Olschowka and Arnold Neumaier. A new pivoting strategy for gaussian elimination. Linear Algebra and its Applications, 240(Supplement C):131 – 151, 1996.
  • [26] Seth Pettie and Peter Sanders. A simpler linear time 2/3-epsilon approximation for maximum weight matching. Inf. Process. Lett., 91(6):271–276, 2004.
  • [27] A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix. ACM Trans. Math. Softw., 16:303–324, 1990.
  • [28] Edward Jason Riedy. Making static pivoting scalable and dependable. PhD thesis, University of California, Berkeley, 2010.
  • [29] P. Sao, X.S. Li, and R. Vuduc. A 3D LU factorization algorithm for sparse matrices. In Proceedings of the IPDPS (to appear). IEEE, 2018.
  • [30] M. Sathe, O. Schenk, and H. Burkhart. An auction-based weighted matching implementation on massively parallel architectures. Parallel Computing, 38(12):595–614, 2012.
  • [31] Alexander Schrijver. Combinatorial optimization: Polyhedra and efficiency, volume 24. Springer Science & Business Media, 2003.