A matching in a graph is defined as a collection of edges that do not share any vertices. The problem of finding a matching with a maximum weight in an edge-weighted graph—henceforth referred to as the maximum weight matching (MWM) problem—is one of the most fundamental combinatorial optimization problems with a wide range of applications in data mining and bioinformatics. For instance, maximum weight matchings can improve the quality of data clustering  or partitioning , as well as discovery of subgraphs in networks in bioinformatics [50, 12]. Other applications are in trading markets and computational advertising [58, 55, 15], kidney exchange [22, 13], online labor markets 
, and semi-supervised learning (see  for other similar examples). Yet another application arise in numerical linear algebra, e.g., in sparse linear solvers [26, 27], decomposition of sparse matrices , and computing sparse bases for underdetermined matrices .
The study of MWM dates back to the introduction of the complexity class as the set of “tractable” problems by Edmonds [28, 29] who designed a poly-time algorithm for this problem on general graphs. Since then, there have been numerous attempts in developing faster algorithms for MWM (see, e.g., [31, 34, 32, 33, 35, 17, 25]) culminating in the -time222Throughout the paper, we use . algorithm of Gabow and Tarjan . For approximation algorithms, the greedy algorithm that repeatedly picks the heaviest edge possible in the matching achieves a two approximation and after a series of work [62, 63, 59, 23], an -time algorithm for -approximation was developed by Duan and Pettie .
Nowadays, many applications that involve MWM require processing massive graphs that are typically being stored and processed in a distributed fashion. Classical algorithmic approaches to MWM are no longer viable options to cope with challenges that stem from processing massive graphs and one should now instead focus on algorithms that can be implemented efficiently in distributed settings even at the cost of a (slight) reduction in the quality of the solution.
In this paper, we design a simple and efficient greedy distributed algorithm for the maximum weight matching problem with an approximation ratio that nearly matches that of the sequential greedy algorithm for MWM. Our algorithm can also be easily implemented in two rounds of parallel computation in MapReduce-style computation frameworks, which is known to be the minimum number of rounds necessary for solving this problem.
1.1 Background and Related Work
Maximum weight and maximum cardinality matchings have been studied extensively in different models of computation for processing massive graphs such as streaming and distributed settings; see, e.g., [54, 30, 39, 48, 3, 1, 45, 53, 16, 7, 6, 57, 2, 51, 43, 5, 18, 42, 4] and references therein.
Closely related to our work, Lattanzi et al.  designed MapReduce algorithms with - and -approximation guarantees, respectively, for unweighted and weighted matchings in rounds on machines with memory . These results were subsequently improved to -approximation for both problems in rounds by Ahn and Guha  using sophisticated primal-dual algorithms and multiplicative-weight-update method (for unweighted bipartite matching, a simpler algorithm with -approximation in rounds using space was recently proposed in ). Very recently, Harvey et al.  designed a -approximation algorithm for weighted matchings in rounds based on the local-ratio theorem of  for MWM. Furthermore, Assadi and Khanna  designed a MapReduce algorithm with memory—using the so-called randomized composable coreset method which we also exploit in this paper—that achieves an -approximation to both problems in only two rounds of computation which is the optimal number of rounds by a result of . This result was very recently improved by Assadi et al.  to (almost) -approximation for unweighted matchings. Recent papers by [18, 4, 38] also considered these problems with smaller per-machine memory and achieved - and -approximation for unweighted and weighted matchings in rounds and memory per-machine. The approximation ratio for weighted matchings in these results were very recently improved to by .
Our work is also closely aligned with the trend on “parallelizing” sequential greedy algorithms in distributed settings (e.g., [49, 56, 19, 20, 42]). As noted elegantly by Kumar et al. : “Greedy algorithms are practitioners’ best friends—they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power.” As such there have been extensive efforts in recent years to carry over the greedy algorithms in the sequential setting to distributed models as well. These results are typically of two types: they either use a relatively large number of rounds to “faithfully” simulate the greedy algorithm, i.e., to obtain approximation guarantees that (almost) match that of the greedy algorithm [49, 20], or use a very small number of rounds, say one or two, for “weak” simulation, resulting in approximation guarantees that are within some constant factor of the corresponding greedy algorithm [56, 19]. Table 1 provides a succinct summary of previous work.
1.2 Our Contribution
In this paper, we “faithfully” parallelize the sequential greedy algorithm for MWM in two rounds of parallel computation. In particular, we present an algorithm in the MapReduce framework (defined formally in Section 2) that for any constant , outputs a -approximation to maximum weight matching in expectation using machines each with memory and in only two rounds of computation; here, and denote the number of edges and vertices in the graph, respectively. See Theorem 2 for the formal statement of this result.
Our distributed algorithm works as follows: send each edge of the graph to machines randomly, run the greedy algorithm—the one that repeatedly picks the heaviest available edge in the matching—on each part separately, combine the output of the greedy algorithms on a single machine, and find a near-optimal weighted matching among these edges using any standard offline algorithm, say algorithm of  (see also Section 3.1 for details on when one can simply run the greedy algorithm at the end). We prove that this simple algorithm leads to an almost two approximate matching of the original graph. This technique of partitioning the input randomly and computing a subgraph of each piece (here, a matching output by the greedy algorithm) is called the randomized composable coreset technique and has been used previously in context of unweighted matchings [5, 4] and constrained submodular maximization [56, 19] (see Section 2.1). Finally, the number of rounds of our algorithm is optimal by a result of .
Comparison with prior work.
We conclude this section by making the following two comparisons:
Number of rounds used by our algorithm is an absolute constant two, independent of the approximation of the algorithm. This significantly improves upon the previously best MapReduce algorithms of [2, 42] that require a large unspecified constant number of rounds to achieve a similar guarantee on the approximation ratio. Other algorithms for weighted matching with similar guarantee on the number of rounds as ours are that of  that achieves -approximation (improvable to (almost) -approximation using the Crouch-Stubbs technique ) in six rounds when using the same per-machine memory as ours (and at least three rounds by allowing even more memory), and (almost) -approximation of  which is based on a considerably complicated algorithm (as it first finds an approximation to unweighted matching with better than -approximation which is well-known to be a “hard task” for matchings333For instance, getting efficient algorithms with better than -approximation in both streaming and dynamic graphs models are longstanding open problems, and very recently proven to be impossible for online edge-arrival graphs .). We emphasize that the main bottleneck in MapReduce computation is the transition between different rounds (see, e.g., ) and hence minimizing the number of rounds is the primary goal in this setting.
Previous work on parallelizing greedy algorithms in MapReduce framework suffered from one of the following two drawbacks: either a suboptimal approximation guarantee compared to the greedy algorithm even by allowing unbounded computation time on each machine [19, 56], or a relatively large number of rounds to match the performance of the greedy algorithm exactly or to within a factor of [20, 49, 42]. Obtaining MapReduce algorithms that can (almost) match the performance of the greedy algorithm without blowing up the number of rounds (in the context of submodular maximization) has been posed as an open question very recently . To our knowledge, ours is the first parallel implementation of the greedy algorithm in a minimal number of rounds with (almost) no blow-up in the approximation ratio. It is a fascinating open question if our improvement for MWM can be extended to other greedy algorithms, in particular, for constrained submodular maximization.
In addition to aforementioned theoretical improvements over previous works, our algorithm has the benefit of being extremely simple (with nearly all the details “pushed” to the analysis), making it easily implementable in the MapReduce model (the only other MapReduce algorithm for matching that we know of to be implemented previously is  which is limited to bipartite graphs in a crucial way). We believe this additional feature of our algorithm is an important contribution of this paper. We discuss this further in Section 4 where we present our experimental results and give an empirical comparison of our algorithm with prior algorithms.
Throughout, . For a graph , denotes the weight of a maximum weight matching in .
Let be a graph and denote any permutation of . denotes the standard greedy algorithm which iterates over edges according to and add to the matching iff both and are unmatched. It is a standard fact that when is sorted in non-increasing order of weights, outputs a -approximation to .
We adopt the MapReduce model as formalized by Karloff et al. ; see also [40, 9]. Let with and be the input graph. In this model, there are machines, each with a memory of such that , i.e., at most a constant factor larger than the input size, and both , i.e., sublinear in the input size. The motivation behind these constraints is that the number of machines, and local memory of each machine should be much smaller than the input size to the problem since these frameworks are used to process massive datasets. Computation in this model proceeds in synchronous rounds: in each round, each machine performs some local computation and at the end of the round machines exchange messages to guide the computation for the next round. All messages received by each machine in one round have to fit into the memory of the machine.
2.1 Randomized Composable Coresets
We briefly review the notion of randomized composable coresets originally introduced by  in the context of submodular maximization, and further refined in  for graph problems. Our definition slightly deviates from previous works as we will remark below.
Let be an edge-set of a weighted graph . Let be a parameter. A collection of edges is a random -clustering of with expected multiplicity iff each edge in is sent to different sets chosen uniformly at random, where is chosen independently for each edge from the binomial distribution with trials and expected value . A random clustering of naturally defines clustering the graph into subgraphs where for all ; as a result, we use random clustering for both the edge-set and the input graph interchangeably.
Consider an algorithm ALG that given a graph outputs a subgraph with at most edges. Let be integers and denote a random -clustering with expected multiplicity of a graph . We say that ALG outputs an -approximate -randomized composable coreset of size for the weighted matching problem iff
where denotes the weight of a maximum weight matching in the given graph. Here, the expectation is taken over the random choice of the random clustering.
We remark that our definition is somewhat different from [56, 5] in the following sense: Previous works considered the case where each edge is sent to exactly different subgraphs (only was considered in ), while we send each edge to subgraphs in expectation. This way of partitioning has the simple yet helpful property that makes the distribution of graphs a product distribution, i.e., each graph is chosen independently even conditioned on all other graphs in the random -clustering. At the same time, size of each subgraph and the total number of edges across all subgraphs, are still respectively and
with high probability.
Randomized Coresets in MapReduce Framework.
Suppose is the input and let . We use a randomized coreset to obtain a MapReduce algorithm: [ enlarge top by=5pt, enlarge bottom by=5pt, boxsep=0pt, left=4pt, right=4pt, top=10pt, arc=0pt, boxrule=1pt,toprule=1pt, colback=white ]
Random clustering: Create a random -clustering of expected multiplicity and allocate each graph to the machine .
Coreset: Each machine creates a randomized composable coreset .
Post-processing: Collect the union of coresets to create on one machine and return a -approximation to MWM on using any offline algorithm.
It is easy to verify that the algorithm requires machines with memory and only two rounds of computation. Moreover, by Definition 2.1, the output of this algorithm is an -approximation to maximum weight matching of .
3 A Randomized Coreset for Maximum Weight Matching
We present a simple randomized composable coreset for MWM in this section and then use it to design an efficient MapReduce algorithm for this problem. In Appendix A, we prove the optimality of the size of our corset using an adaptation of the argument in .
For , there exists a -randomized composable coreset of size with expected multiplicity for the maximum weight matching problem.
Our coreset in Theorem 1 is simply the greedy algorithm for MWM (with consistent tie-breaking). Let be a graph and be a random -clustering of with expected multiplicity . We propose the GreedyCoreset for approximating MWM on : on each subgraph , simply return as the coreset, where sorts the edges in in non-increasing order of their weights (breaking the ties consistently across all ). In the following lemma, we analyze the performance of this coreset.
Suppose is a random -clustering of with expected multiplicity and . Define the graph with ; then,
Let be a permutation of all edges in in non-increasing order of their weights (consistent with orderings for ). Notice that for each , as well. Hence, in the following, we use the permutation instead of each . Throughout the proof, we fix an arbitrary maximum weight matching in and denote by the weight of . For any edge , we use to refer to the set of edges in that appear before . We slightly abuse the notation and use to mean that we run and stop exactly before processing the edge , i.e., we only consider the edges in .
Definition (Free/Blocked Edges).
We say that an edge is free for machine iff no end point of is matched by ; otherwise we call blocked. We use to denote the set of free edges in and to denote the blocked edges.
We emphasize that in Definition 3, an edge can be free or blocked on some machine , without necessarily even appearing in . In other words, this definition is independent of whether belongs to or not. However, notice that if an edge is free on machine and it also appears in , then would definitely belong to the matching (i.e., the coreset on machine ). On the other hand, if an edge is blocked in machine , then necessarily some edge exists in such that is incident on and . We refer to as the certificate of in machine .
The idea behind the proof of Lemma 3.1 is as follows. Recall that the distribution of each graph in the random clustering is the same, and is independent of other graphs. Hence, we can focus on each machine , say machine , separately. Consider blocked edges in machine : for any such edge, we have already picked another edge with at least the same weight in the matching of machine (by definition of an edge being blocked). We can hence use a simple charging argument here to argue that the matching of machine already has enough edges to “compensate” for blocked edges of that were not picked by machine .
The main part of the argument is however to show that we can find enough edges in chosen by other machines that can be added to to also compensate for free edges in machine . The idea here is that since the distribution of input to all machines is the same and is independent across, if an edge is free in machine , it is “most likely” free in many other machines as well, in particular, in a machine which also contains this edge. By definition, this edge then would be chosen in matching . We then use another careful charging scheme to argue that we can indeed “augment” the matching by free edges in that appear in to obtain an almost-two approximation. The main difficulty here is that even though edges in were free in machine , they may still be incident on edges in with equal or smaller weight (as being free only implies that these edges were not incident on edges with higher weight) and hence they cannot be readily added to ; this is the reason we need to find “short augmenting paths” in which may require switching some edges out of as well.
We now start with the formal proof. Throughout, define , i.e., the set of free edges in machine that are present in . We refer to these edges as available free edges (we prove later that essentially any free edge is also available with a large probability). To perform the charging, we need to partition the edges of and as follows (see Figure 1 for an illustration):
Let be an edge with maximum weight in and be its certificate in . Additionally, let be the other edge incident on in (set if no such edge exists).
Type edges: If belongs to or is , then add to and to . Remove both and from . We refer to as a set of type edges.
Type edges: If and is not incident on any certificate edge other than , then add to , to , and to . Remove from and from . We refer to as a set of type edges.
Type edges: If and is incident on another certificate edge which is a certificate for some edge , then add to , to , and to . Remove from and from . We refer to as a set of type edges.
Continue the process from first line until no edge remains in . Add the remaining edges in after this step the set to .
It is immediate to verify that , and , and all these sets are pairwise disjoint. The following claim states properties of this partitioning and the charging scheme.
In the partitioning scheme, for any set of edges:
type : .
type : .
type : .
Since is chosen before , . Also, since is a certificate of , , a simple calculation finalizes the proof.
Since is free, it means is chosen after visiting ; hence . Since is a certificate for , .
Follows since and are certificates for and , respectively, and hence and .
This finalizes the proof.
We now lower bound the weight of the maximum weight matching in the graph .
Define , and hence . To prove the lemma, it suffices to show that there exists a matching in with . Notice that edges in () are already in and hence can be used in . Moreover, for any edge , a certificate of also belongs to the matching . We now show how to design the matching using only the certificate edges and edges in such that . We use the partitioning scheme in Claim 3.2 to construct as follows:
For a set of type edges : add to the matching .
For a set of type edges : add to .
For a set of type edges : if , add to , otherwise add to .
Add any edge in to .
It is immediate to verify that all edges added to indeed belong to and they also form a matching. To lower bound the weight of , notice that by Claim 3.2, for any types of edges in , at least half the weights of edges in also appear in . This immediately implies that , finalizing the proof.
We now use the randomness in the clustering to argue that nearly all free edges on machine are also available.
We first have,
This is because random clustering induces a product distribution on inputs across machines and hence the fact that is free in is independent of whether is picked in some other matching for . We can now calculate the probability that an edge belongs to a matching for a fixed .
|(the event “” only depends on the edges in )|
since each edge appears in with probability . Now notice that the marginal distribution of the graph and under random clustering is the same; as a result the probability that an edge is good in is equal to this probability for the graph as well. Using this, plus the fact that the event that belongs to is independent of all other graphs for , we have,
|(by Eq (1))|
Define as the set of all edge such that . By above equation, for any edge , we have that,
which finalizes the proof.
3.1 MapReduce Algorithms for Maximum Weight Matching
We now present our MapReduce algorithm based on the coreset in Theorem 1.
There is a randomized MapReduce algorithm that for any , outputs a -approximation to maximum weight matching in expectation using machines each with memory and in only two rounds of parallel computation. The local computation on each machine requires time. Here, and denote the number of edges and vertices in the input graph, respectively.
We simply use the randomized coreset in Theorem 1 which has expected multiplicity with the reduction in Section 2.1. The bound on the number of rounds, memory per-machine, and the total number of machines follows immediately from this and the discussion in Section 2.1. To obtain the bounds on the running time of the algorithm, we simply need to run Greedy to compute each coreset which takes time as the total number of edges in the input of each machine is bounded by . To compute the final answer, we run the algorithm of  which takes time on an -vertex graph with edges, where here again.
The algorithm in Theorem 2 implies a theoretically efficient (almost) -approximation for MWM in the MapReduce model. Implementing this algorithm in practice however can be slightly challenging, simply due to the post-processing step which requires computing an (almost) maximum weight matching (such algorithms tend to be tricky on general graphs, mainly to handle “blossoms” (see, e.g., [35, 24]) although one can use any readily available algorithm for MWM in this step). It is thus natural to consider simpler post-processing steps also that are easier to implement in practice. An obvious candidate here is the greedy algorithm itself. It follows already from the guarantee of Greedy and Theorem 1 (by the same exact argument as in Theorem 2) that this would lead to an (almost) -approximation. We next prove that this algorithm in fact already achieves an improved approximation of .
The MapReduce algorithm for maximum weight matching obtained by applying Greedy as the post-processing step to GreedyCoreset always outputs a -approximation in expectation.
We present the proof of Theorem 3 in the next section.
3.2 Proof of Theorem 3
Proof of this theorem is based on a more careful analysis of GreedyCoreset and writing a factor-revealing LP for the overall approximation ratio of the algorithm. To be precise, in the post-processing step, we do as follows: we run the greedy algorithm on the union of coresets received from all machines and return either this matching, denoted by , or simply return any arbitrary matching computing as a coreset on one machine, say on machine , as the final solution, based on which one has a larger weight (we note that returning matching is only used to handle the “pathological case” that arise in the analysis in which each graph already contains a very large weighted matching, i.e., even after sampling by a factor of , we can still find a near optimum solution in the sampled graph).
Recall the definition of (available) free and blocked edges from the proof of Lemma 3.1 and let , , and be defined as before (along with their partitioning into different types). We start with the following simple lemma.
For any pair of edges in that belong to the same type set, there exist a unique edge in with by first part of Claim 3.2. Hence, .
For any edge in that belongs to a type set, there exists a unique edge in with by second part of Claim 3.2. Hence, .
For any pair of edges in that belong to the same type set, there exists a unique pair of edges in with by (the proof of) third part of Claim 3.2. Hence, .
Finally, since are disjoint and are all subset of , we have , as desired.
Recall that is the graph consists of all edges contained in the coresets. The following lemma is stronger version of Lemma 3.3 from previous section.
We compute a matching inside the subgraph as in Lemma 3.6. Edges in belong to by construction. For any pair of edges in , there exists a unique edge (chosen from ) with by Claim 3.2. Any edge in also belongs to by construction. Finally, for any triple of type edges , the weight of the edges added to by construction (using third part of Claim 3.2) is at least equal to , hence clearly these edges have weight at least in total (note that in general, one can even get a higher value here by “mix and matching” the max-term for each triple of edges separately, but we shall not do that here, as our analysis suggests this further extension is not helpful).
We also have the following simple relation between the weight of blocked and available free edges (by second and third part of Claim 3.2):
Let denote the final matching computed by the greedy algorithm on the graph . As greedy always outputs a -approximation of , we have:
Finally define the following quantity:
Note that by Lemma 3.4 in previous section and choice of in GreedyCoreset,
We now use the equations in Lemmas 3.5, 3.6, together with Eq (3), Eq (4), and Eq (6), to form a factor-revealing LP for the performance of the algorithm in Theorem 3. Define the following variables (they are scaled by and are hence between and ):
We design the following linear program: [ enlarge top by=5pt, enlarge bottom by=5pt, boxsep=0pt, left=4pt, right=4pt, top=10pt, arc=0pt, boxrule=1pt,toprule=1pt, colback=white ] A factor-revealing LP for the performance of the algorithm in Theorem3.
We have the following lemma on the correctness of this factor-revealing LP.
Let be the matching output by the algorithm in Theorem 3 and be the optimal value of LP; then, .
Recall that variables in the LP are in fact scaled by the value of and are hence in . We now have:
Constraint (13) holds as we are rescaling every variable by .
Finally, the solution of the LP is always at least as large as both matchings and and is stated relative to which by Eq (6) is at least , hence finalizing the proof.
As this LP contains only a constant number of variables, we can solve it exactly using any standard LP-solver such as Gurobi’s command line tool  which results in the following claim.
The optimal solution of LP is .
Recall that in the analysis of Theorem 3, denotes the weight of the greedy solution on a single subgraph, i.e., after subsampling the graph by a factor of . It is thus natural to expect that weight of of should be much smaller than . After all, any single weighted matching in would incur a “loss” of factor in its weight in due to sampling, and unless contains “many” (at least ) edge-disjoint near-optimal weighted matchings, then indeed . This suggests that in most cases, we can expect weight of to be much smaller than (we demonstrate this on the datasets studied in this paper in our experimental studies in Section 4).
On the other hand, it is easy to verify from the factor-revealing LP for Theorem 3 that if weight of is , then in fact the solution returned by the algorithm is a -approximation as opposed to only (simply change LHS in Constraint (8) to instead). In particular, for “more realistic” values of , i.e., around , the theoretical guarantee of this algorithm already gets within a factor 90% of the sequential greedy algorithm (i.e., the approximation ratio becomes ).
4 Empirical Study
We report the results of evaluating our algorithm on a number of publicly available datasets.
We use different datasets with varying sizes, ranging from about six million to half a trillion edges. Table 2 provides the statistics.
|Dataset||Number of vertices||Number of edges||Maximum degree|
We remark that the number of edges is half the sum of vertex degrees, whereas some previous work (e.g., ) report the latter as the number of edges. Three of our datasets (namely Friendster, Orkut, and LiveJournal, all taken from SNAP) are the public
datasets used in evaluating a hierarchical clustering algorithm in (they also have a fourth private dataset). Maximum weight matching is important in the context of hierarchical clustering, because it can be used to generate a more balanced hierarchy: Iteratively find a maximum weight matching and contract the edges of the matching. Then the size of clusters at each level will be nearly the same and equal to . We also add a fourth dataset of our own based on public DBLP co-authorship : vertices denote authors and edge weights denote the number of papers two researchers co-authored.
We implement our algorithm on a real proprietary MapReduce-like cluster, i.e., no simulations, with tens to hundreds of machines, depending on the size of the dataset. In our set-up, as in any standard MapReduce system, data is initially stored in distributed storage (each edge on an arbitrary machine), and is then fed into the system and the output matching will be stored in a single file on one machine. To be more precise, our implementation follows the approach in Theorem 3: we first compute a coreset on each machine using GreedyCoreset and then after combining the edges, we simply run the greedy algorithm again to compute the final solution. As such, our algorithm can be seen as the distributed version of the sequential greedy algorithm for the weighted matching problem. Our experiments verify that our coreset approach provides significant speed-ups with little quality loss compared to sequential greedy , which complements our theoretical results in Theorems 2 and 3.
The two main measures we are interested in are the quality of the returned solution and the speed-up of the algorithms compared to the sequential greedy algorithm:
Quality. Our coreset-based approach typically secures close to 99.5% of the weight of the solution found by the sequential greedy algorithm (and slightly above this for the cardinality instead of weight). Table 3 shows the quality of the solution obtained on each dataset separately. We emphasize that even though our theoretical proof in Theorem 3 guarantees a 66% performance (for the coreset-based algorithm with greedy as post-processing compared to the sequential greedy algorithm), we never lose more than 1–2% in terms of the solution quality (recall also Remark 3.9 for a theoretical explanation of this phenomenon).
Speed-up. The speed-up achieved by our coreset-based approach varies significantly between the datasets—ranging from x to x—compared to the sequential greedy algorithm. The speed-up is larger for bigger graphs, where the coreset computation can really take advantage of the parallelism. Table 3 provides the speed-up obtained on each dataset separately.
Several remarks are in order. Firstly, since the graphs in our datasets are too big, we could not compute the value of optimal weighted matching on these graphs. Therefore we only compare the quality of our solution to the quality of the sequential greedy algorithm that does not use a coreset. Additionally, in Table 3
, the actual running times and number of machines are withheld due to company’s policy so as to not reveal specifics of hardware and architecture in our company’s clusters. Instead we focus on speed-up (as did several previous work) which is, unlike absolute running time, mostly independent of computing infrastructure, and thus a better performance measure for the algorithms. Indeed, we expect that any MapReduce-based system to see similar speed-up to what we report. However, we provide the following back-of-the-envelope calculation in order to help the reader estimate the general whereabouts of the running times. In the biggest dataset we consider, namely Friendster, storing the graph in the standard edge format used by the Stanford Network Analysis Project (SNAP) requires 20–30TB. Merely reading this data into memory from a typical 7200 RPM HDD (with maximum read speed of 160MB/s) takes at least 35–50 hours. On this dataset, we obtain about 130x speed-up by building a coreset and running the greedy algorithm on it. Our solution secures 99.83% of the weight of the solution found by the sequential greedy algorithm, and its cardinality is 99.74% that of the greedy’s.
Role of the Multiplicity Parameter.
We also verify empirically that increasing the multiplicity parameter in coreset computation improves the overall quality of the final matching solution. This holds for both the cardinality and the weight of the matching obtained. We demonstrate this for two values of , which is the number of subgraphs we partition the original graph into in the coreset approach (which is also the number of machines). Our results in this part are summarized in Figure 2.
4.1 Comparison with Prior Algorithms
We provide an empirical comparison of our algorithm with two prior algorithms, a simple and natural sampling based algorithm, and the coreset-based approach of Assadi and Khanna  for unweighted matching and its extension to weighted matchings using the Crouch-Stubbs technique . At the end of this section, we also present some further theoretical comparison of our algorithm with prior algorithms.
Empirical comparison with Sampling Algorithm.
We compare our coreset approach with a simple sampling based algorithm that first samples a fraction of edges of the input graph, and then computes the weighted matching only on these edges as opposed to the whole graph itself. This is in fact equivalent to comparing the weight of the matching returned as a coreset on one of the machines, as the subgraph on each machine is simply a random subgraph of the input graph and the coreset is just the weighted greedy matching. As shown in Figure 3, the quality of the best greedy matching (among all subgraphs in the random clustering) degrades as we increase the number of pieces. In contrast, the quality of the final solution improves as we increase the number of pieces as was expected by Remark 3.9. Our experimental results in Figure 3 demonstrate the power of our algorithm compared to this simple approach towards implementing the sequential greedy algorithm on massive graphs.
Empirical Comparison with Assadi-Khanna Algorithm .
To provide a more complete picture, we also implemented the previous coreset-based algorithm (for the unweighted matching) by Assadi and Khanna  as well as its extension to the weighted setting via the Crouch-Stubbs technique .
Overall we learn that our new method provides more speed-up and better solution quality (i.e., the weight of the matching). The weight of the matching produced via the previous method is a few percentage points worse than what our method produces. This is partly because the previous method worked on unweighted graphs and we had to extend it to weighted matching algorithm via the Crouch-Stubbs technique that theoretically loses a factor 2 (see ).
The speed-up is an order of magnitude smaller for the two bigger graphs: 4.72 vs. 131.58 for Friendster and 1.36 vs. 17.13 for Orkut. For the two smaller graphs the speed-up of Assadi-Khanna algorithm is about a factor 2 less than our new method. See Table 4 for exact numbers.
One drawback of the Assadi-Khanna algorithm is that it needs to find the maximum matching in the first stage as opposed to a greedy matching in our method (see  for further discussion on necessity of this in their approach). The maximum matching algorithms in non-bipartite graphs rely on the structure of the so-called blossoms, and (even in the case of unweighted graphs) are much more complicated than the greedy algorithm; they are also much slower—more than quadratic time vs. linear time. Therefore, one needs to more carefully balance the number of pieces the graph is chopped into during the coreset stage (so as not to make it computationally prohibitive), but also to avoid having too many edges in the second stage where we want to run a single-machine maximum matching algorithm.
For the LiveJournal graph, we ran more experiments to understand the effect of the parameter from Crouch-Stubbs reduction. Recall that the overhead of their technique to turn an unweighted matching algorithm into a weighted matching algorithm is, theoretically, a factor . In our case, changing from down to did not affect the running time significantly. However, it has a major impact on the solution quality. The relative weight of the produced matching improved from 87.58% to 96.52%. We further note that not using the Crouch-Stubbs technique (and merely running the unweighted algorithm of Assadi and Khanna) resulted in a solution of relative weight only 22.16% (which is to be expected as the original algorithm of  is for the unweighted matching problem only and is oblivious to the weight of the edges).
Theoretical comparison with prior constant round MapReduce algorithms.
Prior MapReduce algorithms for weighted matching that use constant number of rounds (and space comparable to ours) include [51, 16, 5, 4] that also require a small number of rounds, typically between 2 to 10 rounds, and [2, 42] that require a large unspecified constant number of rounds. The approximation ratio of our algorithm improves upon all algorithms in the first category while using the minimal number of 2 rounds but matches that of  and is worse than  in the second category. However, the algorithms in the second category are considerably more complicated than our algorithm (in particular the -approximation algorithm of ) and also require a very large number of rounds of computation in the MapReduce model and are hence less amenable to efficient implementation in practice.
Theoretical comparison with prior MapReduce algorithms with smaller memory.
There are also MapReduce algorithms known in the literature for weighted matching that require a smaller memory per machine than our algorithm (typically just near-linear in memory) at a cost of super-constant (typically