Connected Components at Scale via Local Contractions

07/27/2018
by   Jakub Łącki, et al.
0

As a fundamental tool in hierarchical graph clustering, computing connected components has been a central problem in large-scale data mining. While many known algorithms have been developed for this problem, they are either not scalable in practice or lack strong theoretical guarantees on the parallel running time, that is, the number of communication rounds. So far, the best proven guarantee is ( n), which matches the running time in the PRAM model. In this paper, we aim to design a distributed algorithm for this problem that works well in theory and practice. In particular, we present a simple algorithm based on contractions and provide a scalable implementation of it in MapReduce. On the theoretical side, in addition to showing ( n) convergence for all graphs, we prove an ( n) parallel running time with high probability for a certain class of random graphs. We work in the MPC model that captures popular parallel computing frameworks, such as MapReduce, Hadoop or Spark. On the practical side, we show that our algorithm outperforms the state-of-the-art MapReduce algorithms. To confirm its scalability, we report empirical results on graphs with several trillions of edges.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2020

Parallel Graph Algorithms in Constant Adaptive Rounds: Theory meets Practice

We study fundamental graph problems such as graph connectivity, minimum ...
research
08/07/2023

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

We introduce TeraHAC, a (1+ϵ)-approximate hierarchical agglomerative clu...
research
12/10/2020

Building Graphs at a Large Scale: Union Find Shuffle

Large scale graph processing using distributed computing frameworks is b...
research
12/28/2017

ASYMP: Fault-tolerant Mining of Massive Graphs

We present ASYMP, a distributed graph processing system developed for th...
research
07/18/2018

Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS

Many distributed machine learning frameworks have recently been built to...
research
06/15/2020

Hypergraph Clustering Based on PageRank

A hypergraph is a useful combinatorial object to model ternary or higher...
research
10/14/2019

FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence

This paper presents a new distributed-memory algorithm called FastSV for...

Please sign up or login with your details

Forgot password? Click here to reset