TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

08/07/2023
by   Laxman Dhulipala, et al.
0

We introduce TeraHAC, a (1+ϵ)-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing (1+ϵ)-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of (1+ϵ)-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2018

Connected Components at Scale via Local Contractions

As a fundamental tool in hierarchical graph clustering, computing connec...
research
06/05/2018

Hierarchical Graph Clustering using Node Pair Sampling

We present a novel hierarchical graph clustering algorithm inspired by m...
research
01/31/2022

Fast Distributed k-Means with a Small Number of Rounds

We propose a new algorithm for k-means clustering in a distributed setti...
research
12/05/2022

Stars: Tera-Scale Graph Building for Clustering and Graph Learning

A fundamental procedure in the analysis of massive datasets is the const...
research
04/14/2021

Exact and Approximate Hierarchical Clustering Using A*

Hierarchical clustering is a critical task in numerous domains. Many app...
research
08/07/2015

Sublinear Partition Estimation

The output scores of a neural network classifier are converted to probab...
research
12/09/2017

A Streaming Algorithm for Graph Clustering

We introduce a novel algorithm to perform graph clustering in the edge s...

Please sign up or login with your details

Forgot password? Click here to reset