Hierarchical Agglomerative Graph Clustering in Poly-Logarithmic Depth

06/23/2022
by   Laxman Dhulipala, et al.
0

Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a (1+ϵ)-approximation algorithm for this problem on m edge graphs using Õ(m) work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for exact average-linkage HAC is not possible under standard complexity-theoretic assumptions. We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines. On a broad set of large publicly-available real-world datasets, we find that ParHAC obtains a 50.1x speedup on average over the best sequential baseline, while achieving quality similar to the exact HAC algorithm. We also show that ParHAC can cluster one of the largest publicly available graph datasets with 124 billion edges in a little over three hours using a commodity multicore machine.

READ FULL TEXT
research
06/10/2021

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

We study the widely used hierarchical agglomerative clustering (HAC) alg...
research
04/19/2023

Nearly Work-Efficient Parallel DFS in Undirected Graphs

We present the first parallel depth-first search algorithm for undirecte...
research
07/13/2023

Breaking 3-Factor Approximation for Correlation Clustering in Polylogarithmic Rounds

In this paper, we study parallel algorithms for the correlation clusteri...
research
12/21/2020

Parallel Index-Based Structural Graph Clustering and Its Approximation

SCAN (Structural Clustering Algorithm for Networks) is a well-studied, w...
research
04/03/2019

Efficient Estimation of Heat Kernel PageRank for Local Clustering

Given an undirected graph G and a seed node s, the local clustering prob...
research
03/21/2022

Scaling Up Maximal k-plex Enumeration

Finding all maximal k-plexes on networks is a fundamental research probl...
research
03/03/2020

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

The most widely used internal measure for clustering evaluation is the s...

Please sign up or login with your details

Forgot password? Click here to reset