Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets

05/25/2021
by   Baris Sumengen, et al.
12

Hierarchical Agglomerative Clustering (HAC) is one of the oldest but still most widely used clustering methods. However, HAC is notoriously hard to scale to large data sets as the underlying complexity is at least quadratic in the number of data points and many algorithms to solve HAC are inherently sequential. In this paper, we propose Reciprocal Agglomerative Clustering (RAC), a distributed algorithm for HAC, that uses a novel strategy to efficiently merge clusters in parallel. We prove theoretically that RAC recovers the exact solution of HAC. Furthermore, under clusterability and balancedness assumption we show provable speedups in total runtime due to the parallelism. We also show that these speedups are achievable for certain probabilistic data models. In extensive experiments, we show that this parallelism is achieved on real world data sets and that the proposed RAC algorithm can recover the HAC hierarchy on billions of data points connected by trillions of edges in less than an hour.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2018

Scalable and Robust Sparse Subspace Clustering Using Randomized Clustering and Multilayer Graphs

Sparse subspace clustering (SSC) is one of the current state-of-the-art ...
research
09/05/2023

Data Aggregation for Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC) is likely the earliest and m...
research
09/23/2022

Creating Compact Regions of Social Determinants of Health

Regionalization is the act of breaking a dataset into contiguous homogen...
research
10/12/2018

On The Equivalence of Tries and Dendrograms - Efficient Hierarchical Clustering of Traffic Data

The widespread use of GPS-enabled devices generates voluminous and conti...
research
08/12/2023

A parallel algorithm for Delaunay triangulation of moving points on the plane

Delaunay Triangulation(DT) is one of the important geometric problems th...
research
07/09/2019

Hierarchical Clustering Supported by Reciprocal Nearest Neighbors

Clustering is a fundamental analysis tool aiming at classifying data poi...
research
08/15/2012

A Novel Strategy Selection Method for Multi-Objective Clustering Algorithms Using Game Theory

The most important factors which contribute to the efficiency of game-th...

Please sign up or login with your details

Forgot password? Click here to reset