ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

06/08/2021
by   Shangdi Yu, et al.
0

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8–110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75–54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2023

Parallel Filtered Graphs for Hierarchical Clustering

Given all pairwise weights (distances) among a set of objects, filtered ...
research
02/24/2019

clusterNOR: A NUMA-Optimized Clustering Framework

Clustering algorithms are iterative and have complex data access pattern...
research
09/05/2023

Data Aggregation for Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC) is likely the earliest and m...
research
02/28/2019

Efficient Parameter-free Clustering Using First Neighbor Relations

We present a new clustering method in the form of a single clustering eq...
research
10/26/2009

Parallelization of the LBG Vector Quantization Algorithm for Shared Memory Systems

This paper proposes a parallel approach for the Vector Quantization (VQ)...
research
07/21/2021

SkyCell: A Space-Pruning Based Parallel Skyline Algorithm

Skyline computation is an essential database operation that has many app...
research
07/27/2021

Scalable Community Detection via Parallel Correlation Clustering

Graph clustering and community detection are central problems in modern ...

Please sign up or login with your details

Forgot password? Click here to reset