Scalable Hierarchical Clustering with Tree Grafting

12/31/2019
by   Nicholas Monath, et al.
15

We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2017

An Online Hierarchical Algorithm for Extreme Clustering

Many modern clustering methods scale well to a large number of data item...
research
03/03/2023

Contrastive Hierarchical Clustering

Deep clustering has been dominated by flat models, which split a dataset...
research
08/30/2020

An Objective for Hierarchical Clustering in Euclidean Space and its Connection to Bisecting K-means

This paper explores hierarchical clustering in the case where pairs of p...
research
02/14/2020

Clustering based on Point-Set Kernel

Measuring similarity between two objects is the core operation in existi...
research
07/27/2021

Scalable Community Detection via Parallel Correlation Clustering

Graph clustering and community detection are central problems in modern ...
research
03/19/2020

Clustering with Fast, Automated and Reproducible assessment applied to longitudinal neural tracking

Across many areas, from neural tracking to database entity resolution, m...
research
10/16/2019

FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance

FISHDBC is a flexible, incremental, scalable, and hierarchical density-b...

Please sign up or login with your details

Forgot password? Click here to reset