An Online Hierarchical Algorithm for Extreme Clustering

04/06/2017
by   Ari Kobren, et al.
0

Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2019

Scalable Hierarchical Clustering with Tree Grafting

We introduce Grinch, a new algorithm for large-scale, non-greedy hierarc...
research
10/22/2020

Scalable Bottom-Up Hierarchical Clustering

Bottom-up algorithms such as the classic hierarchical agglomerative clus...
research
03/19/2019

A Quantum Annealing-Based Approach to Extreme Clustering

In this age of data abundance, there is a growing need for algorithms an...
research
06/23/2020

BETULA: Numerically Stable CF-Trees for BIRCH Clustering

BIRCH clustering is a widely known approach for clustering, that has inf...
research
01/26/2018

Information Content of a Phylogenetic Tree in a Data Matrix

Phylogenetic trees in genetics and biology in general are all binary. We...
research
09/21/2020

Interactive Steering of Hierarchical Clustering

Hierarchical clustering is an important technique to organize big data f...
research
03/03/2023

Contrastive Hierarchical Clustering

Deep clustering has been dominated by flat models, which split a dataset...

Please sign up or login with your details

Forgot password? Click here to reset