Fast Online Clustering with Randomized Skeleton Sets

06/10/2015
by   Krzysztof Choromanski, et al.
0

We present a new fast online clustering algorithm that reliably recovers arbitrary-shaped data clusters in high throughout data streams. Unlike the existing state-of-the-art online clustering methods based on k-means or k-medoid, it does not make any restrictive generative assumptions. In addition, in contrast to existing nonparametric clustering techniques such as DBScan or DenStream, it gives provable theoretical guarantees. To achieve fast clustering, we propose to represent each cluster by a skeleton set which is updated continuously as new data is seen. A skeleton set consists of weighted samples from the data where weights encode local densities. The size of each skeleton set is adapted according to the cluster geometry. The proposed technique automatically detects the number of clusters and is robust to outliers. The algorithm works for the infinite data stream where more than one pass over the data is not feasible. We provide theoretical guarantees on the quality of the clustering and also demonstrate its advantage over the existing state-of-the-art on several datasets.

READ FULL TEXT
research
02/28/2023

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Clustering is a widely used technique with a long and rich history in a ...
research
04/21/2021

Skeleton Clustering: Dimension-Free Density-based Clustering

We introduce a density-based clustering method called skeleton clusterin...
research
01/21/2021

Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds

Short text stream clustering is an important but challenging task since ...
research
06/16/2023

Adversarially robust clustering with optimality guarantees

We consider the problem of clustering data points coming from sub-Gaussi...
research
12/22/2020

Fast and Accurate k-means++ via Rejection Sampling

k-means++ <cit.> is a widely used clustering algorithm that is easy to i...
research
10/01/2014

Riemannian Multi-Manifold Modeling

This paper advocates a novel framework for segmenting a dataset in a Rie...
research
04/12/2018

Clustering via Boundary Erosion

Clustering analysis identifies samples as groups based on either their m...

Please sign up or login with your details

Forgot password? Click here to reset