Hybridized Threshold Clustering for Massive Data

07/05/2019
by   Jianmei Luo, et al.
0

As the size n of datasets become massive, many commonly-used clustering algorithms (for example, k-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm designed to partition data into many small clusters in linearithmic time (on average). Our proposed clustering method is as follows. First, TC is performed and clusters are reduced into single "prototype" points. Then, TC is applied repeatedly on these prototype points until sufficient data reduction has been obtained. Finally, a more sophisticated clustering algorithm is applied to the reduced prototype points, thereby obtaining a clustering on all n data points. This entire procedure for clustering is called iterative hybridized threshold clustering (IHTC). Through simulation results and by applying our methodology on several real datasets, we show that IHTC combined with k-means or HAC substantially reduces the run time and memory usage of the original clustering algorithms while still preserving their performance. Additionally, IHTC helps prevent singular data points from being overfit by clustering algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2021

Cube Sampled K-Prototype Clustering for Featured Data

Clustering large amount of data is becoming increasingly important in th...
research
11/22/2022

Global k-means++: an effective relaxation of the global k-means clustering algorithm

The k-means algorithm is a very prevalent clustering method because of i...
research
05/12/2023

Rethinking k-means from manifold learning perspective

Although numerous clustering algorithms have been developed, many existi...
research
12/16/2020

Predictive K-means with local models

Supervised classification can be effective for prediction but sometimes ...
research
07/25/2023

DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm

With the development of Big data technology, data analysis has become in...
research
09/13/2022

Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

The time needed to apply a hierarchical clustering algorithm is most oft...
research
03/19/2019

Predictive Clustering

We show how to convert any clustering into a prediction set. This has th...

Please sign up or login with your details

Forgot password? Click here to reset