An efficient K-means algorithm for Massive Data

05/10/2016
by   Marco Capó, et al.
0

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm remains as one of the most popular clustering methods, in spite of its dependency on the initial settings and high computational cost, especially in terms of distance computations. In this work, we propose an efficient approximation to the K-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of sub- sets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the K-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms well-known approaches, such as the K-means++ and the minibatch K-means, in terms of the relation between number of distance computations and the quality of the approximation.

READ FULL TEXT

page 11

page 14

research
01/09/2018

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importanc...
research
03/15/2019

Tackling Initial Centroid of K-Means with Distance Part (DP-KMeans)

The initial centroid is a fairly challenging problem in the k-means meth...
research
10/18/2022

An enhanced method of initial cluster center selection for K-means algorithm

Clustering is one of the widely used techniques to find out patterns fro...
research
03/11/2013

Improved Performance of Unsupervised Method by Renovated K-Means

Clustering is a separation of data into groups of similar objects. Every...
research
07/30/2019

Parallelization of Kmeans++ using CUDA

K-means++ is an algorithm which is invented to improve the process of fi...
research
06/02/2021

Band Depth based initialization of k-Means for functional data clustering

The k-Means algorithm is one of the most popular choices for clustering ...

Please sign up or login with your details

Forgot password? Click here to reset