
An efficient K means clustering algorithm for massive data
The analysis of continously larger datasets is a task of major importanc...
read it

Tackling Initial Centroid of KMeans with Distance Part (DPKMeans)
The initial centroid is a fairly challenging problem in the kmeans meth...
read it

Improved Performance of Unsupervised Method by Renovated KMeans
Clustering is a separation of data into groups of similar objects. Every...
read it

Parallelization of Kmeans++ using CUDA
Kmeans++ is an algorithm which is invented to improve the process of fi...
read it

Improved Guarantees for kmeans++ and kmeans++ Parallel
In this paper, we study kmeans++ and kmeans++ parallel, the two most p...
read it

Band Depth based initialization of kMeans for functional data clustering
The kMeans algorithm is one of the most popular choices for clustering ...
read it

A balanced kmeans algorithm for weighted point sets
The classical kmeans algorithm for partitioning n points in R^d into k ...
read it
An efficient Kmeans algorithm for Massive Data
Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma nipulate and analyze such information. Even though datasets have grown in size, the Kmeans algorithm remains as one of the most popular clustering methods, in spite of its dependency on the initial settings and high computational cost, especially in terms of distance computations. In this work, we propose an efficient approximation to the Kmeans problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of sub sets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the Kmeans algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms wellknown approaches, such as the Kmeans++ and the minibatch Kmeans, in terms of the relation between number of distance computations and the quality of the approximation.
READ FULL TEXT
Comments
There are no comments yet.