Big-means: Less is More for K-means Clustering

04/14/2022
by   Rustam Mussabayev, et al.
0

K-means clustering plays a vital role in data mining. However, its performance drastically drops when applied to huge amounts of data. We propose a new heuristic that is built on the basis of regular K-means for faster and more accurate big data clustering using the "less is more" and MSSC decomposition approaches. The main advantage of the proposed algorithm is that it naturally turns the K-means local search into global one through the process of decomposition of the MSSC problem. On one hand, decomposition of the MSSC problem into smaller subproblems reduces the computational complexity and allows for their parallel processing. On the other hand, the MSSC decomposition provides a new method for the natural data-driven shaking of the incumbent solution while introducing a new neighborhood structure for the solution of the MSSC problem. This leads to a new heuristic that improves K-means in big data conditions. The scalability of the algorithm to big data can be easily adjusted by choosing the appropriate number of subproblems and their size. The proposed algorithm is both scalable and accurate. In our experiments it outperforms all recent state-of-the-art algorithms for the MSSC in terms of time as well as the solution quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2020

Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Matrix decomposition is one of the fundamental tools to discover knowled...
research
03/24/2023

Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

In the big data era, the key feature that each algorithm needs to have i...
research
06/03/2019

Big-Data Clustering: K-Means or K-Indicators?

The K-means algorithm is arguably the most popular data clustering metho...
research
07/09/2018

Using Multi-Core HW/SW Co-design Architecture for Accelerating K-means Clustering Algorithm

The capability of classifying and clustering a desired set of data is an...
research
10/17/2016

High-performance K-means Implementation based on a Simplified Map-Reduce Architecture

The k-means algorithm is one of the most common clustering algorithms an...
research
03/02/2022

A New Framework for Expressing, Parallelizing and Optimizing Big Data Applications

The Forelem framework was first introduced as a means to optimize databa...

Please sign up or login with your details

Forgot password? Click here to reset