Big-Data Clustering: K-Means or K-Indicators?

06/03/2019
by   Feiyu Chen, et al.
0

The K-means algorithm is arguably the most popular data clustering method, commonly applied to processed datasets in some "feature spaces", as is in spectral clustering. Highly sensitive to initializations, however, K-means encounters a scalability bottleneck with respect to the number of clusters K as this number grows in big data applications. In this work, we promote a closely related model called K-indicators model and construct an efficient, semi-convex-relaxation algorithm that requires no randomized initializations. We present extensive empirical results to show advantages of the new algorithm when K is large. In particular, using the new algorithm to start the K-means algorithm, without any replication, can significantly outperform the standard K-means with a large number of currently state-of-the-art random replications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2020

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many a...
research
03/02/2022

A New Framework for Expressing, Parallelizing and Optimizing Big Data Applications

The Forelem framework was first introduced as a means to optimize databa...
research
04/14/2022

Big-means: Less is More for K-means Clustering

K-means clustering plays a vital role in data mining. However, its perfo...
research
08/07/2015

Spectral Clustering and Block Models: A Review And A New Algorithm

We focus on spectral clustering of unlabeled graphs and review some resu...
research
11/27/2018

A Frequency Scaling based Performance Indicator Framework for Big Data Systems

It is important for big data systems to identify their performance bottl...
research
06/28/2020

Breathing k-Means

We propose a new algorithm for the k-means problem which repeatedly incr...
research
07/09/2018

Using Multi-Core HW/SW Co-design Architecture for Accelerating K-means Clustering Algorithm

The capability of classifying and clustering a desired set of data is an...

Please sign up or login with your details

Forgot password? Click here to reset