Fast k-means based on KNN Graph

05/04/2017
by   Cheng-Hao Deng, et al.
0

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast k-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2020

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many a...
research
10/06/2019

Exact and/or Fast Nearest Neighbors

Prior methods for retrieval of nearest neighbors in high dimensions are ...
research
05/30/2016

k2-means for fast and accurate large scale clustering

We propose k^2-means, a new clustering method which efficiently copes wi...
research
12/20/2017

Fast kNN mode seeking clustering applied to active learning

A significantly faster algorithm is presented for the original kNN mode ...
research
09/21/2010

Balancing clusters to reduce response time variability in large scale image search

Many algorithms for approximate nearest neighbor search in high-dimensio...
research
10/08/2016

Boost K-Means

Due to its simplicity and versatility, k-means remains popular since it ...
research
05/19/2020

k-sums: another side of k-means

In this paper, the decades-old clustering method k-means is revisited. T...

Please sign up or login with your details

Forgot password? Click here to reset