Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing

02/08/2022
by   Carlo Baldassi, et al.
0

We present a meta-method for initializing (seeding) the k-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into J random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data N and the number of clusters k, PNN-smoothing is also almost linear with an appropriate choice of J, and in fact only at most a few percent slower in most cases in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. It can even be applied recursively, and easily parallelized. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2023

Refining a k-nearest neighbor graph for a computationally efficient spectral clustering

Spectral clustering became a popular choice for data clustering for its ...
research
05/01/2019

Recombinator-k-means: Enhancing k-means++ by seeding from pools of previous runs

We present a heuristic algorithm, called recombinator-k-means, that can ...
research
04/20/2011

Fast redshift clustering with the Baire (ultra) metric

The Baire metric induces an ultrametric on a dataset and is of linear co...
research
02/23/2017

Deep Nonparametric Estimation of Discrete Conditional Distributions via Smoothed Dyadic Partitioning

We present an approach to deep estimation of discrete conditional probab...
research
04/20/2015

Nonparametric Nearest Neighbor Random Process Clustering

We consider the problem of clustering noisy finite-length observations o...
research
11/11/2020

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many a...
research
02/24/2020

Clustering and Classification with Non-Existence Attributes: A Sentenced Discrepancy Measure Based Technique

For some or all of the data instances a number of independent-world clus...

Please sign up or login with your details

Forgot password? Click here to reset