A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

02/11/2019
by   Gaël Beck, et al.
0

In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.

READ FULL TEXT

page 18

page 19

page 20

research
04/24/2013

The K-modes algorithm for clustering

Many clustering algorithms exist that estimate a cluster centroid, such ...
research
12/20/2017

Fast kNN mode seeking clustering applied to active learning

A significantly faster algorithm is presented for the original kNN mode ...
research
04/20/2014

Clustering via Mode Seeking by Direct Estimation of the Gradient of a Log-Density

Mean shift clustering finds the modes of the data probability density by...
research
11/20/2017

On Convergence of Epanechnikov Mean Shift

Epanechnikov Mean Shift is a simple yet empirically very effective algor...
research
05/10/2018

Analysis of a Mode Clustering Diagram

Mode-based clustering methods define clusters to be the basins of attrac...
research
08/06/2014

The functional mean-shift algorithm for mode hunting and clustering in infinite dimensions

We introduce the functional mean-shift algorithm, an iterative algorithm...
research
07/04/2022

An Improved Probability Propagation Algorithm for Density Peak Clustering Based on Natural Nearest Neighborhood

Clustering by fast search and find of density peaks (DPC) (Since, 2014) ...

Please sign up or login with your details

Forgot password? Click here to reset