Clustering by the Probability Distributions from Extreme Value Theory

02/20/2022
by   Sixiao Zheng, et al.
26

Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.

READ FULL TEXT

page 1

page 5

page 8

page 11

page 13

research
06/10/2022

A new distance measurement and its application in K-Means Algorithm

K-Means clustering algorithm is one of the most commonly used clustering...
research
01/07/2022

Probabilistic spatial clustering based on the Self Discipline Learning (SDL) model of autonomous learning

Unsupervised clustering algorithm can effectively reduce the dimension o...
research
07/02/2019

A flexible EM-like clustering algorithm for noisy data

We design a new robust clustering algorithm that can deal efficiently wi...
research
09/13/2017

An efficient clustering algorithm from the measure of local Gaussian distribution

In this paper, I will introduce a fast and novel clustering algorithm ba...
research
10/17/2021

Noise-robust Clustering

This paper presents noise-robust clustering techniques in unsupervised m...
research
06/13/2013

Non-parametric Power-law Data Clustering

It has always been a great challenge for clustering algorithms to automa...
research
07/04/2019

k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities

Most convex and nonconvex clustering algorithms come with one crucial pa...

Please sign up or login with your details

Forgot password? Click here to reset