Evolution of K-means solution landscapes with the addition of dataset outliers and a robust clustering comparison measure for their analysis

06/25/2023
by   Luke Dicks, et al.
0

The K-means algorithm remains one of the most widely-used clustering methods due to its simplicity and general utility. The performance of K-means depends upon location of minima low in cost function, amongst a potentially vast number of solutions. Here, we use the energy landscape approach to map the change in K-means solution space as a result of increasing dataset outliers and show that the cost function surface becomes more funnelled. Kinetic analysis reveals that in all cases the overall funnel is composed of shallow locally-funnelled regions, each of which are separated by areas that do not support any clustering solutions. These shallow regions correspond to different types of clustering solution and their increasing number with outliers leads to longer pathways within the funnel and a reduced correlation between accuracy and cost function. Finally, we propose that the rates obtained from kinetic analysis provide a novel measure of clustering similarity that incorporates information about the paths between them. This measure is robust to outliers and we illustrate the application to datasets containing multiple outliers.

READ FULL TEXT

page 12

page 35

page 36

page 38

research
12/01/2022

Clustering What Matters: Optimal Approximation for Clustering with Outliers

Clustering with outliers is one of the most fundamental problems in Comp...
research
05/16/2019

How Entropic Regression Beats the Outliers Problem in Nonlinear System Identification

System identification (SID) is central in science and engineering applic...
research
08/16/2021

Robust Trimmed k-means

Clustering is a fundamental tool in unsupervised learning, used to group...
research
04/11/2018

The Evolution of User-Selected Passwords: A Quantitative Analysis of Publicly Available Datasets

The aim of this work is to study the evolution of password selection amo...
research
12/10/2018

Ramp-based Twin Support Vector Clustering

Traditional plane-based clustering methods measure the cost of within-cl...
research
08/06/2021

Rectified Euler k-means and Beyond

Euler k-means (EulerK) first maps data onto the unit hyper-sphere surfac...
research
02/11/2023

Partial k-means to avoid outliers, mathematical programming formulations, complexity results

A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celeb...

Please sign up or login with your details

Forgot password? Click here to reset