Shape complexity in cluster analysis

05/17/2022
by   Eduardo J. Aguilar, et al.
0

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2019

Pooled scale estimators for scaling prior to cluster analysis

We propose a new approach for scaling prior to cluster analysis based on...
research
12/22/2019

Pooled variable scaling for cluster analysis

We propose a new approach for scaling prior to cluster analysis based on...
research
02/22/2016

Recovering the number of clusters in data sets with noise features using feature rescaling factors

In this paper we introduce three methods for re-scaling data sets aiming...
research
06/08/2020

A Notion of Individual Fairness for Clustering

A common distinction in fair machine learning, in particular in fair cla...
research
12/31/2018

An Analysis of Classical Multidimensional Scaling

Classical multidimensional scaling is an important tool for dimension re...
research
10/24/2018

Modified Multidimensional Scaling and High Dimensional Clustering

Multidimensional scaling is an important dimension reduction tool in sta...
research
08/13/2011

Partition Decomposition for Roll Call Data

In this paper we bring to bear some new tools from statistical learning ...

Please sign up or login with your details

Forgot password? Click here to reset