Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score

11/03/2021
by   Luca Coraggio, et al.
0

Cluster analysis requires many decisions: the clustering method and the implied reference model, the number of clusters and, often, several hyper-parameters and algorithms' tunings. In practice, one produces several partitions, and a final one is chosen based on validation or selection criteria. There exist an abundance of validation methods that, implicitly or explicitly, assume a certain clustering notion. Moreover, they are often restricted to operate on partitions obtained from a specific method. In this paper, we focus on groups that can be well separated by quadratic or linear boundaries. The reference cluster concept is defined through the quadratic discriminant score function and parameters describing clusters' size, center and scatter. We develop two cluster-quality criteria called quadratic scores. We show that these criteria are consistent with groups generated from a general class of elliptically-symmetric distributions. The quest for this type of groups is common in applications. The connection with likelihood theory for mixture models and model-based clustering is investigated. Based on bootstrap resampling of the quadratic scores, we propose a selection rule that allows choosing among many clustering solutions. The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods. Extensive numerical experiments and the analysis of real data show that, even if some competing methods turn out to be superior in some setups, the proposed methodology achieves a better overall performance.

READ FULL TEXT
research
07/08/2021

The Three Ensemble Clustering (3EC) Algorithm for Pattern Discovery in Unsupervised Learning

This paper presents a multiple learner algorithm called the 'Three Ensem...
research
04/20/2023

Salience-based stakeholder selection to maintain stakeholder coverage in solving the next release problem

Stakeholders quantification plays a basic role in selecting the appropri...
research
10/01/2019

Towards Key Performance Indicators of Research Infrastructures

In 2018, the European Strategic Forum for research infrastructures (ESFR...
research
02/01/2022

Gradient Based Clustering

We propose a general approach for distance based clustering, using the g...
research
06/15/2020

Selecting the Number of Clusters K with a Stability Trade-off: an Internal Validation Criterion

Model selection is a major challenge in non-parametric clustering. There...
research
11/15/2019

How bettering the best? Answers via blending models and cluster formulations in density-based clustering

With the recent growth in data availability and complexity, and the asso...
research
04/29/2022

greed: An R Package for Model-Based Clustering by Greedy Maximization of the Integrated Classification Likelihood

The greed package implements the general and flexible framework of arXiv...

Please sign up or login with your details

Forgot password? Click here to reset