Superclustering by finding statistically significant separable groups of optimal gaussian clusters

09/05/2023
by   Oleg I. Berngardt, et al.
0

The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion, number of Gaussian clusters into the optimal, from the point of view of their statistical separability, superclusters. The algorithm consists of three stages: representation of the dataset as a mixture of Gaussian distributions - clusters, which number is determined based on the minimum of the BIC criterion; using the Mahalanobis distance, to estimate the distances between the clusters and cluster sizes; combining the resulting clusters into superclusters using the DBSCAN method by finding its hyperparameter (maximum distance) providing maximum value of introduced matrix quality criterion at maximum number of superclusters. The matrix quality criterion corresponds to the proportion of statistically significant separated superclusters among all found superclusters. The algorithm has only one hyperparameter - statistical significance level, and automatically detects optimal number and shape of superclusters based of statistical hypothesis testing approach. The algorithm demonstrates a good results on test datasets in noise and noiseless situations. An essential advantage of the algorithm is its ability to predict correct supercluster for new data based on already trained clusterer and perform soft (fuzzy) clustering. The disadvantages of the algorithm are: its low speed and stochastic nature of the final clustering. It requires a sufficiently large dataset for clustering, which is typical for many statistical methods.

READ FULL TEXT

page 9

page 17

page 20

page 21

page 27

page 28

research
09/02/2020

An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering

We introduce a new approach to deciding the number of clusters. The appr...
research
10/10/2016

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

We consider the problem of Gaussian mixture clustering in the high-dimen...
research
07/26/2018

Selective Clustering Annotated using Modes of Projections

Selective clustering annotated using modes of projections (SCAMP) is a n...
research
01/15/2022

Wrapped Classifier with Dummy Teacher for training physics-based classifier at unlabeled radar data

In the paper a method for automatic classification of signals received b...
research
05/03/2023

A Statistical Exploration of Text Partition Into Constituents: The Case of the Priestly Source in the Books of Genesis and Exodus

We present a pipeline for a statistical textual exploration, offering a ...
research
01/31/2010

Classifying the typefaces of the Gutenberg 42-line bible

We have measured the dissimilarities among several printed characters of...
research
04/29/2019

Clustering Optimization: Finding the Number and Centroids of Clusters by a Fourier-based Algorithm

We propose a Fourier-based approach for optimization of several clusteri...

Please sign up or login with your details

Forgot password? Click here to reset