Statistical power for cluster analysis

03/01/2020
by   E. S. Dalmaijer, et al.
0

Cluster algorithms are gaining in popularity due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream programming languages and statistical software. While researchers can follow guidelines to choose the right algorithms, and to determine what constitutes convincing clustering, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we take a simulation approach to estimate power and classification accuracy for popular analysis pipelines. We systematically varied cluster size, number of clusters, number of different features between clusters, effect size within each different feature, and cluster covariance structure in generated datasets. We then subjected these datasets to common dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, hierarchical agglomerative clustering with Ward linkage and Euclidean distance, or average linkage and cosine distance, HDBSCAN). Furthermore, we simulated additional datasets to explore the effect of sample size and cluster separation on statistical power and classification accuracy. We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power can be achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large (Δ=4). Finally, we discuss whether fuzzy clustering (c-means) could provide a more parsimonious alternative for identifying separable multivariate normal distributions, particularly those with lower centroid separation.

READ FULL TEXT

page 10

page 11

page 12

page 14

page 17

page 20

page 23

page 25

research
06/10/2022

A new distance measurement and its application in K-Means Algorithm

K-Means clustering algorithm is one of the most commonly used clustering...
research
12/26/2022

Covariance-based soft clustering of functional data based on the Wasserstein-Procrustes metric

We consider the problem of clustering functional data according to their...
research
09/21/2023

Cluster-based pruning techniques for audio data

Deep learning models have become widely adopted in various domains, but ...
research
11/16/2015

Fast clustering for scalable statistical analysis on structured images

The use of brain images as markers for diseases or behavioral difference...
research
09/20/2022

A Framework for Benchmarking Clustering Algorithms

The evaluation of clustering algorithms can be performed by running them...
research
03/14/2022

Geometric reconstructions of density based clusterings

DBSCAN* and HDBSCAN* are well established density based clustering algor...

Please sign up or login with your details

Forgot password? Click here to reset