Statistical power for cluster analysis

03/01/2020
by   E. S. Dalmaijer, et al.
0

Cluster algorithms are gaining in popularity due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream programming languages and statistical software. While researchers can follow guidelines to choose the right algorithms, and to determine what constitutes convincing clustering, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we take a simulation approach to estimate power and classification accuracy for popular analysis pipelines. We systematically varied cluster size, number of clusters, number of different features between clusters, effect size within each different feature, and cluster covariance structure in generated datasets. We then subjected these datasets to common dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, hierarchical agglomerative clustering with Ward linkage and Euclidean distance, or average linkage and cosine distance, HDBSCAN). Furthermore, we simulated additional datasets to explore the effect of sample size and cluster separation on statistical power and classification accuracy. We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power can be achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large (Δ=4). Finally, we discuss whether fuzzy clustering (c-means) could provide a more parsimonious alternative for identifying separable multivariate normal distributions, particularly those with lower centroid separation.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset