Powerful Significance Testing for Unbalanced Clusters

08/24/2023
by   Thomas H. Keefe, et al.
0

Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is, "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case, and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2021

Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

The discovery of disease subtypes is an essential step for developing pr...
research
10/05/2016

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

Cluster analysis is an unsupervised learning strategy that can be employ...
research
04/30/2023

A new clustering framework

Detection of clusters is a crucial task across many disciplines such as ...
research
03/28/2023

Genetic Analysis of Prostate Cancer with Computer Science Methods

Metastatic prostate cancer is one of the most common cancers in men. In ...
research
09/26/2019

CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

Feature selection is an important and challenging task in high dimension...
research
05/25/2023

Metrics for quantifying isotropy in high dimensional unsupervised clustering tasks in a materials context

Clustering is a common task in machine learning, but clusters of unlabel...
research
11/09/2020

Stratification of Systemic Lupus Erythematosus Patients Using Gene Expression Data to Reveal Expression of Distinct Immune Pathways

Systemic lupus erythematosus (SLE) is the tenth leading cause of death i...

Please sign up or login with your details

Forgot password? Click here to reset