Simultaneous Estimation of Number of Clusters and Feature Sparsity in Clustering High-Dimensional Data

09/04/2019
by   Yujia Li, et al.
0

Estimating the number of clusters (K) is a critical and often difficult task in cluster analysis. Many methods have been proposed to estimate K, including some top performers using resampling approach. When performing cluster analysis in high-dimensional data, simultaneous clustering and feature selection is needed for improved interpretation and performance. To our knowledge, none has investigated simultaneous estimation of K and feature selection in an exploratory cluster analysis. In this paper, we propose a resampling method to meet this gap and evaluate its performance under the sparse K-means clustering framework. The proposed target function balances between sensitivity and specificity of clustering evaluation of pairwise subjects from clustering of full and subsampled data. Through extensive simulations, the method performs among the best over classical methods in estimating K in low-dimensional data. For high-dimensional simulation data, it also shows superior performance to simultaneously estimate K and feature sparsity parameter. Finally, we evaluated the methods in four microarray, two RNA-seq, one SNP and two non-omics datasets. The proposed method achieves better clustering accuracy with fewer selected predictive genes in almost all real applications.

READ FULL TEXT

page 36

page 37

page 38

page 39

page 40

research
03/05/2012

Subspace clustering of high-dimensional data: a predictive approach

In several application domains, high-dimensional observations are collec...
research
01/01/2020

Toward Generalized Clustering through an One-Dimensional Approach

After generalizing the concept of clusters to incorporate clusters that ...
research
07/21/2022

Fast Data Driven Estimation of Cluster Number in Multiplex Images using Embedded Density Outliers

The usage of chemical imaging technologies is becoming a routine accompa...
research
05/27/2023

Dynamic User Segmentation and Usage Profiling

Usage data of a group of users distributed across a number of categories...
research
10/27/2022

Clustering High-dimensional Data via Feature Selection

High-dimensional clustering analysis is a challenging problem in statist...
research
09/26/2019

CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

Feature selection is an important and challenging task in high dimension...
research
02/07/2023

Sparse GEMINI for Joint Discriminative Clustering and Feature Selection

Feature selection in clustering is a hard task which involves simultaneo...

Please sign up or login with your details

Forgot password? Click here to reset