Selective inference for clustering with unknown variance

by   Youngjoo Yun, et al.

In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where the hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.


page 1

page 2

page 3

page 4


Selective inference for k-means clustering

We consider the problem of testing for a difference in means between clu...

Selective inference after convex clustering with ℓ_1 penalization

Classical inference methods notoriously fail when applied to data-driven...

Selective Inference for Latent Block Models

Model selection in latent block models has been a challenging but import...

Selective Clustering Annotated using Modes of Projections

Selective clustering annotated using modes of projections (SCAMP) is a n...

How much does your data exploration overfit? Controlling bias via information usage

Modern data is messy and high-dimensional, and it is often not clear a p...

Inference for Dependent Data with Learned Clusters

This paper presents and analyzes an approach to cluster-based inference ...

Selective inference for the problem of regions via multiscale bootstrap

Selective inference procedures are considered for computing approximatel...

Please sign up or login with your details

Forgot password? Click here to reset