Post-clustering difference testing: valid inference and practical considerations

10/24/2022
by   Benjamin Hivert, et al.
0

Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are considered for the inference process, since the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. We propose three novel statistical hypothesis tests which account for the clustering process. Our tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations.

READ FULL TEXT

page 13

page 19

research
12/05/2020

Selective Inference for Hierarchical Clustering

Testing for a difference in means between two groups is fundamental to a...
research
04/06/2021

Hypothesis Formalization: Empirical Findings, Software Limitations, and Design Implications

Data analysis requires translating higher level questions and hypotheses...
research
07/30/2021

Inference for Dependent Data with Learned Clusters

This paper presents and analyzes an approach to cluster-based inference ...
research
09/30/2021

A flexible and robust non-parametric test of exchangeability

Many statistical analyses assume that the data points within a sample ar...
research
11/17/2020

Peer groups for organisational learning: clustering with practical constraints

Peer-grouping is used in many sectors for organisational learning, polic...
research
07/30/2019

Machine learning in APOGEE: Identification of stellar populations through chemical abundances

The vast volume of data generated by modern astronomical surveys offers ...
research
10/17/2018

Structural Equation Modeling and simultaneous clustering through the Partial Least Squares algorithm

The identification of different homogeneous groups of observations and t...

Please sign up or login with your details

Forgot password? Click here to reset