Detecting confounding due to subject identification in clinical machine learning diagnostic applications: a permutation test approach

12/08/2017
by   Elias Chaibub Neto, et al.
0

Recently, Saeb et al (2017) showed that, in diagnostic machine learning applications, having data of each subject randomly assigned to both training and test sets (record-wise data split) can lead to massive underestimation of the cross-validation prediction error, due to the presence of "subject identity confounding" caused by the classifier's ability to identify subjects, instead of recognizing disease. To solve this problem, the authors recommended the random assignment of the data of each subject to either the training or the test set (subject-wise data split). The adoption of subject-wise split has been criticized in Little et al (2017), on the basis that it can violate assumptions required by cross-validation to consistently estimate generalization error. In particular, adopting subject-wise splitting in heterogeneous data-sets might lead to model under-fitting and larger classification errors. Hence, Little et al argue that perhaps the overestimation of prediction errors with subject-wise cross-validation, rather than underestimation with record-wise cross-validation, is the reason for the discrepancies between prediction error estimates generated by the two splitting strategies. In order to shed light on this controversy, we focus on simpler classification performance metrics and develop permutation tests that can detect identity confounding. By focusing on permutation tests, we are able to evaluate the merits of record-wise and subject-wise data splits under more general statistical dependencies and distributional structures of the data, including situations where cross-validation breaks down. We illustrate the application of our tests using synthetic and real data from a Parkinson's disease study.

READ FULL TEXT

page 5

page 6

page 8

page 34

research
03/16/2023

Cross-validatory Z-Residual for Diagnosing Shared Frailty Models

Residual diagnostic methods play a critical role in assessing model assu...
research
09/01/2023

Prediction Error Estimation in Random Forests

In this paper, error estimates of classification Random Forests are quan...
research
09/01/2023

How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis

Deep learning models have revolutionized the field of medical image anal...
research
05/18/2018

Using permutations to quantify and correct for confounding in machine learning predictions

Clinical machine learning applications are often plagued with confounder...
research
12/16/2021

A model sufficiency test using permutation entropy

Using the ordinal pattern concept in permutation entropy, we propose a m...
research
04/19/2022

Investigation of a Data Split Strategy Involving the Time Axis in Adverse Event Prediction Using Machine Learning

Adverse events are a serious issue in drug development and many predicti...

Please sign up or login with your details

Forgot password? Click here to reset