Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

10/17/2022
by   Sami Hamdan, et al.
0

Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches.

READ FULL TEXT

page 11

page 13

page 15

page 17

research
07/14/2022

Leakage and the Reproducibility Crisis in ML-based Science

The use of machine learning (ML) methods for prediction and forecasting ...
research
07/04/2021

Survey: Leakage and Privacy at Inference Time

Leakage of data from publicly available Machine Learning (ML) models is ...
research
05/03/2020

Machine Learning Pipeline for Pulsar Star Dataset

This work brings together some of the most common machine learning (ML) ...
research
06/21/2017

A giant with feet of clay: on the validity of the data that feed machine learning in medicine

This paper considers the use of Machine Learning (ML) in medicine by foc...
research
10/10/2022

Everything is Varied: The Surprising Impact of Individual Variation on ML Robustness in Medicine

In medical settings, Individual Variation (IV) refers to variation that ...
research
06/10/2022

Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

We study how networking corruptions–data corruptions caused by networkin...
research
11/08/2022

Efficacy of MRI data harmonization in the age of machine learning. A multicenter study across 36 datasets

Pooling publicly-available MRI data from multiple sites allows to assemb...

Please sign up or login with your details

Forgot password? Click here to reset