Leakage and the Reproducibility Crisis in ML-based Science

07/14/2022
by   Sayash Kapoor, et al.
0

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

READ FULL TEXT

page 3

page 20

research
08/15/2023

REFORMS: Reporting Standards for Machine Learning Based Science

Machine learning (ML) methods are proliferating in scientific research. ...
research
07/04/2021

Survey: Leakage and Privacy at Inference Time

Leakage of data from publicly available Machine Learning (ML) models is ...
research
10/17/2022

Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

Machine learning (ML) approaches to data analysis are now widely adopted...
research
03/12/2022

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

Recent concerns that machine learning (ML) may be facing a reproducibili...
research
09/09/2021

The challenge of reproducible ML: an empirical study on the impact of bugs

Reproducibility is a crucial requirement in scientific research. When re...
research
04/08/2021

Predicting the Reproducibility of Social and Behavioral Science Papers Using Supervised Learning Models

In recent years, significant effort has been invested verifying the repr...
research
06/13/2022

Modeling the Machine Learning Multiverse

Amid mounting concern about the reliability and credibility of machine l...

Please sign up or login with your details

Forgot password? Click here to reset