When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations

08/08/2023
by   Rhys Compton, et al.
0

In machine learning, incorporating more data is often seen as a reliable strategy for improving model performance; this work challenges that notion by demonstrating that the addition of external datasets in many cases can hurt the resulting model's performance. In a large-scale empirical study across combinations of four different open-source chest x-ray datasets and 9 different labels, we demonstrate that in 43 two hospitals has poorer worst group accuracy over both hospitals than a model trained on just a single hospital's data. This surprising result occurs even though the added hospital makes the training distribution more similar to the test distribution. We explain that this phenomenon arises from the spurious correlation that emerges between the disease and hospital, due to hospital-specific image artifacts. We highlight the trade-off one encounters when training on multiple datasets, between the obvious benefit of additional data and insidious cost of the introduced spurious correlation. In some cases, balancing the dataset can remove the spurious correlation and improve performance, but it is not always an effective strategy. We contextualize our results within the literature on spurious correlations to help explain these outcomes. Our experiments underscore the importance of exercising caution when selecting training data for machine learning models, especially in settings where there is a risk of spurious correlations such as with medical imaging. The risks outlined highlight the need for careful data selection and model evaluation in future research and practice.

READ FULL TEXT

page 21

page 25

page 26

page 27

page 28

page 29

research
10/15/2020

Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale Chest X-ray Dataset

The reliability of machine learning models can be compromised when train...
research
08/27/2021

A comparison of approaches to improve worst-case predictive model performance over patient subpopulations

Predictive models for clinical outcomes that are accurate on average in ...
research
05/04/2023

On the nonlinear correlation of ML performance between data subpopulations

Understanding the performance of machine learning (ML) models across div...
research
09/14/2021

The pitfalls of using open data to develop deep learning solutions for COVID-19 detection in chest X-rays

Since the emergence of COVID-19, deep learning models have been develope...
research
12/06/2018

Generalizability of predictive models for intensive care unit patients

A large volume of research has considered the creation of predictive mod...
research
05/12/2020

Modelling the Extremes of Seasonal Viruses and Hospital Congestion: The Example of Flu in a Swiss Hospital

Viruses causing flu or milder coronavirus colds are often referred to as...
research
09/18/2023

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Deep learning (DL) has demonstrated its innate capacity to independently...

Please sign up or login with your details

Forgot password? Click here to reset