The Dataset Multiplicity Problem: How Unreliable Data Impacts Predictions

04/20/2023
by   Anna P. Meyer, et al.
0

We introduce dataset multiplicity, a way to study how inaccuracies, uncertainty, and social bias in training datasets impact test-time predictions. The dataset multiplicity framework asks a counterfactual question of what the set of resultant models (and associated test-time predictions) would be if we could somehow access all hypothetical, unbiased versions of the dataset. We discuss how to use this framework to encapsulate various sources of uncertainty in datasets' factualness, including systemic social bias, data collection practices, and noisy labels or features. We show how to exactly analyze the impacts of dataset multiplicity for a specific model architecture and type of uncertainty: linear models with label errors. Our empirical analysis shows that real-world datasets, under reasonable assumptions, contain many test samples whose predictions are affected by dataset multiplicity. Furthermore, the choice of domain-specific dataset multiplicity definition determines what samples are affected, and whether different demographic groups are disparately impacted. Finally, we discuss implications of dataset multiplicity for machine learning practice and research, including considerations for when model outcomes should not be trusted.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2019

Fair Generative Modeling via Weak Supervision

Real-world datasets are often biased with respect to key demographic fac...
research
11/23/2020

When and Why Test-Time Augmentation Works

Test-time augmentation (TTA)—the aggregation of predictions across trans...
research
06/07/2022

Certifying Data-Bias Robustness in Linear Regression

Datasets typically contain inaccuracies due to human error and societal ...
research
06/24/2014

Combining predictions from linear models when training and test inputs differ

Methods for combining predictions from different models in a supervised ...
research
02/13/2023

Provable Detection of Propagating Sampling Bias in Prediction Models

With an increased focus on incorporating fairness in machine learning mo...
research
08/10/2023

Test-Time Selection for Robust Skin Lesion Analysis

Skin lesion analysis models are biased by artifacts placed during image ...
research
03/27/2013

An Interesting Uncertainty-Based Combinatoric Problem in Spare Parts Forecasting: The FRED System

The domain of spare parts forecasting is examined, and is found to prese...

Please sign up or login with your details

Forgot password? Click here to reset