Adaptive Sampling Strategies to Construct Equitable Training Datasets

01/31/2022
by   William Cai, et al.
0

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data – an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2020

Adaptive Sampling to Reduce Disparate Performance

Existing methods for reducing disparate performance of a classifier acro...
research
01/24/2023

Designing Data: Proactive Data Collection and Iteration for Machine Learning

Lack of diversity in data collection has caused significant failures in ...
research
11/16/2022

Can Strategic Data Collection Improve the Performance of Poverty Prediction Models?

Machine learning-based estimates of poverty and wealth are increasingly ...
research
04/15/2021

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Many crowdsourced NLP datasets contain systematic gaps and biases that a...
research
06/02/2021

On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

In adversarial data collection (ADC), a human workforce interacts with a...
research
03/05/2021

Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

Collecting more diverse and representative training data is often touted...
research
04/04/2023

A Data Fusion Framework for Multi-Domain Morality Learning

Language models can be trained to recognize the moral sentiment of text,...

Please sign up or login with your details

Forgot password? Click here to reset