Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

03/05/2021
by   Esther Rolf, et al.
30

Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2018

Diversity in Machine Learning

Machine learning methods have achieved good performance and been widely ...
research
06/16/2023

A context model for collecting diversity-aware data

Diversity-aware data are essential for a robust modeling of human behavi...
research
10/09/2019

Who's responsible? Jointly quantifying the contribution of the learning algorithm and training data

A fancy learning algorithm A outperforms a baseline method B when they a...
research
01/31/2022

Adaptive Sampling Strategies to Construct Equitable Training Datasets

In domains ranging from computer vision to natural language processing, ...
research
08/18/2021

Scarce Data Driven Deep Learning of Drones via Generalized Data Distribution Space

Increased drone proliferation in civilian and professional settings has ...
research
04/13/2022

Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

Sampling biases in training data are a major source of algorithmic biase...

Please sign up or login with your details

Forgot password? Click here to reset