Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

12/03/2021
by   Bernard Koch, et al.
0

Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.

READ FULL TEXT

page 7

page 8

research
12/09/2020

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Datasets have played a foundational role in the advancement of machine l...
research
12/21/2022

NADBenchmarks – a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters

Climate change has increased the intensity, frequency, and duration of e...
research
07/14/2020

Bringing the People Back In: Contesting Benchmark Machine Learning Datasets

In response to algorithmic unfairness embedded in sociotechnical systems...
research
03/01/2021

Machine learning on small size samples: A synthetic knowledge synthesis

One of the increasingly important technologies dealing with the growing ...
research
10/05/2021

The Potential of Machine Learning to Enhance Computational Fluid Dynamics

Machine learning is rapidly becoming a core technology for scientific co...
research
09/05/2021

Recommending Researchers in Machine Learning based on Author-Topic Model

The aim of this paper is to uncover the researchers in machine learning ...
research
11/15/2022

Machine Learning Methods Applied to Cortico-Cortical Evoked Potentials Aid in Localizing Seizure Onset Zones

Epilepsy affects millions of people, reducing quality of life and increa...

Please sign up or login with your details

Forgot password? Click here to reset