Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?

12/17/2019
by   R. Stuart Geiger, et al.
0

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing — specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data — give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2021

"Garbage In, Garbage Out" Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?

Supervised machine learning, in which models are automatically derived f...
research
10/11/2017

Is it reasonable to limit scientific coauthorship? There is no inflation of co-authors in Social Sciences and Education in Spain

This paper analyzes the evolution of coauthorship in Spain in the social...
research
04/11/2022

The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink

Machine Learning (ML) workloads have rapidly grown in importance, but ra...
research
07/13/2023

Machine Learning practices and infrastructures

Machine Learning (ML) systems, particularly when deployed in high-stakes...
research
02/03/2022

Best Practices and Scoring System on Reviewing A.I. based Medical Imaging Papers: Part 1 Classification

With the recent advances in A.I. methodologies and their application to ...
research
06/03/2020

Tangles: a new paradigm for clusters and types

Traditional clustering identifies groups of objects that share certain q...

Please sign up or login with your details

Forgot password? Click here to reset