Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators' judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

On Releasing Annotator-Level Labels and Information in Datasets

A common practice in building NLP datasets, especially using crowd-sourc...
research
07/05/2023

Evaluating AI systems under uncertain ground truth: a case study in dermatology

For safety, AI systems in health undergo thorough evaluations before dep...
research
10/14/2022

The Invariant Ground Truth of Affect

Affective computing strives to unveil the unknown relationship between a...
research
07/11/2021

Learning from Crowds with Sparse and Imbalanced Annotations

Traditional supervised learning requires ground truth labels for the tra...
research
02/06/2023

When the Ground Truth is not True: Modelling Human Biases in Temporal Annotations

In supervised learning, low quality annotations lead to poorly performin...
research
06/02/2021

Survey Equivalence: A Procedure for Measuring Classifier Accuracy Against Human Labels

In many classification tasks, the ground truth is either noisy or subjec...
research
04/02/2019

Generating Labels for Regression of Subjective Constructs using Triplet Embeddings

Human annotations serve an important role in computational models where ...

Please sign up or login with your details

Forgot password? Click here to reset