Improve learning combining crowdsourced labels by weighting Areas Under the Margin

09/30/2022
by   Tanguy Lefort, et al.
0

In supervised learning – for instance in image classification – modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per worker trust score. Yet, such worker-centric approaches discard each task ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful for the learning step. In a standard supervised learning setting – with one label per task and balanced classes – the Area Under the Margin (AUM) statistic is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by worker and task dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements with respect to feature-blind aggregation strategies both for simulated settings and for the CIFAR-10H crowdsourced dataset.

READ FULL TEXT
research
11/13/2020

End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models

Labeling real-world datasets is time consuming but indispensable for sup...
research
01/28/2020

Identifying Mislabeled Data using the Area Under the Margin Ranking

Not all data in a typical training set help with generalization; some sa...
research
07/13/2022

Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation

High-quality data is necessary for modern machine learning. However, the...
research
12/04/2021

In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers

We propose a novel three-stage FIND-RESOLVE-LABEL workflow for crowdsour...
research
06/06/2021

Embracing Ambiguity: Shifting the Training Target of NLI Models

Natural Language Inference (NLI) datasets contain examples with highly a...
research
02/12/2019

Crowdsourced PAC Learning under Classification Noise

In this paper, we analyze PAC learnability from labels produced by crowd...

Please sign up or login with your details

Forgot password? Click here to reset