Practice of Efficient Data Collection via Crowdsourcing at Large-Scale

12/10/2019
by   Alexey Drutsa, et al.
0

Modern machine learning algorithms need large datasets to be trained. Crowdsourcing has become a popular approach to label large datasets in a shorter time as well as at a lower cost comparing to that needed for a limited number of experts. However, as crowdsourcing performers are non-professional and vary in levels of expertise, such labels are much noisier than those obtained from experts. For this reason, in order to collect good quality data within a limited budget special techniques such as incremental relabelling, aggregation and pricing need to be used. We make an introduction to data labeling via public crowdsourcing marketplaces and present key components of efficient label collection. We show how to choose one of real label collection tasks, experiment with selecting settings for the labelling process, and launch label collection project at Yandex.Toloka, one of the largest crowdsourcing marketplace. The projects will be run on real crowds. We also present main algorithms for aggregation, incremental relabelling, and pricing in crowdsourcing. In particular, we, first, discuss how to connect these three components to build an efficient label collection process; and, second, share rich industrial experiences of applying these algorithms and constructing large-scale label collection pipelines (emphasizing best practices and common pitfalls).

READ FULL TEXT

page 2

page 3

page 4

page 5

research
07/02/2021

Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

Domain-specific data is the crux of the successful transfer of machine l...
research
06/17/2022

Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts

In this study we investigate to which degree experts and non-experts agr...
research
09/29/2022

TruEyes: Utilizing Microtasks in Mobile Apps for Crowdsourced Labeling of Machine Learning Datasets

The growing use of supervised machine learning in research and industry ...
research
06/08/2019

Doubly Robust Crowdsourcing

Large-scale labeled datasets are the indispensable fuel that ignites the...
research
06/19/2017

Multi-Label Annotation Aggregation in Crowdsourcing

As a means of human-based computation, crowdsourcing has been widely use...
research
06/15/2021

Benchmark dataset of memes with text transcriptions for automatic detection of multi-modal misogynistic content

In this paper we present a benchmark dataset generated as part of a proj...
research
02/11/2021

OpinionRank: Extracting Ground Truth Labels from Unreliable Expert Opinions with Graph-Based Spectral Ranking

As larger and more comprehensive datasets become standard in contemporar...

Please sign up or login with your details

Forgot password? Click here to reset