Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

07/02/2021
by   Nikita Pavlichenko, et al.
0

Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing advanced aggregation methods is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech – the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of collecting high-quality datasets using crowdsourcing: we develop a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY – a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/10/2019

Practice of Efficient Data Collection via Crowdsourcing at Large-Scale

Modern machine learning algorithms need large datasets to be trained. Cr...
research
10/25/2019

Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity

The detection of online cyberbullying has seen an increase in societal i...
research
11/17/2021

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

The People's Speech is a free-to-download 30,000-hour and growing superv...
research
02/26/2022

Visual Speech Recognition for Multiple Languages in the Wild

Visual speech recognition (VSR) aims to recognise the content of speech ...
research
10/12/2020

The Extraordinary Failure of Complement Coercion Crowdsourcing

Crowdsourcing has eased and scaled up the collection of linguistic annot...
research
05/22/2020

From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

Building rich machine learning datasets in a scalable manner often neces...
research
02/05/2021

Applications of Machine Learning in Document Digitisation

Data acquisition forms the primary step in all empirical research. The a...

Please sign up or login with your details

Forgot password? Click here to reset