Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

09/01/2021
by   Mingkuan Liu, et al.
0

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80 annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2022

Charon: a FrameNet Annotation Tool for Multimodal Corpora

This paper presents Charon, a web tool for annotating multimodal corpora...
research
11/01/2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...
research
06/10/2022

An Image Processing Pipeline for Camera Trap Time-Lapse Recordings

A new open-source image processing pipeline for analyzing camera trap ti...
research
06/15/2023

Quality and Efficiency of Manual Annotation: Pre-annotation Bias

This paper presents an analysis of annotation using an automatic pre-ann...
research
11/04/2022

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

In this paper, we introduce our work of building a Streaming Multilingua...
research
04/19/2022

Councils in Action: Automating the Curation of Municipal Governance Data for Research

Large scale comparative research into municipal governance is often proh...
research
03/15/2022

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

Large-scale datasets play a vital role in computer vision. Existing data...

Please sign up or login with your details

Forgot password? Click here to reset