GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

06/13/2021
by   Guoguo Chen, et al.
0

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4 training subsets, we cap it at 0 other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus co...
research
05/26/2017

Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

For conversational large-vocabulary continuous speech recognition (LVCSR...
research
12/17/2019

Libri-Light: A Benchmark for ASR with Limited or No Supervision

We introduce a new collection of spoken English audio suitable for train...
research
10/24/2022

10 hours data is all you need

We propose a novel procedure to generate pseudo mandarin speech data nam...
research
02/12/2023

ASR Bundestag: A Large-Scale political debate dataset in German

We present ASR Bundestag, a dataset for automatic speech recognition in ...
research
03/07/2022

Creating Speech-to-Speech Corpus from Dubbed Series

Dubbed series are gaining a lot of popularity in recent years with stron...
research
09/01/2023

Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI's Whisper

This article introduces Mi-Go, a novel testing framework aimed at evalua...

Please sign up or login with your details

Forgot password? Click here to reset