WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

10/07/2021
by   BinBin Zhang, et al.
0

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation – Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

READ FULL TEXT
research
06/13/2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

This paper introduces GigaSpeech, an evolving, multi-domain English spee...
research
09/01/2023

Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI's Whisper

This article introduces Mi-Go, a novel testing framework aimed at evalua...
research
09/11/2023

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

Multi-Modal automatic speech recognition (ASR) techniques aim to leverag...
research
10/24/2022

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

Speech recognition applications cover a range of different audio and tex...
research
07/29/2023

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

We introduce the ÌròyìnSpeech corpus – a new dataset influenced by a des...
research
10/24/2022

10 hours data is all you need

We propose a novel procedure to generate pseudo mandarin speech data nam...
research
04/13/2018

Voices Obscured in Complex Environmental Settings (VOICES) corpus

This paper introduces the Voices Obscured In Complex Environmental Setti...

Please sign up or login with your details

Forgot password? Click here to reset