Common Voice: A Massively-Multilingual Speech Corpus

by   Rosana Ardila, et al.
Mozilla Corporation
Indiana University

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla's DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 +/- 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.


page 1

page 2

page 3

page 4


Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

The Huqariq corpus is a multilingual collection of speech from native Pe...

Indonesian Automatic Speech Recognition with XLSR-53

This study focuses on the development of Indonesian Automatic Speech Rec...

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Deep learning techniques have been shown to be efficient in various task...

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

Nowadays, research in speech technologies has gotten a lot out thanks to...

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Automatic speech recognition (ASR) on low resource languages improves t...

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Despite the recent advancements in Automatic Speech Recognition (ASR), t...

Please sign up or login with your details

Forgot password? Click here to reset