MLS: A Large-Scale Multilingual Dataset for Speech Research

12/07/2020
by   Vineel Pratap, et al.
0

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

READ FULL TEXT
research
10/31/2020

Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages

In this work, we explore the benefits of using multilingual bottleneck f...
research
06/07/2023

Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages

This work introduces Zambezi Voice, an open-source multilingual speech r...
research
07/30/2019

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

The CMU Wilderness Multilingual Speech Dataset is a newly published mult...
research
06/16/2023

CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lin...
research
05/25/2020

FT Speech: Danish Parliament Speech Corpus

This paper introduces FT Speech, a new speech corpus created from the re...
research
03/06/2022

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Automatic speech recognition (ASR) on low resource languages improves t...
research
03/30/2021

MediaSpeech: Multilanguage ASR Benchmark and Dataset

The performance of automated speech recognition (ASR) systems is well kn...

Please sign up or login with your details

Forgot password? Click here to reset