Speech Wikimedia: A 77 Language Multilingual Speech Dataset

08/30/2023
by   Rafael Mosquera Gómez, et al.
0

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2023

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

We introduce MuAViC, a multilingual audio-visual corpus for robust speec...
research
11/27/2019

Jejueo Datasets for Machine Translation and Speech Synthesis

Jejueo was classified as critically endangered by UNESCO in 2010. Althou...
research
01/27/2022

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

Nowadays, code-mixing has become ubiquitous in Natural Language Processi...
research
09/06/2022

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets,...
research
10/26/2021

Assessing Evaluation Metrics for Speech-to-Speech Translation

Speech-to-speech translation combines machine translation with speech sy...
research
03/09/2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Multi-media communications facilitate global interaction among people. H...
research
05/16/2023

Towards Speech Dialogue Translation Mediating Speakers of Different Languages

We present a new task, speech dialogue translation mediating speakers of...

Please sign up or login with your details

Forgot password? Click here to reset