SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

11/08/2022
by   Paul-Ambroise Duquenne, et al.
0

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2020

CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Speech translation has recently become an increasingly popular topic of ...
research
08/29/2019

Classifying topics in speech when all you have is crummy translations

Given a large amount of unannotated speech in a language with few resour...
research
11/08/2019

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Current research into spoken language translation (SLT) is often hampere...
research
04/08/2021

BSTC: A Large-Scale Chinese-English Speech Translation Dataset

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scal...
research
07/07/2022

BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

BibleTTS is a large, high-quality, open speech dataset for ten languages...
research
10/07/2015

Helping Domain Experts Build Speech Translation Systems

We present a new platform, "Regulus Lite", which supports rapid developm...
research
06/28/2022

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Training speech translation (ST) models requires large and high-quality ...

Please sign up or login with your details

Forgot password? Click here to reset