Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

11/08/2019
by   Javier Iranzo-Sánchez, et al.
0

Current research into spoken language translation (SLT) is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2020

CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Speech translation has recently become an increasingly popular topic of ...
research
09/15/2021

Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation

Automatic translation systems are known to struggle with rare words. Amo...
research
05/14/2018

The Spot the Difference corpus: a multi-modal corpus of spontaneous task oriented spoken interactions

This paper describes the Spot the Difference Corpus which contains 54 in...
research
03/07/2022

Creating Speech-to-Speech Corpus from Dubbed Series

Dubbed series are gaining a lot of popularity in recent years with stron...
research
05/22/2023

Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof

This paper presents a spell checker and correction tool specifically des...
research
07/30/2019

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

The CMU Wilderness Multilingual Speech Dataset is a newly published mult...
research
11/08/2022

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-...

Please sign up or login with your details

Forgot password? Click here to reset