CoVoST 2 and Massively Multilingual Speech-to-Text Translation

07/20/2020
by   Changhan Wang, et al.
0

Speech translation has recently become an increasingly popular topic of research, partly due to the development of benchmark datasets. Nevertheless, current datasets cover a limited number of languages. With the aim to foster research in massive multilingual speech translation and speech translation for low resource language pairs, we release CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage perspective. Data sanity checks provide evidence about the quality of the data, which is released under CC0 license. We also provide extensive speech recognition, bilingual and multilingual machine translation and speech translation baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2021

The Multilingual TEDx Corpus for Speech Recognition and Translation

We present the Multilingual TEDx corpus, built to support speech recogni...
research
02/04/2020

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

Spoken language translation has recently witnessed a resurgence in popul...
research
11/08/2022

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-...
research
11/08/2019

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Current research into spoken language translation (SLT) is often hampere...
research
07/17/2023

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Speech-to-speech translation (S2ST) enables spoken communication between...
research
05/19/2023

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

Hallucinations in machine translation are translations that contain info...
research
02/14/2017

A case study on using speech-to-translation alignments for language documentation

For many low-resource or endangered languages, spoken language resources...

Please sign up or login with your details

Forgot password? Click here to reset