BSTC: A Large-Scale Chinese-English Speech Translation Dataset

04/08/2021
by   Ruiqing Zhang, et al.
4

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2021

Itihasa: A large-scale corpus for Sanskrit to English translation

This work introduces Itihasa, a large-scale translation dataset containi...
research
12/23/2022

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

We investigate how humans perform the task of dubbing video content from...
research
04/11/2023

A Corpus-based Analysis of Attitudinal Changes in Lin Yutang's Self-translation of Between Tears and Laughter

Attitude is omnipresent in almost every type of text. There has yet to b...
research
06/17/2021

Lost in Interpreting: Speech Translation from Source or Interpreter?

Interpreters facilitate multi-lingual meetings but the affordable set of...
research
05/06/2021

Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases

In this paper, we present a quantitative evaluation of differences betwe...
research
05/26/2023

Robustness of Multi-Source MT to Transcription Errors

Automatic speech translation is sensitive to speech recognition errors, ...
research
11/08/2022

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-...

Please sign up or login with your details

Forgot password? Click here to reset