GigaST: A 10,000-hour Pseudo Speech Translation Corpus

04/08/2022
by   Rong Ye, et al.
0

This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2018

The MeMAD Submission to the IWSLT 2018 Speech Translation Task

This paper describes the MeMAD project entry to the IWSLT Speech Transla...
research
05/18/2022

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

Direct Speech-to-speech translation (S2ST) has drawn more and more atten...
research
12/03/2021

Translating Politeness Across Cultures: Case of Hindi and English

In this paper, we present a corpus based study of politeness across two ...
research
05/18/2023

Evaluating the validity of a German translation of an uncanniness questionnaire

When researching on the acceptance of robots in Human-Robot-Interaction ...
research
10/11/2021

WeTS: A Benchmark for Translation Suggestion

Translation Suggestion (TS), which provides alternatives for specific wo...
research
06/28/2022

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Training speech translation (ST) models requires large and high-quality ...
research
09/13/2017

Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

In this paper we describe the use of text classification methods to inve...

Please sign up or login with your details

Forgot password? Click here to reset