Log In Sign Up

End-to-end Speech Translation via Cross-modal Progressive Training

by   Rong Ye, et al.

End-to-end speech translation models have become a new trend in the research due to their potential of reducing error propagation. However, these models still suffer from the challenge of data scarcity. How to effectively make use of unlabeled or other parallel corpora from machine translation is promising but still an open problem. In this paper, we propose Cross Speech-Text Network (XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both speech and text as input and outputs both transcription and translation text. The model benefits from its three key design aspects: a self supervising pre-trained sub-network as the audio encoder, a multi-task training objective to exploit additional parallel bilingual text, and a progressive training procedure. We evaluate the performance of XSTNet and baselines on the MuST-C En-De/Fr/Ru datasets. XSTNet achieves state-of-the-art results on all three language directions with an average BLEU of 27.8, outperforming the previous best method by 3.7 BLEU. The code and the models will be released to the public.


page 1

page 2

page 3

page 4


Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...

M3ST: Mix at Three Levels for Speech Translation

How to solve the data scarcity problem for end-to-end speech-to-text tra...

SDST: Successive Decoding for Speech-to-text Translation

End-to-end speech-to-text translation (ST), which directly translates th...

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...

Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018

This paper describes FBK's submission to the end-to-end English-German s...

Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement

End-to-end speech-to-text translation (E2E-ST) is becoming increasingly ...

Towards Unsupervised Speech-to-Text Translation

We present a framework for building speech-to-text translation (ST) syst...