End-to-end Speech Translation via Cross-modal Progressive Training

04/21/2021
by   Rong Ye, et al.
0

End-to-end speech translation models have become a new trend in the research due to their potential of reducing error propagation. However, these models still suffer from the challenge of data scarcity. How to effectively make use of unlabeled or other parallel corpora from machine translation is promising but still an open problem. In this paper, we propose Cross Speech-Text Network (XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both speech and text as input and outputs both transcription and translation text. The model benefits from its three key design aspects: a self supervising pre-trained sub-network as the audio encoder, a multi-task training objective to exploit additional parallel bilingual text, and a progressive training procedure. We evaluate the performance of XSTNet and baselines on the MuST-C En-De/Fr/Ru datasets. XSTNet achieves state-of-the-art results on all three language directions with an average BLEU of 27.8, outperforming the previous best method by 3.7 BLEU. The code and the models will be released to the public.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2022

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...
research
12/07/2022

M3ST: Mix at Three Levels for Speech Translation

How to solve the data scarcity problem for end-to-end speech-to-text tra...
research
09/21/2020

SDST: Successive Decoding for Speech-to-text Translation

End-to-end speech-to-text translation (ST), which directly translates th...
research
03/20/2022

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...
research
10/16/2018

Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018

This paper describes FBK's submission to the end-to-end English-German s...
research
11/04/2018

Towards Unsupervised Speech-to-Text Translation

We present a framework for building speech-to-text translation (ST) syst...
research
05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...

Please sign up or login with your details

Forgot password? Click here to reset