Improving speech translation by fusing speech and text

05/23/2023
by   Wenbiao Yin, et al.
0

In speech translation, leveraging multimodal data to improve model performance and address limitations of individual modalities has shown significant effectiveness. In this paper, we harness the complementary strengths of speech and text, which are disparate modalities. We observe three levels of modality gap between them, denoted by Modal input representation, Modal semantic, and Modal hidden states. To tackle these gaps, we propose Fuse-Speech-Text (FST), a cross-modal model which supports three distinct input modalities for translation: speech, text, and fused speech-text. We leverage multiple techniques for cross-modal alignment and conduct a comprehensive analysis to assess its impact on speech translation, machine translation, and fused speech-text translation. We evaluate FST on MuST-C, GigaST, and newstest benchmark. Experiments show that the proposed FST achieves an average 34.0 BLEU on MuST-C En→De/Es/Fr (vs SOTA +1.1 BLEU). Further experiments demonstrate that FST does not degrade on MT task, as observed in prior works. Instead, it yields an average improvement of 3.2 BLEU over the pre-trained MT model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

End-to-end speech translation (ST) is the task of translating speech sig...
research
03/20/2022

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...
research
05/07/2021

Learning Shared Semantic Space for Speech-to-Text Translation

Having numerous potential applications and great impact, end-to-end spee...
research
02/10/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Recently text and speech representation learning has successfully improv...
research
05/15/2023

Understanding and Bridging the Modality Gap for Speech Translation

How to achieve better end-to-end speech translation (ST) by leveraging (...
research
05/19/2023

DUB: Discrete Unit Back-translation for Speech Translation

How can speech-to-text translation (ST) perform as well as machine trans...
research
08/28/2023

An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation

Consistency regularization methods, such as R-Drop (Liang et al., 2021) ...

Please sign up or login with your details

Forgot password? Click here to reset