Understanding and Bridging the Modality Gap for Speech Translation

05/15/2023
by   Qingkai Fang, et al.
0

How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset.

READ FULL TEXT
research
05/24/2023

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

End-to-end speech translation (ST) is the task of translating speech sig...
research
03/20/2022

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...
research
05/23/2023

Improving speech translation by fusing speech and text

In speech translation, leveraging multimodal data to improve model perfo...
research
05/09/2023

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

Text image machine translation (TIMT) aims to translate texts embedded i...
research
05/07/2021

Learning Shared Semantic Space for Speech-to-Text Translation

Having numerous potential applications and great impact, end-to-end spee...
research
08/28/2023

An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation

Consistency regularization methods, such as R-Drop (Liang et al., 2021) ...
research
05/07/2023

Leveraging Synthetic Targets for Machine Translation

In this work, we provide a recipe for training machine translation model...

Please sign up or login with your details

Forgot password? Click here to reset