Bridging the Modality Gap for Speech-to-Text Translation

10/28/2020
by   Yuchen Liu, et al.
0

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent space. Furthermore, we introduce a cross-modal adaptation method to close the distance between speech and text representation. Experimental results on English-French and English-German speech translation corpora have shown that our model significantly outperforms strong baselines, and achieves the new state-of-the-art performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2020

TED: Triple Supervision Decouples End-to-end Speech-to-text Translation

An end-to-end speech-to-text translation (ST) takes audio in a source la...
research
02/10/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Recently text and speech representation learning has successfully improv...
research
10/13/2021

End-to-end translation of human neural activity to speech with a dual-dual generative adversarial network

In a recent study of auditory evoked potential (AEP) based brain-compute...
research
05/07/2021

Learning Shared Semantic Space for Speech-to-Text Translation

Having numerous potential applications and great impact, end-to-end spee...
research
10/28/2022

Efficient Speech Translation with Dynamic Latent Perceivers

Transformers have been the dominant architecture for Speech Translation ...
research
05/25/2023

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

End-to-end simultaneous speech translation (SimulST) outputs translation...
research
04/04/2022

Analysis of Joint Speech-Text Embeddings for Semantic Matching

Embeddings play an important role in many recent end-to-end solutions fo...

Please sign up or login with your details

Forgot password? Click here to reset