DeepAI AI Chat
Log In Sign Up

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

by   Jinming Zhao, et al.

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En→DE dataset.[Our code is available at]


page 1

page 2

page 3

page 4


Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

End-to-end speech translation, a hot topic in recent years, aims to tran...

Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation

In end-to-end speech translation, speech and text pre-trained models imp...

AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks

This work proposes an attention-based sequence-to-sequence model for han...

RedApt: An Adaptor for wav2vec 2 Encoding Faster and Smaller Speech Translation without Quality Compromise

Pre-trained speech Transformers in speech translation (ST) have facilita...

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Polyphone disambiguation aims to capture accurate pronunciation knowledg...

GraphTTS: graph-to-sequence modelling in neural text-to-speech

This paper leverages the graph-to-sequence method in neural text-to-spee...

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Recent neural speech synthesis systems have gradually focused on the con...