M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

07/03/2022
by   Jinming Zhao, et al.
0

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En→DE dataset.[Our code is available at https://github.com/mingzi151/w2v2-st.]

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2019

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

End-to-end speech translation, a hot topic in recent years, aims to tran...
research
05/26/2023

Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation

In end-to-end speech translation, speech and text pre-trained models imp...
research
01/23/2022

AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks

This work proposes an attention-based sequence-to-sequence model for han...
research
07/27/2023

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Sign Language Translation (SLT) is a challenging task due to its cross-d...
research
10/16/2022

RedApt: An Adaptor for wav2vec 2 Encoding Faster and Smaller Speech Translation without Quality Compromise

Pre-trained speech Transformers in speech translation (ST) have facilita...
research
06/05/2022

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Polyphone disambiguation aims to capture accurate pronunciation knowledg...
research
03/04/2020

GraphTTS: graph-to-sequence modelling in neural text-to-speech

This paper leverages the graph-to-sequence method in neural text-to-spee...

Please sign up or login with your details

Forgot password? Click here to reset