DeepAI
Log In Sign Up

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

02/10/2021
by   Renjie Zheng, et al.
2

Recently text and speech representation learning has successfully improved many language related tasks. However, all existing methods only learn from one input modality, while a unified acoustic and text representation is desired by many speech-related tasks such as speech translation. We propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text in-put. Within this cross modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that our proposed speech translation models fine-tuned from FAT-MLM substantially improve translation quality (+5.90 BLEU).

READ FULL TEXT

page 3

page 5

10/28/2020

Bridging the Modality Gap for Speech-to-Text Translation

End-to-end speech translation aims to translate speech in one language i...
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
03/20/2022

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-tex...
06/09/2022

Revisiting End-to-End Speech-to-Text Translation From Scratch

End-to-end (E2E) speech-to-text translation (ST) often depends on pretra...
11/27/2022

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

The recently developed discrete diffusion models perform extraordinarily...
09/21/2020

SDST: Successive Decoding for Speech-to-text Translation

End-to-end speech-to-text translation (ST), which directly translates th...
10/21/2022

A Textless Metric for Speech-to-Speech Comparison

This paper proposes a textless speech-to-speech comparison metric that a...