MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

10/22/2020
by   Junkun Chen, et al.
5

End-to-end Speech-to-text Translation (E2E- ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique, termed Masked Acoustic Modeling (MAM), can also perform pre-training, for the first time, on any acoustic signals (including non-speech ones) without annotation. Compared with current state-of-the-art models on ST, our technique achieves +1.4 BLEU improvement without using transcriptions, and +1.2 BLEU using transcriptions. The pre-training of MAM with arbitrary acoustic signals also boosts the downstream speech-related tasks.

READ FULL TEXT

page 5

page 6

page 7

research
05/12/2021

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Encoder pre-training is promising in end-to-end Speech Translation (ST),...
research
09/05/2018

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

We present a simple approach to improve direct speech-to-text translatio...
research
04/07/2022

Speech Pre-training with Acoustic Piece

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre...
research
04/06/2022

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Direct speech-to-speech translation (S2ST) models suffer from data scarc...
research
02/02/2019

Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Acoustic-to-word (A2W) models that allow direct mapping from acoustic si...
research
07/29/2020

Transformer based unsupervised pre-training for acoustic representation learning

Computational audio analysis has become a central issue in associated ar...
research
04/23/2023

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...

Please sign up or login with your details

Forgot password? Click here to reset