Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

04/06/2022
by   Sravya Popuri, et al.
0

Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments show that self-supervised pre-training consistently improves model performance compared with multitask learning with a BLEU gain of 4.3-12.0 under various data setups, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2022

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-...
research
05/15/2023

Back Translation for Speech-to-text Translation Without Transcripts

The success of end-to-end speech-to-text translation (ST) is often achie...
research
10/22/2020

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

End-to-end Speech-to-text Translation (E2E- ST), which directly translat...
research
12/15/2022

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Direct speech-to-speech translation (S2ST), in which all components can ...
research
07/31/2023

Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

Online social media is rife with offensive and hateful comments, prompti...
research
10/12/2019

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

We propose vq-wav2vec to learn discrete representations of audio segment...
research
05/19/2023

DUB: Discrete Unit Back-translation for Speech Translation

How can speech-to-text translation (ST) perform as well as machine trans...

Please sign up or login with your details

Forgot password? Click here to reset