Pre-training for Speech Translation: CTC Meets Optimal Transport

01/27/2023
by   Phuong-Hang Le, et al.
0

The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed for reducing this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2022

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

In this paper, we propose a novel multi-modal multi-task encoder-decoder...
research
03/31/2022

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

This paper studies a novel pre-training technique with unpaired speech d...
research
10/07/2022

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

The rapid development of single-modal pre-training has prompted research...
research
07/29/2020

Transformer based unsupervised pre-training for acoustic representation learning

Computational audio analysis has become a central issue in associated ar...
research
11/20/2019

A Comparative Study on End-to-end Speech to Text Translation

Recent advances in deep learning show that end-to-end speech to text tra...
research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
06/02/2023

Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23

This paper describes the submission of the UPC Machine Translation group...

Please sign up or login with your details

Forgot password? Click here to reset