Revisiting End-to-End Speech-to-Text Translation From Scratch

06/09/2022
by   Biao Zhang, et al.
0

End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks, without which translation performance drops substantially. However, transcripts are not always available, and how significant such pretraining is for E2E ST has rarely been studied in the literature. In this paper, we revisit this question and explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved. We reexamine several techniques proven beneficial to ST previously, and offer a set of best practices that biases a Transformer-based E2E ST system toward training from scratch. Besides, we propose parameterized distance penalty to facilitate the modeling of locality in the self-attention model for speech. On four benchmarks covering 23 languages, our experiments show that, without using any transcripts or pretraining, the proposed system reaches and even outperforms previous studies adopting pretraining, although the gap remains in (extremely) low-resource settings. Finally, we discuss neural acoustic feature modeling, where a neural model is designed to extract acoustic features from raw speech signals directly, with the goal to simplify inductive biases and add freedom to the model in describing speech. For the first time, we demonstrate its feasibility and show encouraging results on ST tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2019

Analyzing ASR pretraining for low-resource speech-to-text translation

Previous work has shown that for low-resource source languages, automati...
research
02/10/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Recently text and speech representation learning has successfully improv...
research
09/21/2020

TED: Triple Supervision Decouples End-to-end Speech-to-text Translation

An end-to-end speech-to-text translation (ST) takes audio in a source la...
research
10/18/2022

Simple and Effective Unsupervised Speech Translation

The amount of labeled data to train models for speech tasks is limited f...
research
10/23/2019

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

Self-attention network (SAN) can benefit significantly from the bi-direc...
research
04/03/2019

Massively Multilingual Adversarial Speech Recognition

We report on adaptation of multilingual end-to-end speech recognition mo...
research
10/11/2021

On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias

Despite the success of pretrained masked language models (MLM), why MLM ...

Please sign up or login with your details

Forgot password? Click here to reset