Phone Features Improve Speech Translation

by   Elizabeth Salesky, et al.

End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT) than a traditional cascade of separate ASR and MT models, with simpler model architectures and the potential for reduced error propagation. Their performance is often assumed to be superior, though in many conditions this is not yet the case. We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines. Further, we introduce two methods to incorporate phone features into ST models. We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work – by up to 9 BLEU on our low-resource setting.


page 1

page 2

page 3

page 4


IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task

This paper describes the submission to the IWSLT 2021 Low-Resource Speec...

Tight Integrated End-to-End Training for Cascaded Speech Translation

A cascaded speech translation model relies on discrete and non-different...

AlloST: Low-resource Speech Translation without Source Transcription

The end-to-end architecture has made promising progress in speech transl...

End-to-End ASR for Code-switched Hindi-English Speech

End-to-end (E2E) models have been explored for large speech corpora and ...

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

In this work, we focus on improving ASR output segmentation in the conte...

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

While end-to-end ASR systems have proven competitive with the convention...

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

Previous work on end-to-end translation from speech has primarily used f...