DeepAI AI Chat
Log In Sign Up

Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement

by   Yichao Du, et al.
Alibaba Group
Rutgers University

End-to-end speech-to-text translation (E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus ⟨ speech, transcription, translation⟩, the conventional high-quality E2E-ST system leverages the ⟨ speech, transcription⟩ pair to pre-train the model and then utilizes the ⟨ speech, translation⟩ pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce two Kullback-Leibler divergence regularization terms into the model training objective to reduce the mismatch between output probabilities of dual-path. Then the well-trained model can be naturally transformed as the E2E-ST models by the pre-defined early stop tag. Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving better performance in the automatic speech recognition task. Our code is open-sourced at


page 1

page 2

page 3

page 4


Non-Parametric Domain Adaptation for End-to-End Speech Translation

End-to-End Speech Translation (E2E-ST) has received increasing attention...

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) m...

JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT

JoeyS2T is a JoeyNMT extension for speech-to-text tasks such as automati...

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

End-to-end Speech Translation (ST) models have many potential advantages...

End-to-end Speech Translation via Cross-modal Progressive Training

End-to-end speech translation models have become a new trend in the rese...

Streaming Models for Joint Speech Recognition and Translation

Using end-to-end models for speech translation (ST) has increasingly bee...

Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation

For end-to-end speech translation, regularizing the encoder with the Con...