Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

by   Elizabeth Salesky, et al.
Carnegie Mellon University

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60 improvements hold across multiple data sizes and two language pairs.


page 1

page 2

page 3

page 4


End-to-End Automatic Speech Translation of Audiobooks

We investigate end-to-end speech-to-text translation on a corpus of audi...

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

End-to-end speech-to-speech translation (S2ST) without relying on interm...

Self-Supervised Representations Improve End-to-End Speech Translation

End-to-end speech-to-text translation can provide a simpler and smaller ...

Phone Features Improve Speech Translation

End-to-end models for speech translation (ST) more tightly couple speech...

On Using SpecAugment for End-to-End Speech Translation

This work investigates a simple data augmentation technique, SpecAugment...

Speechformer: Reducing Information Loss in Direct Speech Translation

Transformer-based models have gained increasing popularity achieving sta...

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Data scarcity is one of the main issues with the end-to-end approach for...

Please sign up or login with your details

Forgot password? Click here to reset