Relative Positional Encoding for Speech Recognition and Direct Translation

by   Ngoc-Quan Pham, et al.
Maastricht University
Johns Hopkins University
Carnegie Mellon University

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.


page 1

page 2

page 3

page 4


Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Transformer-based end-to-end speech recognition models have received con...

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Transformer-based models have achieved state-of-the-art performance on s...

Transformer-based Acoustic Modeling for Hybrid Speech Recognition

We propose and evaluate transformer-based acoustic models (AMs) for hybr...

Lattice Transformer for Speech Translation

Recent advances in sequence modeling have highlighted the strengths of t...

Fluent Translations from Disfluent Speech in End-to-End Speech Translation

Spoken language translation applications for speech suffer due to conver...

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Without positional information, attention-based transformer neural netwo...

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

This work studies the use of attention masking in transformer transducer...

Please sign up or login with your details

Forgot password? Click here to reset