Relative Positional Encoding for Speech Recognition and Direct Translation

05/20/2020
by   Ngoc-Quan Pham, et al.
0

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2021

Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Transformer-based end-to-end speech recognition models have received con...
research
10/30/2020

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Transformer-based models have achieved state-of-the-art performance on s...
research
10/22/2019

Transformer-based Acoustic Modeling for Hybrid Speech Recognition

We propose and evaluate transformer-based acoustic models (AMs) for hybr...
research
06/13/2019

Lattice Transformer for Speech Translation

Recent advances in sequence modeling have highlighted the strengths of t...
research
06/03/2019

Fluent Translations from Disfluent Speech in End-to-End Speech Translation

Spoken language translation applications for speech suffer due to conver...
research
06/06/2021

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Without positional information, attention-based transformer neural netwo...
research
11/02/2022

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

This work studies the use of attention masking in transformer transducer...

Please sign up or login with your details

Forgot password? Click here to reset