Speechformer: Reducing Information Loss in Direct Speech Translation

09/09/2021
by   Sara Papi, et al.
6

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en->de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/07/2021

Efficient Transformer for Direct Speech Translation

The advent of Transformer-based models has surpassed the barriers of tex...
research
02/02/2021

CTC-based Compression for Direct Speech Translation

Previous studies demonstrated that a dynamic phone-informed compression ...
research
04/17/2020

Enriching the Transformer with Linguistic and Semantic Factors for Low-Resource Machine Translation

Introducing factors, that is to say, word features such as linguistic in...
research
10/28/2022

Efficient Speech Translation with Dynamic Latent Perceivers

Transformers have been the dominant architecture for Speech Translation ...
research
06/04/2019

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

Previous work on end-to-end translation from speech has primarily used f...
research
05/19/2023

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Attention is the core mechanism of today's most used architectures for n...
research
02/09/2021

Bayesian Transformer Language Models for Speech Recognition

State-of-the-art neural language models (LMs) represented by Transformer...

Please sign up or login with your details

Forgot password? Click here to reset