Streaming Simultaneous Speech Translation with Augmented Memory Transformer

10/30/2020
by   Xutai Ma, et al.
0

Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in generating the translation when analyzing the latency. With this assumption, a system may have good latency quality trade-offs but be inapplicable in real-time scenarios. In this paper, we focus on the task of streaming simultaneous speech translation, where the systems are not only capable of translating with partial input but are also able to handle very long or continuous input. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder, which has shown great success on the streaming automatic speech recognition task with hybrid or transducer-based models. We conduct an empirical evaluation of the proposed model on segment, context and memory sizes and we compare our approach to a transformer with a unidirectional mask.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2020

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their s...
research
07/03/2023

Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation

Simultaneous speech translation is an essential communication task diffi...
research
04/19/2022

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

Although Transformers have gained success in several speech processing t...
research
05/02/2022

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

In modern interactive speech-based systems, speech is consumed and trans...
research
07/03/2023

Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation

Transformer models using segment-based processing have been an effective...
research
05/20/2020

Relative Positional Encoding for Speech Recognition and Direct Translation

Transformer models are powerful sequence-to-sequence architectures that ...
research
10/31/2021

Visualization: the missing factor in Simultaneous Speech Translation

Simultaneous speech translation (SimulST) is the task in which output ge...

Please sign up or login with your details

Forgot password? Click here to reset