Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

02/07/2020
by   Qian Zhang, et al.
0

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with a monotonic RNN-T loss well-suited to frame-synchronous, streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model achieves competitive performance compared to existing LibriSpeech benchmarks for attention-based models trained with cross-entropy loss. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2020

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and...
research
08/13/2020

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

Transformer has achieved competitive performance against state-of-the-ar...
research
02/23/2021

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition

Self-attention models have been successfully applied in end-to-end speec...
research
09/28/2019

Self-Attention Transducers for End-to-End Speech Recognition

Recurrent neural network transducers (RNN-T) have been successfully appl...
research
06/25/2020

Streaming Transformer ASR with Blockwise Synchronous Inference

The Transformer self-attention network has recently shown promising perf...
research
11/01/2019

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Transformer showed promising results in many sequence to sequence transf...
research
10/07/2021

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

This paper improves the streaming transformer transducer for speech reco...

Please sign up or login with your details

Forgot password? Click here to reset