Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

11/02/2022
by   Pawel Swietojanski, et al.
0

This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries, in terms of recognition accuracy and latency. We then explore the use of variable masking, where the attention masks are sampled from a target distribution at training time, to build models that can work in different configurations. Finally, we investigate how a single configurable model can be used to perform both first pass streaming recognition and second pass acoustic rescoring. Experiments show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking, both with and without FastEmit. We also show that variable masking improves the accuracy by up to 8 the acoustic re-scoring scenario.

READ FULL TEXT

page 2

page 3

page 4

research
10/07/2020

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and...
research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
10/31/2022

Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Recently, there has been an increasing interest in two-pass streaming en...
research
08/02/2016

Efficient Segmental Cascades for Speech Recognition

Discriminative segmental models offer a way to incorporate flexible feat...
research
08/30/2020

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Recent advances of end-to-end models have outperformed conventional mode...
research
10/27/2020

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

In this paper, we summarize the application of transformer and its strea...
research
05/20/2020

Relative Positional Encoding for Speech Recognition and Direct Translation

Transformer models are powerful sequence-to-sequence architectures that ...

Please sign up or login with your details

Forgot password? Click here to reset