On the Prediction Network Architecture in RNN-T for ASR

06/29/2022
by   Dario Albesano, et al.
0

RNN-T models have gained popularity in the literature and in commercial systems because of their competitiveness and capability of operating in online streaming mode. In this work, we conduct an extensive study comparing several prediction network architectures for both monotonic and original RNN-T models. We compare 4 types of prediction networks based on a common state-of-the-art Conformer encoder and report results obtained on Librispeech and an internal medical conversation data set. Our study covers both offline batch-mode and online streaming scenarios. In contrast to some previous works, our results show that Transformer does not always outperform LSTM when used as prediction network along with Conformer encoder. Inspired by our scoreboard, we propose a new simple prediction network architecture, N-Concat, that outperforms the others in our on-line streaming benchmark. Transformer and n-gram reduced architectures perform very similarly yet with some important distinct behaviour in terms of previous context. Overall we obtained up to 4.1 improvement compared to our LSTM baseline, while reducing prediction network parameters by nearly an order of magnitude (8.4 times).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2020

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Recently, there has been a strong push to transition from hybrid models ...
research
09/29/2022

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

The recurrent neural network transducer (RNN-T) is a prominent streaming...
research
11/03/2020

Improving RNN transducer with normalized jointer network

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) mode...
research
06/26/2022

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

The streaming automatic speech recognition (ASR) models are more popular...
research
11/09/2020

Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

In this work, to measure the accuracy and efficiency for a latency-contr...
research
01/25/2022

Improving the fusion of acoustic and text representations in RNN-T

The recurrent neural network transducer (RNN-T) has recently become the ...
research
04/06/2021

Understanding Medical Conversations: Rich Transcription, Confidence Scores Information Extraction

In this paper, we describe novel components for extracting clinically re...

Please sign up or login with your details

Forgot password? Click here to reset