Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

06/16/2022
by   Andrea Fasoli, et al.
0

We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6× compared to the full precision model. Via hardware simulations, we estimate a 3.4× acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by >1.5

READ FULL TEXT
research
10/26/2020

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech...
research
08/27/2021

4-bit Quantization of LSTM-based Speech Recognition Models

We investigate the impact of aggressive low-precision representations of...
research
02/26/2020

Quantized Neural Network Inference with Precision Batching

We present PrecisionBatching, a quantized inference algorithm for speedi...
research
02/01/2018

Alternating Multi-bit Quantization for Recurrent Neural Networks

Recurrent neural networks have achieved excellent performance in many ap...
research
06/30/2022

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme...
research
04/09/2021

Language model fusion for streaming end to end speech recognition

Streaming processing of speech audio is required for many contemporary p...

Please sign up or login with your details

Forgot password? Click here to reset