Improving the fusion of acoustic and text representations in RNN-T

01/25/2022
by   Chao Zhang, et al.
1

The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction network based on the previous subword units. In this paper, we propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations to feed into the output layer. A regularisation method is also proposed to enable better acoustic encoder training by reducing the gradients back-propagated into the prediction network at the beginning of RNN-T training. Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4 reductions with only a few million extra parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2018

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

We investigate training end-to-end speech recognition models with the re...
research
07/11/2023

Improving RNN-Transducers with Acoustic LookAhead

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-...
research
09/13/2022

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Language identification is critical for many downstream tasks in automat...
research
01/13/2017

End-to-End ASR-free Keyword Search from Speech

End-to-end (E2E) systems have achieved competitive results compared to c...
research
11/20/2020

Improving RNN-T ASR Accuracy Using Untranscribed Context Audio

We present a new training scheme for streaming automatic speech recognit...
research
11/05/2020

Improving RNN Transducer Based ASR with Auxiliary Tasks

End-to-end automatic speech recognition (ASR) models with a single neura...
research
06/29/2022

On the Prediction Network Architecture in RNN-T for ASR

RNN-T models have gained popularity in the literature and in commercial ...

Please sign up or login with your details

Forgot password? Click here to reset