Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

07/26/2023
by   Tian-Hao Zhang, et al.
0

RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2020

Improving RNN transducer with normalized jointer network

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) mode...
research
11/05/2020

Alignment Restricted Streaming Recurrent Neural Network Transducer

There is a growing interest in the speech community in developing Recurr...
research
11/04/2022

Multi-blank Transducers for Speech Recognition

This paper proposes a modification to RNN-Transducer (RNN-T) models for ...
research
04/19/2022

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...
research
10/08/2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...
research
11/01/2021

Sequence Transduction with Graph-based Supervision

The recurrent neural network transducer (RNN-T) objective plays a major ...
research
08/16/2022

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Optimization of modern ASR architectures is among the highest priority t...

Please sign up or login with your details

Forgot password? Click here to reset