Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

11/17/2020
by   Xiong Wang, et al.
0

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech...
research
09/19/2019

A Random Gossip BMUF Process for Neural Language Modeling

LSTM language model is an essential component of industrial ASR systems....
research
08/09/2018

Character-Level Language Modeling with Deeper Self-Attention

LSTMs and other RNN variants have shown strong performance on character-...
research
08/06/2019

Two-stage Training for Chinese Dialect Recognition

In this paper, we present a two-stage language identification (LID) syst...
research
07/11/2023

Improving RNN-Transducers with Acoustic LookAhead

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-...
research
07/29/2022

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, ...
research
08/18/2015

End-to-End Attention-based Large Vocabulary Speech Recognition

Many of the current state-of-the-art Large Vocabulary Continuous Speech ...

Please sign up or login with your details

Forgot password? Click here to reset