U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

06/10/2021
by   Di Wu, et al.
0

The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5% - 8% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63% character error rate (CER) with a non-streaming setup and 5.05% with a streaming setup with 320ms latency by U2++. To the best of our knowledge, 5.05% is the best-published streaming result on the AISHELL-1 test set.

READ FULL TEXT
research
11/02/2022

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end...
research
12/10/2020

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

In this paper, we present a novel two-pass approach to unify streaming a...
research
08/29/2019

Two-Pass End-to-End Speech Recognition

The requirements for many applications of state-of-the-art speech recogn...
research
10/31/2022

Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Recently, there has been an increasing interest in two-pass streaming en...
research
05/26/2022

Global Normalization for Streaming Speech Recognition in a Modular Framework

We introduce the Globally Normalized Autoregressive Transducer (GNAT) fo...
research
04/15/2022

Streaming Align-Refine for Non-autoregressive Deliberation

We propose a streaming non-autoregressive (non-AR) decoding algorithm to...
research
11/04/2022

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Sequence transducers, such as the RNN-T and the Conformer-T, are one of ...

Please sign up or login with your details

Forgot password? Click here to reset