Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

05/21/2020
by   Shiliang Zhang, et al.
0

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39 which is the best published performance for online ASR.

READ FULL TEXT
research
01/08/2020

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated stat...
research
10/27/2020

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However...
research
02/18/2019

Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping

Self-attention network, an attention-based feedforward neural network, h...
research
08/12/2020

Online Automatic Speech Recognition with Listen, Attend and Spell Model

The Listen, Attend and Spell (LAS) model and other attention-based autom...
research
04/09/2019

Performance Monitoring for End-to-End Speech Recognition

Measuring performance of an automatic speech recognition (ASR) system wi...
research
06/18/2023

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

The Streaming Unmixing and Recognition Transducer (SURT) model was propo...
research
11/03/2022

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Recognizing a word shortly after it is spoken is an important requiremen...

Please sign up or login with your details

Forgot password? Click here to reset