Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

07/20/2021
by   Tianzi Wang, et al.
0

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at https://github.com/espnet/espnet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

Semi-Autoregressive Streaming ASR With Label Context

Non-autoregressive (NAR) modeling has gained significant interest in spe...
research
05/11/2020

Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition

Although attention based end-to-end models have achieved promising perfo...
research
10/11/2021

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

Non-autoregressive (NAR) models simultaneously generate multiple outputs...
research
10/20/2021

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...
research
06/16/2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Non-autoregressive (NAR) models have achieved a large inference computat...
research
12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...
research
07/15/2022

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

Modern non-autoregressive (NAR) speech recognition systems aim to accele...

Please sign up or login with your details

Forgot password? Click here to reset