Semi-Autoregressive Streaming ASR With Label Context

09/19/2023
by   Siddhant Arora, et al.
0

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based on blockwise attention for low-latency applications. However, streaming NAR models significantly lag in accuracy compared to streaming AR and non-streaming NAR models. To address this, we propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context using a Language Model (LM) subnetwork. We also introduce a novel greedy decoding algorithm that addresses insertion and deletion errors near block boundaries while not significantly increasing the inference time. Experiments show that our method outperforms the existing streaming NAR model by 19 Librispeech-100 clean/other test sets, and 19 Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and non-streaming NAR models while achieving 2.5x lower latency. We also demonstrate that our approach can effectively utilize external text data to pre-train the LM subnetwork to further improve streaming ASR accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2021

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Non-autoregressive (NAR) modeling has gained more and more attention in ...
research
10/12/2020

Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
research
04/15/2022

Streaming Align-Refine for Non-autoregressive Deliberation

We propose a streaming non-autoregressive (non-AR) decoding algorithm to...
research
05/25/2022

Improving CTC-based ASR Models with Gated Interlayer Collaboration

For Automatic Speech Recognition (ASR), the CTC-based methods have becom...
research
07/14/2022

Scene Text Recognition with Permuted Autoregressive Sequence Models

Context-aware STR methods typically use internal autoregressive (AR) lan...
research
03/12/2020

Hybrid Autoregressive Transducer (hat)

This paper proposes and evaluates the hybrid autoregressive transducer (...
research
10/20/2021

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...

Please sign up or login with your details

Forgot password? Click here to reset