VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

07/15/2021
by   Hirofumi Inaguma, et al.
0

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

A streaming style inference of encoder-decoder automatic speech recognit...
research
10/26/2022

Monotonic segmental attention for automatic speech recognition

We introduce a novel segmental-attention model for automatic speech reco...
research
05/19/2020

Enhancing Monotonic Multihead Attention for Streaming ASR

We investigate a monotonic multihead attention (MMA) by extending hard m...
research
07/24/2023

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

Although frame-based models, such as CTC and transducers, have an affini...
research
05/10/2020

CTC-synchronous Training for Monotonic Attention Model

Monotonic chunkwise attention (MoChA) has been studied for the online st...
research
05/20/2020

A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition

End-to-end models are gaining wider attention in the field of automatic ...
research
04/13/2021

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

With the advent of direct models in automatic speech recognition (ASR), ...

Please sign up or login with your details

Forgot password? Click here to reset