Endpoint Detection for Streaming End-to-End Multi-talker ASR

01/24/2022
by   Liang Lu, et al.
0

Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing an end-of-sentence token as an output unit, following the practice of single-talker end-to-end models. Furthermore, we also present a latency penalty approach that can significantly cut down the EP detection latency. Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2021

Streaming Multi-talker Speech Recognition with Joint Speaker Identification

In multi-talker scenarios such as meetings and conversations, speech pro...
research
05/10/2022

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with...
research
11/23/2020

Streaming Multi-speaker ASR with RNN-T

Recent research shows end-to-end ASR systems can recognize overlapped sp...
research
11/26/2020

Streaming end-to-end multi-talker speech recognition

End-to-end multi-talker speech recognition is an emerging research trend...
research
05/04/2021

Streaming end-to-end speech recognition with jointly trained neural feature enhancement

In this paper, we present a streaming end-to-end speech recognition mode...
research
11/15/2018

Streaming End-to-end Speech Recognition For Mobile Devices

End-to-end (E2E) models, which directly predict output character sequenc...
research
09/09/2022

Streaming Target-Speaker ASR with Neural Transducer

Although recent advances in deep learning technology have boosted automa...

Please sign up or login with your details

Forgot password? Click here to reset