Multi-turn RNN-T for streaming recognition of multi-party speech

12/19/2021
by   Ilya Sklyar, et al.
0

Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14 Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28 the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2022

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with...
research
11/23/2020

Streaming Multi-speaker ASR with RNN-T

Recent research shows end-to-end ASR systems can recognize overlapped sp...
research
09/17/2021

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Streaming recognition of multi-talker conversations has so far been eval...
research
10/30/2021

Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network

Most current speech technology systems are designed to operate well even...
research
04/06/2021

Understanding Medical Conversations: Rich Transcription, Confidence Scores Information Extraction

In this paper, we describe novel components for extracting clinically re...
research
07/06/2021

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Speaker-attributed automatic speech recognition (SA-ASR) is a task to re...
research
05/05/2022

Speaker Recognition in the Wild

In this paper, we propose a pipeline to find the number of speakers, as ...

Please sign up or login with your details

Forgot password? Click here to reset