Continuous Streaming Multi-Talker ASR with Dual-path Transducers

09/17/2021
by   Desh Raj, et al.
0

Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we propose the dual-path (DP) modeling strategy first used for time-domain speech separation. We experiment with LSTM and Transformer based DP models, and show that they improve word error rate (WER) performance while yielding faster convergence. We also explore training strategies such as chunk width randomization and curriculum learning for these models, and demonstrate their importance through ablation studies. Finally, we evaluate our models on the LibriCSS meeting data, where they perform competitively with offline separation-based methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2022

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with...
research
12/19/2021

Multi-turn RNN-T for streaming recognition of multi-party speech

Automatic speech recognition (ASR) of single channel far-field recording...
research
06/18/2023

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

The Streaming Unmixing and Recognition Transducer (SURT) model was propo...
research
06/15/2022

On the Design and Training Strategies for RNN-based Online Neural Speech Separation Systems

While the performance of offline neural speech separation systems has be...
research
09/09/2022

Streaming Target-Speaker ASR with Neural Transducer

Although recent advances in deep learning technology have boosted automa...
research
02/23/2021

Dual-Path Modeling for Long Recording Speech Separation in Meetings

The continuous speech separation (CSS) is a task to separate the speech ...
research
02/24/2022

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that pl...

Please sign up or login with your details

Forgot password? Click here to reset