Streaming end-to-end multi-talker speech recognition

11/26/2020
by   Liang Lu, et al.
0

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the recently introduced Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 120 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2022

Endpoint Detection for Streaming End-to-End Multi-talker ASR

Streaming end-to-end multi-talker speech recognition aims at transcribin...
research
04/05/2021

Streaming Multi-talker Speech Recognition with Joint Speaker Identification

In multi-talker scenarios such as meetings and conversations, speech pro...
research
05/01/2020

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

Recently, the recurrent neural network transducer (RNN-T) architecture h...
research
11/02/2022

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end...
research
09/14/2023

DiariST: Streaming Speech Translation with Speaker Diarization

End-to-end speech translation (ST) for conversation recordings involves ...
research
11/06/2017

Improved training for online end-to-end speech recognition systems

Achieving high accuracy with end-to-end speech recognizers requires care...
research
05/19/2020

Exploring Transformers for Large-Scale Speech Recognition

While recurrent neural networks still largely define state-of-the-art sp...

Please sign up or login with your details

Forgot password? Click here to reset