Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

03/30/2022
by   Naoyuki Kanda, et al.
0

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

This paper presents Transcribe-to-Diarize, a new approach for neural spe...
research
09/14/2023

DiariST: Streaming Speech Translation with Speaker Diarization

End-to-end speech translation (ST) for conversation recordings involves ...
research
05/23/2023

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

The recently proposed serialized output training (SOT) simplifies multi-...
research
02/02/2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a ...
research
11/23/2020

Streaming Multi-speaker ASR with RNN-T

Recent research shows end-to-end ASR systems can recognize overlapped sp...
research
09/09/2022

Streaming Target-Speaker ASR with Neural Transducer

Although recent advances in deep learning technology have boosted automa...
research
07/07/2023

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

In real-world applications, users often require both translations and tr...

Please sign up or login with your details

Forgot password? Click here to reset