Streaming Multi-Talker ASR with Token-Level Serialized Output Training

02/02/2022
by   Naoyuki Kanda, et al.
0

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output layers, the t-SOT model has only a single output layer that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2022

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

This paper presents a novel streaming automatic speech recognition (ASR)...
research
07/07/2023

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

In real-world applications, users often require both translations and tr...
research
03/30/2022

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

This paper presents a streaming speaker-attributed automatic speech reco...
research
05/18/2023

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

In this paper, we present ZeroPrompt (Figure 1-(a)) and the correspondin...
research
10/27/2022

Simulating realistic speech overlaps improves multi-talker ASR

Multi-talker automatic speech recognition (ASR) has been studied to gene...
research
05/23/2023

Model Stealing Attack against Multi-Exit Networks

Compared to traditional neural networks with a single exit, a multi-exit...
research
06/28/2023

Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utili...

Please sign up or login with your details

Forgot password? Click here to reset