Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

11/17/2022
by   Zhiyun Fan, et al.
0

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection (SCD) and automatic speech recognition (ASR). Most previous SCD systems rely solely on speaker information and ignore the importance of speech content. In this paper, we propose a novel SCD system that considers both cues of speaker difference and speech content. These two cues are converted into token-level representations by the continuous integrate-and-fire (CIF) mechanism and then combined for detecting speaker changes on the token acoustic boundaries. We evaluate the performance of our approach on a public real-recorded meeting dataset, AISHELL-4. The experiment results show that our method outperforms a competitive frame-level baseline system by 2.45 importance of speech content and speaker difference to the SCD task, and the advantages of conducting SCD on the token acoustic boundaries compared with conducting SCD frame by frame.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

Sequence-level Speaker Change Detection with Difference-based Continuous Integrate-and-fire

Speaker change detection is an important task in multi-party interaction...
research
07/09/2019

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Speech applications dealing with conversations require not only recogniz...
research
10/30/2021

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

This study addresses the problem of single-channel Automatic Speech Reco...
research
11/11/2022

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss

In this work we propose a novel token-based training strategy that impro...
research
03/06/2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text-to-speech (TTS) generally consists of cascaded architecture ...
research
05/23/2023

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

The recently proposed serialized output training (SOT) simplifies multi-...
research
09/19/2023

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Speaker diarization has gained considerable attention within speech proc...

Please sign up or login with your details

Forgot password? Click here to reset