Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

09/15/2023
by   Yiling Huang, et al.
0

While standard speaker diarization attempts to answer the question "who spoken when", most of relevant applications in reality are more interested in determining "who spoken what". Whether it is the conventional modularized approach or the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate the speaker labels with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same neural architecture. That is, while speech is being recognized, speaker labels are predicted simultaneously for each recognized word. Experimental results demonstrate that WEEND outperforms the turn-based diarization baseline system on all 2-speaker short-form scenarios and has the capability to generalize to audio lengths of 5 minutes. Although 3+speaker conversations are harder, we find that with enough in-domain training data, WEEND has the potential to deliver high quality diarized text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2020

Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

Ainu is an unwritten language that has been spoken by Ainu people who ar...
research
05/09/2018

Improving End-of-turn Detection in Spoken Dialogues by Detecting Speaker Intentions as a Secondary Task

This work focuses on the use of acoustic cues for modeling turn-taking i...
research
02/02/2022

ASR-Aware End-to-end Neural Diarization

We present a Conformer-based end-to-end neural diarization (EEND) model ...
research
03/01/2022

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

Graph-based temporal classification (GTC), a generalized form of the con...
research
07/09/2019

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Speech applications dealing with conversations require not only recogniz...
research
10/30/2020

Comparison of Speaker Role Recognition and Speaker Enrollment Protocol for conversational Clinical Interviews

Conversations between a clinician and a patient, in natural conditions, ...
research
11/03/2020

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Confusing-words are commonly encountered in real-life keyword spotting a...

Please sign up or login with your details

Forgot password? Click here to reset