Towards End-to-end Speaker Diarization in the Wild

11/02/2022
by   Zexu Pan, et al.
0

Speaker diarization algorithms address the "who spoke when" problem in audio recordings. Algorithms trained end-to-end have proven superior to classical modular-cascaded systems in constrained scenarios with a small number of speakers. However, their performance for in-the-wild recordings containing more speakers with shorter utterance lengths remains to be investigated. In this paper, we address this gap, showing that an attractor-based end-to-end system can also perform remarkably well in the latter scenario when first pre-trained on a carefully-designed simulated dataset that matches the distribution of in-the-wild recordings. We also propose to use an attention mechanism to increase the network capacity in decoding more speaker attractors, and to jointly train the attractors on a speaker recognition task to improve the speaker attractor representation. Even though the model we propose is audio-only, we find it significantly outperforms both audio-only and audio-visual baselines on the AVA-AVD benchmark dataset, achieving state-of-the-art results with an absolute reduction in diarization error of 23.3

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2020

Neural Speaker Diarization with Speaker-Wise Chain Rule

Speaker diarization is an essential step for processing multi-speaker au...
research
01/06/2021

Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-...
research
04/05/2021

End-to-End Speaker-Attributed ASR with Transformer

This paper presents our recent effort on end-to-end speaker-attributed a...
research
11/29/2021

AVA-AVD: Audio-visual Speaker Diarization in the Wild

Audio-visual speaker diarization aims at detecting “who spoken when“ usi...
research
04/03/2021

Diarization of Legal Proceedings. Identifying and Transcribing Judicial Speech from Recorded Court Audio

United States Courts make audio recordings of oral arguments available a...
research
09/24/2018

Speaker Naming in Movies

We propose a new model for speaker naming in movies that leverages visua...
research
02/26/2019

Utterance-level Aggregation For Speaker Recognition In The Wild

The objective of this paper is speaker recognition "in the wild"-where u...

Please sign up or login with your details

Forgot password? Click here to reset