BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

by   Eunjung Han, et al.

We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the EDA architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants of it. For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but it still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system.



There are no comments yet.


page 1

page 2

page 3

page 4


Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

This paper proposes an online end-to-end diarization that can handle ove...

Online Speaker Diarization with Graph-based Label Generation

This paper introduces an online speaker diarization system that can hand...

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

Attractor-based end-to-end diarization is achieving comparable accuracy ...

Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation

We propose to address online speaker diarization as a combination of inc...

Block-Online Guided Source Separation

We propose a block-online algorithm of guided source separation (GSS). G...

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Lip motion reflects behavior characteristics of speakers, and thus can b...

Lightweight Speaker Verification for Online Identification of New Speakers with Short Segments

Verifying if two audio segments belong to the same speaker has been rece...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.