Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

10/26/2020
by   Keisuke Kinoshita, et al.
0

Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current state-of-the-art approach that works for various challenging data with reasonable robustness and accuracy, it has a critical disadvantage that it cannot handle overlapped speech that is inevitable in natural conversational data. In contrast, the end-to-end neural diarization (EEND), which directly predicts diarization labels using a neural network, was devised to handle the overlapped speech. While the EEND, which can easily incorporate emerging deep-learning technologies, has started outperforming the x-vector clustering approach in some realistic database, it is difficult to make it work for `long' recordings (e.g., recordings longer than 10 minutes) because of, e.g., its huge memory consumption. Block-wise independent processing is also difficult because it poses an inter-block label permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers. It modifies the conventional EEND framework to simultaneously output global speaker embeddings so that speaker clustering can be performed across blocks to solve the permutation problem. With experiments based on simulated noisy reverberant 2-speaker meeting-like data, we show that the proposed framework works significantly better than the original EEND especially when the input data is long.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2021

Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech

Recently, we proposed a novel speaker diarization method called End-to-E...
research
02/24/2020

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

The most common approach to speaker diarization is clustering of speaker...
research
02/14/2022

Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model

Speaker diarization has been investigated extensively as an important ce...
research
04/18/2022

Robust End-to-end Speaker Diarization with Generic Neural Clustering

End-to-end speaker diarization approaches have shown exceptional perform...
research
09/13/2019

End-to-End Neural Speaker Diarization with Self-attention

Speaker diarization has been mainly developed based on the clustering of...
research
05/29/2023

An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

We performed an experimental review of current diarization systems for t...
research
10/14/2021

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

End-to-end neural diarization (EEND) with self-attention directly predic...

Please sign up or login with your details

Forgot password? Click here to reset