Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

09/13/2023
by   Zhengyang Chen, et al.
0

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08 and AMI (13.00 (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

READ FULL TEXT
research
05/18/2023

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

This paper proposes a novel Attention-based Encoder-Decoder network for ...
research
08/27/2022

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

This paper describes a speaker diarization model based on target speaker...
research
05/05/2021

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

We present an end-to-end deep network model that performs meeting diariz...
research
01/25/2022

Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

This report proposes state-of-the-art research in the field of Computer ...
research
06/24/2023

Improving End-to-End Neural Diarization Using Conversational Summary Representations

Speaker diarization is a task concerned with partitioning an audio recor...
research
06/14/2021

End-to-end Neural Diarization: From Transformer to Conformer

We propose a new end-to-end neural diarization (EEND) system that is bas...
research
06/20/2021

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method f...

Please sign up or login with your details

Forgot password? Click here to reset