Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

08/27/2022
by   Dongmei Wang, et al.
0

This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based TS-VAD. Experimental results on VoxConverse show that using the transformers for the cross-speaker modeling reduces the diarization error rate (DER) of TS-VAD by 11.3 EEND-EDA reduces DER by 6.9 EEND-EDA with a similar model size, achieving a new SOTA DER of 11.18 widely used training data setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2023

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Deep neural network-based systems have significantly improved the perfor...
research
03/02/2023

Improving Transformer-based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads

Transformer-based end-to-end neural speaker diarization (EEND) models ut...
research
06/14/2021

End-to-end Neural Diarization: From Transformer to Conformer

We propose a new end-to-end neural diarization (EEND) system that is bas...
research
10/14/2021

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

End-to-end neural diarization (EEND) with self-attention directly predic...
research
11/11/2021

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset

Current authentication and trusted systems depend on classical and biome...
research
05/03/2021

AvaTr: One-Shot Speaker Extraction with Transformers

To extract the voice of a target speaker when mixed with a variety of ot...
research
12/14/2021

End-to-end speaker diarization with transformer

Speaker diarization is connected to semantic segmentation in computer vi...

Please sign up or login with your details

Forgot password? Click here to reset