Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

05/14/2020
by   Ivan Medennikov, et al.
0

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30 Error Rate (DER) abs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2022

Target-Speaker Voice Activity Detection via Sequence-to-Sequence Prediction

Target-speaker voice activity detection is currently a promising approac...
research
02/10/2022

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

We propose two improvements to target-speaker voice activity detection (...
research
01/17/2023

The Newsbridge -Telecom SudParis VoxCeleb Speaker Recognition Challenge 2022 System Description

We describe the system used by our team for the VoxCeleb Speaker Recogni...
research
03/07/2023

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Since diarization and source separation of meeting data are closely rela...
research
11/28/2021

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

Overlapping speech diarization is always treated as a multi-label classi...
research
07/13/2019

Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors

In this paper, we combine Hidden Markov Models (HMMs) with i-vector extr...
research
02/06/2022

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge

In this paper, we present the speaker diarization system for the Multi-c...

Please sign up or login with your details

Forgot password? Click here to reset