Self-supervised Speaker Diarization

04/08/2022
by   Yehoshua Dissen, et al.
1

Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised deep-learning model for speaker diarization. Specifically, the study focuses on generating high-quality neural speaker representations without any annotated data, as well as on estimating secondary hyperparameters of the model without annotations. The speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker. The trained encoder model is then used to self-generate pseudo-labels to subsequently train a similarity score between different segments of the same call using probabilistic linear discriminant analysis (PLDA) and further to learn a clustering stopping threshold. We compared our model to state-of-the-art unsupervised as well as supervised baselines on the CallHome benchmarks. According to empirical results, our approach outperforms unsupervised methods when only two speakers are present in the call, and is only slightly worse than recent supervised models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2023

Evaluation of Speech Representations for MOS prediction

In this paper, we evaluate feature extraction models for predicting spee...
research
10/27/2022

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

We study a novel neural architecture and its training strategies of spea...
research
09/14/2021

Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization

In this paper, we propose a novel algorithm for speaker diarization usin...
research
04/12/2023

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Automatic speaker verification task has made great achievements using de...
research
10/21/2020

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

We present a new approach to disentangle speaker voice and phone content...
research
10/22/2019

Discriminative Neural Clustering for Speaker Diarisation

This paper proposes a novel method for supervised data clustering. The c...
research
05/26/2023

Unsupervised Embedding Quality Evaluation

Unsupervised learning has recently significantly gained in popularity, e...

Please sign up or login with your details

Forgot password? Click here to reset