Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

10/23/2022
by   Xiaoyu Liu, et al.
0

Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2018

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

In this paper, we present a novel system that separates the voice of a t...
research
04/07/2022

Self supervised learning for robust voice cloning

Voice cloning is a difficult task which requires robust and informative ...
research
07/13/2022

Online Target Speaker Voice Activity Detection for Speaker Diarization

This paper proposes an online target speaker voice activity detection sy...
research
06/01/2023

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

We introduce a monaural neural speaker embeddings extractor that compute...
research
03/20/2023

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

Self-supervised learning leverages unlabeled data effectively, improving...
research
01/16/2023

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

As a practical alternative of speech separation, target speaker extracti...
research
08/09/2017

Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

In this paper we propose a new method of speaker diarization that employ...

Please sign up or login with your details

Forgot password? Click here to reset