Log In Sign Up

Disentangled speaker and nuisance attribute embedding for robust speaker verification

by   Woo Hyun Kang, et al.

Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.


page 1

page 2

page 6

page 7

page 9

page 10

page 11

page 12


Robust Speech Representation Learning via Flow-based Embedding Regularization

Over the recent years, various deep learning-based methods were proposed...

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

The main challenge of speaker verification in the wild is the interferen...

Masked cross self-attention encoding for deep speaker embedding

In general, speaker verification tasks require the extraction of speaker...

Improving Embedding Extraction for Speaker Verification with Ladder Network

Speaker verification is an established yet challenging task in speech pr...

Speaker Verification in Emotional Talking Environments based on Three-Stage Framework

This work is dedicated to introducing, executing, and assessing a three-...

Exploring Transfer Learning for Low Resource Emotional TTS

During the last few years, spoken language technologies have known a big...

Channel adversarial training for speaker verification and diarization

Previous work has encouraged domain-invariance in deep speaker embedding...