Log In Sign Up

Deep multi-metric learning for text-independent speaker verification

by   Jiwei Xu, et al.

Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale VoxCeleb2 dataset, which contains over a million utterances from over 6,000 speakers, and the proposed deep neural network obtains an equal error rate of 3.48%, which is a very competitive result. Codes for both training and testing and pretrained models are available at <>, which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.


page 1

page 2

page 3

page 4


End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification

In recent years, speaker verification has been primarily performed using...

Multi-task Metric Learning for Text-independent Speaker Verification

In this work, we introduce metric learning (ML) to enhance the deep embe...

Contrastive-mixup learning for improved speaker verification

This paper proposes a novel formulation of prototypical loss with mixup ...

Unified Hypersphere Embedding for Speaker Recognition

Incremental improvements in accuracy of Convolutional Neural Networks ar...

Atlas Based Representation and Metric Learning on Manifolds

We explore the use of a topological manifold, represented as a collectio...

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit

Speaker modeling is essential for many related tasks, such as speaker re...