Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

by   Chang Zeng, et al.

A back-end model is a key element of modern speaker verification systems. Probabilistic linear discriminant analysis (PLDA) has been widely used as a back-end model in speaker verification. However, it cannot fully make use of multiple utterances from enrollment speakers. In this paper, we propose a novel attention-based back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification with multiple enrollment utterances, and employ scaled-dot self-attention and feed-forward self-attention networks as architectures that learn the intra-relationships of the enrollment utterances. In order to verify the proposed attention back-end, we combine it with two completely different but dominant speaker encoders, which are time delay neural network (TDNN) and ResNet trained using the additive-margin-based softmax loss and the uniform loss, and compare them with the conventional PLDA or cosine scoring approaches. Experimental results on a multi-genre dataset called CN-Celeb show that the performance of our proposed approach outperforms PLDA scoring with TDNN and cosine scoring with ResNet by around 14.1 experiment is also reported in this paper for examining the impact of some significant hyper-parameters for the proposed back-end model.


page 1

page 2

page 3

page 4


Self-attention encoding and pooling for speaker recognition

The computing power of mobile devices limits the end-user applications i...

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 202...

Unifying Cosine and PLDA Back-ends for Speaker Verification

State-of-art speaker verification (SV) systems use a back-end model to s...

Parameter-Free Attentive Scoring for Speaker Verification

This paper presents a novel study of parameter-free attentive scoring fo...

Text-Independent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping

In this paper we present a new method for text-independent speaker verif...

Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?

The emergence of large-margin softmax cross-entropy losses in training d...

Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

In this letter, we propose a vocal tract length (VTL) perturbation metho...