Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances

09/01/2022
by   Chang Zeng, et al.
0

Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, all of these methods are designed for use with a single enrollment utterance. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical case of multiple enrollment utterances. In order to leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.

READ FULL TEXT
research
04/04/2021

Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

A back-end model is a key element of modern speaker verification systems...
research
04/17/2019

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Recently, direct modeling of raw waveforms using deep neural networks ha...
research
01/03/2017

End-to-End Attention based Text-Dependent Speaker Verification

A new type of End-to-End system for text-dependent speaker verification ...
research
04/05/2021

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Many neural network speaker recognition systems model each speaker using...
research
08/08/2020

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

The effects of speaking-style variability on automatic speaker verificat...
research
12/27/2018

Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition

In this paper we propose a method to model speaker and session variabili...
research
02/23/2023

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Speech utterances recorded under differing conditions exhibit varying de...

Please sign up or login with your details

Forgot password? Click here to reset