Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification

08/08/2020
by   Vijay Ravi, et al.
0

In this paper, we propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders. An autoregressive predictive coding (APC) encoder is pre-trained in an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature representation that encapsulates speaker and phonetic content. Two task-specific decoders were trained using labeled datasets to classify speakers (SID) and phrases (PID). Speaker embeddings extracted from the SID decoder were scored using a PLDA. SID and PID systems were fused at the score level. There is a 51.9 supervised x-vector baseline on the cross-lingual DeepMine dataset. However, the i-vector/HMM method outperformed the proposed APC encoder-decoder system. A fusion of the x-vector/PLDA baseline and the SID/PLDA scores prior to PID fusion further improved performance by 15 proposed approach to the x-vector system. We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks. Such a system can provide competitive performance in domain-mismatched scenarios where test data is from data-scarce domains.

READ FULL TEXT
08/05/2019

Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain Adaptation

Speaker verification systems often degrade significantly when there is a...
08/03/2022

The SJTU System for Short-duration Speaker Verification Challenge 2021

This paper presents the SJTU system for both text-dependent and text-ind...
10/18/2018

Unsupervised Neural Text Simplification

The paper presents a first attempt towards unsupervised neural text simp...
11/25/2020

Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

In this letter, we propose a vocal tract length (VTL) perturbation metho...
03/20/2020

Improving Embedding Extraction for Speaker Verification with Ladder Network

Speaker verification is an established yet challenging task in speech pr...
04/05/2019

An Unsupervised Autoregressive Model for Speech Representation Learning

This paper proposes a novel unsupervised autoregressive neural model for...
06/18/2021

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

By implicitly recognizing a user based on his/her speech input, speaker ...