Speaker verification (SV) offers a natural and flexible option for biometric authentication. The text-independent SV system, which does not require the fixed input voice content, is a flexible and challenging task. In real-world scenarios, however, speaker verification systems may degrade significantly when training on one language and test it on another. Language mismatch falls into two scenarios that include (i) the speaker verification system is trained on one language, but the enrollment and test data for speakers are in a second language, and (ii) the enrollment data is in one language, but the test data is in a second language. This study focused on the first scenario where the speaker model is trained on English data, but the enrollment and test materials for speakers are in a new language, Chinese. Since it is not desirable to re-train the speaker model on a new language, the challenge is to find an alternative solution which would allow such an existing system to maintain performance when enrollment and test speaker data are from a new language.
Recently, the speaker representation models have moved from the commonly used i-vector model [kenny2007joint, matvejka2011full, hansen2015speaker], with a probabilistic linear discriminant (PLDA) back-end [kenny2010bayesian, prince2007probabilistic]
to a new paradigm: speaker embedding trained from deep neural networks. Various speaker embeddings based on different network architectures[snyder2018x, michelsanti2017conditional] , attention mechanism [rahman2018attention, zhang2016end]wan2018generalized, zhang2018text], noise robustness [yu2017adversarial, xia2018speaker], and training paradigms [heigold2016end, Heo:2017ci] have been proposed and greatly improve the performance of speaker verification systems. Snyder et al. [snyder2018x] recently proposed the x-vector model, which is based on a Time-Delay Deep Neural Network (TDNN) architecture that computes speaker embeddings from variable-length acoustic segments. This x-vector model has become very successful in various speaker recognition tasks. We use it as the baseline in this study.
However, models trained with these deep neural networks may not generalize well to other datasets in different domains. To alleviate the domain mismatch problem, we can use domain adaptation methods to reduce the domain shift. We can compensate the mismatch by estimating the compensation model[aronowitz2014inter, kanagasundaram2015improving, misra2018maximum, misra2018modelling] using unlabeled data and source domain data. Adversarial adaptation methods [ganin2016domain, pei2018multi, yu2017adversarial, chen2016adversarial] were also applied to ensure that the network cannot distinguish the distributions of training and testing examples. Wang et al. [wang2018unsupervised] proposed an unsupervised approach based on Domain Adversarial Training (DAT) to address speaker recognition problem in domain mismatched conditions.
In this study, we introduce the unsupervised Adversarial Discriminative Domain Adaptation (ADDA) [tzeng2017adversarial] approach. It was originally tested on image classification tasks. We adapt the ADDA approach to the cross-lingual unsupervised adaptation for text-independent speaker verification. Unsupervised adaptation without requiring target domain labels largely reduces labeling costs and utilizes a large amount of publicly available online data. Our approach only requires source and unlabeled target domain data to learn an asymmetric mapping that adapts the target domain feature encoder to the source domain. Furthermore, the ADDA uses separate encoders for the source and target domain without assuming that source and target domain data has a similar class distribution. We show that ADDA is more effective yet considerably simpler than other domain-adversarial methods: the source data is in English from NIST SRE04-08, Mixer 6 and Switchboard, and the target data is in Chinese from AISHELL-I. We show that with the ADDA adaptation, Equal Error Rate (EER) of the x-vector system decreases from 9.331% to 7.645%, relatively 18.07% reduction on EER. ADDA also has 12.54% relative reduction of EER compared to DAT.
In the following sections, we describe the ADDA approach and corresponding baseline systems in Section 2. We provide detailed explanations of our experiments in Section 3, as well as results and discussions in Section 4. Finally we conclude in Section 5 with future work.
1.1 Related work
A number of domain adaptation approaches have been proposed to alleviate the domain shift problem. For example, Wang et al. [wang2018unsupervised]
apply the DAT technique to alleviate the i-vectors mismatch across different domains. They use a multi-task learning framework to jointly learn a shared feature extractor and two classifiers. With a gradient reversal layer in the domain classifier, the shared feature extractor can extract domain-invariant and speaker-discriminative features. In[aronowitz2014inter, kanagasundaram2015improving], the authors proposed an Inter-Dataset Variability Compensation (IDVC) technique to remove the mismatch using Nuisance Attribute Projection (NAP). First, a subspace is computed representing all different data-sets and then NAP is used to remove that subspace as an i-Vector pre-processing step. All these work were on i-vectors for speaker verification, while our work is on the recently proposed x-vectors and shows very promising results.
2 Speaker verification systems
2.1 The X-vector system
We use a recently proposed successful speaker model called X-vector [snyder2018x], to extract speaker representations, and a Probabilistic Linear Discriminant Analysis (PLDA) back-end to compare pairs of enrollment and test speaker embeddings. The X-vector model is based on a Time-Delay Deep Neural Network (TDNN) architecture that computes speaker embeddings from variable-length acoustic segments. The network consists of layers that operate on speech frames, a statistics pooling layer that aggregates over the frame-level representations, additional layers that operate at the segment-level, and finally a softmax output layer. The embeddings are extracted after the statistics pooling layers.
2.2 Cross-lingual adversarial training baseline
In order to address the cross-lingual speaker verification problem, we first implement a Domain Adversarial Neural Network (DANN) [wang2018unsupervised] using Domain Adversarial Training (DAT) [ganin2016domain] to transfer speaker information from labeled English data to another language where only unlabeled data exists, for example, Chinese. DANN in Fig. 1
is a “Y-shaped” network with two discriminative branches: a speaker recognizer and an adversarial language classifier. Both branches take input from a shared feature extractor that aims to learn hidden representations that capture the underlying information of the speaker and are independent of languages.
We can implement the language independent speaker verification system assuming that DANN can learn features that perform well on speaker classification for the source and target language data, are independent with respect to the shift in language. This can be done by minimizing the speaker classification loss and maximizing the domain classification loss with a gradient reversal layer. DANN mainly has two components: 1) a speaker recognizer for the source data; 2) an adversarial language classifier that predicts a scalar indicating whether the input speech is from the source language or the target language. The two classifiers take input from the shared feature extractor , which operates on the average of the speaker embeddings. The loss function of DANN is a multi-task loss which combines the loss of the speaker classifier and the domain classifier with a weight . Training DANN consists in optimizing,
where are parameters of the joint feature extractor and two classifiers, and are the prediction and the domain loss functions. and
are the number of samples of the source and target domain data respectively. We can optimize this loss function using stochastic gradient descent to get the parameters, Using this DAT approach, we are able to minimize the divergence between the source and target feature distributions. Therefore, the learned embeddings are less dependent on the shift in language.
2.3 Adversarial discriminative domain adaptation
Different from the DAT method which applies a gradient reversal layer to confuse the domain classifier, we apply the Adversarial Discriminative Domain Adaptation (ADDA) approach to directly learn an asymmetric mapping, in which we modify the target model in order to match the source distribution. A summary of this entire training process is provided in Fig. 2. Unlike the DAT method which uses a shared feature encoder, our proposed ADDA approach uses separate encoders for the source and target domain data. When there is a significant domain shift, the DAT method may not work well since it inherently assumes that source and target domain data has a similar class distribution.
We define input samples with data labels , where and are input space and output space, respectively. In our speaker verification experiments, are x-vectors and speaker labels. The probabilistic distribution , however, might be different between training and evaluation dataset due to various domain mismatch such as language mismatch. We denote and as source domain and target domain distribution respectively. Our goal is to minimize the distance between the empirical source and target mapping distributions. We firstly learn a source mapping , along with a source classifier , and then learn to map the target domain encoder to the source domain.
We train the source classification model using a standard cross entropy loss defined below,
In order to minimize the source and target representation distances, we use a domain discriminator to classify whether a data point is drawn from the source or the target domain. We optimize using an adversarial loss , defined below:
The DAT method uses a gradient reversal layer [ganin2016domain] to learn the mapping by maximizing the discriminator loss directly, where its adversarial loss . Different from DAT, in order to train the mapping, we use the loss function defined below. This objective has the same fixed-point properties as the minimax loss but provides stronger gradients to the target mapping.
We can optimize this objective function in two steps. First, we need to train a discriminative source classification model, we choose to use a three-layer Deep Neural Network (DNN) and the input features are x-vectors. We start optimizing classification loss over source domain mapping function and classifier by training with the labeled source English data, and . Because we make fixed while learning , we can then optimize and without revisiting the first objective term.
Through this unsupervised adversarial discriminative domain adaptation approach, we can adapt the target encoder to the source domain. In the next section, we will present promising results on cross-lingual text-independent speaker verification tasks using ADDA.
3 Experimental setup
3.1 English Corpora
We use Speaker Recognition Evaluation (SRE) 04-08, Mixer 6, and Switchboard (SWBD) to train the x-vector model. SRE corpus is part of the Mixer 6 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected speech across numerous microphones. Switchboard is a collection of about two-sided telephone conversations among thousands of speakers from all areas of the United States.
3.2 Chinese Corpora
AISHELL-1 [bu2017aishell] is a subset of the AISHELL-ASR0009 corpus, which is a 500 hours multi-channel mandarin speech corpus designed for various speech/speaker processing tasks. Speech utterances are recorded at 44.1kHz via microphones, 16kHz via Android phones and 16kHz via iPhones.
There are 360 participants in the recording, and speakers’ gender, accent, age, and birth-place are recorded as meta-data. About 80 percent of the speakers are from age 16 to 25. Most speakers come from the Northern area of China. The entire corpus includes training and test sets, without speaker overlap. Though the training data provides speaker labels, we do not use any speaker label information of the training data or include it in training our x-vector model. We only use it for unsupervised domain adaptation. We call it AISHELL unlabeled training set.
The training set contains 120,098 utterances from 340 speakers; Test set contains 7,176 utterances from 20 speakers. For each speaker, around 360 utterances (about 26 minutes of speech in total) are released. In order to test our proposed unsupervised ADDA approach, we don’t use any speaker labels of the training data. We train our x-vector based speaker model on the SRE04-08, Mixer 6, and switchboard dataset, and evaluate on the Chinese AISHELL test 143520 trials.
3.3 Evaluation setup
We use SRE04-08, Mixer6 and Switchboard data to train the TDNN based x-vector model. We follow the Kaldi SRE16 recipe to augment the training data by adding noises and reverberations. We use an energy based VAD and the raw feature to train the model are 23-dimensional MFCCs. Having established the x-vector system using English data, we now try to address the challenge of evaluation enrollment and test speakers for a mismatched language, Chinese. To accomplish this, A set of unlabeled data for the new language is needed. We use the target domain AISHELL unlabeled training data. We extract x-vectors on source domain SRE and SWBD data and target domain AISHELL unlabeled data to train the adaptation network.
We train the Adversarial Domain Adaptation Network (ADAN) in two steps. First, we train a DNN encoder and classifier on SRE and SWBD x-vectors. Next, we use the pre-trained source model as an initialization for the target DNN encoder and perform adversarial adaptation to learn a target domain mapping on the AISEHLL unlabeled x-vectors.
During testing, we use AISHELL evaluation set enrollment x-vectors and test x-vectors as the input to the ADDA, and extract the new vectors , using the trained target encoder of ADDA. Adapted embeddings , are therefore expected to be domain-invariant and speaker discriminative representations which stay in the same subspace. We apply mean and length normalization on the adapted embeddings. For the back-end, we train a Probabilistic Linear Discriminant Analysis (PLDA) model on combined SRE clean and noise augmented data, and compute log-likelihood ratio scores of enrollment and test trials. We also perform unsupervised PLDA adaptation using Kaldi to utilize the AISHELL unlabeled data.
3.4 Model configuration
For this experiment, our base architecture is a three-layer Deep Neural Network which is fine-tuned on the source domain for 100 epochs using a batch size of 128. When training ADDA, the adversarial discriminator consists of three additional fully connected layers: 2 hidden layers and an adversarial discriminator output. With the exception of the output, these additionally fully connected layers use a ReLU activation function. ADDA target encoder training then proceeds for another 100 epochs with a batch size of 128. For the DAT training, the shared feature encoder is a three-layer DNN. We use an Adam optimizer with a learning rate. The speaker classifier and the language classifier are two-layer DNNs. To confuse the language domain classifier, the language classifier has a gradient reversal layer. We use a multi-task loss with equal weights to combine the two cross entropy losses.
4 Results and Discussions
In this section, we show experimental results using x-vector, x-vector with DAT and x-vector with ADDA training with and without PLDA adaptations in Table 1. We use Linear Discriminant Analysis (LDA) to reduce all three embeddings to 256 dimension for comparison. Also, we concatenate the DAT embedding with the x-vector since we find it always performs better than a single DAT embedding. From Table 1, we observe that our proposed method, ADDA, greatly improves Equal Error Rate (EER) on AISHELL test trials. After ADDA adaptation, EER of the x-vector system decreases from 9.331% to 7.645%, relatively 18.07%. The ADDA approach also achieves relatively 12.54% improvement compared with the concatenated x-vector and DAT embedding. The major reason that ADDA works better might be that it uses an adversarial discriminator to adapt the target encoder to the source domain. Also, by initializing the target representation space with the pre-trained source model, we can effectively learn the asymmetric mapping function.
Fig. 3 shows the Detection Error Trade-off (DET) curve of our speaker recognition system at three different settings without PLDA adaptation. From the figure, we see after DAT or ADDA adaptation, the overall speaker verification system performance improves significantly compared with the x-vector system. Further, both False Positive Rate (FPR) and False Negative Rate (FNR) of the ADDA embedding system reduce by a large margin compared with the x-vector+DAT embedding system. It indicates that ADDA embedding has more invariance to language shift.
|x-vector + DAT||8.741||0.7475|
|x-vector + PLDA adaptation||9.162||0.7095|
|x-vector + DAT + PLDA adaptation||7.799||0.6989|
|ADDA embedding + PLDA adaptation||7.504||0.7062|
4.2 Visualization of speaker embeddings
To investigate the effect ADDA has on speaker verification, we further assess the quality of the learned speaker features, using t-SNE [maaten2008visualizing], we plot embeddings after LDA from same K speakers of the AISHELL test set. The results are presented in Fig. 4. Fig. 4 (a) is the visualization of x-vectors, and Fig. 4
(b) is the visualization of ADDA embedding. It can be seen that the ADDA embeddings have more discriminative ability to separate different speakers. However, for x-vectors, we observe that some utterances from different speakers are grouped together and not well separated in the embedding space. Also, for speaker “0764”, it is difficult to separate it from speaker “0765” using both methods. It is probably because these two speakers have very similar speaker information.
4.3 Clustering analysis
In order to quantitatively analyze the quality of adapted speaker representations, we also perform clustering on the adapted embeddings. Since t-SNE cannot maintain distance information, which is necessary to apply most clustering algorithms, we perform K-means clustering after LDA transformed x-vectors and ADDA embeddings. Given the knowledge of the ground truth speaker labels, we compute the Normalized Mutual Information (NMI)[vinh2010information] of the K-means clustering assignment. NMI is a metric that measures the agreement of the ground truth labels and the clustering results. The NMI score of x-vectors is 0.787, and the NMI score of ADDA embeddings is 0.802, relatively 1.9% higher. This result is consistent with the visualization using t-SNE. Therefore, we can conclude that with the ADDA adaptation, we can learn more speaker discriminative and language independent speaker embeddings.
5 Conclusions and Future Work
We presented a discriminative adversarial unsupervised adaptation method in this paper. By exploiting how to alleviate the domain mismatch problem in an English-Chinese cross-lingual speaker verification task, we showed that our proposed unsupervised ADDA approach can perform well on speaker classification for the target domain data. Additional data analysis indicated that the representations learned via ADDA can be well separated and are less dependent with respect to the shift in language.
In the future, we would like to investigate the influence of phonetic content on cross-lingual text-independent speaker verification. We intend to use a phoneme decoder to analyze the linguistic factor of speaker models.