Speaker recognition is the task of recognizing a person from his/her voice given a small amount of speech utterance from the speaker 
. Recent progresses have shown successful application of deep neural network to derive deepspeaker embeddings from speech utterances [2, 3]. Analogous to word embeddings [4, 5], a speaker embedding is a fixed-length continuous-value vector that provides a succinct characterization of speaker’s voice rendered in a speech utterance. Similar to the classical i-vectors , deep speaker embeddings live in a simpler Euclidean space where distance could be measured easily, compared to the much complex input patterns. Techniques like within-class covariance normalization (WCCN) , linear discriminant analysis (LDA) , probabilistic LDA (PLDA) [9, 10, 11] can be applied.
Systems comprising x-vector speaker embedding (and i-vector) followed by PLDA have shown state-of-the-art performances on speaker verification task . Training an x-vector PLDA system typically requires over hundred hours of training data with speaker labels, and with the requirement that the training set must contains multiple recordings of a speaker under different settings (recording devices, transmission channels, noise, reverberation etc.). These knowledge sources contribute to the robustness of the system against such nuisance factors. The challenging problem of domain mismatch arises when a speaker recognition system is used in a different domain (e.g., different languages, demographic etc.) than that of the training data. Its performance degrades considerably.
It is impractical to re-train the system for each and every domain as the effort at collecting large labelled data sets is expensive and time consuming. A more viable solution is to adapt the already trained model using a smaller, and possibly unlabeled, set of in-domain data. Domain adaptation could be accomplished at different stages of the x-vector (or i-vector) PLDA pipeline. PLDA adaptation is preferable in practice since the same feature extraction and speaker embedding front-end could be used while domain adapted PLDA backbends are used to cater for the condition in each specific deployment.
PLDA adaptation involves the adaptation of its mean vector 111Mean shift due to domain mismatch could be solved by centralizing the datasets to a common origin .
and covariance matrices. In the case of unsupervised adaptation (i.e., no labels are given), the major challenge is how the adaptation could be performed on the within and between class covariance matrices given that only the total covariance matrix could be estimated directly from the in-domain data. In this paper, we show that this could be accomplished by applying similar principle as in the feature-based correlation alignment (CORAL) from which a pseudo-in-domain within and between class covariance matrices could be computed. We further improve the robustness by introducing additional adaptation parameter and regularization to the adaptation equation. The proposed unsupervised adaptation method is referred to as CORAL+.
2 Domain adaptation of PLDA
This section presents a brief description of probabilistic linear discriminant analysis (PLDA) widely used in state-of-the-art speaker recognition system. We then draw attention to the domain mismatch issue and how the correlation alignment (CORAL) [14, 15] technique deals with it via feature transformation.
2.1 Probabilistic LDA
The vector represents the global mean, while and are the speaker and channel loading matrices, and the diagonal matrix
models the residual variances. The variablesand
are the latent speaker and channel variables, respectively. A PLDA model is essentially a Gaussian distribution in the speaker embedding space. This could be seen more clearly in the form of the marginal density:
The main idea here is to account for the speaker and channel variability with a between-class and a within-class covariance matrices
In a speaker verification task, the PLDA model serves as a backend classifier. For a given pair of enrolment and test utterances, i.e, their speaker embeddingsand , we compute the log-likelihood ratio score
corresponding to the hypothesis test whether the two belong to the same or different speaker. The denominator is evaluated by substituting and in turn to (2). The numerator is computed using
where is the total covariance matrix. The assumption is that the unseen data follow the same distribution as given by the within and between classes covariance matrices derived from the training set (i.e., the dataset we used to train the PLDA). Problem arises when the training set was drawn from a domain (out-of-domain) different from that of the enrollment and test utterances (in-domain).
2.2 Correlation Alignment
Correlation alignment (CORAL)  aims to align the second-order statistics, i.e., covariance matrices, of the out-of-domain (OOD) features to match the in-domain (InD) features. No class (i.e., speaker) label is used and therefore it belongs to the class of unsupervised adaptation techniques. The algorithm consist of two steps, namely, whitening followed by re-coloring. Let and be the covariance matrices of the OOD and InD data, respectively. Denote as a OOD vector, domain adaptation is performed by first whitening and then re-coloring, as follows
whitens the input vector, and
does the re-coloring. Here, and222The whitening and re-coloring procedures are better known as the zero-phase component analysis (ZCA) transformation 
. As opposed to principal component analysis (PCA) and Cholesky whitening (and re-coloring), ZCA preserves the maximal similarity of the transformed feature to the original space.. Such simpler and “frustratingly easy” approach 
has shown to outperform a more complicated non-linear transformation reported in. In , CORAL is performed on the OOD x-vectors (or i-vectors) embeddings, and the transformed vectors (pseudo in-domain) are used to re-train the PLDA. Note that speaker labels of the OOD training data remain the same.
3 The CORAL+ Algoritm
CORAL is a feature-based domain adaptation technique . We propose integrating CORAL to PLDA leading to a model-based domain adaptation.
3.1 Domain adaptation
It is commonly known that a linear transformation on a normally distributed vector leads to an equivalent transformation on the mean vector and covariance matrix of its density function. Letbe the transformation matrix and the transformed vector. The covariance matrix of the pseudo in-domain vector is given by
Here, we have considered a PLDA trained on OOD data with a total covariance matrix given by the sum of within and between class covariance matrices, as noted in Section 2.1. The above equation shows that training a PLDA on the transformed vectors , as proposed in , is equivalent to transforming the within-class, between-class, and total covariance matrices of a PLDA trained on OOD data.
3.2 Model-level adaptation
Instead of replacing the covariance matrices in an OOD PLDA with pseudo in-domain matrices, model-level adaptation allows us to consider their interpolation
where are the adaptation parameters constrained to lie between zero and one. Notice that the first term on the right-hand-side of the equations is the OOD between/within covariance matrix while the second term is the pseudo-in-domain covariance matrix. For clarity, we further simplify the adaptation equations, as follows
The second term on the right-hand-side of the equations represents the new information seen in the in-domain data to be added to the PLDA model.
3.3 Regularized adaptation
The central idea of domain adaptation is to propagate the uncertainty seen in the in-domain data to the PLDA model. The adaptation equations in (8), do not guarantee that the variances, and therefore the uncertainty, increase. In this section, we achieve this goal in the transform space where both the OOD and pseudo-in-domain matrices are simultaneously diagonalized.
be an orthogonal matrix such thatand , where is a diagonal matrix. This procedure is referred to as simultaneous diagonalization. The transformation matrix is obtained by performing twice the eigenvalue decomposition (EVD) on the matrix and then after the first transformation has been applied. The procedure is illustrated in Algorithm 1.
By applying the simultaneous diagonalization on (8), the following adaptation could be obtained:
As before, the between and within class covariance matrices are adapted separately. Notice that the term will ends up with negative variances if any diagonal elements of is less than one. We propose the following regularized adaptation:
The operator ensures that the variance increases. We refer to the regularized adaptation in (10) as the CORAL+ algorithm, while (9) corresponds to the CORAL+ algorithm without regularization. Algorithm 1 summarizes the CORAL+ algorithm. Figure 1 shows a plot of the diagonal elements of the term - in (10). Those entries with negative variances were removed automatically by the operator. It ensures that the uncertainty increases (or stays the same) in the adaptation process.
Experiments were conducted on the the recent SRE’16 and SRE’18 datasets. The performance was measured in terms of equal error rate (EER) and minimum detection cost (MinCost) [19, 20]. The latest SREs organized by NIST have been focusing on domain mismatch as one of the technical challenges. In both SRE’16 and SRE’18, the training set consists primarily English speech corpora collected over multiple years in the North America. This dataset encompasses Switchboard, Fisher, and the MIXER corpora used in SREs 04 – 06, 08, 10, and 12. The enrollment and test segments are in Tagalog and Cantonese for SRE’16, and Tunisian Arabic for SRE’18. Domain adaptation was performed using the unlabeled subsets provided for the evaluation.
The enrollment utterances have a nominal duration of 60 seconds, while the test duration ranges from 10 to 60 seconds. We used x-vector speaker embedding, which has shown to be very effective for speaker verification task over short utterances. (Recent results show that i-vector is more effective for longer utterance of over 2 minutes). The x-vector extractor follows the same configuration and was trained using the same setup as the Kaldi recipe 333https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2
. A slight difference here is that we used an attention model in the pooling layer and extended the data augmentation.
In our experiments, the dimension of the x-vector was 512. As commonly used in most state-of-the-art systems, LDA was used to reduce the dimensionality. We investigated the cases of 150- and 200-dimensional x-vector after LDA projection. CORAL  transformation was applied on the raw x-vectors before LDA. The transformed, and then projected x-vectors were used to train a PLDA for the CORAL PLDA baseline. It is worth noticing that the LDA projection matrix was computed from the raw x-vectors, from which the CORAL transformation was also derived. We find that this gives the best performance compared to that reported in .
The proposed CORAL+ is a model-based adaptation technique. Domain adaptation is achieved by adapting the parameters (i.e., covariance matrices) pertaining to the OOD PLDA as in (10) and Algorithm 1 using the unlabeled in-domain dataset. The adaptation parameters were set empirically to in the experiments. Tables 1 and 2 show the performance of the baseline PLDA model trained on the out-of-domain English dataset (OOD PLDA), the PLDA trained on the x-vectors which have been adapted using CORAL (CORAL PLDA), and the OOD PLDA adapted to in-domain with CORAL+ algorithm (CORAL+ PLDA). Also shown in the tables is the CORAL+ adaptation without regularization (w/o reg). This correspond to the use of (9) replacing (10) in Algorithm 1.
The results on both SRE’16 and SRE’18 show consistent improvement of CORAL+ PLDA compared to the OOD PLDA baseline. The relative improvement amounts to and reduction in EER, and and reduction in MinCost on SRE’16 and SRE’18, respectively, for LDA dimension of . Also shown in the tables is an unsupervised adaptation method implemented in Kaldi 3 (Kaldi PLDA). The proposed CORAL+ PLDA consistently outperforms this baseline on both SRE’16 and SRE’18 though the improvement over this baseline is more apprarent on SRE’18. At LDA dimension of 200, the relative improvement amounts to reduction in EER, and reduction in MinCost on SRE’18.
Compared to the feature-based CORAL (CORAL PLDA), the benefit of CORAL+ (CORAL+ PLDA) is more apparent on SRE’18. We obtained a relative reduction of in EER and in MinCost at LDA dimension of . It is worth mentioning that SRE’16 has a unlabeled set with about the same size compared to that of SRE’18. Nevertheless, SRE’18 unlabeled set exhibits less variability (speaker and channel). This also explains the benefit of regularized adaptation on SRE’18 when a smaller and constrained unlabelled dataset is available for domain adaptation.
We have presented the CORAL+ algorithm for unsupervised adaptation of PLDA backend to deal with the domain mismatch issue in practical applications. Similar to the feature-based correlation alignment (CORAL) technique, the CORAL+ domain adaptation is accomplished by matching the out-of-domain statistics to that of the in-domain. We show that statistics matching could be directly applied on PLDA model. We further improve the robustness by introducing additional adaptation parameter and regularization to the adaptation equation. The proposed method shows significant improvement compared to the PLDA baseline. Results also show the benefit of model-based adaptation especially when the data available for adaptation is relatively small and constrained.
|EER (%)||MinCost||EER (%)||MinCost|
|EER (%)||MinCost||EER (%)||MinCost|
-  J. H. L. Hansen and T. Hasan, “How humans and machines recognize voices: a tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, 2015.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Proc. Interspeech, 2017, pp. 999–1003.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. ICASSP, 2014, pp. 4052–4056.
-  Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in NIPS, 2000.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
“Distributed representations of words and phrases and their compositionality,”in NIPS, 2013.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
-  A. O. Hatch, S. Kajarekar, and A Stolcke, “Within-class covariance normalization for SVM-based speaker recognition,” in Proc. Interspeech, 2006, pp. 1471–1474.
-  C. Bishop, Pattern recognition and machine learning, Springer, New York, 2006.
-  S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. ICCV, 2007, pp. 1–8.
-  S Ioffe, “Probabilistic linear discriminant analysis,” in Proc. ECCV, Part IV, LNCS 3954, 2006, pp. 531–542.
-  P. Kenny, “Bayesian speaker verification with heavy-tailed priors.,” in Proc. Odyssey: Speaker and Language Recognition Workshop, 2010.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP, 2018, pp. 5329–5333.
-  K. A. Lee, V. Hautamaki, T. Kinnunen, et al., “The I4U mega fusion and collaboration for NIST speaker recognition evaluation 2016,” in Proc. Interspeech, 2017, pp. 1328–1332.
-  B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. AAAI, vol. 6, 2016, p. 8.
-  J. Alam, G. Bhattacharya, and P. Kenny, “Speaker verification in mismatched conditions with frustratingly easy domain adaptation,” in Proc. Odyssey, 2018, pp. 176 – 180.
-  S. J. D. Prince, Computer vision: models, learning, and inference, Cambridge University Press, 2012.
-  A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and decorrelation,” The American Statistician, vol. 2018, pp. 1–6, 2018.
W. Lin, M.-W. Mak, L Li, and J.-T. Chien,
“Reducing domain mismatch by maximum mean discrepancy based autoencoders,”in Proc. Odyssey, 2018, pp. 162–167.
-  National Institute of Standards and Technology, “NIST 2016 Speaker Recognition Evaluation Plan,” NIST SRE, 2016.
-  National Institute of Standards and Technology, “NIST 2018 Speaker Recognition Evaluation Plan,” NIST SRE, 2018.
-  Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.