1 Introduction
Speaker verification (SV) is to authenticate a person based on the voice samples [1, 2]. The factor analysis approaches for SV led to a new era with their achievement in having high performance [3, 4]. Later, the total variability model based ivector system has been a benchmark for SV studies in the current decade [5]. Recently, the deep neural network (DNN) based systems have been in the focus of the research community. Due to their good performance, DNNbased systems have been incorporated in most systems in the latest NIST SRE challenge [6, 7, 8].
The initial attempts with DNNs for SV have been made in the context of ivector speaker modeling in terms of computing the phonetic posteriors [9, 10]. Alternative approaches extract bottleneck features from DNN acoustic models that are combined with the acoustic features [11, 12]. However, such approaches require a large amount of transcribed data and may not be as effective for outofdomain data [11]. This led to the exploration of endtoend DNN systems for SV that learn the speaker models in a discriminative manner [13, 14, 15, 16, 17].
The recent work in this direction focuses on using the speaker embeddings that are scored with a probabilistic linear discriminant analysis (PLDA) based backend [18, 19]. This kind of systems give comparable or better results to that obtained with ivector speaker modeling. Further, they are proven to be very effective under shortduration utterance scenarios [20]. The study on the score level fusion of ivector and embedding based systems [18] showed that the fused system outperforms the individual systems due to their complementary characteristics. Later on, the robustness of xvectors has been explored by applying data augmentation [21].
Another strategy to improve the xvector based SV is to include some input from the generative models as their fusion has been found promising. The embedding process is discriminative in nature, whereas the ivector framework is a generative model. Specifically, xvector extraction is achieved by training a DNN to discriminate among different output labels, while the ivector model relies on a universal background model (UBM) to collect sufficient statistics for deriving speaker models. However, directly concatenating or score level fusion of these two models may not be an effective way for applicationoriented systems as it increases the need of runtime computation and memory. This motivated us to develop an efficient way of including information from the generative model based on total variability modeling for an embedding based SV system.
In this work, we propose a novel approach that learns a transformation matrix using the ivectors and xvectors from the background data to utilize both generative and discriminative characteristics. Canonical correlation analysis (CCA) between these vectors is used to derive this transformation model. The CCA has been used previously for analysis of correlation among different features [22] and for fusion of multimodal features in SV [23]. Additionally, it has been used for cowhitening for short and long duration utterances in an ivector system [24]. In this work, the CCA is considered to maximize the correlation of the two models based on generative and discriminative paradigms to discover complementary attributes. The transformation model is then used to transform standard xvectors, so that they also benefit from the input of generative model. Moreover, a comparison of the proposed system and the fusion of ivector and xvector systems is presented to highlight the impact of the work for practical systems.
In the following sections, we first introduce the fundamentals of ivector and xvector approaches for SV in Section 2. Section 3 introduces the proposed framework of generative xvectors. The results of the SV experiments using the proposed approach are reported in Section 4. Finally, Section 5 concludes the work.
2 Speaker recognition paradigms: Generative vs. Discriminative
This section provides an explanation of the basics of ivector and xvector systems as they are studied for the proposed framework of generative xvectors. The detailed structure with the parameters used for various modules of both the systems are also mentioned.
2.1 The ivector: a generative model
An ivector system is based on generative mode that is derived using total variability model (TVM) [5]
. The TVM is learned by unsupervised learning that is used to represent each utterance in a compact lowdimensional vector as follows
(1) 
where
is Gaussian mixture model (GMM) mean supervector of an utterance,
represents UBM mean supervector and total variability model to obtain the ivector .2.2 The xvector: a discriminative model
Generative models are successful due to the strong mathematical representations. However, considering the goal as speaker discrimination helps to increase the robustness. In this regard, researchers pay more attention on discriminative training for speaker recognition recently as discussed in the introduction. We consider xvector as the discriminative baseline system, since it is comparable with ivector systems for textindependent speaker recognition, especially for short utterances. The DNN embedding structure in our work basically follows the work of [18, 21]. We do not use any data augmentation in the current work, that deserves future exploration.
A timedelay neural network (TDNN) [25]
is trained using the same acoustic features as in the ivector system. The TDNN model includes five framelevel hidden layers, all using rectified linear unit (ReLU) activation and batch normalization
[26]. The specific timedelay information of these framelevel layers are listed in Table 1. A statistics pooling layer follows the output of the last framelevel layer which computes the mean and standard deviation of the frames of input segments. The mean and standard deviation are stacked in a manner such that the output dimension is doubled. The final two hidden layers are 512dimensional pooling layers, also operating at segment level, prior to the softmax layer which targets speaker labels for each audio segment. The softmax and the second pooling layer are removed during the testing phase and 512dimensional xvectors are extracted at the output of the first pooling layer.
Layer index  Layer context  Output dimension 

1  (2,1,0,1,2)  512 
2  (2,0,2)  512 
3  (3,0,3)  512 
4  0  512 
5  0  1500 
3 Generative xvectors: DNN embeddings with generative model input
In this work, we propose a novel approach to take the advantage of the correlation between the ivectors and xvectors to utilize their complementary nature of learning speaker models. A transformation model is learned using CCA by considering the ivectors and the corresponding xvectors as the input pairs from the background speech data. During the enrollment and the testing phase, the ivector system is excluded from the pipeline and only xvector system is considered whose output is linearly transformed using the transformation matrix obtained from the CCA model. We refer this transformed output as generative xvector, henceforth referred to as x
vector, since it captures certain properties of the input generative model (ivector model) during the transformation.Fig. 1 illustrates the steps to obtain the proposed xvector representation of speakers. During the training stage, a transformation matrix is learned by applying CCA and this matrix is later used for xvector extraction. It is important to note that the TVM is only used for extracting ivectors of the background data and is computed once. There is no further ivector extraction involved during enrollment and test sessions. Hence, this kind of framework is expected to have relatively less latency than feature concatenation or score level fusion of these systems.
We first mathematically explain the left panel presented in Fig. 1. In order to take advantage of the generative model information, we aim to seek a pair of matrices and , which are confined in the following way
(2) 
Here, and contain the corresponding ivectors and xvectors from the same set of utterances.
The proposed transformation with CCA is hypothesized to transfer information from the generative model to the discriminative model and viceversa. Therefore, the resultant transformation matrices for ivector and xvector are denoted as and , respectively. Let be the number of background utterances used to train the transformation models with CCA. The dimension of background data ivectors and xvectors to CCA are and , respectively. On applying CCA, we obtain transformation matrices of size and of size , respectively.
During the SV experiments, we only concentrate on the xvector pipeline as shown in the right panel of Fig. 1. Given an xvector , the proposed vector is computed as
(3) 
where the denotes the xvector with generative model input that we refer to as a generative xvector.
Both the ivectors and xvectors are zerocentered in all of the mathematical expressions in this section. The details of CCA and the transformation of xvectors are discussed in the following subsections.
3.1 Canonical correlation analysis
As mentioned in the aforementioned section, in this work, we aim to maximize the linear relationship between a set of ivectors and xvectors. It is to be mentioned that the dimensions for an ivector and an xvector are not the same. Given that a fixed number of background speech utterances is used to derive the background ivectors and xvectors, applying CCA maximizes the correlation between the input vector pairs of different dimensions.
Mathematically, given random vectors and , the CCA defines new set of variables and via linear combinations of and
(4) 
(5) 
The CCA aims to find vectors and that maximizes the correlation , which can written as
(6) 
With the constraints that
(7) 
and
(8) 
the correlation parameter to be maximized becomes
(9) 
where , and are the covariances.
We then obtain the first pair of canonical variates via maximizing represented in Equation (9). The remaining canonical variates maximize subject to uncorrelated with for all . This procedure is iterated to times that is based on the dimension of the two random vectors. Finally, we obtain is the
th eigenvector of
. Similarly, as the th eigenvector of .3.2 CCA based xvector transformation
In canonical correlation analysis we aim to find mutually orthogonal pairs of maximally correlated linear combinations of the variables in and . In our work, the random vectors and discussed in Section 3.1 form the ivector matrix and xvector matrix , respectively.
Revisiting the objective function given in Equation (2), it can be solved with the following constraints,
(10) 
and
(11) 
where the xvectors can be automatically whitened in the testing phase. Notice that and denote the empirical covariances of ivectors and xvectors, respectively.
3.3 tSNE visualization
Together with the proposed xvector system, a contrast system is also introduced to derive another transformation model for the generative model ivector to take input from the xvector based discriminative model. The transformed ivector is denoted by ivector as input from discriminative model has been used. We visualize each speaker representation model to examine the distribution from subset of speakers using the tDistributed Stochastic Neighbor Embedding (tSNE) technique [27]
. The tSNE technique is widely used for the visualization of highdimensional data.
We have randomly chosen 5 speakers from the database that have more than 20 utterances and extracted corresponding ivectors, xvectors, xvectors and ivectors. Figure 2 shows the tSNE distributions for different representations. It is observed that the proposed xvectors benefit from the generative information with an increased separability, while the distribution of ivectors highly resembles the original ivectors.
The possible reason of this can be that the discriminative models like xvectors learn the differences among the speakers without learning the characteristics of each speaker. Thus, when information from discriminative models is used as input to the generative model ivector, it may not contribute towards a better SV performance. On the other hand, the generative models such as ivectors, learn the characteristics of each speakers and they add specific speaker information when used as input to a discriminative model. Additionally, the discriminative models work well for a closed set of speakers, whereas there is no such constraint for generative models.
Tasks  EER (%)  DCF  

ivec  xvec  fusion  ivec  xvec  ivec  xvec  fusion  ivec  xvec  
coreextcoreext  2.20  2.96  2.19  2.23  1.51  0.42  0.42  0.36  0.44  0.35 
core10sec  6.07  6.39  4.71  6.00  4.41  0.85  0.72  0.78  0.84  0.70 
10sec10sec  11.46  11.51  8.92  11.56  8.93  0.98  0.85  0.88  0.96  0.89 
4 Experimental Results
4.1 Database
The SV experiments in this work are performed using the NIST SRE 2010 database [28]. The common condition 5 (CC’5) has been chosen for the evaluation. Further, we have considered different enrollment and test scenarios under this task, namely coreextcoreext, core10sec, and 10sec10sec, where coreext and core consist of long duration utterances, while 10sec denotes shortduration speech of 10 seconds. Additionally, Switchboard 2 Corpus of Phases 1, 2, and 3 as well as Switchboard Cellular, along with NIST SREs from 2004 to 2008 are considered as background data for learning the background models.
4.2 Implementation details
In this work, the 20dimensional mel frequency cepstral coefficients (MFCC) features, along with delta and acceleration are extracted for each frame of 25 ms in shift of 10 ms. The ivector model is used as a baseline system for reference in our studies. A fullcovariance genderindependent UBM with 2048 components is used in the ivector framework to obtain 600dimensional ivectors. For both the systems, dimensionality is reduced to 200 with linear discriminant analysis (LDA). For the xvector system, the TDNN is trained on the same 20dimensional MFCC features. All nonlinearities in the neural network are ReLUs.
We use PLDA for channel/session compensation and scoring in our experiments. Further, length normalization has been applied before performing PLDA [29]. The PLDA is trained to have 200 speaker factors with a full covariance, while the channel factor is ignored. The studies are reported in terms of equal error rate (EER) and detection cost function (DCF) that follows the protocol of NIST SRE 2010 evaluation plan [28]. We used Kaldi recipes for building the baseline systems in this work [30].
4.3 Results and discussion
In this section, the results provided by the individual baseline systems using ivectors and xvectors are compared with the proposed generative xvectors. We further apply score fusion to the ivector and xvector systems and compare with the generative xvectors to investigate their effectiveness in capturing the complementary information from the generative model.
Table 2 reports the performance of different SV frameworks used in this study. Comparing the ivector and xvector baselines, it is clear that the ivector works better when both the enrollment and test utterances are of long durations, i.e., for coreextcoreext task. On the other hand, the results for core10sec and 10sec10sec tasks show that the xvector system performs comparable to the ivector system for shortduration test utterances when the enrollment data is either short or long. Further, a score level fusion of these two systems results in a gain for all considered tasks of the NIST SRE 2010 database. The system fusion results follow the trend reported by the authors of [18].
We then focus on the results provided by the proposed xvector system and its contrast ivector system. It is observed that the proposed xvector system outperforms the standard xvector system by reducing the EER from 2.20% to 1.51%. On the other hand, the performance of the contrast ivector system is similar to the original ivector system. Finally, we compare the performance of proposed xvector with the score level fusion. For short utterance cases, the performance of both systems are comparable. The proposed system outperforms the score fusion for the core condition with long utterances. Hence, the proposed system with a lower latency and less computational burden achieves a remarkable performance compared to fusion of the xvector and ivector systems. This highlights its importance as a fielddeployable system in a practical setting.
The detection error tradeoff (DET) curves for different systems obtained on the coreextcoreext task is illustrated in Fig. 3. The superior performance of the proposed system is clearly reflected in this plot with a DET curve that is quite separate from the baseline individual systems as well as their fusion. In terms of EER, we observe 48.83%, 22.46% and 31.01% relative improvement over the original xvector system for the three different tasks of CC’5 on NIST SRE 2010 database discussed in this work. The future work will focus on extending this framework with data augmentation to overcome mismatch conditions [21, 31, 32].
5 Conclusions
This work focuses on having an improved DNN embedding based SV system that considers input from generative models. The total variability speaker modeling is used as the generative model for the studies. A transformation model is learned by applying CCA using background data ivectors and xvectors. This model is then used to obtain the generative xvectors that are found to perform superior to its baseline as well as ivector counterparts. The studies are performed on the NIST SRE 2010 database on three different conditions. The studies reveal 48.83%, 22.46% and 31.01% relative improvement on EER for the coreextcoreext, core10sec and 10sec10sec tasks, respectively. This confirms the importance of using some inputs from the generative models for the framework of discriminative model of DNN embeddings for SV. Additionally, the performance of generative xvectors is found to be superior for long utterances and competitive for short utterance cases to that obtained from the score level fusion of ivector and xvector systems. Thus, this kind of approaches have less latency than the dimension concatenation or score level fusion of systems that makes them useful for application purpose.
References
 [1] T. Kinnunen and H. Li, “An overview of textindependent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.
 [2] J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, Nov 2015.
 [3] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” Tech. Rep. CRIM06/0813, CRIM, Montreal, 2005.
 [4] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, May 2007.
 [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [6] K. A. Lee and SRE’16 I4U Group, “The I4U mega fusion and collaboration for NIST speaker recognition evaluation 2016,” in Proc. Interspeech 2017, 2017, pp. 1328–1332.
 [7] P. A. TorresCarrasquillo, F. Richardson, S. Nercessian, D. Sturim, W. Campbell, Y. Gwon, S. Vattam, N. Dehak, H. Mallidi, P. S. Nidadavolu, R. Li, and R. Dehak, “The MITLL, JHU and LRDE NIST 2016 speaker recognition evaluation system,” in Proc. Interspeech 2017, 2017, pp. 1333–1337.
 [8] N. Kumar, R. K. Das, S. Jelil, B. K. Dhanush, H. Kashyap, K. S. R. Murty, S. Ganapathy, R. Sinha, and S. R. M. Prasanna, “IITGIndigo system for NIST 2016 SRE challenge,” in Proc. Interspeech 2017, 2017, pp. 2859–2863.
 [9] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014, May 2014, pp. 1695–1699.
 [10] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet and J. Alam, “Deep neural networks for extracting baumwelch statistics for speaker recognition,” in Speaker Odyssey 2014, 2014, pp. 293–298.
 [11] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, Oct 2015.
 [12] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, April 2015, pp. 4814–4818.
 [13] E. Variani, X. Lei, E. Mcdermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing 2014, May 2014, pp. 4052–4056.
 [14] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “Endtoend textdependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016, March 2016, pp. 5115–5119.
 [15] S. X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “Endtoend attention based textdependent speaker verification,” in IEEE Spoken Language Technology Workshop (SLT) 2016, Dec 2016, pp. 171–178.
 [16] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in IEEE Spoken Language Technology Workshop (SLT) 2016, Dec 2016, pp. 165–170.
 [17] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv:1705.02304 [cs.CL], 2017.
 [18] D. Snyder, D. GarciaRomero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for textindependent speaker verification,” in INTERSPEECH, 2017, pp. 999–1003.
 [19] N. Brummer, A. Silnova, L. Burget, and T. Stafylakis, “Gaussian metaembeddings for efficient scoring of a heavytailed plda model,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 349–356.
 [20] Chunlei Zhang and Kazuhito Koishida, “Endtoend textindependent speaker verification with triplet loss on short utterances,” in Proc. Interspeech 2017, August 2017, pp. 1487–1491.
 [21] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust DNN embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, April 2018, pp. 5329–5333.
 [22] R. K. Das and S. R. M. Prasanna, “Exploring different attributes of source information for speaker verification with limited test data,” Journal of the Acoustical Society of America, vol. 140, no. 1, pp. 184, 2016.
 [23] M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp, “Multimodal speaker identification using canonical correlation analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, 2006, pp. I–I.
 [24] L. Xu, K. A. Lee, H. Li, and Z. Yang, “Cowhitening of ivectors for short and long duration speaker verification,” in INTERSPEECH 2018, September 2018, pp. 1066–1070.
 [25] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using timedelay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, Mar 1989.
 [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. ICML, 2015, pp. 448––456.

[27]
L Maaten and G Hinton,
“Visualizing data using tsne,”
Journal of Machine Learning Research
, vol. 9, no. 2605, pp. 2579–2605, 2008.  [28] “The NIST year 2010 speaker recognition evaluation plan,” April 2010.
 [29] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
 [30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding 2011, Dec. 2011.
 [31] M. Mclaren, D. Castan, M. K. Nandwana, L. Ferrer, and E. Yılmaz, “How to train your speaker embeddings extractor,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 327–334.
 [32] S. Novoselov, A. Shulipa, I. Kremnev, A. Kozlov, and V. Shchemelinin, “On deep speaker embeddings for textindependent speaker recognition,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 378–385.
Comments
There are no comments yet.