Speaker verification (SV) is defined as the task of verifying a person using their speech/voice signal  . SV is an invasive bio-metric authentication method with many real-world applications, e.g. alternative to passwords, airport security, banking transactions and home automation. SV methods can be broadly categorized into text-independent (TI) and text-dependent (TD). In TI-SV, speakers are free to speak any text or sentence during the enrollment and test phases. In TD-SV, on the other hand, speakers are constrained to speak predefined text during both enrollment/training and test phrases. Therefore, TD-SV maintains the matched phonetic context between the training and test phases and yields a lower error rate than TI-SV, in particular for SV using short speech utterances of few seconds long, which is ideal for real-life applications.
In the model domain, GMM-UBM with maximum-a-posteriori (MAP) adaptation  and i-vector  techniques are popular. It is well known that GMM-UBM outperforms the i-vector technique for speaker verification using short utterances 
. With the recent progress of deep neural networks (DNNs), a new speaker embedding technique has been introduced for speaker recognition, which is calledx-vector  and has attracted much attention. In this method, a DNN is trained to model the variable length speech segments in the initial few layers and discriminate the speakers at the output layer. Then the output of a particular DNN hidden layer for a given speech signal is used as a vectorized representation (i.e. x-vector) for the particular speech signal. During enrollment, target speakers are represented by x-vectors and during the test, the x-vector of a test utterance is scored against the claimant specific x-vector with probabilistic linear discriminate analysis (PLDA). The effectiveness of x-vector for TI-SV using cepstral feature and senone discriminant BN features can be found in [5, 6], and it is observed in  that phonetic information is valuable for the speaker embedding using DNN. To the best of our knowledge, the study of using x-vectors for SV has been focused on the text-independent setting only, and there is a missing comparison between x-vector with GMM-UBM in general in the literature.
On the other hand, many techniques have been proposed in the literature to improve the performance of speaker verification [2, 3, 5, 7, 8, 9, 10] They can be grouped into two broad categories: feature domain and model domain. Feature domain methods include Mel-frequency cepstral coefficients (MFCCs) , perceptual linear predictive (PLPs) ; and DNNs based bottleneck (BN) features [13, 14, 15]. When extracting BN features, it is common to feed cepstral features to a DNN to discriminate speakers , senones , a combination of them , or tri-phone state  at the output layer. Afterward, the output of a particular DNN hidden layer called deep feature
is projected onto a low dimensional space via principal component analysis (PCA) to obtain the BN feature. Recently, time-contrastive-learning (TCL) and phone discriminate DNN BN features have been introduced for TD-SV in. In TCL, text dependent pass-phrase utterances (excluding the evaluation set) are split into a predefined number of segments. The number of segments is equal to the number of classes. The frames within a particular segment are assigned the same class label. Then a DNN is trained to discriminate the classes. In case of the phone BN feature, a DNN is trained to discriminate the phones at the output layer. It is shown in  that TCL and phone BN features give lower error rates than other existing BNs with both GMM-UBM and i-vector frameworks. However, the performance/behaviour of these features on the emerging x-vector paradigm has not been investigated.
The two sets of observations above motivate us to study the use of x-vector for TD-SV, in comparison with i-vector and GMM-UBM, and the impact of different bottleneck features on its performance. We conduct experiments on the RedDots 2016 challenge database . We show the type of features has a marginal impact on the performance of x-vector with the TCL BN feature achieving the lowest equal error rate. On the other hand, the performance gap of using different BN features on i-vector is significant. The fusion of x-vector and i-vector systems gives a large gain in performance. The GMM-UBM technique shows its advantage for TD-SV using short utterances.
2 Modeling techniques
In this section, we briefly describe the different modeling techniques, which are commonly used in speaker verification.
In this approach , a GMM based universal background model is trained using data of many non-target speakers. Then speaker models are derived from the GMM-UBM with MAP adaptation. During test, a test utterance is scored against the claimant and GMM-UBM models. Finally, log likelihood ratio is calculated as
In this approach , a speech signal is represented using a low-dimensional vector called i-vector, which is obtained by projecting the signal onto a low dimensional subspace (called total variability (T) space) of a speaker independent GMM-UBM super-vector, where speaker and channel information is assumed to be dense. For a given speech signal of a speaker, the speaker and channel dependent GMM super-vector can be expresses as
where m denotes the speaker-independent GMM super-vector. and is called an i-vector. During the enrollment phase, each target is represented by an average i-vector computed over his/her training utterance-wise (or speech session-wise) i-vectors. In the test phase, i-vector of a test utterance is scored against the claimant specific i-vector (obtained during enrolment) with PLDA.
where and indicate that both and are coming from a same or different speakers, respectively.
In this method , a speech utterance is characterized by a vector that is obtained as the output of a hidden layer of a DNN, and the vector is called x-vector
. The DNN is trained to model speech segments of variable lengths in the first several layers and embed the speakers at the last hidden layers. The loss function is the cross-entropy loss used to discriminate speakers at the output layer. Similarly to the i-vector system, speakers are represented by their average x-vectors computed over their training speech utterances in the enrollment phase. In the test phase, the x-vector of the test utterance is scored against the claimant specific x-vector with PLDA. Several studies using x-vector can be found in speaker verification with multi-conditional recordings, and in language recognition with triphone-states discriminant BN features trained using single or multiple languages . More details about the x-vector technique can be found in . Fig. 1 illustrates TD-SV using x-vectors.
3 Bottleneck features
In this section, we briefly present the various bottleneck feature extraction methods used for TD-SV in this work.
3.1 Speaker discriminant BN (spkr-BN)
A DNN is trained to optimize a cross-entropy based objective function for discriminating speakers at the output layer . The cross-entropy function can be defined as
where , , , and denote the loss, parameters of DNN, the class label of the input feature vector and a posteriori output at the DNN output layer, respectively. The output of a particular hidden layer for a given speech segment is projected onto the low dimensional space to get the spkr-BN feature using PCA.
3.2 Speaker+pass-phrase discriminant (spkr+phrase-BN)
This system is analogous to speaker discriminant BN. A DNN is trained to optimize two cross-entropy based objective functions simultaneously : one for discriminating speakers and the other for pass-phrase defined on two different sets of output nodes
where . In our case, equally important is given to the two functions.
3.3 Phone discriminant (PHN-BN)
This system is similar to the spkr-BN. The only difference is that phones are discriminated at the output layer of DNNs 
. The phone labels are obtained by using automatic speech recognition (ASR) systems. Three ASR systems are considered for generating the transcription of speech signals, yielding three different systems: a)PHN-BN1: the phoneme recognizer is based on , b) PHN-BN2: the phone recognizer is based on an end-to-end segmental phoneme recognizer , and c) PHN-BN3: this system considers forced-alignment for phone recognition, which is based on the end-to-end segmental model the same as in PHN-BN2. Frames detected as sil and pause are discarded before feeding feature vectors into respective DNNs. More details analysis can be found in .
3.4 Time-contrastive learning (TCL)
The objective behind this feature is to capture the temporal information available from the speech utterances in unsupervised manner i.e. without any ASR or manual transcriptions . There are two settings for the method. In the first one, training data of the DNN are first randomized and then split into chunks of M frames with in this work. For the number of classes in TCL, segments are taken at a time and the frames within the segments are assigned class label as
This is called stream-wise TCL (sTCL). Similarly to spkr-BN and PHN-BNs, a DNN is trained to discriminate the classes at the output layer of DNN with a cross-entropy function and afterward, BN features are extracted.
In the second setting, each utterance is split uniformly into segments, corresponding to classes, and all frames within one segment are assigned the same class label, which we call utterance-wise TCL (uTCL). Afterward, a DNN is trained similarly to sTCL, and BN features are extracted. In this study, we consider the value of as per .
4 Experimental setup
Experiments are conducted on the m-part-01 task (for male speakers) of the RedDots database as per protocol . There are target models that are trained by three utterances each. Each utterance is approximately seconds duration on average. There are four different types of trials for system evaluation as detailed in Table 1.
MFCC feature vectors of dimensions consisting of static and their are extracted from speech signals using a hamming window and a frame shift. An energy-based voice activity detector is used to discard the less energized frames. The selected frames are normalized to fit zero mean and unit variance at the utterance level.
A GMM-UBM with
frame shift. An energy-based voice activity detector is used to discard the less energized frames. The selected frames are normalized to fit zero mean and unit variance at the utterance level. A GMM-UBM withmixtures having diagonal co-variance matrices is trained using speech files from the TIMIT database consisting of males and females. This data set is also used for training PCA to get low dimensional BN features. In MAP adaptation, iterations and value of relevance factor are used.
-space in the i-vector system is trained using utterances covering pass-phrases from the RSR2015 database consisting of male and female speakers, while excluding the pass-phrases common/overlapping with the RedDots database. This data set is also used for the PLDA, x-vector and DNNs training. These numbers result in , and nodes at the output layer of DNNs for spkr-BN, x-vector and spkr+phrase-BN systems, respectively.
Kaldi toolkits  is used for implementing the x-vector technique, i.e. the speaker embedding part. The number of DNN layers, activation functions, the number of neurons per layer and other parameters are considered as per
is used for implementing the x-vector technique, i.e. the speaker embedding part. The number of DNN layers, activation functions, the number of neurons per layer and other parameters are considered as per. dimensional x-vectors are extracted to align with the dimension of i-vector based systems. Kaldi truncates the training data into chunks, and due to short utterances used for training in this work, we set minimum and maximum chunk sizes as and , respectively, in contrast to the default and frames.
CNTK toolkit  is used for implementing the bottleneck feature extraction with the following settings: variable batch sizes from to , variable learning rates from to , and training epochs, as per the default parameters settings. Seven layers feed forward networks with a sigmoid activation function is used. Each hidden layer consists of
training epochs, as per the default parameters settings. Seven layers feed forward networks with a sigmoid activation function is used. Each hidden layer consists ofneurons. For BN feature extraction, the fourth hidden layer for spkr, spkr+phase and the second hidden layer of DNNs for PHN and TCLs are projected onto the low dimensional space as per .
|# of||# of trials in Non-target type|
In PLDA, the utterances of the same pass-phrase from a particular speaker are treated as an individual speaker. It gives classes (4239 males and 3861 females) in PLDA. Speaker and channel factors are kept full in PLDA, i.e. equal to the dimension of i-vector (), x-vector () and vector-fusion (, where i-vector is concatenated with x-vector per utterance-wise) in the respective systems. Before PLDA, i-vector and x -vectors are normalized with spherical normalization of iterations . System performance is measured in terms of equal error rate (EER) and Minimum detection cost function (MinDCF) as per 2008 SRE .
|Feature||# of classes||Method||Non-target type [%EER/(MinDCF 100)]||Average|
|score fusion (i,x)||2.06/0.84||4.15/1.73||0.50/0.16||2.24/0.91|
|score fusion (i,x)||1.94/0.77||4.28/1.77||0.55/0.18||2.26/0.91|
5 Results and Discussions
Table 2 compares the TD-SV performance of various features combined with GMM-UBM, i-vector or x -vector as well as the fusion of i-vector and x-vector on the RedDots database (m-part-01 task). In score fusion, scores of the different systems are combined with equal weights. The following observations can be deduced from the table.
First, there are marginal performance differences among different features under the x-vector framework, although the uTCL BN feature gives the lowest EER. This could be due to x-vector being trained to neutralize the differences in representation power across BN features. On the other hand, the impact of features is rather a signification on i-vector and GMM-UBM, for which PHN-BN and TCL-BN outperform the others with big margins. In most cases, all BN features across the modeling methods outperform MFCC.
Second, GMM-UBM gives the lowest error rates for all features explored, when a single modelling technique is used. This indicates that GMM-UBM is better than x-vector and i- vector each alone for TD-SV using short utterances.
Third, the x-vector yields lower error rates than the i-vector for all features but PHN and TCL BNs. Fusion of x vector system with i-vector significantly reduces the EERs compared to their standalone, indicating their complementary nature.
Finally, it is interesting to notice that i-vector performs better than x-vector when a strong-performing/highly-discriminative BN is used, e.g. PHN-BNs and TCL-BNs. It is the opposite when MFCCs, SPK-BN, and SPK-phrase-BN are used.
It is worth to mention that score fusion of GMM-UBM, i-vector and x-vector systems does not lead to further performance improvement and hence the results are not shown in this paper. More complex fusion strategies will be explored as future work.
In this paper, we studied the use of x-vector and its combination with various bottleneck (BN) features for text-dependent speaker verification (TD-SV) using short utterances. We further compared the TD-SV performance of x-vector with Gaussian mixture models-universal background model (GMM-UBM) and i-vector methods. Experiments lead to a set of interesting results. First, BN features have a marginal impact on the performance of x-vector, while they have a large impact for i-vector and UBM-GMM in favor of phone-discriminant and time-contrastive-learning BN features. The fusion of x-vector and i-vector largely boosts the performance, while GMM-UBM remains a favorable framework for TD-SV with short utterances.
-  F. Bimbot, J.-F. Bonastre, et al., “A Tutorial on Text-independent Speaker Verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004.
-  D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.
-  N. Dehak, P. Kenny, R. Dehak, P. Ouellet, and P. Dumouchel, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans. on Audio, Speech and Language Processing, vol. 19, pp. 788–798, 2011.
-  H. Delgado, M. Todisco, M. Sahidullah, A. K. Sarkar, N. Evans, T. Kinnunen, and Z.-H. Tan, “Further Optimisations of Constant Q Cepstral Processing For Integrated Utterance And Text-dependent Speaker Verification,” in Proc. of IEEE Spoken Language Technology Workshop (SLT), 2016.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN Embeddings For Speaker Recognition,” in Proc. of IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). 2018, pp. 5329–5333, IEEE.
-  M. H. Rahman, I. Himawan, M. McLaren, C. Fookes, and S. Sridharan, “Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance,” in Proc. of Interspeech, 2018, pp. 3593–3597.
-  T. Kinnunen and H. Li, “An Overview Of Text-independent Speaker Recognition: From Features To Supervectors,” Speech communication, vol. 52, no. 1, pp. 12–40, 2010.
-  S. Wang, J. Rohdin, L. Burget, O. Plchot, Y. Qian, K. Yu, and J. Cernocky, “On the Usage of Phonetic Information for Text-independent Speaker Embedding Extraction,” in Proc. of Interspeech, 2019, pp. 1148–1152.
-  D. Garcia-Romero, D. Snyder, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “X-vector DNN Refinement With Full-Length Recordings for Speaker Recognition,” in Proc. of Interspeech, 2019, pp. 1493–1496.
-  N. Tawara, A. Ogawa, T. Iwata, M. Delcroix, and T. Ogawa, “Frame-level Phoneme-invariant Speaker Embedding for Text-independent Speaker Recognition on Extremely Short Utterances,” in Proc. of IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP), 2020, pp. 6799–6803.
-  S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoust. Speech Signal Processing, vol. 28, pp. 357–366, 1980.
-  H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis of Speech,” J. Acoust. Soc. Am., vol. 87, pp. 1738–1752, 1990.
-  A. K. Sarkar, Z.-H. Tan, H. Tang, S. Shon, and J. R. Glass, “Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification,” IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 27, no. 8, pp. 1267–1279, 2019.
-  H. Yu, Z.-H. Tan, Z. Ma, and J. Guo, “Adversarial Network Bottleneck Features For Noise Robust Speaker Verification,” in Proc. of Interspeech, 2017, pp. 1492–1496.
-  A. K. Sarkar, C. T. Do, V. B. Le, and C. Barras, “Combination Of Cepstral And Phonetically Discriminative Features For Speaker Verification,” IEEE Signal Process. Lett., vol. 21, no. 9, pp. 1040–1044, 2014.
-  Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep Feature for Text-dependent Speaker Verification,” Speech Communication, vol. 73, pp. 1–13, 2015.
-  M. McLaren, Y. Lei, and L. Ferrer, “Advances In Deep Neural Network Approaches To Speaker Recognition,” in Proc. of IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP), 2015, pp. 4814–4818.
-  “The reddots challenge: Towards characterizing speakers from short utterances,” https://sites.google.com/site/thereddotsproject/reddots-challenge.
-  M. Senoussaoui et al., “Mixture of PLDA Models In I-Vector Space For Gender-Independent Speaker Recognition,” in Proc. of Interspeech, 2011, pp. 25–28.
-  D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker Recognition For Multi-speaker Conversations Using X-vectors,” in Proc. of IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP), 2019, pp. 5796–5800.
-  D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition Using X-vectors,” in Proc. of Odyssey: The Speaker and Language Recognition Workshop, 2018, pp. 105–111.
-  P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical Structures of Neural Networks for Phoneme Recognition,” in Proc. of IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP), 2006, pp. 325–328.
-  H. Tang, L. Lu, K. Gimpel, K. Livescu, C. Dyer, N. A. Smith, and S. Renals, “End-to-end Neural Segmental Models For Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, pp. 1254–1264, 2017.
-  “https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2,” .
-  Y. Dong, A. Eversole, M. Seltzer, K. Yao, Z. Huang, B. Guenter, O. Kuchaiev, et al., “An Introduction to Ccomputational Nnetworks and the Ccomputational Network Ttoolkit,” in Microsoft Technical Report MSR-TR-2014–112, 2014.
-  P. M. Bousquet et al., “Variance-Spectra Based Normalization For i-vector Standard And Probabilistic Linear Discriminant Anal ysis,” in Proc. of Odyssey Speaker and Language Recognition Workshop, 2012.
-  “https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results,” .