Compared to previous years, the 2016 NIST speaker recognition evaluation (SRE) marked a major shift from English towards Austronesian and Chinese languages. The task like previous years is to perform speaker detection with the focus on telephone speech data recorded over a variety of handset types. The main challenges introduced in this evaluation are duration and language variability. The potential variation of languages addressed in this evaluation, recording environment, and variability of test segments duration influenced the design of our system. Our goal was to utilize recent advances in language normalization, domain adaptation, speech activity detection and session compensation techniques to mitigate the adverse bias introduced in this year’s evaluation.
Over recent years, the i-vector representation of speech segments has been widely used by state-of-the-art speaker recognition systems . The speaker recognition technology based on i-vectors currently dominates the research field due to its performance, low computational cost and the compatibility of i
-vectors with machine learning techniques. This dominance is reflected by the recent NISTi-vector machine learning challenge  which was designed to find the most promising algorithmic approaches to speaker recognition specifically on the basis of i-vectors [11, 18, 23, 12]. The outstanding ability of DNN for frame alignment which has achieved remarkable performance in text-independent speaker recognition for English data [13, 9], failed to provide even comparable recognition performance to the traditional GMM. Therefore, we concentrated on the cepstral based GMM/i-vector system.
We outline in this paper the Intelligent Voice system, techniques and results obtained on the SRE 2016 development set that will mirror the evaluation condition as well as the timing report. Section 2 describes the data used for the system training. The front-end and back-end processing of the system are presented in Sections 3 and 4 respectively. In Section 5, we describe experimental evaluation of the system on the SRE 2016 development set. Finally, we present a timing analysis of the system in Section 6.
2 Training Condition
The fixed training condition is used to build our speaker recognition system. Only conversational telephone speech data from datasets released through the linguistic data consortium (LDC) have been used, including NIST SRE 2004-2010 and the Switchboard corpora (Switchboard Cellular Parts I and II, Switchboard2 Phase I,II and III) for different steps of system training. A more detailed description of the data used in the system training is presented in Table 1. We have also included the unlabelled set of 2472 telephone calls from both minor (Cebuano and Mandarin) and major (Tagalog and Cantonese) languages provided by NIST in the system training. We will indicate when and how we used this set in the training in the following sections.
3 Front-End Processing
In this section we will provide a description of the main steps in front-end processing of our speaker recognition system including speech activity detection, acoustic and i
-vector feature extraction.
3.1 Speech Activity Detection
The first stage of any speaker recognition system is to detect the speech content in an audio signal. An accurate speech activity detector (SAD) can improve the speaker recognition performance. Several techniques have been proposed for SAD, including unsupervised methods based on a thresholding signal energy, and supervised methods that train a speech/non-speech classifier such as support vector machines (SVM)
and Gaussian mixture models (GMMs)
. Hidden markov models (HMMs) have also been successful. Recently, it has been shown that DNN systems achieve impressive improvement in performance especially in low signal to noise ratios (SNRs) 
. In our work we have utilized a two-class DNN-HMM classifier to perform this task. The DNN-HMM hybrid configuration with cross-entropy as the objective function has been trained with the back-propagation algorithm. The softmax layer produces posterior probabilities for speech and non-speech which were then converted into log-likelihoods. Using 2-state HMMs corresponding to speech and non-speech, frame-wise decisions are made by Viterbi decoding. As input to the network, we fed 40-dimensional filter-bank features along with 7 frames from each side. The network has 6 hidden layers with 512 units each. The architecture of our DNN-HMM SAD is shown in Figure1. Approximately 100 hours of speech data from the Switchboard telephony data with word alignments as ground-truth were used to train our SAD. The DNN training in performed on an NVIDIA TITAN X GPU, using Kaldi software . Evaluated on 50 hours of telephone speech data from the same database, our DNN-HMM SAD indicated a frame-level miss-classification (speech/non-speech) rate of 5.9% whereas an energy-based SAD did not perform better than 20%.
3.2 Acoustic Features
For acoustic features we have experimented with different configurations of cepstral features. We have used 39-dimensional PLP features and 60-dimensional MFCC features (including their first and second order derivatives) as acoustic features. Moreover, our experiments indicated that the combination of these two feature sets performs particularly well in score fusion. Both PLP and MFCC are extracted at 8kHz sample frequency using Kaldi 
with 25 and 20 ms frame lengths, respectively, and a 10 ms overlap (other configurations are the same as Kaldi defaults). For each utterance, the features are centered using a short-term (3s window) cepstral mean and variance normalization (ST-CMVN). Finally, we employed our DNN-HMM speech activity detector (SAD) to drop non-speech frames.
3.3 i-Vector Features
Since the introduction of i-vectors in , the speaker recognition community has seen a significant increase in recognition performance. i-Vectors are low-dimensional representations of Baum-Welch statistics obtained with respect to a GMM, referred to as universal background model (UBM), in a single subspace which includes all characteristics of speaker and inter-session variability, named total variability matrix . We trained on each acoustic feature a full covariance, gender-independent UBM model with 2048 Gaussians followed by a 600-dimensional i-vector extractor to establish our MFCC- and PLP-based i-vector systems. The unlabeled set of development data was used in the training of both the UBM and the i-vector extractor. The open-source Kaldi software has been used for all these processing steps .
It has been shown that successive acoustic observation vectors tend to be highly correlated. This may be problematic for maximum a posteriori (MAP) estimation ofi-vectors. To investigating this issue, scaling the zero and first order Baum-Welch statistics before presenting them to the i-vector extractor has been proposed. It turns out that a scale factor of 0.33 gives a slight edge, resulting in a better decision cost function . This scaling factor has been performed in training the i-vector extractor as well as in the testing.
4 Back-End Processing
This section provides the steps performed in back-end processing of our speaker recognition system.
4.1 Nearest-neighbor Discriminant Analysis (NDA)
The nearest-neighbor discriminant analysis is a nonparametric discriminant analysis technique which was proposed in , and recently used in speaker recognition . The nonparametric within- and between-class scatter matrices and , respectively, are computed based on
nearest neighbor sample information. The NDA transform is then formed using eigenvectors of. It has been shown that as the number of nearest neighbors approaches the number of samples in each class, the NDA essentially becomes the LDA projection. Based on the finding in , NDA outperformed LDA due to the ability in capturing the local structure and boundary information within and across different speakers. We applied a NDA projection matrix computed using the 10 nearest sample information on centered i-vectors. The resulting dimensionality reduced i-vectors are then whitened using both the training data and the unlabelled development set.
4.2 Short-Duration Variability Compensation
The enrolment condition of the development set is supposed to provide at least 60 seconds of speech data for each target speaker. Nevertheless, our SAD indicates that the speech content is as low as 26 seconds in some cases. The test segments duration which ranges from 9 to 60 seconds of speech material can result in poor performance for lower duration segments. As indicated in Figure 2, more than one third of the test segments have speech duration of less than 20 seconds. We have addressed this issue by proposing a short duration variability compensation method. The proposed method works by first extracting from each audio segment in the unlabelled development set, a partial excerpt of 10 seconds of speech material with random selection of the starting point (Figure 3). Each audio file in the unlabelled development set, with the extracted audio segment will result in two 400-dimensional i-vectors, one with at most 10 seconds of speech material. Considering each pair as one class, we computed a LDA projection matrix to remove directions attributed to duration variability. Moreover, the projected i-vectors are also subjected to a within-class covariance normalization (WCCN) using the same class labels.
4.3 Language Normalization
Language-source normalization is an effective technique for reducing language dependency in the state-of-the-art i-vector/PLDA speaker recognition system . It can be implemented by extending SN-LDA  in order to mitigate variations that separate languages. This can be accomplished by using the language label to identify different sources during training. Language Normalized-LDA (LN-LDA) utilizes a language-normalized within-speaker scatter matrix which is estimated as the variability not captured by the between-speaker scatter matrix,
where and are the total scatter and normalized between-speaker scatter matrices respectively, and are formulated as follows:
where is the total number of i-vectors and
where is the number of languages in the training set, is the number of speakers in language , is the mean of i-vectors from speaker and language and finally is the mean of all i-vectors in language . We applied a SN-LDA projection matrix to reduce the i-vector dimensions down to 300.
Probabilistic Linear Discriminant Analysis (PLDA) provides a powerful mechanism to distinguish between-speaker variability, separating sources which characterizes speaker information, from all other sources of undesired variability that characterize distortions. Since i
-vectors are assumed to be generated by some generative model, we can break it down into statistically independent speaker- and session-components with Gaussian distributions[5, 8]. Although it has been shown that their distribution follow Student’s rather than Gaussian  distributions, length normalizing the entire set of i-vectors as a pre-processing step can approximately Gaussianize their distributions  and as a result improve the performance of Gaussian PLDA to that of heavy-tailed PLDA . A standard Gaussian PLDA assumes that an i-vector , is modelled according to
where, is the mean of i-vectors, the columns of matrix contains the basis for the between-speaker subspace, the latent identity variable denotes the speaker factor that represents the identity of the speaker and the residual
which is normally distributed with zero mean and full covariance matrix, represents within-speaker variability.
For each acoustic feature we have trained two PLDA models. The first out-domain PLDA (,) is trained using the training set presented in Table 1, and the second in-domain PLDA (,) was trained using the unlabelled development set. Our efforts to cluster the development set (e.g using the out-domain PLDA) was not very successful as it sounds that almost all of them are uttered by different speakers. Therefore, each i-vector was considered to be uttered by one speaker. We also set the number of speaker factors to 200.
4.5 Domain Adaptation
Domain adaptation has gained considerable attention with the aim of compensating for cross-speech-source variability of in-domain and out-of-domain data. The framework presented in  for unsupervised adaptation of out-domain PLDA parameters resulted in better performance for in-domain data. Using in-domain and out-domain PLDA trained in Section 4.4
, we interpolated their parameters as follow:
We chose for making our submission.
4.6 Score Computation and Normalization
For the one-segment enrolment condition, the speaker model is the length normalized i-vector of that segment, however, for the three-segment enrolment condition, we simply used a length-normalized mean vector of the length-normalizated i-vectors as the speaker model. Each speaker model is tested against each test segment as in the trial list. For each two trial i-vectors and , the PLDA score is computed as
and and . It has been shown and proved in our experiments that score normalization can have a great impact on the performance of the recognition system. We used the symmetric s-norm proposed in  which normalizes the score of the pair using the formula
where the meansare computed by matching and against the unlabelled set as the impostor speakers, respectively.
4.7 Quality Measure Function
It has been shown that there is a dependency between the value of the threshold and the duration of both enrolment and test segments. Applying the quality measure function (QMF)  enabled us to compensate for the shift in the threshold due to the differences in speech duration. We conducted some experiments to estimate the dependency between the threshold shift on the duration of test segment and used the following QMF for PLDA verfication scores:
where is the duration of the test segment in seconds.
In the literature, the performance of speaker recognition is usually reported in terms of calibrated-insensitive equal error rate (EER) or the minimum decision cost function (). However, in real applications of speaker recognition there is a need to present recognition results in terms of calibrated log-likelihood-ratios. We have utilized the BOSARIS Toolkit  for calibration of scores. provides an ideal reference value for judging calibration. If is minimized, then the system can be said to be well calibrated.
The choice of target probability () had a great impact on the performance of the calibration. However, we set for our primary submission which performed the best on the development set. For our secondary submission was used.
5 Results and Discussion
In this section we present the results obtained on the protocol provided by NIST on the development set which is supposed to mirror that of evaluation set. The results are shown in Table 2. The first part of the table indicates the result obtained by the primary system. As can be seen, the fusion of MFCC and PLP (a simple sum of both MFCC and PLP scores) resulted in a relative improvement of almost 10%, as compared to MFCC alone, in terms of both and . In order to quantify the contribution of the different system components we have defined different scenarios. In scenario A, we have analysed the effect of using LDA instead of NDA. As can be seen from the results, LDA outperforms NDA in the case of PLP, however, in fusion we can see that NDA resulted in better performance in terms of the primary metric. In scenario B, we analysed the effect of using the short-duration compensation technique proposed in Section 4.2. Results indicate superior performance using this technique. In scenario C, we investigated the effects of language normalization on the performance of the system. If we replace LN-LDA with simple LDA, we can see performance degradation in MFCC as well as fusion, however, PLP seems not to be adversely affected. The effect of using QMF is also investigated in scenario D. Finally in scenario E, we can see the major improvement obtained through the use of the domain adaptation technique explained in Section 4.5. For our secondary submission, we incorporated a disjoint portion of the labelled development set (10 out of 20 speakers) in either LN-LDA and in-domain PLDA training. We evaluated the system on almost 6k out of 24k trials from the other portion to avoid any over-fitting, particularly important for the domain adaptation technique. This resulted in a relative improvement of 11% compared to the primary system in terms of the primary metric. However, the results can be misleading, since the recording condition may be the same for all speakers in the development set.
6 Time Analysis
This section reports on the CPU execution time (single threaded), and the amount of memory used to process a single trial, which includes the time for creating models from the enrolment data and the time needed for processing the test segments. The analysis was performed on an Intel(R) Xeon(R) CPU E5-2670 2.60GHz. The results are shown in Table 3. We used the time command in Unix to report these results. The user time is the actual CPU time used in executing the process (single thread). The real time is the wall clock time (the elapsed time including time slices used by other processes and the time the process spends blocked). The system time is also the amount of CPU time spent in the kernel within the process. We have also reported the memory allocated for each stage of execution. The most computationally intensive stage is the extraction of i-vectors (both MFCC- and PLP-based i-vectors), which also depends on the duration of the segments. For enrolment, we have reported the time required to extract a model from a segment with a duration of 140 seconds and speech duration of 60 seconds. The time and memory required for front-end processing are negligible compared to the i-vector extraction stage, since they only include matrix operations. The time required for our SAD is also reported which increases linearly with the duration of segment.
7 Conclusions and Perspectives
We have presented the Intelligent Voice speaker recognition system used for the NIST 2016 speaker recognition evaluation. Our system is based on a score fusion of MFCC- and PLP-based i-vector/PLDA systems. We have described the main components of the system including, acoustic feature extraction, speech activity detection, i-vector extraction as front-end processing, and language normalization, short-duration compensation, channel compensation and domain adaptation as back-end processing. For our future work, we intend to use the ALISP segmentation technique  in order to extract meaningful acoustic units so as to train supervised GMM or DNN models.
-  N. Brümmer and E. de Villiers. The bosaris toolkit user guide: Theory, algorithms and code for binary classifier score processing. Documentation of BOSARIS toolkit, 2011.
-  G. Chollet, J. Černockỳ, A. Constantinescu, S. Deligne, and F. Bimbot. Toward alisp: A proposal for automatic language independent speech processing. In Computational Models of Speech Pattern Processing, pages 375–388. Springer, 1999.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4):788–798, 2011.
-  K. Fukunaga and J. Mantock. Nonparametric discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):671–678, 1983.
-  D. Garcia-Romero and C. Y. Espy-Wilson. Analysis of i-vector length normalization in speaker recognition systems. In INTERSPEECH, pages 249–252, 2011.
-  D. Garcia-Romero, A. McCree, S. Shum, N. Brummer, and C. Vaquero. Unsupervised domain adaptation for i-vector speaker recognition. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finalnd, 2014.
-  C. S. Greenberg, D. Bansé, G. R. Doddington, D. Garcia-Romero, J. J. Godfrey, T. Kinnunen, A. F. Martin, A. McCree, M. Przybocki, and D. A. Reynolds. The nist 2014 speaker recognition i-vector machine learning challenge. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, 2014.
-  P. Kenny. Bayesian speaker verification with heavy-tailed priors. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, page 14, 2010.
P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam.
Deep neural networks for extracting baum-welch statistics for speaker recognition.In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, pages 293–298, Joensuu, Finland, 2014.
-  P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel. Plda for speaker verification with utterances of arbitrary duration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7649–7653. IEEE, 2013.
-  A. Khosravani and M. Homayounpour. Linearly constrained minimum variance for robust i-vector based speaker recognition. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, pages 249–253, Joensuu, Finland, 2014.
-  E. Khoury, L. El Shafey, M. Ferras, and S. Marcel. Hierarchical speaker clustering methods for the nist i-vector challenge. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, 2014.
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1695–1699. IEEE, 2014.
-  M. McLaren, M. I. Mandasari, and D. A. van Leeuwen. Source normalization for language-independent speaker recognition using i-vectors. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, pages 55–61, Singapore, 2012.
-  M. Mclaren and D. Van Leeuwen. Source-normalized lda for robust speaker recognition using i-vectors from multiple speech sources. Audio, Speech, and Language Processing, IEEE Transactions on, 20(3):755–766, 2012.
-  N. Mesgarani, M. Slaney, and S. A. Shamma. Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations. IEEE Transactions on Audio, Speech, and Language Processing, 14(3):920–930, 2006.
-  T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Veselỳ, and P. Matejka. Developing a speech activity detection system for the darpa rats program. In INTERSPEECH, pages 1969–1972, 2012.
-  S. Novoselov, T. Pekhovsky, and K. Simonchik. Stc speaker recognition system for the nist i-vector challenge. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, pages 231–240, Joensuu, Finland, 2014.
-  T. Pfau, D. P. Ellis, and A. Stolcke. Multispeaker speech activity detection for the icsi meeting recorder. In Automatic Speech Recognition and Understanding, 2001. ASRU’01. IEEE Workshop on, pages 107–110. IEEE, 2001.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
-  N. Ryant, M. Liberman, and J. Yuan. Speech activity detection on youtube using deep neural networks. In INTERSPEECH, pages 728–731, 2013.
-  S. O. Sadjadi, S. Ganapathy, and J. Pelecanos. The ibm 2016 speaker recognition system. In Odyssey 2016: The Speaker and Language Recognition Workshop, pages 174–180, Bilbao, Spain, June 21-24 2016.
-  B. Vesnicer, J. Zganec-Gros, S. Dobrisek, and V. Struc. Incorporating duration information into i-vector-based speaker-recognition systems. In Proceedings of Odyssey, The Speaker and Language Recognition Workshop, pages 241–248, Joensuu, Finland, 2014.