In recent years, there have been many attempts to take advantage of neural networks (NNs) in speaker verification. Most of the attempts have replaced or improved one of the components of an i-vector + PLDA system (feature extraction, calculation of sufficient statistics, i-vector extraction or PLDA) with a neural network. As examples, let us mention: using NN bottleneck features instead of conventional MFCC features
, NN acoustic models replacing Gaussian Mixture Models for extraction of sufficient statistics, NNs for either complementing PLDA [3, 4] or replacing it . More ambitiously, NNs that take the frame level features of an utterance as input and directly produce an utterance level representation—usually referred to as an embedding—have in the past two years almost replaced the generative i-vector approach in text independent speaker recognition [6, 7, 8, 9, 10, 11, 12].
These embeddings are obtained by the means of pooling mechanism, for example taking the mean, over the frame-wise outputs of one or more layers in the NN , or by the use of a recurrent NN . An obvious advantage—compared to i-vectors—lies in a much smaller amount of model parameters, which is typically around 10 million in the x-vector case [11, 12] compared to the i-vector with approximately 50 million parameters for both UBM and i-vector extractor. This results in a very fast and memory efficient embedding extraction. A disadvantage of the x-vector framework can be seen in training during which it is essential to massively augment the training data and split them into many rather short (2–5 seconds) examples.
In this work we continue with our research from , where we kept the large parameter space from the generative i-vector extractor and we focused on discriminative retraining of such a model. We were able to retain the model robustness and even increase the SV performance via optimizing the model for discrimination between speakers—a task closely related to the final speaker verification. However, memory requirements and large computational cost during training have not only limited us in running experiments effectively, but more importantly it was preventing us from continuing with our research goal which is to include this model in a larger DNN scheme that is closer to an end-to-end system.
To solve our problem, we had to drastically decrease the number of trainable model parameters, but, of course, without a major decrease in performance. In the past, people have dealt with the same issue and experimented with factorization of similar or even the same models as ours. In 2003, Subspace Precision and Mean model (SPAM) for acoustic modeling in speech recognition was introduced in  and later optimized by Daniel Povey in . SPAM models are Gaussian mixture models with a subspace constraint, where each covariance matrix is represented as a weighted sum of globally shared full-rank matrices. In 2014, Sandro Cumani proposed an i-vector extractor factorization , for faster i-vector extraction and smaller memory footprint, where each row of the i-vector extractor matrix is represented as a linear combination of the atoms of a common dictionary with the assumption that it is not necessary to store all rows this matrix to perform i–vector extraction.
In our approach to factorization, we were inspired by , but instead of factorizing each row, we perform factorization on the level submatrices of the i-vector extractor that represent individual GMM-UBM components. Also, our motivation is different, as we aim to greatly decrease the memory footprint and therefore substantially speedup the discriminative training. For now, we ignore the possible i-vector extraction speedup.
To finally obtain a discriminative i-vector extractor, we still use the same strategy as in the x-vector framework [6, 10, 11] and we retrain the NN representation of our factorized generative model to optimize the multi-class cross-entropy over a set of training speakers. This is in contrast with our previous research , where we optimized the binary cross-entropy over verification trials formed by pairs of i-vectors. We show that, with such an approach, we can achieve a reasonable performance. Our results are perhaps not as competitive as those achieved with current state-of-the-art x-vector systems , nevertheless, we are now closer to our goal which is to further use this model in the fully end-to-end discriminative system  that can be initialized from a robust generative baseline.
In order to compare both approaches (generative and discriminative) on a speaker verification task, both versions of i-vectors were extracted and used in a standard generative PLDA backend.
2 Theoretical Background
) that already seemingly fits our goal, but it was exactly the i-vector extractor component that posed the biggest challenge and we had to resort to ad-hoc simplifications, such as PCA-based dimensionality reduction of large dimensional sufficient statistics coming from the GMM-UBM. Our approach was to represent a standard generative i-vecror-based SV system as a series of “elementary” feed-forward NNs, each representing individual i-vector building block (e.g. GMM-UBM, i-vector extractor, PLDA classifier). In the beginning, each NN was trained separately to mimic the equivalent block from the generative training. After this “initialization”, all blocks were connected and jointly retrained.
In this paper, we are focused on the i-vector extractor block and its effective discriminative retraining. We still keep the generative GMM-UBM and PLDA models.
2.1 i-vector Baseline
The i-vectors  provide a way of reducing large-dimensional input data to a low-dimensional feature vector while retaining most of the relevant information. The main principle is that the utterance-dependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors lies in a low-dimensional subspace—defined by a matrix , commonly referred to as an i-vector extractor, with being number of GMM components, being feature dimensionality, and being subspace dimensionality—and whose coordinates are given by the (-dimensional) i-vector . The closed-form solution for computing the i-vector can be expressed as a function of the zero- and first-order GMM statistics: and , where
is the posterior (or occupation) probability of framebeing generated by the mixture component . The i-vector is then computed as
where and are the “normalized” variants of and , respectively:
and is a symmetrical decomposition (such as Cholesky) of an inverse of the GMM UBM covariance matrix .
2.2 Factorization of i-vector Extractor
In this work, we propose to factorize each matrix as:
where is number of factors, are the base matrices, are scalar weights for each component . Note that bases are shared across all components . The number of parameters in this new model representation is , while the number parameters in the original i-vector extractor was . Since there is no general requirement of linear independence for the individual matrices in the original i-vector concept, the size of would have to be equal to in order for the factorized model to fully describe the original subspace . However, our assumption is that there, in fact, is some level of linear dependency and therefore, can be chosen significantly smaller than , therefore reducing the original model parameter space.
2.3 Discriminatively Trained Factorized i-vector Extractor
In our previous work , discriminative training of
was based on using a multi-class logistic regression with parametersas a classifier (classifying speakers), both being optimized based on the categorical cross entropy as an objective function (also depicted in Fig. 2):
where, is -th element of the target variable in 1-of-K coding, is the number of speakers (classes), is the number of training samples, and
is a posterior probability (parametrized by logistic regression) of speaker given -th utterance. For the purpose of this work, let us treat the i-vector as a function of .
Generatively trained i-vector extractor was used as an initialization. In this work, we continue using this framework with some adjustments. Let us generalize the optimization objective by adding an regularizer:
where is a Euclidian distance between our factorized matrix , and the original generatively trained matrix .
We used two training schemes which differ in initialization and in the regularizing factor. In scheme-1 initialization, we select eigen-vectors (based on largest eigen-values) of covariance matrix of the vectorized ’s (C vectors of -dimensionality). Parameters are computed as a solution of system of equations . For this scheme, we globally set . In phase-1 of this scheme, only classifier
is trained in several epochs, until convergence on a cross-validation set is reached. Then, in phase-2, both the classifierand the extractor (represented by and ) are retrained until convergence on a cross-validation set is reached.
In scheme-2, we started with random initialization, and for the first epoch (phase-0), was set to a large number ( in our case). After that, was set to zero for the rest of the training, and phase-1 and phase-2 coppied those in scheme-1.
We experimented with different -progression schemes (exponential decreasing, lower stable during whole training, etc.), however, we discovered that one epoch was enough to reach the minimal distance to the . More epochs or learning rate decreasing did not bring any significant improvement neither in nor in final EERs.
In general, we used stochastic gradient descent algorithm for parameter optimization.
3 System Setup
We used the PRISM  training dataset definition without added noise or reverberation to train UBM and i-vector extractor. The set comprises Fisher 1 and 2, Switchboard phase 2 and 3 and Switchboard cellphone phases 1 and 2, along with a set of Mixer speakers. This includes the 66 held out speakers from SRE10 (see Section III-B5 of ), and 965, 980, 485 and 310 speakers from SRE08, SRE06, SRE05 and SRE04, respectively. A total of 13,916 speakers are available in Fisher data and 1,991 in Switchboard data.
Two variants of gender-independent PLDA models were trained: one on the clean training data, the second included also artificially added different mixes of noises and reverberation. Artificially added noise and reverb segments totaled approximately segments or of total number of clean segments for PLDA training, see details in Sec. 3.2.
We evaluated our systems on the female portions of NIST SRE 2010  (tel-tel, int-int and int-mic) and PRISM (prism,noi, prism,rev and prism,chn, see section III.B of ), where tel-tel and prism,chn represent telephone speech, int-int and int-mic interview speech and prism,noi with prism,rev represent artificially corrupted speech with noise and reverberation.
Additionally, we used the Core-Core condition from the SITW challenge—sitw-core-core. SITW  dataset is a large collection of real-world data exhibiting speech from individuals across a wide array of challenging acoustic and environmental conditions.
We also test on NIST SRE 2016 , but we split the trial set by language into Tagalog (sre16-tgl-f) and Cantonese (sre16-yue-f). We use only female trials (both single- and multi-session). We did not use SRE’16 unlabeled development set in any way.
We randomly selected 500 utterances from 500 different speakers as a cross-validation set from the PRISM training dataset.
3.2 PLDA and i-vector Extractor Augmentation Sets
To extend the training set, we created new artificially corrupted training sets from the PRISM training set. In addition to using noise and reverberation, data were also augmented with randomly generated cuts. In our experiments, we used 30% of original training data to generate cuts with durations between 3 to 5 seconds. The composition of the augmentation set is described in details in .
|PLDA clean||PLDA extension data|
4 Experiments and Discussion
One of the issues we had to solve to even begin experimenting with the factorized model was its proper initialization. We present two different strategies for initialization and then we will experiment with subsequent discriminative retraining of such models. We also provide comparisons with the generative baseline and with discriminative retraining of its full representation. In our experiments with factorization, we set the number of bases to 250. This means that the matrix is represented by 7.5 times less parameters compared to the original model , and when compared to the i-vector extractor block from from ) in Fig. 1, the number of parameters is almost half. In all of our experiments, we set the i-vector subspace dimensionality to 400.
For clarity, we denote different ways of obtaining the i-vector extractor by capital letter , , and :
We trained a baseline i-vector extractor in the traditional generative way, using the original PRISM training corpus without any augmentations.
We initialized the bases for factorized model by eigen-vectors.
We initialized the bases randomly and then we ran a single epoch of training with the loss function in (9).
We driscriminatively re-trained a full representation of the baseline generative i-vector extractor .
To avoid over-fitting of the classifier during discriminative training, it was necessary to filter the training data. We selected speakers with at least 5 utterances in the original data. This step limits the training data to 3493 speakers with 59112 utterances (177336 utterances including augmentation).
For all experiments, we kept the same PLDA configuration. The i-vectors are pre-processed with mean normalization, LDA (i-vectors are transformed into 200-dimensions) and finally, they are length normalized.
Our results in terms of EER are presented in Tab. 1 which is divided into two vertical blocks to provide a comparison between PLDA trained on the clean data and multi-condition PLDA training, where we train the PLDA also on augmented copies of its training data. We are now interested in the general robustness of our methods and therefore we will focus on overall performance across all conditions rather than looking closely into individual cases.
The table is also divided into three horizontal blocks based on the type of the condition: into telephone channel (tel-tel, sre16-yue-f), microphone (int-int, int-mic, prism,ch, sitw-core-core) and artificially created conditions (prism,noi, prism,rev). We did not use any type of adaptation, score normalization or any other technique used for results improvement in conditions from SRE16 and others. For system and , we also present results for initialization, before were retrained (in after first epoch with penalty).
When we compare baseline systems (columns in the table) with the results obtained with initialized models for discriminative training (columns and ), we can see that is always significantly worse than the baseline. Initialization has much better results as they are only slightly degraded compared to the baseline indicating that we were able to represent the original i-vector model well.
We can see, that starting from , we reach significant improvements with discriminative parameters re-estimation. Unfortunately, these results indicate, that the model got stuck in the local minimum and it was not able to improve to the level of the baseline.
Initialization variant proved to be a significantly better starting point. After discriminative parameter re-estimation, the model was able to obtain slight improvement across all conditions w.r.t. . Model has also achieved a slight improvement over the baseline or almost reached its performance.
Observing results in columns , we can compare with discriminative retraining of the full i-vector representation . With model we achieve the best overall performance (slightly better than ), but the architecture with factorization offers approximately 4 times faster training with 7.5 times less parameters which will allow us to further extend the model and include also the GMM-UBM representation.
In this work, we have presented a way of refining a discriminative training of i-vector extractor from our previous work. We were able to slightly outperform the generative baseline. Our approach conveniently fits to the current efforts of building a fully end-to-end discriminative systems, and provides a way for a robust initialization of such a large and important part of the system. Needless to say, we have not created a new state-of-the-art system, however, we have prepared a solid platform for our further research. In our ongoing research, we will focus on the final close-form solution of the generative objective and direct estimation of , which will be helpful for simpler initialization. We also plan to analyze the effect of number of factors .
-  A. Lozano-Diez, A. Silnova, P. Matějka, O. Glembek, O. Plchot, J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and Optimization of Bottleneck Features for Speaker Recognition,” in Proceedings of Odyssey 2016, vol. 2016, no. 06. International Speech Communication Association, 2016, pp. 352–357. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cz.iso-8859-2?id=11219
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1695–1699.
-  S. Novoselov, T. Pekhovsky, O. Kudashev, V. S. Mendelev, and A. Prudnikov, “Non-linear PLDA for i-vector speaker verification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Sept 2015, pp. 214–218.
-  G. Bhattacharya, J. Alam, P. Kenny, and V. Gupta, “Modelling speaker and channel variability using deep neural networks for robust speaker verification,” in 2016 IEEE Spoken Language Technology Workshop, SLT 2016, San Diego, CA, USA, December 13-16, 2016.
O. Ghahabi and J. Hernando, “Deep belief networks for i-vector based speaker recognition,” in2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1700–1704.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 4052–4056.
-  G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 5115–5119.
-  S. X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-End attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 171–178.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 165–170.
-  G. Bhattacharya, J. Alam, and P. Kenny, “Deep Speaker Embeddings for Short-Duration Speaker Verification,” in Interspeech 2017, 08 2017, pp. 1517–1521.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Interspeech 2017, Aug 2017.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN Embeddings for Speaker Recognition,” in Proceedings of ICASSP, 2018.
-  O. Novotny, O. Plchot, O. Glembek, L. Burget, and P. Matejka, “Discriminatively re-trained i-vector extractor for speaker recognition,” accepted to ICASSP 2019, 2019.
-  S. Axelrod, V. Goel, B. Kingsbury, K. Visweswariah, and R. A. Gopinath, “Large vocabulary conversational speech recognition with a subspace constraint on inverse covariance matrices,” in INTERSPEECH, 2003.
-  D. Povey, “SPAM and full covariance for speech recognition,” in INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006, 2006. [Online]. Available: http://www.isca-speech.org/archive/interspeech_2006/i06_2047.html
-  S. Cumani and P. Laface, “Factorized sub-space estimation for fast and memory effective i-vector extraction,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, pp. 248–259, 2014.
-  O. Glembek, L. Burget, N. Brümmer, O. Plchot, and P. Matějka, “Discriminatively Trained i-vector Extractor for Speaker Verification,” in Proceedings of Interspeech 2011, no. 8. International Speech Communication Association, 2011, pp. 137–140. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cs?id=9752
-  O. Novotný, O. Plchot, P. Matějka, L. Mošner, and O. Glembek, “On the use of X-vectors for Robust Speaker Recognition,” in Proceedings of Odyssey 2018, no. 6. International Speech Communication Association, 2018, pp. 168–175. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cs?id=11787
-  J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka, and L. Burget, “End-to-end DNN based speaker recognition inspired by i-vector and PLDA,” in Proceedings of ICASSP. IEEE Signal Processing Society, 2018.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis For Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.
-  L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena, A. Lawson, Y. Lei, P. Matejka, O. Plchot, and N. Scheffer, “Promoting robustness for speaker modeling in the community: the PRISM evaluation set,” in Proceedings of SRE11 analysis workshop, Atlanta, Dec. 2011.
-  “National Institute of Standards and Technology,” http://www.nist.gov/speech/tests/spk/index.htm.
-  M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Interspeech 2016, 2016, pp. 818–822. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2016-1129
“The NIST year 2016 Speaker Recognition Evaluation Plan,” https://www.nist.gov/sites/default/files/documents/2016/10/