1 Introduction
In recent years, there have been many attempts to take advantage of neural networks (NNs) in speaker verification. Most of the attempts have replaced or improved one of the components of an ivector + PLDA system (feature extraction, calculation of sufficient statistics, ivector extraction or PLDA) with a neural network. As examples, let us mention: using NN bottleneck features instead of conventional MFCC features
[1], NN acoustic models replacing Gaussian Mixture Models for extraction of sufficient statistics
[2], NNs for either complementing PLDA [3, 4] or replacing it [5]. More ambitiously, NNs that take the frame level features of an utterance as input and directly produce an utterance level representation—usually referred to as an embedding—have in the past two years almost replaced the generative ivector approach in text independent speaker recognition [6, 7, 8, 9, 10, 11, 12].These embeddings are obtained by the means of pooling mechanism, for example taking the mean, over the framewise outputs of one or more layers in the NN [6], or by the use of a recurrent NN [7]. An obvious advantage—compared to ivectors—lies in a much smaller amount of model parameters, which is typically around 10 million in the xvector case [11, 12] compared to the ivector with approximately 50 million parameters for both UBM and ivector extractor. This results in a very fast and memory efficient embedding extraction. A disadvantage of the xvector framework can be seen in training during which it is essential to massively augment the training data and split them into many rather short (2–5 seconds) examples.
In this work we continue with our research from [13], where we kept the large parameter space from the generative ivector extractor and we focused on discriminative retraining of such a model. We were able to retain the model robustness and even increase the SV performance via optimizing the model for discrimination between speakers—a task closely related to the final speaker verification. However, memory requirements and large computational cost during training have not only limited us in running experiments effectively, but more importantly it was preventing us from continuing with our research goal which is to include this model in a larger DNN scheme that is closer to an endtoend system.
To solve our problem, we had to drastically decrease the number of trainable model parameters, but, of course, without a major decrease in performance. In the past, people have dealt with the same issue and experimented with factorization of similar or even the same models as ours. In 2003, Subspace Precision and Mean model (SPAM) for acoustic modeling in speech recognition was introduced in [14] and later optimized by Daniel Povey in [15]. SPAM models are Gaussian mixture models with a subspace constraint, where each covariance matrix is represented as a weighted sum of globally shared fullrank matrices. In 2014, Sandro Cumani proposed an ivector extractor factorization [16], for faster ivector extraction and smaller memory footprint, where each row of the ivector extractor matrix is represented as a linear combination of the atoms of a common dictionary with the assumption that it is not necessary to store all rows this matrix to perform i–vector extraction.
In our approach to factorization, we were inspired by [16], but instead of factorizing each row, we perform factorization on the level submatrices of the ivector extractor that represent individual GMMUBM components. Also, our motivation is different, as we aim to greatly decrease the memory footprint and therefore substantially speedup the discriminative training. For now, we ignore the possible ivector extraction speedup.
To finally obtain a discriminative ivector extractor, we still use the same strategy as in the xvector framework [6, 10, 11] and we retrain the NN representation of our factorized generative model to optimize the multiclass crossentropy over a set of training speakers. This is in contrast with our previous research [17], where we optimized the binary crossentropy over verification trials formed by pairs of ivectors. We show that, with such an approach, we can achieve a reasonable performance. Our results are perhaps not as competitive as those achieved with current stateoftheart xvector systems [18], nevertheless, we are now closer to our goal which is to further use this model in the fully endtoend discriminative system [19] that can be initialized from a robust generative baseline.
In order to compare both approaches (generative and discriminative) on a speaker verification task, both versions of ivectors were extracted and used in a standard generative PLDA backend.
2 Theoretical Background
In [19], we had built an endtoend system (Fig. 1
) that already seemingly fits our goal, but it was exactly the ivector extractor component that posed the biggest challenge and we had to resort to adhoc simplifications, such as PCAbased dimensionality reduction of large dimensional sufficient statistics coming from the GMMUBM. Our approach was to represent a standard generative ivecrorbased SV system as a series of “elementary” feedforward NNs, each representing individual ivector building block (e.g. GMMUBM, ivector extractor, PLDA classifier). In the beginning, each NN was trained separately to mimic the equivalent block from the generative training. After this “initialization”, all blocks were connected and jointly retrained.
In this paper, we are focused on the ivector extractor block and its effective discriminative retraining. We still keep the generative GMMUBM and PLDA models.
2.1 ivector Baseline
The ivectors [20] provide a way of reducing largedimensional input data to a lowdimensional feature vector while retaining most of the relevant information. The main principle is that the utterancedependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors lies in a lowdimensional subspace—defined by a matrix , commonly referred to as an ivector extractor, with being number of GMM components, being feature dimensionality, and being subspace dimensionality—and whose coordinates are given by the (dimensional) ivector . The closedform solution for computing the ivector can be expressed as a function of the zero and firstorder GMM statistics: and , where
(1)  
(2) 
where
is the posterior (or occupation) probability of frame
being generated by the mixture component . The ivector is then computed as(3) 
with
(4) 
where and are the “normalized” variants of and , respectively:
(5)  
(6) 
and is a symmetrical decomposition (such as Cholesky) of an inverse of the GMM UBM covariance matrix .
2.2 Factorization of ivector Extractor
In this work, we propose to factorize each matrix as:
(7) 
where is number of factors, are the base matrices, are scalar weights for each component . Note that bases are shared across all components . The number of parameters in this new model representation is , while the number parameters in the original ivector extractor was . Since there is no general requirement of linear independence for the individual matrices in the original ivector concept, the size of would have to be equal to in order for the factorized model to fully describe the original subspace . However, our assumption is that there, in fact, is some level of linear dependency and therefore, can be chosen significantly smaller than , therefore reducing the original model parameter space.
2.3 Discriminatively Trained Factorized ivector Extractor
In our previous work [13], discriminative training of
was based on using a multiclass logistic regression with parameters
as a classifier (classifying speakers), both being optimized based on the categorical cross entropy as an objective function (also depicted in Fig. 2):(8) 
where, is th element of the target variable in 1ofK coding, is the number of speakers (classes), is the number of training samples, and
is a posterior probability (parametrized by logistic regression
) of speaker given th utterance. For the purpose of this work, let us treat the ivector as a function of .Generatively trained ivector extractor was used as an initialization. In this work, we continue using this framework with some adjustments. Let us generalize the optimization objective by adding an regularizer:
(9) 
where is a Euclidian distance between our factorized matrix , and the original generatively trained matrix .
We used two training schemes which differ in initialization and in the regularizing factor. In scheme1 initialization, we select eigenvectors (based on largest eigenvalues) of covariance matrix of the vectorized ’s (C vectors of dimensionality). Parameters are computed as a solution of system of equations . For this scheme, we globally set . In phase1 of this scheme, only classifier
is trained in several epochs, until convergence on a crossvalidation set is reached. Then, in phase2, both the classifier
and the extractor (represented by and ) are retrained until convergence on a crossvalidation set is reached.In scheme2, we started with random initialization, and for the first epoch (phase0), was set to a large number ( in our case). After that, was set to zero for the rest of the training, and phase1 and phase2 coppied those in scheme1.
We experimented with different progression schemes (exponential decreasing, lower stable during whole training, etc.), however, we discovered that one epoch was enough to reach the minimal distance to the . More epochs or learning rate decreasing did not bring any significant improvement neither in nor in final EERs.
In general, we used stochastic gradient descent algorithm for parameter optimization.
3 System Setup
3.1 Datasets
We used the PRISM [21] training dataset definition without added noise or reverberation to train UBM and ivector extractor. The set comprises Fisher 1 and 2, Switchboard phase 2 and 3 and Switchboard cellphone phases 1 and 2, along with a set of Mixer speakers. This includes the 66 held out speakers from SRE10 (see Section IIIB5 of [21]), and 965, 980, 485 and 310 speakers from SRE08, SRE06, SRE05 and SRE04, respectively. A total of 13,916 speakers are available in Fisher data and 1,991 in Switchboard data.
Two variants of genderindependent PLDA models were trained: one on the clean training data, the second included also artificially added different mixes of noises and reverberation. Artificially added noise and reverb segments totaled approximately segments or of total number of clean segments for PLDA training, see details in Sec. 3.2.
We evaluated our systems on the female portions of NIST SRE 2010 [22] (teltel, intint and intmic) and PRISM (prism,noi, prism,rev and prism,chn, see section III.B of [21]), where teltel and prism,chn represent telephone speech, intint and intmic interview speech and prism,noi with prism,rev represent artificially corrupted speech with noise and reverberation.
Additionally, we used the CoreCore condition from the SITW challenge—sitwcorecore. SITW [23] dataset is a large collection of realworld data exhibiting speech from individuals across a wide array of challenging acoustic and environmental conditions.
We also test on NIST SRE 2016 [24], but we split the trial set by language into Tagalog (sre16tglf) and Cantonese (sre16yuef). We use only female trials (both single and multisession). We did not use SRE’16 unlabeled development set in any way.
We randomly selected 500 utterances from 500 different speakers as a crossvalidation set from the PRISM training dataset.
3.2 PLDA and ivector Extractor Augmentation Sets
To extend the training set, we created new artificially corrupted training sets from the PRISM training set. In addition to using noise and reverberation, data were also augmented with randomly generated cuts. In our experiments, we used 30% of original training data to generate cuts with durations between 3 to 5 seconds. The composition of the augmentation set is described in details in [18].
PLDA clean  PLDA extension data  

Condition  
teltel  2.23  8.39  3.9  2.47  2.2  1.97  3.36  9.72  4.91  3.52  3.3  3.25 
sre16yuef  10.9  17.39  12.79  11.29  10.96  10.97  11.32  17.18  12.18  11.42  11.11  10.87 
intint  4.72  9.56  5.57  4.74  4.51  4.37  4.83  10.18  5.94  4.96  4.67  4.56 
intmic  2.15  5.27  2.69  2.23  2.18  2.11  2.02  5.67  2.65  2.28  2.1  1.91 
prism,chn  1.13  5.63  2.25  0.92  0.83  0.88  1.14  5.95  1.98  1.11  1.12  1.14 
sitwcorecore  10.51  17.97  12.35  10.92  10.4  10.29  10.57  17.54  12.33  10.84  10.47  10.21 
prism,noi  4.34  11.74  6.15  4.6  4.29  3.97  3.66  10.73  5.27  4.04  3.73  3.44 
prism,rev  2.81  8.59  3.67  2.84  2.49  2.54  2.45  7.25  3.17  2.49  2.3  2.34 
4 Experiments and Discussion
One of the issues we had to solve to even begin experimenting with the factorized model was its proper initialization. We present two different strategies for initialization and then we will experiment with subsequent discriminative retraining of such models. We also provide comparisons with the generative baseline and with discriminative retraining of its full representation. In our experiments with factorization, we set the number of bases to 250. This means that the matrix is represented by 7.5 times less parameters compared to the original model , and when compared to the ivector extractor block from from [19]) in Fig. 1, the number of parameters is almost half. In all of our experiments, we set the ivector subspace dimensionality to 400.
For clarity, we denote different ways of obtaining the ivector extractor by capital letter , , and :

We trained a baseline ivector extractor in the traditional generative way, using the original PRISM training corpus without any augmentations.

We initialized the bases for factorized model by eigenvectors.

We initialized the bases for factorized model by eigenvectors as in
and then we continued training with the loss function from (
8) and the two stage training described in Sec. 2.3. 
We initialized the bases randomly and then we ran a single epoch of training with the loss function in (9).

We driscriminatively retrained a full representation of the baseline generative ivector extractor [13].
To avoid overfitting of the classifier during discriminative training, it was necessary to filter the training data. We selected speakers with at least 5 utterances in the original data. This step limits the training data to 3493 speakers with 59112 utterances (177336 utterances including augmentation).
For all experiments, we kept the same PLDA configuration. The ivectors are preprocessed with mean normalization, LDA (ivectors are transformed into 200dimensions) and finally, they are length normalized.
Our results in terms of EER are presented in Tab. 1 which is divided into two vertical blocks to provide a comparison between PLDA trained on the clean data and multicondition PLDA training, where we train the PLDA also on augmented copies of its training data. We are now interested in the general robustness of our methods and therefore we will focus on overall performance across all conditions rather than looking closely into individual cases.
The table is also divided into three horizontal blocks based on the type of the condition: into telephone channel (teltel, sre16yuef), microphone (intint, intmic, prism,ch, sitwcorecore) and artificially created conditions (prism,noi, prism,rev). We did not use any type of adaptation, score normalization or any other technique used for results improvement in conditions from SRE16 and others. For system and , we also present results for initialization, before were retrained (in after first epoch with penalty).
When we compare baseline systems (columns in the table) with the results obtained with initialized models for discriminative training (columns and ), we can see that is always significantly worse than the baseline. Initialization has much better results as they are only slightly degraded compared to the baseline indicating that we were able to represent the original ivector model well.
We can see, that starting from , we reach significant improvements with discriminative parameters reestimation. Unfortunately, these results indicate, that the model got stuck in the local minimum and it was not able to improve to the level of the baseline.
Initialization variant proved to be a significantly better starting point. After discriminative parameter reestimation, the model was able to obtain slight improvement across all conditions w.r.t. . Model has also achieved a slight improvement over the baseline or almost reached its performance.
Observing results in columns , we can compare with discriminative retraining of the full ivector representation [13]. With model we achieve the best overall performance (slightly better than ), but the architecture with factorization offers approximately 4 times faster training with 7.5 times less parameters which will allow us to further extend the model and include also the GMMUBM representation.
5 Conclusion
In this work, we have presented a way of refining a discriminative training of ivector extractor from our previous work. We were able to slightly outperform the generative baseline. Our approach conveniently fits to the current efforts of building a fully endtoend discriminative systems, and provides a way for a robust initialization of such a large and important part of the system. Needless to say, we have not created a new stateoftheart system, however, we have prepared a solid platform for our further research. In our ongoing research, we will focus on the final closeform solution of the generative objective and direct estimation of , which will be helpful for simpler initialization. We also plan to analyze the effect of number of factors .
References
 [1] A. LozanoDiez, A. Silnova, P. Matějka, O. Glembek, O. Plchot, J. Pešán, L. Burget, and J. GonzalezRodriguez, “Analysis and Optimization of Bottleneck Features for Speaker Recognition,” in Proceedings of Odyssey 2016, vol. 2016, no. 06. International Speech Communication Association, 2016, pp. 352–357. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cz.iso88592?id=11219
 [2] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1695–1699.
 [3] S. Novoselov, T. Pekhovsky, O. Kudashev, V. S. Mendelev, and A. Prudnikov, “Nonlinear PLDA for ivector speaker verification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Sept 2015, pp. 214–218.
 [4] G. Bhattacharya, J. Alam, P. Kenny, and V. Gupta, “Modelling speaker and channel variability using deep neural networks for robust speaker verification,” in 2016 IEEE Spoken Language Technology Workshop, SLT 2016, San Diego, CA, USA, December 1316, 2016.

[5]
O. Ghahabi and J. Hernando, “Deep belief networks for ivector based speaker recognition,” in
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1700–1704.  [6] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 4052–4056.
 [7] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “Endtoend textdependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 5115–5119.
 [8] S. X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “EndtoEnd attention based textdependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 171–178.
 [9] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 165–170.
 [10] G. Bhattacharya, J. Alam, and P. Kenny, “Deep Speaker Embeddings for ShortDuration Speaker Verification,” in Interspeech 2017, 08 2017, pp. 1517–1521.
 [11] D. Snyder, D. GarciaRomero, D. Povey, and S. Khudanpur, “Deep Neural Network Embeddings for TextIndependent Speaker Verification,” in Interspeech 2017, Aug 2017.
 [12] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust DNN Embeddings for Speaker Recognition,” in Proceedings of ICASSP, 2018.
 [13] O. Novotny, O. Plchot, O. Glembek, L. Burget, and P. Matejka, “Discriminatively retrained ivector extractor for speaker recognition,” accepted to ICASSP 2019, 2019.
 [14] S. Axelrod, V. Goel, B. Kingsbury, K. Visweswariah, and R. A. Gopinath, “Large vocabulary conversational speech recognition with a subspace constraint on inverse covariance matrices,” in INTERSPEECH, 2003.
 [15] D. Povey, “SPAM and full covariance for speech recognition,” in INTERSPEECH 2006  ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 1721, 2006, 2006. [Online]. Available: http://www.iscaspeech.org/archive/interspeech_2006/i06_2047.html
 [16] S. Cumani and P. Laface, “Factorized subspace estimation for fast and memory effective ivector extraction,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, pp. 248–259, 2014.
 [17] O. Glembek, L. Burget, N. Brümmer, O. Plchot, and P. Matějka, “Discriminatively Trained ivector Extractor for Speaker Verification,” in Proceedings of Interspeech 2011, no. 8. International Speech Communication Association, 2011, pp. 137–140. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cs?id=9752
 [18] O. Novotný, O. Plchot, P. Matějka, L. Mošner, and O. Glembek, “On the use of Xvectors for Robust Speaker Recognition,” in Proceedings of Odyssey 2018, no. 6. International Speech Communication Association, 2018, pp. 168–175. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php.cs?id=11787
 [19] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka, and L. Burget, “Endtoend DNN based speaker recognition inspired by ivector and PLDA,” in Proceedings of ICASSP. IEEE Signal Processing Society, 2018.
 [20] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “FrontEnd Factor Analysis For Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.
 [21] L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena, A. Lawson, Y. Lei, P. Matejka, O. Plchot, and N. Scheffer, “Promoting robustness for speaker modeling in the community: the PRISM evaluation set,” in Proceedings of SRE11 analysis workshop, Atlanta, Dec. 2011.
 [22] “National Institute of Standards and Technology,” http://www.nist.gov/speech/tests/spk/index.htm.
 [23] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Interspeech 2016, 2016, pp. 818–822. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.20161129

[24]
“The NIST year 2016 Speaker Recognition Evaluation Plan,” https://www.nist.gov/sites/default/files/documents/2016/10/
07/sre16_eval_plan_v1.3.pdf, 2016.
Comments
There are no comments yet.