In recent years, there have been many attempts to take advantage of neural networks (NNs) in speaker verification. Most of the attempts have replaced or improved one of the components of an i-vector + PLDA system (feature extraction, calculation of sufficient statistics, i-vector extraction or PLDA) with a neural network. As examples, let us mention: using NN bottleneck features instead of conventional MFCC features
, NN acoustic models replacing Gaussian Mixture Models for extraction of sufficient statistics, NNs for either complementing PLDA [3, 4] or replacing it . More ambitiously, NNs that take the frame level features of an utterance as input and directly produce an utterance level representation—usually referred to as an embedding—have in the past two years almost replaced the generative i-vector approach in text independent speaker recognition [6, 7, 8, 9, 10, 11, 12].
These embeddings are obtained by the means of pooling mechanism, for example taking the mean, over the frame-wise outputs of one or more layers in the NN , or by the use of a recurrent NN . An obvious advantage—compared to i-vectors—lies in a much smaller amount of model parameters, which is typically around 10 million in the x-vector case [11, 12] compared to the i-vector with approximately 50 million parameters for both UBM and i-vector extractor. This results in a very fast and memory efficient embedding extraction. A disadvantage of the x-vector framework can be seen in training during which it is essential to massively augment the training data and split them into many rather short (2–5 seconds) examples.
In this research, we keep the large parameter space from the generative i-vector extractor and we focus on discriminative retraining of such a model that still uses a fairly complex GMM-UBM to provide the training examples (sufficient statistics). I-vector model is generally very robust, which is a property that we want to retain, but at the same time we want the model to focus on important features with respect to the task at hand—discrimination between speakers—and at the same time do not waste parameters to represent the redundant variability in the data.
To obtain a standalone discriminative i-vector extractor, we used the same strategy as in the x-vector framework [6, 10, 11] and we retrained the NN representation of our generative model to optimize the multi-class cross-entropy over a set of training speakers. This is in contrast with our previous research , where we optimized the binary cross-entropy over verification trials formed by pairs of i-vectors. We show that with such an approach we can achieve a reasonable improvement in performance. Our results are perhaps not as good as what can be achieved with current x-vector systems , but our goal is to further use this model in the fully end-to-end discriminative system  that can be initialized from a robust generative baseline. In , we were already able to build such a system, but it was just the i-vector extractor component that posed the biggest challenge and we had to resort to ad-hoc simplifications like PCA-based dimensionality reduction of input sufficient statistics.
In order to compare both approaches (generative and discriminative) on speaker verification task, both versions of i-vectors were extracted and used in a standard generative PLDA backend.
2 Theoretical Background
The i-vectors  provide a way of reducing large-dimensional input data to a small-dimensional feature vector while retaining most of the relevant information. The main principle is that the utterance-dependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors lies in a low-dimensional subspace—defined by matrix , commonly referred to as an i-vector extractor—and whose coordinates are given by the i-vector . The closed-form solution for computing the i-vector can be expressed as a function of the zero- and first-order GMM statistics: and , where
is the posterior (or occupation) probability of framebeing generated by the mixture component . The i-vector is then computed as
where and are the “normalized” variants of and , respectively:
and is a symmetrical decomposition (such as Cholesky) of an inverse of the GMM UBM covariance matrix .
2.1 Discriminatively Trained i-vector Extractor
is trained in a generative fashion using the EM algorithm. In this work, however, we focus on the i-vector extractor parameter re-estimation to better discriminate between speakers. Our experimental pipeline is plotted in Fig.1.
where are the parameters of logistic regression.
Multi-class cross-entropy was used as the objective function:
where, is -th element of the target variable in 1-of-K coding, is number of speakers (classes), and is number of training samples. For the purpose of this work, let us treat the i-vector as a function of . In stage-1, we trained the classifier (parameters
) only. After several epochs (stage-2), we trained the classifier and
-matrix jointly. As the optimizer, stochastic gradient descent algorithm was used.
3 System Setup
We used the PRISM  training dataset definition without added noise or reverberation to train UBM and i-vector extractor. The set comprises Fisher 1 and 2, Switchboard phase 2 and 3 and Switchboard cellphone phases 1 and 2, along with a set of Mixer speakers. This includes the 66 held out speakers from SRE10 (see Section III-B5 of ), and 965, 980, 485 and 310 speakers from SRE08, SRE06, SRE05 and SRE04, respectively. A total of 13,916 speakers are available in Fisher data and 1,991 in Switchboard data.
Two variants of gender-independent PLDA models were trained: one on the clean training data, the second included also artificially added different mixes of noises and reverberation. Artificially added noise and reverb segments totaled approximately segments or of total number of clean segments for PLDA training, see details in Sec. 3.2.
We evaluated our systems on the female portions of NIST SRE 2010  (tel-tel, int-int and int-mic) and PRISM (prism,noi, prism,rev and prism,chn, see section III.B of ), where tel-tel and prism,chn represent telephone speech, int-int and int-mic interview speech and prism,noi with prism,rev represent artificially corrupted speech with noise and reverberation.
Additionally, we used the Core-Core condition from the SITW challenge—sitw-core-core. SITW  dataset is a large collection of real-world data exhibiting speech from individuals across a wide array of challenging acoustic and environmental conditions.
We also test on NIST SRE 2016 , but we split the trial set by language into Tagalog (sre16-tgl-f) and Cantonese (sre16-yue-f). We use only female trials (both single- and multi-session). We did not use SRE’16 unlabeled development set in any way.
3.2 PLDA and i-vector Extractor Augmentation Sets
To extend the training set, we created new artificially corrupted training sets from the PRISM training set. In addition to using noise and reverberation presented below, data were also augmented with randomly generated cuts. In our experiments, we used 30% of original training data to generate cuts with durations between 3 to 5 seconds.
3.2.1 Adding Noise
We prepared a dataset of noises from three different sources:
200 samples (4 minutes long) taken from the Freesound library111http://www.freesound.org (real fan, HVAC, street, city, shop, crowd, library, office and workshop).
5 samples (4 minutes long) of artificially generated noises: various spectral modifications of white noise + 50 and 100 Hz hum.
18 samples (4 minutes long) of babbling noises by merging speech from 100 random speakers from Fisher database using speech activity detector.
The prepared set consists of real room impulse responses from several databases: MARDY , AIR , C4DM [23, 24], OPENAIR , RVB 2014 , and RWCP . Together, they form a set with all types of rooms (small rooms, big rooms, lecture room, restrooms, halls, stairs etc.). All room models have more than one impulse response per room (different RIR was used for source of the signal and source of the noise to simulate different locations of their sources). Rooms were split into two disjoint sets, with 396 rooms for training and 40 rooms for test.
3.2.3 Composition of the Augmented Training Set
To mix the reverberation, noise and signal at given SNR, we followed the procedure outlined in .
When jointly augmenting the data by noise and reverberation, the speech and noise are reverberated separately and different RIRs from the same room are used for speech signal and noise to simulate different positions of their sources. In the following step, we set a ratio of noise and signal energies to obtain the required SNR. Energies of the signal and noise are computed from frames given by the original signal’s voice activity detection (VAD). Finally, signal and noise are summed together at desired SNR. In case we want to add only noise or reverberation, the appropriate part of the algorithm is used.
|PLDA clean||PLDA extension data||PLDA clean||PLDA extension data|
4 Experiments and Discussion
We conducted a set of experiments, all with three different approaches in i-vector extractor training, denoted with letters B, G, D further on.
— In the first set of experiments, we trained a baseline i-vector extractor in the traditional generative way, using the original PRISM training corpus.
— In the second variant, we still used generative training, but we augmented the training data with noise, reverberation, and cuts as described in Section 3.2.
— In the last variant, we used the pre-trained generative i-vector extractor from G and we retrained it discriminatively. The training pipeline is shown in Fig. 1.
For discriminative training, the data preparation was necessary to avoid classifier overtraining. We used speakers with at least 5 utterances in original data only. This step limits the training data to 3493 speakers with 59112 utterances (177336 utterances including augmentation).
Our results (EER) are presented in Tab. 1. The results are presented for 400- and 600- dimensional i-vectors. Next, results are also divided based on PLDA, where we distinguish PLDA trained on clean data and multi-condition training. Finally, the table is also divided based on the type of the condition, for the telephone channel, microphone and artificially created conditions. We did not use any type of adaptation or any other technique used for results improvement in conditions from SRE16 and others.
When we compare baseline systems (B column in the table) with discriminatively retrained variant (D column in the table), we can see that except two cases (sre16-yue-f in 400-dim variant with “clean” PLDA and int-mic in 600-dim variant with “clean” PLDA) the discriminative training is better. The discriminative approach is also better compared with the generative approach with augmented data in the training, where you can see that augmentation in generative training caused mostly degradation in the final.
In tel-tel condition, we can see significant improvement with discriminative training, where 400-dim system have almost 12% relative improvement compared to baseline and it also outperforms 600-dim baseline. The similar situation is in the prism,ch condition, where we have 22% relative improvement in 400-dim variant, here we also outperform 600-dim discriminative variant of training.
The general improvement can be observed also when doing multi-condition training of the PLDA, but we can also see that it harmed the clean condition and helped more on the noisy one, which is an expected behavior.
Generative i-vector extraction training is unsupervised. When we add augmented data to the training list, i-vector extraction is forced to reserve a portion of parameters for representation of variability of noise, reverberation and so it limits parameters for speaker variability. In our supervised discriminative approach, we are forcing i-vector extractor to do the opposite. The extractor is forced to distinguish the speakers, so it should decrease the unwanted variability and keep as many parameters of the -matrix to the speaker variability. It can also help to limit usage GMM components which are not useful for speaker separation.
At this point it is appropriate to discuss our results when compared with the current x-vector recipes. We are fully aware that we do not reach the performance of x-vectors. Results presented here can be directly compared to our previous work  focused on analyzing the performance of the state-of-the-art i-vector and x-vector systems on the very same datasets. Here we present the i-vector system that is based purely on MFCCs, while in  we were using concatenation of MFCCs and DNN bottleneck features. Our current plan is to discriminatively retrain the baseline system from  and then finally replace the i-vector component in the fully end-to-end system presented in  by the discriminatively trained i-vector extractor.
4.1 Experiment observations
We found out, that robust classifier was necessary for proper -matrix retraining. We have conducted experiments with different depth of NN multi-class classifier until we settled on a topology with no hidden layer, which effectively equals to logistic regression. With this setup, we avoid problems with overtraining (especially in the early stage of our endeavor, where we did not use augmented data), there are fewer parameters to train, and time and memory requirements are within reasonable limits, yielding an overall robust classifier.
For effective i-vector extractor re-training, a well-trained classifier was crucial. In stage-2 of the training (where classifier was jointly retrained with the -matrix), a poorly trained classifier resulted in either negligible or even harmful update of the -matrix.
Because of its size, matrix was prone to overtraining, therefore, regularization was necessary. We have chosen L2 regularization centered around the initial ML matrix . This limits the estimate of from moving too far from the initial (already well-estimated) state.
After several unsuccessful experiments, where the change of was too rapid, we set learning rate during the full pipeline training to (so far was used). After this change, the regularization was not necessary anymore, and we received stable training.
Fixing the parameters of the classifier during stage-2 (and retraining only ) led to minor effect on the system, compared to the joint training.
Retraining from randomly initialized matrix rather than from a ML estimate did not lead to convergence.
In this work, we have presented a way of refining a standard generative i-vector extractor via discriminative training. We were able to outperform the generative baseline and make use of additional data obtained by the means of augmentation to further improve the performance when using the discriminative training. Our approach conveniently fits to the current efforts of building a fully end-to-end discriminative systems, and provides a way for a robust initialization of such a large and important part of the system. Needless to say, we have not created a new state-of-the-art system, however, we have prepared a solid platform for our further research.
-  A. Lozano-Diez, A. Silnova, P. Matějka, O. Glembek, O. Plchot, J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and Optimization of Bottleneck Features for Speaker Recognition,” in Proceedings of Odyssey 2016. 2016, vol. 2016, pp. 352–357, International Speech Communication Association.
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1695–1699.
-  S. Novoselov, T. Pekhovsky, O. Kudashev, V. S. Mendelev, and A. Prudnikov, “Non-linear PLDA for i-vector speaker verification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Sept 2015, pp. 214–218.
-  G. Bhattacharya, J. Alam, P. Kenny, and V. Gupta, “Modelling speaker and channel variability using deep neural networks for robust speaker verification,” in 2016 IEEE Spoken Language Technology Workshop, SLT 2016, San Diego, CA, USA, December 13-16, 2016.
O. Ghahabi and J. Hernando,
“Deep belief networks for i-vector based speaker recognition,”in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1700–1704.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 4052–4056.
-  G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 5115–5119.
-  S. X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-End attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 171–178.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 165–170.
-  G. Bhattacharya, J. Alam, and P. Kenny, “Deep Speaker Embeddings for Short-Duration Speaker Verification,” in Interspeech 2017, 08 2017, pp. 1517–1521.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Interspeech 2017, Aug 2017.
-  David Snyder, Daniel Garcia-Romero, Greg Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust DNN Embeddings for Speaker Recognition,” in Proceedings of ICASSP, 2018.
-  Ondřej Glembek, Lukáš Burget, Niko Brümmer, Oldřich Plchot, and Pavel Matějka, “Discriminatively Trained i-vector Extractor for Speaker Verification,” in Proceedings of Interspeech 2011. 2011, number 8, pp. 137–140, International Speech Communication Association.
-  Ondřej Novotný, Oldřich Plchot, Pavel Matějka, Ladislav Mošner, and Ondřej Glembek, “On the use of X-vectors for Robust Speaker Recognition,” in Proceedings of Odyssey 2018. 2018, number 6, pp. 168–175, International Speech Communication Association.
-  Johan Rohdin, Anna Silnova, Mireia Diez, Oldřich Plchot, Pavel Matějka, and Lukáš Burget, “End-to-end DNN based speaker recognition inspired by i-vector and PLDA,” in Proceedings of ICASSP. 2018, IEEE Signal Processing Society.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis For Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.
-  L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena, A. Lawson, Y. Lei, P. Matejka, O. Plchot, and N. Scheffer, “Promoting robustness for speaker modeling in the community: the PRISM evaluation set,” in Proceedings of SRE11 analysis workshop, Atlanta, Dec. 2011.
-  “National Institute of Standards and Technology,” http://www.nist.gov/speech/tests/spk/index.htm.
-  Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Interspeech 2016, 2016, pp. 818–822.
“The NIST year 2016 Speaker Recognition Evaluation Plan,” https://www.nist.gov/sites/default/files/documents/2016/10/
-  “Multichannel Acoustic Reverberation Database at York,” http://www.commsp.ee.ic.ac.uk/ sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/.
-  “Aachen Impulse Response Database,” http://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/.
-  “C4DM (Center for Digital Music) RIR database,” http://isophonics.net/content/room-impulse-response-data-set.
-  R. Stewart and M. Sandler, “Database of omnidirectional and B-format room impulse responses,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 165–168.
-  “OpenAir Impulse Response Database,” http://www.openairlib.net/auralizationdb.
-  “Reverb Challenge ,” http://reverb2014.dereverberation.com/index.html.
-  “RWCP Sound Scene Database,” http://www.openslr.org/13/.