1 Introduction
Automatic speaker verification (ASV) has found a broad range of applications. Conventional ASV methods are based on statistical models [1, 2, 3]
. Perhaps the most famous statistical model in ASV is the Gaussian mixture modeluniversal background model (GMMUBM)
[1]. This model represents the ‘main’ variance of speech signals by a set of global Gaussian components (UBM), and the speaker characters are represented as the ‘shift’ of speakerdependent GMMs over each Gaussian component of the UBM, denoted by a ‘speaker supervector’. The GMMUBM architecture was later enhanced by subspace models, which assume that a speaker supervector can be factorized into a speaker vector (usually lowdimensional) and a residual that represents intraspeaker variation. The joint factor analysis
[2, 4] was the most successful subspace model in early days, though the following ivector model obtained more attention [3]. Besides the simple structure and the superior performance, the ivector approach firstly demonstrated that a speaker can be represented by a lowdimensional vector, which is the precursor of the important concept of speaker embedding.It should be emphasized, however, that the ivector model is purely unsupervised and the embeddings (ivectors) contain a multitude of variations more than speaker information. Therefore, it heavily relies on a powerful backend scoring model to achieve reasonable performance. Among various backend models, the PLDA model [5, 6] has been very powerful, in particular with a simple whitening and length normalization [7]. In the nut shell, PLDA assumes the ‘true’ speaker codes within an ivector is low dimensional and follows a simple Gaussian prior, and the residual is a fullrank Gaussian, formally written by:
(1) 
where is the ivector of utterance of speaker , and are speaker codes and residual respectively, is the global shift and is the speaker loading matrix. Under this assumption, the speaker prior , the conditional and the marginal are all Gaussian. Fortunately, ivectors match these conditions pretty well, due to the linear Gaussian structure of the ivector model. Partly for this reason, the ivector/PLDA framework remains a strong baseline on many ASV tasks.
Recently, neuralbased ASV models have shown great potential [8, 9, 10, 11]
. These models utilize the power of deep neural networks (DNNs) to learn strong speakerdependent features, ideally from a large amount of speakerlabelled data. The present research can be categorized into framebased learning
[8, 10] and utterancebased learning [9, 11, 12, 13]. The framebased learning intends to learn shorttime speaker features, thus more generally useful for speakerrelated tasks, while the utterancebased learning focuses on a wholeutterance speaker representation and/or classification, hence more suitable for the ASV task. A popular utterancebased learning approach is the xvector model proposed by Snyder et al. [11], where the first and secondorder statistics of framelevel features are collected and projected to a lowdimensional representation called xvector, with the objective of discriminating the speakers in the training dataset. The xvector model has achieved good performance in various speaker recognition tasks, as well as related tasks such as language identification [14]. Essentially, the xvector model can be regarded as a deep and discriminative counterpart of the ivector model, and is often called deep speaker embedding.Interestingly, experiments show that the xvector system also heavily relies on a strong backend scoring model, in particular PLDA. Since the xvector have been sufficiently discriminative, the role of PLDA here is regularization rather than discrimination (as in the ivector paradigm): it (globally) discovers the underlying speech codes that are intrinsically Gaussian, so that the ASV scoring based on these codes tends to be comparable across speakers. A potential problem, however, is that xvectors inferred from DNNs are unconstrained, which means that the speaker distribution and the speaker conditional could be in any form. These unconstrained distributions may cause great difficulty for PLDA to discover the underlying speaker codes that are assumed to be Gaussian. Some researchers have noticed this problem and proposed some remedies that encourage speaker conditionals more Gaussian [15, 16], but none of them constrain the prior, thus produced xvectors are still not suitable for PLDA modeling.
In this paper, we investigate an explicit regularization model for unconstrained xvectors. This model is inspired by the variational autoencoder (VAE) architecture, which is capable of projecting an unconstrained distribution to a simple Gaussian distribution. This can be used to constrain the marginal distribution of xvectors. Moreover, a cohesive loss is added to the VAE objective. This follows the same spirit of
[15, 16] and can constrain the speaker conditionals. Experiments showed that with this VAEbased regularization, performance of cosine scoring is largely improved, even comparable with PLDA. This indicates that VAE plays a similar role as PLDA, or, in other words, PLDA works as a regularizer rather than a discriminator in the xvector scoring. Furthermore, the VAEbased speaker codes achieved the stateoftheart performance when scoring with PLDA, demonstrating that (1) VAEbased speaker codes are more regularized and suitable for PLDA modeling, and (2) VAEbased regularization and PLDA scoring are complementary.2 VAEbased speaker regularization
2.1 Revisit PLDA
The principle of PLDA is to model the marginal distribution of speaker embeddings (ivector or xvector), by factoring the total variation of the embeddings into betweenspeaker variation and withinspeaker variation. Based on this factorization, the ASV decision can be cast to a hypothesis test [5, 6], formulated by:
where denotes the confidence score, and the equality relation () denotes that the two embeddings are from the same speaker.
According to Eq.(1), PLDA is a linear Gaussian model and the prior, the conditional, and the marginal are Gaussian. If the embeddings do not satisfy this condition, PLDA cannot model them well, leading to inferior performance. This is the case of xvectors, which are derived from DNNs and both the speaker prior and speaker conditionals are unconstrained. In order to deal with the unconstrained distributions of xvectors, we need a probabilistic model more complex than PLDA.
2.2 VAE for regularization
VAE is a generative model (like PLDA) that can represent a complex data distribution [17]. The key idea of VAE is to learn a DNNbased mapping function that maps a simple distribution to a complex distribution . In other words, it represents complex observations by simpledistributed latent codes via distribution mapping. An illustration of this mapping is shown in Fig. 1. It can be easily shown that the mapped distribution is written by:
where is the inverse function of .
Although VAE can be used to represent the complex marginals, it does not involve any class structure, and so cannot be used in the hypothesis test scoring framework. Nevertheless, if we can find the posterior , the complex can be mapped to a more constrained , so the simple cosine distance can be used for verification. Moreover, the regularized code tends to be easily modeled by PLDA, hence combining the strength of VAE in distribution mapping and the strength of PLDA in distinguishing between and withinspeaker variations. Fortunately, VAE provides a simple way to infer an approximation distribution of , denoted by . It learns a function , parameterized by a DNN, to map to the parameters of , which are the mean and covariance if is assumed to be Gaussian. By this setting, the mean vector of can be treated as VAEregularized speaker codes, and can be used in cosine or PLDAbased scoring.
Fig. 2 illustrates the VAE framework. In this framework, a decoder maps to , i.e.,
where has been assumed to be a Gaussian. Furthermore, an encoder produces a distribution that approximates the posterior distribution as follows:
where .
The training objective is the log probability of the training data
. It is intractable so a variational lower bound is optimized instead, which depends on both the encoder and the decoder . This is formally written by:where is the KL distance, and denotes expectation w.r.t. distribution . As the expectation is intractable, a sampling scheme is often used, as shown in Fig. 2. More details of the training process can be found in [17].
Note that the involves two components: a regularization term that pushes to , and a reconstruction term that encourages a good reconstruction of from . We are free to tune the relative weights of these two terms in practice, in order to obtain latent codes that are either more regularized or more representative. This freelymodified objective may be never a variational lower bound, though nonbalanced weights often lead to better performance in our experiments.
2.3 Speaker cohesive VAE
The standard VAE only constrains the marginal distribution to be Gaussian, which does not guarantee a Gaussian prior or a Gaussian conditional. This is because the VAE model is purely unsupervised and there is no speaker information involved. This lack of speaker information is probably not a big issue for xvectors as they are speaker discriminative already. However, considering speaker information may help VAE to produce a better regularization. Especially, if the speaker code of a particular speaker can be regularized to be Gaussian, the scores based on either cosine distance or PLDA will be more acrossspeaker comparable. This can be formulated as an additional term in the VAE objective function, which we call speaker cohesive loss denoted by :
3 Experiments
3.1 Data
Three datasets were used in our experiments: VoxCeleb, SITW and CSLTSITW. VoxCeleb was used for model training, while the other two were used for evaluation. More information about these three datasets is presented below.
VoxCeleb: A largescale free speaker database collected by University of Oxford, UK [18]. The entire database involves VoxCeleb1 and VoxCeleb2. This dataset, after removing the utterances shared by SITW, was used to train the xvector model, plus the PLDA and VAE models. Data augmentation was applied, where the MUSAN corpus [19] was used to generate noisy utterances and the room impulse responses (RIRS) corpus [20] was used to generate reverberant utterances.
SITW: A standard database used to test ASV performance in realworld conditions [21]. It was collected from opensource media channels, and consists of speech data covering wellknown persons. There are two standard datasets for testing: Dev. Core and Eval. Core. We used Dev. Core to select model parameters, and Eval. Core to perform test in our first experiment. Note that the acoustic condition of SITW is similar to that of the training set VoxCeleb, so this test can be regarded as an indomain test.
CSLTSITW: A small dataset collected by CSLT for commercial usage. It consists of speakers, each of which records a simple Chinese command word, and the duration is about seconds. The scenarios involve laboratory, corridor, street, restaurant, bus, subway, mall, home, etc. Speakers varied their poses during the recording, and the recording devices were placed both near and far. There are about utterances in total. The acoustic condition of this dataset is quite different from that of the training set VoxCeleb, and was used for outofdomain test.
3.2 Settings
We built several systems to validate the VAEbased regularization, each involving a particular pair of frontend and backend.
3.2.1 Frontend
xvector: The baseline xvector frontend. It was built following the Kaldi SITW recipe [22]. The featurelearning component is a
layer timedelay neural network (TDNN). The statistic pooling layer computes the mean and standard deviation of the framelevel features from a speech segment. The size of the output layer is
, corresponding to the number of speakers in the training set. Once trained, the dimensional activations of the penultimate hidden layer are read out as an xvector.vvector: The VAEregularized speaker code. The VAE model is a layer DNN. The dimension of code layer is , and other hidden layers are . The xvectors of all the training utterances are used to the VAE training.
cvector: The VAEregularized speaker code, with the cohesive loss involved in the VAE training. The model structure is the same as in the vvector frontend, and the vvector VAE was used as the initial model for training. We tuned the weight in the objective function, and found is a reasonable value.
avector: Speaker code regularized by a standard autoencoder (AE). AE shares a similar structure as VAE, but the latent codes are not probabilistic so it is less capable of modeling complex distributions. The AE structure is identical to the VAE model in the vvector frontend, except that the code layer is deterministic.
3.2.2 Backend
Cosine: Simple cosine distance.
PCA: PCAbased projection (150dim) plus cosine distance.
PLDA: PLDA scoring.
LPLDA: LDAbased projection (150dim) plus PLDA scoring.
PPLDA: PCAbased projection (150dim) plus PLDA scoring.
3.3 Indomain test
The results on the two SITW evaluation sets, Dev. Core and Eval. Core, are reported in Table 1. The results are reported in terms of equal error rate (EER).
Firstly focus on the xvector frontend. It can be found that PLDA scoring outperformed cosine distance. As we argued, this cannot be interpreted as the discriminative nature of PLDA, but its regularization capability. This is supported by the observation that the vvector frontend achieved rather good performance with the cosine backend (compared with xvector + PLDA). Since VAE is purely unsupervised, it only contributes to regularization. This suggests that PLDA may play a similar role as VAE.
Secondly, we observe that with PCA or LDA, PLDA can perform much better. It is not convincing to assume that LDA and PCA improve the discriminant power of xvectors (in particular PCA), so the only interpretation is that these two models performed regularization, generating more Gaussian codes that are suitable for PLDA. This regularization is similar as what VAE did, but it seems VAE did a better job than PCA, and even better than LDA on the larger evaluation set Eval. Core, even without any speaker supervision.
Thirdly, it can be found that cvectors performed better than vvectors with cosine scoring, confirming that involving cohesive loss improves the regularization. When combined with PLDA, however, the advantage of cvectors diminished. This is expected as PLDA has already learned the speaker discriminative knowledge.
Finally, we found that other unsupervised regularization methods, including PCA and AE, can not obtain reasonable performance with cosine distance, indicating that they cannot conduct good regularization by themselves. This is contrast to VAE, confirming the importance of the probabilistic codes: without this probabilistic nature, it would be impossible to model the complex distribution of xvectors.
SITW Dev. Core  
Cosine  PCA  PLDA  LPLDA  PPLDA  
xvector  15.67  16.17  9.09  3.12  4.16 
avector  16.10  16.48  11.21  4.24  5.01 
vvector  10.32  9.94  3.62  3.54  4.31 
cvector  9.05  8.55  3.50  3.31  3.85 
SITW Eval. Core 

Cosine  PCA  PLDA  LPLDA  PPLDA  
xvector  16.79  17.22  9.16  3.80  4.84 
avector  16.05  16.81  12.14  4.27  5.09 
vvector  10.11  10.03  3.64  3.64  4.43 
cvector  9.05  8.83  3.77  3.53  4.10 
3.4 Analysis
To better understand the VAEbased regularization, we compute the skewness and kurtosis of the distributions of different speaker codes. The skewness and kurtosis are defined as follows:
where and denote the mean and standard variation of , respectively. More Gaussian is a distribution, more close to zero are the two values.
The utterancelevel and speakerlevel skewness and kurtosis of different speaker codes are reported in Table 2. Focusing on the utterancelevel results, it can be seen that the values of skewness and kurtosis of both vvector and cvector are clearly smaller than xvector. This means that the vvector and the cvector are more Gaussian. For the speakerlevel results, it can be found that the kurtosis was largely reduced in vvectors and cvectors. This indicates that the Gaussian regularization placed by VAE on the marginal has implicitly regularized the prior, which is the major reason that these vectors are more suitable for PLDA. The avector, derived from AE, has smaller skewness but larger kurtosis compared to the xvector, on both the utterancelevel and the speakerlevel, suggesting that AE did not perform a good regularization.
Skew(utt)  Kurt(utt)  Skew(spk)  Kurt(spk)  
xvector  0.0423  0.3604  0.0018  0.4499 
avector  0.0072  0.7740  0.0014  0.9765 
vvector  0.0055  0.1324  0.0042  0.0285 
cvector  0.0043  0.1154  0.0076  0.0298 
3.5 Outofdomain test
In this experiment, we test the performance of various systems on the CSLTSITW dataset. Due to the limited data, threefold crossvalidation was used whenever training is required. Three experiments were conducted: (1) directly using all the frontend and backend models trained by VoxCeleb; (2) retraining all the models except the xvector DNN; (3) the same as the retraining scheme, but all the PLDA models were trained by an unsupervised adaptation [23]. The results show that scheme (2) is generally the best, and the PLDA adaptation contributes additional gains in some test settings. For simplicity, only the retraining results under scheme (2) are reported in Table 3. The results exhibit a similar trend as in the SITW test, that both the vvector and cvector outperform the xvector, and the cvector obtained the best performance in nearly all the test settings. Compared to the SITW test, the larger performance gains obtained by VAEregularization. It might be attributed to the more complex acoustic conditions of CSLTSITW, though more investigation is required.

Cosine  PCA  PLDA  LPLDA  PPLDA 

xvector  16.65  16.89  16.91  15.39  13.29 
vvector  13.55  13.71  12.46  12.06  12.02 
cvector  12.98  13.13  12.48  12.01  11.98 
4 Conclusions
This paper proposed a VAEbased regularization for deep speaker embedding. By this model, xvectors that usually exhibit a complex distribution are mapped to latent speaker codes that are simply Gaussian. This model was further enhanced by a speaker cohesive loss, which regularizes speaker conditionals. Experiments on the SITW dataset and a private commercial dataset demonstrated that the VAEregularized speaker codes can achieve better performance with either cosine distance or PLDA scoring, compared to the xvector baseline. Future work will investigate speakeraware VAE, where speaker codes and utterance codes are hierarchically linked as in PLDA.
References
 [1] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital signal processing, vol. 10, no. 13, pp. 19–41, 2000.
 [2] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.
 [3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [4] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” Tech. Rep., 2005.
 [5] S. Ioffe, “Probabilistic linear discriminant analysis,” pp. 531–542, 2006.

[6]
S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for
inferences about identity,” in
2007 IEEE 11th International Conference on Computer Vision
. IEEE, 2007, pp. 1–8.  [7] D. GarciaRomero and C. Y. EspyWilson, “Analysis of ivector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [8] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
 [9] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “Endtoend textdependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5115–5119.
 [10] L. Li, Y. Chen, Y. Shi, Z. Tang, and D. Wang, “Deep speaker feature learning for textindependent speaker verification,” in Interspeech, 2017, pp. 1542–1546.
 [11] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
 [12] S.X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “Endtoend attention based textdependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 171–178.
 [13] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–170.
 [14] D. Snyder, D. GarciaRomero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using xvectors,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 105–111. [Online]. Available: http://dx.doi.org/10.21437/Odyssey.201815

[15]
W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in endtoend speaker and language recognition system,” in
Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 74–81. [Online]. Available: http://dx.doi.org/10.21437/Odyssey.201811  [16] L. Li, Z. Tang, Y. Shi, and D. Wang, “Gaussianconstrained training for speaker verification,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
 [17] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [18] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
 [19] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015.
 [20] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224.
 [21] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (SITW) speaker recognition database.” in Interspeech, 2016, pp. 818–822.
 [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFLCONF192584. IEEE Signal Processing Society, 2011.
 [23] D. GarciaRomero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain adaptation challenge using deep neural networks,” in 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 378–383.
Comments
There are no comments yet.