1 Introduction
Probabilistic linear discriminant analysis (PLDA) [9, 19, 22] has been extensively used in open-set verification tasks, such as speaker verification [2, 21, 8]
. It represents the data with a linear Gaussian model, where the between-class distribution is a Gaussian and the within-class distributions of individual classes are homogeneous Gaussians. The parameters of this model involve a linear transform matrix
and the between-class covariance of the data after the linear transform, and they can be estimated by maximum likelihood (ML) training. Once the model has been trained, it is possible to decide whether two samples are produced from the same class or from two different classes [9], and this decision is optimal in terms of minimum Bayes risk (MBR) [28].A potential problem of PLDA is the unreliable estimation for the between-class covariance, denoted by
. In many applications, the number of classes in the training data is limited. Take speaker verification as an example, the largest open-source dataset VoxCeleb contains 7,000+ speakers. Considering the high dimensionality of the data (e.g., speaker vectors in speaker verification, whose dimension is 400-600), it is difficult to estimate a reasonable between-speaker covariance with maximum likelihood training.
To have an intuition, we take a simulation experiment by sampling
samples from a Gaussian distribution and a Laplacian distribution, and then compute the ML-based variance estimation for each sampling. We plot the variance’s variance to show the reliability of the estimation. As shown in Fig.
1, when the number of samples is small, the variance’s variance is large, indicating that the estimation is highly unreliable. This conclusion is more clear with the Laplacian distribution, due to its heavy-tail property.
For PLDA, the limited number of classes leads to the same unreliable estimation for the between-class covariance . For speaker verification, the distribution of speaker vectors is known to be heavy-tailed [11], which makes the ML estimation for even more unreliable. Moreover, the dimensionality of speaker vectors is often as high as 400-600, which further exaggerates the problem.
In this paper, we propose a robust estimation for the between-class variance , by placing an Inverse-Wishart prior on and then conduct maximum a posterior (MAP) estimation. At the first glance, the MAP estimation seems trivial if the prior and the associated conditional likelihood are given. However, for hierarchical probabilistic models such as PLDA, it is much more convolved. This is because the prior is placed on the covariance of the class means while the likelihood is based on the class members. This complication leads to intractable inference for
. We will prove that under some mild conditions, the MAP estimation can be reformulated to a simple linear interpolation of
derived by the standard PLDA and a prior covariance.2 Theory
2.1 Preliminary of PLDA
We consider the two-covariance form of the PLDA model [22], which assumes a linear Gaussian as follows, where indexes the class, and indexes samples of a particular class:
(1) |
(2) |
(3) |
where we assume
is of full rank. By this assumption, the within-class covariance is the identity matrix
, and the between-class covariance is computed as follows:(4) |
The likelihood of the data of a particular class can be computed as follows:
(5) | |||||
Since and are both Gaussian, it is easy to show that is Gaussian and can be computed efficiently. Collecting all the data of classes, the likelihood function can be computed as follows:
(6) |
where are the parameters. Maximizing this function with respect to these parameters leads to a maximum likelihood (ML) training.
Once the model has been trained, it can be employed to perform verification tasks. According to the hypothesis test theory [17], the following likelihood ratio (LR) is optimal in terms of Bayes risk, when used to judge whether a test sample belongs to the class represented by the enrollment samples :
(7) |
2.2 MAP estimation for
Suppose an Inverse-Wishart prior on [15]:
(8) |
where is the dimension of the data and is a normalization term. If there are observations following a Gaussian, it is easy to derive that the MAP estimation for is given as follows [15]:
(9) |
where denotes the ML estimation for with observations.
If we place an Inverse-Wishart prior on of the PLDA model, the graphical representation is shown in Fig. 2. In this case, the interaction between and the observation is indirect, and there are no i.i.d. Gaussian samples can be used to estimate by (9). One possibility is to use the likelihood function (5
) as the conditional probability, and derive the MAP estimation as follows:
(10) |

Since involves an undetermined parameter and needs to marginalize on , the inference for is intractable. Although a variational approach can be used [25], the iterative process leads to increased computational load. We will show that a simple MAP estimation can be derived by using derived from the standard PLDA, under mild conditions.
Proposition 1.
If every class involves training samples, and PLDA has been well trained, the between-class covariance can be written as , where .
Proof.
Since the PLDA model has been well trained, show a regulated distribution: the between-class variance is , and the within-class variance is . Considering a particular class, the joint probability of the class members is a function of . Considering a particular dimension , the probability is given by:
(11) | |||||
Considering all the classes:
Take derivative of with respect to and set it to 0:
(12) |
A simple computation shows:
(13) |
Since all the dimensions of are independent, we have:
(14) |
∎
Obviously, if is large, the estimation for approaches to:
(15) |
which can be interpreted as an ML estimation for the covariance of a Gaussian distribution represented by the virtual samples .
Since derived by PLDA is equivalent to derived with the virtual samples , we can use these virtual samples as the observations of the underlying Gaussian model, and derive the MAP estimation for the covariance of these samples:
(16) |
where is obtained from the standard PLDA.
By defining appropriate and , the above equation can be reformulated as a simple linear interpolation:
(17) |
where can be interpreted as a prior covariance, and is a hyper-parameter that represents the number of virtual samples associated with . We therefore derived a simple form of MAP estimation for the between-class variance with PLDA.
We highlight that although the final result (17) looks simple, it should not be regarded as trivial. In fact, to derive such a result, we have assumed that all the classes involve the same number of samples. This is even not true in most practical usage, demonstrating that (17) is not as straightforward as the first glance. 111Fortunately, one can verify that if each class contains sufficient samples, (15) remains correct and hence (17) holds.
2.3 Applied to length normalization
The MAP-estimated (and hence ) can be used directly in PLDA scoring, which we will call PLDA/MAP. Moreover, the more robust can be used to improve length normalization (LN) as well. LN is a simple and effective trick that has been widely used in speaker verification [7]. The key idea is that for a high-dimensional Gaussian distribution, most of the samples should concentrate on an eclipse surface defined by the covariance. Suppose the distribution has been aligned to the axes, the eclipse surface will be as follows:
(18) |
where is the variance of the -th dimension. In the PLDA model, this variance consists of the between-class variance and the within-class variance , i.e., . LN scales the speaker vectors to this surface if they are not, with the scale factor computed by:
(19) |
It has been shown that this scaling can greatly improve the Gaussianality of the speaker vectors, hence making them more suitable for PLDA modeling.
Here we encounter the same problem as in PLDA scoring: if is not well estimated, the scaling will be incorrect. In particular for speaker vectors aligned to the directions corresponding to a large , the scaling tends to be aggressive. The MAP-based estimation for discounts large variance and thus is expected to alleviate this problem. For that purpose, we simply use to compute the scale factor . We will call the revised length normalization as LN/MAP.
3 Related work
Brummer et al. [1] presented the initial idea of Bayes PLDA, with the aim to overcome the shortage associated with the point estimation for the parameters in the conventional ML-based PLDA model.
Villalba et al. [25] employed a Bayes approach to improve speaker verification with i-vectors [4]. This approach was further extended to deal with domain adaptation, where the PLDA parameters obtained in one domain were used as the prior when training PLDA in a new domain [27, 10]. Although theoretically interesting, it relies variational inference in both training and test, which is not very friendly and so is rarely used.
The Inverse-Wishart distribution was generally used as the prior for distance/correlation matrix. For example, Fang et al. [6] employed this prior to regularize the metric learning with an i-vector system. Ito et al. [10] employed this prior to adapt the covariance in the GMM-UBM architecture for speaker verification.
4 Experiments
We evaluate the proposed approach by a speaker verification task, following the deep speaker embedding framework [5, 24, 12]
. Given a speech segment, a speaker vector is produced by a deep neural network which consists of frame-level feature extractor and utterance-level pooling. In this paper, we employ the x-vector model
[23] to produce the speaker vectors. This model is trained using the Kaldi toolkit [18], following the SITW recipe 222https://github.com/kaldi-asr/kaldi/tree/master/egs/sitw/. The dimensionality of the x-vectors was set to . Once the speaker vectors are obtained, a PLDA model with LDA dimension reduction is trained and employed to score the test trials. Note that our research goal here is to demonstrate the MAP-based estimation for rather than present a SOTA speaker verification system. For this purpose, using a public recipe in Kaldi is a reasonable choice. Readers can refer to [13, 26] for SOTA performance on the same task.No. | Model | SITW.Dev | SITW.Eval | HI-MIA.Dev | HI-MIA.Eval | |
LDA[512] | 1 | PLDA | 3.697 | 4.019 | 1.080 | 0.891 |
2 | PLDA/MAP | 3.466 | 3.909 | 0.945 | 0.810 | |
3 | PLDA + LN | 4.005 | 4.647 | 1.484 | 1.296 | |
4 | PLDA + LN/MAP | 3.928 | 4.483 | 1.350 | 1.134 | |
5 | PLDA/MAP + LN | 3.889 | 4.429 | 1.080 | 0.891 | |
6 | PLDA/MAP + LN/MAP | 3.812 | 4.374 | 1.080 | 0.891 | |
LDA[150] | 7 | PLDA | 3.273 | 3.800 | 1.215 | 0.972 |
8 | PLDA/MAP | 3.196 | 3.745 | 1.080 | 0.891 | |
9 | PLDA + LN | 2.965 | 3.362 | 1.350 | 1.134 | |
10 | PLDA + LN/MAP | 3.003 | 3.335 | 1.350 | 1.053 | |
11 | PLDA/MAP + LN | 3.003 | 3.417 | 1.215 | 1.134 | |
12 | PLDA/MAP + LN/MAP | 2.926 | 3.417 | 1.080 | 0.972 |
4.1 Data
Three datasets were used in our experiments: VoxCeleb, SITW, and HI-MIA. Details are as follows:
VoxCeleb [16, 3]: An open-source speaker dataset collected from media sources by University of Oxford. This dataset contains 2,000+ hours of speech signals from 7,000+ speakers. This dataset was used to train the x-vector model and the PLDA model used in the test on the SITW dataset.
SITW [14]: A standard evaluation dataset consists of 299 speakers. The core-core trials built on the SITW.Dev set was used to optimize the prior weight in the MAP estimation of (17). The core-core trials built on the SITW.Eval set were used for evaluation.
HI-MIA [20]: An open-source text-dependent speaker recognition dataset. All the speech utterances contain the word ‘Hi MIA’, recorded by a microphone 3 meters away from the speaker. The development set (used for training PLDA and estimating the MAP prior weight) involves 5,062 utterances from 254 speakers, and the evaluation set involves 1,665 utterances from 86 speakers.
4.2 Behavior of the MAP estimation
In the first experiment, we study the behavior of MAP estimation using the SITW.Dev core-core trials. We set the prior covariance = 1 in (17), and set as the number of speakers used for training the PLDA model, which is 6,300 in our experiment. The performance of the PLDA/MAP on SITW.Dev in terms of equal error rate (EER) is reported in Fig. 3, where the prior weight changes from 0 to 7,000. Note that PLDA/MAP with = 0 is just the conventional PLDA. It is clear to see that PLDA/MAP can substantially improve system performance with an appropriate . Notice that there is an optimal that best trades off the contribution of the prior and the data.

4.3 Detailed results
In this experiment, we choose using the development sets (SITW.Dev for SITW.Eval test, and HI-MIA.Dev for HI-MIA.Eval test) based on the EER results with PLDA/MAP, and then apply the optimal to both PLDA/MAP and LN/MAP. The EER results are reported in Table 1.
Firstly, we observe that in almost all the cases, PLDA/MAP outperforms the conventional PLDA. The general improvement obtained by PLDA/MAP implies that the MAP estimation indeed delivers a better between-class covariance.
Secondly, we found that in most tests, LN/MAP clearly outperforms the standard LN. This double confirms that the MAP estimation produces a better between-class covariance. Moreover, since the improvement was obtained by using the prior weight selected based on PLDA/MAP, we conclude that the priors for PLDA/MAP and LN/MAP are consistent.
Thirdly, we observe that in the LDA[512]-dim tests, PLDA/MAP + LN (system 5) generally outperforms PLDA + LN (system 3), but this is not always the case in the LDA[150]-dim tests. For example, in the SITW test, PLDA/MAP + LN (system 11) performs worse than PLDA + LN (system 9). A possible reason is that with the LN operation, the data statistics has been changed, and so the MAP estimation based on the original statistics may be suboptimal. In general, LN/MAP is more safe than PLDA/MAP as it can improve performance in almost all the cases, though in many cases, PLDA/MAP can deliver more improvement than LN/MAP.
Finally, it seems that combining PLDA/MAP and LN/MAP (system 6) may lead to performance improvement in some cases, but this is not always the case. The improvement, even if it is observed, is not significant. This can be explained again by the suboptimum of the MAP estimation for the data after length normalization.
5 Conclusions
We presented a simple form of MAP estimation for the between-class covariance in the PLDA model. Our derivation shows that under mild conditions, the MAP estimation can be formed as a linear interpolation of the ML estimation obtained by standard PLDA and a prior covariance. We employed the MAP-estimated between-class covariance to both PLDA scoring and length normalization, and interesting performance improvement was obtained. Future work will investigate better strategies to combine MAP estimation and length normalization.
Acknowledgment
This work was supported by the National Natural Science Foundation of China (NSFC) under Grants No.61633013 and No.62171250, and also the Innovation Research Program of Northwest Minzu University under Grant No.YXM2021005.
References
- [1] (2010) Bayesian plda. Technical report Tech. Rep., Agnitio Labs. Cited by: §3.
- [2] (1997) Speaker recognition: a tutorial. Proceedings of the IEEE 85 (9), pp. 1437–1462. Cited by: §1.
- [3] (2018) VoxCeleb2: deep speaker recognition. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1086–1090. Cited by: §4.1.
- [4] (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §3.
- [5] (2014) Deep learning: methods and applications. Foundations and trends in signal processing 7 (3–4), pp. 197–387. Cited by: §4.
- [6] (2013) Bayesian distance metric learning on i-vector for speaker verification. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §3.
- [7] (2011) Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Cited by: §2.3.
- [8] (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal processing magazine 32 (6), pp. 74–99. Cited by: §1.
-
[9]
(2006)
Probabilistic linear discriminant analysis.
In
European Conference on Computer Vision (ECCV)
, pp. 531–542. Cited by: §1, §2.1. - [10] (2008) Speaker recognition based on variational bayesian method. In Ninth Annual Conference of the International Speech Communication Association, Cited by: §3, §3.
- [11] (2010) Bayesian speaker verification with heavy-tailed priors.. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 14. Cited by: §1.
- [12] (2017) Deep speaker feature learning for text-independent speaker verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1542–1546. Cited by: §4.
- [13] (2019) Analysis of but submission in far-field scenarios of voices 2019 challenge. In Proceedings of the Interspeech. 2019, pp. 15–19. Cited by: §4.
- [14] (2016) The speakers in the wild (sitw) speaker recognition database.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 818–822. Cited by: §4.1.
- [15] (2007) Conjugate bayesian analysis of the gaussian distribution. def 1 (22), pp. 16. Cited by: §2.2, §2.2.
- [16] (2017) VoxCeleb: a large-scale speaker identification dataset. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Cited by: §4.1.
- [17] (1933) IX. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706), pp. 289–337. Cited by: §2.1.
-
[18]
(2011)
The Kaldi speech recognition toolkit.
In
IEEE workshop on automatic speech recognition and understanding
, Cited by: §4. - [19] (2007) Probabilistic linear discriminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §1, §2.1.
- [20] (2020) HI-MIA: a far-field text-dependent speaker verification database and the baselines. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7609–7613. Cited by: §4.1.
- [21] (2002) An overview of automatic speaker recognition technology. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, pp. IV–4072. Cited by: §1.
-
[22]
(2014)
Unifying probabilistic linear discriminant analysis variants in biometric authentication.
In
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)
, pp. 464–475. Cited by: §1, §2.1. - [23] (2018) X-vectors: robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §4.
- [24] (2014) Deep neural networks for small footprint text-dependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §4.
- [25] (2011) Towards fully bayesian speaker recognition: integrating out the between-speaker covariance. In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §2.2, §3.
- [26] (2020) State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language 60, pp. 101026. External Links: ISSN 0885-2308 Cited by: §4.
- [27] (2012) Bayesian adaptation of PLDA based speaker recognition to domains with scarce development data. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 47–54. Cited by: §3.
- [28] (2020) Remarks on optimal scores for speaker recognition. arXiv preprint arXiv:2010.04862. Cited by: §1.