DeepAI
Log In Sign Up

An MAP Estimation for Between-Class Variance

Probabilistic linear discriminant analysis (PLDA) has been widely used in open-set verification tasks, such as speaker verification. A potential issue of this model is that the training set often contains limited number of classes, which makes the estimation for the between-class variance unreliable. This unreliable estimation often leads to degraded generalization. In this paper, we present an MAP estimation for the between-class variance, by employing an Inverse-Wishart prior. A key problem is that with hierarchical models such as PLDA, the prior is placed on the variance of class means while the likelihood is based on class members, which makes the posterior inference intractable. We derive a simple MAP estimation for such a model, and test it in both PLDA scoring and length normalization. In both cases, the MAP-based estimation delivers interesting performance improvement.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/24/2021

A Study on Decoupled Probabilistic Linear Discriminant Analysis

Probabilistic linear discriminant analysis (PLDA) has broad application ...
01/20/2020

Pairwise Discriminative Neural PLDA for Speaker Verification

The state-of-art approach to speaker verification involves the extractio...
02/12/2018

Linear Regression for Speaker Verification

This paper presents a linear regression based back-end for speaker verif...
11/08/2018

Gaussian-Constrained training for speaker verification

Neural models, in particular the d-vector and x-vector architectures, ha...
11/13/2017

MM Algorithms for Variance Component Estimation and Selection in Logistic Linear Mixed Model

Logistic linear mixed model is widely used in experimental designs and g...
04/28/2017

Learning Quadratic Variance Function (QVF) DAG models via OverDispersion Scoring (ODS)

Learning DAG or Bayesian network models is an important problem in multi...

1 Introduction

Probabilistic linear discriminant analysis (PLDA) [9, 19, 22] has been extensively used in open-set verification tasks, such as speaker verification [2, 21, 8]

. It represents the data with a linear Gaussian model, where the between-class distribution is a Gaussian and the within-class distributions of individual classes are homogeneous Gaussians. The parameters of this model involve a linear transform matrix

and the between-class covariance of the data after the linear transform, and they can be estimated by maximum likelihood (ML) training. Once the model has been trained, it is possible to decide whether two samples are produced from the same class or from two different classes [9], and this decision is optimal in terms of minimum Bayes risk (MBR) [28].

A potential problem of PLDA is the unreliable estimation for the between-class covariance, denoted by

. In many applications, the number of classes in the training data is limited. Take speaker verification as an example, the largest open-source dataset VoxCeleb contains 7,000+ speakers. Considering the high dimensionality of the data (e.g., speaker vectors in speaker verification, whose dimension is 400-600), it is difficult to estimate a reasonable between-speaker covariance with maximum likelihood training.

To have an intuition, we take a simulation experiment by sampling

samples from a Gaussian distribution and a Laplacian distribution, and then compute the ML-based variance estimation for each sampling. We plot the variance’s variance to show the reliability of the estimation. As shown in Fig. 

1, when the number of samples is small, the variance’s variance is large, indicating that the estimation is highly unreliable. This conclusion is more clear with the Laplacian distribution, due to its heavy-tail property.

Figure 1: Variance’s variance of samples from Gaussian and Laplacian distributions. For each distribution, we firstly sample data points, and then compute their variance (i.e., ML-based estimation for the underlying true variance). This process repeats 10,000 times and the variance of the 10,000 values is computed. The x-axis is the number of samples , and the y-axis is the value of variance’s variance. For simplicity, the data points are one-dimensional. For the sake of comparison, the true variances of two distributions are both set to 1.0.

For PLDA, the limited number of classes leads to the same unreliable estimation for the between-class covariance . For speaker verification, the distribution of speaker vectors is known to be heavy-tailed [11], which makes the ML estimation for even more unreliable. Moreover, the dimensionality of speaker vectors is often as high as 400-600, which further exaggerates the problem.

In this paper, we propose a robust estimation for the between-class variance , by placing an Inverse-Wishart prior on and then conduct maximum a posterior (MAP) estimation. At the first glance, the MAP estimation seems trivial if the prior and the associated conditional likelihood are given. However, for hierarchical probabilistic models such as PLDA, it is much more convolved. This is because the prior is placed on the covariance of the class means while the likelihood is based on the class members. This complication leads to intractable inference for

. We will prove that under some mild conditions, the MAP estimation can be reformulated to a simple linear interpolation of

derived by the standard PLDA and a prior covariance.

2 Theory

2.1 Preliminary of PLDA

We consider the two-covariance form of the PLDA model [22], which assumes a linear Gaussian as follows, where indexes the class, and indexes samples of a particular class:

(1)
(2)
(3)

where we assume

is of full rank. By this assumption, the within-class covariance is the identity matrix

, and the between-class covariance is computed as follows:

(4)

The likelihood of the data of a particular class can be computed as follows:

(5)

Since and are both Gaussian, it is easy to show that is Gaussian and can be computed efficiently. Collecting all the data of classes, the likelihood function can be computed as follows:

(6)

where are the parameters. Maximizing this function with respect to these parameters leads to a maximum likelihood (ML) training.

Once the model has been trained, it can be employed to perform verification tasks. According to the hypothesis test theory [17], the following likelihood ratio (LR) is optimal in terms of Bayes risk, when used to judge whether a test sample belongs to the class represented by the enrollment samples :

(7)

Thanks to the linear Gaussian form of the model, the LR score has a closed form and can be computed efficiently [9, 19].

2.2 MAP estimation for

Suppose an Inverse-Wishart prior on  [15]:

(8)

where is the dimension of the data and is a normalization term. If there are observations following a Gaussian, it is easy to derive that the MAP estimation for is given as follows [15]:

(9)

where denotes the ML estimation for with observations.

If we place an Inverse-Wishart prior on of the PLDA model, the graphical representation is shown in Fig. 2. In this case, the interaction between and the observation is indirect, and there are no i.i.d. Gaussian samples can be used to estimate by (9). One possibility is to use the likelihood function (5

) as the conditional probability, and derive the MAP estimation as follows:

(10)
Figure 2: Graphical model of PLDA with an Inverse-Wishart prior on the between-class variance .

Since involves an undetermined parameter and needs to marginalize on , the inference for is intractable. Although a variational approach can be used [25], the iterative process leads to increased computational load. We will show that a simple MAP estimation can be derived by using derived from the standard PLDA, under mild conditions.

Proposition 1.

If every class involves training samples, and PLDA has been well trained, the between-class covariance can be written as , where .

Proof.

Since the PLDA model has been well trained, show a regulated distribution: the between-class variance is , and the within-class variance is . Considering a particular class, the joint probability of the class members is a function of . Considering a particular dimension , the probability is given by:

(11)

Considering all the classes:

Take derivative of with respect to and set it to 0:

(12)

A simple computation shows:

(13)

Since all the dimensions of are independent, we have:

(14)

Obviously, if is large, the estimation for approaches to:

(15)

which can be interpreted as an ML estimation for the covariance of a Gaussian distribution represented by the virtual samples .

Since derived by PLDA is equivalent to derived with the virtual samples , we can use these virtual samples as the observations of the underlying Gaussian model, and derive the MAP estimation for the covariance of these samples:

(16)

where is obtained from the standard PLDA.

By defining appropriate and , the above equation can be reformulated as a simple linear interpolation:

(17)

where can be interpreted as a prior covariance, and is a hyper-parameter that represents the number of virtual samples associated with . We therefore derived a simple form of MAP estimation for the between-class variance with PLDA.

We highlight that although the final result (17) looks simple, it should not be regarded as trivial. In fact, to derive such a result, we have assumed that all the classes involve the same number of samples. This is even not true in most practical usage, demonstrating that (17) is not as straightforward as the first glance. 111Fortunately, one can verify that if each class contains sufficient samples, (15) remains correct and hence (17) holds.

2.3 Applied to length normalization

The MAP-estimated (and hence ) can be used directly in PLDA scoring, which we will call PLDA/MAP. Moreover, the more robust can be used to improve length normalization (LN) as well. LN is a simple and effective trick that has been widely used in speaker verification [7]. The key idea is that for a high-dimensional Gaussian distribution, most of the samples should concentrate on an eclipse surface defined by the covariance. Suppose the distribution has been aligned to the axes, the eclipse surface will be as follows:

(18)

where is the variance of the -th dimension. In the PLDA model, this variance consists of the between-class variance and the within-class variance , i.e., . LN scales the speaker vectors to this surface if they are not, with the scale factor computed by:

(19)

It has been shown that this scaling can greatly improve the Gaussianality of the speaker vectors, hence making them more suitable for PLDA modeling.

Here we encounter the same problem as in PLDA scoring: if is not well estimated, the scaling will be incorrect. In particular for speaker vectors aligned to the directions corresponding to a large , the scaling tends to be aggressive. The MAP-based estimation for discounts large variance and thus is expected to alleviate this problem. For that purpose, we simply use to compute the scale factor . We will call the revised length normalization as LN/MAP.

3 Related work

Brummer et al. [1] presented the initial idea of Bayes PLDA, with the aim to overcome the shortage associated with the point estimation for the parameters in the conventional ML-based PLDA model.

Villalba et al. [25] employed a Bayes approach to improve speaker verification with i-vectors [4]. This approach was further extended to deal with domain adaptation, where the PLDA parameters obtained in one domain were used as the prior when training PLDA in a new domain [27, 10]. Although theoretically interesting, it relies variational inference in both training and test, which is not very friendly and so is rarely used.

The Inverse-Wishart distribution was generally used as the prior for distance/correlation matrix. For example, Fang et al. [6] employed this prior to regularize the metric learning with an i-vector system. Ito et al. [10] employed this prior to adapt the covariance in the GMM-UBM architecture for speaker verification.

4 Experiments

We evaluate the proposed approach by a speaker verification task, following the deep speaker embedding framework [5, 24, 12]

. Given a speech segment, a speaker vector is produced by a deep neural network which consists of frame-level feature extractor and utterance-level pooling. In this paper, we employ the x-vector model 

[23] to produce the speaker vectors. This model is trained using the Kaldi toolkit [18], following the SITW recipe 222https://github.com/kaldi-asr/kaldi/tree/master/egs/sitw/. The dimensionality of the x-vectors was set to . Once the speaker vectors are obtained, a PLDA model with LDA dimension reduction is trained and employed to score the test trials. Note that our research goal here is to demonstrate the MAP-based estimation for rather than present a SOTA speaker verification system. For this purpose, using a public recipe in Kaldi is a reasonable choice. Readers can refer to [13, 26] for SOTA performance on the same task.

No. Model SITW.Dev SITW.Eval HI-MIA.Dev HI-MIA.Eval
LDA[512] 1 PLDA 3.697 4.019 1.080 0.891
2 PLDA/MAP 3.466 3.909 0.945 0.810
3 PLDA + LN 4.005 4.647 1.484 1.296
4 PLDA + LN/MAP 3.928 4.483 1.350 1.134
5 PLDA/MAP + LN 3.889 4.429 1.080 0.891
6 PLDA/MAP + LN/MAP 3.812 4.374 1.080 0.891
LDA[150] 7 PLDA 3.273 3.800 1.215 0.972
8 PLDA/MAP 3.196 3.745 1.080 0.891
9 PLDA + LN 2.965 3.362 1.350 1.134
10 PLDA + LN/MAP 3.003 3.335 1.350 1.053
11 PLDA/MAP + LN 3.003 3.417 1.215 1.134
12 PLDA/MAP + LN/MAP 2.926 3.417 1.080 0.972
Table 1: EER(%) results with different settings of PLDA and LN

4.1 Data

Three datasets were used in our experiments: VoxCeleb, SITW, and HI-MIA. Details are as follows:

VoxCeleb [16, 3]: An open-source speaker dataset collected from media sources by University of Oxford. This dataset contains 2,000+ hours of speech signals from 7,000+ speakers. This dataset was used to train the x-vector model and the PLDA model used in the test on the SITW dataset.

SITW [14]: A standard evaluation dataset consists of 299 speakers. The core-core trials built on the SITW.Dev set was used to optimize the prior weight in the MAP estimation of (17). The core-core trials built on the SITW.Eval set were used for evaluation.

HI-MIA [20]: An open-source text-dependent speaker recognition dataset. All the speech utterances contain the word ‘Hi MIA’, recorded by a microphone 3 meters away from the speaker. The development set (used for training PLDA and estimating the MAP prior weight) involves 5,062 utterances from 254 speakers, and the evaluation set involves 1,665 utterances from 86 speakers.

4.2 Behavior of the MAP estimation

In the first experiment, we study the behavior of MAP estimation using the SITW.Dev core-core trials. We set the prior covariance = 1 in  (17), and set as the number of speakers used for training the PLDA model, which is 6,300 in our experiment. The performance of the PLDA/MAP on SITW.Dev in terms of equal error rate (EER) is reported in Fig. 3, where the prior weight changes from 0 to 7,000. Note that PLDA/MAP with = 0 is just the conventional PLDA. It is clear to see that PLDA/MAP can substantially improve system performance with an appropriate . Notice that there is an optimal that best trades off the contribution of the prior and the data.

Figure 3: EER results of PLDA/MAP on SITW.Dev with different .

4.3 Detailed results

In this experiment, we choose using the development sets (SITW.Dev for SITW.Eval test, and HI-MIA.Dev for HI-MIA.Eval test) based on the EER results with PLDA/MAP, and then apply the optimal to both PLDA/MAP and LN/MAP. The EER results are reported in Table 1.

Firstly, we observe that in almost all the cases, PLDA/MAP outperforms the conventional PLDA. The general improvement obtained by PLDA/MAP implies that the MAP estimation indeed delivers a better between-class covariance.

Secondly, we found that in most tests, LN/MAP clearly outperforms the standard LN. This double confirms that the MAP estimation produces a better between-class covariance. Moreover, since the improvement was obtained by using the prior weight selected based on PLDA/MAP, we conclude that the priors for PLDA/MAP and LN/MAP are consistent.

Thirdly, we observe that in the LDA[512]-dim tests, PLDA/MAP + LN (system 5) generally outperforms PLDA + LN (system 3), but this is not always the case in the LDA[150]-dim tests. For example, in the SITW test, PLDA/MAP + LN (system 11) performs worse than PLDA + LN (system 9). A possible reason is that with the LN operation, the data statistics has been changed, and so the MAP estimation based on the original statistics may be suboptimal. In general, LN/MAP is more safe than PLDA/MAP as it can improve performance in almost all the cases, though in many cases, PLDA/MAP can deliver more improvement than LN/MAP.

Finally, it seems that combining PLDA/MAP and LN/MAP (system 6) may lead to performance improvement in some cases, but this is not always the case. The improvement, even if it is observed, is not significant. This can be explained again by the suboptimum of the MAP estimation for the data after length normalization.

5 Conclusions

We presented a simple form of MAP estimation for the between-class covariance in the PLDA model. Our derivation shows that under mild conditions, the MAP estimation can be formed as a linear interpolation of the ML estimation obtained by standard PLDA and a prior covariance. We employed the MAP-estimated between-class covariance to both PLDA scoring and length normalization, and interesting performance improvement was obtained. Future work will investigate better strategies to combine MAP estimation and length normalization.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants No.61633013 and No.62171250, and also the Innovation Research Program of Northwest Minzu University under Grant No.YXM2021005.

References

  • [1] N. Brümmer (2010) Bayesian plda. Technical report Tech. Rep., Agnitio Labs. Cited by: §3.
  • [2] J. P. Campbell (1997) Speaker recognition: a tutorial. Proceedings of the IEEE 85 (9), pp. 1437–1462. Cited by: §1.
  • [3] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1086–1090. Cited by: §4.1.
  • [4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §3.
  • [5] L. Deng and D. Yu (2014) Deep learning: methods and applications. Foundations and trends in signal processing 7 (3–4), pp. 197–387. Cited by: §4.
  • [6] X. Fang et al. (2013) Bayesian distance metric learning on i-vector for speaker verification. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §3.
  • [7] D. Garcia-Romero and C. Y. Espy-Wilson (2011) Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Cited by: §2.3.
  • [8] J. H. Hansen and T. Hasan (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal processing magazine 32 (6), pp. 74–99. Cited by: §1.
  • [9] S. Ioffe (2006) Probabilistic linear discriminant analysis. In

    European Conference on Computer Vision (ECCV)

    ,
    pp. 531–542. Cited by: §1, §2.1.
  • [10] T. Ito, K. Hashimoto, Y. Nankaku, A. Lee, and K. Tokuda (2008) Speaker recognition based on variational bayesian method. In Ninth Annual Conference of the International Speech Communication Association, Cited by: §3, §3.
  • [11] P. Kenny (2010) Bayesian speaker verification with heavy-tailed priors.. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 14. Cited by: §1.
  • [12] L. Li, Y. Chen, Y. Shi, Z. Tang, and D. Wang (2017) Deep speaker feature learning for text-independent speaker verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1542–1546. Cited by: §4.
  • [13] P. Matějka, O. Plchot, H. Zeinali, L. Mošner, A. Silnova, L. Burget, O. Novotnỳ, and O. Glembek (2019) Analysis of but submission in far-field scenarios of voices 2019 challenge. In Proceedings of the Interspeech. 2019, pp. 15–19. Cited by: §4.
  • [14] M. McLaren, L. Ferrer, D. Castan, and A. Lawson (2016) The speakers in the wild (sitw) speaker recognition database.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 818–822. Cited by: §4.1.
  • [15] K. P. Murphy (2007) Conjugate bayesian analysis of the gaussian distribution. def 1 (22), pp. 16. Cited by: §2.2, §2.2.
  • [16] A. Nagrani, J. S. Chung, and A. Zisserman (2017) VoxCeleb: a large-scale speaker identification dataset. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Cited by: §4.1.
  • [17] J. Neyman and E. S. Pearson (1933) IX. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706), pp. 289–337. Cited by: §2.1.
  • [18] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The Kaldi speech recognition toolkit. In

    IEEE workshop on automatic speech recognition and understanding

    ,
    Cited by: §4.
  • [19] S. J. Prince and J. H. Elder (2007) Probabilistic linear discriminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §1, §2.1.
  • [20] X. Qin, H. Bu, and M. Li (2020) HI-MIA: a far-field text-dependent speaker verification database and the baselines. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7609–7613. Cited by: §4.1.
  • [21] D. A. Reynolds (2002) An overview of automatic speaker recognition technology. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, pp. IV–4072. Cited by: §1.
  • [22] A. Sizov, K. A. Lee, and T. Kinnunen (2014) Unifying probabilistic linear discriminant analysis variants in biometric authentication. In

    Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

    ,
    pp. 464–475. Cited by: §1, §2.1.
  • [23] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §4.
  • [24] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez (2014) Deep neural networks for small footprint text-dependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §4.
  • [25] J. Villalba and N. Brümmer (2011) Towards fully bayesian speaker recognition: integrating out the between-speaker covariance. In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §2.2, §3.
  • [26] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, L. P. García-Perera, F. Richardson, R. Dehak, P. A. Torres-Carrasquillo, and N. Dehak (2020) State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language 60, pp. 101026. External Links: ISSN 0885-2308 Cited by: §4.
  • [27] J. Villalba and E. Lleida (2012) Bayesian adaptation of PLDA based speaker recognition to domains with scarce development data. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 47–54. Cited by: §3.
  • [28] D. Wang (2020) Remarks on optimal scores for speaker recognition. arXiv preprint arXiv:2010.04862. Cited by: §1.