Speaker verification (SV) is the task of verifying the identity of a person from the characteristics of his or her voice. It has been widely studied for decades with significant performance advancement. State-of-the-art SV systems are predominantly embedding based, comprising a front-end embedding extractor and a back-end scoring model. The front-end module transforms input speech into a compact embedding representation of speaker-related acoustic characteristics. The back-end model computes the similarity of two input speaker embeddings and determines whether they are from the same person.
There are two commonly used back-end scoring methods. One is the cosine scoring, which assumes the input embeddings are angularly discriminative. The SV score is defined as the cosine similarity of two embeddingsand , which are mean-subtracted and length-normalized [garcia2011analysis], i.e.,
The other method of back-end scoring is based on probabilistic linear discriminant analysis (PLDA) [ioffe2006probabilistic]
. It takes the assumption that the embeddings (also mean-subtracted and length-normalized) are in general Gaussian distributed.
It has been noted that the standard PLDA back-end performs significantly better than the cosine back-end on conventional i-vector embeddings[dehak2010front]. Unfortunately, with the powerful neural speaker embeddings that are widely used nowadays [zeinali2019but], the superiority of PLDA vanishes and even turns into inferiority. This phenomenon has been evident in our experimental studies, especially when the front-end is trained with the additive angular margin softmax loss[deng2019arcface, xiang2019margin].
The observation of PLDA being not as good as the cosine similarity is against the common sense of the back-end model design. Compared to the cosine, PLDA has more learnable parameters and incorporates additional speaker labels for training. Consequently, PLDA is generally considered to be more effective in discriminating speaker representations. This contradiction between experimental observations and theoretical expectation deserves thoughtful investigations on PLDA. In [li2019gaussian, zhang2019vae, cai2020deep], Cai et al argued that the problem should have arise from the neural speaker embeddings. It is noted that embeddings extracted from neural networks tend to be non-Gaussian for individual speakers and the distributions across different speakers are non-homogeneous. These irregular distributions cause the performance degradation of verification systems with the PLDA back-end. In relation to this perspective, a series of regularization approaches have been proposed to force the neural embeddings to be homogeneously Gaussian distributed, e.g., Gaussian-constrained loss [li2019gaussian], variational auto-encoder[zhang2019vae] and discriminative normalization flow[cai2020deep, li2020neural].
In this paper, we try to present and substantiate a very different point of view from that in previous research. We argue that the suspected irregular distribution of speaker embeddings does not necessarily contribute to the inferiority of PLDA versus the cosine. Our view is based on the evidence that the cosine can be regarded as a special case of PLDA. This is indeed true but we have not yet found any work mentioning it. Existing studies have been treating the PLDA and the cosine scoring methods separately. We provide a short proof to unify them. It is noted that the cosine scoring, as a special case of PLDA, also assumes speaker embeddings to be homogeneous Gaussian distributed. Therefore, if the neural speaker embeddings are distributed irregularly as previously hypothesized, both back-ends should exhibit performance degradation.
By unifying the cosine and the PLDA back-ends, it can be shown that the cosine scoring puts stricter assumptions on the embeddings than PLDA. Details of these assumptions are explained in Section 3. Among them, the dimensional independence assumption is found to play a key role in explaining the performance gap between the two back-ends. It is evidenced by incorporating the dimensional independence assumption into the training of PLDA, leading to the diagonal PLDA (DPLDA). This variation of PLDA shows a significant performance improvement under the domain-matched condition. However, when severe domain mismatch exists and back-end adaptation is needed, PLDA performs better than both the cosine and DPLDA. This is because the dimension independence assumption does not hold. Analysis on the between-/within-class covariance of speaker embeddings supports these statements.
2 Review of PLDA
Theoretically PLDA is a probabilistic extension to the classical linear discriminant analysis (LDA)[balakrishnama1998linear]. It incorporates a Gaussian prior on the class centroids in LDA. Among the variants of PLDA, the two-covariance PLDA[sizov2014unifying] has been commonly used in speaker verification systems. A straightforward way to explain two-covariance PLDA is by using probabilistic graphical model[jordan2003introduction].
Consider speech utterances coming from speakers, where the -th speaker is associated with utterances. With a front-end embedding extractor, each utterance can be represented by an embedding of dimensions. The embedding of the -th utterance from the -th speaker is denoted as . Let represent these per-utterance embeddings. Additionally, PLDA supposes the existence of per-speaker embeddings . They are referred to as latent speaker identity variables in [brummer2010speaker].
With the graphical model shown in Fig.1, these embeddings are generated as follows,
Randomly draw the per-speaker embedding , for ;
Randomly draw the per-utterance embedding , for .
where denotes the model parameters of PLDA. Note that and
are precision matrices. The joint distributioncan be derived as,
Assuming the embeddings are mean-subtracted and length-normalized, we let to simplify the scoring function. Given two per-utterance embeddings , the PLDA generates a log-likelihood ratio (LLR) that measures the relative likelihood of the two embeddings coming from the same speaker. The LLR is defined as,
where and represent the same-speaker and different-speaker hypotheses. To derive the score function, without loss of generality, consider a set of embeddings that come from the same speaker. It can be proved that
3 Cosine as a typical PLDA
Relating Eq.6 to Eq.2 for the cosine similarity measure, it is noted that when , the LLR of PLDA degrades into the cosine similarity, as . It is also noted that the condition of is not required. PLDA is equivalent to the cosine if and only if and , where .
Given , we have
Without loss of generality, we let . In other words, the cosine is a typical PLDA with both within-class covariance and between-class covariance
fixed as an identity matrix.
So far we consider only the simplest pairwise scoring. In the general case of many-vs-many scoring, the PLDA and cosine are also closely related. For example, let us consider two sets of embeddings and of size and , respectively. Their centroids are denoted by and . It can be shown,
under the condition of . The term depends only on and .
This has shown that the cosine puts more stringent assumptions than PLDA on the input embeddings. These assumptions are:
(dim-indep) Dimensions of speaker embeddings are mutually uncorrelated or independent;
Based on 1), all dimensions share the same variance value.
As the embeddings are assumed to be Gaussian, dimensional uncorrelatedness is equivalent to dimensional independence.
3.1 Diagonal PLDA
With Gaussian distributed embeddings, the dim-indep assumption implies that speaker embeddings have diagonal covariance. To analyse the significance of this assumption to the performance of SV backend, a diagonal constraint is applied to updating and in Algorithm 1, i.e.,
where denotes the Hadamard square. The PLDA trained in this way is named as the diagonal PLDA (DPLDA). The relationship between DPLDA and PLDA is similar to that between the diagonal GMM and the full-covariance GMM.
4 Experimental setup
Experiments are carried out with the Voxceleb1+2 [nagrani2017voxceleb] and the CNCeleb1 databases[li2022cn]. A vanilla ResNet34[chung2020defence] model is trained with 1029K utterances from 5994 speakers in the training set of Voxceleb2. Following the state-of-the-art training configuration111https://github.com/TaoRuijie/ECAPA-TDNN, data augmentation with speed perturbation, reverberation and spectrum augmentation[park2019specaugment] is applied. The AAM-softmax loss[deng2019arcface] is adopted to produce angular-discriminative speaker embeddings.
The input features to ResNet34 are 80-dimension filterbank coefficients with mean normalization over a sliding window of up to 3 seconds long. Voice activity detection is carried out with the default configuration in kaldi222https://github.com/kaldi-asr/kaldi/blob/master/egs/voxceleb/v2/conf. The front-end module is trained to generate 256-dimension speaker embeddings, which are subsequently mean-subtracted and length-normalized. The PLDA backend is implemented in kaldi and modified to the DPLDA according to Eq. 13-14.
Performance evaluation is carried out on the test set in VoxCeleb1 and CNCeleb1. The evaluation metrics are equal error rate (EER) and decision cost function (DCF) withor .
4.1 Performance comparison between backends
As shown in Table 1, the performance gap between cosine and PLDA backends can be observed from the experiment on VoxCeleb. Cosine outperforms PLDA by relatively improvements of in terms of equal error rate (EER) and in terms of minimum Decision Cost Function with (DCF). The performance difference becomes much more significant with DCF, e.g., by PLDA versus by the cosine. Similar results are noted on other test sets of VoxCeleb1 ((not listed here for page limit)).
The conventional setting of using LDA to preprocess raw speaker embeddings before PLDA is evaluated. It is labelled as LDA+PLDA in Table 1. Using LDA appears to have a negative effect on PLDA. This may be due to the absence of the dim-indep constraint on LDA. We argue that it is unnecessary to apply LDA to regularize the embeddings. The commonly used LDA preprocessing is removed in the following experiments.
The DPLDA incorporates the dim-indep constraint into PLDA training. As shown in Table 1, it improves the EER of PLDA from to , which is comparable to cosine. This clearly confirms the importance of dim-indep.
4.2 Performance degradation in Iterative PLDA training
According to the derivation in Section 3, PLDA implemented in Algorithm 1 is initialized as the cosine, e.g., . However, the PLDA has been shown to be inferior to the cosine by the results in Table 1. Logically it would be expected that the performance of PLDA degrades in the iterative EM training. Fig 2 shows the plot of EERs versus number of training iterations. Initially PLDA achieves exactly the same performance as cosine. In the first iteration, the EER seriously increases from 1.06% to 1.707%. For DPLDA, the dim-indep constraint shows an effect of counteracting the degradation.
4.3 When domain mismatch exists
The superiority of cosine over PLDA has been evidenced on the VoxCeleb dataset, of which both training and test data come from the same domain, e.g., interviews collected from YouTube. In many real-world scenarios, domain mismatch between training and test data commonly exists. A practical solution is to acquire certain amount of in-domain data and update the backend accordingly. The following experiment is to analyse the effect of domain mismatch on the performance of backend models.
The CNCeleb1 dataset is adopted as the domain-mismatched data. It is a multi-genre dataset of Chinese speech with very different acoustic conditions from VoxCeleb. The ResNet34 trained on VoxCeleb is deployed to exact embeddings from the utterances in CNCeleb1. The backends are trained and evaluated on the training and test embeddings of CNCeleb1.
As shown in Table2, the performance of both cosine and DPLDA are inferior to PLDA. Due to that the dim-indep assumption no longer holds, the diagonal constraint on covariance does not bring any performance improvement to cosine and DPLDA.
4.4 Analysis of between-/within-class covariances
To analyze the correlation of individual dimensions of the embeddings, the between-class and within-class covariances, and , are computed as follows,
where and . These are the training equations of LDA and closely related to the M-step of PLDA. Note that for visualization, the elements in and are converted into their absolute value.
In Fig.3, both between-class and within-class covariances show clearly diagonal patterns, in the domain-matched case (plot on the top). This provides additional evidence to support the dim-indep assumption aforementioned. However, this assumption would be broken with strong domain-mismatched data in CNCeleb. As shown by the two sub-plots in the bottom of Fig 3, even though the within-class covariance plot on the right shows a nice diagonal pattern, it tends to vanish for the between-class covariance (plot on the left). Off-diagonal elements have large absolute value and the dimension correlation pattern appears, suggesting the broken of dim-indep. The numerical measure of diagonal index also confirms this observation.
The reason why PLDA appears to be inferior to the cosine scoring with neural speaker embeddings has been exposed with both theoretical and experimental evidence. It has been shown that the cosine scoring is essentially a special case of PLDA. Hence, the non-Gaussian distribution of speaker embeddings should not be held responsible for explaining the performance difference between the PLDA and cosine back-ends. Instead, it should be attributed to the dimensional independence assumption made by the cosine, as evidenced in our experimental results and analysis. Nevertheless, this assumption fits well only in the domain-matched condition. When severe domain mismatch exists, the assumption no longer holds and PLDA can work better than the cosine. Further improvements on PLDA need to take this assumption into consideration. It is worth noting that the AAM-softmax loss should have the benefit of regularizing embeddings to be homogeneous Gaussian, considering good performance of the cosine scoring.