1 Introduction
Speaker verification (SV) is the task of verifying the identity of a person from the characteristics of his or her voice. It has been widely studied for decades with significant performance advancement. Stateoftheart SV systems are predominantly embedding based, comprising a frontend embedding extractor and a backend scoring model. The frontend module transforms input speech into a compact embedding representation of speakerrelated acoustic characteristics. The backend model computes the similarity of two input speaker embeddings and determines whether they are from the same person.
There are two commonly used backend scoring methods. One is the cosine scoring, which assumes the input embeddings are angularly discriminative. The SV score is defined as the cosine similarity of two embeddings
and , which are meansubtracted and lengthnormalized [garcia2011analysis], i.e.,(1)  
(2) 
The other method of backend scoring is based on probabilistic linear discriminant analysis (PLDA) [ioffe2006probabilistic]
. It takes the assumption that the embeddings (also meansubtracted and lengthnormalized) are in general Gaussian distributed.
It has been noted that the standard PLDA backend performs significantly better than the cosine backend on conventional ivector embeddings
[dehak2010front]. Unfortunately, with the powerful neural speaker embeddings that are widely used nowadays [zeinali2019but], the superiority of PLDA vanishes and even turns into inferiority. This phenomenon has been evident in our experimental studies, especially when the frontend is trained with the additive angular margin softmax loss[deng2019arcface, xiang2019margin].The observation of PLDA being not as good as the cosine similarity is against the common sense of the backend model design. Compared to the cosine, PLDA has more learnable parameters and incorporates additional speaker labels for training. Consequently, PLDA is generally considered to be more effective in discriminating speaker representations. This contradiction between experimental observations and theoretical expectation deserves thoughtful investigations on PLDA. In [li2019gaussian, zhang2019vae, cai2020deep], Cai et al argued that the problem should have arise from the neural speaker embeddings. It is noted that embeddings extracted from neural networks tend to be nonGaussian for individual speakers and the distributions across different speakers are nonhomogeneous. These irregular distributions cause the performance degradation of verification systems with the PLDA backend. In relation to this perspective, a series of regularization approaches have been proposed to force the neural embeddings to be homogeneously Gaussian distributed, e.g., Gaussianconstrained loss [li2019gaussian], variational autoencoder[zhang2019vae] and discriminative normalization flow[cai2020deep, li2020neural].
In this paper, we try to present and substantiate a very different point of view from that in previous research. We argue that the suspected irregular distribution of speaker embeddings does not necessarily contribute to the inferiority of PLDA versus the cosine. Our view is based on the evidence that the cosine can be regarded as a special case of PLDA. This is indeed true but we have not yet found any work mentioning it. Existing studies have been treating the PLDA and the cosine scoring methods separately. We provide a short proof to unify them. It is noted that the cosine scoring, as a special case of PLDA, also assumes speaker embeddings to be homogeneous Gaussian distributed. Therefore, if the neural speaker embeddings are distributed irregularly as previously hypothesized, both backends should exhibit performance degradation.
By unifying the cosine and the PLDA backends, it can be shown that the cosine scoring puts stricter assumptions on the embeddings than PLDA. Details of these assumptions are explained in Section 3. Among them, the dimensional independence assumption is found to play a key role in explaining the performance gap between the two backends. It is evidenced by incorporating the dimensional independence assumption into the training of PLDA, leading to the diagonal PLDA (DPLDA). This variation of PLDA shows a significant performance improvement under the domainmatched condition. However, when severe domain mismatch exists and backend adaptation is needed, PLDA performs better than both the cosine and DPLDA. This is because the dimension independence assumption does not hold. Analysis on the between/withinclass covariance of speaker embeddings supports these statements.
2 Review of PLDA
Theoretically PLDA is a probabilistic extension to the classical linear discriminant analysis (LDA)[balakrishnama1998linear]. It incorporates a Gaussian prior on the class centroids in LDA. Among the variants of PLDA, the twocovariance PLDA[sizov2014unifying] has been commonly used in speaker verification systems. A straightforward way to explain twocovariance PLDA is by using probabilistic graphical model[jordan2003introduction].
2.1 Modeling
Consider speech utterances coming from speakers, where the th speaker is associated with utterances. With a frontend embedding extractor, each utterance can be represented by an embedding of dimensions. The embedding of the th utterance from the th speaker is denoted as . Let represent these perutterance embeddings. Additionally, PLDA supposes the existence of perspeaker embeddings . They are referred to as latent speaker identity variables in [brummer2010speaker].
With the graphical model shown in Fig.1, these embeddings are generated as follows,

Randomly draw the perspeaker embedding , for ;

Randomly draw the perutterance embedding , for .
where denotes the model parameters of PLDA. Note that and
are precision matrices. The joint distribution
can be derived as,(3) 
2.2 Training
Estimation of PLDA model parameters can be done with the iterative EM algorithm, as described in Algorithm 1. The algorithm requires initialization of model parameters. In kaldi[povey2011kaldi], the initialization strategy is to set and .
2.3 Scoring
Assuming the embeddings are meansubtracted and lengthnormalized, we let to simplify the scoring function. Given two perutterance embeddings , the PLDA generates a loglikelihood ratio (LLR) that measures the relative likelihood of the two embeddings coming from the same speaker. The LLR is defined as,
(4) 
where and represent the samespeaker and differentspeaker hypotheses. To derive the score function, without loss of generality, consider a set of embeddings that come from the same speaker. It can be proved that
3 Cosine as a typical PLDA
Relating Eq.6 to Eq.2 for the cosine similarity measure, it is noted that when , the LLR of PLDA degrades into the cosine similarity, as . It is also noted that the condition of is not required. PLDA is equivalent to the cosine if and only if and , where .
Given , we have
(9)  
(10) 
Without loss of generality, we let . In other words, the cosine is a typical PLDA with both withinclass covariance and betweenclass covariance
fixed as an identity matrix.
So far we consider only the simplest pairwise scoring. In the general case of manyvsmany scoring, the PLDA and cosine are also closely related. For example, let us consider two sets of embeddings and of size and , respectively. Their centroids are denoted by and . It can be shown,
(11)  
(12) 
under the condition of . The term depends only on and .
This has shown that the cosine puts more stringent assumptions than PLDA on the input embeddings. These assumptions are:

(dimindep) Dimensions of speaker embeddings are mutually uncorrelated or independent;

Based on 1), all dimensions share the same variance value.
As the embeddings are assumed to be Gaussian, dimensional uncorrelatedness is equivalent to dimensional independence.
3.1 Diagonal PLDA
With Gaussian distributed embeddings, the dimindep assumption implies that speaker embeddings have diagonal covariance. To analyse the significance of this assumption to the performance of SV backend, a diagonal constraint is applied to updating and in Algorithm 1, i.e.,
(13)  
(14) 
where denotes the Hadamard square. The PLDA trained in this way is named as the diagonal PLDA (DPLDA). The relationship between DPLDA and PLDA is similar to that between the diagonal GMM and the fullcovariance GMM.
4 Experimental setup
Experiments are carried out with the Voxceleb1+2 [nagrani2017voxceleb] and the CNCeleb1 databases[li2022cn]. A vanilla ResNet34[chung2020defence] model is trained with 1029K utterances from 5994 speakers in the training set of Voxceleb2. Following the stateoftheart training configuration^{1}^{1}1https://github.com/TaoRuijie/ECAPATDNN, data augmentation with speed perturbation, reverberation and spectrum augmentation[park2019specaugment] is applied. The AAMsoftmax loss[deng2019arcface] is adopted to produce angulardiscriminative speaker embeddings.
The input features to ResNet34 are 80dimension filterbank coefficients with mean normalization over a sliding window of up to 3 seconds long. Voice activity detection is carried out with the default configuration in kaldi^{2}^{2}2https://github.com/kaldiasr/kaldi/blob/master/egs/voxceleb/v2/conf. The frontend module is trained to generate 256dimension speaker embeddings, which are subsequently meansubtracted and lengthnormalized. The PLDA backend is implemented in kaldi and modified to the DPLDA according to Eq. 1314.
Performance evaluation is carried out on the test set in VoxCeleb1 and CNCeleb1. The evaluation metrics are equal error rate (EER) and decision cost function (DCF) with
or .4.1 Performance comparison between backends
As shown in Table 1, the performance gap between cosine and PLDA backends can be observed from the experiment on VoxCeleb. Cosine outperforms PLDA by relatively improvements of in terms of equal error rate (EER) and in terms of minimum Decision Cost Function with (DCF). The performance difference becomes much more significant with DCF, e.g., by PLDA versus by the cosine. Similar results are noted on other test sets of VoxCeleb1 ((not listed here for page limit)).
The conventional setting of using LDA to preprocess raw speaker embeddings before PLDA is evaluated. It is labelled as LDA+PLDA in Table 1. Using LDA appears to have a negative effect on PLDA. This may be due to the absence of the dimindep constraint on LDA. We argue that it is unnecessary to apply LDA to regularize the embeddings. The commonly used LDA preprocessing is removed in the following experiments.
EER%  DCF0.01  DCF0.001  
cos  1.06  0.1083  0.1137 
PLDA  1.86  0.2198  0.3062 
LDA+PLDA  2.17  0.2476  0.3715 
DPLDA  1.11  0.1200  0.1426 
The DPLDA incorporates the dimindep constraint into PLDA training. As shown in Table 1, it improves the EER of PLDA from to , which is comparable to cosine. This clearly confirms the importance of dimindep.
4.2 Performance degradation in Iterative PLDA training
According to the derivation in Section 3, PLDA implemented in Algorithm 1 is initialized as the cosine, e.g., . However, the PLDA has been shown to be inferior to the cosine by the results in Table 1. Logically it would be expected that the performance of PLDA degrades in the iterative EM training. Fig 2 shows the plot of EERs versus number of training iterations. Initially PLDA achieves exactly the same performance as cosine. In the first iteration, the EER seriously increases from 1.06% to 1.707%. For DPLDA, the dimindep constraint shows an effect of counteracting the degradation.
4.3 When domain mismatch exists
The superiority of cosine over PLDA has been evidenced on the VoxCeleb dataset, of which both training and test data come from the same domain, e.g., interviews collected from YouTube. In many realworld scenarios, domain mismatch between training and test data commonly exists. A practical solution is to acquire certain amount of indomain data and update the backend accordingly. The following experiment is to analyse the effect of domain mismatch on the performance of backend models.
The CNCeleb1 dataset is adopted as the domainmismatched data. It is a multigenre dataset of Chinese speech with very different acoustic conditions from VoxCeleb. The ResNet34 trained on VoxCeleb is deployed to exact embeddings from the utterances in CNCeleb1. The backends are trained and evaluated on the training and test embeddings of CNCeleb1.
As shown in Table2, the performance of both cosine and DPLDA are inferior to PLDA. Due to that the dimindep assumption no longer holds, the diagonal constraint on covariance does not bring any performance improvement to cosine and DPLDA.
EER%  DCF0.01  DCF0.001  
cos  10.11  0.5308  0.7175 
PLDA  8.90  0.4773  0.6331 
DPLDA  10.24  0.5491  0.8277 
4.4 Analysis of between/withinclass covariances
To analyze the correlation of individual dimensions of the embeddings, the betweenclass and withinclass covariances, and , are computed as follows,
(15)  
(16) 
where and . These are the training equations of LDA and closely related to the Mstep of PLDA. Note that for visualization, the elements in and are converted into their absolute value.
In Fig.3, both betweenclass and withinclass covariances show clearly diagonal patterns, in the domainmatched case (plot on the top). This provides additional evidence to support the dimindep assumption aforementioned. However, this assumption would be broken with strong domainmismatched data in CNCeleb. As shown by the two subplots in the bottom of Fig 3, even though the withinclass covariance plot on the right shows a nice diagonal pattern, it tends to vanish for the betweenclass covariance (plot on the left). Offdiagonal elements have large absolute value and the dimension correlation pattern appears, suggesting the broken of dimindep. The numerical measure of diagonal index also confirms this observation.
5 Conclusion
The reason why PLDA appears to be inferior to the cosine scoring with neural speaker embeddings has been exposed with both theoretical and experimental evidence. It has been shown that the cosine scoring is essentially a special case of PLDA. Hence, the nonGaussian distribution of speaker embeddings should not be held responsible for explaining the performance difference between the PLDA and cosine backends. Instead, it should be attributed to the dimensional independence assumption made by the cosine, as evidenced in our experimental results and analysis. Nevertheless, this assumption fits well only in the domainmatched condition. When severe domain mismatch exists, the assumption no longer holds and PLDA can work better than the cosine. Further improvements on PLDA need to take this assumption into consideration. It is worth noting that the AAMsoftmax loss should have the benefit of regularizing embeddings to be homogeneous Gaussian, considering good performance of the cosine scoring.