Automatic speaker verification (ASV) is the process to verify whether a given speech utterance is from a specific speaker or not. I-vector embedding[dehak11] followed by probabilistic linear discriminant analysis (PLDA) [prince07, ioffe06]
was dominant in ASV for a long time until recent years when ASV started to benefit from deep learning. The use of deep neural networks (DNNs) has been investigated to replace individual components along the ASV pipeline, including the front-end feature extraction[Snyder18, Tang19], back-end modeling [Chien17], and the entire pipeline in an end-to-end manner [Li17, Rohdin20]. Among these, using DNNs to extract discriminative speaker embeddings has been shown to be the most viable and effective. Therefore, recent works in ASV have focused on building network architectures that produce embedding vectors with improved speaker representations[Snyder18, Desplanques20, lee21, Liu22].
A DNN for extracting an utterance-level speaker embedding consists of three modules: (1) a frame-level feature encoder, (2) a pooling layer, and (3) utterance-level representations. The input to the first module is a sequence of acoustic features, e.g., mel-frequency cepstral coefficients (MFCCs) and filter-bank coefficients. After considering relatively short-term acoustic features, this module outputs intermediate representations. Various neural network architectures have been used as the encoder, e.g., the time-delay neural network (TDNN) [Snyder18]
, convolutional neural network (CNN)[Nagrani17], LSTM [Bhattacharya17], the incorporation of LSTM to TDNN [Tang19]
or gated recurrent unit (GRU)[Li17]. The goal of this module is to extract more comprehensive speaker information. The second module converts variable-length frame-level intermediate features into a single fixed-dimensional vector by a temporal pooling. In addition to the most basic average statistics pooling, attention mechanism [Zhu18, Okabe18, zhu21c_interspeech] is commonly used to form weighted statistics focusing on essential frames and in turn become more speaker discriminative. The third module stacks several fully-connected layers including one bottleneck layer used to extract utterance-level speaker embeddings with a fixed dimension in the testing phase. During training, the output nodes correspond to the set of speaker IDs in the training data. A softmax function is commonly used to constrain the predicted outputs so that they sum to one, and a cross-entropy (CE) loss is used to measure the network performance.
Good speaker embeddings should be discriminative between different speakers and compact within the same speaker. Embeddings learned using the conventional softmax CEloss, however, are optimized for only inter-speaker discrepancy. To address this issue, margin penalties have been introduced to the so-called large-margin softmax CE loss [fwang18, hwang18, deng18], to simultaneously enhance the intra-class compactness and inter-class discrepancy. In this paper, we refer to the embeddings extracted from networks trained with margin penalties as the large-margin embeddings.
The emergence of large-margin embeddings has triggered a gradual shift from parametric back-ends, such as the PLDA, to a simpler cosine similarity measure [Liu19, Zhou20]. One possible reason is that a PLDA model decomposes the total variability into within and between-speaker covariance matrices [prince07, ioffe06]. The intra-speaker compactness of the large-margin embeddings makes the within-speaker variability modeling no longer essential. However, as we noted, there is no prior experimental analysis. The goal of this paper is three-fold: (1) to study the properties of large-margin embeddings with respect to their predecessors, and to find (2) suitable scoring back-ends and (3) pre-processing techniques best suited for large-margin embeddings.
The paper is organized as follows. Section 2 reviews the large-margin softmax CE loss, as well as cosine similarity and PLDA back-ends. Section 3 introduces our investigations and motivations. Section 4 shows the experimental setup and results. Section 5 provides a summary of our work.
2 Large-Margin Embeddings for ASV
2.1 Softmax and Large-Margin Softmax
2.1.1 Softmax Cross-Entropy Loss
The softmax function is often used as an activation function to calculate the relative probabilities to target classes in multi-way classification tasks. The cross-entropy (CE) loss could be calculated as:
where is the batch size, is the number of speakers in the training set, is the embedding representation of the -th utterance, belonging to -th class. The vector denotes -th column of the weight matrix while is the corresponding bias term. The softmax function constrains the total probabilities to all the classes as 1, which helps training converge more quickly than it otherwise would. The expression in the numerator of (1) is equal to , with the angle between the vectors and . A modified softmax CE loss [ranjan17, liu17] further normalizes the individual weight vector , normalizes the embedding vector and re-scales to , and discards the bias term:
The modification enables the network to directly optimize angles and learn angularly distributed features, but not necessarily more discriminative ones [liu17].
2.1.2 Large-Margin Softmax Cross Entropy Loss
Since angles are used as the distance metric in (2), various techniques were introduced to incorporate margin penalties in order to enhance the speaker-discriminative power. They can be summarized with an angular function [deng18]
where , and are the three margin penalties. Therefore, the larger margin softmax cross-entropy (CE) loss is
The margins , , can be used simultaneously [deng18] or individually [liu17, fwang18, hwang18, deng18], in which (4) is denoted, respectively, as the angular softmax (A-Softmax) [liu17]
the additive angular margin softmax (AAM-Softmax or ArcFace) [deng18]
and the additive margin softmax (AM-Softmax)[fwang18]
The margin penalties enforce intra-class compactness and inter-class discrepancy. This corresponds to a reduced within-speaker variability and a larger between-speaker variability in speaker recognition terminology. We refer to this class of representation as large-margin embeddings in this paper.
2.2 Speaker Verification
Speaker verification can be accomplished by calculating the similarity between the two speaker embeddings corresponding to an enrollment and test speech. To this end, a simple cosine distance measurement can be used. Alternatively, a more sophisticated scoring back-end can be trained such as the probabilistic linear discriminant analysis (PLDA).
2.2.1 Cosine Similarity
Cosine similarity scoring is a computationally efficient method in many verification tasks. When it is applied to speaker verification, the cosine of the angle between the enrollment () and test () embeddings is used as the decision score
This technique has an advantage that no training is required. Scoring is performed directly in the speaker embedding space.
As opposed to cosine similarity measure, PLDA is a supervised method where speaker labels are necessary to train a PLDA model. There are multiple PLDA variants [prince07, ioffe06, Garcia-Romero11, Brummer10]. Here we focus on the formulation reported in [prince07], which is widely used in speaker recognition [kenny10, Lee19].
be an embedding vector which we assume follows a Gaussian distribution[bishop06, prince07, ioffe06]:
where is the global mean. The matrices and are, respectively, the speaker and channel loading matrices, and
models the residual variances and is constrained to be a diagonal matrix. The vectorsand are the latent speaker and channel variables, respectively. Integrating out the latent variables, we arrive at the following marginal density
where are the between and within-speaker covariance matrices given by
In the testing phase, the log-likelihood score between the enrollment () and test () embeddings is calculated as
Here, the joint likelihood in the numerator can be computed as
while the likelihood and in the denominator are evaluated using (10). It is evident that PLDA scoring involves the explicit use of between and within covariance matrices, which is absent in cosine scoring.
3 Covariance Modeling for Large-Margin Embeddings
PLDA [prince07, ioffe06] was originally introduced in ASV to work with i-vector framework [dehak11, kenny10, matejka11]. Despite the i-vector front-end being replaced with more effective deep speaker embeddings, PLDA continues to be a promising back-end [Villalba19, lee20].
We study empirically the between and within-speaker covariance of the conventional x-vector embeddings [Snyder18] and large-margin embeddings from an ECAPA-TDNN[Desplanques20]. The plots in Fig. 1 (a) and (b) show that the within-speaker covariance of the conventional x-vector embeddings is larger than the between-speaker covariance in most of the dimensions, no matter whether length-normalization (LN) is applied. In contrary, the between-speaker covariance is larger than the within-speaker covariance for the large-margin embeddings in all the dimensions regardless of the LN application, as shown in Fig. 1
(c) and (d). It indicates that the use of large-margin softmax CE loss efficiently reduces the intra-speaker variability (enhanced intra-speaker compactness) in the embedding space. This motivates us to constrain PLDA models to match the reduced within-speaker variability in large-margin embeddings. In our implementation, we set the within-speaker covariance as a diagonal matrix in each iteration of the expectation-maximization (EM)[em] steps in PLDA training. For the linear discriminant analysis (LDA) pre-processing technique, we also use a constrained variant which keeps only the diagonal elements in the within-speaker covariance matrix calculated from the data in the calculation of the LDA transformation matrix. In this paper, they are referred to as LDA-diag and PLDA-diag.
Fig. 2 show the t-SNE visualizations of the conventional and large-margin embeddings. Comparing the scatter plots in Fig. 2 (a) and (b), it clearly shows the compactness of the individual classes with the large-margin embeddings with respect to the conventional x-vector embeddings. In addition, the between class distances are more uniform across classes with large-margin embeddings as shown in Fig. 2(b). This is consistent with Fig. 1 where the between-speaker covariance of the large-margin embeddings are distributed more evenly across all of the dimensions, while in the conventional embeddings, high covariance values concentrate in certain dimensions only.
4.1 Experimental settings
In order to verify the effectiveness of back-end techniques, the experiments are conducted on both VoxCeleb1 [Nagrani17] and the Speaker in the Wild (SITW) core-core [McLaren16] test sets. For VoxCeleb1, we have exploited the original test set Vox1-o and the hard test set Vox1-h. All of our front-ends and parametric back-ends are trained on VoxCeleb2 dataset [Chung18]. Approximately of the train set is reserved for validation. Between our training and evaluation sets, there are no overlapping speakers. We employ augmentation techniques to produce a variety of the training data for the embedding networks, including random drops of audio chunks and frequency bands [Park19], speed perturbation [Ko15], environmental corruptions with a collection of room impulse responses (RIRs) and noise [Ko17]. For the parametric back-end training, a subset of VoxCeleb2 that consists of 300k utterances from 5,985 speakers is used with no augmentation, considering the training and testing data are in similar conditions.
We study several systems of state-of-the-art TDNN, ECAPA-TDNN and MFA-TDNN backbones with softmax, AAM-Softmax and AM-Softmax cross-entropy (CE) losses for comparisons [lee21, Liu22, Desplanques20, Snyder18]. The pooling options are average and attentive statistics pooling and posterior inference pooling [lee21]. The details of combinations are shown in Table 1
. We use SpeechBrain open-source toolkit[sb21] to implement all the front-ends and extract speaker embeddings. At the input of the neural networks, our systems utilize 80-dimensional filterbank features.
We evaluate three scoring methods: cosine similarity, PLDA and PLDA-diag, and also the effect of length normalization (LN) [Garcia-Romero11] and LDA as pre-processing steps for PLDA, as well as LDA-diag. The dimensions of LDA and LDA-diag are set to 150. Results are reported in terms of equal error rate (EER) and the minimum normalized detection cost function (MinDCF) at and .
4.2 Results and analysis
We first investigate the intra-speaker compactness in the conventional softmax embeddings (S6-S7 in Table 1) and the large-margin embeddings (S1-S5), respectively. Only LN is used before scoring as the pre-processing step. As shown in Table 1, for both S6 and S7, PLDA outperforms cosine similarity measure, while for the five systems (S1-S5) with different types of large margin softmax CE losses, cosine similarity measure achieves better performance than PLDA. These observations are consistent on all three evaluation sets. This indicates that the within-speaker variability in the conventional softmax embeddings are effectively reduced by channel compensation in PLDA, while the channel compensation is no longer essential for large-margin embeddings and even deteriorates the ASV performance. Figure 1 depicts the difference in the covariance plots between different embeddings. Both the results in Tabel 1 and the covariance plots show that the use of large-margin softmax CE loss efficiently reduces the intra-speaker variability in the embeddings. Comparing the front-ends, the large-margin embeddings (S1-S5) achieve much better performance than the conventional embeddings (S6, S7), which also confirms the efficiency of large-margin softmax in learning speaker-discriminative embeddings.
Next, we investigate the effectiveness using diagonal within-class covariance matrix (denoted as PLDA-diag) in Table 1. The use of the diagonalized within-speaker covariance in the PLDA model on the large-margin embeddings (S1-S5) reduces EER and minDCF on average by and , respectively, compared with the conventional PLDA with full within-class covariance matrix. Additionally, it outperforms cosine similarity consistently, reducing EER and minDCF on average by and , respectively. For conventional embeddings (S6, S7), on the contrary, PLDA-diag degrades both EER and minDCF compared with the conventional PLDA.
Taking ECAPA-TDNN as a front-end example of large-margin embeddings (S1 vs. S2), we further investigate the effect of embedding dimensions on ASV performance. Cosine similarity gives similar performance across the two dimensional embeddings, while the degradation produced by using the conventional PLDA in the 512-d embedding system S2 is almost double that of the 192-d embedding system S1. If we use PLDA-diag instead of PLDA, the performance improves and both systems have similar performance again.
Next we investigate the feasibility of the pre-processing techniques of PLDA on the large-margin embeddings. Since Vox1-h shows the same trend in the performance as Vox1-o, we exclude it considering the page limit. Figure 3 shows the effect of length normalization (LN) on the large-margin embeddings (S1) and the conventional embeddings (S6) with both PLDA and PLDA-diag back-ends on the Vox1-o and SITW core-core test sets. We observe that applying LN reduces both EER and minDCF in almost all systems. The performance improvement in EER is larger than that in minDCF. Therefore, we conclude that LN is still effective for large-margin speaker embeddings. We also note that with or without LN, PLDA-diag outperforms PLDA significantly. We have validated all the large-margin embeddings in Table 1 and obtained the same results.
Figure 4 shows the effect of LDA pre-processing technique on the same front-ends and back-ends. For the large-margin embeddings (S1), the use of the conventional LDA does not help in conventional PLDA systems, but drastically increases errors when applying to PLDA-diag systems. Applying LDA-diag to the PLDA systems improves the performance, however, much less than the improvement brought by using PLDA-diag directly. Applying it to the PLDA-diag system degrades the performance slightly. We conclude that for large-margin embeddings, removing the off-diagonal elements in the within speaker-covariance matrix in either LDA or PLDA improves speaker modeling. Using only PLDA-diag without LDA is sufficient to achieve good performance. For the conventional embeddings (S6), applying both LDA and LDA-diag does not greatly affect the performance. LDA helps when there is a slight mismatch between the SITW test set and the model training set.
This paper, for the first time, experimentally investigated the reasons of the shift from parametric back-ends to a simpler cosine similarity measure for the scoring of large-margin speaker embedding in speaker verification. Our experiments on the state-of-the-art ECAPA-TDNN networks with AAM-Softmax and AM-Softmax cross-entropy losses on VoxCeleb1 and SITW core-core test sets showed substantial increment in intra-speaker compactness making the conventional PLDA superfluous, while the cosine similarity scoring seems to be sufficient. We found that simply discarding off-diagonal elements in the within-speaker covariance matrix of the PLDA model improved the performance significantly with an average of EER reduction and minDCF reduction. It also outperformed cosine scoring consistently with reductions in EER and minDCF by and , respectively. In addition, this paper revisited the pre-processing techniques which have been widely used in the ASV back-ends in the past, and assessed their effects. In the future, we will investigate the evaluations in mismatch domains.
This project is supported by the Agency of Science, Technology and Research (ASTAR), Singapore (Project No. CR-2021-005).