Large Margin Softmax Loss for Speaker Verification

In neural network based speaker verification, speaker embedding is expected to be discriminative between speakers while the intra-speaker distance should remain small. A variety of loss functions have been proposed to achieve this goal. In this paper, we investigate the large margin softmax loss with different configurations in speaker verification. Ring loss and minimum hyperspherical energy criterion are introduced to further improve the performance. Results on VoxCeleb show that our best system outperforms the baseline approach by 15% in EER, and by 13%, 33% in minDCF08 and minDCF10, respectively.

Authors

• 87 publications
• 30 publications
• 76 publications
06/15/2021

Adaptive Margin Circle Loss for Speaker Verification

Deep-Neural-Network (DNN) based speaker verification sys-tems use the an...
11/10/2019

Improved Large-margin Softmax Loss for Speaker Diarisation

Speaker diarisation systems nowadays use embeddings generated from speec...
05/02/2018

End-to-End Residual CNN with L-GM Loss Speaker Verification System

We propose an end-to-end speaker verification system based on the neural...
08/12/2019

A Study on Angular Based Embedding Learning for Text-independent Speaker Verification

Learning a good speaker embedding is important for many automatic speake...
11/19/2019

Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verification

Deep embedding based text-independent speaker verification has demonstra...
08/08/2020

Extrapolating false alarm rates in automatic speaker verification

Automatic speaker verification (ASV) vendors and corpus providers would ...
04/07/2021

Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

Generative probability models are widely used for speaker verification (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, speaker embedding, which is extracted from a deep neural network, outperforms the conventional i-vector in many speaker verification tasks

[1]

. By virtue of the excellent performance, speaker embedding is becoming the next generation of speaker recognition technology. Similar with i-vector, speaker embedding encodes a variable-length utterance into a fix-length vector representing the speaker characteristics. A variety of backend classifiers can be applied to suppress noise and session variability. Speaker embedding can also be used in other applications such as speaker diarization

[2], speaker retrieval and speech synthesis.

Speaker verification is an open-set recognition problem. An utterance is verified to be a certain speaker if their similarity exceeds a threshold. Ideal speaker embedding should be discriminative between different speakers and compact within the same speaker. Although cross-entropy with softmax is arguably one of the most commonly used loss function to train the speaker embedding neural network, it is more suitable for classification and does not explicitly encourage discriminative learning of features.

To address this issue, different loss functions are proposed. Triplet loss for speaker verification were first presented in [3, 4]. By selecting appropriate training samples, triplet loss performed well in both text-dependent and text-independent tasks. However, the performance is sensitive to the triplet mining strategy [5, 6] and it is time consuming to design such a training procedure. Also, speaker identity subspace loss [7], Gaussian Mixture loss [8], etc., were proposed in other works.

On the other hand, efforts have been made to improve the original softmax loss. Center loss was introduced in [9, 10] to constrain features to be gathered around the corresponding centers and thus reduce the intra-speaker variability. Both triplet loss and center loss are optimized in the Euclidean space. In the last few years, angular-based losses have become popular. Compare with the Euclidean distance, angular distance is a more natural choice in the feature space. In [11, 12]

, the features and the weights of the output layer were normalized before softmax, making the loss function focus on the cosine similarity. Generalized end-to-end loss was proposed in

[13]

. The scaled cosine scores between the features and the estimated speaker centers were used as the logits to compute the loss. Angular margin softmax loss was first presented in

[14, 15] in which the margin is incorporated with the angle in a multiplicative way. This method is extended in [16, 17, 18] where additive margins are used. Some of these losses have been applied in speaker verification [19, 20, 21, 22, 23]. Since all these losses combine softmax with margins, we call them the large margin softmax loss in this paper.

In this paper, we first build a baseline system using a generic toolkit similar with [24]. Several training strategies are used to improve the accuracy. We then compare the performance of the large margin softmax loss using different configurations. Ring loss [25] and minimum hyperspherical energy (MHE) criterion [26] are involved to enhance the discriminative learning and enlarge the inter-speaker separability. Experiments on VoxCeleb show that our baseline system achieves better performance than the Kaldi x-vector recipe and reduces the EER, minDCF08 and minDCF10 from 3.10%, 0.0169, 0.4977 to 2.34%, 0.0122 and 0.3754, respectively. Using the large margin softmax loss with auxiliary objective functions, the best system further improves the performance to 2%, 0.0106 and 0.2487.

The organization of this paper is as follows. The speaker embedding we use is briefly introduced in Section 2. Section 3 describes the large margin softmax loss and different techniques to enhance the loss. Our experimental setup and results are given in Section 4 and 5. The last section concludes the paper.

2 Speaker embedding

The deep neural network used to extract speaker embedding consists of frame-level and segment-level sub-networks, connected by a temporal pooling layer. The frame-level network can be seen as a speaker feature extractor which transforms the acoustical features into speaker-related vectors. These vectors are aggregated across the entire utterance by a pooling layer and further processed by several fully-connected layers. Different loss functions can be used to optimize the network. After training, the output of a hidden layer at the segment-level network is extracted as the speaker embedding. Cosine scoring, LDA and PLDA are usually applied to generate the verification scores. Since the output layer is removed during the test phase, the test speakers do not have to be present in the training data.

In this paper, speaker embedding is extracted from the x-vector architecture [1]. The x-vector is popular on many applications and has been provided as the official system on the recent NIST speaker recognition evaluation (SRE). The details are described in Section 4.2.

3 Large margin softmax loss

3.1 Definition

The widely used softmax loss is presented as

 LS=−1NN∑i=1loge→wTyi→xi∑Cj=1e→wTj→xi (1)

where is the number of training samples, is the number of speakers in the training set, is the -th column of the weights in the output layer. is the input of the last (i.e. output) layer, is the ground truth label for the -th sample. To avoid ambiguity, we use feature to represent in this paper while embedding refers to the speaker embedding extracted from a hidden layer of the network. The logit can be transformed to , where is the angle between and . Eq. 1 is influenced by the norm of the weights. This is annoying since we more care about the angle .

In modified softmax [15], the weights are normalized as and the loss is

 LMS=−1NN∑i=1logescosθyi∑Cj=1escosθj (2)

where is the scaling factor. This factor can be the feature norm or a fixed value if the feature is also normalized. We will discuss the feature normalization later.

Based on Eq. 2, different margins can be introduced by reformulating the target logits. We define an angle function [18]

 ψ(θyi)=cos(m1θyi+m2)−m3 (3)

where , and are margins. Eq. 2 is rewritten as the large margin softmax loss

 LLMS=−1NN∑i=1loges⋅ψ(θyi)es⋅ψ(θyi)+∑Cj=1,j≠ies⋅cosθj (4)

Strictly, Eq. 3 is only valid when and , since , should be a monotonically decreasing function. However, in practice, the angle is usually in the range of [18]. We can safely apply Eq. 3 and 4 to optimize the network when and .

When is large, a new is required. Let . For , we use the angle function defined in [15]

 ψ(θyi)=(−1)kcos(m1θyi)−2k (5)

where and . The curves of the angle functions using different margins are illustrated in Fig. 1.

The margins , and can be used separately [15, 17, 16, 18]. Also, they can be further combined as in [18]

. In this paper, we only use one single margin at a time since the performance gain of the margin combination is relatively small while large efforts will be paid to tune the hyperparameters. When

, and are used individually, the losses are denoted as angular softmax (ASoftmax), additive angular margin softmax (ArcSoftmax) and additive margin softmax loss (AMSoftmax), respectively.

3.2 Feature normalization

As discussed in [11], the norm of the feature is related with the sample quality when the softmax loss is used. The network will minimize the loss by simply increasing the norm of the features for easy samples and ignoring the hard ones. This avoids the network processing samples in poor quality well.

To solve this issue, feature normalization is presented in many works. After normalization, the feature norm is eliminated from the loss and a fixed-value scaling factor is used instead. Using the feature normalization, the loss is only related with the angle function. Features with small norm will get much bigger gradients compared to those with large norm, making the network pay more attention to the low-quality samples [16].

Rather than learning to map the samples into a fixed-norm hypersphere, feature normalization uses an additional normalization layer to do this job. Unlike feature normalization, we introduce Ring loss [25] to directly apply the norm constraint on the features in this paper. The definition of the Ring loss is straightforward. We want the feature norm to be close to a target value . An auxiliary loss is employed as

 LR=λRNN∑i=1(∥→xi∥−R)2 (6)

where is the loss weight with the primary large margin softmax loss. The Ring loss can be considered as a soft version of feature normalization and the target norm can be learnt during the network training.

3.3 Enlarge inter-speaker feature separability

Although the large margin softmax loss improves the intra-class compactness, it does not explicitly promote the inter-class separability. In [26], the authors proposed a minimum hyperspherical energy (MHE) criterion to encourage the weights of the output layer to distribute evenly on hypersphere. The MHE criterion is expressed as

 LM=λMN(C−1)N∑i=1C∑j=1,j≠if(∥^→wyi−^→wj∥) (7)

where is a weighting hyperparameter, , are the normalized weights in the loss function and is a decreasing function. Intuitively, MHE loss enlarges the overall inter-class feature separability. Similar with the Ring loss, we include MHE as an auxiliary objective function.

3.4 Annealing strategy during training

From a classification perspective, the large margin softmax makes the decision boundary more stringent to correctly classify . The angle between and

is required to be much smaller than the angles with other weights. From a view of optimization, the existence of the margin makes the well-separated features continue to get big gradients which can shrink the intra-class variance.

However, the margin will also increase the training difficulty especially when the network is randomly initialized. To stable the training procedure, an annealing strategy is applied. The target logit is replaced by the weighted average of the original logit and the large margin counterpart, which means

 ψtrain(θyi)=11+λψ(θyi)+λ1+λcos(θyi) (8)

where , is the training step, is the minimum value it can achieve, , and are the hyperparameters controlling the annealing speed.

3.5 Other discussions

In [13], generalized end-to-end (GE2E) loss is proposed to train the speaker network. We rewrite Eq. 6 in [13] as

 LGE2E=−1NN∑i=1loges⋅cos(→xi,→cyi)+b∑C′j=1es⋅cos(→xi,→cj)+b (9)

where is the number of speakers in a minibatch and is the center of speaker estimated from the batch. The softmax is computed across the batch rather than the entire dataset. This is convenient when the training set is extremely large. If the bias is omitted and the estimated center is replaced with a learnable weight , the GE2E loss becomes the modified softmax in Eq. 2. Hence, combining with the GE2E loss, the large margin softmax loss is also potential to be applied on a dataset comprising millions of speakers.

4 Experimental setup

4.1 Dataset

To investigate the performance of the large margin softmax loss, we have run experiments on the VoxCeleb dataset [27, 28]. The training set includes VoxCeleb1 dev part and VoxCeleb2. The VoxCeleb1 test part is used as the evaluation set. This setup is selected to be consistent with the Kaldi recipe [29].

Equal error rate (EER), minimum detection cost function from NIST SRE08 (minDCF08) [30] and SRE10 (minDCF10) [31] are presented to demonstrate the performance.

4.2 Training details

The acoustic feature in our experiments is 30-dim MFCCs with cepstral mean normalization. An energy-based voice active detection (VAD) is applied. The training data is augmented using MUSAN [32] and RIR [33].

We use the same network architecture as Kaldi [1] to extract x-vectors with the following modifications.

• [leftmargin=*]

• For the frame-level network, a 5-layer TDNN is used. The kernel size for each layer is . Different with Kaldi, there is no dilation used. This performs better in our experiments and is also suggested in other works [24]. Statistics pooling and a 2-layer segment-level network is appended after the frame-level network.

• Each hidden layer consists of an affine component followed by batch-normalization (BN) and ReLU non-linearity. The order of BN and ReLU does not necessarily lead to better performance but the training is more stable than that of the opposite order.

• The last ReLU in the segment-level network is removed. The non-linearity limits the feasible angles between the feature and the weights which is not a good choice when the angle-based large margin softmax loss is applied [15].

At every training step, we sample 64 speakers. For each speaker, a segment with 200 to 400 frames is sliced from the utterances. Softmax with cross entropy is used to train the baseline system.

regularization is applied to all layers in the network to prevent overfitting. We select stochastic gradient descent (SGD) as the optimizer and the initial learning rate is set to 0.01. A 1000-utterance validation set is randomly selected from the training set and the learning rate is halved if the validation loss gets stuck for a while. The loss converges after the learning rate goes down below

, resulting to around 2.5M training steps. No dropout is applied in our networks as described in [24].

When training with the large margin softmax loss, a annealing strategy is used as described in Section 3.4. Specifically, we set a fast decay for AMSoftmax where . For ASoftmax and ArcSoftmax, the decay speed is slowed down by setting . The for ArcSoftmax and AMSoftmax is 0, while for ASoftmax, . This non-zero results in a more gentle angle function. For example, when , the target logit is similar to ASoftmax with without .

After training, the output of the second last layer in the segment-level network is extracted as the speaker embedding. LDA is used to reduce the dimension to 200 and PLDA is then applied to generate the verification scores. One may also use the embedding extracted from the last layer with simple cosine backend. However, in our experiments, we find the PLDA scoring generally performs better.

Our systems are implemented with Kaldi and Tensorflow toolkits. The code and models have been released

.

5 Results

Table 1 summarizes the results of different systems. The first row in Table 1 shows the Kaldi recipe for VoxCeleb. The first experiment is to validate the performance of our baseline system. We find a large weight decay parameter works well in our systems. When increasing this parameter from to , the EER is improved from 3% to 2.34%. The second row shows the performance of our baseline system which is significantly outperform the standard Kaldi result. The third row is the result using the modified softmax loss. Without any margins, the modified softmax does not perform better by simply normalizing the weights.

The performance of the large margin softmax loss is exhibited in the following sections of Table 1. We remove the last ReLU for all these networks. This generally improves the results. For instance, the performance of ASoftmax (=4) with ReLU is 2.12%, 0.0122 and 0.3214 in EER, minDCF08 and minDCF10, while without ReLU, it achieves 2.15%, 0.0113 and 0.3108 instead. The same trends are observed in other systems as well.

From Table 1, it is clear that ASoftmax achieves the best result when . The performance of ArcSoftmax is similar with ASoftmax and the best margin is about to . The AMSoftmax performs the best among all these large margin softmax losses with the optimal margin . We notice that the best margins for these systems are relatively small compared with those reported in the face verification [15, 16, 18].

We now investigate the influence of the Ring loss and the MHE loss. The weight for the Ring loss is set as 0.01 and is initialized at 20. Table 1 shows that the Ring loss improves the minDCF08 and minDCF10. The norm distributions of different systems are presented in Fig. 2. Since the weights of the softmax network are not normalized, the mean of the feature norm is very large (about 150). To show the feature norm without margins, we use the modified softmax instead. From Fig. 2, we find that using the margin helps to reduce the norm variance. The margin prevents the norm of the simple samples growing too large. The norm distribution further shrinks when the Ring loss is applied. However, even though the network is trained using AMSoftmax without feature normalization, the norm variance is relatively small. Therefore, the effectiveness of the Ring loss with AMSoftmax is less significant in our experiment.

The performance of the MHE loss is presented in the last row of Table 1. The weight is 0.01. The AMSoftmax with MHE loss achieves the best result among all the systems. This loss improves the baseline performance by 15% in EER, 13% in minDCF08 and 33% in minDCF10. To get some insights of the MHE loss, we illustrate the distribution of the pairwise squared distances between the normalized weights. The distance, which is , indicates the separability between speakers on the training set. In Fig. 3, it is shown that all the distributions have the means at about 2.0, indicating is 0 in average. The AMSoftmax with the MHE loss achieves the smallest variance of the inter-speaker distances, which means the features of speakers distribute more evenly on hypersphere, leading to a better overall separability in the feature space.

6 Conclusions

In this paper, we investigate the large margin softmax loss for speaker verification. By selecting an appropriate margin, the large margin softmax loss can achieve promising results. Ring loss and MHE loss are involved to further improve the performance. Ring loss is a soft version of feature normalization and alleviates the impact of feature norm. MHE criterion is another loss function which enlarges the overall inter-speaker separability. On VoxCeleb, our baseline system achieves better result than the Kaldi toolkit. We find AMSoftmax is easier to train and generally performs better than ASoftmax and ArcSoftmax in our experiments. The best system is obtained when AMSoftmax is used with the MHE loss. This combination substantially outperforms the baseline.

In the future, we will combine both the Ring loss and the MHE loss with the large margin softmax loss. More efforts will be made to enable simple cosine scoring and remove the need for the PLDA backend.

7 Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 61403224 and No. U1836219.