Recently, speaker embedding, which is extracted from a deep neural network, outperforms the conventional i-vector in many speaker verification tasks
. By virtue of the excellent performance, speaker embedding is becoming the next generation of speaker recognition technology. Similar with i-vector, speaker embedding encodes a variable-length utterance into a fix-length vector representing the speaker characteristics. A variety of backend classifiers can be applied to suppress noise and session variability. Speaker embedding can also be used in other applications such as speaker diarization, speaker retrieval and speech synthesis.
Speaker verification is an open-set recognition problem. An utterance is verified to be a certain speaker if their similarity exceeds a threshold. Ideal speaker embedding should be discriminative between different speakers and compact within the same speaker. Although cross-entropy with softmax is arguably one of the most commonly used loss function to train the speaker embedding neural network, it is more suitable for classification and does not explicitly encourage discriminative learning of features.
To address this issue, different loss functions are proposed. Triplet loss for speaker verification were first presented in [3, 4]. By selecting appropriate training samples, triplet loss performed well in both text-dependent and text-independent tasks. However, the performance is sensitive to the triplet mining strategy [5, 6] and it is time consuming to design such a training procedure. Also, speaker identity subspace loss , Gaussian Mixture loss , etc., were proposed in other works.
On the other hand, efforts have been made to improve the original softmax loss. Center loss was introduced in [9, 10] to constrain features to be gathered around the corresponding centers and thus reduce the intra-speaker variability. Both triplet loss and center loss are optimized in the Euclidean space. In the last few years, angular-based losses have become popular. Compare with the Euclidean distance, angular distance is a more natural choice in the feature space. In [11, 12]
, the features and the weights of the output layer were normalized before softmax, making the loss function focus on the cosine similarity. Generalized end-to-end loss was proposed in14, 15] in which the margin is incorporated with the angle in a multiplicative way. This method is extended in [16, 17, 18] where additive margins are used. Some of these losses have been applied in speaker verification [19, 20, 21, 22, 23]. Since all these losses combine softmax with margins, we call them the large margin softmax loss in this paper.
In this paper, we first build a baseline system using a generic toolkit similar with . Several training strategies are used to improve the accuracy. We then compare the performance of the large margin softmax loss using different configurations. Ring loss  and minimum hyperspherical energy (MHE) criterion  are involved to enhance the discriminative learning and enlarge the inter-speaker separability. Experiments on VoxCeleb show that our baseline system achieves better performance than the Kaldi x-vector recipe and reduces the EER, minDCF08 and minDCF10 from 3.10%, 0.0169, 0.4977 to 2.34%, 0.0122 and 0.3754, respectively. Using the large margin softmax loss with auxiliary objective functions, the best system further improves the performance to 2%, 0.0106 and 0.2487.
The organization of this paper is as follows. The speaker embedding we use is briefly introduced in Section 2. Section 3 describes the large margin softmax loss and different techniques to enhance the loss. Our experimental setup and results are given in Section 4 and 5. The last section concludes the paper.
2 Speaker embedding
The deep neural network used to extract speaker embedding consists of frame-level and segment-level sub-networks, connected by a temporal pooling layer. The frame-level network can be seen as a speaker feature extractor which transforms the acoustical features into speaker-related vectors. These vectors are aggregated across the entire utterance by a pooling layer and further processed by several fully-connected layers. Different loss functions can be used to optimize the network. After training, the output of a hidden layer at the segment-level network is extracted as the speaker embedding. Cosine scoring, LDA and PLDA are usually applied to generate the verification scores. Since the output layer is removed during the test phase, the test speakers do not have to be present in the training data.
In this paper, speaker embedding is extracted from the x-vector architecture . The x-vector is popular on many applications and has been provided as the official system on the recent NIST speaker recognition evaluation (SRE). The details are described in Section 4.2.
3 Large margin softmax loss
The widely used softmax loss is presented as
where is the number of training samples, is the number of speakers in the training set, is the -th column of the weights in the output layer. is the input of the last (i.e. output) layer, is the ground truth label for the -th sample. To avoid ambiguity, we use feature to represent in this paper while embedding refers to the speaker embedding extracted from a hidden layer of the network. The logit can be transformed to , where is the angle between and . Eq. 1 is influenced by the norm of the weights. This is annoying since we more care about the angle .
In modified softmax , the weights are normalized as and the loss is
where is the scaling factor. This factor can be the feature norm or a fixed value if the feature is also normalized. We will discuss the feature normalization later.
where , and are margins. Eq. 2 is rewritten as the large margin softmax loss
Strictly, Eq. 3 is only valid when and , since , should be a monotonically decreasing function. However, in practice, the angle is usually in the range of . We can safely apply Eq. 3 and 4 to optimize the network when and .
When is large, a new is required. Let . For , we use the angle function defined in 
where and . The curves of the angle functions using different margins are illustrated in Fig. 1.
. In this paper, we only use one single margin at a time since the performance gain of the margin combination is relatively small while large efforts will be paid to tune the hyperparameters. When, and are used individually, the losses are denoted as angular softmax (ASoftmax), additive angular margin softmax (ArcSoftmax) and additive margin softmax loss (AMSoftmax), respectively.
3.2 Feature normalization
As discussed in , the norm of the feature is related with the sample quality when the softmax loss is used. The network will minimize the loss by simply increasing the norm of the features for easy samples and ignoring the hard ones. This avoids the network processing samples in poor quality well.
To solve this issue, feature normalization is presented in many works. After normalization, the feature norm is eliminated from the loss and a fixed-value scaling factor is used instead. Using the feature normalization, the loss is only related with the angle function. Features with small norm will get much bigger gradients compared to those with large norm, making the network pay more attention to the low-quality samples .
Rather than learning to map the samples into a fixed-norm hypersphere, feature normalization uses an additional normalization layer to do this job. Unlike feature normalization, we introduce Ring loss  to directly apply the norm constraint on the features in this paper. The definition of the Ring loss is straightforward. We want the feature norm to be close to a target value . An auxiliary loss is employed as
where is the loss weight with the primary large margin softmax loss. The Ring loss can be considered as a soft version of feature normalization and the target norm can be learnt during the network training.
3.3 Enlarge inter-speaker feature separability
Although the large margin softmax loss improves the intra-class compactness, it does not explicitly promote the inter-class separability. In , the authors proposed a minimum hyperspherical energy (MHE) criterion to encourage the weights of the output layer to distribute evenly on hypersphere. The MHE criterion is expressed as
where is a weighting hyperparameter, , are the normalized weights in the loss function and is a decreasing function. Intuitively, MHE loss enlarges the overall inter-class feature separability. Similar with the Ring loss, we include MHE as an auxiliary objective function.
3.4 Annealing strategy during training
From a classification perspective, the large margin softmax makes the decision boundary more stringent to correctly classify . The angle between and
is required to be much smaller than the angles with other weights. From a view of optimization, the existence of the margin makes the well-separated features continue to get big gradients which can shrink the intra-class variance.
However, the margin will also increase the training difficulty especially when the network is randomly initialized. To stable the training procedure, an annealing strategy is applied. The target logit is replaced by the weighted average of the original logit and the large margin counterpart, which means
where , is the training step, is the minimum value it can achieve, , and are the hyperparameters controlling the annealing speed.
3.5 Other discussions
where is the number of speakers in a minibatch and is the center of speaker estimated from the batch. The softmax is computed across the batch rather than the entire dataset. This is convenient when the training set is extremely large. If the bias is omitted and the estimated center is replaced with a learnable weight , the GE2E loss becomes the modified softmax in Eq. 2. Hence, combining with the GE2E loss, the large margin softmax loss is also potential to be applied on a dataset comprising millions of speakers.
4 Experimental setup
To investigate the performance of the large margin softmax loss, we have run experiments on the VoxCeleb dataset [27, 28]. The training set includes VoxCeleb1 dev part and VoxCeleb2. The VoxCeleb1 test part is used as the evaluation set. This setup is selected to be consistent with the Kaldi recipe .
4.2 Training details
The acoustic feature in our experiments is 30-dim MFCCs with cepstral mean normalization. An energy-based voice active detection (VAD) is applied. The training data is augmented using MUSAN  and RIR .
We use the same network architecture as Kaldi  to extract x-vectors with the following modifications.
For the frame-level network, a 5-layer TDNN is used. The kernel size for each layer is . Different with Kaldi, there is no dilation used. This performs better in our experiments and is also suggested in other works . Statistics pooling and a 2-layer segment-level network is appended after the frame-level network.
The last ReLU in the segment-level network is removed. The non-linearity limits the feasible angles between the feature and the weights which is not a good choice when the angle-based large margin softmax loss is applied .
At every training step, we sample 64 speakers. For each speaker, a segment with 200 to 400 frames is sliced from the utterances. Softmax with cross entropy is used to train the baseline system.
regularization is applied to all layers in the network to prevent overfitting. We select stochastic gradient descent (SGD) as the optimizer and the initial learning rate is set to 0.01. A 1000-utterance validation set is randomly selected from the training set and the learning rate is halved if the validation loss gets stuck for a while. The loss converges after the learning rate goes down below, resulting to around 2.5M training steps. No dropout is applied in our networks as described in .
When training with the large margin softmax loss, a annealing strategy is used as described in Section 3.4. Specifically, we set a fast decay for AMSoftmax where . For ASoftmax and ArcSoftmax, the decay speed is slowed down by setting . The for ArcSoftmax and AMSoftmax is 0, while for ASoftmax, . This non-zero results in a more gentle angle function. For example, when , the target logit is similar to ASoftmax with without .
After training, the output of the second last layer in the segment-level network is extracted as the speaker embedding. LDA is used to reduce the dimension to 200 and PLDA is then applied to generate the verification scores. One may also use the embedding extracted from the last layer with simple cosine backend. However, in our experiments, we find the PLDA scoring generally performs better.
Table 1 summarizes the results of different systems. The first row in Table 1 shows the Kaldi recipe for VoxCeleb. The first experiment is to validate the performance of our baseline system. We find a large weight decay parameter works well in our systems. When increasing this parameter from to , the EER is improved from 3% to 2.34%. The second row shows the performance of our baseline system which is significantly outperform the standard Kaldi result. The third row is the result using the modified softmax loss. Without any margins, the modified softmax does not perform better by simply normalizing the weights.
The performance of the large margin softmax loss is exhibited in the following sections of Table 1. We remove the last ReLU for all these networks. This generally improves the results. For instance, the performance of ASoftmax (=4) with ReLU is 2.12%, 0.0122 and 0.3214 in EER, minDCF08 and minDCF10, while without ReLU, it achieves 2.15%, 0.0113 and 0.3108 instead. The same trends are observed in other systems as well.
From Table 1, it is clear that ASoftmax achieves the best result when . The performance of ArcSoftmax is similar with ASoftmax and the best margin is about to . The AMSoftmax performs the best among all these large margin softmax losses with the optimal margin . We notice that the best margins for these systems are relatively small compared with those reported in the face verification [15, 16, 18].
We now investigate the influence of the Ring loss and the MHE loss. The weight for the Ring loss is set as 0.01 and is initialized at 20. Table 1 shows that the Ring loss improves the minDCF08 and minDCF10. The norm distributions of different systems are presented in Fig. 2. Since the weights of the softmax network are not normalized, the mean of the feature norm is very large (about 150). To show the feature norm without margins, we use the modified softmax instead. From Fig. 2, we find that using the margin helps to reduce the norm variance. The margin prevents the norm of the simple samples growing too large. The norm distribution further shrinks when the Ring loss is applied. However, even though the network is trained using AMSoftmax without feature normalization, the norm variance is relatively small. Therefore, the effectiveness of the Ring loss with AMSoftmax is less significant in our experiment.
The performance of the MHE loss is presented in the last row of Table 1. The weight is 0.01. The AMSoftmax with MHE loss achieves the best result among all the systems. This loss improves the baseline performance by 15% in EER, 13% in minDCF08 and 33% in minDCF10. To get some insights of the MHE loss, we illustrate the distribution of the pairwise squared distances between the normalized weights. The distance, which is , indicates the separability between speakers on the training set. In Fig. 3, it is shown that all the distributions have the means at about 2.0, indicating is 0 in average. The AMSoftmax with the MHE loss achieves the smallest variance of the inter-speaker distances, which means the features of speakers distribute more evenly on hypersphere, leading to a better overall separability in the feature space.
|AMSoftmax + Ring Loss||2.07||0.0107||0.2687|
|AMSoftmax + MHE||2.00||0.0106||0.2487|
In this paper, we investigate the large margin softmax loss for speaker verification. By selecting an appropriate margin, the large margin softmax loss can achieve promising results. Ring loss and MHE loss are involved to further improve the performance. Ring loss is a soft version of feature normalization and alleviates the impact of feature norm. MHE criterion is another loss function which enlarges the overall inter-speaker separability. On VoxCeleb, our baseline system achieves better result than the Kaldi toolkit. We find AMSoftmax is easier to train and generally performs better than ASoftmax and ArcSoftmax in our experiments. The best system is obtained when AMSoftmax is used with the MHE loss. This combination substantially outperforms the baseline.
In the future, we will combine both the Ring loss and the MHE loss with the large margin softmax loss. More efforts will be made to enable simple cosine scoring and remove the need for the PLDA backend.
This work was supported by the National Natural Science Foundation of China under Grant No. 61403224 and No. U1836219.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
-  D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
-  C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
-  C. Zhang and K. Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances,” in Proc. Interspeech, 2017, pp. 1487–1491.
-  A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in
deep embedding learning,” in
Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
-  R. Ji, X. Cai, and B. Xu, “An end-to-end text-independent speaker identification system on short utterances,” in Proc. Interspeech, 2018.
W. Wan, Y. Zhong, T. Li, and J. Chen, “Rethinking feature distribution for
loss functions in image classification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9117–9126.
-  N. Li, D. Tuo, D. Su, Z. Li, and D. Yu, “Deep discriminative embeddings for duration robust speaker verification,” in Proc. Interspeech, 2018, pp. 2262–2266.
-  S. Yadav and A. Rai, “Learning discriminative features for speaker identification and verification,” in Proc. Interspeech, 2018, pp. 2237–2241.
-  R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017.
-  F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: l2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017, pp. 1041–1049.
-  L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks,” inProc. ICML, 2016, pp. 507–516.
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
-  F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
-  H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
-  J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018.
-  Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text-independent speaker verification,” in Proc. Interspeech, 2018, pp. 3623–3627.
-  Y. Li, F. Gao, Z. Ou, and J. Sun, “Angular softmax loss for end-to-end speaker verification,” arXiv preprint arXiv:1806.03464, 2018.
-  G. Bhattacharya, J. Alam, and P. Kenny, “Adapting end-to-end neural speaker verification to new languages and recording conditions with adversarial training,” arXiv preprint arXiv:1811.03055, 2018.
-  M. Hajibabaei and D. Dai, “Unified hypersphere embedding for speaker recognition,” arXiv preprint arXiv:1807.08312, 2018.
-  W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” arXiv preprint arXiv:1902.10107, 2019.
-  H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. Cernocky, “How to improve your speaker embeddings extractor in generic toolkits,” arXiv preprint arXiv:1811.02066, 2018.
-  Y. Zheng, D. K. Pal, and M. Savvides, “Ring loss: Convex feature normalization for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5089–5097.
-  W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song, “Learning towards minimum hyperspherical energy,” in Advances in Neural Information Processing Systems, 2018, pp. 6225–6236.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, and Others, “The kaldi speech recognition toolkit,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.
-  A. F. Martin and C. S. Greenberg, “NIST 2008 speaker recognition evaluation: Performance across telephone and room microphone channels,” in Proc. Interspeech, 2009, pp. 2579–2582.
-  ——, “The NIST 2010 speaker recognition evaluation,” in Proc. Interspeech, 2010, pp. 2726–2729.
-  D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
-  T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.