1 Introduction
Recently, speaker embedding, which is extracted from a deep neural network, outperforms the conventional ivector in many speaker verification tasks
[1]. By virtue of the excellent performance, speaker embedding is becoming the next generation of speaker recognition technology. Similar with ivector, speaker embedding encodes a variablelength utterance into a fixlength vector representing the speaker characteristics. A variety of backend classifiers can be applied to suppress noise and session variability. Speaker embedding can also be used in other applications such as speaker diarization
[2], speaker retrieval and speech synthesis.Speaker verification is an openset recognition problem. An utterance is verified to be a certain speaker if their similarity exceeds a threshold. Ideal speaker embedding should be discriminative between different speakers and compact within the same speaker. Although crossentropy with softmax is arguably one of the most commonly used loss function to train the speaker embedding neural network, it is more suitable for classification and does not explicitly encourage discriminative learning of features.
To address this issue, different loss functions are proposed. Triplet loss for speaker verification were first presented in [3, 4]. By selecting appropriate training samples, triplet loss performed well in both textdependent and textindependent tasks. However, the performance is sensitive to the triplet mining strategy [5, 6] and it is time consuming to design such a training procedure. Also, speaker identity subspace loss [7], Gaussian Mixture loss [8], etc., were proposed in other works.
On the other hand, efforts have been made to improve the original softmax loss. Center loss was introduced in [9, 10] to constrain features to be gathered around the corresponding centers and thus reduce the intraspeaker variability. Both triplet loss and center loss are optimized in the Euclidean space. In the last few years, angularbased losses have become popular. Compare with the Euclidean distance, angular distance is a more natural choice in the feature space. In [11, 12]
, the features and the weights of the output layer were normalized before softmax, making the loss function focus on the cosine similarity. Generalized endtoend loss was proposed in
[13]. The scaled cosine scores between the features and the estimated speaker centers were used as the logits to compute the loss. Angular margin softmax loss was first presented in
[14, 15] in which the margin is incorporated with the angle in a multiplicative way. This method is extended in [16, 17, 18] where additive margins are used. Some of these losses have been applied in speaker verification [19, 20, 21, 22, 23]. Since all these losses combine softmax with margins, we call them the large margin softmax loss in this paper.In this paper, we first build a baseline system using a generic toolkit similar with [24]. Several training strategies are used to improve the accuracy. We then compare the performance of the large margin softmax loss using different configurations. Ring loss [25] and minimum hyperspherical energy (MHE) criterion [26] are involved to enhance the discriminative learning and enlarge the interspeaker separability. Experiments on VoxCeleb show that our baseline system achieves better performance than the Kaldi xvector recipe and reduces the EER, minDCF08 and minDCF10 from 3.10%, 0.0169, 0.4977 to 2.34%, 0.0122 and 0.3754, respectively. Using the large margin softmax loss with auxiliary objective functions, the best system further improves the performance to 2%, 0.0106 and 0.2487.
The organization of this paper is as follows. The speaker embedding we use is briefly introduced in Section 2. Section 3 describes the large margin softmax loss and different techniques to enhance the loss. Our experimental setup and results are given in Section 4 and 5. The last section concludes the paper.
2 Speaker embedding
The deep neural network used to extract speaker embedding consists of framelevel and segmentlevel subnetworks, connected by a temporal pooling layer. The framelevel network can be seen as a speaker feature extractor which transforms the acoustical features into speakerrelated vectors. These vectors are aggregated across the entire utterance by a pooling layer and further processed by several fullyconnected layers. Different loss functions can be used to optimize the network. After training, the output of a hidden layer at the segmentlevel network is extracted as the speaker embedding. Cosine scoring, LDA and PLDA are usually applied to generate the verification scores. Since the output layer is removed during the test phase, the test speakers do not have to be present in the training data.
In this paper, speaker embedding is extracted from the xvector architecture [1]. The xvector is popular on many applications and has been provided as the official system on the recent NIST speaker recognition evaluation (SRE). The details are described in Section 4.2.
3 Large margin softmax loss
3.1 Definition
The widely used softmax loss is presented as
(1) 
where is the number of training samples, is the number of speakers in the training set, is the th column of the weights in the output layer. is the input of the last (i.e. output) layer, is the ground truth label for the th sample. To avoid ambiguity, we use feature to represent in this paper while embedding refers to the speaker embedding extracted from a hidden layer of the network. The logit can be transformed to , where is the angle between and . Eq. 1 is influenced by the norm of the weights. This is annoying since we more care about the angle .
In modified softmax [15], the weights are normalized as and the loss is
(2) 
where is the scaling factor. This factor can be the feature norm or a fixed value if the feature is also normalized. We will discuss the feature normalization later.
Based on Eq. 2, different margins can be introduced by reformulating the target logits. We define an angle function [18]
(3) 
where , and are margins. Eq. 2 is rewritten as the large margin softmax loss
(4) 
Strictly, Eq. 3 is only valid when and , since , should be a monotonically decreasing function. However, in practice, the angle is usually in the range of [18]. We can safely apply Eq. 3 and 4 to optimize the network when and .
When is large, a new is required. Let . For , we use the angle function defined in [15]
(5) 
where and . The curves of the angle functions using different margins are illustrated in Fig. 1.
The margins , and can be used separately [15, 17, 16, 18]. Also, they can be further combined as in [18]
. In this paper, we only use one single margin at a time since the performance gain of the margin combination is relatively small while large efforts will be paid to tune the hyperparameters. When
, and are used individually, the losses are denoted as angular softmax (ASoftmax), additive angular margin softmax (ArcSoftmax) and additive margin softmax loss (AMSoftmax), respectively.3.2 Feature normalization
As discussed in [11], the norm of the feature is related with the sample quality when the softmax loss is used. The network will minimize the loss by simply increasing the norm of the features for easy samples and ignoring the hard ones. This avoids the network processing samples in poor quality well.
To solve this issue, feature normalization is presented in many works. After normalization, the feature norm is eliminated from the loss and a fixedvalue scaling factor is used instead. Using the feature normalization, the loss is only related with the angle function. Features with small norm will get much bigger gradients compared to those with large norm, making the network pay more attention to the lowquality samples [16].
Rather than learning to map the samples into a fixednorm hypersphere, feature normalization uses an additional normalization layer to do this job. Unlike feature normalization, we introduce Ring loss [25] to directly apply the norm constraint on the features in this paper. The definition of the Ring loss is straightforward. We want the feature norm to be close to a target value . An auxiliary loss is employed as
(6) 
where is the loss weight with the primary large margin softmax loss. The Ring loss can be considered as a soft version of feature normalization and the target norm can be learnt during the network training.
3.3 Enlarge interspeaker feature separability
Although the large margin softmax loss improves the intraclass compactness, it does not explicitly promote the interclass separability. In [26], the authors proposed a minimum hyperspherical energy (MHE) criterion to encourage the weights of the output layer to distribute evenly on hypersphere. The MHE criterion is expressed as
(7) 
where is a weighting hyperparameter, , are the normalized weights in the loss function and is a decreasing function. Intuitively, MHE loss enlarges the overall interclass feature separability. Similar with the Ring loss, we include MHE as an auxiliary objective function.
3.4 Annealing strategy during training
From a classification perspective, the large margin softmax makes the decision boundary more stringent to correctly classify . The angle between and
is required to be much smaller than the angles with other weights. From a view of optimization, the existence of the margin makes the wellseparated features continue to get big gradients which can shrink the intraclass variance.
However, the margin will also increase the training difficulty especially when the network is randomly initialized. To stable the training procedure, an annealing strategy is applied. The target logit is replaced by the weighted average of the original logit and the large margin counterpart, which means
(8) 
where , is the training step, is the minimum value it can achieve, , and are the hyperparameters controlling the annealing speed.
3.5 Other discussions
In [13], generalized endtoend (GE2E) loss is proposed to train the speaker network. We rewrite Eq. 6 in [13] as
(9) 
where is the number of speakers in a minibatch and is the center of speaker estimated from the batch. The softmax is computed across the batch rather than the entire dataset. This is convenient when the training set is extremely large. If the bias is omitted and the estimated center is replaced with a learnable weight , the GE2E loss becomes the modified softmax in Eq. 2. Hence, combining with the GE2E loss, the large margin softmax loss is also potential to be applied on a dataset comprising millions of speakers.
4 Experimental setup
4.1 Dataset
To investigate the performance of the large margin softmax loss, we have run experiments on the VoxCeleb dataset [27, 28]. The training set includes VoxCeleb1 dev part and VoxCeleb2. The VoxCeleb1 test part is used as the evaluation set. This setup is selected to be consistent with the Kaldi recipe [29].
4.2 Training details
The acoustic feature in our experiments is 30dim MFCCs with cepstral mean normalization. An energybased voice active detection (VAD) is applied. The training data is augmented using MUSAN [32] and RIR [33].
We use the same network architecture as Kaldi [1] to extract xvectors with the following modifications.

[leftmargin=*]

For the framelevel network, a 5layer TDNN is used. The kernel size for each layer is . Different with Kaldi, there is no dilation used. This performs better in our experiments and is also suggested in other works [24]. Statistics pooling and a 2layer segmentlevel network is appended after the framelevel network.

Each hidden layer consists of an affine component followed by batchnormalization (BN) and ReLU nonlinearity. The order of BN and ReLU does not necessarily lead to better performance but the training is more stable than that of the opposite order.

The last ReLU in the segmentlevel network is removed. The nonlinearity limits the feasible angles between the feature and the weights which is not a good choice when the anglebased large margin softmax loss is applied [15].
At every training step, we sample 64 speakers. For each speaker, a segment with 200 to 400 frames is sliced from the utterances. Softmax with cross entropy is used to train the baseline system.
regularization is applied to all layers in the network to prevent overfitting. We select stochastic gradient descent (SGD) as the optimizer and the initial learning rate is set to 0.01. A 1000utterance validation set is randomly selected from the training set and the learning rate is halved if the validation loss gets stuck for a while. The loss converges after the learning rate goes down below
, resulting to around 2.5M training steps. No dropout is applied in our networks as described in [24].When training with the large margin softmax loss, a annealing strategy is used as described in Section 3.4. Specifically, we set a fast decay for AMSoftmax where . For ASoftmax and ArcSoftmax, the decay speed is slowed down by setting . The for ArcSoftmax and AMSoftmax is 0, while for ASoftmax, . This nonzero results in a more gentle angle function. For example, when , the target logit is similar to ASoftmax with without .
After training, the output of the second last layer in the segmentlevel network is extracted as the speaker embedding. LDA is used to reduce the dimension to 200 and PLDA is then applied to generate the verification scores. One may also use the embedding extracted from the last layer with simple cosine backend. However, in our experiments, we find the PLDA scoring generally performs better.
Our systems are implemented with Kaldi and Tensorflow toolkits. The code and models have been released
^{1}^{1}1https://github.com/mycrazycracy/tfkaldispeaker.5 Results
Table 1 summarizes the results of different systems. The first row in Table 1 shows the Kaldi recipe for VoxCeleb. The first experiment is to validate the performance of our baseline system. We find a large weight decay parameter works well in our systems. When increasing this parameter from to , the EER is improved from 3% to 2.34%. The second row shows the performance of our baseline system which is significantly outperform the standard Kaldi result. The third row is the result using the modified softmax loss. Without any margins, the modified softmax does not perform better by simply normalizing the weights.
The performance of the large margin softmax loss is exhibited in the following sections of Table 1. We remove the last ReLU for all these networks. This generally improves the results. For instance, the performance of ASoftmax (=4) with ReLU is 2.12%, 0.0122 and 0.3214 in EER, minDCF08 and minDCF10, while without ReLU, it achieves 2.15%, 0.0113 and 0.3108 instead. The same trends are observed in other systems as well.
From Table 1, it is clear that ASoftmax achieves the best result when . The performance of ArcSoftmax is similar with ASoftmax and the best margin is about to . The AMSoftmax performs the best among all these large margin softmax losses with the optimal margin . We notice that the best margins for these systems are relatively small compared with those reported in the face verification [15, 16, 18].
We now investigate the influence of the Ring loss and the MHE loss. The weight for the Ring loss is set as 0.01 and is initialized at 20. Table 1 shows that the Ring loss improves the minDCF08 and minDCF10. The norm distributions of different systems are presented in Fig. 2. Since the weights of the softmax network are not normalized, the mean of the feature norm is very large (about 150). To show the feature norm without margins, we use the modified softmax instead. From Fig. 2, we find that using the margin helps to reduce the norm variance. The margin prevents the norm of the simple samples growing too large. The norm distribution further shrinks when the Ring loss is applied. However, even though the network is trained using AMSoftmax without feature normalization, the norm variance is relatively small. Therefore, the effectiveness of the Ring loss with AMSoftmax is less significant in our experiment.
The performance of the MHE loss is presented in the last row of Table 1. The weight is 0.01. The AMSoftmax with MHE loss achieves the best result among all the systems. This loss improves the baseline performance by 15% in EER, 13% in minDCF08 and 33% in minDCF10. To get some insights of the MHE loss, we illustrate the distribution of the pairwise squared distances between the normalized weights. The distance, which is , indicates the separability between speakers on the training set. In Fig. 3, it is shown that all the distributions have the means at about 2.0, indicating is 0 in average. The AMSoftmax with the MHE loss achieves the smallest variance of the interspeaker distances, which means the features of speakers distribute more evenly on hypersphere, leading to a better overall separability in the feature space.
EER(%)  minDCF08  minDCF10  
Kaldi ^{2}^{2}2https://github.com/kaldiasr/kaldi/blob/master/egs/voxceleb/v2  3.10  0.0169  0.4977  
Softmax  2.34  0.0122  0.3754  
Modified Softmax  2.62  0.0131  0.4146  
ASoftmax  2.18  0.0119  0.3791  
2.15  0.0113  0.3108  
ArcSoftmax  2.14  0.0119  0.3610  
2.03  0.0120  0.4010  
2.12  0.0115  0.3138  
2.23  0.0123  0.3622  
AMSoftmax  2.13  0.0113  0.3707  
2.04  0.0111  0.2922  
2.15  0.0119  0.3559  
2.18  0.0115  0.3152  
AMSoftmax + Ring Loss  2.07  0.0107  0.2687  
AMSoftmax + MHE  2.00  0.0106  0.2487 
6 Conclusions
In this paper, we investigate the large margin softmax loss for speaker verification. By selecting an appropriate margin, the large margin softmax loss can achieve promising results. Ring loss and MHE loss are involved to further improve the performance. Ring loss is a soft version of feature normalization and alleviates the impact of feature norm. MHE criterion is another loss function which enlarges the overall interspeaker separability. On VoxCeleb, our baseline system achieves better result than the Kaldi toolkit. We find AMSoftmax is easier to train and generally performs better than ASoftmax and ArcSoftmax in our experiments. The best system is obtained when AMSoftmax is used with the MHE loss. This combination substantially outperforms the baseline.
In the future, we will combine both the Ring loss and the MHE loss with the large margin softmax loss. More efforts will be made to enable simple cosine scoring and remove the need for the PLDA backend.
7 Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant No. 61403224 and No. U1836219.
References
 [1] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
 [2] D. Snyder, D. GarciaRomero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multispeaker conversations using xvectors,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
 [3] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
 [4] C. Zhang and K. Koishida, “Endtoend textindependent speaker verification with triplet loss on short utterances,” in Proc. Interspeech, 2017, pp. 1487–1491.
 [5] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person reidentification,” arXiv preprint arXiv:1703.07737, 2017.

[6]
C.Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in
deep embedding learning,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2017, pp. 2840–2848.  [7] R. Ji, X. Cai, and B. Xu, “An endtoend textindependent speaker identification system on short utterances,” in Proc. Interspeech, 2018.

[8]
W. Wan, Y. Zhong, T. Li, and J. Chen, “Rethinking feature distribution for
loss functions in image classification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2018, pp. 9117–9126.  [9] N. Li, D. Tuo, D. Su, Z. Li, and D. Yu, “Deep discriminative embeddings for duration robust speaker verification,” in Proc. Interspeech, 2018, pp. 2262–2266.
 [10] S. Yadav and A. Rai, “Learning discriminative features for speaker identification and verification,” in Proc. Interspeech, 2018, pp. 2237–2241.
 [11] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017.
 [12] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: l2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017, pp. 1041–1049.
 [13] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized endtoend loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.

[14]
W. Liu, Y. Wen, Z. Yu, and M. Yang, “Largemargin softmax loss for convolutional neural networks,” in
Proc. ICML, 2016, pp. 507–516. 
[15]
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.  [16] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
 [17] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
 [18] J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018.
 [19] Z. Huang, S. Wang, and K. Yu, “Angular softmax for shortduration textindependent speaker verification,” in Proc. Interspeech, 2018, pp. 3623–3627.
 [20] Y. Li, F. Gao, Z. Ou, and J. Sun, “Angular softmax loss for endtoend speaker verification,” arXiv preprint arXiv:1806.03464, 2018.
 [21] G. Bhattacharya, J. Alam, and P. Kenny, “Adapting endtoend neural speaker verification to new languages and recording conditions with adversarial training,” arXiv preprint arXiv:1811.03055, 2018.
 [22] M. Hajibabaei and D. Dai, “Unified hypersphere embedding for speaker recognition,” arXiv preprint arXiv:1807.08312, 2018.
 [23] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterancelevel aggregation for speaker recognition in the wild,” arXiv preprint arXiv:1902.10107, 2019.
 [24] H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. Cernocky, “How to improve your speaker embeddings extractor in generic toolkits,” arXiv preprint arXiv:1811.02066, 2018.
 [25] Y. Zheng, D. K. Pal, and M. Savvides, “Ring loss: Convex feature normalization for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5089–5097.
 [26] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song, “Learning towards minimum hyperspherical energy,” in Advances in Neural Information Processing Systems, 2018, pp. 6225–6236.
 [27] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
 [28] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
 [29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, and Others, “The kaldi speech recognition toolkit,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.
 [30] A. F. Martin and C. S. Greenberg, “NIST 2008 speaker recognition evaluation: Performance across telephone and room microphone channels,” in Proc. Interspeech, 2009, pp. 2579–2582.
 [31] ——, “The NIST 2010 speaker recognition evaluation,” in Proc. Interspeech, 2010, pp. 2726–2729.
 [32] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
 [33] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
Comments
There are no comments yet.