1 Introduction
In recent years, several studies have reported superior results using deep neural networks (DNNs) for extracting speaker embeddings compared to conventional stateoftheart ivectorbased [1] speaker verification systems [2, 3, 4, 5, 6, 7]. Therefore, several recent studies have mainly focused on designing loss functions to train DNNs to make them suitable for speaker verification. Wan et al. proposed a generalized endtoend (GE2E) loss function based on centroids, which are the average embeddings for each speaker, to train DNNs with higher generalization performance [5]. Li et al.
applied a loss function based on angular softmax, which was proposed for face recognition
[8], to create an angular margin between speakers in an embedding space [4].The conventional studies on loss functions mentioned above do not address the following two problems. The first problem is that conventional loss functions only consider a limited number of speakers according to minibatch composition. In the process of repeatedly training DNNs with minibatches of a small size, the parameters of a network could be biased to only the speakers included in one minibatch. The second problem is that excessive overhead occurs in performing hard negative mining, which is important in metriclearningbased loss functions [9]. Hard negative mining is known to have a significant impact on the performance of metric learning. However, it is usually performed at regular intervals because of practical issues. Hard negative samples will be changed because of the updated weight parameters after each minibatch. Therefore, hard negative mining should be performed for each minibatch. Even though GE2E has partially solved these problems, there is a limitation that only a few speakers can be considered by hard negative mining in GE2E. In this paper, we propose loss functions based on speaker bases to handle these problems. Speaker bases are trainable parameters that can represent speakers. We expect that it would be possible to train all speakers simultaneously and perform hard negative mining in every minibatch using the loss function based on speaker bases.
2 Related works
In this section, we introduce various existing loss functions that can be used to train speaker verification systems. The introduced loss functions include those already successfully applied to speaker verification systems and the face recognition field.
2.1 Softmaxbased loss function
The softmaxbased loss function is widely used to train DNNs for identification purposes. Generally, when the softmaxbased loss function is exploited for speaker verification, the output of the last hidden layer is used as the embedding of each utterance after training DNNs. Based on the output, , of the last hidden layer, the softmax loss function is calculated as
(1) 
where and denote the embedding of the utterance and the corresponding speaker label, respectively, M is the number of utterances, is the number of speakers in the training set, and
are the weight matrix and the bias vector of the output layer, respectively, and
is the exponential function.2.2 Center loss function
The center loss function was proposed to reduce withinclass variations while training embeddings with the softmaxbased loss function [10]. To reduce withinclass variations, loss is calculated based on the mean squared error between the embeddings of each utterance and the center embedding of the corresponding speaker. This loss function was successfully applied in the field of face recognition, and high performance improvement was reported. The center loss function, which is defined in equation (2), is not used by itself, and it is used in conjunction with the conventional softmaxbased loss function in most cases.
(2) 
where is the center embedding of the speaker and is the weight factor of the center loss function. The center embedding of each speaker in the center loss function is not trained based on gradient descent, like other parameters of DNNs. Rather, it is trained by moving the center embedding by a scalar based on the delta center value calculated using the following formula:
(3) 
where if the is satisfied; otherwise, .
2.3 Additive margin loss function
The additive margin softmax (AMsoftmax) loss function was proposed to replace the inner product operation of the softmaxbased loss function with the cosine similarity operation
[8] and widen the margin between each class in an embedding space [11]. This loss function is calculated based on the cosine similarity, , so that the embedding between each speaker has an additional margin of , as follows:(4) 
(5) 
where is a scaling factor for stabilizing the training process of cosine similaritybased loss. This loss function has been successfully applied to face recognition systems. However, no studies have been reported on speaker recognition to the best of our knowledge. We expect that the additive margin loss function would be effective for speaker verification because it is the improved version of the angular softmax loss function [8], which has been successfully applied to speaker verification [4].
2.4 Generalized endtoend loss function
The GE2E loss function was proposed to reduce the distance between the embeddings of each utterance and the centroid embeddings of the corresponding speaker while increasing the distance from the centroid embeddings of other speakers [12]. The most significant characteristic of the GE2E loss function is that it does not calculate the distance between samples but calculates the distance between centroids by averaging the embeddings from the same speaker. Wan, Li et al. assumed that higher generalization performance could be achieved through a distance comparison with centroid embeddings [12]. For this purpose, the distance between the embedding of the utterance of the speaker and the centroid of the speaker, , is calculated as follows:
(6) 
where and are the trainable parameters for scaling and shifting scores. It is important to note that the centroid, , in equation (6), is different from the center embedding, , in equation (2). Centroid embedding is calculated by utilizing the embeddings of each speaker, as follows:
(7) 
where is the number of utterances of the speaker. The GE2E loss function is calculated based on , as follows:
(8) 
where
is the sigmoid function for stabilizing training. In the GE2E loss function, hard negative mining is performed by selecting the largest value among the scores of negative pairs. It is important to note that it is required to construct one minibatch with a few utterances of speakers for calculating the loss defined by equation (6). This is because the centroid,
, in equation (6) is calculated from multiple utterances of each speaker. This requirement limits the minibatch configuration, thereby considerably reducing the number of speakers in a minibatch. For example, if one configures a minibatch of size 100 and each minibatch includes five utterances per speaker, only 20 speakers will be included in a minibatch. This may be too small considering the dataset for speaker recognition [13, 14].3 Proposed loss function
The proposed loss function was based on the idea that the weight matrix between the last hidden layer and output layer in the softmaxbased loss function could replace the centroid required to calculate the GE2E loss function. For example, to calculate the softmax loss function for 1000 speakers from 128dimensional embedding, a weight matrix of size [128, 1000] is required. This weight matrix can be interpreted as a set of 128dimensional vectors, which represent each speaker. We interpreted this 128dimensional vector as the basis for representing each speaker, and we trained these basis vectors to replace each speaker’s centroid. Using this approach, it is possible to train all speakers simultaneously, regardless of the size of a minibatch. For example, it would be possible to train a DNN to maximize betweenspeaker variations by adding the following simple term to the existing loss function:
(9) 
where is the number of speakers, and is the basis of the speaker. It is important to note that is designed to consider all speakers simultaneously. We also expect that the proposed loss function, , is complementary to the conventional center loss function, which only considers withinclass variations. In addition, speaker bases can be used to define a loss function that performs hard negative mining for all speakers, as shown below:
(10) 
where and denote the utterance and the basis of the corresponding speaker, respectively, and is the set of the top speaker bases with large values. Therefore, hard negative mining becomes possible in the process of calculating loss. We followed common manner in metric learning to design the loss function. The main purpose of the loss is to reduce the negative similarities () while increasing the positive similarities (). Exponential function has a role of increasing the gradient of the samples with large loss and decreasing the gradient of the samples with small loss. The additional term in the function limits the value of to greater than one. This is because the function has too small a value and an overly large gradient when the value of is close to zero. In conventional metric learning, hard negative mining is typically performed in a separate phase. However, it is difficult to increase the frequency of hard negative mining owing to the overhead in the phase. The GE2E loss function address the same problem in a manner to the proposed loss function, but the number of speakers included in negative mining is quite limited. In contrast, applying the proposed loss function enables negative mining to consider all speakers for every minibatch.


System  Loss  hyperparameters  EER (%) 


ivector PLDA reported in [13]      8.8 
Metric learning reported in [13]      7.8 
Softmax loss (our implementation)    7.78  
Center loss (our implementation)  ,  6.55  
AM softmax (our implementation)  , weight decay(0.0001)  7.31  
GE2E (our implementation)  5 utterances for each speaker, weight decay(0.0001)  10.65  
Proposed 1  ,  5.96  
Proposed 2  , weight decay(0.0001)  5.55  

Figure 1 shows the embeddings, centroids, and speaker bases extracted from the trained DNN using the proposed loss functions (), from the utterances of five randomly selected speakers. The figure shows that each speaker can be represented by a speaker basis.
Figure 2 shows the histogram of impostor scores, which are calculated as , to confirm the effect of the proposed loss function, . We compared the difference between the baseline with (trained by ) and without (trained by ) applying the proposed loss function. Figure 2 shows that the impostor scores of the training set are reduced by the proposed loss function and betweenspeaker variations are increased compared with the center loss function (baseline).
4 Experiments
We used the VoxCeleb 1 dataset following the guideline provided in [13]
. We constructed a validation set with the utterances of 40 speakers whose name started with ‘B’ in the training set and reported the test equal error rate (EER) when the lowest validation EER was observed. We implemented the DNNs based on Keras with TensorFlow as the backend
[16, 17, 18]. We used Kaldi for acoustic feature extraction
[19].4.1 Experimental configuration
64dimensional filterbank energy features were extracted using a 25ms hamming window with a 10ms shift. Mean normalization was applied over a 3s sliding window. ResNet34 [20, 21]
was modified as shown in Table 1 and used for extracting 128dimensional speaker embeddings. We used leaky rectified linear unit as the activation function. The Adam optimizer with a learning rate of 0.001 was utilized with a minibatch size of 100. In our experiments, the performance of the loss function defined by the inner product was degraded by weight decay and the performance of loss function defined by the cosine similarity was improved by weight decay. Table 2 shows the hyperparameters and the EER.
4.2 Results and analysis
We found that the DNN trained by the center loss showed the lowest EER among other losses discussed in Section 2. The GE2E loss function, which is expected to have high performance, exhibited a relatively high EER. This result is interpreted as a phenomenon caused by a fixed minibatch size owing to practical issues such as GPU memory. In particular, if the minibatch size was fixed at 100 and five utterances for each speaker were included, one minibatch would contain only 20 speakers. This is an extremely limited number considering the number of all speakers. Based on the center loss function, which showed the highest performance among the conventional loss functions, we applied the proposed loss function and compared the performances. First, the loss function, , defined by equation (9) with the center loss function could reduce the EER by approximately 9%. This result indirectly shows that betweenspeaker variations were increased by the proposed loss function, . In addition, the error was reduced by 6% by replacing the center loss function by the proposed loss function, . Finally, the proposed loss function reduced the error by 15%. Based on these results, we found that it is possible to design an effective loss function for speaker verification with the proposed speaker bases.
5 Conclusions
In this study, we interpreted the weight matrix of the output layer as speaker bases and proposed an endtoend loss function with the speaker bases for speaker verification. The proposed loss function consisted of for increasing betweenspeaker variations and for hard negative mining. The biggest advantage of the proposed loss functions is that regardless of the composition of the minibatch, all speakers can be considered simultaneously. The experimental results obtained for the VoxCeleb showed that the error in the proposed loss function was reduced by approximately 15% compared with the error in the conventional loss functions. In addition, we found that the proposed loss function could replace the conventional loss functions. The limitation of this work is that the speaker bases are highly dependent on the last training sample. Future work will include considering how to train speaker bases to mitigate these problems.
References
 [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [2] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
 [3] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
 [4] L. Yutian, G. Feng, O. Zhijian, and S. Jiasong, “Angular softmax loss for endtoend speaker verification,” Proceedings of INTERSPEECH, Hyderabad, India, 2018.
 [5] L. Wan, Q. Wang, A. Papir, and I. Moreno, “Generalized endtoend loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
 [6] J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “A complete endtoend speaker verification system using deep neural networks: From raw signals to verification result,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5349–5353.
 [7] J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “Avoiding speaker overfitting in endtoend dnns using raw waveform for textindependent speaker verification,” in Proc. Interspeech 2018, 2018, pp. 3583–3587.

[8]
L. Weiyang, W. Yandong, Y. Zhiding, L. Ming, R. Bhiksha, and S. Le,
“Sphereface: Deep hypersphere embedding for face recognition,”
in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017, vol. 1, p. 1.  [9] S. Florian, K. Dmitry, and P. James, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
 [10] W. Yandong, Z. Kaipeng, L. Zhifeng, and Q. Yu, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.
 [11] W. Feng, C. Jian, L. Weiyang, and L. Haijun, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
 [12] W. Li, W. Quan, P. Alan, and M. I. Lopez, “Generalized endtoend loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
 [13] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in Interspeech, 2017.
 [14] M. F. Alvin and G. S. Craig, “The nist 2010 speaker recognition evaluation,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.

[15]
L. Maaten and G. Hinton,
“Visualizing data using tsne,”
Journal of machine learning research
, vol. 9, no. Nov, 2008.  [16] F. Chollet et al., “Keras,” https://github.com/kerasteam/keras, 2015.
 [17] A. Martín, A. Ashish, B. Paul, B. Eugene, et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” 2015.
 [18] A. Martin, B. Paul, C. Jianmin, C. Zhifeng, D. Andy, D. Jeffrey, D. Matthieu, G. Sanjay, I. Geoffrey, I. Michael, K. Manjunath, L. Josh, M. Rajat, M. Sherry, M. G. Derek, S. Benoit, T. Paul, V. Vijay, W. Pete, W. Martin, Y. Yuan, and Z. Xiaoqiang, “Tensorflow: A system for largescale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
 [19] P. Daniel, G. Arnab, B. Gilles, B. Lukas, G. Ondrej, G. Nagendra, H. Mirko, M. Petr, Q. Yanmin, S. Petr, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFLCONF192584.
 [20] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [21] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.