In recent years, several studies have reported superior results using deep neural networks (DNNs) for extracting speaker embeddings compared to conventional state-of-the-art i-vector-based  speaker verification systems [2, 3, 4, 5, 6, 7]. Therefore, several recent studies have mainly focused on designing loss functions to train DNNs to make them suitable for speaker verification. Wan et al. proposed a generalized end-to-end (GE2E) loss function based on centroids, which are the average embeddings for each speaker, to train DNNs with higher generalization performance . Li et al.
applied a loss function based on angular softmax, which was proposed for face recognition, to create an angular margin between speakers in an embedding space .
The conventional studies on loss functions mentioned above do not address the following two problems. The first problem is that conventional loss functions only consider a limited number of speakers according to mini-batch composition. In the process of repeatedly training DNNs with mini-batches of a small size, the parameters of a network could be biased to only the speakers included in one mini-batch. The second problem is that excessive overhead occurs in performing hard negative mining, which is important in metric-learning-based loss functions . Hard negative mining is known to have a significant impact on the performance of metric learning. However, it is usually performed at regular intervals because of practical issues. Hard negative samples will be changed because of the updated weight parameters after each mini-batch. Therefore, hard negative mining should be performed for each mini-batch. Even though GE2E has partially solved these problems, there is a limitation that only a few speakers can be considered by hard negative mining in GE2E. In this paper, we propose loss functions based on speaker bases to handle these problems. Speaker bases are trainable parameters that can represent speakers. We expect that it would be possible to train all speakers simultaneously and perform hard negative mining in every mini-batch using the loss function based on speaker bases.
2 Related works
In this section, we introduce various existing loss functions that can be used to train speaker verification systems. The introduced loss functions include those already successfully applied to speaker verification systems and the face recognition field.
2.1 Softmax-based loss function
The softmax-based loss function is widely used to train DNNs for identification purposes. Generally, when the softmax-based loss function is exploited for speaker verification, the output of the last hidden layer is used as the embedding of each utterance after training DNNs. Based on the output, , of the last hidden layer, the softmax loss function is calculated as
where and denote the embedding of the utterance and the corresponding speaker label, respectively, M is the number of utterances, is the number of speakers in the training set, and
are the weight matrix and the bias vector of the output layer, respectively, andis the exponential function.
2.2 Center loss function
The center loss function was proposed to reduce within-class variations while training embeddings with the softmax-based loss function . To reduce within-class variations, loss is calculated based on the mean squared error between the embeddings of each utterance and the center embedding of the corresponding speaker. This loss function was successfully applied in the field of face recognition, and high performance improvement was reported. The center loss function, which is defined in equation (2), is not used by itself, and it is used in conjunction with the conventional softmax-based loss function in most cases.
where is the center embedding of the speaker and is the weight factor of the center loss function. The center embedding of each speaker in the center loss function is not trained based on gradient descent, like other parameters of DNNs. Rather, it is trained by moving the center embedding by a scalar based on the delta center value calculated using the following formula:
where if the is satisfied; otherwise, .
2.3 Additive margin loss function
The additive margin softmax (AMsoftmax) loss function was proposed to replace the inner product operation of the softmax-based loss function with the cosine similarity operation and widen the margin between each class in an embedding space . This loss function is calculated based on the cosine similarity, , so that the embedding between each speaker has an additional margin of , as follows:
where is a scaling factor for stabilizing the training process of cosine similarity-based loss. This loss function has been successfully applied to face recognition systems. However, no studies have been reported on speaker recognition to the best of our knowledge. We expect that the additive margin loss function would be effective for speaker verification because it is the improved version of the angular softmax loss function , which has been successfully applied to speaker verification .
2.4 Generalized end-to-end loss function
The GE2E loss function was proposed to reduce the distance between the embeddings of each utterance and the centroid embeddings of the corresponding speaker while increasing the distance from the centroid embeddings of other speakers . The most significant characteristic of the GE2E loss function is that it does not calculate the distance between samples but calculates the distance between centroids by averaging the embeddings from the same speaker. Wan, Li et al. assumed that higher generalization performance could be achieved through a distance comparison with centroid embeddings . For this purpose, the distance between the embedding of the utterance of the speaker and the centroid of the speaker, , is calculated as follows:
where and are the trainable parameters for scaling and shifting scores. It is important to note that the centroid, , in equation (6), is different from the center embedding, , in equation (2). Centroid embedding is calculated by utilizing the embeddings of each speaker, as follows:
where is the number of utterances of the speaker. The GE2E loss function is calculated based on , as follows:
is the sigmoid function for stabilizing training. In the GE2E loss function, hard negative mining is performed by selecting the largest value among the scores of negative pairs. It is important to note that it is required to construct one mini-batch with a few utterances of speakers for calculating the loss defined by equation (6). This is because the centroid,, in equation (6) is calculated from multiple utterances of each speaker. This requirement limits the mini-batch configuration, thereby considerably reducing the number of speakers in a mini-batch. For example, if one configures a mini-batch of size 100 and each mini-batch includes five utterances per speaker, only 20 speakers will be included in a mini-batch. This may be too small considering the dataset for speaker recognition [13, 14].
3 Proposed loss function
The proposed loss function was based on the idea that the weight matrix between the last hidden layer and output layer in the softmax-based loss function could replace the centroid required to calculate the GE2E loss function. For example, to calculate the softmax loss function for 1000 speakers from 128-dimensional embedding, a weight matrix of size [128, 1000] is required. This weight matrix can be interpreted as a set of 128-dimensional vectors, which represent each speaker. We interpreted this 128-dimensional vector as the basis for representing each speaker, and we trained these basis vectors to replace each speaker’s centroid. Using this approach, it is possible to train all speakers simultaneously, regardless of the size of a minibatch. For example, it would be possible to train a DNN to maximize between-speaker variations by adding the following simple term to the existing loss function:
where is the number of speakers, and is the basis of the speaker. It is important to note that is designed to consider all speakers simultaneously. We also expect that the proposed loss function, , is complementary to the conventional center loss function, which only considers within-class variations. In addition, speaker bases can be used to define a loss function that performs hard negative mining for all speakers, as shown below:
where and denote the utterance and the basis of the corresponding speaker, respectively, and is the set of the top speaker bases with large values. Therefore, hard negative mining becomes possible in the process of calculating loss. We followed common manner in metric learning to design the loss function. The main purpose of the loss is to reduce the negative similarities () while increasing the positive similarities (). Exponential function has a role of increasing the gradient of the samples with large loss and decreasing the gradient of the samples with small loss. The additional term in the function limits the value of to greater than one. This is because the function has too small a value and an overly large gradient when the value of is close to zero. In conventional metric learning, hard negative mining is typically performed in a separate phase. However, it is difficult to increase the frequency of hard negative mining owing to the overhead in the phase. The GE2E loss function address the same problem in a manner to the proposed loss function, but the number of speakers included in negative mining is quite limited. In contrast, applying the proposed loss function enables negative mining to consider all speakers for every mini-batch.
|i-vector PLDA reported in ||-||-||8.8|
|Metric learning reported in ||-||-||7.8|
|Softmax loss (our implementation)||-||7.78|
|Center loss (our implementation)||,||6.55|
|AM softmax (our implementation)||, weight decay(0.0001)||7.31|
|GE2E (our implementation)||5 utterances for each speaker, weight decay(0.0001)||10.65|
|Proposed 2||, weight decay(0.0001)||5.55|
Figure 1 shows the embeddings, centroids, and speaker bases extracted from the trained DNN using the proposed loss functions (), from the utterances of five randomly selected speakers. The figure shows that each speaker can be represented by a speaker basis.
Figure 2 shows the histogram of impostor scores, which are calculated as , to confirm the effect of the proposed loss function, . We compared the difference between the baseline with (trained by ) and without (trained by ) applying the proposed loss function. Figure 2 shows that the impostor scores of the training set are reduced by the proposed loss function and between-speaker variations are increased compared with the center loss function (baseline).
We used the VoxCeleb 1 dataset following the guideline provided in 
. We constructed a validation set with the utterances of 40 speakers whose name started with ‘B’ in the training set and reported the test equal error rate (EER) when the lowest validation EER was observed. We implemented the DNNs based on Keras with TensorFlow as the back-end[16, 17, 18]
. We used Kaldi for acoustic feature extraction.
4.1 Experimental configuration
was modified as shown in Table 1 and used for extracting 128-dimensional speaker embeddings. We used leaky rectified linear unit as the activation function. The Adam optimizer with a learning rate of 0.001 was utilized with a mini-batch size of 100. In our experiments, the performance of the loss function defined by the inner product was degraded by weight decay and the performance of loss function defined by the cosine similarity was improved by weight decay. Table 2 shows the hyperparameters and the EER.
4.2 Results and analysis
We found that the DNN trained by the center loss showed the lowest EER among other losses discussed in Section 2. The GE2E loss function, which is expected to have high performance, exhibited a relatively high EER. This result is interpreted as a phenomenon caused by a fixed mini-batch size owing to practical issues such as GPU memory. In particular, if the mini-batch size was fixed at 100 and five utterances for each speaker were included, one mini-batch would contain only 20 speakers. This is an extremely limited number considering the number of all speakers. Based on the center loss function, which showed the highest performance among the conventional loss functions, we applied the proposed loss function and compared the performances. First, the loss function, , defined by equation (9) with the center loss function could reduce the EER by approximately 9%. This result indirectly shows that between-speaker variations were increased by the proposed loss function, . In addition, the error was reduced by 6% by replacing the center loss function by the proposed loss function, . Finally, the proposed loss function reduced the error by 15%. Based on these results, we found that it is possible to design an effective loss function for speaker verification with the proposed speaker bases.
In this study, we interpreted the weight matrix of the output layer as speaker bases and proposed an end-to-end loss function with the speaker bases for speaker verification. The proposed loss function consisted of for increasing between-speaker variations and for hard negative mining. The biggest advantage of the proposed loss functions is that regardless of the composition of the mini-batch, all speakers can be considered simultaneously. The experimental results obtained for the VoxCeleb showed that the error in the proposed loss function was reduced by approximately 15% compared with the error in the conventional loss functions. In addition, we found that the proposed loss function could replace the conventional loss functions. The limitation of this work is that the speaker bases are highly dependent on the last training sample. Future work will include considering how to train speaker bases to mitigate these problems.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
-  C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
-  L. Yutian, G. Feng, O. Zhijian, and S. Jiasong, “Angular softmax loss for end-to-end speaker verification,” Proceedings of INTERSPEECH, Hyderabad, India, 2018.
-  L. Wan, Q. Wang, A. Papir, and I. Moreno, “Generalized end-to-end loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
-  J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5349–5353.
-  J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification,” in Proc. Interspeech 2018, 2018, pp. 3583–3587.
-  L. Weiyang, W. Yandong, Y. Zhiding, L. Ming, R. Bhiksha, and S. Le, “Sphereface: Deep hypersphere embedding for face recognition,” in , 2017, vol. 1, p. 1.
-  S. Florian, K. Dmitry, and P. James, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
-  W. Yandong, Z. Kaipeng, L. Zhifeng, and Q. Yu, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.
-  W. Feng, C. Jian, L. Weiyang, and L. Haijun, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
-  W. Li, W. Quan, P. Alan, and M. I. Lopez, “Generalized end-to-end loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
-  M. F. Alvin and G. S. Craig, “The nist 2010 speaker recognition evaluation,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
L. Maaten and G. Hinton,
“Visualizing data using t-sne,”
Journal of machine learning research, vol. 9, no. Nov, 2008.
-  F. Chollet et al., “Keras,” https://github.com/keras-team/keras, 2015.
-  A. Martín, A. Ashish, B. Paul, B. Eugene, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2015.
-  A. Martin, B. Paul, C. Jianmin, C. Zhifeng, D. Andy, D. Jeffrey, D. Matthieu, G. Sanjay, I. Geoffrey, I. Michael, K. Manjunath, L. Josh, M. Rajat, M. Sherry, M. G. Derek, S. Benoit, T. Paul, V. Vijay, W. Pete, W. Martin, Y. Yuan, and Z. Xiaoqiang, “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
-  P. Daniel, G. Arnab, B. Gilles, B. Lukas, G. Ondrej, G. Nagendra, H. Mirko, M. Petr, Q. Yanmin, S. Petr, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
-  H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.