1 Introduction
Speaker recognition (SR) task can be defined as an utterancelevel “sequencetoone” learning problem. It is problem in that we are trying to retrieve information about an entire utterance rather than specific word content [1]
. Moreover, there is no constraint on the lexicon words thus training utterances and testing segments may have completely different contents
[2]. Therefore, given the input speech data, the goal may boil down to transform them into utterancelevel representations, among them the interclass variability is maximized and simultaneously the intraclass variability is minimized [3].Typically, SR can be categorized as speaker identification (SID) task and speaker verification (SV) task [4]
. The former classifies a speaker to a specific identity, while the latter determines whether a pair of utterances belongs to the same person. For the openset protocol, speaker identities in the testing set are usually disjoint from the ones in training set, which makes the SV more challenging yet closer to practice. Since it is impossible to classify testing utterances to known identities in training set, we need to map speakers to a discriminative feature space. In this scenario, openset SV is essentially a metric learning problem, where the key is to learn discriminative largemargin speaker embeddings.
There are generally two categories commonly used to obtain utterancelevel speaker representations. The first consists of series of separated statistical models. The represent is the classical ivector approach [5]
. Firstly, framelevel feature sequences are extracted from raw audio signals. Then, selected feature frames in training dataset are grouped together to estimate a Gaussian Mixture Model (GMM) based universal background model (UBM)
[6]. Sufficient statistics of each utterance on the UBM is accumulated, and a factor analysis based ivector extractor is trained to project the statistics into a lowrank total variability subspace [5].The other category relies on a model trained by a downstream procedure through endtoend deep neural network [7, 8, 9, 10]
. First, in the same way as the ivector approach, framelevel feature sequences are extracted as well. Then an automatic framelevel feature extractor such as convolution neural network (CNN)
[8, 11], timedelay neural network (TDNN) [9]or Long Short Term Memory (LSTM) network
[7, 12] is designated to get highlevel abstract representation. Afterward, a statistic pooling [9] or encoding layer [13]is built on top to extract the fixeddimensional utterancelevel representation. This utterancelevel representation can be further processed by fullyconnected (FC) layer, and finally connected with an output layer. All the components in the endtoend pipeline are jointly learned with a unified loss function.
In classical ivector approach, an extra length normalization step is necessary to normalize the representations into unitlength hyperspace before backend modeling [14]. When it turns into endtoend system, once we have extracted deep speaker embeddings from theneural network, such as xvector [15], this length normalization step is also required when calculating pairwise scores.
In this paper, we explore endtoend SV system where length normalization step is builtin inherently within the deep neural network. Therefore, the neural network can learn speaker embeddings being lengthnormalized in an endtoend manner.
2 Related works
2.1 Length normalization in ivector approach
Length normalization has been analyzed and proved to be an effective strategy for SR, but limited in conventional ivector approach [14]. As demonstrated in Fig. 1
, this simple nonlinear transformation on ivector has been the de facto standard before backend modeling
[16, 17].For closedset SID task, length normalization followed by logistic regression or support vector machine is commonly adopted to get the posterior probabilities for the speaker categories. For openset SV task, cosine similarity or length normalization followed by probabilistic linear discriminant analysis (PLDA) scoring
[18, 19] modeling is widely used to get the final pairwise scores. The cosine similarity is a similarity measure which is independent of magnitude, it can be seen as the lengthnormalized version of innerproduct of two vectors. In these above systems, frontend ivector modeling, length normalization step, and backend modeling are all independent of each other and performed separately.2.2 Length normalization in endtoend system with triplet loss
Some previous works in [8, 12, 20] introduced triplet loss [21] and successfully trained models with the features being normalized in an endtoend fashion. They all explicitly treat the openset SV task as a metric learning problem. This kind of triplet loss approach naturally requires length normalization step to compute the distance of normalized unit vectors.
However, a neural network trained with triplet loss requires carefully designed triplet mining procedure. This procedure is nontrivial, both timeconsuming and performancesensitive [22]. Besides, many closedset tasks like SID are equal to classification problem, it is intuitively not necessary to implement triplets mining procedure and explicitly treat them as metric learning problem. Therefore, we concentrate our attention on general scenario with common classification network. This means the units in the output layer are equal to the precollected speaker categories in training set.
2.3 Length normalization in common endtoend deep speaker embedding system
For openset SV task, since it is impossible to classify testing utterances to known identities in training set, the endtoend classification network plays role as an automatic speaker embedding extractor, as demonstrated in Fig. 2. Once deep speaker embeddings (e.g. xvectors) are extracted, just the same as in ivector approach, cosine similarity or length normalization followed by PLDA is commonly required to get the final pairwise scores. It’s noticed that no matter in cosine similarity or PLDA modeling, the length normalization is an extra step performed on the extracted speaker embeddings, and out of endtoend manner.
3 Deep length normalization
As described in section 2.1, backend modeling in conventional ivector approach usually performs on the unitlength hyperspace. When it turns into endtoend deep neural network, however, in practice the backend softmax classifier commonly adopts the innerproduct based FC layer without normalization. It means that if we want to perform cosine similarity or PLDA on the extracted deep speaker embeddings, such as the representative xvectors, we should manually normalize them with unitlength first.
It motivates us that whether it is possible to learn the deep speaker embeddings being lengthnormalized in an endtoend manner within common classification network. One might wonder the substantial difference between length normalization in an endtoend manner or out of endtoend manner. This issue has been studied by [23, 24]
in computer vision community. The effect of deep length normalization is equivalent to adding an
constraint on the original loss function. With deep speaker embeddings being lengthnormalized inherently in an endtoend manner, our optimization object requires not only the speaker embeddings being separated, but also constrained on a small unit hyperspace. This makes it more difficult to train the network, but in the other side, could greatly enhance its generalization capability.To this end, a naive practice is just to add an normalization layer before the output layer. However, we find that the training process may not converge and lead to rather poor performance, especially when the number of output categories is very large. The reason might be that the surface area of the unitlength hypersphere would have not enough room to not only accommodate so many speaker embeddings, but also allow each category of them to be separable.
As done in [23, 24], we introduce a scale parameter to shape the lengthnormalized speaker embeddings into suitable radius. The scale layer can scale the unitlength speaker embeddings into a fixed radius given by the parameter . Therefore, the complete formula of our introduced deep length normalization can be expressed as
(1) 
where is the input data sequence in the batch, is the corresponding output of the penultimate layer of the network, and is the deep normalized embedding.
Our endtoend system architecture with deep feature normalization is demonstrated in Fig.
3. The lengthnormalized speaker embedding can be directly fed into the output layer, and all the components in the network are optimized jointly with a unified crossentropy loss function:(2) 
where is the training batch size, is the output categories, is the deep normalized embedding, is the corresponding ground truth label, and and are the weights and bias for the last layer of the network which acts as a backend classifier.
In total, only a single scalar parameter is introduced, and it can be inherently trained with other components of the network together. This scale parameter has a crucial impact on the performance since it determines the radius of the lengthnormalized hyperspace. The network could have stronger constraint on the small radius hyperspace with smaller , but faces the risk of not convergent.
Therefore, it is vital to choose appropriate and normalize the feature into hyperspace with suitable radius. For elegance, we may prefer to make the parameter automatically learned by backpropagation. However, because the crossentropy loss function only takes into account whether it the speaker embeddings are separated correctly, it is apt to increases the value of to meet the demand. Therefore, the value of learned by the network might always be high, which results in a relaxed constraint [23].
A better practice considers as a hypeparameter, and fix it with a lowvalue constant in order to enlarge the constraint. However, too small for large number of categories may lead to the unconverged case. Hence, we should find an optimal balance point for .
Given the number of categories for a training dataset, in order to achieve a classification probability score of , the authors in [23] give the formulation of theoretical lower bound on by
(3) 
At the testing stage, speaker embeddings are extracted after the normalization layer. Since the embeddings have already been normalized to unit length, a simple innerproduct or PLDA can be adopted to get the final similarity scores.
4 Experiments
4.1 Data description
Voxceleb1 is a large scale textindependent SR dataset collected “in the wild”, which contains over 100,000 utterances from 1251 celebrities [25]. We focus on its openset verification task.
There are totally 1211 celebrities in the development dataset. The testing dataset contains 4715 utterances from the rest 40 celebrities. There are totally 37720 pairs of trials including 18860 pairs of true trials. To evaluate the system performance, we report results in terms of equal errorrate (EER) and the minimum of the normalized detection cost function (minDCF) at = 0.01 and = 0.001, as shown in Table 2 and Table 3.
4.2 Referenced ivector system
We build a referenced ivector system based on the Kaldi toolkit [26]. Firstly, 20dimensional melfrequency cepstral coefficients (MFCC) is augmented with their delta and double delta coefficients, making 60dimensional MFCC feature vectors. Then, a framelevel energybased voice activity detection (VAD) selects features corresponding to speech frames. A 2048components full covariance GMM UBM is trained, along with a 400dimensional ivector extractor and full rank PLDA.
Layer  Output size  Downsample  Channels  Blocks 

Conv1  64  False  16   
Res1  64  False  16  3 
Res2  32  True  32  4 
Res3  16  True  64  6 
Res4  8  True  128  3 
Average pool  128       
FC (embedding)  128       
Output  speaker categories       
4.3 Endtoend system
Audio is converted to 64dimensional log melfilterbank energies with a framelength of 25 ms, meannormalized over a sliding window of up to 3 seconds. A framelevel energybased voice activity detection (VAD) selects features corresponding to speech frames. In order to get higher level abstract representation, we design a deep convolutional neural network (CNN) based on the wellknown ResNet34 architecture [27], as described in Table 1. Followed by the frontend deep CNN, we adopt the simplest average pooling layer to extract the utterancelevel mean statistics. Therefore, given input data sequence of shape , where denotes variablelength data frames, we finally get 128dimensional utterancelevel representation.
The model is trained with a minibatch size of 128, using typical stochastic gradient descent with momentum 0.9 and weight decay 1e4. The learning rate is set to 0.1, 0.01, 0.001 and is switched when the training loss plateaus. For each training step, an integer
within interval is randomly generated, and each data in the minibatch is cropped or extended to frames. After model training finished, the 128dimensional speaker embeddings are extracted after the penultimate layer of neural network.4.4 Evaluation
We first investigate the setting of scale parameter . For those systems in Table 3 and Fig. 4, the cosine similarity or equivalently normalized innerproduct is adopted to measure the similarities between speaker embeddings. From Fig. 4, we can observe the proposed normalized deep embedding system achieves the best minDCF of 0.475, 0.586 and EER of 5.01%, which outperforms the baseline system significantly. According to Equation (3), for speaker categories of 1211 and probability score of 0.9, the theoretical lower bound of is 9. The performance is poor when is below the lower bound and stable with higher than the lower bound. The best in our experiment is 12, which is slightly larger than the lower bound.
We further compare the effect of deep length normalization strategy and traditional extra length normalization in the whole SV pipeline. The results are shown in Table 2. No matter in ivector or baseline deep speaker embedding systems, extra length normalization step followed by PLDA scoring achieves the best performance. When it turns into normalized deep speaker embedding systems, since the speaker embeddings extracted from the neural network have already been normalized to unit length, we need no more extra length normalization step. In the testing stage, a simple innerproduct achieves the best performance, even slightly better than the PLDA scoring result. It might be the reason that our normalized speaker embeddings are highly optimized, which could be incompatible with the objective function introduced by PLDA.
5 Conclusions
In this paper, we explore a deep length normalization strategy in endtoend SV system. We add an normalization layer followed by a scale layer before the output layer of the deep neural network. This simple yet efficient strategy makes the learned deep speaker embeddings being normalized in an endtoend manner. The value of scale parameter is crucial to the system performance especially when the number of output categories is large. Experiments show that system performance could be significantly improved by setting a proper value of . In the testing stage of an normalized deep embedding system, a simple innerproduct can achieve the stateoftheart.
References
 [1] W. Campbell, D. Sturim, and D. Reynolds, “Support vector machines using gmm supervectors for speaker verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308–311, 2006.
 [2] T. Kinnunen and H. Li, “An overview of textindependent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.
 [3] J. H. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal processing magazine, vol. 32, no. 6, pp. 74–99, 2015.
 [4] D. Reynolds and R. Rose, “Robust textindependent speaker identification using gaussian mixture speaker models,” IEEE Transactions on Speech & Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
 [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [6] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted gaussian mixture models,” in Digital Signal Processing, 2000, p. 19–41.

[7]
J. GonzalezDominguez, I. LopezMoreno, H. Sak, J. GonzalezRodriguez, and P. J. Moreno, “Automatic language identification using long shortterm memory recurrent neural networks,” in
Proc. INTERSPEECH 2014, 2014.  [8] L. Chao, M. Xiaokong, J. Bing, L. Xiangang, Z. Xuewei, L. Xiao, C. Ying, K. Ajay, and Z. Zhenyao, “Deep speaker: an endtoend neural speaker embedding system,” 2017.
 [9] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in Proc. IEEE SLT 2017, pp. 165–170.
 [10] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in endtoend speaker and language recognition system,” in Proc. Speaker Odyssey, 2018.
 [11] W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into endtoend learning scheme for language identification,” in Proc. ICASSP 2018, 2018.
 [12] H. Bredin, “Tristounet: triplet loss for speaker turn embedding,” in Proc. ICASSP 2017, 2017, pp. 5430–5434.
 [13] W. Cai, Z. Cai, X. Zhang, and M. Li, “A novel learnable dictionary encoding layer for endtoend language identification,” in Proc. ICASSP 2018, 2018.
 [14] D. GarciaRomero and C. Y. EspyWilson, “Analysis of ivector length normalization in speaker recognition systems.” in Proc. INTERSPEECH, 2011, pp. 249–252.
 [15] D. Snyder, G. GarciaRomero, D. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP 2018, 2018.

[16]
P. Bousquet, A. Larcher, D. Matrouf, J. Bonastre, and O. Plchot, “Variancespectra based normalization for ivector standard and probabilistic linear discriminant analysis,” in
Proc. Odyssey, 2012.  [17] D. GarciaRomero and A. McCree, “Insights into deep neural networks for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [18] S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in ICCV 2007, pp. 1–8.
 [19] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” in Proc. Odyssey Speaker and Language Recogntion Workshop, Brno, Czech Republic, 2010.
 [20] C. Zhang and K. Koishida, “Endtoend textindependent speaker verification with triplet loss on short utterances,” in Proc. Interspeech 2017, 2017, pp. 1487–1491.

[21]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 815–823.  [22] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proc. CVPR 2017, vol. 1, 2017.
 [23] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017.
 [24] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017.
 [25] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in Proc. INTERSPEECH 2017,.
 [26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in Proc. ASRU 2011.
 [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR 2016, 2016, pp. 770–778.
Comments
There are no comments yet.