1 Introduction
Recently there have been several attempts to apply Deep Learning (DL) in order to build speaker embeddings. Speaker embedding is often referred to a single low dimensional vector representation of the speaker characteristics from a speech signal extracted using a nn model. For textindependent sr, which is the focus of this work, these models can be trained either in a supervised (e.g., [1, 2]) or in an unsupervised (e.g., [3, 4]) fashion. Supervised speaker embeddings are produced by training a deep architecture using speakerlabeled background data. This network, which is capable to produce highlevel features, is usually trained to discriminate the background speakers. Then in the testing phase, the output layer is discarded, the feature vectors of an unknown speaker are given through the network, and the pooled representation from the activation of a given hidden layer are considered as the speaker embedding [1, 2, 5, 6]. The reported results from different works have shown that, in most of cases, the largest improvements are obtained on short utterances compared to the conventional ivectors [5, 2]. That suggests that dl technology can model the speaker characteristics of a shortduration speech signal better than the traditional signal processing techniques.
In [1]
, the inputs of the network are the speaker feature vectors stacked over a context window. They use a dnn with maxpooling and dropout regularization applied on the last two hidden layers. There are also other works which employes other deep architectures such as cnn and tdnn
[5, 7]. Snyder et al., in [8]introduced a temporal pooling to extract speaker embeddings and a dnn architecture with a PLDAlike objective function. This function operates on pairs of embeddings to maximize the probability for the embeddings of the same speakers and minimize it otherwise. In
[2], they take advantage of a tdnn which is further followed by a statistical pooling and a dnn classifier. The statistical pooling layer aggregates input segments over the variablelength and prepares the fixeddimensional statistics vectors as the inputs of a feedforward network. The second part of the network has only two hidden layers whose activations can be used as speaker embeddings. The preliminary results showed that these embeddings outperform the traditional ivectors
[9] for short duration speech segments [2]. However, [6] a recent work has shown that data augmentation, consisting of added noise and reverberation, can significantly improve the performance of these embeddings (xvectors as they referred to), while it is not so effective for ivectors [6]. There have also been some efforts to improve the quality and generalization powers of xvectors by the modification applied to the network architecture [10] and the training procedure [11, 12, 13].Attention mechanisms are one of the main reasons of the success of sequencetosequence (seq2seq) models in tasks like nmt or asr [14, 15, 16]. In seq2seq models, these algorithms are applied over the encoded sequence in order to help the decoder to decide which region of the sequence must be either translated or recognized. For speaker recognition, these models have been also used for textdependent speaker verification. In works like [5, 17, 18], attention is applied over the hidden states of a rnn in order to pool these states into speaker embeddings. The same idea has been also used for textindependent verification. In [19], a unified framework is introduced for both speaker and language recognition. In this architecture, variablelength input utterance is fed into a network that encodes an utterance level representation. In addition to temporal pooling, they have also adopted a selfattention pooling mechanism and a learnable dictionary encoding layer to get the utterance level representation. Multihead attention is a newly emerging attention mechanism which is originally proposed in [15] for a Transformer architecture and appeared very effective in many seq2seq models such as [15, 20, 21].
In this paper we present a multihead attention based network for speaker verification. This mechanism is used as a self attentive pooling layer to create an utterance level embedding. Given a set of encoded representations from a cnn feature extractor, self attention performs a weighted average of these representations. This mechanism differs from other pooling methods in that the average weights are also trained as network parameters. In comparison with other works like [19], our approach introduces a major improvement by using multihead attentions (instead of singlehead attention as in [19]). This allows the model to attend to different parts of the sequence, which is one of the main limitations of vanilla self attentive pooling. In the same way, multihead also helps the network to attend to different subsets of the encoded representations. Therefore the encoder is not forced to create overall speaker embeddings from the feature level. Attention allows the encoder to create different sets of features, so the model can attend to the most discriminative patterns from different positions of the sequence. The main contribution of this works is the introduction of a pooling layer which takes advantage of multihead self attention mechanism to create more discriminative speaker embeddings. We compare this pooling mechanism with temporal and statistical pooling layers. In order to show the effectiveness of the proposed approach, these embeddings will be assessed in a textindependent speaker verification task.
The rest of this paper is structured as follows. Section 2 explains self multihead attention pooling. Section 3 illustrates the architecture of the system. Section 4 gives the details of the system setup. Experimental results are presented in section 5. The concluding remarks and some future works are given in section 6.
2 Self MultiHead Attention Pooling
Self attentive pooling attention was initially proposed in [19] for textindependent speaker verification. Their objective was to use a trainable and more adapted layer for pooling than vanilla temporal average. Given a sequence of encoded hidden states from a network, temporal pooling averages these representations over the time to obtain a final encoded embedding. The main problem of this method is that assumes that all the elements of the sequence must contribute equally in obtaining the utterance level representation. Self attentive pooling is a mechanism that through a trainable layer is able to assign a weight over each representation of the sequence. Hence given these weights, the utterance level representation is obtained through the respective weighted average of these representations.
Consider a sequence of hidden of sequence states , with , and a trainable
. We can define a relevance scalar weight for each element of the sequence trough a softmax layer:
(1) 
Given the set of weights over all the elements of the sequence, we can then obtain the pooled representation as the weighted average of the hidden states:
(2) 
This attention mechanism has some limitations. The main restriction is that attention weights are calculated considering the whole information of the embedding. Therefore, we assume that all the discriminative information of the signal must come from the same encoded representations of the utterance.
Multihead attention model was firstly introduced in [15]. This approach consists on splitting the encoded representations of the sequence into homogeneous subvectors called heads (Figure 1). If we consider a number of heads for the multihead attention, now where . We can compute then the head size as . In the same way we have also a trainable parameter where . A different attention is then applied over each head of the encoded sequence:
(3) 
where corresponds to the attention weight of the head on the step
of the sequence. If each head corresponds to a subspace of the hidden state, the weight sequence of that head can be considered as a probability distribution function from that subspace features over the sequence. We then compute a new pooled representation for each head in the same way than vanilla self attention:
(4) 
where corresponds to the utterance level representation from head . The final utterance level representation is then obtained with the concatenation of the utterance level vectors from all the heads . This method allows the network to extract different kind information over different regions of the network. Besides, the main advantage of this attention variation is that it does not increase the complexity of the model adding more parameters on the model. Instead of having a global attention vector, we have now a subset of attention vector which sums the same number of components than .
3 System Description
Figure 2 shows the overall architecture used for this work. The proposed neural network is a cnn based encoder and an attention based pooling layer followed by a set of dense layers. The network is fed with variable length melspectrogram features. These features are then mapped into a sequence of speaker representations trough a cnn encoder. This cnn feature extractor is based on of the VGG proposed in [22] for the asr task. In our case, we have extended this architecture so as to work for speaker verification. Our adapted VGG is composed of three convolution blocks, where each block contains two concatenated convolutional layers followed by a max pooling layer with a xstride. Hence given a spectrogram a of frames, the VGG performs a downsampling reducing its output into a sequence of representations. Given this set of representations, the attention mechanism is then used to transform the encoded states of the cnn into an overall speaker representation. Finally this fixed length embedding is feed into a set of fc layers and a softmax layer. We refer to bottle neck layer previous to the softmax layer as the speaker embedding. The softmax layer corresponds to the speaker labels of the train partition corpus. Hence the network is trained as a speaker classifier. The speaker embedding layer corresponds to the speaker representation that will be used for the speaker verification task.
Layer  Size  In Dim.  Out Dim.  Stride  Feat Size 
conv11  3x3  1  128  1x1  128xN 
conv12  3x3  128  128  1x1  128xN 
mpool1  2x2      2x2  64xN/2 
conv21  3x3  128  256  1x1  64xN/2 
conv22  3x3  256  256  1x1  64xN/2 
mpool2  2x2      2x2  32xN/4 
conv31  3x3  256  512  1x1  32xN/4 
conv32  3x3  512  512  1x1  32xN/4 
mpool3  2x2      2x2  16XN/8 
flatten    512  1    8192xN/8 
4 Experimental Setup
The proposed system in this work will be tested on VoxCeleb1 [23]. This corpus is a large multimedia database that contains over utterances for celebrities, extracted from videos uploaded to Youtube. For each person in the corpus there is an average of videos. Each of these videos has been split into approximately short speech utterances of seconds average length. The proposed approaches will be evaluated on the original VoxCeleb1 speaker verification protocol [24]. Hence the network is trained with VoxCeleb1 development partition and evaluated on the test set.
Three different baselines will be considered to compare with the presented approach. The soft multihead attention pooling will be evaluated against two statistical based methods: temporal and statistical pooling. In order to evaluate them, this pooling layers will replace the attention pooling block without modifying any other parameter of the network. The speaker vectors used for the verification tests will be extracted from the same speaker embedding layer for each of the pooling methods. The metric used to compute the scores between embeddings for the verification task is cosine distance. Additionally we have also considered an ivector + PLDA baseline [9, 25]. The ivector is created from MFCC + delta coefficients features. The extraction is performed using a ubm and a total variability matrix. GPLDA[25] is applied with eigenvector size.
Approach  DCF  EER 

Ivector + PLDA  0.0078  
CNN + Temporal Pooling  0.0047  
CNN + Statistical Pooling  0.0046  
CNN + Att. Pooling  0.005  
CNN + MHA Pooling  0.0045  4.0 
The proposed network has been trained to classify variable length speaker utterances. For feature extraction we have used librosa
[26] to extract dimension melspectrograms. The cnn encoder is then feed with x spectrograms to obtain a sequence of xencoded hidden representations. The setup of the cnn feature extractor can be found on Table
1. The pooling layer maps the encoded sequence into an unique speaker representation. The following fc Block consists on two consecutive dense layers with anddimension, where the last layer correspond to the speaker embedding. A final softmax layer is then fed with the speaker embedding. Batch normalization has been applied on the
dense layer and dropout on the softmax layer. Adam optimizer has been used to train all the models with standard values and learning rate of . Finally we have applied a epochs patience early stopping criterion.5 Results
The proposed attention pooling layer has been evaluated with different approaches in the VoxCeleb1 textindependent verification task and presented in Table 2. Performance is evaluated using the eer and the minimum Decision Cost Function (DCF) calculated using , , and . MHA Pooling refers to the best self multihead attention model that we have trained. This model has a head size, which corresponds to heads per encoded representation. Ivector with PLDA have shown the worst results for this task. As it has mentioned before, ivectors performance decreases in shortutterance condition. Following the ivector, the statistical based pooling layers have scored and EER, respectively. Vanilla self attentive pooling performance has shown a EER. Similar to the work proposed in [19], self attentive pooling doesn’t lead to a big improvement. Here it has only shown a improvement relative improvement. Self MultiHead attention has shown the best result of all the evaluated approaches. It outperforms both ivector+PLDA and statistical based pooling layers with a EER and EER improvement, respectively. In comparison with self attentive pooling layer, MHA also shows a noticeable improvement of EER.
Figure 3 shows the det curves of the ivector+PLDA baseline and our proposed architecture with different pooling mechanisms. It shows that not only at the EER but also at all other working points MHA pooling outperforms other pooling mechanisms by a large margin.
In order to understand the improvement achieved with selfmultihead attention pooling in comparison with vanilla attention models, we have assessed their attention weights. Figure 4
shows the weight values created over the encoded features of the CNN for both vanilla and multihead attention pooling in one of the VoxCeleb1 test utterances. On the top we can appreciate how each of the several heads of the multihead model attends to different regions of the sequence. That suggests that the model is able to capture subsets of features from the encoded representations in different parts of the signal. On the bottom, the weight values from vanilla attention model are compared with the averaged weights of the different heads of the multihead model (cumulative multihead). Vanilla attention weights have a more uniform distribution over the sequence than the weights showed by the heads of the multihead model in the top image. If we compare the weight alignment created from both vanilla attention and cumulative multihead, several discriminative regions are commonly detected. However, there are some regions of the sequence attended by the MHA model that vanilla attention has not detected. That suggests that MHA degrees of freedom permits the detection procedure to focus on more specific regions of the sequence.
6 Conclusions
In this paper we have applied a self multihead attention mechanism to obtain speaker embeddings at level utterance by pooling shortterm features. This pooling layer have been tested in a neural network based on a CNN that maps spectrograms into sequences of speaker vectors. These vectors are then input to the pooling layer, which output activation is then connected to a set of dense layers. The network is trained as a speaker classifier and a bottleneck layer from the fully connected block is used as speaker embedding. We have evaluated this approach with other pooling methods for the textindependent verification task using the speaker embeddings and applying cosine distance. The presented approach have outperformed standard pooling methods based on statistical layers and vanilla attention models. We have also analyzed the multihead attention alignments over a sequence. This analysis have shown that self multi head attention layer allows to capture specific subsets of features over different regions of a sequence.
7 Acknowledgements
This work was supported in part by the Spanish Project DeepVoice (TEC201569266P).
References
 [1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
 [2] D. Snyder, D. GarciaRomero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for textindependent speaker verification,” in Interspeech, 2017, pp. 999–1003.

[3]
V. Vasilakakis, S. Cumani, P. Laface, and P. Torino, “Speaker recognition by means of deep belief networks,”
Proc. Biometric Technologies in Forensic Science, 2013. 
[4]
P. Safari, O. Ghahabi, and F. J. Hernando Pericás, “From features to speaker vectors by means of restricted boltzmann machine adaptation,” in
ODYSSEY 2016The Speaker and Language Recognition Workshop, 2016, pp. 366–371.  [5] G. Bhattacharya, M. J. Alam, and P. Kenny, “Deep speaker embeddings for shortduration speaker verification,” in Interspeech, 2017, pp. 1517–1521.
 [6] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
 [7] L. Li, Z. Tang, D. Wang, and T. F. Zheng, “Fullinfo training for deep speaker feature learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5369–5373.
 [8] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–170.
 [9] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [10] S. Novoselov, A. Shulipa, I. Kremnev, A. Kozlov, and V. Shchemelinin, “On deep speaker embeddings for textindependent speaker recognition,” arXiv preprint arXiv:1804.10080, 2018.
 [11] L. Li, Z. Tang, Y. Shi, and D. Wang, “Gaussianconstrained training for speaker verification,” arXiv preprint arXiv:1811.03258, 2018.
 [12] H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. Cernocky, “How to improve your speaker embeddings extractor in generic toolkits,” arXiv preprint arXiv:1811.02066, 2018.
 [13] Z. Huang, S. Wang, and K. Yu, “Angular softmax for shortduration textindependent speaker verification,” Proc. Interspeech, Hyderabad, 2018.
 [14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
 [16] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
 [17] S.X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “Endtoend attention based textdependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 171–178.
 [18] F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attentionbased models for textdependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017.

[19]
W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in endtoend speaker and language recognition system,” in
Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 74–81.  [20] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “Stateoftheart speech recognition with sequencetosequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
 [21] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal transformers,” arXiv preprint arXiv:1807.03819, 2018.
 [22] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctcattention based endtoend speech recognition with a deep cnn encoder and rnnlm,” arXiv preprint arXiv:1706.02737, 2017.
 [23] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in INTERSPEECH, 2017.
 [24] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.

[25]
S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for
inferences about identity,” in
2007 IEEE 11th International Conference on Computer Vision
. IEEE, 2007, pp. 1–8.  [26] B. McFee, M. McVicar, S. Balke, V. Lostanlen, C. Thomé, C. Raffel, D. Lee, K. Lee, O. Nieto, F. Zalkow, D. Ellis, E. Battenberg, R. Yamamoto, J. Moore, Z. Wei, R. Bittner, K. Choi, nullmightybofo, P. Friesch, F.R. Stöter, Thassilo, M. Vollrath, S. K. Golu, nehz, S. Waloschek, Seth, R. Naktinis, D. Repetto, C. F. Hawthorne, and C. Carr, “librosa/librosa: 0.6.3,” Feb. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.2564164