Speaker verification aims to determine whether a pair of audios corresponds to the same speaker. Given speech signals, speaker verification systems are able to extract speaker identity patterns from the characteristics of the voice. These patterns can be both statistically modelled or encoded into discriminative speaker representations. Over the last few years, researchers have put huge effort on encoding these traits into more discriminative speaker vectors. Current state-of-the-art speaker verification systems are based on dl approaches. These architectures are commonly trained as speaker classifiers in order to be used as speaker embedding extractors. Speaker embeddings are fixed-length vectors extracted from some of the last layers of these dnn. The most known representation is the x-vector , which has become state-of-the-art for speaker recognition and has also been used for other tasks such as language and emotion recognition [3, 4].
Most of the recent network architectures used for speaker embedding extraction are composed of a front-end feature extractor, a pooling layer, and a set of fc layers. Lately, there have been several architectures proposed to encode audio utterances into speaker embeddings for different choices of network inputs. Using mfcc features, tdnn [5, 6]
is the most currently used architecture. tdnn is the x-vector front-end and consists of a stack of 1-D dilated cnn. The idea behind the use of tdnn is to encode a sequence of mfcc into a more discriminative sequence of vectors by capturing long-term feature relations. 2-D cnn have also shown competitive results for speaker verification. There are Computer Vision architectures such as VGG[7, 8, 9] and ResNet [10, 11, 12] that have been adapted to capture speaker discriminative information from the Mel-Spectrogram. In fact, Resnet34 has shown a better performance than tdnn in the most recent speaker verification challenges [13, 14]. Finally, there are also some other attempts to work directly on the raw signal instead of using hand-crafted features [15, 16, 17].
Given the encoded sequence from the front-end, a pooling layer is adopted to obtain an utterance-level representation. During the last few years, there are several studies addressing different types of pooling strategies. X-vector originally uses statistical pooling  or the Self Attentive pooling method proposed in . A wide set of pooling layers based on self attention have been proposed improving this vanilla self attention mechanism. In  several attentions are applied over the same encoded sequence, producing multiple context vectors. In our previous work 
, the encoded sequence is split into different heads and a different attention model is applied over each head sub-sequence. Attention mechanisms have also been used to improve statistical pooling. In works like, attention is used to extract better order features statistics. Finally there are also works with competitive results such as [20, 21, 22] which proposed pooling methods independent from self attention models.
In this paper we present a Double mha pooling layer for speaker verification. The use of this layer is inspired by , where Double mha is presented as a double attention block which captures feature statistics and makes adaptive feature assignment over images. In this work this mechanism is used as a combination of two self attention pooling layers to create utterance-level speaker embeddings. Given a sequence of encoded representations from a cnn, Self mha first concatenates the context vector from head attentions applied over a sub-embedding sequences. An additional self attention mechanism is then applied over the multi-head context vector. This attention based pooling summarizes the set of head context vectors into a global speaker representation. This representation is pooled through a weighted average of the head context vectors, where the head weights are produced with the self attention mechanism. On one hand, this approach allows the model to attend to different parts of the sequence, capturing at the same time different subsets of encoded representations. On the other the hand, the pooling layer allows to select which head context vectors are the most relevant to produce the global context vector. In comparison with , the second pooling layer operates over the head context vectors produced by a mha instead of the global descriptors produced by a self multi attention mechanism applied over an image.
2 Proposed Architecture
Our proposed system architecture is illustrated in Figure 1. It utilizes a cnn-based front-end which takes in a set of variable length mel-spectrogram features and outputs a sequence of speaker representations. These speaker representations are further subject to a Double mha pooling which is the main contribution of this work. The Double mha layer comprises a Self mha pooling and an additional Self Attention layer that summarizes the information of each head context vector into an unique speaker embedding. The combination of Self mha pooling together with this Self Head Attention layer provides us with a deeper self-attention pooling mechanism (Figure 2). The speaker embedding obtained from the pooling layer is sent through a set of fc layers to predict the speaker posteriors. This network architecture is trained with ams loss  as a speaker classifier so as to have a speaker embedding extractor.
2.1 Front-End Feature Extractor
Our feature extractor network is a larger version of the adapted VGG proposed in 
. This cnn comprises four convolution blocks, each of which contains two concatenated convolutional layers followed by a max pooling with astride. Hence given a spectrogram of frames, the VGG performs a down-sampling reducing its output into a sequence of representations. The output of the VGG is a set of feature maps with dimension. These feature maps are concatenated into a unique vector sequence. This reshaped sequence of hidden states can now be defined as , where corresponds to the hidden state dimension.
2.2 Self Multi-Head Attention Pooling
The sequence of hidden states output from the front-end feature extractor can be expressed as with . If we consider a number of heads for the mha pooling, now we can define the hidden state as where . Hence each feature vector is split into a set of sub-feature vectors of size . In the same way we have also a trainable parameter where . A self attention operation is then applied over each head of the encoded sequences. The weights of each head alignment are defined as:
where corresponds to the attention weight of the head on the step of the sequence and corresponds to hidden state dimension . If each head corresponds to a subspace of the hidden state, the weight sequence of that head can be considered as a pdf from that subspace features over the sequence. We then compute a new pooled representation for each head in the same way than vanilla self attention:
where corresponds to the utterance level representation from head . The final utterance level representation is then obtained with the concatenation of the utterance level vectors from all the heads . This method allows the network to extract different kinds of information over different regions of the network.
2.3 Double Multi-Head Attention
The main disadvantage of Self mha pooling is that it assumes uniform head relevance. The output context vector is the concatenation of all head context vectors and it is used as input of the following dense layers. Double mha does not assume that. Therefore each utterance context vector is computed as a different linear combination of head context vectors. A summarized vector is then defined as a weighted average over the set of head context vectors . A self attention mechanism is used to pool the set of head context vectors and obtain an overall context vector .
where corresponds to the aligned weight of each head and is a trainable parameter. The context vector is then computed as the weighted average of the context vectors among heads. With this method, each utterance context vector is created scaling the information of the most/least relevance heads. Considering the whole pooling layer, Double mha allows to capture different kind of speaker patterns in different regions of the input, and at the same time allows to weight the relevance of each of these patterns for each utterance.
The number of heads used for this pooling defines both the context vector dimension and how the VGG feature maps are grouped. Considering the number of channels and heads, for each head we would create a context vector of dimension which contains a subset of feature maps. Therefore, as the number of heads grows larger, it allows Double mha to consider more subsets of features while decreases the dimension of the final utterance-level context vector. This implies a trade-off between the number of features subsets we can create and how much compressed are these features in the context vector subspace.
2.4 Fully-Connected Layers
The utterance-level speaker vector obtained from the pooling layer is fed into a set of four fc layers (Figure 1
). Each of the first two fc layers is followed by a batch normalization layer and relu activations. A dense layer is adopted for the third fc layer and the last fc corresponds to the speaker classification layer. Since ams is used to train the network, the third layer is set up without activation and batch normalization as proposed in . Once the network is trained, we can extract a speaker embedding from one of the intermediate fc layers. According to , we consider the second layer as the speaker embedding instead of the third one. The output of this fc layer then corresponds to the speaker representation that will be used for the speaker verification task.
3 Experimental Setup
The proposed system111Models are available at:
https://github.com/miquelindia90/DoubleAttentionSpeakerVerification in this work has been assessed by VoxCeleb dataset [27, 7]. VoxCeleb is a large multimedia database that contains more than 1 million utterances for more than 6K celebrities. These utterances are 16kHz audio chunks extracted from Youtube videos. VoxCeleb has two different versions with several evaluation conditions and protocols. For our experiments, VoxCeleb1 and VoxCeleb2 development partitions have been used to train both baseline and presented approaches. No data augmentation has been applied to increase the training data. On the other hand, the performance of these systems have been evaluated with the original Vox1 test set.
|Layer||Size||In Dim.||Out Dim.||Stride||Feat Size|
Two different baselines have been considered to compare with the presented approach. Double mha pooling have been evaluated against two self attentive based pooling methods: vanilla Self Attention and Self mha. In order to evaluate them, these mechanisms have replaced the pooling layer of the system (Figure 1) without modifying any other block or parameter from the network. The speaker embeddings used for the verification tests have been extracted from the same fc layer for each of the pooling methods. Cosine distance have been used to compute the scores between pairs of speaker embeddings.
The proposed network has been trained to classify variable-length speaker utterances. As input features we have used dimension log Mel Spectrograms with ms length Hamming windows and ms window shift. The audios have not been filtered with any vad system and 0.97 coefficient pre-emphasis has been applied. The audio features have been only normalized with cmn. The cnn encoder is then fed with Spectrograms to obtain a sequence of
encoded hidden representations. For training we have used batches of N=350 frames audio chunks but for test the whole utterances have been encoded. The setup of the cnn feature extractor can be found on Table1. For the pooling layer we have tuned the number of heads for both Self mha and Double mha. For the presented CNN setup we have considered 8,16, and 32 head number values, which implies a head context vector of 640, 320, and 160, respectively. The last block of the system consists on four consecutive fc layers. The first three dense layers have dimension. The last fc layer has dimension, which corresponds to the number of train speaker labels. Batch normalization has been applied only on the first two dense layers as mentioned in subsection 2.4. The network has been trained with ams loss with and hyper-parameters. Batch size is set to samples and Adam optimizer has been used to train all the models with learning rate and weight decay. During the training we have used 15 patience early stopping criterion, where the models have been validated each batches.
The proposed approach has been evaluated with different attention methods in the VoxCeleb text-independent verification task. Performance is evaluated using eer and dcf calculated using , , and . The results of this task are presented in both Figure 3 and Table 2. DET curves are shown in Figure 3 and both eer and dcf metrics are presented in Table 2. Double mha is referred as DMHA in both analysis.
Self Attention pooling has shown the worst results for this task compared to the best tuned approaches in both Self mha and Double mha. Compared to Self Attention, Self mha has shown better results with 16 heads and worst results with both 8 and 32 heads. With 16 heads, Self mha has shown a eer relative improvement in comparison with Self Attention Pooling. Otherwise, DCF has only improved from to . With 8 and 32 heads Self mha performance in EER has decreased a and a , respectively. Double mha have shown better results with 16 heads than both Self Attention and Self mha approaches. Double mha has shown a EER relative improvement in comparison with Self Attention and relative improvement compared with 16 heads Self mha. In terms of dcf, Double mha dcf has shown the best result with a . If we compare Double mha and Self MHA with 8 heads, Double mha is better in terms of dcf but has not improved in terms of EER. Double mha dcf has improved from to but eer has remain the same with a . Double mha with 32 heads has shown the worst results in comparison with both heads Self mha and Self Attention with a eer and 0.0032 dcf.
As the results have shown, best performances in mha based approaches are achieved with 16 heads. Besides verification metrics, Table 2 also indicates the head and global context vector dimensions. As it was discussed in subsection 2.3, in Self mha and both and dimensions in Double mha are inversely proportional to the number of heads. Therefore, there is a trade-off between number of heads and systems performance, which is related to context vector dimensions. Worst performance showed with Double mha is achieved with heads. This setup implies that both and dimensions are 160. This value can be considered small compared to current state-of-the-art speaker embeddings, whose dimension range is between 200 and 1500. Therefore, system performance with 32 heads is worst because the context vector subspace is not enough big to encode all the discriminative speaker information from the CNN output. On the other hand, as larger is the number of heads, more subsets of speaker features can be captured over the CNN encoded sequence. With 8 heads, 640 dimension head context vectors are extracted and with 16 heads, head context vectors have 320 dimension. Both Self mha and Double mha approaches have shown the best results with 16 heads, which implies 320 dimension context vectors. Therefore CNN output feature maps are more efficiently grouped in subsets of channels, which correspond to sub-sequences of 320 dimension embeddings. Considering these sets of 16 context vectors pooled in that layer, these representations are efficiently averaged with Double mha into unique 320 dimension utterance-level speaker representations.
In this paper we have implemented a Double Multi-Head Attention mechanism to obtain speaker embeddings at level utterance by pooling short-term representations. The proposed pooling layer is composed of a Self Multi-Head Attention pooling and a Self Attention mechanism that summarizes the context vectors of each head into a unique speaker vector. This pooling layer have been tested in a neural network based on a CNN that maps spectrograms into sequences of speaker vectors. These vectors are then input to the proposed pooling layer, which output activation is then connected to a set of dense layers. The network is trained as a speaker classifier and a bottleneck layer from these fully connected layers is used as speaker embedding. We have evaluated this approach with other pooling methods for the text-independent verification task using the speaker embeddings and applying cosine distance. The presented approach have outperformed both vanilla Self Attention and Self Multi-Head Attention poolings.
This work was supported in part by the Spanish Project DeepVoice (TEC2015-69266-P).
-  O. Ghahabi, P. Safari, and J. Hernando, “Deep learning in speaker recognition,” in Development and Analysis of Deep Learning Architectures. Springer, 2020, pp. 145–169.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
-  D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using x-vectors,” in Odyssey, 2018, pp. 105–111.
-  R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” arXiv preprint arXiv:2002.05039, 2020.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–170.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Interspeech, 2017, pp. 999–1003.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
-  ——, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
-  M. India, P. Safari, and J. Hernando, “Self Multi-Head Attention for Speaker Recognition.”
-  G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker recognition: Modular or monolithic?” in Proc. Interspeech, 2019, pp. 1143–1147.
J. Zhou, T. Jiang, Z. Li, L. Li, and Q. Hong, “Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function,”Proc. Interspeech 2019, pp. 2883–2887, 2019.
-  A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” arXiv preprint arXiv:1907.10420, 2019.
-  J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, and A. Zisserman, “Voxsrc 2019: The first voxceleb speaker recognition challenge,” arXiv preprint arXiv:1912.02522, 2019.
-  H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
-  M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” arXiv preprint arXiv:1808.00158, 2018.
-  J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification,” extraction, vol. 8, no. 12, pp. 23–24, 2018.
-  J.-w. Jung, H.-S. Heo, J.-h. Kim, H.-j. Shim, and H.-J. Yu, “Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification,” arXiv preprint arXiv:1904.08104, 2019.
-  Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification.” in Interspeech, 2018, pp. 3573–3577.
-  K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018.
-  W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 74–81.
-  W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5791–5795.
-  Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, “Spatial pyramid encoding with convex length normalization for text-independent speaker verification,” arXiv preprint arXiv:1906.08333, 2019.
-  Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A^ 2-nets: Double attention networks,” in Advances in Neural Information Processing Systems, 2018, pp. 352–361.
-  Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker verification,” arXiv preprint arXiv:1904.03479, 2019.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  H. Zeinali, L. Burget, J. Rohdin, T. Stafylakis, and J. H. Cernocky, “How to improve your speaker embeddings extractor in generic toolkits,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6141–6145.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.