Double Multi-Head Attention for Speaker Verification

07/26/2020 ∙ by Miquel India, et al. ∙ Universitat Politècnica de Catalunya 0

Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 9.19% and 4.29% relative improvement in terms of EER compared to Self Attention pooling and Self Multi-Head Attention, respectively. According to the obtained results, Double Multi-Head Attention has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification aims to determine whether a pair of audios corresponds to the same speaker. Given speech signals, speaker verification systems are able to extract speaker identity patterns from the characteristics of the voice. These patterns can be both statistically modelled or encoded into discriminative speaker representations. Over the last few years, researchers have put huge effort on encoding these traits into more discriminative speaker vectors. Current state-of-the-art speaker verification systems are based on dl approaches. These architectures are commonly trained as speaker classifiers in order to be used as speaker embedding extractors. Speaker embeddings are fixed-length vectors extracted from some of the last layers of these dnn

[1]. The most known representation is the x-vector [2], which has become state-of-the-art for speaker recognition and has also been used for other tasks such as language and emotion recognition [3, 4].

Most of the recent network architectures used for speaker embedding extraction are composed of a front-end feature extractor, a pooling layer, and a set of fc layers. Lately, there have been several architectures proposed to encode audio utterances into speaker embeddings for different choices of network inputs. Using mfcc features, tdnn [5, 6]

is the most currently used architecture. tdnn is the x-vector front-end and consists of a stack of 1-D dilated cnn. The idea behind the use of tdnn is to encode a sequence of mfcc into a more discriminative sequence of vectors by capturing long-term feature relations. 2-D cnn have also shown competitive results for speaker verification. There are Computer Vision architectures such as VGG

[7, 8, 9] and ResNet [10, 11, 12] that have been adapted to capture speaker discriminative information from the Mel-Spectrogram. In fact, Resnet34 has shown a better performance than tdnn in the most recent speaker verification challenges [13, 14]. Finally, there are also some other attempts to work directly on the raw signal instead of using hand-crafted features [15, 16, 17].

Given the encoded sequence from the front-end, a pooling layer is adopted to obtain an utterance-level representation. During the last few years, there are several studies addressing different types of pooling strategies. X-vector originally uses statistical pooling [6] or the Self Attentive pooling method proposed in [18]. A wide set of pooling layers based on self attention have been proposed improving this vanilla self attention mechanism. In [18] several attentions are applied over the same encoded sequence, producing multiple context vectors. In our previous work [9]

, the encoded sequence is split into different heads and a different attention model is applied over each head sub-sequence. Attention mechanisms have also been used to improve statistical pooling. In works like

[19], attention is used to extract better order features statistics. Finally there are also works with competitive results such as [20, 21, 22] which proposed pooling methods independent from self attention models.

In this paper we present a Double mha pooling layer for speaker verification. The use of this layer is inspired by [23], where Double mha is presented as a double attention block which captures feature statistics and makes adaptive feature assignment over images. In this work this mechanism is used as a combination of two self attention pooling layers to create utterance-level speaker embeddings. Given a sequence of encoded representations from a cnn, Self mha first concatenates the context vector from head attentions applied over a sub-embedding sequences. An additional self attention mechanism is then applied over the multi-head context vector. This attention based pooling summarizes the set of head context vectors into a global speaker representation. This representation is pooled through a weighted average of the head context vectors, where the head weights are produced with the self attention mechanism. On one hand, this approach allows the model to attend to different parts of the sequence, capturing at the same time different subsets of encoded representations. On the other the hand, the pooling layer allows to select which head context vectors are the most relevant to produce the global context vector. In comparison with [23], the second pooling layer operates over the head context vectors produced by a mha instead of the global descriptors produced by a self multi attention mechanism applied over an image.

2 Proposed Architecture

Our proposed system architecture is illustrated in Figure 1. It utilizes a cnn-based front-end which takes in a set of variable length mel-spectrogram features and outputs a sequence of speaker representations. These speaker representations are further subject to a Double mha pooling which is the main contribution of this work. The Double mha layer comprises a Self mha pooling and an additional Self Attention layer that summarizes the information of each head context vector into an unique speaker embedding. The combination of Self mha pooling together with this Self Head Attention layer provides us with a deeper self-attention pooling mechanism (Figure 2). The speaker embedding obtained from the pooling layer is sent through a set of fc layers to predict the speaker posteriors. This network architecture is trained with ams loss [24] as a speaker classifier so as to have a speaker embedding extractor.

2.1 Front-End Feature Extractor

Our feature extractor network is a larger version of the adapted VGG proposed in [9]

. This cnn comprises four convolution blocks, each of which contains two concatenated convolutional layers followed by a max pooling with a

stride. Hence given a spectrogram of frames, the VGG performs a down-sampling reducing its output into a sequence of representations. The output of the VGG is a set of feature maps with dimension. These feature maps are concatenated into a unique vector sequence. This reshaped sequence of hidden states can now be defined as , where corresponds to the hidden state dimension.

2.2 Self Multi-Head Attention Pooling

The sequence of hidden states output from the front-end feature extractor can be expressed as with . If we consider a number of heads for the mha pooling, now we can define the hidden state as where . Hence each feature vector is split into a set of sub-feature vectors of size . In the same way we have also a trainable parameter where . A self attention operation is then applied over each head of the encoded sequences. The weights of each head alignment are defined as:


where corresponds to the attention weight of the head on the step of the sequence and corresponds to hidden state dimension . If each head corresponds to a subspace of the hidden state, the weight sequence of that head can be considered as a pdf from that subspace features over the sequence. We then compute a new pooled representation for each head in the same way than vanilla self attention:


where corresponds to the utterance level representation from head . The final utterance level representation is then obtained with the concatenation of the utterance level vectors from all the heads . This method allows the network to extract different kinds of information over different regions of the network.

2.3 Double Multi-Head Attention

Figure 1: System Architecture.

The main disadvantage of Self mha pooling is that it assumes uniform head relevance. The output context vector is the concatenation of all head context vectors and it is used as input of the following dense layers. Double mha does not assume that. Therefore each utterance context vector is computed as a different linear combination of head context vectors. A summarized vector is then defined as a weighted average over the set of head context vectors . A self attention mechanism is used to pool the set of head context vectors and obtain an overall context vector .

Figure 2: An example of Double mha Pooling with heads.

where corresponds to the aligned weight of each head and is a trainable parameter. The context vector is then computed as the weighted average of the context vectors among heads. With this method, each utterance context vector is created scaling the information of the most/least relevance heads. Considering the whole pooling layer, Double mha allows to capture different kind of speaker patterns in different regions of the input, and at the same time allows to weight the relevance of each of these patterns for each utterance.

The number of heads used for this pooling defines both the context vector dimension and how the VGG feature maps are grouped. Considering the number of channels and heads, for each head we would create a context vector of dimension which contains a subset of feature maps. Therefore, as the number of heads grows larger, it allows Double mha to consider more subsets of features while decreases the dimension of the final utterance-level context vector. This implies a trade-off between the number of features subsets we can create and how much compressed are these features in the context vector subspace.

2.4 Fully-Connected Layers

The utterance-level speaker vector obtained from the pooling layer is fed into a set of four fc layers (Figure 1

). Each of the first two fc layers is followed by a batch normalization layer

[25] and relu activations. A dense layer is adopted for the third fc layer and the last fc corresponds to the speaker classification layer. Since ams is used to train the network, the third layer is set up without activation and batch normalization as proposed in [24]. Once the network is trained, we can extract a speaker embedding from one of the intermediate fc layers. According to [26], we consider the second layer as the speaker embedding instead of the third one. The output of this fc layer then corresponds to the speaker representation that will be used for the speaker verification task.

3 Experimental Setup

The proposed system111Models are available at:
in this work has been assessed by VoxCeleb dataset [27, 7]. VoxCeleb is a large multimedia database that contains more than 1 million utterances for more than 6K celebrities. These utterances are 16kHz audio chunks extracted from Youtube videos. VoxCeleb has two different versions with several evaluation conditions and protocols. For our experiments, VoxCeleb1 and VoxCeleb2 development partitions have been used to train both baseline and presented approaches. No data augmentation has been applied to increase the training data. On the other hand, the performance of these systems have been evaluated with the original Vox1 test set.

Layer Size In Dim. Out Dim. Stride Feat Size
conv11 3x3 1 128 1x1 Nx80
conv12 3x3 128 128 1x1 Nx80
mpool1 2x2 - - 2x2 N/2x40
conv21 3x3 128 256 1x1 N/2x40
conv22 3x3 256 256 1x1 N/2x40
mpool2 2x2 - - 2x2 N/4x20
conv31 3x3 256 512 1x1 N/4x20
conv32 3x3 512 512 1x1 N/4x20
mpool3 2x2 - - 2x2 N/8x10
conv41 3x3 512 1024 1x1 N/8x10
conv42 3x3 1024 1024 1x1 N/8x10
mpool4 2x2 - - 2x2 N/16x5
flatten - 1024 1 - N/16x5120
Table 1: cnn Architecture. In and Out Dim. refers to the input and output feature maps of the layer. Feat Size refers to the dimension of each one of this output feature maps.

Two different baselines have been considered to compare with the presented approach. Double mha pooling have been evaluated against two self attentive based pooling methods: vanilla Self Attention and Self mha. In order to evaluate them, these mechanisms have replaced the pooling layer of the system (Figure 1) without modifying any other block or parameter from the network. The speaker embeddings used for the verification tests have been extracted from the same fc layer for each of the pooling methods. Cosine distance have been used to compute the scores between pairs of speaker embeddings.

The proposed network has been trained to classify variable-length speaker utterances. As input features we have used dimension log Mel Spectrograms with ms length Hamming windows and ms window shift. The audios have not been filtered with any vad system and 0.97 coefficient pre-emphasis has been applied. The audio features have been only normalized with cmn. The cnn encoder is then fed with Spectrograms to obtain a sequence of

encoded hidden representations. For training we have used batches of N=350 frames audio chunks but for test the whole utterances have been encoded. The setup of the cnn feature extractor can be found on Table

1. For the pooling layer we have tuned the number of heads for both Self mha and Double mha. For the presented CNN setup we have considered 8,16, and 32 head number values, which implies a head context vector of 640, 320, and 160, respectively. The last block of the system consists on four consecutive fc layers. The first three dense layers have dimension. The last fc layer has dimension, which corresponds to the number of train speaker labels. Batch normalization has been applied only on the first two dense layers as mentioned in subsection 2.4. The network has been trained with ams loss with and hyper-parameters. Batch size is set to samples and Adam optimizer has been used to train all the models with learning rate and weight decay. During the training we have used 15 patience early stopping criterion, where the models have been validated each batches.

4 Results

Figure 3: DET curves for the experiments on VoxCeleb 1 test set verification task.

The proposed approach has been evaluated with different attention methods in the VoxCeleb text-independent verification task. Performance is evaluated using eer and dcf calculated using , , and . The results of this task are presented in both Figure 3 and Table 2. DET curves are shown in Figure 3 and both eer and dcf metrics are presented in Table 2. Double mha is referred as DMHA in both analysis.

Self Attention pooling has shown the worst results for this task compared to the best tuned approaches in both Self mha and Double mha. Compared to Self Attention, Self mha has shown better results with 16 heads and worst results with both 8 and 32 heads. With 16 heads, Self mha has shown a eer relative improvement in comparison with Self Attention Pooling. Otherwise, DCF has only improved from to . With 8 and 32 heads Self mha performance in EER has decreased a and a , respectively. Double mha have shown better results with 16 heads than both Self Attention and Self mha approaches. Double mha has shown a EER relative improvement in comparison with Self Attention and relative improvement compared with 16 heads Self mha. In terms of dcf, Double mha dcf has shown the best result with a . If we compare Double mha and Self MHA with 8 heads, Double mha is better in terms of dcf but has not improved in terms of EER. Double mha dcf has improved from to but eer has remain the same with a . Double mha with 32 heads has shown the worst results in comparison with both heads Self mha and Self Attention with a eer and 0.0032 dcf.

As the results have shown, best performances in mha based approaches are achieved with 16 heads. Besides verification metrics, Table 2 also indicates the head and global context vector dimensions. As it was discussed in subsection 2.3, in Self mha and both and dimensions in Double mha are inversely proportional to the number of heads. Therefore, there is a trade-off between number of heads and systems performance, which is related to context vector dimensions. Worst performance showed with Double mha is achieved with heads. This setup implies that both and dimensions are 160. This value can be considered small compared to current state-of-the-art speaker embeddings, whose dimension range is between 200 and 1500. Therefore, system performance with 32 heads is worst because the context vector subspace is not enough big to encode all the discriminative speaker information from the CNN output. On the other hand, as larger is the number of heads, more subsets of speaker features can be captured over the CNN encoded sequence. With 8 heads, 640 dimension head context vectors are extracted and with 16 heads, head context vectors have 320 dimension. Both Self mha and Double mha approaches have shown the best results with 16 heads, which implies 320 dimension context vectors. Therefore CNN output feature maps are more efficiently grouped in subsets of channels, which correspond to sub-sequences of 320 dimension embeddings. Considering these sets of 16 context vectors pooled in that layer, these representations are efficiently averaged with Double mha into unique 320 dimension utterance-level speaker representations.

Approach Heads dim dim EER DCF
Dmha 3.26 0.0027
Table 2: Evaluation results of the text-independent verification task on VoxCeleb 1.

5 Conclusion

In this paper we have implemented a Double Multi-Head Attention mechanism to obtain speaker embeddings at level utterance by pooling short-term representations. The proposed pooling layer is composed of a Self Multi-Head Attention pooling and a Self Attention mechanism that summarizes the context vectors of each head into a unique speaker vector. This pooling layer have been tested in a neural network based on a CNN that maps spectrograms into sequences of speaker vectors. These vectors are then input to the proposed pooling layer, which output activation is then connected to a set of dense layers. The network is trained as a speaker classifier and a bottleneck layer from these fully connected layers is used as speaker embedding. We have evaluated this approach with other pooling methods for the text-independent verification task using the speaker embeddings and applying cosine distance. The presented approach have outperformed both vanilla Self Attention and Self Multi-Head Attention poolings.

6 Acknowledgements

This work was supported in part by the Spanish Project DeepVoice (TEC2015-69266-P).