Self Multi-Head Attention for Speaker Recognition

06/24/2019 ∙ by Miquel India, et al. ∙ Universitat Politècnica de Catalunya 0

Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18% of relative EER. Obtained results show a 58% relative improvement in EER compared to i-vector+PLDA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently there have been several attempts to apply Deep Learning (DL) in order to build speaker embeddings. Speaker embedding is often referred to a single low dimensional vector representation of the speaker characteristics from a speech signal extracted using a nn model. For text-independent sr, which is the focus of this work, these models can be trained either in a supervised (e.g., [1, 2]) or in an unsupervised (e.g., [3, 4]) fashion. Supervised speaker embeddings are produced by training a deep architecture using speaker-labeled background data. This network, which is capable to produce high-level features, is usually trained to discriminate the background speakers. Then in the testing phase, the output layer is discarded, the feature vectors of an unknown speaker are given through the network, and the pooled representation from the activation of a given hidden layer are considered as the speaker embedding [1, 2, 5, 6]. The reported results from different works have shown that, in most of cases, the largest improvements are obtained on short utterances compared to the conventional i-vectors [5, 2]. That suggests that dl technology can model the speaker characteristics of a short-duration speech signal better than the traditional signal processing techniques.

In [1]

, the inputs of the network are the speaker feature vectors stacked over a context window. They use a dnn with max-pooling and dropout regularization applied on the last two hidden layers. There are also other works which employes other deep architectures such as cnn and tdnn

[5, 7]. Snyder et al., in [8]

introduced a temporal pooling to extract speaker embeddings and a dnn architecture with a PLDA-like objective function. This function operates on pairs of embeddings to maximize the probability for the embeddings of the same speakers and minimize it otherwise. In

[2]

, they take advantage of a tdnn which is further followed by a statistical pooling and a dnn classifier. The statistical pooling layer aggregates input segments over the variable-length and prepares the fixed-dimensional statistics vectors as the inputs of a feed-forward network. The second part of the network has only two hidden layers whose activations can be used as speaker embeddings. The preliminary results showed that these embeddings outperform the traditional i-vectors

[9] for short duration speech segments [2]. However, [6] a recent work has shown that data augmentation, consisting of added noise and reverberation, can significantly improve the performance of these embeddings (x-vectors as they referred to), while it is not so effective for i-vectors [6]. There have also been some efforts to improve the quality and generalization powers of x-vectors by the modification applied to the network architecture [10] and the training procedure [11, 12, 13].

Attention mechanisms are one of the main reasons of the success of sequence-to-sequence (seq2seq) models in tasks like nmt or asr [14, 15, 16]. In seq2seq models, these algorithms are applied over the encoded sequence in order to help the decoder to decide which region of the sequence must be either translated or recognized. For speaker recognition, these models have been also used for text-dependent speaker verification. In works like [5, 17, 18], attention is applied over the hidden states of a rnn in order to pool these states into speaker embeddings. The same idea has been also used for text-independent verification. In [19], a unified framework is introduced for both speaker and language recognition. In this architecture, variable-length input utterance is fed into a network that encodes an utterance level representation. In addition to temporal pooling, they have also adopted a self-attention pooling mechanism and a learnable dictionary encoding layer to get the utterance level representation. Multi-head attention is a newly emerging attention mechanism which is originally proposed in [15] for a Transformer architecture and appeared very effective in many seq2seq models such as [15, 20, 21].

In this paper we present a multi-head attention based network for speaker verification. This mechanism is used as a self attentive pooling layer to create an utterance level embedding. Given a set of encoded representations from a cnn feature extractor, self attention performs a weighted average of these representations. This mechanism differs from other pooling methods in that the average weights are also trained as network parameters. In comparison with other works like [19], our approach introduces a major improvement by using multi-head attentions (instead of single-head attention as in [19]). This allows the model to attend to different parts of the sequence, which is one of the main limitations of vanilla self attentive pooling. In the same way, multi-head also helps the network to attend to different sub-sets of the encoded representations. Therefore the encoder is not forced to create overall speaker embeddings from the feature level. Attention allows the encoder to create different sets of features, so the model can attend to the most discriminative patterns from different positions of the sequence. The main contribution of this works is the introduction of a pooling layer which takes advantage of multi-head self attention mechanism to create more discriminative speaker embeddings. We compare this pooling mechanism with temporal and statistical pooling layers. In order to show the effectiveness of the proposed approach, these embeddings will be assessed in a text-independent speaker verification task.

The rest of this paper is structured as follows. Section 2 explains self multi-head attention pooling. Section 3 illustrates the architecture of the system. Section 4 gives the details of the system setup. Experimental results are presented in section 5. The concluding remarks and some future works are given in section 6.

2 Self Multi-Head Attention Pooling

Self attentive pooling attention was initially proposed in [19] for text-independent speaker verification. Their objective was to use a trainable and more adapted layer for pooling than vanilla temporal average. Given a sequence of encoded hidden states from a network, temporal pooling averages these representations over the time to obtain a final encoded embedding. The main problem of this method is that assumes that all the elements of the sequence must contribute equally in obtaining the utterance level representation. Self attentive pooling is a mechanism that through a trainable layer is able to assign a weight over each representation of the sequence. Hence given these weights, the utterance level representation is obtained through the respective weighted average of these representations.

Consider a sequence of hidden of sequence states , with , and a trainable

. We can define a relevance scalar weight for each element of the sequence trough a softmax layer:

(1)

Given the set of weights over all the elements of the sequence, we can then obtain the pooled representation as the weighted average of the hidden states:

(2)

This attention mechanism has some limitations. The main restriction is that attention weights are calculated considering the whole information of the embedding. Therefore, we assume that all the discriminative information of the signal must come from the same encoded representations of the utterance.

Figure 1: An example of the Self Multi-Head Attention Pooling with heads.

Multi-head attention model was firstly introduced in [15]. This approach consists on splitting the encoded representations of the sequence into homogeneous sub-vectors called heads (Figure 1). If we consider a number of heads for the multi-head attention, now where . We can compute then the head size as . In the same way we have also a trainable parameter where . A different attention is then applied over each head of the encoded sequence:

(3)

where corresponds to the attention weight of the head on the step

of the sequence. If each head corresponds to a subspace of the hidden state, the weight sequence of that head can be considered as a probability distribution function from that sub-space features over the sequence. We then compute a new pooled representation for each head in the same way than vanilla self attention:

(4)

where corresponds to the utterance level representation from head . The final utterance level representation is then obtained with the concatenation of the utterance level vectors from all the heads . This method allows the network to extract different kind information over different regions of the network. Besides, the main advantage of this attention variation is that it does not increase the complexity of the model adding more parameters on the model. Instead of having a global attention vector, we have now a subset of attention vector which sums the same number of components than .

3 System Description

Figure 2 shows the overall architecture used for this work. The proposed neural network is a cnn based encoder and an attention based pooling layer followed by a set of dense layers. The network is fed with variable length mel-spectrogram features. These features are then mapped into a sequence of speaker representations trough a cnn encoder. This cnn feature extractor is based on of the VGG proposed in [22] for the asr task. In our case, we have extended this architecture so as to work for speaker verification. Our adapted VGG is composed of three convolution blocks, where each block contains two concatenated convolutional layers followed by a max pooling layer with a xstride. Hence given a spectrogram a of frames, the VGG performs a down-sampling reducing its output into a sequence of representations. Given this set of representations, the attention mechanism is then used to transform the encoded states of the cnn into an overall speaker representation. Finally this fixed length embedding is feed into a set of fc layers and a softmax layer. We refer to bottle neck layer previous to the softmax layer as the speaker embedding. The softmax layer corresponds to the speaker labels of the train partition corpus. Hence the network is trained as a speaker classifier. The speaker embedding layer corresponds to the speaker representation that will be used for the speaker verification task.

Figure 2: System Diagram.
Layer Size In Dim. Out Dim. Stride Feat Size
conv11 3x3 1 128 1x1 128xN
conv12 3x3 128 128 1x1 128xN
mpool1 2x2 - - 2x2 64xN/2
conv21 3x3 128 256 1x1 64xN/2
conv22 3x3 256 256 1x1 64xN/2
mpool2 2x2 - - 2x2 32xN/4
conv31 3x3 256 512 1x1 32xN/4
conv32 3x3 512 512 1x1 32xN/4
mpool3 2x2 - - 2x2 16XN/8
flatten - 512 1 - 8192xN/8
Table 1: cnn Architecture. In and Out Dim. refers to the input and output feature maps of the layer. Feat Size refers to the dimension of each one of this output feature maps.

4 Experimental Setup

The proposed system in this work will be tested on VoxCeleb1 [23]. This corpus is a large multimedia database that contains over utterances for celebrities, extracted from videos uploaded to Youtube. For each person in the corpus there is an average of videos. Each of these videos has been split into approximately short speech utterances of seconds average length. The proposed approaches will be evaluated on the original VoxCeleb1 speaker verification protocol [24]. Hence the network is trained with VoxCeleb1 development partition and evaluated on the test set.

Three different baselines will be considered to compare with the presented approach. The soft multi-head attention pooling will be evaluated against two statistical based methods: temporal and statistical pooling. In order to evaluate them, this pooling layers will replace the attention pooling block without modifying any other parameter of the network. The speaker vectors used for the verification tests will be extracted from the same speaker embedding layer for each of the pooling methods. The metric used to compute the scores between embeddings for the verification task is cosine distance. Additionally we have also considered an i-vector + PLDA baseline [9, 25]. The i-vector is created from MFCC + delta coefficients features. The extraction is performed using a ubm and a total variability matrix. G-PLDA[25] is applied with eigenvector size.

Approach DCF EER
I-vector + PLDA 0.0078
CNN + Temporal Pooling 0.0047
CNN + Statistical Pooling 0.0046
CNN + Att. Pooling 0.005
CNN + MHA Pooling 0.0045 4.0
Table 2: Evaluation results of the text-independent verification task on VoxCeleb 1. The results for our proposed architecture have been obtained using cosine scoring.
Figure 3: DET curves for the experiments on VoxCeleb 1 verification task. MHA stands for the Multi-Head Attention.
Figure 4: Top: Analysis of the weight values for the first six multi-head attentions over a test utterance. Bottom: Comparison between vanilla attention weights and the averaged weights over all the heads of our proposed model (Cumulative MHA). The weights are extracted from the same test utterance than the top image.

The proposed network has been trained to classify variable length speaker utterances. For feature extraction we have used librosa

[26] to extract dimension mel-spectrograms. The cnn encoder is then feed with x spectrograms to obtain a sequence of x

encoded hidden representations. The setup of the cnn feature extractor can be found on Table

1. The pooling layer maps the encoded sequence into an unique speaker representation. The following fc Block consists on two consecutive dense layers with and

dimension, where the last layer correspond to the speaker embedding. A final softmax layer is then fed with the speaker embedding. Batch normalization has been applied on the

dense layer and dropout on the softmax layer. Adam optimizer has been used to train all the models with standard values and learning rate of . Finally we have applied a epochs patience early stopping criterion.

5 Results

The proposed attention pooling layer has been evaluated with different approaches in the VoxCeleb1 text-independent verification task and presented in Table 2. Performance is evaluated using the eer and the minimum Decision Cost Function (DCF) calculated using , , and . MHA Pooling refers to the best self multi-head attention model that we have trained. This model has a head size, which corresponds to heads per encoded representation. I-vector with PLDA have shown the worst results for this task. As it has mentioned before, i-vectors performance decreases in short-utterance condition. Following the i-vector, the statistical based pooling layers have scored and EER, respectively. Vanilla self attentive pooling performance has shown a EER. Similar to the work proposed in [19], self attentive pooling doesn’t lead to a big improvement. Here it has only shown a improvement relative improvement. Self Multi-Head attention has shown the best result of all the evaluated approaches. It outperforms both i-vector+PLDA and statistical based pooling layers with a EER and EER improvement, respectively. In comparison with self attentive pooling layer, MHA also shows a noticeable improvement of EER.

Figure 3 shows the det curves of the i-vector+PLDA baseline and our proposed architecture with different pooling mechanisms. It shows that not only at the EER but also at all other working points MHA pooling outperforms other pooling mechanisms by a large margin.

In order to understand the improvement achieved with self-multihead attention pooling in comparison with vanilla attention models, we have assessed their attention weights. Figure 4

shows the weight values created over the encoded features of the CNN for both vanilla and multi-head attention pooling in one of the VoxCeleb1 test utterances. On the top we can appreciate how each of the several heads of the multi-head model attends to different regions of the sequence. That suggests that the model is able to capture sub-sets of features from the encoded representations in different parts of the signal. On the bottom, the weight values from vanilla attention model are compared with the averaged weights of the different heads of the multi-head model (cumulative multi-head). Vanilla attention weights have a more uniform distribution over the sequence than the weights showed by the heads of the multi-head model in the top image. If we compare the weight alignment created from both vanilla attention and cumulative multi-head, several discriminative regions are commonly detected. However, there are some regions of the sequence attended by the MHA model that vanilla attention has not detected. That suggests that MHA degrees of freedom permits the detection procedure to focus on more specific regions of the sequence.

6 Conclusions

In this paper we have applied a self multi-head attention mechanism to obtain speaker embeddings at level utterance by pooling short-term features. This pooling layer have been tested in a neural network based on a CNN that maps spectrograms into sequences of speaker vectors. These vectors are then input to the pooling layer, which output activation is then connected to a set of dense layers. The network is trained as a speaker classifier and a bottleneck layer from the fully connected block is used as speaker embedding. We have evaluated this approach with other pooling methods for the text-independent verification task using the speaker embeddings and applying cosine distance. The presented approach have outperformed standard pooling methods based on statistical layers and vanilla attention models. We have also analyzed the multi-head attention alignments over a sequence. This analysis have shown that self multi head attention layer allows to capture specific sub-sets of features over different regions of a sequence.

7 Acknowledgements

This work was supported in part by the Spanish Project DeepVoice (TEC2015-69266-P).

References