Self-attention encoding and pooling for speaker recognition

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94 x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.


page 1

page 2

page 3

page 4


Self Multi-Head Attention for Speaker Recognition

Most state-of-the-art Deep Learning (DL) approaches for speaker recognit...

Double Multi-Head Attention for Speaker Verification

Most state-of-the-art Deep Learning systems for speaker verification are...

Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

A back-end model is a key element of modern speaker verification systems...

AM-MobileNet1D: A Portable Model for Speaker Recognition

Speaker Recognition and Speaker Identification are challenging tasks wit...

Masked cross self-attention encoding for deep speaker embedding

In general, speaker verification tasks require the extraction of speaker...

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in stat...

End-to-End Neural Speaker Diarization with Self-attention

Speaker diarization has been mainly developed based on the clustering of...

1 Introduction

Recently, there have been several attempts to apply Deep Learning (DL) in order to build speaker embeddings

[1, 2, 3, 4, 5]. Speaker embedding is often referred to a single low dimensional vector representation of the speaker characteristics from a speech signal. It is extracted using a nn model. There are different architectures proposed to extract these speaker representations either in a supervised (e.g., [1, 3]) or in an unsupervised way (e.g., [2, 6]). In the supervised methods, usually a deep architecture is trained for classification purposes using speaker labels on the development set. Then in the testing phase, the output layer is discarded, the feature vectors of an unknown speaker are given through the network, and the pooled representation from one or more hidden layers are considered as the speaker embedding [3, 5].

On the other hand, attention mechanisms are one of the main reasons of the success of sequence-to-sequence (seq2seq) models in tasks like nmt or asr [7, 8, 9]. In seq2seq models, these algorithms are applied over the encoded sequence in order to help the decoder to decide which region of the sequence must be either translated or recognized. Self-attention, as a specific type of attention, is the process of applying the attention mechanism to every position of the input itself. Self-attention has shown promising results in a variety of nlp tasks, such as nmt [8], semantic role labelling [10], and language representations [11]. The popularity of the self-attention as employed in networks such as Transformer [8], lies in its high parallelization capabilities in computation, and its flexibility in modeling dependencies regardless of distance by explicitly attending to the whole signal. Multi-head attention is a newly emerging attention mechanism, which is originally proposed in [8] for a Transformer architecture and appeared very effective in many seq2seq models such as [8, 12, 13]. People in [14, 15]

employ multi-head attention mechanism in the pooling layer to aggregate the information over multiple input segments and output a single fixed-dimensional vector which is further used in the dnn classifier.

For speaker recognition, attention mechanism has been studied mostly in the pooling layer as an alternative to statistical pooling. In [4, 16, 17], attention is applied over the hidden states of a rnn in order to pool these states into speaker embeddings for text-dependent speaker verification. In [18], a unified framework is introduced for both speaker and language recognition using attention. In [14], a multi attention mechanism based on [19]

is proposed to use more than one attention model to capture different kinds of information from the encoder features. Other works like

[15] have explored the use of multi-head attention for the pooling layer.

Computational limitations of the mobile devices oblige the end-user applications to use models which comply with certain constraints such as energy consumption, memory and storage size. However, these constraints are not usually met by deep learning approaches. These obstacles motivate researchers for the design of more efficient deep models. MobileNets were proposed in [20, 21] and showed promising results for image classification. They are built based on depth-wise separable convolutions [22] and further used in Inception models [23] to reduce the computation in the first few layers. Architectures based on factorized convolution were also found to be successful in [24] and [25] for image classification tasks. In another attempt, a bottleneck strategy was utilized in [26]

to design a very small network. There are other reduced computation networks such as structured transform networks

[27] and deep fried cnn [28].

In this paper, we present an end-to-end speaker embedding extractor for speaker verification inspired by self-attention networks. It is shown that self-attention mechanisms are able to capture time-invariant features from one’s speech, which can result in a discriminative representation of the speaker. The end-to-end architecture comprises an encoder, a pooling layer, and a dnn classifier. The main contribution of this work is to use an encoder and pooling layer solely relied on self-attention and feed-forward networks to create discriminative speaker embeddings. In order to show the effectiveness of the proposed approach, these embeddings are evaluated on Voxceleb1 [29], Voxceleb2 [30], and VoxCeleb1-E [30] protocols. The preliminary results obtained on this attention-based system show superior performance compared to the x-vector baseline. Its performance is also competitive to some other cnn-based benchmarks while having much less parameters. More precisely, our proposed system employs more than , , , and less parameters compared to VGG-M, ResNet-34, ResNet-50, and x-vector, respectively. This is a substantial reduction in model size which makes it a great candidate for resource-constrained devices. To the best of our knowledge, there is only one study to address concise deep models for speaker identification on mobile devices [31], which is evaluated on TIMIT [32] and MIT [33] datasets.

2 Proposed Architecture

The system presented in this work can be divided into three main stages, namely an encoder, a pooling layer, and a dnn classifier (Figure 1

). The encoder is a stack of N identical blocks each of which has two sub-layers. The first sub-layer is a single-head self-attention mechanism, and the second is a position-wise feed-forward network. A residual connection is employed around each of the two sub-layers, and it is followed by a layer normalization

[34]. The output of the encoder network is a sequence of feature vectors. The pooling mechanism is then used to transform these encoded features into an overall speaker representation. Finally, the third stage of the network is a dnn classifier, which outputs the speaker posteriors. The activations of one or more of these hidden layers can be considered as the speaker embeddings. These embeddings can be further used in speaker verification using cosine scoring or more sophisticated back-ends.

2.1 Self-Attention Encoder

Figure 1: The diagram of the proposed architecture. Encoder comprises a stack of N layers each of which is a series of attention mechanisms followed by position-wise feed-forward networks. A residual connection and layer normalization are applied to each of the sub-layers.

The input to each of the encoder blocks is passed through a self-attention network to produce representations by applying attention to every position of the input sequence. This is done using a set of linear projections. Consider an input sequence of length , with , and a set of trainable parameters , the model transforms the input into namely queries , keys , and values :


The output for each time instance of the attention layer is computed as:


where is an attention function that retrieves the keys with the query . There are two most commonly used attention functions namely additive attention [7], and dot-product attention [35]. Here the attention is a scaled version of the dot-product attention mechanism, which is originally proposed in [8]. It is much faster and more space-efficient in practice, compared to the additive attention. Once the dimension grows, the additive attention outperforms the dot-product attention [35]. Therefore the scaling factor comes in handy to fill the performance gap between the two while keeping its advantages. The final output representation of the attention mechanism is formulated as:


The output of the attention mechanism is sent to the other sublayer, which is a position-wise feed-forward network. It is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between:


where is the input and is the output of the feed-forward network. These linear transformations are the same across different positions, however, they differ from layer to layer. and can be considered as two convolutions with kernel size of , while is the feed-forward dimension.

2.2 Self-Attention Pooling Layer

Self-attention based encoder outputs a sequence of short-term speaker representations. Given these frame-level features, it is needed to convert them into an utterance level vector. In this work, we will use a self-attention pooling layer. This layer is an additive attention based mechanism, which computes the compatibility function using a feed-forward network with a single hidden layer. Given the sequence of encoded features , we compute the segment-level representation as:


where is a trainable parameter. This self attention can be also understood as a dot product attention where the keys and the values correspond to the same representation and the query is only a trainable parameter. Therefore is a weighted average of the encoder sequence of features. These weights correspond to the alignment learned by the self-attention mechanism. This kind of attention pooling is different from the ones proposed in [14, 36], where attention is applied to extract statistics necessary for statistical pooling. However, in this work, we replace the whole statistical pooling with a simple additive attention mechanism with less computational cost.

2.3 Speaker Embeddings

The fixed length vector obtained from the pooling layer is fed into a dnn classifier to output the speaker posteriors. This dnn classifier is based on three fully connected layers and a softmax layer to compute a multi-class cross entropy objective for training purposes.

Once the network is trained, embeddings are extracted from one of the dnn layers after the non-linearity. Instead of using the previous layer to the softmax as the speaker embedding, we use the second previous layer like in [37]. This is done because it has been shown by [3] that this layer generalizes better, so it contains speaker discriminative information less adapted to the training data.

3 Experimental Setup

The performance of the proposed system is assessed using both VoxCeleb1 [29], and VoxCeleb2 [30] datasets with three different protocols. VoxCeleb1 development set is used for training in the first set of experiments (referred to as Vox1 hereafter). It contains over utterances from speakers. The trained models are evaluated on the original VoxCeleb1 test set, which contains verification pairs for each clients and impostors test. For the second set of experiments (referred to as Vox2 hereafter), Voxceleb2 develpoment set is used for training. It comprises utterances among speakers. The VoxCeleb1 test set is used for the assessment. In the third set of experiments (referred to as Vox1-E hereafter), we use Voxceleb2 development set for training, and VoxCeleb1-E for the test. VoxCeleb1-E consists of random pairs sampled from the entire VoxCeleb1 dataset (developmenttest), covering speakers.

For all the systems that we have trained, librosa toolkit [38] is employed to extract -dimensional Mel-Frequency Cepstral Coefficients (MFCCs) with the standard window size and shift to represent the speech signal. The deltas and double deltas together are added to produce a dimensional input for the networks. No data or test-time augmentation have been used during training or test. All features are subject to Cepstral Mean Variance Normalization (CMVN) prior to training. All models are then trained on chunks of speech features with frames. During the test, embeddings are extracted from the whole speech signal. ReLU is used as the non-linearity for all the systems. Adam optimizer has been used to train all the models with standard values as proposed in [39] and learning rate of 1e-4.

The architecture of x-vector baseline is as proposed in [5]. Speaker embeddings are extracted from the affine component of the first layer after pooling. No dropout is used as suggested in [37]

since it does not improve the results for the verification task. This statement is also confirmed by our experiments. All tdnn layers are followed by non-linearity and Batch Normalization

[23]. For this baseline architecture we have also considered plda back-end. Given the speaker embeddings, these are centered, then subject to Linear Discriminant Analysis (LDA) dimension reduction to and length-normalized. Additionally, we have trained a PLDA to score the trials of these post-processesed represenations for the speaker verification task.

For the SAEP configuration, we use a stack of identical layers with , and position-wise feed-forward dimension of . For the encoder network a dropout of is considered and for the rest of the network dropout is fixed to . The dimension of the first dense layer is equal to and all other dense layers are equal to embedding dimension which is . The size of for the embedding is chosen just to be similar as the i-vector system. No experiments have been conducted to explore the ideal size for the embedding. For SAEP, we have also tried Additive Margin Softmax (AMSoftmax) as originally proposed in [40] with scaling factor of and

margin as the hyperparameters.

Method Input Loss Dim. # Params EER%


i-vector/PLDA [29] MFCC - - 8.8
VGG-M [29] Spectrogram Softmax M
VGG-M [29] Spectrogram Softmax+Contrastive M
x-vector MFCC Softmax M
x-vector/LDA/PLDA MFCC Softmax M


ResNet-34 [30] Spectrogram Softmax+Contrastive M
ResNet-50 [30] Spectrogram Softmax+Contrastive M
x-vector MFCC Softmax M
x-vector/LDA/PLDA MFCC Softmax M


ResNet-50 [30] Spectrogram Softmax+Contrastive M
x-vector MFCC Softmax M
x-vector/LDA/PLDA MFCC Softmax M

Minimum estimation, actual number not provided by the reference.

Table 1: Evaluation results on VoxCeleb data sets. Cosine distance is employed for scoring unless otherwise stated. Vox1: train on Vox1-dev & test on Vox1-test. Vox2: train on Vox2-dev test on Vox1-test. Vox1-E: train on Vox2-dev test on extended VoxCeleb1 test set. The result reported for ResNet-50 on Vox1-E, takes advantage of the test time augmentation [30].

4 Results

Results for text-independent speaker verification task are summarized in Table 1 in three different sections for three different protocols. The cosine distance is employed for scoring unless otherwise stated. The performance of different systems have been evaluated using eer. The proposed architecture is compared with i-vector, and also x-vector system, which is nowadays one of the most popular end-to-end speaker embedding extractors. Additionally, in order to analyse the trade-off between performance and network size, we have also included results reported by some other approaches such as VGG-M [29], ResNet-34, and ResNet-50 [30]. These models benefit from much larger amount of parameters.

For the Vox1 protocol we have compared SAEP with x-vector and the VGG based architecture proposed in [29]. It is the smallest protocol in terms of development data. SAEP with vanilla softmax shows a small relative improvement in terms of EER compared to x-vector with LDA/PLDA, and VGG-M. However, using AMSoftmax this improvement increases to a in comparison with x-vector/LDA/PLDA and compared to VGG-M.

For the Vox2 and Vox1-E protocol we have added ResNet-34 and ResNet-50 based models, which have shown recently very competitive results for the speaker verification task [30]. Here SAEP outperforms x-vector by a larger margin. The presented approach with softmax loss shows almost relative improvement in terms of eer compared to x-vector with LDA/PLDA back-end in the Vox2 protocol and more than in the Vox1-E protocol. Compared to the ResNet standard architectures, SAEP is able to perform similar to ResNet-34 with more than less parameters. ResNet-34 and ResNet-50 have better results than our method and also x-vector, mainly due to the use of much larger number of parameters and more sophisticated training strategies such as contrastive loss. Furthermore, ResNet-50 takes advantage of test-time augmentation [30] for Vox1-E protocol.

Figure 2: Assessment of the impact of the key and value dimensions () on the performance of speaker embeddings in terms of EER% for SAEP architecture. The results reported for both Softmax and AMSoftmax training losses on Vox2 protocol.

In Figure 2 we have presented the effect of key and value dimensions of on the performance of our proposed system for both Softmax and AMSoftmax losses on Vox2 protocol. We have tried three different dimensions , , and . It is notable to consider the total number of network parameters for each one of these systems. These dimensions correspond to M, M, and M total number of model parameters, respectively. In addition to these experiments, in order to show the efficiency of our architecture in terms of number of model parameters, we tried an alternative configuration with and . This configuration achieved EER on Vox2 protocol and Softmax loss, with only M parameters. Compared to x-vector, which offers similar performance, this alternative approach requires only almost one-tenth of the parameters required for the x-vector. This is a remarkable improvement, which justifies the superiority of our solution for resource-constraint devices, such as mobile phones.

5 Conclusions

We have presented a tandem encoder and pooling layer solely based on self-attention mechanism and position-wise feed-forward networks. They are used in an end-to-end embedding extractor for text-independent speaker verification. Preliminary results show that this fully attention based system is able to extract the time-invariant speaker features and produce discriminative representation for speakers with much fewer number of parameters compared to some other cnn and tdnn-based benchmarks. SAEP uses more than , , , and less parameters compared to VGG-M, ResNet-34, ResNet-50, and x-vector, respectively. It reveals that SAEP is superior at capturing long-range dependencies compared to other models. One reason is that the attention mechanism looks into all time instances at once so it is more flexible to extract features, which are located at longer distances from one another. For the future work, we would like to study the effects of various data augmentation strategies since it is very effective on speaker embeddings such as x-vector. We also need to study the scale of the system with more encoding layers, larger dimensions, and the addition of an appropriate back-end.

6 Acknowledgements

This work was supported in part by the Spanish Project DeepVoice (TEC2015-69266-P).