Poformer: A simple pooling transformer for speaker verification

by   Yufeng Ma, et al.

Most recent speaker verification systems are based on extracting speaker embeddings using a deep neural network. The pooling layer in the network aims to aggregate frame-level features extracted by the backbone. In this paper, we propose a new transformer based pooling structure called PoFormer to enhance the ability of the pooling layer to capture information along the whole time axis. Different from previous works that apply attention mechanism in a simple way or implement the multi-head mechanism in serial instead of in parallel, PoFormer follows the initial transformer structure with some minor modifications like a positional encoding generator, drop path and LayerScale to make the training procedure more stable and to prevent overfitting. Evaluated on various datasets, PoFormer outperforms the existing pooling system with at least a 13.00



There are no comments yet.


page 1

page 2

page 3

page 4


Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

This paper proposes a serialized multi-layer multi-head attention for ne...

Exploring a Unified Attention-Based Pooling Framework for Speaker Verification

The pooling layer is an essential component in the neural network based ...

MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification

Most of the recent state-of-the-art results for speaker verification are...

Double Multi-Head Attention for Speaker Verification

Most state-of-the-art Deep Learning systems for speaker verification are...

Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification

In this paper, gating mechanisms are applied in deep neural network (DNN...

Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification

This paper describes the multi-query multi-head attention (MQMHA) poolin...

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Identifying multiple speakers without knowing where a speaker's voice is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification aims to determine whether two audio segments belong to the same speaker. This can be done by embedding the audio segments into high-dimensional vectors and then measuring their similarity. Before deep learning is applied in speaker verification, methods like i-vectors

[1] achieved a good performance. In recent years, progresses have been made by DNN-based systems, such as DNN-embeddings [2], x-vectors [3] and ECAPA-TDNN [4]

. In most systems, the whole network can be divided into three parts: (1) a backbone to extract the features of the audio segment, (2) a pooling layer to capture global information in the extracted feature and (3) an embedding layer with a loss function to classify the speaker. Our work focuses on the pooling layer. In previous works, pooling layers like statistical pooling

[3] are proposed and achieve good results despite their simplicity. Moreover, introducing attention-based modules has become popular in speaker verification recently. Works like attentive statistical pooling [5], cas pooling [4], self-attentive speaker embedding [6] and serialized multi-layer multi-head attention [7] prove that attention mechanism is effective in aggregating frame-level features. Transformer, first proposed in [8]

, is also an attention-based structure and has made a lot of successes in various areas including natural language processing


and computer vision

[10, 11, 12], inspired by which we propose a pooling transformer, PoFormer by introducing transformer to the pooling layer of our speaker verification network to strengthen its capability of capturing information across the whole time domain. Different from [7] which re-designs the inner structure of the attention module and implements the multi-head mechanism serially, we strictly follows [8] where different heads are in parallel, providing a simple but effective pooling transformer. ViT [10] sets all the positional encoding as learnable. However, it causes a performance decay on input sequences with various length. To make our model robust to audio segments with different length, we use positional encoding generator (PEG) [13] to generate the positional encoding for the input of the transformer. Following [13], we use multiple PEGs, one for each transformer layer. Furthermore, because of the complexity of transformer, it is vulnerable to overfitting. Therefore, we add LayerScale [14] and drop path [15] in our PoFormer. The performance is evaluated on Voxceleb1-O, Voxceleb1-E and Voxceleb1-H [16, 17]. With PEG, LayerScale and drop path, our model achieves a good result. Evaluated on the hardest Voxceleb1-H dataset, the equal error rate (EER) is 13.00% better than the baseline and the minimum decision cost function (minDCF) is 9.12% better than the baseline. Two main contributions of our paper are: 1) we first introduce the original transformer structure into the pooling layer of a speaker verification network and 2) we provide a set of effective strategies to improve the performance of our PoFormer and, at the same time, make it more robust.

2 Proposed Methods

Figure 1: The overall structure of our speaker verification system.

As the pooling layer aims to aggregate frame-level features and the transformer is known of its strong capability of capturing global information, we propose a transformer-based pooling layer, named as PoFormer (pooling transformer). This section first describes the overall structure of our speaker verification network and then focuses on the PoFormer. The overall structure of our proposed system is shown in Fig. 1. We follow the popular design where the whole network is divided into three parts: a backbone, a pooling layer and an embedding layer. In the backbone part, 5 TDNN layers [18] are used to extract the frame-level features of the utterance. Then the features are fed into our PoFormer to integrate the feature sequence into one vector. After that, an embedding layer is used to obtain the speaker embedding. The network is trained as a speaker classifier with an AM softmax loss [19]. For comparison, we also construct a baseline system where the PoFormer is substituted by a simple statistic pooling [3].

2.1 PoFormer Architecture

The stucture of PoFormer is also illustrated in Fig. 1. The dimension of TDNN’s output is usually high, making the model computationally intensive as well as memory requiring. To reduce the cost of space and time, the frame-level features are first compressed in dimension by a linear layer, and then fed into several stacked transformer layers. Multi-head self-attention (MHSA) module is the core of the transformer. Given an input sequence where denotes the sequence length and denotes the feature dimension, , and where denotes the -th head are first generated by learnable linear projections , and . Then the output from the -th head can be formulated as

where is the number of heads. After that, the outputs from different heads are concatenated together and transformed into the final output of the MHSA module by a linear layer:

where is also a learnable parameter. Feedforward net (FFN) module is also important in transformer. It consists of two linear layers with one non-linear layer in between. Formally, given the input , the output can be written as


denotes a non-linear activation function such as ReLU or GELU, and

, , and

are the parameters of two linear layers. To make the transformer easier to train, layer normalization and residual connection are used and the layer normalization is added before the MHSA and the FFN module. The output from the

-th transformer layer can be formulated as:


where denotes layer normalization. Several layers are stacked together to strengthen the capability of the model. Moreover, following ViT, a class token is appended to the feature sequence and used in later embedding and classification.

Models VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
EER(%) EER(%) EER(%)
Snyder et al. (baseline) [3] 2.0943 0.1834 1.9461 0.1958 3.1885 0.2773
Zhu et al. [7] 1.8558 0.2526 1.9874 0.2184 3.4930 0.3207
PoFormer () 1.6172 0.1499 1.6916 0.1756 2.8754 0.2572
PoFormer () 1.5641 0.1345 1.6547 0.1698 2.8255 0.2593
PoFormer () 1.5218 0.1298 1.6203 0.1661 2.7739 0.2520
Table 1: Performances of our proposed PoFormer and the comparison with statistical pooling and serialized multi-head multi-layer pooling. All the models are based on our own implementation and are trained with the same 1024-dimensional TDNN backbone.

2.2 Positional Encoding Generator

Figure 2: The structure of positional encoding generator (PEG). Note that the class token is first split and concatenated back after the position encoding is generated.

To make the input of transformer permutation-variant but translation-invariant, we use a 1-D depth-wise convolution layer to generate the positional encoding. The class token isn’t involved in the generation of positional encoding. That is, as is illustrated in Fig. 2, before calculating positional encoding, the class token is split away from the feature. After the generated positional encoding is added to the frame-level features, the class token is concatenated back and fed into the MHSA module. Moreover, we apply one PEG module before every transformer layer to improve the performance.

2.3 Drop path and LayerScale

To make the training process more stable and to avoid overfitting, we add LayerScale and drop path into our PoFormer. LayerScale means scaling the output of the MHSA module and the FFN module in a channel-wise way with the scale factors learnable. The technique helps to stabilize the training in early stages. Formally, let denote the scale factor, then LayerScale transforms equation 1 and 2 into


Drop path

is inserted after the MHSA or FFN module before its output is added to residual connection branch. It sets the output of the MHSA or FFN module as zero with a certain probability, so that the input is passed forward by the residual connection with no changes made. This prevents the model from overfitting the training data.

3 Experiments

3.1 Dataset

We used VoxCeleb2 development set [17] to train all of our models and the data augmentation techniques were the same as in [20] except that we didn’t apply the online data augmentation. After augmentation, we had 16,380,135 audio segments from 17,982 speakers. The models were evaluated by their equal error rate (EER) and minimum decision cost funcion (minDCF) on VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H [16, 17].

3.2 Experiment and Results

We used a 1024-dimensional TDNN backbone and the output dimension was 1500. Both the dimension of PoFormer and the speaker embedding were set as 512. The number of heads in PoFormer was 4 and the dimension of the feed forward layer was 1024, twice the PoFormer dimension. Drop path rate was modified according to the number of layers. For Poformers with , and transformer layers, the drop path rate was set as , and respectively. We used 81-dimensional fbank feature with no extra voice activation detection as the input. During training, 300 frames were extracted for each audio segment. The margin and the scale of AMSoftmax were 30 and 0.25 respectively. Our system was trained using an AdamW optimizer with a weight decay of 0.2. A cosine annealing learning rate scheduler was applied with an initial learning rate of 1e-3 and a minimum learning rate of 5e-5, and the training procedure lasted for 100,000 steps. Furthermore, a 10,000-step learning rate warm-up was used for the sake of stability. Table 1 reports the performance of PoFormers with different number of layers on three test datasets. We can see that our PoFormer outperforms both the baseline and serialized multi-head multi-layer attention module (reproduced by us using the same backbone and embedding layer). Our 7-layer PoFormer improved the EER by 13.00% and the minDCF by 9.12% compared with the baseline.

3.3 Ablation Studies

3.3.1 Drop path

To show the importance of drop path, we present the performances of PoFormers with different drop path rate in Table 2. Only results on the most difficult test dataset Voxceleb1-H are listed (the same below). As is shown in the table, the performance is poor when there is no drop path, and the best result is observed with a drop path rate of 0.3. The PoFormers consist of three transformer layers and are trained without positional encoding. Note that as the drop path rate can be sensitive to the number of transformer layers, we recommend a larger drop path rate for more transformer layers.

Drop path rate EER(%)
(no drop path) 6.0909 0.5043
3.2414 0.2943
3.1773 0.2841
3.0651 0.2734
3.1990 0.2909
Table 2: Ablation study on drop path rate. PoFormers trained with different drop path rate and evaluated on Voxceleb1-H are presented.

3.3.2 Peg

Table 3 reports the performance of PoFormers with different positional encoding strategies. Either removing the PEG or replacing it with a sinusoidal positional encoding degrades its performance. Sinusoidal encodings causes the performance to decay because the test audio segments vary in length while the training sequences have the same length. Moreover, as long as PEG is used, the result is insensitive to the kernel size as PEGs with different kernel sizes give similar EER and minDCF.

PEG kernel size EER(%)
No PEG 3.0651 0.2734
sinusoidal 3.3023 0.2951
2.9384 0.2625
2.8754 0.2572
2.8555 0.2654
Table 3: Ablation study on PEG. PoFormers trained with different PEGs and evaluated on Voxceleb1-H are presented.

3.3.3 Pre-norm and post-norm

Figure 3: The difference between (a) pre-norm transformer and (b) post-norm transformer.

As is mentioned in section 2, our PoFormer uses the pre-norm transformer [21], where the layer normalization is applied before the MHSA module and the FFN module. However, the original transformer structure in [8] is a post-norm transformer. Their difference is illustrated in Fig. 3. To figure out which one is more suitable to speaker verification, we compare their performances, and the results are listed in Table 4. We can see that the pre-norm PoFormer outperforms the post-norm one.

pre-norm 2.8754 0.2572
post-norm 2.9251 0.2695
Table 4: Ablation study on pre-norm and post-norm. PoFormers with pre-norm and post-norm transformer layer are trained. Both are evaluated on Voxceleb1-H.

3.3.4 Class token

In our standard PoFormer, we use an extra class token to embed and classify the speaker. However, we find that concatenating the mean and the standard deviation of the frame-level output from the last transformer layer provides a slightly better performance, which is a bonus for our PoFormer. The results are reported in

Table 5.

PoFormer output EER(%)
class token 2.8754 0.2572
class token + stats 2.8295 0.2495
Table 5: Ablation study on class token. PoFormer with extra statistic information provides a slightly better performance on Voxceleb1-H.

4 Conclusion

In this paper, we propose a transformer-based pooling structure, PoFormer for the speaker verification systems. To the best of our knowledge, our work is the first attempt to introduce the original transformer structure into the pooling layer. The multi-head self-attention mechanism can effectively capture global information. By adding PEG, LayerScale and drop path, our PoFormer outperforms all the existing pooling system in speaker verification.


  • [1] Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [2] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech 2017, 2017, pp. 999–1003.
  • [3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  • [4] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” Interspeech 2020, Oct 2020.
  • [5] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Interspeech 2018, 2018.
  • [6] Yingke Zhu, Tom Ko, David Snyder, Brian Kan-Wing Mak, and Daniel Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” in INTERSPEECH, 2018.
  • [7] H. Zhu, K. A. Lee, and H. Li, “Serialized multi-layer multi-head attention for neural speaker embedding,” 2021.
  • [8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, NIPS’17, p. 6000–6010, Curran Associates Inc.
  • [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
  • [11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to-end object detection with transformers,” 2020.
  • [12] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021.
  • [13] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen, “Conditional positional encodings for vision transformers,” 2021.
  • [14] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou, “Going deeper with image transformers,” 2021.
  • [15] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” 2017.
  • [16] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: A large-scale speaker identification dataset,” Interspeech 2017, Aug 2017.
  • [17] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” Interspeech 2018, Sep 2018.
  • [18] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
  • [19] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, Jul 2018.
  • [20] Miao Zhao, Yufeng Ma, Min Liu, and Minqiang Xu, “The speakin system for voxceleb speaker recognition challange 2021,” 2021.
  • [21] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu, “On layer normalization in the transformer architecture,” 2020.