Text independent speaker verification verifies the identity of a speaker by judging whether a speech segment is from the speaker. The flexibility of the technology is not limited by this text content, so it has extensive application scenarios. Most state-of-the-art speaker verification systems are based on Time delay neural network (TDNN), therefore improving TDNN-based network structure has always been an active area of research. Enhancing the temporal context modeling capability can be very effective which has been proved by introducing LSTM or GRU layer before the pooling layer in the typical TDNN [1, 2].  uses the Transformer  encoder module to replace the TDNN network module in the conventional X-vector system [5, 6] which can capture better global temporal characteristics. Besides,puts more emphasis on channel interdependence has also shown impressive performance improvement in VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge[7, 8].
In this work we propose the MACCIF-TDNN to further enhance the TDNN architecture and statistics pooling layer. we use the SE-Res2Blocks as in ECAPA-TDNN  to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. And, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level. Besides, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. We expect that this arrangement would aggregate channel interdependece, local and global context interdependence features from multi aspect. To the best of our knowledge, this is the first work that integrates SE-Res2Blocks, Transformer encoder and multi-head attentive statistics pooling in TDNN-based network.
2 Baseline system artichitectures
. It integrates a Res2Net module to enhance the central volume layer and constructs a hierarchical residual connection to handle multi-scale features. It also introduces 1-dimensional TDNN-specific SE-blocks which rescales the intermediate time context bound frame-level features per channel according to global utterance properties. The pooling layer uses a channel-and context-dependent attention mechanism to attend to different speaker characteristic properties at different time steps for each feature map. Finally, Multi-layer Feature Aggregation (MFA) provides additional complementary information for the statistics pooling by concatenating the final frame-level features with the intermediate features of previous layers. The architecture is trained with the AAM-softmax [10, 11] loss. Additionally,  introduces a 2D convolutional stem in a strong ECAPA-TDNN baseline to transfer some of the strong characteristics of a ResNet based model to this hybrid CNN-TDNN architecture.
The S-vector proposes an self-attention based alternative to the X-vectors system [5, 6]. The TDNN of X-vector system, which has finite context, is replaced with the encoder module of the Transformers. This arrangement would capture better speaker characteristics due to unrestricted context. Also, self-attention is built on the dot product between frames, so it can capture the similarities across an utterance efficiently.
The R-vector system proposed in 
. It is based on the ResNet18 and ResNet34 implementations of the successful ResNet architecture. The convolutional frame layers of this network process the features as a 2-dimensional signal before collecting the mean and standard deviation statistics in the pooling layer.
3 Proposed system architectures
In this section, we examine some of the limitations of the TDNN-based methods and incorporate potential solutions into our MACCIF-TDNN architecture. The following subsections will focus on the complementary features aggregation and multi-head attentive statistic pooling. An overview of the complete architecture is given by Fig.1.
3.1 Aggregation of complementary features
The temporal context in the TDNN-based architecture is limited to several frames. As the network apparently benefits from longer term dependence, we argue it could be beneficial to strengthen the capability of global temporal context modeling. For this purpose we introduce the encoder structure of Transformer to our architecture. Different from the RNN networks, Transformer relies entirely on an attention mechanism to draw global dependence. Therefore, the advantage of the Transformer encoder structure is that it is not restricted to finite context and attends to all frames in every time step. However, encoder structure does not learn the characteristics of channel interdependence, and lacks the ability of multi-scale temporal context modeling. On the other hand, the SE-Res2Block integrates SE block and Res2Net, which not only models the channel interdependence, but also enhances the multi-scale local context model ability. Therefore, we introduce the SE-Res2Block with the Transformer encoder module.
We aggregate the complementary features extracted by these two parts as in Fig.2. For each frame we concatenate the output feature maps of the SE-Res2Blocks and the encoder layer. After the Multi-layer Feature Aggregation(MFA), a dense layer processes the concatenated information to generate the features for the attentive statistics pooling. To further aggregate multi-layer information we use the output of all preceding layer and initial convolutional layer as input for each frame layer which is implemented by defining the residual connection in each frame layer as the sum of the outputs of all the previous layers.
3.2 Multi-Head channel and context-dependent statistics pooling
In recent TDNN-based architectures, the attention mechanism is integrated into the temporal pooling [13, 14].  extend this temporal attention mechanism even further to the channel dimension,which enables the network to focus more on speaker characteristics that do not activate on identical or similar time instances. Inspired by Vector-Based Attentive Pooling , we argue that it might be beneficial to extend the channel and context-dependent statistics pooling in a multi-head way for better speaker embeddings from multiple aspect.
The channel and context-dependent attention mechanism in  defines as:
where weight over all frames can be represented as a matrix with . The is the self-attention score which represents the importance of each frame given the channel.
To discriminate speakers from multiple aspect, we extend the pooling method in a multi-head way as following equation:
where and are weight matrices; and are bias items; is the vectorial weight matrix generated by -th attention head. The weighted mean vector and weighted standard deviation vector generated by -th attention head can be obtained by the following equations:
The final output of the pooling layer is given by concatenating the weighted mean vector and weighted standard deviation vector in all I attention heads:
To ensure that each head can collect dissimilar information, a penalty term P is added to the loss function, the penalty term encourages the diversity of the attention matrices across different heads of attention:
where and are corresponding hyper-parameters; represents the Frobenius norm of matrix.
4.1 Dataset and Data Augmentation
We train models with the development set of VoxCeleb1 and the development set of voxceleb-2 respectively. VoxCeleb1  development dataset contains over 100000 utterances for 1211 celebrities, extracted from videos uploaded to YouTube. VoxCeleb-2  development dataset contains over 1 million utterances for 5994 celebrities, extracted from videos uploaded to YouTube. In order to generate additional augmented copies, we augment the data with reverberation (The RIR dataset) and noise (MUSAN dataset
). And the remaining three augmentations are generated with the open-source SoX (tempo up, tempo down) and FFmpeg (compression) libraries.
4.2 Training settings
We follow the same steps of data preparation and feature extraction as standard ECAPA-TDNN for the models we trained. The input features are the 80-dimensional MFCCs extracted from the window with 25ms length and 10ms shift, without voice activity detection. Three second random crops of the MFCCs feature vectors are normalized through cepstral mean subtraction. All models are trained using AAM-softmax [10, 11] with a margin of 0.2 and softmax prescaling of 30 for 4 cycles. We apply a weight decay on all weights in the model of 2e-4 incldue the AAM-softmax weights. We adopt 512 channels in the convolution frame layer in all the trained models. All models we trained using a cyclical learning rate varying between 1e-8 and 1e-3 using the triangular2 policy  with the Adam optimizer . The duration of one cycle is set to 130k iterations. In the MACCIF-TDNN system for the encoder layer we employ 8 parallel attention layers. For each of these we use . For the multi-head attentive pooling we employ 2 parallel attention layers, and the parameter and are set to 1 and 1 respectively. The mini-batch size for training is 64.
4.3 Evaluation settings
The scores are produced using the cosine distance between speaker embeddings extracted from the final fully connected layer for all systems. VoxCeleb1 dataset is used to evaluate the systems. The standard Equal Error Rate (EER) and minimum normalized detection cost function (MinDCF) is used as the evaluation metrics to the performance. For MinDCF calculation we assumeand .
4.4 Experimental Results
To reveal how each of the proposed improvements affects the performance, we conduct a group of ablation studies and all the results are shown in Table 1. All ablation studies are trained with the training set of VoxCeleb1, and the testing set of VoxCeleb1 is used for evaluation. The experiment A.0 uses SE-Res2Block and Attentive statistic pooling(ASP) which is essentially the same to ECAPA-TDNN, we refer to the code public available in  to complete this experiment and all the other improvements are based on this realization. The experiment A.1 uses multi-head ASP instead of the version applied in the experiment A.0. The experiment A.2 does not use the SE-Res2Block but the encoder layer of transformer. Different from the experiments A.1 and A.2, A.3 uses the integration structure of SE-Res2Block and transformer encoder module with ASP. Experiment A.4 uses all the proposed improvements which means the integration of SE-Res2Block,transformer encoder module and multi-head ASP.
The comparision of the result of A.4 with A.2 confirms the benefits of the integration of SE-Res2Block which can improve the EER and MinDCF metric with 3.4% and 2.4%, respectively. This indicates that it is contributive to capture the channel interdependence and multi-scale local context features. The results of experiment A.4 and A.1 show that the encoder module of Transformer can lead to performance improvement about 1.6% in EER and 1.3% in MinDCF, respectively. This indicates that the limited local context of the frame-level features is insufficient and can be complemented with global utterance-based information. Comparing experiment A.4 with A.3, we can see that the multi-head attentive statistics pooling can make a improvement of performance about 0.5% in EER and 0.8% in MinDCF, respectively. This indicates that extending the pooling method in a multi-head way is helpful to discriminate features from multiple aspect. The above comparisions illustrate all the proposed improvements can bring positive impact on the performance of TDNN-based system. Additionally, the results of experiment A.4 and A.0 show that the aggregation of all improvements can achieve 4.5% in EER and 5.4% in MinDCF improvement, respectively. The overall improvement brings greater benefits than each single improvement reval that MACCIF-TDNN can indeed capture the complementary multi aspect channel and context features.
A performance overview of the baseline systems and our proposed MACCIF-TDNN system is given in Table 2. The ECAPA-TDNN(reproduced) and our proposed MACCIF-TDNN models are trained with the training set of voxceleb-2. Our proposed architecture MACCIF-TDNN can apparently the E-TDNN, R-vector and S-vector baseline systems, and gives a relative improvement of 4.8% in EER and 3.2% in MinDCF over the reproduced standard ECAPA-TDNN baseline system. This reveals that it is benefitial to aggregate the channel and context interdependence features multi aspect way in TDNN-based architecture can indeed bring performance improvement. Despite, We do not reproduce the performance of standard ECAPA-TDNN(512) reported in  using the public code available in . And either the ECAPA-TDNN(1024) or the latest ECAPA-CNN-TDNN is trained with 1024 channels in convolution frame layer, which can not be compared with MACCIF-TDNN directly. But, the MACCIF-TDNN can still achieve competitive results.
In this paper, we propose a new network architecture which aggregates the channel and context interdependence features in a multi aspect way based on TDNN. Firstly, we use the SE-Res2Net blocks to explicitly model the channel interdependence, and process context features in a multi-scale way . Secondly, we explore to use the encoder structure of Transformer to model the long term context interdependence features. Before the pooling layer, we aggregate the outputs of SE-Res2Net blocks and Transformer-encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, we extend the single attentive statistics pooling in a multi-head way which can discriminate features from multiple aspect. Experiments show that the proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.
-  R. Li, D. Chen, and W. Zhang, “Voiceai systems to nist sre19 evaluation: Robust speaker recognition on conversational telephone speech,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
-  S. Novoselov, A. Gusev, A. Ivanov, T. Pekhovsky, A. Shulipa, G. Lavrentyeva, V. Volokhov, and A. Kozlov, “Stc speaker recognition systems for the voices from a distance challenge,” 2019.
-  S. Metilda, S. V. Katta, and S. Umesh, “S-vectors: Speaker embeddings based on transformer’s encoder for text-independent speaker verification,”
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Interspeech 2017, 2017.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
-  B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Interspeech 2020, 2020.
-  J. Thienpondt, B. Desplanques, and K. Demuynck, “Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification,”
-  G. Huang, Z. Liu, V. Laurens, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Computer Society, 2016.
J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in, 2019.
-  X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019.
-  H. Zeinali, S. Wang, A. Silnova, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,”
-  K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Interspeech 2018, 2018.
-  M. India, P. Safari, and J. Hernando, “Self multi-head attention for speaker recognition,” in Interspeech 2019, 2019.
-  Y. Wu, C. Guo, H. Gao, X. Hou, and J. Xu, “Vector-based attentive pooling for text-independent speaker verification,” 2020.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,”
-  T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, 2017.
-  D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” Computer Science, 2015.
-  L. Smith, “Cyclical learning rates for training neural networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
-  “https://github.com/joovvhan/ecapa-tdnn,”