1 System Description
For both Track 1 and Track 2, we adopt the same system settings without any extra data other than Voxceleb2-dev . This part will focus on the method we implemented in this challenge.
1.1 Datasets and Data Augmentation
1.1.1 Training Data
The VoxCeleb2-dev dataset contains 1,092,009 utterances and 5,994 speakers in total. Data augmentation is also quite important in training speaker verification models. We here adopted a 3-fold speed augmentation [2, 3] at first to generate extra twice speakers. Each speech segment in this dataset was perturbed by 0.9 or 1.1 factor based on the SoX speed function. Then we obtained 3,276,027 utterances and 17,982 speakers. The traditional Kaldi-based [4, 5] method (offline augmentation) is widely adopted in this field. Recent researches [6, 7] mentioned a new method that augments data on the fly (online augmentation). Our system contains both offline and online trained models. These two different data augmentation methods are applied separately for different training modes:
Offline training mode: In this training method, we used RIRs  and MUSAN  to create extra four copies of the training utterances and the data augmentation process was based on the Kaldi VoxCeleb recipe1. After this augmentation, 16,380,135 utterances from 17,982 speakers were generated to extract acoustic features.
Online training mode: Instead of concatenating different types of augmentation 
, we adopted a chain-like augment. It means that we predefine an effect chain composed of several augments, and every augment should have its probability to be activated. The effect chain is as:
gain augment with a probability of 0.2
white noise augment with a probability of 0.2
RIR reverberation and noise addition augment with a probability of 0.6
time stretch augment with a probability of 0.2
It is worth mentioning that the offline 3-fold speed augmentation is also adopted in online augmentation, which means the number of classes is 17,982. The speed augmentation will change the pitch of a speaker, while time-stretching will not change the pitch. Both foreground and background noises are added, and they are randomly selected from MUSAN and RIRs noises.
1.1.2 Developing Set
VoxCeleb1-O: 37,720 trails are sampled from the VoxCeleb1 test dataset with only 40 speakers.
VoxCeleb1-E: This is an extended version of VoxCeleb1-O. This set contains 581,480 trials from 1251 speakers.
VoxCeleb1-H: This set has 552,536 trials. It is harder since each pair in this set shares the same nationality and gender.
VoxSRC20-dev: This is the validation set of VoxSRC2020 and the trials contains out-of-domain data provided by VoxCeleb_cd. This set has 263,486 trials.
VoxSRC21-val: This is the validation set of VoxSRC2021 and has 60,000 trials. Trials in this set contain more multi-lingual data.
We extracted both 81-dimensional and 96-dimensional log Mel filter bank energies based on Kaldi in offline training mode. The window size is 25 ms, and the frameshift is 10 ms. 200 frames of features were extracted without extra voice activation detection (VAD). The speech segments were sliced to 2 seconds and augmented on the fly in the online training mode. 96-dimensional log Mel filter bank energies were extracted based on torchaudio. All features were cepstral mean normalized in both our training modes.
1.2 Network Structures
Recent researches proposed a new way to construct ConvNets. The method is called the re-parameterization technique. This method decouples the training time and inference time architecture. RepVGG, as one of the re-parameterized models, shows competitive performance in the computer vision field. We, at the first time, introduced this RepVGG architecture in speaker verification. As Figure1
shows, the RepVGG block has a separate 3x3 and 1x1 convolutional layer with batch normalization and an identity branch with only a batch normalization layer during the training time. Since convolution and batch normalization can fuse into a convolution layer and both the 1x1 convolution layer and the batch normalization layer can transform to a 3x3 convolution layer, all branches in this block are equal to three 3x3 convolutions. All these 3x3 convolutions share the same setting (kernel size, stride, groups, dilation, and so on) so that they can fuse into only one 3x3 convolution by simply adding parameters filter-wisely. When merged into one 3x3 convolution and a ReLU layer, this block is as same as a VGG block during the inference time. We select RepVGG-A2, RepVGG-B1, RepVGG-B2g4, and RepVGG-B2 as our backbones. All models adopt 64 base channels except RepVGG-A2 which uses 96 base channels.
ResNet  As one of the most classical ConvNets, ResNet has proved its power in speaker verification. In our systems, both basic-block-based ResNet-34 and bottleneck-block-based ResNet (deeper structures: ResNet-101 and ResNet-152) are adopted. All base channels of these ResNets are 64.
1.2.2 Pooling Method
The pooling layer aims to aggregate the variable sequence to an utterance level embedding. The vanilla idea to achieve this purpose is by calculating the mean and standard derivation along the time axis . However, it could be limited by the fact that the contributions from different frames could be unequal. An attention mechanism  is introduced to calculate weighted statistics of the outputs of the backbone. Furthermore, a multi-head mechanism was introduced to increase the diversity of attention, such as multi-head self-attentive (MHSA) pooling  and self multi-head attention (MHA) pooling . The main difference between these two methods is the definition of the heads in attention mechanism. Instead of attending to the whole feature through different heads as we called queries, the latter split the features into several parts, and each head focuses on its corresponding part. We proposed a multi-query multi-head attention pooling mechanism (MQMHA) for the first time by combining both the multi-head methods above. Since this method can help us attend to different parts and gain more diversified information. The method can be described as below:
Suppose we have a backbone output , and each is spit into parts with , where is the number of head of attention. For each head, it has
trainable query vectors where. Attention weight of is defined as:
And the representation is expressed as:
as the MQMHA combines both MHSA and MHA, in which and are the cases of MHSA and MHA respectively.
Finally, we concatenate all of the sub-representations to get the utterance level embedding with , where
. And an extra attentive standard deviationcomputed through the attention weights. This standard deviation is concatenated with to enhance the performance.
1.2.3 Loss Function
Recently, margin based softmax methods have been widely used in speaker recognition works. To make a much better performance, we strengthen the AM-Softmax [19, 20] and AAM-Softmax loss functions by two methods.
First, the subcenter method  was introduced to reduce the influence of possible noisy samples. The formulation is given by:
where the function means that the nearest center is selected and it inhibits possible noisy samples interfering the dominant class center.
Secondly, we proposed the Inter-TopK penalty to pay further attention to the centers which obtain high similarities comparing samples that do not belong to them. Therefore, it adds an extra penalty on these easily misclassified centers. Given a batch with examples and a number of classes of , the formulation with adding extra Inter-TopK penalty based on the AM-Softmax is:
where is the original margin of AM-Softmax and is the scalar of magnitude. We use the to replace the in the denominator:
where is an extra penalty which focuses on the closest centers to the example . And it is just the original AM-Softmax case when . Similarity, the Inter-TopK penalty could be also added for AAM-Softmax loss function by replacing by .
1.3 Training Protocol
In the first stage, the SGD optimizer with a momentum of 0.9 and weight decay of 1e-3 (4e-4 for online training) was used. We used 8 GPUs with 1,024 mini-batch and an initial learning rate of 0.08 to train all of our models. As described in section 1.1.1, 200 frames of each sample in one batch were adopted to avoid over-fitting and speed up training. We adopted ReduceLROnPlateau scheduler with a frequency of validating every 2,000 iterations, and the patience is 2. The minimum learning rate is 1.0e-6, and the decay factor is 0.1. Furthermore, the margin gradually increases from 0 to 0.2 .
In the large-margin-based fine-tuning stage , settings are slightly different from the first stage. Firstly, we removed the speed augmented part from the training set to avoid domain mismatch. Only 5,994 classes were left. Secondly, we changed the frame size from 200 to 600 and increased the margin exponentially from 0.2 to 0.5. The AM-Softmax loss was replaced by AAM-Softmax loss. The Inter-TopK penalty was removed to make training stable. Finally, We adopted a smaller finetuning learning rate of 8e-5 and a 256 batch size. The learning rate scheduler is almost the same while the decay factor became 0.5.
After completing the fine-tuning stage, 512-dimensional speaker embeddings were extracted from the fully connected (FC) layer, and then the length normalization was applied before computing cosine similarity. Moreover, we utilized speaker-wise adaptive score normalization (AS-Norm) and Quality Measure Functions (QMF) [11, 25] to calibrate the scores, and these methods greatly enhanced the performance. For AS-Norm, we selected the original VoxCeleb2 dev dataset without any augmentation. After extracting embeddings, all these embeddings were averaged speaker-wise, which resulted in 5994 cohorts. Then scores would be calibrated by this speaker-wise AS-Norm using top 400 imposter scores. For QMF, we combined three qualities, speech duration computed by Kaldi, imposter mean based on AS-Norm, and magnitude of non-normalized embeddings. Like IDLAB’s way 
, we also selected 30k trials from the original VoxCeleb2-dev as the training set of QMF. Then a Logistic Regression(LR) was trained to serve as our QMF model.
|Offine fbank 81|
|Offine fbank 96|
|Online fbank 96|
1.5.1 Baseline System Ablation Study
In this subsection, we show our ablation study on our baseline system. The baseline system is a ResNet-34 backbone followed by MHA pooling and AM-Softmax. The performance was evaluated using the Equal Error Rate (EER) and the minimum Decision Cost Function (DCF) calculated where , , and or for different trials. As Table 1 shows, our baseline system’s performance improved significantly on various trials by stacking our proposed methods gradually. For convenience, we took the performance of VoxSRC21-val as our benchmark. First, we conducted our ablation studies by changing normal AM-Softmax () to 3-subcenter AM-Softmax. The EER was improved from 3.13% to 2.785%, and the minDCF was improved from 0.1686 to 0.1503. By adding the Inter-TopK () extra penalty, the EER was 2.58%, and the minDCF was 0.1433. Using MQMHA () instead of MHA, the EER further achieved 2.51%, and the minDCF was 0.1403. The procedures above already boosted our baseline system’s EER by relatively 19.8% and minDCF by relatively 16.78%. The domain-based large margin finetuning improved our system performance from 2.51% EER to 1.9933% EER drastically. The minDCF also improved from 0.1403 to 0.1158. Applying the speaker-wise AS-Norm further achieved 1.8367% EER and 0.0996 minDCF. The final QMF process got 1.60% EER and 0.0906 minDCF. After doing AS-Norm and QMF, our system’s EER improved 19.6% relatively, and minDCF improved 21.76% relatively compared to the finetuned system. After completing the ablation study, our baseline system improved EER relatively 48.9% and minDCF relatively 46.26% in total.
For all our models, we followed the same procedure, and the only variable is our backbone.
1.5.2 Sub-Systems and Fusion Performance
All our sub-systems were described in Table 2. A total of 9 different backbones were used to generate different representations. The offline trained systems used both 81-dimensional and 96-dimensional acoustic features and online trained systems adopted 96-dimensional features only. Table 3 demonstrates the result achieved by our sub-systems of various trials. We found that a large model, such as RepVGG-B1, and ResNet-101 seemed to yield a better result compared to smaller models like our baseline system. However, an even bigger model like ResNet-152 and RepVGG-B2 could not bring a comparable performance boost regarding the drastically increased parameters. Also, it is worth mentioning that these even bigger models showed a sign of over-fitting on the VoxCeleb2-dev dataset. As the learning rate was smaller than 1e-4, the EER and minDCF of these large systems degraded. However, the performance of these systems remained SOTA even when we terminated the training at an earlier stage. 96-dimensional Fbank features were good complements of 81-dimensional Fbank features. The online system set we used is not the optimal choice, as we are still researching this new training paradigm. Though it shows a competitive result, it cannot achieve the best result of our large offline models.
Table 4 shows some of our submissions to the VoxSRC2021 and the final result of our fusion system. It is worth mentioning that our RepVGG-B1 achieved a 0.1212 minDCF and 2.2410% EER with only a single model while ResNet-152 achieved a 0.1195 minDCF and 2.16% EER. We tuned our fusion weights of all these models based on the results of VoxCeleb1-H and VoxSRC21-val. The final fusion resulted in a 0.1034 minDCF and a 1.846% EER in the VoxSRC2021 challenge. The fusion result improved 12.47% relatively in minDCF and 14.54% relatively in EER compared to our ResNet-152 model.
In this challenge, we first introduced a new backbone structure (RepVGG) in speaker verification. We also proposed MQMHA, Inter-TopK loss, and domain-based large margin fine-tuning methods. All these methods above and the large backbones ensured our first place in track 1 and track 2 of VoxSRC 2021. The final result of our system was 0.1034 minDCF and 1.846% EER.
This work is supported by the SpeakIn Technologies Co. Ltd.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
-  H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka, “Speaker augmentation and bandwidth extension for deep speaker embedding.” in INTERSPEECH, 2019, pp. 406–410.
-  W. Wang, D. Cai, X. Qin, and M. Li, “The dku-dukeece systems for voxceleb speaker recognition challenge 2020,” arXiv preprint arXiv:2010.12731, 2020.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi
speech recognition toolkit,” in
IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
-  W. Cai, J. Chen, J. Zhang, and M. Li, “On-the-fly data loader and utterance-level aggregation for speaker and language recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1038–1051, 2020.
-  M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “Speechbrain: A general-purpose speech toolkit,” 2021.
-  T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224.
-  D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
-  J. Thienpondt, B. Desplanques, and K. Demuynck, “The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in dnn based speaker verification,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5814–5818.
-  H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making
vgg-style convnets great again,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 733–13 742.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification.” inInterspeech, 2017, pp. 999–1003.
-  K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018.
-  Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification.” in Interspeech, vol. 2018, 2018, pp. 3573–3577.
-  M. India, P. Safari, and J. Hernando, “Self multi-head attention for speaker recognition,” arXiv preprint arXiv:1906.09890, 2019.
-  F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5265–5274.
-  J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
-  J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou, “Sub-center arcface: Boosting face recognition by large-scale noisy web faces,” in European Conference on Computer Vision. Springer, 2020, pp. 741–757.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
Z. Lin, N. Gimelshein, L. Antiga et al.
, “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
-  Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker verification,” arXiv preprint arXiv:1904.03479, 2019.
-  J. Thienpondt, B. Desplanques, and K. Demuynck, “The idlab voxceleb speaker recognition challenge 2020 system description,” arXiv preprint arXiv:2010.12468, 2020.