The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021

09/05/2021 ∙ by Miao Zhao, et al. ∙ 0

This report describes our submission to the track 1 and track 2 of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC 2021). Both track 1 and track 2 share the same speaker verification system, which only uses VoxCeleb2-dev as our training set. This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement. Our system is a fusion of 9 models and achieves first place in these two tracks of VoxSRC 2021. The minDCF of our submission is 0.1034, and the corresponding EER is 1.8460



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 System Description

For both Track 1 and Track 2, we adopt the same system settings without any extra data other than Voxceleb2-dev [1]. This part will focus on the method we implemented in this challenge.

1.1 Datasets and Data Augmentation

1.1.1 Training Data

The VoxCeleb2-dev dataset contains 1,092,009 utterances and 5,994 speakers in total. Data augmentation is also quite important in training speaker verification models. We here adopted a 3-fold speed augmentation [2, 3] at first to generate extra twice speakers. Each speech segment in this dataset was perturbed by 0.9 or 1.1 factor based on the SoX speed function. Then we obtained 3,276,027 utterances and 17,982 speakers. The traditional Kaldi-based [4, 5] method (offline augmentation) is widely adopted in this field. Recent researches [6, 7] mentioned a new method that augments data on the fly (online augmentation). Our system contains both offline and online trained models. These two different data augmentation methods are applied separately for different training modes:

  • Offline training mode: In this training method, we used RIRs [8] and MUSAN [9] to create extra four copies of the training utterances and the data augmentation process was based on the Kaldi VoxCeleb recipe1. After this augmentation, 16,380,135 utterances from 17,982 speakers were generated to extract acoustic features.

  • Online training mode: Instead of concatenating different types of augmentation [7]

    , we adopted a chain-like augment. It means that we predefine an effect chain composed of several augments, and every augment should have its probability to be activated. The effect chain is as:

    • gain augment with a probability of 0.2

    • white noise augment with a probability of 0.2

    • RIR reverberation and noise addition augment with a probability of 0.6

    • time stretch augment with a probability of 0.2

    It is worth mentioning that the offline 3-fold speed augmentation is also adopted in online augmentation, which means the number of classes is 17,982. The speed augmentation will change the pitch of a speaker, while time-stretching will not change the pitch. Both foreground and background noises are added, and they are randomly selected from MUSAN and RIRs noises.

1.1.2 Developing Set

To evaluate the performance of our models, we used 5 test sets [1, 10] as our developing sets:

  • VoxCeleb1-O: 37,720 trails are sampled from the VoxCeleb1 test dataset with only 40 speakers.

  • VoxCeleb1-E: This is an extended version of VoxCeleb1-O. This set contains 581,480 trials from 1251 speakers.

  • VoxCeleb1-H: This set has 552,536 trials. It is harder since each pair in this set shares the same nationality and gender.

  • VoxSRC20-dev: This is the validation set of VoxSRC2020 and the trials contains out-of-domain data provided by VoxCeleb_cd. This set has 263,486 trials.

  • VoxSRC21-val: This is the validation set of VoxSRC2021 and has 60,000 trials. Trials in this set contain more multi-lingual data.

1.1.3 Features

We extracted both 81-dimensional and 96-dimensional log Mel filter bank energies based on Kaldi in offline training mode. The window size is 25 ms, and the frameshift is 10 ms. 200 frames of features were extracted without extra voice activation detection (VAD). The speech segments were sliced to 2 seconds and augmented on the fly in the online training mode. 96-dimensional log Mel filter bank energies were extracted based on torchaudio. All features were cepstral mean normalized in both our training modes.

1.2 Network Structures

1.2.1 Backbone

Figure 1: Architecture of RepVGG block. Here (a) is the training time state. (b) demonstrates the process of conv-bn fusion. (c) is the inference time state. denotes element-wise addition.

Convolutional Neural Networks [6, 11, 12] have become the main-stream solution in speaker verification tasks. Our backbones include two types of state-of-the-art models:

  • RepVGG [13]

    Recent researches proposed a new way to construct ConvNets. The method is called the re-parameterization technique. This method decouples the training time and inference time architecture. RepVGG, as one of the re-parameterized models, shows competitive performance in the computer vision field. We, at the first time, introduced this RepVGG architecture in speaker verification. As Figure


    shows, the RepVGG block has a separate 3x3 and 1x1 convolutional layer with batch normalization and an identity branch with only a batch normalization layer during the training time. Since convolution and batch normalization can fuse into a convolution layer and both the 1x1 convolution layer and the batch normalization layer can transform to a 3x3 convolution layer, all branches in this block are equal to three 3x3 convolutions. All these 3x3 convolutions share the same setting (kernel size, stride, groups, dilation, and so on) so that they can fuse into only one 3x3 convolution by simply adding parameters filter-wisely. When merged into one 3x3 convolution and a ReLU layer, this block is as same as a VGG block during the inference time. We select RepVGG-A2, RepVGG-B1, RepVGG-B2g4, and RepVGG-B2 as our backbones. All models adopt 64 base channels except RepVGG-A2 which uses 96 base channels.

  • ResNet [14] As one of the most classical ConvNets, ResNet has proved its power in speaker verification. In our systems, both basic-block-based ResNet-34 and bottleneck-block-based ResNet (deeper structures: ResNet-101 and ResNet-152) are adopted. All base channels of these ResNets are 64.

1.2.2 Pooling Method

The pooling layer aims to aggregate the variable sequence to an utterance level embedding. The vanilla idea to achieve this purpose is by calculating the mean and standard derivation along the time axis [15]. However, it could be limited by the fact that the contributions from different frames could be unequal. An attention mechanism [16] is introduced to calculate weighted statistics of the outputs of the backbone. Furthermore, a multi-head mechanism was introduced to increase the diversity of attention, such as multi-head self-attentive (MHSA) pooling [17] and self multi-head attention (MHA) pooling [18]. The main difference between these two methods is the definition of the heads in attention mechanism. Instead of attending to the whole feature through different heads as we called queries, the latter split the features into several parts, and each head focuses on its corresponding part. We proposed a multi-query multi-head attention pooling mechanism (MQMHA) for the first time by combining both the multi-head methods above. Since this method can help us attend to different parts and gain more diversified information. The method can be described as below:

Suppose we have a backbone output , and each is spit into parts with , where is the number of head of attention. For each head, it has

trainable query vectors where

. Attention weight of is defined as:


And the representation is expressed as:


as the MQMHA combines both MHSA and MHA, in which and are the cases of MHSA and MHA respectively.

Finally, we concatenate all of the sub-representations to get the utterance level embedding with , where

. And an extra attentive standard deviation

computed through the attention weights. This standard deviation is concatenated with to enhance the performance.

Figure 2: Comparison between RepVGG and Basic Block.

1.2.3 Loss Function

Recently, margin based softmax methods have been widely used in speaker recognition works. To make a much better performance, we strengthen the AM-Softmax [19, 20] and AAM-Softmax [21]loss functions by two methods.

First, the subcenter method [22] was introduced to reduce the influence of possible noisy samples. The formulation is given by:


where the function means that the nearest center is selected and it inhibits possible noisy samples interfering the dominant class center.

Secondly, we proposed the Inter-TopK penalty to pay further attention to the centers which obtain high similarities comparing samples that do not belong to them. Therefore, it adds an extra penalty on these easily misclassified centers. Given a batch with examples and a number of classes of , the formulation with adding extra Inter-TopK penalty based on the AM-Softmax is:

Methods VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H VoxSRC20-dev VoxSRC21-val
EER(%) EER(%) EER(%) EER(%) EER(%)
ResNet-34-64 1.0660 0.0876 1.0440 0.0971 1.7660 0.1561 2.7580 0.1357 3.1300 0.1686
+ K-Subcenter 0.9756 0.0840 1.0270 0.0930 1.7020 0.1467 2.6450 0.1321 2.7850 0.1503
++ Inter-TopK 0.9730 0.0912 1.0170 0.0892 1.6860 0.1415 2.5760 0.1297 2.5800 0.1433
+++ MQMHA 0.9305 0.0738 0.9809 0.0879 1.6020 0.1373 2.5070 0.1246 2.5100 0.1403
++++ Fine-tuning 0.6654 0.0573 0.8243 0.0725 1.3532 0.1154 2.1584 0.1054 1.9933 0.1158
+++++ AS-Norm 0.5594 0.0480 0.7632 0.0624 1.2122 0.0971 1.9120 0.0935 1.8367 0.0996
++++++ QMF 0.5249 0.0498 0.7130 0.0627 1.1240 0.0923 1.8330 0.0867 1.6020 0.0906
Table 1: Ablation Study on Our Baseline System. + here denotes stacking our methods.

where is the original margin of AM-Softmax and is the scalar of magnitude. We use the to replace the in the denominator:


where is an extra penalty which focuses on the closest centers to the example . And it is just the original AM-Softmax case when . Similarity, the Inter-TopK penalty could be also added for AAM-Softmax loss function by replacing by .

1.3 Training Protocol

We used Pytorch

[23] to conduct our experiments. All of our models were trained through two stages.

In the first stage, the SGD optimizer with a momentum of 0.9 and weight decay of 1e-3 (4e-4 for online training) was used. We used 8 GPUs with 1,024 mini-batch and an initial learning rate of 0.08 to train all of our models. As described in section 1.1.1, 200 frames of each sample in one batch were adopted to avoid over-fitting and speed up training. We adopted ReduceLROnPlateau scheduler with a frequency of validating every 2,000 iterations, and the patience is 2. The minimum learning rate is 1.0e-6, and the decay factor is 0.1. Furthermore, the margin gradually increases from 0 to 0.2 [24].

In the large-margin-based fine-tuning stage [25], settings are slightly different from the first stage. Firstly, we removed the speed augmented part from the training set to avoid domain mismatch. Only 5,994 classes were left. Secondly, we changed the frame size from 200 to 600 and increased the margin exponentially from 0.2 to 0.5. The AM-Softmax loss was replaced by AAM-Softmax loss. The Inter-TopK penalty was removed to make training stable. Finally, We adopted a smaller finetuning learning rate of 8e-5 and a 256 batch size. The learning rate scheduler is almost the same while the decay factor became 0.5.

1.4 Back-end

After completing the fine-tuning stage, 512-dimensional speaker embeddings were extracted from the fully connected (FC) layer, and then the length normalization was applied before computing cosine similarity. Moreover, we utilized speaker-wise adaptive score normalization (AS-Norm)

[3] and Quality Measure Functions (QMF) [11, 25] to calibrate the scores, and these methods greatly enhanced the performance. For AS-Norm, we selected the original VoxCeleb2 dev dataset without any augmentation. After extracting embeddings, all these embeddings were averaged speaker-wise, which resulted in 5994 cohorts. Then scores would be calibrated by this speaker-wise AS-Norm using top 400 imposter scores. For QMF, we combined three qualities, speech duration computed by Kaldi, imposter mean based on AS-Norm, and magnitude of non-normalized embeddings. Like IDLAB’s way [11]

, we also selected 30k trials from the original VoxCeleb2-dev as the training set of QMF. Then a Logistic Regression(LR) was trained to serve as our QMF model.

Index Backbone
Offine fbank 81
S1 ResNet-34-64
S2 ResNet-101-64
S3 ResNet-152-64
S4 RepVGG-a2-96
S5 RepVGG-b1-64
S6 RepVGG-b2g4-64
S7 RepVGG-b2-64
Offine fbank 96
S8 RepVGG-b2g4-64
Online fbank 96
S9 RepVGG-b2g4-64
Table 2: Sub-System Structures.
System Index VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H VoxSRC20-dev VoxSRC21-val
EER(%) EER(%) EER(%) EER(%) EER(%)
S1 0.5249 0.0498 0.7130 0.0627 1.1240 0.0923 1.8330 0.0867 1.6020 0.0906
S2 0.5037 0.0356 0.6435 0.0514 0.9737 0.0783 1.5760 0.0753 1.3350 0.0685
S3 0.4613 0.0232 0.6342 0.0477 0.9932 0.0763 1.4770 0.0726 1.4550 0.0813
S4 0.5673 0.0309 0.6759 0.0550 1.0360 0.0830 1.5860 0.0797 1.4620 0.0776
S5 0.4401 0.0253 0.6518 0.0494 0.9914 0.0738 1.4960 0.0691 1.3610 0.0628
S6 0.4825 0.0374 0.6707 0.0508 1.0270 0.0783 1.5160 0.0725 1.4050 0.0730
S7 0.4825 0.0283 0.6511 0.0484 0.9965 0.0738 1.4910 0.0699 1.4180 0.0660
S8 0.5090 0.0340 0.6587 0.0489 0.9954 0.0707 1.4940 0.0699 1.4180 0.0698
S9 0.5673 0.0461 0.6961 0.0584 1.0910 0.0856 1.7040 0.0845 1.6420 0.0942
S1S9 0.4189 0.0217 0.5826 0.0414 0.8868 0.0630 1.3400 0.0624 1.2710 0.0590
Table 3: Results on Developing Sets.

1.5 Results

1.5.1 Baseline System Ablation Study

In this subsection, we show our ablation study on our baseline system. The baseline system is a ResNet-34 backbone followed by MHA pooling and AM-Softmax. The performance was evaluated using the Equal Error Rate (EER) and the minimum Decision Cost Function (DCF) calculated where , , and or for different trials. As Table 1 shows, our baseline system’s performance improved significantly on various trials by stacking our proposed methods gradually. For convenience, we took the performance of VoxSRC21-val as our benchmark. First, we conducted our ablation studies by changing normal AM-Softmax () to 3-subcenter AM-Softmax. The EER was improved from 3.13% to 2.785%, and the minDCF was improved from 0.1686 to 0.1503. By adding the Inter-TopK () extra penalty, the EER was 2.58%, and the minDCF was 0.1433. Using MQMHA () instead of MHA, the EER further achieved 2.51%, and the minDCF was 0.1403. The procedures above already boosted our baseline system’s EER by relatively 19.8% and minDCF by relatively 16.78%. The domain-based large margin finetuning improved our system performance from 2.51% EER to 1.9933% EER drastically. The minDCF also improved from 0.1403 to 0.1158. Applying the speaker-wise AS-Norm further achieved 1.8367% EER and 0.0996 minDCF. The final QMF process got 1.60% EER and 0.0906 minDCF. After doing AS-Norm and QMF, our system’s EER improved 19.6% relatively, and minDCF improved 21.76% relatively compared to the finetuned system. After completing the ablation study, our baseline system improved EER relatively 48.9% and minDCF relatively 46.26% in total.

For all our models, we followed the same procedure, and the only variable is our backbone.

1.5.2 Sub-Systems and Fusion Performance

All our sub-systems were described in Table 2. A total of 9 different backbones were used to generate different representations. The offline trained systems used both 81-dimensional and 96-dimensional acoustic features and online trained systems adopted 96-dimensional features only. Table 3 demonstrates the result achieved by our sub-systems of various trials. We found that a large model, such as RepVGG-B1, and ResNet-101 seemed to yield a better result compared to smaller models like our baseline system. However, an even bigger model like ResNet-152 and RepVGG-B2 could not bring a comparable performance boost regarding the drastically increased parameters. Also, it is worth mentioning that these even bigger models showed a sign of over-fitting on the VoxCeleb2-dev dataset. As the learning rate was smaller than 1e-4, the EER and minDCF of these large systems degraded. However, the performance of these systems remained SOTA even when we terminated the training at an earlier stage. 96-dimensional Fbank features were good complements of 81-dimensional Fbank features. The online system set we used is not the optimal choice, as we are still researching this new training paradigm. Though it shows a competitive result, it cannot achieve the best result of our large offline models.

Table 4 shows some of our submissions to the VoxSRC2021 and the final result of our fusion system. It is worth mentioning that our RepVGG-B1 achieved a 0.1212 minDCF and 2.2410% EER with only a single model while ResNet-152 achieved a 0.1195 minDCF and 2.16% EER. We tuned our fusion weights of all these models based on the results of VoxCeleb1-H and VoxSRC21-val. The final fusion resulted in a 0.1034 minDCF and a 1.846% EER in the VoxSRC2021 challenge. The fusion result improved 12.47% relatively in minDCF and 14.54% relatively in EER compared to our ResNet-152 model.

System Index EER(%)
S1 2.8890 0.1700
S2 - -
S3 2.1690 0.1195
S4 - -
S5 2.2410 0.1212
S6 - -
S7 - -
S8 - -
S9 - -
S1S9 1.8460 0.1034
Table 4: Our Submissions to VoxSRC21-test.

2 Conclusions

In this challenge, we first introduced a new backbone structure (RepVGG) in speaker verification. We also proposed MQMHA, Inter-TopK loss, and domain-based large margin fine-tuning methods. All these methods above and the large backbones ensured our first place in track 1 and track 2 of VoxSRC 2021. The final result of our system was 0.1034 minDCF and 1.846% EER.

3 Acknowledgements

This work is supported by the SpeakIn Technologies Co. Ltd.