Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021

09/08/2021 ∙ by Li Zhang, et al. ∙ 0

In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various kinds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160 submission, we came to the second place in the challenge for both tracks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As in previous years [chung2019voxsrc, nagrani2020voxsrc], VoxSRC-21 has four challenge tracks, which are fully supervised speaker verification with VoxCeleb 2 development dataset (track 1), fully supervised speaker verification with arbitrary training set (track 2), self-supervised speaker verification with VoxCeleb 2 development set (track 3) and speaker diarisation (track 4). Our team has worked on track 1 and track 2 – the fully supervised task particularly focusing on multi-lingual data. Totally we trained 15 systems based on multiple variants of TDNN  [snyder2018x] and ResNet [he2016deep]. Moreover, we explored large margin fine-tune strategy, embedding normalization and score normalization to further improve the results.

2 Data

2.1 Training and Evaluation Datasets

In this paper, the VoxCeleb 2 development set is the only data set used to train all the models for both tracks, though other data sets are allowed for track 2. We evaluated all our models on the VoxSRC-21 development trials (VoxSRC21-dev) while the model fusion submissions are evaluated on the VoxSRC-21 test trials (VoxSRC21-test) due to the limitation on the number of uploading scores.

2.2 Data Augmentation

Online data augmentation [cai2020fly]

is performed for all speaker embedding models. Specifically, frequency-domain specAug, additive noise augmentation and reverberation augmentation are adopted for all models, and some models further go through the speaker augmentation step.

  • Frequency-Domain SpecAug: We apply time and frequency masking as well as time warping to the input spectrum (frequency-domain implementation) [park19e_interspeech]

  • Additive Noise: We add the noise, music and babble type from MUSAN [snyder2015musan] to the original speech of VoxCeleb 2.

  • Reverberation: Reverberation is convoluted with the original speech in VoxCeleb 2 [snyder2018x] and the RIRs are from [habets2006room].

  • Speaker-Aug: Speaker-Aug is a successful method in speech and speaker recognition tasks [yamamoto2019speaker]. We use the SOX toolkit to perturb speech speed into 0.9 or 1.1 times. The generated utterances are considered as new speakers. As a result, the whole speaker number after speaker-aug becomes 5,994 3 = 17982.

3 Feature Extraction

In this challenge, we train the speaker embedding models with Mel Frequency Cepstral Coefficents (MFCCs) and Mel-Filter bank (Fbank) features. Specifically, E-TDNN used 60-dimensional MFCCs as the input features. Others used 80- or 96-dimension Fbank as the input features instead.

4 Deep Speaker Embedding Models

In total, we trained 15 advanced speaker embedding models and all are variants of x-vcector [snyder2018x] and ResNet [he2016deep]. We introduce them in details in the following.

4.1 E-Tdnn

The extended TDNN architecture (E-TDNN) [snyder2019speaker]

has slightly wider temporal context and interleaves dense layers between convolutional layers compared with the original x-vector model 

[snyder2018x]. We adopt the same structure as [snyder2019speaker]

. It comprises 4 blocks of 1D dilated convolutions plus affine layers. The first layer uses a kernel of size 5 and the other ones use a kernel of size 3. The dilation factors are 1, 2, 3 and 4 respectively. The embedding is extracted from the output of an affine layer that follows the statistics (mean + standard deviation) pooling layer 

[garcia2020jhu]. This model is trained by the cross entropy loss.

4.2 Ecapa-Tdnn

ECAPA-TDNN is known as one of the state-of-the-art speaker verification models [desplanques20_interspeech]. We train particularly two ECAPA-TDNN models, which configured with 1024 channels and 2048 channels respectively [thienpondt2021idlab]. The pooling layer of ECAPA-TDNN (1024) is attention statistic pooling [desplanques20_interspeech].

4.3 ResNet

We train 7 ResNet34-related models with the similarity structure but different attention modoul. They have {64, 128, 256, 512} or {32, 64, 128, 256} channels of residual blocks and multi-head attention statistic pooling. The loss functions are additive angular margin softmax (AAM-softmax), subcenter additive angular margin softmax (SC-AAM-softmax) and circle loss respectively.

4.4 ResNet-SE

ResNet with squeeze and excitation attention (ResNet-SE) has achieved good performance in speaker verification [thienpondt2021idlab, zhang21g_interspeech] recently. In this work, we adopt ResNet34-SE with 256 channels of SE attention modules. Two ResNet-SE models with different size are particularly adopted. Their channel configurations of residual block are {32, 64, 128, 256} and {64, 128, 256, 512} respectively [heo2020clova]. The loss functions of ResNet34-SE (256) and ResNet34-SE (512) are AAM-softmax and SC-AAM-softmax respectively.

4.5 ResNet-BAM

Our recent study has shown that adding bottleneck attention modules (BAM) in ResNet leads to improved speaker verification performance [zhang20ca_interspeech]. BAM is able to emphasize important elements in 3D feature map generated from convolution. Specifically for the speech signal, 3D feature map has channel dimension (filter number of convolution), time dimension as well as frequency dimension. We adopt the ResNet-BAM model in our work and the channels of each residual block are {64, 128, 256, 512}.

4.6 SE-Res2Net

In order to capture speaker characteristics at different levels, we use the Res2Net [2019Res2Net] network with the SE module. The channel configurations of each residual block are {32, 64, 128, 256}. In addition, we replace the last two fully connected (FC) layers of Res2Net with a TDNN layer to better model the context information.

4.7 D-Tdnn-Se

D-TDNN-SE is also adopted, in which dense connectivity [yu2020densely] is adopted and feed forward and TDNN layers are used to replace the two-dimensional convolutional layers.

4.8 ResNet-TDNN

We replace the last linear layer of ResNet with a 4-layer TDNN in order to capture the relationships of different frames, resulting in the ResNet-TDNN model. The number of channels in ResNet is still {64, 128, 256, 512}.

4.9 Cnn-Ecapa

The hybrid CNN-ECAPA [thienpondt21_interspeech]

structure is recently proposed with the belief that the advantages of the convolutional neural network and ECAPA-TDNN are integrated in a unified structure. Specifically, in our CNN-ECAPA, the number of channels in the convolutional stem is set to 128. The output feature map of the convolutional stem is subsequently flattened in the channel and frequency dimensions and used as input for the regular ECAPA-TDNN network.

4.10 ResNet-DTCF

We propose the Duality-Temporal-Channel-Frequency (DTCF) attention in ResNet34, named as ResNet-DTCF, to boost the representation extracting capability of CNN in speaker verification. Different from other squeeze and excitation (SE) attention learning after averaging the time and frequency dimensions simultaneously, the DTCF attention module firstly re-calibrates the channel-wise features with aggregation global context information on temporal and frequency dimensions, and then the duality channel-wise attention is adopted with preserving temporal and frequency information respectively. The DTCF attention module particularly encodes the temporal and frequency information into the channel-wise attention masks, averting the leakage of global context information in temporal and frequency dimensions. Details of this model can be found from our recent submission [zhangdct] to ASRU2021. The channels of residual blocks in ResNet-DTCF are {32, 64, 128, 256}.

4.11 ResNet-Pyramid

Pyramidal convolution (PyConv) contains a pyramid of kernels, where each level involves different types of filters with varying size and depth. This design aims to capture different levels of details in speech utterances [duta2020pyramidal]. Thus in this work, we embed the PyConv into ResNet50 to make the model to capture long contextual information. The configuration of ResNet-Pyramid is the same as PyConvResNet-50 in [duta2020pyramidal].

4.12 CoAtNet

Transformer is a popular sequence-to-sequence model, originally used for neural machine translation (NMT) 

[vaswani2017attention] and later introduced to speech recognition [gulati20_interspeech]

. The use of transformer family has become popular in computer vision (CV) recently. We borrow the new CoAtNet 

[dai2021coatnet] structure from CV and use it in this work for speaker verification. The oracle transformers, heavily adopting self-attention layers to model global dependencies, have larger model capacity, but their generalization ability can be worse than convolution networks due to the lack of the right inductive bias. The design of CoAtNet [dai2021coatnet] specifically migrates the problem by combining the advantages of both structures. We explore the performance of CoAtNet in this challenge. The pooling layer of CoAtNet in our model is average pooling and the loss function is cross entropy.

5 Pooling Layer

The pooling layer in speaker verification aims to aggregate the frame-level speaker embedding into an utterance-level speaker embedding for scoring. In this challenge, we adopt 5 statistic pooling strategies: temporal average pooling (TAP) [yu2014mixed], statistic pooling (SP)  [snyder2018x], self-attention pooling (SAP) [cai18_odyssey], attentive statistic pooling (ASP) [okabe18_interspeech] and multi-head attention pooling (MHAP) [india19_interspeech]. The specific use of the pooling layers in different models is summarized in Table 1.

6 Objective Loss Function

Softmax is the commonly used loss function in classification tasks. As compared with the softmax loss, the additive angular margin loss (AAM-Softmax) [liu19f_interspeech] is more popular in speaker verification as increasing intra-speaker distances and ensuring inter-speaker compactness are both important. In this work, we adopt softmax and AAM-Softmax in different models, where and are used for AAM-Softmax. Moreover, we also train some models with suncenter AAM softmax (SC-AAM-softmax) [deng2020sub] and circle loss [sun2020circle] as well. We particularly introduce the two losses in the following. Again, the details of the use of different losses can be found in Table 1.

6.1 SC-AAM-Softmax

Subcenter additive angular margin loss (SC-AAM-Softmax) [deng2020sub] adds sub-centers for each class to reduce the possible impact from label noise in the training data and improve the model robustness. In this challenge, we first set to 3 and then change it to 1 after the model achieves good result. In this case, the SC-AAM-Softmax has become an ordinary AAM-Softmax.

6.2 Circle Loss

A majority of loss functions simply have set an equal penalty strength on within-class similarity () and between-class similarity (). On the contrast, circle loss [sun2020circle] flexibly uses weight factors and to make and learn at different paces. In this work, we set margin to 0.25 and gamma to 64.

7 Training Strategy

The model training process is composed of base training and fine-tuning. All model are first trained using the Adam optimizer [kingma2014adam] with a cyclical learning rate (CLR) using the triangular2 policy as described in [smith2017cyclical]. The max and min learning rates are set at and respectively. Then we use the large margin fine-tune (LMF) strategy [thienpondt2021idlab] to fine-tune our well-trained models. Here the max cyclical learning rate is reduced to

. The initial margin penalty and margin of the AAM-softmax layer is set to 0.2 and 30 respectively. In the fine-tune stage, the penalty and margin are set to 0.25 and 35 instead. In primary training, we use 3s utterances to train the models. In fine-tune step, we adopt 6s utterances to train the models.

8 Normalization and Fusion

8.1 Embedding Normalization

We introduce embedding averaging and matrix scoring as embedding normalization tricks. The former is to simply average all segments of each utterance to obtain the speaker representation of the utterance, while the later is to score each trial with one score matrix generated from the segments.

Embedding Average (EA) We train our models with 3-second utterances in each batch-size. In the test stage, we also split the test utterances into 3-second clips and extract speaker embedding for each segment. Then we average the embeddings of all segments in the utterance and the averaged one is considered as the final speaker embedding of this utterance. Note that the larger margin fine-tuned models with 6-second utterances are tested on 6-second segments of enrollment and test utterances.

Matrix Score Average (MSA) We split each utterance into segments and then extract speaker embedding for each segment. Thus for one trail, we can formulate a score matrix with size of . Then we average all scores to obtain the final score of this trial.

8.2 Score Normalization

Most of our models are scored by cosine similarity but we adopt the PLDA scoring 

[snyder2018x] in Kaldi [povey2011kaldi] when our models trained by the softmax loss function. According to our past experience, PLDA will not work well if the model is trained with margin-based softmax. In addition, we adopt adaptive symmetrical score normalization (as-norm) to normalize all scores.

8.3 System Fusion

System fusion aims to further boost the performance by integrating multiple models which are expected to be complimentary. In the system fusion stage, we adopt manual calibration as well as automatic calibration. According to the performance on the development data set, we adopt the score level fusion that assigns different weights to different models. Considering that the model may over-fit on the development set with manual calibration, we particularly use the BOSARIS toolkit [brummer2013bosaris] for score calibrating before system fusion.

9 Experimental Results

We evaluate the 15 models with the above mentioned strategies (each with a system index) on voxSRC21-dev and voxSRC21-eval data sets and the results are summarized in Table 1. Here ECAPA-TDNN is specifically used for ablation study. We can see from Table 1 that the EER and minDCF (p=0.05) of ECAPA-TDNN are 3.565% and 0.1833 on the VoxSRC21-dev trails. Further with the large margin fine-tune (LMF) strategy, we observe 0.634%/0.0201 reduction on EER/minDCF. At the evaluation stage, the use of embedding average (EA) achieves 0.299%/0.0181 EER/minDCF reduction. In addition, the combination of matrix score average (MSA) and EA further obtains 0.061%/0.0215 reduction on EER/minDCF. These results show the effectiveness of various tricks. We also notice that the best single model is ResNet-TDNN (512) with SC-AAM-Softmax and speaker augmentation (indexed with L2), which achieves the lowest EER/minDCF of 1.614%/0.0930 on VoxSRC21-dev trials among all our models. As we introduced before, we perform frequency specAug, additive noise and reverberation augmentation dynamically during model training. The speaker augmentation only operates on ECAPA-TDNN (2048), ResNet34-SE (512), CNN-ECAPA, SE-Res2Net, ResNet34-TDNN and CoAtNet. Comparing L1 with M1 – the only difference of the two systems is the loss function adopted, we can see that SC-AAM-Softmax is superior to circle loss.

We tried two fusion strategies. In Fusion1 in Table  1, model weights are assigned by hands according to the performance on development trials. In Fusion2-4, models are fused by the BOSARIS toolkit [brummer2013bosaris] in which a linear fusion strategy is used. In general, system fusion leads to clear performance gain due to the complimentary between different models. Finally on the voxSRC21-eval set, the best performances on track 1 and track 2 are achieved by the fusion of 11 and 14 systems respectively. Due to time limitation, we did not manage to submit a 14-system fusion for track 1. In details, our best minDCF is 0.1205 and 0.1175 for track 1 and 2 respectively, achieving the second place in both tracks.

System Index Model Structure Pooling Layer Loss Function Speaker Aug VoxSRC21-dev VoxSRC21-eval
EER(%) minDCF EER(%) minDCF
A1 E-TDNN SP Softmax 7.384 0.4486 - -
A2   +PLDA SP Softmax 5.562 0.3029 - -
B1 ECAPA-TDNN(1024) ASP AAM-Softmax 3.565 0.1833 - -
B2   +LMF ASP AAM-Softmax 2.931 0.1632 - -
B3   +LMF+EA ASP AAM-Softmax 2.632 0.1451 - -
B4   +LMF+EA+MSA ASP AAM-Softmax 2.571 0.1236 - -
C1 ECAPA-TDNN(2048) ASP SC-AAM-Softmax 3.126 0.1511 - -
C2   +LMF+EA+MSA ASP SC-AAM-Softmax 2.012 0.1028 - -
D1 ResNet34-SE(256) ASP AAM-Softmax 3.419 0.2061 - -
D2   +LMF+EA+MSA ASP AAM-Softmax 2.976 0.1472 - -
E1 ResNet34-SE(512) SAP SC-AAM-Softmax 3.359 0.1903 - -
E2   +LMF+EA+MSA SAP SC-AAM-Softmax 2.252 0.1105 - -
F1 ResNet34-DTCF ASP AAM-Softmax 3.565 0.2096 - -
F2   +LMF+EA+MSA ASP AAM-Softmax 2.573 0.1281 - -
G1 ResNet34-BAM SAP AAM-Softmax 3.384 0.1931 - -
G2   +LMF+EA+MSA SAP AAM-Softmax 2.623 0.1331 - -
H1 ResNet-Pyramid TAP AAM-Softmax 3.562 0.2029 - -
H2   +LMF+EA+MSA TAP AAM-Softmax 2.132 0.1051 - -
I1 CNN-ECAPA SAP SC-AAM-Softmax 2.837 0.1964 - -
I2   +LMF SAP SC-AAM-Softmax 2.159 0.1310 - -
I3   +LMF+EA+MSA SAP SC-AAM-Softmax 1.773 0.0979 - -
J1 SE-Res2Net MHAP Circle-Loss 3.427 0.1972 - -
J2   +LMF+EA+MSA MHAP Circle-Loss 1.912 0.1021 - -
K1 SE-Res2Net MHAP SC-AAM-Softmax 3.345 0.1824 - -
K2   +LMF+EA+MSA MHAP SC-AAM-Softmax 2.017 0.1185 - -
L1 ResNet34-TDNN MHAP SC-AAM-Softmax 3.012 0.1532 - -
L2   +LMF+EA+MSA MHAP SC-AAM-Softmax 1.614 0.0930 - -
M1 ResNet34-TDNN MHAP Circle-Loss 2.817 0.1975 - -
M2   +LMF+EA+MSA MHAP Circle-Loss 1.799 0.0854 - -
N1 D-TDNN-SE SAP AAM-Softmax 3.424 0.2201 - -
N2   +LMF+EA+MSA SAP AAM-Softmax 2.042 0.1087 - -
O1 CoAtNet SAP Softmax 4.562 0.2531 - -
O2   +PLDA SAP Softmax 2.125 0.1814 - -
Fusion1 [B4+D2+F2+G2+E2] 1.6384 0.0935 3.3360 0.1659
Fusion2 [C2+E2+H2+I2+K2+L2+N2] 1.4384 0.0735 2.9440 0.1457
Fusion3 [B4+C2+D2+E2+F2+G2+J2+K2+L2+M2+N2] 1.4210 0.0681 2.8160 0.1205
Fusion4 [B4+C2+D2+E2+F2+G2+H2+I3+J2+K2+L2+M2+N2+O2] 1.1039 0.0582 2.8400 0.1175
Table 1: The experimental results on VoxSRC21-dev & VoxSRC21-eval (LMF: large margin fine-tune, EA: embedding average, MSA: matrix score average, SP: statistics pooling, ASP: attentive stitistics pooling, SAP: self-attention pooling)

10 Conclusion

In our submission to VoxSRC-21, we explored various TDNN-based and ResNet-based speaker embedding models on track 1 and 2. We particularly introduced ResNet-DTCF, CoAtNet and ResNet-Pyramid structures. We also used the large margin fine-tune strategy during model training to further improve the performance. Moreover, we specifically took embedding average and embedding matrix score average as normalization tricks at the evaluation stage. The take-home messages are as follows.

  • Hybrid structures are always superior as compared with individual ResNet and TDNN neural structures.

  • Embedding normalization and score normalization during evaluation are useful tricks, which lead to clear performance gain.

  • Fusion on multiple models is beneficial as models may compliment each other.

11 Acknowledgement

This work was supported by Beijing ZKJ Technology Co., Ltd.

References