The IDLAB VoxCeleb Speaker Recognition Challenge 2021 System Description

09/09/2021 ∙ by Jenthe Thienpondt, et al. ∙ Ghent University 0

This technical report describes the IDLab submission for track 1 and 2 of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). This speaker verification competition focuses on short duration test recordings and cross-lingual trials. Currently, both Time Delay Neural Networks (TDNNs) and ResNets achieve state-of-the-art results in speaker verification. We opt to use a system fusion of hybrid architectures in our final submission. An ECAPA-TDNN baseline is enhanced with a 2D convolutional stem to transfer some of the strong characteristics of a ResNet based model to this hybrid CNN-TDNN architecture. Similarly, we incorporate absolute frequency positional information in the SE-ResNet architectures. All models are trained with a special mini-batch data sampling technique which constructs mini-batches with data that is the most challenging for the system on the level of intra-speaker variability. This intra-speaker variability is mainly caused by differences in language and background conditions between the speaker's utterances. The cross-lingual effects on the speaker verification scores are further compensated by introducing a binary cross-linguality trial feature in the logistic regression based system calibration. The final system fusion with two ECAPA CNN-TDNNs and three SE-ResNets enhanced with frequency positional information achieved a third place on the VoxSRC-21 leaderboard for both track 1 and 2 with a minDCF of 0.1291 and 0.1313 respectively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 System architectures

TDNN and ResNet based speaker verification systems achieve similar performance [voxsrc_2020_technical_report] while being structurally very different. TDNNs depend on 1D convolutions of which the kernels cover the complete frequency range of the input features. As a consequence, the absolute frequency position of each input feature is hard-coded through the order of the filter coefficients. This makes sense in the context of speaker verification as features based on absolute frequency information such as the speaker’s pitch provide crucial information. The main drawback is that many filters are needed to model the fine details of complex patterns that can occur at any frequency region.

ResNets use 2D convolutions with a small receptive field, capturing local frequency and temporal information. By exploiting local speaker-specific frequency patterns that repeat but for small frequency shifts, this model requires fewer feature channels to model fine frequency details. However, accurate absolute frequency positions of features are not explicitly encoded [position_info_resnet]. This can be sub-optimal as we expect significantly different patterns in the low frequency regions compared to the high frequency regions. We enhance both architectures to incorporate the beneficial characteristics of the other model. The modifications are described below and more information can be found in [ecapa_cnn_tdnn].

Figure 1: The 2D convolutional stem of the ECAPA CNN-TDNN architecture. and indicate the channel and temporal dimension, respectively. The kernel size is denoted by . refers to the number of training speakers.

1.1 ECAPA CNN-TDNN with 2D convolutional stem

Architectures based on the TDNN topology apply the 1D-convolutional operation on the frame-level intermediate features. Consequentially, the kernel of the convolution encompasses the complete frequency range. We want the ECAPA-TDNN [ecapa_tdnn]

to be equivariant to small and reasonable shifts in the frequency domain, compensating for realistic intra-speaker frequency variability. To accomplish this, we base the initial network layers on 2D convolutions. These layers will also require less filters to model high resolution input detail.

Inspired by the powerful 2D convolutions in the ResNet architectures, we adopt a similar structure for our ECAPA-TDNN stem. The proposed network configuration is shown in Figure 1

. To make the stem more computationally efficient we use a stride of 2 in the frequency dimension of the first and final 2D convolutional layer. The output feature maps of the new stem are subsequently flattened in the channel and frequency dimensions and used as input for the regular ECAPA-TDNN network. We will refer to this network as the ECAPA CNN-TDNN.

1.2 SE-ResNet with frequency position information

The second architecture is based on the ResNet [resnet] architecture. We use the same network topology as defined in [magneto], with the addition of an SE-block [se_block] at the end of each residual block. We also incorporate the same channel-dependent attentive statistics pooling layer as in [ecapa_tdnn].

1.2.1 Learnable frequency positional encodings

While a certain robustness against frequency translation is beneficial, the spatial equivariance of 2D convolutions can limit the ability of the network to fully exploit frequency-specific information. We argue that encoding positional frequency information in the intermediate feature representations can make the network incorporate and utilize knowledge of the feature frequency positions.

Positional encodings have been popularized with the rise of Transformer [transformer] models. By design, these Transformers are invariant to the re-ordering of the input sequence and positional encodings can alleviate this issue when required. In our submission we focus on learnable encodings as the computational overhead is negligible.

Consider the input of a residual module in our baseline ResNet architecture as , with , and

indicating the channel, frequency and time dimension, respectively. We define the positional encoding vector

as a trainable vector. The elements in this vector are broadcasted to match the dimensions of the targeted input feature maps. The input of the residual module is now defined as . We add a unique positional encoding to the input of each residual block after branching the skip connection.

1.2.2 Frequency-wise Squeeze-Excitation (fwSE)

Squeeze-Excitation (SE) [se_block] blocks have been successfully applied in both TDNN and ResNet based speaker verification architectures. The first stage, the squeeze operation, calculates a vector containing the mean descriptor for each feature map. It is followed by the excitation operation, which calculates a scalar for each feature map given the information in . Subsequently, each feature map is rescaled with its corresponding scalar.

We argue that rescaling per feature map is not tailored for the processing of speech in ResNets. Instead, we propose a frequency-wise Squeeze-Excitation (fwSE) block, which injects global frequency information across all feature maps. We calculate a vector containing the mean descriptor for each frequency-channel of the intermediate feature maps in the following manner:


with the elements of , the component of input feature map corresponding with frequency position . From the mean descriptors in we calculate a vector containing the scaling scalars for each frequency-channel in the second stage, the excitation operation:


with and indicating the weights and bias of a linear layer,

the ReLU activation function and

the sigmoid function. Finally,

is scaled with the corresponding scalar in . The proposed frequency-wise SE-blocks are inserted at the end of each residual module before the additive skip connection.

1.3 Single system configurations

The two selected ECAPA CNN-TDNN architectures in the system fusion use the same model architecture as described in [ecapa_cnn_tdnn]. We also introduce a larger variant ECAPA CNN-TDNN (big) with 256 channels in the 2D convolutional stem compared to 128 channels in the original model. The number of channels in the TDNN layers is set to 2048 and the ECAPA-TDNN includes a total of four SE-Res2Blocks.

In addition, we train three fwSE-ResNet variants with a topology as described in Section 1.2. We vary the amount of layers in each of the four ResBlocks of the model as indicated by the four numbers between brackets in Table 1.

2 Training procedure

2.1 Initial training

All speaker embedding extractors are trained on the development part of the VoxCeleb2 corpus [vox2]. The input features are 80-dimensional log Mel-filterbank energies extracted with a window length of 25 ms and a frame-shift of 10 ms. We apply an online augmentation strategy on random training crops using the MUSAN [musan] corpus (additive music, noise & babble) and the RIR [rirs] corpus (reverb). SpecAugment [specaugment] is applied to the input features with a frequency and temporal masking dimension of 10 and 5, respectively. Finally, the input features are mean normalized across time per utterance.

During the construction of each mini-batch we randomly select 32 utterances from different speakers and apply the four augmentations on random crops with a duration of 2 s to get a total of 160 training samples. We also include a crop of the original clean signal in the mini-batch. SpecAugment is applied on all samples. This mini-batch data sampling procedure has been shown to be beneficial for speaker diarization performance in [ecapa_diarization].

The model parameter updates are determined by the Adam optimizer [adam] with a cyclical learning rate schedule. We use the triangular2 policy described in [clr]. The cycle length is set to 130k iterations with a minimum and maximum learning rate of 1e-8 and 1e-3, respectively. The ECAPA CNN-TDNN systems are trained for three full cycles, while the fwSE-ResNets are trained for one cycle to prevent overfitting. The AAM-softmax [arcface] layers uses a margin of 0.2 and a scale

of 30. To prevent overfitting, a weight decay of 2e-4 is applied on the weights of the AAM-softmax layer, while a value of 2e-5 is used on all other layers of the model.

2.2 Large margin fine-tuning and cross-lingual sampling

We fine-tune all systems with an adapted version of the Large Margin Fine-Tuning (LM-FT) strategy presented in [icassp_voxsrc20]. The CNN-TDNN systems use the original settings with an increase in crop size to 6 seconds accompanied with a margin increase to 0.5. For computational reasons, we limited the increase in crop length to only 4 seconds with a margin of 0.4 for the ResNets systems during fine-tuning.

To make the embeddings of the models more robust against variations in language of the same speaker, we used an alternative training utterance sample technique to construct challenging cross-lingual mini-batches during fine-tuning. Each mini-batch contains two utterances of a different language for 64 training speakers resulting in a total mini-batch size of 128. This results in more robustness of the embeddings against the challenging cross-lingual test trial conditions. To determine the language of the utterances we use the estimated language labels provided by the VoxSRC-21 organizers. Each utterance is optionally augmented with music, babble, noise, reverb or clean with equal probability.

Due to time constraints, we did not investigate the possibility to combine the hard prototype mining based utterance selection criterion [sdsvc_paper, icassp_voxsrc20] with the cross-lingual sampling.

3 Scoring procedure

3.1 Score estimation and normalization

The verification trials are scored by calculating the cosine similarity between the enrollment model and the test utterance speaker embeddings. The scores are normalized using top-100 adaptive s-normalization

[as_norm_original_1, as_norm_original_2]. The imposter cohort consists of top imposters selected from the speakers in the VoxCeleb2 training set. Each imposter speaker is represented by the average of their length-normalized training embeddings.

3.2 Score fusion and QMF score calibration

We fuse the systems scores by taking the weighted average of the s-normalized single system scores. In line with the single system performance on the validation set, the ResNet based systems get assigned double the weight of the TDNN systems. Next, we perform quality-aware score calibration [icassp_voxsrc20] based on logistic regression with Quality Metric Functions (QMFs). We use the minimum and maximum value of for the enrollment and test utterance as a QMF with the duration of the considered utterance and slightly lower then the minimum expected test utterance duration (i.e. 1.99 s). A second set of QMFs is the minimum and maximum imposter mean score of both utterances in the trial against an imposter cohort consisting of the average utterance embedding of the VoxCeleb2 training speakers. As analyzed in [voxsrc_2020_technical_report], we use the inner product instead of the cosine similarity to compute this score as this results in better performance. Finally, we add a binary cross-linguality feature in the calibration backend indicating if the enrollment and test utterance are spoken in the same language or not according to the provided language label.

For the open condition, we use the provided validation set to train the logistic regression calibration backend. To make the calibration system more robust against short duration trial conditions, we randomly crop half of the utterances included in the validation set between 2 and 4 seconds.

For the closed track, we create a validation set based on utterances of the VoxCeleb2 development set. We create 100K within-gender and cross-video trials, with an even split of positive and negative trials. Again, half of the utterances are cropped between 2 and 4 seconds to simulate short duration conditions. Half of the trials are cross-lingual to make the system calibration robust against varying language conditions. Because this calibration set uses the same utterances as those on which the models are trained, the impact of cross-linguality conditions will be underestimated by the calibration backend. To compensate, we consider the weight of the binary cross-linguality feature as a hyper-parameter to tune manually according to the performance on the VoxSRC-21 validation set.

4 Results

We evaluate all single systems and (calibrated) fusion system on the given VoxSRC-21 validation set in Table 1. The results of the calibrated system fusion on the VoxSRC-21 test set is provided in Table 2. We report the EER and MinDCF metric using a value of with and . All reported scores are s-normalized using an imposter cohort size of 100 speakers. The single system scores and system fusion without QMFs in Table 1 are not calibrated, while the reported fusion scores with QMFs are calibrated as described in Section 3.2.

System EER(%) MinDCF
ECAPA CNN-TDNN 2.67 0.1604
ECAPA CNN-TDNN (big) 2.65 0.1531
fwSE-ResNet86 (24, 32, 24, 6) 2.37 0.1296
fwSE-ResNet110 (24, 48, 32, 6) 2.48 0.1334
fwSE-ResNet90 (12, 32, 40, 6) 2.45 0.1305
Fusion 1.71 0.0986
Fusion + QMFs (closed) 1.54 0.0943
Fusion + QMFs (open) 1.38 0.0880
Table 1: System performance on the VoxSRC-21 validation set.
Systems EER(%) MinDCF
Fusion + QMFs (closed) 2.27 0.1291
Fusion + QMFs (open) 2.05 0.1313
Table 2: Fusion system performance on the VoxSRC-21 test set.