The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge

by   Danwei Cai, et al.
Duke University

This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58 test set, respectively.



There are no comments yet.


page 1

page 2

page 3

page 4


An iterative framework for self-supervised deep speaker representation learning

In this paper, we propose an iterative framework for self-supervised spe...

Simple Attention Module based Speaker Verification with Iterative noisy label detection

Recently, the attention mechanism such as squeeze-and-excitation module ...

The JHU submission to VoxSRC-21: Track 3

This technical report describes Johns Hopkins University speaker recogni...

The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System Description

We describe the Phonexia submission for the VoxCeleb Speaker Recognition...

Augmentation adversarial training for unsupervised speaker recognition

The goal of this work is to train robust speaker recognition models with...

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

This study tackles unsupervised subword modeling in the zero-resource sc...

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021

We present a system for the Zero Resource Speech Challenge 2021, which c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC).

In our previous work on self-supervised speaker representation learning [3], we proposed a two-stage iterative labeling framework. In the first stage, contrastive self-supervised learning is used to pre-training the speaker embedding network. This allows the network to learn a meaningful feature representation for the first round of clustering instead of random initialization. In the second stage, a clustering algorithm iteratively generates pseudo labels of the training data with the learned representation, and the network is trained with these labels in a supervised manner. The clustering algorithm can discover the intrinsic structure of the representation of the unlabeled data, providing meaningful supervisory signals comparing to contrastive learning which draws negative samples uniformly from the training data without label information. The idea behind the proposed framework is to take advantage of the DNN’s ability to learn from data with label noise and bootstrap its discriminative power.

In this work, we extend this iterative labeling framework to multi-modal audio-visual data, considering that complementary information from different modalities can help the clustering algorithm generate more meaningful supervisory signals. Specifically, we train a visual representation network to encode face information using the pseudo labels generated by audio data. With the resulted visual representations, clustering is performed to generate pseudo labels for visual data. Then, we employ a cluster ensemble algorithm to fuse pseudo-labels generated by different modalities. This fused pseudo-label is then used to train speaker and face representation networks. With the clustering ensemble algorithm, information in one modality can flow to the other modality, providing more robust and fault-tolerant supervisory signals.

Figure 1: The proposed iterative framework for self-supervised speaker representation learning using multi-modal data.

2 Methods

This section describes the proposed iterative labeling framework for self-supervised speaker representation learning using multi-modal audio-visual data. We illustrate the proposed framework in figure 1.

  • Stage 1: contrastive training

    • Train an audio encoding network using contrastive self-supervised learning.

    • With this encoding network, extract representations for the audio data. Perform a clustering algorithm on these audio representations to generate pseudo labels.

  • Stage 2: iterative clustering and representation learning

    • With the generated pseudo labels, train audio and visual encoding network independently in a supervised manner.

    • With the audio encoding network, extract audio representations and perform clustering to generate audio pseudo labels.

    • With the visual encoding network, extract visual representations and perform clustering to generate visual pseudo labels.

    • Fuse the audio and visual pseudo labels using a cluster ensemble algorithm.

    • Repeat stage 2 with limited rounds.

2.1 Contrastive self-supervised learning

We employ the contrastive self-supervised learning (CSL) framework similar to the framework in [5, 8] to learn an initial audio representatoion. Let be an unlabeled dataset with data samples, CSL assumes that each data sample defines its own class and perform instance discrimination. During training, we randomly sample a mini-batch of data samples from . For data point , stochastic data augmentation is performed to generate two correlated views, i.e., and , resulting data points in total for a mini-batch. Two different audio segments are randomly cropped from the original audio before data augmentation. and are considered as a positive pair and other data points are negative examples for and .

During training, a neural network encoder extracts representations for the augmented data samples,


After that, contrastive loss identifies the positive example (or ) among the negative examples for (or ). We adapt the contrastive loss from SimCLR [5] as:


where is an indicator function evaluating when and ,

denotes the cosine similarity and

is a temperature parameter to scale the similarity scores. can be interpreted as the loss for anchor feature . It computes positive score for positive feature and negative scores across all negative features .

Figure 2: Within-cluster sum of square of a clustering procedure versus the number of clusters employed.

2.2 Generating pseudo labels by clustering

2.2.1 K-means clustering

Given the learned representations of the training data, we employ a clustering algorithm to generate cluster assignments and pseudo labels. In this paper, we use the well-known k-means algorithm because of its simplicity, fast speed, and capability with large datasets.

Let the learnt representation in -dimensional feature space , k-means learns a centroid matrix and the cluster assignment for representation with the following learning objective


where is the column of the centroid matrix . The optimal assignments are used as pseudo labels.

2.2.2 Determine the number of clusters

To determine the optimal number of clusters, we employ the simple ‘elbow’ method. It calculates the total within-cluster sum of squares for the clustering outputs with different numbers of clusters :


The curve of the total within-cluster sum of squares is plotted according to a sequence of in ascending order. Figure 2 shows an example of such a curve. decreases as increases, and the decrease of flattens from some onwards, forming an ‘elbow’ of the curve. Such ‘elbow’ indicates that additional clusters beyond such contribute little intra-cluster variation; thus, the at the ‘elbow’ indicates the appropriate number of clusters. In figure 2, the number of clusters can choose between 5000 to 7000.

This ‘elbow’ method is not exact, and the choice of the optimal number of clusters can be subjective. Still, it provides a meaningful way to help to determine the optimal number of clusters. A mathematically rigorous interpretation of this method can be found in [22].

2.3 Learning with pseudo labels

Given a multi-modal dataset with audio-modality and visual-modality , an audio encoder and a visual encoder

are discriminatively trained with an audio classifier

and a visual classifier respectively using the generated pseudo labels . For each modality, the representation can be extracted as


For a single modality, the parameters or are jointly trained with the cross-entropy loss:


where is the ground-truth distribution over labels for data sample with label , a Dirac delta which equals to for and 0 otherwise, is the element (

) of the class score vector

, is the number of the pseudo classes.

2.4 Clustering audio-visual data

Clustering on the audio representations and the visual representations gives audio pseudo labels and visual pseudo labels respectively.

Considering that the audio and the visual representations contain complementary information from different modalities, we apply an additional clustering on the joint representations to generate more robust pseudo labels. Given the audio representation and the visual representation , concatenating and gives the joint representation . The pseudo labels is then generated by clustering on joint representations.

2.5 Cluster ensemble

We use simple voting strategy [1, 14] to fuse the three clustering outputs, i,e., , and . Since the cluster labels in different clustering outputs are arbitrary, cluster correspondence should be established among different clustering outputs. This starts with a contingency matrix for the referenced clustering output and the current clustering output , where is the number of clusters. Each entry represents the co-occurence between cluster of the referenced clustering output and cluster of the current clustering output,


Cluster correspondence is solved by the following optimization problem,


where is the correspondence matrix for the two clustering outputs. equals to if cluster in the reference clustering output corresponds to cluster in the current clustering output, otherwise. This optimization can be solved by the Hungarian algorithm [16].

We select the joint pseudo labels as the reference clustering output and calculate cluster correspondence for the audio and visual pseudo labels. A globally consistent label set is obtained after the re-labeling process. Majority voting is then employed to determine a consensus pseudo label for each data sample in the multi-modal dataset.

2.6 Dealing with label noise: label smoothing regularization

One problem with the generated pseudo labels is label noise which degrades the generalization performance of deep neural networks. We apply label smoothing regularization to estimate the marginalized effect of label noise during training. It prevents a DNN from assigning full probability to the training samples with noisy label

[21, 18]. Specifically, for a training example with label , label smoothing regularization replaces the label distribution in equation (7) with


where is a smoothing parameter and is set to in the experiments.

3 Experiments

3.1 Dataset

The experiments are conducted on the development set of Voxceleb 2, which contains 1,092,009 video segments from 5,994 individuals [6]. Speaker labels are not used in the proposed method. For evaluation, the development set and test set of Voxceleb 1 are used [17]. For each video segment in VoxCeleb datasets, we extracted image six frames per second.

3.2 Data augmentation

3.2.1 Data augmentation for audio data

Data augmentation is proven to be an effective strategy for both conventional learning with supervision [2] and contrastive self-supervision learning [12, 11, 5]

in the context of deep learning. We perform data augmentation with MUSAN dataset

[19]. The noise type includes ambient noise, music, and babble noise for the background additive noise. The babble noise is constructed by mixing three to eight speech files into one. For the reverberation, the convolution operation is performed with 40,000 simulated room impulse responses (RIR) in MUSAN. We only use RIRs from small and medium rooms.

With contrastive self-supervised learning, three augmentation types are randomly applied to each training utterance: applying only noise addition, applying only reverberation, and applying both noise and reverberation. The signal-to-noise ratios (SNR) are set between 5 to 20 dB.

When training with pseudo labels, data augmentation is performed at a probability of 0.6. The SNR is randomly set between 0 to 20 dB.

Model audio NMI audio EER visual NMI visual EER fused label NMI
Fully supervised 1 1.51 - - -
Initial round (CSL) 0.75858 8.86 - - -
Round 1 0.90065 3.64 0.91071 5.55 0.95053
Round 2 0.94455 2.05 0.95017 2.27 0.95739
Round 3 0.95196 1.93 0.95462 1.78 0.95862
Round 4 - 1.81 - - -
Table 1: Speaker verification performance (EER[%]) on Voxceleb 1 test set. The NMIs of the pseudo labels for each iteration are also reported.
original score after score norm
minDCF EER[%] minDCF EER[%]
System 1 0.386 6.310 0.341 6.214
System 2 0.375 6.217 0.336 6.057
System 3 0.392 6.224 0.361 6.067
Fusion 0.344 5.683 0.315 5.578
Fusion (test) - - 0.341 5.594
Table 2: Speaker verification performance on VoxSRC 2021 development and test set.

3.2.2 Data augmentation for visual data

We sequentially apply these simple augmentations for the visual data: random cropping followed by resizing to , random horizontal flipping, random color distortions, random grey scaling, and random Gaussian blur. The data augmentation is performed at a probability of 0.6. We normalize each image’s pixel value to the range of afterward.

3.3 Network architecture

3.3.1 Audio encoder

We opt for a residual convolutional network (ResNet) to learn speaker representation from the spectral feature sequence of varying length [4]

. The ResNet’s output feature maps are aggregated with a global statistics pooling layer, which calculates means and standard deviations for each feature map. A fully connected layer is employed afterward to extract the 128-dimensional speaker embedding.

3.3.2 Visual encoder

We choose the standard ResNet-34 [9] as the visual encoder. After the pooling layer, a fully connected layer transforms the output to a 128-dimensional embedding.

3.4 Implementation details

3.4.1 Contrastive self-supervised learning setup

We choose a 40-dimensional log Mel-spectrogram with a 25ms Hamming window and 10ms shifts for audio data for feature extraction. The duration between 2 to 4 seconds is randomly generated for each audio data batch.

We use the same network architecture as in [2]

but with half feature map channels. ReLU non-linear activation and batch normalization are applied to each convolutional layer in ResNet. Network parameters are updated using Adam optimizer

[13] with an initial learning rate of 0.001 and a batch size of 256. The temperature in equation (3) is set as 0.1.

3.4.2 Clustering setup

The cluster number is set to 6,000 for k-means based on the ‘elbow’ method described in section 2.2.2. The - curve shown in figure 2 is based on the audio representation learned with contrastive loss. With the dataset size of 100,000, we range the number of clusters from 1,000 to 20,000, considering the average cluster size ranging from 50 to 1,000.

3.4.3 Setup for supervised training

For the audio data, we extract 80-dimensional log Mel-spectrogram as input features. The duration between 2 to 4 seconds is randomly generated for each audio data batch. The architecture of the audio encoder is the same as the one used in[2].

For both audio and visual encoders, dropout is added before the classification layer to prevent overfitting [20]

. Network parameters are updated using the stochastic gradient descent (SGD) algorithm. The learning rate is initially set to 0.1 and is divided by ten whenever the training loss reaches a plateau.

3.5 Robust training on final pseudo labels

Our final submission consists of three systems trained on the final pseudo labels.

  • System 1: the network architecture is the same as the self-labeling framework; label smoothing regularization is applied.

  • System 2: Squeeze-Excitation (SE) module [10] is added to the network in the self-labeling framework, AAM-softmax [7] loss is used to train the network.

  • System 3: same as system 2; the single-speaker audio segment information is used to improve the final pseudo label: the mode label is used as the final label of the single-speaker audio segment.

3.6 Score normalization

After scoring with cosine similarity, scores from all trials are subject to score normalization. We utilize Adaptive Symmetric Score Normalization (AS-Norm) in our systems [15]. The number of the cohort is 300 for all systems.

3.7 Experimental results

Table 1 shows the results of each clustering iteration on Voxceleb 1 original test set. Normalized mutual information (NMI) is used as a measurement of clustering quality. With four rounds of training, our method obtains an EER of 1.81%.

Table 2 shows the results of our submission system on the VoxSRC 2021 development and test set.


  • [1] E. Bauer and R. Kohavi (1999) An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine learning 36 (1), pp. 105–139. Cited by: §2.5.
  • [2] D. Cai, W. Cai, and M. Li (2020) Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments. In ICASSP, pp. 6469–6473. Cited by: §3.2.1, §3.4.1, §3.4.3.
  • [3] D. Cai, W. Wang, and M. Li (2021) An Iterative Framework for Self-Supervised Deep Speaker Representation Learning. In ICASSP, pp. 6728–6732. Cited by: §1.
  • [4] W. Cai, J. Chen, and M. Li (2018)

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

    In Speaker Odyssey, pp. 74–81. Cited by: §3.3.1.
  • [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A Simple Framework for Contrastive Learning of Visual Representations. In ICML, pp. 1597–1607. Cited by: §2.1, §2.1, §3.2.1.
  • [6] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: Deep Speaker Recognition. In Interspeech, pp. 1086–1090. Cited by: §3.1.
  • [7] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    ArcFace: Additive Angular Margin Loss for Deep Face Recognition

    In CVPR, pp. 4685–4694. Cited by: 2nd item.
  • [8] W. Falcon and K. Cho (2020) A Framework For Contrastive Self-Supervised Learning And Designing A New Approach. arXiv:2009.00104. Cited by: §2.1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, pp. 770–778. Cited by: §3.3.2.
  • [10] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu (2019) Squeeze-and-Excitation Networks. CVPR. Cited by: 2nd item.
  • [11] J. Huh, H. S. Heo, J. Kang, S. Watanabe, and J. S. Chung (2020) Augmentation Adversarial Training for Unsupervised Speaker Recognition. arXiv:2007.12085. Cited by: §3.2.1.
  • [12] N. Inoue and K. Goto (2020) Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition. arXiv:2006.04326. Cited by: §3.2.1.
  • [13] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §3.4.1.
  • [14] L. Lam and S. Suen (1997)

    Application of Majority Voting to Pattern Recognition: An Analysis of Its Behavior and Performance

    IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 27 (5), pp. 553–568. Cited by: §2.5.
  • [15] P. Matějka, O. Novotný, O. Plchot, L. Burget, M. D. Sánchez, and J. Černocký (2017) Analysis of Score Normalization in Multilingual Speaker Recognition. In The Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1567–1571. Cited by: §3.6.
  • [16] J. Munkres (1957) Algorithms for the Assignment and Transportation Problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §2.5.
  • [17] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: A Large-Scale Speaker Identification Dataset. In Interspeech, pp. 2616–2620. Cited by: §3.1.
  • [18] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton (2017) Regularizing Neural Networks by Penalizing Confident Output Distributions. In ICLR (Workshop), External Links: 1701.06548 Cited by: §2.6.
  • [19] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: A Music, Speech, and Noise Corpus. arXiv:1510.08484. Cited by: §3.2.1.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.4.3.
  • [21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the Inception Architecture for Computer Vision

    In CVPR, pp. 2818–2826. Cited by: §2.6.
  • [22] R. Tibshirani, G. Walther, and T. Hastie (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), pp. 411–423. Cited by: §2.2.2.