1 Introduction
This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC).
In our previous work on self-supervised speaker representation learning [3], we proposed a two-stage iterative labeling framework. In the first stage, contrastive self-supervised learning is used to pre-training the speaker embedding network. This allows the network to learn a meaningful feature representation for the first round of clustering instead of random initialization. In the second stage, a clustering algorithm iteratively generates pseudo labels of the training data with the learned representation, and the network is trained with these labels in a supervised manner. The clustering algorithm can discover the intrinsic structure of the representation of the unlabeled data, providing meaningful supervisory signals comparing to contrastive learning which draws negative samples uniformly from the training data without label information. The idea behind the proposed framework is to take advantage of the DNN’s ability to learn from data with label noise and bootstrap its discriminative power.
In this work, we extend this iterative labeling framework to multi-modal audio-visual data, considering that complementary information from different modalities can help the clustering algorithm generate more meaningful supervisory signals. Specifically, we train a visual representation network to encode face information using the pseudo labels generated by audio data. With the resulted visual representations, clustering is performed to generate pseudo labels for visual data. Then, we employ a cluster ensemble algorithm to fuse pseudo-labels generated by different modalities. This fused pseudo-label is then used to train speaker and face representation networks. With the clustering ensemble algorithm, information in one modality can flow to the other modality, providing more robust and fault-tolerant supervisory signals.

2 Methods
This section describes the proposed iterative labeling framework for self-supervised speaker representation learning using multi-modal audio-visual data. We illustrate the proposed framework in figure 1.
-
Stage 1: contrastive training
-
Train an audio encoding network using contrastive self-supervised learning.
-
With this encoding network, extract representations for the audio data. Perform a clustering algorithm on these audio representations to generate pseudo labels.
-
-
Stage 2: iterative clustering and representation learning
-
With the generated pseudo labels, train audio and visual encoding network independently in a supervised manner.
-
With the audio encoding network, extract audio representations and perform clustering to generate audio pseudo labels.
-
With the visual encoding network, extract visual representations and perform clustering to generate visual pseudo labels.
-
Fuse the audio and visual pseudo labels using a cluster ensemble algorithm.
-
Repeat stage 2 with limited rounds.
-
2.1 Contrastive self-supervised learning
We employ the contrastive self-supervised learning (CSL) framework similar to the framework in [5, 8] to learn an initial audio representatoion. Let be an unlabeled dataset with data samples, CSL assumes that each data sample defines its own class and perform instance discrimination. During training, we randomly sample a mini-batch of data samples from . For data point , stochastic data augmentation is performed to generate two correlated views, i.e., and , resulting data points in total for a mini-batch. Two different audio segments are randomly cropped from the original audio before data augmentation. and are considered as a positive pair and other data points are negative examples for and .
During training, a neural network encoder extracts representations for the augmented data samples,
(1) |
After that, contrastive loss identifies the positive example (or ) among the negative examples for (or ). We adapt the contrastive loss from SimCLR [5] as:
(2) |
(3) |
where is an indicator function evaluating when and ,
denotes the cosine similarity and
is a temperature parameter to scale the similarity scores. can be interpreted as the loss for anchor feature . It computes positive score for positive feature and negative scores across all negative features .
2.2 Generating pseudo labels by clustering
2.2.1 K-means clustering
Given the learned representations of the training data, we employ a clustering algorithm to generate cluster assignments and pseudo labels. In this paper, we use the well-known k-means algorithm because of its simplicity, fast speed, and capability with large datasets.
Let the learnt representation in -dimensional feature space , k-means learns a centroid matrix and the cluster assignment for representation with the following learning objective
(4) |
where is the column of the centroid matrix . The optimal assignments are used as pseudo labels.
2.2.2 Determine the number of clusters
To determine the optimal number of clusters, we employ the simple ‘elbow’ method. It calculates the total within-cluster sum of squares for the clustering outputs with different numbers of clusters :
(5) |
The curve of the total within-cluster sum of squares is plotted according to a sequence of in ascending order. Figure 2 shows an example of such a curve. decreases as increases, and the decrease of flattens from some onwards, forming an ‘elbow’ of the curve. Such ‘elbow’ indicates that additional clusters beyond such contribute little intra-cluster variation; thus, the at the ‘elbow’ indicates the appropriate number of clusters. In figure 2, the number of clusters can choose between 5000 to 7000.
This ‘elbow’ method is not exact, and the choice of the optimal number of clusters can be subjective. Still, it provides a meaningful way to help to determine the optimal number of clusters. A mathematically rigorous interpretation of this method can be found in [22].
2.3 Learning with pseudo labels
Given a multi-modal dataset with audio-modality and visual-modality , an audio encoder and a visual encoder
are discriminatively trained with an audio classifier
and a visual classifier respectively using the generated pseudo labels . For each modality, the representation can be extracted as(6) | ||||
For a single modality, the parameters or are jointly trained with the cross-entropy loss:
(7) |
(8) |
where is the ground-truth distribution over labels for data sample with label , a Dirac delta which equals to for and 0 otherwise, is the element (
) of the class score vector
, is the number of the pseudo classes.2.4 Clustering audio-visual data
Clustering on the audio representations and the visual representations gives audio pseudo labels and visual pseudo labels respectively.
Considering that the audio and the visual representations contain complementary information from different modalities, we apply an additional clustering on the joint representations to generate more robust pseudo labels. Given the audio representation and the visual representation , concatenating and gives the joint representation . The pseudo labels is then generated by clustering on joint representations.
2.5 Cluster ensemble
We use simple voting strategy [1, 14] to fuse the three clustering outputs, i,e., , and . Since the cluster labels in different clustering outputs are arbitrary, cluster correspondence should be established among different clustering outputs. This starts with a contingency matrix for the referenced clustering output and the current clustering output , where is the number of clusters. Each entry represents the co-occurence between cluster of the referenced clustering output and cluster of the current clustering output,
(9) |
Cluster correspondence is solved by the following optimization problem,
(10) |
where is the correspondence matrix for the two clustering outputs. equals to if cluster in the reference clustering output corresponds to cluster in the current clustering output, otherwise. This optimization can be solved by the Hungarian algorithm [16].
We select the joint pseudo labels as the reference clustering output and calculate cluster correspondence for the audio and visual pseudo labels. A globally consistent label set is obtained after the re-labeling process. Majority voting is then employed to determine a consensus pseudo label for each data sample in the multi-modal dataset.
2.6 Dealing with label noise: label smoothing regularization
One problem with the generated pseudo labels is label noise which degrades the generalization performance of deep neural networks. We apply label smoothing regularization to estimate the marginalized effect of label noise during training. It prevents a DNN from assigning full probability to the training samples with noisy label
[21, 18]. Specifically, for a training example with label , label smoothing regularization replaces the label distribution in equation (7) with(11) |
where is a smoothing parameter and is set to in the experiments.
3 Experiments
3.1 Dataset
The experiments are conducted on the development set of Voxceleb 2, which contains 1,092,009 video segments from 5,994 individuals [6]. Speaker labels are not used in the proposed method. For evaluation, the development set and test set of Voxceleb 1 are used [17]. For each video segment in VoxCeleb datasets, we extracted image six frames per second.
3.2 Data augmentation
3.2.1 Data augmentation for audio data
Data augmentation is proven to be an effective strategy for both conventional learning with supervision [2] and contrastive self-supervision learning [12, 11, 5]
in the context of deep learning. We perform data augmentation with MUSAN dataset
[19]. The noise type includes ambient noise, music, and babble noise for the background additive noise. The babble noise is constructed by mixing three to eight speech files into one. For the reverberation, the convolution operation is performed with 40,000 simulated room impulse responses (RIR) in MUSAN. We only use RIRs from small and medium rooms.With contrastive self-supervised learning, three augmentation types are randomly applied to each training utterance: applying only noise addition, applying only reverberation, and applying both noise and reverberation. The signal-to-noise ratios (SNR) are set between 5 to 20 dB.
When training with pseudo labels, data augmentation is performed at a probability of 0.6. The SNR is randomly set between 0 to 20 dB.
Model | audio NMI | audio EER | visual NMI | visual EER | fused label NMI |
---|---|---|---|---|---|
Fully supervised | 1 | 1.51 | - | - | - |
Initial round (CSL) | 0.75858 | 8.86 | - | - | - |
Round 1 | 0.90065 | 3.64 | 0.91071 | 5.55 | 0.95053 |
Round 2 | 0.94455 | 2.05 | 0.95017 | 2.27 | 0.95739 |
Round 3 | 0.95196 | 1.93 | 0.95462 | 1.78 | 0.95862 |
Round 4 | - | 1.81 | - | - | - |
original score | after score norm | |||
---|---|---|---|---|
minDCF | EER[%] | minDCF | EER[%] | |
System 1 | 0.386 | 6.310 | 0.341 | 6.214 |
System 2 | 0.375 | 6.217 | 0.336 | 6.057 |
System 3 | 0.392 | 6.224 | 0.361 | 6.067 |
Fusion | 0.344 | 5.683 | 0.315 | 5.578 |
Fusion (test) | - | - | 0.341 | 5.594 |
3.2.2 Data augmentation for visual data
We sequentially apply these simple augmentations for the visual data: random cropping followed by resizing to , random horizontal flipping, random color distortions, random grey scaling, and random Gaussian blur. The data augmentation is performed at a probability of 0.6. We normalize each image’s pixel value to the range of afterward.
3.3 Network architecture
3.3.1 Audio encoder
We opt for a residual convolutional network (ResNet) to learn speaker representation from the spectral feature sequence of varying length [4]
. The ResNet’s output feature maps are aggregated with a global statistics pooling layer, which calculates means and standard deviations for each feature map. A fully connected layer is employed afterward to extract the 128-dimensional speaker embedding.
3.3.2 Visual encoder
We choose the standard ResNet-34 [9] as the visual encoder. After the pooling layer, a fully connected layer transforms the output to a 128-dimensional embedding.
3.4 Implementation details
3.4.1 Contrastive self-supervised learning setup
We choose a 40-dimensional log Mel-spectrogram with a 25ms Hamming window and 10ms shifts for audio data for feature extraction. The duration between 2 to 4 seconds is randomly generated for each audio data batch.
We use the same network architecture as in [2]
but with half feature map channels. ReLU non-linear activation and batch normalization are applied to each convolutional layer in ResNet. Network parameters are updated using Adam optimizer
[13] with an initial learning rate of 0.001 and a batch size of 256. The temperature in equation (3) is set as 0.1.3.4.2 Clustering setup
The cluster number is set to 6,000 for k-means based on the ‘elbow’ method described in section 2.2.2. The - curve shown in figure 2 is based on the audio representation learned with contrastive loss. With the dataset size of 100,000, we range the number of clusters from 1,000 to 20,000, considering the average cluster size ranging from 50 to 1,000.
3.4.3 Setup for supervised training
For the audio data, we extract 80-dimensional log Mel-spectrogram as input features. The duration between 2 to 4 seconds is randomly generated for each audio data batch. The architecture of the audio encoder is the same as the one used in[2].
For both audio and visual encoders, dropout is added before the classification layer to prevent overfitting [20]
. Network parameters are updated using the stochastic gradient descent (SGD) algorithm. The learning rate is initially set to 0.1 and is divided by ten whenever the training loss reaches a plateau.
3.5 Robust training on final pseudo labels
Our final submission consists of three systems trained on the final pseudo labels.
-
System 1: the network architecture is the same as the self-labeling framework; label smoothing regularization is applied.
-
System 3: same as system 2; the single-speaker audio segment information is used to improve the final pseudo label: the mode label is used as the final label of the single-speaker audio segment.
3.6 Score normalization
After scoring with cosine similarity, scores from all trials are subject to score normalization. We utilize Adaptive Symmetric Score Normalization (AS-Norm) in our systems [15]. The number of the cohort is 300 for all systems.
3.7 Experimental results
Table 1 shows the results of each clustering iteration on Voxceleb 1 original test set. Normalized mutual information (NMI) is used as a measurement of clustering quality. With four rounds of training, our method obtains an EER of 1.81%.
Table 2 shows the results of our submission system on the VoxSRC 2021 development and test set.
References
- [1] (1999) An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine learning 36 (1), pp. 105–139. Cited by: §2.5.
- [2] (2020) Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments. In ICASSP, pp. 6469–6473. Cited by: §3.2.1, §3.4.1, §3.4.3.
- [3] (2021) An Iterative Framework for Self-Supervised Deep Speaker Representation Learning. In ICASSP, pp. 6728–6732. Cited by: §1.
-
[4]
(2018)
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
. In Speaker Odyssey, pp. 74–81. Cited by: §3.3.1. - [5] (2020) A Simple Framework for Contrastive Learning of Visual Representations. In ICML, pp. 1597–1607. Cited by: §2.1, §2.1, §3.2.1.
- [6] (2018) Voxceleb2: Deep Speaker Recognition. In Interspeech, pp. 1086–1090. Cited by: §3.1.
-
[7]
(2019)
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
. In CVPR, pp. 4685–4694. Cited by: 2nd item. - [8] (2020) A Framework For Contrastive Self-Supervised Learning And Designing A New Approach. arXiv:2009.00104. Cited by: §2.1.
- [9] (2016) Deep Residual Learning for Image Recognition. In CVPR, pp. 770–778. Cited by: §3.3.2.
- [10] (2019) Squeeze-and-Excitation Networks. CVPR. Cited by: 2nd item.
- [11] (2020) Augmentation Adversarial Training for Unsupervised Speaker Recognition. arXiv:2007.12085. Cited by: §3.2.1.
- [12] (2020) Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition. arXiv:2006.04326. Cited by: §3.2.1.
- [13] (2015) Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §3.4.1.
-
[14]
(1997)
Application of Majority Voting to Pattern Recognition: An Analysis of Its Behavior and Performance
. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 27 (5), pp. 553–568. Cited by: §2.5. - [15] (2017) Analysis of Score Normalization in Multilingual Speaker Recognition. In The Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1567–1571. Cited by: §3.6.
- [16] (1957) Algorithms for the Assignment and Transportation Problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §2.5.
- [17] (2017) Voxceleb: A Large-Scale Speaker Identification Dataset. In Interspeech, pp. 2616–2620. Cited by: §3.1.
- [18] (2017) Regularizing Neural Networks by Penalizing Confident Output Distributions. In ICLR (Workshop), External Links: 1701.06548 Cited by: §2.6.
- [19] (2015) MUSAN: A Music, Speech, and Noise Corpus. arXiv:1510.08484. Cited by: §3.2.1.
- [20] (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.4.3.
-
[21]
(2016)
Rethinking the Inception Architecture for Computer Vision
. In CVPR, pp. 2818–2826. Cited by: §2.6. - [22] (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), pp. 411–423. Cited by: §2.2.2.
Comments
There are no comments yet.