The DKU-DukeECE-Lenovo System for the Diarization Task of the 2021 VoxCeleb Speaker Recognition Challenge

09/05/2021 ∙ by Weiqing Wang, et al. ∙ Duke University Lenovo 0

This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based speaker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice activity detection (TS-VAD) model. Our final submission, consisting of 5 independent systems, achieves a DER of 5.07 test set.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the speaker embedding becomes more and more robust, the conventional diarization system also achieves good performance since the speaker confusion has been significantly reduced. To further improve the performance and reduce diarization error rate (DER), many researches focus on overlapped speech detection (OSD) to reduce the missed speaker error, including speech separation [microsoft], overlap detection [overlap], end-to-end neural speaker diarization (EEND) [eend] and target-speaker voice activity detection (TS-VAD) [tsvad].

We also explore many different OSD models in this challenge. First, a ResNet-based model with LSTM back-end is employed for overlap detection, where the output is 1 for overlapped speech and 0 otherwise. Second, an x-vector- and ResNet-based TS-VAD model is used to refine the output from the conventional diarization systems. Finally, we propose a 2-speaker TS-VAD model for overlap detection, where a pair of speaker embeddings are fed to the TS-VAD model, and the overlapped speech regions between these two speakers are detected. Compared with the original TS-VAD, this method is not restricted to the number of speakers.

2 Dataset Description

Since our TS-VAD model takes a long time for inference, we only use the last 46 recordings in the test dataset as our validation dataset, referred to as VAL46. The development dataset with the remaining data in the test dataset is used as our development dataset, which contains 402 recordings and is referred to as DEV402. The detailed dataset used in this challenge for each model include:

  • Voice activity detection (VAD): AMI [AMI], ICSI [ICSI], ISL (LDC2004S05), NIST (LDC2004S09), SPINE1&2 (LDC2000S87, LDC2000S96, LDC2001S04, LDC2001S06, LDC2001S08), AISHELL-4 [aishell4], DIHARD II [DIHARDII] and DIHARD III [DIHARDIII] are the mixed training set. DEV402 and VAL46 is used for fine-tuning and validation, respectively.

  • Speaker embedding: We use Voxceleb 1 & 2 [Voxceleb] as the training set.

  • Agglomerative hierarchical clustering (AHC): We directly tune the parameters on DEV402.

  • LSTM-based similarity measurement with spectral clustering: We use the same dataset as VAD does.

  • Overlap detection: We use DEV402 for training and VAL46 for validation.

  • 2-speaker TS-VAD & TS-VAD: These models are first trained on the data simulated by Librispeech [Librispeech]. Then we transfer the model to VoxConverse [Voxconverse2020] dataset with the data simulated by DEV402. Finally, we fine-tune the model on DEV402 and validate it on VAL46.

  • Data augmentation: We perform data augmentation with MUSAN [MUSAN] and RIRs [RIRs] corpus.

Figure 1: The data simulation strategy for TS-VAD.

3 Detailed Model Configuration

The model architectures of VAD, overlap detection, and TS-VAD are very similar, including a ResNet34 [resnet], a statistical pooling layer, two BiLSTM [bilstm]

layers and two fully-connected layers with sigmoid function. For ResNet34, the channel width is {32, 64, 128, 256} with kernel size of 3. Each BiLSTM layer contains 256 units per direction with a dropout rate of 0.1. The two fully connected layers contain 128 and 1 unit, respectively.

3.1 Vad

We use a ResNet34 as the front-end model to extract the frame-level feature map. Next, a statistical pooling layer is employed on the feature map every

frames. Finally, two BiLSTM layers and two fully-connected layers with a sigmoid function produce the posterior probability of speech. In our experiments, the input is 8s chunked wav, and the acoustic feature is 80-dim log Mel-filterbank energies with a frame length of 25ms and a frameshift of 10ms.

The model is first trained on the mixed training set for 100 epochs with a learning rate of 0.0001 and then fine-tuned on DEV402 for 100 epochs with a learning rate of 0.00001 with binary cross-entropy (BCE) loss and Adam optimizer. We train four models with different

to obtain the VAD outputs with different resolutions. Finally, we average the outputs from these four models in frame level. The threshold for speech decision is set to 0.6. Table 1 shows the false alarm (FA) and miss detection (MISS) on VAL46.

FA [%] MISS [%] Accuracy [%]
S=1 2.57 1.56 96.33
S=2 2.97 1.30 96.17
S=4 3.22 1.32 95.93
S=8 3.40 1.59 95.54
Fusion 2.37 1.66 96.43
Table 1: The false alarm (FA), miss detection (MISS) and accuracy of the VAD model

3.2 Speaker Embedding

The ResNet34 structure is employed as the front-end pattern extractor, which learns a frame-level representation from the input acoustic feature. A global statistic pooling (GSP) layer projects the variable length input to the fixed-length vector. Next, a 128-dim fully connected layer is adopted as the speaker embedding layer. The ArcFace [arcface]

(s=32,m=0.2) is used as a classifier. The detailed configuration of the neural network is the same as

[ffsvc20]. Table 2 shows the EER on the Voxceleb1 test set. The input is 24s chunked wav, and the acoustic feature is 80-dim log Mel-filterbank energies with a frame length of 25ms and a frameshift of 10ms.

Training data EER (%)
Voxceleb 2 1.23
Table 2: The EER of the speaker embedding model

3.3 Clustering-based System

3.3.1 Ahc

We use a similar AHC method as [microsoft] without speech separation. First, speaker embeddings are extracted from the uniformly segmented speech with a length of 1.28s and shift of 0.32s, and two consecutive segments are merged into a longer segment if the distance is greater than a segment threshold. The pairwise similarity is measured by cosine distance. Next, we perform a plain AHC on the similarity matrix with a relatively high stop threshold to obtain the clusters with high confidence. These clusters are split into “long clusters” and “short clusters” by the total duration in each cluster, and the central embedding of each cluster is the mean of all speaker embeddings in the cluster. Later, each short cluster is assigned to the closest long cluster by the cosine distance of central embedding. Finally, if a short cluster is too different from all long clusters, which means that the distance between them is lower than a speaker threshold, we treat it as a new speaker.

All of these parameters are directly tuned on the DEV402 by grid search. In our experiments, the segment threshold is 0.54, the stop threshold is 0.62, the duration for classifying long and short clusters is 6s, and the speaker threshold is 0.2.

3.3.2 LSTM-based Similarity Measurement with Spectral Clustering

We use the same LSTM-based model as [LinLSTM]. The model consists of two BiLSTM and two fully connected layers with a sigmoid function. Speaker embeddings are also extracted from the uniformly segmented speech with a length of 1.28s and shift of 0.64s. Next, the speaker embedding sequence is concatenated with repeated

as the input and produce the i-th row of the affinity matrix



where is the LSTM-based neural network, n is set to 64 in our experiments. More details can be found in [DIHARDII-LSTM].

The model is trained on the mixed training set for 100 epochs and fine-tuned on DEV402 for 100 epochs. The model is optimized with BCE loss and Adam optimizer with a learning rate of 0.001. After obtaining the affinity matrix , we employ spectral clustering (SC) to get the final diarization results.

3.4 Overlap Detection

The overlap detection model and training process are the same as that of the VAD model except for the training data, labels, and resolutions. We train the overlap detection model on DEV402, and we only average the outputs from two models with since the overlapped regions are much shorter than the speech regions in the VAD task. The label is 1 for overlapped speech and 0 otherwise. After an overlapped region is detected, we replace the label with two closest speakers near this region. The threshold for overlap decision is set to 0.8. The input is 8s chunked wav, and the acoustic feature is 80-dim log Mel-filterbank energies with a frame length of 25ms and a frameshift of 10ms.

Model Submission VAL46 VoxSRC21 test set
DER[%] JER[%] DER[%] JER[%]
Baseline - - - 17.99 38.72
AHC + Oracle VAD - 2.61 21.93 - -
LSTM-SC + Oracle VAD - 3.16 28.04 - -
1 AHC 1 4.42 26.42 6.77 27.66
2 LSTM-SC - 4.97 31.04 - -
3 AHC + OD - 3.98 26.08 - -
4 LSTM-SC + OD - 4.58 30.70 - -
5 AHC + 2-spk TS-VAD as OD 4 3.96 25.82 5.45 27.55
6 LSTM-SC + 2-spk TS-VAD as OD - 4.56 30.51 - -
7 System 3 + TS-VAD - 4.53 28.39 - -
8 System 5 + TS-VAD - 4.51 28.33 - -
Fusion (3+4+7) 2 3.93 25.68 5.36 29.10
Fusion (3+4+5+7) 3 4.02 27.11 5.82 29.78
Fusion (3+4+5+6+8) 5 4.10 30.97 5.07 29.16
Table 3: The performance of different speaker diarization systems. Since the VAD model used in the 1st submissions is not well trained, the improvement of DER compared with other systems may come from both overlap detection (OD) and VAD on the test set.

3.5 Ts-Vad

3.5.1 Data Simulation

We simulate two 900-hour pre-training datasets. One is simulated from LibriSpeech, and another is simulated from DEV402. To obtain a simulated dataset that is similar to the VoxConverse dataset, we first extract the label from the DEV402, which is a matrix of size , where is the number of frames and is the number of speakers. Then, for the label of each speaker, we fill the active regions with a single speaker’s speech from LibriSpeech or DEV402. Finally, we take the sum of all speeches as one simulated recording. Note that the non-speech regions in the labels are first removed. Figure 1 shows an example of the process of data simulation.

3.5.2 Training details

TS-VAD has achieved an excellent performance on CHIME6 [tsvad] and DIHARD III [dihard3tsvad] challenge. Unlike the previous method using i-vector, we employ ResNet-based x-vector as the target-speaker embedding.

The TS-VAD model is also similar to the VAD model except that the feature maps produced by ResNet need to be concatenated with a target speaker embedding. The concatenated features are then fed to the BiLSTM layers and fully connected layers. Since TS-VAD training needs large computing resources, we only train a model with for statistical pooling. The number of target speakers embedding is set to 8. The parameters of front ResNet34 are initialized from our speaker embedding model.

The model is first trained on the simulated LibriSpeech for 10 epochs with front ResNet34 frozen, and then it is trained for another 10 epochs with all parameters. Next, we transfer this model to VoxConverse data by training on the simulated DEV402 for 10 epochs. Finally, we fine-tune the model on DEV402 for 200 epochs and validate on VAL46. The learning rate is 0.0001 when training on simulated data and 0.00001 during the fine-tuning stage. The model is optimized by BCE loss and Adam optimizer. The input is 16s chunked wav, and the acoustic feature is 80-dim log Mel-filterbank energies with a frame length of 25ms and a frameshift of 10ms.

3.5.3 Inference

For inference, the non-speech regions are first removed by VAD, and the wavs are split into 16s chunks. Next, speaker embeddings are extracted given the results from a clustering-based method. We only consider those speaker embeddings with 16s or longer speech. If the number of speakers is less than 8, we use zero-vectors as the fake embeddings. If it is greater than 8, we discard the speaker embeddings with shorter speech, but their labels are kept in the final results. The threshold for speaker decision is set to 0.8.

3.6 2-speaker TS-VAD for Overlap Detection

The training data, model configuration, and training process are the same as the TS-VAD in Sec. 3.5 except that the number of target speaker is 2. For each recording, we select at most 5 speakers with the longest speech for inference, which provides at most pairs of target speaker embeddings. After obtaining the speaker decision of each pair of speakers by a threshold of 0.8, we update the results with the detected overlapped speech regions. This 2-speaker TS-VAD method can be applied to any data without considering the number of speakers. In this challenge, we only consider modifying the overlapped speech regions, but the single speaker region can also be iteratively refined according to the output from this 2-speaker TS-VAD model. And we will do more experiments later.

3.7 Data Augmentation

We perform online data augmentation [cai2020fly] with MUSAN and RIRs corpus. For background additive noise, we use ambient noise, music, television, and babble noise. For reverberation, we perform convolution with 40,000 simulated room impulse responses from small and medium rooms. The data augmentation is employed for all models which take acoustic features as input.

3.8 System Fusion

To further improve the performance and robustness, we fuse our systems by DOVER-Lap [doverlap].

4 Experimental Results

Table 3 shows the results on both VAL46 and the challenge test set. For the clustering-based system, the AHC method achieves a DER of on VAL46 and on the test set, which is our first submission. We employed our best VAD model for all systems on VAL46, but our first submission includes a poor VAD model, and it may not correctly reveal the improvements brought by OSD.

For the 2nd submission, we fused systems 3, 4, and 7 using DOVER-Lap with custom weight tuned on VAL46, and we obtained a DER of on VAL46 and on the test set.

For the 3rd submission, we fuse systems 3, 4, 5, and 7 using DOVER-Lap with custom weight tuned on VAL46. However, the DER on the test set goes lower. We did not know if it was system 5 that reduced the performance, so we directly submitted system 5, and it shows a DER of . Therefore, the reason may be that we tuned the weights so aggressively, and the fused system is over-fitted on VAL46.

Finally, we fused systems 3, 4, 5, 6, and 8 with rank-based weighting and achieve a DER of on the test set, which ranked 1st at the VoxSRC 2021.

5 Conclusions

In this report, we describe our system for the VoxSRC 2021. To achieve better performance, we mainly focus on the overlapped speech detection. We employ overlap detection and TS-VAD to reduce the missed speaker error. In addition, we propose a 2-speaker TS-VAD framework to detect the overlapped speech between each pair of speakers. Our experiments show that detecting the overlapped speech regions can significantly improve performance.