In recent years, we have developed our Speaker Diarization (SD) system [1, 2, 3]. Last year, we also tailored an off-line system for the First DIHARD Speech Diarization Challenge , where we participated in Track 1 as well as Track 2 of the challenge.
The Second DIHARD Challenge  brought us an opportunity to extend our system and to try combining results from different sources. Besides i-vector [6, 7] extraction, we have also extracted x-vectors . The main novelty in our system is an early-fusion of i-vectors and x-vectors into xi-vectors and modified segmentation. We also use the Kaldi system111http://kaldi-asr.org/ with a recipe for diarization as an additional system for comparison and for a domain-specific combination of results. This year we have decided to participate only in the Track 1 part of the challenge, where a reference speech labeling is available and no speech activity detection (SAD) step is needed.
Our main advantage in the First DIHARD Challenge was the application of a Neural Network (NN) based domain classifier that allows the system to automatically identify the domain of each recording and to set the system’s configuration accordingly. The same applies for the Second DIHARD Challenge.
2 Speaker Diarization System
Our system follows x-vector- and i-vector-based approaches [9, 10, 8]. A diagram of our diarization system is shown in Figure 1. The general structure is the same as in our previous system for the First DIHARD Challenge .
This section provides a description of the main steps of the diarization process. The domain classifier and related domain-dependent settings are described in section 3.
First, the entire conversation is split into multiple individual speech regions by breaking it on any non-speech events; these non-speech regions are excluded from subsequent processing. As a second step, the speech regions are further segmented according to the probability of speaker change given by our Speaker Change Detector, which is based on a Convolutional Neural Network (SCD-CNN).
The SCD-CNN was trained as a regressor on spectrograms of the acoustic signal. The process is described in our previous paper for the First DIHARD Challenge . The signal between two detected speaker changes is considered to be one segment.
To ensure that each segment contains sufficient information about the speaker, we set the minimum duration of each segment to 0.5 s. Shorter segments are discarded from the clustering stage and the decision about the speaker is left for the resegmentation step (see section 2.5).
2.2 Feature Extraction
We use the same signal processing pipeline as for the First DIHARD challenge  – Linear Frequency Cepstral Coefficients (LFCCs). As a newly added step, Cepstral Mean Normalization (CMN) is applied to compensate for channel variations.
2.3 Segment Description
Each segment is represented by a concatenation of the x-vector and i-vector for the same segment. We call this an xi-vector. The i- and x-vectors were obtained using a Kaldi recipe222https://github.com/kaldi-asr/kaldi/tree/master/egs/
callhome_diarization/v1 and /v2 . To exploit the ability of x/i-vectors to represent small amounts of data and to minimize the presence of more than one speaker in a segment, longer segments are split into intervals of max. 2 s, with 1 s overlaps.
For generating i-vectors, we trained a UBM with components and a transformation matrix with a latent dimension of . A Time Delay Neural Network is used as an x-vector extractor, and x-vectors are extracted from the affine component of the second-to-last layer with dimension .
For whitening the xi-vectors, we subtract the mean of development set’s xi-vectors. During the diarization process, we use a conversation-dependent Principal Component Analysis (PCA) computed on the data in the current conversation to reduce the dimension of the xi-vectors into , , or (depending on the identified corpus – see Tab. 1).
The following corpora were used as training data: LibriSpeech333http://www.openslr.org/12/, VoxCeleb2 , TedLium3 , and the following ELRA corpora: Speecon database UK English (ELRA-S0215) and US English (ELRA-S0233). Additional data augmentation (additive noise, music, babble and reverberation) was used on the LibriSpeech and TedLium3 corpora.
The general clustering approach remains mostly unchanged from what we used in the First DIHARD Challenge.
As in the previous year’s challenge, the number of speakers in each recording is unknown in advance: the DIHARD corpus consists of several distinct domains where the number of speakers ranges from 1 to 10. Thus, we have chosen to primarily use the agglomerative hierarchical clustering (AHC) algorithm.
The clustering process is based on the average cosine distance between xi-vectors. We use a fixed distance threshold as the stopping condition. Additionally, we leverage our knowledge from the development set by also specifying a minimum and maximum number of clusters for each conversation, based on the number of speakers typically observed for the domain. So, we force the final number of clusters to be within this set range. The range and the clustering stopping condition – the distance threshold – were both established on a per-corpus basis using the development data (see Section 6).
In the Second DIHARD Challenge, there are three corpora in the development set with an overwhelming majority of two-speakers conversations. For these domains, we use a Probabilistic Linear Discriminant Analysis (PLDA) model  to evaluate the distance between the xi-vectors. Then, we apply k-medoids clustering into a constant number of clusters across the whole domain. For training the PLDA model, we used the same datasets as listed in Subsection 2.3. The between-class dimension is equal to the feature dimension.
Finally, we refine the diarization results via resegmentation. The previous results are based on relatively long signal windows and the boundaries between speaker segments are not precise. Therefore, we compute a GMM for each speaker cluster from all feature vectors assigned to the speaker. Then, likelihoods for all speaker GMMs are evaluated and filtered by a Gaussian window (length 75 ms with shift 50 ms) to smooth peaks. The number of GMM components ranges between 1 and 64 depending on the speaker data size. Then, the entire conversation is reclassified according to the GMM likelihoods on a frame-by-frame basis.
3 Domain Classification
The DIHARD II corpus  consists of data taken from several different domains, with very diverse characteristics – including the number of speakers, the level of noise, and audio quality in general. To improve the results of the diarization system, we decided to use the supervised information about each domain given by the organizers (the possible number of speakers in the domain) and to tune specific settings on development data (mainly the threshold for AHC).
We have proposed the domain classifier as a hierarchical two-stage classifier. The first level is a special classifier to distinguish recordings with one speaker from multi-speaker data. The second stage classifier is applied when the first level class is considered to be the multiple speaker case. It evaluates the posterior probability that the input conversation belongs to one of the 11 corpora in the DIHARD II development set.
The same NN architecture is used for both classifiers. They differ only in the last layer, where the first level NN uses one neuron for the binary classification, and the second stage NN uses a softmax layer with 11 classes.
The NNs receive a single i-vector calculated over the entire conversation as the input. A special i-vector extractor was trained for the domain classifiers. The LFCC features are the same but also include frames marked as non-speech by SAD. The UBM has 512 diagonal components, and the final i-vector dimension is 100.
The NNs have one hidden layer with 2048 neurons and tanh activation function. Dropout with coefficient 0.9 was used during training. The network was implemented in TensorFlow, where the “adam” optimizer was used with 10 epochs and batch size of 32. The remaining hyper-parameters were left at default values.
Both NNs were trained on the development data + 10 randomly chosen recordings from the LibriSpeech corpus (cut to 10min length).
Because the evaluation dataset contains two unknown corpora with very different characteristics, we apply a threshold on the classifier result. The positive detection threshold was set to 0.6 for both stage classifiers. If there is no positive detection in the second-stage NN, the conversation is treated as “unknown domain”. The accuracy of the first-stage NN was 100% on a held-out part of the development data. The accuracy of the second-stage NN was 82%.
3.1 Domain-specific settings
Because of our domain classifier, we were able to use different system configuration for each of the 11 development set corpora and for unknown data. Here we describe the general approaches we selected for each domain. Specific experimentally-chosen parameters are listed in Table 1.
LibriVox: All recordings contain only 1 speaker. Thus, we did not need to perform diarization but simply used the information given by reference SAD. We rely on the first stage of the domain specific classifier.
SEEDLingS, SCOTUS, RT-04S, SLX, VAST and YouthPoint: For these corpora, we used the AHC approach with cosine distance, as described in section 2.4.
ADOS, DCIEM and MIXER6: These corpora have almost exclusively exactly 2 speakers in each conversation. For this reason, we could simply use k-medoids clustering into 2 clusters with PLDA scores.
CIR: For this corpus, our system gives the best results on development data for k-medoids clustering into 4 clusters with PLDA scores.
Unknown: For unrecognized evaluation data, we’ve chosen to use AHC with 2-6 target clusters.
|Corpus||Clustering||No. spk||Thr AHC||PCA dim|
4 Kaldi Diarization System
The segmentation provides chunks of speech between important non-speech events (Kaldi SAD segmentation) and subsequently divides these segments into sub-segments with constant length 1.5 s and overlap 0.75 s (the minimum length of a segment is 0.5 s).
X-vectors or i-vectors are computed on segmented data and handled by a PLDA model to compute the similarity between these segments. X/i-vectors are whitened before the PLDA estimation by subtracting the mean and transforming by an LDA matrix.
The vectors of segments are then clustered according to the AHC, with the stopping threshold set on development data. This threshold was found for the entire development set – this system does not treat different domains of the DIHARD II corpus differently.
Additionally, we create xi-vectors by concatenating the x-vector and i-vector for the same segment. The whitening transformation is also obtained by concatenating the means and PCA transformation matrices belonging to the x/i-vectors (this independent treatment works better than computing the whitening transformation on the whole xi-vectors).
5 Late system combination
The Kaldi diarization system does not use the information from the domain classifier, and its setting is very general. Therefore, we have used the Kaldi system instead of our speaker diarization system (SD) in the cases where the domain classifier marked the conversation as “unknown domain”.
For the two most problematic corpora (Seedlings, VAST), we have also used the Kaldi system. On average, our system slightly outperforms the Kaldi system on the development data for these corpora. However, the DERs of individual conversations have a higher variance than the ones from Kaldi. We refer to this per-domain system selection as late system combination.
This section describes our experiments on the development set of the Second DIHARD Challenge, as well as our final results on the evaluation set. The experiments mainly served for finding the optimal system configuration for each of the individual corpora. For details of the DIHARD II corpus [16, 18], see the evaluation plan .
The system performance was evaluated in terms of Diarization Error Rate (DER), as defined by NIST . On the development set, we calculated this on a per-recording basis using NIST’s md-eval.pl script444https://github.com/usnistgov/SCTK/blob/master/src/md-eval.
DER and Jaccard Error Rate (JER) on the evaluation set were given by the official scoring system .
Unlike usual practice, DIHARD Challenge submissions were scored with no forgiveness collar around speaker boundaries, and overlapping speech was included in the evaluation.
Table 2 presents a comparison between i/x/xi-vectors contribution with an earlier version of our system. Based on these preliminary results, the rest of our work was with xi-vectors only.
For this challenge, we decided to extend our previous system for diarization with various enhancements; this chapter discusses their benefits. Table 5 presents the results for our SD system with comparison with several alterations: a system without SCD-CNN (only Kaldi SAD segmentation), with de-noised test data (using speech enhancement555https://github.com/staplesinLA/denoising_DIHARD18 , no de-noising for training data), and a system with reference information about overlapping speech. The latter information is used in the resegmentation step – for such parts of the data, the second most likely speaker is also detected.
|Corpus||SD||without SCD||with de-noise||with ref. overlap|
Table 6 shows the segmentation coverage and purity  achieved on the development set for SCD-CNN and Kaldi SAD segmentation. Our SCD-CNN method outperforms the Kaldi SAD segmentation in both coverage and purity, except two cases. This result is expected, as SAD segmentation does not take into account the speaker changes in one speech activity segment. The purity measure is more important in the diarization task, as it tells us how pure the segments are – that they contain only one speaker.
Based on these results, we chose to apply the SCD-CNN in our final system, and we omitted the de-noising. Unfortunately, despite the clear benefits of detecting overlapping speech, we were not able to train a real overlap detector with reasonable accuracy on the DIHARD II data.
In this paper, we presented a new version of our diarization system and its results for the Second DIHARD Diarization Challenge. Compared to our previous system, we applied xi-vectors and modified the SCD-based segmentation step to take advantage of x/i-vectors’ ability to represent short segments. Using a domain classifier as in the previous challenge, we were able to use a different system configuration for each subset of data. For comparison, we applied a Kaldi recipe for diarization and combined the results into a single system. Additionally, we have investigated the potential gains of detecting overlapping speech, de-noising, and various segmentation methods. Our best Track 1 submission achieved a DER of 23.47% and JER of 48.99%.
This research was supported by the Ministry of Culture of the Czech Republic, project No.DG16P02B048. Access to computing and storage facilities provided by the project CESNET LM2015042 is greatly appreciated.
-  Z. Zajíc, M. Kunešová, and V. Radová, “Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech,” in Specom. Budapest: Springer, 2016, pp. 411–418.
-  M. Hrúz and Z. Zajíc, “Convolutional Neural Network for Speaker Change Detection in Telephone Speaker Diarization System,” in ICASSP. New Orleans: IEEE, 2017, pp. 4945–4949.
-  Z. Zajíc, M. Hrúz, and L. Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in Interspeech, Stockholm, 2017, pp. 3562–3566.
-  Z. Zajíc, M. Kunešová, J. Zelinka, and M. Hrúz, “ZCU-NTIS Speaker Diarization System for the DIHARD 2018 Challenge,” in Interspeech, Hyderabad, 2018, pp. 2788–2792.
-  N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “Second DIHARD Challenge Evaluation Plan,” Linguistic Data Consortium, Tech. Rep., 2019.
-  P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A Study of Interspeaker Variability in Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008.
-  L. Machlica and Z. Zajíc, “Factor Analysis and Nuisance Attribute Projection Revisited,” in Interspeech, vol. 2, Portland, 2012, pp. 1570–1573.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in ICASSP, 2018, pp. 5329–5333.
-  G. Sell and D. Garcia-Romero, “Speaker Diarization with PLDA I-vector Scoring and Unsupervised Calibration,” in IEEE Spoken Language Technology Workshop, South Lake Tahoe, 2014, pp. 413–417.
-  M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization,” Audio, Speech and Language Processing, vol. 22, no. 1, pp. 217–227, 2014.
-  G. Sell, D. Snyder, A. Mccree, D. Garcia-romero, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur, “Diarization is Hard : Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge,” in Interspeech, Hyderabad, 2018, pp. 2808–2812.
-  S. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, and J. Glass, “Exploiting Intra-Conversation Variability for Speaker Diarization,” in Interspeech, Florence, 2011, pp. 945–948.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Interspeech, Hyderabad, 2018, pp. 1086—-1090.
-  F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, “TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation,” in Specom. Leipzig: Springer, Cham, sep 2018, pp. 198–208.
-  S. Ioffe, “Probabilistic Linear Discriminant Analysis,” Lecture Notes in Computer Science, vol. 3954 LNCS, pp. 531–542, 2006.
-  N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “DIHARD Corpus,” LDC, Tech. Rep., 2019.
-  ——, “The Second DIHARD Diarization Challenge: Dataset, task, and baselines,” in Interspeech, Gratz, 2019.
-  E. Bergelson, “Bergelson Seedlings HomeBank Corpus,” Linguistic Data Consortium, Tech. Rep., 2018.
-  J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, and C. Laprun, “The Rich Transcription 2006 Spring Meeting Recognition Evaluation,” Machine Learning for Multimodal Interaction, vol. 4299, pp. 309–322, 2006.
-  L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-h. Lee, “Speaker Diarization with Enhancing Speech for the First DIHARD Challenge,” in Interspeech, 2018, pp. 2793–2797.
-  H. Bredin, “TristouNet: Triplet Loss for Speaker Turn Embedding,” in ICASSP. New Orleans: IEEE, 2017, pp. 5430–5434.