The Speed Submission to DIHARD II: Contributions Lessons Learned

11/06/2019 ∙ by Md Sahidullah, et al. ∙ 0

This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The DIHARD II diarization challenge [26] focused on “hard” diarization, by providing datasets that are challenging to the current state-of-the-art speaker diarization (SD) systems. The intentions of DIHARD II is to both (i) support SD research through the creation and distribution of novel data sets, and (ii) to measure and calibrate the performance of systems on these data sets. The challenge consists of four tracks, with two tracks for single channel data and two tracks for multi-channel data. The data for development and evaluation sets were taken from different databases, including audiobooks, meeting speech, child language acquisition recordings, dinner parties, and samples of web video. The organizers allowed using large set of existing speech corpora for training.

In this paper, we describe the efforts of the multi-national team Speed in the DIHARD II challenge. We focus on the approaches tried and lessons learned and present the system with considerably improved performance compared to the baseline provided by the challenge organizers. The main contributions of the Speed team can be summarized as follows:

  • Automatic grouping of pseudo-domains for class-dependent SD was investigated.

  • Different speech activity detection (SAD) methods were assessed.

  • An in-house SD system was developed, outperforming the baseline provided by the organizers.

  • Resegmentation methods are considered and approaches to their combination approaches are proposed.

  • For multichannel data, the suitability of different front-end processing and clustering methods is considered in an attempt to improve the SD performance.

Ii DIHARD II challenge

DIHARD II speaker diarization challenge [26] evaluates the task of determining “who spoke when” in a multi-speaker environment based only on audio recordings. As with DIHARD I [29], development and evaluation sets were provided but participants were free to train the systems on any proprietary or public data. DIHARD II extends the inaugural DIHARD I challenge by adding tracks on multi-channel recordings (from CHiME-5 [4]), by using more refined annotations for evaluation set, which made it more challenging for systems that over-fit on the development data, and by providing baseline system (the best performing system from DIHARD I [29]) to the participants. The DIHARD II has four different tracks. Two of them (Track 1 and 2) are dedicated for the single channel speech and other two (Track 3 and Track 4) are for multichannel. The difference between the two tracks in each channel is in the use of speech activity detection (SAD). Track 1 and Track 3 use ground-truth SAD labels provided by the organizers whereas the SAD labels need to be generated by the participants for the other two tracks.

DIHARD II uses two evaluation metrics. In addition to the previously used

diarization error rate (DER) metric, a new metric called Jaccard error rate (JER) is introduced. The details about the DIHARD datasets and evaluation methodology are available in [26, 25].

Fig. 1: Modules of Speed speaker diarization system.

Iii Diarization system

The speaker diarization system consists of several key modules as shown in Figure 1. In this challenge, we focus on enhancing several of them to improve the SD performance.

Group Domains # dev # eval
1 audiobooks, broadcast interview, 59 54
court room, & maptask
2 child language 23 38
3 clinical, sociolinguistic (field), 52 61
& sociolinguistic (lab)
4 meeting, restaurant, & web video 58 41
Total 192 194
TABLE I: Summary of the domain grouping showing the number of audio files predicted for each group in evaluation set.

Iii-a Grouping domains

The speech data for single channel tracks (Track 1 & 2) consists of speech files of 11 different domains such as audio books, web videos, broadcast interview, meetings, etc. These domains are different in terms of speech quality, number of speakers, recording environment, amount of overlapped speech, etc. Studies on speaker diarization with diverse domains indicate that domain-dependent processing helps in improving the overall performance [8]. Especially, this is important for selecting domain-dependent thresholds and for adaption of the automatic speaker verification

(ASV) back-end. In order to exploit the advantages of domain-dependent processing, first we made an attempt to categorize the audio files according to the provided domain labels. We have developed an i-vector-based domain classification method but it shows poor classification accuracy even on development set. We have obtained 86.98% accuracy when tested with leave-one-out cross validation on the development set. The accuracy is poor not only because this is a challenging task but also due to the fact that some of the domains are similar to each other in terms of speech quality. Considering the fact that miss-classification of domains on unseen evaluation data can significantly decrease the diarization performance, we grouped the domains into a reduced number of classes. We use the confusion matrix of primary domain classification results, SD performance on individual domains in development set, as well as the metadata of the domains for this grouping task. Groups are summarized in Table 

I. This domain grouping also helps to increase the amount of audio-data for training class-dependent speech enhancement and SAD methods, which are described in the following two subsections.

Iii-B Speech enhancement

On the front-end side, both for single-channel and multi-channel, we employed a

deep neural network

(DNN) based speech enhancement algorithm based on the deep feature loss paradigm 

[11]. The speech enhancement network is fully convolutional and takes the noisy raw waveform as input yielding the enhanced speech as output. We used a ResNet-inspired architecture [13] with squeeze-and-excitation blocks [15]. In each residual block, the dilated convolutional layer is followed by leaky rectified linear unitReLU activations. In order to allow the network learning good inter-channel dependencies, a squeeze-and-excitation self-attention block with leaky ReLU activation and a dense layer with 16 neurons are used next. Skip connections have been implemented in our architecture, differently from [11]. Finally, we employed the same VGG-19 inspired network as in [11] to compute the loss. The Adam optimizer [18] and weight normalization [27] have been used for training.

Since no clean speech references are available in the DIHARD II Development dataset, we used synthetic datasets to train our network. We built four different synthetic datasets, one for each of the group of domains introduced in Table III-A. Each synthetic dataset was built to be as similar as possible to the corresponding group. We used clean speech utterances from Librispeech [22], ParlamentParla [20], ST Chinese Mandarin Corpus [34], and Freesound [9]. Other sources, like cry sounds, have been taken from Freesound and Youtube [37]. Noises have been extracted from the DIHARD Development set by using the reference SAD and used in conjunction with other noise sources like the MUSAN dataset [31] and Freesound. We generated the synthetic acoustic scene using Pyroomacoustics [28].

The speech enhancement algorithm has been validated by using the DIHARD II speech enhancement baseline as term of comparison and the synthetic datasets above as evaluation data. For each domain, a split for training, validation, and testing respectively has been used. signal-to-noise ratio (SNR) and the perceptual evaluation of speech quality (PESQ) were used as evaluation indexes. The results reported in Table II confirm the effectiveness of the proposed approach. We use this enhanced speech for training one of our SAD system.

System Group 1 Group 2 Group 3 Group 4
DIHARD baseline 2.92/8.44 2.70/5.49 2.69/5.14 2.73/4.97
Proposed algorithm 3.12/9.02 2.80/5.94 2.80/5.95 2.79/5.84
TABLE II: Performance comparison (in terms of PESQ/output SNR [dB]) of proposed and baseline speech enhancement algorithms on the single channel synthetic datasets for the four groups.

Iii-C Speech activity detection

Speech activity detection (SAD) is modelled as a supervised sequence labeling problem. Let be a sequence of feature vectors extracted from an audio recording (e.g., mel-frequency cepstral coefficients or MFCCs): where is the length of the sequence. Let be the corresponding sequence of labels: and where is the number of classes. In case of speech activity detection, classes: for speech, for non-speech.

The objective is to find a function that matches a feature sequence x to the corresponding label sequence y. We propose to model this function using a stacked long short-term memory LSTM neural architecture trained with cross-entropy loss. Short fixed-length sub-sequences (a few seconds) of otherwise longer and of variable length audio files are fed into the model. This allows to increase the number of training samples and augment their variability.

At test time, audio files are processed using overlapping sliding windows of the same length as used in training. For each time step, this results in several overlapping sequences of -dimensional (softmax-ed) scores, which are averaged to obtain the final score of each class. The sequence of speech scores is then post-processed using two ( and ) thresholds for the detection of the beginning and end of speech regions [10].

In practice, half of DIHARD II development set was used for training, while hyper-parameters were tuned on the other half. We use an open-source implementation of this SAD approach as provided with the

pyannote-audio toolkit [24].

Iii-D Back-end

The DIHARD organizers provided a Kaldi-based x-vector back-end for speaker similarity measure. We developed a separate back-end that uses different acoustic features and parameters for neural network training.

Iii-D1 Acoustic features

We use mel-frequency cepstral coefficients (MFCCs) as our primary acoustic feature. We extract -dimensional MFCCs using filters. Unlike the implementation used in the baseline where window-based cepstral mean normalization (CMN) is performed, we apply utterance-dependent CMN where the global mean is computed from the speech regions. We have also investigated inverted mel-frequency cepstral coefficients (IMFCCs) which capture complementary information to MFCCs [5]. The IMFCCs are extracted in the same manner as MFCCs, except the warping scale is flipped to give more emphasize to the high frequency regions.

Iii-D2 Classifier

We rely on an x-vector system that uses neural network for discriminative training. The neural network consists of the time-delay neural network (TDNN) architecture which captures information from large temporal context from the frame-level speech feature sequences [35]. In addition to the TDNN layers, x-vector system uses statistics pooling and fully connected layers to design a speaker classification network at segment level. In our x-vector implementation, we use five TDNN layers and three fully connected layers as used in [30]. The details of the neural network configuration is shown in Table III. The x-vectors are used with probabilistic linear discriminant analysis (PLDA) back-end for segment-level speaker similarity measure.

Layer Details
TDNN-{1,…,4} Conv1D (#F=1024, KS={5,3,3,1}, DR={1,2,3,1})
TDNN-5 Conv1D (#F=4096, KS=1, DR=1)
Statistics pooling

Computation of mean and standard deviation

FC-{1,2} Fully connected layers (#nodes=512)
Softmax Softmax layer with 7205 outputs
TABLE III: Description of the layers in x-vector architecture. #F stands for number of filters, KS for kernel size, and DR for dilation rate.

We have implemented the x-vector system with Keras Python library 


using TensorFlow back-end 

[1]. We use (ReLU) [21] and batch normalization [16]

for all the five TDNNs and two fully connected layers. We apply dropout with probability

only on the two fully connected layers. Parameters of the neural work are initialized with Xavier normal method [12]. The neural network is trained using the Adam optimizer [17] with learning rate , , and without decay. We train the neural network using speech segments of  s. The x-vector systems are trained with batch size of and epochs where each epoch consists of mini-batches. We consider development sets of VoxCeleb1 and VoxCeleb2 consisting of speakers, with no data augmentation. We extract

-dimensional speaker embedding from the output of FC1 layer (before applying ReLU and batch normalization).

Iii-E Clustering

The baseline clustering provided by the organizers is based on a rather simple yet effective implementation of the

agglomerative hierarchical clustering

(AHC). The clusters are created from the pair-wise similarity matrix of segment-level x-vectors. The optimal threshold to stop clustering is determined by minimizing the DER on the entire development set.

Starting from the observation that almost half of the overall DER is due to the missed speakers, we investigated alternative clustering strategies that might reduce the missed speaker rate, under the assumption that these errors are related to the presence of overlaps between speakers. Therefore, one first attempt was to allow overlaps between clusters and trying to close gaps between two segments assigned to the same cluster and separated by another speaker for less than  s. Unfortunately, this approach gave a noticeable deterioration on both sets by introducing higher false alarms. Another attempt was to revise the clustering process in a more traditional fashion where a left-to-right (L2R) clustering is performed first, followed by an AHC. The x-vectors are averaged on the segments generated by the L2R initial step. However, this did not improve the DER either.

Iii-F Multi-channel front-end

For the multi-channel front-end, we investigated multiple speech enhancement methods. We investigate BeamformIt [2] where an enhanced signal is computed using a simple filter and sum beamforming technique and a combination of BeamformIt followed by the baseline speech enhancement method provided by the organizers [33].

We also investigated a source localization-driven source separation method, similar to [6] and detailed in [3]

. The source location was estimated using the classical

generalized cross correlation (GCC-PHAT) technique [19]. On obtaining the source location, a delay-and-sum (DS) beamforming is performed and a time-frequency mask corresponding to the localized speaker is obtained using a 2-layered bi-LSTM neural network using features obtained from the delay summed signal. Speech separation is done with speech distortion weighted multi-channel Wiener filter (SDW-MWF) [32] using the speech and noise covariance matrices obtained using the mask. The neural network to estimate the mask was trained using simulated data based on the WSJ0-2mix dataset [14]. The dataset contains a mixture of two clean WSJ utterances which we further reverberate using RIRs generated with CHiME-5 like microphone array geometry. Noise from the CHiME-5 dataset is also included in the simulate data to make it realistic. Another attempt to exploit the availability of multiple channel was to average x-vectors across the channels of each device.

Iii-G Re-segmentation

The final module (see Figure 1

) of the diarization system, for which we tested different approaches, is re-segmentation. Given the output of the clustering step, re-segmentation aims at refining speech segments boundaries and labels. Two different resegmentation methods were tested both separately and jointly. One is based on Gaussian Mixture models (GMMs) and another on long short-term memory (LSTM) recurrent neural networks.


A GMM is used to model every cluster hypothesized at the clustering step. The log-likelihood is calculated at feature level for every such GMM model. To counterbalance the noisy behavior of log-likelihoods at frame level, an average smoothing within a sliding window is applied to the log-likelihood curves obtained with each GMM cluster. Then, each frame is assigned to the cluster which provides the highest smoothed log-likelihood. This technique, previously used in [23], is expected to provide a finer boundary correction.


Assuming that the output of the clustering step predicts different speakers, this re-segmentation method uses the same principle as in Section III-C, but with classes so that the label is assigned for non-speech and every other are used for speakers . At test time, and using the (unsupervised) output of the clustering step as its unique training file, the neural network is trained for a number of epochs (tunable) and applied on the very same test file it has been trained on. The resulting sequence of -dimensional scores is post-processed by choosing the class with maximum score for each frame. To stabilize the choice of the hyper-parameter and make the prediction scores smoother, scores from the previous epochs are averaged when doing predictions at epoch

. While this re-segmentation step does improve the labeling of speech regions, it also has the side effect of increasing false alarms (i.e., non-speech regions classified as speech). Therefore, its output is further post-processed to revert speech/non-speech regions back to the original SAD output. The technique was previously proposed in 


Iv Experimental results

In this section, we present the evaluation results of different approaches for the modules of the diarization system.

Iv-a Comparison with baseline SD methods for Track 1

First, we compare the SD performance with the challenge baseline and our implementation for Track 1 which uses ground-truth SAD labels. The main differences between the baseline and our implementation are in the feature configurations and classifier back-end. We have trained the back-end system with smaller chunks of  s whereas the baseline system is trained with chunks of more than 4s. The baseline system also uses data augmentation from MUSAN and RIR datasets, and it uses domain adaptation by learning centering and whitening parameters from in-domain DIHARD data. Our system is simpler, since we neither use data augmentation nor domain adaptation. For our SD method, we use the same AHC as used in the Kaldi baseline system. The comparative results for Track 1 are shown in Table IV. The results indicate that our SD system is consistently better than Kaldi baseline for both development and evaluation data. We observe more improvement in JER metrics. For example, we obtain a relative improvement of % and % for development and evaluation sets respectively. We have not observed any improvement with PLDA adaptation. This is most likely because our system is trained with smaller chunks of 1 s, which already fit the DIHARD speech segments. On the other hand, for Kaldi baseline system, the domain adaptation might have helped to compensate the duration mismatch.

x-vector system Dev Eval
baseline 23.70/56.20 25.99/59.51
in-house 22.87/49.76 25.33/51.58
TABLE IV: Performance comparison (in terms of DER/JER in %) of the x-vector baseline and in-house implementations on Track 1.

Iv-B Comparison with baseline SD methods for Track 2

In Table V, we have compared the performance of SDs with different implementation of SADs. The results indicate that our LSTM-based SADs yield consistent improvement over the WebRTC-based SAD provided as baseline for the challenge. We have obtained the best performance in evaluation set in terms of DER by using Kaldi baseline as back-end and LSTM SAD. Our SD system also shows lowest JERs in both development and evaluation set. Different versions of our LSTM-based SAD system correspond to different training and tuning strategies, such as (v1) using MUSAN database of noises for noise augmentation, (v2) using silences of Dev set for noise augmentation and the same set for training and tuning hyper-parameters, which expectedly led to the best performance on Dev set, (v3) splitting Dev set into two subsets for training and tuning the system and silences in Dev set for noise augmentation, and (v4) adding the enhanced speech to the training set. From the results, it can be noted that the best performing system on Eval set (DER metric) simply learned the specific data and speech annotations provided by the challenge. Although such approach improved the DER results, it may not generalize well to the other types of data.

x-vector system Dev Eval
baseline + WebRTC SAD 38.26/62.59 40.86/66.60
baseline + LSTM SAD (v1) 25.66/56.88 35.81/63.03
baseline + LSTM SAD (v2) 25.01/55.75 43.08/65.77
baseline + LSTM SAD (v3) 28.77/58.32 33.02/61.51
baseline + LSTM SAD (v4) 27.93/57.46 35.44/63.19
in-house + LSTM SAD (v3) 28.97/53.99 34.71/58.21
TABLE V: Performance comparison (in terms of DER/JER in %) of the x-vector baseline and in-house implementations on Track 2.

Iv-C Impact of domain grouping

Table VI shows the results for domain grouping for both Track 1 and Track 2. We have computed the SD performance for both Kaldi baseline and for our system. We observe that the domain grouping improves SD performance compared to the condition without grouping. For example, the DER on evaluation set has been reduced to % compared to % of Kaldi baseline. JER is consistently lower for our system. However, DERs of Kaldi baseline is lower compared to our system on Eval set. This is most likely due to the wrong estimation of the domains in the Eval set.

Track x-vector system Dev Eval
1 baseline 23.03/53.38 24.25/56.04
in-house 22.83/49.41 25.34/50.75
2 baseline 28.65/56.04 32.60/59.16
in-house 28.68/53.00 34.39/57.30
TABLE VI: Performance comparison (in terms of DER/JER in %) of the speaker diarization x-vectors systems with domain-grouping for Track 1 and Track 2.

Iv-D Comparison of acoustic front-ends

In Table VII, we have compared the performance of MFCC and IMFCC acoustic front-end and found that IMFCC gives poorer SD performance compared to the MFCC. This is expected, since high-frequency regions of DIHARD data are more corrupted by the noise. However, we have found that a score-level fusion (with the weight optimized on the development set) of two systems improves the overall performance in all cases. The fused system is our best performing system for Track 1 for both DER and JER.

System Dev Eval
MFCC 22.87/49.76 25.33/51.58
IMFCC 25.47/51.30 27.43/53.33
Fusion 22.85/48.62 24.72/49.95
TABLE VII: Performance comparison (in terms of DER/JER in %) of MFCC, IMFCC and a fused system in Track 1 using the in-house x-vector implementation.

Iv-E Effect of resegmentation

Table VIII presents the results obtained after applying the resegmentation techniques described in Section III-G on Track 2 for both development and evaluation sets. These techniques were applied on top of the clustering solutions generated by two systems. The first is based on the provided x-vector baseline, domain grouping as described in Section III-A and an LSTM-based SAD (v3). Each GMM and LSTM resegmentations result in a small but consistent decrease in DER by about 0.4% for the development set. However, when applying them jointly (an LSTM-based is followed by a GMM-based resegmentation), DER further drops to 27.87%.

Similar trend can be seen for the evaluation set. Even more so when LSTM and GMM resegmentations are applied jointly, effectively lowering DER to 31.03%, resulting in our best overall performing system for Track 2. When instead of the baselines, we use our in-house x-vector implementation with the same LSTM-based SAD (v3), we can notice similar trends of lower DER after the resegmentation is perfomed. However, resegmentation had a negative effect on performance in Track 1. While in Track 2, the missed speech detection that artificially splits same-speaker speech content into multiple clusters can be corrected by the resegmentation, despite the negative effect of the false alarms, it would not have such positive effect on Track 1, since the “oracle” SAD annotation is already provided.

SD system Method Dev Eval
none 28.65/56.04 32.60/59.16
x-vectors (baseline) + GMM 28.25/55.67 32.02/58.88
grouping+LSTM SAD LSTM 28.21/57.01 31.41/59.60
LSTM+GMM 27.87/56.77 31.03/59.22
none 28.77/51.37 33.75/55.54
x-vectors (in-house) + GMM 28.79/54.72 33.55/58.01
score fusion LSTM 28.16/54.31 32.77/58.87
LSTM+GMM 27.86/54.99 32.37/58.67
TABLE VIII: Performance comparison (in terms of DER/JER in %) of the resegmentation methods on Track 2.

Iv-F Experiments on multi-channel SD

We have performed SD experiments on the multi-channel tracks using the Kaldi x-vector system as back-end. The results for Track 3 are shown in Table IX. We observe that a better tuning of the threshold on the training set led to a small improvement in the system performance. We have also found that optimizing the threshold for each recording session gives a marginal improvement over baseline. For example, DER is reduced to % from % when “oracle” threshold is chosen for each session. Table IX also shows the results for two different clustering methods. However, the performances are considerably deteriorated for evaluation set. Most of this deterioration performance is due to an increase in the false alarm in contrast with a minor reduction of missed speaker and speaker confusion. We observe a large performance gap between development and evaluation sets which indicates that threshold optimized for the development set fail to generalize to the evaluation set.

System Dev Eval
DIHARD baseline 60.10 50.85
Baseline threshold tuning 60.01 49.97
Session-based oracle threshold 58.28 -
Bridge gap + overlap 62.63 56.61
L2R + AHC 60.20 61.27
TABLE IX: SD performance in terms of DER % on development and evaluation sets for various approaches in Track 3.

We have also evaluated different beamforming strategies and speech enhancement methods for multi-channel scenario. Table X reports the results on Track 3 in terms of DER. In most cases, performance deteriorates in both development and evaluation sets. The alternative BeamformIt method slightly improves the performance on the development set but performs considerably poorer on the evaluation set. The best system is the combination of BeamformIt with the baseline enhancement which is just slightly worse than the baseline. Going more into details, applying the enhancement signals increases both the missed speaker rate and the speaker confusion. Averaging x-vectors over the four channels of a device does not result in any noticeable difference in the SD performance, probably because channels are very close to each other. Note that some methods were not investigated on the evaluation set due to poor performance or the end of challenge evaluation.

System Dev Eval
BeamformIT 59.96 53.00
BeamformIT+enhancement 60.01 50.71
SLOC SDW 64.09
xvec averaging 60.04
TABLE X: SD performance in terms of DER (%) on development and evaluation set in Track 3 for different front-end processing. SLOC SDW refers to the localization driven source separation

V Lessons learned and future directions

From all the investigations and the experiments by the Speed team, we have found that the SD performance can be systematically improved by improving each module, including backend, SAD, resegmentation, and the combinations and fusions of different methods. From our work on this challenging realistic dataset, we noted several important issues that may be helpful to the community and that may require further investigation.

Domain grouping: The way we combine different domains in this work helps to improve the SD performance marginally. However, we have observed large intra-domain variability in terms of SNR, DER, number of speakers, etc. The available domain labels are mostly associated with audio sources than the individual speech quality. Possibly for this reason, the optimized thresholds for each domain are not considerably different. We hypothesize that the speech files need to be clustered according to the speech quality before performing class-dependent SD. This clustering could be helpful for multi-channel tracks also as the speech files are collected from different room reverberation conditions.

Domain adaptation for backend: With our current system, we do not observe improvement with simple PLDA adaption by learning centering and whitening parameters from the in-domain data. This contradicts with the results by the Kaldi implementation. We have speculated that the newly trained system already compensates the domain mismatch due to the duration variability. We plan to explore more advanced domain adaptation such as supervised domain adaptation and inter-dataset variability compensation.

Speech enhancement: The employment of data-driven based speech enhancement algorithms required the adoption of synthetic suitably labelled datasets for DNNs supervised training. The synthetic data generation task proved to be very challenging, and it surely deserve more attention, especially in terms of reducing the mismatch between synthetic data and the datasets used in the challenge. In particular, a special care should be devoted to modeling of the diverse non-stationary noise sources. From a more general perspective, the success of speech enhancement algorithms in speaker diarization systems inevitably passes through the adoption of well-matched models at back-end level. For instance, processing the speech material used for training the diarization system with enhanced speech is expected to improve the overall performance. All these aspects have not been adequately investigated by the Speed team so far, and they should be addressed in the future.

Threshold computation: The state-of-the-art SD system as used in DIHARD baseline optimizes the global threshold for speaker clustering on the development set and applies this threshold when computing the diarization labels for the evaluation set. This approach may not lead to the optimum performance due to the possible mismatch in both sets. Clustering audio recordings and applying cluster-wise threshold can be a tuning techniques that improves performance.

Robust feature extraction

: We have observed that using different features conveying complementary information can improve the SD performance. However, the feature extraction process used in our system lacked additional processing that can improve the robustness. We plan to explore further the robust audio features speaker diarization.

Vi Conclusion

This paper summarizes the work done by Speed team for the second DIHARD challenge. We have discussed different methods explored for improving speaker diarization performance in realistic conditions. Amongst all the methods explored, we have found considerable improvement over baseline when using LSTM-based speaker activity detection methods. We have also discuss which approaches and enhancements that we tried did not work and speculate what could be the reasons for that. Diarization on different tracks of the DIHARD challenge turned out to be difficult not only due to the largely varying speech quality but also due to the wide mismatch between development and evaluation sets. We discuss some of the future directions which will be explored in a post-evaluation analysis.


  • [1] M. Abadi et al. (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from External Links: Link Cited by: §III-D2.
  • [2] X. Anguera, C. Wooters, and J. Hernando (2007-09) Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing 15 (7), pp. 2011–2021. Cited by: §III-F.
  • [3] Annonymous (2019-12)

    Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

    In IEEE Automatic Speech Recognition and Understanding Workshop (Submitted), Singapore. Cited by: §III-F.
  • [4] J. Barker, S. Watanabe, E. Vincent, and J. Trmal (2018) The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In Proc. Interspeech, Cited by: §II.
  • [5] S. Chakroborty, A. Roy, and G. Saha (2007) Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. International Journal of Signal Processing 4 (2), pp. 114–122. Cited by: §III-D1.
  • [6] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong (2018-12) Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 558–565. External Links: Document Cited by: §III-F.
  • [7] F. Chollet et al. (2015) Keras. Note: Cited by: §III-D2.
  • [8] M. Diez et al. (2018) BUT system for DIHARD speech diarization challenge 2018. In Proc. Interspeech, pp. 2798–2802. Cited by: §III-A.
  • [9] Freesound(Website) External Links: Link Cited by: §III-B.
  • [10] G. Gelly and J. Gauvain (2015) Minimum word error training of RNN-based voice activity detection. In Proc. Interpseech, Cited by: §III-C.
  • [11] F. G. Germain, Q. Chen, and V. Koltun (2018) Speech denoising with deep feature losses. arXiv preprint arXiv:1806.10522. Cited by: §III-B.
  • [12] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics

    pp. 249–256. Cited by: §III-D2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §III-B.
  • [14] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §III-F.
  • [15] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proc. IEEE CVPR, pp. 7132–7141. Cited by: §III-B.
  • [16] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pp. 448–456. Cited by: §III-D2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. ICLR, pp. 1–15. Cited by: §III-D2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
  • [19] C. Knapp and G. Carter (1976-08) The Generalized Correlation Method for Estimation of Time Delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 24 (4), pp. 320–327. External Links: ISSN 0096-3518, Document Cited by: §III-F.
  • [20] B. Külebi and A. Öktem (2018) Building an open source automatic speech recognition system for catalan. In Proc. IberSPEECH 2018, pp. 25–29. External Links: Document, Link Cited by: §III-B.
  • [21] V. Nair and G.E. Hinton (2010)

    Rectified linear units improve restricted Boltzmann machines

    In Proc. ICML, pp. 807–814. Cited by: §III-D2.
  • [22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §III-B.
  • [23] J. Patino, H. Delgado, and N. Evans (2018) The EURECOM submission to the first DIHARD Challenge. In Proc. INTERSPEECH, Cited by: §III-G.
  • [24] pyannote contributors(Website) External Links: Link Cited by: §III-C.
  • [25] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman (2019) Second dihard challenge evaluation plan. Linguistic Data Consortium, Tech. Rep. Cited by: §II.
  • [26] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman (2019) The second dihard diarization challenge: dataset, task, and baselines. In Proc. Interspeech, Cited by: §I, §II, §II.
  • [27] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: §III-B.
  • [28] R. Scheibler, E. Bezzam, and I. Dokmanić (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In Proc. ICASSP, pp. 351–355. Cited by: §III-B.
  • [29] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. Interspeech, Cited by: §II.
  • [30] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018-04) X-vectors: robust DNN embeddings for speaker recognition. In Proc. ICASSP, Vol. , pp. 5329–5333. Cited by: §III-D2.
  • [31] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: 1510.08484 Cited by: §III-B.
  • [32] A. Spriet, M. Moonen, and J. Wouters (2004-12) Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction. Signal Processing 84 (12), pp. 2367–2387. External Links: ISSN 0165-1684, Document Cited by: §III-F.
  • [33] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C. Lee (2018) Speaker diarization with enhancing speech for the first dihard challenge. In Proc. Interspeech, pp. 2793–2797. Cited by: §III-F.
  • [34] Surfintech ST-CMDS-20170001 1 - Free ST Chinese Mandarin Corpus. Cited by: §III-B.
  • [35] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang (1989-03) Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (3), pp. 328–339. Cited by: §III-D2.
  • [36] R. Yin, H. Bredin, and C. Barras (2018) Neural speech turn segmentation and affinity propagation for speaker diarization. In Proc. Interspeech, Cited by: §III-G.
  • [37] YouTube(Website) External Links: Link Cited by: §III-B.