Speaker diarization is the task of determining ‘who spoke when’  in a multi-speaker environment. It is an essential component for a variety of applications such as call center services, meeting transcriptions, etc, thus the attention drawn from the research community [11, 1]. Although, the process of diarization sounds inherently easy, it poses multiple challenges in practice, limiting its commercial deployment. One of the reasons is that the overall performance depends heavily on both the application scenario and the imposed constraints. For example, diarization focused on call center audio is mostly about separating just two speakers, often in quite diverse acoustic environments. On the other hand diarization of meetings is much more challenging with multiple speakers, reverberation, back channeling, etc. As such, even determining the number of active speakers still poses a scientific challenge.
The main approach for the diarization systems is to first extract noise- and environment-invariant speaker embeddings and then cluster them. Most of the previous work is based on the i-vector embeddings[13, 3, 14, 5]. However, the latest research work shows a shift from i-vectors to d-vectors [16, 7], i.e. features based on the bottleneck layer output of a DNN trained for Speaker Recognition. Reasons for this shift are attributed to the enhanced performance, easier training with more data , and robustness against speaker variability and acoustic conditions. The main difference between i-vectors and d-vectors are that the latter are extracted in a frame-level fashion from the bottleneck layer.
, where the proposed system is trained on pre-segmented, speaker labeled utterances. These approaches usually provide a neural network modeling the speaker profiles on-the-fly and techniques like an HMM for the speaker sequence and segmentation. The second approach is by clustering the speakers in an unsupervised manner while the algorithm decides the data modality, i.e. the speaker number[18, 4].
Along these lines, the performance of the Deep Embedded Clustering (DEC) variations [17, 6] is also investigated and compared with the k-Means, Spectral Clustering (assuming the number of speakers is known). In the case of unknown number of speakers, we investigate the performance of the “x-means” algorithm , where the number of speakers is determined on-the-fly. To highlight the impact of the clustering performance, while the number of speakers is known, we will use oracle audio segmentations. We also compare the proposed approach against the fully-supervised UISRNN system.
This paper is structured as follows: in Sec. 2, a short overview of the d-vector extraction and the existing clustering approaches is provided. In Sec. 3.1, the enhancements in the front-end are discussed, where the temporal filtering and the median are discussed. The refinement in the DEC algorithm are discussed in Sec. 3.2
, where the loss function for training is revisited. In Sec.4, the experimental results on 3 different tasks, i.e. the AMI database , the DIHARD corpus , and the internal meeting data, are presented. Finally, our latest findings are recapped and conclusions are presented in Sec. 5.
2.1 Clustering Methods
The baseline clustering system is based on two of the most widely used clustering algorithms, i.e. the k-Means and the Spectral clustering. The k-Means clustering is an unsupervised cluster analysis, partitioning the samples intoclusters, with
known. Both, the k-Means and the expectation-maximization algorithms for GMMs are similar in a sense, i.e. using cluster centroids, i.e. ‘means’, to model the data; however, k-means clustering finds clusters of comparable spatial variance, while the expectation-maximization mechanism allows clusters to have very different ones. The objective function is,
where are the samples, are the centroids and the sets corresponding to the respective clusters. Before using the k-Means algorithm the input samples are whitened and their dimensionality is reduced using PCA. The can be replaced by the speaker profiles, when initializing the process, incorporating prior knowledge to the system.
Spectral clustering is the default state-of-the-art unsupervised clustering algorithm for diarization providing very good performance. It is based on a similarity matrix defined as a symmetric matrix, where represents the similarity between any two data points with indices
. The spectral clustering approach employed here uses the k-Means on the eigenvectors of the ‘graph Laplacian’ matrix of, . The intrinsic dimensionality reduction provides an additional robustness to the algorithm.
Both algorithms have drawbacks, such as the random initialization step and the requirement for a preset number of clusters , Eq. (1). The latter constraint is addressed with the ‘x-Means’ algorithm  may start with a lower bound of and keep splitting the clusters until a stopping criterion is reached (or the purity of the clusters reaches a certain level). Herein, the Bayesian Information Criterion (BIC) is used ,
where is the log-likelihood of the data according to the -th model, and is the number of parameters in model (where the family of models) and the number of samples. The algorithm keeps splitting the clusters until all clusters are different enough based on BIC. A similar idea (albeit in a completely different approach) has been proposed in 
while using Agglomerative Hierarchical Clustering.
2.2 Deep Embedded Clustering – DEC
The motivation behind DEC is to transform the input features, herein speaker embeddings, to a space better separable in a given number of clusters. The clusters are iteratively refined based on a target distribution 
. First, an autoencoder is trained while injecting noise, i.e. dropouts for the encoding part. The autoencoder learns a representation of the input features in a space of much lower dimensionality (embeddings), while maintaining the separable properties of the features. The encoder outputs are used as input for the clustering component, iteratively refined by learning from their high confidence assignments. Specifically, the DEC model is trained with the KL-divergence between the and distributions as the loss function, when matching the soft assignment of the embedding to the cluster with the target distribution , Eq. (4),
and are given by,
where the centroid of -th cluster and is the soft cluster frequency with . The DEC approach presents however some problems: first, the training of the autoencoder and the clustering steps are decoupled, according to Eq. (3). This is may lead to trivial solutions, especially when the encoded features are not discriminative enough, since there is no gradient back-propagation to the autoencoder. An initial fix was proposed in  adding a second loss term to preserve the local structure of the input features while improving their separability,
where is the encoder and decoder mappings combined. Now, the autoencoder is forced to maintain local structure while improving discrimination, thus the features cannot collapse in the ‘trivial’ space.
However, the two-term loss function doesn’t address another fundamental problem of the algorithm: there is no constraint for avoiding empty clusters, despite the fact that the number of classes is known. The non-trivial solution might be implied in Eq. (3) but it’s not adequate. This issue can be further enhanced since the features are transformed to a low dimensionality space, according to the loss terms , without constraints.
2.3 Unbounded Interleaved-State RNN
Lately, an online method called ‘Unbounded Interleaved-State Recurrent Neural Network (UISRNN)’ has been proposed in for fully supervised speaker diarization. The input to the algorithm is d-vectors and uses an RNN to keep track (as a different state) for the different speakers as they are interleaved in the time domain. The RNN is part of a Bayesian framework supporting an unknown number of speakers. Although there are fundamental differences with the other algorithms investigated here, we include results using this algorithm for comparison reasons.
3 Enhancements of Speaker Clustering
3.1 Improving Speaker Embeddings
The d-vectors are extracted using a DNN 111A TDNN is used for the extraction, now called ‘x-vectors’ , with stacked log-mel filterbank energy coefficients as input features. The output of the network is a one-hot speaker label 
(or equally the probability of that particular speaker given the current input frame). The d-vectors are the output of the second to last DNN layer which is usually much shorter than the last one. This layer is called the ‘bottleneck’ layer and its size depends on the implementation.
The frame-based nature of the d-vectors leads to noisy frame estimates despite the very long input time-windows – usually aroundor more of audio. Most approaches using d-vectors are aggregating them over the span of a segment by averaging. As such, the length of the input segments is one of the limitations for high-quality d-vectors with shorter segments corresponding to suboptimal clustering results, Sec. 4.
Herein, we propose a different approach, where the d-vectors are first low-passed and then aggregated by a median filter,
where is the -th coefficient of the -th frame and is a moving-average, FIR filter estimated as , and , where the Dirac function. This filtering process results in smoother, less noisy temporal trajectories of the d-vector coefficients. A median value for each segment is then extracted from these temporally smoothed vectors,
the start/end segment frame index, respectively. The ‘smoothing and median filtering’ approach has been found to outperform the widely used averaging scheme for several reasons. Assuming the d-vectors belonging to the same speaker are similar enough, the variations of adjacent d-vectors can be attributed to the phonetic content or the environmental noise and as such they can be discarded. Additional robustness is provided by the median filtering, where the outliers have smaller impact on the aggregated values compared to averaging.
3.2 Improvements on Deep Embedded Clustering
The second contribution of the paper is revisiting the overall loss function and adding a few algorithmic steps to the DEC algorithm.
First, the possibility of empty clusters has to be addressed. The basic assumption of our approach is that the distribution of speaker turns is uniform across all speakers, i.e. all speakers contribute equally to the session. This assumption is not realistic in real meeting environments but it constrains the solution space enough to avoid the empty clusters without affecting overall performance. Under this assumption, the
input samples are uniformly distributed overclusters, expressed by,
where is the uniform distribution or equally , the clusters are now forced to be more balanced, while penalizing clusters not following the uniform distribution.
An additional loss term penalizes the distance from the centroids . This MSE term is given by:
This is similar to the k-Means criterion in Eq. (1), but it is now expressed as part of the loss-function.
The loss function of the revisited DEC algorithm now becomes,
Although not presented here, the weights can be fine-tuned on some held-out data.
Finally, an additional k-Means ‘re-calibration’ step is included every few training iterations. The DEC algorithm uses the k-Means for initializing the centroids and then, it runs iteratively based on the loss functions. Based on our experience, the distribution, Eq. (3), can diverge from the target distribution and a reset is necessary. Such a reset on iteratively refined features ensures that the system cannot diverge to an ‘ill-conditioned’ solution.
4.1 System Setup
We investigate the performance of the proposed components on 3 different tasks, the AMI , the DIHARD  and an internal meeting transcription task. The AMI dataset consists of 166 meetings with 4 speakers worth of 100h captured by multiple lapel mics and 2 microphone-arrays. We use only the lapel recordings (1 per speaker) for the segmentation and the mixed audio (the 4 channels are mixed into one) for diarization222The AMI dataset provides this audio signal after mixing all the lapel channels together. Each set of speakers is used for 4 meetings. The DIHARD set is a collection of diverse recordings with a varying number of speakers, noise conditions and spoken languages. The task contains two tracks, with/without the transcriptions given. Herein, we utilize only the first track (with the known segmentations). Finally, the third dataset contains two 1h-long internal meetings, i.e. ‘Meeting A’ and ‘Meeting B’. There are 6 and 4 participants, respectively. The audio is recorded with a microphone array and processed by a fixed beamformer  keeping the top-beam, i.e. the most active in terms of signal energy. This single-channel audio is then processed for diarization. In all but the DIHARD task and for the case of x-Means in Table 3, the number of speakers and the segmentations are considered given. Any silence shorter than , i.e. the collar, is treated as ‘speech’ for training, testing and scoring. We use the ground-truth segmentations provided by the databases for the d-vector aggregation and the time boundaries are considered as potential speaker-turns. For the case of the internal meeting data, we use the segmentations provided by the Microsoft ASR decoder. Therefore, there are (short) silence segments present in the ground-truth segmentations. Finally, since there is no overlapping speech detection333about 10% of the speech is considered overlapping ., i.e. segments with more than one active speakers are assigned to the speaker already talking. Consequently, the diarization results appear worse in the sense that some segments with overlapping speech are simply ignored.
The d-vectors are trained on text-dependent utterances, i.e. wake-up phrases, with speakers . The d-vector length, i.e. the bottleneck layer, is 200 coefficients long and takes as input frames of log-mel filterbank energy coefficients. There is no overlap between the speakers in the training set and the speakers in the test audio.
We use PCA for dimensionality reduction of the input to clustering algorithms. The length for the resulting d-vectors is 70 coefficients. In the case of the AMI task, the same set of speakers is found in 4 separate sessions/meetings. Thus, the diarization output of the first meeting can be used to create speaker profiles, serving as initial centroids for the rest of the meetings, Eq. (1). Finally, we use the original features in the case of Spectral clustering and DEC, since it has an intrinsic dimensionality reduction process.
The autoencoder for the DEC algorithm has the following architecture with dense layer size of . The loss function, Eq. (10), has the following weights . The network is trained with the Adamax criterion with learning rate and batch size of 64.
The UISRNN model is trained on transcribed internal meeting data of about 100h. The number of speakers varies on the meeting. The frontend for the UISRNN system is the same d-vector network as described above. For the UISRNN, we use a beam of 6 and 3 passes to further refine the diarization results.
Finally, the scoring is held with the standard for diarization evaluations NIST tool .
First, we investigate how the length of the speech segments can affect the diarization performance when training the PCA transformations and estimating the speaker centroids. We present results in Tables 1-2, where segments shorter than are ignored. The performance is measured as the ratio of segments (in sec) assigned to the right speaker. As mentioned, ‘Meeting A’ contains 6 speaker and ‘Meeting B’ only 4. We use k-Means for the clustering part and the metric is the ‘Clustering Recall’, i.e. the ratio of correctly assigned speech over the available speech (in sec).
The diarization performance is improved as expected, Table 1, by ignoring these short segments. However, this is not a viable solution since the shorter segments remain unassigned. The ‘temporal and median’ filtering can improve the overall performance reaching close to the best possible performance, as shown in the last row of the table, without ignoring any segments. Shorter segments provide embeddings of lower quality. It is possible to greatly improve the diarization performance by distinguishing the segment processing according to their length. Also, the number of speakers, i.e. clusters, can affect performance. Diarization performance in ‘Meeting A’ is worse than the corresponding on ‘Meeting b’ because there are more speakers (6 speakers vs. 4).
|Meeting Task – Clustering Recall (%)|
|Meeting A||Meeting B|
|+ temp. filtering||88.20||92.52|
The second experiment, in Table 2, investigates how using the longer segments, where the quality of the aggregated d-vectors is expected higher, to train our feature transformations and estimate the speaker centroids. In the case of UISRNN and DEC in Tables 2 and 3, all available segments are used as input, whether the temporal filtering is applied or not. The results for the UISRNN algorithm are provided after setting the decoding beam to 6 and allowing the algorithm to iteratively refine the results with 3 passes. In order to make the comparisons fair, we ignore the part of the Diarization errors that correspond to the VAD functionality and we report only the recall of the system. Finally, the performance of both DEC versions, i.e. the original and the improved one, is presented. Results are reported for the raw and processed d-vectors.
As shown, the proposed pre-processing step for the d-vectors greatly improves performance with an additional rel. improvement of for the best performing algorithm, i.e. the ‘Improv. DEC’ (last row of Table 2). The enhancement of the DEC algorithm yields an 6.5% over the original DEC version and 19% over the Spectral Clustering results (best baseline clustering algorithm). Note here that there is no need for embedding pre-processing. The DEC algorithm can keep the salient components based on the autoencoder. Also, utilizing the longer segments for processing/training can make an impact on the overall performance. Difference around 19.3% can be achieved by using these segments first.
|AMI Task – Clustering Error (%)|
|All Seg.||All Seg+Filt.|
For the DIHARD task, we have initialized the autoencoder for the DEC algorithm with the Devel set. However, the initialization of the autoencoder does not significantly affect the overall performance, i.e. we also tried initializing the autoencoder on meeting data with similar results. We report results only on ‘Track 1’ dataset since the use of VAD is beyond the scope of this paper. The input features are d-vectors pre-processed with the temporal filtering and median averaging, as in Sec. 3.1. The best published results for the DIHARD task can be found in .
|DIHARD Clustering Errors – DER(%)|
The results in Table 3 show a relative improvement of over the state-of-the-art (SoA) results (for the Eval. set), when the number of speakers is known. However, the comparison is not entirely fair: the system in  learns in a supervised manner the best thresholds to determine the number of speakers. Herein, there is an advantage over the SoA system since herein the number of speakers is given. A more fair comparison would be by comparing the ‘x-Means’ performance with the SoA system444This is also an unfair comparison because no fine-tuning of the x-means algorithm is done.. Also, ‘x-Means’ results are 30% worse than the ’k-Means’ ones, this is mainly due to the unknown number of speakers. Determining how many active speakers are present is part of the future work.
In this paper, enhancements in two different components of a diarization system are proposed, i.e. the speaker embeddings and the clustering algorithms. Factors, like the segment length, are also investigated how affect the overall performance. We show that diarization performance for segments only longer than is 62% relatively better (38.56% for ‘Meeting B’) than the case where all the segments are included. The proposed approach is able to recover around 31.4% of the optimal performance while including all segments. Also, it is shown that it is better to use these long segments to train the clustering models and use them for the shorter ones.
Further, we show the proposed enhancement in the DEC algorithm can yield up to 31% improvement in clustering performance. The additional terms in the loss function constrain the system to a smoother clustering behavior. The DEC algorithm is able to filter out non-relevant information via the bottleneck layer of the autoencoder.
Finally, it is shown that having a good estimate of the number of speakers is crucial for the overall performance of the system. Herein, we assume we know the number of speakers in advance thus, obtaining about a boost in performance for the DIHARD task, when comparing the k-Means with the x-Means algorithm.
We would like to thank Jeremy Wong for his work on UISRNN during his internship with Microsoft. His work is the basis for all the UISRNN-related experiments.
-  (2012-Feb.) Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20 (2). Cited by: §1.
-  (2005) The ami meeting corpus: a pre-announcement. In MLMI, Cited by: §1, §4.1.
-  (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4). Cited by: §1.
-  (2017-Aug.) Developing on-line speaker diarization system. In Interspeech, Cited by: §1.
-  (2012) I-vectors and ilp clustering adapted to cross-show speaker diarization. In INTERSPEECH, Cited by: §1.
-  (2017-Sept.) Improved deep embedded clustering with local structure preservation. In IJCAI, Cited by: §1, §2.2.
-  (2016) End-to-end text-dependent speaker verification. In ICASSP, Cited by: §1.
-  (2007) A tutorial on spectral clustering. Note: Statistics and Computing 17(4) External Links: Cited by: §2.1.
X-means: extending k-means with efficient estimation of the number of clusters.
In Proceedings of the 17th International Conf. on Machine Learning, Cited by: §1, §2.1.
-  (2004) NIST speaker recognition evaluation chronicles. In Odyssey, Cited by: §4.1.
-  (2009) A study of new approaches to speaker diarization. In INTERSPEECH, Cited by: §1.
-  (2018-Sept.) Diarization is hard: some experiences and lessons learned for the jhu team in the inaugural dihard challenge. In INTERSPEECH, Cited by: §1, §2.1, §4.1, §4.2, §4.2, Table 3, footnote 1.
-  (2013-Oct.) Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing 21 (10). Cited by: §1.
-  (2008) On the use of spectral and iterative methods for speaker diarization. In INTERSPEECH, Cited by: §1.
-  (2006-Sep.) An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing 14 (5). Cited by: §1.
-  (2014) Deep neural networks for small footprint text-dependent speaker verification. In ICASSP, Cited by: §1.
-  (2016-Sept.) Unsupervised deep embedding for clustering analysis. In ICML, Cited by: §1, §2.2.
-  (2018-Sept.) Neural speech turn segmentation and affinity propagation for speaker diarization. In INTERSPEECH, Cited by: §1.
-  (2019) Low-latency speaker-independent continuous speech separation. In ICASSP, Cited by: §4.1.
-  (2019-Sept.) Meeting transcription using asynchronous distant microphones. In Interspeech, Cited by: footnote 3.
-  (2018) Fully supervised speaker diarization. Note: arXiv:1810.04719 Cited by: §1, §1, §2.3.
-  (2017) End-to-end attention based text-dependent speaker verification. Note: arXiv:1701.00562 Cited by: §3.1, §4.1.