1 Introduction
Speaker diarization is the task of determining ‘who spoke when’ [15] in a multispeaker environment. It is an essential component for a variety of applications such as call center services, meeting transcriptions, etc, thus the attention drawn from the research community [11, 1]. Although, the process of diarization sounds inherently easy, it poses multiple challenges in practice, limiting its commercial deployment. One of the reasons is that the overall performance depends heavily on both the application scenario and the imposed constraints. For example, diarization focused on call center audio is mostly about separating just two speakers, often in quite diverse acoustic environments. On the other hand diarization of meetings is much more challenging with multiple speakers, reverberation, back channeling, etc. As such, even determining the number of active speakers still poses a scientific challenge.
The main approach for the diarization systems is to first extract noise and environmentinvariant speaker embeddings and then cluster them. Most of the previous work is based on the ivector embeddings
[13, 3, 14, 5]. However, the latest research work shows a shift from ivectors to dvectors [16, 7], i.e. features based on the bottleneck layer output of a DNN trained for Speaker Recognition. Reasons for this shift are attributed to the enhanced performance, easier training with more data [21], and robustness against speaker variability and acoustic conditions. The main difference between ivectors and dvectors are that the latter are extracted in a framelevel fashion from the bottleneck layer.In terms of clustering, most of the literature can be grouped into two approaches: the first, using supervised clustering [21, 18]
, where the proposed system is trained on presegmented, speaker labeled utterances. These approaches usually provide a neural network modeling the speaker profiles onthefly and techniques like an HMM for the speaker sequence and segmentation. The second approach is by clustering the speakers in an unsupervised manner while the algorithm decides the data modality, i.e. the speaker number
[18, 4].Along these lines, the performance of the Deep Embedded Clustering (DEC) variations [17, 6] is also investigated and compared with the kMeans, Spectral Clustering (assuming the number of speakers is known). In the case of unknown number of speakers, we investigate the performance of the “xmeans” algorithm [9], where the number of speakers is determined onthefly. To highlight the impact of the clustering performance, while the number of speakers is known, we will use oracle audio segmentations. We also compare the proposed approach against the fullysupervised UISRNN system.
This paper is structured as follows: in Sec. 2, a short overview of the dvector extraction and the existing clustering approaches is provided. In Sec. 3.1, the enhancements in the frontend are discussed, where the temporal filtering and the median are discussed. The refinement in the DEC algorithm are discussed in Sec. 3.2
, where the loss function for training is revisited. In Sec.
4, the experimental results on 3 different tasks, i.e. the AMI database [2], the DIHARD corpus [12], and the internal meeting data, are presented. Finally, our latest findings are recapped and conclusions are presented in Sec. 5.2 Background
2.1 Clustering Methods
The baseline clustering system is based on two of the most widely used clustering algorithms, i.e. the kMeans and the Spectral clustering. The kMeans clustering is an unsupervised cluster analysis, partitioning the samples into
clusters, withknown. Both, the kMeans and the expectationmaximization algorithms for GMMs are similar in a sense, i.e. using cluster centroids, i.e. ‘means’, to model the data; however, kmeans clustering finds clusters of comparable spatial variance, while the expectationmaximization mechanism allows clusters to have very different ones. The objective function is,
(1) 
where are the samples, are the centroids and the sets corresponding to the respective clusters. Before using the kMeans algorithm the input samples are whitened and their dimensionality is reduced using PCA. The can be replaced by the speaker profiles, when initializing the process, incorporating prior knowledge to the system.
Spectral clustering is the default stateoftheart unsupervised clustering algorithm for diarization providing very good performance. It is based on a similarity matrix defined as a symmetric matrix, where represents the similarity between any two data points with indices
. The spectral clustering approach employed here uses the kMeans on the eigenvectors of the ‘graph Laplacian’ matrix of
, [8]. The intrinsic dimensionality reduction provides an additional robustness to the algorithm.Both algorithms have drawbacks, such as the random initialization step and the requirement for a preset number of clusters , Eq. (1). The latter constraint is addressed with the ‘xMeans’ algorithm [9] may start with a lower bound of and keep splitting the clusters until a stopping criterion is reached (or the purity of the clusters reaches a certain level). Herein, the Bayesian Information Criterion (BIC) is used [9],
(2) 
where is the loglikelihood of the data according to the th model, and is the number of parameters in model (where the family of models) and the number of samples. The algorithm keeps splitting the clusters until all clusters are different enough based on BIC. A similar idea (albeit in a completely different approach) has been proposed in [12]
while using Agglomerative Hierarchical Clustering.
2.2 Deep Embedded Clustering – DEC
The motivation behind DEC is to transform the input features, herein speaker embeddings, to a space better separable in a given number of clusters. The clusters are iteratively refined based on a target distribution [17]
. First, an autoencoder is trained while injecting noise, i.e. dropouts for the encoding part. The autoencoder learns a representation of the input features in a space of much lower dimensionality
[17] (embeddings), while maintaining the separable properties of the features. The encoder outputs are used as input for the clustering component, iteratively refined by learning from their high confidence assignments. Specifically, the DEC model is trained with the KLdivergence between the and distributions as the loss function, when matching the soft assignment of the embedding to the cluster with the target distribution , Eq. (4),(3) 
and are given by,
(4) 
where the centroid of th cluster and is the soft cluster frequency with . The DEC approach presents however some problems: first, the training of the autoencoder and the clustering steps are decoupled, according to Eq. (3). This is may lead to trivial solutions, especially when the encoded features are not discriminative enough, since there is no gradient backpropagation to the autoencoder. An initial fix was proposed in [6] adding a second loss term to preserve the local structure of the input features while improving their separability,
(5) 
where is the encoder and decoder mappings combined. Now, the autoencoder is forced to maintain local structure while improving discrimination, thus the features cannot collapse in the ‘trivial’ space.
However, the twoterm loss function doesn’t address another fundamental problem of the algorithm: there is no constraint for avoiding empty clusters, despite the fact that the number of classes is known. The nontrivial solution might be implied in Eq. (3) but it’s not adequate. This issue can be further enhanced since the features are transformed to a low dimensionality space, according to the loss terms , without constraints.
2.3 Unbounded InterleavedState RNN
Lately, an online method called ‘Unbounded InterleavedState Recurrent Neural Network (UISRNN)’ has been proposed in
[21] for fully supervised speaker diarization. The input to the algorithm is dvectors and uses an RNN to keep track (as a different state) for the different speakers as they are interleaved in the time domain. The RNN is part of a Bayesian framework supporting an unknown number of speakers. Although there are fundamental differences with the other algorithms investigated here, we include results using this algorithm for comparison reasons.3 Enhancements of Speaker Clustering
3.1 Improving Speaker Embeddings
The dvectors are extracted using a DNN ^{1}^{1}1A TDNN is used for the extraction, now called ‘xvectors’ [12], with stacked logmel filterbank energy coefficients as input features. The output of the network is a onehot speaker label [22]
(or equally the probability of that particular speaker given the current input frame). The dvectors are the output of the second to last DNN layer which is usually much shorter than the last one. This layer is called the ‘bottleneck’ layer and its size depends on the implementation.
The framebased nature of the dvectors leads to noisy frame estimates despite the very long input timewindows – usually around
or more of audio. Most approaches using dvectors are aggregating them over the span of a segment by averaging. As such, the length of the input segments is one of the limitations for highquality dvectors with shorter segments corresponding to suboptimal clustering results, Sec. 4.Herein, we propose a different approach, where the dvectors are first lowpassed and then aggregated by a median filter,
(6) 
where is the th coefficient of the th frame and is a movingaverage, FIR filter estimated as , and , where the Dirac function. This filtering process results in smoother, less noisy temporal trajectories of the dvector coefficients. A median value for each segment is then extracted from these temporally smoothed vectors,
(7) 
where
the start/end segment frame index, respectively. The ‘smoothing and median filtering’ approach has been found to outperform the widely used averaging scheme for several reasons. Assuming the dvectors belonging to the same speaker are similar enough, the variations of adjacent dvectors can be attributed to the phonetic content or the environmental noise and as such they can be discarded. Additional robustness is provided by the median filtering, where the outliers have smaller impact on the aggregated values compared to averaging.
3.2 Improvements on Deep Embedded Clustering
The second contribution of the paper is revisiting the overall loss function and adding a few algorithmic steps to the DEC algorithm.
First, the possibility of empty clusters has to be addressed. The basic assumption of our approach is that the distribution of speaker turns is uniform across all speakers, i.e. all speakers contribute equally to the session. This assumption is not realistic in real meeting environments but it constrains the solution space enough to avoid the empty clusters without affecting overall performance. Under this assumption, the
input samples are uniformly distributed over
clusters, expressed by,(8) 
where is the uniform distribution or equally , the clusters are now forced to be more balanced, while penalizing clusters not following the uniform distribution.
An additional loss term penalizes the distance from the centroids . This MSE term is given by:
(9) 
This is similar to the kMeans criterion in Eq. (1), but it is now expressed as part of the lossfunction.
The loss function of the revisited DEC algorithm now becomes,
(10) 
Although not presented here, the weights can be finetuned on some heldout data.
Finally, an additional kMeans ‘recalibration’ step is included every few training iterations. The DEC algorithm uses the kMeans for initializing the centroids and then, it runs iteratively based on the loss functions. Based on our experience, the distribution, Eq. (3), can diverge from the target distribution and a reset is necessary. Such a reset on iteratively refined features ensures that the system cannot diverge to an ‘illconditioned’ solution.
4 Experiments
4.1 System Setup
We investigate the performance of the proposed components on 3 different tasks, the AMI [2], the DIHARD [12] and an internal meeting transcription task. The AMI dataset consists of 166 meetings with 4 speakers worth of 100h captured by multiple lapel mics and 2 microphonearrays. We use only the lapel recordings (1 per speaker) for the segmentation and the mixed audio (the 4 channels are mixed into one) for diarization^{2}^{2}2The AMI dataset provides this audio signal after mixing all the lapel channels together. Each set of speakers is used for 4 meetings. The DIHARD set is a collection of diverse recordings with a varying number of speakers, noise conditions and spoken languages. The task contains two tracks, with/without the transcriptions given. Herein, we utilize only the first track (with the known segmentations). Finally, the third dataset contains two 1hlong internal meetings, i.e. ‘Meeting A’ and ‘Meeting B’. There are 6 and 4 participants, respectively. The audio is recorded with a microphone array and processed by a fixed beamformer [19] keeping the topbeam, i.e. the most active in terms of signal energy. This singlechannel audio is then processed for diarization. In all but the DIHARD task and for the case of xMeans in Table 3, the number of speakers and the segmentations are considered given. Any silence shorter than , i.e. the collar, is treated as ‘speech’ for training, testing and scoring. We use the groundtruth segmentations provided by the databases for the dvector aggregation and the time boundaries are considered as potential speakerturns. For the case of the internal meeting data, we use the segmentations provided by the Microsoft ASR decoder. Therefore, there are (short) silence segments present in the groundtruth segmentations. Finally, since there is no overlapping speech detection^{3}^{3}3about 10% of the speech is considered overlapping [20]., i.e. segments with more than one active speakers are assigned to the speaker already talking. Consequently, the diarization results appear worse in the sense that some segments with overlapping speech are simply ignored.
The dvectors are trained on textdependent utterances, i.e. wakeup phrases, with speakers [22]. The dvector length, i.e. the bottleneck layer, is 200 coefficients long and takes as input frames of logmel filterbank energy coefficients. There is no overlap between the speakers in the training set and the speakers in the test audio.
We use PCA for dimensionality reduction of the input to clustering algorithms. The length for the resulting dvectors is 70 coefficients. In the case of the AMI task, the same set of speakers is found in 4 separate sessions/meetings. Thus, the diarization output of the first meeting can be used to create speaker profiles, serving as initial centroids for the rest of the meetings, Eq. (1). Finally, we use the original features in the case of Spectral clustering and DEC, since it has an intrinsic dimensionality reduction process.
The autoencoder for the DEC algorithm has the following architecture with dense layer size of . The loss function, Eq. (10), has the following weights . The network is trained with the Adamax criterion with learning rate and batch size of 64.
The UISRNN model is trained on transcribed internal meeting data of about 100h. The number of speakers varies on the meeting. The frontend for the UISRNN system is the same dvector network as described above. For the UISRNN, we use a beam of 6 and 3 passes to further refine the diarization results.
Finally, the scoring is held with the standard for diarization evaluations NIST tool [10].
4.2 Results
First, we investigate how the length of the speech segments can affect the diarization performance when training the PCA transformations and estimating the speaker centroids. We present results in Tables 12, where segments shorter than are ignored. The performance is measured as the ratio of segments (in sec) assigned to the right speaker. As mentioned, ‘Meeting A’ contains 6 speaker and ‘Meeting B’ only 4. We use kMeans for the clustering part and the metric is the ‘Clustering Recall’, i.e. the ratio of correctly assigned speech over the available speech (in sec).
The diarization performance is improved as expected, Table 1, by ignoring these short segments. However, this is not a viable solution since the shorter segments remain unassigned. The ‘temporal and median’ filtering can improve the overall performance reaching close to the best possible performance, as shown in the last row of the table, without ignoring any segments. Shorter segments provide embeddings of lower quality. It is possible to greatly improve the diarization performance by distinguishing the segment processing according to their length. Also, the number of speakers, i.e. clusters, can affect performance. Diarization performance in ‘Meeting A’ is worse than the corresponding on ‘Meeting b’ because there are more speakers (6 speakers vs. 4).
Meeting Task – Clustering Recall (%)  

Meeting A  Meeting B  
UISRNN  81.39  94.20 
All Segments  82.80  91.88 
Ignore segments  89.53  91.63 
Ignore segments  91.38  92.85 
Ignore segments  93.38  95.05 
Ignore segments  51.95  56.74 
All segments  
+ temp. filtering  88.20  92.52 
The second experiment, in Table 2, investigates how using the longer segments, where the quality of the aggregated dvectors is expected higher, to train our feature transformations and estimate the speaker centroids. In the case of UISRNN and DEC in Tables 2 and 3, all available segments are used as input, whether the temporal filtering is applied or not. The results for the UISRNN algorithm are provided after setting the decoding beam to 6 and allowing the algorithm to iteratively refine the results with 3 passes. In order to make the comparisons fair, we ignore the part of the Diarization errors that correspond to the VAD functionality and we report only the recall of the system. Finally, the performance of both DEC versions, i.e. the original and the improved one, is presented. Results are reported for the raw and processed dvectors.
As shown, the proposed preprocessing step for the dvectors greatly improves performance with an additional rel. improvement of for the best performing algorithm, i.e. the ‘Improv. DEC’ (last row of Table 2). The enhancement of the DEC algorithm yields an 6.5% over the original DEC version and 19% over the Spectral Clustering results (best baseline clustering algorithm). Note here that there is no need for embedding preprocessing. The DEC algorithm can keep the salient components based on the autoencoder. Also, utilizing the longer segments for processing/training can make an impact on the overall performance. Difference around 19.3% can be achieved by using these segments first.
AMI Task – Clustering Error (%)  
All Seg.  All Seg+Filt.  
UISRNN  12.52  N/A  N/A  N/A 
Orig. DEC  11.41  N/A  N/A  12.43 
kMeans  17.70  16.32  13.56  16.44 
Spectral Cl.  13.52  13.41  10.65  13.20 
xMeans  19.37  17.90  13.69  17.69 
Impr. DEC  10.66  N/A  N/A  11.87 
For the DIHARD task, we have initialized the autoencoder for the DEC algorithm with the Devel set. However, the initialization of the autoencoder does not significantly affect the overall performance, i.e. we also tried initializing the autoencoder on meeting data with similar results. We report results only on ‘Track 1’ dataset since the use of VAD is beyond the scope of this paper. The input features are dvectors preprocessed with the temporal filtering and median averaging, as in Sec. 3.1. The best published results for the DIHARD task can be found in [12].
DIHARD Clustering Errors – DER(%)  

Devel  Eval  
Stateoftheart [12]  18.17  23.99 
Original DEC  28.17  30.38 
kmeans  17.90  19.77 
Spectral  18.36  19.32 
xmeans  19.33  25.99 
Improved DEC  18.69  21.40 
The results in Table 3 show a relative improvement of over the stateoftheart (SoA) results (for the Eval. set), when the number of speakers is known. However, the comparison is not entirely fair: the system in [12] learns in a supervised manner the best thresholds to determine the number of speakers. Herein, there is an advantage over the SoA system since herein the number of speakers is given. A more fair comparison would be by comparing the ‘xMeans’ performance with the SoA system^{4}^{4}4This is also an unfair comparison because no finetuning of the xmeans algorithm is done.. Also, ‘xMeans’ results are 30% worse than the ’kMeans’ ones, this is mainly due to the unknown number of speakers. Determining how many active speakers are present is part of the future work.
5 Conclusions
In this paper, enhancements in two different components of a diarization system are proposed, i.e. the speaker embeddings and the clustering algorithms. Factors, like the segment length, are also investigated how affect the overall performance. We show that diarization performance for segments only longer than is 62% relatively better (38.56% for ‘Meeting B’) than the case where all the segments are included. The proposed approach is able to recover around 31.4% of the optimal performance while including all segments. Also, it is shown that it is better to use these long segments to train the clustering models and use them for the shorter ones.
Further, we show the proposed enhancement in the DEC algorithm can yield up to 31% improvement in clustering performance. The additional terms in the loss function constrain the system to a smoother clustering behavior. The DEC algorithm is able to filter out nonrelevant information via the bottleneck layer of the autoencoder.
Finally, it is shown that having a good estimate of the number of speakers is crucial for the overall performance of the system. Herein, we assume we know the number of speakers in advance thus, obtaining about a boost in performance for the DIHARD task, when comparing the kMeans with the xMeans algorithm.
6 Acknowledgments
We would like to thank Jeremy Wong for his work on UISRNN during his internship with Microsoft. His work is the basis for all the UISRNNrelated experiments.
References
 [1] (2012Feb.) Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20 (2). Cited by: §1.
 [2] (2005) The ami meeting corpus: a preannouncement. In MLMI, Cited by: §1, §4.1.
 [3] (2011) Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4). Cited by: §1.
 [4] (2017Aug.) Developing online speaker diarization system. In Interspeech, Cited by: §1.
 [5] (2012) Ivectors and ilp clustering adapted to crossshow speaker diarization. In INTERSPEECH, Cited by: §1.
 [6] (2017Sept.) Improved deep embedded clustering with local structure preservation. In IJCAI, Cited by: §1, §2.2.
 [7] (2016) Endtoend textdependent speaker verification. In ICASSP, Cited by: §1.
 [8] (2007) A tutorial on spectral clustering. Note: Statistics and Computing 17(4) External Links: arXiv:0711.0189 Cited by: §2.1.

[9]
(2000)
Xmeans: extending kmeans with efficient estimation of the number of clusters.
In
In Proceedings of the 17th International Conf. on Machine Learning
, Cited by: §1, §2.1.  [10] (2004) NIST speaker recognition evaluation chronicles. In Odyssey, Cited by: §4.1.
 [11] (2009) A study of new approaches to speaker diarization. In INTERSPEECH, Cited by: §1.
 [12] (2018Sept.) Diarization is hard: some experiences and lessons learned for the jhu team in the inaugural dihard challenge. In INTERSPEECH, Cited by: §1, §2.1, §4.1, §4.2, §4.2, Table 3, footnote 1.
 [13] (2013Oct.) Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing 21 (10). Cited by: §1.
 [14] (2008) On the use of spectral and iterative methods for speaker diarization. In INTERSPEECH, Cited by: §1.
 [15] (2006Sep.) An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing 14 (5). Cited by: §1.
 [16] (2014) Deep neural networks for small footprint textdependent speaker verification. In ICASSP, Cited by: §1.
 [17] (2016Sept.) Unsupervised deep embedding for clustering analysis. In ICML, Cited by: §1, §2.2.
 [18] (2018Sept.) Neural speech turn segmentation and affinity propagation for speaker diarization. In INTERSPEECH, Cited by: §1.
 [19] (2019) Lowlatency speakerindependent continuous speech separation. In ICASSP, Cited by: §4.1.
 [20] (2019Sept.) Meeting transcription using asynchronous distant microphones. In Interspeech, Cited by: footnote 3.
 [21] (2018) Fully supervised speaker diarization. Note: arXiv:1810.04719 Cited by: §1, §1, §2.3.
 [22] (2017) Endtoend attention based textdependent speaker verification. Note: arXiv:1701.00562 Cited by: §3.1, §4.1.
Comments
There are no comments yet.