End-to-end speaker diarization has attracted heated discussion recently . Despite its strenghts in handling overlapping speech, its limitations in processing long meetings and large number of speakers have hindered its large-scale usage. Considering these limitations, some researchers have spent considerable efforts in exploring possibilities to incorporate clustering-based approaches with end-to-end diarization in order to take advantage of both systems .
Clustering-based approaches are known to be more robust on large meetings and cross-domain dataset . Hence these methods are so far irreplaceable in real-world applications of speaker diarization. However, most diarization systems seem to have decided to save the efforts of finding the optimally performing clustering methods and readily accepted some of the most common practices such as k-means, spectral clustering, and agglomerative hierarchical clustering (AHC) .
Despite their simplicity and popularity, these clustering methods may not be most suitable for the sake of clustering speaker-segments. For example, k-means only considers the relative distance to the centroids of clusters and fails to recognize the topological structure of the distribution. Spectral clustering, like k-means, have difficulties determining the number of classes, which is essential in a speaker diarization system.
In this work we propose to reformulate the clustering of speaker-segments as a network community detection problem. By traversing the network we seek to understand its entire structure. When it comes to the decision making for each node, both local and global topology are considered. The movement of a node to another community will have impact on community structure of the whole network. The objective is to optimize such impact.
The Louvain algorithm has become one of the mainstream community detection algorithms since its proposal . Its popularity rises rapidly in various areas of studies due to its simplicity and effectiveness . However, it is proven that the Louvain algorithm could result in arbitrarily badly connected communities. In terms of clustering of speaker-segments, this would lead to false aggregation of two speakers into one. In this paper we introduce the recently proposed Leiden algorithm, which guarantees that all communities are well-connected .
Dimension reduction also has a positive effect on clustering of speaker embeddings. A speaker embedding encodes information such as voice, speech content, channel, environment, etc 
. The embedding extraction network is trained on the set of hundreds of thousands of speakers. A high-dimensional embedding space allows for larger margins between large number of distinct classes during training. During inference, however, we only need to distinguish among a few speakers presented in a meeting. Hence information encoded in an embedding may be redundant or sometimes harmful due to curse of dimensionality. It is of our interest to project embeddings onto lower dimension space that best serves to distinguish between the speakers in the given meeting. In this paper we propose to use a novel dimension reduction technique named Uniform Manifold Approximation and Projection (UMAP). This manifold learning technique stems from the theoretical foundations in Riemannian geometry and algebraic topology and has proven to be effective in our task.
. Longer segments may contain multiple speakers, which could generate noisy embeddings for clustering. On the other hand, shorter segments may result in large variance in embeddings, hence introducing noise to clustering. To deal with this dilemma, a masked filtering technique is presented. The filtering process adopts a “winner takes all” strategy. The dominating speaker is chosen as the target speaker for the segment and the corresponding embedding is used as target reference for the extraction of clean embedding.
Upon determining the community partitions for all speaker-segments, we seek to further improve the performance by refining the results within and between consecutive segments, such as locating precise speaker change points, smoothing out frame-level results, and handling overlapping segments. These tasks can be addressed by an end-to-end post processing component. In this work we extend the end-to-end mechanism for post processing that effectively integrates the partition results from community structure.
2.1 Leiden Community Detection
The construction of a community detection network is straightforward. Speaker embeddings extracted from each segments are formulated as nodes in the network and similarity scores between pairs of embeddings are the edge values. The goal is to find the optimal partitions that describe the community structure.
Modularity is selected as the optimization function and is represented as
where A is the adjacency matrix, is the number of edges in the total network, is the degree of vertex , is the community vertex belongs, and is the Kronecker delta function that equals 1 if and 0 otherwise.
The Leiden algorithm works as follows:
Step 1. For each node whose neighborhood has changed, move to a different community until no movement of nodes can increase modularity. This results in a partition .
Step 2. Make refinements to : Within each community in , set to singleton partition and locally merge nodes to form .
Step 3. Create aggregated network based on the refined partition.
Step 4. Repeat Step 1-3 until no further improvements can be made.
2.2 Uniform Manifold Approximation
UMAP presumes the existence of a locally connected manifold on which the speaker embeddings are uniformly distributed and aims to preserve the topological structure of this manifold.
Similar to other dimension reduction algorithms, the UMAP mechanism can be summarized by two stages. First it seeks to construct a suitable weighted k-neighbour graph. Then it estimates the projection of low dimensional space of the graph.
Let be the embeddings of speaker-segments in the meeting of interest. The weights between an embedding to its k nearest neighbours are given by
for . Here exemplifies local-connectivity assumption and is a normalization factor defining the Riemannian metric on which lies on such that the data is uniformally distributed.
A weighted graph can thus be constructed based on the above weight function. Theoretically the graph is the 1-skeleton of the fuzzy simplicial set related to the Riemannian metric space ambient to . Let be the adjacency matrix of this graph.
UMAP is optimized through the fuzzy set cross entropy given by
where and are the membership functions of two fuzzy sets. Note that is written as the sum of two components. The first component on the left of summation serves to find local clustering, or high density regions in the graph. The second component aims to preserve the global topological structure.
The t-SNE algorithm had been one of the most popular dimension reduction algorithms before UMAP was proposed . However, t-SNE fails to preserve the structure like UMAP does. In addition, t-SNE only allows for reduction to 2-dimensional space while UMAP allows for reduction to any dimension that is optimal for the data of interest. Therefore, UMAP is more suitable for tasks requiring further processing, such as clustering, after dimension reduction.
2.3 “Winner takes all” Masked Filtering
We also seek to make improvements to the basic unit of community detection – the speaker embeddings. Most clustering-based speaker diarization systems split the audio into segments of length ranging from 1 to 2 seconds. This is based on the presumption that a short 2-second segments will contain no more than 1 speaker. As the length of the segments increase, such presumption no longer holds. In our system we audaciously increase the length of each segments to 4 seconds, with the understanding that longer segments lead to more robust embeddings.
In order to extract relatively clean embeddings from longer segments, we introduce a “winner-takes-all” masked filtering system as illustrated in Figure 1. The filterbank features of each segment is passed through a several D-TDNN layers. Then the frame-level speaker features are clustered into 2 classes using k-means clustering. Features from the dominating component are used as the target embedding. The masked prediction network is trained in the same manner as described in .
One may wonder why bother doing masked filtering and not just use the Target Embedding in Figure 1 as the output. The mask prediction component serves two purposes. First, frames from the winner speaker may still contain overlapping speech from other speakers, which will introduce bias to the target embedding. Masked filtering allows us to extract relatively clean embeddings from overlapping speech. This is well illustrated in our previous work. Second, if there is only one speaker in the entire segment, it is of our best interest to include as much information as possible. By design the target embedding could ignore up to half of the frames with useful information. This contradicts our motivation of extracting embeddings from longer segments in order to minimize variance.
2.4 End-to-end Post Processing with Community Partitions
Despite that clustering-based diarization has superior performance in determining the number of speakers, finding correct classification for speaker-segments, and handling domain mismatch between training and testing data, we recognize that end-to-end approaches are more suitable at processing local information such as handling overlapping speech, and finding precise speaker change points. Therefore we propose to include an end-to-end post processing component upon the results of community detection.
The end-to-end post processing architecture is a modification from the EEND-EDA system described in 
, in which the Encoder-Decoder Attractor(EDA) component stores a flexible number of speakers. As indicated by the experiments, the EDA component is limited in predicting correct number of speakers. Hence we replace the EDA module by the results of community partitions and fix the number of speakers as the number of communities. During training and inference, the zero vectors input of EDA are replaced by the mean of each community parititons and the generation of new attractors is disallowed. For the simplicity of referencing, we name this end-to-end post processing approach based on community partitions as EEND-CommPart in later sections.
Since EEND-CommPart is aimed for processing local information, the input to EEND-CommPart are short segments. In our experiment, for example, the inputs consist of two adjacent segments, making the total length 8 seconds. During training longer segments are included to increase variety.
3 Experiments & Results
The train and test set used in the experiments are simulated from NIST SRE corpus. The training corpus consists of 57,517 utterances from 5,767 speakers in NIST SRE 04-10 corpus. The performance is evaluated on SRE10 Evaluation set. When evaluating diarization performance, utterances from a random 2-8 speakers are selected to simulate meetings of duration of at least 30 minutes. The overlap ratio ranges from up to in each of the simulated meetings.
In the first experiment we evaluate performance of several diarization systems on simulated long meetings with 2, 4, 6, and 8 participants respectively. The components described above, such as “winner takes all” masked filtering, UMAP-Leiden community detection, and EEND-CommPart, are included or excluded in different runs in order to see their effects on the overall performance.
In the second experiment we focus on evaluating the performance of several clustering approaches. Speaker-segments ranging from 2-4 seconds are randomly selected.
Specifically we compare k-means, spectral clustering, AHC, and UMAP-Leiden. We are also interested to see how EEND-EDA performs on predicting the number of speakers compared to other clustering approaches.
For k-means and spectral clustering, we use the curve of descending eigenvalues to estimate the number of classes, as described in
. The number of classes of AHC is counted once the stopping criteria is met. The stopping criteria is determined on a small development set. Three hyperparameters are required to be fine-tuned for UMAP dimension reduction algorithm - the number of k nearest neighbors to be considered, the target dimension to be reduced to, and expected distance that determines whether points should be packed together. For Leiden algorithm the most important hyperparameter to be fine-tuned is the resolution.
The mask prediction network is trained on the same setup as described in . The only difference is that, instead of taking a rough mean pooling as the target embedding, we conduct a 2-class clustering and only run mean pooling on the dominating class.
3.3 Results & Analysis
As Table 1 suggests, system No.6 has the optimal performance when there are 4 or more speakers. System No.6 includes masked filtering as pre-processing, UMAP-Leiden as clustering, and EEND-CommPart as post-processing. System No.2, EEND-EDA, has optimal performance for 2-speaker cases, but its performance decreases significantly as the number of speakers increases. System No.6 outperforms EEND-EDA and k-means clustering baseline by a remarkable margin in 4-, 6-, and 8-speaker cases.
When compared the performance of system No.1 to No.4, as well as system No.3 to No.5, we can see that UMAP-Leiden introduces remarkable gains in performance. System No.5 adds masked filtering onto system No.4, showing that obtaining cleaner embeddings has positive effects. Finally, the results from system No.6 indicate that EEND-CommPart post processing allows for better handling of overlapping speech and better local refinements.
|System ID||Pre-Processing||Clustering Methods||Post-Processing||DER(%)|
|No 3||Masked Filtering||k-means||/||9.6||23.0||33.8||37.2|
Now we break down to analyze the gain from clustering performance of UMAP-Leiden algorithm.
displays the performance of different clustering methods when the actual number of speakers is 1, 2, 4, 6, 8, and 10, respectively. For each case 500 clustering tests are simulated. For example, for #Spks=2, random shuffled segments from a 2 random speakers are selected to construct a clustering test instance. This is repeated 500 times. The F-score and #Spks Prediction Accuracy are estimated based on the average of 500 clustering results. #Spks Prediction Accuracy measures the number of times out of total that the predicted number of speakers equals the actual number of speakers.
According to Table 2, when the actual number of speakers in the meeting is either 1 or 2, k-means, spectral clustering, and UMAP-Leiden have similar competitive performances, with spectral clustering slightly better for 1-speaker case and k-means slightly better for 2-speaker case. The competitive performance of k-means and spectral clustering for diarization on very few speakers may be a major reason that researchers lack the motivation to explore more sophisticated clustering methods for speaker diarization.
As the number of speakers increase, the performance of k-means and spectral clustering drop significantly, largely due to the difficulties in estimating correct number of speakers. The UMAP-Leiden algorithm, on the other hand, performs reasonably well on larger number of speakers. We also note that EEND-EDA performs well for 2-speaker situation and has relatively poor performance on estimating the number of speakers for #spks . This is consistent with the observation from Table 1 that EEND-EDA has the lowest DER for #spk=2 but increases rapidly. This is our main motivation for replacing the EDA module by the results of UMAP-Leiden partitions in our end-to-end post processing component.
|#Spks||Methods||#Spk Predict Accuracy||F-score|
Table 3 compares the computation costs of different clustering approaches. The community detection approaches such as Louvain and Leiden are most efficient in terms of running time. Leiden is faster than Louvain because it adopts a fast local moving process which visits only nodes whose neighbourhood has changed, while Louvain keeps visiting all nodes in the network every single time.
In this paper we propose to reformulate speaker diarization as community detection problem. We introduce Leiden as the community detection algorithm for its optimal performance on the dataset used in this experiment. Other community detection algorithms, such as Infomap  and Louvain, have demonstrated to be slightly better on some datasets. Therefore, deciding on which algorithm to use is dependent on the data of interest. Fortunately, implementations of all above mentioned community detection and dimension reduction algorithms are relatively simple. A quick comparison can be done with minimal efforts.
We find ways to extract speaker embeddings from longer segments to reduce variance, without having affected by other speakers presented in the segment, by introducing a masked filtering approach. This turns out to be more beneficial in meetings presented with large portions of overlapping speech and frequent speaker turns. Considering its computation costs, the masked filtering can be optionally turned off for more structured and organized meetings. For the future we are interested in exploring more efficient methods to serve the purpose.
Finally, we propose the EEND-Commpart post processing component to handle overlapping speech and polish local results. The system leverages the edge of end-to-end methods in detailed refinement and the advantage of clustering-based approaches in global speaker counting.
Through this work we show the clustering-based diarization approaches still have large rooms for improvement. We observed remarkable gains simply by replacing the k-means with community detection algorithms. We hope that this result could inspire more studies on the clustering methods for speaker diarization.
-  (2008-10) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10), pp. P10008. External Links: Cited by: §1.
-  (2009) Community detection in graphs. CoRR abs/0906.0612. External Links: Cited by: §1.
-  (2019) End-to-end neural speaker diarization with permutation-free objectives. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 4300–4304. External Links: Cited by: §1.
End-to-end neural speaker diarization with self-attention.
IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019, External Links: Cited by: §1, §1.
Speaker diarization using deep neural network embeddings. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pp. 4930–4934. External Links: Cited by: §1.
-  End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, External Links: Cited by: §2.4.
-  (2021) End-to-end speaker diarization as post-processing. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 7188–7192. External Links: Cited by: §1.
-  (2021) Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech. CoRR abs/2105.09040. External Links: Cited by: §1, §1.
-  (2021) Integrating end-to-end neural and clustering-based diarization: getting the best of both worlds. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 7198–7202. External Links: Cited by: §1.
-  (2019) LSTM based similarity measurement with spectral clustering for speaker diarization. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 366–370. External Links: Cited by: §1.
-  Speaker diarization using leave-one-out gaussian PLDA clustering of DNN embeddings. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, External Links: Cited by: §1.
UMAP: uniform manifold approximation and projection.
J. Open Source Softw.3 (29), pp. 861. External Links: Cited by: §1.
On spectral clustering: analysis and an algorithm. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 849–856. External Links: Cited by: §3.2.
-  (2010) The nist year 2010 speaker recognition evaluation plan. External Links: Cited by: §3.2.
-  (2020) Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Process. Lett. 27, pp. 381–385. External Links: Cited by: §1.
-  (2009-11) The map equation. The European Physical Journal Special Topics 178 (1), pp. 13–23. External Links: Cited by: §4.
-  (2010) Complex network measures of brain connectivity: uses and interpretations. NeuroImage 52 (3), pp. 1059–1069. External Links: Cited by: §1.
-  (2018) From louvain to leiden: guaranteeing well-connected communities. CoRR abs/1810.08473. External Links: Cited by: §1.
Visualizing data using t-SNE.
Journal of Machine Learning Research9, pp. 2579–2605. External Links: Cited by: §2.2.
-  (2018) Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pp. 5239–5243. External Links: Cited by: §1.
-  (2019) Ensemble additive margin softmax for speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019, pp. 6046–6050. External Links: Cited by: §3.2.
-  (2020) Densely connected time delay neural network for speaker verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp. 921–925. External Links: Cited by: §3.2.
-  (2021) Cam: context-aware masking for robust speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 6703–6707. External Links: Cited by: §2.3, §2.3, §3.2.
-  (2021) A real-time speaker diarization system based on spatial spectrum. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 7208–7212. External Links: Cited by: §1.
-  (2020) Phonetically-aware coupled network for short duration text-independent speaker verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, Cited by: §1.