1 Introduction
Endtoend speaker diarization has attracted heated discussion recently [4][3]. Despite its strenghts in handling overlapping speech, its limitations in processing long meetings and large number of speakers have hindered its largescale usage. Considering these limitations, some researchers have spent considerable efforts in exploring possibilities to incorporate clusteringbased approaches with endtoend diarization in order to take advantage of both systems [7][9][8].
Clusteringbased approaches are known to be more robust on large meetings and crossdomain dataset [4][8]. Hence these methods are so far irreplaceable in realworld applications of speaker diarization. However, most diarization systems seem to have decided to save the efforts of finding the optimally performing clustering methods and readily accepted some of the most common practices such as kmeans, spectral clustering, and agglomerative hierarchical clustering (AHC) [11][15][10].
Despite their simplicity and popularity, these clustering methods may not be most suitable for the sake of clustering speakersegments. For example, kmeans only considers the relative distance to the centroids of clusters and fails to recognize the topological structure of the distribution. Spectral clustering, like kmeans, have difficulties determining the number of classes, which is essential in a speaker diarization system.
In this work we propose to reformulate the clustering of speakersegments as a network community detection problem. By traversing the network we seek to understand its entire structure. When it comes to the decision making for each node, both local and global topology are considered. The movement of a node to another community will have impact on community structure of the whole network. The objective is to optimize such impact.
The Louvain algorithm has become one of the mainstream community detection algorithms since its proposal [1]. Its popularity rises rapidly in various areas of studies due to its simplicity and effectiveness [2][17]. However, it is proven that the Louvain algorithm could result in arbitrarily badly connected communities. In terms of clustering of speakersegments, this would lead to false aggregation of two speakers into one. In this paper we introduce the recently proposed Leiden algorithm, which guarantees that all communities are wellconnected [18].
Dimension reduction also has a positive effect on clustering of speaker embeddings. A speaker embedding encodes information such as voice, speech content, channel, environment, etc [25]
. The embedding extraction network is trained on the set of hundreds of thousands of speakers. A highdimensional embedding space allows for larger margins between large number of distinct classes during training. During inference, however, we only need to distinguish among a few speakers presented in a meeting. Hence information encoded in an embedding may be redundant or sometimes harmful due to curse of dimensionality. It is of our interest to project embeddings onto lower dimension space that best serves to distinguish between the speakers in the given meeting. In this paper we propose to use a novel dimension reduction technique named Uniform Manifold Approximation and Projection (UMAP)
[12]. This manifold learning technique stems from the theoretical foundations in Riemannian geometry and algebraic topology and has proven to be effective in our task.Segmentation is a longstanding challenge for clusteringbased diarization [5][20][24]
. Longer segments may contain multiple speakers, which could generate noisy embeddings for clustering. On the other hand, shorter segments may result in large variance in embeddings, hence introducing noise to clustering. To deal with this dilemma, a masked filtering technique is presented. The filtering process adopts a “winner takes all” strategy. The dominating speaker is chosen as the target speaker for the segment and the corresponding embedding is used as target reference for the extraction of clean embedding.
Upon determining the community partitions for all speakersegments, we seek to further improve the performance by refining the results within and between consecutive segments, such as locating precise speaker change points, smoothing out framelevel results, and handling overlapping segments. These tasks can be addressed by an endtoend post processing component. In this work we extend the endtoend mechanism for post processing that effectively integrates the partition results from community structure.
2 Methods
2.1 Leiden Community Detection
The construction of a community detection network is straightforward. Speaker embeddings extracted from each segments are formulated as nodes in the network and similarity scores between pairs of embeddings are the edge values. The goal is to find the optimal partitions that describe the community structure.
Modularity is selected as the optimization function and is represented as
where A is the adjacency matrix, is the number of edges in the total network, is the degree of vertex , is the community vertex belongs, and is the Kronecker delta function that equals 1 if and 0 otherwise.
The Leiden algorithm works as follows:
Step 1. For each node whose neighborhood has changed, move to a different community until no movement of nodes can increase modularity. This results in a partition .
Step 2. Make refinements to : Within each community in , set to singleton partition and locally merge nodes to form .
Step 3. Create aggregated network based on the refined partition.
Step 4. Repeat Step 13 until no further improvements can be made.
2.2 Uniform Manifold Approximation
UMAP presumes the existence of a locally connected manifold on which the speaker embeddings are uniformly distributed and aims to preserve the topological structure of this manifold.
Similar to other dimension reduction algorithms, the UMAP mechanism can be summarized by two stages. First it seeks to construct a suitable weighted kneighbour graph. Then it estimates the projection of low dimensional space of the graph.
Let be the embeddings of speakersegments in the meeting of interest. The weights between an embedding to its k nearest neighbours are given by
for . Here exemplifies localconnectivity assumption and is a normalization factor defining the Riemannian metric on which lies on such that the data is uniformally distributed.
A weighted graph can thus be constructed based on the above weight function. Theoretically the graph is the 1skeleton of the fuzzy simplicial set related to the Riemannian metric space ambient to . Let be the adjacency matrix of this graph.
UMAP is optimized through the fuzzy set cross entropy given by
where and are the membership functions of two fuzzy sets. Note that is written as the sum of two components. The first component on the left of summation serves to find local clustering, or high density regions in the graph. The second component aims to preserve the global topological structure.
The tSNE algorithm had been one of the most popular dimension reduction algorithms before UMAP was proposed [19]. However, tSNE fails to preserve the structure like UMAP does. In addition, tSNE only allows for reduction to 2dimensional space while UMAP allows for reduction to any dimension that is optimal for the data of interest. Therefore, UMAP is more suitable for tasks requiring further processing, such as clustering, after dimension reduction.
2.3 “Winner takes all” Masked Filtering
We also seek to make improvements to the basic unit of community detection – the speaker embeddings. Most clusteringbased speaker diarization systems split the audio into segments of length ranging from 1 to 2 seconds. This is based on the presumption that a short 2second segments will contain no more than 1 speaker. As the length of the segments increase, such presumption no longer holds. In our system we audaciously increase the length of each segments to 4 seconds, with the understanding that longer segments lead to more robust embeddings.
In order to extract relatively clean embeddings from longer segments, we introduce a “winnertakesall” masked filtering system as illustrated in Figure 1. The filterbank features of each segment is passed through a several DTDNN layers. Then the framelevel speaker features are clustered into 2 classes using kmeans clustering. Features from the dominating component are used as the target embedding. The masked prediction network is trained in the same manner as described in [23].
One may wonder why bother doing masked filtering and not just use the Target Embedding in Figure 1 as the output. The mask prediction component serves two purposes. First, frames from the winner speaker may still contain overlapping speech from other speakers, which will introduce bias to the target embedding. Masked filtering allows us to extract relatively clean embeddings from overlapping speech. This is well illustrated in our previous work[23]. Second, if there is only one speaker in the entire segment, it is of our best interest to include as much information as possible. By design the target embedding could ignore up to half of the frames with useful information. This contradicts our motivation of extracting embeddings from longer segments in order to minimize variance.
2.4 Endtoend Post Processing with Community Partitions
Despite that clusteringbased diarization has superior performance in determining the number of speakers, finding correct classification for speakersegments, and handling domain mismatch between training and testing data, we recognize that endtoend approaches are more suitable at processing local information such as handling overlapping speech, and finding precise speaker change points. Therefore we propose to include an endtoend post processing component upon the results of community detection.
The endtoend post processing architecture is a modification from the EENDEDA system described in [6]
, in which the EncoderDecoder Attractor(EDA) component stores a flexible number of speakers. As indicated by the experiments, the EDA component is limited in predicting correct number of speakers. Hence we replace the EDA module by the results of community partitions and fix the number of speakers as the number of communities. During training and inference, the zero vectors input of EDA are replaced by the mean of each community parititons and the generation of new attractors is disallowed. For the simplicity of referencing, we name this endtoend post processing approach based on community partitions as EENDCommPart in later sections.
Since EENDCommPart is aimed for processing local information, the input to EENDCommPart are short segments. In our experiment, for example, the inputs consist of two adjacent segments, making the total length 8 seconds. During training longer segments are included to increase variety.
3 Experiments & Results
3.1 Corpus
The train and test set used in the experiments are simulated from NIST SRE corpus. The training corpus consists of 57,517 utterances from 5,767 speakers in NIST SRE 0410 corpus. The performance is evaluated on SRE10 Evaluation set. When evaluating diarization performance, utterances from a random 28 speakers are selected to simulate meetings of duration of at least 30 minutes. The overlap ratio ranges from up to in each of the simulated meetings.
3.2 Experiments
In the first experiment we evaluate performance of several diarization systems on simulated long meetings with 2, 4, 6, and 8 participants respectively. The components described above, such as “winner takes all” masked filtering, UMAPLeiden community detection, and EENDCommPart, are included or excluded in different runs in order to see their effects on the overall performance.
In the second experiment we focus on evaluating the performance of several clustering approaches. Speakersegments ranging from 24 seconds are randomly selected.
Specifically we compare kmeans, spectral clustering, AHC, and UMAPLeiden. We are also interested to see how EENDEDA performs on predicting the number of speakers compared to other clustering approaches.
For kmeans and spectral clustering, we use the curve of descending eigenvalues to estimate the number of classes, as described in
[13]. The number of classes of AHC is counted once the stopping criteria is met. The stopping criteria is determined on a small development set. Three hyperparameters are required to be finetuned for UMAP dimension reduction algorithm  the number of k nearest neighbors to be considered, the target dimension to be reduced to, and expected distance that determines whether points should be packed together. For Leiden algorithm the most important hyperparameter to be finetuned is the resolution.
The mask prediction network is trained on the same setup as described in [23]. The only difference is that, instead of taking a rough mean pooling as the target embedding, we conduct a 2class clustering and only run mean pooling on the dominating class.
The speakers embeddings are extracted from a DTDNN architecture [22] trained on the NIST SRE 0410 train set [14]
. AMSoftmax is used as the loss function
[21].3.3 Results & Analysis
As Table 1 suggests, system No.6 has the optimal performance when there are 4 or more speakers. System No.6 includes masked filtering as preprocessing, UMAPLeiden as clustering, and EENDCommPart as postprocessing. System No.2, EENDEDA, has optimal performance for 2speaker cases, but its performance decreases significantly as the number of speakers increases. System No.6 outperforms EENDEDA and kmeans clustering baseline by a remarkable margin in 4, 6, and 8speaker cases.
When compared the performance of system No.1 to No.4, as well as system No.3 to No.5, we can see that UMAPLeiden introduces remarkable gains in performance. System No.5 adds masked filtering onto system No.4, showing that obtaining cleaner embeddings has positive effects. Finally, the results from system No.6 indicate that EENDCommPart post processing allows for better handling of overlapping speech and better local refinements.
System ID  PreProcessing  Clustering Methods  PostProcessing  DER(%)  
2spk  4spk  6spk  8spk  
No.1  /  kmeans  /  10.3  23.9  33.6  40.1 
No.2  /  EENDEDA  /  4.7  22.8  42.3  68.5 
No 3  Masked Filtering  kmeans  /  9.6  23.0  33.8  37.2 
No.4  /  UMAPLeiden  /  7.3  17.1  24.6  28.9 
No.5  Masked Filtering  UMAPLeiden  /  6.5  15.9  22.3  25.8 
No.6  Masked Filtering  UMAPLeiden  EENDCommPart  5.2  13.1  18.8  20.2 
Now we break down to analyze the gain from clustering performance of UMAPLeiden algorithm.
Table 2
displays the performance of different clustering methods when the actual number of speakers is 1, 2, 4, 6, 8, and 10, respectively. For each case 500 clustering tests are simulated. For example, for #Spks=2, random shuffled segments from a 2 random speakers are selected to construct a clustering test instance. This is repeated 500 times. The Fscore and #Spks Prediction Accuracy are estimated based on the average of 500 clustering results. #Spks Prediction Accuracy measures the number of times out of total that the predicted number of speakers equals the actual number of speakers.
According to Table 2, when the actual number of speakers in the meeting is either 1 or 2, kmeans, spectral clustering, and UMAPLeiden have similar competitive performances, with spectral clustering slightly better for 1speaker case and kmeans slightly better for 2speaker case. The competitive performance of kmeans and spectral clustering for diarization on very few speakers may be a major reason that researchers lack the motivation to explore more sophisticated clustering methods for speaker diarization.
As the number of speakers increase, the performance of kmeans and spectral clustering drop significantly, largely due to the difficulties in estimating correct number of speakers. The UMAPLeiden algorithm, on the other hand, performs reasonably well on larger number of speakers. We also note that EENDEDA performs well for 2speaker situation and has relatively poor performance on estimating the number of speakers for #spks . This is consistent with the observation from Table 1 that EENDEDA has the lowest DER for #spk=2 but increases rapidly. This is our main motivation for replacing the EDA module by the results of UMAPLeiden partitions in our endtoend post processing component.
#Spks  Methods  #Spk Predict Accuracy  Fscore 

1  Kmeans  0.84  0.91 
Spectral  0.84  0.93  
AHC  0.32  0.64  
UMAPLeiden  0.71  0.87  
2  Kmeans  0.95  0.95 
Spectral  0.95  0.94  
AHC  0.77  0.86  
EENDEDA  0.93  NA  
UMAPLeiden  0.93  0.92  
4  Kmeans  0.83  0.86 
Spectral  0.83  0.88  
AHC  0.70  0.85  
EENDEDA  0.50  NA  
UMAPLeiden  0.90  0.94  
6  Kmeans  0.62  0.81 
Spectral  0.62  0.84  
AHC  0.61  0.84  
EENDEDA  0.31  NA  
UMAPLeiden  0.85  0.89  
8  Kmeans  0.45  0.70 
Spectral  0.45  0.72  
AHC  0.55  0.79  
EENDEDA  0.15  NA  
UMAPLeiden  0.84  0.87  
10  Kmeans  0.27  0.67 
Spectral  0.27  0.69  
AHC  0.46  0.74  
EENDEDA  0.03  NA  
UMAPLeiden  0.80  0.84  
Table 3 compares the computation costs of different clustering approaches. The community detection approaches such as Louvain and Leiden are most efficient in terms of running time. Leiden is faster than Louvain because it adopts a fast local moving process which visits only nodes whose neighbourhood has changed, while Louvain keeps visiting all nodes in the network every single time.
Methods  Runtime 

kmeans  2585s 
Spectral  597s 
AHC  3281s 
Louvain  374s 
Leiden  171s 
4 Conclusion
In this paper we propose to reformulate speaker diarization as community detection problem. We introduce Leiden as the community detection algorithm for its optimal performance on the dataset used in this experiment. Other community detection algorithms, such as Infomap [16] and Louvain, have demonstrated to be slightly better on some datasets. Therefore, deciding on which algorithm to use is dependent on the data of interest. Fortunately, implementations of all above mentioned community detection and dimension reduction algorithms are relatively simple. A quick comparison can be done with minimal efforts.
We find ways to extract speaker embeddings from longer segments to reduce variance, without having affected by other speakers presented in the segment, by introducing a masked filtering approach. This turns out to be more beneficial in meetings presented with large portions of overlapping speech and frequent speaker turns. Considering its computation costs, the masked filtering can be optionally turned off for more structured and organized meetings. For the future we are interested in exploring more efficient methods to serve the purpose.
Finally, we propose the EENDCommpart post processing component to handle overlapping speech and polish local results. The system leverages the edge of endtoend methods in detailed refinement and the advantage of clusteringbased approaches in global speaker counting.
Through this work we show the clusteringbased diarization approaches still have large rooms for improvement. We observed remarkable gains simply by replacing the kmeans with community detection algorithms. We hope that this result could inspire more studies on the clustering methods for speaker diarization.
References
 [1] (200810) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10), pp. P10008. External Links: ISSN 17425468, Link, Document Cited by: §1.
 [2] (2009) Community detection in graphs. CoRR abs/0906.0612. External Links: Link, 0906.0612 Cited by: §1.
 [3] (2019) Endtoend neural speaker diarization with permutationfree objectives. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 1519 September 2019, pp. 4300–4304. External Links: Link, Document Cited by: §1.

[4]
Endtoend neural speaker diarization with selfattention.
In
IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 1418, 2019
, External Links: Link, Document Cited by: §1, §1. 
[5]
(2017)
Speaker diarization using deep neural network embeddings
. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 59, 2017, pp. 4930–4934. External Links: Link, Document Cited by: §1.  [6] Endtoend speaker diarization for an unknown number of speakers with encoderdecoder based attractors. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 2529 October 2020, External Links: Link, Document Cited by: §2.4.
 [7] (2021) Endtoend speaker diarization as postprocessing. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 611, 2021, pp. 7188–7192. External Links: Link, Document Cited by: §1.
 [8] (2021) Advances in integration of endtoend neural and clusteringbased diarization for real conversational speech. CoRR abs/2105.09040. External Links: Link, 2105.09040 Cited by: §1, §1.
 [9] (2021) Integrating endtoend neural and clusteringbased diarization: getting the best of both worlds. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 611, 2021, pp. 7198–7202. External Links: Link, Document Cited by: §1.
 [10] (2019) LSTM based similarity measurement with spectral clustering for speaker diarization. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 1519 September 2019, pp. 366–370. External Links: Link, Document Cited by: §1.
 [11] Speaker diarization using leaveoneout gaussian PLDA clustering of DNN embeddings. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 1519 September 2019, External Links: Link, Document Cited by: §1.

[12]
(2018)
UMAP: uniform manifold approximation and projection.
J. Open Source Softw.
3 (29), pp. 861. External Links: Link, Document Cited by: §1. 
[13]
(2001)
On spectral clustering: analysis and an algorithm
. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 38, 2001, Vancouver, British Columbia, Canada], pp. 849–856. External Links: Link Cited by: §3.2.  [14] (2010) The nist year 2010 speaker recognition evaluation plan. External Links: Link Cited by: §3.2.
 [15] (2020) Autotuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Process. Lett. 27, pp. 381–385. External Links: Link, Document Cited by: §1.
 [16] (200911) The map equation. The European Physical Journal Special Topics 178 (1), pp. 13–23. External Links: ISSN 19516401, Link, Document Cited by: §4.
 [17] (2010) Complex network measures of brain connectivity: uses and interpretations. NeuroImage 52 (3), pp. 1059–1069. External Links: Link, Document Cited by: §1.
 [18] (2018) From louvain to leiden: guaranteeing wellconnected communities. CoRR abs/1810.08473. External Links: Link, 1810.08473 Cited by: §1.

[19]
(2008)
Visualizing data using tSNE.
Journal of Machine Learning Research
9, pp. 2579–2605. External Links: Link Cited by: §2.2.  [20] (2018) Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 1520, 2018, pp. 5239–5243. External Links: Link, Document Cited by: §1.
 [21] (2019) Ensemble additive margin softmax for speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 1217, 2019, pp. 6046–6050. External Links: Link, Document Cited by: §3.2.
 [22] (2020) Densely connected time delay neural network for speaker verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 2529 October 2020, pp. 921–925. External Links: Link, Document Cited by: §3.2.
 [23] (2021) Cam: contextaware masking for robust speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 611, 2021, pp. 6703–6707. External Links: Link, Document Cited by: §2.3, §2.3, §3.2.
 [24] (2021) A realtime speaker diarization system based on spatial spectrum. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 611, 2021, pp. 7208–7212. External Links: Link, Document Cited by: §1.
 [25] (2020) Phoneticallyaware coupled network for short duration textindependent speaker verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, Cited by: §1.