Online Speaker Diarization with Graph-based Label Generation

by   Yucong Zhang, et al.
Duke University

This paper introduces an online speaker diarization system that can handle long-time audio with low latency. First, a new variant of agglomerative hierarchy clustering is built to cluster the speakers in an online fashion. Then, a speaker embedding graph is proposed. We use this graph to exploit a graph-based reclustering method to further improve the performance. Finally, a label matching algorithm is introduced to generate consistent speaker labels, and we evaluate our system on both DIHARD3 and VoxConverse datasets, which contain long audios with various kinds of scenarios. The experimental results show that our online diarization system outperforms the baseline offline system and has comparable performance to our offline system.



There are no comments yet.


page 1

page 2

page 3

page 4


BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

We present a novel online end-to-end neural diarization system, BW-EDA-E...

EML System Description for VoxCeleb Speaker Diarization Challenge 2020

This technical report describes the EML submission to the first VoxCeleb...

Fully Supervised Speaker Diarization

In this paper, we propose a fully supervised speaker diarization approac...

Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

This paper proposes an online end-to-end diarization that can handle ove...

Graph-based Label Propagation for Semi-Supervised Speaker Identification

Speaker identification in the household scenario (e.g., for smart speake...

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

End-to-end speaker diarization using a fully supervised self-attention m...

Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation

We propose to address online speaker diarization as a combination of inc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarization aims at solving the problem of ”who spoke when”. It is a process of separating an input audio into different pieces in terms of different speaker identities. Speaker diarization involving multiple speakers has various kinds of applications. Particularly, the boundaries produced by a diarization system can provide useful information to multi-speaker automatic speech recognition 

[11, 16] and improve its performance.

The conventional modularized speaker diarization systems usually contain multiple modules [25, 13, 19], including voice activity detection (VAD) [18], speech segmentation, embedding extraction [3, 22, 23] and speaker clustering [6, 21, 24, 14]

. Each of those modules has been studied widely to improve the overall performance of the diarization system. For embedding extraction, i-vector 

[3], x-vector [22] and d-vector [23] are the most frequently used methods. In the clustering stage, embeddings are grouped together according to some similarity metrics. The typical clustering methods for speaker diarization are agglomerative hierarchy clustering (AHC) [6, 21]

and spectral clustering (SC) 

[24, 14]. In addition to the key modules mentioned above, reclustering module may also be employed as the post-processing to further improve the performance [25, 13].

Recently, the demand for online speaker diarization systems has surged in many scenarios, e.g., meeting or interview, but the conventional modularized speaker diarization system cannot be directly applied to the online diarization task, since most of the clustering algorithms are designed for offline tasks. An intuitive solution is that the clustering is performed on the whole received speech segments when new speech segments arrive. However, it is not time efficient and may cause high latency. Furthermore, the labels generated by the clustering algorithms might not be temporally consistent among all the speech segments. To handle all these problems, low-latency online diarization methods are studied.

Over the decade, several online speaker diarization systems have been developed. In the early design of the system, Gaussian Mixture Model (GMM) was trained as a background model, and some adaptation methods were used when a speech segment was assigned to a new speaker 

[7, 15]. However, those systems usually need pre-trained models, such as GMM for male speech, female speech and non-speech. Later on, speaker embedding methods were proposed to replace the GMM approach. Some studies use d-vector as the speaker embedding to represent speaker segments, and the embeddings were then clustered by some supervised methods [26, 5]. However, those supervised diarization method needs lots of annotated diarization data, which is not easy to get.

Besides supervised diarization methods, online modularized speaker diarization systems that use adapted i-vector and adaptive clustering are proposed [27, 4]. Zhu et al. [27]

used principal component analysis (PCA) to transform speaker embeddings into a subspace, where the embeddings are more distinguishable. Dimitriadis et al. proposed a variation of the k-means algorithms several years ago 

[4]. They refined cluster assignments by repeatedly attempting cluster subdivision, while keeping the most meaningful splits given a stopping criterion. However, the time complexity of both algorithms are linear with the number of speaker segments. Hence, it is not efficient to handle long-time audios.

More recently, an online speaker diarization system based on EEND [8] was proposed. They modified the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al. [10] by adding an incremental Transformer encoder module. However, the end-to-end online neural diarization systems have two major problems. First, it is restricted by the number of speakers. The online system purposed by Han et al. [8] performs comparably well to the offline systems when only one to two speakers are involved, but it cannot deal with more speakers well. Second, the end-to-end online neural diarization system needs a large amount of in-domain data to train beforehand. However, in-domain supervised training data for diarization cannot be easily obtained. Although simulated data can be used for training, but the model still need to be transformed to the specific domain by finetuning on the real dataset. Those problems motivate us to build a modularized online speaker diarization system, which can deal with more speakers and without any training data.

Figure 1: The pipeline of our purposed system

In this paper, we propose an online modularized speaker diarization system, which can handle long-time audios without annotated training data. The source code will be released soon and a demo can be found here111 Our system is performed in a frame-wise online fashion that mainly consists of five modules, namely VAD, speaker embedding extraction, embedding clustering, post-processing and label generation. The system is illustrated in Fig. 1. We summarize the key contribution of our online system as follows:

Speaker clustering with chkpt-AHC
We propose an online speaker clustering module, named checkpoint-AHC (chkpt-AHC), to allow speaker clustering to work in an online manner with low latency. High-purity speaker clusters could be derived using this module.

Post-processing with graph-based reclustering
We introduce a speaker embedding graph, which is built and maintained as new speaker embeddings arrive. A reclustering method using this graph is proposed to further refine the speaker clusters.

Label matching
We introduce a novel approach to enable the online label generating for long-time audios, such that the output labels can remain its consistency.

The rest of the paper is organized as follows. Section 2 presents the details of our online diarization system. Section 3 describes the settings of our experiments and shows the results. Conclusions are drawn in Section 4.

2 Proposed Diarization system

2.1 Overview

The key idea of our work is to perform clustering according to the history segments each time when a new segment comes, but only the label of the newly coming segment is updated. The whole working pipeline, as shown in Fig. 1, consists of a back-end module that generates the labeling results and a front-end module that performs label matching. For the back-end module, we introduce a new variant of AHC algorithm and a graph-based reclustering method. The new AHC algorithm, called chkpt-AHC, allows the whole system to process long-time audios with low latency. Besides, a speaker embedding graph is built as new segments arrive and a graph-based reclustering method is proposed to further improve the clustering performance. For the front-end module, the problem is that labels generated in the back-end may not be consistent when a new speech segment appears, namely speaker permutation. Therefore, we employ a label matching module, which imposes the Hungarian algorithm to align the speakers between two neighbor time steps.

2.2 Chkpt-AHC

The original AHC algorithm for speaker clustering uses all the speech segments. Every time when a new speech segment arrives, the system performs AHC on all the previously recorded speech segments. Hence, its computational cost grows linearly with the length of the audio, leading to high latency for long recordings.

Therefore, in order to reduce the computational cost, we set checkpoints to limit the number of initial speech segments for the AHC algorithm, which is called chkpt-AHC. When the number of speech segments is fewer than the pre-defined checkpoint number , the online system performs AHC on all of the speech segments. Otherwise, the intermediate clustering results of clusters are recorded as the checkpoint state. The checkpoint state is used to control the maximum number of speech segments to be considered by AHC. When the next speech segment arrives, the clustering process starts from the checkpoint state of clusters and continues the clustering process with the new-coming segment. With chkpt-AHC employed, the total processing time reduces significantly, especially for long-time speech, as shown in Table 2.

2.3 Graph-based Reclustering

Motivated by [25], we use a high threshold as the stopping criteria of chkpt-AHC to get high-purity clusters. As a result, the number of clusters after chkpt-AHC is usually larger than the groundtruth number of speakers. Then, we choose the speaker clusters based on the duration criterion. However, in the online setting, the duration of each cluster is always small in the beginning, since the duration of each cluster accumulates with time. Thus, we make some modifications so that the system can perform in an online manner. At first, when no cluster has a duration longer than a pre-defined threshold, we pick one cluster with the longest duration as the speaker cluster, and all the others as non-speaker clusters. Then, as timestamp increases, the speaker clusters are picked using the duration criterion. In order to further refine the clustering results, we propose a graph-based approach to further determine whether the embeddings in non-speaker clusters belong to a current speaker cluster or a new speaker cluster.

(a) Using 0.6 as threshold (b) Using 0.3 as threshold
Figure 2: Graphs with different threshold

We introduce a speaker embedding graph, which is constructed as new speech segments arrive. Each node in the graph represents a speaker embedding. The weight of the edges is the similarity between the speaker embeddings, measured by certain metrics. The similarity threshold used to build a graph is lower than the stopping criteria of chkpt-AHC. As Fig. 2 shows, the graph contains more connected nodes when applying a lower threshold, which allow us to perform more precise refinements. Otherwise, it is difficult to adjust since many unconnected notes present due to a high threshold.

(a) Graph before pruning (b) Graph after pruning
Figure 3: Effects of Graph pruning with threshold=0.3

Since building such a graph is computationally expensive, we prune some of the edges in the speaker embedding graph as shown in Fig. 3. Intuitively, if the new speaker node is not similar to another node in the graph, the new node is not connected to that node and that node’s neighbors in the graph.

With the auxiliary speaker embedding graph, we propose a novel approach to deal with the embeddings in non-speaker clusters. Given a speaker embedding graph , we use to represent all the nodes in the graph and to represent all the edges in the graph. We use cluster likelihood to measure how close the node is to the cluster of nodes , where and is the cardinality of cluster . The cluster likelihood is calculated as follows:

where represents the weight of edge , reflecting the similarity between node and node .

In order to assign the embeddings in non-speaker clusters, we find the corresponding nodes in the graph and calculate the cluster likelihood for all pairs of non-speaker nodes and speaker clusters. Then we assign the non-speaker node using the formula below:

where is the number of cluster that node should be assigned to.

2.4 Label Matching

Labels of each speech segment are generated according to different clusters in the back-end module. However, with the appearance of new speech segments, the clustering result might not be consistent. The labels generated in the back-end might change when new speech segments arrive, but the previously output labels in the front-end are not allowed to change. This leads to a gap between the output labels and the back-end labels, as it is illustrated in Fig. 4 (a). This motivates us to implement an algorithm to match the back-end labels and the output labels so as to make inferences on the label of the new speech segment. We re-frame the label matching problem as a maximum weighted bipartite matching problem, which can be solved using the Hungarian algorithm [12]. Each node in the bipartite graph stands for a label. Output labels and back-end labels are in the separate part of the graph. The edge weight between two nodes is the frequency that two labels appear at the same time. An example of the graph is shown in Fig. 4 (b).

(a) Label matching (b) Bipartite graph
Figure 4: Front-end module

By formulating the problem as a bipartite matching problem, we derive the matching result using the Hungarian algorithm. We then use this result to make inferences on the new speech segment as shown in Fig. 4 (a).

System Offline Online AHC
DIHARD3 VoxConverse
Dev Eval Dev Eval
Baseline - - - - - 20.71 20.75 - -
1 - - - 17.63 16.82 3.94 4.68
2 - - - 20.17 19.68 5.20 6.28
3 - - - 20.78 20.05 5.91 6.71
4 - - - 20.28 19.57 5.80 6.60
Table 1: The DER (%) of the proposed speaker diarization system. The baseline system is introduced by [20] for DIHARD3 competition without VB-HMM resegmentation. System 1 is the offline version of our proposed diarization system.

3 Experiments

We conduct our experiments in four stages. First, we build an offline speaker diarization system that uses AHC with the naive reclustering module (system 1). The naive reclustering module uses a threshold to reassign the non-speaker clusters to the speaker clusters. Intuitively, the non-speaker clusters are assigned to one of the speaker clusters via cosine similarity with a threshold. Second, we build an online speaker diarization system that also uses AHC and the naive reclustering module (system 2). Then, we replace AHC with chkpt-AHC and run experiments on all the datasets (system 3). Finally, a graph-based reclustering module is added in place of the naive reclustering approach (system 4). We present the DER results as well as the time it takes for all the diarization systems for later analysis. For all the datasets, we use an oracle VAD as a data preparation step.

3.1 Datasets

For the speaker embedding model, it is trained on the development set of Voxceleb2 [17] with 5994 speakers and achieves an equal error rate (EER) of 1.06% on the Voxceleb1 original test set. For the speaker diarization part, we use DIHARD3 [20] and VoxConverse [2] as our diarization datasets. For DIHARD3, there are 254 audios in the development dataset and 259 audios in the evaluation dataset. For VoxConverse, there are 216 audios in the development dataset and 232 audios in the evaluation dataset. We use the development datasets to tune the parameters and evaluate our systems on the evaluation datasets.

3.2 Model configurations

3.2.1 Speaker embedding extraction

In our work, we use the same recipe described in [1]. The input frame is 1s in length with a 0.5s shift. We use a deep CNN based on ResNet [9] to extract the 128-dim speaker embeddings.

(a) chkpt-AHC (b) speaker embedding graph
Figure 5: Thresholds tuning

3.2.2 Thresholds for speaker clustering and reclustering

The thresholds for chkpt-AHC and speaker embedding graph are tuned on the corresponding development datasets, as shown in Fig. 5. The optimal thresholds for different datasets are similar, which shows that thresholds are not closely related to different datasets. As a result, we use 0.6 as the threshold for DIHARD3 and 0.4 as the threshold to build the speaker embedding graph for the experiment.

3.2.3 Offline system

An offline modularized speaker diarization system is built as system 1. It uses the similar AHC-based speaker clustering module as [25] with the oracle VAD. Our implementation of the offline speaker diarization system has the DER of 16.82% on the evaluation dataset of DIHARD3, which can outperform the baseline in the DIHARD3 competition [20]. Moreover, the result is better than half of the teams that take part in the DIHARD3 competition.

3.3 Evaluation metrics

We use the diarization error rate (DER) to measure the performance of our system. DER typically consists of three components: False Alarm (FA), Speaker Confusion and Missed Detection (Miss). Particularly, for the Voxconverse dataset, we evaluate our model with a 0.25s forgiveness collar for DER.

3.4 Results & Analysis

The diarization results are shown in Table 1. System 2 is an online system that uses all available speech segments for AHC. In system 3, we change the AHC module with chkpt-AHC, but we keep the naive reclustering unchanged. It is reasonable that the performance gets worse, since every time it starts the clustering process from the previous checkpoint state, instead of starting from the beginning. In system 4, we further add a graph-based reclustering module in the place of the naive reclustering module. Comparing to system 3, system 4 has improvements on all the datasets, comparable to the baseline offline system. Although the performance of system 4 is worse than system 1, it still outperforms the baseline offline system on DIHARD3 datasets. We do not include the baseline for VoxConverse, since we use the oracle VAD.

System DIHARD3 VoxConverse
dev eval dev eval
2 121.8/484.0 113.0/458.8 104.3/338.3 446.1/830.2
3 44.0/484.0 40.0/458.8 27.9/338.3 114.7/830.2
4 48.9/484.0 44.7/458.8 32.5/338.3 142.7/830.2
Table 2: Average processing time (s) / average audio time (s) for different datasets. This experiment is conducted on a single core of Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

In terms of time efficiency, we measure it by the average processing time that the system takes to handle one audio in the datasets. The average processing time is calculated by the time to process the whole dataset divided by the total number of the audios in the dataset. Table. 2 shows that the time cost for system 2 is considerably high. System 4 is a little bit time-consuming than system 3, because we add a speaker embedding graph to the system, and it needs more time to maintain the speaker embedding graph. Nevertheless, comparing to system 2, system 4 has a much lower time cost.

4 Conclusions

In this paper, we propose an online modularized diarization system that can handle long-time audios with low latency. Our proposed system includes modules of chkpt-AHC, graph-based reclustering and label matching. We evaluate our proposed online system on DIHARD3 and VoxConverse datasets. As a result, we show that our best online system can outperform the baseline offline system and achieve comparable results to our offline system.


  • [1] W. Cai, J. Chen, J. Zhang, and M. Li (2020) On-the-fly Data Loader and Utterance-level Aggregation for Speaker and Language Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1038–1051. Cited by: §3.2.1.
  • [2] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman (2020) Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech, pp. 299–303. External Links: Document Cited by: §3.1.
  • [3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2010) Front-end Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1.
  • [4] D. Dimitriadis and P. Fousek (2017) Developing on-line speaker diarization system.. In Proc. Interspeech, pp. 2739–2743. Cited by: §1.
  • [5] E. Fini and A. Brutti (2020) Supervised Online Diarization with Sample Mean Loss for Multi-domain Data. In Proc. ICASSP, pp. 7134–7138. Cited by: §1.
  • [6] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree (2017)

    Speaker Diarization Using Deep Neural Network Embeddings

    In Proc. ICASSP, pp. 4930–4934. Cited by: §1.
  • [7] J. Geiger, F. Wallhoff, and G. Rigoll (2010) GMM-ubm based open-set online speaker diarization. In Proc. INTERSPEECH 2010, Makuhari, Japan, pp. 2330–2333. Cited by: §1.
  • [8] E. Han, C. Lee, and A. Stolcke (2021) BW-EDA-EEND: Streaming End-to-end Neural Speaker Diarization for A Variable Number of Speakers. In Proc. ICASSP, pp. 7193–7197. Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Proc. CVPR, pp. 770–778. Cited by: §3.2.1.
  • [10] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu (2020) End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. In Proc. Interspeech, pp. 269–273. External Links: Document Cited by: §1.
  • [11] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, and R. Haeb-Umbach (2019) Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR. In Proc. Interspeech, pp. 1248–1252. External Links: Document Cited by: §1.
  • [12] H. W. Kuhn (1955) The Hungarian Method for the Assignment Problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §2.4.
  • [13] F. Landini, O. Glembek, P. Matějka, J. Rohdin, L. Burget, M. Diez, and A. Silnova (2021) Analysis of the BUT Diarization System for Voxconverse Challenge. In Proc. ICASSP, pp. 5819–5823. Cited by: §1.
  • [14] Q. Lin, R. Yin, M. Li, H. Bredin, and C. Barras (2019) LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. In Proc. Interspeech, pp. 366–370. External Links: Document Cited by: §1.
  • [15] K. Markov and S. Nakamura (2008)

    Improved novelty detection for online GMM based speaker diarization

    In Proc. Interspeech 2008, pp. 363–366. External Links: Document Cited by: §1.
  • [16] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, et al. (2020) The STC System for the CHiME-6 Challenge. In CHiME 2020 Workshop on Speech Processing in Everyday Environments, Cited by: §1.
  • [17] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: A Large-scale speaker Identification Dataset. arXiv preprint arXiv:1706.08612. Cited by: §3.1.
  • [18] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Veselý, and P. Matějka (2012) Developing a speech activity detection system for the DARPA RATS program. In Proc. Interspeech 2012, pp. 1969–1972. External Links: Document Cited by: §1.
  • [19] T. J. Park, M. Kumar, and S. Narayanan (2021) Multi-Scale Speaker Diarization with Neural Affinity Score Fusion. In Proc. ICASSP, Vol. , pp. 7173–7177. External Links: Document Cited by: §1.
  • [20] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman (2020) The Third DIHARD Diarization Challenge. arXiv preprint arXiv:2012.01477. Cited by: Table 1, §3.1, §3.2.3.
  • [21] G. Sell and D. Garcia-Romero (2014) Speaker Diarization with PLDA I-vector Scoring and Unsupervised Calibration. In Proc. SLT, pp. 413–417. Cited by: §1.
  • [22] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: Robust DNN Embeddings for Speaker Recognition. In Proc. ICASSP, pp. 5329–5333. Cited by: §1.
  • [23] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized End-to-end Loss for Speaker Verification. In Proc. ICASSP, pp. 4879–4883. Cited by: §1.
  • [24] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno (2018) Speaker Diarization with LSTM. In Proc. ICASSP, pp. 5239–5243. Cited by: §1.
  • [25] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, S. Chen, Y. Zhao, G. Liu, Y. Wu, J. Wu, et al. (2021) Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020. In Proc. ICASSP, pp. 5824–5828. Cited by: §1, §2.3, §3.2.3.
  • [26] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang (2019) Fully Supervised Speaker Diarization. In Proc. ICASSP, Vol. , pp. 6301–6305. External Links: Document Cited by: §1.
  • [27] W. Zhu and J. Pelecanos (2016) Online Speaker Diarization Using Adapted I-vector Transforms. In Proc. ICASSP, pp. 5045–5049. Cited by: §1.