Automatic meeting/conversation analysis is one of the essential technologies required for realizing futuristic speech applications such as communication agents that can follow, respond to, and facilitate our conversation. As an important central task for the meeting analysis, speaker diarization has been extensively studied [1, 2, 3].
Currently, there are mainly two major approaches to the diarization problem, that is, clustering-based approaches [4, 1, 5, 6] such as x-vector clustering, and End-to-End Neural Diarization (EEND) approaches [7, 8, 9]. The clustering-based approaches first segment a recording into short homogeneous blocks and compute a speaker embedding for each block assuming that only one speaker is active in each block. Then, speaker embedding vectors are clustered to regroup segments belonging to the same speakers and obtain diarization results [5, 6, 10, 11]. On the other hand, EEND is relatively simple. It receives standard frame-level spectral features and directly outputs a frame-level speaker activity for each speaker. In recent diarization challenges such as DIHARD-III , it is revealed that these two approaches are complementary to each other as it will be discussed below, and thus many institutes achieved reliable diarization for real conversational data by performing system combination of these approaches, e.g., .
To accomplish reliable diarization for any real conversational speech, the following essential problems have to be addressed:
overlapped speech (i.e., segments where more than one person is speaking)
long-form audio (e.g., duration of more than 10 min.),
an arbitrary number of speakers.
In each of these aspects, the aforementioned two approaches have different properties complementary to each other. The clustering-based approaches have been studied for a decade and have been shown to work well with long-form audio containing an arbitrary number of speakers . However, by nature of the assumption made in the extraction of speaker embeddings, i.e., single-speaker block assumption, it cannot handle overlapped speech. On the other hand, EEND was first developed to address the overlapped speech problem [7, 8]. Then, recently, it was extended to handle meetings containing an arbitrary number of speakers by introducing speaker counting functionality based on an Encoder-Decoder Attractor architecture (EDA) . However, it was experimentally shown to still have difficulty in dealing with a meeting containing a realistically large number of speakers, such as more than 3 speakers . In addition, it was shown that it is difficult to directly apply the EEND systems to long-form audio (e.g., recordings longer than 10 minutes) . Since the original EEND system was designed to operate in a batch processing mode, it inevitably requires a very large computer memory when performing inference with long recordings. Besides, aside from the memory issue, the neural networks (NNs) in EEND have difficulty generalizing to unseen very long sequential data. Block-wise independent processing is also difficult because it poses an inter-block label permutation problem, i.e., an ambiguity of the speaker label assignments between blocks.
Focusing on these different pros and cons of the clustering and EEND approaches, we proposed a simple but effective hybrid diarization approach , called EEND-vector clustering, by combining the best of the clustering-based diarization and EEND. The framework allows us to process long-form audio containing overlapped speech and an arbitrary number of speakers. It first split the input long recording into fixed-length blocks. Then it applies a modified version of EEND to each block to obtain the diarization results for a fixed number of speakers as well as global speaker embedding vectors for each of the speakers. This assumes that for each short block the maximum number of speakers will be equal to or less than the number of the output of EEND. Finally, to solve the inter-block label permutation problem, speaker clustering is performed across blocks by using a constrained clustering algorithm. In , the EEND-vector clustering framework was shown to significantly outperform the conventional EEND  and x-vector clustering when processing simulated long-recordings of 2 speakers containing various overlap conditions, noise and reverberation, and thus was proven to be more advantageous in addressing the aforementioned problems (1) and (2).
However, it was not clear from our past studies whether the EEND-vector clustering could be generalized to real conversational speech data containing an arbitrary number of speakers (aforementioned problems (1) and (3)). To this end, this paper focuses on (i) application of the EEND-vector clustering to the widely used CALLHOME dataset , which consists of the real conversational speech of 2 to 6 speakers, and (ii) its evaluation in comparison with current state-of-the-art systems such as EDA-EEND , x-vector clustering , and Region-Proposal Network based Speaker Diarization (RPNSD)  that can handle CALLHOME data including overlapped speech. We also (iii) introduce practical techniques to increase robustness against real data such as more robust constrained clustering methods and silent speaker detection.
In the remainder of the paper, we first review the proposed EEND-vector clustering approach (in Sec. 2), and then introduce the practical techniques required to deal with real meeting data (in Sec. 3). Finally, with experiments, we show that the EEND-vector clustering can outperform the other state-of-the-art approaches by a large margin, especially when the number of speakers is large.
2 EEND-vector clustering
2.1 Overall framework
It first segments the input recording into blocks and calculates a sequence of the input frame features within each block, as where , and are the block index, the frame index in the block and the block size, respectively. is the -dimensional input frame feature at the time frame . In the example shown in Fig 1, the input recording consists of 2 blocks and contains 3 speakers in total. In the following explanation, we assume that we can reasonably fix the maximum number of active speakers in a block, , to 2, for the sake of simplicity 111In experiments, we use . .
Based on the assumption/hyper-parameter , the neural network
always estimates diarization results and associated speaker embeddings for 2 speakers in each block. If a speaker is absent (i.e., there is only one active speaker in that block), the network simply estimates the diarization results of all zeros for that silent speaker. The diarization results are estimated independently in each block222For the details of NN and its training procedure, please refer to .. Since it is not always guaranteed that the diarization results of a certain speaker are estimated at the same output node, we may have the inter-block label permutation problem in the diarization outputs. We can solve this permutation problem and estimate the correct association of the diarization results among blocks, by clustering the speaker embeddings given the total number of speakers in the input recording, , (3 in this case). Note that the speaker embedding extraction process is optimized such that the vectors of the same speaker stay close to each other, while the vectors of different speakers lie far away from each other. Based on the clustering results, we can stitch together the diarization results of the same speaker across blocks to obtain the final diarization output. Note that, while the proposed framework estimates the diarization results for the fixed number of speakers in a block, it can handle a meeting with an arbitrary number of speakers.
To tightly couple the embedding estimation and clustering process, it is beneficial to convey to the clustering algorithm the useful information that NN always estimates speech activities of two different speakers in each block. To this end, we suggest using a constrained/semi-supervised clustering algorithm which typically allows us to set a cannot-link constraint between a given pair of embeddings to prevent the pair from being assigned to the same speaker cluster.
2.2 Formulation of NN to jointly perform diarization
and speaker embedding estimation
in Fig. 1 can be formulated as follows. Let us denote the ground-truth diarization label sequence as that corresponds to . Here, the diarization label represents a joint activity for speakers. For example, indicates both speakers and spoke at the time frame in the block . Similarly, let us denote the ground-truth speaker embedding set as that corresponds to . is -dimensional speaker embedding vector for the -th speaker.
Then, the joint estimation of diarization results and speaker embeddings by NN in Fig. 1 is formulated as:
can be trained with a multitask loss function composed of diarization loss, i.e., binary cross entropy loss, and speaker embedding loss that encourages the embeddings to have small intra-speaker and large inter-speaker euclidean distances, as proposed in.
3 Handling real conversational data
This section summarizes techniques that we incorporated into the EEND-vector clustering to cope with real conversational data. While we should be able to improve the NN architecture and training procedure as in much other literature [8, 9] to push up the final performance, we found that improving only the constrained clustering part, which is a unique part of EEND-vector clustering, can already make a big difference in the final performance and help achieve state-of-the-art performance. The following subsections detail modifications we newly introduce to the original framework proposed in .
3.1 Improving constrained clustering
which is an extension of k-means clustering, and showed it was effective for handling simulated 2-speaker noisy reverberant meeting data. However, we neither compared it with other constrained clustering algorithms nor confirmed whether it is beneficial to use the cannot-link constraint. To thoroughly investigate the effectiveness of the constrained clustering algorithm in the proposed framework, here we incorporate and evaluate other constrained clustering algorithms in addition to COP-Kmeans. While the standard k-means clustering generally does not well handle non-Gaussian data and/or data containing imbalanced classes, both of which are common for real conversational speech data, it was shown that Agglomerative Hierarchical Clustering (AHC) do not suffer from such limitations and thus has been widely used within clustering-based diarization systems[5, 6]
. Also, Spectral Clustering (SC) is often used for clustering-based diarization systems[10, 18], since it can handle non-Gaussian data. In accordance with this development in the clustering for diarization, we here introduce the following constrained AHC and constrained SC into the EEND-vector clustering framework.
3.1.1 Constrained Agglomerative Hierarchical Clustering
AHC is a common but effective unsupervised clustering technique used in state-of-the-art diarization systems [5, 6]. Standard AHC requires pairwise distances between all input samples, i.e., speaker embeddings , resulting in a distance matrix. Then, based on the matrix, it generates a cluster hierarchy that is commonly displayed as a tree diagram called a dendrogram. Each input sample starts in its own cluster, and pairs of clusters are merged as it moves up the hierarchy. We stop the cluster merging process either (a) when we obtain a required number of clusters, or (b) when the clusters are too far apart beyond a certain threshold to be merged. In diarization systems, the stopping criterion (a) is used to obtain diarization results given prior knowledge on the number of speakers in a meeting (i.e., oracle-number-of-speaker evaluation), while the stopping criterion (b) is used to obtain diarization results with the estimated number of speakers in a meeting (i.e., estimated-number-of-speaker evaluation).
To incorporate the cannot-link constraint to AHC, it is proposed to directly modify the distance matrix such that the distance between samples with the cannot-link constraint becomes larger than any other values in the matrix . Within the proposed framework, the distance matrix we obtain based on can be expressed as
where each element of the above block matrix, , is a symmetric distance matrix calculated based on a set of speaker embedding vectors obtained at -th block, , and -th block, . When incorporating the cannot-link constraint, we insert certain large value into the off-diagonal components of matrices for all from to . After obtaining the modified distance matrix, we can use it in a standard AHC with the aforementioned stopping criteria.
3.1.2 Constrained Spectral Clustering
SC is another common unsupervised clustering technique used in recent diarization systems [10, 18]. SC is a technique with roots in graph theory, and it requires a pairwise similarity score between all input sample, i.e., speaker embeddings , resulting in a similarity graph. When we have prior knowledge on the number of speakers,
, in a meeting, we typically analyze the eigenvectors corresponding to
smallest eigenvalues of unnormalized graph Laplacian constructed from the similarity graph. When we would like to estimate the number of clusters, we can use a widely used method called eigengap heuristics.
There are several ways to incorporate the cannot-link constraint to SC , varying from a simple way  to methods that can control the contribution of the constraint in a soft manner . Here we employ a relatively straightforward method proposed in , which directly modifies the similarity graph such that the graph edge between samples with the cannot-link is forced to . This can be done by constructing the similarity matrix similar to eq. (1) but with similarity scores, and inserting into the off-diagonal components of matrices for all from to .
3.2 Silent speaker detection
Before clustering speaker embedding vectors with a constrained clustering, it is beneficial to detect and exclude embedding vectors corresponding to silent speakers. It is mainly because, if there are multiple silent speakers in that block, we should not set the cannot-link constraint between those embedding vectors, i.e., they should belong to the same silent speaker cluster. We found that, in many cases, the diarization results for those silent speakers stay very close to (as we trained NN to do so). Therefore, we propose to detect it by examining whether the mean of the diarization results is sufficiently small; A speaker embedding is judged to be from a silent speaker if , where is a predetermined threshold.
In this section, we evaluate the effectiveness of the proposed EEND-vector clustering in comparison with state-of-the-art conventional methods, based on CALLHOME dataset . We also evaluate clustering algorithms in the proposed EEND-vector clustering framework to highlight the importance of the constrained clustering.
For the training, we used simulated mixtures created from Switchboard-2 (Phase I & II & III), Switchboard Cellular (Part 1 & 2), and the NIST Speaker Recognition Evaluation (2004 & 2005 & 2006 & 2008) for speech, and the MUSAN corpus for noise with simulated room impulse responses used in , by following the data generation procedure in . We created 3-speaker meeting-like dataset based on the algorithm proposed in  with . 333Each mixture contains dozens of utterances per speaker with reasonable silence intervals between utterances of the same speaker’s. means that the average duration of the silence interval is 10 s.
For evaluation and adaptation, we used the telephone conversation dataset CALLHOME (CH) , i.e., NIST SRE 2000 (LDC2001S97, Disk-8), which has been the most widely used dataset for speaker diarization studies. The CALLHOME dataset contains 500 sessions of multilingual telephonic speech. Each session has 2 to 6 speakers while there are two dominant speakers in each conversation. For evaluation and adaptation purpose, we split the CALLHOME dataset into two subsets according to , and performed adaptation on a subset and evaluation of the proposed method on the other subset.
4.2 Conventional methods to be compared with
The proposed method was compared with state-of-the-art methods, namely, x-vector clustering with PLDA scoring and variational Bayesian resegmentation [4, 1, 6], EDA-EEND  and Region-Proposal Network based Speaker Diarization (RPNSD) . We did not reproduce their results, but simply borrow the results of x-vector clustering and EDA-EEND from , and that of RPNSD from .
4.3 Settings of the proposed EEND-vector clustering
4.3.1 NN training and hyper-parameters
In this experiment, the assumed maximum number of speakers in each block was set at in the proposed method, i.e., always estimates diarization results and embedding vectors for 3 speakers including silent speaker(s). The block size was varied from 50 s to 15 s to see how the block size will have an impact on the diarization performance. To speed up the experiments, we first trained a model with and
on the training data for 100 epochs, and adapted the pre-trained model to different block sizess to obtain a model appropriate for each .
For the neural network architecture and training protocol, we basically followed , and followed  for speaker embedding estimation and multi-task training setting. We used self-attention-based six-layer stacked Transformer encoders with eight attention heads as a backbone of our method. The input for the network was the same as , i.e., 345-dimensional log-scaled Mel-filterbank-based features. The threshold to detect the silent speaker was set at 0.05.
4.3.2 Clustering algorithms in the proposed method
We evaluate oracle clustering, the following 3 unconstrained and 3 constrained clustering algorithms in the framework of the EEND-vector clustering, namely, k-means, AHC, SC, COP-Kmeans, constrained AHC, and constrained SC.
The oracle clustering corresponds to permutation that can yield a diarization result closest to the true one, based on a diarization result estimated by the network.
Using unconstrained clustering algorithms such as k-means, AHC, and SC, there may be some cases where speaker embeddings from a certain block are clustered into the same cluster. In such case, with an assumption that a certain speaker’s speech activity was erroneously split into more than one output, we heuristically merge the diarization results corresponding to those speaker embeddings, by taking maximum across these diarization results at each time frame .
For the constrained AHC, we set at . We performed the AHC clustering such that it minimizes the average of the distances between all observations of pairs of clusters. The distance threshold above which, clusters will not be merged, was set at 1.
|# of speakers in a session|
4.4.1 Effect of clustering algorithms
Table 1 shows diarization error rate (DER) of the proposed EEND-vector clustering with best-performing condition (), with different clustering methods performed with the oracle number of speakers and the estimated number of speakers. First, we can see that the constrained AHC performs the best and stays closest to the oracle clustering. In comparison with its unconstrained counterpart, it performs significantly better for difficult cases such as the number of speakers of 6. Hereafter, in this paper, we use constrained AHC as the clustering algorithm in the proposed method.
Except for AHC, it is not perfectly clear whether the incorporation of the cannot-link constraint is advantageous or not. The obtained results are very much affected by issues such as the imbalanced class problem, poor accuracy in speaker counting, which makes it difficult to make a definitive conclusion.
|# of speakers in a session|
|x-vector clustering ||8.93||19.01||24.48||32.14||34.95||18.98|
|EEND-vector clust. ()||7.63||13.14||13.71||22.14||28.82||12.53|
|EEND-vector clust. ()||8.08||11.27||15.01||23.14||26.56||12.22|
|EEND-vector clust. ()||7.97||13.56||15.28||28.52||22.62||13.15|
|EEND-vector clust. ()||8.57||14.33||17.12||30.02||23.17||14.07|
|# of speakers in a session|
|x-vector clustering ||15.45||18.01||22.68||31.40||34.27||19.43|
|EEND-vector clust. ()||7.18||12.50||16.91||28.04||26.76||12.98|
|EEND-vector clust. ()||7.96||11.93||16.38||21.21||23.10||12.49|
|EEND-vector clust. ()||8.13||12.83||16.75||31.90||23.08||13.40|
|EEND-vector clust. ()||9.77||16.27||16.12||27.21||23.16||14.84|
4.4.2 Comparison with state-of-the-art methods
Tables 2 and 3 shows diarization error rate (DER) of the proposed EEND-vector clustering in comparison with state-of-the-art methods, performed with the oracle number of speakers (Table 2) and with the estimated number of speakers (Table 3). They first reveal that, comparing the best performing configuration of the proposed method (
) with the other state-of-the-art methods, EEND-vector-clustering largely outperforms the others. It works significantly better than EDA-EEND especially when the number of speakers is large. The proposed method works generally better than x-vector-clustering most probably because it handles overlapped speech well and can estimate speaker embeddings from longer segments. It also outperforms RPNSD that combines NN-based diarization with clustering in a different manner.
However, if we decrease the size of the processing block , the diarization performance tends to degrade. This issue should be resolved in the future, because, although for telephone speech data such as CALLHOME, speakers tend to speak for a long time and there are relatively few different speakers appearing in relatively long segments, it may not be always the case, especially when dealing with more causal real-life conversations. To deal with such data, we need to use a shorter block size to keep the maximum number of speakers in a block as small as . It is not shown here because of space limitation, but, by examining the DERs more carefully, we found that, as we decrease the block size, missed speech and false alarm do not significantly increase but speaker confusion does. It suggests that if we could improve accuracy of the speaker embeddings or the clustering algorithm, we could bring about significant improvement even when the processing block is short, which will be part of our future works.
This paper evaluated the EEND-vector clustering framework based on real conversational speech dataset CALLHOME, and showed that it outperforms significantly the conventional state-of-the-art methods. We also experimentally showed the importance of constrained clustering in our framework. Future work includes investigations to improve the speaker embedding extraction and clustering performance when dealing with short speech segments.
-  X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: A review of recent research,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356–370, Feb 2012.
-  N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, First DIHARD Challenge Evaluation Plan, 2018, https://zenodo.org/record/1199638.
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec,
V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska,
I. McCowan, W. Post, D. Reidsma, , and P. Wellner, “The AMI meeting
corpus: A pre-announcement,” in
The Second International Conference on Machine Learning for Multimodal Interaction, ser. MLMI’05, 2006, pp. 28–39.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, , and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Proc. IEEE Spoken Language Technology Workshop, 2016.
-  G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur, “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech 2018, 2018, pp. 2808–2812. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1893
-  M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Zmolikova, O. Novotný, K. Veselý, O. Glembek, O. Plchot, L. Mošner, and P. Matějka, “BUT system for DIHARD speech diarization challenge 2018,” in Proc. Interspeech 2018, 2018, pp. 2798–2802. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1749
-  Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech 2019, 2019, pp. 4300–4304.
-  Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in Proc. IEEE ASRU, 2019, pp. 296–303.
-  S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” 2020, arXiv:2005.09921.
-  A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6301–6305.
-  X. Li, Y. Zhao, C. Luo, and W. Zeng, “Online speaker diarization with relation network,” 2020, arXiv:2009.08162.
-  N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third DIHARD diarization challenge,” 2021.
-  S. Horiguchi, N. Yalta, P. Garcia, Y. Takashima, Y. Xue, D. Raj, Z. Huang, Y. Fujita, S. Watanabe, and S. Khudanpur, “The hitachi-jhu DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap,” 2021.
-  K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (To appear), 2021.
-  M. Przybocki and A. Martin, 2000 NIST Speaker Recognition Evaluation (LDC2001S97). Philadelphia, New Jersey: Linguistic Data Consortium, 2001.
-  Z. Huang, S. Watanabe, Y. Fujita, P. García, Y. Shao, D. Povey, and S. Khudanpur, “Speaker diarization with region proposal network,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6514–6518.
-  K. Wagstaff, C. Cardie, S. Rogers, and S. S. Schroedl, “Constrained k-means clustering with background knowledge,” in Proc. 18th International Conference on Machine Learning (ICML), 2001.
-  H. Aronowitz, W. Zhu, M. Suzuki, G. Kurata, and R. Hoory, “New advances in speaker diarization,” in Proc. Interspeech 2020, 2018, pp. 279–283.
-  I. Davidson and S. S. Ravi, “Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results,” Data Mining and Knowledge Discovery, vol. 77, no. 18, pp. 257–282, Dec. 2009.
-  U. von Luxburg, “A tutorial on spectral clustering,” Statist. and Comput., vol. 17, no. 4, pp. 395–416, 2007.
-  X. Wang, B. Qian, and I. Davidson, “On constrained spectral clustering and its applications,” Data Mining and Knowledge Discovery, vol. 28, pp. 1–30, Dec. 2014.
S. D. Kamvar, D. Klein, and C. D. Manning, “Spectral leawrning,” in
Proc. Proceedings ofthe 18th International Joint Conference on Artificial Intelligence (IJCAI), 2003, pp. 561–566.
-  D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,,” 2015, arXiv:1510.08484.
-  T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5220––5224.