When building an audio-based human-computer interaction (HCI) system, it is important to provide the speaker turn information as well as speech transcription information. Speaker diarization can locate speaker turn information and use it to identify the speaker in a corresponding segment, which is defined as “who spoke when” [1, 2, 3]. Speaker diarization has been widely applied to meetings, call-center telephone conversations, and the home environment (CHiME-5) [4, 5, 6, 7].
Online speaker diarization outputs the diarization result as soon as the audio segment arrives, which means no future information is available when analyzing the current segment. In contrast, in an offline mode, the whole recording is processed so that all segments can be compared and clustered at the same time . Currently, few speaker diarization systems can be applied in practical scenarios because most of them work well only under specific conditions such as long latency, no overlapping, or no noise level. [9, 10]. An online speaker diarization system with low latency is still an open technical problem.
. Current research focuses primarily on the speaker model or speaker embeddings, such as Gaussian mixture models (GMM)[8, 13], i-vector [14, 15, 16], d-vector [17, 18], and x-vector [19, 20]19, 21, 22, 23]. The issue with these methods is that they cannot directly minimize the diarization error because they are based on an unsupervised algorithm. Zhang, et al [12, 24] proposed a supervised online speaker diarization approach while the method still assumes only one speaker in one segment (no overlap).
proposed an end-to-end speaker diarization system that directly minimizes the diarization error by training a neural network using Permutation Invariant Training (PIT) with multi-speaker recordings. Their experimental results show that the self-attention based end-to-end speaker diarization (SA-EEND) system[26, 27]
outperformed the state-of-art i-vector and x-vector clustering and long short-term memory (LSTM) based end-to-end method. Although SA-EEND has achieved significant improvement, it is only working in the offline condition due to the self-attention mechanism, which outputs speaker labels only after the whole recording is provided.
This paper first investigates a straightforward online extension of SA-EEND by performing diarization independently for each chunked recording. However, this straightforward online extension degrades the diarization error rate (DER) due to the speaker permutation inconsistency across the chunk, especially for short-duration chunks. Therefore, we propose a method called speaker-tracing buffer, which can track speaker information consistently across the chunk by extending a self-attention mechanism to maintain the speaker permutation information determined in previous chunks. More specifically, we select a fixed number of input frames in the previous chunk that have dominant speaker permutation information based on the diarization output probability. These additional input frames are fed into the self attention layer to take over the speaker permutation information determined in the previous chunk. Our experimental results show that choosing the buffer frames using the absolute probability difference of the output speaker labels yields the best results compared with other methods. The code of SA-EEND with speaker tracing buffer will be available athttps://github.com/hitachi-speech/EEND.
2 Analysis of online SA-EEND
In SA-EEND , the speaker diarization task is formulated as a probabilistic multi-label classification problem. Given the length acoustic feature , with a -dimensional observation feature vector at time index , SA-EEND predicts the corresponding speaker label sequence . Here, speaker label represents a joint activity for multiple speakers () at time . For example, means both and spoke at time . Thus, determining is the key for determining the speaker diarization information as follows:
where is a multi-head self-attention based neural network.
Note that the vanilla self-attention layers has to wait the processing of all speech features in an entire recording to compute the output speaker labels. Thus, this method causes very large latency determined by the length of the recording, and cannot be adequate for online/real-time speech interface.
2.2 Chunk-wise SA-EEND for online inference
This paper first investigates the use of SA-EEND as shown in Eq. (1) for chunked recordings with chunk size , as follows:
denotes a chunk index, and . and denotes subsequences of and at chunk , respectively. The latency can be suppressed by chunk size instead of the entire recording length . We first investigate the influence of chunk size in terms of the diarization performance.
2.2.1 Model configuration and dataset
. Here two encoder blocks with 256 attention units containing four heads without residual connections were trained. The input features are 23-dimensional log-Mel-filterbanks concatenated with previous seven frames and subsequent seven frames with a 25-ms frame length and 10-ms frame shift. And then subsampling with a factor of ten. In a word, a-dimensional input feature is inputted into the neural network every which means the duration length of one chunk is .
Two datasets are used for this analysis. The first one, CALLHOME , consists of actual two-speaker telephone conversations. Following the steps in , we split CALLHOME into the two parts: 155 recordings for adaptation and 148 recordings for evaluation. The average overlap ratio with the test set is . The average duration of recordings in CALLHOME is . The second dataset is the Corpus of Spontaneous Japanese (CSJ) dataset  which consists of interviews, natural conversations, etc. We used 54 recordings in this evaluation and their average overlap ratio is . There are two speakers in each recording and the average duration is .
2.2.2 Analysis results
In this section, we analyze the relationship between chunk size in Eq. (2) and DER. The recordings to be evaluated were first divided according to chunk size and then fed into a SA-EEND system one by one to obtain the diarization result of each chunk. These chunk-wise diarization results were then combined as the final diarization result of the whole recording. We call it as recording-wise DER calculated on the entire recording. When computing the DER, a collar tolerance was used at the start and the end of each segment. We also evaluated overlapping speech and non-speech regions.
Note that this chunk-wise SA-EEND method does not guarantee that the speaker labels obtained across the chunk are the same due to the speaker permutation ambiguity underlying in the general speaker diarization problem. Thus, the recording-wise DER would be degraded due to this across-chunk speaker inconsistency. To measure this degradation, we also computed the oracle DER in each chunk separately (chunk-wise DER), which does not include the across-chunk speaker inconsistency error.
The analytical results are shown in Figure 1 for the CALLHOME and CSJ datasets. In these figures, the x-axis represents chunk size during inference. Here, one chunk unit corresponds to , which means the latency of the system is when the chunk size is 10 (i.e., 10 = ). The y-axis represents the final DER of the whole dataset. As shown in Figure 1, the recording-wise DER decreased as the chunk size increased for both datasets. When the chunk size was larger than 800, the recording-wise DER tended to converge for CALLHOME. On the other hand, the oracle chunk-wise DER was much smaller and more stable than the recording-wise DER even when the chunk size was small, for both datasets. This indicates that the main degradation of online chunk-wise SA-EEND comes from the across-chunk speaker permutation inconsistency. Based on these findings, the next section explores how to solve this across-chunk speaker permutation issue.
3 Speaker-tracing buffer
In this section, we propose a method called speaker-tracing buffer, that utilizes previous information as a clue to solve the across-chunk permutation issue.
3.1 Speaker-tracing with buffer
Let and be -length acoustic feature buffer and the corresponding SA-EEND outputs, respectively, which contain the speaker-tracing information. At the initial stage, and are empty. Our online diarization is performed by referring and updating this speaker-tracing buffer, as shown in Algorithm 1. The input of the SA-EEND system is the concatenation of acoustic feature subsequence at current chunk and the acoustic features in buffer , i.e., . The corresponding output of SA-EEND is . If is not empty, the correlation coefficient between and current buffer output at output speaker permutation is calculated as
Permutation with the largest correlation coefficient is chosen as follows:
where generates all permutations according to the number of speakers . The corresponding buffer output is chosen as the final output of chunk , which can maintain the consistent speaker permeation across the chunk. The obtained output
is stacked with the previously estimated output to form the whole recording’s outputin the end. An example of applying the speaker-tracing buffer to SA-EEND in the first two chunks is shown in Figure 2, where is equal to 10, the buffer size is 5, and the speaker number is 2.
Speaker-tracing buffer for the next chunk is selected from and in the current chunk. We consider three selection strategies, as explained in the next section.
3.2 Selection strategy for speaker-tracing buffer
If chunk size is not larger than the pre-defined buffer size , we can simply store all the features in the buffer until the number of stored features reaches the buffer size. Once the number of accumulated features becomes larger than buffer size , we have to select and store informative features that contain the speaker permutation information from and . In this section, three selection rules for updating the buffer are listed. Here we assume that the number of speaker is 2.
Uniform sampling (US). acoustic features from and the corresponding diarization results from
are randomly extracted based on the uniform distribution.
Deterministic selection (DS) using the absolute difference of probabilities of speakers, as
where , are the probabilities of the first and second speakers at time index . The maximum value of () is realized in either case of or . This means that we try to find dominant active-speaker frames. Top samples with the highest are selected from and .
Weighted sampling (WS): This is a combination of the uniform sampling and deterministic selection. We randomly select features but the probability of selecting -th feature is proportional to in Eq. (5).
4 Experimental results
4.1 Effect of selection strategy
We analyzed the effect of the selection strategy using the same chunk size 10 and several buffer sizes from 10 to 600. In Table 1, the number in the left column (10-500) represent the chunk size for offline system but the buffer size for online situation. As shown in Table 1, applying the speaker-tracing buffer improved the performance of online SA-EEND performance regardless of which selection strategy was used. As for the strategies, WS performed best for both datasets at most cases when was large (larger than 100). Therefore, We considered WS as the selection strategy for future analysis.
4.2 Effect of buffer and chunk size
Next, we analyzed the effect of buffer and chunk size. The DER results for the CALLHOME and CSJ when applying the weighted sampling (WS) selection strategy are shown in Figure 3. Chunk sizes were 10 and 20 and had the latency time of and respectively. Regarding the chunk size in Figure 3, all DERs from the large chunk size are better than those from the small chunk size even if the buffer size is the same. As for the buffer size, when the chunk size was the same, DER decreased as buffer size increases. These results are in line with our assumption that a large input size would lead to a better result.
4.3 Real-time factor
RTF was calculated as the ratio of the summation of the execution time of every chunk to the recording duration which can measure the speech decoding speed and express the time performance of the proposed system. To avoid an unequal size of buffers in the first several chunks, we first filled the buffer with dummy values and then calculated the RTF. Our experiment was conducted on an Intel® Xeon® CPU E52697A v2 @ 2.60GHz using one thread. RTFs are equal to 0.40, 1.07 when the buffer size are 500 and 1000 which indicates that the proposed method is acceptable for online applications when buffer size is smaller than 1000 ().
4.4 Comparison with other methods
For a comparison with other methods, we evaluated our proposed methods using two real datasets (CALLHOME (CH) and CSJ), and three simulated datasets which are shown in Table 2. The simulated datasets were created by using two speakers’ segments. The background noise and room impulse response were from MUSAN corpus  and Simulated Room Response corpus  following the procedure in . Three kinds of simulated datasets were created with overlap ratios equal to , and , respectively.
For the offline i-vector and x-vector method, we applied the Kaldi CALLHOME diarization v1 and v2 recipe [31, 19, 32]. They are offline diarization methods that applies probabilistic linear discriminant analysis  along with an agglomerative-cluster method with TDNN-based speech activity detection  and oracle number of speakers. Offline SA-EEND is referred to as a chunk size of the entire recording, The system in  which achieved the best performance is applied here, not only for the offline SA-EEND but also for the online SA-EEND () and all proposed methods.
For the online x-vector, the speech segments are firstly divided into subsequent chunk. And then judging whether the entire chunk is speech or silence using the energy-based VAD for real datasets and oracle VAD for simulated datasets. The following part will be skipped if the entire chunk is considered as silence whose the If the percentage of voiced frames of the entire chunk are fewer than , it will be considered as silence and the fol If it is voiced chunk, we will extract x-vector and assign it to the first cluster until a dissimilar x-vector arrives according to the probabilistic linear discriminant analysis (PLDA) score. Here we applied a threshold 0 as the dissimilar criterion. After two clusters exist, when a new x-vector comes, calculating the PLDA score between the new segment and the two clusters and finally, assigning x-vector to the nearest cluster. For online SA-EEND, the chunk size is set to 10 without applying a speaker-tracing buffer. The proposed method applied the weighted sampling based speaker-tracing buffer to the SA-EEND.
As shown in Table 2, comparing with x-vector based online system, the proposed method obtained the best results for the online situations. The proposed method performed even better on CSJ than the offline clustering i-vector and the x-vector based methods. The online method increased the DER by 3.27 and 1.16 point comparing with the offline SA-EEND system for two real dataset when buffer size is 500 and the latency time is . In order to explore the increased DER, the broken down DER with a calibration period of is calculated and shown in Table 3. Comparing with offline SA-EEND, the proposed method only increases the DERs of 0.52 and 1.07 point for two datasets with a calibration period. Therefore, we can conclude that our proposed method can improve latency of the SA-EEND system from the entire duration of recording to only or even with comparable diarization performance.
|Online x-vector ()||36.94||34.94||33.19||26.90||25.45|
|Online SA-EEND ()||33.18||37.31||41.41||36.93||47.57|
|Within 30s||After 30s||All|
In this paper, we proposed a speaker-tracing buffer to memory previous speaker permutation information which enable the pre-trained offline SA-EEND system directly work online. The latency time can reduce to with comparable diarization performance. Future work will be concentrated to flexible speaker as current method is limited to two-speaker case.
-  S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 5, pp. 1557–1565, 2006.
-  G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge.” in INTERSPEECH, 2018, pp. 2808–2812.
-  N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The second dihard diarization challenge: Dataset, task, and baselines,” in INTERSPEECH, 2019, pp. 978–982.
-  J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth “chime” speech separation and recognition challenge: Dataset, task and baselines,” in INTERSPEECH, 2018.
-  N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu, X. Wang, V. Manohar, N. E. Y. Soplin, M. Maciejewski, S.-J. Chen et al., “The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays,” in The 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018), Interspeech, 2018.
-  “2000 NIST Speaker Recognition Evaluation,” https://catalog.ldc.upenn.edu/LDC2001S97.
-  A. Martin and M. Przybocki, “The NIST 1999 speaker recognition evaluation—an overview,” Digital signal processing, vol. 10, no. 1-3, pp. 1–18, 2000.
-  J. Geiger, F. Wallhoff, and G. Rigoll, “GMM-UBM based open-set online speaker diarization,” in INTERSPEECH, 2010.
-  T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in ICASSP, 2019, pp. 91–95.
-  M. Maciejewski, D. Snyder, V. Manohar, N. Dehak, and S. Khudanpur, “Characterizing performance of speaker diarization systems on far-field speech using standard methods,” in ICASSP, 2018, pp. 5244–5248.
-  S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
-  A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in ICASSP, 2019, pp. 6301–6305.
K. Markov and S. Nakamura, “Improved novelty detection for online gmm based speaker diarization,” inINTERSPEECH, 2008.
-  S. Madikeri, I. Himawan, P. Motlicek, and M. Ferras, “Integrating online i-vector extractor with information bottleneck based speaker diarization system,” in INTERSPEECH, 2015, pp. 3105–3109.
-  D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in ICASSP, 2017, pp. 4930–4934.
-  W. Zhu and J. Pelecanos, “Online speaker diarization using adapted i-vector transforms,” in ICASSP, 2016, pp. 5045–5049.
-  Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in ICASSP, 2018, pp. 5239–5243.
-  L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP, 2018, pp. 4879–4883.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP, 2018, pp. 5329–5333.
-  G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” in SLT, 2014, pp. 413–417.
-  H. Ning, M. Liu, H. Tang, and T. S. Huang, “A spectral clustering approach to speaker diarization,” in Ninth International Conference on Spoken Language Processing, 2006.
-  D. Dimitriadis and P. Fousek, “Developing on-line speaker diarization system.” in INTERSPEECH, 2017, pp. 2739–2743.
-  J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wisniewski, C. Barras, N. W. Evans, and S. Marcel, “Low-latency speaker spotting with online diarization and detection.” in Odyssey, 2018, pp. 140–146.
-  E. Fini and A. Brutti, “Supervised online diarization with sample mean loss for multi-domain data,” in ICASSP, 2020, pp. 7134–7138.
-  Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in INTERSPEECH, 2019, pp. 4300–4304.
-  Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in ASRU, 2019.
-  Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Nagamatsu, “End-to-end neural diarization: Reformulating speaker diarization as simple multi-label classification,” arXiv preprint arXiv:2003.02966, 2020.
-  K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
-  D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
-  T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification.” in INTERSPEECH, 2017, pp. 999–1003.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in ASRU, 2011.
-  S. Ioffe, “Probabilistic linear discriminant analysis,” in ECCV, 2006, pp. 531–542.
-  V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, and S. Khudanpur, “JHU ASPIRE system: Robust LVCSR with TDNNs, iVector adaptation and RNN-LMS,” in ASRU, 2015, pp. 539–546.