Speech diarization is a task to label partitioning speech segments with classes corresponding to speaker identities, namely “who spoke when” [2021review]. It is an important front-end for many applications, such as meeting summary, telephone conversations analysis and speaker based indexing [diarization_application1, diarization_application3]. Therefore, there are many different domains to evaluate the diarization performances, such as broadcast news recordings [broadcast], conversational telephone speech (CTS) [CTS] and meeting conversations [meeting].
Conventional clustering-based methods are widely used in speaker diarization [CD1, CD2]
. The core of this technique is extracting and clustering speaker representations which mainly include i-vector[ivector]
and some neural network based embeddings, such as d-vector[dvector] and x-vector [xvector]
. In these algorithms, variational Bayesian hidden Markov model with x-vectors (VBx)[BUT1] has achieved a superior performance and ranked first in DIHARD-II Challenge [dihard2]. However, diarization based on speaker clustering cannot well handle overlapping speech because a speech segment can only be assigned to one specific cluster. To solve this problem, end-to-end neural speaker diarization (EEND) [EEND, EENDEDA, EENDSC] approaches were proposed where the diarization results can be directly predicted by neural networks, which allows the system to be optimized by minimizing diarization errors. Target-speaker voice activity detection (TS-VAD) [TSVAD] is then proposed to predict an activity of each speaker at each time frame using speech features along with speaker embedding. All these techniques are capable of dealing with overlapping speech based on classification networks.
Speech separation (SS) is a task of separating target speech from background interferences [sep1]. Most speech separation approaches have been formulated in the time-frequency (T-F) domain. Such models include deep neural networks (DNNs) [sepdnn]
, recurrent neural networks (RNNs)[seprnn, seprnn1]
, and generative adversarial networks (GANs)[sepgan]. Recently, time-domain based networks, such as fully-convolutional time-domain audio separation network (Conv-TasNet) [Tasnet] and dual-path RNN [Dualpathrnn], have shown good results in speech separation. However, most of above algorithms were evaluated on simulation data. For data sets with realistic conditions, the separation performances cannot be directly measured due to the lack of clean speech. This limits the application of separation to diarization tasks based on realistic data sets.
In separation-based speaker diarization (SSD), we can separate the utterances by trained Conv-TasNets and obtain diarization results by detecting the speech segments in separated speech. However, when dealing with realistic mismatched data, the separation performances are often unstable, which leads to worse diarization performances than those obtained with clustering-based speaker diarization (CSD) systems. Therefore, we analyze different cases in SSD systems and propose some strategies, including speech duration checking, overlap ratio checking and relative diarization error rate (DER) calculation. We call this improvement separation guided speaker diarization (SGSD) approach which enables separation to assist CSD in handling overlap regions. Through these strategies, SGSD can indirectly measure the separation performances in realistic utterances and select between the results of SSD and CSD.
Experiments on CTS dataset show that the proposed SGSD system can help CSD achieve a good performance on overlap regions. Similar works exist in [sepmeeting1, sepmeeting2]. However, our proposed SGSD framework offers a few major differences: (1) different from the works in [sepmeeting1, sepmeeting2] which use the BLSTM based separation model, we adopt more powerful Conv-TasNet separation model. It avoids the assumption that the speaker masks are additive and sum to one for each time-frequency bin which is not directly applicable to diarization [EENDSC]; (2) we evaluate our methods on realistic mismatched single-channel dataset with different speaking styles from our training set, which is more challenging than handling the simulated single-channel data in [sepmeeting1] and multi-channel dataset with similar speaking styles to training set in [sepmeeting2]; and (3) due to the more challenging situation, we cannot directly use speech separation to attain the diarization results. Therefore, different from the multi-task perspective in [sepmeeting1, sepmeeting2], we emphasize the aspect of enabling speech separation to assist CSD in the proposed SGSD system.
2 The Proposed SGSD Framework
Fig. 2 shows a flowchart of the proposed SGSD system. As can be seen, it contains two single diarization systems: clustering-based speaker diarization (CSD) and separation-based speaker diarization (SSD). We design several strategies to better select between the results of these two systems, which enables the SSD to assist CSD in dealing with overlap regions. Details of the proposed system are described in the following subsections.
2.1 Clustering-based speaker diarization (CSD)
As shown in the bottom of Fig.2, we use VBx [BUT1]
as our clustering-based speaker diarization (CSD) system. Apart from the conventional processes which include voice activity detection (VAD), speaker feature extraction and speaker clustering, the VBx also employs VB-HMM to refine the assignments of x-vectors to speaker clusters.
2.2 Separation-based speaker diarization (SSD)
The top of Fig. 2 illustrates the framework of the separation-based speaker diarization (SSD) system. As can be seen in the figure, the SSD system simply contains two parts: separation and detection. In separation part, original utterance is separated into two streams by Conv-TasNet [Tasnet] based separation model. Ideally, overlapping speech segments are automatically separated, and the single-speaker speech segments are assigned to the streams corresponding to speaker identities. In this part, the utterance-level permutation invariant training (uPIT) based learning objective is used to optimize the model parameters:
where is the number of speakers which is 2 in this study, is the error between the network output and the target, denotes -th predicted speech, denotes reference speech with the permutation that minimizes the training objective . In training,
is calculated by scale-invariant source-to-noise ratio (Si-SNR), which is an evaluation metric for source separation:
where . and
are the estimates and targets respectively. In detection part, the separated two-stream speech signals are sent to the VAD model to get the time label of speech segments. Combine all detection results along the time axis, then speaker diarization results are obtained.
2.3 Separation guided speaker diarization (SGSD)
Fig.1 presents the performance (DER) comparison between CSD and SSD systems on each utterance from CTS domain of DIHARD-III development set. It can be observed that the performance of SSD system is unstable. On about half of the utterances, the SSD results are better than CSD results, while in other utterances, the SSD performance is fairly poor, with DERs even greater than 50. On the contrary, the CSD system is relatively stable in all utterances. Therefore, we can see that both systems have advantages and disadvantages: SSD system can handle the overlap regions while its stability is poor, the CSD system is pretty stable while it cannot well process the overlapping speech. The motivation of our method is utilizing the complementarity of these two systems. What’s more, from the Fig. 2 we can see that poor separation performance will lead to the degradation of SSD performance. However, when dealing with the realistic mismatched datasets, the speech separation performance is unstable, and the separation performance cannot be directly measured in the realistic datasets due to the lack of clean speech. Therefore, we need some strategies which can indirectly measure the separation performance to better select between the results of the CSD and SSD systems. Inspired by these facts, the proposed SGSD procedure is illustrated in Algorithm 1. First, we get the diarization results of SSD and CSD systems for all utterances.
For some utterances, the separation part of SSD system doesn’t output the correct single-speaker speech. Next, we use the SSD results to measure the performance of speech separation. Then, we select the utterances which are judged to have poor speech separation results. Finally, we adopt the CSD results for selected utterances and the SSD results for unselected utterances. In SGSD, we use the diarization results to indirectly measure the separation performance in SSD system because poor separation results tend to lead to unreasonable diarization results. By analyzing some failure cases in separation part of SSD system (this will be described in detail in Section 3), we propose some strategies to measure the separation performance in Step 2.
To better illustrate the strategies, we assume that denotes the number of speakers, denotes the regions of speaker and denotes the duration of the region . Specially, in the CTS dataset. We propose three strategies which will be introduced one by one.
Strategy 1: Check if the duration of speakers’ speech is unbalanced in SSD results:
the is the threshold which can be adjusted. If the InEq. (3) is unsatisfied, it means that the duration of the two speakers is unbalanced, and the performance of speech separation will be judged as poor.
Strategy 2: Check if there is an abnormal overlap ratio in SSD results:
The left side of the InEq. (4) is the overlap ratio of an utterance [OLR]. If the InEq. (4) is unsatisfied when is set to an appropriate value, it implies that the overlap ratio in the SSD results is too large, and the separation results will be judged as incorrect.
Strategy 3: Calculate the deviation degree of the SSD results relative to the CSD results, namely:
where is the number of speaker segments in which both CSD results and SSD results contain the same speaker (or speakers). and denote the speaker number in speech segment of CSD and SSD results respectively. means the number of speakers in speech segment that are correctly matched between CSD and SSD results. This is actually the calculation of DER [DER], but we replace the ground truth with the CSD results. If the InEq. (5) is unsatisfied, it means the SSD results greatly deviate from the stable CSD results and the corresponding separation performance is poor.
Among them, SGSD with the first strategy (hereinafter referred to as SGSD1) and the second strategy (hereinafter referred to as SGSD2) measure the separation performance based on the diarization results of the SSD system itself, and SGSD with the third strategy (hereinafter referred to as SGSD3) uses the clustering-based method as a benchmark to measure the separation performance of SSD system. Through the above three strategies, we can detect the utterances with poor separation performance in SSD system.
3.1 Experimental conditions
The training set of the separation model in SSD system was simulated on Librispeech [librispeech]
dataset. We randomly selected two utterances from different speakers in Librispeech dataset and mixed them to obtain the simulated training utterance. In this paper, we simulated about 250-hour training data. To verify the effectiveness of the proposed method, we conducted the experiments on the realistic conversational telephone speech (CTS) dataset from development set and evaluation set of DIHARD-III Challenge[dihard3]. Both sets contain 61 utterances with each utterance consisting of a 10-minute conversation between two native English speakers. The overlap ratio of the development set and evaluation set is 11.9 and 10.5 respectively, which is quite large in the two-speaker domain. We compared the proposed method with VBx system [but] and referred to the configuration used in recipe111https://github.com/BUTSpeechFIT/VBx/tree/v1.0 DIHARDII published by BUT speech team. In the separation part of SSD system, we used the asteroid toolkit [asteroid]
to train a Conv-TasNet as separation model. We trained the model for 75 epochs on 3-second long segments. The learning rate was set to 0.001, and the batch-size was set to 6. Adam[adam] was adopted as the optimizer. Moreover, WebRTC VAD222https://github.com/wiseman/py-webrtcvad with 30ms hop length was employed in the detection part. In SGSD1, we set the minimum ratio of the duration of two speakers () to 40 . In SGSD2, we set the highest overlap ratio () in SSD results to 20 . In SGSD3, we set the maximum relative deviation () to 26 . These thresholds were determined based on the development set. Diarization error rate (DER) [DER] was used as evaluation metric in our experiments. It consists of computing speaker error, false alarm and missed speech. We included all errors in calculation of DER. In addition, we didn’t set any forgiveness collar during evaluation.
3.2 Analysis of SSD system under mismatched conditions
Although we used the powerful Conv-TasNet based separation model in SSD system, the separation performance was still unstable due to the mismatch between simulated reading style training set and realistic conversational style test set. In order to propose suitable selection strategies, we analyzed the different cases of separation results in SSD system. Fig. 3 presents the spectrograms and diarization labels of 10s speech segments from three selected utterances which belong to different cases. Fig. 3 (a) shows a successful case where the separation result has a clear relationship with the diarization label. In this case, the DER is quite small (DER = 4.6 ). Fig. 3 (b) shows the failure case where the speech segments of different speakers are assigned to the same stream, which leads to a large speaker error (SpkErr = 36.1 ). Fig. 3 (c) shows the failure case where the speech segments of one speaker are assigned to two streams, which leads to a large false alarm error (FA = 78.1 ). From these cases we can see, in SSD system, successful separation will yield the good diarization results while failure speech separation will lead to a large speaker error (failure case 1) or false alarm error (failure case 2). In addition, we found that there was a strong relationship between the speaker gender combination and diarization performance in SSD system as shown in the bottom of Fig. 1. For utterances with different gender combination (Male-Female), the SSD results are often better than CSD results. Conversely, for utterances with the same gender combination (Female-Female or Male-Male), the performance of SSD system is often very poor. As we know, the same gender speaker mix is more difficult to separate than the case of different gender mix in speech separation [gender]. We can observe the consistency between SSD and SS performance, which also verifies that poor separation results will cause the degradation of SSD system performance as mentioned in Section 2.3.
Moreover, from Fig. 3 we can see that if the separation result belongs to the first failure case, the duration of the two speakers will be very unbalanced (e.g., the ratio of two speakers’ duration is 6.3 in SSD result for the segment shown in Fig. 3 (b)). This corresponds to Strategy 1 (SGSD1) in Section 2.3. If the separation result belongs to the second failure case, the overlap ratio of the SSD result will be too high (e.g., the overlap ratio is 84.2 in SSD for the segment shown in Fig. 3 (c)), which corresponds to Strategy 2 (SGSD2) in Section 2.3.
3.3 Overall comparison
Table 1 compares the performance of detecting the utterances with poor separation performance (i.e., SSD results are worse than CSD results) among different SGSD systems on the CTS development and evaluation sets from DIHARD-III Challenge. “SGSD12” means combining the detection results of SGSD1 and SGSD2. “SGSD123” means voting on SGSD1, SGSD2 and SGSD3. From this table we can make several observations. First, by comparing the different SGSD systems, SGSD3 has achieved the best performance on both development set and evaluation set, which indicates that DER between SSD and CSD is a good and robust indicator for measuring the speech separation performance. Second, combining the results of SGSD1 and SGSD2 can significantly improve the detection performance, which illustrates the complementarity of them. Third, compared with the SGSD3, the voting of SGSD1, SGSD2 and SGSD3 leads to the worse detection performance due to the detection errors of SGSD1 and SGSD2. However, even with our best system SGSD3, some utterances with relatively poor speech separation performance are not detected because the differences between CSD results and SSD results in these utterances are small. Generally speaking, these SSD results are not too poor due to the small differences from stable CSD results. Table 2 compares the DERs of CSD, SGSD3 and Oracle (perfect detection / selection between SSD and CSD with ground truth). It can be observed that the speaker errors of CSD results are quite small in both development set and evaluation set (4.2 and 3.7 respectively), which means CSD system can achieve good performance on CTS dataset. However, due to the lack of ability to handle the overlapping speech, most of the DER comes from the miss error. What’s more, SGSD3 can help CSD system handle the overlapping speech which can be seen from the smaller miss error of SGSD3 results compared with CSD results. It is worth noting that the SGSD3 has achieved quite good results which is very close to the oracle results.
|Set||Method||MISS ()||FA ()||SpkErr ()||DER ()|
We propose a SGSD approach to enabling separation-based processing to assist clustering-based systems in handling overlapping speech regions. To reduce the impact of the instability of separation performance, we design some strategies to select between the results of CSD and SSD systems. Experiments on the CTS data show that the proposed SGSD can help improve the conventional clustering-based systems. In the future, we will explore SGSD approaches under more challenging multi-speaker (more than 2 speakers) and noisy conditions.