Separation Guided Speaker Diarization in Realistic Mismatched Conditions

07/06/2021 ∙ by Shu-Tong Niu, et al. ∙ Georgia Institute of Technology USTC 0

We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select between the results of SSD and CSD systems based on an analysis of the instability of the SSD system performances. Experiments on the conversational telephone speech (CTS) data from DIHARD-III Challenge show that the proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2 evaluation set, respectively.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech diarization is a task to label partitioning speech segments with classes corresponding to speaker identities, namely “who spoke when” [2021review]. It is an important front-end for many applications, such as meeting summary, telephone conversations analysis and speaker based indexing [diarization_application1, diarization_application3]. Therefore, there are many different domains to evaluate the diarization performances, such as broadcast news recordings [broadcast], conversational telephone speech (CTS) [CTS] and meeting conversations [meeting].

Conventional clustering-based methods are widely used in speaker diarization [CD1, CD2]

. The core of this technique is extracting and clustering speaker representations which mainly include i-vector


and some neural network based embeddings, such as d-vector

[dvector] and x-vector [xvector]

. In these algorithms, variational Bayesian hidden Markov model with x-vectors (VBx)

[BUT1] has achieved a superior performance and ranked first in DIHARD-II Challenge [dihard2]. However, diarization based on speaker clustering cannot well handle overlapping speech because a speech segment can only be assigned to one specific cluster. To solve this problem, end-to-end neural speaker diarization (EEND) [EEND, EENDEDA, EENDSC] approaches were proposed where the diarization results can be directly predicted by neural networks, which allows the system to be optimized by minimizing diarization errors. Target-speaker voice activity detection (TS-VAD) [TSVAD] is then proposed to predict an activity of each speaker at each time frame using speech features along with speaker embedding. All these techniques are capable of dealing with overlapping speech based on classification networks.

Speech separation (SS) is a task of separating target speech from background interferences [sep1]. Most speech separation approaches have been formulated in the time-frequency (T-F) domain. Such models include deep neural networks (DNNs) [sepdnn]

, recurrent neural networks (RNNs)

[seprnn, seprnn1]

, and generative adversarial networks (GANs)

[sepgan]. Recently, time-domain based networks, such as fully-convolutional time-domain audio separation network (Conv-TasNet) [Tasnet] and dual-path RNN [Dualpathrnn], have shown good results in speech separation. However, most of above algorithms were evaluated on simulation data. For data sets with realistic conditions, the separation performances cannot be directly measured due to the lack of clean speech. This limits the application of separation to diarization tasks based on realistic data sets.

In separation-based speaker diarization (SSD), we can separate the utterances by trained Conv-TasNets and obtain diarization results by detecting the speech segments in separated speech. However, when dealing with realistic mismatched data, the separation performances are often unstable, which leads to worse diarization performances than those obtained with clustering-based speaker diarization (CSD) systems. Therefore, we analyze different cases in SSD systems and propose some strategies, including speech duration checking, overlap ratio checking and relative diarization error rate (DER) calculation. We call this improvement separation guided speaker diarization (SGSD) approach which enables separation to assist CSD in handling overlap regions. Through these strategies, SGSD can indirectly measure the separation performances in realistic utterances and select between the results of SSD and CSD.

Experiments on CTS dataset show that the proposed SGSD system can help CSD achieve a good performance on overlap regions. Similar works exist in [sepmeeting1, sepmeeting2]. However, our proposed SGSD framework offers a few major differences: (1) different from the works in [sepmeeting1, sepmeeting2] which use the BLSTM based separation model, we adopt more powerful Conv-TasNet separation model. It avoids the assumption that the speaker masks are additive and sum to one for each time-frequency bin which is not directly applicable to diarization [EENDSC]; (2) we evaluate our methods on realistic mismatched single-channel dataset with different speaking styles from our training set, which is more challenging than handling the simulated single-channel data in [sepmeeting1] and multi-channel dataset with similar speaking styles to training set in [sepmeeting2]; and (3) due to the more challenging situation, we cannot directly use speech separation to attain the diarization results. Therefore, different from the multi-task perspective in [sepmeeting1, sepmeeting2], we emphasize the aspect of enabling speech separation to assist CSD in the proposed SGSD system.

Figure 1: Performance (DER) comparison between CSD system and SSD system on CTS domain of DIHARD-III development set.

2 The Proposed SGSD Framework

Fig. 2 shows a flowchart of the proposed SGSD system. As can be seen, it contains two single diarization systems: clustering-based speaker diarization (CSD) and separation-based speaker diarization (SSD). We design several strategies to better select between the results of these two systems, which enables the SSD to assist CSD in dealing with overlap regions. Details of the proposed system are described in the following subsections.

Figure 2: Flowchart of separation guided speaker diarization

2.1 Clustering-based speaker diarization (CSD)

As shown in the bottom of Fig.2, we use VBx [BUT1]

as our clustering-based speaker diarization (CSD) system. Apart from the conventional processes which include voice activity detection (VAD), speaker feature extraction and speaker clustering, the VBx also employs VB-HMM to refine the assignments of x-vectors to speaker clusters.

2.2 Separation-based speaker diarization (SSD)

The top of Fig. 2 illustrates the framework of the separation-based speaker diarization (SSD) system. As can be seen in the figure, the SSD system simply contains two parts: separation and detection. In separation part, original utterance is separated into two streams by Conv-TasNet [Tasnet] based separation model. Ideally, overlapping speech segments are automatically separated, and the single-speaker speech segments are assigned to the streams corresponding to speaker identities. In this part, the utterance-level permutation invariant training (uPIT) based learning objective is used to optimize the model parameters:


where is the number of speakers which is 2 in this study, is the error between the network output and the target, denotes -th predicted speech, denotes reference speech with the permutation that minimizes the training objective . In training,

is calculated by scale-invariant source-to-noise ratio (Si-SNR), which is an evaluation metric for source separation:


where . and

are the estimates and targets respectively. In detection part, the separated two-stream speech signals are sent to the VAD model to get the time label of speech segments. Combine all detection results along the time axis, then speaker diarization results are obtained.

2.3 Separation guided speaker diarization (SGSD)

Fig.1 presents the performance (DER) comparison between CSD and SSD systems on each utterance from CTS domain of DIHARD-III development set. It can be observed that the performance of SSD system is unstable. On about half of the utterances, the SSD results are better than CSD results, while in other utterances, the SSD performance is fairly poor, with DERs even greater than 50. On the contrary, the CSD system is relatively stable in all utterances. Therefore, we can see that both systems have advantages and disadvantages: SSD system can handle the overlap regions while its stability is poor, the CSD system is pretty stable while it cannot well process the overlapping speech. The motivation of our method is utilizing the complementarity of these two systems. What’s more, from the Fig. 2 we can see that poor separation performance will lead to the degradation of SSD performance. However, when dealing with the realistic mismatched datasets, the speech separation performance is unstable, and the separation performance cannot be directly measured in the realistic datasets due to the lack of clean speech. Therefore, we need some strategies which can indirectly measure the separation performance to better select between the results of the CSD and SSD systems. Inspired by these facts, the proposed SGSD procedure is illustrated in Algorithm 1. First, we get the diarization results of SSD and CSD systems for all utterances.

Step1: Results generation
  Obtain the SSD and CSD results for all utterances;
Step2: Performance measuring
  Use the SSD results to measure the SS performance;
Step3: Utterances capture
  Capture the utterances with poor SS performance;
Step4: Results selection
  Use the CSD results for the selected utterances;
  Use the SSD results for the other utterances.

Algorithm 1 SGSD Procedure

For some utterances, the separation part of SSD system doesn’t output the correct single-speaker speech. Next, we use the SSD results to measure the performance of speech separation. Then, we select the utterances which are judged to have poor speech separation results. Finally, we adopt the CSD results for selected utterances and the SSD results for unselected utterances. In SGSD, we use the diarization results to indirectly measure the separation performance in SSD system because poor separation results tend to lead to unreasonable diarization results. By analyzing some failure cases in separation part of SSD system (this will be described in detail in Section 3), we propose some strategies to measure the separation performance in Step 2.

To better illustrate the strategies, we assume that denotes the number of speakers, denotes the regions of speaker and denotes the duration of the region . Specially, in the CTS dataset. We propose three strategies which will be introduced one by one.

Strategy 1: Check if the duration of speakers’ speech is unbalanced in SSD results:


the is the threshold which can be adjusted. If the InEq. (3) is unsatisfied, it means that the duration of the two speakers is unbalanced, and the performance of speech separation will be judged as poor.

Strategy 2: Check if there is an abnormal overlap ratio in SSD results:


The left side of the InEq. (4) is the overlap ratio of an utterance [OLR]. If the InEq. (4) is unsatisfied when is set to an appropriate value, it implies that the overlap ratio in the SSD results is too large, and the separation results will be judged as incorrect.

Strategy 3: Calculate the deviation degree of the SSD results relative to the CSD results, namely:


where is the number of speaker segments in which both CSD results and SSD results contain the same speaker (or speakers). and denote the speaker number in speech segment of CSD and SSD results respectively. means the number of speakers in speech segment that are correctly matched between CSD and SSD results. This is actually the calculation of DER [DER], but we replace the ground truth with the CSD results. If the InEq. (5) is unsatisfied, it means the SSD results greatly deviate from the stable CSD results and the corresponding separation performance is poor.

Among them, SGSD with the first strategy (hereinafter referred to as SGSD1) and the second strategy (hereinafter referred to as SGSD2) measure the separation performance based on the diarization results of the SSD system itself, and SGSD with the third strategy (hereinafter referred to as SGSD3) uses the clustering-based method as a benchmark to measure the separation performance of SSD system. Through the above three strategies, we can detect the utterances with poor separation performance in SSD system.

3 Experiments

3.1 Experimental conditions

The training set of the separation model in SSD system was simulated on Librispeech [librispeech]

dataset. We randomly selected two utterances from different speakers in Librispeech dataset and mixed them to obtain the simulated training utterance. In this paper, we simulated about 250-hour training data. To verify the effectiveness of the proposed method, we conducted the experiments on the realistic conversational telephone speech (CTS) dataset from development set and evaluation set of DIHARD-III Challenge

[dihard3]. Both sets contain 61 utterances with each utterance consisting of a 10-minute conversation between two native English speakers. The overlap ratio of the development set and evaluation set is 11.9 and 10.5 respectively, which is quite large in the two-speaker domain. We compared the proposed method with VBx system [but] and referred to the configuration used in recipe111  DIHARDII published by BUT speech team. In the separation part of SSD system, we used the asteroid toolkit [asteroid]

to train a Conv-TasNet as separation model. We trained the model for 75 epochs on 3-second long segments. The learning rate was set to 0.001, and the batch-size was set to 6. Adam

[adam] was adopted as the optimizer. Moreover, WebRTC VAD222 with 30ms hop length was employed in the detection part. In SGSD1, we set the minimum ratio of the duration of two speakers () to 40 . In SGSD2, we set the highest overlap ratio () in SSD results to 20 . In SGSD3, we set the maximum relative deviation () to 26 . These thresholds were determined based on the development set. Diarization error rate (DER) [DER] was used as evaluation metric in our experiments. It consists of computing speaker error, false alarm and missed speech. We included all errors in calculation of DER. In addition, we didn’t set any forgiveness collar during evaluation.

(a) Successful case (DER = 4.6 )
(b) Failure case 1 (DER = 42.5 )
(c) Failure case 2 (DER = 82.4 )
Figure 3: Spectrograms and diarization labels of speech segments from three selected utterances in SSD system (the rectangles mark the regions which were falsely separated) . Case (a): SS successfully separates a two-speaker mixed utterance, Cases (b) and (c): two SS failures.

3.2 Analysis of SSD system under mismatched conditions

Although we used the powerful Conv-TasNet based separation model in SSD system, the separation performance was still unstable due to the mismatch between simulated reading style training set and realistic conversational style test set. In order to propose suitable selection strategies, we analyzed the different cases of separation results in SSD system. Fig. 3 presents the spectrograms and diarization labels of 10s speech segments from three selected utterances which belong to different cases. Fig. 3 (a) shows a successful case where the separation result has a clear relationship with the diarization label. In this case, the DER is quite small (DER = 4.6 ). Fig. 3 (b) shows the failure case where the speech segments of different speakers are assigned to the same stream, which leads to a large speaker error (SpkErr = 36.1 ). Fig. 3 (c) shows the failure case where the speech segments of one speaker are assigned to two streams, which leads to a large false alarm error (FA = 78.1 ). From these cases we can see, in SSD system, successful separation will yield the good diarization results while failure speech separation will lead to a large speaker error (failure case 1) or false alarm error (failure case 2). In addition, we found that there was a strong relationship between the speaker gender combination and diarization performance in SSD system as shown in the bottom of Fig. 1. For utterances with different gender combination (Male-Female), the SSD results are often better than CSD results. Conversely, for utterances with the same gender combination (Female-Female or Male-Male), the performance of SSD system is often very poor. As we know, the same gender speaker mix is more difficult to separate than the case of different gender mix in speech separation [gender]. We can observe the consistency between SSD and SS performance, which also verifies that poor separation results will cause the degradation of SSD system performance as mentioned in Section 2.3.

Moreover, from Fig. 3 we can see that if the separation result belongs to the first failure case, the duration of the two speakers will be very unbalanced (e.g., the ratio of two speakers’ duration is 6.3 in SSD result for the segment shown in Fig. 3 (b)). This corresponds to Strategy 1 (SGSD1) in Section 2.3. If the separation result belongs to the second failure case, the overlap ratio of the SSD result will be too high (e.g., the overlap ratio is 84.2 in SSD for the segment shown in Fig. 3 (c)), which corresponds to Strategy 2 (SGSD2) in Section 2.3.

3.3 Overall comparison

Table 1 compares the performance of detecting the utterances with poor separation performance (i.e., SSD results are worse than CSD results) among different SGSD systems on the CTS development and evaluation sets from DIHARD-III Challenge. “SGSD12” means combining the detection results of SGSD1 and SGSD2. “SGSD123” means voting on SGSD1, SGSD2 and SGSD3. From this table we can make several observations. First, by comparing the different SGSD systems, SGSD3 has achieved the best performance on both development set and evaluation set, which indicates that DER between SSD and CSD is a good and robust indicator for measuring the speech separation performance. Second, combining the results of SGSD1 and SGSD2 can significantly improve the detection performance, which illustrates the complementarity of them. Third, compared with the SGSD3, the voting of SGSD1, SGSD2 and SGSD3 leads to the worse detection performance due to the detection errors of SGSD1 and SGSD2. However, even with our best system SGSD3, some utterances with relatively poor speech separation performance are not detected because the differences between CSD results and SSD results in these utterances are small. Generally speaking, these SSD results are not too poor due to the small differences from stable CSD results. Table 2 compares the DERs of CSD, SGSD3 and Oracle (perfect detection / selection between SSD and CSD with ground truth). It can be observed that the speaker errors of CSD results are quite small in both development set and evaluation set (4.2 and 3.7 respectively), which means CSD system can achieve good performance on CTS dataset. However, due to the lack of ability to handle the overlapping speech, most of the DER comes from the miss error. What’s more, SGSD3 can help CSD system handle the overlapping speech which can be seen from the smaller miss error of SGSD3 results compared with CSD results. It is worth noting that the SGSD3 has achieved quite good results which is very close to the oracle results.

Method Dev Eval
Recall Precision Acc Recall Precision Acc
SGSD1 0.35 1.0 0.69 0.43 0.92 0.72
SGSD2 0.55 1.0 0.79 0.36 1.0 0.71
SGSD3 0.90 0.93 0.92 0.86 0.92 0.90
SGSD12 0.79 1.0 0.90 0.71 0.95 0.85
SGSD123 0.79 1.0 0.90 0.71 1.0 0.87
Table 1: Detection comparison among different SGSD systems on the CTS domain of development set and evaluation set from DIHARD-III Challenge.
Set Method MISS () FA () SpkErr () DER ()
Dev CSD 12.0 0.0 4.2 16.22
SGSD3 7.6 2.6 2.7 12.95
Oracle 7.5 2.6 2.6 12.75
Eval CSD 10.5 0.0 3.7 14.20
SGSD3 6.4 2.6 2.2 11.24
Oracle 6.4 2.4 2.2 10.94
Table 2: Performance comparison among CSD system, SGSD system with Strategy 3 (denoted as SGSD3) and oracle system (denoted as Oracle).

4 Conclusion

We propose a SGSD approach to enabling separation-based processing to assist clustering-based systems in handling overlapping speech regions. To reduce the impact of the instability of separation performance, we design some strategies to select between the results of CSD and SSD systems. Experiments on the CTS data show that the proposed SGSD can help improve the conventional clustering-based systems. In the future, we will explore SGSD approaches under more challenging multi-speaker (more than 2 speakers) and noisy conditions.