A study of semi-supervised speaker diarization system using gan mixture model

10/24/2019 ∙ by Monisankha Pal, et al. ∙ 0

We propose a new speaker diarization system based on a recently introduced unsupervised clustering technique namely, generative adversarial network mixture model (GANMM). The proposed system uses x-vectors as front-end representation. Spectral embedding is used for dimensionality reduction followed by k-means initialization during GANMM pre-training. GANMM performs unsupervised speaker clustering by efficiently capturing complex data distributions. Experimental results on the AMI meeting corpus show that the proposed semi-supervised diarization system matches or exceeds the performance of competitive baselines. On an evaluation set containing fifty sessions with varying durations, the best achieved average diarization error rate (DER) is 17.11 and comparable to xvector baseline.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarization addresses the problem of determining “who spoke when” in an audio recording. It has wide ranging applications related to speaker indexing of data [1], and forms an integral component of speech [2] and speaker recognition [3] pipelines. A generic speaker diarization system includes (a) a speech activity detection (SAD) module, which separates speech from non-speech parts, (b) speaker segmentation, where the input audio is segmented into speaker homogeneous chunks and enables extraction of speaker discriminative embeddings such as speaker factors [4], i-vectors [5, 6], x-vectors [7, 8], CNN and LSTM based embeddings [9, 10] and d-vectors [11, 12] from those audio chunks, and (c) speaker clustering that determines the constituent number of speakers in an audio stream and labels each segment with distinct speaker labels (and possibly, identities).

Many recent works on deep neural network based

embedding extraction [10, 7, 11] have advanced speaker diarization research with significant performance improvements. They have effectively replaced the previous embedding approaches based on i-vectors for diarization [13, 14]. The largely popular x-vector embeddings, which have proven to be more effective than traditional i-vectors especially for short-duration speech, have become the de-facto standard for speaker recognition [15] and diarization [7].

However, speaker clustering has been based on mostly unsupervised algorithms over the years. These algorithms include Gaussian mixture model


; agglomerative hierarchical clustering (AHC) based on similarity measures like Bayesian information criterion

[17], generalized log-likelihood ratio [7], information bottleneck (IB) [18]; mean shift[6]; k-means [19]

; spectral clustering


; integrated linear programming

[20] and links [21]. Most recently, few supervised speaker clustering methods such as UIS-RNN [22] and affinity propagation [23] have also been proposed for diarization. Despite the success of the above clustering techniques, speaker diarization still remains a challenging task in many real-world applications due to the wide heterogeneity and variability in audio.

The recent successes of generative adversarial networks [24] (GANs) in capturing complex data distributions by encoding rich latent structures has attracted a lot of attention. However, it is difficult to train GANs due to the mode collapse problem [25]. To address this problem, many variants of GAN such as the Wasserstein GAN (WGAN) [26], multi-generator GAN [27], mixture GAN [28] have been proposed. More recently, the GAN mixture model

(GANMM), a novel adversarial architecture with a mixture of generators and discriminators, and a classifier trained in an expectation maximization (EM) fashion was introduced

[29]. This model was shown to be very effective for image and character data clustering [29]

. Although the performance of speech recognition and speaker verification systems has improved dramatically due to the deep learning approaches, most of the existing clustering techniques for speaker diarization are not yet taking full advantage of it. Therefore, it is worthwhile to explore the potential of neural network based clustering for speaker diarization. In our present work, we adopt and further develop the GANMM framework for audio based speaker clustering.

Figure 1: Schematic diagram of the proposed speaker diarization system.

This work uses x-vector embeddings as a feature representation on short overlapping speech segments. Prior to GANMM clustering, we extract spectral embeddings on the x-vectors to reduce the dimensionality and k-means clustering as initialization for pre-training the GAN models. Training the mixture model is performed through -EM procedure. The expectation step comprises of learning a classifier to separate the clusters. In the maximization step, each mixture model is trained using the clustered data. Experiments conducted on the AMI corpus confirm the validity of our proposed system. To the best of our knowledge, this is the first attempt of using GANs for speaker clustering in an unsupervised manner within a speaker diarization framework.

2 Proposed speaker diarization system

2.1 Overview of the system

An overview of the proposed system is presented in Fig. 1. Below, we describe each of the components in detail.

2.2 Segmentation

In order to isolate potential confounds due to miss and false alarm errors in speech/non-speech detection, we focus solely on the speaker confusion part of diarization error rate (DER). We use oracle SAD for the proposed as well as the baseline systems. After removing the non-speech part in an audio session, x-vectors are extracted from each segment of duration 1.5 sec with 1 sec overlap. While this denser segmentation might help in clustering, each segment may contain more than one speaker. Motivated by the success of spectral clustering in speaker diarization [11], we exploited the use of spectral clustering on the extracted x-vector embeddings by projecting them into lower-dimensional subspace to produce spectral embeddings using Eigen decomposition.

2.3 Speaker clustering: GANMM training

We employ GAN models in learning mixture models to capture underlying clusters of complex data. In the following sections, we describe various implementation details involved in GANMM training.

2.3.1 Mitigating early convergence

Given an audio stream, we assume the extracted embeddings

to follow probability distribution

and hidden variables with probability distribution . The log-likelihood with model parameters can be expressed as


where the log-likelihood can be written as and is the KL-divergence between and

. In the conventional EM procedure, the E-step matches the hidden variable (clusters in this case) distribution with current posterior probability. The M-step determines the parameters by maximizing the resulted log-likelihood. However, the key issue with GAN models is that they can fit the current guess of the hidden variables too well. Hence, the model maximizing the likelihood has extreme value and the whole procedure converges very early with

[29]. To mitigate this early convergence, an error term is introduced in the E-step with . The convergence is guaranteed by keeping . In the GANMM setup, an imperfect classifier ( is trained at the -E step with generator outputs and the M-step ensures the convergence by eliminating this error. Further details are outlined in [29].

2.3.2 -EM procedure

This involves the following:
-E-step: (1) Produce samples from the generators of the current GANMM model and gather them to construct a data set , where is the generated data point from the -th mixture and is the generator index. (2) Train an inaccurate classifier using the data set . (3) Classify each training data of a particular session by the classifier to one of clusters using maximum probability value.
M-step: Train each mixture of GAN model () with the clustered data for one generator and several discriminator iterations. Here, as the cluster data is directly fed to the discriminator, discriminator symbol is synonymous for the clustered data. We execute the -EM step until convergence. Fig. 2 illustrates the general architecture of GANMM.

2.3.3 GANMM model

To this end, the analogy is that each maps random noise (sampled from ) to and thus induces a single distribution and, as a whole induces a mixture of distributions called in the data space. Here, one is supplied for each by postulating that it will be for one cluster and will distinguish between sample and real training data. In contrast, the classifier performs multi-class classification on training instances into one of the clusters. The minimax game between these three networks in GANMM can be formulated as


where the first two terms represent WGAN discriminator and the second term represents WGAN generator cost functions, is the probability that is generated by . The last term in (2), is a standard cross-entropy loss. It is to be noted that each generator is deemed to produce samples of a specific mode.

2.3.4 Pre-training

The rationale for doing pre-training before -EM training is that if we start

-EM with random parameter initialization, it may result into a bad clustering with all training data is assigned to one cluster. For the pre-training, we prepare data by random shuffling of the training instances and assigning them to clusters. For a better initialization, we first perform k-means clustering with the given number of speakers from ground truth. The GAN model for each cluster is pre-trained for 500 epochs.

Figure 2: GANMM architecture with generators, discriminators, and one classifier. Here, represents cluster.

2.3.5 Class imbalance problem

At the beginning of the -E step, it may be possible that the classifier trained on GAN generated samples assigns clusters with imbalanced instances. This imbalance might get enhanced throughout the iterations with some clusters receiving fewer data than others. To combat this issue, we first ensure that every GAN model generates the same amount of data to train the classifier. Furthermore, for each cluster (say ), data is augmented from the rest of the clusters (i.e., ) with highest posterior probability to -th cluster according to the classifier. We empirically define the amount of data to be augmented and reduce it along the iterations for convergence. The full algorithm is presented in Algorithm 1.

2.4 GANMM testing

During inference, the test audio session is uniformly segmented to extract x-vectors, followed by projection onto a lower dimension using spectral embedding. We use the previously trained classifier to predict the cluster decision for each segment which is converted to frame-level. Finally, a median filter with kernel size 361 is applied to smooth the frame-level decisions.

3 Experimental Evaluation

1:: learning rate, : batch size, : number of clusters, : clipping parameter, : number of -EM iterations, : number of pre-training iterations, : number of augmented data points at , : number of critic iterations for each generator iteration
2: COMMENTPre-training
3:Do k-means and divide into
4:for  = 1 to  do
5:     for  = 1 to  do
7:     end for
8:end for
9: COMMENT-EM procedure
10:t = 0
11:for  = 1 to  do
12:     t = t + 1
13:       COMMENT-E-step
14:      sampled from generators
17:     assign to using the classifier
18:       COMMENTM-step
19:     for  = 1 to  do
20:         add (reduced along the iterations) instances
21:         from rest of the clusters with highest posterior
22:         for cluster by to .
24:     end for
25:end for
Algorithm 1 GAN mixture model clustering algorithm Default values: = , = 50, = 5, = 0.01

3.1 Data set

We evaluate our proposed algorithm on the popular augmented multiparty interaction (AMI) meeting corpus111http://groups.inf.ed.ac.uk/ami/download/. It consists of about 100 hours of meeting recordings in English and recorded at multiple sites (Edinburgh, Idiap, TNO, Brno). For our experiments, we randomly chose ten meetings each of duration 10-20 min as the development set to tune the parameters. Another fifty meetings equally distributed among varying length (10-20 min, 20-30 min, 30-40 min, 40-50 min, 50-60 min) were randomly selected as the evaluation set for benchmarking the proposed system against two baseline diarization systems. Meetings from all the recording sites are present both in development and evaluation set. The proposed system relies solely on each meeting to perform diarization and no separate training data is required.

3.2 Implementation details

We use the pre-trained CALLHOME x-vector model222https://kaldi-asr.org/models/m6 available in the Kaldi recipe [30]

for x-vector extraction. It is to be noted that in this work unless explicitly mentioned, we have used the ground truth to perform SAD and to calculate the number of speakers in a session, and do the same for both the baselines. The discriminator and generator networks in GANMM are feed-forward neural networks with one hidden layer that contains 64 nodes. We use ReLU activation function in the hidden layer and linear activation functions in the input and output layers. The classifier also contains one hidden layer with 64 nodes. We use the development data for early stopping during GANMM training and choosing the smoothing window size.

Since our proposed diarization system uses uniform segmentation approach and x-vector as an embedding, we have implemented two methods for diarization as our baselines: Information Bottleneck (IB) [18] and x-vector embedding with PLDA scoring and AHC clustering (x-vector) [7]. To report the performance of different systems, we use the NIST diarization error rate (DER) [31] performance metric.

3.3 Results and Discussion

3.3.1 Results on development set

We begin with the experiment on the development set for baselines and then applied our basic x-vector based GANMM (P1) to several system refinements (P1-P4) on top of that to arrive at our final proposed system (P4). The experimental results are reported in Table 1. Note that for all the results presented in this table, it is assumed that the systems have access to the oracle SAD and number of speakers. From Table 1, it is seen that the basic x-vector based GANMM (P1) is not performing well as compared to the baselines. This is due to poor cluster initialization and redundancy in capturing speaker-specific information in high-dimensional embedding space. However, when k-means initialization for pre-training (P2) is introduced in P1, it outperforms IB. We then incorporate spectral embedding only in our P1 system (P3). The significant improvement from P1 to P3 shows that spectral embedding, which retains speaker-specific information more efficiently, is one of the key components for our diarization system. Finally, by employing both spectral embedding on the extracted x-vectors and k-means initialization for GANMM pre-training (P4), we obtain further improvement in performance. It is worthwhile to mention that unlike the x-vector baseline, no supervised PLDA was used in any of our experiments. We use our best performing system (P4) for the remaining experiments in this work.


IB x-vector P1 P2 P3 P4


Avg. DER (in %) 29.17 23.03 36.59 27.30 19.70 18.90


Table 1: Avg. DER results for the proposed and two baselines on the development set.

3.3.2 Results on evaluation set

Diarization performances in terms of mean and standard deviation in DER on the evaluation set are presented in Table


. We show results by assuming oracle SAD with known number of speakers and estimated number of speakers in column 2 and 3, respectively. For a fair comparison between the proposed and x-vector systems, we implemented a separate diarization system: spectral embedding on the extracted x-vectors with cosine similarity scoring and AHC clustering (denoted as x-vector

). From Table II, the smaller gain in avg. DER for x-vector system as compared to the proposed is attributed probably due to extra supervised PLDA steps on the x-vectors. On the other hand, spectral embedding on x-vectors brings performance improvement for x-vector system. However, DER improvement between P4 and either of x-vector systems is not statistically significant (

0.05 by t-test) Moreover, the proposed method provides superior performance over IB baseline. From Table


column 3, we observe that diarization performance degrades for all the systems when estimated number of speakers is used for clustering. In this paper, the number of speakers in a particular session is determined by using eigen gap analysis on the affinity matrix constructed based on cosine similarity between segment x-vectors

[11]. However, for x-vector and x-vector systems, we use thresholding on the PLDA scores to perform AHC clustering for unknown number of speakers. We noticed that, performance degradation is due to mostly under-estimation of the number of speakers for some sessions. We achieve comparable performance for our proposed system as compared to x-vector and x-vector, and significantly better than IB.

3.3.3 Further analysis

We next check the effectiveness of our proposed system in a variety of practical conditions. Average DER of all the audio files split according to session duration are shown in Table 3. For 20-30 min and 30-40 min audio sessions, the proposed system performs better than x-vector and degrades for rest of the audio files. Therefore, we can say that clustering with GANMM results in a comparable performance with supervised methods for the short duration sessions.

For in-depth analysis and to check the effectiveness of our proposed method in more challenging practical scenarios, we first chose meetings from the evaluation set that have majority number of small duration ( 2.5 sec and 3 sec) speech segments. We report mean DER of the selected sessions in Table 4 column 2. It is clear from the table that proposed system yields competitive performances to the x-vector baseline system for both cases when fraction of small duration segments within a session is 70%. Both the proposed and x-vector systems are robust to short speech segments.

To show the effectiveness of proposed diarization system in minority speaker detection, we first select meetings from the evaluation set that has at least one speaker who speaks for less than 10% of the whole meeting duration. We then compute the minority speaker error, which we define as the fraction of speaker error (in seconds) within the speech from minority speaker over the total session duration. We report the average minority speaker error (in %) over all sessions in Table 4 column 3. It is evident from the table that our proposed system is slightly more robust as compared to the x-vector baseline in non-dominant speaker detection.

Finally, to evaluate the diarization performance of the proposed GANMM-based system in a scenario with a larger number of speakers, we chose the ICSI meeting corpora [32]. The results are shown in Table 4 column 4. For small number (3-5) of speakers, the proposed system is comparable to the x-vector system; however, its performance deteriorates for a larger number of speakers. Further analysis is required before the proposed diarization system is used for applications with a large number of speakers.


Mean, std. dev. DER (in %)
(Oracle SAD, known
Mean, std. dev. DER (in %)
(Oracle SAD, estimated
#speakers) #speakers)


IB 25.43 15.66 26.57 16.04
x-vector 15.91 7.88 20.14 11.36
x-vector 15.62 11.41 20.02 12.98
P4 17.11 10.57 21.56 12.28


Table 2: Results on evaluation set for the baseline systems and final version of proposed system.


System 10-20 min 20-30 min 30-40 min 40-50 min 50-60 min


x-vector 16.54 20.33 17.57 10.68 14.05
P4 18.39 17.82 12.52 18.77 18.12


Table 3: Performance (DER, %) of x-vector and proposed system on evaluation set split according to session duration.


Avg. DER (in %) for
speech segments
Avg. speaker
(in %)
Avg. DER (in %) across
2.5 sec 3 sec
3, 4 and 5
8, 9 and 10


x-vector 15.92 18.84 1.74 24.80 29.28
P4 17.61 18.92 1.51 26.38 34.02


Table 4: Performance analysis of x-vector and proposed system in challenging scenarios.

4 Conclusions

In this work, we propose a GAN mixture model, a novel deep generative model for speaker clustering within the speaker diarization framework. While the basic x-vector based GANMM is shown to perform poorly, substantial improvement is observed after employing k-means initialization based GANMM pre-training and spectral embedding on the extracted x-vectors. The proposed system results in a relative 33% DER improvement over the IB baseline and favorably comparable performance to x-vector baseline. Furthermore, to the best of our knowledge, this is one of the first approaches exploring the use of GAN mixture model for speaker clustering in the context of speaker diarization. In addition, the proposed diarization system exhibits promising performances in several challenging practical conditions. Future work could investigate variants of GANs in the mixture model in a multi-tasking fashion to further improve diarization performance.


  • [1] Marijn Huijbregts, Segmentation, diarization and speech transcription: surprise data unraveled, Citeseer, 2008.
  • [2] Petr Cerva, Jan Silovsky, Jindrich Zdansky, Jan Nouza, and Ladislav Seps, “Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives,” Speech Commun., vol. 55, no. 10, pp. 1033–1046, 2013.
  • [3] Yi Liu, Yao Tian, Liang He, and Jia Liu, “Investigating various diarization algorithms for speaker in the wild (SITW) speaker recognition challenge,” in Proc. Interspeech, 2016, pp. 853–857.
  • [4] Fabio Castaldo, Daniele Colibro, Emanuele Dalmasso, Pietro Laface, and Claudio Vair, “Stream-based speaker segmentation using speaker factors and eigenvoices,” in Proc. ICASSP, 2008, pp. 4133–4136.
  • [5] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, 2013.
  • [6] Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 1, pp. 217–227, 2014.
  • [7] Daniel Garcia-Romero, David Snyder, Gregory Sell, Daniel Povey, and Alan McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017, pp. 4930–4934.
  • [8] Gregory Sell et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2808–2812.
  • [9] Pawel Cyrta, Tomasz Trzciński, and Wojciech Stokowiec,

    “Speaker diarization using deep recurrent convolutional neural networks for speaker embeddings,”

    in International Conference on Information Systems Architecture and Technology. Springer, 2017, pp. 107–117.
  • [10] Hervé Bredin, “Tristounet: triplet loss for speaker turn embedding,” in Proc. ICASSP, 2017, pp. 5430–5434.
  • [11] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with LSTM,” in Proc. ICASSP, 2018, pp. 5239–5243.
  • [12] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” arXiv preprint arXiv:1810.04719, 2018.
  • [13] Gregory Sell and Daniel Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 413–417.
  • [14] Gregory Sell, Daniel Garcia-Romero, and Alan McCree, “Speaker diarization with i-vectors from DNN senone posteriors,” in Proc. Interspeech, 2015.
  • [15] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  • [16] Zbynek Zajíc, Marek Hrúz, and Ludek Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in Proc. Interspeech, 2017, pp. 3562–3566.
  • [17] Shih-Sian Cheng, Hsin-Min Wang, and Hsin-Chia Fu, “BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 18, no. 1, pp. 141–157, 2010.
  • [18] Deepu Vijayasenan, Fabio Valente, and Hervé Bourlard, “An information theoretic approach to speaker diarization of meeting data,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 17, no. 7, pp. 1382–1393, 2009.
  • [19] Dimitrios Dimitriadis and Petr Fousek, “Developing on-line speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
  • [20] Mickael Rouvier and Sylvain Meignier, “A global optimization framework for speaker diarization,” in Odyssey, 2012.
  • [21] Philip Andrew Mansfield, Quan Wang, Carlton Downey, Li Wan, and Ignacio Lopez Moreno, “Links: A high-dimensional online clustering method,” arXiv preprint arXiv:1801.10123, 2018.
  • [22] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
  • [23] Ruiqing Yin, Hervé Bredin, and Claude Barras, “Neural speech turn segmentation and affinity propagation for speaker diarization,” Proc. Interspeech, pp. 1393–1397, 2018.
  • [24] Ian Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS, 2014, pp. 2672–2680.
  • [25] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
  • [26] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein generative adversarial networks,” in Proc. ICML, 2017, pp. 214–223.
  • [27] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung, “Multi-generator generative adversarial nets,” arXiv preprint arXiv:1708.02556, 2017.
  • [28] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” in Proc. ICML. JMLR. org, 2017, pp. 224–232.
  • [29] Yang Yu and Wen-Ji Zhou, “Mixture of GANs for clustering,” in Proc. IJCAI, 2018, pp. 3047–3053.
  • [30] Daniel Povey et al., “The Kaldi speech recognition toolkit,” Tech. Rep., IEEE Signal Processing Society, 2011.
  • [31] Jonathan G Fiscus, Jerome Ajot, Martial Michel, and John S Garofolo, “The Rich Transcription 2006 spring meeting recognition evaluation,” in

    International Workshop on Machine Learning for Multimodal Interaction

    . Springer, 2006, pp. 309–322.
  • [32] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, et al., “The ICSI meeting corpus,” in Proc. ICASSP, 2003, vol. 1, pp. I–I.