1 Introduction
Speaker diarization addresses the problem of determining “who spoke when” in an audio recording. It has wide ranging applications related to speaker indexing of data [1], and forms an integral component of speech [2] and speaker recognition [3] pipelines. A generic speaker diarization system includes (a) a speech activity detection (SAD) module, which separates speech from nonspeech parts, (b) speaker segmentation, where the input audio is segmented into speaker homogeneous chunks and enables extraction of speaker discriminative embeddings such as speaker factors [4], ivectors [5, 6], xvectors [7, 8], CNN and LSTM based embeddings [9, 10] and dvectors [11, 12] from those audio chunks, and (c) speaker clustering that determines the constituent number of speakers in an audio stream and labels each segment with distinct speaker labels (and possibly, identities).
Many recent works on deep neural network based
embedding extraction [10, 7, 11] have advanced speaker diarization research with significant performance improvements. They have effectively replaced the previous embedding approaches based on ivectors for diarization [13, 14]. The largely popular xvector embeddings, which have proven to be more effective than traditional ivectors especially for shortduration speech, have become the defacto standard for speaker recognition [15] and diarization [7].However, speaker clustering has been based on mostly unsupervised algorithms over the years. These algorithms include Gaussian mixture model
[16]; agglomerative hierarchical clustering (AHC) based on similarity measures like Bayesian information criterion
[17], generalized loglikelihood ratio [7], information bottleneck (IB) [18]; mean shift[6]; kmeans [19][11]; integrated linear programming
[20] and links [21]. Most recently, few supervised speaker clustering methods such as UISRNN [22] and affinity propagation [23] have also been proposed for diarization. Despite the success of the above clustering techniques, speaker diarization still remains a challenging task in many realworld applications due to the wide heterogeneity and variability in audio.The recent successes of generative adversarial networks [24] (GANs) in capturing complex data distributions by encoding rich latent structures has attracted a lot of attention. However, it is difficult to train GANs due to the mode collapse problem [25]. To address this problem, many variants of GAN such as the Wasserstein GAN (WGAN) [26], multigenerator GAN [27], mixture GAN [28] have been proposed. More recently, the GAN mixture model
(GANMM), a novel adversarial architecture with a mixture of generators and discriminators, and a classifier trained in an expectation maximization (EM) fashion was introduced
[29]. This model was shown to be very effective for image and character data clustering [29]. Although the performance of speech recognition and speaker verification systems has improved dramatically due to the deep learning approaches, most of the existing clustering techniques for speaker diarization are not yet taking full advantage of it. Therefore, it is worthwhile to explore the potential of neural network based clustering for speaker diarization. In our present work, we adopt and further develop the GANMM framework for audio based speaker clustering.
This work uses xvector embeddings as a feature representation on short overlapping speech segments. Prior to GANMM clustering, we extract spectral embeddings on the xvectors to reduce the dimensionality and kmeans clustering as initialization for pretraining the GAN models. Training the mixture model is performed through EM procedure. The expectation step comprises of learning a classifier to separate the clusters. In the maximization step, each mixture model is trained using the clustered data. Experiments conducted on the AMI corpus confirm the validity of our proposed system. To the best of our knowledge, this is the first attempt of using GANs for speaker clustering in an unsupervised manner within a speaker diarization framework.
2 Proposed speaker diarization system
2.1 Overview of the system
An overview of the proposed system is presented in Fig. 1. Below, we describe each of the components in detail.
2.2 Segmentation
In order to isolate potential confounds due to miss and false alarm errors in speech/nonspeech detection, we focus solely on the speaker confusion part of diarization error rate (DER). We use oracle SAD for the proposed as well as the baseline systems. After removing the nonspeech part in an audio session, xvectors are extracted from each segment of duration 1.5 sec with 1 sec overlap. While this denser segmentation might help in clustering, each segment may contain more than one speaker. Motivated by the success of spectral clustering in speaker diarization [11], we exploited the use of spectral clustering on the extracted xvector embeddings by projecting them into lowerdimensional subspace to produce spectral embeddings using Eigen decomposition.
2.3 Speaker clustering: GANMM training
We employ GAN models in learning mixture models to capture underlying clusters of complex data. In the following sections, we describe various implementation details involved in GANMM training.
2.3.1 Mitigating early convergence
Given an audio stream, we assume the extracted embeddings
to follow probability distribution
and hidden variables with probability distribution . The loglikelihood with model parameters can be expressed as(1) 
where the loglikelihood can be written as and is the KLdivergence between and
. In the conventional EM procedure, the Estep matches the hidden variable (clusters in this case) distribution with current posterior probability. The Mstep determines the parameters by maximizing the resulted loglikelihood. However, the key issue with GAN models is that they can fit the current guess of the hidden variables too well. Hence, the model maximizing the likelihood has extreme value and the whole procedure converges very early with
[29]. To mitigate this early convergence, an error term is introduced in the Estep with . The convergence is guaranteed by keeping . In the GANMM setup, an imperfect classifier ( is trained at the E step with generator outputs and the Mstep ensures the convergence by eliminating this error. Further details are outlined in [29].2.3.2 EM procedure
This involves the following:
Estep:
(1) Produce samples from the generators of the current GANMM model and gather them to construct a data set , where is the generated data point from the th mixture and is the generator index. (2) Train an inaccurate classifier using the data set . (3) Classify each training data of a particular session by the classifier to one of clusters using maximum probability value.
Mstep: Train each mixture of GAN model () with the clustered data for one generator and several discriminator iterations. Here, as the cluster data is directly fed to the discriminator, discriminator symbol is synonymous for the clustered data. We execute the EM step until convergence. Fig. 2 illustrates the general architecture of GANMM.
2.3.3 GANMM model
To this end, the analogy is that each maps random noise (sampled from ) to and thus induces a single distribution and, as a whole induces a mixture of distributions called in the data space. Here, one is supplied for each by postulating that it will be for one cluster and will distinguish between sample and real training data. In contrast, the classifier performs multiclass classification on training instances into one of the clusters. The minimax game between these three networks in GANMM can be formulated as
(2) 
where the first two terms represent WGAN discriminator and the second term represents WGAN generator cost functions, is the probability that is generated by . The last term in (2), is a standard crossentropy loss. It is to be noted that each generator is deemed to produce samples of a specific mode.
2.3.4 Pretraining
The rationale for doing pretraining before EM training is that if we start
EM with random parameter initialization, it may result into a bad clustering with all training data is assigned to one cluster. For the pretraining, we prepare data by random shuffling of the training instances and assigning them to clusters. For a better initialization, we first perform kmeans clustering with the given number of speakers from ground truth. The GAN model for each cluster is pretrained for 500 epochs.
2.3.5 Class imbalance problem
At the beginning of the E step, it may be possible that the classifier trained on GAN generated samples assigns clusters with imbalanced instances. This imbalance might get enhanced throughout the iterations with some clusters receiving fewer data than others. To combat this issue, we first ensure that every GAN model generates the same amount of data to train the classifier. Furthermore, for each cluster (say ), data is augmented from the rest of the clusters (i.e., ) with highest posterior probability to th cluster according to the classifier. We empirically define the amount of data to be augmented and reduce it along the iterations for convergence. The full algorithm is presented in Algorithm 1.
2.4 GANMM testing
During inference, the test audio session is uniformly segmented to extract xvectors, followed by projection onto a lower dimension using spectral embedding. We use the previously trained classifier to predict the cluster decision for each segment which is converted to framelevel. Finally, a median filter with kernel size 361 is applied to smooth the framelevel decisions.
3 Experimental Evaluation
3.1 Data set
We evaluate our proposed algorithm on the popular augmented multiparty interaction (AMI) meeting corpus^{1}^{1}1http://groups.inf.ed.ac.uk/ami/download/. It consists of about 100 hours of meeting recordings in English and recorded at multiple sites (Edinburgh, Idiap, TNO, Brno). For our experiments, we randomly chose ten meetings each of duration 1020 min as the development set to tune the parameters. Another fifty meetings equally distributed among varying length (1020 min, 2030 min, 3040 min, 4050 min, 5060 min) were randomly selected as the evaluation set for benchmarking the proposed system against two baseline diarization systems. Meetings from all the recording sites are present both in development and evaluation set. The proposed system relies solely on each meeting to perform diarization and no separate training data is required.
3.2 Implementation details
We use the pretrained CALLHOME xvector model^{2}^{2}2https://kaldiasr.org/models/m6 available in the Kaldi recipe [30]
for xvector extraction. It is to be noted that in this work unless explicitly mentioned, we have used the ground truth to perform SAD and to calculate the number of speakers in a session, and do the same for both the baselines. The discriminator and generator networks in GANMM are feedforward neural networks with one hidden layer that contains 64 nodes. We use ReLU activation function in the hidden layer and linear activation functions in the input and output layers. The classifier also contains one hidden layer with 64 nodes. We use the development data for early stopping during GANMM training and choosing the smoothing window size.
Since our proposed diarization system uses uniform segmentation approach and xvector as an embedding, we have implemented two methods for diarization as our baselines: Information Bottleneck (IB) [18] and xvector embedding with PLDA scoring and AHC clustering (xvector) [7]. To report the performance of different systems, we use the NIST diarization error rate (DER) [31] performance metric.
3.3 Results and Discussion
3.3.1 Results on development set
We begin with the experiment on the development set for baselines and then applied our basic xvector based GANMM (P1) to several system refinements (P1P4) on top of that to arrive at our final proposed system (P4). The experimental results are reported in Table 1. Note that for all the results presented in this table, it is assumed that the systems have access to the oracle SAD and number of speakers. From Table 1, it is seen that the basic xvector based GANMM (P1) is not performing well as compared to the baselines. This is due to poor cluster initialization and redundancy in capturing speakerspecific information in highdimensional embedding space. However, when kmeans initialization for pretraining (P2) is introduced in P1, it outperforms IB. We then incorporate spectral embedding only in our P1 system (P3). The significant improvement from P1 to P3 shows that spectral embedding, which retains speakerspecific information more efficiently, is one of the key components for our diarization system. Finally, by employing both spectral embedding on the extracted xvectors and kmeans initialization for GANMM pretraining (P4), we obtain further improvement in performance. It is worthwhile to mention that unlike the xvector baseline, no supervised PLDA was used in any of our experiments. We use our best performing system (P4) for the remaining experiments in this work.


System  
IB  xvector  P1  P2  P3  P4  


Avg. DER (in %)  29.17  23.03  36.59  27.30  19.70  18.90 

3.3.2 Results on evaluation set
Diarization performances in terms of mean and standard deviation in DER on the evaluation set are presented in Table
2. We show results by assuming oracle SAD with known number of speakers and estimated number of speakers in column 2 and 3, respectively. For a fair comparison between the proposed and xvector systems, we implemented a separate diarization system: spectral embedding on the extracted xvectors with cosine similarity scoring and AHC clustering (denoted as xvector
). From Table II, the smaller gain in avg. DER for xvector system as compared to the proposed is attributed probably due to extra supervised PLDA steps on the xvectors. On the other hand, spectral embedding on xvectors brings performance improvement for xvector system. However, DER improvement between P4 and either of xvector systems is not statistically significant (0.05 by ttest) Moreover, the proposed method provides superior performance over IB baseline. From Table
2column 3, we observe that diarization performance degrades for all the systems when estimated number of speakers is used for clustering. In this paper, the number of speakers in a particular session is determined by using eigen gap analysis on the affinity matrix constructed based on cosine similarity between segment xvectors
[11]. However, for xvector and xvector systems, we use thresholding on the PLDA scores to perform AHC clustering for unknown number of speakers. We noticed that, performance degradation is due to mostly underestimation of the number of speakers for some sessions. We achieve comparable performance for our proposed system as compared to xvector and xvector, and significantly better than IB.3.3.3 Further analysis
We next check the effectiveness of our proposed system in a variety of practical conditions. Average DER of all the audio files split according to session duration are shown in Table 3. For 2030 min and 3040 min audio sessions, the proposed system performs better than xvector and degrades for rest of the audio files. Therefore, we can say that clustering with GANMM results in a comparable performance with supervised methods for the short duration sessions.
For indepth analysis and to check the effectiveness of our proposed method in more challenging practical scenarios, we first chose meetings from the evaluation set that have majority number of small duration ( 2.5 sec and 3 sec) speech segments. We report mean DER of the selected sessions in Table 4 column 2. It is clear from the table that proposed system yields competitive performances to the xvector baseline system for both cases when fraction of small duration segments within a session is 70%. Both the proposed and xvector systems are robust to short speech segments.
To show the effectiveness of proposed diarization system in minority speaker detection, we first select meetings from the evaluation set that has at least one speaker who speaks for less than 10% of the whole meeting duration. We then compute the minority speaker error, which we define as the fraction of speaker error (in seconds) within the speech from minority speaker over the total session duration. We report the average minority speaker error (in %) over all sessions in Table 4 column 3. It is evident from the table that our proposed system is slightly more robust as compared to the xvector baseline in nondominant speaker detection.
Finally, to evaluate the diarization performance of the proposed GANMMbased system in a scenario with a larger number of speakers, we chose the ICSI meeting corpora [32]. The results are shown in Table 4 column 4. For small number (35) of speakers, the proposed system is comparable to the xvector system; however, its performance deteriorates for a larger number of speakers. Further analysis is required before the proposed diarization system is used for applications with a large number of speakers.


System 



#speakers)  #speakers)  


IB  25.43 15.66  26.57 16.04  
xvector  15.91 7.88  20.14 11.36  
xvector  15.62 11.41  20.02 12.98  
P4  17.11 10.57  21.56 12.28  



System  1020 min  2030 min  3040 min  4050 min  5060 min 


xvector  16.54  20.33  17.57  10.68  14.05 
P4  18.39  17.82  12.52  18.77  18.12 



System 




2.5 sec  3 sec 





xvector  15.92  18.84  1.74  24.80  29.28  
P4  17.61  18.92  1.51  26.38  34.02  

4 Conclusions
In this work, we propose a GAN mixture model, a novel deep generative model for speaker clustering within the speaker diarization framework. While the basic xvector based GANMM is shown to perform poorly, substantial improvement is observed after employing kmeans initialization based GANMM pretraining and spectral embedding on the extracted xvectors. The proposed system results in a relative 33% DER improvement over the IB baseline and favorably comparable performance to xvector baseline. Furthermore, to the best of our knowledge, this is one of the first approaches exploring the use of GAN mixture model for speaker clustering in the context of speaker diarization. In addition, the proposed diarization system exhibits promising performances in several challenging practical conditions. Future work could investigate variants of GANs in the mixture model in a multitasking fashion to further improve diarization performance.
References
 [1] Marijn Huijbregts, Segmentation, diarization and speech transcription: surprise data unraveled, Citeseer, 2008.
 [2] Petr Cerva, Jan Silovsky, Jindrich Zdansky, Jan Nouza, and Ladislav Seps, “Speakeradaptive speech recognition using speaker diarization for improved transcription of large spoken archives,” Speech Commun., vol. 55, no. 10, pp. 1033–1046, 2013.
 [3] Yi Liu, Yao Tian, Liang He, and Jia Liu, “Investigating various diarization algorithms for speaker in the wild (SITW) speaker recognition challenge,” in Proc. Interspeech, 2016, pp. 853–857.
 [4] Fabio Castaldo, Daniele Colibro, Emanuele Dalmasso, Pietro Laface, and Claudio Vair, “Streambased speaker segmentation using speaker factors and eigenvoices,” in Proc. ICASSP, 2008, pp. 4133–4136.
 [5] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, 2013.
 [6] Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distancebased mean shift for telephone speech diarization,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 1, pp. 217–227, 2014.
 [7] Daniel GarciaRomero, David Snyder, Gregory Sell, Daniel Povey, and Alan McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017, pp. 4930–4934.
 [8] Gregory Sell et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2808–2812.

[9]
Pawel Cyrta, Tomasz Trzciński, and Wojciech Stokowiec,
“Speaker diarization using deep recurrent convolutional neural networks for speaker embeddings,”
in International Conference on Information Systems Architecture and Technology. Springer, 2017, pp. 107–117.  [10] Hervé Bredin, “Tristounet: triplet loss for speaker turn embedding,” in Proc. ICASSP, 2017, pp. 5430–5434.
 [11] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with LSTM,” in Proc. ICASSP, 2018, pp. 5239–5243.
 [12] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” arXiv preprint arXiv:1810.04719, 2018.
 [13] Gregory Sell and Daniel GarciaRomero, “Speaker diarization with PLDA ivector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 413–417.
 [14] Gregory Sell, Daniel GarciaRomero, and Alan McCree, “Speaker diarization with ivectors from DNN senone posteriors,” in Proc. Interspeech, 2015.
 [15] David Snyder, Daniel GarciaRomero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “Xvectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
 [16] Zbynek Zajíc, Marek Hrúz, and Ludek Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in Proc. Interspeech, 2017, pp. 3562–3566.
 [17] ShihSian Cheng, HsinMin Wang, and HsinChia Fu, “BICbased speaker segmentation using divideandconquer strategies with application to speaker diarization,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 18, no. 1, pp. 141–157, 2010.
 [18] Deepu Vijayasenan, Fabio Valente, and Hervé Bourlard, “An information theoretic approach to speaker diarization of meeting data,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 17, no. 7, pp. 1382–1393, 2009.
 [19] Dimitrios Dimitriadis and Petr Fousek, “Developing online speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
 [20] Mickael Rouvier and Sylvain Meignier, “A global optimization framework for speaker diarization,” in Odyssey, 2012.
 [21] Philip Andrew Mansfield, Quan Wang, Carlton Downey, Li Wan, and Ignacio Lopez Moreno, “Links: A highdimensional online clustering method,” arXiv preprint arXiv:1801.10123, 2018.
 [22] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
 [23] Ruiqing Yin, Hervé Bredin, and Claude Barras, “Neural speech turn segmentation and affinity propagation for speaker diarization,” Proc. Interspeech, pp. 1393–1397, 2018.
 [24] Ian Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS, 2014, pp. 2672–2680.
 [25] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
 [26] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein generative adversarial networks,” in Proc. ICML, 2017, pp. 214–223.
 [27] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung, “Multigenerator generative adversarial nets,” arXiv preprint arXiv:1708.02556, 2017.
 [28] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” in Proc. ICML. JMLR. org, 2017, pp. 224–232.
 [29] Yang Yu and WenJi Zhou, “Mixture of GANs for clustering,” in Proc. IJCAI, 2018, pp. 3047–3053.
 [30] Daniel Povey et al., “The Kaldi speech recognition toolkit,” Tech. Rep., IEEE Signal Processing Society, 2011.

[31]
Jonathan G Fiscus, Jerome Ajot, Martial Michel, and John S Garofolo,
“The Rich Transcription 2006 spring meeting
recognition evaluation,”
in
International Workshop on Machine Learning for Multimodal Interaction
. Springer, 2006, pp. 309–322.  [32] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, et al., “The ICSI meeting corpus,” in Proc. ICASSP, 2003, vol. 1, pp. I–I.
Comments
There are no comments yet.