1 Introduction
Speaker diarization [1], the task of determining “who spoke when” in a multispeaker audio stream has a wide range of applications from information retrieval and meeting annotations to face to face and telephonic conversation analysis. Recent speaker diarization systems [2, 3] are based on segmenting the input audio stream into uniform speakerhomogeneous segments, followed by extracting fixedlength speaker embeddings from those segments and performing speaker clustering over these embeddings.
Among speaker embeddings, ivectors [4, 5]
, produced using generative modeling were the first employed for speaker diarization. Recently, embeddings extracted from discriminativelytrained deep neural networks (DNNs) such as dvectors
[6, 7], and xvectors [2, 3]have shown superior performance over ivectors. These embeddings are partitioned into speaker clusters using clustering algorithms, such as Gaussian mixture models
[4], meanshift [5], agglomerative hierarchical clustering (AHC)
[2], kmeans
[8][6, 9] and links [10]. All the aforementioned approaches are unsupervised in determining the number of speakers and speaker labels of a given audio session. Recently, a few supervised clustering approaches like UISRNN [7] and affinity propagation [11] have also been proposed for diarization.While performances of tasks such as speech and speaker recognition have improved significantly due to supervised deep learning approaches, most of the existing diarization systems are yet to take full advantage of similar techniques. DNNbased deep clustering approaches are popular in computer vision
[12]. While appealing, they are however not immediately applied for speaker diarization tasks probably due to lack of interpretability and the problem of unknown number of speakers of a given audio session. Recently, deep embedded clustering on dvectors was introduced for speaker diarization
[13]. Incorporating the above advances, clustering with dimension reduction using nonlinear neural transformation of embeddings, trained with clusteringspecific loss could be beneficial for audio diarization systems.A latent space image clustering method using generative adversarial network (GAN) along with an encoder network (ClusterGAN) was proposed recently in [14]. Here, the encoder network performs inverse mapping, i.e., it backprojects
the data into the latent space. Two main advantages of GANbased latent space clustering are the interpretability and interpolation in the latent space
[14]. In our work, we adopt and modify this network for speaker clustering within the speaker diarization framework. The two main differences of our proposed work from [14] are: (a) instead of random onehot encoded variables, we use original speaker labels of the training data. Thus, the GAN generator input is a mixture of continuous random and discrete onehot encoded speaker label variables; (b) instead of images (spectrograms), xvector embeddings of short audio segments are used as real data input to the GAN discriminator. The GAN and encoder networks are jointly trained along with a clusteringspecific loss.2 Background
Over the recent years, the primary focus of research in image clustering has been to nonlinearly transform the input feature space to a latent space (where the separation of data is easier) using DNNs. Current deep clustering methods on image data include autoencoder based approaches
[15], generative model based approaches such as variational deep embedding [16] and information maximizing GAN (InfoGAN) [17] among others. All these algorithms comprise of three essential components: deep neural network architecture, network loss, and clusteringspecific loss. The network loss refers to the reconstruction loss of an autoencoder, variational loss of a variational autoencoder or the adversarial loss of GANs. It is used to learn feasible latent features and avoid trivial solutions. Clusteringspecific loss can be cluster assignment losses such as kmeans loss [18], cluster assignment hardening loss [15], spectral clustering loss [19], agglomerative clustering loss [20] or cluster regularization losses such as locality preserving loss, group sparsity loss, cluster classification loss [12]. These losses are used to learn suitable clusterfriendly representations from the data. In this work, we exploit both network loss and clustering loss in the clustering module for speaker diarization.3 Proposed speaker diarization system
3.1 Overview
The overall methodology of the proposed speaker diarization system is shown in Fig. 1. The proposed system begins with the popular timedelay neural network (TDNN) speaker embedding [2], i.e., xvector extraction and followed by latent space clustering. We discuss each module in the diarization pipeline below.
3.2 Segmentation
Our approach starts with a temporal segmentation of 1.5 sec with 1 sec overlap. The speech segments are embedded into a fixeddimensional xvector of dimension 512. This TDNNbased speaker embeddings achieved stateoftheart performance in speaker verification/diarization [2]. The xvectors are then fed as inputs to the ClusterGAN network.
3.3 ClusterGAN training
The motivation behind using ClusterGAN on xvectors is to nonlinearly transform it into a lowerdimensional embedding space which is more separable. Although the idea of using a mixture of continuous and discrete latent variables as the input to GAN generator was inspired from InfoGAN [17], ClusterGAN is better suited for clustering than InfoGAN [14]. ClusterGAN comprises three components: the generator (), the discriminator () and the encoder (), as shown in Fig. 2.
3.3.1 Adversarial training
GANs are a recent class of deep generative models inspired by game theory metaphor, where both
and networks engage in a twoplayer minimax game [21]. The generator is considered to be a mapping from the latent space to the data space . It takes noisy data sampled from and generates samples to fool the discriminator. The discriminator is considered to be a mapping from the data space to a real value . It takes real data sampled from and tries to discriminate between the real and generated fake samples. We employ the improved Wasserstein GAN (IWGAN) [22] for our GAN network. The objective function of this adversarial game is:(1) 
where, is the gradient penalty coefficient and GP is the gradient penalty term [22].
3.3.2 Sampling from discretecontinuous mixtures
In order to perform clustering in the latent space, we have to backproject the data into the latent space. The latent space distribution in traditional GANs is typically chosen to be Gaussian or uniform distributions. Although such distributions contain useful information about input data distributions, they usually lead to bad clusters
[23]. To mitigate this problem, boosting the latent space using categorical variables to create nonsmooth geometry is essential. However, continuity in latent space is also required for good interpolation and GANs have good interpolation ability. Therefore, we employ a mixture of continuous (
) and discrete () variables to the generator by concatenating with . In this work,is randomly sampled from a normal distribution
. We chose in all our experiments. We use the original speaker labels for the speech segments from training data as the onehot encoded variable . The concatenation of with enables clustering in the latent space.3.3.3 Inverse mapping network
Mapping from the data space to latent space is a nontrivial problem, since it requires the inversion of the generator which is a nonlinear model. Existing works [23, 24] tackle this problem by solving an optimization problem in to get back the latent vectors using , where is norm, is a regularization constant and denotes the norm. However, these approaches are not suitable for clustering since the optimization problem is nonconvex [24, 14]. To address this issue, an network alongside the GAN network for backprojection is introduced. We fix and randomly sample from a normal distribution with multiple restarts at each iteration step. Furthermore, to ensure precise recovery of the latent vector , we compute the numerical difference between the encoder output latent vector and . For that, we empirically found that instead of mean square error, cosine distance is more suitable. The objective function for this task can be written as:
(2) 
where, is the mini batch size.
3.3.4 Clusteringspecific loss
To learn cluster friendly representations, we incorporate cluster classification loss while training as crossentropy (CE) loss. The softmax layer output obtained by network is used for computing the crossentropy loss. This loss encourages the latent embeddings to cluster and hence increase the discriminative information. We minimize this crossentropy loss as:
(3) 
where, the first term is the empirical probability that the embedding belongs to the th speaker, and the second term is the predicted probability (by the encoder) that the embedding belongs to the th speaker.
3.3.5 Joint training
We train the GAN and encoder networks jointly. The training objective function takes the following form:
(4) 
The weights and are used to control the importance of preserving continuous and discrete latent variables. Algorithm 1 shows the training steps of ClusterGAN.
3.4 Testing
After offline training, only the trained encoder model is required to produce the proposed latent embeddings for the input xvectors of a test audio session. The concatenated latent embeddings ( and ) are then clustered to produce speaker labels of each segment using kmeans.
4 Experimental evaluation
4.1 Data preparation
We evaluate our proposed algorithm on the AMI meeting corpus and two childclinician interaction corpora: ADOS [25] and BOSCC [26]. The AMI database consists of 171 meetings recorded at four different sites (Edinburgh, Idiap, TNO, Brno). For our evaluation, we use the official speech recognition partition of AMI dataset^{1}^{1}1http://groups.inf.ed.ac.uk/ami/download/. We exclude the TNO meetings from dev and eval set, which is a common practice in diarization studies [9, 27]. The details of the dataset partition are shown in Table 1.
The ADOS [28] is a diagnostic tool which comprises over 10 playbased, conversational tasks. We chose two conversational tasks: Emotions and Social Difficulties and Annoyance from 272 sessions for our evaluation. BOSCC [29] is a new treatment outcome measure, also comprised of playbased, conversational segments. For this study, 24 BOSCC sessions are selected.


#Meetings  #Speakers  


Train  136  155 
Dev  14  17 
Eval  12  12 

4.2 Experimental framework
4.2.1 Baseline systems
Since our proposed system uses xvectors as input features, we used the Kaldibased AHC clustering with PLDA scoring on xvectors [2] (denoted as xvector in this paper) as our main baseline system. We also show results on xvectors with kmeans clustering (denoted as kmeans in this paper), as our second baseline.
4.2.2 Model specifications
In all our systems, xvectors are extracted using the Voxceleb^{2}^{2}2https://kaldiasr.org/models/m7 models available in the Kaldi recipe. Diarization performance of the proposed system is evaluated for two models trained with different amounts of supervised data: P1 and P2. P1 is trained only on the AMI train set, whereas P2 is trained on both AMI train set and 60 beamformed ICSI [30]
sessions with a total number of 46 speakers. The generator and discriminator networks in the proposed systems are simple feed forward neural networks with one and two hidden layers respectively each with 512 nodes. The input layer of
consists of nodes (, are the dimensions of and respectively), where for both P1 and P2 models, and for P1 and for P2 model. ’s output layer has 512 nodes, which is the xvector dimension. The input and output layer of contains 512 nodes and one node, respectively. On the other hand, the network consists of a single hidden layer with 512 nodes and input layer is linear with 512 nodes. The output layer of is a linear layer with nodes from which the first nodes are directly used as and the rest are passed through a softmax layer to produce. For all the three networks, the activation function in the hidden layers is ReLU. In the proposed system, we use the original speaker labels from the training data to produce
for each segment. The networks are optimized using Adam [31] with a minibatch size of 64 samples and learning rate . We fixed the , and values as 1, 2 and 10 respectively by tuning on AMI dev set. The number of iterations is fixed to 30k based on optimizing DER on the AMI dev set.4.2.3 Performance metrics
The performance of speaker diarization systems is evaluated by using NIST diarization error rate (DER) [32], which is typically calculated with a 0.25 sec collar. Since the primary focus of this paper is on the effectiveness of new speaker embeddings in clustering, likewise in [27, 2, 9], for all the experiments in this paper we use oracle speech activity detection (SAD). Therefore, all DER values reported in this work correspond to speaker confusion errors with no missed or false alarm speech.
4.3 Results and discussions


System 



Dev  Eval  Dev  Eval  


xvector  11.65  11.34  11.08  10.37  
kmeans  11.94  11.45  12.64  12.26  
P1  10.17  10.10  10.98  11.26  
P2  9.67  11.64  10.33  11.56  
xvector + P1  7.45  7.82  8.73  9.11  
xvector + P2  6.98  8.85  7.93  8.92  
Sun et. al. [9]  –  –  12.22  12.99  

4.3.1 Results on AMI dev and eval set
Results for diarization performance on AMI dev and eval sets are reported in Table 2
. We show results for oracle SAD with both known number of speakers and estimated number of speakers. For the xvector baseline, we use thresholding on the PLDA scores to perform AHC clustering for unknown number of speakers. The number of speakers for kmeans and proposed systems are estimated using Eigengap analysis of the affinity matrix constructed from the cosine distance of xvector embeddings followed by binarization and symmetrization
[33]. From Table 2 column 2, we see that for known number of speakers, the P1 system beats xvector (stateoftheart) and kmeans systems for both AMI dev and eval sets. The performance improves further after incorporating fusion with xvector embeddings ((xvector + P1) and (xvector + P2)). It is observed that both the fused systems significantly outperform all the other systems. The best achieved DER using our fused systems on AMI dev and eval set are 6.98% and 7.82% respectively. This is attributed to the fact that our proposed embeddings have complementary information with xvector embeddings.We report the diarization performance of all the systems for estimated number of speakers in Table 2 column 3. Surprisingly, it is observed that xvector baseline system with thresholding on the PLDA scores for AHC clustering produces a slightly better performance as compared to the oracle number of speaker condition. In contrast, all the other methods’ performance degrades for estimated number of speakers. We also compare the proposed diarization system with the work proposed in Sun et al. [9] evaluated on the same data set. The system proposed in [9] is a 2D selfattentive combination of dvectors with spectral clustering backend. As seen in Table 2 column 3, our proposed and xvector fused embeddings with kmeans clustering backend outperforms other baseline methods.


System 





xvector  14.36  21.69  
kmeans  12.35  14.73  
P1  11.27  14.63  
P2  11.08  13.35  
xvector + P1  9.38  13.55  
xvector + P2  9.22  11.17  

4.3.2 TSNE visualization
We show TSNE visualizations of xvector, proposed and fused embeddings of AMI session IS1008a in Fig. 3. It is evident from the figure that the proposed embedding based clusters are slightly more compact as compared to the xvectors. However, fused embedding based clusters are the most compact within a class and most separated between classes.
4.3.3 Generalization ability
From Table 3, we observe significant performance improvement for the proposed system over the baselines on both ADOS and BOSCC sessions. In addition, the P2 model which is trained on more data achieves better performance than P1 for both individual and fused scenarios. In particular, the improvement is notable compared to the xvector baseline. We hypothesize that the PLDA model pretrained on Voxceleb presents a significant domain mismatch in this case. Moreover, both P1 and P2 systems, either used individually or in fusion with xvectors, are superior to kmeans. The best system (xvector + P2) achieves a relative 36% and 49% improvement over xvector on those two databases.
5 Conclusions
We presented a new deep latent space clustering using ClusterGANs to perform speaker diarization. The entire system was trained in a supervised manner along with a clusteringspecific loss function. We observed that ClusterGANbased latent embeddings provide superior performance than xvector embeddings. Further improvement was achieved after fusing proposed and xvector embeddings. Experimental results showed a significant DER reduction for the proposed system over stateoftheart xvector diarization system on AMI, ADOS and BOSCC corpora. Future work could use spectrograms directly instead of pretrained embeddings as the GAN discriminator input.
References
 [1] Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals, “Speaker diarization: A review of recent research,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 356–370, 2012.
 [2] Daniel GarciaRomero, David Snyder, Gregory Sell, Daniel Povey, and Alan McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017, pp. 4930–4934.
 [3] Gregory Sell et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2808–2812.
 [4] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, 2013.
 [5] Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distancebased mean shift for telephone speech diarization,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 1, pp. 217–227, 2014.
 [6] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with LSTM,” in Proc. ICASSP, 2018, pp. 5239–5243.
 [7] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
 [8] Dimitrios Dimitriadis and Petr Fousek, “Developing online speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
 [9] Guangzhi Sun, Chao Zhang, and Philip C Woodland, “Speaker diarisation using 2D selfattentive combination of embeddings,” in Proc. ICASSP, 2019, pp. 5801–5805.
 [10] Philip Andrew Mansfield, et al., “Links: A highdimensional online clustering method,” arXiv preprint arXiv:1801.10123, 2018.
 [11] Ruiqing Yin, Hervé Bredin, and Claude Barras, “Neural speech turn segmentation and affinity propagation for speaker diarization,” in Proc. Interspeech, 2018, pp. 1393–1397.
 [12] Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, Maximilian Strobel, and Daniel Cremers, “Clustering with deep learning: Taxonomy and new methods,” arXiv preprint arXiv:1801.07648, 2018.
 [13] Dimitrios Dimitriadis, “Enhancements for audioonly diarization systems,” arXiv preprint arXiv:1909.00082, 2019.
 [14] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan, “ClusterGAN: Latent space clustering in generative adversarial networks,” in Proc. AAAI, 2019, vol. 33, pp. 4610–4617.

[15]
Junyuan Xie, Ross Girshick, and Ali Farhadi,
“Unsupervised deep embedding for clustering analysis,”
in Proc. ICML, 2016, pp. 478–487.  [16] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou, “Variational deep embedding: An unsupervised and generative approach to clustering,” arXiv preprint arXiv:1611.05148, 2016.
 [17] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel, “InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets,” in Proc. NIPS, 2016, pp. 2172–2180.
 [18] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong, “Towards kmeansfriendly spaces: Simultaneous deep learning and clustering,” in Proc. ICML. JMLR. org, 2017, pp. 3861–3870.
 [19] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger, “Spectralnet: Spectral clustering using deep neural networks,” arXiv preprint arXiv:1801.01587, 2018.

[20]
Jianwei Yang, Devi Parikh, and Dhruv Batra,
“Joint unsupervised learning of deep representations and image clusters,”
in Proc. CVPR, 2016, pp. 5147–5156.  [21] Ian Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS, 2014, pp. 2672–2680.
 [22] Ishaan Gulrajani et al., “Improved training of Wasserstein GANs,” in Proc. NIPS, 2017, pp. 5767–5777.
 [23] Zachary C Lipton and Subarna Tripathi, “Precise recovery of latent vectors from generative adversarial networks,” arXiv preprint arXiv:1702.04782, 2017.
 [24] Antonia Creswell and Anil Anthony Bharath, “Inverting the generator of a generative adversarial network,” IEEE Trans. on neural networks and learning systems, 2018.
 [25] Daniel Bone et al., “Spontaneousspeech acousticprosodic features of children with autism and the interacting psychologist,” in Proc. Interspeech, 2012.
 [26] Manoj Kumar et al., “A knowledge driven structural segmentation approach for playtalk classification during autism assessment.,” in Proc. Interspeech, 2018, pp. 2763–2767.
 [27] Sree Harsha Yella and Andreas Stolcke, “A comparison of neural network feature transforms for speaker diarization,” in Proc. Interspeech, 2015.
 [28] Catherine Lord et al., “The autism diagnostic observation schedule—generic: A standard measure of social and communication deficits associated with the spectrum of autism,” Journal of autism and developmental disorders, vol. 30, no. 3, pp. 205–223, 2000.
 [29] Rebecca Grzadzinski et al., “Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (boscc),” Journal of autism and developmental disorders, vol. 46, no. 7, pp. 2464–2479, 2016.
 [30] Adam Janin et al., “The ICSI meeting corpus,” in Proc. ICASSP, 2003, vol. 1, pp. I–I.
 [31] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[32]
Jonathan G Fiscus et al.,
“The Rich Transcription 2006 spring meeting
recognition evaluation,”
in
International Workshop on Machine Learning for Multimodal Interaction
. Springer, 2006, pp. 309–322.  [33] Tae Jin Park et al., “The Second DIHARD challenge: System Description for USCSAIL Team,” in Proc. Interspeech, 2019, pp. 998–1002.