Speaker diarization , the task of determining “who spoke when” in a multi-speaker audio stream has a wide range of applications from information retrieval and meeting annotations to face to face and telephonic conversation analysis. Recent speaker diarization systems [2, 3] are based on segmenting the input audio stream into uniform speaker-homogeneous segments, followed by extracting fixed-length speaker embeddings from those segments and performing speaker clustering over these embeddings.
, produced using generative modeling were the first employed for speaker diarization. Recently, embeddings extracted from discriminatively-trained deep neural networks (DNNs) such as d-vectors[6, 7], and x-vectors [2, 3]
have shown superior performance over i-vectors. These embeddings are partitioned into speaker clusters using clustering algorithms, such as Gaussian mixture models, mean-shift 
, agglomerative hierarchical clustering (AHC)
, k-means6, 9] and links . All the aforementioned approaches are unsupervised in determining the number of speakers and speaker labels of a given audio session. Recently, a few supervised clustering approaches like UIS-RNN  and affinity propagation  have also been proposed for diarization.
While performances of tasks such as speech and speaker recognition have improved significantly due to supervised deep learning approaches, most of the existing diarization systems are yet to take full advantage of similar techniques. DNN-based deep clustering approaches are popular in computer vision
. While appealing, they are however not immediately applied for speaker diarization tasks probably due to lack of interpretability and the problem of unknown number of speakers of a given audio session. Recently, deep embedded clustering on d-vectors was introduced for speaker diarization. Incorporating the above advances, clustering with dimension reduction using non-linear neural transformation of embeddings, trained with clustering-specific loss could be beneficial for audio diarization systems.
A latent space image clustering method using generative adversarial network (GAN) along with an encoder network (ClusterGAN) was proposed recently in . Here, the encoder network performs inverse mapping, i.e., it back-projects
the data into the latent space. Two main advantages of GAN-based latent space clustering are the interpretability and interpolation in the latent space. In our work, we adopt and modify this network for speaker clustering within the speaker diarization framework. The two main differences of our proposed work from  are: (a) instead of random one-hot encoded variables, we use original speaker labels of the training data. Thus, the GAN generator input is a mixture of continuous random and discrete one-hot encoded speaker label variables; (b) instead of images (spectrograms), x-vector embeddings of short audio segments are used as real data input to the GAN discriminator. The GAN and encoder networks are jointly trained along with a clustering-specific loss.
Over the recent years, the primary focus of research in image clustering has been to non-linearly transform the input feature space to a latent space (where the separation of data is easier) using DNNs. Current deep clustering methods on image data include autoencoder based approaches, generative model based approaches such as variational deep embedding  and information maximizing GAN (InfoGAN)  among others. All these algorithms comprise of three essential components: deep neural network architecture, network loss, and clustering-specific loss. The network loss refers to the reconstruction loss of an autoencoder, variational loss of a variational autoencoder or the adversarial loss of GANs. It is used to learn feasible latent features and avoid trivial solutions. Clustering-specific loss can be cluster assignment losses such as k-means loss , cluster assignment hardening loss , spectral clustering loss , agglomerative clustering loss  or cluster regularization losses such as locality preserving loss, group sparsity loss, cluster classification loss . These losses are used to learn suitable cluster-friendly representations from the data. In this work, we exploit both network loss and clustering loss in the clustering module for speaker diarization.
3 Proposed speaker diarization system
The overall methodology of the proposed speaker diarization system is shown in Fig. 1. The proposed system begins with the popular time-delay neural network (TDNN) speaker embedding , i.e., x-vector extraction and followed by latent space clustering. We discuss each module in the diarization pipeline below.
Our approach starts with a temporal segmentation of 1.5 sec with 1 sec overlap. The speech segments are embedded into a fixed-dimensional x-vector of dimension 512. This TDNN-based speaker embeddings achieved state-of-the-art performance in speaker verification/diarization . The x-vectors are then fed as inputs to the ClusterGAN network.
3.3 ClusterGAN training
The motivation behind using ClusterGAN on x-vectors is to non-linearly transform it into a lower-dimensional embedding space which is more separable. Although the idea of using a mixture of continuous and discrete latent variables as the input to GAN generator was inspired from InfoGAN , ClusterGAN is better suited for clustering than InfoGAN . ClusterGAN comprises three components: the generator (), the discriminator () and the encoder (), as shown in Fig. 2.
3.3.1 Adversarial training
GANs are a recent class of deep generative models inspired by game theory metaphor, where bothand networks engage in a two-player minimax game . The generator is considered to be a mapping from the latent space to the data space . It takes noisy data sampled from and generates samples to fool the discriminator. The discriminator is considered to be a mapping from the data space to a real value . It takes real data sampled from and tries to discriminate between the real and generated fake samples. We employ the improved Wasserstein GAN (IWGAN)  for our GAN network. The objective function of this adversarial game is:
where, is the gradient penalty coefficient and GP is the gradient penalty term .
3.3.2 Sampling from discrete-continuous mixtures
In order to perform clustering in the latent space, we have to back-project the data into the latent space. The latent space distribution in traditional GANs is typically chosen to be Gaussian or uniform distributions. Although such distributions contain useful information about input data distributions, they usually lead to bad clusters
. To mitigate this problem, boosting the latent space using categorical variables to create non-smooth geometry is essential. However, continuity in latent space is also required for good interpolation and GANs have good interpolation ability. Therefore, we employ a mixture of continuous () and discrete () variables to the generator by concatenating with . In this work,
is randomly sampled from a normal distribution. We chose in all our experiments. We use the original speaker labels for the speech segments from training data as the one-hot encoded variable . The concatenation of with enables clustering in the latent space.
3.3.3 Inverse mapping network
Mapping from the data space to latent space is a non-trivial problem, since it requires the inversion of the generator which is a non-linear model. Existing works [23, 24] tackle this problem by solving an optimization problem in to get back the latent vectors using , where is norm, is a regularization constant and denotes the norm. However, these approaches are not suitable for clustering since the optimization problem is non-convex [24, 14]. To address this issue, an network alongside the GAN network for back-projection is introduced. We fix and randomly sample from a normal distribution with multiple restarts at each iteration step. Furthermore, to ensure precise recovery of the latent vector , we compute the numerical difference between the encoder output latent vector and . For that, we empirically found that instead of mean square error, cosine distance is more suitable. The objective function for this task can be written as:
where, is the mini batch size.
3.3.4 Clustering-specific loss
To learn cluster friendly representations, we incorporate cluster classification loss while training as cross-entropy (CE) loss. The soft-max layer output obtained by network is used for computing the cross-entropy loss. This loss encourages the latent embeddings to cluster and hence increase the discriminative information. We minimize this cross-entropy loss as:
where, the first term is the empirical probability that the embedding belongs to the -th speaker, and the second term is the predicted probability (by the encoder) that the embedding belongs to the -th speaker.
3.3.5 Joint training
We train the GAN and encoder networks jointly. The training objective function takes the following form:
The weights and are used to control the importance of preserving continuous and discrete latent variables. Algorithm 1 shows the training steps of ClusterGAN.
After offline training, only the trained encoder model is required to produce the proposed latent embeddings for the input x-vectors of a test audio session. The concatenated latent embeddings ( and ) are then clustered to produce speaker labels of each segment using k-means.
4 Experimental evaluation
4.1 Data preparation
We evaluate our proposed algorithm on the AMI meeting corpus and two child-clinician interaction corpora: ADOS  and BOSCC . The AMI database consists of 171 meetings recorded at four different sites (Edinburgh, Idiap, TNO, Brno). For our evaluation, we use the official speech recognition partition of AMI dataset111http://groups.inf.ed.ac.uk/ami/download/. We exclude the TNO meetings from dev and eval set, which is a common practice in diarization studies [9, 27]. The details of the dataset partition are shown in Table 1.
The ADOS  is a diagnostic tool which comprises over 10 play-based, conversational tasks. We chose two conversational tasks: Emotions and Social Difficulties and Annoyance from 272 sessions for our evaluation. BOSCC  is a new treatment outcome measure, also comprised of play-based, conversational segments. For this study, 24 BOSCC sessions are selected.
4.2 Experimental framework
4.2.1 Baseline systems
Since our proposed system uses x-vectors as input features, we used the Kaldi-based AHC clustering with PLDA scoring on x-vectors  (denoted as x-vector in this paper) as our main baseline system. We also show results on x-vectors with k-means clustering (denoted as k-means in this paper), as our second baseline.
4.2.2 Model specifications
In all our systems, x-vectors are extracted using the Voxceleb222https://kaldi-asr.org/models/m7 models available in the Kaldi recipe. Diarization performance of the proposed system is evaluated for two models trained with different amounts of supervised data: P1 and P2. P1 is trained only on the AMI train set, whereas P2 is trained on both AMI train set and 60 beamformed ICSI 
sessions with a total number of 46 speakers. The generator and discriminator networks in the proposed systems are simple feed forward neural networks with one and two hidden layers respectively each with 512 nodes. The input layer ofconsists of nodes (, are the dimensions of and respectively), where for both P1 and P2 models, and for P1 and for P2 model. ’s output layer has 512 nodes, which is the x-vector dimension. The input and output layer of contains 512 nodes and one node, respectively. On the other hand, the network consists of a single hidden layer with 512 nodes and input layer is linear with 512 nodes. The output layer of is a linear layer with nodes from which the first nodes are directly used as and the rest are passed through a soft-max layer to produce for each segment. The networks are optimized using Adam  with a mini-batch size of 64 samples and learning rate . We fixed the , and values as 1, 2 and 10 respectively by tuning on AMI dev set. The number of iterations is fixed to 30k based on optimizing DER on the AMI dev set.
4.2.3 Performance metrics
The performance of speaker diarization systems is evaluated by using NIST diarization error rate (DER) , which is typically calculated with a 0.25 sec collar. Since the primary focus of this paper is on the effectiveness of new speaker embeddings in clustering, likewise in [27, 2, 9], for all the experiments in this paper we use oracle speech activity detection (SAD). Therefore, all DER values reported in this work correspond to speaker confusion errors with no missed or false alarm speech.
4.3 Results and discussions
|x-vector + P1||7.45||7.82||8.73||9.11|
|x-vector + P2||6.98||8.85||7.93||8.92|
|Sun et. al. ||–||–||12.22||12.99|
4.3.1 Results on AMI dev and eval set
Results for diarization performance on AMI dev and eval sets are reported in Table 2
. We show results for oracle SAD with both known number of speakers and estimated number of speakers. For the x-vector baseline, we use thresholding on the PLDA scores to perform AHC clustering for unknown number of speakers. The number of speakers for k-means and proposed systems are estimated using Eigen-gap analysis of the affinity matrix constructed from the cosine distance of x-vector embeddings followed by binarization and symmetrization. From Table 2 column 2, we see that for known number of speakers, the P1 system beats x-vector (state-of-the-art) and k-means systems for both AMI dev and eval sets. The performance improves further after incorporating fusion with x-vector embeddings ((x-vector + P1) and (x-vector + P2)). It is observed that both the fused systems significantly outperform all the other systems. The best achieved DER using our fused systems on AMI dev and eval set are 6.98% and 7.82% respectively. This is attributed to the fact that our proposed embeddings have complementary information with x-vector embeddings.
We report the diarization performance of all the systems for estimated number of speakers in Table 2 column 3. Surprisingly, it is observed that x-vector baseline system with thresholding on the PLDA scores for AHC clustering produces a slightly better performance as compared to the oracle number of speaker condition. In contrast, all the other methods’ performance degrades for estimated number of speakers. We also compare the proposed diarization system with the work proposed in Sun et al.  evaluated on the same data set. The system proposed in  is a 2D self-attentive combination of d-vectors with spectral clustering back-end. As seen in Table 2 column 3, our proposed and x-vector fused embeddings with k-means clustering back-end outperforms other baseline methods.
|x-vector + P1||9.38||13.55|
|x-vector + P2||9.22||11.17|
4.3.2 TSNE visualization
We show TSNE visualizations of x-vector, proposed and fused embeddings of AMI session IS1008a in Fig. 3. It is evident from the figure that the proposed embedding based clusters are slightly more compact as compared to the x-vectors. However, fused embedding based clusters are the most compact within a class and most separated between classes.
4.3.3 Generalization ability
From Table 3, we observe significant performance improvement for the proposed system over the baselines on both ADOS and BOSCC sessions. In addition, the P2 model which is trained on more data achieves better performance than P1 for both individual and fused scenarios. In particular, the improvement is notable compared to the x-vector baseline. We hypothesize that the PLDA model pre-trained on Voxceleb presents a significant domain mismatch in this case. Moreover, both P1 and P2 systems, either used individually or in fusion with x-vectors, are superior to k-means. The best system (x-vector + P2) achieves a relative 36% and 49% improvement over x-vector on those two databases.
We presented a new deep latent space clustering using ClusterGANs to perform speaker diarization. The entire system was trained in a supervised manner along with a clustering-specific loss function. We observed that ClusterGAN-based latent embeddings provide superior performance than x-vector embeddings. Further improvement was achieved after fusing proposed and x-vector embeddings. Experimental results showed a significant DER reduction for the proposed system over state-of-the-art x-vector diarization system on AMI, ADOS and BOSCC corpora. Future work could use spectrograms directly instead of pre-trained embeddings as the GAN discriminator input.
-  Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals, “Speaker diarization: A review of recent research,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 356–370, 2012.
-  Daniel Garcia-Romero, David Snyder, Gregory Sell, Daniel Povey, and Alan McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017, pp. 4930–4934.
-  Gregory Sell et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2808–2812.
-  Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, 2013.
-  Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 1, pp. 217–227, 2014.
-  Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with LSTM,” in Proc. ICASSP, 2018, pp. 5239–5243.
-  Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
-  Dimitrios Dimitriadis and Petr Fousek, “Developing on-line speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
-  Guangzhi Sun, Chao Zhang, and Philip C Woodland, “Speaker diarisation using 2D self-attentive combination of embeddings,” in Proc. ICASSP, 2019, pp. 5801–5805.
-  Philip Andrew Mansfield, et al., “Links: A high-dimensional online clustering method,” arXiv preprint arXiv:1801.10123, 2018.
-  Ruiqing Yin, Hervé Bredin, and Claude Barras, “Neural speech turn segmentation and affinity propagation for speaker diarization,” in Proc. Interspeech, 2018, pp. 1393–1397.
-  Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, Maximilian Strobel, and Daniel Cremers, “Clustering with deep learning: Taxonomy and new methods,” arXiv preprint arXiv:1801.07648, 2018.
-  Dimitrios Dimitriadis, “Enhancements for audio-only diarization systems,” arXiv preprint arXiv:1909.00082, 2019.
-  Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan, “ClusterGAN: Latent space clustering in generative adversarial networks,” in Proc. AAAI, 2019, vol. 33, pp. 4610–4617.
Junyuan Xie, Ross Girshick, and Ali Farhadi,
“Unsupervised deep embedding for clustering analysis,”in Proc. ICML, 2016, pp. 478–487.
-  Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou, “Variational deep embedding: An unsupervised and generative approach to clustering,” arXiv preprint arXiv:1611.05148, 2016.
-  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel, “InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets,” in Proc. NIPS, 2016, pp. 2172–2180.
-  Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in Proc. ICML. JMLR. org, 2017, pp. 3861–3870.
-  Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger, “Spectralnet: Spectral clustering using deep neural networks,” arXiv preprint arXiv:1801.01587, 2018.
Jianwei Yang, Devi Parikh, and Dhruv Batra,
“Joint unsupervised learning of deep representations and image clusters,”in Proc. CVPR, 2016, pp. 5147–5156.
-  Ian Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS, 2014, pp. 2672–2680.
-  Ishaan Gulrajani et al., “Improved training of Wasserstein GANs,” in Proc. NIPS, 2017, pp. 5767–5777.
-  Zachary C Lipton and Subarna Tripathi, “Precise recovery of latent vectors from generative adversarial networks,” arXiv preprint arXiv:1702.04782, 2017.
-  Antonia Creswell and Anil Anthony Bharath, “Inverting the generator of a generative adversarial network,” IEEE Trans. on neural networks and learning systems, 2018.
-  Daniel Bone et al., “Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist,” in Proc. Interspeech, 2012.
-  Manoj Kumar et al., “A knowledge driven structural segmentation approach for play-talk classification during autism assessment.,” in Proc. Interspeech, 2018, pp. 2763–2767.
-  Sree Harsha Yella and Andreas Stolcke, “A comparison of neural network feature transforms for speaker diarization,” in Proc. Interspeech, 2015.
-  Catherine Lord et al., “The autism diagnostic observation schedule—generic: A standard measure of social and communication deficits associated with the spectrum of autism,” Journal of autism and developmental disorders, vol. 30, no. 3, pp. 205–223, 2000.
-  Rebecca Grzadzinski et al., “Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (boscc),” Journal of autism and developmental disorders, vol. 46, no. 7, pp. 2464–2479, 2016.
-  Adam Janin et al., “The ICSI meeting corpus,” in Proc. ICASSP, 2003, vol. 1, pp. I–I.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Jonathan G Fiscus et al.,
“The Rich Transcription 2006 spring meeting
International Workshop on Machine Learning for Multimodal Interaction. Springer, 2006, pp. 309–322.
-  Tae Jin Park et al., “The Second DIHARD challenge: System Description for USC-SAIL Team,” in Proc. Interspeech, 2019, pp. 998–1002.