I Introduction
Speaker clustering (SC) is the task of identifying the unique speakers in a set of audio recordings (each belonging to exactly one speaker) without knowing who and how many speakers are present altogether [beigi2011fundamentals]. Other tasks related to speaker recognition and SC are the following:

Speaker verification (SV): A binary decision task in which the goal is to decide if a recording belongs to a certain person or not.

Speaker identification (SI): A multiclass classification task in which to decide to whom out of speakers a certain recording belongs.
SC is also referred to as speaker diarization
when a single (usually long) recording involves multiple speakers and thus needs to be automatically segmented prior to clustering. Since SC is a completely unsupervised problem (the number of speakers and segments per speaker is unknown), it is straightforward to note that it is considered of higher complexity with respect to both SV and SI. The complexity of SC is comparable to the problem of image segmentation in computer vision, in which the number of regions to be found is typically unknown.
The SC problem is of importance in the domain of audio analysis due to many possible applications, for example in lecture/meeting recording summarization [anguera2012speaker], as a preprocessing step in automatic speech recognition, or as part of an information retrieval system for audio archives [ajmera2003robust]. Furthermore, SC represents a building block for speaker diarization [shum2012use].
The SC problem has been widely studied [jin1997automatic, sadjadi20172016]. A typical pipeline is based on three main steps: i.a)
acoustic feature extraction from audio samples,
i.b) voice feature aggregation from the lowerlevel acoustic features by means of a speaker modeling stage, and ii) a clustering technique on top of this featurebased representation.The voice features after phase i)
have been traditionally created based on Mel Frequency Cepstral Coefficient (MFCC) acoustic features modeled by a Gaussian Mixture Model (GMM)
[campbell2006support], or ivectors
[dehak2009support, lee2014clustering]. More recently, with the rise of deep learning, the community is moving towards learned features instead of handcrafted ones, as surveyed by Richardson et al.
[richardson2015deep]. Recent examples of deepfeature representations for SI, SV, and SC problems come for example from Lukic et al.
[MLSP2017], after Convolutional neural networks (CNN) have been introduced in the speech processing field by LeCun et al. already in the nineties
[lecun1995convolutional]. McLaren et al. used a CNN for speaker recognition in order to improve robustness to noisy speech [mclaren2014application]. Chen et al. used a novel deep neural architecture to learn speaker specific characteristics directly from MFCC features [chen2011learning]. Yella et al. exploited the capabilities of an artificial neural network of 3 layers to extract features directly from a hidden layer, which are used for speaker clustering [yella2014artificial].However advanced phase i) has become during the last years, the clustering phase ii)
still relies on traditional methodologies. For example, Khoury et al. demonstrated good results for speaker clustering using a hierarchical clustering algorithm
[khoury2014hierarchical], while Kenny et al. report hierarchical clustering to be unsuitable for the speaker clustering stage in a speaker diarization system [kenny2010diarization]. In [shum2011exploiting]they performed clustering with Kmeans on dimensionalityreduced ivectors which showed to work better than spectral clustering as noted in
[shum2012use].In this paper, we therefore improve the results of the speaker clustering task by first using stateofart learned features and then, a different and more robust clustering algorithm, dominant sets (DS) [pavan2007dominant]. The motivation driving the choice of dominant sets is the following: a) no need for an apriori number of clusters; b) having a notion of compactness to be able to automatically detect clusters composed of noise; c) for each cluster the centrality of each element is quantified (centroids emerge naturally in this context); and d) extensive experimentations and the underlying theory prove a high robustness to noise [pavan2007dominant]. All the aforementioned properties perfectly fit the SC problem.
The contribution of this paper is threefold: first, we apply the dominant set method for the first time in the SC domain, outperforming the previous state of the art; second, it is the first time that the full TIMIT dataset [timit:1986] is used for SC problems, making this paper a reference baseline in this context and on this dataset; third, we use for the first time a pretrained VGGVox^{3}^{3}3https://github.com/anagrani/VGGVox network to extract features for the TIMIT dataset, obtaining good results and demonstrating the capability of this embedding.
The remainder of this paper is organized as follow: in Sec II the proposed method is explained in detail (with Sec IIA having explanations for the different feature extraction methods, and Sec IIB having an introduction to the theoretical foundations of DS). In Sec III the experiments that have been carried out are explained and in Sec IV we discuss the results before drawing conclusions in Sec V together with future perspectives.
Ii Speaker clustering with dominant sets
Our proposed approach, called SCDS is based on the twophase schema (see Fig.1): the first part in which features are extracted from each utterance and the second one in which from this featurebased representation the dominant sets are extracted. In this section, the specific parts are explained.
Iia Features extraction
We use two different feature extraction methods in this work that we call CNNT (derived from embeddings based on the TIMIT dataset), and CNNV (based on a model trained on VoxCeleb [Nagrani17]):
IiA1 CNNT features
Features are extracted from the CNN^{4}^{4}4https://github.com/stdm/ZHAW_deep_voice described in detail by Lukic et al. [MLSP2016], specifically from the dense layer L7 therein. The network has been trained on 590 speakers of the TIMIT database that have been fed to the net as spectrograms derived from the corresponding utterances, and yields 1,000dimensional feature vectors.
IiA2 CNNV features
Features are extracted from the published VGGVox model trained on the 100,000 utterances of the VoxCeleb dataset[Nagrani17]. Since the domain of VoxCeleb and TIMIT are similar, we expect to have good performances on the latter. VGGVox is based on the VGGM convolutional architecture [chatfield2014return] which was previously used for image data, adapted for spectrogram input. We get 1,024dimensional features from the FC7 layer as in the original publication.
IiB Dominant Set clustering
Dominant set clustering is a graphbased method that generalizes the problem of finding a maximal clique to edgeweighted graphs. A natural application of this method is for partitioning (clustering) a graph into disjoint sets. In this framework, a dataset is modeled as an undirected edgeweighted graph with no self loops, in which the nodes are the items of the dataset (represented by feature vectors). The edges are the pairwise relations between nodes and their weight function calculates pairwise similarities. The symmetric adjacency matrix is employed to summarize :
Typically with every clustering method two properties shall hold: the intracluster homogeneity is high while intercluster homogeneity is low. These two properties are important in order to separate and group objects in the best possible way. They are directly reflected in the combinatorial formulation of DS (see [pavan2007dominant] for the details). Pavan and Pelillo propose an intriguing connection between clusters, dominant sets and local solutions of the following quadratic problem [pavan2007dominant]:
maximize  (1)  
subject to 
where is the similarity matrix of the graph and x is the socalled characteristic vector which lies in the ndimensional simplex , that is, . The components of vector x represent the likelihood of each element to belong to the cluster, the higher the score the higher the chance of being part of it. If x is a strict local solution of (1) then its support is a dominant set.
In order to extract a DS, a local solution of (1
) must be found. A method to solve this problem is to use a result from evolutionary game theory
[weibull1997evolutionary] known as replicator dynamic (RD) (see Eq. 2).(2) 
RD is a dynamical system that operates a selection process over the components of the vector x. At convergence of Eq. 1 , certain components will emerge () while others will get extinct (). In practical cases, if these last components of are not exactly equal to zero then a thresholding () is performed. The convergence of the process is guaranteed if the matrix is nonnegative and symmetric. The dynamical system starts at the barycenter of the simplex and its components are updated using Eq. 2.
Deciding upon a cutoff threshold is not obvious. Instead of using a predefined value, we prefer to employ the approach proposed by Vascon et al. [Vascon2013, dodero2013automatic]. The parameter is computed based on the following idea: it decides the minimum degree of participation of an element to a cluster and is relative to the participation of the centroid. The support for each dominant set becomes with (see Sec. IVE for sensitivity analysis on the parameters).
At each iteration a dominant set is extracted and its subsets of nodes are removed from the graph (this is called peelingoff strategy). The process iterates on the remaining nodes until all are assigned to a cluster.
IiC Similarity measure
To compute weights on edges of graph we use the cosine distance to construct a similarity function. The cosine distance has been chosen because it showed good performance on SC tasks [khoury2014hierarchical, shum2011exploiting, Nagrani17]. Given two utterances and their dimensional feature vectors and , we apply the following function:
(3) 
where is the cosine distance between given features, and is the similarity scaling parameter.
Setting the parameter
is often problematic and typically requires a grid search over a range of plausible values or a crossvalidation. In this work, we decided to use a principle heuristic from spectral clustering
[perona] which proved to work well also in other works [tripodi2016context, ZemeneAP17]. Based on [perona] and [ZemeneAP17] we tested a local scaling parameter for each utterance to be clustered. This means that in (3) our parameter depends on local neighborhoods of given features and and it is determined as follows:(4) 
where represents the nearest neighborhood of element . In our experiments we used as in [ZemeneAP17].
IiD Cluster labeling
Once all dominant sets are extracted, the final step is to label each partition such that each speaker is in onetoone correspondence with a cluster. The labels of the data are then used to perform the assignment. We tested two approaches for cluster labeling:
IiD1 Max
a prototype selection method which assigns cluster labels using the class of the element with maximum participation in the characteristic vector [Vascon2013]. Labels are unique, and in case 2 different clusters share their labels, the latter one is considered completely mistaken, increasing error in the evaluation.
IiD2 Hungarian
with this approach, each cluster is labeled using the Munkres (aka Hungarian) method [hungarian]. The cost of assigning a cluster to a particular label is computed as the number of elements of class in the cluster . Since the method minimizes the total cost of assignments, the value of is changed to . This turns the minimization problem to a maximization one, where is the maximum cost over all the assignments.
Iii Experiments
Iiia Datasets & data preparation
We evaluate our method on the TIMIT dataset, presented as TIMIT Small and TIMIT Full (see Table I). The dataset is composed of 6,300 phrases (10 phrases per speaker), spoken by 438 males (70%) and 192 females (30%). Speakers coming from 8 different regions and having different dialects. The phrases of each speaker have been divided into 2 parts in accordance with previous research [Thilo2009, MLSP2016, MLSP2017]. In our experimentation we used the same 40 speakers dataset as reported by these earlier attempts (here called TIMIT Small), and the full TIMIT set composed by 630 speakers. Note that TIMIT Small is disjoint with the training set of CNNT. This dataset is suited to our work because we are not dealing with noise, segmentation or similar diarization problems.
Acronym  #POIs  #Utt/POI  Utterances  

TIMIT Small [Thilo2009]  TimitS  40  2  80 
TIMIT Full [timit:1986]  TimitF  630  2  1260 
IiiB Comparison to other methods
The proposed method has been compared with the state of the art [MLSP2017, MLSP2016, Thilo2009]
and with other clustering techniques like spectral clustering (SP), kmeans (KM) and hierarchical clustering (HC). Given the fact that our proposed method is completely unsupervised (in particular, there is no knowledge apriori of the number of clusters), we tested some heuristics to estimate
also for the aforementioned algorithms. Specifically, the Eigengap heuristic [von2007tutorial] and the number of clusters found by our method are used. Moreover, we chose affinity propagation (AP) [frey2007clustering] and HDBSCAN [hdbscan] because they do not require an apriori . In order to fairly compare our method, we tested them with the best settings. Specifically for HC and KM, cosine distance was the best choice, while for SP we used RBF kernel with parameter found through an extensive grid search. The cut on HC has been set such that the error is minimized as in [MLSP2016]. For AP we used the same similarity measure of SCDS while for HDBSCAN the Euclidean distance and minimum number of points per cluster equal to 2 were used.IiiC Evaluation criteria
To evaluate the clustering quality we used three distinct metrics: the misclassification rate (MR) [Kotti], the adjusted RAND index (ARI) [hubert1985comparing] and the average cluster purity (ACP) [solomonoff]. The usage of different metrics is important because each of them gives a different perspective on results: MR quantifies how many labels of speakers are inferred correctly from clusters while ARI and ACP are measures of grouping/separation performance on utterances.
Formally, given a onetoone mapping between clusters and labels (see Sec IID), MR is defined as where is the total number of audio segments to cluster, the number of speakers, and the number of segments of speaker classified incorrectly. Cluster purity is a measure to determine how pure clusters are. If a cluster is composed of utterances belonging to the same speaker, then it is completely pure, otherwise (i.e., other speakers are in that cluster, too) purity decreases. Formally, average cluster purity is defined as:
is the number of clusters, utterances in cluster spoken by speaker and is the size of cluster . The ARI finally is the normalized version of RAND index [rand1971objective], with maximum value 1 for perfectly assigned clusters with respect to the expected ones.
IiiD Experimental setup
Our proposed method is evaluated in experiments composed as follows: given a set of audio utterances, features are extracted following one of the methods in Sec IIA
and the affinity matrix is computed as in Sec
IIC. Subsequently, the DS are found on top of this graphbased representation. Labeling is performed on each cluster following the methodology proposed in Sec IID. The goodness of clusters are then evaluated using the metrics in Sec IIIC. The summarized results are reported in Tables II and III and discussed in the next section. The hyper parameters for all experiments are set to and .Iv Results
In this section, the results of a series of analyses are reported, followed by an overall discussion.
Iva Initialization of k in the competitors
DS does not need an apriori number of clusters, while the supervised competitors do. In order to make a fair comparison with standard approaches (HC, KM and SP), we used as : the correct number of clusters to be found (symbol in tables II and III), the number of clusters found by DS (symbol in the tables) and the number of clusters estimated with Eigengap (symbol # in tables).
Experimental results show that even when the correct number of clusters is provided, SCDS still achieves more desirable results (see tables). This means that not only our method is able to recover a number of clusters close to the right one, but also that it is able to extract much more correct partitions. And when the number of clusters found by DS is given to the other methods, results obtained are plausible, showing that our method is able to grasp a good number of clusters while with standard heuristics the performance drops strongly.
IvB Analysis of different feature extraction methods
In the next experiments, we tested the two CNNbased features, CNNT and CNNV. Both provide good features in term of capacity to discriminate speakers. With the CNNT features, the performance of our method saturates on TIMIT Small (see last rows of Table II) and reaches almost perfect performances on TIMIT Full (see last rows of Table III). This is mainly explained by the fact that the network used to extract the CNNT embeddings has been trained in using the remaining 590 of the 630 TMIT speakers [MLSP2016], thus biasedly performing well on the entire dataset.
Surprisingly, features obtained from VGGVox are so generic that they allow almost the same performances for SCDS. This approach is also beneficial for competitors, and in fact all of them have better performances in term of MR/ARI/ACP with CNNV features rather than CNNT ones (except for KM).
IvC Cluster labeling
We tested two methods for labeling clusters for our approach (see Max and Hungarian in Sec IID), while for all the other competitors we used only the Hungarian algorithm since Max is a peculiarity of DS. Under all conditions and datasets both labeling methods perform the same (see last rows of results where Max = SCDS, Hungarian = SCDS). Labeling with Max method comes for free directly from DS theory, while applying the Hungarian method has its computational cost.
IvD Metrics comparison
The three metrics (MR, ARI and ACP) are important to be analyzed in conjunction because they capture different aspects of the result. Having the lowest MR in the final results in both datasets emphasize the fact that we are correctly labeling clusters and that the number of missclassified samples is extremely low. On the other side, reaching the highest value in ARI shows that our method obtains a good partitioning of the data with respect to the expected clusters. Furthermore, having the higher ACP confirms that clusters extracted with SCDS are mainly composed by utterances from the same speaker.
The proposed method reaches best scores on all metrics simultaneously. Indeed, other methods reach similar performances, in particular on TIMIT Small (like HC, AP), but none of them work as well as our in the most complex experimental setting used, TIMIT Full with VGGVox features (where no knowledge of the actual voices to be clustered could possibly enter the features and thus the clustering system).
SMALL TIMIT  CNNT Features  CNNV Features  

MR  ARI  ACP  MR  ARI  ACP  
HC  0.0250  0.9259  0.9667  0.0000  1.0000  1.0000 
HC [MLSP2016]  0.0500           
Adadelta 20k[MLSP2017]  0.0500           
Adadelta 30k[MLSP2017]  0.0500           
SVM [Thilo2009]  0.0600           
GMM/MFCC[Thilo2009]  0.1300           
SP  0.0750  0.8422  0.9500  0.0000  1.0000  1.0000 
KM  0.0250  0.9259  0.9667  0.0375  0.9390  0.9750 
HC k  0.0250  0.9259  0.9667  0.0000  1.0000  1.0000 
SP k  0.0750  0.8422  0.9500  0.0000  1.0000  1.0000 
KM k  0.0250  0.9259  0.9667  0.0375  0.9390  0.9750 
HC #  0.4500  0.4234  0.5500  0.6750  0.2466  0.3250 
SP #  0.4500  0.0827  0.5500  0.6750  0.1751  0.3038 
KM #  0.4500  0.3543  0.5267  0.6750  0.1746  0.3193 
AP  0.0500  0.8951  0.9416  0.0000  1.0000  1.0000 
HDBS  0.1000  0.8056  0.8833  0.0750  0.8422  0.9083 
SCDS  0.0000  1.0000  1.0000  0.0000  1.0000  1.0000 
SCDS  0.0000  1.0000  1.0000  0.0000  1.0000  1.0000 
IvE Sensitivity analysis
Finally, we report the results of a sensitivity analysis on the two freeparameters of our method under two metrics (see Fig 2 and 3), the precision of Replicator Dynamics (see Eq. 2) and the relative cutoff (see Sec. IIB). The analysis has been carried out on TIMIT Full with VGGVox features because under this setting a certain amount of variability on results is observed, which made this analysis interesting. The search space for the parameters is as follows: and . The choice has been made on these extremal points for the following reasons: a low value, e.g. , means that a point belongs to a cluster if and only if its level of participation in the cluster with respect to the centroid is at least centrality of the centroid. Instead, means that the centroid and the sample must be almost exactly the same. In the first case we are creating clusters which span widely in terms of similarities of its elements, while in the latter case we create clusters composed by very similar elements. Regarding the parameter , when it is set to , it requires that two successive steps in Eq. 2 are very close to each other while in the case we allow for a coarse equilibrium point.
Changes in both variables showed that the area in which the performances are stable is large (see the blue area in Fig 2 and yellow area in Fig 3). Only when extremal values of parameters are used the performances drops. The best parameter choice (CNNT: , ; CNNV: , ) is shown in Table III as SCDSbest.
FULL TIMIT  CNNT Features  CNNV Features  

MR  ARI  ACP  MR  ARI  ACP  
HC  0.0770  0.8341  0.9283  0.0571  0.8809  0.9484 
SP  0.2294  0.0432  0.8355  0.0675  0.5721  0.9488 
KM  0.1071  0.7752  0.9071  0.1286  0.6982  0.8730 
HC k  0.0762  0.8343  0.9280  0.0706  0.8502  0.9295 
SP k  0.2341  0.0421  0.8332  0.0635  0.4386  0.9427 
KM k  0.1079  0.7682  0.9007  0.1429  0.6646  0.8485 
HC #  0.9921  0.0050  0.0079  0.9984  0.0000  0.0016 
SP #  0.9921  0.0003  0.0075  0.9984  0.0000  0.0016 
KM #  0.9921  0.0052  0.0076  0.9984  0.0000  0.0016 
AP  0.0753  0.8330  0.9030  0.1396  0.7127  0.8222 
HDBS  0.1825  0.6214  0.7825  0.3000  0.4112  0.6527 
SCDS  0.0048  0.9897  0.9947  0.0349  0.9167  0.9578 
SCDS  0.0048  0.9897  0.9947  0.0349  0.9167  0.9578 
SCDSbest  0.0032  0.9929  0.9966  0.0024  0.9944  0.9974 
IvF Overall discussion
From a global perspective we can say that the proposed SCDS method performs better than the alternatives on the used datasets, outperforming the stateoftheart and showing a more adaptive response also with a pretrained model on a different dataset. In particular, this is evident in TIMIT Full, where better performances than competitors are achieved even when they are given the right number of clusters to be found. It is worth to note that our clustering method has only two parameters to set, which are both very insensitive to variation as shown in the sensitivity analysis.
Interesting to note, an analysis of misclassified speakers shows that if a speaker is wrongly clustered by DS, it is also wrongly clustered by all other methods. This gives rise to the assumption that in these cases the extracted features may be the reason for the error rather than the clustering approach used.
V Conclusions
In this paper, we have proposed a novel pipeline for speaker clustering. The proposed method is based on the dominant set clustering algorithm which has been applied to this domain for the first time. It outperforms the previous state of the art and other clustering techniques.
We proposed a method which is almost parameterless – the two free parameters do not affect too much the results, testifying to its stability. Moreover, we successfully used a pretrained CNN model on a different dataset and report reasonable speaker clustering performance on the TIMIT Full dataset for the first time (after the % MR reported by Stadelmann and Freisleben using a classical pipeline [Thilo2009]). Now that we reached a good starting point with noisefree utterances we can start considering more complex datasets with their relatively more challenging tasks (noise, segmentation, crosstalk etc.). Future work also includes improving the features using the siamese network proposed by Nagrani et al. [Nagrani17] to extract similarities directly.
Code available at https://github.com/feliksh/SCDS
Acknowledgment
The authors would like to thank Y. Lukic for providing the MLSP features and J. Salamone for contributing to first results.
Comments
There are no comments yet.