Speaker clustering (SC) is the task of identifying the unique speakers in a set of audio recordings (each belonging to exactly one speaker) without knowing who and how many speakers are present altogether [beigi2011fundamentals]. Other tasks related to speaker recognition and SC are the following:
Speaker verification (SV): A binary decision task in which the goal is to decide if a recording belongs to a certain person or not.
Speaker identification (SI): A multiclass classification task in which to decide to whom out of speakers a certain recording belongs.
SC is also referred to as speaker diarization
when a single (usually long) recording involves multiple speakers and thus needs to be automatically segmented prior to clustering. Since SC is a completely unsupervised problem (the number of speakers and segments per speaker is unknown), it is straightforward to note that it is considered of higher complexity with respect to both SV and SI. The complexity of SC is comparable to the problem of image segmentation in computer vision, in which the number of regions to be found is typically unknown.
The SC problem is of importance in the domain of audio analysis due to many possible applications, for example in lecture/meeting recording summarization [anguera2012speaker], as a pre-processing step in automatic speech recognition, or as part of an information retrieval system for audio archives [ajmera2003robust]. Furthermore, SC represents a building block for speaker diarization [shum2012use].
The SC problem has been widely studied [jin1997automatic, sadjadi20172016]. A typical pipeline is based on three main steps: i.a)
acoustic feature extraction from audio samples,i.b) voice feature aggregation from the lower-level acoustic features by means of a speaker modeling stage, and ii) a clustering technique on top of this feature-based representation.
The voice features after phase i)
have been traditionally created based on Mel Frequency Cepstral Coefficient (MFCC) acoustic features modeled by a Gaussian Mixture Model (GMM)[campbell2006support]
, or i-vectors[dehak2009support, lee2014clustering]
. More recently, with the rise of deep learning, the community is moving towards learned features instead of hand-crafted ones, as surveyed by Richardson et al.[richardson2015deep]
. Recent examples of deep-feature representations for SI, SV, and SC problems come for example from Lukic et al.[MLSP2017]
, after Convolutional neural networks (CNN) have been introduced in the speech processing field by LeCun et al. already in the nineties[lecun1995convolutional]. McLaren et al. used a CNN for speaker recognition in order to improve robustness to noisy speech [mclaren2014application]. Chen et al. used a novel deep neural architecture to learn speaker specific characteristics directly from MFCC features [chen2011learning]. Yella et al. exploited the capabilities of an artificial neural network of 3 layers to extract features directly from a hidden layer, which are used for speaker clustering [yella2014artificial].
However advanced phase i) has become during the last years, the clustering phase ii)
still relies on traditional methodologies. For example, Khoury et al. demonstrated good results for speaker clustering using a hierarchical clustering algorithm[khoury2014hierarchical], while Kenny et al. report hierarchical clustering to be unsuitable for the speaker clustering stage in a speaker diarization system [kenny2010diarization]. In [shum2011exploiting]shum2012use].
In this paper, we therefore improve the results of the speaker clustering task by first using state-of-art learned features and then, a different and more robust clustering algorithm, dominant sets (DS) [pavan2007dominant]. The motivation driving the choice of dominant sets is the following: a) no need for an a-priori number of clusters; b) having a notion of compactness to be able to automatically detect clusters composed of noise; c) for each cluster the centrality of each element is quantified (centroids emerge naturally in this context); and d) extensive experimentations and the underlying theory prove a high robustness to noise [pavan2007dominant]. All the aforementioned properties perfectly fit the SC problem.
The contribution of this paper is three-fold: first, we apply the dominant set method for the first time in the SC domain, outperforming the previous state of the art; second, it is the first time that the full TIMIT dataset [timit:1986] is used for SC problems, making this paper a reference baseline in this context and on this dataset; third, we use for the first time a pre-trained VGGVox333https://github.com/a-nagrani/VGGVox network to extract features for the TIMIT dataset, obtaining good results and demonstrating the capability of this embedding.
The remainder of this paper is organized as follow: in Sec II the proposed method is explained in detail (with Sec II-A having explanations for the different feature extraction methods, and Sec II-B having an introduction to the theoretical foundations of DS). In Sec III the experiments that have been carried out are explained and in Sec IV we discuss the results before drawing conclusions in Sec V together with future perspectives.
Ii Speaker clustering with dominant sets
Our proposed approach, called SCDS is based on the two-phase schema (see Fig.1): the first part in which features are extracted from each utterance and the second one in which from this feature-based representation the dominant sets are extracted. In this section, the specific parts are explained.
Ii-a Features extraction
We use two different feature extraction methods in this work that we call CNN-T (derived from embeddings based on the TIMIT dataset), and CNN-V (based on a model trained on VoxCeleb [Nagrani17]):
Ii-A1 CNN-T features
Features are extracted from the CNN444https://github.com/stdm/ZHAW_deep_voice described in detail by Lukic et al. [MLSP2016], specifically from the dense layer L7 therein. The network has been trained on 590 speakers of the TIMIT database that have been fed to the net as spectrograms derived from the corresponding utterances, and yields 1,000-dimensional feature vectors.
Ii-A2 CNN-V features
Features are extracted from the published VGGVox model trained on the 100,000 utterances of the VoxCeleb dataset[Nagrani17]. Since the domain of VoxCeleb and TIMIT are similar, we expect to have good performances on the latter. VGGVox is based on the VGG-M convolutional architecture [chatfield2014return] which was previously used for image data, adapted for spectrogram input. We get 1,024-dimensional features from the FC7 layer as in the original publication.
Ii-B Dominant Set clustering
Dominant set clustering is a graph-based method that generalizes the problem of finding a maximal clique to edge-weighted graphs. A natural application of this method is for partitioning (clustering) a graph into disjoint sets. In this framework, a dataset is modeled as an undirected edge-weighted graph with no self loops, in which the nodes are the items of the dataset (represented by feature vectors). The edges are the pairwise relations between nodes and their weight function calculates pairwise similarities. The symmetric adjacency matrix is employed to summarize :
Typically with every clustering method two properties shall hold: the intra-cluster homogeneity is high while inter-cluster homogeneity is low. These two properties are important in order to separate and group objects in the best possible way. They are directly reflected in the combinatorial formulation of DS (see [pavan2007dominant] for the details). Pavan and Pelillo propose an intriguing connection between clusters, dominant sets and local solutions of the following quadratic problem [pavan2007dominant]:
where is the similarity matrix of the graph and x is the so-called characteristic vector which lies in the n-dimensional simplex , that is, . The components of vector x represent the likelihood of each element to belong to the cluster, the higher the score the higher the chance of being part of it. If x is a strict local solution of (1) then its support is a dominant set.
In order to extract a DS, a local solution of (1
) must be found. A method to solve this problem is to use a result from evolutionary game theory[weibull1997evolutionary] known as replicator dynamic (RD) (see Eq. 2).
RD is a dynamical system that operates a selection process over the components of the vector x. At convergence of Eq. 1 , certain components will emerge () while others will get extinct (). In practical cases, if these last components of are not exactly equal to zero then a thresholding () is performed. The convergence of the process is guaranteed if the matrix is non-negative and symmetric. The dynamical system starts at the barycenter of the simplex and its components are updated using Eq. 2.
Deciding upon a cutoff threshold is not obvious. Instead of using a predefined value, we prefer to employ the approach proposed by Vascon et al. [Vascon2013, dodero2013automatic]. The parameter is computed based on the following idea: it decides the minimum degree of participation of an element to a cluster and is relative to the participation of the centroid. The support for each dominant set becomes with (see Sec. IV-E for sensitivity analysis on the parameters).
At each iteration a dominant set is extracted and its subsets of nodes are removed from the graph (this is called peeling-off strategy). The process iterates on the remaining nodes until all are assigned to a cluster.
Ii-C Similarity measure
To compute weights on edges of graph we use the cosine distance to construct a similarity function. The cosine distance has been chosen because it showed good performance on SC tasks [khoury2014hierarchical, shum2011exploiting, Nagrani17]. Given two utterances and their -dimensional feature vectors and , we apply the following function:
where is the cosine distance between given features, and is the similarity scaling parameter.
Setting the parameter
is often problematic and typically requires a grid search over a range of plausible values or a cross-validation. In this work, we decided to use a principle heuristic from spectral clustering[perona] which proved to work well also in other works [tripodi2016context, ZemeneAP17]. Based on [perona] and [ZemeneAP17] we tested a local scaling parameter for each utterance to be clustered. This means that in (3) our parameter depends on local neighborhoods of given features and and it is determined as follows:
where represents the nearest neighborhood of element . In our experiments we used as in [ZemeneAP17].
Ii-D Cluster labeling
Once all dominant sets are extracted, the final step is to label each partition such that each speaker is in one-to-one correspondence with a cluster. The labels of the data are then used to perform the assignment. We tested two approaches for cluster labeling:
a prototype selection method which assigns cluster labels using the class of the element with maximum participation in the characteristic vector [Vascon2013]. Labels are unique, and in case 2 different clusters share their labels, the latter one is considered completely mistaken, increasing error in the evaluation.
with this approach, each cluster is labeled using the Munkres (aka Hungarian) method [hungarian]. The cost of assigning a cluster to a particular label is computed as the number of elements of class in the cluster . Since the method minimizes the total cost of assignments, the value of is changed to . This turns the minimization problem to a maximization one, where is the maximum cost over all the assignments.
Iii-a Datasets & data preparation
We evaluate our method on the TIMIT dataset, presented as TIMIT Small and TIMIT Full (see Table I). The dataset is composed of 6,300 phrases (10 phrases per speaker), spoken by 438 males (70%) and 192 females (30%). Speakers coming from 8 different regions and having different dialects. The phrases of each speaker have been divided into 2 parts in accordance with previous research [Thilo2009, MLSP2016, MLSP2017]. In our experimentation we used the same 40 speakers dataset as reported by these earlier attempts (here called TIMIT Small), and the full TIMIT set composed by 630 speakers. Note that TIMIT Small is disjoint with the training set of CNN-T. This dataset is suited to our work because we are not dealing with noise, segmentation or similar diarization problems.
|TIMIT Small [Thilo2009]||TimitS||40||2||80|
|TIMIT Full [timit:1986]||TimitF||630||2||1260|
Iii-B Comparison to other methods
The proposed method has been compared with the state of the art [MLSP2017, MLSP2016, Thilo2009]
and with other clustering techniques like spectral clustering (SP), k-means (KM) and hierarchical clustering (HC). Given the fact that our proposed method is completely unsupervised (in particular, there is no knowledge a-priori of the number of clusters), we tested some heuristics to estimatealso for the aforementioned algorithms. Specifically, the Eigengap heuristic [von2007tutorial] and the number of clusters found by our method are used. Moreover, we chose affinity propagation (AP) [frey2007clustering] and HDBSCAN [hdbscan] because they do not require an a-priori . In order to fairly compare our method, we tested them with the best settings. Specifically for HC and KM, cosine distance was the best choice, while for SP we used RBF kernel with parameter found through an extensive grid search. The cut on HC has been set such that the error is minimized as in [MLSP2016]. For AP we used the same similarity measure of SCDS while for HDBSCAN the Euclidean distance and minimum number of points per cluster equal to 2 were used.
Iii-C Evaluation criteria
To evaluate the clustering quality we used three distinct metrics: the misclassification rate (MR) [Kotti], the adjusted RAND index (ARI) [hubert1985comparing] and the average cluster purity (ACP) [solomonoff]. The usage of different metrics is important because each of them gives a different perspective on results: MR quantifies how many labels of speakers are inferred correctly from clusters while ARI and ACP are measures of grouping/separation performance on utterances.
Formally, given a one-to-one mapping between clusters and labels (see Sec II-D), MR is defined as where is the total number of audio segments to cluster, the number of speakers, and the number of segments of speaker classified incorrectly. Cluster purity is a measure to determine how pure clusters are. If a cluster is composed of utterances belonging to the same speaker, then it is completely pure, otherwise (i.e., other speakers are in that cluster, too) purity decreases. Formally, average cluster purity is defined as:
is the number of clusters, utterances in cluster spoken by speaker and is the size of cluster . The ARI finally is the normalized version of RAND index [rand1971objective], with maximum value 1 for perfectly assigned clusters with respect to the expected ones.
Iii-D Experimental setup
Our proposed method is evaluated in experiments composed as follows: given a set of audio utterances, features are extracted following one of the methods in Sec II-A
and the affinity matrix is computed as in SecII-C. Subsequently, the DS are found on top of this graph-based representation. Labeling is performed on each cluster following the methodology proposed in Sec II-D. The goodness of clusters are then evaluated using the metrics in Sec III-C. The summarized results are reported in Tables II and III and discussed in the next section. The hyper parameters for all experiments are set to and .
In this section, the results of a series of analyses are reported, followed by an overall discussion.
Iv-a Initialization of k in the competitors
DS does not need an a-priori number of clusters, while the supervised competitors do. In order to make a fair comparison with standard approaches (HC, KM and SP), we used as : the correct number of clusters to be found (symbol in tables II and III), the number of clusters found by DS (symbol in the tables) and the number of clusters estimated with Eigengap (symbol # in tables).
Experimental results show that even when the correct number of clusters is provided, SCDS still achieves more desirable results (see tables). This means that not only our method is able to recover a number of clusters close to the right one, but also that it is able to extract much more correct partitions. And when the number of clusters found by DS is given to the other methods, results obtained are plausible, showing that our method is able to grasp a good number of clusters while with standard heuristics the performance drops strongly.
Iv-B Analysis of different feature extraction methods
In the next experiments, we tested the two CNN-based features, CNN-T and CNN-V. Both provide good features in term of capacity to discriminate speakers. With the CNN-T features, the performance of our method saturates on TIMIT Small (see last rows of Table II) and reaches almost perfect performances on TIMIT Full (see last rows of Table III). This is mainly explained by the fact that the network used to extract the CNN-T embeddings has been trained in using the remaining 590 of the 630 TMIT speakers [MLSP2016], thus biasedly performing well on the entire dataset.
Surprisingly, features obtained from VGGVox are so generic that they allow almost the same performances for SCDS. This approach is also beneficial for competitors, and in fact all of them have better performances in term of MR/ARI/ACP with CNN-V features rather than CNN-T ones (except for KM).
Iv-C Cluster labeling
We tested two methods for labeling clusters for our approach (see Max and Hungarian in Sec II-D), while for all the other competitors we used only the Hungarian algorithm since Max is a peculiarity of DS. Under all conditions and datasets both labeling methods perform the same (see last rows of results where Max = SCDS, Hungarian = SCDS). Labeling with Max method comes for free directly from DS theory, while applying the Hungarian method has its computational cost.
Iv-D Metrics comparison
The three metrics (MR, ARI and ACP) are important to be analyzed in conjunction because they capture different aspects of the result. Having the lowest MR in the final results in both datasets emphasize the fact that we are correctly labeling clusters and that the number of miss-classified samples is extremely low. On the other side, reaching the highest value in ARI shows that our method obtains a good partitioning of the data with respect to the expected clusters. Furthermore, having the higher ACP confirms that clusters extracted with SCDS are mainly composed by utterances from the same speaker.
The proposed method reaches best scores on all metrics simultaneously. Indeed, other methods reach similar performances, in particular on TIMIT Small (like HC, AP), but none of them work as well as our in the most complex experimental setting used, TIMIT Full with VGGVox features (where no knowledge of the actual voices to be clustered could possibly enter the features and thus the clustering system).
|SMALL TIMIT||CNN-T Features||CNN-V Features|
Iv-E Sensitivity analysis
Finally, we report the results of a sensitivity analysis on the two free-parameters of our method under two metrics (see Fig 2 and 3), the precision of Replicator Dynamics (see Eq. 2) and the relative cut-off (see Sec. II-B). The analysis has been carried out on TIMIT Full with VGGVox features because under this setting a certain amount of variability on results is observed, which made this analysis interesting. The search space for the parameters is as follows: and . The choice has been made on these extremal points for the following reasons: a low value, e.g. , means that a point belongs to a cluster if and only if its level of participation in the cluster with respect to the centroid is at least centrality of the centroid. Instead, means that the centroid and the sample must be almost exactly the same. In the first case we are creating clusters which span widely in terms of similarities of its elements, while in the latter case we create clusters composed by very similar elements. Regarding the parameter , when it is set to , it requires that two successive steps in Eq. 2 are very close to each other while in the case we allow for a coarse equilibrium point.
Changes in both variables showed that the area in which the performances are stable is large (see the blue area in Fig 2 and yellow area in Fig 3). Only when extremal values of parameters are used the performances drops. The best parameter choice (CNN-T: , ; CNN-V: , ) is shown in Table III as SCDSbest.
|FULL TIMIT||CNN-T Features||CNN-V Features|
Iv-F Overall discussion
From a global perspective we can say that the proposed SCDS method performs better than the alternatives on the used datasets, outperforming the state-of-the-art and showing a more adaptive response also with a pre-trained model on a different dataset. In particular, this is evident in TIMIT Full, where better performances than competitors are achieved even when they are given the right number of clusters to be found. It is worth to note that our clustering method has only two parameters to set, which are both very insensitive to variation as shown in the sensitivity analysis.
Interesting to note, an analysis of misclassified speakers shows that if a speaker is wrongly clustered by DS, it is also wrongly clustered by all other methods. This gives rise to the assumption that in these cases the extracted features may be the reason for the error rather than the clustering approach used.
In this paper, we have proposed a novel pipeline for speaker clustering. The proposed method is based on the dominant set clustering algorithm which has been applied to this domain for the first time. It outperforms the previous state of the art and other clustering techniques.
We proposed a method which is almost parameter-less – the two free parameters do not affect too much the results, testifying to its stability. Moreover, we successfully used a pre-trained CNN model on a different dataset and report reasonable speaker clustering performance on the TIMIT Full dataset for the first time (after the % MR reported by Stadelmann and Freisleben using a classical pipeline [Thilo2009]). Now that we reached a good starting point with noise-free utterances we can start considering more complex datasets with their relatively more challenging tasks (noise, segmentation, cross-talk etc.). Future work also includes improving the features using the siamese network proposed by Nagrani et al. [Nagrani17] to extract similarities directly.
Code available at https://github.com/feliksh/SCDS
The authors would like to thank Y. Lukic for providing the MLSP features and J. Salamone for contributing to first results.