Clustering is a fundamental and widely used unsupervised learning technique for extracting and understanding structural properties of datasets (Hartigan, 1988). The input to a clustering algorithm is a set of objects, often represented in the -dimensional space with a distance or similarity measure . Clustering can also be performed in more abstract spaces, e.g., clustering of strings or graphs. The objective is to assign objects into groups, so that similar objects are placed in the same group and dissimilar objects are placed in different groups. The potential of clustering algorithms to reveal the underlying structure in any given dataset has been exploited in a wide variety of domains, such as image processing, bioinformatics, geoscience, and retail marketing.
The quality of clustering is affected by many factors, such as noise in the data or the suitability of the assumptions of the clustering algorithm. Such factors can, in general, prevent any notion of stability in the clustering result. In particular, two data items that are assigned to the same cluster by a clustering algorithm may end up in different clusters if the algorithm is re-run. This variation in the clustering of a dataset can be seen as stemming from both systematic and random errors. The systematic error is due to the stochastic nature of the used clustering algorithm, i.e., re-clustering of dataset using some algorithms leads to slightly different solutions due to, e.g., different initial conditions. The random error in the clustering is due to the variation in the used data sample, i.e., choosing a slightly different data sample will likely change the clustering solution too. These two sources of error in the clustering solution are not easily separable. Nonetheless, knowledge of the stability of the clustering result is important for the proper interpretation of the structure of the data.
In this paper, we introduce a method for assessing the stability of a clustering result in terms of co-occurrence probability, i.e., the probability that data items co-occur in the same cluster. Specifically, we address the following problem: given a dataset and a clustering algorithm, we want to identify the largest set of items within each cluster in the dataset, so that the co-occurrence probability of the items in these within-cluster sets is guaranteed to be at least , where . Each such set is called a core cluster and the addressed problem is referred to as the core clustering problem. The core clustering method hence reflects the combined effect of the systematic and random errors in the clustering of a dataset.
Example. To better motivate the idea of core clustering, consider the following example, using a synthetic dataset of
points generated by a mixture of three Gaussians with unit variance. Suppose that this dataset represents data from patients, each suffering from one out of three possible conditions, and that the condition of individual patients is unknown by a clinician studying the data.
The clinician is interested in grouping the patients in order to investigate patterns in the data, such as which patients are similar to each other and thus more likely to have the same disease. The clinician employs the k-means++ algorithm(Arthur and Vassilvitskii, 2007) to partition the data. However, although clustering the data yields a result, it is difficult to interpret the validity of the result, i.e., how good the clustering result is. Do the patients in the clusters really belong to these groups?
Our goal with the proposed core clustering algorithm is to answer this question by providing a stable clustering result that also gives a statistical guarantee that the data points (in our example patients) in a given cluster co-occur with a probability of at least , where is a given constant.
One way of evaluating the validity of the clustering solution in this example could be by choosing a slightly different sample of patients from the same population. Repeatedly clustering such samples would allow us to track how often a certain patient is placed in a particular group. The intuition is that if some patients are often placed together in the same particular group, it indicates that these patients are strong members of that group. In contrast, it is difficult to label patients that shift from one group to another during different clusterings.
We will now extend this idea of clustering slightly overlapping samples of data into a method that allows us to reach our goal to provide a statistical guarantee on the co-occurrence of data points in clusters. To this end, we employ the following scheme: First, we run k-means++ and record the original clustering. Next, we take a bootstrap sample from the original dataset and re-run k-means++ for that sample. This step is repeated, e.g., times, each time using a new bootstrap sample. The co-occurrence probability of each pair of points is determined as the fraction of times the points have co-occurred in the same cluster during the process. Hence, we can now identify the stable core clusters as the largest set of points within each original cluster, where the points co-occur with a probability of at least . The number of core clusters is equal to the number of original clusters, and the method of core clustering can be seen as refining the output from a clustering algorithm, such as k-means++ used here, finding points within clusters having a strong cluster membership and by excluding certain unstable points from the clustering solution. The clustering of the dataset discussed in our example is shown in Figure 1, where the clusters obtained from the original clustering are shown using different plot symbols (circles, triangles, and rectangles). The points belonging to the core clusters are shown as filled points, and these points always co-occur in the same cluster with a probability of at least . The unfilled points in the figure do not belong to the core clusters and are termed weak points, since their cluster membership is weak as it changes from one clustering to another in more than percent of the cases.
The fact that the points in the core clusters are guaranteed to co-occur with a certain probability is useful in exploratory data mining, and in the above example they would allow the clinician to cluster patients using different confidence levels. This would correspond to testing the hypothesis that two data items belong to the same cluster with a given probability.
The core clusters also allow the behaviour of the clustering algorithm to be evaluated. If the core clusters are very small, it means that the clustering is unstable and the results are unreliable. Core clustering hence offers direct insight into the suitability of a clustering scheme, reflecting the interaction between the clustering algorithm, its settings and the data.
Since clustering is a fundamental tool in exploratory data analysis, the concept of core clustering could be useful in domains typically using clustering, e.g., in bioinformatics.
Related work. Next, we summarise some related work and position core clustering with respect to existing literature. Clustering can be performed in a variety of ways and a plethora of clustering methods has been proposed. However, a thorough review of clustering methods is outside the scope of this paper. The reader can refer to several available surveys (Kogan et al. (2006); Jain et al. (1999); Guinepain and Gruenwald (2005)). Clustering can be either hard, which is the focus of this work, where each data object strictly belongs to a single cluster, or soft, where each data object may belong to more than one partition. Some widely used algorithms for hard clustering include k-means and variants of it (James et al., 2013; Pelleg and Moore, 2000)
, hierarchical clustering, and density-based clustering. For the case of soft clustering, also known as fuzzy clustering(Baraldi and Blonda, 1999), several approaches exist in the literature, including standard algorithms such as fuzzy c-means and extensions (Bezdek, 1981; Hoeppner et al., 1988; Wu et al., 2003) and the Gustafson-Kessel algorithm (Gustafson and Kessel, 1979). Other adaptations of these algorithms include the probabilistic c-means (Krishnapuram and Keller, 1993) and the probabilistic fuzzy c-means (Tushir and Srivastava, 2010)
algorithms, which are motivated by the fact that both probabilistic and membership coefficients are necessary to perform clustering, respectively to reduce outlier sensitivity and to assign data objects to clusters. Thus, both relative and absolute resemblance to cluster centres are taken into account. Similar probabilistic clustering approaches are based on the idea that the underlying data objects are generated by
different probability distributions(Smyth, 1996; McLachlan and Basford, 1988; Taskar et al., 2001). The models correspond to the clusters of interest and the task is to discover these models. However, none of these clustering algorithms can provide statistical guarantees on the result.
This work can be compared with previous work on assessing the significance of clustering solutions in Ojala (2010) and Ojala (2011), which however tries to answer the question whether the clustering solution as a whole is statistically valid by constructing elaborate null models for the data, as opposed to summarising individual co-occurrence patterns as done in this paper.
In the field of robust clustering, the goal is to perform clustering without the result being overly affected by outliers in the data. See, e.g., García-Escudero et al. (2010) for an overview. The problem studied in this paper is different from robust clustering, since in our case the focus is not on removing outlier data points. In contrast, unstable weak points are identified, which are typically not outliers but instead they are points located on the border of different clusters. Hence, the two problems are rather complementary, since core clustering can be used in conjunction with robust clustering methods, as demonstrated in this paper.
Another important problem is the evaluation of clustering stability, which is a model selection problem that focuses on discovering the number of clusters in the data; for a review see von Luxburg (2010). In this paper, we do not consider clustering stability with respect to different choices of the number of clusters. Instead, we study the problem of finding clusters that are stable upon clustering of resampled versions of the original dataset.
Pairwise comparison of the cluster membership of different points is a natural and popular method used extensively when comparing the similarity of two clustering solutions (e.g., Rand (1971); Fowlkes and Mallows (1983)). For reviews of similarity measures related to the pairwise comparison of clusterings see, e.g., Wagner and Wagner (2007); Pfitzner et al. (2009).
Core clustering is a variant of clustering aggregation (consensus clustering), see, e.g., Ghosh and Acharya (2013). Given a set of different clustering results for some data, the goal of clustering aggregation is to find the clustering with the highest mutual agreement between all of the clusterings in the set. Clustering aggregation can be performed using for instance probabilistic models (e.g., Wang et al. (2011); Topchy et al. (2005)), methods based on on pairwise similarity (e.g., Gionis et al. (2007); Nguyen and Caruana (2007)), or methods considering the pairwise co-occurrences of points (e.g., Fred and Jain (2005)).
Methods in which the co-occurrences of points are used to find stable clustering solutions have been presented in fields such as neuroinformatics (Bellec et al., 2010), bioinformatics (Kellam et al., 2001) and statistics (Steinley, 2008).
The methods for finding stable clusters by considering co-occurrences are essentially variations of the evidence accumulation algorithm proposed by Fred (2001); Fred and Jain (2002) and Fred and Jain (2005). In this method clusters are formed by clustering the co-occurrence matrix of items obtained from different clusterings of the data.
The method of core clustering also belongs to this category of methods and the core clusters are formed such that the agreement between all clusterings is at least . Bootstrapping is used to form the co-occurrence matrix as also done by Bellec et al. (2010). In the present paper we show the utility of the core clustering method for explorative data analysis using several clustering and classification algorithms.
Contributions. The main contributions of this paper can be summarised as follows:
We present a method for assessing the stability of clusters in terms of core clusters and we define and analyse theoretically the corresponding core clustering problem.
An algorithm for solving the core clustering problem is proposed.
It is demonstrated that bootstrapping provides a feasible method for evaluating clustering stability.
Through an empirical evaluation on both synthetic and real-world datasets, and for both clustering and classification problems, the proposed algorithm is shown to support the interpretation of the structure of the datasets.
In the next section, we formalise the core clustering problem, and in Sec. 3 the core clustering algorithm is described and analysed theoretically. In Sec. 4, the setup and results of the empirical investigation are presented. The results are discussed in Sec. 5, while the conclusions are summarised and pointers to future work are provided in Sec. 6.
2 Problem Setting
Let be a dataset of items defined in some space . We denote the th data item in as , and assume that these data items have been drawn i.i.d. from an unknown distribution over .
We now consider the problem of clustering the data items in . In other words, our aim is to assign a cluster label to each item in . In a general case, we can express the process of assigning cluster (or class) labels to data items using a clustering function.
Clustering function. Given a dataset of items , a clustering function is a partial function on , having at least in its support, which outputs a cluster index for a data item in its support; the cluster index is denoted as .
In the case of the k-means++ algorithm, the function simply assigns each item with the index of the closest cluster centroid. For some clustering algorithms is defined only for items in , since there may be no natural way to assign a previously unseen data item into a cluster.
A clustering function can also be used in a supervised setting, where each data item in is associated with a class label, and we can use
to train a classifier function, such as a random forest or a support vector machine (SVM). In this case, the classifiers predict the class label given a data item, and hence, we can view the classifier function as a clustering function. As a result, even though there are differences in terms of operation between an unsupervised clustering algorithm and a supervised classifier, we can treat both cases as an assignment task, where each data item in is assigned with either a cluster index or a class label.
We can now define the co-occurrence probability for two data items as follows:
Co-occurrence probability. Given two data items and in , the co-occurrence probability of and is the probability that they co-occur in the same cluster for a data set consisting of as well as data items drawn at random from according to distribution .
The co-occurrence probability is affected both the systematic and random error in the clustering due to the randomness in the clustering algorithm and the variation in the data, as discussed above, and these effects cannot readily be separated.
Next, we proceed to define the core clusters. Clustering the dataset into clusters using gives a disjoint partition of into distinct sets (clusters) : . The set of items in the clusters can now be refined as follows based on the co-occurrence probabilities to give the core clusters:
Core cluster of . Given a set of items and a constant , the subset is a core cluster of , if is the largest set of items in , where the co-occurrence probability of each pair of items is at least .
In other words, we require that any two points that belong to the same core cluster co-occur in that cluster with a probability of at least .
The determination of which data items that belong to the core clusters hence depends only on the co-occurrence probabilities. Given the co-occurrence probabilities of data items we can now proceed as follows to obtain the core clusters. Let be the original partition of dataset into a set of clusters. The co-occurrences between all pairs of data items in each cluster can be represented by a graph, where an edge is defined between items and , if and only if, their co-occurrence probability in is at least . Within each cluster , we proceed with finding the largest maximal clique , resulting in cliques. For these cliques, every two distinct data items are adjacent, and the condition of Definition 3 holds; the core clusters for are given by the largest maximal cliques .
The core clusters, hence, consist of those points in the “cores” of the original clusters, for which the cluster membership is strong. Core clustering can be thought of as a method of refining the original clustering, while the interpretation of the clusters is unmodified.
We also define weak points as follows.
Weak points. Given a dataset with core clusters }, the set of weak points is .
The weak points are hence those data items that do not belong to the core clusters.
Finally, using the above definitions, we can formulate the main problem studied in this paper.
Core clustering. Given a dataset of items, a clustering function , and a constant , we want to find the largest sets of items within each cluster of given by , where all pairs of items co-occur with a probability of at least in the same cluster.
Our main objective is therefore, given a dataset and a result of a clustering algorithm, to refine or enhance this result by identifying those data points within the clusters for which we can provide probabilistic guarantees that the data items in the same cluster co-occur with a probability of at least .
In this section, we present our algorithm for finding core clusters and prove that the core clustering problem is -hard.
3.1 Algorithm for Determining Core Clusters
As noted above, the determination of core clusters only depends on calculating the co-occurrence probabilities of the data items in the dataset using a given clustering function.
Depending on the knowledge of the distribution of the data points in the dataset, we can different strategies for calculating the co-occurrence probability. If the distribution of the data items is known, we can use this distribution directly. However, in most cases the true distribution is unknown and we resort to bootstrapping the original dataset. Below, we present algorithms for both of these scenarios.
The general algorithm for determining core clusters is presented in detail in Algorithm 1 (Fig. 2). The main steps of the algorithm are as follows:
Given a dataset , determine an initial clustering using a suitable clustering function (line 1 in Fig. 2). This could, for example, correspond to running k-means++ on . The role of the initial clustering is discussed below in more detail.
Calculate the co-occurrence probabilities for all pairs of data items in each of the clusters (line 5). This step can be achieved in multiple ways, and below we present two methods.
The process of finding the core clusters as the largest set of items within the original cluster where pairs of items co-occur at a given level naturally leads to a clique-finding problem when we extend the pairwise co-occurrences of items to comprise the set of all items co-occurring at the given level. Hence, as discussed above, to find the core clusters of we find the largest maximal cliques within each of the original clusters (line 7). There are many choices for an algorithm to find the largest cliques, see e.g., Bron and Kerbosch (1973).
Next, we present two algorithms for determining the co-occurrence probability (step 2 of Algorithm 1); one direct, naïve approach and one based on bootstrapping. For both algorithms we also derive the time complexity of the algorithms and the required number of samples needed to reach a sufficient co-occurrence probability accuracy.
If the data distribution is known, e.g., as in the example presented in the Introduction, the co-occurrence probabilities can be computed to the desired accuracy using Algorithm 2 (Fig. 3) by sampling sets of data points from and applying the clustering function to the set consisting of , , and the sampled data points. The co-occurrence probability is then the fraction of samples in which and are assigned to the same cluster.
In order to derive a measure for the accuracy of the co-occurrence probability, we proceed as follows. The standard deviation
of the co-occurrence probability is given by the binomial distribution,
where is the co-occurrence probability and is the number of pairs sampled. If we use a co-occurrence probability of and want to calculate this to a one standard deviation accuracy of at the threshold, we need at least
samples where the pair of points and co-occur. Using samples hence provides good accuracy. However, as each sample contains only one pair of points of interest, i.e, the pair and , a total of samples is required.
The time complexity of Algorithm 2 is , where is the number of samples, is the time complexity of training the clustering algorithm, i.e., finding the clustering function , and is the size of the dataset.
In practice, one seldom has any knowledge of the true underlying data distribution ; in such case one must sample from the original dataset. Also, the computational complexity of directly sampling from is high, which means that this approach, is not feasible to use, although it is a direct implementation of Definition 2. Instead, a natural and efficient choice for acquiring the samples needed for calculating the co-occurrence probabilities is to use the bootstrap approximation.
In order to calculate the co-occurrence probabilities using the non-parametric bootstrap approximation, data points are sampled with replacement from the dataset and the co-occurrence probabilities are calculated using Algorithm 3 (Fig. 4
). The bootstrap procedure allows estimation of the sampling distribution of a parameter, in this case the co-occurrence probabilities, and simulates the effect of drawing new samples from the population.
Different bootstrapping schemes can be used when calculating the co-occurrence probabilities. For instance, the clustering function could be constructed using out-of-bag samples. Using the bootstrap approach should provide a good approximation of the data distribution in most situations. The bootstrap approximation may fail if the clustering algorithm is sensitive to duplicated data points that necessarily appear in the bootstrap samples. One should also notice that the above discussed method of sampling directly from is, in fact, a variant of parametric bootstrapping, where the data generating process is known.
The time complexity of Algorithm 3 using bootstrap samples is , where is the time complexity of the used clustering algorithm and is the size of the dataset.
We will now determine the number of bootstrap samples required for a given co-occurrence probability accuracy. The probability that a randomly chosen item from a dataset of size appears in a bootstrap sample is
Hence, the probability that a randomly chosen pair of points appear in the bootstrap sample is therefore given by meaning that each bootstrap sample on average covers 40 % of the pairs in the dataset. The standard deviation of the co-occurrence probability when bootstrapping is now given by Equation 1 setting . Using bootstrap samples and a co-occurrence probability of here thus gives us a one standard deviation accuracy of 1.5 %.
Both of the above mentioned schemes for calculating the co-occurrence probabilities are usable in a scenario where there is no natural way to assign a cluster index to a previously unseen data item. As will be shown in the experimental evaluation, using the bootstrap agrees well with using the true underlying distribution , when is known.
The number of samples in Algorithm 2 needed to reach a co-occurrence probability accuracy grows as the number of samples in the dataset increases. In contrast, the number of samples needed for a given accuracy using the bootstrap in Algorithm 3 remains constant regardless of the size of the dataset. Algorithm 3 (non-parametric bootstrap approximation) is computationally much more efficient than Algorithm 2 (direct estimation).
The non-parametric bootstrapping method is hence the preferred method to be used in the calculation of the co-occurrence probabilities.
3.1.3 Initial clustering
The core clusters are always determined with respect to an initial clustering (line 1 of Algorithm 1 (Fig. 2)). By definition, the core clusters are constructed so that the agreement between clusterings of different samples of the data must, for the core clusters, overlap in at least percent of the samples. This means that the data items in the core clusters of any given sample will overlap to percent with any other sample. Hence, the choice of initial reference clustering from among bootstrap samples of the dataset is in practice arbitrary. We therefore suggest using any clustering of the full original dataset as the reference clustering with respect to which the core clusters are determined.
3.2 Complexity of finding core clusters
In this section we show that the problem of determining the core clusters is -hard by proving the following theorem.
Finding the core clusters is -hard.
We show that finding the core clusters is -hard by a reduction to the clique problem, which is a classic -hard problem (Karp, 1972). The clique problem is defined as follows: given an undirected graph with vertexes, find the largest fully connected subgraph in . Consider the problem of finding a core cluster from a dataset of items with parameter given by and the clustering function constructed as follows. Assign all data items to the same cluster with a probability
i.e., for any . Otherwise pick a random pair of items : if there is an edge between the items and in then assign and to a cluster of two items, otherwise assign them to singleton clusters; finally assign the remaining items to singleton clusters. Assume that the initial clustering was (by chance) such that all data items were assigned to the same cluster, hence, there will be one core cluster. Now, if there is no edge in between a pair of items then the co-occurrence probability will be
If there is an edge in between the items and they will co-occur in a cluster with a probability of .
Hence, the co-occurrence probability between a pair of items is at least if and only if there is an edge between the items. Therefore, the solution to the core clustering problem using Algorithm 1 (Fig. 2) gives under this construction of the largest clique in .
4.1 Experimental Setup
In the experiments we investigate the following aspects: (i) the impact of core clustering and (ii) the differences between Algorithm 2 (exact sampling from the known distribution) and Algorithm 3 (bootstrap approximation).
4.1.1 Clustering Algorithms
We use both unsupervised learning methods (clustering algorithms) and supervised learning methods (classifiers) to determine the core clustering of different datasets. All experiments are performed in R(R Core Team, 2014). As clustering algorithms we use one parametric method, k-means++(Arthur and Vassilvitskii, 2007), and one non-parametric method, hierarchical clustering (hclust), as well as two robust clustering methods based on trimming as implemented by the tclust R-package (Fritz et al., 2012); trimmed k-means (tkmeans111The tclust function with parameters restr=eigen, restr.fact = 1 and equal.weights=TRUE) (Cuesta-Albertos et al., 1997) and tclust222The tclust function with parameters restr=eigen, restr.fact = 50 and equal.weights=FALSE (García-Escudero et al., 2008). In the robust clustering methods based on trimming, a given fraction of the most outlying data items are trimmed and are not part of the clustering solution. The points that are trimmed are hence comparable to the weak points in the core clustering method. As classifiers we use Random Forest (RF) and support vector machines (SVM), which both are among the best-performing classifiers, see e.g., Fernández-Delgado et al. (2014).
As datasets in the experiments, we use the synthetic data presented above in the motivating example, five datasets from the UCI Machine Learning Repository (Iris, Wine Glass, Yeast and Breast Cancer Wisconsin (BCW))(Bache and Lichman, 2014), and the 10% KDD Cup 1999 dataset (KDD). The properties of the datasets are described in Table 1. Items with missing values in the datasets were removed. As noted by Chawla and Gionis (2013), the three classes normal, neptune and smurf account for 98.4% of the KDD dataset, so we selected only these three classes. Furthermore, we performed variable selection for this dataset, using a random forest classifier to reduce the number of variables in the dataset to 5 (variables 5, 2, 24, 30 and 36).
4.1.3 Experimental Procedure
Two separate experiments were carried out. In the first experiment, we obtain the core clustering for the synthetic and KDD dataset using Algorithm 2 (Fig. 3), which assumes knowledge of the true underlying data distribution.
For the synthetic dataset, we sample directly from the data generating distribution, a mixture of three Gaussians, which is possible since it is known. For the KDD dataset, we initially choose a random sample of 200 data items to cluster, corresponding to 0.04% of the size of the dataset. Since the KDD dataset is very large in comparison to the small sample we wish to cluster, we approximate the true distribution of data items by randomly drawing samples from the entire KDD dataset. The core clustering of the UCI datasets using the true distribution is not possible as the datasets are too small.
We use bootstrap iterations for calculating the co-occurrence probabilities and a confidence threshold of in all experiments. It should be noted that the goal here was not to maximise quality of clustering or classification accuracy, but to demonstrate the effect of core clustering. All clustering and classification functions were hence used at their default settings. The k-means++ algorithm was run ten times and the clustering solution with the smallest within-cluster sum of squares was chosen.
To evaluate the method of core clustering, we use the external validation metric purity (Manning et al., 2009, Sec. 16.3). Purity ranges from zero to one, with unity denoting a perfect match to the ground truth class structure.
All source code used for the experiments, including an R package
corecluster implementing the core clustering algorithm, is
4.2 Experimental Results
4.2.1 Agreement Between True Distribution and Bootstrap Approximation
The agreement between core clustering using the true distribution (Algorithm 2) and the bootstrap approximation (Algorithm 3) is shown in Table 2. The table shows the confusion matrices for the used clustering algorithms when run on the synthetic and KDD datasets, using Algorithm 2 and Algorithm 3. The confusion matrices show that most of the points fall in either the top left or bottom right corner. The results from the two algorithms agree if the majority of points is located on the diagonal. For all clustering algorithms and both datasets, with the exception of hierarchical clustering and trimmed k-means for the synthetic dataset, only a minor proportion of data items are in discord between using the true distribution and using the bootstrap approximation.
4.2.2 Core Clustering for Datasets
The experimental results for all datasets using all classifiers are shown in Table 3 and Table 4. The quality of the clustering is assessed with purity using the known class labels of the datasets. The table shows the purity using the original clustering (), and using core clustering (). The fraction of weak points () is also provided. The weak points are ignored when calculating purity for core clustering. The robust clustering algorithms trimmed k-means and tclust were set to trim 5% of the points. In some cases, the tclust, tkmeans and random forest algorithms failed to cluster a particular data sample. For Algorithm 2 and Algorithm 3, a new sample was then obtained, and if a clustering solution was not obtained in 5 iterations, this iteration was discarded.
The results in Table 4 systematically show that core clustering improves purity in all cases, with the exception of k-means++ clustering for the KDD dataset, for which the drop in purity is 0.01.
4.2.3 Visualisations of Core Clusterings
Examples of core clusterings obtained using the bootstrap approximation of the synthetic, iris, and BCW datasets using different clustering algorithms are shown in Figure 5. Trimmed points, for tclust and tkmeans, are marked with stars.
Figure (b)b, showing core clustering of the synthetic dataset using hierarchical clustering, can be compared to the core clustering of this dataset using k-means++ shown in Figure 1, also calculated using Algorithm 3. It is clear that a large number of points are weak points when using hierarchical clustering (53% of the points, as also seen in Table 4). The same applies to core clustering using tclust (Figure (c)c), which categorises 85% of the points as weak points. The interpretation is that hierarchical clustering and tclust exhibit a high variability in the clustering outcome on different iterations, which leads to core clusters with small radii. This means, that these algorithms are not well suited for the clustering of this dataset. The core clustering using SVM (Figure (c)c) only categorises 20% of the points as weak points, comparable to the results for k-means++.
The core clustering of the iris dataset using k-means (Figure (d)d) clearly shows that the weak points are located between the two rightmost clusters in the figure. This is also visible when using hierarchical clustering and trimmed k-means. For trimmed k-means some peripheral points have also been excluded.
The core clusterings for the BCW dataset shown in Figures (g)g, (h)h and (i)i vary depending on the clustering algorithm. It is clear, that the k-means++ and tclust algorithms are better suited for clustering this dataset than hierarchical clustering, which must discard 22% of the data items as weak points.
As shown above, the method of finding core clusters is at least as hard as finding the maximum clique, but the search for core clusters is easily parallelisable in terms of the original clusters, i.e., each of the core clusters can be found independently in parallel within each of the original clusters (lines 6–7 in Algorithm 1). Changing the number of replicates used in the bootstrap also affect the running time; the time complexity is quadratic with respect to the number of items in the dataset, as shown in Sections 3.1.1 and 3.1.2. Typical running times on the datasets used in the experiments for this paper with the recommended bootstrap method (Algorithm 3) are presented in Table 5, using R-code with some C++ on a 1.8 GHz Intel Core i7 CPU and 1000 bootstrap iterations. The baseline Algorithm 2, which has been presented for comparison, is substantially slower, on the order of days. Clearly, there is a large variation in the running times between the different clustering functions, due to the underlying implementation of the algorithms. The running time of core clustering is dominated by the time required by the used clustering algorithm.
The experimental results show that core clustering improves the homogeneity of the clusters, i.e., in-cluster consistency increases as weak points are excluded from the core clusters. The agreement between use of the true underlying data distribution and the bootstrap approximation is good. This means that the bootstrap method of calculating the co-occurrence probabilities offers a computationally efficient and feasible way of determining the core clusters.
Core clustering reflects the interaction between the clustering algorithm and the data. For some clustering algorithms the result is not deterministic, which means that factors such as, e.g., choice of initial cluster centres affects the clustering outcome on different runs. However, this can be overcome by using methods that optimise the initial conditions, such as the k-means++ method used here, or by combining the output from multiple runs on the same dataset as done, e.g., in Gionis et al. (2007).
Unstable clustering functions, as exemplified in the results using hierarchical clustering for the synthetic dataset, produce core clusters with small radii and a large number of weak points. In general, the cluster radii are proportional to the value of , i.e., a low means that the co-occurrence probability will be high which in turn means that the core clusters will be small due to this strict criterion. Conversely, a high means that the co-occurrence probability is low, and hence the core clusters will be large. However, it should also be noted that the size of the core clusters depends on the characteristics of the data and on the assumptions of the clustering function.
The core clusters make it possible to detect an unstable clustering algorithm and find the data items for which the cluster membership is uncertain. The benefit of using core clustering is in the statistical guarantee it gives on the co-occurrence of data items in the same cluster, and this helps in the interpretation of the structure of the dataset. Core clustering can be used to gain insight into how the clustering algorithm is using the data, by investigating the size of the core clusters, e.g., using different parameters for the clustering algorithm.
The number of weak points detected in the core clustering of a dataset can be used as the instability measure of a clustering algorithm, which allows the stability of a clustering algorithm in terms of the correct number of clusters in the data to be investigated, see, e.g., von Luxburg (2010). In practice, this means that core clusterings of a dataset using a varying number of clusters each yield a different number of weak points. The core clustering with the fewest number of weak points is the most stable clustering, indicating how many clusters the data might contain.
The method of core clustering can also be viewed in terms of hypothesis testing; the points within the core clusters are guaranteed to co-occur with a given probability. The core clusters hence allows the testing of the hypothesis whether two data items belong to the same cluster, at the desired confidence level. This has implications for using clustering algorithms in explorative data mining tasks. The statistical guarantee given by the core clusters can be particularly useful in certain application areas, e.g., in the medical domain, since core clustering can be used to explore the structure of a dataset with a confidence guarantee on the clusters. Clustering is also used in bioinformatics to identify groups of genes with similar expression patterns. Core clusters could be valuable also in this context.
The method of core clustering is in itself model-free, making no assumptions regarding the structure of the data. The only assumptions are those imposed by the used clustering algorithm. Core clustering can be used in conjunction with any unsupervised or supervised learning algorithm, as shown in this paper. A useful property is that core clustering can be used to make a non-robust clustering algorithm, such as traditional k-means, more robust, as shown in the experiments above.
Existing robust clustering methods try to detect outlying points in the dataset using, e.g., some distance metric. These methods also typically require that the proportion of points to be discarded must be given in advance. In contrast, in core clustering one only needs to specify the confidence level for the co-occurrences of the items in the core clusters. This inclusive criterion can be viewed as being more natural than specifying what fraction of data items to discard.
Usually, outliers in a dataset consist of data items in the periphery of clusters. As noted by Fritz et al. (2012), it is also important to remove “bridge points”, i.e., data items located between clusters. Using the core clustering method, it is precisely the weak points not in the core clusters that are the bridge points. Hence, core clustering can be used to augment any clustering algorithm to make it capable of removing bridge points, without making any model assumptions on the data. Core clustering can be used in conjunction with robust clustering functions, as shown in this paper. In this case, both peripheral outliers and bridge points can be efficiently detected.
In the domain of classifiers, the core clusters can be interpreted as presenting the set of data points for which the classifier output is consistent. Core clustering could be used jointly with the sampling-based method introduced in Henelius et al. (2014) allowing introspection into the way a classifier uses the features of the data when making predictions. This would make it possible to study the interplay between the features in the data in areas where the classification results are robust and in the more uncertain areas represented by the weak points.
The problem of core clustering can also be viewed from a Bayesian perspective. Assume that the dataset obeys a Bayesian mixture model of components Gelman et al. (2004). In this case, the posterior co-occurrence probability of two data items belonging to the same mixture component can be computed using the standard Bayesian machinery.
As noted by (Hastie et al., 2009, Sec. 8.4)
, the bootstrap represents an approximate nonparametric, noninformative posterior distribution for a parameter of interest. Hence, the non-parametric bootstrap approximates the Bayesian co-occurrence probability, if the clustering function gives the maximum likelihood mixture assignment for the Bayesian mixture model. For such a clustering function core clusters can therefore be interpreted as sets of points for which the posterior probability of two items occurring in the same cluster is at least. Obviously, the core clusters depend on the modelling assumptions: if the mixture model fits the data well, the co-occurrence probabilities tend to be higher and the core clusters larger than if the model fits the data poorly.
The advantage of the bootstrap approach over direct Bayesian treatment is that we do not need to know the underlying model, or even if the clustering function gives the maximum likelihood estimate of any Bayesian model. The close relation of the bootstrap method to the Bayesian way of computing co-occurrence probabilities gives additional insight into interpreting the results and the interplay between the data and the modelling assumptions here implicit in the clustering function.
The core clustering method is versatile, yet conceptually simple. The sampling of data items can be performed in many different ways. However, since the nature of the true underlying data distribution is seldom known in real applications, the non-parametric bootstrapping of data items for the calculation of the co-occurrences used here is the most feasible approach and it is also computationally fast.
6 Concluding remarks
This paper presents a conceptually simple and efficient method for finding statistically robust clusters. The method is independent of the used clustering algorithm and is equally usable with both clustering algorithms and classifiers.
As demonstrated in the experiments in this paper, the agreement between the bootstrap approximation and the true distribution is high. Furthermore, different bootstrap schemes can be devised using, e.g., out-of-bag samples for cluster estimation. The bootstrap approximation is usable if the clustering algorithm is not sensitive to the fact that the bootstrap approximation necessarily produces duplicated data points. It would also be possible to mitigate this issue, e.g., by jittering the bootstrapped data, as noted by Hennig (2007). Another approach would be to estimate the distribution of data items parametrically and use the obtained distribution to generate new data sets, e.g., using a parametric bootstrap approach. It is therefore possible to fine-tune the way the co-occurrences are calculated, if needed.
The core clusters are naturally interpreted as the strong points in the original clusters obtained using some clustering algorithm, which makes the interpretation of core clusters straightforward. Core clustering can hence be used to make a non-robust clustering algorithm, such as the traditional k-means or hierarchical clustering, more robust by providing a probabilistic guarantee for the cluster membership of data items co-occurring in the same core cluster.
Core clustering further extends robust clustering algorithms by allowing points lying on the border between clusters to be excluded, as shown in the experimental results in this paper, without model assumptions or use of distance metrics.
Summarising, core clustering can be used to find statistically valid core clusters using any clustering or classification algorithm and any dataset which can be resampled. The method is generic and no additional assumptions, such as distance or similarity measures, are needed.
This work was supported by the Academy of Finland (decision 288814), Tekes (Revolution of Knowledge Work project), and the High-Performance Data Mining for Drug Effect Detection project at Stockholm University, funded by the Swedish Foundation for Strategic Research under grant IIS11-0053.
- Arthur and Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
- Bache and Lichman (2014) Bache, K. and Lichman, M. UCI machine learning repository, 2014. URL http://archive.ics.uci.edu/ml.
Baraldi and Blonda (1999)
Baraldi, A. and Blonda, P.
A survey of fuzzy clustering algorithms for pattern recognition.IEEE Trans Syst Man Cybern B, 29:786–801, 1999.
- Bellec et al. (2010) Bellec, P., Rosa-Neto, P., Lyttelton, O. C., Benali, H., and Evans, A. C. Multi-level bootstrap analysis of stable clusters in resting-state fmri. Neuroimage, 51(3):1126–1139, 2010.
- Bezdek (1981) Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, 1981. ISBN 0306406713.
- Bron and Kerbosch (1973) Bron, C. and Kerbosch, J. Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM, 16(9):575–577, 1973.
Chawla and Gionis (2013)
Chawla, S. and Gionis, A.
k-means-: A unified approach to clustering and outlier detection.In SDM, pages 189–197. SIAM, 2013.
- Cuesta-Albertos et al. (1997) Cuesta-Albertos, J., Gordaliza, A., Matrán, C., et al. Trimmed -means: an attempt to robustify quantizers. The Annals of Statistics, 25(2):553–576, 1997.
- Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
- Fowlkes and Mallows (1983) Fowlkes, E. B. and Mallows, C. L. A method for comparing two hierarchical clusterings. Journal of the American statistical association, 78(383):553–569, 1983.
- Fred (2001) Fred, A. Finding consistent clusters in data partitions. In International Workshop on Multiple Classifier Systems, pages 309–318. Springer, 2001.
- Fred and Jain (2002) Fred, A. and Jain, A. Data clustering using evidence accumulation. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 4, pages 276–280. IEEE, 2002.
- Fred and Jain (2005) Fred, A. L. and Jain, A. K. Combining multiple clusterings using evidence accumulation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(6):835–850, 2005.
Fritz et al. (2012)
Fritz, H., García-Escudero, L. A., and Mayo-Iscar, A.
tclust: An R package for a trimming approach to cluster analysis.Journal of Statistical Software, 47(12):1–26, 2012. URL http://www.jstatsoft.org/v47/i12/.
- García-Escudero et al. (2008) García-Escudero, L. A., Gordaliza, A., Matrán, C., and Mayo-Iscar, A. A general trimming approach to robust cluster analysis. The Annals of Statistics, pages 1324–1345, 2008.
- García-Escudero et al. (2010) García-Escudero, L. A., Gordaliza, A., Matrán, C., and Mayo-Iscar, A. A review of robust clustering methods. Advances in Data Analysis and Classification, 4(2-3):89–109, 2010.
- Gelman et al. (2004) Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian data analysis. Taylor & Francis, 2 edition, 2004.
- Ghosh and Acharya (2013) Ghosh, J. and Acharya, A. Cluster Ensembles: Theory and Applications. In Aggarwal, C. C. and Reddy, C. K., editors, Data clustering: Algorithms and Applications. CRC Press, 2013.
- Gionis et al. (2007) Gionis, A., Mannila, H., and Tsaparas, P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):4, 2007.
- Guinepain and Gruenwald (2005) Guinepain, S. and Gruenwald, L. Research issues in automatic database clustering. SIGMOD Rec., 34(1):33–38, Mar. 2005. ISSN 0163-5808. doi: 10.1145/1058150.1058157. URL http://doi.acm.org/10.1145/1058150.1058157.
- Gustafson and Kessel (1979) Gustafson, E. E. and Kessel, W. C. Fuzzy clustering with a fuzzy covariance matrix. In In Proceedings of the IEEE Conference on Decision and Control, page 761–766. Piscataway, NJ, USA. IEEE Press, 1979.
- Hartigan (1988) Hartigan, J. Clustering Algorithms. New York: John Wiley and Sons, Inc, 1988.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning – Data Mining, Inference and Prediction. Springer, 2 edition, 2009.
- Henelius et al. (2014) Henelius, A., Puolamäki, K., Boström, H., Asker, L., and Papapetrou, P. A peek into the black box: Exploring classifiers by randomization. Data Mining and Knowledge Discovery, 28(5–6):1503–1529, Sept. 2014.
- Hennig (2007) Hennig, C. Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis, 52(1):258–271, 2007.
- Hoeppner et al. (1988) Hoeppner, F., F., K., Kruse, R., and Runkler, T. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. John Wiley and Sons Ltd, New York, NY, USA, 1988.
- Jain et al. (1999) Jain, A. K., Murty, M. N., and Flynn, P. J. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, Sept. 1999. ISSN 0360-0300. doi: 10.1145/331499.331504. URL http://doi.acm.org/10.1145/331499.331504.
- James et al. (2013) James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning. Springer, 2013.
- Karp (1972) Karp, R. M. Reducibility among combinatorial problems. Springer, 1972.
- Kellam et al. (2001) Kellam, P., Liu, X., Martin, N., Orengo, C., Swift, S., and Tucker, A. Comparing, contrasting and combining clusters in viral gene expression data. In Proceedings of 6th workshop on intelligent data analysis in medicine and pharmocology, pages 56–62. Citeseer, 2001.
- Kogan et al. (2006) Kogan, J., Nicholas, C., and Teboulle, M., editors. A Survey of Clustering Data Mining Techniques, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-28348-5. doi: 10.1007/3-540-28349-8˙2. URL http://dx.doi.org/10.1007/3-540-28349-8_2.
- Krishnapuram and Keller (1993) Krishnapuram, R. and Keller, J. M. A possibilistic approach to clustering. Trans. Fuz Sys., 1(2):98–110, May 1993. ISSN 1063-6706. doi: 10.1109/91.227387. URL http://dx.doi.org/10.1109/91.227387.
- Manning et al. (2009) Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2009.
- McLachlan and Basford (1988) McLachlan, G. J. and Basford, K. E. Mixture models: inference and application to clustering. New York: Marcel Dekker, 1988.
- Nguyen and Caruana (2007) Nguyen, N. and Caruana, R. Consensus clusterings. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 607–612. IEEE, 2007.
- Ojala (2010) Ojala, M. Assessing data mining results on matrices with randomization. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), pages 959–964, 2010.
- Ojala (2011) Ojala, M. Randomization Algorithms for Assessing the Significance of Data Mining Results. PhD thesis, Aalto University School of Science, 2011.
- Pelleg and Moore (2000) Pelleg, D. and Moore, A. W. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pages 727–734, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-707-2. URL http://dl.acm.org/citation.cfm?id=645529.657808.
- Pfitzner et al. (2009) Pfitzner, D., Leibbrandt, R., and Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3):361–394, 2009.
- R Core Team (2014) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R-project.org/.
- Rand (1971) Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
- Smyth (1996) Smyth, P. Clustering using monte carlo cross-validation. ACM SIGKDD, pages 126–133, 1996.
- Steinley (2008) Steinley, D. Stability analysis in k-means clustering. British Journal of Mathematical and Statistical Psychology, 61(2):255–273, 2008.
Taskar et al. (2001)
Taskar, B., Segal, E., and Koller, D.
Probabilistic classification and clustering in relational data.
Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’01, pages 870–876, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-812-5, 978-1-558-60812-2. URL http://dl.acm.org/citation.cfm?id=1642194.1642210.
- Topchy et al. (2005) Topchy, A., Jain, A. K., and Punch, W. Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions on pattern analysis and machine intelligence, 27(12):1866–1881, 2005.
- Tushir and Srivastava (2010) Tushir, M. and Srivastava, S. A new kernelized hybrid c-mean clustering model with optimized parameters. Appl. Soft Comput., 10(2):381–389, Mar. 2010. ISSN 1568-4946. doi: 10.1016/j.asoc.2009.08.020. URL http://dx.doi.org/10.1016/j.asoc.2009.08.020.
- von Luxburg (2010) von Luxburg, U. Clustering stability: An overview. Foundations and Trends in Machine Learning, 2(3):235–274, 2010. doi: http://arxiv.org/abs/1007.1075v1.
- Wagner and Wagner (2007) Wagner, S. and Wagner, D. Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe, 2007.
- Wang et al. (2011) Wang, H., Shan, H., and Banerjee, A. Bayesian cluster ensembles. Statistical Analysis and Data Mining, 4(1):54–70, 2011.
- Wu et al. (2003) Wu, Z., Xie, W., and Yu, J. Fuzzy clustering with a fuzzy covariance matrix. In In Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA), pages 1–6, 2003.