Constraint-Based Clustering Selection

09/23/2016 ∙ by Toon Van Craenendonck, et al. ∙ 0

Semi-supervised clustering methods incorporate a limited amount of supervision into the clustering process. Typically, this supervision is provided by the user in the form of pairwise constraints. Existing methods use such constraints in one of the following ways: they adapt their clustering procedure, their similarity metric, or both. All of these approaches operate within the scope of individual clustering algorithms. In contrast, we propose to use constraints to choose between clusterings generated by very different unsupervised clustering algorithms, run with different parameter settings. We empirically show that this simple approach often outperforms existing semi-supervised clustering methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Clustering is one of the core tasks in data analysis [19]. It is inherently subjective, as users may prefer very different clusterings of the same data [10, 34]. Semi-supervised clustering [35, 37] aims to deal with this subjectivity by allowing the user to specify background knowledge, often in the form of pairwise constraints that indicate whether two instances should be in the same cluster or not.

Traditional approaches to semi-supervised (or constraint-based) clustering use constraints in one of the following three ways. First, one can modify an existing clustering algorithm to take them into account. This approach is taken in COP-KMeans [35], one of the first clustering algorithms able to deal with pairwise constraints. Second, one can learn a distance metric based on the constraints [37], after which the metric is used within a traditional clustering algorithm. Third, one can combine the above two approaches and develop so-called hybrid methods [7].

Our approach to constraint-based clustering is quite different from existing methods, and does not fit in any of these three categories. It is motivated by the well-known fact that different algorithms may produce very different clusterings of the same data [14], and even within one algorithm, different parameter settings may yield different clusterings. This implies that selecting a clustering algorithm and tuning its parameter settings is crucial to obtain a good clustering.

We propose to use constraints to solve these tasks: to find an appropriate clustering, we first generate a set of clusterings using several unsupervised algorithms, with different hyperparameter settings, and afterwards select from this set the clustering that satisfies the largest number of constraints. Our experiments show that, surprisingly, this simple constraint-based selection approach often yields better clusterings than existing semi-supervised algorithms. This shows that it is more important to use an algorithm of which the inherent bias matches a particular problem, than to modify the optimization criterion of any individual algorithm to take the constraints into account. We also present a method for selecting the most informative constraints first, which further increases the usefulness of our approach.

The remainder of this paper is structured as follows. In Section II we give some background on semi-supervised clustering, and algorithm and hyperparameter selection for clustering. Section III presents our approach to using pairwise constraints in clustering, which we call COBS (for Constraint-Based Selection). In Section 4 we describe how COBS can be extended to actively select informative constraints. We conclude in Section V.

Ii Background

We first describe related work on semi-supervised clustering. As our approach consists of using constraints to choose an algorithm and tune its parameters, we also discuss related work on meta-learning for clustering, as well as on algorithm and hyperparameter selection.

Semi-supervised clustering algorithms allow the user to incorporate a limited amount of supervision into the clustering procedure. Several kinds of supervision have been proposed, one of the most popular ones being pairwise constraints. Must-link (ML) constraints indicate that two instances should be in the same cluster, cannot-link (CL) constraints that they should be in different clusters. Most existing semi-supervised approaches use such constraints within the scope of an individual clustering algorithm. COP-KMeans [35]

, for example, modifies the clustering assignment step of K-means: instances are assigned to the closest cluster for which the assignment does not violate any constraints. Similarly, the clustering procedures of DBSCAN

[28, 21, 9], EM [29]

and spectral clustering

[26, 36] have been extended to incorporate pairwise constraints. Another approach to semi-supervised clustering is to learn a distance metric based on the constraints [37, 3, 11]. Xing et al[37], for example, propose to learn a Mahalanobis distance by solving a convex optimization problem in which the distance between instances with a must-link constraint between them is minimized, while simultaneously separating instances connected by a cannot-link constraint. Hybrid algorithms, such as MPCKMeans [7], combine metric learning with an adapted clustering procedure.

Meta-learning and algorithm selection

have been studied extensively in supervised learning

[8, 30], but much less clustering. There is some work on building meta-learning systems that recommend clustering algorithms [12, 15]. However, these systems do not take hyperparameter selection into account, or any form of supervision. More related to ours is the work of Caruana et al[10]. They generate a large number of clusterings using K-means and spectral clustering, and cluster these clusterings. This meta-clustering is presented to the user as a dendrogram. Here, we also generate a set of clusterings, but afterwards we select from that set the most suitable clustering based on pairwise constraints. The only other work, to our knowledge, that has explored the use of pairwise constraints for algorithm selection is that by Adam and Blockeel [1]. They define a meta-feature based on constraints, and use this feature to predict whether EM or spectral clustering will perform better for a dataset. While their meta-feature attempts to capture one specific property of the desired clusters, i.e. whether they overlap, our approach is more general and allows selection between any clustering algorithms.

Whereas algorithm selection has received little attention in clustering, several methods have been proposed for hyperparameter selection.

One strategy is to run the algorithm with several parameter settings, and select the clustering that scores highest on an internal quality measure [2, 31]. Such measures try to capture the idea of a “good” clustering. A first limitation is that they are not able to deal with the inherent subjectivity of clustering, as they do not take any external information into account. Furthermore, internal measures are only applicable within the scope of individual clustering algorithms, as each of them comes with its own bias [34]. For example, the vast majority of them has a preference for spherical clusters, making them suitable for K-means, but not for e.g. spectral clustering and DBSCAN.

Another strategy for parameter selection in clustering is based on stability analysis [6, 33, 5]. A parameter setting is considered to be stable if similar clusterings are produced with that setting when it is applied to several datasets from the same underlying model. These datasets can for example be obtained by taking subsamples of the original dataset [6, 20]. In contrast to internal quality measures, stability analysis does not require an explicit definition of what it means for a clustering to be good. Most studies on stability focus on selecting parameter settings in the scope of individual algorithms (in particular, often the number of clusters).

Additionally, one can also avoid the need for explicit parameter selection. In self-tuning spectral clustering [39]

, for example, the affinity matrix is constructed based on local statistics and the number of clusters is estimated using the structure of the eigenvectors of the Laplacian.

A key distinction with COBS is that none of the above methods takes the subjective preferences of the user into account. We will compare our constraint-based selection strategy to some of them in the next section.

Iii Constraint-based clustering selection

Algorithm and hyperparameter selection are difficult tasks in an entirely unsupervised setting, mainly due to the lack of a well-defined way to estimate the quality of clustering results [14]. We propose to use constraints for this purpose, and estimate the quality of a clustering as the number of constraints that it satisfies. This quality estimate allows us to do a search over unsupervised algorithms and their parameter settings, as described in Algorithm 1. We use a basic grid search, but in principle also more advanced optimization strategies could be used [30, 18]. We assume that we are given a set of must-link constraints ML, where ML indicates that instances and should be in the same cluster. Similarly, we are given a set of cannot-link constraints CL, where CL indicates that and should be in different clusters. A clustering maps instances (through their index) to their cluster label, i.e.  indicates that in clustering , is an element of cluster . The indicator function has value one if the enclosed expression is true, zero otherwise. We select the “best” solution from a set of clusterings as the one satisfying the largest number of constraints (in case of a tie, we select randomly from the involved clusterings).

0:  : a dataset      ML: set of must-link constraints      CL: set of cannot-link constraints
0:  a clustering of
1:  Generate a set of clusterings by varying the hyperparameters of several unsupervised clustering algorithms
2:  return  
Algorithm 1 Constraint-based selection (COBS)

COBS is motivated by the following two observations.

First, it is commonly accepted that no single algorithm performs best on all clustering problems: each algorithm comes with its own bias, which may match a particular problem to a greater or lesser degree [14]. Traditional semi-supervised approaches use constraints within the scope of an individual algorithm. By doing so, they can change the bias of the algorithm, but only to a certain extent. For instance, using constraints to learn a Mahalanobis distance allows K-means to find ellipsoidal clusters, rather than spherical ones, but still does not make it possible to find non-convex clusters. In contrast, by using constraints to choose between clusterings generated by very different algorithms, COBS aims to select the most suitable one from a diverse range of biases.

Second, it is also widely known that within a single clustering algorithm the choice of the hyperparameters can strongly influence the clustering result. Consequently, choosing a good parameter setting is crucial. Currently, a user can either do this manually, or use one of the selection strategies discussed in section II. Both options come with significant drawbacks. Doing parameter tuning manually is time-consuming, given the often large number of combinations one might try. Existing automated selection strategies avoid this manual labor, but can easily fail to select a good setting as they do not take the user’s preferences into account. For COBS, parameters are an asset rather than a burden. They allow generating a large and diverse set of clusterings, from which we can select the most suitable solution with a limited number of pairwise constraints.

Although our approach is very simple, it does not appear to have been studied before, neither as a way to incorporate constraints into clustering, nor as a way to select clustering algorithms and their parameter settings (despite the substantial body of research on both constraint-based clustering and hyperparameter selection).

Research questions

In the remainder of this section, we aim to answer the following questions:

  • How does COBS, for hyperparameter selection only, compare to unsupervised hyperparameter selection methods?

  • How does COBS, for hyperparameter selection only, compare to existing semi-supervised clustering algorithms?

  • How does COBS, for both algorithm and hyperparameter selection, compare to existing semi-supervised algorithms?

  • Can we improve COBS by using semi-supervised algorithms to generate clusterings, instead of unsupervised ones?

Although our selection strategy is also related to meta-clustering [10], an experimental comparison would be difficult as meta-clustering produces a dendrogram of clusterings for the user to explore. The user can traverse this dendrogram to obtain a single clustering, but the outcome of this process is highly subjective. COBS works with pairwise constraints, therefore we compare to other methods that do the same.

Experimental methodology

To answer our research questions we perform experiments with 10 UCI classification datasets, listed in Table I. These have also been used in several other studies on semi-supervised clustering [7, 38]. The optdigits389 dataset is a subset of the UCI handwritten digits dataset, containing only digits 3, 8 and 9 [7, 22]. The classes are assumed to represent the clusters of interest. We evaluate how well the returned clusters coincide with them by computing the Adjusted Rand Index (ARI) [17], which is a commonly used measure for this; 0 means that the clustering is not better than random, 1 is a perfect match. In our experiments with semi-supervised clustering, we always repeat the following steps 25 times and report average results:

  1. Randomly partition the full dataset into 70% (“potential supervision set”) and 30% (“left-out set”).

  2. Generate pairwise constraints ( is a parameter) by repeatedly selecting two random instances from the supervision set, and adding a must-link constraint if they belong to the same class, and a cannot-link otherwise.

  3. Apply COBS to the full dataset to obtain a clustering.

  4. Evaluate the clustering by calculating the ARI on all objects that were not involved in any constraints.

We avoid including pairs in the evaluation that were among the given constraints, as this would be the equivalent of testing on the training set.

dataset # instances # features # classes
wine 178 13 3
dermatology 358 33 6
iris 147 4 3
ionosphere 350 34 2
breast-cancer-wisconsin 449 32 2
ecoli 336 7 8
optdigits389 1151 64 3
segmentation 2100 19 7
glass 214 10 7
hepatitis 112 19 2
TABLE I: Datasets used in the experiments. Duplicate instances and instances with missing values are removed.

We use K-means, DBSCAN and spectral clustering to generate clusterings in step one of Algorithm 1, as they are common representatives of different types of algorithms (we use implementations from scikit-learn [24]). The hyperparameters are varied in the ranges specified in Table II. In particular, for each dataset we generate 180 clusterings using K-means (for each number of clusters we store the clusterings obtained with 20 random initializations), 351 using spectral clustering and 400 using DBSCAN, yielding a total of 931 clusterings. For discrete parameters, clusterings are generated for the complete range. For continuous parameters, clusterings are generated using 20 evenly spaced values in the specified intervals. For the parameter used in DBSCAN, the lower and upper bounds are the minimum and maximum pairwise distances between instances (referred to as and in Table II).

All datasets are normalized by rescaling each feature to the range . We use the Euclidean distance for all unsupervised algorithms.

Algorithm Param. Range Selection method
silhouette index
DBCV index
spectral clustering
TABLE II: Algorithms used, the hyperparameters that were varied, their corresponding ranges and the hyperparameter selection methods used in Q1

Q1: COBS vs. unsupervised hyperparameter tuning

To evaluate hyperparameter selection for individual algorithms, we use Algorithm 1 with a set of clusterings generated using one particular algorithm (K-means, DBSCAN or spectral). We compare COBS to state of the art unsupervised selection strategies. As there is no single method that can be used for all three algorithms, we use three different approaches, which are briefly described next.

K-means has one hyperparameter: the number of clusters . A popular method to select in K-means is by using internal clustering quality measures [31, 2]. K-means is ran for different values of K (and in this case also for different random seeds), and afterwards the clustering that scores highest on such an internal measure is chosen. In our setup, we generate 20 clusterings for each by using different random seeds. We select the clustering that scores highest on the silhouette index [27], which was identified as one of the best internal criteria by Arbelaitz et al[2].

DBSCAN has two parameters: , which specifies how close points should be to be in the same neighborhood, and , which specifies the number of points that are required in the neighborhood to be a core point. Most internal criteria are not suited for DBSCAN, as they assume spherical clusters, and one of the key characteristics of DBSCAN is that it can find clusters with arbitrary shape. One exception is the Density-Based Cluster Validation (DBCV) score [23], which we use in our experiments.

Spectral clustering requires the construction of a similarity graph, which can be done in several ways [32]. If a -nearest neighbor graph is used, has to be set. For graphs based on a Gaussian similarity function, has to be set to specify the width of the neighborhoods. Also the number of clusters should be specified. Self-tuning spectral clustering [39] avoids having to specify any of these parameters, by relying on local statistics to compute different values for each instance, and by exploiting structure in the eigenvectors to determine the number of clusters. This approach is different from the one used for K-means and DBSCAN, as here we do not generate a set of clusterings first, but instead hyperparameters are estimated directly from the data.


K-means MPCKMeans
Q1 Q2
0.85 0.81 0.86 0.68 0.70
0.57 0.84 0.59 0.46 0.42
0.56 0.66 0.62 0.72 0.65
0.27 0.24 0.24 0.19 0.17
0.73 0.67 0.73 0.73 0.71
0.04 0.62 0.70 0.51 0.45
0.49 0.79 0.58 0.28 0.49
0.10 0.51 0.38 0.19 0.28
0.19 0.18 0.25 0.18 0.14
0.23 0.2 0.24 0.17 0.20
Q1 Q2
0.32 0.36 0.53
0.37 0.40 0.76
0.56 0.50 0.80
0.05 0.66 -0.04
0.65 0.72 0.53
0.03 0.44 0.56
0.00 0.27 0.55
0.24 0.37 0.54
-0.13 0.02 0.23
0.01 0.14 0.15
spectral COSC
Q1 Q2
STS COBS eigen NumSat CVCP
0.9 0.89 0.50 0.50 0.68
0.21 0.88 0.38 0.38 0.50
0.56 0.81 0.84 0.43 0.60
0.24 0.23 0.22 0.22 0.24
0.81 0.79 0.83 0.83 0.83
0.04 0.65 0.67 0.44 0.61
0.38 0.94 0.54 0.54 0.77
0.24 0.49 0.15 0.15 0.26
-0.1 0.03 0.13 0.13 0.11
0.17 0.17 0.12 0.12 0.17
TABLE III: We first show the ARIs obtained with unsupervised vs. constraint-based hyperparameter selection (columns marked Q1). Next, we show the ARIs obtained with the semi-supervised variants, with several hyperparameter selection methods (columns marked Q2). For semi-supervised results 50 constraints were used, and the average of 25 runs is shown. SI refers to the silhouette index, STS to self-tuning spectral clustering, FOSC to FOSC-OpticsDend and eigen to the eigengap method.

Results and conclusion

The columns of Table III marked with Q1 compare the ARIs obtained with the unsupervised approaches to those obtained with COBS. The best of these two is underlined for each algorithm and dataset combination. Most of the times the constraint-based selection strategy performs better, and often by a large margin. Note for example the large difference for ionosphere: DBSCAN is able to produce a good clustering, but it is only selected using the constraint-based approach. When the unsupervised selection method performs better, the difference is usually small. We conclude that often the internal measures do not match the actually desired clusters. Constraints provide useful information that can help select a good parameter setting.

Q2: COBS vs. semi-supervised algorithms

It is not too surprising that COBS outperforms unsupervised hyperparameter selection, since it has access to more information. We now compare to semi-supervised algorithms, which have access to the same information.

Existing semi-supervised algorithms

We compare to the following algorithms, as they are semi-supervised variants of the unsupervised algorithms used in our experiments:

  • MPCKMeans [7] is a hybrid semi-supervised extension of K-means. It minimizes an objective that combines the within-cluster sum of squares with the cost of violating constraints. This objective is greedily minimized using a procedure based on K-means. Besides a modified cluster assignment step and the usual cluster center re-estimation step, this procedure also adapts an individual metric associated with each cluster in each iteration. We use the implementation available in the WekaUT package111

  • FOSC-OpticsDend [9] is a semi-supervised extension of OPTICS, which is in turn based on ideas similar to DBSCAN. The first step of this algorithm is to run the unsupervised OPTICS algorithm, and to construct a dendrogram using its output. The FOSC framework is then used to extract a flat clustering from this dendrogram that is optimal w.r.t. the given constraints.

  • COSC [26] is based on spectral clustering, but optimizes for an objective that combines the normalized cut with a penalty for constraint violation. We use the implementation available on the authors’ web page222

In our experiments, the only kind of supervision that is given to the algorithms is in the form of pairwise constraints. In particular, the number of clusters is assumed to be unknown. In COBS, is treated as any other hyperparameter. MPCKMeans and COSC, however, require specifying the number of clusters. The following strategies are used to select based on the constraints:

  • NumSat: We run the algorithms for multiple , and select the clustering that violates the smallest number of constraints. In case of a tie, we choose the solution with the lowest number of clusters.

  • CVCP: Cross-Validation for finding Clustering Parameters [25] is a cross-validation procedure for semi-supervised clustering. The set of constraints is divided into independent folds. To evaluate a parameter setting, the algorithm is repeatedly run on the entire dataset given the constraints in folds, keeping aside the th fold as a test set. The clustering that is produced given the constraints in the

    folds, is then considered as a classifier that distinguishes between must-link and cannot-link constraints in the

    th fold. The F-measure is used to evaluate the score of this classifier. The performance of the parameter setting is then estimated as the average F-measure over all test folds. This process is repeated for all parameter settings, and the one resulting in the highest average F-measure is retained. The algorithm is then run with this parameter setting using all constraints to produce the final clustering. We use 5-fold cross-validation.

We also compare to unsupervised hyperparameter selection strategies for the semi-supervised algorithms. In particular, we use the silhouette index for MPCKMeans, and the eigengap heuristic for COSC

[32]. The affinity matrix for COSC is constructed using local scaling as in [26].

Results and conclusion

The columns in Table III marked with Q2 show the ARIs obtained with the semi-supervised algorithms. The best result for each type of algorithm (unsupervised or semi-supervised) is indicated in bold. The table shows that in several cases it is more advantageous to use the constraints to optimize the hyperparameters of the unsupervised algorithm (as COBS does). In other cases, it is better to use the constraints within the algorithm itself, to perform a more informed search (as the semi-supervised variants do). Within the scope of a single clustering algorithm, neither strategy consistently outperforms the other. For example, if we use spectral clustering on the dermatology data, it is better to use the constraints for tuning the hyperparameters of unsupervised spectral clustering (also varying and for constructing the signature matrix) than within COSC, its semi-supervised variant (which uses local scaling for this). In contrast, if we use density-based clustering on the same data, it is better to use constraints in FOSC-OpticsDend (which does not have an parameter, and for which is set to 4, a value commonly used in the literature [13, 9]) than to use them to tune the hyperparameters of DBSCAN (varying both and ).

Q3: COBS with multiple unsupervised algorithms

In the previous two subsections, we showed that constraints can be useful to tune the hyperparameters of individual algorithms. Table III also shows, however, that no single algorithm (unsupervised or semi-supervised) performs well on all datasets. This motivates the use of COBS to not only select hyperparameters, but also the clustering algorithm. In this subsection we again use Algorithm 1, but set in step 1 now includes clusterings produced by any of the three unsupervised algorithms.

Fig. 1: Performance of COBS vs. semi-supervised algorithms


We compare COBS with existing semi-supervised algorithms in Figure 1333Due to long runtimes of COSC, we do not report results in combination with CVCP on the two largest datasets (optdigits389 and segmentation).. COBS is able to find relatively good clusterings for the first 8 datasets. While some other approaches also do well on some of these datasets, none of them do so consistently. Compared to each competitor individually, COBS is clearly superior. For example, COSC-EigenGap outperforms COBS on the iris dataset, but performs much worse on several others. COBS performs poorly on glass and hepatitis, as do the other semi-supervised algorithms, although for hepatitis other approaches are able to find better solutions after a larger number of constraints. The overall poor performance on these last two datasets suggests that the class labels do not indicate a natural grouping.

Table IV allows us to assess the quality of the clusterings that are selected by COBS, relative to the quality of the best clustering in the set of generated clusterings. Column 2 shows the highest ARI of all generated clusterings for each dataset. Note that we can only compute this value in an experimental setting, in which we have labels for all elements. In a real clustering application, we cannot simply select the result with the highest ARI. Column 3, then, shows the ARI of the clustering that is actually selected using COBS when it is given 50 constraints. It shows that there still is room for improvement, i.e. a more advanced strategy might get closer to the maxima. Nevertheless, even our simple strategy gets close enough to outperform most other semi-supervised methods. The last column of Table IV shows how often COBS chose a clustering by K-means (’K’), DBSCAN (’D’) and spectral clustering (’S’). It illustrates that the selected algorithm strongly depends on the dataset. For example, for ionosphere COBS selects clusterings generated by DBSCAN, as it is the only algorithm able to produce good clusterings of this dataset. For most other datasets, spectral clustering is preferred.


If any of the unsupervised algorithms is able to produce good clusterings, COBS can select them using a limited number of constraints. If not, COBS performs poorly, but in our experiments none of the algorithms did well in this case. We conclude that it is often better to use constraints to select and tune an unsupervised algorithm, than within a randomly chosen semi-supervised algorithm.

dataset best unsupervised COBS algorithm used
wine 0.93 0.90 K:4/D:0/S:21
dermatology 0.94 0.87 K:12/D:0/S:13
iris 0.88 0.80 K:9/D:0/S:16
ionosphere 0.7 0.65 K:0/D:25/S:0
breast-cancer-wisconsin 0.84 0.77 K:4/D:1/S:20
ecoli 0.75 0.65 K:6/D:0/S:19
optdigits389 0.97 0.96 K:0/D:0/S:25
segmentation 0.59 0.50 K:8/D:2/S:15
hepatitis 0.27 0.01 K:1/D:18/S:6
glass 0.29 0.19 K:14/D:0/S:11
TABLE IV: The ARI of the best clustering that is generated by any of the unsupervised algorithms, the ARI of the clustering that is selected after 50 constraints (averaged over 25 runs), and the algorithms that produced the selected clusterings.

Q4: Using COBS with semi-supervised algorithms

In the previous section we have shown that we can use constraints to do algorithm and hyperparameter selection for unsupervised algorithms. On the other hand, constraints can also be useful when used within an adapted clustering procedure, as traditional semi-supervised algorithms do. This raises the question: can we combine both approaches? In this section, we use the constraints to select and tune a semi-supervised clustering algorithm. In particular, we vary the hyperparameters of the semi-supervised algorithms to generate the set of clusterings from which we select. The varied hyperparameters are the same as those for their unsupervised variants, except for two. First, is not varied for FOSC-OpticsDend, as it is not a hyperparameter for that algorithm. Second, in this section we only use -nearest neighbors graphs for (semi-supervised) spectral clustering, as full similarity graphs lead to long execution times for COSC.

Results and conclusions

Column 3 of Table V shows that this strategy does not produce better results. This is caused by using the same constraints twice: once within the semi-supervised algorithms, and once to evaluate the algorithms and select the best-performing one. Obviously, algorithms that overfit the given constraints will get selected in this manner.
The problem could be alleviated by using separate constraints inside the algorithm and for evaluation, but this decreases the number of constraints that can effectively be used for either purpose. Column 4 of Table V shows the average ARIs that are obtained if we use half of the constraints within the semi-supervised algorithms, and half to select one of the generated clusterings afterwards. This works better, but still often not as good as COBS with unsupervised algorithms. Results are only improved for segmentation, hepatitis and glass, the datasets with less clear clustering structure (as indicated by the ARIs).
We conclude that using semi-supervised algorithms within COBS can only be beneficial if the semi-supervised algorithms use different constraints from those used for selection. Even then, when a limited number of constraints is available, using all of them for selection is often the best choice.

dataset COBS-U COBS-SS COBS-SS-split
wine 0.89 0.54 0.80
dermatology 0.85 0.62 0.81
iris 0.77 0.51 0.75
ionosphere 0.64 0.19 0.31
breast-cancer-wisconsin 0.79 0.50 0.69
ecoli 0.67 0.51 0.63
optdigits389 0.92 0.51 0.80
segmentation 0.48 0.45 0.54
hepatitis 0.07 0.09 0.27
glass 0.18 0.18 0.19
TABLE V: ARIs obtained with 50 constraints by COBS with unsupervised algorithms (COBS-U) and with semi-supervised algorithms, with and without splitting the constraint set (COBS-SS and COBS-SS-split). Results are averaged over 25 random constraint sets, except for optdigits389 and segmentation, for which results are averaged over 10 runs.

Note on computational complexity

One might expect COBS to be prohibitively expensive, given the large number of clusterings it needs to generate. This is not the case, for multiple reasons.

First, the runtimes of individual clustering algorithms vary greatly, and in addition to that, some semi-supervised algorithms are much slower than their unsupervised counterpart. As a result, constructing many clusterings with unsupervised algorithms is only slightly more expensive than running the slowest semi-supervised algorithm just once. In our experiments, for the largest dataset we used (segmentation), generating 931 unsupervised clusterings took 560s on a single core, using scikit-learn implementations. A single run of COSC, the semi-supervised variant of spectral clustering, took 200s (using the Matlab implementation available on the authors’ web page). If COSC is run multiple times, for instance with different numbers of clusters (as is done in COSC-NumSat and COSC-CVCP), its runtime quickly exceeds that of COBS.

Second, the runtime of COBS can be reduced in several ways. The cluster generation step can easily be parallelized. For larger datasets, one might consider doing the algorithm and hyperparameter selection on a sample of the data, and afterwards cluster the complete dataset only once with the selected configuration.

Finally, note that the added cost of doing algorithm and parameter selection is no different from its comparable, and commonly accepted, cost in (semi-)supervised learning. The focus is on maximally exploiting the limited amount of supervision, as obtaining labels or constraints is often expensive, whereas computation is cheap.

Iv Active COBS

Obtaining constraints can be costly, as they are often specified by human experts. Consequently, several methods have been proposed to actively select the most informative constraints [4, 22, 38]. We first briefly discuss some of these methods, and subsequently present a constraint selection strategy for COBS.

Iv-a Related work

Basu et al. [4] were the first to propose an active constraint selection method for semi-supervised clustering. Their strategy is based on the construction of neighborhoods, which are points that are known to belong to the same cluster because must-link constraints are defined between them. These neighborhoods are initialized in the exploration phase: (the number of clusters) instances with cannot-link constraints between them are sought, by iteratively querying the relation between the existing neighborhoods and the point farthest from these neighborhoods. In the subsequent consolidation phase these neighborhoods are expanded by iteratively querying a random point against the known neighborhoods until a must-link occurs and the right neighborhood is found. Mallapragada et al. [22] extend this strategy by selecting the most uncertain points to query in the consolidation phase, instead of random ones. Note that in these approaches all constraints are queried before the actual clustering is performed.
More recently, Xiong et al. [38] proposed the normalized point-based uncertainty (NPU) framework. Like the approach introduced by Mallapragada et al. [22], NPU incrementally expands neighborhoods and uses an uncertainty-based principle to determine which pairs to query. In the NPU framework, however, data is re-clustered several times, and at each iteration the current clustering is used to determine the next set of pairs to query. NPU can be used with any semi-supervised clustering algorithm, and Xiong et al. [38] use it with MPCKMeans to experimentally demonstrate its superiority to the method of Mallapragada et al. [22].

Iv-B Active constraint selection in COBS

Like the approaches in [22] and [38], our constraint selection strategy for COBS is based on uncertainty sampling. Defining this uncertainty is straightforward within COBS, because of the availability of a set of clusterings: a pair is more uncertain if more clusterings disagree on whether it should be in the same cluster or not. Algorithm 2 presents a selection strategy based on this idea. We associate with each clustering a weight that depends on the number of constraints was right or wrong about. In each iteration we query the pair with the lowest weighted agreement. The agreement of a pair (line 5 of the algorithm) is defined as the absolute value of the difference between the sum of weights of clusterings in which the instances in the pair belong to the same cluster, and the sum of weights of clusterings in which they belong to a different cluster. The weights of clusterings that correctly “predict” the relation between pairs are increased by multiplying with an update factor , weights of other clusterings are decreased by dividing by . As the total number of pairwise constraints is quite large ( with the number of instances), we only consider constraints in a small random sample of all possible constraints.

0:  : a dataset      budget: the maximum number of constraints to use       m: weight update factor       s: size of sample of constraints to choose from
0:  a clustering of
1:  Generate a set of clusterings by varying the hyperparameters of several unsupervised clustering algorithms
2:  Let for all
3:  Let be a sample of all possible pairwise constraints
4:  while  do
6:     Query pair
7:     : multiply with if correctly predicted the          relation between and , divide by if not
9:  end while
10:  return  the clustering with the highest weight
Algorithm 2 Active constraint selection for COBS

Iv-C Experiments

We first demonstrate the influence of the weight update factor and sample size, and then compare our approach to active constraint selection with NPU [38].

Effect of weight update factor and sample size

Our constraint selection strategy requires specifying a weight update factor and a sample size . Figure 2 shows the results for wine and dermatology for various values of . First, the figure shows that the active strategy can significantly improve performance over random selection. Second, it shows that the selection process is not very sensitive to the choice of the update factor. Figure 3 shows the results for various sample sizes. It shows that the sample size has a limited effect on performance for a small number of constraints, but that this effect increases as more constraints are given. In the remainder of this section we use a sample of 1000 constraints (i.e. we try to choose the most useful constraints to ask from 1000 possible queries), and set the weight update factor to 2.

Fig. 2: Active COBS with different weight update factors. The constraint sample size was set to 1000.
Fig. 3: Active COBS with different sample sizes. The weight update factor was set to 2.
Fig. 4: Comparison of active COBS to NPU in combination with different semi-supervised clustering algorithms

Comparison to active selection with NPU

NPU [38] can be used in combination with any semi-supervised clustering algorithm, we use the same ones as in the previous section. We do not include CVCP hyperparameter selection in these experiments, because of its high computational complexity (for these experiments we cannot cluster for several fixed numbers of constraints, as the choice of the next constraints depends on the current clustering). For the same reason we only include the EigenGap parameter selection method for the two largest datasets (opdigits389 and segmentation) in these experiments. The results are shown in Figure 4. For the first 8 datasets, the conclusions are similar to those for the random setting: COBS consistently performs relatively well. Also in the active setting, none of the approaches produces a clustering with a high ARI for glass. For hepatitis, however, MPCKMeans is able to find good clusterings while COBS is not, albeit only after a relatively large number of constraints (hepatitis contains 112 instances). This implies that, although the labels might not represent a natural grouping, the class structure does match the bias of MPCKMeans, and given many constraints the algorithm finds this structure.

Time complexity

We distinguish between the offline and online stages of COBS. In the offline stage, the set of clusterings is generated. As mentioned before, this took 560s on a single core for the largest dataset (segmentation, with 2100 instances). In the online stage, we select the most informative pairs and ask the user about their relation. Execution time is particularly important here, as this stage requires user interaction. In active COBS, selecting the next pair to query is , as we have to loop through all clusterings () for each constraint in the sample (). For the setup used in our experiments (, ), this was always less than 0.02s. Note that this time does not depend on the size of the dataset (as all clusterings are generated beforehand). In contrast, NPU requires re-clustering the data several times during the constraint selection process, which is usually significantly more computationally expensive.


The COBS approach allows for a straightforward definition of uncertainty: pairs of instances are more uncertain if more clusterings disagree on them. Selecting the most uncertain pairs first can significantly increase performance.

V Conclusion

Exploiting constraints has been the subject of substantial research, but all existing methods use them within the clustering process of individual algorithms. In contrast, we propose to use them to choose between clusterings generated by different unsupervised algorithms, ran with different parameter settings. We experimentally show that this strategy is superior to all the semi-supervised algorithms compared to, which themselves are state of the art and representative for a wide range of algorithms. For the majority of the datasets, it works as well as the best among them, and on average it performs much better. The generated clusterings can also be used to select more informative constraints first, which further improves performance.

In future work, we would like to study several strategies that have been used in supervised learning in the context of semi-supervised clustering. In particular, we want to consider more advanced algorithm and hyperparameter optimization strategies (as in [30]), meta-learning approaches (as in [8]), and combinations of these two (as in [16]).


  • [1] Antoine Adam and Hendrik Blockeel. Dealing with overlapping clustering: A constraint-based approach to algorithm selection. In MetaSel workshop at ECMLPKDD, pages 43–54. CEUR Workshop proceedings, September 2015.
  • [2] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez, and Iñigo Perona. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256, 2013.
  • [3] Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning distance functions using equivalence relations. In ICML, 2003.
  • [4] Sugato Basu and Raymond J. Mooney. Active Semi-Supervision for Pairwise Contrained Clustering. In Proc. of the SIAM International Conference on Data Mining, pages 333–344, 2004.
  • [5] Shai Ben-David, Ulrike von Luxburg, and Dávid Pál. A sober look at clustering stability. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pages 5–19, Berlin, Heidelberg, 2006. Springer-Verlag.
  • [6] Asa Ben-Hur, André Elisseeff, and Isabelle Guyon. A Stability Based Method for Discovering Structure in Clustered Data. In Pacific Symposium on Biocomputing, pages 6–17, 2002.
  • [7] Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In

    Proc. of 21st International Conference on Machine Learning

    , pages 81–88, July 2004.
  • [8] Pavel B. Brazdil, Carlos Soares, and JoaquimPinto da Costa. Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results. Machine Learning, 50(3):251–277, 2003.
  • [9] Ricardo J. G. B. Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery, 27(3):344–371, 2013.
  • [10] Rich Caruana, Mohamed Elhawary, and Nam Nguyen. Meta clustering. In Proc. of the International Conference on Data Mining, 2006.
  • [11] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 209–216, New York, NY, USA, 2007. ACM.
  • [12] M.C.P. de Souto, R.B.C. Prudencio, R.G.F. Soares, D.S.A. de Araujo, I.G. Costa, T.B. Ludermir, and A. Schliep. Ranking and selecting clustering algorithms using a meta-learning approach. In

    IEEE International Joint Conference on Neural Networks

    , 2008.
  • [13] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996.
  • [14] Vladimir Estivill-Castro. Why so many clustering algorithms: a position paper. ACM SIGKDD Explorations Newsletter, 4:65–75, 2002.
  • [15] Daniel Gomes Ferrari and Leandro Nunes de Castro. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 301:181 – 194, 2015.
  • [16] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
  • [17] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
  • [18] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration, pages 507–523. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
  • [19] Anil K. Jain. Data clustering : 50 years beyond K-means. Pattern Recognition Letters, 31:651–666, 2010.
  • [20] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of clustering solutions. Neural Comput., 16(6):1299–1323, June 2004.
  • [21] Levi Lelis and Jörg Sander. Semi-supervised density-based clustering. In 2009 Ninth IEEE International Conference on Data Mining, pages 842–847, Dec 2009.
  • [22] Pavan K. Mallapragada, Rong Jin, and Anil K. Jain. Active query selection for semi-supervised clustering. In Proc. of the 19th International Conference on Pattern Recognition, 2008.
  • [23] Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J.G.B. Campello, Arthur Zimek, and Jörg Sander. Density-based clustering validation. In Proc. of the 14th SIAM International Conference on Data Mining, 2014.
  • [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, , R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [25] Mojgan Pourrajabi, Arthur Zimek, Davoud Moulavi, Ricardo J G B Campello, and Randy Goebel. Model Selection for Semi-Supervised Clustering. In Proc. of the 17th International Conference on Extending Database Technology, 2014.
  • [26] Syama S. Rangapuram and Matthias Hein. Constrained 1-spectral clustering. In

    Proc. of the 15th International Conference on Artificial Intelligence and Statistics

    , 2012.
  • [27] Peter J. Rousseeuw.

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

    Journal of Computational and Applied Mathematics, 20:53–65, 1987.
  • [28] Carlos Ruiz, Carlos Ruiz, Myra Spiliopoulou, Myra Spiliopoulou, Ernestina Menasalvas, and Ernestina Menasalvas. C-DBSCAN: Density-Based Clustering with Constraints. RSFDGr’07: Proc. of the International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing held in JRS07, 4481:216–223, 2007.
  • [29] Noam Shental, Aharon Bar-Hillel, Tomer Hertz, and Daphna Weinshall.

    Computing Gaussian mixture models with EM using equivalence constraints.

    In In Advances in Neural Information Processing Systems 16, 2004.
  • [30] Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
  • [31] Lucas Vendramin, Ricardo J G B Campello, and Eduardo R Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209–235, 2010.
  • [32] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
  • [33] Ulrike von Luxburg. Clustering stability: An overview. Found. Trends Mach. Learn., 2(3):235–274, March 2010.
  • [34] Ulrike von Luxburg, Robert C. Williamson, and Isabelle Guyon. Clustering: Science or Art? In

    Workshop on Unsupervised Learning and Transfer Learning, JMLR Workshop and Conference Proceedings 27

    , 2014.
  • [35] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained K-means Clustering with Background Knowledge. In Proc. of the Eighteenth International Conference on Machine Learning, pages 577–584, 2001.
  • [36] Xiang Wang, Buyue Qian, and Ian Davidson. On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1):1–30, 2014.
  • [37] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505–512, 2003.
  • [38] Sicheng Xiong, Javad Azimi, and Xiaoli Z. Fern. Active learning of constraints for semi-supervised clustering. IEEE Transactions on Knowledge and Data Engineering, 26(1):43–54, 2014.
  • [39] Lihi Zelnik-manor and Pietro Perona. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems 17, pages 1601–1608, 2004.