I Introduction
Clustering is one of the core tasks in data analysis [19]. It is inherently subjective, as users may prefer very different clusterings of the same data [10, 34]. Semisupervised clustering [35, 37] aims to deal with this subjectivity by allowing the user to specify background knowledge, often in the form of pairwise constraints that indicate whether two instances should be in the same cluster or not.
Traditional approaches to semisupervised (or constraintbased) clustering use constraints in one of the following three ways. First, one can modify an existing clustering algorithm to take them into account. This approach is taken in COPKMeans [35], one of the first clustering algorithms able to deal with pairwise constraints. Second, one can learn a distance metric based on the constraints [37], after which the metric is used within a traditional clustering algorithm. Third, one can combine the above two approaches and develop socalled hybrid methods [7].
Our approach to constraintbased clustering is quite different from existing methods, and does not fit in any of these three categories. It is motivated by the wellknown fact that different algorithms may produce very different clusterings of the same data [14], and even within one algorithm, different parameter settings may yield different clusterings. This implies that selecting a clustering algorithm and tuning its parameter settings is crucial to obtain a good clustering.
We propose to use constraints to solve these tasks: to find an appropriate clustering, we first generate a set of clusterings using several unsupervised algorithms, with different hyperparameter settings, and afterwards select from this set the clustering that satisfies the largest number of constraints. Our experiments show that, surprisingly, this simple constraintbased selection approach often yields better clusterings than existing semisupervised algorithms. This shows that it is more important to use an algorithm of which the inherent bias matches a particular problem, than to modify the optimization criterion of any individual algorithm to take the constraints into account. We also present a method for selecting the most informative constraints first, which further increases the usefulness of our approach.
The remainder of this paper is structured as follows. In Section II we give some background on semisupervised clustering, and algorithm and hyperparameter selection for clustering. Section III presents our approach to using pairwise constraints in clustering, which we call COBS (for ConstraintBased Selection). In Section 4 we describe how COBS can be extended to actively select informative constraints. We conclude in Section V.
Ii Background
We first describe related work on semisupervised clustering. As our approach consists of using constraints to choose an algorithm and tune its parameters, we also discuss related work on metalearning for clustering, as well as on algorithm and hyperparameter selection.
Semisupervised clustering algorithms allow the user to incorporate a limited amount of supervision into the clustering procedure. Several kinds of supervision have been proposed, one of the most popular ones being pairwise constraints. Mustlink (ML) constraints indicate that two instances should be in the same cluster, cannotlink (CL) constraints that they should be in different clusters. Most existing semisupervised approaches use such constraints within the scope of an individual clustering algorithm. COPKMeans [35]
, for example, modifies the clustering assignment step of Kmeans: instances are assigned to the closest cluster for which the assignment does not violate any constraints. Similarly, the clustering procedures of DBSCAN
[28, 21, 9], EM [29][26, 36] have been extended to incorporate pairwise constraints. Another approach to semisupervised clustering is to learn a distance metric based on the constraints [37, 3, 11]. Xing et al. [37], for example, propose to learn a Mahalanobis distance by solving a convex optimization problem in which the distance between instances with a mustlink constraint between them is minimized, while simultaneously separating instances connected by a cannotlink constraint. Hybrid algorithms, such as MPCKMeans [7], combine metric learning with an adapted clustering procedure.Metalearning and algorithm selection
have been studied extensively in supervised learning
[8, 30], but much less clustering. There is some work on building metalearning systems that recommend clustering algorithms [12, 15]. However, these systems do not take hyperparameter selection into account, or any form of supervision. More related to ours is the work of Caruana et al. [10]. They generate a large number of clusterings using Kmeans and spectral clustering, and cluster these clusterings. This metaclustering is presented to the user as a dendrogram. Here, we also generate a set of clusterings, but afterwards we select from that set the most suitable clustering based on pairwise constraints. The only other work, to our knowledge, that has explored the use of pairwise constraints for algorithm selection is that by Adam and Blockeel [1]. They define a metafeature based on constraints, and use this feature to predict whether EM or spectral clustering will perform better for a dataset. While their metafeature attempts to capture one specific property of the desired clusters, i.e. whether they overlap, our approach is more general and allows selection between any clustering algorithms.Whereas algorithm selection has received little attention in clustering, several methods have been proposed for hyperparameter selection.
One strategy is to run the algorithm with several parameter settings, and select the clustering that scores highest on an internal quality measure [2, 31]. Such measures try to capture the idea of a “good” clustering. A first limitation is that they are not able to deal with the inherent subjectivity of clustering, as they do not take any external information into account. Furthermore, internal measures are only applicable within the scope of individual clustering algorithms, as each of them comes with its own bias [34]. For example, the vast majority of them has a preference for spherical clusters, making them suitable for Kmeans, but not for e.g. spectral clustering and DBSCAN.
Another strategy for parameter selection in clustering is based on stability analysis [6, 33, 5]. A parameter setting is considered to be stable if similar clusterings are produced with that setting when it is applied to several datasets from the same underlying model. These datasets can for example be obtained by taking subsamples of the original dataset [6, 20]. In contrast to internal quality measures, stability analysis does not require an explicit definition of what it means for a clustering to be good. Most studies on stability focus on selecting parameter settings in the scope of individual algorithms (in particular, often the number of clusters).
Additionally, one can also avoid the need for explicit parameter selection. In selftuning spectral clustering [39]
, for example, the affinity matrix is constructed based on local statistics and the number of clusters is estimated using the structure of the eigenvectors of the Laplacian.
A key distinction with COBS is that none of the above methods takes the subjective preferences of the user into account. We will compare our constraintbased selection strategy to some of them in the next section.
Iii Constraintbased clustering selection
Algorithm and hyperparameter selection are difficult tasks in an entirely unsupervised setting, mainly due to the lack of a welldefined way to estimate the quality of clustering results [14]. We propose to use constraints for this purpose, and estimate the quality of a clustering as the number of constraints that it satisfies. This quality estimate allows us to do a search over unsupervised algorithms and their parameter settings, as described in Algorithm 1. We use a basic grid search, but in principle also more advanced optimization strategies could be used [30, 18]. We assume that we are given a set of mustlink constraints ML, where ML indicates that instances and should be in the same cluster. Similarly, we are given a set of cannotlink constraints CL, where CL indicates that and should be in different clusters. A clustering maps instances (through their index) to their cluster label, i.e. indicates that in clustering , is an element of cluster . The indicator function has value one if the enclosed expression is true, zero otherwise. We select the “best” solution from a set of clusterings as the one satisfying the largest number of constraints (in case of a tie, we select randomly from the involved clusterings).
COBS is motivated by the following two observations.
First, it is commonly accepted that no single algorithm performs best on all clustering problems: each algorithm comes with its own bias, which may match a particular problem to a greater or lesser degree [14]. Traditional semisupervised approaches use constraints within the scope of an individual algorithm. By doing so, they can change the bias of the algorithm, but only to a certain extent. For instance, using constraints to learn a Mahalanobis distance allows Kmeans to find ellipsoidal clusters, rather than spherical ones, but still does not make it possible to find nonconvex clusters. In contrast, by using constraints to choose between clusterings generated by very different algorithms, COBS aims to select the most suitable one from a diverse range of biases.
Second, it is also widely known that within a single clustering algorithm the choice of the hyperparameters can strongly influence the clustering result. Consequently, choosing a good parameter setting is crucial. Currently, a user can either do this manually, or use one of the selection strategies discussed in section II. Both options come with significant drawbacks. Doing parameter tuning manually is timeconsuming, given the often large number of combinations one might try. Existing automated selection strategies avoid this manual labor, but can easily fail to select a good setting as they do not take the user’s preferences into account. For COBS, parameters are an asset rather than a burden. They allow generating a large and diverse set of clusterings, from which we can select the most suitable solution with a limited number of pairwise constraints.
Although our approach is very simple, it does not appear to have been studied before, neither as a way to incorporate constraints into clustering, nor as a way to select clustering algorithms and their parameter settings (despite the substantial body of research on both constraintbased clustering and hyperparameter selection).
Research questions
In the remainder of this section, we aim to answer the following questions:

How does COBS, for hyperparameter selection only, compare to unsupervised hyperparameter selection methods?

How does COBS, for hyperparameter selection only, compare to existing semisupervised clustering algorithms?

How does COBS, for both algorithm and hyperparameter selection, compare to existing semisupervised algorithms?

Can we improve COBS by using semisupervised algorithms to generate clusterings, instead of unsupervised ones?
Although our selection strategy is also related to metaclustering [10], an experimental comparison would be difficult as metaclustering produces a dendrogram of clusterings for the user to explore. The user can traverse this dendrogram to obtain a single clustering, but the outcome of this process is highly subjective. COBS works with pairwise constraints, therefore we compare to other methods that do the same.
Experimental methodology
To answer our research questions we perform experiments with 10 UCI classification datasets, listed in Table I. These have also been used in several other studies on semisupervised clustering [7, 38]. The optdigits389 dataset is a subset of the UCI handwritten digits dataset, containing only digits 3, 8 and 9 [7, 22]. The classes are assumed to represent the clusters of interest. We evaluate how well the returned clusters coincide with them by computing the Adjusted Rand Index (ARI) [17], which is a commonly used measure for this; 0 means that the clustering is not better than random, 1 is a perfect match. In our experiments with semisupervised clustering, we always repeat the following steps 25 times and report average results:

Randomly partition the full dataset into 70% (“potential supervision set”) and 30% (“leftout set”).

Generate pairwise constraints ( is a parameter) by repeatedly selecting two random instances from the supervision set, and adding a mustlink constraint if they belong to the same class, and a cannotlink otherwise.

Apply COBS to the full dataset to obtain a clustering.

Evaluate the clustering by calculating the ARI on all objects that were not involved in any constraints.
We avoid including pairs in the evaluation that were among the given constraints, as this would be the equivalent of testing on the training set.
dataset  # instances  # features  # classes 

wine  178  13  3 
dermatology  358  33  6 
iris  147  4  3 
ionosphere  350  34  2 
breastcancerwisconsin  449  32  2 
ecoli  336  7  8 
optdigits389  1151  64  3 
segmentation  2100  19  7 
glass  214  10  7 
hepatitis  112  19  2 
We use Kmeans, DBSCAN and spectral clustering to generate clusterings in step one of Algorithm 1, as they are common representatives of different types of algorithms (we use implementations from scikitlearn [24]). The hyperparameters are varied in the ranges specified in Table II. In particular, for each dataset we generate 180 clusterings using Kmeans (for each number of clusters we store the clusterings obtained with 20 random initializations), 351 using spectral clustering and 400 using DBSCAN, yielding a total of 931 clusterings. For discrete parameters, clusterings are generated for the complete range. For continuous parameters, clusterings are generated using 20 evenly spaced values in the specified intervals. For the parameter used in DBSCAN, the lower and upper bounds are the minimum and maximum pairwise distances between instances (referred to as and in Table II).
All datasets are normalized by rescaling each feature to the range . We use the Euclidean distance for all unsupervised algorithms.
Algorithm  Param.  Range  Selection method  

kmeans 

silhouette index  
DBSCAN 


DBCV index  
spectral 



Q1: COBS vs. unsupervised hyperparameter tuning
To evaluate hyperparameter selection for individual algorithms, we use Algorithm 1 with a set of clusterings generated using one particular algorithm (Kmeans, DBSCAN or spectral). We compare COBS to state of the art unsupervised selection strategies. As there is no single method that can be used for all three algorithms, we use three different approaches, which are briefly described next.
Kmeans has one hyperparameter: the number of clusters . A popular method to select in Kmeans is by using internal clustering quality measures [31, 2]. Kmeans is ran for different values of K (and in this case also for different random seeds), and afterwards the clustering that scores highest on such an internal measure is chosen. In our setup, we generate 20 clusterings for each by using different random seeds. We select the clustering that scores highest on the silhouette index [27], which was identified as one of the best internal criteria by Arbelaitz et al. [2].
DBSCAN has two parameters: , which specifies how close points should be to be in the same neighborhood, and , which specifies the number of points that are required in the neighborhood to be a core point. Most internal criteria are not suited for DBSCAN, as they assume spherical clusters, and one of the key characteristics of DBSCAN is that it can find clusters with arbitrary shape. One exception is the DensityBased Cluster Validation (DBCV) score [23], which we use in our experiments.
Spectral clustering requires the construction of a similarity graph, which can be done in several ways [32]. If a nearest neighbor graph is used, has to be set. For graphs based on a Gaussian similarity function, has to be set to specify the width of the neighborhoods. Also the number of clusters should be specified. Selftuning spectral clustering [39] avoids having to specify any of these parameters, by relying on local statistics to compute different values for each instance, and by exploiting structure in the eigenvectors to determine the number of clusters. This approach is different from the one used for Kmeans and DBSCAN, as here we do not generate a set of clusterings first, but instead hyperparameters are estimated directly from the data.
dataset 
wine 
dermatology 
iris 
ionosphere 
breastcancerwisconsin 
ecoli 
optdigits389 
segmentation 
hepatitis 
glass 

Kmeans  MPCKMeans  

Q1  Q2  
SI  COBS  SI  NumSat  CVCP 
0.85  0.81  0.86  0.68  0.70 
0.57  0.84  0.59  0.46  0.42 
0.56  0.66  0.62  0.72  0.65 
0.27  0.24  0.24  0.19  0.17 
0.73  0.67  0.73  0.73  0.71 
0.04  0.62  0.70  0.51  0.45 
0.49  0.79  0.58  0.28  0.49 
0.10  0.51  0.38  0.19  0.28 
0.19  0.18  0.25  0.18  0.14 
0.23  0.2  0.24  0.17  0.20 
DBSCAN  FOSC  

Q1  Q2  
DBCV  COBS  
0.32  0.36  0.53 
0.37  0.40  0.76 
0.56  0.50  0.80 
0.05  0.66  0.04 
0.65  0.72  0.53 
0.03  0.44  0.56 
0.00  0.27  0.55 
0.24  0.37  0.54 
0.13  0.02  0.23 
0.01  0.14  0.15 
spectral  COSC  

Q1  Q2  
STS  COBS  eigen  NumSat  CVCP 
0.9  0.89  0.50  0.50  0.68 
0.21  0.88  0.38  0.38  0.50 
0.56  0.81  0.84  0.43  0.60 
0.24  0.23  0.22  0.22  0.24 
0.81  0.79  0.83  0.83  0.83 
0.04  0.65  0.67  0.44  0.61 
0.38  0.94  0.54  0.54  0.77 
0.24  0.49  0.15  0.15  0.26 
0.1  0.03  0.13  0.13  0.11 
0.17  0.17  0.12  0.12  0.17 
Results and conclusion
The columns of Table III marked with Q1 compare the ARIs obtained with the unsupervised approaches to those obtained with COBS. The best of these two is underlined for each algorithm and dataset combination. Most of the times the constraintbased selection strategy performs better, and often by a large margin. Note for example the large difference for ionosphere: DBSCAN is able to produce a good clustering, but it is only selected using the constraintbased approach. When the unsupervised selection method performs better, the difference is usually small. We conclude that often the internal measures do not match the actually desired clusters. Constraints provide useful information that can help select a good parameter setting.
Q2: COBS vs. semisupervised algorithms
It is not too surprising that COBS outperforms unsupervised hyperparameter selection, since it has access to more information. We now compare to semisupervised algorithms, which have access to the same information.
Existing semisupervised algorithms
We compare to the following algorithms, as they are semisupervised variants of the unsupervised algorithms used in our experiments:

MPCKMeans [7] is a hybrid semisupervised extension of Kmeans. It minimizes an objective that combines the withincluster sum of squares with the cost of violating constraints. This objective is greedily minimized using a procedure based on Kmeans. Besides a modified cluster assignment step and the usual cluster center reestimation step, this procedure also adapts an individual metric associated with each cluster in each iteration. We use the implementation available in the WekaUT package^{1}^{1}1http://www.cs.utexas.edu/users/ml/risc/code/.

FOSCOpticsDend [9] is a semisupervised extension of OPTICS, which is in turn based on ideas similar to DBSCAN. The first step of this algorithm is to run the unsupervised OPTICS algorithm, and to construct a dendrogram using its output. The FOSC framework is then used to extract a flat clustering from this dendrogram that is optimal w.r.t. the given constraints.

COSC [26] is based on spectral clustering, but optimizes for an objective that combines the normalized cut with a penalty for constraint violation. We use the implementation available on the authors’ web page^{2}^{2}2http://www.ml.unisaarland.de/code/cosc/cosc.htm.
In our experiments, the only kind of supervision that is given to the algorithms is in the form of pairwise constraints. In particular, the number of clusters is assumed to be unknown. In COBS, is treated as any other hyperparameter. MPCKMeans and COSC, however, require specifying the number of clusters. The following strategies are used to select based on the constraints:

NumSat: We run the algorithms for multiple , and select the clustering that violates the smallest number of constraints. In case of a tie, we choose the solution with the lowest number of clusters.

CVCP: CrossValidation for finding Clustering Parameters [25] is a crossvalidation procedure for semisupervised clustering. The set of constraints is divided into independent folds. To evaluate a parameter setting, the algorithm is repeatedly run on the entire dataset given the constraints in folds, keeping aside the th fold as a test set. The clustering that is produced given the constraints in the
folds, is then considered as a classifier that distinguishes between mustlink and cannotlink constraints in the
th fold. The Fmeasure is used to evaluate the score of this classifier. The performance of the parameter setting is then estimated as the average Fmeasure over all test folds. This process is repeated for all parameter settings, and the one resulting in the highest average Fmeasure is retained. The algorithm is then run with this parameter setting using all constraints to produce the final clustering. We use 5fold crossvalidation.
Results and conclusion
The columns in Table III marked with Q2 show the ARIs obtained with the semisupervised algorithms. The best result for each type of algorithm (unsupervised or semisupervised) is indicated in bold. The table shows that in several cases it is more advantageous to use the constraints to optimize the hyperparameters of the unsupervised algorithm (as COBS does). In other cases, it is better to use the constraints within the algorithm itself, to perform a more informed search (as the semisupervised variants do). Within the scope of a single clustering algorithm, neither strategy consistently outperforms the other. For example, if we use spectral clustering on the dermatology data, it is better to use the constraints for tuning the hyperparameters of unsupervised spectral clustering (also varying and for constructing the signature matrix) than within COSC, its semisupervised variant (which uses local scaling for this). In contrast, if we use densitybased clustering on the same data, it is better to use constraints in FOSCOpticsDend (which does not have an parameter, and for which is set to 4, a value commonly used in the literature [13, 9]) than to use them to tune the hyperparameters of DBSCAN (varying both and ).
Q3: COBS with multiple unsupervised algorithms
In the previous two subsections, we showed that constraints can be useful to tune the hyperparameters of individual algorithms. Table III also shows, however, that no single algorithm (unsupervised or semisupervised) performs well on all datasets. This motivates the use of COBS to not only select hyperparameters, but also the clustering algorithm. In this subsection we again use Algorithm 1, but set in step 1 now includes clusterings produced by any of the three unsupervised algorithms.
Results
We compare COBS with existing semisupervised algorithms in Figure 1^{3}^{3}3Due to long runtimes of COSC, we do not report results in combination with CVCP on the two largest datasets (optdigits389 and segmentation).. COBS is able to find relatively good clusterings for the first 8 datasets. While some other approaches also do well on some of these datasets, none of them do so consistently. Compared to each competitor individually, COBS is clearly superior. For example, COSCEigenGap outperforms COBS on the iris dataset, but performs much worse on several others. COBS performs poorly on glass and hepatitis, as do the other semisupervised algorithms, although for hepatitis other approaches are able to find better solutions after a larger number of constraints. The overall poor performance on these last two datasets suggests that the class labels do not indicate a natural grouping.
Table IV allows us to assess the quality of the clusterings that are selected by COBS, relative to the quality of the best clustering in the set of generated clusterings. Column 2 shows the highest ARI of all generated clusterings for each dataset. Note that we can only compute this value in an experimental setting, in which we have labels for all elements. In a real clustering application, we cannot simply select the result with the highest ARI. Column 3, then, shows the ARI of the clustering that is actually selected using COBS when it is given 50 constraints. It shows that there still is room for improvement, i.e. a more advanced strategy might get closer to the maxima. Nevertheless, even our simple strategy gets close enough to outperform most other semisupervised methods. The last column of Table IV shows how often COBS chose a clustering by Kmeans (’K’), DBSCAN (’D’) and spectral clustering (’S’). It illustrates that the selected algorithm strongly depends on the dataset. For example, for ionosphere COBS selects clusterings generated by DBSCAN, as it is the only algorithm able to produce good clusterings of this dataset. For most other datasets, spectral clustering is preferred.
Conclusion
If any of the unsupervised algorithms is able to produce good clusterings, COBS can select them using a limited number of constraints. If not, COBS performs poorly, but in our experiments none of the algorithms did well in this case. We conclude that it is often better to use constraints to select and tune an unsupervised algorithm, than within a randomly chosen semisupervised algorithm.
dataset  best unsupervised  COBS  algorithm used 

wine  0.93  0.90  K:4/D:0/S:21 
dermatology  0.94  0.87  K:12/D:0/S:13 
iris  0.88  0.80  K:9/D:0/S:16 
ionosphere  0.7  0.65  K:0/D:25/S:0 
breastcancerwisconsin  0.84  0.77  K:4/D:1/S:20 
ecoli  0.75  0.65  K:6/D:0/S:19 
optdigits389  0.97  0.96  K:0/D:0/S:25 
segmentation  0.59  0.50  K:8/D:2/S:15 
hepatitis  0.27  0.01  K:1/D:18/S:6 
glass  0.29  0.19  K:14/D:0/S:11 
Q4: Using COBS with semisupervised algorithms
In the previous section we have shown that we can use constraints to do algorithm and hyperparameter selection for unsupervised algorithms. On the other hand, constraints can also be useful when used within an adapted clustering procedure, as traditional semisupervised algorithms do. This raises the question: can we combine both approaches? In this section, we use the constraints to select and tune a semisupervised clustering algorithm. In particular, we vary the hyperparameters of the semisupervised algorithms to generate the set of clusterings from which we select. The varied hyperparameters are the same as those for their unsupervised variants, except for two. First, is not varied for FOSCOpticsDend, as it is not a hyperparameter for that algorithm. Second, in this section we only use nearest neighbors graphs for (semisupervised) spectral clustering, as full similarity graphs lead to long execution times for COSC.
Results and conclusions
Column 3 of Table V shows that this strategy does not produce better results. This is caused by using the same constraints twice: once within the semisupervised algorithms, and once to evaluate the algorithms and select the bestperforming one. Obviously, algorithms that overfit the given constraints will get selected in this manner.
The problem could be alleviated by using separate constraints inside the algorithm and for evaluation, but this decreases the number of constraints that can effectively be used for either purpose. Column 4 of Table V shows the average ARIs that are obtained if we use half of the constraints within the semisupervised algorithms, and half to select one of the generated clusterings afterwards. This works better, but still often not as good as COBS with unsupervised algorithms. Results are only improved for segmentation, hepatitis and glass, the datasets with less clear clustering structure (as indicated by the ARIs).
We conclude that using semisupervised algorithms within COBS can only be beneficial if the semisupervised algorithms use different constraints from those used for selection. Even then, when a limited number of constraints is available, using all of them for selection is often the best choice.
dataset  COBSU  COBSSS  COBSSSsplit 

wine  0.89  0.54  0.80 
dermatology  0.85  0.62  0.81 
iris  0.77  0.51  0.75 
ionosphere  0.64  0.19  0.31 
breastcancerwisconsin  0.79  0.50  0.69 
ecoli  0.67  0.51  0.63 
optdigits389  0.92  0.51  0.80 
segmentation  0.48  0.45  0.54 
hepatitis  0.07  0.09  0.27 
glass  0.18  0.18  0.19 
Note on computational complexity
One might expect COBS to be prohibitively expensive, given the large number of clusterings it needs to generate. This is not the case, for multiple reasons.
First, the runtimes of individual clustering algorithms vary greatly, and in addition to that, some semisupervised algorithms are much slower than their unsupervised counterpart. As a result, constructing many clusterings with unsupervised algorithms is only slightly more expensive than running the slowest semisupervised algorithm just once. In our experiments, for the largest dataset we used (segmentation), generating 931 unsupervised clusterings took 560s on a single core, using scikitlearn implementations. A single run of COSC, the semisupervised variant of spectral clustering, took 200s (using the Matlab implementation available on the authors’ web page). If COSC is run multiple times, for instance with different numbers of clusters (as is done in COSCNumSat and COSCCVCP), its runtime quickly exceeds that of COBS.
Second, the runtime of COBS can be reduced in several ways. The cluster generation step can easily be parallelized. For larger datasets, one might consider doing the algorithm and hyperparameter selection on a sample of the data, and afterwards cluster the complete dataset only once with the selected configuration.
Finally, note that the added cost of doing algorithm and parameter selection is no different from its comparable, and commonly accepted, cost in (semi)supervised learning. The focus is on maximally exploiting the limited amount of supervision, as obtaining labels or constraints is often expensive, whereas computation is cheap.
Iv Active COBS
Obtaining constraints can be costly, as they are often specified by human experts. Consequently, several methods have been proposed to actively select the most informative constraints [4, 22, 38]. We first briefly discuss some of these methods, and subsequently present a constraint selection strategy for COBS.
Iva Related work
Basu et al. [4] were the first to propose an active constraint selection method for semisupervised clustering. Their strategy is based on the construction of neighborhoods, which are points that are known to belong to the same cluster because mustlink constraints are defined between them. These neighborhoods are initialized in the exploration phase: (the number of clusters) instances with cannotlink constraints between them are sought, by iteratively querying the relation between the existing neighborhoods and the point farthest from these neighborhoods. In the subsequent consolidation phase these neighborhoods are expanded by iteratively querying a random point against the known neighborhoods until a mustlink occurs and the right neighborhood is found. Mallapragada et al. [22] extend this strategy by selecting the most uncertain points to query in the consolidation phase, instead of random ones. Note that in these approaches all constraints are queried before the actual clustering is performed.
More recently, Xiong et al. [38] proposed the normalized pointbased uncertainty (NPU) framework. Like the approach introduced by Mallapragada et al. [22], NPU incrementally expands neighborhoods and uses an uncertaintybased principle to determine which pairs to query. In the NPU framework, however, data is reclustered several times, and at each iteration the current clustering is used to determine the next set of pairs to query. NPU can be used with any semisupervised clustering algorithm, and Xiong et al. [38] use it with MPCKMeans to experimentally demonstrate its superiority to the method of Mallapragada et al. [22].
IvB Active constraint selection in COBS
Like the approaches in [22] and [38], our constraint selection strategy for COBS is based on uncertainty sampling. Defining this uncertainty is straightforward within COBS, because of the availability of a set of clusterings: a pair is more uncertain if more clusterings disagree on whether it should be in the same cluster or not. Algorithm 2 presents a selection strategy based on this idea. We associate with each clustering a weight that depends on the number of constraints was right or wrong about. In each iteration we query the pair with the lowest weighted agreement. The agreement of a pair (line 5 of the algorithm) is defined as the absolute value of the difference between the sum of weights of clusterings in which the instances in the pair belong to the same cluster, and the sum of weights of clusterings in which they belong to a different cluster. The weights of clusterings that correctly “predict” the relation between pairs are increased by multiplying with an update factor , weights of other clusterings are decreased by dividing by . As the total number of pairwise constraints is quite large ( with the number of instances), we only consider constraints in a small random sample of all possible constraints.
IvC Experiments
We first demonstrate the influence of the weight update factor and sample size, and then compare our approach to active constraint selection with NPU [38].
Effect of weight update factor and sample size
Our constraint selection strategy requires specifying a weight update factor and a sample size . Figure 2 shows the results for wine and dermatology for various values of . First, the figure shows that the active strategy can significantly improve performance over random selection. Second, it shows that the selection process is not very sensitive to the choice of the update factor. Figure 3 shows the results for various sample sizes. It shows that the sample size has a limited effect on performance for a small number of constraints, but that this effect increases as more constraints are given. In the remainder of this section we use a sample of 1000 constraints (i.e. we try to choose the most useful constraints to ask from 1000 possible queries), and set the weight update factor to 2.
Comparison to active selection with NPU
NPU [38] can be used in combination with any semisupervised clustering algorithm, we use the same ones as in the previous section. We do not include CVCP hyperparameter selection in these experiments, because of its high computational complexity (for these experiments we cannot cluster for several fixed numbers of constraints, as the choice of the next constraints depends on the current clustering). For the same reason we only include the EigenGap parameter selection method for the two largest datasets (opdigits389 and segmentation) in these experiments. The results are shown in Figure 4. For the first 8 datasets, the conclusions are similar to those for the random setting: COBS consistently performs relatively well. Also in the active setting, none of the approaches produces a clustering with a high ARI for glass. For hepatitis, however, MPCKMeans is able to find good clusterings while COBS is not, albeit only after a relatively large number of constraints (hepatitis contains 112 instances). This implies that, although the labels might not represent a natural grouping, the class structure does match the bias of MPCKMeans, and given many constraints the algorithm finds this structure.
Time complexity
We distinguish between the offline and online stages of COBS. In the offline stage, the set of clusterings is generated. As mentioned before, this took 560s on a single core for the largest dataset (segmentation, with 2100 instances). In the online stage, we select the most informative pairs and ask the user about their relation. Execution time is particularly important here, as this stage requires user interaction. In active COBS, selecting the next pair to query is , as we have to loop through all clusterings () for each constraint in the sample (). For the setup used in our experiments (, ), this was always less than 0.02s. Note that this time does not depend on the size of the dataset (as all clusterings are generated beforehand). In contrast, NPU requires reclustering the data several times during the constraint selection process, which is usually significantly more computationally expensive.
Conclusion
The COBS approach allows for a straightforward definition of uncertainty: pairs of instances are more uncertain if more clusterings disagree on them. Selecting the most uncertain pairs first can significantly increase performance.
V Conclusion
Exploiting constraints has been the subject of substantial research, but all existing methods use them within the clustering process of individual algorithms. In contrast, we propose to use them to choose between clusterings generated by different unsupervised algorithms, ran with different parameter settings. We experimentally show that this strategy is superior to all the semisupervised algorithms compared to, which themselves are state of the art and representative for a wide range of algorithms. For the majority of the datasets, it works as well as the best among them, and on average it performs much better. The generated clusterings can also be used to select more informative constraints first, which further improves performance.
In future work, we would like to study several strategies that have been used in supervised learning in the context of semisupervised clustering. In particular, we want to consider more advanced algorithm and hyperparameter optimization strategies (as in [30]), metalearning approaches (as in [8]), and combinations of these two (as in [16]).
References
 [1] Antoine Adam and Hendrik Blockeel. Dealing with overlapping clustering: A constraintbased approach to algorithm selection. In MetaSel workshop at ECMLPKDD, pages 43–54. CEUR Workshop proceedings, September 2015.
 [2] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez, and Iñigo Perona. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256, 2013.
 [3] Aharon BarHillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning distance functions using equivalence relations. In ICML, 2003.
 [4] Sugato Basu and Raymond J. Mooney. Active SemiSupervision for Pairwise Contrained Clustering. In Proc. of the SIAM International Conference on Data Mining, pages 333–344, 2004.
 [5] Shai BenDavid, Ulrike von Luxburg, and Dávid Pál. A sober look at clustering stability. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pages 5–19, Berlin, Heidelberg, 2006. SpringerVerlag.
 [6] Asa BenHur, André Elisseeff, and Isabelle Guyon. A Stability Based Method for Discovering Structure in Clustered Data. In Pacific Symposium on Biocomputing, pages 6–17, 2002.

[7]
Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney.
Integrating constraints and metric learning in semisupervised
clustering.
In
Proc. of 21st International Conference on Machine Learning
, pages 81–88, July 2004.  [8] Pavel B. Brazdil, Carlos Soares, and JoaquimPinto da Costa. Ranking Learning Algorithms: Using IBL and MetaLearning on Accuracy and Time Results. Machine Learning, 50(3):251–277, 2003.
 [9] Ricardo J. G. B. Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. A framework for semisupervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery, 27(3):344–371, 2013.
 [10] Rich Caruana, Mohamed Elhawary, and Nam Nguyen. Meta clustering. In Proc. of the International Conference on Data Mining, 2006.
 [11] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Informationtheoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 209–216, New York, NY, USA, 2007. ACM.

[12]
M.C.P. de Souto, R.B.C. Prudencio, R.G.F. Soares, D.S.A. de Araujo, I.G. Costa,
T.B. Ludermir, and A. Schliep.
Ranking and selecting clustering algorithms using a metalearning
approach.
In
IEEE International Joint Conference on Neural Networks
, 2008.  [13] Martin Ester, HansPeter Kriegel, Jörg Sander, and Xiaowei Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996.
 [14] Vladimir EstivillCastro. Why so many clustering algorithms: a position paper. ACM SIGKDD Explorations Newsletter, 4:65–75, 2002.
 [15] Daniel Gomes Ferrari and Leandro Nunes de Castro. Clustering algorithm selection by metalearning systems: A new distancebased problem characterization and ranking combination methods. Information Sciences, 301:181 – 194, 2015.
 [16] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
 [17] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
 [18] Frank Hutter, Holger H. Hoos, and Kevin LeytonBrown. Sequential ModelBased Optimization for General Algorithm Configuration, pages 507–523. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
 [19] Anil K. Jain. Data clustering : 50 years beyond Kmeans. Pattern Recognition Letters, 31:651–666, 2010.
 [20] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stabilitybased validation of clustering solutions. Neural Comput., 16(6):1299–1323, June 2004.
 [21] Levi Lelis and Jörg Sander. Semisupervised densitybased clustering. In 2009 Ninth IEEE International Conference on Data Mining, pages 842–847, Dec 2009.
 [22] Pavan K. Mallapragada, Rong Jin, and Anil K. Jain. Active query selection for semisupervised clustering. In Proc. of the 19th International Conference on Pattern Recognition, 2008.
 [23] Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J.G.B. Campello, Arthur Zimek, and Jörg Sander. Densitybased clustering validation. In Proc. of the 14th SIAM International Conference on Data Mining, 2014.
 [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, , R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [25] Mojgan Pourrajabi, Arthur Zimek, Davoud Moulavi, Ricardo J G B Campello, and Randy Goebel. Model Selection for SemiSupervised Clustering. In Proc. of the 17th International Conference on Extending Database Technology, 2014.

[26]
Syama S. Rangapuram and Matthias Hein.
Constrained 1spectral clustering.
In
Proc. of the 15th International Conference on Artificial Intelligence and Statistics
, 2012. 
[27]
Peter J. Rousseeuw.
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics, 20:53–65, 1987.  [28] Carlos Ruiz, Carlos Ruiz, Myra Spiliopoulou, Myra Spiliopoulou, Ernestina Menasalvas, and Ernestina Menasalvas. CDBSCAN: DensityBased Clustering with Constraints. RSFDGr’07: Proc. of the International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing held in JRS07, 4481:216–223, 2007.

[29]
Noam Shental, Aharon BarHillel, Tomer Hertz, and Daphna Weinshall.
Computing Gaussian mixture models with EM using equivalence constraints.
In In Advances in Neural Information Processing Systems 16, 2004.  [30] Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin LeytonBrown. Autoweka: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
 [31] Lucas Vendramin, Ricardo J G B Campello, and Eduardo R Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209–235, 2010.
 [32] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
 [33] Ulrike von Luxburg. Clustering stability: An overview. Found. Trends Mach. Learn., 2(3):235–274, March 2010.

[34]
Ulrike von Luxburg, Robert C. Williamson, and Isabelle Guyon.
Clustering: Science or Art?
In
Workshop on Unsupervised Learning and Transfer Learning, JMLR Workshop and Conference Proceedings 27
, 2014.  [35] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained Kmeans Clustering with Background Knowledge. In Proc. of the Eighteenth International Conference on Machine Learning, pages 577–584, 2001.
 [36] Xiang Wang, Buyue Qian, and Ian Davidson. On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1):1–30, 2014.
 [37] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems 15, pages 505–512, 2003.
 [38] Sicheng Xiong, Javad Azimi, and Xiaoli Z. Fern. Active learning of constraints for semisupervised clustering. IEEE Transactions on Knowledge and Data Engineering, 26(1):43–54, 2014.
 [39] Lihi Zelnikmanor and Pietro Perona. Selftuning spectral clustering. In Advances in Neural Information Processing Systems 17, pages 1601–1608, 2004.
Comments
There are no comments yet.