In many learning tasks we have a plethora of unlabelled data in a high-dimensional space consisted of series of features that have some interpretation and a limited number of labelled data since the latter are expensive to be generated. In many cases we do not have a knowledge of the actual contribution of each features on the learning task and we often consider conditionality reduction methods in order to keep only the most relevant features to the given task. However, many dimensionality reduction methods, such as Principal Component Analysis (PCA)[wold1987principal], result in the loss of the original features which might be of interest especially if each feature has specifically designed to have a biological meaning. In unsupervised scenarios a number of authors [pan2007penalized, wang2008variable, xie2008penalized, raftery2006variable, maugis2009variable]
have proposed clustering algorithms that have the ability keep the initial features intact and assign a certain weight to them based on their contribution on clustering potentially resulting in feature selection and sparse clustering.
In the work of [witten2010framework] a general sparse clustering framework is presented which incorporates
(lasso regression) and
(Ridge regression) penalties in order to eliminate the uninformative feature and weight the rest based on their contribution on clustering[witten2009penalized]. Such method requires the tuning of the sparsity hyper-parameter which essentially regulates the amount of
application. This framework have been applied with K-Means and Hierarchical clustering but can also be applied to semi-supervised scenarios where pairwise constraints are given as an additional input to the algorithm, such constraints are generated from partly labelled data and indicate which data points should (MUST-LINK) or should not (CANNOT-LINK) belong to the same cluster. Previous work on semi-supervised learning[wagstaff2001constrained, basu2002semi, klein2002instance, xing2003distance, bar2003learning] has indicated that incorporated constraints can result to superior performance of the learning algorithm. This is achieved by guiding the clustering solution either with the alternation of the objective function of the algorithm to include satisfaction of the constraints [demiriz1999semi] or with the initialisation of the centroids, an important step to K-Means clustering as indicated in previous studies [pena1999empirical, he2004initialization, celebi2013comparative], to more appropriate locations of the feature space based on the constraints [basu2002semi]. Another technique is to train a metric that satisfy the constraints as in the work of [xing2003distance] in which pairwise constraints were used to train a Mahalanobis metric.
Based on the previous work of [witten2010framework] on sparse K-Means clustering we propose an alternation to the objective function of the algorithm to incorporate constraints. We show that in that case we get the best of both worlds since constraints result to better clustering performance without affecting the sparsity capabilities of the algorithm. We name this algorithm Pairwise Constrained Sparse K-Means (PCSKM) and we testing its performance under different conditions such as different number and kind of constraints (CANNOT-LINK, MUST-LINK or both). In our previous study [vouros2019empirical] we have shown the superiority of the deterministic initialisation method of DK-Means++ over stochastic methods thus we select this method for the initialisation of the algorithms. In our benchmark we include real world data sets from the UCI [asuncion2007uci] and a real world data set from a previous behavioural neuroscience study [vouros2018generalised] which contains a known uninformative feature.
We will use the following notations:
represents a data set in the form of a by matrix, where is the number of observations and is the number of features (dimensions), and specifies the i-th element of the data set.
represents the number of target clusters with number of elements in each cluster respectively. The specifies the k-th cluster center (centroid). The centroid is the mean of the data points in this cluster.
is the global centroid of the data set (mean of all data).
2.1 The K-Means (LKM) algorithm
K-Means aims to minimize the sum of squared distances between the data points and their respected centroids as indicated in the objective function of equation 1,
which is equivalent of maximizing the between cluster sum of squares (BCSS) given by equation 2,
The most common algorithm to minimize equation 1 is the Lloyd’s K-Means algorithm which will be assumed as the default K-Means algorithm for the rest of this paper. This algorithm is described below [jain2010data]:
Initialise initial centroids using some initialisation method.
Assign each data point to cluster so that,
Recompute the cluster centroids by taking the mean of the data points assigned to them, i.e for the k-th cluster, if it contains elements its centroid is computed as
Repeat steps 2 and 3 until converge, i.e. there are no more data points reassignments.
The algorithm returns the final clusters (centroids and element assignments).
2.2 The Sparse K-Means (SKM) algorithm
In the work of [witten2010framework] the authors propose the maximisation of a weighted version of the (refer to equation 2) subject to certain constraints. The proposed objective function is given by equaltion 4
where is the penalty or ridge regression and is the penalty or lasso regression. The minimization of the penalty will result to a constant shrinkage of the weights meaning that some wight will reach (feature selection). On the other hand, the minimization of the penalty will result to proportional shrinkage of the weights meaning that the wights will never reach (feature weighting). is known as the sparsity parameter and regulates the amount of sparsity, i.e. how many weights will receive 0 weight
An iterative algorithm for maximizing the function 4 is given by the algorithm below [witten2010framework]:
Initialise initial centroids using some initialisation method and . As a further optimization step when we have squared Euclidean distance we can initialise the weights as [witten2010framework].
and denotes the positive part of . Assuming that has a unique maximum and that , if that results in , otherwise a is chosen to yield . To find a value for the Binary Search algorithm is used [witten2010framework].
Iterate through steps 2 and 3 until the convergence criterion in equation 7
where refers to the weights of the current iteration and to the weights of the previous iteration.
The algorithm returns the final clusters (centroids and elements) and the weight of each feature.
2.3 The Pairwise Constrained K-Means (PCKM) and the Metric Pairwise Constrained K-Means (MPCKM) algorithms
In the study of [bilenko2004integrating] the authors proposed a semi-supervised algorithms called Metric Pairwise Constrained K-Means (MPCKM) which learns a distance metric based on constraints imposed by labelled data points in the dataset. The constraints are imposed between pairs of points and can be either MUST-LINK, i.e. the two points must be in the same cluster or CANNOT-LINK, i.e. the two points must not be in the same cluster [wagstaff2001constrained]. MPCKM integrates a variant of the Pairwise Constrained K-Means (PCKM) algorithm [basu2004active] with metric learning [bar2003learning, xing2003distance]. The PCKM algorithm [basu2004active] incorporates constraints to guide the clustering solution, the constraints are considered as soft meaning that violations are permitted as opposed to its predecessor the COP-KMeans [wagstaff2001constrained] algorithm which stops if constraints violation is unavoidable. Metric learning is the adaptation of a distant metric to satisfy the similarity imposed by the pairwise constraints (supervised similarity [bilenko2004integrating]). These constraints may not results in separable clusters, thus a metric should be learnt to create distinctive clusters but at the same time satisfy the supervised similarity.
The PCKM objective function is given by equation 2.3
where the second and third terms of the equation are two functions that indicate the severity of violating the imposed MUST-LINK and CANNOT-LINK constraints of the i-th element belonging to the k-th cluster; is a boolean function that specifies if in case of a MUST-LINK () or CANNOT-LINK () constraint, this constraints has been violated; specifies violation of a MUST-LINK constraint and violation of a CANNOT-LINK constraint. The terms and are providing a way of specifying individual costs for each constraint violation. In the original algorithm ([basu2004active]) the functions and were equal to . However, specifying appropriate values for constraint costs can be challenging and requires extensive knowledge about the data set under analysis or the constraints quality. Thus in this study we assume that and are equal to 1 and we used the distance functions used in [bilenko2004integrating] where, and ( and are the maximally separated points in the data set). In this way The severity is proportional to the distance of the pair of points. Based on the second term the severity of the penalty for violating a MUST-LINK constraint between the i-th element and another point distant from the i-th element is higher than if this pair of point were nearby. Analogously, based on the third term, the severity of the penalty for violating a CANNOT-LINK constraint between the i-th element and another point near the i-th element is higher than if this pair of points were far from each other. For minimising 2.3 an equivalent algorithm as K-Means can used with the only difference that during the data point assignment to the nearest cluster the constraints are also included resulting in,
where is a weight that parameterize the Euclidean distance on the j-th dimension and is a normalization constant [xing2003distance] that does not allow the weights to grow to large.
An iterative algorithm for minimizing the function 2.3 is given by the algorithm below [bilenko2004integrating]:
Initialise initial centroids using some initialisation method and as a diagonal matrix with values . A semi-supervised initialisation method is proposed in [basu2004active] and will be discussed latter.
Assign each data point to cluster so that,
Recompute the cluster centroids by taking the mean of the data points assigned to them, i.e for the k-th cluster, if it contains elements its centroid is computed as .
Update the weights ,
Iterate through steps 2 to 4 until the convergence. Various criteria can be used for convergence e.g. maximum number of iteration reached or minimum changes in the objective function.
The algorithm returns the final clusters (centroids and elements). The weights correspond to the learnt metric that shapes the feature space accordingly to satisfy the input constraints.
2.4 Our technique of Pairwise Constrained Sparse K-Means (PCSKM) algorithm
Based on the work on [witten2010framework] we propose an algorithm for sparse clustering that also takes advantage of pairwise constrains. This algorithm aims to maximise the objective function in equaltion 2.4
and it is given below,
Given a dataset, number of clusters , MUST-LINK and CANNOT-LINK constraints and constraints costs (optional), initialise initial centroids using some initialisation method and . As a further optimization step when we have squared Euclidean distance we can initialize the weights as [witten2010framework].
Holding fixed optimize 2.4 with respect to the weights applying the proposition given in equation,
This proposition follows the same solution as in Sparse K-Means.
Iterate through steps 2 and 3 until the convergence criterion in equation 7.
The algorithm returns the final clusters (centroids and elements) and the weight of each feature.
One important note is that K-Means and Sparse K-Means are element order invariant meaning that with the same initialisation method (and the same seed if the centroids initialisation method is stochastic) it will produce the same results while PCSKM does not have this property. This is because Sparse K-Means uses the K-Means algorithm to optimize the WCSS given fixed weights while PCSKM uses PCKM where the pairwise constrained penalties are affected by each element assignment to a specific cluster. For the same reason PCSKM may also experience oscillations on every iteration. However oscillations occurs only after the algorithm has reached a local minima with minimal effect on the clustering solution.
In our first experiment we wanted to assess the performance of PCSKM compared with other unsupervised algorithms Lloyd’s K-Means (LKM) and Sparse K-Means (SKM) and and semi-supervised algorithms Pairwise Constrained K-Means (PCKM) and Metric Pairwise Constrained K-Means (MPCKM). We tested the performance of all these algorithms using the deterministic initialisation technique of Density K-Means++ (DKMPP) [nidheesh2017enhanced] which performed best based in our previous benchmarking [vouros2019empirical]. We also tested the semi-supervised algorithms with different number and types of constraints including, only MUST-LINK, only CANNOT-LINK, same number of MUST-LINK and CANNOT-LINK and random selection from both MUST-LINK and CANNOT-LINK. The latest configuration was used in the previous benchmarking of MPCKM [bilenko2004integrating]. For this experiment we used real world data sets (fisheriris, ionosphere and glass) from the UCI repository [asuncion2007uci] with the number of clusters to be equal with the number of labels and the sparsity parameter to be the one that yields the highest performance. We note that the default semi-supervised initialisation procedure of MPCKM has not been considered in this study.
In our second experiment we wanted to assess the feature selection capabilities of our algorithm. We used a reduced data set from the previous rodent study of [vouros2018generalised] where the data are consisted of 8 features of unknown importance that describe rodent path segments inside the Morris Water Maze experimental procedure. The set contains a total of 441 observations and three classes. The dimensionality was increased with the addition of one more feature, the path segment length. This 9th feature is considered unimportant given the fact that all the segments were created to have approximately the same length.
The performance was tested using a similar evaluation as the one used in [basu2004active, bilenko2004integrating]
. We run 10-fold cross validation using all the data but splitting the labels into training and test sets. The performance on each fold was assessed based on the F-score, an information retrieval measure, adapted for evaluating clustering by considering same-cluster pairs similar to[bilenko2004integrating]. The clustering algorithm was run on the whole data set, but the F-score was calculated only on the test set. Results were averaged over 25 runs (each run with a random selection of constraints based on type) of 10 folds.
In this study we propose a semi-supervised algorithm with a feature selection mechanism. We show that the performance of this algorithm can be more stable than the performance of other semi-supervised algorithms (PCKM and MPCKM) when when different types of constraints are used (refer to Figures 1 and 2). We also showed that its feature selection mechanism is not affected by the use of constraints and compared to MPCKM, which also applies a weight of its feature, its weight assignments can be used to indicate informative or uninformative features (refer to Figure 2).
Interestingly in the Morris Water Maze data set (refer to Figure 2) the SKM algorithm yield a very good performance and when only MUST-LINK constraints where used for the semi-supervised algorithms, SKM had the overall best performance. A possible explanation is the fact that the three classes of the data set are relatively distinctive in a much lower dimensionality than 9 dimensions. This is indicated in Figure 2, where for low sparsity both the SKM and PCSKM assign weight values only in the 3rd and 8th features while the rest of the features are dropped. The constraints in this case might have a negative effect on the clustering.
Regardless the benefits of our proposed PCSKM algorith, we should highlight that, similar to SKM, it requires the tuning of an extra parameter, the sparsity (s) apart from the number of cluster (k). Both these parameters can be tuned using the gap statistic as proposed in the study of [brodinova2017robust].
A further continuation of this study would be a more detailed experimentation on different centroids initialisation techniques for the semi-supervised algorithms. Here we experimented only with the DKMPP initialisation technique which was proven powerful in our previous benchmark [vouros2019empirical] but we observe that the performance of MPCKM using this method is far lower than the one reported in the study of [bilenko2004integrating] while the performance of LKM higher. The most possible reason for this phenomenon is because of the semi-supervised method that the authors used in the aforementioned study. A more in-depth study on a careful selection of constraints which might result in a good clustering with a small amount of constraints can also be a future work.
A.V. designed the methodology and wrote the manuscript. E.V. contributed to the statistical and algorithmic analysis, provided feedback and co-authored the manuscript.