1 Introduction
Kmeans [1, 2] is one of the most widely known clustering algorithms. The basic problem it solves is as follows: for a fixed natural number and dataset return a set of centers such that it is the solution to the kmeans problem:
(1) 
Then, using this set of centers, return one of labels for each datum:
Lloyd’s algorithm [3] is a particularly longlived strategy for locally solving the kmeans problem (cf. Algorithm 1); it suggests randomly selecting a subset of size from as initial centers, then alternating updates to cluster assignments and new centers. Lloyd’s algorithm does not have an approximation guarantee. Hence, for certain datasets, it can return centers that result in a large value of the objective function in equation (1
) with high probability. Thus, even running independent copies of the algorithm and choosing the best result could yield poor results.
Macqueen [2] conjectures that exactly solving equation (1) is difficult. He is correct; Mahajan et al. [4] show that even for two dimensional data, kmeans is NPhard. Because of this, practitioners instead seek to approximately solve the kmeans problem. Arthur and Vassilvitskii [5] present kmeans++, a randomized approximation algorithm running in time (for their initialization step, which is all that is required for the approximation bound) that works by modifying Lloyd’s algorithm to choose initial centers with unequal weighting (cf. Algorithm 2). Their results are remarkable because the algorithm runs in a practical amount of time. This work inspired others to propose alternative randomized initializations for Lloyd’s algorithm for streaming data, [6] parallel implementations, [7] and biapproximations with extra () centers that can then be reclustered to yield centers. [8, 6]
In semisupervised learning, there is additional information available about the true labels of some of the data. These typically take the form of label information (e.g.
) or pairwise constrains (e.g. or In recent years, there has been a fair amount of interest in solving problems with these additional constraints. Wagstaff et al. [9] propose the COPKMeans algorithm, which uses a modified assignment step in Lloyd’s algorithm to avoid making cluster assignments that would be in violation of the constraints. Basu et al. [10] focused on using label information in their SeededKMeans and ConstrainedKMeans algorithms. Both algorithms use the centroids of the labeled points as initial starting centers. Basu et al. [11] use the ExpectationMaximization (EM) algorithm [12] as a modified Llyod’s algorithm to modify the pairwise supervision algorithms to include a step wherein the distance measure is modified (so that they do not necessarily use Euclidean distance). Finley and Joachims [13] learn pairwise similarities to account for semisupervision.The structure of the remainder of this paper is as follows. First, we will introduce the definitions and notation to be used afterwards in the remainder of the paper. Next, we present the main algorithm, where we modify the kmeans++ algorithm for the semisupervised with labels case. We then prove an approximation bound that improves with the amount of supervision. Finally, we include numerical experiments showing the efficacy of the algorithm on simulated and real data under a few performance metrics.
2 Preliminaries
We will present a few definitions to clarify the notation used in the theoretical results. Recall that is our data. We will additionally partition into the unsupervised and supervised data, respectively.
Definition 2.1 (Clustering).
A clustering, say is a set of centers that are used for cluster assignments of unlabeled points.
Definition 2.2 (Potential function).
Fix a clustering Define the potential function as be defined such that for ,
Definition 2.3 (Optimal cluster).
Let be a clustering solving the kmeans problem from equation (1). Let An optimal cluster is defined such that
Definition 2.4 ().
Call the current clustering . Define such that
3 Semisupervised Kmeans++ Algorithm
We now propose an extension to the kmeans++ algorithm for the semisupervised case. We will called this the sskmeans++ algorithm. Suppose that we want to partition our data into groups. Let us agree that the semisupervision occurs in the following way:

choose a class uniformly at random;

choose observations uniformly at random from

and label these observations as being from class .
We optionally allow repetition of steps to give more partially supervised classes. The modified kmeans algorithm, which is Algorithm 3 followed by Algorithm 1, replaces the initial step of choosing a point at random by choosing points as above, then setting the first center to the centroid of those points. Also, during the probabilistic selection process, we do not allow centers to be chosen from the supervised points. This makes sense because that cluster is already covered. Note that this essentially the kmeans++ version of the ConstrainedKMeans algorithm [10].
4 Theoretical Results
Consider the objective function the potential function associated with a clustering . Arthur and Vassilvitskii [5] prove that the expectation of the potential function after the seeding phase of the kmeans++ algorithm satisfies
where corresponds to the potential using the optimal centers. We will improve this bound for our algorithm by mostly following their analysis, mutatis mutandis.
The sketch of the proof is as follows:

Bound the expectation of the potential for the first cluster (chosen by semisupervision)

Bound the expectation of the potential for clusters with centers chosen via weighting conditioned on the center being from an uncovered cluster.

Bound the expectation of the potential when choosing a number of new centers at once in a technical result

Specialize the technical result to our algorithm and get the overall bound
Consider a collection of data of size . Suppose we have uniformly chosen at random members of in a set that we consider prelabeled. Consider the mean of these datum, say , to be the proposed center of , then the expectation of the potential function is
where the expectation is over the choice of the elements of . We can compute this expectation explicitly.
Lemma 4.1.
If is a subset of of size chosen uniformly at random from all subsets of of size , then
where
is the centroid of (i.e. ), and is the centroid of .
We present several technical lemmas used above.
Lemma 4.2.
Let and be the centroid of A. Let For any ,
Proof.
Observe
Hence,
which was what was wanted. ∎
Lemma 4.3.
If is a subset of of size chosen uniformly at random from all subsets of of size , then
Proof.
Observe , the probability that is chosen from for a group of size from objects in . The conclusion follows. ∎
Lemma 4.4.
If is a subset of of size chosen uniformly at random from all subsets of of size , then
Proof.
Let be the indicator random variable that is if and otherwise (i.e. Observe
Observe
since the first case represents the probability that and are chosen together and the second case represents the probability that is chosen, as . The conclusion follows. ∎
The first result will handle the semisupervised cluster. Suppose that is the optimal set of cluster centers. Now, we consider the contribution to the potential function of a cluster from when a center is chosen from with weighting. If we can prove a good approximation bound, then we can say that conditioned on choosing centers from uncovered clusters, we will have a good result on average.
Lemma 4.5.
Let be the current (arbitrary) set of cluster centers. Note that is probably not a subset of . Let be any cluster in . Let be a point from chosen at random with weighting. Then,
where the expectation is over the choice of new center .
Proof.
Unchanged from Lemma 3.2 in [5]. ∎
Lemma 4.6.
Fix a clustering Suppose there are uncovered clusters from the optimal clustering Denote the points in these uncovered clusters as (not to be confused with ). Let
be the set of points in covered clusters. Use weighting (excluding supervised data) to add new centers to to form . In a slight abuse of notation, let , and . Then,
where
Proof.
We have that the probability of choosing a point from a fixed set with weighting ignoring supervised points is
Further, note that , since all supervised clusters are covered.
Following the argument in [5] using the above probabilities, we have our result. ∎
Theorem 4.7.
Suppose our story about how the supervision occurs holds. Let For each label that we have supervised exemplars of, add the centroid of the supervised data labeled say to C. Suppose that Let be the number of supervised exemplars with label for Then, we have uncovered clusters. Add new centers using weighting ignoring the supervised points. The expectation of the resulting potential, , is then
Proof.
The end result is a modest improvement over that of [5] that scales with the level of supervision. The final inequality in the proof is tighter than the result stated in the theorem, since the factor of could be lower depending on the contributions of the supervised clusters in the optimal clustering.
5 Numerical Experiments
5.1 Performance Measures
We use several measures for each experiment. First, we use the cost, as estimated by the potential function. For comparing to the theoretical bound, we also use the fraction of optimal cost, where “optimal” is derived by taking the centroids for each class as determined by the ground truth labels. Next, we use the number of Lloyds iterations until convergence.
Finally, we will use the Adjusted Rand Index (ARI) [14], which is an index that compares how closely two partitions agree. The ARI is the Rand index, the ratio of number of agreements between two partitions, after adjusting for chance. It is essentially chance at 0, meaningless , and perfect at its maximal value, unity. Since ground truth labels are available for our datasets, we can compare them to the partitions yielded from the output of the algorithms in Section 5.3. Thus, a large ARI value indicates good clustering performance as determined by fidelity to the ground truth partition.
5.2 Data
We showcase our algorithm on three datasets (cf. Figure 2 for depiction). The first, Gaussian Mixture was inspired by both [5, 7]. We drew centers from the 15dimensional hypercube with side length of . For each center , we drew points from a multivariate Guassian with mean and identity covariance. This dataset is remarkable because it is easy to cluster by inspection (at least with larger sidelength, as in the original papers) yet is difficult for Lloyd’s algorithm when initialized with bad centers. For our chosen side length, it is not easy to cluster by eye. Note that the supervision story (where centroids of the class labels correspond to best centers) is likely to hold for most realizations of the data.
The next two datasets are real data for which the assumption that the labels match up with minimum cost clusters is not met. The second dataset is the venerable Iris dataset [15], which uses variables to describe different classes of flowers. While this dataset is old, it is nonetheless difficult for kmeans to handle from a clustering standpoint. This fact is widely known; indeed, even the Wikipedia page for the Iris dataset has a rendering of kmeans failing on it.[16] We compared the ARI for this dataset and the Gaussian Mixture
dataset while varying the ratio of side length of the hypercube to standard deviation (
with fixed), and we found that the datasets were roughly equivalent for side length around 3.25. This is under one percent of the side length and times the volume of the norm25 dataset [7] that our Gaussian Mixture dataset is based on. Thus, we observe that the Iris dataset is harder to cluster than the synthetic dataset.The third dataset, Hyperspectral, is a Naval Space Command HyMap hyperspectral dataset representing an aerial photo a runway and background terrain of the airport at Dalgren, Virginia as originally seen in [17] (cf. Figure 1 for a depiction of the location). Each pixel in the photo is associated with features representing different spectral bands (e.g. visible and infrared). We took the first six principal components to form a dataset with data in , as chosen by the minimum number of dimensions to capture
of the total variance. The first two principal components are depicted in Fig
2. The classes are the identities of each pixel (i.e. runway, pine, oak, grass, water, scrub, and swamp). Based on the ARI scores presented in the forthcoming results section, this dataset is only a little easier to cluster than Iris.5.3 Algorithms
For both datasets, we apply several algorithms: sskmeans++, ConstrainedKMeans, sskmeans++ (without Lloyds), ConstrainedKMeans (without Lloyds), and ConstrainedKMeans algorithm initialized at the true class centroids. We consider the latter algorithm as an approximation to the optimal solution. ConstrainedKMeans and ConstrainedKMeans (without Lloyds) use a random sample of the unsupervised data weighted uniformly for the remaining initial centers (after using centroids of the labeled points). The algorithms without Lloyds use their respective initialization strategy to choose initial centers then move straight to class assignment without updating the initial centers. We consider these algorithms as “initialization only” methods for this reason.
5.4 Results
We vary the supervision level from 0% to 100%, where we add supervised classes and sample 5/5/50 datapoints per class to label for Guassian Mixture, Iris, and Hyperspectral, respectively. Note that this is percent of clusters which have exemplars and not percent of all points which are labels. Also, at 100% supervision, sskmeans++ and ConstrainedKMeans are the same, since there are no additional centers to choose. We did not allow the supervised data to change cluster assignment, so the approximation to the optimal can change with the level of supervision and with different supervised data chosen . We set equal to the true number of groups (24 for Gaussian Mixture, 3 for Iris, and 7 for Hyperspectral). We used 100 Monte Carlo replicates at each level of supervision.
Figure 3 shows the cost as the level of supervision changes. We observe the cost decreases with more supervision. Also, we see the same relative performances of the algorithms, with the ++ version outperforming the benchmark. Observe that the approximation to the optimal solution is the best. Figure 4 depicts the theoretical bound. All algorithms are below the bound (in expectation).
gold: ConstrainedKMeans (without Lloyds iterations);
blue: sskmeans++ (without Lloyds iterations);
red: ConstrainedKMeans;
green: sskmeans++; and
pink: ConstrainedKMeans initialized at true centroids of labels.
gold: ConstrainedKMeans (without Lloyds iterations);
blue: sskmeans++ (without Lloyds iterations);
red: ConstrainedKMeans;
green: sskmeans++; and
pink: ConstrainedKMeans initialized at true centroids of labels.
Figure 5 shows the number of iterations before Lloyd’s converges. We can see that improved selection of by weighted randomization leads to fewer iterations before convergence. We expected this; Arthur and Vassilvitskii [5] observed a similar phenomenon with no supervision. More supervision did not seem to affect the number of iterations until very high levels (near 100%). For the real world datasets, we can see that the approximation to the optimal algorithm required more than one iteration to converge, indicating that the centroids of the true class labels do not match with the locally minimal cost solutions. This means that the conditions for the supervision in our proofs do not hold for this dataset. Nevertheless, both cost and ARI improve with additional supervision.
gold: ConstrainedKMeans (without Lloyds iterations);
blue: sskmeans++ (without Lloyds iterations);
red: ConstrainedKMeans;
green: sskmeans++; and
pink: ConstrainedKMeans initialized at true centroids of labels.
Figure 6 shows the ARI for all algorithms. Note that supervision improves the ARI, as expected. Also, sskmeans++ generally outperforms ConstrainedKMeans. The same observation holds for the initialization only versions as well. Remarkably, the true centroids and Lloyd’s algorithm is outperformed by the initialization only methods on the Iris and Hyperspectral datasets at 100% supervision for the ARI metric. This is due to the fact that the true classes do not correspond to the minimum cost solution, which is what Lloyd’s iterations would improve (apparently at the cost of ARI).
gold: ConstrainedKMeans (without Lloyds iterations);
blue: sskmeans++ (without Lloyds iterations);
red: ConstrainedKMeans;
green: sskmeans++; and
pink: ConstrainedKMeans initialized at true centroids of labels.
6 Conclusions
In this paper, we present a natural extension of kmeans++ and ConstrainedKMeans
. Then, we prove the corresponding bound on the expectation of the cost under some conditions on the supervision. No assumptions are made about the distribution of the data. Finally, we demonstrated that on three datasets judicious supervision and good starting center selection heuristics improve clustering performance, cost, and iteration count.
Possible future theoretical work includes incorporating the advances set forth in the extensions to the original kmeans++ paper. For example, we could produce semisupervised versions of kmeans# [7] and kmeans [6] with commensurately improved bounds. Relaxing the constraints to the pairwise cannotlink and mustlink constraints as in [9] is also desirable, because the assumption of exogenously provided hard labels is often untenable. Other assumptions that would be nice to relax would be the equal cluster shapes and cluster volume implicit in kmeans clustering.
Acknowledgements
The authors would like to thank Theodore D. Drivas for helping to test the codes used in the experiments and for consultation on aesthetics.
Funding
This work is partially funded by the National Security Science and Engineering Faculty Fellowship (NSSEFF), the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE), and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA87501220303.
References

[1]
Forgey E. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics. 1965;21(3):768–769.
 [2] MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability; Vol. 1. Oakland, CA, USA.; 1967. p. 281–297.
 [3] Lloyd SP. Least squares quantization in pcm. Information Theory, IEEE Transactions on. 1982;28(2):129–137.
 [4] Mahajan M, Nimbhorkar P, Varadarajan K. The planar kmeans problem is nphard. Theoretical Computer Science. 2012;442:13–21.
 [5] Arthur D, Vassilvitskii S. kmeans++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.
 [6] Ailon N, Jaiswal R, Monteleoni C. Streaming kmeans approximation. In: Advances in Neural Information Processing Systems; 2009. p. 10–18.
 [7] Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable kmeans++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.

[8]
Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for kmeans clustering. In: Approximation, randomization, and combinatorial optimization: Algorithms and techniques. Springer; 2009. p. 15–28.

[9]
Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained kmeans clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning; Vol. 1; 2001. p. 577–584.
 [10] Basu S, Banerjee A, Mooney R. Semisupervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning. Citeseer; 2002.
 [11] Basu S, Bilenko M, Mooney RJ. A probabilistic framework for semisupervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2004. p. 59–68.
 [12] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;:1–38.
 [13] Finley T, Joachims T. Supervised kmeans clustering. Cornell University; 2008. Report No.: 181311621.
 [14] Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2(1):193–218.
 [15] Fisher RA. The use of multiple measurements in taxonomic problems. Annals of eugenics. 1936;7(2):179–188.
 [16] Wikipedia. Iris flower data set — wikipedia, the free encyclopedia. 2015 12; Available from: https://en.wikipedia.org/w/index.php?title=Iris_flower_data_set&oldid=678872226.

[17]
Priebe CE, Marchette DJ, Healy Jr DM. Integrated sensing and processing for statistical pattern recognition. Modern Signal Processing. 2004;46:223.
Comments
There are no comments yet.