Semi-supervised K-means++

by   Jordan Yoder, et al.
Johns Hopkins University

Traditionally, practitioners initialize the k-means algorithm with centers chosen uniformly at random. Randomized initialization with uneven weights ( k-means++) has recently been used to improve the performance over this strategy in cost and run-time. We consider the k-means problem with semi-supervised information, where some of the data are pre-labeled, and we seek to label the rest according to the minimum cost solution. By extending the k-means++ algorithm and analysis to account for the labels, we derive an improved theoretical bound on expected cost and observe improved performance in simulated and real data examples. This analysis provides theoretical justification for a roughly linear semi-supervised clustering algorithm.



There are no comments yet.


page 10

page 11


A semi-supervised sparse K-Means algorithm

We consider the problem of data clustering with unidentified feature qua...

Quantum Semi-Supervised Learning with Quantum Supremacy

Quantum machine learning promises to efficiently solve important problem...

An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering

The minimum sum-of-squares clustering (MSSC), or k-means type clustering...

Semi-Supervised Information-Maximization Clustering

Semi-supervised clustering aims to introduce prior knowledge in the deci...

Semi-Supervised Cluster Extraction via a Compressive Sensing Approach

We use techniques from compressive sensing to design a local clustering ...

A Novel Semi-supervised Framework for Call Center Agent Malpractice Detection via Neural Feature Learning

This work presents a practical solution to the problem of call center ag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

K-means [1, 2] is one of the most widely known clustering algorithms. The basic problem it solves is as follows: for a fixed natural number and dataset return a set of centers such that it is the solution to the k-means problem:


Then, using this set of centers, return one of labels for each datum:

Lloyd’s algorithm [3] is a particularly long-lived strategy for locally solving the k-means problem (cf. Algorithm 1); it suggests randomly selecting a subset of size from as initial centers, then alternating updates to cluster assignments and new centers. Lloyd’s algorithm does not have an approximation guarantee. Hence, for certain datasets, it can return centers that result in a large value of the objective function in equation (1

) with high probability. Thus, even running independent copies of the algorithm and choosing the best result could yield poor results.

Macqueen [2] conjectures that exactly solving equation (1) is difficult. He is correct; Mahajan et al. [4] show that even for two dimensional data, k-means is NP-hard. Because of this, practitioners instead seek to approximately solve the k-means problem. Arthur and Vassilvitskii [5] present k-means++, a randomized approximation algorithm running in time (for their initialization step, which is all that is required for the approximation bound) that works by modifying Lloyd’s algorithm to choose initial centers with unequal weighting (cf. Algorithm 2). Their results are remarkable because the algorithm runs in a practical amount of time. This work inspired others to propose alternative randomized initializations for Lloyd’s algorithm for streaming data, [6] parallel implementations, [7] and bi-approximations with extra () centers that can then be re-clustered to yield centers. [8, 6]

In semi-supervised learning, there is additional information available about the true labels of some of the data. These typically take the form of label information (e.g.

) or pair-wise constrains (e.g. or In recent years, there has been a fair amount of interest in solving problems with these additional constraints. Wagstaff et al. [9] propose the COP-KMeans algorithm, which uses a modified assignment step in Lloyd’s algorithm to avoid making cluster assignments that would be in violation of the constraints. Basu et al. [10] focused on using label information in their Seeded-KMeans and Constrained-KMeans algorithms. Both algorithms use the centroids of the labeled points as initial starting centers. Basu et al. [11] use the Expectation-Maximization (EM) algorithm [12] as a modified Llyod’s algorithm to modify the pairwise supervision algorithms to include a step wherein the distance measure is modified (so that they do not necessarily use Euclidean distance). Finley and Joachims [13] learn pairwise similarities to account for semi-supervision.

The structure of the remainder of this paper is as follows. First, we will introduce the definitions and notation to be used afterwards in the remainder of the paper. Next, we present the main algorithm, where we modify the k-means++ algorithm for the semi-supervised with labels case. We then prove an approximation bound that improves with the amount of supervision. Finally, we include numerical experiments showing the efficacy of the algorithm on simulated and real data under a few performance metrics.

Input: ( datapoints)
( initial centers)
Output: (updated centers)
1 repeat
2       Assign each to the nearest center .
3       Update each as the centroid of the points such that .
until  has not changedreturn
Algorithm 1 Lloyd’s k-means algorithm
Input: (n datapoints)
(number of centers)
Output: (set of initial centers)
1 Choose an uniformly at random.
2 Let
3 while card do
4       Choose a datapoint with probability proportional to .
5       Update
Algorithm 2 Initialization of centers for k-means++

2 Preliminaries

We will present a few definitions to clarify the notation used in the theoretical results. Recall that is our data. We will additionally partition into the unsupervised and supervised data, respectively.

Definition 2.1 (Clustering).

A clustering, say is a set of centers that are used for cluster assignments of unlabeled points.

Definition 2.2 (Potential function).

Fix a clustering Define the potential function as be defined such that for ,

Definition 2.3 (Optimal cluster).

Let be a clustering solving the k-means problem from equation (1). Let An optimal cluster is defined such that

Definition 2.4 ().

Call the current clustering . Define such that

3 Semi-supervised K-means++ Algorithm

We now propose an extension to the k-means++ algorithm for the semi-supervised case. We will called this the ss-k-means++ algorithm. Suppose that we want to partition our data into groups. Let us agree that the semi-supervision occurs in the following way:

  1. choose a class uniformly at random;

  2. choose observations uniformly at random from

  3. and label these observations as being from class .

We optionally allow repetition of steps to give more partially supervised classes. The modified k-means algorithm, which is Algorithm 3 followed by Algorithm 1, replaces the initial step of choosing a point at random by choosing points as above, then setting the first center to the centroid of those points. Also, during the probabilistic selection process, we do not allow centers to be chosen from the supervised points. This makes sense because that cluster is already covered. Note that this essentially the k-means++ version of the Constrained-KMeans algorithm [10].

Input: ( unlabeled datapoints)
( labeled datapoints)
(labels corresponding to the data in
(number of centers)
Output: (set of initial centers)
1 Let be the number of supervised datapoints with label
2 Let
3 for  do
4       if  then
5             Let be the centroid of the labeled datapoints with label
6             Update
8while card do
9       Choose a datapoint with probability proportional to .
10       Update
Algorithm 3 Initialization of centers for semi-supervised k-means++

4 Theoretical Results

Consider the objective function the potential function associated with a clustering . Arthur and Vassilvitskii [5] prove that the expectation of the potential function after the seeding phase of the k-means++ algorithm satisfies

where corresponds to the potential using the optimal centers. We will improve this bound for our algorithm by mostly following their analysis, mutatis mutandis.

The sketch of the proof is as follows:

  1. Bound the expectation of the potential for the first cluster (chosen by semi-supervision)

  2. Bound the expectation of the potential for clusters with centers chosen via weighting conditioned on the center being from an uncovered cluster.

  3. Bound the expectation of the potential when choosing a number of new centers at once in a technical result

  4. Specialize the technical result to our algorithm and get the overall bound

Consider a collection of data of size . Suppose we have uniformly chosen at random members of in a set that we consider pre-labeled. Consider the mean of these datum, say , to be the proposed center of , then the expectation of the potential function is

where the expectation is over the choice of the elements of . We can compute this expectation explicitly.

Lemma 4.1.

If is a subset of of size chosen uniformly at random from all subsets of of size , then


is the centroid of (i.e. ), and is the centroid of .


Let . Observe

Let us determine Observe

Applying Lemma 4.4, we have


We present several technical lemmas used above.

Lemma 4.2.

Let and be the centroid of A. Let For any ,




which was what was wanted. ∎

Lemma 4.3.

If is a subset of of size chosen uniformly at random from all subsets of of size , then



be the indicator random variable that is

if and otherwise (i.e. Observe

Observe , the probability that is chosen from for a group of size from objects in . The conclusion follows. ∎

Lemma 4.4.

If is a subset of of size chosen uniformly at random from all subsets of of size , then


Let be the indicator random variable that is if and otherwise (i.e. Observe


since the first case represents the probability that and are chosen together and the second case represents the probability that is chosen, as . The conclusion follows. ∎

The first result will handle the semi-supervised cluster. Suppose that is the optimal set of cluster centers. Now, we consider the contribution to the potential function of a cluster from when a center is chosen from with weighting. If we can prove a good approximation bound, then we can say that conditioned on choosing centers from uncovered clusters, we will have a good result on average.

Lemma 4.5.

Let be the current (arbitrary) set of cluster centers. Note that is probably not a subset of . Let be any cluster in . Let be a point from chosen at random with weighting. Then,

where the expectation is over the choice of new center .


Unchanged from Lemma 3.2 in [5]. ∎

Lemma 4.6.

Fix a clustering Suppose there are uncovered clusters from the optimal clustering Denote the points in these uncovered clusters as (not to be confused with ). Let

be the set of points in covered clusters. Use weighting (excluding supervised data) to add new centers to to form . In a slight abuse of notation, let , and . Then,



We have that the probability of choosing a point from a fixed set with weighting ignoring supervised points is

Further, note that , since all supervised clusters are covered.

Following the argument in [5] using the above probabilities, we have our result. ∎

Theorem 4.7.

Suppose our story about how the supervision occurs holds. Let For each label that we have supervised exemplars of, add the centroid of the supervised data labeled say to C. Suppose that Let be the number of supervised exemplars with label for Then, we have uncovered clusters. Add new centers using weighting ignoring the supervised points. The expectation of the resulting potential, , is then


Applying Lemma 4.6 with we have

Applying Lemma 4.1 to each we have

Finally, using the fact that , we have our result. ∎

The end result is a modest improvement over that of [5] that scales with the level of supervision. The final inequality in the proof is tighter than the result stated in the theorem, since the factor of could be lower depending on the contributions of the supervised clusters in the optimal clustering.

5 Numerical Experiments

5.1 Performance Measures

We use several measures for each experiment. First, we use the cost, as estimated by the potential function. For comparing to the theoretical bound, we also use the fraction of optimal cost, where “optimal” is derived by taking the centroids for each class as determined by the ground truth labels. Next, we use the number of Lloyds iterations until convergence.

Finally, we will use the Adjusted Rand Index (ARI) [14], which is an index that compares how closely two partitions agree. The ARI is the Rand index, the ratio of number of agreements between two partitions, after adjusting for chance. It is essentially chance at 0, meaningless , and perfect at its maximal value, unity. Since ground truth labels are available for our datasets, we can compare them to the partitions yielded from the output of the algorithms in Section 5.3. Thus, a large ARI value indicates good clustering performance as determined by fidelity to the ground truth partition.

5.2 Data

We showcase our algorithm on three datasets (cf. Figure 2 for depiction). The first, Gaussian Mixture was inspired by both [5, 7]. We drew centers from the 15-dimensional hypercube with side length of . For each center , we drew points from a multivariate Guassian with mean and identity covariance. This dataset is remarkable because it is easy to cluster by inspection (at least with larger side-length, as in the original papers) yet is difficult for Lloyd’s algorithm when initialized with bad centers. For our chosen side length, it is not easy to cluster by eye. Note that the supervision story (where centroids of the class labels correspond to best centers) is likely to hold for most realizations of the data.

The next two datasets are real data for which the assumption that the labels match up with minimum cost clusters is not met. The second dataset is the venerable Iris dataset [15], which uses variables to describe different classes of flowers. While this dataset is old, it is nonetheless difficult for k-means to handle from a clustering standpoint. This fact is widely known; indeed, even the Wikipedia page for the Iris dataset has a rendering of k-means failing on it.[16] We compared the ARI for this dataset and the Gaussian Mixture

dataset while varying the ratio of side length of the hypercube to standard deviation (

with fixed), and we found that the datasets were roughly equivalent for side length around 3.25. This is under one percent of the side length and times the volume of the norm25 dataset [7] that our Gaussian Mixture dataset is based on. Thus, we observe that the Iris dataset is harder to cluster than the synthetic dataset.

The third dataset, Hyperspectral, is a Naval Space Command HyMap hyperspectral dataset representing an aerial photo a runway and background terrain of the airport at Dalgren, Virginia as originally seen in [17] (cf. Figure 1 for a depiction of the location). Each pixel in the photo is associated with features representing different spectral bands (e.g. visible and infrared). We took the first six principal components to form a dataset with data in , as chosen by the minimum number of dimensions to capture

of the total variance. The first two principal components are depicted in Fig

2. The classes are the identities of each pixel (i.e. runway, pine, oak, grass, water, scrub, and swamp). Based on the ARI scores presented in the forthcoming results section, this dataset is only a little easier to cluster than Iris.

Figure 1: Image corresponding to the Hyperspectral dataset as seen in Figure 2 of [17]

. Each pixel can be classified according to what it represents.

(a) Gaussian Mixture
(b) Iris
(c) Hyperspectral
Figure 2: First two dimensions of the datasets (one realization for Gaussian Mixture). Because Gaussian Mixture has 13 more dimensions than are shown here, clustering it is considerably easier than this figure would imply. Note, however, that we have overlapping classes (as denoted by the colors) in all datasets.

5.3 Algorithms

For both datasets, we apply several algorithms: ss-k-means++, Constrained-KMeans, ss-k-means++ (without Lloyds), Constrained-KMeans (without Lloyds), and Constrained-KMeans algorithm initialized at the true class centroids. We consider the latter algorithm as an approximation to the optimal solution. Constrained-KMeans and Constrained-KMeans (without Lloyds) use a random sample of the unsupervised data weighted uniformly for the remaining initial centers (after using centroids of the labeled points). The algorithms without Lloyds use their respective initialization strategy to choose initial centers then move straight to class assignment without updating the initial centers. We consider these algorithms as “initialization only” methods for this reason.

5.4 Results

We vary the supervision level from 0% to 100%, where we add supervised classes and sample 5/5/50 datapoints per class to label for Guassian Mixture, Iris, and Hyperspectral, respectively. Note that this is percent of clusters which have exemplars and not percent of all points which are labels. Also, at 100% supervision, ss-k-means++ and Constrained-KMeans are the same, since there are no additional centers to choose. We did not allow the supervised data to change cluster assignment, so the approximation to the optimal can change with the level of supervision and with different supervised data chosen . We set equal to the true number of groups (24 for Gaussian Mixture, 3 for Iris, and 7 for Hyperspectral). We used 100 Monte Carlo replicates at each level of supervision.

Figure 3 shows the cost as the level of supervision changes. We observe the cost decreases with more supervision. Also, we see the same relative performances of the algorithms, with the ++ version outperforming the benchmark. Observe that the approximation to the optimal solution is the best. Figure 4 depicts the theoretical bound. All algorithms are below the bound (in expectation).

(a) Gaussian Mixture
(b) Iris
(c) Hyperspectral
Figure 3: Cost (value of the potential) shown as a function of the level of supervision for 100 Monte Carlo replicates. Shading indicates two standard deviations. Colors indicate algorithm:
gold: Constrained-KMeans (without Lloyds iterations);
blue: ss-k-means++ (without Lloyds iterations);
red: Constrained-KMeans;
green: ss-k-means++; and
pink: Constrained-KMeans initialized at true centroids of labels.
(a) Gaussian Mixture
(b) Iris
(c) Hyperspectral
Figure 4: Fractional cost (value of the potential over an estimate of the optimal) plotted as a function of the level of supervision for 100 Monte Carlo replicates. Shading around the lines indicates two standard deviations. The shaded region is the region corresponding to the theoretical cost in expectation from Section 4. Colors indicate algorithm:
gold: Constrained-KMeans (without Lloyds iterations);
blue: ss-k-means++ (without Lloyds iterations);
red: Constrained-KMeans;
green: ss-k-means++; and
pink: Constrained-KMeans initialized at true centroids of labels.

Figure 5 shows the number of iterations before Lloyd’s converges. We can see that improved selection of by weighted randomization leads to fewer iterations before convergence. We expected this; Arthur and Vassilvitskii [5] observed a similar phenomenon with no supervision. More supervision did not seem to affect the number of iterations until very high levels (near 100%). For the real world datasets, we can see that the approximation to the optimal algorithm required more than one iteration to converge, indicating that the centroids of the true class labels do not match with the locally minimal cost solutions. This means that the conditions for the supervision in our proofs do not hold for this dataset. Nevertheless, both cost and ARI improve with additional supervision.

(a) Gaussian Mixture
(b) Iris
(c) Hyperspectral
Figure 5: Lloyd’s iterations before convergence plotted as a function of the level of supervision for 100 Monte Carlo replicates. Shading indicates two standard deviations. Colors indicate algorithm:
gold: Constrained-KMeans (without Lloyds iterations);
blue: ss-k-means++ (without Lloyds iterations);
red: Constrained-KMeans;
green: ss-k-means++; and
pink: Constrained-KMeans initialized at true centroids of labels.

Figure 6 shows the ARI for all algorithms. Note that supervision improves the ARI, as expected. Also, ss-k-means++ generally outperforms Constrained-KMeans. The same observation holds for the initialization only versions as well. Remarkably, the true centroids and Lloyd’s algorithm is outperformed by the initialization only methods on the Iris and Hyperspectral datasets at 100% supervision for the ARI metric. This is due to the fact that the true classes do not correspond to the minimum cost solution, which is what Lloyd’s iterations would improve (apparently at the cost of ARI).

(a) Gaussian Mixture
(b) Iris
(c) Hyperspectral
Figure 6: Average ARI shown as a function of the level of supervision for 100 Monte Carlo replicates. Shading indicates two standard deviations. Green beats red. Colors indicate algorithm:
gold: Constrained-KMeans (without Lloyds iterations);
blue: ss-k-means++ (without Lloyds iterations);
red: Constrained-KMeans;
green: ss-k-means++; and
pink: Constrained-KMeans initialized at true centroids of labels.

6 Conclusions

In this paper, we present a natural extension of k-means++ and Constrained-KMeans

. Then, we prove the corresponding bound on the expectation of the cost under some conditions on the supervision. No assumptions are made about the distribution of the data. Finally, we demonstrated that on three datasets judicious supervision and good starting center selection heuristics improve clustering performance, cost, and iteration count.

Possible future theoretical work includes incorporating the advances set forth in the extensions to the original k-means++ paper. For example, we could produce semi-supervised versions of k-means# [7] and k-means|| [6] with commensurately improved bounds. Relaxing the constraints to the pairwise cannot-link and must-link constraints as in [9] is also desirable, because the assumption of exogenously provided hard labels is often untenable. Other assumptions that would be nice to relax would be the equal cluster shapes and cluster volume implicit in k-means clustering.


The authors would like to thank Theodore D. Drivas for helping to test the codes used in the experiments and for consultation on aesthetics.


This work is partially funded by the National Security Science and Engineering Faculty Fellowship (NSSEFF), the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE), and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303.


  • [1]

    Forgey E. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics. 1965;21(3):768–769.

  • [2] MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability; Vol. 1. Oakland, CA, USA.; 1967. p. 281–297.
  • [3] Lloyd SP. Least squares quantization in pcm. Information Theory, IEEE Transactions on. 1982;28(2):129–137.
  • [4] Mahajan M, Nimbhorkar P, Varadarajan K. The planar k-means problem is np-hard. Theoretical Computer Science. 2012;442:13–21.
  • [5] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.
  • [6] Ailon N, Jaiswal R, Monteleoni C. Streaming k-means approximation. In: Advances in Neural Information Processing Systems; 2009. p. 10–18.
  • [7] Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.
  • [8]

    Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for k-means clustering. In: Approximation, randomization, and combinatorial optimization: Algorithms and techniques. Springer; 2009. p. 15–28.

  • [9]

    Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning; Vol. 1; 2001. p. 577–584.

  • [10] Basu S, Banerjee A, Mooney R. Semi-supervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning. Citeseer; 2002.
  • [11] Basu S, Bilenko M, Mooney RJ. A probabilistic framework for semi-supervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2004. p. 59–68.
  • [12] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;:1–38.
  • [13] Finley T, Joachims T. Supervised k-means clustering. Cornell University; 2008. Report No.: 1813-11621.
  • [14] Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2(1):193–218.
  • [15] Fisher RA. The use of multiple measurements in taxonomic problems. Annals of eugenics. 1936;7(2):179–188.
  • [16] Wikipedia. Iris flower data set — wikipedia, the free encyclopedia. 2015 12; Available from:
  • [17]

    Priebe CE, Marchette DJ, Healy Jr DM. Integrated sensing and processing for statistical pattern recognition. Modern Signal Processing. 2004;46:223.