Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

03/02/2018 ∙ by Buddhima Gamlath, et al. ∙ EPFL 0

We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - δ), simultaneously has a cost of at most (1 + ϵ) times the optimal cost and an accuracy of at least (1 - ϵ)? We show how to achieve such a clustering on n points with O((k^2 n) · m(Q, ϵ^4, δ/(k n))) oracle queries, when the k clusters can be learned with an ϵ' error and a failure probability δ' using m(Q, ϵ',δ') labeled samples, where Q is the set of candidate cluster centers. We show that m(Q, ϵ', δ') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O( k^4/(ϵ^2 δ) + (k^9/ϵ^4) (1/δ) + k · m(Q, ϵ^4/k, δ)) oracle queries. Finally, we show that the number of queries required for (1 - ϵ)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, whereas, for finite metric space k-means, this number must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental problem that arises in many learning tasks. Given a set of data points, the goal is to output a -partition of according to some optimization criteria. In unsupervised clustering, the data points are unlabeled. The classic -means problem and other well-studied clustering problems such as -median fall into this category.

In a general -means clustering problem, the input comprises a finite set of points that is to be clustered, a set of candidate centers , and a distance metric giving the distances between each pair of points in . The goal is to find cluster centers that minimizes the cost, which is the sum of squared distances between each point in and its closest cluster center. In this case, the clustering is defined by setting for all and breaking ties arbitrarily. Two widely studied special cases are the -means problem in Euclidean space (where , and is the Euclidean distance function) and the -means problem in finite metric spaces (where forms a finite metric space).

Despite its popularity and success in many settings, there are two known drawbacks of the unsupervised -means problem:

  1. Finding the centers that satisfy the clustering goal is computationally hard. For example, even the special case of -means problem in Euclidean space is NP-hard [9].

  2. There could be multiple possible sets of centers that minimize the cost. However, in practical instances, not all such sets are equally meaningful, and we would like our algorithm to find one that corresponds to the concerns of the application.

Since -means is NP-hard, it is natural to seek approximation algorithms. For the general -means problem in Euclidean space, notable approximation results include the local search by Kanungo et al. [15] with an approximation guarantee of and the recent LP-based -approximation algorithm by Ahmadian et al. [1]. On the negative side, Lee et al. [16] ruled out arbitrarily good approximation algorithms for the -means problem on general instances. For several special cases, however, there exist PTASes. For example, in the case where is constant, Har-Peled and Mazumdar [12] and Feldman et al. [10] showed how to get a PTAS using weak coresets, and in the case where the dimension is constant, Cohen-Addad et al. [7] and Friggstad et al. [11] gave PTASes based on a basic local search algorithm. In addition, Awasthi et al. [4] presented a PTAS for -means, assuming that the input is “clusterable” (satisfies a certain stability criterion).

Even if we leave aside the computational issues with unsupervised -means, we still have the problem that there can be multiple different clusterings that minimize the cost. To see this, consider the -means problem on the set of vertices of an equilateral triangle. In this case, we have three different clusterings that give the same minimum cost, but only one of the clusterings might be meaningful. One way to avoid this issue is to have strong assumptions on the input. For example, Balcan et al. [5] considered the problem in a restricted setting where any

-approximation to the problem also classifies at least a

fraction of the points correctly.

Ashtiani et al. [3] recently proposed a different approach for addressing the aforementioned drawbacks. They introduced a semi-supervised active clustering framework where the algorithm is allowed to make queries of the form same-cluster(x, y) to a domain expert, and the expert replies whether the points and belong to the same cluster in some fixed optimal clustering. Under the additional assumptions that the clusters are contained inside balls in that are sufficiently far away from each other, they presented an algorithm that makes same-cluster queries, runs in time, and recovers the clusters with probability at least . Their algorithm finds approximate cluster centers, orders all points by their distances to the cluster centers, and performs binary searches to determine the radii of the balls. Although it recovers the exact clusters, this approach works only when the clusters are contained inside well-separated balls. When the clusters are determined by a general Voronoi partitioning, and thus distances to the cluster boundaries can differ in different directions, this approach fails.

A natural question arising from the work of Ashtiani et al. [3] is whether such strong assumptions on the input structure are necessary. Ailon et al. [2] addressed this concern and considered the problem without any assumptions on the structure of the underlying true clusters. Their main result was a polynomial-time -approximation scheme for -means in the same semi-supervised framework as in Ashtiani et al. [3]. However, in contrast to Ashtiani et al. [3], their work gives no assurance on the accuracy of the recovered clustering compared to the true clustering. To achieve their goal, the authors utilized importance sampling to uniformly sample points from small clusters that significantly contribute to the cost. Their algorithm makes same-cluster queries, runs in time, and succeeds with a constant probability.

In this work, we investigate the -means problem in the same semi-supervised setting as Ailon et al. [2], but in addition to approximating the cost, we seek a solution that is also accurate with respect to the true clustering. We assume that the underlying true clustering minimizes the cost, and that there are no points on cluster boundaries (i.e., the margin between each pair of clusters can be arbitrarily small but not zero). This last assumption is what differentiates our setup from that of Ailon et al. [2]. It is reasonable to assume that no point lies on the boundary of two clusters, as otherwise, to achieve constant accuracy, we would have to query at least a constant fraction of the boundary points. Without querying each boundary point, we have no way of determining to which cluster it belongs.

Observe that if we label all the points correctly with respect to the true clustering, the resulting clustering automatically achieves the optimal cost. However, such perfect accuracy is difficult to achieve as there may be points that are arbitrarily close to each other but belong to different clusters. Using only a reasonable number of samples, the best we can hope for is to recover an approximately accurate solution. PAC (Probably Approximately Correct) learning helps us achieve this goal and provides a trade-off between the desired accuracy and the required number of samples.

Suppose that we have a solution where only a small fraction of the input points is incorrectly classified. In this case, one would hope that the cost is also close to the optimal cost. Unfortunately, the extra cost incurred by the incorrectly classified points can be very high depending on their positions, true labels, and the labels assigned to them. Our main concern in this paper is controlling this additional cost.

We show that if we start with a constant-factor approximation for the cost, we can refine the clustering using a PAC learning algorithm. This yields a simple polynomial-time algorithm that, given a -means instance and as parameters, with probability at least outputs a clustering that has a cost of at most times the optimal cost and that classifies at least a fraction of the points correctly with respect to the underlying true clustering. To do so, the algorithm makes same-cluster queries. Here, is the sufficient number of labeled samples for a PAC learning algorithm to learn clusters in a -means instance with an error and a failure probability in the supervised setting (recall that is the set of candidate centers). We further show that our algorithm can be easily adapted to -median and other similar problems that use the ’th power of distances in place of squared distances for some fixed . We formally present this result as Theorem 11 in Section 3. In Theorem 1 below, we give an informal statement for the case of -means.

Theorem 1 (An informal version of Theorem 11).

There exists a semi-supervised learning algorithm that, given a

-means instance, oracle access to same-cluster queries that are consistent with some fixed optimal clustering, and parameters , outputs a clustering that, with probability at least , correctly labels (up to a permutation of the labels) at least a fraction of the points and, simultaneously, has a cost of at most times the optimal cost. In doing so, the algorithm makes same-cluster queries.

Our algorithm is general and applicable to any family of -means, -median, or similar distance based clustering instances that can be efficiently learned with PAC learning. As shown in Appendix A, these include Euclidean and general finite metric space clustering instances. In contrast, both Ashtiani et al. [3] and Ailon et al. [2], considered only the Euclidean -means problem. To the best of our knowledge, ours is the first such result applicable to finite metric space -means and both Euclidean and finite metric space -median problems.

Ideally, we want to be small. Additionally, the analysis of our algorithm relies on two natural properties of learning algorithms. Firstly, we require PAC learning to always correctly label all the sampled points. Secondly, we also require it to not ‘invent’ new labels and only output labels that it has seen on the samples. We show that such learning algorithms with small exist both for -means instances in Euclidean space and for those in finite metric spaces with no points on the boundaries of the optimal clusters. For -dimensional Euclidean -means, has a linear dependency on . For the case of finite metric spaces, has a logarithmic dependency on , which is the size of the set of candidate centers. In fact, these learning algorithms are applicable not only to -means instances but also to instances of other similar center-based clustering problems (where clusters are defined by assigning points to their closest cluster centers).

Our semi-supervised learning algorithm is inspired by the work of Feldman et al. [10] on weak coresets. Their construction of the weak coresets first obtains an intermediate clustering using a constant-factor approximation algorithm and refines each intermediate cluster by taking random samples. In order to get a good guarantee for the cost, their algorithm partitions each cluster into an inner ball that contains the majority of the points, and an outer region that contains the remaining points. We proceed similarly to this construction; however, we further partition the outer region into concentric rings and use PAC learning to label the points in the inner ball and in each of the outer rings separately. For Euclidean -means instances, the number of same-cluster queries needed by the algorithm has a logarithmic dependency on the number of points, which is similar (up to a factor) to that of the algorithm by Ashtiani et al. [3]. The advantage of our algorithm is that it works for a much broader range of -means instances whereas the applicability of the algorithm of Ashtiani et al. [3] is restricted to those instances whose clusters are contained in well-separated balls in Euclidean space.

This algorithm is effective in many natural scenarios where the number of clusters is larger than . However, as the size of the -means instance (i.e., the number of points) becomes large, the factor becomes undesirable. In Euclidean -means, the number of samples needed by the learning algorithm for an error and a failure probability does not depend on . The dependency in the final query complexity is exclusively due to repeating the PAC learning step on different partitions of . To overcome this problem, we present a second algorithm, which is applicable only to Euclidean -means instances, inspired by the work of Ailon et al. [2]. This time, we start with a -approximation for the cost and refine it using PAC learning. Unlike our first algorithm, we only run the PAC learning once on the whole input, and thus we completely eliminate the dependency on . The disadvantages of this algorithm compared to our first algorithm are the slightly more involved nature of the algorithm and the increased dependency on in its query complexity. Theorem 2 below formally states this result. The proof follows from the analysis our algorithm in Section 4.

Theorem 2.

There exists a polynomial-time algorithm that, given a -means instance in -dimensional Euclidean space, oracle access to same-cluster queries that are consistent with some fixed optimal clustering, and parameters , outputs a clustering that, with probability at least , correctly labels (up to a permutation of the labels) at least a fraction of the points and, simultaneously, has a cost of at most times the optimal cost. The algorithm makes same-cluster queries.

For the Euclidean setting, the query complexities of both our algorithms have a linear dependency on the dimension of the Euclidean space. The algorithm of Ashtiani et al. [3] does not have such a dependency due to their strong assumption on the cluster structure, whereas the one by Ailon et al. [2] does not have that as it only approximates the cost. We show that, in our scenario, such a dependency is necessary to achieve the accuracy guarantees of our algorithms. For the finite metric space -means, the query complexity of our general algorithm has an dependency. The dependency on comes from the repeated application of the learning algorithm on different partitions, and whether we can avoid this is an open problem. However, we show that an query complexity is necessary for the accuracy. Formally, we prove the following theorem in Section 5.

Theorem 3.

Let be a family of -means instances. Let be an algorithm that, given a -means instance in , oracle access to same-cluster queries for some fixed optimal clustering, and parameters , outputs a clustering that, with probability at least , correctly labels (up to a permutation of the cluster labels) at least a fraction of the points. Then, the following statements hold:

  1. If is the family of -means instances in -dimensional Euclidean space that have no points on the boundaries of optimal clusters, must make same-cluster queries.

  2. If is the family of finite metric space -means instances that have no points on the boundaries of optimal clusters, must make same-cluster queries.

The outline of this extended abstract is as follows. In Section 2 we introduce the notation, formulate the problem and present the learning theorems that we use in the subsequent sections. In Section 3 we present our first algorithm, which is simple and applicable to general -means instances that admit efficient learning algorithms, but has a dependency of in its query complexity. In Section 4 we discuss how to remove the dependency in the query complexity for the special case of Euclidean -means instances and present our second algorithm. In Section 5, we prove our query lower bound claims of Theorem 3. In Appendix A, we introduce the basic concepts and tools of PAC learning and explain how to design learning algorithms for Euclidean and finite metric space -means instances.

2 Preliminaries

In this section, we introduce the basic notation and two common families of -means instances, and formally define the -means problem that we address in this work. We also introduce the notion of learnability for families of -means instances and state two learning theorems that will be used in the later sections.

2.1 k-Means Problem in a Semi-supervised Setting

Let and be two sets of points where , and let be a distance metric. We denote a -means instance by the triple . Two common families of -means instances we consider in this work are:

  1. -means instances in Euclidean space, where , , and is the Euclidean distance between and , and

  2. -means instances in finite metric spaces, where forms a finite metric space.

Let . We identify a -clustering of by a labeling function , and a set of centers, , associated with each label, . For each label of a clustering , let be the set of points whose label is . For convenience, we may use the labeling function or the set of clusters interchangeably to denote a clustering .

For a subset and a point , define . For each , define center , i.e., each center is a point in that minimizes the sum of squared distances between itself and each of the points assigned to it. For a -clustering , we define its k-means cost as Let be the set of all -clusterings of . Then, the optimal -means cost of is defined as We say that a -clustering -approximates the -means cost if .

Let be a fixed -clustering of that achieves the optimal -means cost, and let be any -clustering of . Let and be the labeling functions that correspond to and respectively. We assume that we have oracle access to the labeling function of the optimal target clustering up to a permutation of the labels. We can simulate a single query to such an oracle with queries to a same-cluster oracle as explained in Algorithm 1. A same-cluster oracle is an oracle that answers queries with ‘yes’ or ‘no’ based on whether and belong to the same cluster in the fixed optimal clustering .

The error of a clustering with respect to the clustering for a -means instance is now defined as , where the minimization is over all permutations . In other words, is the minimum number of points incorrectly labeled by the clustering with respect to the optimal clustering , considering all possible permutations of the cluster labels. The reason for defining in this manner is because we use a simulated version of (which is only accurate up to a permutation of the cluster labels) instead of the true to learn cluster labels. We say that a -clustering is -accurate with respect to if .

Input : A point , oracle access to .
Output : A label .
Global : A list of points .
1 for  do
2       if  then
3            Return
4      
5Append to .
6 Return .
Algorithm 1 Simulating a labeling oracle with the same-cluster oracle.

Given , parameters and , and oracle access to , our goal is to output a -clustering of that, with probability at least , satisfies and .

2.2 PAC Learning for k-Means

Let be a family of -means instances, and let be a positive integer-valued function. We say such a family is learnable with sample complexity if there exists a learning algorithm such that the following holds: Let be an error parameter and let be a probability parameter. Let be a -means instance that belongs to . Let be a fixed optimal -means clustering and let be the associated labeling function. Let be a fixed subset of , and let be a multiset of at least

independently and uniformly distributed samples from

. The algorithm , given input and for all , outputs a function . Moreover, with probability at least over the choice of , the output agrees with on at least a fraction of the points in (i.e., ). This simpler notion of learnability is sufficient for the purpose of this work although it deviates from that of the general PAC learnability, which concerns with samples drawn from arbitrary distributions.

We say that such a learning algorithm has the zero sample error property if the output of assigns the correct label to all the sampled points (i.e., for all ). Furthermore, we say that such a learning algorithm is non-inventive if it does not ‘invent’ labels that it has not seen. This means that the output of does not assign labels that were not present in the input (sample, label) pairs (i.e., if for some , then for some sample point , ).

In Section 3, we present a simple algorithm for -approximate and -accurate -means clustering for a family of -means instances, assuming that is learnable with a zero sample error, non-inventive learning algorithm. In the analysis, zero sample error and non-inventive properties play a key role in the crucial step of bounding the cost of incorrectly labeled points in terms of that of correctly labeled nearby points.

We now present two learning theorems for the Euclidean setting and the finite metric space setting. Assuming no point lies on cluster boundaries, the theorems state that the labeling function of the optimal clustering is learnable with a zero sample error, non-inventive learning algorithm in both settings. We say that a -means instance has no boundary points if in any optimal clustering with clusters and respective centers , the closest center to any given point is unique (i.e., if , for all ).

Theorem 4 (Learning k-Means in Euclidean Space).

Let be the Euclidean distance function. Let be the family of -means instances that are in -dimensional Euclidean space and that have no boundary points. The family is learnable with sample-complexity111 hides factors.

Theorem 5 (Learning k-Means in Finite Metric Spaces).

Let is a finite metric space, and be the family of finite metric space -means instances that have no boundary points. The family is learnable with sample-complexity222 hides factors.

We prove Theorem 4 and Theorem 5 in Appendix A, where we also introduce the necessary PAC learning concepts and tools.

3 A Simple Algorithm for Cost and Accuracy

Let be a family of -means instances that is learnable with sample complexity using a zero sample error, non-inventive learning algorithm . Let be a constant-factor approximation algorithm (in terms of cost) for -means, and let be a polynomial-time algorithm for the -means problem (i.e., given , finds in polynomial time). We present a simple semi-supervised learning algorithm that, given a -means instance of class and oracle access to the labeling function of a fixed optimal clustering of , outputs a clustering that, with probability at least , satisfies and . Our algorithm uses , , and as subroutines and makes oracle queries. We show that our algorithm can be easily modified for -approximate and -accurate -median and other similar distance-based clustering problems. Towards the end of this section, we discuss several applications of this result, namely, for Euclidean and finite metric space -means and -median problems.

Let us start by applying the learning algorithm to learn all the cluster labels. If we get perfect accuracy, the cost will be optimal. A natural question to ask in this case is: what happens to the cost if the learning output has error? In general, even a single misclassified point can incur an arbitrarily large additional cost. To better understand this, consider the following: Let be two distinct optimal clusters in the target clustering, and let , be their respective cluster centers. Let be a point that is incorrectly classified and assigned label by . Also assume that the number of misclassified points is small enough so that the centers of the clusters output by the learning algorithm are close to those of the optimal clustering. Thus, in the optimal clustering, incurs a cost of , whereas according to the learning outcome, incurs a cost that is close to . In the worst case, can be arbitrarily larger than .

Now suppose that, within distance from , there exists some point . In this case, we can bound the cost incurred due to the erroneous label of using the true cost of in the target clustering. To be more specific, using the triangle inequality, we get the following bound for any metric space: . Furthermore, due to the optimality, . Hence, it follows that . To utilize this observation in an algorithmic setting, we need to make sure that, for every point that is misclassified into cluster , there exists a correctly classified nearby point that belongs to the optimal cluster . Luckily, this is ensured by the combination of zero sample error and non-inventive properties of . If a point is misclassified into cluster , the non-inventive property says that must have seen a sample point from cluster . The zero sample error property ensures that is labeled correctly by . To make sure that such correctly labeled points are sufficiently close to their incorrectly labeled counterparts, we run separately on certain suitably bounded partitions of .

The formal description of our algorithm is given in Algorithm 2. The outline is as follows: First, we run on and obtain an intermediate clustering . For each , we run to find a suitable center . Next, we partition each intermediate cluster into an inner ball and outer rings centered around . We run the learning algorithm separately on each of these partitions. We choose the inner and outer radii of the rings so that, in each partition, the points that are incorrectly classified by the learning algorithm only incur a small additional cost compared to that of the correctly classified points. The final output is a clustering that is consistent with the learning outputs on each of the partitions. For each cluster , we associate the output of running on as its center. Note that, due to the accuracy requirements, the cluster center to which a point is assigned in the output may not be the cluster center closest to that point in the output. It remains an interesting problem to find an accurate clustering in which every point is always assigned to its nearest cluster center.

Input :  -Means instance , oracle access to , constant-factor approximation algorithm for -means, -means algorithm , zero sample-error, non-inventive learning algorithm with sample complexity , accuracy parameter , and failure probability .
Output :  The clustering defined by the labeling computed below. The respective cluster centers are , which can be found by running on .
1 Let , and let .
2 Run and obtain an -approximate -means clustering . For each , run on and find centers .
3 for  do
4       Let .
5       Let be all points in that are at most away from .
6       Let be the points in that are between and away from for .
7       Let .
8       for each non-empty  do
9             Sample points independently and uniformly at random.
10             Query the oracle on and let .
11             Run on input and , and obtain a labeling .
12      
13Output the clustering defined by the following labeling function:
14 for each , ,  do
15       Set .
Algorithm 2 A simple algorithm for -approximate -accurate -means.

We now analyze Algorithm 2 and show that, with probability at least , it outputs a -approximate and -accurate -means clustering.

Let be the total number of points in the -means instance. Assume that . For all , let be the set of points that are in and labeled by the output of the learning algorithm . Call a point bad if but ; otherwise, call it good. Denote the set of bad points by and let the complement of be the set of good points. For each , let denote the center of cluster in . For any point , let denote the center of the optimal cluster for under the clustering . Thus, .

Notice that, for all , all the points in belong to one of the ’s. In other words, no point in is more than away from , where . To see this, suppose is a point that is more than away from . Then, which is a contradiction.

Lemma 6.

With probability at least , all non-empty ’s satisfy .

Proof.

Recall that we run with samples. Thus, by definition, , each run of succeeds with probability at least . Since we only run at most times, the claim follows from the union bound. ∎

We continue the rest of the analysis conditioned on for all . In proving the subsequent results, we use the following observations.

Observation 7.

No two points in are more than distance apart. Note that, according to the definition of in the algorithm, is the outer diameter of the ring that bounds .

Observation 8.

For , the inner radius of the ring that bounds is . Therefore, we have the following lower bound for the cost of :

Lemma 9.

For all and , if then .

Proof.

If , then is in some optimal cluster denoted by for some . Note that if (i.e., the output of the algorithm ) gives label to some point , then the non-inventive property of ensures that it has seen at least one point that is in , and the zero sample-error property ensures that is labeled correctly by . Thus, and . Hence, is a good point with label , and consequently, we have

where the last inequality follows from Observation 7. ∎

Lemma 10 (Squared Triangle Inequality).

For any and , we have

Proof.

The first inequality follows from the AM-GM inequality because . The second inequality holds because implies . ∎

Now, let us analyze the cost of the labeling output by Algorithm 2:

Splitting the cost contributions of good and bad points, we get
Applying Lemma 9 together with Lemma 10, for any , we have
From Lemma 6, we have , and it follows that
(1)

Consider the last two terms of Equation (1) individually. For the first summation, we have

In the last inequality, we used the fact that gives an -approximation for the optimal cost. For the second summation of Equation (1), Observation 8 gives

Here, we have again used the approximation guarantee of in the final inequality.

Choosing makes sure that both and , and consequently, we get a final cost of at most . Recall that we established this bound conditioned on for all . In Lemma 6, we saw that all the runs of the learning algorithm succeed with probability at least . Hence, the condition is true for all with the same probability. Summing the inequality over all yields . Consequently, the output of Algorithm 2, with probability at least over the choice of samples in Step 2, outputs a -approximate and -accurate -means clustering.

In Algorithm 2, instead of an exact algorithm for the -means problem, we can also use a PTAS. Using a PTAS to approximate -means up to a factor will only cost an additional factor in our cost analysis. As a result, we get the same approximation and accuracy guarantees if we replace with .

Algorithm 2 makes queries to the oracle in total. Recall that simulating an oracle query to takes same-cluster queries. Therefore, the total number of same-cluster queries is .

Our definition of a learning algorithm in Section 2.2 has nothing to do with whether the input is a -means instance or a -median instance, which is similar to -means except that the cost of a cluster with respect to a center is defined as . In fact, it applies to any similar clustering scenario where the cost is defined in terms of the ’th power () of distances instead of squared distances. The analysis of Algorithm 2 can be adapted to any fixed once we have a suitable triangle inequality analogous to Lemma 10. For example, when , we can simply use the trivial inequality . Thus, for such clustering problems, Algorithm 2, with a slight modification on choice of radii in Step 2 and a little adjustment to the parameter , will give the same guarantees. Hence, we have the following theorem which is the formal version of Theorem 1. The proof follows from the analysis of Algorithm 2.

Theorem 11.

Let be a family of -means (-median) instances. Suppose that is learnable with sample complexity using a zero sample error, non-inventive learning algorithm . Let be a constant-factor approximation algorithm, and let be a PTAS for the -means (-median) problem. There exists a polynomial-time algorithm that, given an instance , oracle access to same-cluster queries for some fixed optimal clustering , and parameters , outputs a clustering that, with probability at least , is -accurate with respect to , and simultaneously has a cost of at most . The algorithm uses , , and as subroutines. The number of same-cluster queries made by the algorithm is

  1. for the -means setting and

  2. for the -median setting.

For -means and -median instances in Euclidean space and those in finite metric spaces, there exist several constant-factor approximation algorithms (for example, Ahmadian et al. [1] and Kanungo et al. [15]). Solving the -means problem in Euclidean space is straightforward: The solution to is simply . For the -median problem in Euclidean space, the problem of -median does not have an exact algorithm but several PTASes exist (for example, Cohen et al. [6]). In a finite metric space, to solve , we can simply try all possible in polynomial time, and this holds for the -median setting as well. Thus, for Euclidean and finite metric space -means and -median instances that have no boundary points, Theorem 11, together with Theorem 4 and Theorem 5, gives efficient algorithms for -approximate, -accurate semi-supervised clustering.

4 Removing the Dependency on Problem Size in the Query Complexity for Euclidean -Means

For the family of Euclidean -means instances, the query complexity of Algorithm 2 suffers from a dependency (where is the number of points in the input -means instance, and hides factors) due to the repeated use of the learning algorithm . Specifically, we run with a failure probability of , times per cluster. Note that the sample complexity of itself, in the case of Euclidean -means instances, does not have this dependency.

In this section, we show that we can avoid this dependency on using a slightly more involved algorithm at the cost of increasing the query complexity by an extra factor. Nevertheless, this algorithm has superior performance when the size of the input instance (i.e., the number of points) is very large (when for example).

Recall that, for a set , is minimized when is the centroid of , denoted by . Define the fractional size of an optimal cluster as the fraction of points that belong to , i.e., the ratio . Suppose we only want to get a good approximation for the cost, and that we know that all the clusters in the target solution have sufficiently large fractional sizes. In this case, naive uniform sampling will likely pick a large number of samples from each of the clusters. This observation, together with Lemma 12, allows us to approximate the centroid and the cost of each cluster to any given accuracy.

Lemma 12 (Lemma 1 and Lemma 2 of Inaba et al. [13]).

Let , let be a positive integer, and let be a multiset of i.i.d. samples from the uniform distribution over some finite set . With probability at least , and .

However, the above approach fails when some clusters in the optimal target solution contribute significantly to the cost, but have small fractional sizes (that is because uniform sampling is not guaranteed to pick sufficient numbers of samples from the small clusters). Ailon et al. [2]

circumvented this issue with an algorithm that iteratively approximates the centers of the clusters using a distance-based probability distribution (

-sampling). We will refer to their algorithm as .

Note that when it comes to accuracy, we can totally disregard clusters with small fractional sizes; we only have to correctly label a sufficiently large fraction of the points in large clusters. With this intuition, we present the outline of our algorithm.

Let be a -means instance in Euclidean space that has no boundary points. For simplicity, we refer to the instance by just where possible, as for Euclidean -means, the other two parameters are fixed. We start with a naive uniform sampling step that gives a good approximation for the centers of large clusters. Starting with these centers, we run a slightly modified version of algorithm to approximate the centers of the remaining small clusters. Thus, at this point, we have a clustering with a good cost and we know which clusters are large. We now run the learning algorithm on input and obtain a labeling of the points. For each point, we assign its final label based on

  1. the label assigned to it by the learning algorithm , and

  2. its proximity to large cluster centers.

In particular, if the output of decides that a point should be in some large cluster , and if is sufficiently close to the approximate center for cluster , we label it according to the learning output; otherwise, we label it according to its nearest approximate center. We show that this approach retains a cost that is close to the cost of the clustering output by . The accuracy guarantee comes from the facts that a large fraction of the points are sufficiently close to the centers of large clusters, and that labels most of them correctly with a good probability.

We now review the key properties of algorithm (the algorithm of Ailon et al. [2]). Let . We say a -means instance is -irreducible if no -means clustering gives an -approximation for the -means problem, i.e., if denotes the optimal -means cost of , then is -irreducible if