    # Speeding Up Constrained k-Means Through 2-Means

For the constrained 2-means problem, we present a O(dn+d(1ϵ)^O(1ϵ) n) time algorithm. It generates a collection U of approximate center pairs (c_1, c_2) such that one of pairs in U can induce a (1+ϵ)-approximation for the problem. The existing approximation scheme for the constrained 2-means problem takes O((1ϵ)^O(1ϵ)dn) time, and the existing approximation scheme for the constrained k-means problem takes O((kϵ)^O(kϵ)dn) time. Using the method developed in this paper, we point out that every existing approximating scheme for the constrained k-means so far with time C(k, n, d, ϵ) can be transformed to a new approximation scheme with time complexity C(k, n, d, ϵ)/ k^Ω(1ϵ).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The -means problems is to partition a set of points in -dimensional space into subsets such that is minimized, where is the center of , and is the distance between two points of and . The -means problem is one of the classical NP-hard problems in the field of computer science, and has broad applications as well as theoretical importance. The -means problem is NP-hard even for the case  . The classical -means problem and -median problem have received a lot of attentions in the last decades [28, 8, 12, 19, 25, 1, 9, 21, 16, 30].

Inaba, Katoh, and Imai  showed that -means problem has an exact algorithm  with running time . For the -means problem, Arthur and Vassilvitskii  gave a )-approximation algorithm. A -approximation scheme was derived by de la Vega et al.  with time . Kumar, Sabharwal, and Sen  presented a -approximation algorithm for the -means problem with running time . Ostrovsky et al.   developed a -approximation for the -means problem under the separation condition with running time . Feldman, Monemizadeh, and Sohler  gave a -approximation scheme for the -means problem using corset with running time . Jaiswal, Kumar, and Yadav  presented a -approximation algorithm for the -means problem using -sampling method with running time . Jaiswal, Kumar, and Yadav  gave a -approximation algorithm with running time . Kanungo et al.  presented a -approximation algorithm for the problem in polynomial time by applying local search. Ahmadian et al.  gave a -approximation algorithm for the -means problem in Euclidean space. For fixed and arbitrary , Friggstad, Rezapour, and Salavatipour  and Cohen-Addad, Klein, and Mathieu  proved that the local search algorithm yields a PTAS for the problem, which runs in time. Cohen-Addad  further showed that the running time can be improved to .

The input data of the -means problem always satisfies local properties. However, for many applications, each cluster of the input data may satisfy some additional constraints. It seems that the constrained -means problem has different structure from the classical -means problem, which lets each point go to the cluster with nearest center. The constrained -means problems have been paid lots of attention in the literature, such as the chromatic clustering problem [4, 14], the -capacity clustering problem , -gather clustering , fault tolerant clustering , uncertain data clustering , semi-supervised clustering [35, 34], and -diversity clustering . As given in Ding and Xu , all -means problems with constraint conditions can be defined as follows.

###### Definition 1

[Constrained -means problem] Given a point set , a list of constraints , and a positive integer , the constrained -means problem is to partition into clusters such that all the constraints in are satisfied and is minimized, where denotes the centroid of .

Recent years, there are some progress for the constrained -means problem. The first polynomial time approximation scheme with running time for the constrained -means problem was shown by Ding and Xu , and a collection of size of candidate approximate centers can be obtained. The existing fastest approximation schemes for the constrained -means problem takes time [6, 7, 17], which was first derived by Bhattacharya, Jaiswai, and Kumar [6, 7]. Their algorithm gives a collection of size of candidate approximate centers. Feng et al.  analyzed the complexity of [6, 7] and gave an algorithm with running time , which outputs a collection of size of candidate approximate centers.

It is known that 2-means problem is the smallest version of the -means problem, and remains being NP-hard. Obviously, all the approximation algorithms of the -means problem can be directly applied to get approximation algorithms for the 2-means problem. However, not all the approximation algorithms for 2-means problem can be generalized to solve the -means problem. The understanding of the characteristics of the 2-means problem will give new insight to the -means problem. Meanwhile, getting two clusters of the input data is useful in many interesting applications, such as the “good” and “bad” clusters of input data, the “normal” and “abnormal” clusters of input data, etc.

For the 2-means problem, Inaba, Katoh, and Imai  presented an -approximation scheme for -means with running time . Matoušek  gave a deterministic -approximation algorithm with running time log. Sabharwal and Sen   presented a -approximation algorithm with linear running time . Kumar, Sabharwal, and Sen  gave a randomized approximation algorithm with running time .

This paper develops a new technology to deal with the constrained 2-means problem. It is based on how balance between the sizes of clusters in the constrained -means problem. This brings an algorithm with running time . Our algorithm outputs a collection of size of candidate approximate centers, in which one of them induces a -approximation for the constrained -means problem. The technology shows a faster way to obtain first two approximate centers when applied to the constrained -means, and can speed up the existing approximation schemes for constrained -means with greater than 2. Using this method developed in this paper, we point out every existing PTAS for the constrained -means so far with time can be transformed to a new PTAS with time complexity . Therefore, we provide a unified approach to speed up the existing approximation scheme for the constrained -means problem.

This papers is organized with a few sections. In Section 2, we give some basic notations. In section 3, we give an overview of the new algorithm for the constrained -means problem. In section 4, we give a much faster approximation scheme for the constrained -means problem. In section 5, we apply the method to the general constrained -means problem, and show faster approximation schemes.

## 2 Preliminaries

This section gives some notations that are used in the algorithm design.

###### Definition 2

Let be a real number in . Let be a set of points in .

• A partition of is -balanced if for .

• A -balanced -means problems is to partition into such that for all .

###### Definition 3

Let be a set of points in , and .

• Define .

• Define .

###### Definition 4

Let be a set of points in , and be a partition of .

1. Define .

2. Define .

3. Define .

4. Define .

5. Define .

Chernoff Bound (see ) is used in the approximation algorithm when our main result is applied in some concrete model.

###### Theorem 5

Let be independent random - variables, where takes

with probability at least

for . Let . Then for any , .

The union bound is expressed by the inequality

 Pr(E1∪E2…∪Em)≤Pr(E1)+Pr(E2)+…+Pr(Em), (1)

where are events that may not be independent. We will use the famous Stirling formula

 n!≈√2πn⋅nnen. (2)

For two points and in , both and represent their Euclidean distance . For a finite set , is the number of elements in it.

###### Lemma 6

 For a set of points, and any point , .

###### Lemma 7

 Let be a set of points in . Assume that is a set of points obtained by sampling points from uniformly and independently. Then for any , with probability at least , where .

###### Lemma 8

 Let be a set of points in , and be an arbitrary subset of with points for some . Then , where .

###### Lemma 9

[6, 7] For any three points , we have .

## 3 Overview of Our Method

In order to develop a faster algorithm for the constrained -Means problem, we assume that the input set has two clusters and . We will try to find a subset and of size from and , respectively, where is an integer to be large enough to derive an approximate center by Lemma 8. We consider two different cases. The first case is that the two clusters and with have a balanced sizes of points (. We get a set of random samples, and another set of random samples from . An approximate center for the cluster will be generated via one of the subsets of size from . An approximate center for will be generated via one of the subsets of size from . The two parameters and are selected based on the balanced condition between the sizes of and .

We discuss the case that is much larger than . We generate a subset with that will be used to generate an approximate center for . The set can be obtained via random samples from since is much larger than . It also has two cases to find another approximate center for . The first case is that almost all points of is close to . In this case, we just let be the same as , which is based on Lemma 8. The second case is that there are enough points of to be far from . This transforms the problem into finding the second approximate center for the second cluster assuming the approximate center is good enough for .

Phase of the algorithm lets be equal to . Phase extracts the set of half elements from with larger distances to than the rest half. It will have phases to search . The next phase will shrink the search area by a constant factor. This method was used in the existing algorithms. As we only have one approximate center for , it saves the amount of time by a factor to find the first approximate center. This makes our approximation algorithm run in time for the constrained -means problem.

## 4 Approximation Algorithm for Constrained 2-means

In this section, an approximation scheme will be presented for the constrained -means problem. The methods used in this section will be applied to the general constrained -means problem in Section 5. We define some parameters before describing the algorithm for the constrained -means problem.

### 4.1 Setting Parameters

Assume that real parameter is used to control the approximation ratio, and real parameter is used to control failure probability of the randomized algorithm. We define some constants for our algorithm and its analysis. All the parameters that are set up through (3) to (17) in this section are positive real constants.

 γ = 13, (3) δ=δ6 = 110, (4) d1 = δ6, (5) d2 = 4,. (6)

We select to satisfy inequality (7).

 (1+2δ2)⋅d2≤d2+δ. (7)
 d4 = (10+18δ6)δ22⋅(1γ+log1γ), (8) γ∗ = 12−δ6, (9) γ5,b = γ12, (10) η = δ62, (11) α1 = 5, (12) α2 = 2α1α5+δ6, (13) α6 = γ∗⋅d412, (14) α5 = 41−δ6−4α6, and (15) M = d4ϵ. (16)

We select and in to satisfy inequality (17).

 (1+ς)(1+2δ1)α2 ≤ α2+δ/2. (17)
###### Lemma 10

The parameters satisfy the following conditions (18) to (21):

 α6 ≥ 4, (18) e−δ22d24M ≤ γ6, (19) e−aM ≤ γ12  for any positive real a and all ϵ≤ad4ln12γ, (20) 4α6+4α5 = 1−δ6. (21)

Proof:  Inequality (18): By equations (14), (8) and (9), we have inequalities:

 α6 = 112⋅γ∗⋅d4=112⋅(12−δ6)⋅(10⋅4⋅3)≥4. (22)

Inequality (19): Let . We have the inequalities:

 z = δ22d24⋅M (23) ≥ δ22d24⋅d4ϵ (24) ≥ d24⋅8(1γ+log1γ) (25) ≥ 2d2(1γ+log1γ). (26)

Thus, .

Inequality (20): By equation (16) we have when .

Equation (21): It follows from equation (15).

### 4.2 Algorithm Description

In this section, an approximation algorithm for the constrained 2-means problem is given. It outputs a collection of centers, and one of them brings a -approximation for the constrained 2-means problem.

Algorithm -Means

Input: is a set of points in , and real parameter to control accuracy of approximation.

Output: A collection of two centers .

1. Let ;

2. Let ;

3. Let be defined as that in equation (16);

4. Let ;

5. Let ;

6. Let ;

7. Select a set of random samples from ;

8. Select a set of random samples from ;

9. For every two subsets of and of of size ,

10. {

11. Compute the centroid of , and of ;

13. }

14. Select a set of random samples from ;

15. Compute the centroid of ;

16. Let ;

17. Repeat

18. Select a set of random samples from ;

19. For each size subset of copies of

20. {

21. Compute the centroid of ;

23. }

24. Let be the -th largest of ;

25. Let contain all of the points in with ;

26. Let ;

27. Until is empty;

28. Output ;

End of Algorithm

###### Definition 11

Let be the approximate center of via the algorithm.

1. Define .

2. Define .

3. Define .

4. Define .

5. Let be the center of for .

6. Let be the center for for

7. For each , let .

8. Let be the multiset with number of . It transforms every element of to .

9. Let .

10. Let be the center of for .

###### Lemma 12

Let be a real number in and be positive real number with . Then we have ,

Proof:  By Taylor formula, we have for some . Thus, we have .

###### Lemma 13

The algorithm -Means(.) has the following properties:

1. With probability at least , at least random points are from in , where .

2. If the two clusters and satisfy , then with probability at least , at least random points are from in , where .

3. Line 7 to line 13 of the algorithm -Means(.) generate at most pairs of centers.

4. If the clusters and satisfy , then with probability at least , contains no element of , where for all , where .

5. Line 14 to line 27 iterate at most times and generate at most pairs of centers.

Proof: The Lemma is proven with the following cases.

Statement 1: Since , we have . Let . With elements from , with probability at most (by inequality (19)), there are less than elements from by Theorem 5.

Statement 2: Let . By line 5 of the algorithm, we have . When elements are selected from , by Theorem 5, with probability at most (by inequality (19) and the range of determined nearby equation (7)), multiset has less than random points from .

Statement 3: After getting and of sizes and , respectively, it takes cases to enumerate their subsets of size . If contains elements from and contains elements from , then it generates pairs of and .

Statement 4: Let . When elements are selected in , the probability that contains no element of is at least by Lemma 12, and equations (16) and (8). Let . We have for all small positive when .

Statement 5: The loop from line 17 to line 27 iterates at most times since . Each iteration of the internal loop from line 19 to line 23 generates pairs of centers.

###### Lemma 14

Assume that only contains elements in ). Then with probability at least (), the approximate center satisfies the inequality

 ||c(V)−mi||2≤ϵ(1+η)α6σ2i (27)

Proof:  It follows from Lemma 7. Let . This is because

 δ∗|M| = γ6|M| (28) = γ6⋅d4ϵ (29) = 1ϵ⋅(γ6⋅12γ4)⋅(γ4d4)12 (30) ≥ 1ϵ⋅1(1+η)⋅α6. (31)

Thus, . Therefore, the failure probability is at most by Lemma 7. Let .

We assume that if the unbalanced condition of Statement (4) of Lemma 13 is satisfied, then inequality (27) holds for with and . In otherwords, inequality holds at the unbalanced condition since it has a large probability to be true by Lemma 14 and Statement 4 of Lemma 13.

###### Lemma 15

.

Proof:  By Lemma 6 and inequality (27), we have

 f2(c1,P1) = f2(m1,P1)+|P1|||c1−m1||2 (32) ≤ f2(m1,P1)+|P1|⋅ϵ(1+η)α6σ21 (33) = f2(m1,P1)+ϵ(1+η)α6f2(m1,P1) (34) = (1+ϵ(1+η)α6)f2(m1,P1) (35)

Note that the transition from (33) to (34) is by item 3 of Definition 4 .

We discuss the two different cases. They are based on the size of .

Case 1: .

In this case, we let .

###### Lemma 16

.

Proof:  Since , we have by the condition of Case 1. Let . We have . By Lemma 8, we have

###### Lemma 17

.

Proof:  By the definition of , we have the following inequalities:

 ||min2−c2|| = ∣∣ ∣ ∣∣∣∣ ∣ ∣∣⎛⎜⎝1|Pin2|∑p∈Pin2p⎞⎟⎠−c2∣∣ ∣ ∣∣∣∣ ∣ ∣∣ (36) = ∣∣ ∣ ∣∣∣∣ ∣ ∣∣⎛⎜⎝1|Pin2|∑p∈Pin2p⎞⎟⎠−⎛⎜⎝1|Pin2|∑p∈Pin2c2⎞⎟⎠∣∣ ∣ ∣∣∣∣ ∣ ∣∣ (37) ≤ 1|Pin2|∣∣ ∣ ∣∣∣∣ ∣ ∣∣∑p∈Pin2(p−c2)∣∣ ∣ ∣∣∣∣ ∣ ∣∣ (38) ≤ 1|Pin2|∑p∈Pin2||p−c1|| (39) ≤ r2. (40)

###### Lemma 18

.

Proof:

 f2(c2,P2) = f2(m2,P2)+|P2|||m2−c2||2 (41) ≤ f2(m2,P2)+|P2|(2||m2−min2||2+2||min2−c2||2) (42) ≤ f2(m2,P2)+|P2|⎛⎝2(√ϵα1−ϵσ2)2+2r22⎞⎠ (43) = f2(m2,P2)+|P2|(2ϵα1−ϵ)⋅σ22+2|P2|r22 (44) = (1+2ϵα1−ϵ)f