Fair Colorful k-Center Clustering

07/08/2020 ∙ by Xinrui Jia, et al. ∙ EPFL 0

An instance of colorful k-center consists of points in a metric space that are colored red or blue, along with an integer k and a coverage requirement for each color. The goal is to find the smallest radius h̊o̊ such that there exist balls of radius h̊o̊ around k of the points that meet the coverage requirements. The motivation behind this problem is twofold. First, from fairness considerations: each color/group should receive a similar service guarantee, and second, from the algorithmic challenges it poses: this problem combines the difficulties of clustering along with the subset-sum problem. In particular, we show that this combination results in strong integrality gap lower bounds for several natural linear programming relaxations. Our main result is an efficient approximation algorithm that overcomes these difficulties to achieve an approximation guarantee of 3, nearly matching the tight approximation guarantee of 2 for the classical k-center problem which this problem generalizes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the colorful k-center problem introduced in [5], we are given a set of points in a metric space partitioned into a set of red points and a set of blue points, along with parameters , , and . The goal is to find a set of centers that minimizes so that balls of radius around each point in cover at least red points and at least blue points. More generally, the points can be partitioned into color classes , with coverage requirements . To keep the exposition of our ideas as clean as possible, we concentrate the bulk of our discussion to the version with two colors. In Section 3 we show how our algorithm can be generalized for color classes with an exponential dependence on in the running time in a rather straightforward way, thus getting a polynomial time algorithm for constant .

This generalization of the classic

-center problem has applications in situations where fairness is a concern. For example, if a telecommunications company is required to provide service to at least 90% of the people in a country, it would be cost effective to only provide service in densely populated areas. This is at odds with the ideal that at least some people in every community should receive service. In the absence of color classes, an approximation algorithm could be “unfair” to some groups by completely considering them as outliers. The inception of fairness in clustering can be found in the recent paper

[8] (see also [4, 1]), which uses a related but incomparable notion of fairness. Their notion of fairness requires each individual cluster to have a balanced number of points from each color class, which leads to very different algorithmic considerations and is motivated by other applications, such as “feature engineering”.

The other motive for studying the colorful -center problem derives from the algorithmic challenges it poses. One can observe that it generalizes the -center problem with outliers, which is equivalent to only having red points and needing to cover at least of them. This outlier version is already more challenging than the classic -center problem: only recent results give tight -approximation algorithms [6, 12], improving upon the -approximation guarantee of [7]. In contrast, such algorithms for the classic -center problem have been known since the ’80s[13, 10]. That the approximation guarantee of is tight, even for classic -center, was proved in [14].

At the same time, a subset-sum problem with polynomial-sized numbers is embedded within the colorful -center problem. To see this, consider numbers and let . Construct an instance of the colorful -center problem with , , and for every , a ball of radius one containing red points and blue points. These balls are assumed to be far apart so that any single ball that covers two of these balls must have a very large radius. It is easy to see that the constructed colorful -center instance has a solution of radius one if and only if there is a size subset of the numbers whose sum equals .

We use this connection to subset-sum to show that the standard linear programming (LP) relaxation of the colorful -center problem has an unbounded integrality gap even after a linear number of rounds of the powerful Lasserre/Sum-of-Squares hierarchy (see Section 4.1). We remark that the standard linear programming relaxation gives a -approximation algorithm for the outliers version even without applying lift-and-project methods. Another natural approach for strengthening the standard linear programming relaxation is to add flow-based inequalities specially designed to solve subset-sum problems. However, in Section 4.2, we prove that they do not improve the integrality gap due to the clustering feature of the problem. This shows that clustering and the subset-sum problem are intricately related in colorful -center. This interplay makes the problem more complex and prior to our work only a randomized constant-factor approximation algorithm was known when the points are in with an approximation guarantee greater than  [5].

Our main result overcomes these difficulties and we give a nearly tight approximation guarantee:

Theorem 1.

There is a -approximation algorithm for the colorful -center problem.

As aforementioned, our techniques can be easily extended to a constant number of color classes but we restrict the discussion here to two colors.

On a very high level, our algorithm manages to decouple the clustering and the subset-sum aspects. First, our algorithm guesses certain centers of the optimal solution that it then uses to partition the point set into a “dense” part and a “sparse” part . The dense part is clustered using a subset-sum instance while the sparse set is clustered using the techniques of Bandyapadhyay, Inamdar, Pai, and Varadarajan [5] (see Section 2.1). Specifically, we use the pseudo-approximation of [5] that satisfies the coverage requirements using balls of at most twice the optimal radius.

While our approximation guarantee is nearly tight, it remains an interesting open problem to give a -approximation algorithm or to show that the ratio is tight. One possible direction is to understand the strength of the relaxation obtained by combining the Lasserre/Sum-of-Squares hierarchy with the flow constraints. While we show that individually they do not improve the integrality gap, we believe that their combination can lead to a strong relaxation.

Independent work. Independently and concurrently to our work, authors in  [2] obtained a -approximation algorithm for the colorful k-center problem with using different techniques than the ones described in this work. Furthermore they show that, assuming , if is allowed to be unbounded then the colorful k-center problem admits no algorithm guaranteeing a finite approximation. They also show that assuming the Exponential Time Hypothesis, colorful k-center is inapproximable if grows faster than .

Organization. We begin by giving some notation and definitions and describing the pseudo-approximation algorithm in [5]. In fact, we then describe a 2-approximation algorithm on a certain class of instances that are well-separated, and the 3-approximation follows almost immediately. This 2-approximation proceeds in two phases: the first is dedicated to the guessing of certain centers, while the second processes the dense and sparse sets.

Section 3 explains the generalization to color classes. In Section 3 we present our integrality gaps under the Sum-of-Squares hierarchy and additional constraints deriving from a flow network to solve subset-sums.

2 A 3-Approximation Algorithm

In this section we present our 3-approximation algorithm. We briefly describe the pseudo-approxima- tion algorithm of Bandhyapadhyay et al. [5] since we use it as a subroutine in our algorithm.

Notation: We assume that our problem instance is normalized to have an optimal radius of one and we refer to the set of centers in an optimal solution as . The set of all points at distance at most from a point is denoted by and we refer to this set as a ball of radius at . We write for . By a ball of we mean for some .

2.1 The Pseudo-Approximation Algorithm

LP1

LP2

Figure 1: The linear programs used in the pseudo-approximation algorithm.

The algorithm of Bandhyapadhyay et al. [5] first guesses the optimal radius for the instance (there are at most distinct values the optimal radius can take), which we assume by normalization to be one, and considers the natural LP relaxation LP1 depicted on the left in Figure 1. The variable indicates how much point is fractionally opened as a center and indicates the amount that is covered by centers.

Given a fractional solution to LP1, the algorithm of [5] finds a clustering of the points. The clusters that are produced are of radius two, and with a simple modification (details can be found in Appendix B), can be made to have a special structure that we call a flower:

Definition 2.1.

For , a flower centered at is the set .

More specifically, given a fractional solution to LP1, the clustering algorithm in [5] produces a set of points and a cluster for every such that:

  1. The set is a subset of the points with positive -values.

  2. For each , we have and the clusters are pairwise disjoint.

  3. If we let and for , then the linear program LP2 (depicted on the right in Figure 1) has a feasible solution of value at least .

As LP2 has only two non-trivial constraints, any extreme point will have at most two variables attaining strictly fractional values. So at most variables of are non-zero. The pseudo-approximation of [5] now simply takes those non-zero points as centers. Since each flower is of radius two, this gives a -approximation algorithm that opens at most centers. (Note that, as the clusters are pairwise disjoint, at least blue points are covered, and at least red points are covered since the value of the solution is at least .)

Obtaining a constant-factor approximation algorithm that only opens centers turns out to be significantly more challenging. Nevertheless, the above techniques form an important subroutine in our algorithm. Given a fractional solution to LP1, we proceed as above to find and an extreme point to LP2 of value at least . However, instead of selecting all points with positive -value, we, in the case of two fractional values, only select the one whose cluster covers more blue points. This gives us a solution of at most centers whose clusters cover at least blue points. Furthermore, the number of red points that are covered is at least since we disregarded at most one center. As (see first property above) and (see second property above), we have . We summarize the obtained properties in the following lemma.

Lemma 2.2.

Given a fractional solution to LP1, there is a polynomial-time algorithm that outputs at most clusters of radius two that cover at least blue points and at least red points.

We can thus find a -approximate solution that covers sufficiently many blue points but may cover fewer red points than necessary. The idea now is that, if the number of red points in any cluster is not too large, i.e., is “small”, then we can hope to meet the coverage requirements for the red points by increasing the radius around some opened centers. Our algorithm builds on this intuition to get a -approximation algorithm using at most centers for well-separated instances as defined below.

Definition 2.3.

An instance of colorful -center is well-separated if there does not exist a ball of radius three that covers at least two balls of .

Our main result of this section can now be stated as follows:

Theorem 2.

There is a -approximation algorithm for well-separated instances.

The above theorem immediately implies Theorem 1, i.e., the -approximation algorithm for general instances. Indeed, if the instance is not well-separated, we can find a ball of radius three that covers at least two balls of by trying all points and running the pseudo-approximation of [5] on the remaining uncovered points with centers. In the correct iteration, this gives us at most centers of radius two, which when combined with the ball of radius three that covers two balls of , is a 3-approximation.

Our algorithm for well-separated instances now proceeds in two phases with the objective of finding a subset of on which the pseudo-approximation algorithm produces subsets of flowers containing not too many red points. In addition, we maintain a partial solution set of centers (some guessed in the first phase), so that we can expand the radius around these centers to recover the deficit of red points from closing one of the fractional centers.

2.2 Phase I

In this phase we will guess some balls of that can be used to construct a bound on . To achieve this, we define the notion of Gain for any point and .

Definition 2.4.

For any and , let

be the set of red points added to by forming a flower centered at .

Our algorithm in this phase proceeds by guessing three centers of the optimal solution :

[hidealllines=true, backgroundcolor=gray!15] For , guess the center in and calculate the point such that the number of red points in is maximized over all possible , where

The time it takes to guess , and is and for each we find the such that is maximized by trying all points in (at most many).

For notation, define and let

The important properties guaranteed by the first phase is summarized in the following lemma.

Lemma 2.5.

Assuming that and are guessed correctly, we have that

  1. the balls of radius one in are contained in and cover blue points and red points; and

  2. the three clusters , and are contained in and cover at least blue points and at least red points.

Proof.

1) We claim that the intersection of any ball of with in is empty, for all . Then the balls in satisfy the statement of (1). To prove the claim, suppose that there is such that for some . Note that , so this implies that , for some . Hence, a ball of radius three around covers both and as , which contradicts that the instance is well-separated.

2) Note that for , , and that and Gain() are disjoint. The balls cover at least blue points and red points, while . ∎

2.3 Phase II

Throughout this section we assume , and have been guessed correctly in Phase I so that the properties of Lemma 2.5 hold. Furthermore, by the selection and the definition of , we also have

(1)

This implies that contains at most red points of . However, to apply Lemma 2.2 we need that the number of red points of in the whole flower is bounded. To deal with balls with many more than red points, we will iteratively remove dense sets from to obtain a subset of sparse points.

Definition 2.6.

When considering a subset of the points , we say that a point is dense if the ball contains strictly more than red points of . For a dense point , we also let contain those points whose intersection contains strictly more than red points of .

We remark that in the above definition, we have in particular that for a dense point . Our iterative procedure now works as follows:

[hidealllines=true, backgroundcolor=gray!15] Initially, let and . While there is a dense point : Add to and update by removing the points .

Let denote those points that were removed from . We will cluster the two sets and of points separately. Indeed, the following lemma says that a center in either covers points in or but not points from both sets. Recall that denotes the set of points that are removed from in the iteration when was selected and so .

Lemma 2.7.

For any and any , either or .

Proof.

Let , , and suppose . If , there is a point in the intersection for some . Suppose first that . Then, since , the intersection contains fewer than red points from (recall that contains the points of in at the time was selected). But by the definition of dense clients, has more than red points, so has more than red points. This region is a subset of , which contradicts (1). This is shown in Figure 2(a). Now consider the second case when and there is a point in the intersection for some and . Then, by the definition of , has more than red points of . However, this is also a subset of so we reach the same contradiction. See Figure 2(b). ∎

(a)

(b)
Figure 2: The shaded regions are subsets of Gain(c,p), which contain the darkly shaded regions that have red points.

Our algorithm now proceeds by guessing the number of balls of contained in . We also guess the numbers and of red and blue points, respectively, that these balls cover in . Note that after guessing , we know that the number of balls in contained in equals . Furthermore, by the first property of Lemma 2.5, these balls cover at least blue points in and at least red points in . As there are possible values of , and (each can take a value between and ) we can try all possibilities by increasing the running time by a multiplicative factor of . Henceforth, we therefore assume that we have guessed those parameters correctly. In that case, we show that we can recover an equally good solution for and a solution for that covers blue points and almost red points:

Lemma 2.8.

There exist two polynomial-time algorithms and such that if , and are guessed correctly then

  • returns balls of radius one that cover blue points of and red points of ;

  • returns balls of radius two that cover at least blue points of and at least red points of .

Proof.

We first describe and analyze the algorithm followed by .

The algorithm for the dense point set .

By Lemma 2.7, we have that all balls in that cover points in are centered at points in . Furthermore, we have that each contains at most one center of . This is because every is such that and so, by the triangle inequality, contains all balls . Hence, by the assumption that the instance is well-separated, the set contains at most one center of .

We now reduce our problem to a -dimensional subset-sum problem. For each , form a group consisting of an item for each . The item corresponding to has the

-dimensional value vector

. Our goal is to find items such that at most one item per group is selected and their -dimensional vectors sum up to . Such a solution, if it exists, can be found by standard dynamic programming that has a table of size . For completeness, we provide the recurrence and precise details of this standard technique in Appendix A. Furthermore, since the ’s are disjoint by definition, this gives centers that cover blue points and red points in , as required in the statement of the lemma.

It remains to show that such a solution exists. Let denote the centers of the balls in that cover points in . Furthermore, let be the sets in such that for . Notice that by Lemma 2.7 we have that is disjoint from and contained in . It follows that the -dimensional vector corresponding to an center equals . Therefore, the sum of these vectors corresponding to results in the vector , where we used that our guesses of , and were correct.

The algorithm for the sparse point set .

Assuming that the guesses are correct we have that contains balls that cover blue points of and red points of . Hence, LP1 has a feasible solution to the instance defined by the point set , the number of balls , and the constraints and on the number of blue and red points to be covered, respectively. Lemma 2.2 then says that we can in polynomial-time find balls of radius two such that at least blue balls of are covered and at least

red points of are covered. Here, refers to the flower restricted to the point set .

To prove the the second part of Lemma 2.8, it is thus sufficient to show that LP1 has a feasible solution where for all such that . In turn, this follows by showing that, for any such with , no point in is in (since then in the integral solution corresponding to ). Such a feasible solution can be found by adding for all such to LP1.

To see why this holds, suppose towards a contradiction that there is a such that . First, since there are no dense points in , we have that the number of red points in is at most . Therefore the number of red points of in is strictly more than . In other words, we have which contradicts (1). ∎ Equipped with the above lemma we are now ready to finalize the proof of Theorem 2.

Proof of Theorem 2.

Our algorithm guesses the optimal radius and the centers in Phase I, and in Phase II. There are at most choices of the optimal radius, choices for each , and choices of (ranging from to ). We can thus try all these possibilities in polynomial time and, since all other steps in our algorithm run in polynomial time, the total running time will be polynomial. The algorithm tries all these guesses and outputs the best solution found over all choices. For the correct guesses, we output a solution with balls of radius at most two. Furthermore, by the second property of Lemma 2.5 and the two properties of Lemma 2.8, we have that

  • the number of blue points covered is at least ; and

  • the number of red points covered is at least .

We have thus given a polynomial-time algorithm that returns a solution where the balls are of radius at most twice the optimal radius. ∎

3 Constant Number of Colors

Our algorithm extends easily to a constant number of color classes with coverage requirements . We use the LPs in Fig. 3 for a general number of colors, where in LP2 indicates the number of points of color class in cluster . is the set of cluster centers obtained from modified clustering algorithm in Appendix B to instances with color classes. LP2 has only non-trivial constraints, so any extreme point has at most variables attaining strictly fractional values, and a feasible solution attaining objective value at least will have at most positive values. By rounding up to 1 the fractional value of the center that contains the most number of points of , we can cover points of . We would like to be able to close the remaining fractional centers, so we apply an analogous procedure to the case with just two colors.

LP1

LP2

Figure 3: Linear programs for color classes.

We can guess centers of for each of the colors whose coverage requirements are to be satisfied. Then we bound the number of points of each color that may be found in a cluster, by removing dense sets that contain too many points of any one color and running a dynamic program on the removed sets. The final step is to run the clustering algorithm of [5] on the remaining points, and rounding to one the fractional center with the most number of points of , and closing all other fractional centers.

In particular, we get a running time with a factor of . The remainder of this section gives a formal description of the algorithm for color classes.

3.1 Formal Algorithm for colors

The following is a natural generalization of Lemma 2.2 and summarizes the main properties of the clustering algorithm of Appendix B for instances with color classes.

Lemma 1.

Given a fractional solution to LP1, there is a polynomial-time algorithm that outputs at most clusters of radius two that cover at least points of , and at least for .

Since we may not meet the coverage requirements for color classes, it is necessary to guess some balls of for each of those colors, and for each fractional center. In total we guess points of as follows:

[hidealllines=true, backgroundcolor=gray!15] For , for guess the center in and calculate the point such that is maximized over all possible , where

This guessing takes rounds. It is possible that some coincide, but this does not affect the correctness of the algorithm. In fact, this can only improve the solution, in the sense that the coverage requirements will be met with fewer than centers. Let denote the number of distinct obtained in the correct guess. For notation, define

To be consistent with previous notation, let

The important properties guaranteed by the first phase can be summarized in the following lemma whose proof is the natural extension of Lemma 2.5.

Lemma 2.

Assuming that are guessed correctly, we have that

  1. the balls of radius one in are contained in and cover of points in and points of for ; and

  2. the clusters are contained in and cover at least points of and at least points of .

Now we need to remove points which contain many points from any one of the color classes to partition the instance into dense and sparse parts which leads to the following generalized definition of dense points.

Definition 4.

When considering a subset of the points , we say that a point is -dense if . For a -dense point , we also let contain those points such that , for every .

Now we perform a similar iterative procedure as for two colors:

[hidealllines=true, backgroundcolor=gray!15] Initially, let and . While there is a -dense point for any : Add to and update by removing the points .

As in the case of two colors, set . By naturally extending Lemma 2.7 and its proof, we can ensure that any ball of is completely contained in either or . We guess the number of such balls of contained in , and guess the numbers of points of covered by these balls in . There are possible values of and all the possibilities can be tried by increasing the running time by a multiplicative factor. The number of balls of contained in is given by and these balls cover at least points of in , .

Assuming that the parameters are guessed correctly we can show, similar to Lemma 2.8, that the following holds.

Lemma 4.

There exist two polynomial-time algorithms and such that if are guessed correctly then

  • returns balls of radius one that cover points of of ;

  • returns balls of radius two that cover at least points of of and at least points of of , .

The algorithm proceeds as did , with the modification that the dynamic program is now -dimensional. Algorithm , is also similar to , because LP1 has a feasible solution where for all such that holds for any . Hence, we output a solution with balls of radius at most two, and

  • the number of points of covered is at least ; and

  • the number of points of covered is at least , for all .

This is a polynomial-time algorithm for colorful -center with a constant number of color classes.

4 LP Integrality Gaps

In this section, we present two natural ways to strengthen LP1 and show that they both fail to close the integrality gap, providing evidence that clustering and knapsack feasibility cannot be decoupled in the colorful -center problem. On one hand, the Sum-of-Squares hierarchy is ineffective for knapsack problems, while on the other hand, adding knapsack constraints to LP1 is also insufficient due to the clustering aspect of this problem.

4.1 Sum-of-Squares Integrality Gap

The Sum-of-Squares hierarchy (equivalently Lasserre [16, 17]) is a method of strengthening linear programs that has been used in constraint satisfaction problems, set-cover, and graph coloring, to just name a few examples [3, 9, 18]. We use the same notation for the Sum-of-Squares hierarchy, abbreviated as SoS, as in Karlin et al. [15]. For a set of variables, are the power sets of and are the subsets of of size at most . Their succinct definition of the hierarchy makes use of the shift operator: for two vectors the shift operator is the vector such that

Analogously, for a polynomial we have . In particular, we work with the linear inequalities so that the polytope to be lifted is

Let be a collection of subsets of and a vector in . The matrix is indexed by elements of such that

We can now define the -th SoS lifted polytope.

Definition 4.1.

For any , the -th SoS lifted polytope is the set of vectors such that , , and for all .

A point belongs to the -th SoS polytope if there exists such that for all .

We use a reduction from Grigoriev’s SoS lower bound for knapsack [11] to show that the following instance has a fractional solution with small radius that is valid for a linear number of rounds of SoS.

Theorem 3 (Grigoriev).

At least rounds of SoS are required to recognize that the following polytope contains no integral solution for odd.

Consider an instance of colorful -center with two colors, points, , and where is odd. Points belong to cluster of radius one. For odd , has three red points and one blue point and for even , has one red point and three blue points. A picture is shown in Figure 4. In an optimal integer solution, one center needs to cover at least 2 of these clusters while a fractional solution satisfying LP1 can open a center of around each cluster of radius 1. Hence, LP1 has an unbounded integrality gap since the clusters can be arbitrarily far apart. This instance takes an odd number of copies of the integrality gap example given in [5].

Figure 4: Integrality gap example for linear rounds of SoS

We can do a simple mapping from a feasible solution for the th round of SoS on the system of equations in Theorem 3 to our variables in the th round of SoS on LP1 for this instance to demonstrate that the infeasibility of balls of radius one is not recognized. More precisely, we assign a variable to each pair of clusters of radius one as shown in Figure 4, corresponding to opening each cluster in the pair by amount. Then a fractional opening of balls of radius one can be mapped to variables that satisfy the polytope in Theorem 3. The remainder of this subsection is dedicated to formally describing the reduction from Theorem 3.
Let denote the set of variables used in the polytope defined in Theorem 3. Let be in the -th round of SoS applied to the system in Theorem 3 so that is indexed by subsets of of size at most . Let , where and , be the set of variables used in LP1 for the instance shown in Figure 4. We define vector with entries indexed by subsets of , and show that is in the -th SoS lifting of LP1. In each ball we pick a representative , , to indicate how much the ball is opened, so we set if , . Otherwise, we set where

We have , and for and , for since satisfies the -th round of SoS. This implies that

is the zero matrix.

To show that