Privacy preserving clustering with constraints

02/07/2018 ∙ by Clemens Rösner, et al. ∙ University of Bonn 0

The k-center problem is a classical combinatorial optimization problem which asks to find k centers such that the maximum distance of any input point in a set P to its assigned center is minimized. The problem allows for elegant 2-approximations. However, the situation becomes significantly more difficult when constraints are added to the problem. We raise the question whether general methods can be derived to turn an approximation algorithm for a clustering problem with some constraints into an approximation algorithm that respects one constraint more. Our constraint of choice is privacy: Here, we are asked to only open a center when at least ℓ clients will be assigned to it. We show how to combine privacy with several other constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental unsupervised learning task: Given a set of objects, partition them into clusters, such that objects in the same cluster are well matched, while different clusters have something that clearly differentiates them. The three classical clustering objectives studied in combinatorial optimization are

-center, -median and facility location. Given a point set , -center and -median ask for a set of centers and an assignment of the points in to the selected centers that minimize an objective. For -center, the objective is the maximum distance of any point to its assigned center. For -median, it is the sum of the distances of all points to their assigned center (this is called connection cost). Facility location does not restrict the number of centers. Instead, every center (here called facility) has an opening cost. The goal is to find a set of centers such that the connection cost plus the opening cost of all chosen facilities is minimized. In the unconstrained versions each point will be assigned to its closest center. With the addition of constraints a different assignment is often necessary in order to satisfy the constraints.

A lot of research has been devoted to developing approximation algorithms for these three. The earliest success story is that of -center: Gonzalez [20] as well as Hochbaum and Shmoys [23] gave a -approximation algorithm for the problem, while Hsu and Nemhauser [24] showed that finding a better approximation is NP-hard.

Since then, much effort has been made to approximate the other two objectives. Typically, facility location will be first, and transferring new techniques to -median poses additional challenges. Significant techniques developed during the cause of many decades are LP rounding techniques [10, 33], greedy and primal dual methods [25, 26], local search algorithms [5, 29], and, more recently, the use of pseudo-approximation [32]. The currently best approximation ratio for facility location is 1.488 [31], while the best lower bound is 1.463 [21]. For -median, the currently best approximation algorithm achieves a ratio of 2.675+ [8], while the best lower bound is  [25].

While the basic approximability of the objectives is well studied, a lot less is known once constraints are added to the picture. Constraints come naturally with many applications of clustering, and since machine learning and unsupervised learning methods become more and more popular, there is an increasing interest in this research topic. It is one of the troubles with approximation algorithms that they are often less easy to adapt to a different scenario than some easy heuristic for the problem, which was easier to understand and implement in the first place. Indeed, it turns out that adding constraints to clustering often requires fundamentally different techniques for the design of approximation algorithms and is a very new challenge altogether.

A good example for this is the capacity constraint: Each center is now equipped with a capacity , and can only serve points. This natural constraint is notoriously difficult to cope with; indeed, the standard LP formulations for the problems have an unbounded integrality gap. Local search provides a way out for facility location, leading to - and -approximations for uniform [1] and non-uniform capacities [6], and preprocessing together with involved rounding proved sufficient for -center to obtain a -approximation [14, 4]. However, the choice of techniques that turned out to work for capacitated clustering problems is still very limited, and indeed no constant factor approximation is known to date for -median.

And all the while, new constraints for clustering problems are proposed and studied. In private clustering [2], we demand a lower bound on the number of points assigned to a center to ensure a certain anonymity. The more general form where each cluster has an individual lower bound is called clustering with lower bounds [3]. Fair clustering [13] assumes that points have a protected feature (like gender), modeled by a color, and that we want clusters to be fair in the sense that the ratios between points of different colors is the same for every cluster. Clustering

with outliers

 [11] assumes that our data contains measurement errors and searches for a solution where a prespecified number of points may be excluded from the cost computation. Other constraints include fault tolerance [27], matroid or knapsack constraints [12], must-link and cannot-link constraints [34], diversity [30] and chromatic clustering constraints [17, 18].

The abundance of constraints and the difficulty to adjust methods for all of them individually asks for ways to add a constraint to an approximation algorithm in an oblivious way. Instead of adjusting and reproving known algorithms, we would much rather like to take an algorithm as a black box and ensure that the solution satisfies one more constraint in addition. This is a challenging request. We start the investigation of such add-on algorithms by studying private clustering in more detail. Indeed, we develop a method to add the privacy constraint to approximation algorithms for constraint -center problems. That means that we use an approximation algorithm as a subroutine and ensure that the final solution will additionally respect a given lower bound. The method has to be adjusted depending on the constraint, but it is oblivious to the underlying approximation algorithm used for that constraint.

This works for the basic -center problem (giving an algorithm for the private -center problem), but we also show how to use the method when the underlying approximation algorithm is for -center with outliers, fair -center, capacitated -center and fair capacitated -center. We also demonstrate that our method suffices to approximate strongly private -center, where we assume a protected feature like in fair clustering, but instead of fairness, now demand that a minimum number of points of each color is assigned to each open center to ensure anonymity for each class individually.

Our Technique

The general structure of the algorithm is based on standard thresholding [23], i.e., the algorithm tests all possible thresholds and chooses the smallest for which it finds a feasible solution. For each threshold, it starts with the underlying algorithm and computes a non private solution. Then it builds a suitable network to shift points to satisfy the lower bounds. The approximation ratio of the method depends on the underlying algorithm and on the structure of this network.

The shifting does not necessarily work right away. If it does not produce a feasible solution, then using the max flow min cut theorem, we obtain a set of points for which we can show that the clustering uses too many clusters (and can thus not satisfy the lower bounds). The algorithm then recomputes the solution in this part. Depending on the objective function, we have to overcome different hurdles to ensure that the recomputation works in the sense that it a) makes sufficient progress towards finding a feasible solution and b) does not increase the approximation factor. The process is then iterated until we find a feasible solution.

Results

We obtain the following results for multiple combinations of privacy with other constraints. Note that our definition of -center (see Section 2) distinguishes between the set of points and the set of possible center locations . This general case is also called the -supplier problem, while classical -center often assumes that . Our reductions can handle the general case; whether the resulting algorithm is then for -center or -supplier thus depends on the evoked underlying algorithm.

  • We obtain a -approximation for private -center with outliers ( for the supplier version). This matches the best known bounds [2] ([3] for the supplier version (this also holds for non-uniform lower bounds)).

  • We compute an -approximation for private capacitated -center (i.e., centers have a lower bound and an upper bound), and a -approximation for private uniform capacitated -center (where the upper bounds are uniform, as well). The best known bounds for these two problems are and  [16]. For the supplier version we obtain a -approximation which matches the best known bound [16] (for uniform upper bounds a -approximation-algorithm is known [16]).

  • We achieve constant factor approximations for private fair capacitated/uncapacitated -center/-supplier clustering. The approximation factor depends on the balance of the input point set and the type of upper bounds, it ranges between in the uncapacitated case where for each color the number of points with color is an integer multiple of the number of points with the rarest color and in the general supplier version with non-uniform upper bounds. To the best of our knowledge, all these combinations have not been studied before.

  • Along the way, we propose constant factor algorithms for general cases of fair clustering. While [13] introduces a pretty general model of fairness, it only derives approximation algorithms for inputs with two colors and a balance of for an integer . We achieve ratios of and for the general fair -center and supplier problem, respectively.

  • Finally, we propose the strongly private -center problem. As in the fair clustering problem, the input here has a protected feature like gender, modeled by colors. Now instead of a fair clustering, we aim for anonymity for each color, meaning that we have a lower bound for each color. Each open center needs to be assigned this minimum number of points for each color. To the best of our knowledge, this problem has not been studied before; we obtain a -approximation as well as a -approximation for the supplier version.

Since our method does not require knowledge of the underlying approximation algorithm, the approximation guarantees improve if better approximation algorithms for the underlying problems are found. There is also hope that our method could be used for new, not yet studied constraints, with not too much adjustment.

Related Work

Vanilla Capacities Outlier Fair Subset Partition
uniform non-uniform general
-center [23] [28] [4] [9] [13] 12 (Thm. 4.2)
-supplier [23] 11 [4] [11]
Table 1: An overview on the approximation results that we combine with privacy.

The unconstrained -center problem can be -approximated [20, 23], and it is NP-hard to approximate it better [24]. The -supplier problem can be -approximated [23], and this is also tight.

Capacitated -center was first approximated with uniform upper bounds [7, 28]. Two decades after the first algorithms for the uniform case, [14] provided the first constant factor approximation for non-uniform capacities. The algorithm was improved and also applied to the -supplier problem in [4]. In contrast to upper bounds (capacities), lower bounds are less studied. The private -center problem is introduced and -approximated in [2], and non-uniform lower bounds are studied in [3]. The -center/-supplier problem with outliers is -approximated in [11] alongside approximations to other robust variants of the -center problem. The approximation factor for the -center problem with outliers was improved to in [9].

The fair -center problem was introduced in [13]. The paper describes how to approximate the problem by using an approximation for a subproblem that we call fair subset partition problem. Algorithms for this subproblem are derived for two special cases where the number of colors is two, and the points are either perfectly balanced or the number of points of one color is an integer multiple of the number of points of the other color.

These are the constraints for which we make use of known results. We state the best known bounds and their references in Table 1. Approximation algorithms are also e.g. known for fault tolerant -center [27] and -center with matroid or knapsack constraints [12].

Relatively little is known about the combination of constraints. Cygan and Kociumaka [15] give a 25-approximation for the capacitated -center problem with outliers. Aggarwal et. al [2] give a -approximation for the private -center problem with outliers. Ahmadian and Swamy [3] consider the combination of -supplier with outliers with (non-uniform) lower bounds and derive a -approximation. The paper also studies the -supplier problem with outliers (without lower bounds), and the min-sum-of-radii problem with lower bounds and outliers. Their algorithms are based on the Lagrangian multiplier preserving primal dual method due to Jain and Vazirani [26].

Ding et. al [16] study the combination of capacities and lower bounds as well as capacities, lower bounds and outliers by generalizing the LP algorithms from [4] and  [15] to handle lower bounds. They give results for several variations, including a -approximation for private capacitated -center and a -approximation for private capacitated -supplier.

Friggstad, Rezapour, Salavatipour [19] consider the combination of uniform capacities and non-uniform lower bounds for facility location and obtain bicriteria approximations.

Outline

In Section 2, we introduce necessary notation. Section 3 then presents our method, applied to the private -center problem with outliers. We choose the outlier version since it is non-trivial but still intuitive and does thus give a good impression on the application of our method. In Section 4, we then adjust the method to approximate private and fair -center, private and capacitated -center, and -center with all three constraints. In Section 5, we consider the strongly private -center problem. We conclude the paper with Section 6 by some remarks on private facility location.

2 Preliminaries

Let be a finite metric space, i.e., is a finite set and is a metric. We use for the smallest distance between and a set . For two sets , we use for the smallest distance between any pair .

Let be a subset of called points and let be a subset of called locations. An instance of a private assignment constrained -center problem consists of , , an integer , a lower bound and possibly more parameters. Given the input, the problem is to compute a set of centers with and an assignment of the points to the selected centers that satisfies for every selected center , and some specific assignment restriction. The solution shall be chosen such that

is minimized. Different assignment restrictions lead to different constrained private -center problems. The capacity assignment restriction comes with an upper bound function for which we require for all , and then demands . When we have for all and some , then we say that the capacities are uniform, otherwise, we say they are non-uniform. The fairness assignment restriction provides a mapping of points to colors and then requires that each cluster has the same ratio between the numbers of points with different colors (see Section 4.2 for specifics). The strongly private -center problem can also be cast as a -center problem with an assignment restriction. Again, the input now additionally contains a mapping of points to colors. Now the assignment is restricted to ensure that it satisfies the lower bound for the points of each color. We even consider the slight generalization where each color has its own lower bound, and call this problem the strongly private -center problem.

An instance of the private -center problem with outliers consists of , , an integer , a lower bound , and a parameter for the maximum number of outliers. The problem is to compute a set of centers with , a set of outliers with , and an assignment of the points that are not outliers to the centers in . The choice of shall minimize

3 Private -center with Outliers

Assume that there exists an approximation algorithm for the -center problem with outliers with approximation factor .

Then for instances , , , , of the private -center problem with outliers, we can compute an -approximation in polynomial time.

Proof.

Below, we describe an algorithm that uses a threshold graph with threshold . We show that for any given , the algorithm has polynomial runtime and, if is equal to , the value of the optimal solution, computes an -approximation. Since we know that the value of every solution is equal to the distance between a point and a location, we test all possible distances for and return the best feasible clustering returned by any of them. The main proof is the proof of Lemma 3 below, which concludes this proof. ∎

We now describe the procedure for a fixed value of .

Assume that there exists an approximation algorithm for the -center problem with outliers with approximation factor . Let , , , , be an instance of the private -center problem with outliers, let and let denote the maximum radius in the optimal feasible clustering for , , , , . We can in polynomial time compute a feasible clustering with a maximum radius of at most or determine .

Proof.

The algorithm first uses A to compute a solution without the lower bound: Let be an -approximate solution for the -center problem with outliers on , , , . Notice that it can happen that contains clusters with fewer than points.

Let (notice that is possible), , and let be the clusters that induces, i.e., . Finally, let be the largest distance of any point to its assigned center. Observe that an optimal solution to the -center problem with outliers can only have a lower objective value than the optimal solution to our problem because we only dropped a condition. Therefore, implies that . If we have , we return .

We use and to create a threshold graph which we use to either reassign points between the clusters to obtain a feasible solution or to find a set of points for which we can show that every feasible clustering with maximum radius uses less clusters than our current solution to cover it. In the latter case we compute another -approximate solution which uses fewer clusters on and repeat the process. Note that for such a clustering does not necessarily exist, but for the optimal clustering provides a solution for with fewer clusters. If we do not find such a clustering with maximum radius at most , we return .

We show that every iteration of the process reduces the number of clusters or the number of outliers, therefore the process stops after at most iterations. It may happen that our final solution contains much less clusters than the optimal solution (but it will be an approximate solution for the optimal solution with centers).

We will use a network flow computation to move points from clusters with more than points to clusters with less than points. Moving a point to another cluster can increase the radius of the cluster. We only want to move points between clusters such that the radius does not increase by too much. More precisely, we only allow a point to be moved to another cluster if the distance between the point and the clusters is at most . This is ensured by the structure of the network described in the next paragraph. Unless stated otherwise, when we refer to distances between a point and a cluster in the following, we mean the distance between the point and the cluster in its original state before any points have been reassigned.

Given and , we create the threshold graph as follows. consists of a source , a sink , a node for each cluster , a node for the set of outliers and a node for each point . For all , we connect to if the cluster contains more than points and set the capacity of to . If the cluster contains fewer than points, we connect with and set the capacity of to . Furthermore, we connect with for all and set the capacity of to . We also connect to with capacity and with for all with capacity . Whenever a point and a cluster with satisfy (i.e., there is a point that satisfies ), we connect with with capacity .

Formally the graph is defined by

(1)
(2)
(3)
(4)

We define the capacity function by

(5)

We use to refer to as is clear from context. We now compute an integral maximum --flow on . According to we can reassign points different clusters.

Let be an integral maximal --flow on . It is possible to reassign to for all edges with .

The resulting solution has a maximum radius of at most . If saturates all edges of the form , then the solution is feasible.

Proof.

Let . The choice of capacity on and flow conservation ensure

for . Therefore no point would have to be reassigned to more than one cluster. Note that for every point that would be reassigned we must have and for every edge with the point would be reassigned.

For any , let be any point which we want to reassign to . Then we must have and therefore there must be a point with . Thus we have

Now assume that saturates all edges of the form and let . If contains the edge , then it can not contain the edge and therefore all incoming edges of are of the form . Flow conservation then implies that the number of points reassigned to minus the points reassigned away from is equal to , which increases the number of points in to .

If contains the edge , then it can not contain the edge and therefore all outgoing edges of are of the form . Flow conservation then implies that the number of points reassigned away from minus the points reassigned to is equal to , which reduces the number of points in to at least .

If contains neither nor , then the number of points in is equal to and does not change (the points may change, but their number does not).

In all three cases contains at least points after the reassignment. ∎

If saturates all edges of the form in , then we reassign points according to Lemma 3 and return the new clustering.

Otherwise we look at the residual network of on . Let be the set of nodes in which can not be reached from . We say cluster belongs to if , and a point is adjacent to if and . Let denote the set of clusters belonging to . Let . We say a point belongs to if the cluster with belongs to . Let and denote the set of points that belong to and the set of points adjacent to .

Any clustering on with maximum radius at most that contains at least points in every cluster uses fewer than clusters to cover all points in .

Proof.

We first observe that must have the following properties:

  • and implies .

  • , and implies .

  • for some and implies .

The first property follows from the fact that can only saturate if also saturates for . So, either is not saturated, which means that can be reached from any vertex that reaches , or is saturated, which means that the only incoming edge of in is . In both cases, if , then . The second property follows since implies . The third property is true since we defined .

This implies that a reassignment due to Lemma 3 would reassign all points adjacent to to clusters in and moreover all reassignments from points in would be to clusters in . Let denote the number of points that would be assigned to after the reassignment. Then .

Now we argue that this sum is smaller than by observing that each and at least one is strictly smaller than .

Let be a cluster with more than points after the reassignment. Then is not saturated by and can be reached from in . Therefore after the reassignment no cluster would contain more than points; in other words, implies .

Let be a cluster which would still contain fewer than points after the reassignment. This implies that does not saturate the edge . Therefore can be reached from and since is a maximum - flow, can not be reached from . We must have .

Because we assumed that the reassignment does not satisfy all lower bounds, at least one such cluster has to exist. This implies

Which means that the clusters in and do not contain enough points to satisfy the lower bound in clusters.

By definition of and , for two points with and we must have . Let be a clustering that abides the lower bounds and has a maximal radius of at most . Then every cluster in that contains at least one point from can only contain points from . Therefore must contain fewer than clusters which contain at least one point from . ∎

If we have , then Lemma 3 implies that the optimal solution covers all points in with fewer than clusters. An -approximative solution on the point set with at most clusters which contains at most outliers is then -approximative for .

Unfortunately, we do not know how many outliers an optimal clustering has in . We therefore involve the outliers in our new computation as well. Let denote the current number of outliers. We obtain the following Lemma through a counting argument.

We call a cluster special if it contains at least one point from or only contains points from . Let be a clustering on with a maximum radius of at most on all special clusters that respects the lower bounds, has at most outliers and consists of at most clusters out of which at most are special. If has exactly special clusters, then has at most outliers in .

Proof.

Assume the clustering contains exactly special clusters. Each of these clusters has to contain at least points from . We know

So there remain at most unclustered points in . ∎

Now we need to show that such a clustering exists if is the case.

If , then there exists a clustering on with a maximum radius at most on all special clusters that respects the lower bounds, has at most outliers and consists of at most clusters out of which at most are special.

Proof.

We look at an optimal clustering . The only way can violate a condition is if it contains special clusters. Lemma 3 implies that contains at least clusters that contain only points in . If all clusters in are special we know . We arbitrarily select clusters from that contain only points in , declaring all points in them as outliers and closing the corresponding centers. This leaves us with clusters which contain at least points. Since this leaves at most outliers. Otherwise, if contains at least one cluster which is not special, we add all outliers from to . Again we arbitrarily select clusters from that contain only points in , declaring all points in them as outliers and closing the corresponding centers. By creation there are no unclustered points in and exactly special clusters with radius at most . Therefore this clustering contains at most outliers and has at most clusters. ∎

We now use again to compute new solutions without the lower bound: Let be an -approximate solution for the -center problem with outliers on , , , and let be an -approximate solution for the -center problem with outliers on , , , . Let .

Note that in case , it can happen that no such clustering exists or that we obtain for both and . We then return . Otherwise for at least one must exist together with .

If exists and we have we replace by in and adjust accordingly to obtain with and

(6)

Otherwise, if exists, we have and either does not exist or we have , we analogous replace by to obtain .

If we did not return , then is a solution for the -center problem with outliers on , , , and we have .

Proof.

is a solution for the -center problem with outlier on , , , with and since we did not return , we must have for the chosen . ∎

We iterate the previous process with the new clustering until we either determine or the reassignment of points according to Lemma 3 yields a feasible solution. Since each iteration reduces the number of clusters or keeps the same number of clusters and reduces the number of outliers, the process terminates after at most iterations. ∎

We can compute a -approximation for instances of the private -center problem with outliers and a -approximation for instances of the private -supplier problem in polynomial time.

Proof.

Follows from Theorem 3 together with the -approximation for -center with outliers in [9] and the -approximation for -supplier with outliers in [11]. ∎

4 Combining Privacy with other Constraints

We want to take the general idea from Section 3 and instead of outliers we want to combine privacy with other restrictions on the clusters. Given a specific restriction and an approximation algorithm for the -center problem with restriction with approximation factor we ask: Can we similar to Section 3 combine with the use of a threshold graph to compute an -approximation for the private -center problem with restriction ?

In Section 3 we made use of two properties of a clustering with outliers. In Lemma 3 we used that reassigning points to another cluster never increases the number of outliers and in Lemma 3 we used that outliers have the somewhat local property that computing a new clustering on the points from a subset of the clusters together with the set of outliers can not create more outliers on the remaining points.

In this section we now take a look at restriction properties which are similarly local, and show how to combine them with privacy.

4.1 Privacy and Capacities

Assume that there exists an approximation algorithm for the capacitated -center problem with approximation factor . Then we can compute an -approximation for the private capacitated -center problem in polynomial time.

Proof.

Let , , , , be an instance of the private capacitated -center problem.

Analogous to Section 3 we use a threshold graph with threshold and show that for any given the algorithm has polynomial runtime and, if is equal to , the value of the optimal solution, computes an -approximation. Since we know that the value of the optimal solution is equal to the distance between a point and a location, we test all possible distances for and return the best feasible clustering returned by any of them. The main proof is the proof of Lemma 4.1 below. The lemma then concludes the proof. ∎

We now describe the procedure for a fixed value of .

Assume that there exists an approximation algorithm for the capacitated -center problem with approximation factor .

Let , , , , be an instance of the private capacitated center problem and let . and let denote the maximum radius in the optimal feasible clustering for , , , , . We can in polynomial time compute a feasible clustering with a maximum radius of at most or determine .

Proof.

The algorithm first uses A to compute a solution without the lower bound: Let be an -approximate solution for the capacitated -center problem on , , , .

Again let , , let be the clusters that induces, i.e., and let be the largest distance of any point to its assigned center. If we have , we return .

Given and , we create, similar to Section 3, a threshold graph by

(7)
(8)
(9)

We define the capacity function by

(10)

The only difference to Section 3 is that we do not have any outliers. We use to refer to as is clear from context. We now compute an integral maximum --flow on . According to we can reassign points to different clusters.

Analogous to Lemma 3 we obtain the following lemma. Let be an integral maximal --flow on . It is possible to reassign to for all edges with .

The resulting solution has a maximum radius of at most . If saturates all edges of the form , then the solution is feasible.

In case saturates all edges of the form we reassign points according to Lemma 4.1 and return the new clustering.

Otherwise, we look at the residual network of on . We define and as before, i.e., is the set of nodes in which can not be reached from , and is the number of clusters which belong to . As before, we obtain the following lemma.

Any clustering on with maximum radius at most that respects the lower bounds uses fewer than clusters to cover all points in .

In case we have this implies that the optimal solution covers all points in with fewer than clusters. An -approximative solution on the point set with at most clusters which abides only the upper bounds is then -approximative for .

We now use again to compute a new solution without the lower bound: Let be an -approximate solution for the capacitated -center problem on , , , . Let . Note that in case , it can happen that no such clustering exists or that we obtain . We then return .

Otherwise we replace replace by in and adjust accordingly to obtain with and

(11)

In case we did not return , is a solution for the capacitated -center problem on , , , and we have .

We iterate the previous process with new clustering until we either determine or the reassignment of points according to Lemma 4.1 yields a feasible solution. Since the number of clusters is reduced in each iteration, the process terminates after at most iterations. ∎

We can compute an -approximation for instances of the private capacitated -center problem in polynomial time.

If the upper bounds are uniform, too, then we can compute an -approximation.

Proof.

Follows from Theorem 4.1 together with the -approximation for capacitated -center in [4]. For uniform upper bounds, capacitated -center can be -approximated [28], leading to a guarantee of . ∎

We can compute a -approximation for instances of the private capacitated -supplier problem in polynomial time.

Proof.

Follows from Theorem 4.1 together with the -approximation for capacitated -center in [4]. ∎

4.2 Privacy and Fairness

Fair clustering was introduced in [13]. The idea is that there are one or more protected features of the objects, and that the composition of all clusters should be fair with respect to the protected features. Formally, the protected features are modeled by colors. [13] defines fair clustering problems for the case of two colors, i.e., two protected features.

We consider the general version with an arbitrary amount of colors. Thus in the fair version of the -center problem, in addition to , and , each point in is colored. We denote the set of colors by and let assign the points to their colors. For a subset and a color , let . A clustering is considered fair if the ratios between points with different colors is the same in every cluster, i.e., for every pair and every , we have .

Again we adjust our method in order to apply it to the fair -center problem to obtain the following lemma.

Assume that there exists an approximation algorithm for the fair -center problem with approximation factor . Then for instances , , , , , of the private and fair -center problem, we can compute a -approximation in polynomial time.

Proof.

Analogous to Section 3 we use a threshold graph with threshold and show that for any given , the algorithm has polynomial runtime, and, if is equal to , the value of the optimal solution, computes an -approximation. Since we know that the value of the optimal solution is equal to the distance between a point and a location, we test all possible distances for and return the best feasible clustering returned by any of them. The main proof is the proof of Lemma 4.2 below. The lemma then concludes the proof. ∎

We now describe the procedure for a fixed value of .

Assume that there exists an approximation algorithm for the fair -center problem with approximation factor .

Let , , , , , be an instance of the private and fair -center problem, let and let denote the maximum radius in the optimal feasible clustering for , , , , , . We can in polynomial time compute a feasible clustering with a maximum radius of at most or determine .

Proof.

The algorithm first uses A to compute a solution without the lower bound: Let be an -approximate solution for the fair -center problem on , , , , .

Again let , , let be the clusters that induces, i.e., and let be the largest distance of any point to its assigned center.

If we have , we return .

Reassigning a point to a different cluster can result in both the old and the new cluster not being fair anymore. Therefore we unfortunately can not simply create a threshold graph and move points from one cluster to another.

For every let , then it is easy to see that in every feasible clustering every cluster contains a multiple of points.

Instead of moving single points between clusters we want to move sets which contain points with color for every , thus keeping the clustering fair.

A subset is called a fair subset of , if for every contains exactly points with color , i.e., for all we have .

We use to arbitrarily partition into fair sets such that all points in the same set belong to the same cluster in . Let denote these sets. By construction the distance between any two points in the same set is at most .

Given , and , we create the threshold graph similar to Section 3 by

(12)
(13)
(14)

We define the capacity function by

(15)

The difference to the threshold graph in Section 3 is that we do not have outliers and include the nodes for the fair sets instead of nodes for the points. We also changed the capacities, such that the capacities of edges of the form now represent how many additional fair sets needs to satisfy the lower bound, while capacities of edges of the form now represent how many fair sets can give away and still contain at least points.

We use to refer to as is clear from context. We now compute an integral maximum --flow on . According to we can reassign fair subsets to different clusters.

Analogous to Lemma 3 we obtain the following lemma. Let be an integral maximal --flow on . It is possible to reassign to for all edges with .

The resulting solution has a maximum radius of at most . If saturates all edges of the form , then the solution is feasible. Note that in contrast to Lemma 3 we obtained a new radius of at most because when we add a fair subset to a cluster the maximum distance of a point in to is at most .

In case saturates all edges of the form we reassign points according to Lemma 4.2 and return the new clustering.

Otherwise, we again look at the residual network