Fair k-Center Clustering for Data Summarization

01/24/2019 ∙ by Matthäus Kleindessner, et al. ∙ Rutgers University 0

In data summarization we want to choose k prototypes in order to summarize a data set. We study a setting where the data set comprises several demographic groups and we are restricted to choose k_i prototypes belonging to group i. A common approach to the problem without the fairness constraint is to optimize a centroid-based clustering objective such as k-center. A natural extension then is to incorporate the fairness constraint into the clustering objective. Existing algorithms for doing so run in time super-quadratic in the size of the data set. This is in contrast to the standard k-center objective that can be approximately optimized in linear time. In this paper, we resolve this gap by providing a simple approximation algorithm for the k-center problem under the fairness constraint with running time linear in the size of the data set and k. If the number of demographic groups is small, the approximation guarantee of our algorithm only incurs a constant-factor overhead. We demonstrate the applicability of our algorithm on both synthetic and real data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms have been rapidly adopted in numerous human-centric domains, from personalized advertising to lending to health care. Fast on the heels of this ubiquity have come a whole host of concerning behaviors from these algorithms: facial recognition has higher accuracy on white, male faces

(Buolamwini & Gebru, 2017); DUI arrest help advertisements are shown more regularly to minority profiles (Sweeney, 2013); and criminal recidivism tools are more likely to label African-American low-risk defendants as high-risk (Angwin et al., 2016)

. There are also several examples of unsavory ML behavior pertaining to unsupervised learning tasks, whether considering the bias evident in word2vec embeddings

(Bolukbasi et al., 2016) or the gender imbalance of CEO image search results (Kay et al., 2015)

. Most of the academic work on fairness in machine learning, however, has investigated how to solve classification tasks subject to various constraints on the behavior of a classifier on different demographic groups 

(e.g., Hardt et al., 2016; Zafar et al., 2017; Agarwal et al., 2018).

This paper adds to the literature on fair methods for unsupervised learning tasks (see Section 4 for related work). We consider the problem of data summarization (Hesabi et al., 2015) through the lens of algorithmic fairness. The goal of data summarization is to output a small but representative subset of a data set. Think of an image database and a user entering a query that is matched by many images. Rather than presenting the user with all matching images, we only want to show a summary. In such an example, a data summary can be quite unfair on a demographic group. Indeed, Google Images has been found to answer the query “CEO” with a much higher fraction of images of men compared to the real-world fraction of male CEOs (Kay et al., 2015).

One approach to the problem of data summarization is provided by centroid-based clustering, such as -center (formally defined in Section 2) or -medoid (Hastie et al., 2009, Section 14.3.10; sometimes referred to as -median). For a centroid-based clustering objective, an optimal clustering of a data set can be defined by points , called centroids, such that the clusters are formed by assigning every to its closest centroid. Since the centroids are good representatives of their clusters, the set of centroids can be used as a summary of . This approach of data summarization via clustering is used in numerous domains: in the social sciences (Bartholomew et al., 2008; Cameron & Trivedi, 2010), in psychology (Borgen & Barnett, 1987; Campbell et al., 2008), and in biology (Eisen et al., 1998).

If the data set  comprises several demographic groups , we may consider to be a fair summary only if the groups are represented fairly: if in the real world 70% of CEOs are male and we want to output ten images for the query “CEO”, then three of the ten images should show women. Formally, this can be encoded with one parameter for every group . Our goal is then to minimize the clustering objective under the constraint that many centroids belong to . A constraint of this form can also enforce balanced summaries: even if in the real world there are more male CEOs than female ones, we might want to output an equal number of male and female images to reflect that gender is not definitional to the role of CEO.

Centroid-based clustering under such a constraint has been studied in the theoretical computer science literature (see Sections 2 and 4). However, existing approximation algorithms for this problem run in time , while the unconstrained -center clustering problem can be approximated in time linear in . Since data summarization is particularly useful for massive data sets, such a slowdown may be practically prohibitive. The contribution of this paper is to present a simple approximation algorithm for -center clustering under our fairness constraint with running time only linear in and . The improved running time comes at the price of a worse guarantee on the approximation factor if the number of demographic groups is large. However, note that in practical situations concerning fairness, the number of groups is often quite small (e.g., when the groups encode gender or race). Furthermore, in our extensive numerical simulations we never observed a large approximation factor, even when the number of groups was large (cf. Section 5), indicating the practical usefulness of our algorithm.

Outline of the paper   In Section 2, we formally state the -center and the fair -center problem. In Section 3, we present our algorithm and provide a sketch of its analysis. The full proofs can be found in Appendix A. We discuss related work in Section 4 and present a number of experiments in Section 5. Further experiments can be found in Appendix B. We conclude with a discussion in Section 6.

2 Definition of -Center and Fair -Center

Let be a finite data set and be a metric on . In particular, we assume to satisfy the triangle inequality. The standard -center clustering problem is the following minimization problem

(1)

where is a given parameter and . Here, are called centers. Any set of centers defines a clustering of by assigning every to its closest center. The -center problem is NP-hard and is also NP-hard to approximate to a factor better than (Gonzalez, 1985; Vazirani, 2001, Chapter 5). The famous greedy strategy of Gonzalez (1985) is a -approximation algorithm with running time if we assume that can be evaluated in constant time (this is the case, e.g., if a problem instance is given via the distance matrix ). This algorithm randomly selects an element of the data set as first center and then iteratively adds the next center to be the point with maximum distance to the current set of centers.

We consider a fair variant of -center as described in Section 1. Our variant also allows for the user to specify a given subset that has to be included in the set of centers (think of the example of the image database and the case that we always want to show five prespecified images as part of the summary). Assuming that , where are the demographic groups, the fair -center problem can be stated as the minimization problem

(2)

where with and are given. By means of a partition matroid, the fair -center problem can be phrased as a matroid center problem, for which Chen et al. (2016) provide a 3-approximation algorithm using matroid intersection (e.g., Cook et al., 1998). Chen et al. (2016) do not discuss the running time of their algorithm, but it requires to sort all distances between elements in and hence has running time at least . In our experiments in Section 5 we observe a running time in .

Notation   For , we sometimes use .

3 A Linear-time Approximation Algorithm

1:  Input: metric ; ;
2:  Output:
3:  set
4:  for  to  
5:     choose
6:     set
7:  return  
Algorithm 1 Approximation algorithm for (3)

In this section, we present our approximation algorithm for the minimization problem (2). It is a recursive algorithm with respect to the number of groups . To increase comprehensibility, we first present the case of two groups and then the general case of an arbitrary number of groups.

At several points, we will consider the standard (unfair) -center problem (1) generalized to the case of initially given centers , that is

(3)

We can adapt the greedy strategy of Gonzalez (1985) while maintaining its 2-approximation guarantee for (3). For the sake of completeness, we provide the algorithm as Algorithm 1 and state the following lemma:

Lemma 1.

Algorithm 1 is a 2-approximation algorithm for the unfair -center problem (3) with running time , assuming can be evaluated in constant time.

A proof of Lemma 1, similar in structure to a proof in Har-Peled (2011, Section 4.2) for the strategy of Gonzalez (1985) for problem (1), can be found in Appendix A.

3.1 Fair -Center with Two Groups

Assume that . Our algorithm first runs Algorithm 1 for the unfair problem (3) with and . If we are lucky and Algorithm 1 picks many centers from and many centers from , our algorithm terminates. Otherwise, Algorithm 1 picks too many centers from one group, say , and too few from . We try to decrease the number of centers in by replacing any such a center with an element in its cluster belonging to . Once we have made all such available swaps, the remaining clusters with centers in are entirely contained within . We then run Algorithm 1 on this subset with and the centers from as well as as initial centers, and return both the centers from the recursive call and those from from the initial call.

This algorithm is formally stated as Algorithm 2. The following theorem states that it is a 5-approximation algorithm and that our analysis is tight—in general, Algorithm 2 does not achieve a better approximation factor.

1:  Input: metric ; with ;

; group-membership vector 

encoding membership in or
2:  Output:
3:  run Algorithm 1 on with and ; let denote its output
4:  
5:  if   # COMMENTimplies
6:     return  
7:  # we assume ; otherwise we switch the role of and
8:  form clusters by assigning every to its closest center in
9:  while  and there exists with center and  
10:     replace center with by setting
11:  
12:  if   # COMMENTimplies
13:     return  
14:  let       # we have
15:  run Algorithm 1 on with and ; let denote its output
16:  return   as well as many arbitrary elements from
Algorithm 2 Approximation algorithm for (2) when
Theorem 1.

Algorithm 2 is a 5-approximation algorithm for the fair -center problem (2) with , but not a -approximation algorithm for any . It can be implemented in time , assuming can be evaluated in constant time.

Proof.

Here we only present a sketch of the proof. The full proof can be found in Appendix A. For showing that Algorithm 2 is a 5-approximation algorithm, let be the optimal value of (2) and be the optimal value of (3) (for ). Clearly, . Let be the set of centers returned by Algorithm 2. It is clear that comprises many elements from and many elements from . We need to show that for every . Let be the output of Algorithm 1 when called in Line 3 of Algorithm 2. Since Algorithm 1 is a 2-approximation algorithm for (3) according to Lemma 1, we have , . Assume that . It follows from the triangle inequality that after exchanging centers in the while-loop in Line 9 of Algorithm 2 we have , . Assume that still . We only need to show that , . Let be an optimal solution to (2). We split into two subsets , where comprises all for which the closest center in is in . Using the triangle inequality we can show that , . We partition into at most many clusters corresponding to the closest center in . Each of these clusters has diameter not greater than . If Algorithm 1 in Line 15 of Algorithm 2 chooses one element from each of these clusters, we immediately have , . Otherwise, Algorithm 1 chooses an element from or two elements from the same cluster of . In both cases, it follows from the greedy choice property of Algorithm 1 that .

A family of examples shows that Algorithm 2 is not a -approximation algorithm for any . ∎

3.2 Fair -Center with Arbitrary Number of Groups

The main idea to handle an arbitrary number of groups  is the same as for the case : we first run Algorithm 1. We then exchange centers for elements in their clusters in such a way that the number of centers from a group comes closer to , which is the requested number of centers from . If via exchanging centers we can actually hit for every group , we are done. Otherwise, we wish that, when no more exchanging is possible, we are left with a subset that only comprises elements from or fewer groups. Denote the set of these groups by . We also wish that for those groups not in we have picked only the requested number of centers or fewer and we can consider the groups not in to have been “resolved”. If both are true, we can recursively apply our algorithm to and a smaller number of groups. We might recurse down to the case of only one group, which we can solve with Algorithm 1.

Figure 1: An example illustrating the need for a more sophisticated procedure for exchanging centers in the case of three or more groups compared to the case of only two groups: we would like to exchange a center from for an element from , but cannot do that directly. Rather, we have to make a series of exchanges.

The difficulty with this idea comes from the exchanging process. Formally, we are given centers and the corresponding clustering , where is the union of clusters with a center in , and we want to exchange some centers for an element in their cluster such that there exists a strict subset of groups with the following properties:

(4)
(5)

While in the case of only two groups this can easily be achieved by exchanging centers from the group that has more than the requested number of centers for elements from the other group, as we do in Algorithm 2, it is not immediately clear how to deal with a situation as shown in Figure 1. There are three groups (elements of these groups are shown in blue, green, and red, respectively), and we have . For the current set of centers (elements at the centers of the circles) there does not exist satisfying (4) and (5). We would like to decrease the number of centers in and increase the number of centers in , but the clusters with a center in do not comprise an element from . Hence, we cannot directly exchange a center from for an element in . Rather, we first have to exchange a center from for an element in (although this increases the number of centers from over ) and then a center from for an element in . An algorithm that can deal with such a situation is Algorithm 3. It exchanges some centers for an element in their cluster and yields that provably satisfies (4) and (5), as stated by the following lemma. Its proof can be found in Appendix A.

1:  Input: centers and the corresponding clustering ; with ; group-membership vector
2:  Output: , where some centers have been replaced with an element in , and satisfying (4) and (5)
3:  set for
4:  construct a directed unweighted graph on as follows: we have , that is there is a directed edge from to , if and only if there exists with center and
5:  compute all shortest paths on
6:  
7:  while  for some and there exist such that and and there exists a shortest path with , that connects to in  
8:     for  
9:        find with center and ; replace with by setting
10:     update and
11:     recompute and all shortest paths on
12:  
13:  if  for all  
14:     return   and
15:  else
16:     set and there exists and a path from to in
17:     return   and
Algorithm 3 Algorithm for exchanging centers & finding 
Lemma 2.

Algorithm 3 is well-defined, it terminates, and exchanges centers in such a way that the set  that it returns satisfies and properties (4) and (5).

Observing that the number of iterations of the while-loop in Line 7 is upper-bounded by as the proof of Lemma 2 shows, that the number of iterations of the for-loop in Line 8 is upper-bounded by , and that all shortest paths on can be computed in running time (Cormen et al., 2009, Chapter 25), it is not hard to see that Algorithm 3 can be implemented with running time .

Using Algorithm 3, it is straightforward to design a recursive approximation algorithm for the fair -center problem (2) as outlined at the beginning of Section 3.2. We state the algorithm as Algorithm 4. Applying, by means of induction, a similar technique as in the proof of Theorem 1 to every (recursive) call of Algorithm 4, we can prove the following:

1:  Input: metric ; with ; ; group-membership vector
2:  Output:
3:  run Algorithm 1 on with and ; let denote its output
4:  if  
5:     return  
6:  
7:  form clusters by assigning every to its closest center in
8:  apply Algorithm 3 to and in order to exchange some centers and obtain
9:  if  
10:     return  
11:  
12:  let and; recursively call Algorithm 4, where:
  • [leftmargin=*]

  • plays the role of

  • we assign elements in to an arbitrary group in and hence there are many groups

  • the requested numbers of centers are

  • plays the role of initially given centers 

let denote its output
13:  return   as well as many arbitrary elements from for every group not in
Algorithm 4 Approximation alg. for (2) for arbitrary
Theorem 2.

Algorithm 4 is a -approximation algorithm for the fair -center problem (2) with groups. It can be implemented in time , assuming can be evaluated in constant time.

It is not clear to us whether our analysis of Algorithm 4 is tight and the approximation factor achieved by Algorithm 4 can indeed be as large as or whether the dependence on is actually less severe (compare with Section 5 and Section 6). Although trying hard to find instances for which the approximation factor of Algorithm 4 is large, we never observed a factor greater than .

Lemma 3.

Algorithm 4 is not a -approximation algorithm for any for (2) with groups.

The proofs of Theorem 2 and Lemma 3 are in Appendix A.

4 Related Work

Fairness   By now, there is a huge body of work on fairness in machine learning. For a recent paper providing an overview of the literature on fair classification see Donini et al. (2018). Our paper adds to the literature on fair methods for unsupervised learning tasks (Chierichetti et al., 2017; Celis et al., 2018a, b, c; Samadi et al., 2018; Schmidt et al., 2018). Note that all these papers assume to know which demographic group a data point belongs to just as we do. We discuss the two works most closely related to our paper.

First,  Celis et al. (2018b) also deal with the problem of fair data summarization. They study the same fairness constraint on the summary as we do, that is the summary must contain many elements from group . However, while we aim for a representative summary, where every data point should be close to at least one center in the summary, Celis et al. aim for a diverse summary. Their approach requires the data set to consist of points in , and then the diversity of a subset of is measured by the volume of the parallelepiped that it spans (Kulesza & Taskar, 2012). Note that the summarization objective of Celis et al. is different from ours, and in different application domains one or the other may be more appropriate. An advantage of our approach is that it only requires access to a metric on the data set, rather than assuming feature representations of the data points.

The second line of work we discuss centers around the paper of Chierichetti et al. (2017). Their paper proposes a notion of fairness for clustering different from ours. Based on the fairness notion of disparate impact (Feldman et al., 2015) for classification (or the -rule; Zafar et al., 2017), the paper by Chierichetti et al. asks that every group be approximately equally represented in each cluster. In their paper, Chierichetti et al. focus on -medoid and -center clustering and the case of two groups. Subsequently, Rösner & Schmidt (2018) study such a fair -center problem for multiple groups, and Schmidt et al. (2018) build upon the work of Chierichetti et al. to devise algorithms for such a fair -means problem. While we certainly consider the fairness notion of Chierichetti et al. (2017), which can be applied to any kind of clustering, to be meaningful in some scenarios, we believe that in certain applications of centroid-based clustering (such as data summarization) our proposed fairness notion provides a more sensible alternative.

Centroid-based clustering  

There are many papers proposing heuristics and approximation algorithms for both

-center (e.g., Hochbaum & Shmoys, 1986; Mladenović et al., 2003; Ferone et al., 2017) and -medoid (e.g., Charikar et al., 2002; Arya et al., 2004; Li & Svensson, 2013) under various assumptions on and the distance function . There are also numerous papers on versions with constraints, such as lower or upper bounds on the size of the clusters (Aggarwal et al., 2010; Cygan et al., 2012; Rösner & Schmidt, 2018).

Most important to mention are the works by Hajiaghayi et al. (2010), Krishnaswamy et al. (2011) and Chen et al. (2016). Hajiaghayi et al. are the first that consider our fairness constraint (for two groups and without a set that has to be included in the set of centers) for -medoid. They present a local search algorithm and prove it to be a constant-factor approximation algorithm. Their work has been generalized by Krishnaswamy et al., who consider -medoid under the constraint that the set of centers has to form an independent set in a given matroid. This kind of constraint contains our fairness constraint as a special case (for an arbitrary number of groups and an arbitrary set ). Krishnaswamy et al.

obtain a 16-approximation algorithm for this so-called matroid median problem based on rounding the solution of a linear programming relaxation. Subsequently,

Chen et al. study the matroid center problem. Using an algorithm for matroid intersection as black box, they obtain a 3-approximation algorithm. Note that none of Hajiaghayi et al., Krishnaswamy et al. or Chen et al. discuss the running time of their algorithm, except for arguing it to be polynomial time (compare with Section 2).

5 Experiments

In this section, we present a number of experiments. We begin with a motivating example on a small image data set illustrating that a summary produced by Algorithm 1 (i.e., the standard greedy strategy for the unfair -center problem) can be quite unfair. We also compare summaries produced by our algorithm to summaries produced by the method of Celis et al. (2018b). We then investigate the approximation factor of our algorithm on several artificial instances for which we know or can compute the optimal value of the fair -center problem (2) and compare our algorithm to the one for the matroid center problem by Chen et al. (2016), both in terms of approximation factor and running time. Next, on both synthetic and real data, we compare our algorithm in terms of the cost of its output to two baseline heuristics. Finally, we compare our algorithm to Algorithm 1 more systematically. We study the difference in the costs of the outputs of our algorithm and Algorithm 1, a quantity one may refer to as price of fairness, and measure how unfair the output of Algorithm 1 can be. In the following, all boxplots show results of 200 runs of an experiment.

5.1 Motivating Example and Comparison with Celis et al. (2018b)

Algorithm 1 Our Algorithm Celis et al.
Figure 2: A data set consisting of 14 images of medical doctors (7 female, 7 male) and four summaries computed by the unfair Algorithm 1, our algorithm and the algorithm proposed by Celis et al. (2018b) (all three algorithms are randomized algorithms).

Consider the 14 images111All images were found on https://commons.wikimedia.org, https://pexels.com or https://pixnio.com and are in the public domain. of medical doctors shown in the first row of Figure 2. Assume we want to generate a summary of size four of these images. One way to do so is to run Algorithm 1. The first column of the table in Figure 2 shows in each row the summary produced in one run of Algorithm 1 (recall that all algorithms considered here are randomized algorithms). These summaries are quite unfair: although there is an equal number of images of female doctors and images of male doctors, all these summaries show three or even four females. To overcome this bias we can apply our algorithm or the method of Celis et al. (2018b), which both allow us to explicitly state the numbers of females and males that we want in the summary. The second and the third column of the table show summaries produced by these algorithms. It is hard to say which of them produces more useful summaries and the results ultimately depend on the feature representations of the images (see the next paragraph). To provide further illustration, we present a similar experiment in Figure 11 in Appendix B. Note that we chose very small numbers of images in these experiments solely for the purpose of easy visual digestion.

Figure 3: Left: Approx. factor of Alg. 4 and the algorithm by Chen et al. (M.C.) on simulated data with computable optimal solution. ; various settings with . (1) , (2) , (3) , (4) , (5) , (6) , (7) , . Right: Running time as a function of .
Figure 4: Approximation factor of our algorithm on simulated data with known optimal solution. , , . Left: Example of the data set when . The optimal solution consists of 100 points located at the centers of the visible clusters and has cost . Right: Approximation factor for .

For computing feature representations of the images and running the algorithm of Celis et al. we used the code provided by them. The feature vector of an image is a histogram based on the image’s SIFT descriptors; see Celis et al. for details. We used the Euclidean metric between these feature vectors as metric  for Algorithm 1 and our algorithm.

5.2 Approximation Factor and Comparison with Chen et al. (2016)

We implemented the algorithm by Chen et al. (2016) using the generic algorithm for matroid intersection provided in SageMath222http://sagemath.org/. To speed up computation, rather than testing all distance values as threshold as suggested by Chen et al., we implemented binary search to look for the optimal value.

In the experiment shown in the left part of Figure 3, we study the approximation factor achieved by our algorithm (Alg. 4) and the algorithm by Chen et al. (M.C.) in various settings of values of , and , . The data set  always consists of 25 vertices of a random graph and is small enough to explicitly compute an optimal solution to the fair -center problem (2

). The random graph is constructed according to an Erdős-Rényi model, where any possible edge between two vertices is contained in the graph with probability

. With high probability such a graph is connected (if not, we discard it). We put random weights on the edges, drawn from the uniform distribution on

, and let the metric  be the shortest-path distance on the graph. We assign every vertex to one of groups uniformly at random and randomly choose a subset of initially given centers. As we can see from the boxplots, the approximation factor achieved by our algorithm is never larger than 2.4. We also see that in each of the seven settings that we consider the median of the achieved approximation factors (indicated by the red lines in the boxes) is smaller for our algorithm than for the algorithm by Chen et al..

In the experiment shown in the right part of Figure 3, we study the running time of the two algorithms as a function of the size of the data set, which is created analogously to the experiment in the left part. We set , and , . The shown curves are obtained from averaging the running times of 200 runs of the experiment. While our algorithm never runs for more than 0.01 seconds, the algorithm by Chen et al., on average, runs for 240 seconds when . Its run time grows at least as , which proves it to be inappropriate for massive data sets.

In the experiment of Figure 4, we once more study the approximation factor achieved by our algorithm. We place 100 optimal centers at , , and sample points around them such that for every center the farthest point in its cluster is at distance 0.5 from the center (Euclidean distance). One such a point set can be seen in the left plot of Figure 4. We randomly assign every point and center to one of groups and set to the number of centers that have been assigned to group . We let . For , the right part of Figure 4 shows boxplots of the approximation factors for our algorithm. Similarly as before, the approximation factor achieved by our algorithm is never larger than 2.6. Most interestingly, the approximation factor increases very moderately with .

Figure 5: Cost of the output of our algorithm in comparison to the unfair Algorithm 1 and maximum deviation of the numbers of centers in and , , in the output of Algorithm 1 (it is , ). Left: . Middle: . Right: .

5.3 Comparison with Baseline Approaches

We compare our algorithm in terms of the cost of an approximate solution to two baseline heuristics for the fair -center problem (2). The first one, referred to as Heuristic A, runs Algorithm 1 on each group separately (with and for group ) and returns the union of the centers obtained for the groups as output. The second one, Heuristic B, greedily chooses centers similarly to Algorithm 1, but only from those groups for which we have not reached the requested number of centers yet. It is easy to see that the approximation factor achieved by these heuristics can be arbitrarily large on some worst-case instances.

Figure 6: Cost of the output of our algorithm in comparison to two heuristics. Left: . Middle: . Right: .

Figure 6 shows boxplots of the costs of the approximate solutions returned by our algorithm and the two heuristics for three data sets: the data set in the left plot consists of vertices of a random graph constructed similarly as in the experiments of Figure 3. We set , , , and . The data set in the middle and in the right plot consists of the first 25000 records of the Adult data set (Dheeru & Karra Taniskidou, 2017)

. We only use its six numerical features (e.g., age, hours worked per week), normalized to have mean zero and standard deviation one, for representing records and use the

-distance as metric . For the experiment shown in the middle plot, we split the data set into two groups according to the sensitive feature gender (there are 16709 males and 8291 females) and set . For the experiment shown in the right plot, we split the data set into five groups according to the feature race (#White=21391, #Asian-Pac-Islander=775, #Amer-Indian-Eskimo=241, #Other=214, #Black=2379) and set , . In both cases, we let be a subset of randomly chosen records of size . In Figure 10 in Appendix B we present results for other choices of . The two heuristics perform surprisingly well. Although coming without any worst-case guarantees, overall, the cost of their solutions is comparable to the cost of the output of our algorithm. Still, in four out of seven experiments, our algorithm is superior to Heuristic B, which in turn is superior to Heuristic A in all seven experiments.

5.4 Comparison with Unfair Algorithm 1

We compare the cost of the solution produced by our algorithm to the cost of the (potentially) unfair solution provided by Algorithm 1. Of course, we expect the latter to be lower. We consider the case , , and also examine how balanced the numbers of centers from a group  in the output of Algorithm 1 are. Figure 5 shows the results, where the data sets and settings equal the ones in the experiments of Figure 6. Similar experiments with different settings are provided in Figure 12 in Appendix B. Remarkably, the costs of the solution produced by our algorithm and Algorithm 1 have the same order of magnitude in all experiments, showing that the price of fairness is small. On the other hand, the output of Algorithm 1 can be highly unfair.

6 Discussion

We considered -center clustering under a fairness constraint that is motivated by the application of centroid-based clustering for data summarization. We presented a simple approximation algorithm with running time only linear in the size of the data set and the number of centers . We proved our algorithm to be a 5-approximation algorithm when consists of two groups. For more than two groups, our analysis yields an upper bound on the approximation factor that increases exponentially with the number of groups. We do not know whether this exponential dependence is necessary or whether our analysis is loose—in our extensive numerical simulations we never observed a large approximation factor. Beside answering this question, in future work it would be interesting to extend our results to other clustering objectives such as -medoid or -means. It would also be interesting to characterize properties of data sets that guarantee that fast algorithms find an optimal fair clustering.

References

  • Agarwal et al. (2018) Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning (ICML), 2018.
  • Aggarwal et al. (2010) Aggarwal, G., Panigrahy, R., Feder, T., Thomas, D., Kenthapadi, K., Khuller, S., and Zhu, A. Achieving anonymity via clustering. ACM Transactions on Algorithms, 6(3):49:1–49:19, 2010.
  • Angwin et al. (2016) Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Propublica—machine bias, 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
  • Arya et al. (2004) Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. Local search heuristics for -median and facility location problems. SIAM Journal on Computing, 33(3):544–562, 2004.
  • Bartholomew et al. (2008) Bartholomew, D. J., Steele, F., Galbraith, J., and Moustaki, I. Analysis of multivariate social science data. Chapman and Hall, 2008.
  • Bolukbasi et al. (2016) Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., and Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Neural Information Processing Systems (NIPS), 2016.
  • Borgen & Barnett (1987) Borgen, F. H. and Barnett, D. C.

    Applying cluster analysis in counseling psychology research.

    Journal of Counseling Psychology, 34(4):456, 1987.
  • Buolamwini & Gebru (2017) Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability, and Transparency (ACM FAT), 2017.
  • Cameron & Trivedi (2010) Cameron, A. C. and Trivedi, P. K. Microeconometrics Using Stata, volume 2. Stata Press, 2010.
  • Campbell et al. (2008) Campbell, R., Greeson, M. R., Bybee, D., and Raja, S. The co-occurrence of childhood sexual abuse, adult sexual assault, intimate partner violence, and sexual harassment: a mediational model of posttraumatic stress disorder and physical health outcomes. Journal of Consulting and Clinical Psychology, 76(2):194–207, 2008.
  • Celis et al. (2018a) Celis, L. E., Huang, L., and Vishnoi, N. K. Multiwinner voting with fairness constraints. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    , 2018a.
  • Celis et al. (2018b) Celis, L. E., Keswani, V., Straszak, D., Deshpande, A., Kathuria, T., and Vishnoi, N. K. Fair and diverse DPP-based data summarization. In International Conference on Machine Learning (ICML), 2018b. Code available on https://github.com/DamianStraszak/FairDiverseDPPSampling.
  • Celis et al. (2018c) Celis, L. E., Straszak, D., and Vishnoi, N. K. Ranking with fairness constraints. In International Colloquium on Automata, Languages and Programming (ICALP), 2018c.
  • Charikar et al. (2002) Charikar, M., Guha, S., Tardos, E., and Shmoys, D. B. A constant-factor approximation algorithm for the -median problem. Journal of Computer and System Sciences, 65(1):129–149, 2002.
  • Chen et al. (2016) Chen, D. Z., Li, J., Liang, H., and Wang, H. Matroid and knapsack center problems. Algorithmica, 75:27–52, 2016.
  • Chierichetti et al. (2017) Chierichetti, F., Kumar, R., Lattanzi, S., and Vassilvitskii, S. Fair clustering through fairlets. In Neural Information Processing Systems (NIPS), 2017.
  • Cook et al. (1998) Cook, W. J., Cunningham, W. H., Pulleyblank, W. R., and Schrijver, A. Combinatorial Optimization. Wiley, 1998.
  • Cormen et al. (2009) Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
  • Cygan et al. (2012) Cygan, M., Hajiaghayi, M., and Khuller, S. LP rounding for -centers with non-uniform hard capacities. In Symposium on Foundations of Computer Science (FOCS), 2012.
  • Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. https://archive.ics.uci.edu/ml/datasets/adult.
  • Donini et al. (2018) Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and Pontil, M. Empirical risk minimization under fairness constraints. In Neural Information Processing Systems (NeurIPS), 2018.
  • Eisen et al. (1998) Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.
  • Feldman et al. (2015) Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2015.
  • Ferone et al. (2017) Ferone, D., Festa, P., Napoletano, A., and Resende, M. G. C. A new local search for the -center problem based on the critical vertex concept. In International Conference on Learning and Intelligent Optimization (LION), 2017.
  • Gonzalez (1985) Gonzalez, T. F. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
  • Hajiaghayi et al. (2010) Hajiaghayi, M., Khandekar, R., and Kortsarz, G. The red-blue median problem and its generalization. In European Symposium on Algorithms (ESA), 2010.
  • Har-Peled (2011) Har-Peled, S. Geometric approximation algorithms. American Mathematical Society, 2011.
  • Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N.

    Equality of opportunity in supervised learning.

    In Neural Information Processing Systems (NIPS), 2016.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning — Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009.
  • Hesabi et al. (2015) Hesabi, Z. R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., and Queiroz, C. Data summarization techniques for big data—a survey. In Handbook on Data Centers, pp. 1109–1152. Springer, 2015.
  • Hochbaum & Shmoys (1986) Hochbaum, D. S. and Shmoys, D. B. A unified approach to approximation algorithms for bottleneck problems. Journal of the ACM, 33(3):533–550, 1986.
  • Kay et al. (2015) Kay, M., Matuszek, C., and Munson, S. A. Unequal representation and gender stereotypes in image search results for occupations. In Conference on Human Factors in Computing Systems (CHI), 2015.
  • Krishnaswamy et al. (2011) Krishnaswamy, R., Kumar, A., Nagarajan, V., Sabharwal, Y., and Saha, B. The matroid median problem. In Symposium on Discrete Algorithms (SODA), 2011.
  • Kulesza & Taskar (2012) Kulesza, A. and Taskar, B. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5:123–286, 2012.
  • Li & Svensson (2013) Li, S. and Svensson, O. Approximating -median via pseudo-approximation. In

    Symposium on the Theory of Computing (STOC)

    , 2013.
  • Mladenović et al. (2003) Mladenović, N., Labbé, M., and Hansen, P. Solving the -center problem with tabu search and variable neighborhood search. Networks, 42(1):48–64, 2003.
  • Rösner & Schmidt (2018) Rösner, C. and Schmidt, M. Privacy preserving clustering with constraints. In International Colloquium on Automata, Languages, and Programming (ICALP), 2018.
  • Samadi et al. (2018) Samadi, S., Tantipongpipat, U., Morgenstern, J., Singh, M., and Vempala, S. The price of fair PCA: One extra dimension. In Neural Information Processing Systems (NeurIPS), 2018.
  • Schmidt et al. (2018) Schmidt, M., Schwiegelshohn, C., and Sohler, C. Fair coresets and streaming algorithms for fair k-means clustering. arXiv:1812.10854 [cs.DS], 2018.
  • Sweeney (2013) Sweeney, L. Discrimination in online ad delivery. Queue, 11(3):10–29, 2013.
  • Vazirani (2001) Vazirani, V. Approximation Algorithms. Springer, 2001.
  • Zafar et al. (2017) Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

Appendix

Appendix A Proofs

Proof of Lemma 1:

It is straightforward to see that Algorithm 1 can be implemented in time . We only need to show that it is a 2-approximation algorithm for (3).

If , there is nothing to show, so assume that . Let be the output of Algorithm 1 and be an optimal solution of (3) with objective value . Let be arbitrary. We need to show that for some . If , there is nothing to show. So assume . If

there exists with and we are done. Otherwise, let and hence . We distinguish two cases:

  • with :

    We have and hence .

  • with :

    There must be , where not both and can be in , and such that

    Since and , it follows that .

    Without loss of generality assume that in the execution of Algorithm 1, has been added to the set of centers after has been added. In particular, we have and for some . Due to the greedy choice in Line 5 of the algorithm and since has not been chosen by the algorithm, we have

Proof of Theorem 1:

Again it is easy to see that Algorithm 2 can be implemented in time . We need to prove that it is a 5-approximation algorithm, but not a -approximation algorithm for any :

  1. Algorithm 2 is a 5-approximation algorithm:

    Let be the optimal value of the fair problem (2) and be the optimal value of the unfair problem (3). Clearly, . Let with and be an optimal solution to the fair problem (2) with cost  and be the centers returned by Algorithm 2. It is clear that Algorithm 2 returns many elements from and many elements from and hence with and . We need to show that

    Let be the output of Algorithm 1 when called in Line 3 of Algorithm 2. Since Algorithm 1 is a 2-approximation algorithm for the unfair problem (3) according to Lemma 1, we have

    (6)

    If Algorithm 2 returns