Fair clustering via equitable group representations

06/19/2020
by   Mohsen Abbasi, et al.
THE UNIVERSITY OF UTAH
21

What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A core principle in most clustering problems is that a cluster center should be representative of the cluster it represents, by being "close" to the points associated with it. This is so that we can effectively replace the points by their cluster centers without significant loss in fidelity, and indeed is a common "use case" for clustering. For such a clustering to be fair, the centers should "represent" different groups equally well. We call such a clustering a group-representative clustering. In this paper, we study the structure and computation of group-representative clusterings. We show that this notion naturally parallels the development of fairness notions in classification, with direct analogs of ideas like demographic parity and equal opportunity. We demonstrate how these notions are distinct from and cannot be captured by balance-based notions of fairness. We present approximation algorithms for group representative k-median clustering and couple this with an empirical evaluation on various real-world data sets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/03/2022

Fair Representation Clustering with Several Protected Classes

We study the problem of fair k-median where each cluster is required to ...
11/14/2019

Distributional Clustering: A distribution-preserving clustering method

One key use of k-means clustering is to identify cluster prototypes whic...
05/02/2019

Efficient Contour Computation of Group-based Skyline

Skyline, aiming at finding a Pareto optimal subset of points in a multi-...
05/28/2022

Fair Labeled Clustering

Numerous algorithms have been produced for the fundamental problem of cl...
06/22/2021

Diversity-aware k-median : Clustering with fair center representation

We introduce a novel problem for diversity-aware clustering. We assume t...
06/17/2020

Socially Fair k-Means Clustering

We show that the popular k-means clustering algorithm (Lloyd's heuristic...
02/18/2020

Fair Prediction with Endogenous Behavior

There is increasing regulatory interest in whether machine learning algo...

1 Introduction

Growing use of automated decision making has sparked a debate concerning bias, and what it means to be fair in this setting. As a result, an extensive literature exists on algorithmic fairness, and in particular on how to define fairness for problems in supervised learning

Dwork12Fairness; Romei13Multidisciplinary; Feldman2015DisparateImpact; hardt2016equality; arvindtutorial; mitchell2018predictionbased. However, these notions are not readily applicable to unsupervised learning problems such as clustering. The reason is that unlike in the supervised setting, a well-defined notion of ground truth does not exist in such problems. In 2017, chierichetti2017fair proposed the idea of balance as a notion of fairness in clustering. Given a set of data points with a type assigned to each one, balance asks for a clustering where each cluster has roughly the same proportion of types as the overall population. This definition spawned a flurry of research on efficient algorithms for fair clustering chierichetti2017fair; kleindessner2019fair; kleindessner2019guarantees; chen2019proportionally; schmidt2018fair; ahmadian2019clustering; rosner2018privacy. Further work by other researchers has extended this definition, but with the same basic principle of proportionality (abraham2019fairness; backurs2019scalable; bercea2018cost; huang2019coresets; bera2019fair; wang2019towards; zikoclustering).

There are two sources of concern with balance as a normative principle. The idea that enforcing proportionate clusters leads to fairness would make sense if the objective is to pick one cluster as representative of the entire set. However, this is not a typical goal in clustering. The objective in clustering problems is often grouping similar data points together, where each cluster center is representative of its cluster. This means that unlike supervised learning, labels assigned by a clustering algorithm do not always carry an inherent meaning like being accepted to college or defaulting on a loan. So representativeness of a particular cluster may not always be meaningful or worse, may incorrectly “represent” the set of points. One example of such a concern is in the problem of redistricting: a partitioning of a region into voting districts that achieves “balance” between voters from different political parties will result in each district having a majority of voters from the same political party, which is in fact a technique in gerrymandering called cracking. Secondly, if we borrow the notion of disparate impact, we would desire “protected” classes to have approximately equal representation in the decision space, compared to the majority group. However, enforcing balance does not necessarily guarantee such requirement. To illustrate why, see the example presented in Figure 1, which shows a balance-preserving -means clustering on the left for two groups denoted by the colors red and blue, and regular -means clustering on the right. Here, the number of red points is larger than the other group. Therefore, each cluster center is chosen close to its respective red group’s centroid. As a result, red points are better represented by chosen centers compared to blue points.

Figure 1: Balance is preserved in the left hand figure but the centers are much more representative of the red points. In contrast the clustering on the right represents both groups comparably

1.1 Our Contributions

In this paper we propose a notion of fairness in clustering based on the idea of minimizing gaps in representativeness for groups. We present a number of different ways of measuring representativeness and interestingly, show that they naturally parallel standard notions of fairness in the supervised learning literature. We establish some basic properties of these measures as well as showing their incompatibility with each other. We also present bicritera approximation algorithms for computing -medians under these different notions of fairness and support this with an experimental study that illustrates both the effectiveness of these measures and their incompatibility with notions of balance.

2 Related Work

Chierichetti et al. chierichetti2017fair introduced balance as a fairness constraint in clustering for two groups. Considering the same setting with binary attribute for groups, Backurs et al. improved the running time of their algorithm for fair -median (backurs2019scalable). Rösner and Schmidt (rosner2018privacy) proposed a constant-factor approximation algorithm for fair -center problem with multiple protected classes. Bercea et al. (bercea2018cost) proposed bicriteria constant-factor approximations for several classical clustering objectives and improved the results of Rösner and Schmidt. Bera et al. (Bera19) generalized previous works by allowing maximum over- and minimum under-representation of groups in clusters, and multiple, non-disjoint sensitive types in their framework. Other works have studied multiple types setting (wang2019towards), multiple, non-disjoint types (huang2019coresets) and cluster dependent proportions (zikoclustering).

In a different line of work, Ahmadian et al. (ahmadian2019clustering) studied fair -center problem where there is an upper bound on maximum fraction of a single type within each cluster. Chen et al. (chen2019proportionally) studied a variant of fair clustering problem where any large enough group of points with respect to the number of clusters are entitled to their own cluster center, if it is closer in distance to all of them.

A large body of works in the area of algorithmic fairness have focused on ensuring fair representation of all social groups in the machine learning pipeline

(bolukbasi2016man; samadi_price_2018; abbasi2019fairness). Recent work by Mahabadi et al. (mahabadi2020individual), studies the problem of individually fair clustering, under the fairness constraint proposed by Jung et al. (jung2019center). In their framework, if denotes the minimum radius such that the ball of radius centered at has at least points, then at least one center should be opened within distance from .

3 Fair Clustering

In this paper we will consider clustering objectives that satisfy the Voronoi property: the optimal assignment for a point is the cluster center nearest to it. This includes the usual clustering formulations like -center, -means and -median. Thus, we can represent a clustering as the set of cluster centers . The cost of a clustering of a set of points is given by the function . For any subset of points , we denote the cost of assigning to cluster centers in a given clustering as . Finally, given a cost function cost and a set of points we denote the set of centers in an optimal clustering of by . When the context is clear, we will drop the subscript and merely write this as .

Our ideas of fairness in clustering are rooted in the idea of equitable representations. To that end, we introduce different ways to measure the cost of group representation. We can then define a fair clustering.

Definition 1.

(Fair Clustering). Given a set of data points partitioned into groups , fair clustering minimizes the maximum average (representation) cost across all groups:

where is the set of all possible clusterings.

3.1 Quality of group representation

We now introduce different ways to measure group representation cost.

3.1.1 Absolute Representation Error

In supervised learning, statistical parity captures the idea that groups should have similar outcomes. Rephrasing, it says that groups should be represented equally well in the output. In the case of binary classification, Statistical Parity requires

for two groups and , where denotes the sensitive attribute. A clustering adaptation of statistical parity would require that cluster centers represent all groups equally well, regardless of their potentially different distributions. More specifically, the average distance between members of a group and their respective cluster centers should look the same across groups. Motivated by this, we introduce the following definition of representation cost.

Definition 2 (AbsError).

The absolute (representation) error of a clustering

where is a set of points, is a set of centers and is a an arbitrary distance function between and nearest center to it in .

An AbsError-fair clustering is a fair clustering that uses AbsError to measure group representation cost in Defn 1.

3.1.2 Relative representation error

AbsError does not take different group distributions into account. In order to see how that might be problematic, let us consider minimizing the maximum value of AbsError for two groups and in Figure 2, using three clusters. Assume and . The points in group could be grouped in two clusters and with close to zero cost as shown in the figure. The points in group lie on three line segments , and where points are placed on and points are placed on each one of the other two and . We should note that the line segments are all of size and the points are distributed uniformly on each one. Since we assumed the size of group is much larger than the size of group , an optimal clustering for would have , and as its centers. Without loss of generality, if we assume all the points on and are closer to than and , then the total cost of clustering with for group is . If for we have , then the optimal clustering for group would have , and as its centers and . In addition, it is easy to see also minimizes the maximum average AbsError for both groups where this value for each group is . In such setting, for a small enough value of , we see that in an AbsError-Fair clustering, the total cost for group has increased substantially compared to an unconstrained clustering, while group has not gained a noticeable benefit.

Figure 2: Minimizing the average AbsError for group , provides a solution which is times the optimal solution.

In the example above, AbsError-fair clustering fails to achieve a fair and at the same time acceptable clustering, because it ignores the fact that the two groups have drastically different distributions. Attention to this form of “base rates” is the motivation behind introduction of fairness measures like equality of opportunity based on balancing error rates rather than outcomes hardt2016equality.

A clustering adaptation of a notion like equality of opportunity, would require two steps. Firstly, comparing the average distance between members of a social group and their respective cluster centers, to the corresponding “optimal” value of that group. Secondly, ensuring the difference between these two values for the minority group is roughly equal to the corresponding difference for the majority group.

This relative measure of representation error motivates the following definition.

Definition 3 (RelError).

The relative (representation) error of a clustering is given by

where is a set of points, is a set of centers and is a an arbitrary distance function between and nearest center to it in .

Alternatively, one can capture the relative error via a difference instead of a division.

This is similar to the formulation used by samadi_price_2018 in their work on fair PCA. For technical reasons relating to the difficulty of optimizing for differences, we will not discuss this further here.

Equality of cost in fair clustering

Our definition of fair clustering asks to minimize the maximum representation cost for groups. Another way to think of fair clustering with respect to group representations is to enforce equality of representation costs across groups. Though it may not seem obvious at first glance, this approach to fairness is related to definition 1. In order to connect these two definitions, we present a similar argument to that of Samadi et al. (samadi_price_2018). In Observation 5, we describe how and under what conditions, minimizing the maximum cost across groups leads to equal costs for them.

Definition 4.

(Homogeneous group) Given a set of data points and an arbitrary subset , we call homogeneous with respect to and a given clustering cost function, if there is at least one clustering where ’s average cost is smaller than or equal to ’s optimal average cost. Formally, we call homogeneous if .

Observation 5.

Assume we are given a clustering algorithm with a continuous and convex cost function (e.g. soft clustering with -means) and also a set of points , which can be partitioned into two homogeneous groups and . Minimizing the maximum average cost over two groups is equivalent to equalizing the average cost for them.

Proof.

Let denote the clustering returned after minimizing the maximum cost over two groups and . If , we’re done. So without loss of generality, let’s assume . In this case, (which is the global minimum for the function ) is a local minimum for group ’s cost function. Otherwise, since the cost function is continuous, there should be another clustering, , where:

which means min max procedure should have returned instead of . A convex function has only one local minimum which is also a global minimum. Therefore, since we assumed the given cost function is convex, is a global minimum for group ’s cost function. On the other hand, clustering returns a global minimum for group ’s cost function. Therefore, because the two groups are homogeneous:

(1)

Inequality 1 tells us that overall average cost given clustering , is smaller than the average cost for in clustering which contradicts the optimality of the clustering .

Since the two groups are homogeneous, continuity of the cost function guarantees that there is at least one clustering where the average cost for the two groups is equal. Therefore, minimizing the maximum average cost over the two groups, would return such a clustering with the smallest value possible. ∎

4 Algorithms for fair clustering

We now present algorithms for fair clustering under these measures of fairness. We start with an observation about the difference between optimizing in a “group-blind” manner and explicit optimization for group representations. Such observations are generally referred to as the “price” of fair clustering111We use this terminology because it is commonly used. However, in a broader sense we believe that discussions of fairness in terms of a compromise in quality are misguided and represent a false tradeoff between two fundamentally different values..

Theorem 6.

Consider an arbitrary clustering algorithm and a set of data points which can be partitioned into of groups . If in the optimal clustering for the entire set, group suffer the largest average cost, the total cost of fair clustering would not be larger than times the optimal solution.

Proof.

Let us denote the fair clustering by . By assumption, the average costs for every group in , is no larger than the average cost for group in :

Therefore:

We now show the analysis in theorem 6 is tight for all variations of fair clustering introduced in section 3.1. Consider the relaxed version of the -means problem, namely linear -subspace -clustering. In this problem, the goal is to find subspaces of rank at most , which minimize the sum of squared distances between input points and subspaces (turning; cohen). The cost of clustering in this problem is the minimum cost of projecting data points on such subspaces. We first build the example for fair clustering, having AbsError as the cost function, and later make a small adjustment for it to be applicable to RelError-fair clustering as well.

Consider minimizing the maximum cost for two groups and in Figure 3, using one cluster center (here, the cluster center is 1 dimensional or simply a direction). Let’s assume there are two points in each group and . Since , an optimal clustering with no fairness constraint would pick axis as subspace where the average AbsError for group is zero and the average AbsError for group is . However, it is easy to see that in order to minimize the maximum average AbsError over both groups, we should pick as the fair subspace. Referring to Theorem 5, we know the average AbsError for the two groups is equal.222Theorem 5 applies here because Frobenius norm is convex. Therefore:

is inversely related to . As a result, by increasing , the average cost of projecting group onto the fair solution () gets arbitrarily close to the corresponding cost in the optimal solution ( axis). Therefore, the cost of projection for all points in the fair solution would asymptotically get close to times the corresponding cost in the optimal unconstrained solution.333We should note that since the optimal cost for each group is zero, the same reasoning applies if RelErrorDiff is used as cost function

The example above could be used to prove the tightness of the analysis in Theorem 6 for RelError-fair clustering. However, since division by zero is not defined, we make a small adjustment. So, instead of having two points for each group, we consider two sets of points i.e. the points and in the AbsError-fair example, all become centers for points belonging to the same group. We assume these points are close enough to centers so their within group distances are negligible. The rest of the analysis is the same as before.

Figure 3: Minimizing the maximum average AbsError, provides a solution which is times the optimal solution.

According to the result above, it may seem that different measures all have similar behavior on different data sets. However, The examples provided in Figure 4, showcase how different measures of representation cost induce different clusterings for the linear -subspace -clustering problem.

(a)
(b)
(c)
(d)
Figure 4: Comparing different fairness constraints and their induced clusterings. AbsError, RelError RelErrorDiff are shortened to AE, RE and RED respectively.

4.1 Approximation algorithm via LP relaxation

For the fair -median and

-means problem, we now study the natural linear programming relaxation and develop a rounding algorithm.

4.2 Relaxation for AbsError-Fair clustering

Let be the groups of vertices, and let . For , is intended to denote if vertex is assigned to center . These are called assignment variables. We also have variables that are intended to denote if is chosen as one of the centers (or medians). The LP (called FairLP-AbsError) is now the following:

subject to
(2)

The only new constraint compared to the standard LP for -median (e.g., charikar2002constant) is the constraint for all groups . This is to ensure that we minimize the maximum -median objective over the groups. To handle -means clustering, it suffices to replace in the constraint with . See Remark 9 for details.

Theorem 7.

The integrality gap of FairLP-AbsError is .

Proof.

Consider an instance in which we have points in total, each in a different group. Formally, let for all . Suppose that for all , and let .

Now, consider the fractional solution in which for all . Also, let , and let for some (it does not matter which one). It is easy to see that this solution satisfies all the constraints. Moreover, the LP objective value is .

However, in any integral solution, one of the points is not chosen as a center, and thus the objective value is at least . Thus the integrality gap is . ∎

Theorem 7 makes it hard for an LP approach to give an approximation factor better than (which follows via a simple algorithm that finds an approximate -median solution on using the weight for all points in ). However, the LP can still be used to obtain a bi-criteria approximation.

Theorem 8.

Consider a feasible solution for FairLP-AbsError. For any , there is an algorithm that opens centers, while achieving an objective value of .

Proof.

The proof is based on the well-known “filtering” technique (Lin1992approximation; charikar2002constant). Define as the LP’s “connection cost” for the point . Formally, . Now, construct a subset of the points as follows. Set and to begin with, and in every step, find that has the smallest value (breaking ties arbitrarily) and add it to . Then, remove all such that from the set . Suppose we continue this process until is empty.

The set obtained satisfies the following property: for all , . This is true because if was added to before , then , and further, should not have been removed from , which gives the desired bound. The property above implies that the set of metric balls are all disjoint.

Next, we observe that each such ball contains a total -value at least . This is by a simple application of Markov’s inequality. By definition, , and thus . This means that , and thus . As the balls are disjoint, we have that .

Now, consider an algorithm that opens all the points of as centers. By construction, all are at a distance from some point in , and thus for any group , we have that , thus completing the proof of the theorem. ∎

Remark 9 (Extension to -means).

The argument above can be extended easily to obtain similar results for the -means objective. We simply replace all distances with the squared distances. The metric ball around each point can be replaced with the ball, and the same approximation factors hold.

LP based heuristic.

The instance showing the factor integrality gap is special in the sense that every group has exactly one point, and thus it is impossible for an integer solution with to achieve a small cost for all of them. We now see that in the case of -median, there exist randomized rounding strategies that ensure that in expectation, the connection cost of every group is within a constant factor of the LP objective. (Of course, all the costs need not simultaneously be small, e.g., in our gap instance.)

Definition 10 (Faithful rounding).

A (randomized) rounding procedure for FairLP-AbsError is said to be -faithful if it takes a feasible solution and produces a feasible integral solution with the guarantee that for every ,

Using a simple dependent rounding (see Chekuri2010dependent; Srinivasan2001distributions) procedure, charikar2012dependent showed that there exists a faithful rounding for FairLP-AbsError with . We note that some of the other LP rounding schemes (e.g., charikar2002constant) are not faithful. Formally,

Theorem 11 (charikar2012dependent).

There exists a faithful (randomized) rounding algorithm for FairLP-AbsError, with .

Corollary 12.

Let be a solution to FairLP-AbsError. There exists a rounding algorithm that ensures that the expected connection cost for every group is .

The corollary follows directly from Theorem 11

, by linearity of expectation. While this does not guarantee that the rounding simultaneously produces a small connection cost for all groups, this gives a good heuristic rounding algorithm. In examples where every group has many points well-distributed across clusters, the costs tend to get concentrated around the expectation, leading to small connection costs for all clusters. We will see this via examples in the experimental section.

4.3 Relaxation for RelError-Fair clustering

We now see that the rounding methods introduced in Section 4.2 can also be used for RelError-fair clustering. However, the LP in this case is not quite a relaxation:

subject to
(3)

The constraint (3) now involves a new term, , which is an approximation to the optimum -median objective of the set . For our purposes, we do not care how this approximation is achieved – it can be via an LP relaxation charikar2002constant; Li2013approximating, local search arya2004local; gupta2008, or any other method. We assume that if is the optimum -median objective for , then for all , for some constant . (From the works above, we can even think of as being .)

Lemma 13.

Suppose there is a rounding procedure that takes a solution to FairLP-RelError and outputs a set of centers with the property that for some parameter ,

(4)

Then, this algorithm provides an approximation to RelError-fair clustering.

Proof.

Let Opt be the optimum value of the ratio-fair objective on the instance . The main observation is that the LP provides a lower bound on Opt. This is true because any solution to ratio-fair clustering leads to a feasible integral solution to FairLP-RelError, where the RHS of the constraint (3) is replaced by . Since , it is also feasible for FairLP-RelError, showing that the optimum LP value is .

Next, consider a rounding algorithm that takes the optimum LP solution and produces a set that satisfies (4) (with ). Then, since , we have

and using completes the proof of the lemma. ∎

Thus, it suffices to develop a rounding procedure for FairLP-RelError. Here, we observe that the rounding from Theorem 8 directly applies (because ensures that every , ), giving us the same bi-criteria guarantee (and the same adjustment under faithful rounding).

Corollary 14 (Corollary to Theorem 8).

For any , there is an efficient algorithm that opens centers and achieves a approximation to the optimum value of the ratio-fair objective.

5 Experiments

In this section, we present two types of experiments. In the first part, we evaluate balance-based approaches to fair clustering with regard to our representation-based notions. In the second part, we evaluate our algorithms for fair clustering and provide an empirical assessment for their performance. We consider four datasets:

  • Synthetic. Synthetic dataset with three features. First feature is binary (“majority" or “minority"), and determines the group example belongs to. Second and third attributes are generated using distribution in the majority group, and distribution in minority group. Majority and minority groups are of size 250 and 50, respectively.

  • Iris.444https://archive.ics.uci.edu/ml/datasets/iris Data set consists of 50 samples from each of three species of Iris: Iris setosa, Iris virginica and Iris versicolor. Selected features are length and width of the petals.

  • Census.555https://archive.ics.uci.edu/ml/datasets/adult Dataset is 1994 US Census and selected attributes are “age",“fnlwgt", “education-num", “capital-gain" and “hours-per-week". groups of interests are “female" and “male".

  • Bank.666https://archive.ics.uci.edu/ml/datasets/Bank+Marketing The dataset contains records of a marketing campaign based on phone calls, ran by a Portuguese banking institution. Selected attributes are “age", “balance", “duration" and groups of interest are “married" and “single".

We should note that in all experiments, points were clustered using 3 centers.

5.1 On balance and representations

In this section, we empirically study the effects of enforcing balance on group representations. More specifically, we compare each group’s average cost for unconstrained -median to the corresponding value under balance constraint. As for the balance-fair -median, we chose the algorithm proposed by Backurs et al. (backurs2019scalable).777Implementation could be found here. In this experiment, we used the entire Synthetic and Iris datasets, and sampled 300 examples from each of Census (150 male, 150 female) and Bank (150 married, 150 single) datasets. In table 1, we present the average costs for all groups within each dataset, in two clusterings generated by unconstrained -median and balanced -median. In all datasets, we observe enforcing balance amplifies representation disparity across groups and leads to a higher maximum average cost. However, it is especially more noticeable in Synthetic and Iris datasets, where different groups have drastically different distributions.888The algorithm proposed by Backurs et al. works on only two groups. We chose two groups out of three from Iris. Repeating the experiment with other groups lead to similar results.

Datasets Synthetic Iris Census Bank
majority minority Setosa Versicolor female male married single
Unconstrained 0.514 0.678 0.169 0.256 34492.40 35083.73 627.05 682.76
Balanced 0.430 3.476 0.101 2.819 34019.34 35876.70 622.87 694.78
Table 1: Effects of enforcing balance on group representations

5.2 Algorithm evaluation

The empirical results in the last section show that balanced-based algorithms do not mitigate representation disparity across groups. Therefore, In this section we propose two heuristic algorithms to compute group-representative -median clusterings, which we call LS-Fair -median and LP-Fair -median:

LS-Fair -median

Arya et al. proposed a local search algorithm to approximately solve the -median problem (arya2004local). Their algorithm starts with an arbitrary solution, and repeatedly improves it by swapping a subset of the centers in the current solution, with another set of centers not in it. We modify this algorithm to minimize the maximum average cost over all groups. Assuming we’re given a cost function and as groups where , LS-Fair -median is presented in Algorithm 1.

an arbitrary set of centers from
while there is and s.t. do
     
     
return
Algorithm 1 LS-Fair -median()

Later in this section we see that LS-Fair -median works well in practice. However, the following example shows that -median with the AbsError-fair objective can have local optima that are arbitrarily worse than the global optimum. Let and be two sets that are far apart (think of the distance between any pair as ). , where and , for some integer parameter . Likewise, suppose that , of sizes respectively. Suppose that all the elements of (so also ) are at distance away from one another. Suppose the distance between and (so also and ) is .

Now, suppose the two groups are and . Let . The optimal solution is to choose one point in and another in . This results in an objective value of

Consider the solution that chooses the unique points from and . The -median objective for both the groups is , and thus the AbsError-fair objective is . Now, consider swapping with some point . This changes the -median objective for group 1 from to , and so even though the swap significantly decreases the objective for the second group, the local search algorithm will not perform the swap. The same argument holds for swapping with a point . It is thus easy to see that is a locally optimum solution.

However, the ratio between the AbsError-fair objectives of this solution and the optimum is for . Thus the gap can be as bad as the number of points.

LP-Fair -median

LP-Fair -median first solves FairLP, presented in sections 4.1 and 4.3, and later rounds the solution with the matching idea proposed by Charikar et al. (charikar2012dependent). The rounding is done in four phases:

  1. Filtering: Similar to the filtering technique described in section 4.1, we construct a subset of the points. With a small adjustment that after adding a point to the set , all points from the original set such that , will not be considered to be added to anymore.

  2. Bundling: For each point , we create a bundle which is comprised of the centers that exclusively serve . In the rounding procedure, each bundle

    is treated as a single entity, where at most one center from it will be opened. The probability of opening a center from a bundle,

    , is the sum of , which we call bundle’s volume.

  3. Matching: The generated bundles have the nice property that their volume lies within and . So given any two bundles, at least one center from them should be opened. Therefore, while there are at least two unmatched points in , we match the corresponding bundles of the two closest unmatched points in .

  4. Sampling: Given the matching generated in the last phase, we iterate over its members and consider the bundle volumes as probabilities, to open centers in expectation.

The centers picked in the sampling phase are returned as the final centers.

Results.

In this experiment, in order to save space, we focus on just Census and Bank datasets. However, we consider two subsamples of each dataset: 1:1 Census contains 150 female and 150 male examples, 1:5 Census contains 50 female and 250 male examples, 1:1 Bank contains 150 married and 150 single examples, and 1:5 Bank contains 50 married and 250 single examples. In table 2, we present the average cost for all groups within each sample. Group optimal presents the optimal average cost for a group, when it is clustered by itself via centers. -median presents a group’s average cost, in a clustering generated by unconstrained -median, performed on all groups together. The other rows in the table show the average group cost for either of heuristic algorithms, using various cost functions. In general, the results demonstrate the effectiveness of our algorithms. However, we emphasize on the difference between 1:1 and 5:1 samples. In 1:1 case, the groups have the same size and unconstrained -median treats them roughly the same. But in 5:1 case, if the groups have different distributions, unconstrained -median favors majority group over the other, and the effectiveness of our proposed algorithms are more evident. 999Each dataset was sampled 10 times and we reported the overall average.

Datasets 1:1 Census 1:5 Census 1:1 Bank 1:5 Bank
female male female male married single married single
Group optimal 34499 31528 35349 32619 569 686 659 655
-median 35264 32351 40212 32689 596 730 948 665
AbsError LS-Fair 34827 33298 37887 36144 627 718 749 740
LP-Fair 34668 33971 38396 35675 630 717 740 763
RelError LS-Fair 35390 32341 38099 34702 611 727 745 747
LP-Fair 35397 32343 38067 33865 613 722 767 743
Table 2: Clustering Census and Bank datasets using LS-Fair and LP-Fair algorithms

6 Conclusion

In this work we presented a novel approach to think of and formulate fairness in clustering tasks, based on group representativeness. Our main contributions are introducing a fairness notion which parallels the development of fairness in classification setting, proposing bicritera approximation algorithms for

-medians under different variations of this notion and providing theoretical bounds. Our results suggest that our formulation provides better quality representations especially when the groups are skewed in size.

7 Broader Impact

Clustering is a critical part and often one of the early steps of learning pipeline. Preprocessing data for supervised learning is one of its many use cases. Therefore it is critical to understand how bias might enter the pipeline through clustering, and how one might mitigate it. The underlying assumption in most clustering tasks is that cluster centers act as a representatives and summarize the variety of points in their cluster. If there exist pre-defined groups beyond clusters, it is possible that some groups are not as well-represented as others in a clustering. Poor representation of a specific set of data points in clustering, may lead to that group being neglected in the rest of the learning pipeline. Our research introduces a new way of understanding and mitigating poor representation of protected social groups in clustering. This is crucial in ensuring equal treatment of all social groups, in any learning task which uses clustering as a preprocessing tool.

In all discussions around what it means for something to be “fair”, it is important to look at the normative basis for the claim. We argue that representation quality acts as a normative basis. While it is true that this is not captured by existing formulations, a proper understanding of normative concerns around representation requires a deeper understanding of specific use cases, even if it just a matter of deciding whether to use AbsError or RelError. In that respect, our work is on the more theoretical end – trying to understand the computational elements of these measures – rather than providing a recommendation for how we should do clustering fairly.

Another concern that our work exposes but does not resolve is that the process of enforcing fair representations may come at the price of making a noticeable fraction of population suffer a larger burden, in terms of representation cost. The nature of such trade-off is not well-studied at this point and remains both a point of caution as well as an avenue for further study.

References