Fair Correlation Clustering

02/06/2020 ∙ by Sara Ahmadian, et al. ∙ 0

In this paper, we study correlation clustering under fairness constraints. Fair variants of k-median and k-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints. Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We show how this problem relates to k-median fairlet decomposition and how this allows us to obtain approximation algorithms for a wide range of fairness constraints. We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is a growing literature on fairness in various learning and optimization problems [33, 32, 30, 15, 45, 14, 20]

. The goal of this literature is to develop criteria and algorithms to ensure that we can find solutions for optimization/learning problems that are fair with respect to a certain sensitive feature. In the case of clustering, a fundamental unsupervised learning and optimization problem, the study of fairness was initiated by Chierichetti et al. 

[19]. They formulated the notion of proportional fairness, and developed approximation algorithms for fair -median and fair -center under this notion of fairness. Follow-up work generalized their results to other clustering problems, such as -means and facility location, and more relaxed notions of fairness [10, 1, 9, 7, 35]. Notably, the important graph clustering problem of correlation clustering has been so far not addressed by this literature.

Correlation clustering is a type of clustering that uses information about similarity as well as dissimilarity relationships among a set of objects in order to cluster them [8]. In contrast to other clustering problems such as -median, -means, and -center, the number of clusters is not predetermined but rather determined based on the outcome of the optimization. This fact, as well as the fact that correlation clustering takes advantage of both similarity and dissimilarity information, makes it a desirable clustering model in many applications [37, 42, 4]. Therefore, it is natural to study this problem under fairness constraints.

The main tool introduced by Chierichetti et al. [19] for solving fair -center and -median problems is the notion of fairlets. A fairlet is a small set of elements that satisfies the fairness property. Chierichetti et al. [19] showed that fair -median and -center can be solved by first decomposing an instance into fairlets and then solving the clustering problem on the set of centers of these fairlets. To the best of our knowledge, this technique has been used only for metric space clustering problems such as -center and -median.

Our main result is developing a fairlet-based reduction for the graph clustering problem of correlation clustering. Whereas, in the case of -center and -median, the fairlet decomposition problem amounts to solving the same clustering problem on the same instance under the condition that each cluster is a fairlet, the situation for correlation clustering is complicated by the lack of the properties of metric spaces. To tackle this problem, we introduce a novel cost function for the correlation clustering fairlet decomposition, and prove that this cost can be approximated by a median-type clustering cost function for a carefully defined metric space. We prove that given a solution to this fairlet decomposition problem, we can reduce the fair correlation clustering instance to a regular correlation clustering instance through a graph manipulation method. Therefore, we show that any approximation algorithm for fairlet decomposition for -median yields an approximation algorithm for fair correlation clustering. The loss in the approximation ratio depends on the size of the fairlets.

Furthermore, we prove that in many natural cases, there is a fairlet decomposition with small fairlets. This bounds the approximation ratio of our algorithm for fair correlation clustering. In addition to the theoretical bounds on the approximation ratio of our algorithm, we provide an empirical evaluation based on real data sets, showing that the algorithms often perform much better in practice than their worst-case guarantees and that they yield solutions of costs comparable to that of unfair clustering algorithms while substantially reducing the imbalance of the clusters.

Related work.

Clustering is a fundamental unsupervised machine learning task with a long history (cf. 

[29]). Our paper spans the areas of correlation clustering, clustering in metric spaces, and fairness in clustering which are actively growing fields. For brevity, we will only focus of key works in these three areas.

Correlation clustering. Correlation clustering is a widely studied formulation of clustering with both similarity and dissimilarity information [8], with many applications in machine learning [37, 11]. Variants of the problem include complete signed graphs [8, 4] and weighted graphs [21]. We focus on the complete graph case with weights which is APX-hard [16] but admits constant-factor algorithms [4, 17]. Distributed and streaming algorithms are also known [41, 3].

Metric space clustering. The most widely studied clustering setting is clustering in metric spaces consisting in minimizing the -norm of the distances between points in a cluster and their center. For this corresponds to -median, -means, and -center, respectively, which are NP-hard problems but admit constant-factor approximations [25, 27, 40, 2, 34].

Fairness in clustering. Fairness in machine learning is new area with a fast growing literature. Fundamental work in this area is devoted to defining notions of fairness [13, 22, 23, 32] and solving fairness-constrained problems [14, 15, 19, 30, 32, 45, 6, 33, 24, 1, 20, 26].

Chierichetti et al. [19] first introduced a notion of disparate impact for clustering and provided fair -center algorithms for the case of two colors (or groups). We review this notion more in detail in Section 2. Following this work, the problem has been later generalized in many directions including allowing many colors [43], allowing upper bounds on the fraction of points of a given color [1] and both upper and lower bounds [9, 10]. Backurs et al. [5] designed near-linear algorithms for finding -median fairlets, and Huang et al. [28] designed core-sets for the problem. Other variants include clustering with diversity constraints [39], proportionality constraints [18], and fair center selection [35]

. Fairness has been studied in spectral clustering as well 

[36, 46].

To the best of our knowledge no prior work has addressed correlation clustering with fairness constraints in the cluster elements distribution. Kalhan [31] recently studied an orthogonal notion of fairness in correlation clustering which consists in bounding the maximum error for a node.

2 Problem Statement

Correlation clustering. Let be a complete undirected graph on vertices and be a function that assigns a label to each edge. The label for each is either positive (indicating that the two endpoints of are similar) or non-positive (indicating that they are dissimilar). In the unweighted version of the problem, for each . Our focus in this paper is on the unweighted problem, which is the original correlation clustering model defined by Bansal et al. [8], although we will use the weighted version in the proof. Let be the set of positive edges and be the set of non-positive edges. For subsets , let denote the edges inside and denote the edges between and .

A clustering is a partitioning of into disjoint subsets. The sets of intra-cluster and inter-cluster edges in a clustering are defined as and . The correlation clustering cost of is defined as:

In the unweighted version of the problem, this is simply . The goal of correlation clustering111This is in fact the minimizing disagreements variant of correlation clustering. A maximizing agreements version can also be defined similarly. In this paper we focus on minimizing disagreements, since the maximization version admits a trivial randomized -approximation that can be made fair. is to find a clustering to minimize . For unweighted correlation clustering, there are constant-factor approximation algorithms for this problem (with the best known constant being 2.06) [8, 4, 17]. For the weighted case, the best known approximation factor for this problem is a -approximation [21].

Fairness constraints. In the fair version of any clustering problem, each vertex has a color . Proportional fairness, defined by Chierichetti et al. [19], requires that in every cluster, the number of vertices of each color is proportional to the corresponding number in the whole graph. In particular, in the symmetric case where each color appears the same number of times in the graph, we require the same in each cluster. Ahmadian et al. [1] relaxed this property by requiring that each color constitutes at most an -fraction of each cluster, for a given . Bera et al. [9] further generalized this notion to include lower bounds on the number of vertices of each color in each cluster.

In this paper, we give a general reduction from the fair correlation clustering to a median fairlet decomposition that works for any of these definitions of fairness, and in fact for a more general class of constraints. As long as the fairlet decomposition problem can be solved with small fairlets (which is the case for the above definitions of fairness, as we will show in Section 4), this will give us an approximation algorithm for the corresponding fair correlation clustering problem.

3 Overview of Results

In this section, we give a high-level overview of our algorithm and the proof. The main ingredient of our algorithm is a general reduction from the given constrained correlation clustering problem (as defined below) to a fairlet decomposition problem. We then show how the cost of a fairlet decomposition can be approximated by a median clustering cost function. This allows us to use previous results on the fair median problem to solve fairlet decomposition for the standard notions of fairness defined in the previous section. Finally, given an approximately optimal fairlet decomposition, we use our reduction to reduce the constrained correlation clustering instance to a standard correlation clustering instance, and apply known algorithms [8, 4, 17] to solve this problem.

Constrained correlation clustering. We start by defining a general class of constrained correlation clustering problems. Consider an unweighted correlation clustering instance and let be a family of subsets of . We treat as the family of feasible clusters, and assume it has the following composability property: for every , we have . Note this property is satisfied when is the collection of all fair sets under any of the definitions of fairness given in Section 2. The constrained correlation clustering problem is to define a correlation clustering with minimum such that for all , we have .

Fairlet decomposition.

Next, we define the notion of fairlet decomposition used in our reduction. A fairlet decomposition for a constrained correlation clustering problem is simply a partition of into subsets in , i.e., for all . We call each a fairlet. The key in our reduction is a cost function fcost that evaluates ’s usefulness in building a correlation clustering of . Here we define this cost function, and in Section 4.1 we show how it can be approximated by the standard -median clustering cost function in a carefully defined metric space.

Fairlet decomposition cost.

Consider a fairlet decomposition . For each fairlet , we let be the number of negative edges inside , i.e., and let For fairlets , we let let be the number of edges between them with the minority sign, i.e.,

Finally, we let , , and .

Reduced instance.

Given a constrained correlation clustering instance and a fairlet decomposition for , we define a reduced correlation clustering instance as follows: has vertices, each corresponding to one fairlet in . We denote the vertex corresponding to the fairlet by . The label of the edge between and is the majority sign of the edges in (with ties broken arbitrarily) multiplied by a weight that is equal to the number of edges in with the majority sign.

Note that the instance defined above is an instance of weighted correlation clustering, although as we will observe, the edges have weights that are within a constant factor each other, and therefore the problem can be solved using unweighted correlation clustering algorithms. Given a solution to this problem, it can be expanded into a solution of the original constrained problem. The final algorithm is sketched below.

1:Find an approx optimal fairlet decomposition .
2:Compute reduced instance .
3:Let be an approx optimal (non-constrained) correlation clustering of .
4:Output the clustering .
Algorithm 1 Constrained Correlation Clustering

To prove that the above algorithm produces an approximately optimal solution to the constrained correlation clustering problem, we need the following lemmas. The first two lemmas prove that a solution of can be transformed to a solution of and vice versa, and these transformations do not increase the cost by more than the cost of the fairlet decomposition. The third lemma bounds the cost of a fairlet decomposition in terms of the cost of the optimal solution to the constrained correlation clustering problem. The proofs of these lemmas are presented in the Supplemental Material.

Lemma 3.1.

Given a correlation clustering instance , a fairlet decomposition for , and a clustering of , there is a clustering of such that

Lemma 3.2.

Assume is a clustering of , and let be the clustering computed in line 4 of Algorithm 1 for . Then we have

Lemma 3.3.

For any constrained correlation clustering instance , and any constrained clustering of , there is a fairlet decomposition of satisfying .

Putting these together, we have the following:

Theorem 3.4.

Assume we have an -approximation algorithm for finding the minimum cost fairlet decomposition and a -approximation algorithm for solving the unconstrained correlation clustering instance . Then Algorithm 1 produces a -approximation for the constrained correlation clustering instance .

Proof.

Let be an optimal solution to the constrained correlation clustering instance . By Lemma 3.3, the fairlet decomposition problem has a solution of cost at most , and therefore, algorithm for this problem must find a decomposition with . Also, by Lemma 3.1, the instance has a solution of cost at most . Therefore, algorithm can find a clustering of cost at most . Thus, by Lemma 3.2, the cost of the clustering produced by Algorithm 1 is at most . Finally, by the composability property of the constraints, we know that this clustering satisfies the constraints, since each of its clusters is a union of fairlets in . ∎

4 Fairlet Decomposition

In this section, we show how to solve the fairlet decomposition problem by reducing it to a fair clustering problem with the -median cost function in an appropriately defined metric space. The reduction, presented in Section 4.1, loses a factor that is proportional to the size of the largest fairlet, but as we show in Section 4.2, in cases that we know how to solve the fairlet decomposition problem, the size of the fairlets can be guaranteed to be small.

4.1 Reduction to median cost

Consider a correlation clustering instance and let be a distance function that defines a metric space on the set of vertices . For a fairlet decompostion , the median cost can be defined as follows: and . Notice that the problem of finding the fairlet decomposition with minimum is precisely the fairlet decomposition problem for fair -median, as studied by [19, 9].

We now define a distance function such that the median cost mcost can approximate the fairlet cost fcost. We first define an embedding as follows. For a vertex , let

In other words, is the th row of the adjacency matrix of after adding a positive self-loop at every vertex. Define for points to be the Hamming distance of their embeddings, i.e., . Intuitively, measures the “cost” of committing to put and in one cluster in the correlation clustering instance.

We now prove the following two lemmas, which show that the fcost of a fairlet decomposition is close to its mcost with respect to the metric . The proofs of these lemmas are presented in the Supplemental Material.

Lemma 4.1.

For any fairlet decomposition , we have

Lemma 4.2.

Let be any fairlet decomposition and let . Then,

Using the above lemmas, we have the following.

Theorem 4.3.

Assume there is a -approximation algorithm for fairlet decomposition with median costs. Furthermore, assume that this algorithm always produces fairlets of size at most . Then the solution produced by this algorithm is a -approximation to the problem of finding a fairlet decomposition with minimum fcost.

4.2 Algorithms for fairlet decomposition

In this section, we give algorithms for fairlet decomposition with median cost for several notions of fairness. These algorithms, together with Theorems 3.4 and 4.3, imply algorithms for the corresponding fair correlation clustering problems. We focus on three fairness constraints: an upper bound of on the fraction of vertices of each color in each cluster; an upper bound of where is the number of distinct colors (this corresponds to the proportional fairness property studied in [19]); and an upper bound of for an integer on the fraction of vertices of each color in each cluster. We give approximation algorithms in these cases (a bicriteria approximation in the third case), with an upper bound (of , , and , respectively) on the size of fairlets.

Throughout this subsection, when we speak of the cost of a fairlet decomposition, we mean its median cost.

4.2.1

First, we show that there is an optimal fairlet decomposition with small fairlets.

Claim 4.4.

There is an optimal fairlet decomposition where .

Proof.

Consider an optimal solution with the largest number of fairlets. Suppose there is a part with . Then let be points of the most common color and the second most common color. Construct a new partition from by removing from and adding a new partition . By the choice of , both and still satisfy the fairness constraint. Furthermore, by the triangle inequality where is the center of , and hence, . Therefore, is also an optimal fairlet decomposition and has one more fairlet than , contradicting the choice of . ∎

We now define a graph on points in as follows: two vertices are connected by two edges in if they have distinct colors. The weight of both of these edges is . Now in order to find a fairlet decomposition, we find the minimum weight -factor in . Recall that a 2-factor is a subgraph where each vertex has degree 2. This problem is polynomially solvable [44, Chapter 21]. Note that the optimal cost of the 2-factor with minimum weight can be bounded as follows.

Claim 4.5.

The graph has a -factor of total weight at most , where is the optimal fairlet decomposition.

Proof.

Using the previous claim, we know that there exists an optimal fairlet decomposition with . Each fairlet in of size 2 can be turned into a 2-vertex 2-factor of weight equal to twice the cost of the fairlet. For each fairlet of size 3, we can add the edge between the two non-center vertices to obtain a 3-vertex 2-factor. By triangle inequality, the weight of this 2-factor is at most twice the cost of . Therefore, the overall cost of the 2-factor is at most . ∎

It remains to show how to convert a 2-factor solution to a fairlet decomposition with small fairlets. Cycles of length 2 and 3 give fairlets of size 2 and 3; so we only need to argue how to convert cycles of length more than 3 to fairlets. If a cycle has an even length we just return one of the two sets of alternating edges, whichever has lower cost. For cycles of odd length, there must exists three consecutive vertices of pairwise distinct colors. In this case we return these three vertices as one fairlet and the rest of cycle can be partitioned into matching edges. The cost of this fairlet decomposition is at most the weight of the original 2-factor. So in this case, we get a 2-approximation for fairlet decomposition with fairlets of size at most 3.

4.2.2 with colors

Let us first understand the structure of an optimal solution . Since , each part has equal number of colors. So in order to find a fair partition, we consider an ordering of colors, and solve a minimum cost matching problem between points of color and in the graph . The union of these matchings gives us a partition of into paths of length . Each such path defines a fairlet; let denote this fairlet decomposition. We show the cost of is at most twice the cost of the optimal fairlet decomposition.

Claim 4.6.

Let be the optimal fairlet decomposition. Then, .

Proof.

Let us first bound the cost of matching edges between color and using the partitions in . Since each part in has equal number of vertices of each color, we can find points of colors and within each part of . Using the triangle inequality the cost of the matching edges between vertices in , is at most , where is the center of . Summing up all the matching weights, we get the total matching weights are at most since each edge of the form is charged twice, once for matching between and and once for matching between and . ∎

So in this case, we get a 2-approximation with fairlets of size at most . Note that in the motivating application of fair clustering, the color of each vertex corresponds to one possible value of a sensitive feature like race or gender, and therefore the value tends to be small in such applications.

4.2.3 Bicriteria approximation for

In this section, we focus on the case where and since we have already solved the case of in Section 4.2.1, we can assume . For solving this case, we start from a solution produced by the bicriteria approximation algorithm of Bera et al. [9] and make necessary changes to make sure that maximum fairlet size is .

Theorem 4.7 (Theorem 1 of [9]).

Given a -approximation algorithm for the -median problem, we can return a -approximate solution where each part in violates the the fairness constraint by additive factor of at most .

Since median problem is a relaxation of -median problem, using the state-of-the-art algorithm for -median problem[12], we can get a solution to the fairlet decomposition problem with cost at most times the optimal solution where each fairlet ‘almost’ satisfies the fairness constraint. Since we need bounded fairlet sizes, we process the solution and transform it to a solution where fairlet sizes are bounded using the following claim. In the following claim, we show that large fairlets can be broken into smaller fairlets. In addition, it is easy to see that if a large fairlet violates the fairness constraint by an additive factor of at most , then the resulting small fairlets violate the fairness condition by additive factors of at most .

Claim 4.8.

For any set of size that satisfies fairness constraint with , there exists a partition of into sets where each satisfies the fairness constraint and .

Proof.

Let with , then the fairness constraints ensures that there are at most elements of each color. Consider partitioning obtained through the following process: consider an ordering of elements where points of the same color are in consecutive places, assign points to sets in a round robin fashion. So each set gets at least elements and at most elements assigned to it. Since there are at most elements of each color, each set gets at most one point of any color and hence all sets satisfy the fairness constraint as . ∎

Applying Claim 4.8 to the solution of Algorithm of [9], one can obtain a fairlet decomposition where fairlets have size at most . Using the triangle inequality, we can show that updating the center of each fairlet to the closest vertex in that fairlet to the assigned center, will at most doubles the cost of the solution. Since the size of fairlet is at least , the fairness constraint is satisfied with a multiplicative factor of . So the above arguments show that we can find a fairlet decomposition with cost at within times the optimal cost where each fairlet is of size at least and at most and fairness constraint is violated by at most a factor of .

4.3 Solving the reduced instance

Note that the reduced correlation clustering instance is a weighted correlation clustering instance. Even though in general, the best known approximation ratio for weighted correlation clustering is [21], for the fairlet decomposition algorithms presented in this section, we have the following property that allows us to get a better approximation factor: the ratio of the size of the largest fairlet to the smallest one is at most a constant. For fairlet decompositions in Section 4.2.1, 4.2.2, and 4.2.3, this constant is , , and , respectively. Now, notice that the weight of the edge between and in is at least and at most . These weights are all within a constant factor of each other (with the constants being , , and ). Therefore, by removing the weights from the instance and solving the resulting unweighted instance (using the algorithm of [17], for example), we only lose an additional factor equal to this constant.

5 Experiments

Dataset Unfair Alg. Error Unfair Alg. Imbalance Fair Alg. Error
Local Pivot Local Pivot Match + Local Single Rand
amazon 0.01 0.039 0.40 0.39 0.064 0.786 0.215
reuters 0.095 0.156 0.65 0.62 0.229 0.754 0.255
reuters 0.180 0.230 0.50 0.43 0.321 0.504 0.502
reuters 0.188 0.286 0.15 0.20 0.199 0.252 0.746
victorian 0.109 0.170 0.55 0.45 0.209 0.753 0.251
victorian 0.182 0.254 0.31 0.25 0.324 0.502 0.499
victorian 0.203 0.248 0.12 0.13 0.237 0.251 0.747
mean over datasets 0.138 0.198 0.38 0.35 0.226 0.459 0.543
Table 1: Error and Imbalance in color case for various datasets and different threshold

for the quantile used for positive edges. Notice how our algorithm

Match + Local has cost comparable to Pivot and not much higher than Local while reducing the imbalance from the up of the unfair algorithms to 0.

In this section, we present our experiments demonstrating that our algorithm solves the correlation clustering problem with fairness constraints with only a limited loss in the cost when compared to the vanilla (unfair) solution. We describe the datasets used, the algorithms evaluated, the quality measures, and our results.

Datasets. We use publicly-available datasets from the UCI Repository222http://archive.ics.uci.edu/ml and from the SNAP Datasets333http://snap.stanford.edu/data/.

The datasets represent complete signed graphs from different domains, including both co-purchasing relationships among products, and semantic similarities among texts learned with embedding methods. The graphs are represented by complete signed matrices up to 1600x1600 in size, up to million positive edges, and up to colors. Here we only briefly describe the datasets, more details are in the appendix of the supplemental material. amazon: Nodes represents products on the Amazon website [38], the color of is the item category, and two co-reviewed items have a weight edges (all non-co-reviewed items have weight edges). We use 1000 nodes equally distributed in two 2 popular categories. reuters and victorian: These datasets are extracted from text data used in previous fair clustering work [1]. The datasets include between and English language texts from each of up to 16 authors.444The datasets are available at archive.ics.uci.edu/ml/datasets/Reuter_50_50 and archive.ics.uci.edu/ml/datasets/Victorian+Era+Authorship+Attribution

Each node represent a text, the color represent the author. For each text we obtain a semantic embedding vector with standard methods. We use a threshold on the dot product of the embedding vectors to obtain the edges. Through this operation, we set the top

of edges via dot products as ’s, and the remaining edges are assigned ’s.

Algorithms. We evaluate Algorithm 1 in two fairness scenarios: with an upper bound of of the nodes for each color, and with for equal color representation (for the two-color case, the two are equivalent). For the case, in our experiments we simplify the algorithm of Section 4.2.2 to compute a minimum-cost perfect matching (1-factor) instead of a 2-factor decomposition. This can be formally shown to be sufficient for the case, or when all optimal clusters are even sized, and we observe it works well in practice in our the experiments. For , we implement the repeated matching algorithm in 4.2.2 to obtain the fairlets in a similar fashion. After finding the fairlets, we use an in-house correlation clustering solver based on local search (Local). We refer to our algorithms as Match + Local for the case and as Rep. Match + Local for the case.

We also consider the following (unfair) baseline algorithms: the standard Pivot algorithm of Ailon et al. [4] for correlation clustering (for Pivot

, we repeat the randomized algorithm 10 times and use the best result), and the (unfair) local search heuristic

Local used as part of our algorithm. In addition, as no prior work has addressed the fair correlation clustering problem, we compare our algorithm with two simple fair baselines: the whole graph as one cluster Single, and a random fairlet decomposition Rand.

Algorithm Error Imbalance Imbalance
for 1/2 for equality
Local 0.249 0.016 0.222
Pivot 0.320 0.009 0.188
Match + Local 0.252 0 0.188
Rep. Match + Local 0.306 0 0
Single 0.5 0 0
Rand 0.5 0 0
Table 2: Experimental results for the victorian dataset, using colors. 

Quality measures. For each algorithm, we report the following measures. The (Error) of the correlation cost of the clustering obtained by the algorithm, presented as the ratio of the edges of the graph that are in disagreement with the clustering (i.e., inter-cluster positive edges and intra-cluster negative edges) over the number of edges (i.e., 0 corresponds to a perfect solution, whereas 1 corresponds to a completely incorrect solution). For fairness, we report the Imbalance as the total fraction of nodes that violate the color representation constraint, i.e., nodes for each cluster and color that are above the fraction for the size of the cluster [1]. More precisely, let be a cluster in the solution of an -color constraint instance. The maximum allowed number of points of a certain color in the cluster is . Let be the nodes of color on the graph and be the nodes violating the constraint in . We report as the Imbalance for (i.e., 0 corresponds to no imbalance, and 1 corresponds to complete imbalance). We repeat all algorithms 10 times and report the mean results for all measures.

5.1 The color case

Table 1 shows the results for the color case with , i.e., equal representation over the various datasets. We use the Match + Local as a fair algorithm, which has an Imbalance of 0 by construction.

The table shows clearly that our algorithm Match + Local obtains clusters that are fair and has costs comparable to the unfair Pivot baseline (and sometimes better) and slightly worse than the Local baseline. On average, over all datasets, our algorithm has an average cost of vs for Pivot and for Local. On the other hand, our algorithm is significantly better than both the naive one-cluster and random fair clustering algorithms, which, while satisfying fairness, have exorbitantly high costs.

Notice how the Local and the Pivot baseline have very high Imbalance values of up to of the nodes (as they are oblivious to colors), showing the importance of developing novel algorithms for the problem. This result is not surprising. If pairs of nodes of the same color are more likely to be similar, it is expected that many clusters will contain vast majorities of points with a single color.

5.2 The color case

Here, we study the behavior of our algorithm in the case when more than two colors are present in the dataset. We use Match + Local to obtain a fair solution and Rep. Match + Local to obtain a (equal representation) solution.

We report an overview of our experimental results in Table 2 for the dataset victorian with threshold and colors (note that this is a different graph than that produced with the previous dataset). More experimental results are available in the extended material.

It is possible to see that all trends from the previous section are confirmed. Notice how the algorithm for the case Match + Local is only marginally worse than the best unfair solution Local and much better than all other baselines. Our algorithm for the more difficult equal representation case is again better than the Pivot baseline. Notice how all unfair algorithms are significantly far from being equally balanced. However, the presence of many colors makes it easier to get closer to the threshold.

Figure 1: Error of our algorithms over that of the unfair Local algorithms for and , on a series of graphs from the victorian dataset, , and using to colors.

.

We confirm this observation in Figure 1. The figure shows a comparison of the Error for our algorithms for and vs the Error for the vanilla Local algorithms as the number of colors goes from to . We report the ratio of Match + Local over Local as a solid line, and the ratio for Rep. Match + Local as a dashed line. Notice how the error for our algorithm for the case gets closer and closer to the Local output error for more colors. Again, this result is obtained because the presence of many colors makes the problem easier for the case. The performance of the algorithm for the case has a less stable pattern, but it confirms that the algorithm is quite competitive with the unfair solution (between higher error) even for quite a few colors. These results are significantly better than what is expected from a worst-case analysis.

Finally, we report having observed that using just Match or Rep. Match fairlets without the re-clustering part of the algorithm is not sufficient for obtaining good results, as the re-clustering step is needed for obtaining clusters that do not have large errors.

6 Conclusions

In this paper we initiated the study of correlation clustering with fairness constraints. We showed a reduction to the fairlet decomposition problem with a standard -median cost function, for a carefully chosen distance function. Using this, and old and new results on the fairlet decomposition problem with a median cost function, we obtained provable constant-factor approximation algorithms for fair correlation clustering for various notions of fairness. Our experimental evaluation shows that these algorithms perform well not only in theory but also in practice.

References

  • [1] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian (2019) Clustering without over-representation. In KDD, pp. 267–275. Cited by: §1, §1, §1, §2, §5, §5.
  • [2] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward (2017) Better guarantees for -means and euclidean -median by primal-dual algorithms. In FOCS, pp. 61–72. Cited by: §1.
  • [3] K. Ahn, G. Cormode, S. Guha, A. McGregor, and A. Wirth (2015) Correlation clustering in data streams. In ICML, pp. 2237–2246. Cited by: §1.
  • [4] N. Ailon, M. Charikar, and A. Newman (2008) Aggregating inconsistent information: ranking and clustering. J. ACM 55 (5), pp. 23:1–23:27. Cited by: §1, §1, §2, §3, §5.
  • [5] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner (2019) Scalable fair clustering. ICML. Cited by: §1.
  • [6] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner (2019) Scalable fair clustering. In ICML, Cited by: §1.
  • [7] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner (2019) Scalable fair clustering. In ICML, pp. 405–413. Cited by: §1.
  • [8] N. Bansal, A. Blum, and S. Chawla (2004) Correlation clustering. Mach. Learn. 56 (1-3), pp. 89–113. External Links: ISSN 0885-6125, Link, Document Cited by: §1, §1, §2, §2, §3.
  • [9] S. K. Bera, D. Chakrabarty, N. Flores, and M. Negahbani (2019) Fair algorithms for clustering. In NeurIPS, Cited by: §1, §1, §2, §4.1, §4.2.3, §4.2.3, Theorem 4.7.
  • [10] I. O. Bercea, M. Gross, S. Khuller, A. Kumar, C. Rösner, D. R. Schmidt, and M. Schmidt (2019) On the cost of essentially fair clusterings. In APPROX-RANDOM, pp. 18:1–18:22. Cited by: §1, §1.
  • [11] M. Bressan, N. Cesa-Bianchi, A. Paudice, and F. Vitale (2019) Correlation clustering with adaptive similarity queries. In NeurIPS, Cited by: §1.
  • [12] J. Byrka, T. Pensyl, B. Rybicki, A. Srinivasan, and K. Trinh (2014) An improved approximation for -median, and positive correlation in budgeted optimization. In SODA, pp. 737–756. Cited by: §4.2.3.
  • [13] T. Calders and S. Verwer (2010)

    Three naive bayes approaches for discrimination-free classification

    .
    DMKD 21 (2), pp. 277–292. Cited by: §1.
  • [14] L. E. Celis, L. Huang, and N. K. Vishnoi (2018) Multiwinner voting with fairness constraints.. In IJCAI, pp. 144–151. Cited by: §1, §1.
  • [15] L. E. Celis, D. Straszak, and N. K. Vishnoi (2018) Ranking with fairness constraints. In ICALP, pp. 28:1–28:15. Cited by: §1, §1.
  • [16] M. Charikar, V. Guruswami, and A. Wirth (2005) Clustering with qualitative information. JCSS 71 (3), pp. 360–383. Cited by: §1.
  • [17] S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev (2015) Near optimal LP rounding algorithm for correlation clustering on complete and complete -partite graphs. In STOC, pp. 291–228. Cited by: §1, §2, §3, §4.3.
  • [18] X. Chen, B. Fain, C. Lyu, and K. Munagala (2019) Proportionally fair clustering. In ICML, pp. 1032–1041. Cited by: §1.
  • [19] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii (2017) Fair clustering through fairlets. In NIPS, pp. 5029–5037. Cited by: §1, §1, §1, §1, §2, §4.1, §4.2.
  • [20] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii (2019) Matroids, matchings, and fairness. In AISTATS, pp. 2212–2220. Cited by: §1, §1.
  • [21] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica (2006) Correlation clustering in general weighted graphs. TCS 361 (2-3), pp. 172–187. Cited by: §1, §2, §4.3.
  • [22] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In ITCS, pp. 214–226. Cited by: §1.
  • [23] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. In KDD, pp. 259–268. Cited by: §1.
  • [24] B. Fish, J. Kun, and Á. D. Lelkes (2016) A confidence-based approach for balancing fairness and accuracy. In SDM, pp. 144–152. Cited by: §1.
  • [25] T. F. Gonzalez (1985) Clustering to minimize the maximum intercluster distance. TCS 38, pp. 293–306. Cited by: §1.
  • [26] S. Har-Peled and S. Mahabadi (2019) Near neighbor: who is the fairest of them all?. arXiv:1906.02640. Cited by: §1.
  • [27] D. S. Hochbaum and D. B. Shmoys (1985) A best possible heuristic for the -center problem. MOR 10 (2), pp. 180–184. Cited by: §1.
  • [28] L. Huang, S. H. -C. Jiang, and N. K. Vishnoi (2019) Coresets for clustering with fairness constraints. In NeurIPS, Cited by: §1.
  • [29] A. K. Jain (2010) Data clustering: 50 years beyond -means. Pattern Recognition Letters 31 (8), pp. 651–666. Cited by: §1.
  • [30] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth (2016) Fairness in learning: classic and contextual bandits. In NIPS, pp. 325–333. Cited by: §1, §1.
  • [31] S. Kalhan, K. Makarychev, and T. Zhou (2019) Improved algorithms for correlation clustering with local objectives. In NeurIPS, Cited by: §1.
  • [32] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma (2012)

    Fairness-aware classifier with prejudice remover regularizer

    .
    In PKDD, pp. 35–50. Cited by: §1, §1.
  • [33] T. Kamishima, S. Akaho, and J. Sakuma (2011) Fairness-aware learning through regularization approach. In ICDMW, pp. 643–650. Cited by: §1, §1.
  • [34] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2004) A local search approximation algorithm for -means clustering. Computational Geometry 28 (2-3), pp. 89–112. Cited by: §1.
  • [35] M. Kleindessner, P. Awasthi, and J. Morgenstern (2019) Fair -center clustering for data summarization. In ICML, pp. 3458–3467. Cited by: §1, §1.
  • [36] M. Kleindessner, S. Samadi, P. Awasthi, and J. Morgenstern (2019) Guarantees for spectral clustering with fairness constraints. In ICML, pp. 3448–3457. Cited by: §1.
  • [37] S. Kushagra, S. Ben-David, and I. Ilyas (2018) Semi-supervised clustering for de-duplication. arXiv:1810.04361. Cited by: §1, §1.
  • [38] J. Leskovec, L. A. Adamic, and B. A. Huberman (2007) The dynamics of viral marketing. TWEB 1 (1), pp. 5. Cited by: §5.
  • [39] J. Li, K. Yi, and Q. Zhang (2010) Clustering with diversity. In ICALP, pp. 188–200. Cited by: §1.
  • [40] S. Li and O. Svensson (2016) Approximating -median via pseudo-approximation. SICOMP 45 (2), pp. 530–547. External Links: Link, Document Cited by: §1.
  • [41] X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramachandran, and M. I. Jordan (2015) Parallel correlation clustering on big graphs. In NIPS, pp. 82–90. Cited by: §1.
  • [42] J. Pouget-Abadie, K. Aydin, W. Schudy, K. Brodersen, and V. Mirrokni (2019) Variance reduction in bipartite experiments through correlation clustering. In NeurIPS, Cited by: §1.
  • [43] C. Rösner and M. Schmidt (2018) Privacy preserving clustering with constraints. In ICALP, pp. 96:1–96:14. Cited by: §1.
  • [44] A. Schrijver (2003) Combinatorial optimization: polyhedra and efficiency. Vol. 24, Springer Science & Business Media. Cited by: §4.2.1.
  • [45] K. Yang and J. Stoyanovich (2017) Measuring fairness in ranked outputs. In SSDBM, pp. 22:1–22:6. Cited by: §1, §1.
  • [46] I. M. Ziko, E. Granger, J. Yuan, and I. B. Ayed (2019) Clustering with fairness constraints: a flexible and scalable approach. arXiv:1906.08207. Cited by: §1.