1 Introduction
Algorithms and machine learned models are increasingly used to assist in decision making on a wide range of issues, from mortgage approval to court sentencing recommendations Kleinberg et al. (2017). It is clearly undesirable, and in many cases illegal, for models to be biased to groups, for instance to discriminate on the basis of race or religion. Ensuring that there is no bias is not as easy as removing these protected categories from the data. Even without them being explicitly listed, the correlation between sensitive features and the rest of the training data may still cause the algorithm to be biased. This has led to an emergent literature on computing provably fair outcomes (see the book Barocas et al. (2019)).
The prominence of clustering in data analysis, combined with its use for data segmentation, feature engineering, and visualization makes it critical that efficient fair clustering methods are developed. There has been a flurry of recent results in the ML research community, proposing algorithms for fair flat clustering, i.e., partitioning a dataset into a set of disjoint clusters, as captured by kcenter, kmedian, kmeans, correlation clustering objectives (Ahmadian et al., 2019, 2020b; Backurs et al., 2019; Bera et al., 2019; Bercea et al., 2019; Chen et al., 2019; Chiplunkar et al., 2020; Huang et al., 2019; Jones et al., 2020; Kleindessner et al., 2019a, b). However, the same issues affect hierarchical clustering, which is the problem we study.
The input to the hierarchical clustering problem is a set of data points, with pairwise similarity or dissimilarity scores. A hierarchical clustering is a tree, whose leaves correspond to the individual datapoints. Each internal node represents a cluster containing all the points in the leaves of its subtree. Naturally, the cluster at an internal node is the union of the clusters given by its children. Hierarchical clustering is widely used in data analysis Dubes and Jain (1980), social networks Mann et al. (2008); Rajaraman and Ullman (2011), image/text organization Karypis et al. (2000).
Hierarchical clustering is frequently used for flat clustering when the number of clusters is unknown ahead of time. A hierarchical clustering yields a set of clusterings at different granularities that are consistent with each other. Therefore, in all clustering problems where fairness is desired but the number of clusters is unknown, fair hierarchical clustering is useful. As concrete examples, consider a collection of news articles organized by a topic hierarchy, where we wish to ensure that no single source or view point is overrepresented in a cluster; or a hierarchical division of a geographic area, where the sensitive attribute is gender or race, and we wish to ensure balance in every level of the hierarchy. There are many such problems that benefit from fair hierarchical clustering, motivating the study of the problem area.
Our contributions. We initiate an algorithmic study of fair hierarchical clustering. We build on Dasgupta’s seminal formal treatment of hierarchical clustering Dasgupta (2016) and prove our results for the revenue Moseley and Wang (2017), value CohenAddad et al. (2018), and cost Dasgupta (2016) objectives in his framework.
To achieve fairness, we show how to extend the fairlets machinery, introduced by Chierichetti et al. (2017) and extended by Ahmadian et al. (2019), to this problem. We then investigate the complexity of finding a good fairlet decomposition, giving both strong computational lower bounds and polynomial time approximation algorithms.
Finally, we conclude with an empirical evaluation of our approach. We show that ignoring protected attributes when performing hierarchical clustering can lead to unfair clusters. On the other hand, adopting the fairlet framework in conjunction with the approximation algorithms we propose yields fair clusters with a negligible objective degradation.
Related work. Hierarchical clustering has received increased attention over the past few years. Dasgupta Dasgupta (2016) developed a cost function objective for data sets with similarity scores, where similar points are encouraged to be clustered together lower in the tree. CohenAddad et al. CohenAddad et al. (2018) generalized these results into a class of optimization functions that possess other desirable properties and introduced their own value objective in the dissimilarity score context. In addition to validating their objective on inputs with known ground truth, they gave a theoretical justification for the averagelinkage algorithm, one of the most popular algorithms used in practice, as a constantfactor approximation for value. Contemporaneously, Moseley and Wang (Moseley and Wang, 2017) designed a revenue objective function based on the work of Dasgupta for point sets with similarity scores and showed the averagelinkage algorithm is a constant approximation for this objective as well. This work was further improved by Charikar Charikar et al. (2019) who gave a tighter analysis of averagelinkage for Euclidean data for this objective and Ahmadian et al. (2020a); Alon et al. (2020) who improved the approximation ratio in the general case.
In parallel to the new developments in algorithms for hierarchical clustering, there has been tremendous development in the area of fair machine learning. We refer the reader to a recent textbook (Barocas et al., 2019) for a rich overview, and focus here on progress for fair clustering. Chierichetti et al. Chierichetti et al. (2017) first defined fairness for median and center clustering, and introduced the notion of fairlets to design efficient algorithms. Extensive research has focused on two topics: adapting the definition of fairness to broader contexts, and designing efficient algorithms for finding good fairlet decompositions. For the first topic, the fairness definition was extended to multiple values for the protected feature Ahmadian et al. (2019); Bercea et al. (2019); Rösner and Schmidt (2018). For the second topic, Backurs et al. Backurs et al. (2019) proposed a nearlinear constant approximation algorithm for finding fairlets for median, Kleindessner et al. Kleindessner et al. (2019a) designed a linear time constant approximation algorithm for center, Bercea et al. Bercea et al. (2019) developed methods for fair means, while Ahmadian et al. Ahmadian et al. (2020b) defined approximation algorithms for fair correlation clustering. Concurrently with our work, Chhabra et al. Chhabra and Mohapatra (2020)
introduced a possible approach to ensuring fairness in hierarchical clustering. However, their fairness definition differs from ours (in particular, they do not ensure that all levels of the tree are fair), and the methods they introduce are heuristic, without formal fairness or quality guarantees.
2 Formulation
2.1 Objectives for hierarchical clustering
Let be an input instance, where is a set of data points, and is a similarity function over vertex pairs. For two sets, , we let and . For problems where the input is , with a distance function, we define and similarly. We also consider the vertexweighted versions of the problem, i.e. (or ), where is a weight function on the vertices. The vertexunweighted version can be interpreted as setting . For , we use the notation .
A hierarchical clustering of is a tree whose leaves correspond to and whose internal vertices represent the merging of vertices (or clusters) into larger clusters until all data merges at the root. The goal of hierarchical clustering is to build a tree to optimize some objective.
To define these objectives formally, we need some notation. Let be a hierarchical clustering tree of . For two leaves and , we say is their least common ancestor. For an internal vertex in , let be the subtree in rooted at . Let be the leaves of .
We consider three different objectives—revenue, value, and cost—based on the seminal framework of Dasgupta (2016), and generalize them to the vertexweighted case.
Revenue. Moseley and Wang Moseley and Wang (2017) introduced the revenue objective for hierarchical clustering. Here the input instance is of the form , where is a similarity function.
Definition 1 (Revenue).
The revenue () of a tree for an instance , where denotes similarity between data points, is:
Note that in this definition, each weight is scaled by (the vertexweight of) the nonleaves. The goal is to find a tree of maximum revenue. It is known that averagelinkage is a approximation for vertexunweighted revenue Moseley and Wang (2017); the stateoftheart is a approximation Alon et al. (2020).
As part of the analysis, there is an upper bound for the revenue objective CohenAddad et al. (2018); Moseley and Wang (2017), which is easily extended to the vertexweighted setting:
(1) 
Note that in the vertexunweighted case, the upper bound is just .
Value. A different objective was proposed by CohenAddad et al. CohenAddad et al. (2018), using distances instead of similarities. Let , where is a distance (or dissimilarity) function.
Definition 2 (Value).
The value () of a tree for an instance where denotes distance is:
As in revenue, we aim to find a hierarchical clustering to maximize value. CohenAddad et al. CohenAddad et al. (2018) showed that both averagelinkage and a locally densest cut algorithm achieve a approximation for vertexunweighted value. They also provided an upper bound for value, much like that in (1), which in the vertexweighted context, is:
(2) 
Cost. The original objective introduced by Dasgupta Dasgupta (2016) for analyzing hierarchical clustering algorithms introduces the notion of cost.
Definition 3 (Cost).
The of a tree for an instance where denotes similarity is:
The objective is to find a tree of minimum cost. From a complexity point of view, cost is a harder objective to optimize. Charikar and Chatziafratis Charikar and Chatziafratis (2017) showed that cost is not constantfactor approximable under the Small Set Expansion hypothesis, and the current best approximations are and require solving SDPs.
Convention. Throughout the paper we adopt the following convention: will always denote similarities and will always denote distances. Thus, the inputs for the cost and revenue objectives will be instances of the form and inputs for the value objective will be instances of the form . All the missing proofs can be found in the Supplementary Material.
2.2 Notions of fairness
Many definitions have been proposed for fairness in clustering. We consider the setting in which each data point in has a color; the color corresponds to the protected attribute.
Disparate impact. This notion is used to capture the fact that decisions (i.e., clusterings) should not be overly favorable to one group versus another. This notion was formalized by Chierichetti et al. Chierichetti et al. (2017) for clustering when the protected attribute can take on one of two values, i.e., points have one of two colors. In their setup, the balance of a cluster is the ratio of the minimum to the maximum number of points of any color in the cluster. Given a balance requirement , a clustering is fair if and only if each cluster has a balance of at least .
Bounded representation. A generalization of disparate impact, bounded representation focuses on mitigating the imbalance of the representation of protected classes (i.e., colors) in clusters and was defined by Ahmadian et al. Ahmadian et al. (2019). Given an overrepresentation parameter , a cluster is fair if the fractional representation of each color in the cluster is at most , and a clustering is fair if each cluster has this property. An interesting special case of this notion is when there are total colors and . In this case, we require that every color is equally represented in every cluster. We will refer to this as equal representation. These notions enjoy the following useful property:
Definition 4 (Unionclosed).
A fairness constraint is unionclosed if for any pair of fair clusters and , is also fair.
This property is useful in hierarchical clustering: given a tree and internal node , if each child cluster of is fair, then must also be a fair cluster.
Definition 5 (Fair hierarchical clustering).
For any fairness constraint, a hierarchical clustering is fair if all of its clusters (besides the leaves) are fair.
Thus, under any unionclosed fairness constraint, this definition is equivalent to restricting the bottommost clustering (besides the leaves) to be fair. Then given an objective (e.g., revenue), the goal is to find a fair hierarchical clustering that optimizes the objective. We focus on the bounded representation fairness notion with colors and an overrepresentation cap . However, the main ideas for the revenue and value objectives work under any notion of fairness that is unionclosed.
3 Fairlet decomposition
Definition 6 (Fairlet Chierichetti et al. (2017)).
A fairlet is a fair set of points such that there is no partition of into and with both and being fair.
In the bounded representation fairness setting, a set of points is fair if at most an fraction of the points have the same color. We call this an capped fairlet. For with an integer, the fairlet size will always be at most . We will refer to the maximum size of a fairlet by .
Recall that given a unionclosed fairness constraint, if the bottom clustering in the tree is a layer of fairlets (which we call a fairlet decomposition of the original dataset) the hierarchical clustering tree is also fair. This observation gives an immediate algorithm for finding fair hierarchical clustering trees in a twophase manner. (i) Find a fairlet decomposition, i.e., partition the input set into clusters that are all fairlets. (ii) Build a tree on top of all the fairlets. Our goal is to complete both phases in such a way that we optimize the given objective (i.e., revenue or value).
In Section 4, we will see that to optimize for the revenue objective, all we need is a fairlet decomposition with bounded fairlet size. However, the fairlet decomposition required for the value objective is more nuanced. We describe this next.
Fairlet decomposition for the value objective
For the value objective, we need the total distance between pairs of points inside each fairlet to be small. Formally, suppose is partitioned into fairlets such that is an capped fairlet. The cost of this decomposition is defined as:
(3) 
Unfortunately, the problem of finding a fairlet decomposition to minimize does not admit any constantfactor approximation unless P = NP.
Theorem 7.
Let be an integer. Then there is no bounded approximation algorithm for finding capped fairlets optimizing , which runs in polynomial time, unless P = NP.
The proof proceeds by a reduction from the Triangle Partition problem, which asks if a graph on vertices can be partitioned into three element sets, with each set forming a triangle in . Fortunately, for the purpose of optimizing the value objective, it is not necessary to find an approximate decomposition.
4 Optimizing revenue with fairness
This section considers the revenue objective. We will obtain an approximation algorithm for this objective in three steps: (i) obtain a fairlet decomposition such that the maximum fairlet size in the decomposition is small, (ii) show that any approximation algorithm to (1) plus this fairlet decomposition can be used to obtain a (roughly) approximation for fair hierarchical clustering under the revenue objective, and (iii) use averagelinkage, which is known to be a approximation to (1). (We note that the recent work Ahmadian et al. (2020a); Alon et al. (2020) on improved approximation algorithms compare to a bound on the optimal solution that differs from (1) and therefore do not fit into our framework.)
First, we address step (ii). Due to space, this proof can be found in Appendix B .
Theorem 8.
Given an algorithm that obtains a approximation to (1) where , and a fairlet decomposition with maximum fairlet size , there is a approximation for fair hierarchical clustering under the revenue objective.
Prior work showed that averagelinkage is a approximation to (1) in the vertexunweighted case; this proof can be easily modified to show that it is still approximation even with vertex weights. This accounts for step (iii) in our process.
Combined with the fairlet decomposition methods for the twocolor case (Chierichetti et al., 2017) and for multicolor case (Supplementary Material) to address step (i), we have the following.
Corollary 9.
There is polynomial time algorithm that constructs a fair tree that is a approximation for revenue objective, where is the maximum size of fairlets.
5 Optimizing value with fairness
In this section we consider the value objective. As in the revenue objective, we prove that we can reduce fair hierarchical clustering to the problem of finding a good fairlet decomposition for the proposed fairlet objective (3), and then use any approximation algorithm for weighted hierarchical clustering with the decomposition as the input.
Theorem 10.
We complement this result with an algorithm that finds a good fairlet decomposition in polynomial time under the bounded representation fairness constraint with cap .
Let be the colors and let be the fairlet decomposition. Let be the number of points colored in . Let denote the number of points colored in the th fairlet.
Theorem 11.
There exists a local search algorithm that finds a fairlet decomposition with in time .
We can now use the fact that both averagelinkage and the locallydensest cut algorithm give a  and approximation respectively for vertexweighted hierarchical clustering under the value objective. Finally, recall that fairlets are intended to be minimal, and their size depends only on the parameter , and not on the size of the original input. Therefore, as long as the number of points of each color increases as input size, , grows, the ratio goes to . These results, combined with Theorem 10 and Theorem 11, yield Corollary 12.
Corollary 12.
Given bounded size fairlets, the fairlet decomposition computed by local search combined with averagelinkage constructs a fair hierarchical clustering that is a approximation for the value objective. For the locallydensest cut algorithm in CohenAddad et al. (2018), we get a polynomial time algorithm for fair hierarchical clustering that is a approximation under the value objective for any .
Given at most a small fraction of every color is in any cluster, Corollary 12 states that we can extend the stateoftheart results for value to the capped, multicolored constraint. Note that the preconditions will always be satisfied and the extension will hold in the twocolor fairness setting or in the multicolored equal representation fairness setting.
Fairlet decompositions via local search
In this section, we give a local search algorithm to construct a fairlet decomposition, which proves Theorem 11. This is inspired by the densest cut algorithm of CohenAddad et al. (2018). To start, recall that for a pair of sets and we denote by the sum of interpoint distances, . A fairlet decomposition is a partition of the input such that each color composes at most an fraction of each .
Our algorithm will recursively subdivide the cluster of all data to construct a hierarchy by finding cuts. To search for a cut, we will use a swap method.
Definition 13 (Local optimality).
Consider any fairlet decomposition and . Define a swap of and for as updating to be and to be . We say is locallyoptimal if any swap with of the same color reduces the objective value by less than a factor.
The algorithm constructs a locally optimal algorithm for fairlet decomposition, which runs in time. Consider any given instance . Let denote the maximum distance, denote the maximum fairlet size, and . The algorithm begins with an arbitrary decomposition. Then it swaps pairs of monochromatic points until it terminates with a locally optimal solution. By construction we have the following.
Claim 14.
Algorithm 1 finds a valid fairlet decomposition.
We prove two things: Algorithm 1 optimizes the objective (3), and has a small running time. The following lemma gives an upper bound on ’s performance for (3) found by Algorithm 1.
Lemma 15.
Finally we bound the running time. The algorithm has much better performance in practice than its worstcase analysis would indicate. We will show this later in Section 7.
Lemma 16.
The running time for Algorithm 1 is .
6 Optimizing cost with fairness
This section considers the cost objective of Dasgupta (2016). Even without our fairness constraint, the difficulty of approximating cost is clear in its approximation hardness and the fact that all known solutions require an LP or SDP solver. We obtain the result in Theorem 17; extending this result to other fairness constraints, improving its bound, or even making the algorithm practical, are open questions.
Theorem 17.
Consider the twocolor case. Given a approximation for cost and a approximation for minimum weighted bisection ^{1}^{1}1 The minimum weighted bisection problem is to find a partition of nodes into two equalsized subsets so that the sum of the weights of the edges crossing the partition is minimized. on input of size , then for parameters and such that and , there is a fair approximation for .
With proper parameterization, we achieve an approximation. We defer our algorithm description, pseudocode, and proofs to the Supplementary Material. While our algorithm is not simple, it is an important (and nonobvious) step to show the existence of an approximation, which we hope will spur future work in this area.
7 Experiments
This section validates our algorithms from Sections 4 and 5 empirically. We adopt the disparate impact fairness constraint Chierichetti et al. (2017); thus each point is either blue or red. In particular, we would like to:

[nosep]

Show that running the standard averagelinkage algorithm results in highly unfair solutions.

Demonstrate that demanding fairness in hierarchical clustering incurs only a small loss in the hierarchical clustering objective.

Show that our algorithms, including fairlet decomposition, are practical on real data.
In Appendix G we consider multiple colors and the same trends as the two color case occur.
Name  Sample size  # features  Protected feature  Color (blue, red)  

CensusGender  30162  6  gender  (female, male)  
CensusRace  30162  6  race  (nonwhite, white)  
BankMarriage  45211  7  marital status  (not married, married)  
BankAge  45211  7  age  (, ) 
Samples  400  800  1600  3200  6400  12800 

CensusGender, initial  
final  
CensusRace, initial  
final  
BankMarriage, initial  
final  
BankAge, initial  
final 
Datasets. We use two datasets from the UCI data repository.^{2}^{2}2archive.ics.uci.edu/ml/index.php, Census: archive.ics.uci.edu/ml/datasets/census+income, Bank: archive.ics.uci.edu/ml/datasets/Bank+Marketing In each dataset, we use features with numerical values and leave out samples with empty entries. For for value, we use the Euclidean distance as the dissimilarity measure. For revenue, we set the similarity to be where is the Euclidean distance. We pick two different protected features for both datasets, resulting in four datasets in total (See Table 1 for details).

[nosep]

Census dataset: We choose gender and race to be the protected feature and call the resulting datasets CensusGender and CensusRace.

Bank dataset: We choose marital status and age to be the protected features and call the resulting datasets BankMarriage and BankAge.
In this section, unless otherwise specified, we report results only for the value objective. Results for the revenue objective are qualitatively similar and are omitted here. We do not evaluate our algorithm for the cost objective since it is currently only of theoretical interest.
We subsample points of two colors from the original data set proportionally, while approximately retaining the original color balance. The sample sizes used are . On each, we do experiments and report the average results. We set in Algorithm 1 to in all of the experiments.
Implementation. The code is available in the Supplementary Material. In the experiments, we use Algorithm 1 for the fairlet decomposition phase, where the fairlet decomposition is initialized by randomly assigning red and blue points to each fairlet. We apply the averagelinkage algorithm to create a tree on the fairlets. We further use averagelinkage to create subtrees inside of each fairlet.
The algorithm selects a random pair of blue or red points in different fairlets to swap, and checks if the swap sufficiently improves the objective. We do not run the algorithm until all the pairs are checked, rather the algorithm stops if it has made a failed attempts to swap a random pair. As we obseve empirically, this does not have material effect on the quality of the overall solution.
Samples  100  200  400  800  1600  3200  6400  12800 

CensusGender, initial  2.5e2  1.2e2  6.2e3  3.0e3  1.5e3  7.5e4  3.8e4  1.9e4 
final  4.9e3  1.4e3  6.9e4  2.5e4  8.5e5  3.6e5  1.8e5  8.0e6 
CensusRace, initial  6.6e2  3.4e2  1.7e2  8.4e3  4.2e3  2.1e3  1.1e3  5.3e4 
final  2.5e2  1.2e2  6.2e3  3.0e3  1.5e3  7.5e4  3.8e4  1.9e5 
BankMarriage, initial  1.7e2  8.2e3  4.0e3  2.0e3  1.0e3  5.0e4  2.5e4  1.3e4 
final  5.9e3  2.1e3  9.3e4  4.1e4  1.3e4  7.1e5  3.3e5  1.4e5 
BankAge, initial  1.3e2  7.4e3  3.5e3  1.9e3  9.3e4  4.7e4  2.3e4  1.2e4 
final  5.0e3  2.2e3  7.0e4  3.7e4  1.3e4  5.7e5  3.0e5  1.4e5 
(i)  (ii)  (iii) 
Metrics. We present results for value here, the results for revenue are qualitatively similar. In our experiments, we track the following quantities. Let be the given input instance and let be the output of our fair hierarchical clustering algorithm. We consider the following ratio , where is the tree obtained by the standard averagelinkage algorithm. We consider the fairlet objective function where is a fairlet decomposition. Let .
Results. Averagelinkage algorithm always constructs unfair trees. For each of the datasets, the algorithm results in monochromatic clusters at some level, strengthening the case for fair algorithms.
In Table 2, we show for each dataset the both at the time of initialization (Initial) and after usint the local search algorithm (Final). We see the change in the ratio as the local search algorithm performs swaps. Fairness leads to almost no degradation in the objective value as the swaps increase. Table 3 shows the between the initial initialization and the final output fairlets. As we see, Algorithm 1 significantly improves the fairness of the initial random fairlet decomposition.
Dataset  CensusGender  CensusRace  BankMarriage  BankAge 

Revenue vs. upper bound  
Value vs. upper bound 
The more the locallyoptimal algorithm improves the objective value of (3), the better the tree’s performance based on the fairlets. Figures 1(i) and 1(ii) show and for every swaps in the execution of Algorithm 1 on a subsample of size from Census data set. The plots show that as the fairlet objective value decreases, the value objective of the resulting fair tree increases. Such correlation are found on subsamples of all sizes.
Now we compare the objective value of the algorithm with the upper bound on the optimum. We report the results for both the revenue and value objectives, using fairlets obtained by local search, in Table 4. On all datasets, we obtain ratios significantly better than the theoretical worst case guarantee. In Figure 1(iii), we show the average running time on Census data for both the original averagelinkage and the fair averagelinkage algorithms. As the sample size grows, the running time scales almost as well as current implementations of averagelinkage algorithm. Thus with a modest increase in time, we can obtain a fair hierarchical clustering under the value objective.
8 Conclusions
In this paper we extended the notion of fairness to the classical problem of hierarchical clustering under three different objectives (revenue, value, and cost). Our results show that revenue and value are easy to optimize with fairness; while optimizing cost appears to be more challenging.
Our work raises several questions and research directions. Can the approximations be improved? Can we find better upper and lower bounds for fair cost? Are there other important fairness criteria?
References
 Bisect and conquer: hierarchical clustering via maxuncut bisection. In AISTATS, Cited by: §1, §4.
 Clustering without overrepresentation. In KDD, pp. 267–275. Cited by: Appendix G, §1, §1, §1, §2.2.
 Fair correlation clustering. In AISTATS, Cited by: §1, §1.
 Hierarchical clustering: a 0.585 revenue approximation. In COLT, Cited by: §1, §2.1, §4.
 Scalable fair clustering. In ICML, pp. 405–413. Cited by: §1, §1.
 Fairness and machine learning. www.fairmlbook.org. Cited by: §1, §1.
 Fair algorithms for clustering. In NeurIPS, pp. 4955–4966. Cited by: §1.
 On the cost of essentially fair clusterings. In APPROXRANDOM, pp. 18:1–18:22. Cited by: §1, §1.
 Multiwinner voting with fairness constraints. In IJCAI, pp. 144–151. Cited by: §1.
 Ranking with fairness constraints. In ICALP, pp. 28:1–28:15. Cited by: §1.
 Hierarchical clustering better than averagelinkage. In SODA, pp. 2291–2304. Cited by: §1.
 Approximate hierarchical clustering via sparsest cut and spreading metrics. In SODA, pp. 841–854. Cited by: Appendix E, §2.1.
 Proportionally fair clustering. In ICML, pp. 1032–1041. Cited by: §1.
 Fair algorithms for hierarchical agglomerative clustering. arXiv:2005.03197. Cited by: §1.
 Fair clustering through fairlets. In NIPS, pp. 5029–5037. Cited by: §C.1, §1, §1, §2.2, §4, §7, Definition 6.
 Matroids, matchings, and fairness. In AISTATS, pp. 2212–2220. Cited by: §1.
 How to solve fair center in massive data models. In ICML, Cited by: §1.
 Hierarchical clustering: objective functions and algorithms. In SODA, pp. 378–397. Cited by: §A.2, §A.2, §A.2, §A.2, Appendix A, §1, §1, §2.1, §2.1, §2.1, §5, Corollary 12.
 A cost function for similaritybased hierarchical clustering. In STOC, pp. 118–127. Cited by: Appendix E, §1, §1, §2.1, §2.1, §6.
 Clustering methodologies in exploratory data analysis. In Advances in Computers, Vol. 19, pp. 113–228. Cited by: §1.
 A polylogarithmic approximation of the minimum bisection. In FOCS, pp. 105–115. Cited by: Appendix E.
 Computers and intractability. WH Freeman New York. Cited by: Appendix D.
 Coresets for clustering with fairness constraints. In NeurIPS, pp. 7587–7598. Cited by: §1.
 Fair centers via maximum matching. In ICML, Cited by: §1.
 A comparison of document clustering techniques. In TextMining Workshop at KDD, Cited by: §1.
 Human decisions and machine predictions. The Quarterly Journal of Economics 133 (1), pp. 237–293. Cited by: §1.
 Fair center clustering for data summarization. In ICML, pp. 3448–3457. Cited by: §1, §1.

Guarantees for spectral clustering with fairness constraints
. In ICML, pp. 3448–3457. Cited by: §1.  The use of sparsest cuts to reveal the hierarchical community structure of social networks. Social Networks 30 (3), pp. 223–234. Cited by: §1.
 Approximation bounds for hierarchical clustering: average linkage, bisecting means, and local search. In NIPS, pp. 3094–3103. Cited by: §A.2, Appendix A, §1, §1, §2.1, §2.1, §2.1.
 Mining of massive datasets. Cambridge University Press. Cited by: §1.
 Privacy preserving clustering with constraints. In ICALP, pp. 96:1–96:14. Cited by: §1.
Appendix
Appendix A Approximation algorithms for weighted hierarchical clustering
In this section we first prove that running constantapproximation algorithms on fairlets gives good solutions for value objective, and then give constant approximation algorithms for both revenue and value in weighted hierarchical clustering problem, as is mentioned in Corollary 9 and 12. That is, a weighted version of averagelinkage, for both weighted revenue and value objective, and weighted ()locally densest cut algorithm, which works for weighted value objective. Both proofs are easily adapted from previous proofs in CohenAddad et al. [18] and Moseley and Wang [30].
a.1 Running constantapproximation algorithms on fairlets
In this section, we prove Theorem 10, which says if we run any approximation algorithm for the upper bound on weighted value on the fairlet decomposition, we get a fair tree with minimal loss in approximation ratio. For the remainder of this section, fix any hierarchical clustering algorithm that is guaranteed on any weighted input to construct a hierarchical clustering with objective value at least for the value objective on a weighted input. Recall that we extended the value objective to a weighted variant in the Preliminaries Section and . Our aim is to show that we can combine with the fairlet decomposition introduced in the prior section to get a fair hierarchical clustering that is a approximation for the value objective, if .
In the following definition, we transform the point set to a new set of points that are weighted. We will analyze on this new set of points. We then show how we can relate this to the objective value of the optimal tree on the original set of points.
Definition 18.
Let be the fairlet decomposition for that is produced by the local search algorithm. Define as follows:

Each set has a corresponding point in .

The weight of is set to be .

For each partitions , where and , .
We begin by observing the objective value that receives on the instance is large compared to the weights in the original instance.
Theorem 19.
On the instance the algorithm has a total weighted objective of .
Proof.
Notice that . Consider the total sum of all the distances in . This is . The upper bound on the optimal solution is . Since , this upper bound is at least . Theorem 10 follows from the fact that the algorithm archives a weighted revenue at least a factor of the total weighted distances. ∎
a.2 Weighted hierarchical clustering: Constantfactor approximation
For weighted hierarchical clustering with positive integral weights, we define the weighted averagelinkage algorithm for input and . Define the average distance to be for dissimilaritybased input, and for similaritybased input. In each iteration, weighted averagelinkage seeks to merge the clusters which minimizes this value, if dissimilaritybased, and maximizes this value, if similaritybased.
Lemma 20.
Weighted averagelinkage is a approximation for the upper bound on weighted value (resp., revenue) objective with positive, integral weights.
Proof.
We prove it for weighted value first. This is directly implied by the fact that averagelinkage is approximation for unweighted value objective, as is proved in CohenAddad et al. [18]. We have already seen in the last subsection that a unweighted input can be converted into weighted input . Vice versa, we can construct a weighted input into unweighted input with same upper bound for value objective.
In weighted hierarchical clustering we treat each point with integral weights as duplicates of points with distance among themselves, let’s call this set . For two weighted points and , if , let . This unweighted instance, composed of many duplicates, has the same upper bound as the weighted instance. Notice that running averagelinkage on the unweighted instance will always choose to put all the duplicates together first for each , and then do hierarchical clustering on top of the duplicates. Thus running averagelinkage on the unweighted input gives a valid hierarchical clustering tree for weighted input. Since unweighted value upper bound equals weighted value upper bound, the approximation ratio is the same.
Now we prove it for weighted revenue. In Moseley and Wang [30], averagelinkage being approximation for unweighted revenue is proved by the following. Given any clustering , if averagelinkage chooses to merge and in , we define a local revenue for this merge:
And correspondingly, a local cost:
Summing up the local revenue and cost over all merges gives the upper bound. Moseley and Wang [30] used the property of averagelinkage to prove that at every merge, , which guarantees the total revenue, which is the summation of over all merges, is at least of the upper bound. For the weighted case, we define
And
And the rest of the proof works in the same way as in Moseley and Wang [30], proving weighted averagelinkage to be for weighted revenue. ∎
Next we define the weighted locallydensest cut algorithm. The original algorithm, introduced in CohenAddad et al. [18], defines a cut to be . It starts with the original set as one cluster, at every step, it seeks the partition of the current set that locally maximizes this value, and thus constructing a tree from top to bottom. For the weighted input , we define the cut to be , and let . For more description of the algorithm, see Algorithm 4 in Section 6.2 in CohenAddad et al. [18].
Lemma 21.
Weighted locallydensest cut algorithm is a approximation for weighted value objective.
Proof.
Just as in the averagelinkage proof, we convert each weighted point into a set of duplicates of . Notice that the converted unweighted hierarchical clustering input has the same upper bound as the weighted hierarchical clustering input, and the locallydensest cut algorithm moves all the duplicate sets around in the unweighted input, instead of single points as in the original algorithm in CohenAddad et al. [18].
Focus on a split of cluster into . Let be a duplicate set. , where is a set of duplicates, we must have
Pick up a point ,
Rearrange the terms and we get the following inequality holds for any point :
The rest of the proof goes exactly the same as the proof in CohenAddad et al. [18, Theorem 6.5]. ∎
Appendix B Proof of Theorem 8
Proof.
Let be the approximation algorithm to (1). For a given instance , let be a fairlet decomposition of ; let . Recall that .
We use to create a weighted instance . For , we define and we define .
We run on and let be the hierarchical clustering obtained by . To extend this to a tree on , we simply place all the points in each fairlet as leaves under the corresponding vertex in .
We argue that .
Since obtains a approximation to hierarchical clustering on , we have
Notice the fact that, for any pair of points in the same fairlet , the revenue they get in the tree is . Then using ,
Thus the resulting tree is a approximation of the upper bound. ∎
Appendix C Proofs for locallyoptimal local search algorithm
In this section, we prove that Algorithm 1 gives a good fairlet decomposition for the fairlet decomposition objective 3, and that it has polynomial run time.
c.1 Proof for a simplified version of Lemma 15
In Subsection C.2, we will prove Lemma 15. For now, we will consider a simpler version of Lemma 15 in the context of Chierichetti et al. [15]’s disparate impact problem, where we have red and blue points and strive to preserve their ratios in all clusters. Chierichetti et al. Chierichetti et al. [15] provided a valid fairlet decomposition in this context, where each fairlet has at most blue points and red points. Before going deeper into the analysis, we state the following useful proposition.
Proposition 22.
Let be the total number of red points and the number of blue points. We have that, .
Proof.
Recall that , and wlog . Since the fractions are positive and we know that . Since we conclude that . Similarly, we conclude that . Therefore .
Thus, . However, since and , , . ∎
Using this, we can define and prove the following lemma, which is a simplified version of Lemma 15.
Lemma 23.
Proof.
Let denote a mapping from a point in to the fairlet it belongs to. Let , and . Naturally, for any set . For a fairlet , let and denote the number of red and blue points in .
We first bound the total number of intrafairlet pairs. Let , we know that and . The number of intrafairlet pairs is at most .
The While loop can end in two cases: 1) if is locallyoptimal; 2) if . Case 2 immediately implies the lemma, thus we focus on case 1. By definition of the algorithm, we know that for any pair and where have the same color and the swap does not increase objective value by a large amount. (The same trivially holds if the pair are in the same cluster.)
After moving terms and some simplification, we get the following inequality:
(4) 
Then we sum up (4), , over every pair of points in (even if they are in the same partition).
Divide both sides by and use the fact that for all :
(5) 
For pairs of points in we sum (4) to similarly obtain:
(6) 
Now we sum up (5) and (6). The LHS becomes:
The other terms give:
Comments
There are no comments yet.