The success of machine learning led to its widespread adoption in many aspects of our daily lives. Automatic prediction and forecasting methods are now used to approve mortgage applications or estimate the likelihood of recidivism. It it thus crucial to design machine learning algorithms that arefair, i.e., do not suffer from bias against or towards particular population groups. An extensive amount of research over the last few years has focused on two key questions: how to formalize the notion of fairness in the context of common machine learning tasks, and how to design efficient algorithms that conform to those formalizations. See e.g., the survey by Chouldechova and Roth (2018) for an overview.
In this paper we focus on the second aspect. Specifically, we consider the problem of fair clustering and propose efficient algorithms for solving this problem. Fair clustering, introduced in Chierichetti et al. (2017), generalizes the standard notion of clustering by imposing a constraint that all clusters must be balanced with respect to specific sensitive attributes, such as gender or religion. In the simplest formulation, each input point is augmented with one of two colors (say, red and blue), and the goal is to cluster the data while ensuring that, in each cluster, the fraction of points with the less frequent color is bounded from below by some parameter strictly greater than . Chierichetti et al. (2017) proposed polynomial time approximation algorithms for fair variants of classic clustering methods, such as -center (minimize the maximum distance between points and their cluster centers) and -median (minimize the average distance between points and their cluster centers). To this end, they introduced the notion of fairlet decomposition: a partitioning of the input pointset into small subsets, called fairlets, such that a good balanced clustering can be obtained by merging fairlets into clusters. Unfortunately, their algorithm for computing a fairlet decomposition has running time that is at least quadratic in the number of the input points. As a result, the algorithm is applicable only to relatively small data sets.
In this paper we address this drawback and propose an algorithm for computing fairlet decompositions with running time that is near-linear in the data size. We focus on the -median formulation, as
-center clustering is known to be sensitive to outliers. Our algorithms apply to the typical case where the set of input points lie in a-dimensional space, and the distance is induced by the Euclidean norm.111E.g., all data sets used to evaluate the algorithms in Chierichetti et al. (2017) fall into this category.
To state the result formally, we need to introduce some notation. Consider a collection of points , where each point is colored either red or blue. For a subset of points , the balance of is defined as where and respectively denote the number of red and blue points in . Assuming , a clustering of is ()-fair if for every cluster , . In -median clustering, the goal is to find centers and partition the pointset into clusters centered at the selected centers such that the sum of the distances from each point point to its assigned center (i.e., the center of the cluster to which belongs) is minimized. In the -fair -median problem, all clusters are required to have balance at least . Our main result is summarized in the following theorem.
Let be the running time of an -approximation algorithms for the -median problem over points in . Then there exists an -time algorithm that given a point set and balance parameters , computes a -fair -median of whose cost is within a factor of from the optimal cost of -fair -median of .
We complement our theoretical analysis with empirical evaluation. Our experiments show that the quality of the clustering obtained by our algorithm is comparable to that of Chierichetti et al. (2017). At the same time, the empirical runtime of our algorithm scales almost linearly in the number of points, making it applicable to massive data sets (see Figure 1).
Since the original paper of Chierichetti et al. (2017), there has been several followup works studying fair clustering. In particular, Rösner and Schmidt (2018) and Bercea et al. (2018) studied the fair variant of -center clustering (as opposed to -median in our case). Furthermore, the latter paper presented a “bi-criteria” approximation algorithm for -median and
-means under a somewhat different notion of fairness. However, their solution relies on a linear program that is a relaxation of an integer linear program with at leastvariables, one for every pair of points. Thus, their algorithm does not scale well with the input size. Similarly, another algorithm proposed in Bera et al. (2019), requires solving linear programs with variables each.
The work most relevant to our paper is a recent manuscript by Schmidt et al. (2018), which proposed efficient streaming algorithms for fair -means (which is similar to -median studied here). Specifically, they give a near-linear time streaming algorithm for computing a core-set: a small subset such that solving fair clustering over yields an approximate solution for the original point-set . In order to compute the final clustering, however, they resort to a variant of the algorithm of Chierichetti et al. (2017), or another algorithm with a running time exponential in . Thus, our approach is complementary to theirs. In particular, our algorithm could be applied to the core-set generated by their method in order to reduce the space usage.
We note that the above algorithms guarantee constant approximation factors, as opposed to the logarithmic factor in our paper. As we show in the experimental section, this does not seem to affect the empirical quality of solutions produced by our algorithm. Still, designing a constant factor algorithm with a near-linear running time is an interesting open problem.
Possible settings of .
Chierichetti et al. (2017) gave -fairlet decomposition algorithms only for . This does not allow for computing a full decomposition of the pointset into well-balanced fairlets if the numbers of red and blue points are close but not equal (for instance, if their ratio is 9:10). One way to address this could be to downsample the larger set in order to make them have the same cardinality and then compute a -fairlet decomposition. The advantage of our approach is that we do not disregard any, even random, part of the input. This may potentially lead to much better solutions, partially by allowing that the clusters are not ideally balanced. The general settings of and are also considered by Bercea et al. (2018).
Our main contribution is to design a nearly-linear time algorithm for -fairlet decomposition for any integer values of . Our algorithm has two steps. First, it embeds the input points into a tree metric called HST (intuitively, this is done by computing a quadtree decomposition of the point set, and then using the distances in the quadtree). In the second step it solves the fairlet decomposition problem with respect to the new distance function induced by HST. The distortion of the embedding into the HST accounts for the factor in the approximation guarantee.
Once we have the HST representation of the pointset, the high-level goal is to construct “local” -fairlets with respect to the tree. To this end, the algorithm scans the tree in a top-down order. In each node of the tree, it greedily partitions the points into fairlets so that the number of fairlets whose points belong to subtrees rooted at different children of is minimized. In particular, we prove that minimizing the number of such fairlets (which we refer to as the Minimum Heavy Point problem) leads to an -approximate -fairlet decomposition with respect to the distance over the tree.
Definition 2.2 (Fairlet Decomposition).
Suppose that is a collection of points such that each is either colored red or blue. Moreover, suppose that for some integers such that . A clustering of is an -fairlet decomposition if (a) each point belongs to exactly one cluster , (b) for each , , and (c) for each , .
Probabilistic metric embedding.
A probabilistic metric is defined as a set of metrics
along with a probability distribution of support sizedenoted by such that . For any finite metric and probabilistic metric , an embedding has distortion , if:
for all and , ,
Definition 2.3 (-Hst).
A tree rooted at vertex is a hierarchically well-separated tree (-HST) if all edges of have non-negative weights and the following two conditions hold:
The (weighted) distances from any node to all its children are the same.
For each node , the distance of to its children is at most times the distance of to its parent.
We build on the following result due to Bartal (1996), which gives a probabilistic embedding from to a -HST. Our algorithm explicitly computes this embedding which we describe in more detail in the next section. For a proof of its properties refer to Indyk (2001) and the references therein. In this paper, we assume that the given pointset has aspect ratio (i.e. the ratio between the maximum and minimum distance is ).
Theorem 2.4 (Bartal (1996)).
Any finite metric space on points in can be embedded into probabilistic metric over -HST metrics with distortion in time.
3 High-level Description of Our Algorithm
Our algorithm for -fair -median problem in Euclidean space follows the high-level approach of Chierichetti et al. (2017): it first computes an approximately optimal -fairlet decomposition for the input point set (see Algorithm 1). Then, in the second phase, it clusters the ()-fairlets produced in the first phase into clusters (see Algorithm 2). Our main contribution is designing a scalable algorithm for the first phase of this approach, namely ()-fairlet decomposition.
Preprocessing phase: embedding to -Hst.
An important step in our algorithm is to embed the input pointset into a -HST (see Section 2 for more details on HST metrics). To this end, we exploit the following standard construction of -HST using randomly shifted grids.
Suppose that all points in lie in . We generate a random tree (which is a -HST embedding of ) recursively. We translate the -dimensional hypercube
via a uniformly random shift vector. It is straightforward to verify that all points in are enclosed in . We then split each dimension of into equal pieces to create a grid with cells. Then we proceed recursively with each non-empty cell to create a hierarchy of nested -dimensional grids with levels (each cell in the final level of the recursion either contains exactly one point of or has side length ). Next, we construct a tree corresponding to the described hierarchy nested -dimensional grids as follows. Consider a cell in the -th level (level denote the initial hypercube ) of the hierarchy. Let denote the trees constructed recursively for each non-empty cells of . Denote the root of each tree by . Then we connect (corresponding to cell ) to each of with an edge of length proportional to the diameter of (i.e., ).
Note that the final tree generated by the above construction is a -HST: on each path from the root to a leaf, the length of consecutive edges decrease exponentially (by a factor of ) and the distance from any node to all of its children are the same. Moreover, we assume that .
Phase 1: computing ()-fairlet decomposition.
This phase operates on the probabilistic embedding of the input into a -HST from the preprocessing phase, where . The distortion of the embedding is . Additionally, we augment each node with integers and denoting the number of red and blue points, respectively, in the subtree rooted at .
Compute an approximately minimum number of points that are required to be removed from the children of so that (1) the set of points contained by each child becomes -balanced, and (2) the union of the set of removed points is also -balanced. More formally, we solve Question 3.6 approximately (recall that for each child , and respectively denotes the number of red and blue points in ).
Definition 3.5 (Heavy Point).
A point is heavy with respect to if it belongs to a fairlet such that . For each fairlet , denotes the least common ancestor (lca) of the points contained in in .
Question 3.6 (Minimum Heavy Points Problem).
Suppose that is a node in . For each corresponding to non-empty children of , let be respectively the number of red and blue points that are removed from . The goal is to minimize such that the following conditions hold:
for each , is ()-balanced.
After computing , for each , remove an arbitrary set of red and blue points from and add them to . Then, output an arbitrary -fairlet decomposition of points which is guaranteed to be -balanced by Step 1.
For each corresponding to a non-empty child of , run which is guaranteed to be -balanced by Step 1.
Here is the main guarantee of our approach in the first step (i.e., ()-fairlet decomposition).
There exists an time algorithm that given a point set and balance parameters , computes an -fairlet decomposition of with respect to whose expected cost is within factor of the optimal -fairlet decomposition of in expectation.
Phase 2: merging ()-fairlets into clusters.
In this phase, we essentially follow the same approach as Chierichetti et al. (2017).
Theorem 3.8 (Fairlet to Fair Clustering).
Suppose that is an -approximate -fairlet decomposition of . Then, ClusterFairlet returns an -approximate -fair -median clustering of where denotes the approximation guarantee of the -median algorithm invoked in ClusterFairlet.
For a pointset , let and respectively denote an optimal -fair -median and an optimal -fairlet decomposition of . It is straightforward to see that for any set of point , and in particular,
Let denote the set of the centers of fairlets in . For a set of points , let denotes an optimal -median clustering of (note that there is not fairness requirement). Since , the optimal -median cost of is smaller than the optimal -median cost of . Since contains at most copies of each point of , by assigning all copies of each point in to the center of in an optimal -median clustering of ,
Since the distance of each point to the center of its cluster in is less than the sum of its distance to the center of its fairlet in and the distance of to its center in , we can bound the cost of in terms of the costs of and as follows:
4 Fairlet Decomposition: a Top-down Approach on -Hst
In this section, we provide a complete description of the first phase in our -fair -median algorithm (described in Section 3), namely our scalable ()-fairlet decomposition algorithm.
The first step in our algorithm is to embed the input point set into a -HST (for a value of to be determined later in this section). Once we build a -HST embedding of the input points , the question is how to partition the points into -fairlets. We assume that each node is augmented with extra information and respectively denoting the number of red and blue points in the subtree rooted at .
To compute the total cost of a fairlet decomposition, it is important to specify the clustering cost model (e.g., -median, -center). Here, we define to denote the cost of a fairlet (or cluster) with respect to the cost function of -median clustering: for any subset of points , where denotes the distance of in .
In this section, we design a fast fairlet decomposition algorithm with respect to .
There exists an time algorithm that given an -HST embedding of the point set and balance parameters , computes an -approximate -fairlet decomposition of with respect to on .
Theorem 3.7. We first embed the points into an -HST and then perform the algorithm guaranteed in Theorem 4.9. By Theorem 2.4, the expected distortion of our embedding to is and by Theorem 4.9, there exists an algorithm that computes an -approximate fairlet-decomposition of with respect to distances in . Hence, the overall algorithm achieves -approximation.
Since the embedding can be constructed in time and the fairlet-decomposition algorithm of Theorem 4.9 runs in near-linear time, the overall algorithm also runs in .
Before describing the algorithm promised in Theorem 4.9, we define a modified cost function which is a simplified variant of and is particularly useful for computing the cost over trees. Consider a -HST embedding of denoted by and assume that is a fairlet decomposition of . Moreover, denotes the height of in . For each fairlet , is defined as
In particular, relaxes the task of finding “the best center in fairlets” and at the same time, its value is within a small factor of .
Suppose that is a -HST embedding of the set of points . For any -fairlet of , where the distance function is defined w.r.t. .
There exists a near-linear time algorithm that given an -HST embedding of the point set and balance parameters , computes an -approximate -fairlet decomposition of with respect to on .
For any tree with the root vertex , the number of heavy points with respect to in the ()-fairlet decomposition constructed by is at most times the minimum number of heavy points in any valid ()-fairlet decomposition of .
Lemma 4.11. The proof is by induction on height of in . The base case is when is a leaf node in and the algorithm trivially finds an optimal solution in this case. Suppose that the induction hypothesis holds for all vertices of at height . Here, we show that the statement holds for the vertices of at height as well.
Let OPT denote an optimal ()-fairlet decomposition of the points in with respect to . Next, we decompose OPT into parts: and . For each , denotes the set of fairlets in OPT whose lca are in . Moreover, denotes the set of heavy fairlets with respect to and denotes the set of heavy points with respect to in OPT. Lastly, denotes the set of heavy points with respect to in OPT that are contained in .
Let sol denote the solution returned by FairletDecomposition. Similarly, we decompose sol into parts: and . Moreover, denotes the set of heavy points with respect to in sol and for each , denotes the set of heavy points with respect to in sol that are contained in .
For each , there exists an -fairlet decomposition of of cost at most where is the set of points contained in .
Consider the fairlet decomposition on . A fairlet is affected if it contains a point .
We define the set of affected points as to denote the union of the points in the affected fairlets (i.e., ) and the set (whose points do not belong to any of fairlets in ).
Next, we bound the cost of the fairlet decomposition which is constructed by augmenting the set of fairlets with the set of affected points . Let denote the set of affected points . We augment the fairlet decomposition in three steps:
In this step, we create as many -balanced fairlets using the affected points only. Note that the contribution of each point involved in such fairlets is where denotes the distance of from the leaves in . Let denote the set of affected points that do not join any fairlets at the end of this step. Note that all points in are of the same color .
Next, we add as many points of as possible to the existing fairlets in while preserving the -balanced property. Now the extra cost incurred by each points of that joins a fairlet in this step is at most . Let be the set of points that do not belong to an fairlets by the end of the second phase. Note that at the end of this step, if is non-empty, then all fairlets are maximally-balanced -dominant (a fairlet is maximally-balanced -dominant if (1) in , the number of points of color are larger than the number of points in color , (2) the set is -balanced, and (3) adding a point of color to makes it unbalanced).
Finally, we show that by mixing the points of at most existing fairlets with the set , we can find an -balanced fairlet decomposition of the involved points and the contribution of each such point to the total cost is at most . Note that since the set of all points we are considering is -balanced, not all of the so far constructed fairlets are saturated (i.e., has size exactly ). In particular, we show that there exists a set of non-saturated fairlets of size at most whose addition to constitutes a -balanced set. For each fairlet ,
where and respectively denotes the set of points of color and in . This implies that after picking at most non-saturated fairlets (i.e., the fairlets in ),
where and respectively denotes the set of points of color and in . Hence, the set of points is -balanced. Moreover, the cost of this step is at most .
Altogether, there exists a fairlet decomposition of of cost at most
4.1 Description of Step 1: Minimizing the Number of Heavy Points
In this section, we show that MinHeavyPoints algorithm invoked by FairletDecomposition finds an -approximate solution of Minimum Heavy Points problem.
The high-level overview of MinHeavyPoints is as follows. For any subset of points , we can compute in what the maximal size -balanced subset of is: w.l.o.g. suppose that and . If , the collection is -balanced. Otherwise, it suffices to greedily pick maximal size -fairlets (see procedure UnbalancedPoints for the formal algorithm). This simple observation implies a lower bound on the size of any optimal solution of Heavy Points Minimization with respect to and we use this value to bound the approximation guarantee of MinHeavyPoints algorithm.
UnbalancedPoints correctly computes the minimum number of points that is required to be removed from so that the remaining points become -balanced. Moreover, the solution returned by the procedure only removes points form a single color class.
For each , let be the output of UnbalancedPoints.
Any -fairlet decomposition of the points in has at least heavy points.
Stage 1: minimum number of heavy points.
If red points together with blue points form an -balanced collection, then MinHeavyPoints technically terminates at the end of stage and the solution returned by MinHeavyPoints achieves the minimum possible number of heavy points. However, in general, the collection with red points and may not form an -balanced collection. Next, we show that we can always pick at most additional heavy points and keep both all subtrees rooted at children of and the set of heavy points -balanced.
Another structure we will refer to in the rest of this section is saturated ()-fairlets. A fairlet is a saturated -fairlet if it has exactly points; points from color and points from color .
Stage 2: Adding free points.
If the “must-have” heavy points are not -balanced, then one color is dominant. For a color class , a collection of points is -dominant if . Moreover, the collection is minimally-balanced -dominant if is -balanced but it will be no longer -balanced even if we remove a single point of color .
Let be the dominant color in the heavy points. Then, we inspect all children of and if there exits a child in which is dominant, we borrow as many points of color as we can (we need to keep the subtree -balanced, see ExtraPoint procedure) till either the set of heavy points becomes -balanced or all subtrees rooted at children of become minimally-balanced -dominant. It is straightforward to show that at most points of color will be borrowed from the children of in this phase.
Suppose that the set of heavy points is -dominant. If the set of heavy points is not -balanced at the end of stage 2, then for each , the set of points in the subtree rooted at is minimally-balanced -dominant.
Suppose that the set of heavy points is -dominant. If the set of heavy points is not -balanced at the end of stage 2, then for each , the set of points in the subtree rooted at have an