In a typical clustering problem, there is a set of points in a metric space characterized by a distance function , where is some non-increasing function of similarity or proximity. The goal is to choose a set of at most representative centers, and subsequently construct an assignment that maps each point to one of the chosen centers, thus creating a collection of clusters. In addition, the quantity that really matters for each , is its distance to its corresponding cluster center . This distance represents the quality of service receives. In classical clustering applications would correspond to how similar is to , and in facility-location applications to the distance needs to travel in order to reach its service-provider. Hence, from an individual perspective, each requires to be as small as possible. The most popular objectives in the literature (-center, -median, -means) “boil down” this large collection of values , into an increasing function they try to minimize.
In scenarios where the points correspond to selfish agents, it is natural to assume that they will be mindful of the quality of service other points receive. Specifically, a point may feel that it is being handled unfairly by a solution , if is not close enough to the quality of service a group of other points enjoys. In this context, the points of are exactly those which perceives as similar to itself, hence it arguably believes that it should obtain similar treatment as them. Furthermore, there may exist situations where the sets are not explicitly provided, e.g., because they are private or because individuals have a fuzzy understanding of similarity. However, in that case a central planner could still construct the sets on its own, using some desired interpretation of similarity, and then provide fairness guarantees with respect to these .
As a practical example, consider the following application in an e-commerce site, where the points of correspond to its users. In order to provide relevant recommendations, the website needs to choose a set of representative users, and then assign each point to one of those based on a mapping . The recommendations gets will be based on ’s profile, and in this case the quantity corresponds to how representative is for , and hence how suitable ’s recommendations are. A point may feel unfairly treated, if points that are similar to it (points with small ) get better recommendations and consequently better service (see, e.g., the work of Datta et al.  for studies on similar users receiving different types of job recommendations).
Here we formalize this abstract notion of fairness via two rigorous and related constraints, which we incorporate into the classical -center problem. We focus on -center due to its numerous practical applications, but mostly because of its theoretical simplicity, which allows us to explore in depth the intricacies and the combinatorial structure of this novel notion of individually-fair clustering.
1.1 Formal problem definitions
We are given a set of points in a metric space characterized by the distance function . Moreover, the input includes a positive integer and a value . Finally, for every point we have a similarity set , denoting the group of points perceives as comparable to itself. We also define , and thus have .
The goal in our problems of interest is to choose a set of at most centers, and then find an assignment , such that the -center objective , i.e., , is minimized. Further, we use two different constraints to capture the notion of fairness we aim to study.
Per-Point Fairness (): When we study the problem under this constraint, we want to make sure that for all with , we have:
Here is satisfied if its quality of service is at most times the “best" quality found in . Equivalently, in this case we should guarantee that for all we have:
Aggregate Fairness (): Here for each with , we want to guarantee that:
Hence, feels fairly treated if is at most times the average quality of .
We call our problem -Equitable -Center, and denote it by EqCenter. Moreover, when we study it under constraint (1) we refer to it as EqCenter-PP, and similarly when we use constraint (2) we denote it by EqCenter-AG. Furthermore, both variants are NP-hard, since when for all , the fairness constraints become redundant, and the problems reduce to -center, which is already known to be NP-hard [Hochbaum and Shmoys, 1985].
Constraint (1) provides a stronger notion of fairness, in that each point cares explicitly about every point . Constraint (2) is weaker, in the sense that the points now compromise to comparing their quality of service to the average quality obtained by their similarity set. Due to this, a solution for (1) also constitutes a solution for (2), and hence for the same instance the optimal value of EqCenter-AG must be no larger than that of EqCenter-PP. This observation reveals an intriguing trade-off between how strict we want to be in our fairness constraints, and how much we care about the overall objective cost. We further explore this issue in Section 5.
1.2 Our contributions and discussion of our results
In Section 2 we investigate the combinatorial structure of our newly introduced fairness constraints. At first, a question that naturally arises is for what values of are our problems well-defined. We call a problem well-defined if it always admits a feasible solution , i.e., and satisfies the corresponding fairness constraint for all . Ideally, we would like our problems to admit feasible solutions for any possible value of . However we provide the following result which indicates that absolute equity is not achievable.
For both EqCenter-PP and EqCenter-AG, there exist instances with that do not admit any feasible solution.
We then proceed by showing that for there is always a feasible solution, thus settling the question about the regime of for which our problems are well-defined.
For both EqCenter-PP and EqCenter-AG, every instance with always admits a feasible solution.
Given that is the range we should focus on, we proceed by studying another vital concept, and that is the Price of Fairness (PoF) [Bertsimas et al., 2011, Caragiannis et al., 2009]. This notion is just a measure of relative loss in system efficiency, when fairness constraints are introduced. Specifically, for a given instance of either EqCenter-PP or EqCenter-AG, PoF is defined as the value of the optimal solution to our fair problem, over the value of the optimal solution to the underlying -center instance, where we drop the fairness constraints from the problem’s requirements. In other words, PoF (optimal fair value)/(optimal unfair value). We show the following.
There exist instances of EqCenter-PP and EqCenter-AG where PoF is unbounded.
All results of Section 2 are proven for . Observe that the case is trivial, since one can efficiently try each point as a center, see if any yields a feasible solution, and also find the optimal one among the computed feasible solutions. On the other hand, even when and we only have center sets to check, the number of possible assignments for each set of size is .
In Section 3 we provide an approximation algorithm that covers instances with for both EqCenter-PP and EqCenter-AG. The main body of the algorithm remains the same for the two problems, with minor differences to capture each unique case. Our process of choosing centers consitutes an extension of a result by Khuller and Sussmann . Our procedure gives useful guarantees regarding the distances between chosen centers, a feature that is crucially exploited in the assignment phase of the algorithm, where we carefully construct the mapping . Our result is:
Suppose we are given an instance with for either EqCenter-PP or EqCenter-AG, whose optimal value is . Let also . Our algorithm provides a feasible solution to either problem, for which .
We strongly believe that in realistic applications involving selfish agents , and hence our constructed solution will be a -approximate one. In such cases, an agent’s interest is targeted towards its immediate neighbors in the metric space (the ones most similar to it), rendering the values small. Points that are much further away, hence being substantially dissimilar, tend to be of no concern to the individual. On the other hand, the optimal value is a global measure (the radius of the largest ball around an optimal center) that clearly exceeds sufficiently small local neighborhoods.
Moreover, an assumption found in already existing work on related individually-fair clustering models, immediately guarantees that we always produce a -approximate solution. Specifically, Brubach et al.  explicitly assume that two points should be considered similar, if their distance is at most the value of the optimal unfair solution111In that paper the similarity between two points is given as a function of their distance , where is an input parameter and is the value of the optimal unfair -center solution. The smaller is, the more similar the two points, with indicating absolute identity and absolute dissimilarity. Given that, the experimental part of this work argues that fair solutions require , and therefore for to be similar (equivalently ) we must have .. If we also make this assumption we get , because each will be equal to the optimal unfair value, and the latter is a lower bound on .
Nonetheless, if a central planner wants to enforce the -approximate outcome without making any assumptions, there is an easy way to do so. The planner can first compute a lower bound for , which will actually be independent of the sets . There are various ways for achieving this, and we describe one in full detail in Section 5. Afterwards, the planner publishes , and the points construct their similarity sets based on it, i.e., they are only allowed to “envy” points within distance at most . Besides guaranteeing a -approximate solution, this strategy also enjoys explainability merits. By publishing the value , the planner informs the agents that even under optimal conditions this is the best service they may receive. Hence, the points focus their attention on close neighbors that may end up becoming their assigned centers, and thus might get better service than them.
Continuing with the description of our results, in Section 3 we also study the PoF behavior of our main algorithms. Specifically, when the similarity sets satisfy certain properties, we show that our algorithms enjoy a bounded price of fairness. If is the optimal value of the underlying -center instance where no fairness constraints are imposed, we prove the following.
When for all we have for some , our algorithm for EqCenter-AG provides a feasible solution with cost at most .
At this point, note that the previously mentioned similarity assumption of Brubach et al.  would guarantee that the conditions of the above two theorems are met, and consequently that the PoF of our returned solutions is bounded. However, if we do not want to use this assumption, it is again easy to ensure that the sets satisfy the desired properties, and thus force a bounded PoF in the end. A way of achieving this is presented in Section 5. Furthermore, we mention that all algorithms of Section 3 are purely combinatorial (e.g., do not require convex programming), and hence very efficient and easily implementable.
In Section 4 we study the assignment problem for EqCenter-PP and EqCenter-AG. To be more precise, if we are given the optimal set of centers , can we find the corresponding optimal assignment ? In a vanilla clustering setting this is trivial, since assigning points to their closest center is easily seen to yield the necessary results. However, as is the case in almost all literature on fair clustering, in the presence of fairness constraints like (1) or (2), such an assignment is not necessarily correct. As a side note, this implies that for a , we might have . Nonetheless, this does not constitute a modeling issue. Recalling the motivational example of a recommendation system for a website, we see that for a client chosen as a representative, assigning to a different representative is an acceptable outcome, as long as all individuals feel fairly treated.
Therefore, since from a theoretical perspective the assignment problem is fundamental in a clustering setting and because in our case it appears highly non-trivial, we choose to address it in order to achieve a deeper understanding of the nature of our novel fairness constraints. In the end, we manage to show that with a slightly intricate iterative algorithm, we can indeed compute the optimal assignment .
Finally, Section 5 contains an extensive experimental evaluation, that validates the effectiveness and the efficiency of our proposed algorithms.
1.3 Related work
Due to its tremendous societal significance, fair clustering has been extensively studied during the last few years. The concept of fairness we consider here falls under the broader umbrella of individual fairness. Although all previous work on individual fairness may have a high-level similarity to our model, in its core it is substantially different. Inspired by the seminal work of Dwork et al. , the papers of Anderson et al. , Brubach et al. [2020, 2021] interpret individual fairness as follows. For two points
, the probability that they are placed in a different cluster should be a decreasing function of their similarity. Unlike our model, this notion provides no guarantee on the gap betweenand . Alternatively, Mahabadi and Vakilian , Jung et al.  define individual fairness as guaranteeing that for each point , there will be a chosen center within distance from it, where is the minimum radius such that . Finally, Kleindessner et al.  view individual fairness as ensuring that each point is on average closer to the points in its own cluster than to the points in any other cluster.
The most popular and well-studied notion of fairness in clustering is the demographic one. Herein, the points are partitioned into demographic groups, and what is required is a fair treatment or an equal representation of these demographics in the solution. This research area was initiated by the groundbreaking work of Chierichetti et al. . Further work on demographic fairness includes Bercea et al. , Bera et al. , Esmaeili et al. , Huang et al. , Backurs et al. , Ahmadian et al. , Kleindessner et al. , Chen et al. , Abbasi et al. .
Another work that is closely related to our model is that of Balcan et al. 
. In that paper the authors study a classification problem where there is a set of already-known labels, and the points need to be assigned to those via some stochastic classifier. The points have preferences over the labels, given by some utility function, and the final classification should be envy-free in the standard sense. Our model differs from that ofBalcan et al.  for two reasons. First, our focus is on a clustering problem, where the labels are not known, there is an underlying metric space, a metric related objective needs to be minimized, and also the assignment has to be deterministic. Secondly, although the concept of envy-freeness is related to constraint (1), there is the crucial difference of points in our case not envying the resources allocated to other individuals, but rather their final utility. In other words, in the language of Fair Division of Goods, our model is closer to the notion of an equitable allocation [Varian, 1974] rather than an envy-free one.
2 Structural properties of the problem
The purpose of this section is to answer questions regarding the combinatorial nature of our proposed fairness constraints. As mentioned in the introduction, all our results here are for , since is a trivial case. At first, we want to investigate the range of for which our problems always admit a feasible solution. Ideally, an value close to would be the most fair, but as the following theorem suggests, such a guarantee is impossible.
For both EqCenter-PP and EqCenter-AG, there exist instances with that do not admit any feasible solution.
Let be a very large even integer, with also being an even integer. We consider points in a cycle, where for all , and also . The rest of the distances are set to be the shortest path ones, based on those already defined. This is a valid metric space, since it constitutes the shortest path metric resulting from a simple cycle graph of vertices.
To construct the similarity sets, we map each point to another point , such that the function is one-to-one and . Given that, the similarity set of point will be set to be . Now let and
. For every odd, set and . In this way, because is even, we map every point of to some other point of . Also, note that for every we will have . For the points , consider them in increasing order of . If is not already mapped to some other point, set and . This is a valid assignment because is assumed to be an even integer. At the end of the above process, we have created a one-to-one mapping between the points of , such that for every we have . This concludes the description of the similarity sets. Finally, this pairing process for and is possible, because both sets include an even number of points. See Figure 1 for an example.
To conclude the description of the input we also assume that . In addition, note that because for all we have , constraints (1) and (2) are equivalent and hence showing infeasibility for this instance covers both EqCenter-PP and EqCenter-AG. Finally, to prove the statement of the theorem, it suffices to show that for all possible choices of centers and all possible corresponding assignments , there will always be a point for which .
At first, notice that there exists no feasible solution that uses just one center. Supposing otherwise, let be the only chosen center. Then there exists only one possible assignment for , and that is . Hence , and the fairness constraint for will never be satisfied.
Now we will show that even solutions that pick two centers cannot admit any feasible assignment. We proceed via a case analysis on .
: Because the points of and alternate in the metric cycle, we know that there exists a such that (in the example of Figure 1 we might have and , ). By the triangle inequality we also get . As for the point , we have:
From ’s perspective, the best case situation regarding its fairness constraint is if gets assigned to its closest center, and gets assigned to its farthest one. Given all the previous inequalities, we see that the best possible service for is , and the worst possible service for is . We next show that even in this ideal situation for , its fairness constraint with will never be satisfied if is significantly large. To see this, note that is an increasing function of and also:
Therefore, for every given , there exists an such that .
: In this case, because is assumed to be significantly large and because the points of alternate in the metric cycle, we can find a point in the shortest path between and , which will be approximately in the middle of the path. Letting such that , we have (in the example of Figure 1 we might have and , ). Regarding the possible assignments for we have:
Again we will focus on the best case situation for , which according to the previous analysis is getting assigned to a center at distance from it, and getting assigned to a center at distance . Therefore, we consider the ratio , and we are going to prove that even in this ideal case for , its fairness constraint for will not be satisfiable if is suffieciently large. At first, because the previous ratio will be an increasing function of . In addition,
The last inequality follows since is a decreasing function, and for we have . Hence, for every there exists an such that .
: Because is assumed to be significantly large and because the points of alternate in the metric cycle, we can find a point in the shortest path between and , which will be approximately in the middle of the path. Letting such that , we have (in Figure 1 we might have and , ). Consider now , and without loss of generality assume that (when the situation is symmetric, with the roles of , switched.).
At first, suppose that is a point in the shortest path between and (in the example of Figure 1 and would result in that). Thus, because and , we can focus on the line segment , where the triangle inequality holds with equality. Here we get,
The second case we consider is when is not on the shortest path between and (in Figure 1 take for instance and hence and ). In that scenario, because , we turn our attention to the line segment , where the triangle inequality holds with equality. Here we have
Therefore, in every case we have the following:
Now that we have the bounds (3) for the assignment distance of to both centers, we proceed with the final case analysis.
Suppose that gets assigned to . Then from ’s perspective, the best possible situation is if its own assignment distance is exactly , and gets an assignment distance of . In this case, the ratio is an increasing function of , because . In addition we have:
The last inequality is because is a decreasing function and . Hence, for every , there exists an such that . Thus, even in the ideal situation for , if is larger than its fairness constraint for will be unsatisfiable.
On the other hand, suppose that gets assigned to . Then from ’s perspective, the best possible situation is if it gets an assignment distance of , and has assignment distance exactly . In this case, the ratio is an increasing function of , because . Also:
The last inequality is because is an increasing function and . Hence, for every , there exists an such that . Thus, even in the ideal situation for , if is larger than , ’s fairness constraint for will be unsatisfiable.
The analysis is exhaustive, because the maximum distance between two points in the metric is . Further, we see that if we set , then in every possible scenario there will exist a point whose fairness constraint for will not be satisfiable. ∎
Moving on, we show that for there is always a feasible solution to both our problems, and hence we settle the important question of what is the smallest value of for which EqCenter-PP and EqCenter-AG are well-defined.
Consider a set of points in a metric space with distance function , where . Then there exists an efficient way of finding two distinct points and an assignment , such that for every we have .
At first, choose to be the two points of that are the furthest apart, i.e. . Then, for every set . In other words, given the chosen centers, each point is assigned to the center that is furthest from it in the metric. Let also be the center to which is not assigned to. For any , combining the triangle inequality and the fact that , will give us:
Finally, by the way we chose and we also get . ∎
For both EqCenter-PP and EqCenter-AG, every instance with always admits a feasible solution.
Suppose that as an instance to either problem we are given a set of points together with their associated similarity sets , and . Since , we can use Lemma 8 and get a set of two centers and an assignment function , such that for all we have . In the case of constraint (1), for every and any we have . Furthermore, since any feasible solution for constraint (1) is also a feasible solution for constraint (2), the proof is concluded. ∎
Another structural notion that interests us, is that of the Price of Fairness (PoF). For a given instance of either of our problems, PoF is the ratio of the value of the optimal solution to the problem, over the the optimal unfair value. The latter is defined as the optimal value of the given instance, when we drop the fairness constraint and simply solve -center. As is the case in most fair clustering literature, we show that in general PoF can be arbitrarily large.
There exist instances of EqCenter-PP and EqCenter-AG where PoF is unbounded.
We will use the example of Figure 2 for both problems. Consider three points on the line, where and . Moreover, let , and assume that as well as .
To begin with, observe that in the absence of the fairness constraints, the optimal solution for -center occurs when and (or and ) are chosen as centers, and its corresponding value is .
Moving forward, we are going to show that the optimal solution for the fair variants has value at least (note that the existence of such a solution is guaranteed by Theorem 9). This implies that PoF is at least , and since this can be arbitrarily large. We proceed with a case analysis.
Initially, consider a solution that uses only one center. If that center is either or , then regardless of feasibility issues the value of the corresponding solution will be . On the other hand, if is chosen as a center, then we will necessarily have as the only possible assignment. Hence, because and , no matter what fairness constraint we have for , it is obvious that it cannot be satisfied. Thus, can never be a center on its own.
Now we consider solutions that use exactly two centers.
Let be the set of chosen centers. If we want to satisfy the fairness constraints for , we should set , because otherwise the assignment distance of will be . Having immediately implies that if there is a feasible solution for this set of centers, its value should be at least .
Let be the set of chosen centers. To begin with, see that an assignment that leads to a value of is not possible. The only mapping that leads to such a solution is . However, this violates both constraints (1) and (2) for , since all points in will have an assignment distance of . Thus, because of the discrete values of the metric space, any feasible solution using as its centers should have value at least . ∎
3 Approximation algorithms for EqCenter-Pp and EqCenter-Ag
Suppose that we are given an instance of EqCenter with , and we are either solving EqCenter-PP or EqCenter-AG. In addition, let and let denote the value of the optimal solution of the corresponding problem.
In this section we provide a procedure that works under an explicitly given value , with . This process will either return a feasible solution with , or an infeasibility message. The latter message indicates with absolute certainty that .
The aforementioned procedure suffices to guarantee the result of Theorem 4. Because is always the distance between two points in , the total number of possible values for it is only polynomial, specifically at most . Hence, we can run the procedure for all such distances that are at least , and in the end keep for the minimum guess for which we did not receive an infeasibility message. If , then our returned solution is guaranteed to have value at most , because is one of the target values we tested. On the other hand, when , the iteration with as the guess cannot return and infeasibility message, and thus it will provide a solution of value at most . As a side note, we mention that we can speed up the runtime of this approach by using binary search over the guesses , instead of a naive brute-force method.
Therefore, apart from the input instance, assume that we are also given a target value with . Our framework begins by choosing an appropriate set of centers . The full details of this step are presented in Algorithm 1. Besides choosing this set , Algorithm 1 also creates a partition of for some , and returns sets for every . The algorithm works by trying to expand the current set as much as possible, via finding points that are within distance from some center of it. If no such point exists, then we never deal with again, and we move on to create by choosing an arbitrary available point for it.
For every , let be the index of the partition set belongs to, i.e., . We also define and . We interpret the centers of as being isolated, since for each the corresponding partition set contains only , i.e., . On the other hand, the centers of are non-isolated, in the sense of having for each . In addition, for every point , let the center of such that . Finally, we define and .
For every distinct we have .
For every , there exists a different such that .
The sets for all , induce a partition of .
For any , we have for all and all .
Focus on such a , and for the sake of contradiction assume that there exists a and a for which . Let the center of with .
At first, suppose that during the execution of Algorithm 1 entered before . Having means that when , the algorithm tried to find a point in within distance from but failed. However, at that time was still in , because and entered after . In addition , and thus we reached a contradiction.
Now assume that entered before . This implies that , because . When the algorithm stopped expanding , there was not any point of within distance from a center of
. However, at that momentwas still in , because and . In addition , and so we once again reach a contradiction. ∎
By using Lemma 14 and the fact that , we immediately get the following.
For every , we have for all .
For every , we have .
After computing the set of centers , our approach proceeds by constructing the appropriate assignment function. This will occur in two steps. The first step takes care of the points in , by choosing a new set of centers , and by constructing an assignment . The second step handles the points of via a mapping . This is well-defined, since . Note that due to Corollary 15, the fairness constraint of a point is only affected by , since . Similarly, due to Corollary 16, the fairness constraint of a is only affected by , since . Therefore, we can study the feasibility of our solution separately on for , and on for .
Algorithm 2 demonstrates the details of the first assignment step. The algorithm operates by trying to “guess” if the optimal solution uses exactly one center inside each for . If it does, so will our algorithm. If not, then our approach will open exactly two centers, and will subsequently construct an assignment that will satisfy the appropriate fairness constraint.
After the execution of Algorithm 2, for every we have that the constructed assignment will 1) satisfy ’s fairness constraint, and 2) guarantee .
At first, due to Observation 13, Algorithm 2 sets the value for each exactly once. In addition, we know that for every , all points of will have their assignment set in the same iteration of Algorithm 2, since and by Corollary 15 we have .
For a point , when is considered by Algorithm 2 there are two possible scenarios. In the first we have