Clustering with fairness constraints- A flexible and scalable approach
This study investigates a general variational formulation of fair clustering, which can integrate fairness constraints with a large class of clustering objectives. Unlike the existing methods, our formulation can impose any desired (target) demographic proportions within each cluster. Furthermore, it enables to control the trade-off between fairness and the clustering objective. We derive an auxiliary function (tight upper bound) of our KL-based fairness penalty via its concave-convex decomposition and Lipschitz-gradient property. Our upper bound can be optimized jointly with various clustering objectives, including both prototype-based such as K-means and graph-based such as Normalized Cut. Interestingly, at each iteration, our general fair-clustering algorithm performs an independent update for each assignment variable, while guaranteeing convergence. Therefore, it can be easily distributed for large-scale data sets. Such scalability is important as it enables to explore different trade-off levels between fairness and clustering objectives. Unlike existing fairness-constrained spectral clustering, our formulation does not need storing an affinity matrix and computing its eigenvalue decomposition. Moreover, unlike existing prototype-based methods, our experiments reveal that fairness does not come at a significant cost of the clustering objective. In fact, several of our tests showed that our fairness penalty helped to avoid weak local minima of the clustering objective (i.e., with fairness, we obtained better clustering objectives). We demonstrate the flexibility and scalability of our algorithm with comprehensive evaluations over both synthetic and real world data sets, many of which are much larger than those used in recent fair-clustering methods.READ FULL TEXT VIEW PDF
We advocate Laplacian K-modes for joint clustering and density mode find...
Clustering is a fundamental unsupervised learning problem where a datase...
Deep clustering has the potential to learn a strong representation and h...
We revisit the problem of fair clustering, first introduced by Chieriche...
Given the widespread popularity of spectral clustering (SC) for partitio...
Incorporating fairness constructs into machine learning algorithms is a ...
Clustering with fairness constraints- A flexible and scalable approach
Machine learning models are impacting most of the automatic tasks in our daily life, for instance, in marketing, lending for home loans, education and even in sentencing recommendations in law courts kleinberg2017human
. However, machine learning models might exhibit biases towards specific demographic groups due to, for instance, the biases that exist within the training data. For example, higher accuracy is found on white, male faces for facial recognitionbuolamwini2018gender
, and the prediction probability of recidivism tends to assign an incorrect high risk score to a low risk African-AmericanJulia:2016 . These kinds of issue has recently triggered substantial research interest in designing fair algorithms in the supervised setting Hardt2016 ; Zafar2017 ; Donini2018
. Also, very recently, the community started to investigate fairness constraints in unsupervised learningChierichetti2017 ; Kleindessner2019 ; Samadi2018 ; Celis2018 . Specifically, Chierichetti et al. Chierichetti2017 pioneered the concept of fair clustering. The problem consists of embedding fairness constraints, which encourage clusters to have balanced demographic groups pertaining to some sensitive attributes (e.g., sex, gender, race, etc.), so as to counteract any form of data-inherent bias.
Assume that we are given data points to be assigned to a set of clusters, and let
denotes a binary indicator vector whose components take valuewhen the point is within cluster and otherwise. Also suppose that the data contains different demographic groups, with denoting a binary indicator vector of demographic group . The authors of Chierichetti2017 ; Kleindessner2019 suggested to evaluate fairness in terms of cluster-balance measures, which take the following form:
The higher this balance measure for each cluster, the fairer the clustering. The overall balance of the clustering is defined by taking the minimum over balances of all the clusters. This notion of fairness in clusters has triggered a new line of work introduced recently for iterative prototype-based clustering (e.g., K-center and K-median, K-means) and spectral graph clustering (Ratio Cut) Chierichetti2017 ; Kleindessner2019 ; Schmidt2018 , and raises several interesting questions. How to embed fairness in popular clustering objectives? What is the cost of fairness with respect to the clustering objective and computational complexity? Can we control the trade-off between some "acceptable" fairness level and the quality of the clustering objective? Can we impose any desired (target) proportion of the demographics within each cluster, beyond the perfect balance sought by the cluster measures in Eq. (1)?
Chierichetti et al. Chierichetti2017 investigated approximation algorithms, which ensure the fairness measures in Eq. (1) are within an acceptable range, for K-center and K-median clustering, and for . Schmidt et al. Schmidt2018 extended Chierichetti2017 to the K-means objective. In Chierichetti2017 , they compute fairlets or micro-clusters, which are groups of points that are fair and can not be split further into more subsets that are also fair. Then, they consider each fairlet as a data point and cluster them with approximate K-center or K-median algorithms. However, the cost for finding fairlets with perfect matching becomes quadratic with respect to the number of data points, a complexity that might be worse for more than two demographic groups. The experiments reported in Chierichetti2017 showed that obtaining fair solutions with these approximation algorithms come at the price of substantial increase in the clustering objectives. Another important limitation of the algorithms in Chierichetti2017 ; Schmidt2018 is that they are tailored for specific prototype-based objectives. For instance, they are not applicable to the very popular spectral clustering objectives for graph data, e.g., Ratio Cut or Normalized Cut VonLuxburg2007 . Kleindessner et al. Kleindessner2019 embedded fairness constraints (1) into a popular spectral graph clustering objective, specifically, Ratio Cut. They showed that this can be done by embedding linear constraints on the relaxed assignment matrix in standard spectral relaxation. Therefore, the problem can be solved as a constrained trace optimization via finding the K smallest eigenvalues of some transformed Laplacian matrix, similar to the well known versions of constrained spectral clustering methods yu2004segmentation . Unlike Chierichetti2017 , the authors of Kleindessner2019 showed that fairness-constrained spectral relaxation does not necessarily induce significant cost in the clustering objective. However, the constrained spectral relaxation in Kleindessner2019 is tailored specifically to Ratio Cut, and it is unclear how to use it in the context of iterative prototype-based clustering such as K-center, K-median or K-means. Furthermore, it is well-known that spectral relaxation has heavy memory and computational loads as it requires storing an by affinity matrix and computing its eigenvalue decomposition; the complexity is cubic222In fact, the experimental tests of Chierichetti2017 ; Kleindessner2019 are on small data sets, and do not go beyond data points. w.r.t for a straightforward implementation and super-quadratic for fast implementations Tian-AAAI . In the general context of spectral relaxations and graph partitioning, the computational scalability issues for large-scale problems is an active line of recent works shaham2018spectralnet ; ziko2018scalable ; Vladymyrov-2016 .
Apart from computational scalability and the negative effect of fairness constraints on the clustering objectives, the existing fair-clustering methods have two important limitations. First, the cluster measures in Eq. (1
) encourage perfect balance (i.e., a uniform distribution) of the demographic groups. There is no mechanism that imposes any desired (target) proportion of the demographics within each cluster, beyond the uniform distribution. Second, it is not possible to control the trade-off between some tolerable fairness level and the quality of the clustering objective.
We propose a general variational formulation of fair clustering, which can integrate fairness constraints with a large class of clustering objectives, including both prototype-based (e.g., K-means) and graph-based (e.g., Normalized Cut). Based on the KL divergence between demographic proportions and a prior distribution, our formulation can impose any desired (target) proportions within each cluster333Balanced demographics correspond to the particular case of a uniform prior., unlike the existing fair-clustering methods. Moreover, It enables to control the trade-off between fairness and the clustering objective. We derive an auxiliary function (tight upper bound) of our fairness penalty via its concave-convex decomposition and Lipschitz-gradient property. Our bound can be optimized jointly with various clustering objectives. Interestingly, at each iteration, our general fair-clustering algorithm performs an independent update for each assignment variable, while guaranteeing convergence. Therefore, it can be easily distributed for large-scale data sets. This scalibility is important as it enables to explore different trade-off levels between fairness and clustering objectives. Unlike the constrained spectral relaxation in Kleindessner2019 , our formulation does not need storing an affinity matrix and computing its eigenvalue decomposition. Furthermore, unlike Chierichetti2017 , our experiments reveal that fairness does not comes at a significant cost of the clustering objective. Rather, several of our tests showed that our fairness penalty helped to avoid weak minima of the clustering objective (i.e., with fairness, we obtained lower clustering objectives). We demonstrate the flexibility and scalability of our algorithm with comprehensive evaluations over both synthetic and real world data sets, many of which are much larger than those used in recent fair-clustering methods Kleindessner2019 ; Chierichetti2017 .
Let denotes a set of data points to be assigned into a set of clusters, and a matrix of binary cluster-assignment vectors . Suppose that the data set contains different demographic groups, with column vector indicating point assignment to group such that if data point is in group or otherwise. We propose the following general formulation for optimizing any clustering objective subject to fairness constraints:
where is some given (required) proportion of demographic within each cluster, and is a prior distribution, i.e., . Notice that balanced demographics correspond to the particular case of our formulation, in which is a uniform distribution: . And indicates the transpose operator. We solve constrained problem (2) with a KL-based penalty approach:
In (3), the second term is the fairness penalty, with denoting the KL divergence between the probability of each demographic in cluster and the given (required) proportions . Parameter controls the trade-off between the clustering objective and fairness penalty.
Let us relax the integer constraint on our assignment matrix and, instead, use a relaxed variable:
where denotes relaxed cluster-assignment vectors and, for each point , is the probability simplex assignment vector verifying . With this relaxation, expanding KL term and discarding constant , our problem in (3) becomes:
where, now, corresponds to relaxed variables: . Observe that, in Eq. (11), the fairness penalty becomes a cross-entropy between the given and , the proportion of demographic within cluster . Notice that our fairness penalty decomposes into convex and concave parts:
This enables us to derive tight bounds (auxiliary functions) for minimizing our general fair-clustering model in Eq. (11) using a Lipschitz-gradient property of the convex part and a first-order bound on the concave part. This will be discussed in more details in the following sections for various clustering objectives.
is an auxiliary function (tight upper bound) of objective at current solution if it satisfies the following conditions:
where is the iteration index. Bound optimization updates the current solution to the next optimized solution as the optimum of the auxiliary function: . Notice that this guarantees that the original objective function does not increase at each iteration: . Bound optimizers can be useful in optimization as they transform difficult problems into easier ones Zhang2007 . Examples of well-known bound optimizers include the concave-convex procedure (CCCP) Yuille2001
, expectation maximization (EM) algorithms and submodular-supermodular procedures (SSP)Narasimhan2005 .
Given current clustering solution at iteration , we have the following tight upper bound (auxiliary function) on the fairness term in (11), up to additive and multiplicative constants, and for current solutions in which each demographic is represented by at least one point in each cluster (i.e., ):
where is some positive Lipschitz-gradient constant verifying
See the detailed proof in the supplemental material.
see the respective references in Table 1
is affinity measuring kernel,
is affinity matrix
Given current clustering solution , the bound on clustering objective and the bound on fairness penalty at iteration . We have the following bound for the general fair clustering objective in (11):
It is trivial to check that sum of auxiliary functions, each corresponding to a term in the objective, is also an auxiliary function of the overall objective.
We can see that our bound in Eq. (8) is the sum of independent functions, each corresponding to a data point . Therefore, our minimization problem in (11) can be tackled by optimizing each term over subject to the simplex constraint and independently of the other terms, while guaranteeing convergence:
where denotes the -dimensional probability simplex . Also, notice that in our derived bound, we obtained a convex negative entropy barrier function , which comes from the convex part in our fairness penalty. This entropy term is very interesting as it restricts the domain of each to non-negative values, thereby avoiding completely extra dual variables for constraints .
The objective in (9) is sum of convex functions with affine simplex constraint . As strong duality holds for the convex objective and the affine simplex constraints, the solutions of the (KKT) conditions minimize the bound. The KKT conditions yield a closed-form solution for both primal variables and the dual variables (Lagrange multipliers) corresponding to the simplex constraints. With simplifying the solution and bringing the back, we get the following closed-form update for each data point :
Notice that each closed-form update in (10), which globally optimizes (9), is within the simplex. We give the pseudo-code of the proposed fair-clustering in Algorithm 1. The algorithm can be used for any specific clustering objectives, e.g., K-means or Ncut, among others, by providing the corresponding . The algorithm consists of an inner and an outer loop where the inner iterations updates using (10) until does not change, with the clustering term fixed from the outer loop. The outer iteration re-computes from the updated . The time complexity of each inner iteration is . Also, the updates are independent for each data and, thus, can be efficiently computed in parallel. In the outer iteration, the time complexity of updating depends on the chosen clustering objective. For instance, for K-means, it is , and, for Ncut, it is for full affinity matrix or much lesser for a sparse affinity matrix. Note that can be computed efficiently in parallel for all the clusters.
In this section, we present empirical evaluations of the proposed fair-clustering algorithm on both synthetic and real data sets. We choose two well-known clustering objectives: K-means and Normalized cut (Ncut) ShiMalik2000 ; VonLuxburg2007 , and incorporated fairness by using the respective bounds (see Table 1) in our bound optimizers. We call these versions: Fair K-means and Fair Ncut. Note that our formulation can be used for other objectives in a straightforward manner.
We evaluate the algorithm in terms of the balance of each cluster in Eq. (1), and define the overall balance of the clustering as , where means the perfect balance between two demographic groups in each cluster. To evaluate uneven proportions other than equal proportions, and also for the cases where the number of demographic groups is more than two, we propose to use a fairness error, which is the KL divergence in Eq. (3). Fairness error becomes zero when the output clusters are according to the given proportions of demographics. We also show the effect on the original clustering objectives for K-means and Ncut with respect to values and fairness errors. For Ncut, we use -nearest neighbor affinity matrix, such that if data point is within nearest neighbors of . In all the experiments, we fixed
and found that this does not increase the objective (see detailed explanation in the supplemental material). We perform l2-normalization of row features for each data by dividing it with its corresponding l2 norm. To standardize the real datasets, we scale each attribute features to have zero mean and unit variance, before doing l2-normalization of the data.
Synthetic datasets. We create two types of synthetic datasets according to the proportions of demographics (see Fig. 1). The Synthetic dataset has in total data points in 2D features, with each of the two demographic groups having an equal number of points. To experiment with fairness with unequal proportions, Synthetic-unequal dataset is used with 300 and 100 points within each of the two demographic groups. We impose to have clustering with given proportions for this dataset. We run the algorithm to find two clusters with given proportions.
Real datasets. We use two datasets from the UCI machine learning repository Dua:2019 :
Bank 444https://archive.ics.uci.edu/ml/datasets/Bank+Marketing dataset contains number of records of direct marketing campaigns of a Portuguese banking institution corresponding to each client contacted moro2014data . We utilize the marital status as the sensitive attribute, which contains three groups – married, single and divorced. We subsampled records from each group to have total number of records. We chose numeric attributes (age, duration, euribor, and no. of employees) as features, set the number of clusters , and impose proportions within each cluster.
Adult555https://archive.is.uci/ml/datasets/adult is a US census record data set from 1994. The dataset contains records. We subsampled records, and used the gender status as the sensitive attribute, which contains females and males. We chose the six numeric attributes as features and set the number of clusters to , and impose proportions within each cluster.
In this section, we discuss the results of the different experiments to evaluate the proposed algorithm for Fair Ncut and Fair K-means clustering. Note that we fix the initial seeds to start the algorithm for each dataset with the K-means solution although any other initialization such as K-means++ seed arthur2007k can also be used.
Output clusters vs. . In Fig.1, we plot the output clusters of Fair K-means with respect to an increased value of . When , we get the traditional clustering results of K-means without fairness. The result clearly has biased clusters, each corresponding fully to one the demographic groups, with a balance measure equal . In the Synthetic dataset, the balance increases with parameter , with perfect balance achieved starting from some value of . We also observe the same trend in the Synthetic-unequal dataset, where the output clusters are found according to prior demographic distribution , with a fairness error equal to .
Clustering objective vs. vs. fairness error. We assess the effect of incorporating fairness constraints on the original clustering objectives (or energies). In Fig. 3, we plot the clustering objectives ( in Table 1) for K-means and Ncut with respect to and, at the same time, we plot the fairness errors. In the plots, the blue line corresponds to the clustering objectives, and the red to the fairness errors. We observe that the fairness error becomes very close to zero with a certain value of . The same trend is observed in all the datasets. However, it is interesting to observe that, in all the datasets, the clustering objectives also tend to decrease with the fairness errors, for both Fair Ncut and Fair K-means. This shows that our fairness penalty did not affect negatively the clustering objectives; in fact, it helped avoiding weak local minima of the clustering objectives. Also, note that, with increased and decreased fairness error, the clustering energies tend to decrease and stabilize within a small range of objective values for both Ncut and K-means. This makes our algorithm flexible for general fair-clustering problems: we can select a that achieves an acceptable fairness level while yielding the best possible clustering objective.
We presented a variational, bound-optimization formulation that integrates fairness constraints with various well-known clustering objectives. Our algorithm can impose any desired target demographic proportions pertaining to a specific sensitive attribute or protected class, to be present within each cluster. The algorithm is scalable and computationally efficient in terms of both the number of data points and the number of demographic groups . It enables parallel updates of cluster assignments for each data point with convergence guarantee. Interestingly, our tests showed that our KL-based fairness penalty did not affect negatively the clustering objectives; in fact, it helped avoiding weak local minima. Our algorithm is flexible in the sense that it enables to control the trade-off between clustering objective and fairness: we can choose a trade-off that yields a given acceptable fairness level while yielding the best possible clustering objective.
In this supplemental document, we present a detailed proof of Proposition 1 (Bound on fairness) in the paper. Recall that, in the paper, we wrote the fairness clustering problem in the following form:
The proposition for the bound on the fairness penalty states the following: Given current clustering solution at iteration , we have the following tight upper bound (auxiliary function) on the fairness term in (11), up to additive and multiplicative constants, and for current solutions in which each demographic is represented by at least one point in each cluster (i.e., ):
where is some positive Lipschitz-gradient constant verifying
Proof: We can expand each term in the fairness penalty in (11), and write it as the sum of two functions, one is convex and the other is concave:
Let us represent the matrix in its equivalent vector form , where is the probability simplex assignment vector for point . As we shall see later, this equivalent simplex-variable representation will be convenient for deriving our bound.
Bound on :
For concave part , we can get a tight upper bound (auxiliary function) by its first-order approximation at current solution :
where gradient vector and is the sum of all the constant terms. Now consider matrix and it equivalent vector representation , which concatenates rows , , of the matrix into a single -dimensional vector. Summing the bounds in (14) over and using the -dimensional vector representation of both and , we get:
Bound on :
For convex part , the upper bound (auxiliary function) can be found by using the Lemma 1 and Definition 1 in the Appendix:
where gradient vector and is a valid Lipschitz constant for the gradient of . Similarly to earlier, consider matrix and it equivalent vector representation . Using this equivalent vector representations for matrices , and , and summing the bounds in (16) over , we get:
In our case, the Lipschitz constant is: , where is the maximum eigen value of the Hessian matrix:
Note that, is defined over the simplex variable of each data point . Thus, we can utilize the Lemma 2 in Appendix and get the following bound on :
Total bound on the Fairness term:
By taking into account the sum over all the demographics and combining the bounds for and , we get the following bound for the fairness term:
Note that for current solutions in which each demographic is represented by at least one point in each cluster (i.e., ), the maximum eigen value of the Hessian is bounded by , which means . Note that, in our case, typically the term in the Hessian is much smaller than . Therefore, in practice, setting a suitable positive does not increase the objective.
A convex function defined over a convex set is L-smooth if
the gradient of is Lipschitz (with a Lipschitz constant ): for all . Equivalently, there exists a strictly positive such that the Hessian of verifies: where is the identity matrix.
is the identity matrix.
Let denotes the maximum Eigen value of is a valid Lipschitz constant for the gradient of because
Lipschitz gradient implies the following bound666This implies that the distance between the and its first-order Taylor approximation at is between and . Such a distance is the Bregman divergence with respect to the norm. on
Proof: The proof of this lemma is straightforward. It suffices to start from convexity condition and use Cauchy-Schwarz inequality and the Lipschitz gradient condition:
For any and belonging to the -dimensional probability simplex , we have the following inequality:
where is the Kullback-Leibler divergence:
is the Kullback-Leibler divergence:
Proof: Let . The Hessian of is a diagonal matrix whose diagonal elements are given by: . Now because , we have . Therefore, is -strongly convex: . This is equivalent to:
The gradient of is given by:
Applying this expression to , notice that . Using these in expression (24) for , we get:
Now, because and are in , we have . This yields the result in Lemma 2.
Equality of opportunity in supervised learning.In Neural Information Processing Systems (NeurIPS), pages 3315–3323, 2016.
Conference on Uncertainty in Artificial Intelligence (UAI), pages 404–412, 2005.
Spectralnet: Spectral clustering using deep neural networks.In International Conference on Learning Representations (ICLR), 2018.
International Journal of Computer Vision, 127:477–511, 2019.