Introduction
Machine learning (ML) has become one the most applicable and influential tools to support critical decision makings such as college admission, job hiring, loan decisions, criminal risk assessment, etc. makhlouf2021applicability. Widespread applications of MLbased predictive modeling have induced growing concerns regarding social inequities and unfairness in decisionmaking processes. With fairness being critical in practicing responsible machine learning, fairnessaware learning has been the primary goal in many recent machine learning developments.
fairnessaware learning can be achieved by intervention at preprocessing, inprocessing(algorithms), or postprocessing strategies friedler2019comparative. Preprocessing strategies involve the fairness measure in the data preparation step to mitigate the potential bias in the input data and produce fair outcomes kamiran2012data; feldman2015certifying; calmon2017optimized. Inprocess approaches agarwal2018reductions; celis2019classification; zafar2015fairness incorporate fairness in the design of the algorithm to generate a fair outcome. Postprocess methods hardt2016equality; kamiran2010discrimination, manipulate the model outcometo mitigate unfairness.
Fairness is an abstract term with many definitions. The literature on Fairness in ML encompasses more than 21 fairness metrics narayanan2018translation; verma2018fairness that could be utilized to mitigate the bias in a given model. Yet, one could face a situation of not being able to choose among various notions of fairness. For one, choosing the proper fairness notion could depend on the context. More importantly, even if chosen properly, there is no evidence to support that ensuring the selected fairness notion enables to describe the overall unfairness and is adequate to address the challenge in a given problem. On the other hand, improving algorithms to satisfy a given notion of fairness, does not guarantee the lack of unfairness in another notion. In fact, the potential intractability of various notions has not been studied in the literature and is the main focus of this paper.
For example, the choice of fairness may depend on the users’ knowledge about the type of disparities influencing the outcome of interest and its corresponding independent variables in a specific application domain. However, formalizing and evaluating the appropriate notion is often inaccessible to practitioners due to the limited awareness of the applicability of fairness notions within a given problem.
While recent literature aims to answer questions such as how to measure fairness and mitigate the algorithmic bias, little is known about the sufficiency of different fairness notions and how one should choose among them. In this paper, we aim to elaborate on the sufficiency of different fairness metrics in a given problem by considering their potential interactions and overlaps. We develop an automated tool to help users decide on the choice of fairness.
Choosing appropriate fairness metrics for auditing and bias mitigation within a specific context is part of the responsibilities of practitioners (who are applying existing approaches) or researchers (who are developing methodologies). However, this is a troublesome task and it requires a rigorous exploration of the combinatorial space of fairness notions. The Impossibility theorem kleinberg2016inherent and the tradeoffs between fairness notions exacerbate the difficulty of the task, increase the chance of impulsiveness, and may introduce additional bias as a result of a wrong choice.
In order to help data scientists to specify the fairness metrics to consider, in this paper we propose an automatic framework for identifying a small subset of fairness metrics that are representative of other metrics. In summary, we make the following contributions:

We propose the problem of using the correlations between different fairness metrics, for a given context (specified by training data and a model type), to find a small subset of metrics that represent other fairness metrics. To the best of our knowledge, we are the first to propose this problem.

We design a samplingbased MonteCarlo method to estimate the correlations between the fairness metrics.

We develop an efficient approach for sampling models with different fairness metric values, which enables estimating the correlation between the fairness metrics.

We adapt and extend the existing work in order to specify the small subset of representative metrics, using the correlations between fairness metrics.

In order to evaluate our findings, we conduct comprehensive experiments, using realworld benchmark datasets, multiple types of classifiers, and a comprehensive set of fairness metrics. Our experiment results, verify the effectiveness of our approach for choosing fairness metrics.
In the following, we first provide a brief explanation of the fairness metrics and then propose our technical solution.
Fairness Model
Consider a training dataset consisting of
data points denoted by vectors of
. Let be the set of nonsensitive attributes of dimension , be the set of sensitive attributes, andbe the response variable. In a classification setting
with being the total number of distinct classes. Let be the classifier function where . Let denotes the set of (group) fairness metrics, . Since fairness metrics are defined for each sensitive group, let be the fairness metric defined for sensitive group .Let and
denote the true and predicted labels in a given test problem, respectively. Most of the existing fairness notions are defined based on the joint distribution of the
, and variables, and fall into one of three wellknown categories of Independence(), Separation(), Sufficiency() barocas2017fairness. Various fairness metrics have been proposed in the literature, each based on one of the aforementioned ideas. Following the literature on defining the fairness metrics and for the ease of explanation let us consider a binary classification and a binary sensitive attribute ^{1}^{1}1Please note that techniques provided in this paper are agnostic to the choice of fairness metrics and machine learning tasks. Extensions to higher numbers and carnality of the sensitive attributes and class labels are straightforward as we shall explain in the discussion section. For the purpose of our analysis, we assume and to represent the Privileged and Unprivilegedsensitive groups, respectively. The fairness metrics can be derived by expanding the confusion matrix on the outcome of
split according to each sensitive attribute value kim2020fact. Let (True Positive), (False Negative), (False Positive), and (True Negative) be the elements of a confusion matrix. Given the binary sensitive attribute , a split of the confusion matrix on Privileged group is denoted by , , , and , and total observations of Privileged group is denoted by . Thus, for instance, Statistical parity would be equivalent to which measures positive prediction outcome () among different sensitive groups without considering their truelabel. Similarly, Equalized odds can be expressed as
and which emphasizes on positive prediction outcome and measures false positive and true positive rates among sensitive groups.Note that when , the fairness metrics can be defined upon multiple confusion matrices split according to a combination of class labels.
Having briefly discussed the fairness metrics next we propose our framework for identifying a small subset of representative metrics for a given context.
Identifying Representative Fairness Metrics
Fairness is an abstract metric with many definitions from different perspectives and in different contexts. This variety of definitions, coupled with the Impossibility Theorems and the tradeoff between definitions make it overwhelmingly complicated for ordinary users and data scientists to select a subset of those to consider. After providing the preliminaries and the set of wellknown fairness metrics, in this section, we focus on identifying a subset of representative fairness metrics to facilitate the choice of fairness for a given context. In particular, we use the correlation between the fairness metric as the measure to identify their similarities. given a universe of fairness definitions of interest, we would like to find a subset , with a significantly smaller size that represents all metrics in . That is, , there exists a fairness metric such that the correlation between and is “high”.
To this end, we first need to be able to calculate the correlations between metrics of fairness for a given context. Estimating these correlations is one of the major challenges we shall resolve in this section. While the general ideas proposed in this section are not limited to a specific ML task, without loss of generality, we will focus on classification for developing our techniques. First, we note that given a classifier, one can audit it and compute its (un)fairness with regard to different metrics. It, however, does not provide the relation between the metrics. In other words, it is not clear how trying to resolve unfairness on one metric will impact the other metrics. Next, using the existing systems for achieving fairness does not seem to provide a promising approach: (i) existing fairML learning approaches (including the benchmark methods in AIF360 aif360) are designed for a subset of fairness metric; (ii) the ones, including omnifair, that claim to cover different metrics are inefficient when multiple metrics are considered simultaneously – due to the exponentially large search space they consider. In summary, the FairML approaches are designed to, build fair models, given a set of metrics, in oppose to finding the relation between different fairness metrics.
Therefore, in the following, we design a MonteCarlo montecarlo; hickernell2013guaranteed method for estimating the underlying correlation and tradeoffs between fairness metrics. MonteCarlo methods turn out to be both efficient and accurate for such approximations.
After identifying the correlations, we use them to find the set of the representative metrics.
Estimating Between Fairness Correlations
MonteCarlo methods use repeated sampling and the central limit theorem for solving deterministic problems
durrett2010probability. At a high level, the MonteCarlo methods work as follows: first, they generate a large enough set of random samples; then they use these inputs to estimate aggregate results. We use MonteCarlo methods to estimate correlations between the fairness metrics. The major challenge towards developing the MonteCarlo method is being able to generate a large pool of samples. Every sample is a classifier that provides different values for every fairness metric. We use a sampling oracle that upon calling it, returns the fairness values for a sampled classifier. We shall provide the details of the oracle in the next subsection.Correlation is a measure of linear association between two variables and . When both variables are random it is referred to as Coefficient of Correlation . The correlation model most widely employed to calculate
is the normal correlation model. The normal correlation model for the case of two variables is based on the bivariate normal distribution,
neter1996applied. Having enough repeated sampling (), we can assume the variables and follows the Normal distribution (central limit theorem) with means and, and standard deviations of
and , respectively. In a bivariate normal model, the parameter provides information about the degree of the linear relationship between the two variables and , which is calculated as . The correlation coefficient takes values between 1 and 1. Note that if two variables are independent and subsequently . When , tends to be large when is large, or small when is small. In contrast, when (i.e. two variables are negatively correlated), tends to be large when is small, or vice versa. indicates a perfect direct linear relation and 1 denotes a perfect inverse relation. Since is unknown, a point estimator of is required. The estimator is often called thePearson Correlation Coefficient.In order to estimate the correlations, we use the sampling oracle to sample classifiers , and to calculate fairness values for each. Let be fairness metric of classifier , thus the Pearson Correlation Coefficient is defined as follow :
In order to reduce the variance of our estimation, we repeat the estimation
times (we use the rule of thumb number in our experiments), where every iteration returns the correlation estimation . Then, the correlation is computed as the average of estimations in each round. That is,(1) 
Using the central limit theorem, follows the Normal distribution . Given a confidence level , the confidence error identifies the range where
Using the Ztable, while using the sample variance to estimate , the confidence error is computed as
(2) 
Developing the Sampling Oracle
Having discussed the estimation of the correlations between the fairness metrics, next we discuss the development details of the sampling oracle. Upon calling the oracle, it should draw an iid sample classifier and evaluate it for different fairness metrics. Considering the set of fairness metrics of interest , the output of the sampling oracle for the sample can be viewed as a vector of values , where is the fairness of sampled classifier for metric . Calling the oracle by the correlation estimator times forms a table of samples where each row are the fairness values for a sampled classifier (Figure 1).
Sample ID  

There are two requirements that are important in the development of the sampling oracle. First, since our objective is to find the correlations between the fairness metrics, we would like the samples to provide different values for the fairness metrics. In other words, the samples should provide randomness over fairness values space. Besides, since the correlation estimator calls the oracle many times before it computes the correlations, we want the oracle to be efficient.
Our strategies to satisfy the design requirement for the sampling oracle are based on two simple observations. First, we note that the performance of a model for a protected group depends on the ratio of samples from that protected group and the distribution of their label values. To better explain this, let us consider a binary classifier and two groups and . Clearly, if all samples in the training data belong to then the model is only trained for , totally ignoring the other group. As the ratio of samples from increases in the training data, the model trains better for this group. Specifically, since the training error – the average error across training samples – is minimized during the training process, the ratio of to in the training data directly impacts the performance of the model for each of the groups. Besides, the other factor that impacts the prediction of the model for a protected group is the ratio of positive to negative samples. Therefore, as an indirect method to control variation over the space of fairness values, we consider variation over the ratios of the samples from each protected group and label values in the training data. The next observation is that when sampling to estimate the correlations between the fairness metrics, we are not concerned with the accuracy of the model. This gives the opportunity to use subsampling to train the model on small subsets of the training data in order to gain efficiency. Using these observations, we propose a Bootstrap resampling approach to generate different subsamples from the dataset for the training purpose.
Bootstrapping is a datadriven statistical inference (standard error and bias estimates, confidence intervals, and hypothesis tests) methodology that could be categorized under the broader class of resampling techniques
efron1994introduction; hesterberg2011bootstrap. The core idea of Bootstrapping is similar to random sampling with replacement without further assumptions. Let be the number of drawn Bootstraps samples. Given the training dataset , we aim to construct smaller representative subsets of we use to train the sampled model.Consider a binary classification problem () with two protected groups (: Group1, : Group2) to describe our sampling procedure. In order to (indirectly) control the fairness values, we bootstrap different samples ratios from each of the protected groups and label values as shown in Figure 2. Let be the ratios for each of the cells of the table. To generate each sample, we need to draw the vector uniformly from the space of possible values for . Given that represents the ratios from each cell, . To make sure values in are drawn uniformly at random, we first generate four random numbers, each drawn uniformly at random from the range . Then, we normalize the weights as . Having sampled the ratios, we bootstrap samples from the samples of that belong to cell of the table to form the bootstrapped dataset . Next, the oracle uses the dataset to train the sampled classifier . Having trained the classifier , it next evaluates the model to compute the values , for each fairness metric , and return the vector . The set of samples collected from the sampling oracle then form the table of fairness values shown in Figure 1, which is used to estimate the correlations between the fairness metrics. The Pseudocode of our proposed sampling approach is provided in the Appendix.
Finding the Representative Fairness Metrics using Correlations
To discover the representative subset of fairness metrics that are highly correlated, we utilize the correlation estimation from our proposed MonteCarlo sampling approach described in the previous sections.
Consider a complete graph of vertices (denoting each fairness metric ), where the weight of an edge between the nodes and is equal to their correlations . The goal is to identify the subsets of vertices such that the within subset positive correlations and between subsets negative correlations are maximized. This problem is proven to be NPcomplete. The wellknown approximation algorithm is proposed in bansal2004correlation provides a constant approximation ratio for this problem.
Considering the complete graph of correlations where the edge weights are in , the algorithm first selects one of the nodes as the pivot, uniformly at random. Next, all the nodes that are connected to the pivot with a edge are connected to the cluster of the pivot. Next, the algorithm removes the already clustered node from the graph and repeats the same process by selecting the next pivot until all points are clustered.
In order to adapt this algorithm for finding the representative fairness metrics, we consider a threshold . Then, after selecting the pivot (a fairness metric ), we connect each fairness metric to the cluster of , if .
Moreover, in order to find a small subset , we repeat the algorithm multiple times and return the smallest number of subsets. For every cluster of metrics, the pivot is added to the set of representative metrics .
Experiments
Datasets
Our empirical results are based on the benchmark datasets in fairML literature^{2}^{2}2 https://aif360.readthedocs.io/en/latest/ toolkit aif360:
COMPAS^{3}^{3}3ProPublica, https://bit.ly/35pzGFj: published by ProPublica propublica, this dataset contains information of juvenile felonies such as marriage status, race, age, prior convictions, etc. We normalized data so that it has zero mean and unit variance. We consider race as the sensitive attribute and filtered dataset to black and white defendants. The dataset contains 5,875 records, after filtering. We use twoyear violent recidivism record as the true label of recidivism: if the recidivism is greater than zero and otherwise. We consider race as the sensitive attribute.
Adult^{4}^{4}4CI repository, https://bit.ly/2GTWz9Z: contains 45,222 individuals income extracted from the 1994 census data with attributes such as age, occupation, education, race, sex, maritalstatus, nativecountry, hoursperweek etc. We use income (a binary attribute with values and ) as the true label . The attribute sex is considered as the sensitive attribute.
German Credit Data ^{5}^{5}5UCI repository, https://bit.ly/36x9t8o: includes 1000 individuals credit records containing attributes such as marital status, sex, credit history, employment, and housing status. We consider both sex and age as the sensitive attributes, and credit rating (0 for bad customers and 1 for good customers) as the true label, , for each individual.
Bank marketing^{6}^{6}6UCI repository, https://archive.ics.uci.edu/ml/datasets/Bank+Marketing: Published by moro2014data the data is related to direct marketing campaigns number of phone calls of a Portuguese banking institution. The classification goal is to predict if a client will subscribe to a term deposit (variable ). The Dataset contains 41188 and 20 attributes that are collected from May 2008 to November 2010. We consider age as the sensitive attribute.
Using Logistic Regression to illustrate that Correlations are data dependent.
Performance Evaluation
In order to estimate betweenfairness correlations using our proposed MonteCarlo method, we use 1000 sampled models for each round and repeat the estimation process 30 times. Our proposed approaches are evaluated using a set of common classifiers; Logistic Regression (Logit
), Random Forest (
RF), Knearest Neighbor (KNN) , Support Vector Machines (
SVM) with linear kernel, and Neural Networks (
NN) with one dense layer. Our findings are transferable to other classifiers.Correlation estimation quality: We begin our experiments by evaluating the performance of our correlation estimation method. Recall that we designed Montecarlo method. In every iteration, the algorithm uses the sampling oracle and samples classifiers to evaluate correlations between pairs of fairness notions, for which we use 1K samples. To see the impact of number of estimation iteration in the estimation variance and confidence error, we varied the number of iterations from 2 to 30 times. Since the number of estimation pairs are quadratic to , we decided to (arbitrarily) pick a par of notions and provide the results for it. To be consistent across the experiments, we fixed the pair and for all data sets/models/settings. We confirm that the results and findings for other pairs of notions are consistent to what presented.^{7}^{7}7Complimentary results are provided in the appendix. Figure 7 provides the results for and for COMPAS (a), Adult (b), Credit (c), Bank (d) datasets. Looking at the figure, one can confirm the stable estimation and small confidence error bars which demonstrate the high accuracy of our estimation. Also, as the number of iterations increase the estimation variance and confidence error significantly decreases.
Impact of data/model on correlation values: In this paper, we proposed to identify representative fairness metrics for a given context (data and model). The underlying assumption behind this proposal is that correlations are data and model dependent. Having evaluated our correlation estimation quality, in this experiment we verify that the data/model dependent assumptions for the correlations are valid. To do so, we first fix the data set to see if the correlations are model dependent (Figure 11) and then fix the model to see if the correlations are model dependent (Figure 15). First, we confirm that the results for other data sets/models are consistent with what presented here. Looking at Figure 11, it is clear that correlations are model dependent. In particular, generally speaking, the complex boundary of the nonlinear models (e.g. NN) reduce the correlation between fairness metrics, compared to linear models (e.g. Logit). Similarly, Figure 15 verifies that correlations are data dependent. This is because different data sets represent different underlying distributions with different properties impacting the fairness values.
Number of representative metrics: Next, we evaluate the impact of the parameter parameter , used for identifying the representative metrics, in the number of representatives . Figure 20 present the results for various values of the threshold for each ML model for COMPAS, Adult, Credit, Bank datasets. The thresholds values are selected as . We can observe that as increases the number of subsets increases. For nonlinear classifiers such as NN, the number of subsets is relatively larger due to the sensitivity of the nonlinear decision boundaries to the subset of samples in the training set. In such situation, the fairness metrics would be less correlated. In general, fairness metrics of linear decision boundaries are more correlated. Although similar overall pattern can be observed from one dataset to another, the number of subset of representatives are different. The results indicates that the proposed approach for estimation of correlation is model dependent.
In our next experiments, Figure 25, represents the number of representative subsets of fairness metrics, , for different datasets fixing the ML model. We demonstrate that given a model as increases the size of increases. The nonlinear models as expected, require more subsets The results indicates that the proposed approach for estimation of correlation is data dependent.
To provide an example of representative subsets of fairness metrics, Figure 28, we illustrate a graph with nodes representing the fairness metrics, orange nodes indicating the representative metrics of each subset, and edges showing the subsets of metrics that are highly correlated. We used for this plot. The graph confirms that the number of subsets are smaller using Logit model than that using NN, as previously discussed. Similarly, Figure 31 shows the representative metrics for Credit dataset using Logit and NN models. Discovery Ratio, Predictive Parity, Equality of Opportunity, and Average Odd Difference are selected as representative metrics for Credit dataset when we use Logit classifier.
Related work
Algorithmic fairness have been studied extensively in recent years corbett2017algorithmic; kleinberg2018algorithmic. Various fairness metrics have been defined in the literature to address the inequalities of the algorithmic decision making from different perspectives. barocas2017fairness and verma2018fairness define different fairness notions in details. Majority of works focus on the fairness consideration in different stages of predictive modeling including preprocessing feldman2015certifying; kamiran2012data; calmon2017optimized, inprocessing calders2010three; zafar2015fairness; asudeh2019designing, and postprocessing pleiss2017fairness; feldman2015certifying; stoyanovich2018online; hardt2016equality to mitigate bias of the outcome. Furthermore, the proposed interventions are tied to a specific notion of fairness; statistical parity calders2010three, equality of opportunity hardt2016equality, disparate impact feldman2015certifying, etc. A few recent work, discuss the challenge of choosing the appropriate fairness metric for bias mitigation considerations. makhlouf2021applicability surveys notions of fairness and discusses the subjectivity of different notions for a set of realworld scenarios. The challenges about a growing pool of fairness metrics for unfairness mitigation and some aspects of the relationships between fairness metrics are highlighted in castelnovo2021zoo with respect to the distinctions individual vs. group and observational vs. causalitybased. As a result, the authors highly promotes quantitative research for fairness metric assessment for bias mitigation. Building on previous works kleinberg2016inherent; chouldechova2017fair, garg2020fairness provides a comparative using mathematical representations to discuss the tradeoff between some of the common notions of fairness.
Final Remarks
The abundance, tradeoffs, and details of fairness metrics is a major challenge towards responsible practices of machine learning for the ordinary data scientists. To alleviate the overwhelming task of selecting a subset of fairness measures to consider for a context (a data set and a model type), we proposed a framework that, given a set of fairness notions of interest, estimates the correlations between them and identifies a subset of notions that represent others. Our comprehensive experiments on benchmark data sets and different classification models verify the validity of out proposal and effectiveness of our approach.
References
Appendix A Pseudocodes
Appendix B Complimentary Experiment Results
Figure 10 (a) and (b) show similar results for the correlation estimates, using Credit dataset with S=age referred to as ”Creditage”.
Figures 11 and 12 demonstrate a comprehensive comparison on the number of representative subsets between different models using all of datasets (including creditage). As we can observer and discussed before, the number of representative subsets are modeldependent.
Figure 13 illustrate the correlation estimations for Creditage dataset. Note that in (c) the correlation estimates for certain pairs are missing, when we use NN. The reason for missing correlations is the NA values for which indicate that is zero for this case. Similarly for , since is zero it yiels NA for FNR ratio which is . In addition, is always zero for any sampled model (FOR disparity rate is zero). Similarly for the disparity is zero.