Metrics and methods for a systematic comparison of fairness-aware machine learning algorithms

Understanding and removing bias from the decisions made by machine learning models is essential to avoid discrimination against unprivileged groups. Despite recent progress in algorithmic fairness, there is still no clear answer as to which bias-mitigation approaches are most effective. Evaluation strategies are typically use-case specific, rely on data with unclear bias, and employ a fixed policy to convert model outputs to decision outcomes. To address these problems, we performed a systematic comparison of a number of popular fairness algorithms applicable to supervised classification. Our study is the most comprehensive of its kind. It utilizes three real and four synthetic datasets, and two different ways of converting model outputs to decisions. It considers fairness, predictive-performance, calibration quality, and speed of 28 different modelling pipelines, corresponding to both fairness-unaware and fairness-aware algorithms. We found that fairness-unaware algorithms typically fail to produce adequately fair models and that the simplest algorithms are not necessarily the fairest ones. We also found that fairness-aware algorithms can induce fairness without material drops in predictive power. Finally, we found that dataset idiosyncracies (e.g., degree of intrinsic unfairness, nature of correlations) do affect the performance of fairness-aware approaches. Our results allow the practitioner to narrow down the approach(es) they would like to adopt without having to know in advance their fairness requirements.


page 1

page 2

page 3

page 4


Modeling Techniques for Machine Learning Fairness: A Survey

Machine learning models are becoming pervasive in high-stakes applicatio...

Auditing and Achieving Intersectional Fairness in Classification Problems

Machine learning algorithms are extensively used to make increasingly mo...

k-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers

The protection of private information is a crucial issue in data-driven ...

FairXGBoost: Fairness-aware Classification in XGBoost

Highly regulated domains such as finance have long favoured the use of m...

An Empirical Study of Modular Bias Mitigators and Ensembles

There are several bias mitigators that can reduce algorithmic bias in ma...

Counterfactual fairness: removing direct effects through regularization

Building machine learning models that are fair with respect to an unpriv...

Inducing bias is simpler than you think

Machine learning may be oblivious to human bias but it is not immune to ...

1. Introduction

The era of “big data” has seen industry seek competitive advantage through the monetisation of data assets and the use of modern machine learning methods in automated decisioning systems. However, there remain key questions on the impact these large-scale industrial algorithmic decisioning processes have on society  (OŃeil, 2016; Zarsky, 2012). One of these questions is: are these algorithms, which are often built from historic data with unknown sample and systematic biases, making fair decisions? This question has recently attracted significant attention, highlighted by media scandals on systematic unfairness presenting across many sectors including criminal recidivism (Chouldechova, 2017), recruitment  (Cohen et al., 2019; Mujtaba and Mahapatra, 2019) and credit-worthiness assessments.

These cases have demonstrated that “fairness through unawareness” (Clarke, 2005; Dwork et al., 2012) is ineffective and misleading. This realization spawned new research on developing new approaches to tackle unfairness in machine learning. Advances in the field now enable practitioners to specify the unprivileged groups of concern and employ fairness interventions at all points of the machine learning training process. Despite this progress, questions remain on the effectiveness of each intervention and how these techniques should be integrated into a single model-building framework. This is due to the number of potential sources of bias, the different and sometimes incompatible measures of fairness (Kleinberg, 2018; Verma and Rubin, 2018; Mehrabi et al., 2019; Yeom and Tschantz, 2019; Kusner et al., 2017; Chiappa and Gillam, 2018), the multitude of fairness interventions available, and the effects of decision policies employed to convert model outputs to decisions. The modeller must also address instances where fairness constraints and the machine learning optimisation criteria form competing goals and the imposition of fairness has a performance cost.

In this paper, we address each of these problems and perform a systematic comparison of mainstream and convenient-to-use fairness interventions. In particular, we look for:

  • good quality, open-source code,

  • documentation,

  • usability,

  • computational efficiency.

The comparison focusses on the case of binary classification with a single binary protected attribute. This protected attribute usually, but not exclusively, cannot be discriminated against by law. 111For example, see UK’s 2010 Equality Act (19).

To facilitate the comparison, we introduce the metric fair efficiency aimed to remove the effects of the decision threshold policy and to quantify the fairness-performance trade-off in a single value. Such a metric allows for a comparison of methods along a single axis, thereby solving one of the key issues in measuring the performance of fairness interventions.

Finally, for comparison’s sake, we propose a novel, simple approach called

Fair Feature Selection

that selects features taking both their predictive power and fairness into account.

The structure of this paper is as follows: in Section 2 we provide additional background and further motivate our approach. In Section 3 we introduce our novel metric Fair Efficiency, section 4 covers experimental methodology with results in Section 5 and discussion in 6. Our conclusions are then stated in Section 7. Finally, Fair Feature Selection is described in the Appendix.

2. Related work

Advances in algorithmic fairness and the development of interventions have correspondingly resulted in several fairness evaluation and comparative works. Interventions can broadly be categorised into: preprocessing of data (Kamiran and Calders, 2012; Feldman et al., 2015; Calmon et al., 2017; Zemel et al., 2013), post-processing of the model outputs (Hardt et al., 2016; Pleiss et al., 2017; Kamiran et al., 2012) and incorporating fairness constraints directly into model training  (Zhang et al., 2018; Manisha and Gujar, 2018; Goel et al., 2018; Celis et al., 2019; Calders and Verwer, 2010; Agarwal et al., 2018) (i.e., in-training). The incorporation of fairness constraints into model training often take the form of convex regularizations (Manisha and Gujar, 2018; Beutel et al., 2019; Di Stefano et al., 2020), adversarial learning (Zhang et al., 2018; Hickey et al., 2020) or recasting the model-training process as one of constrained optimization (Zafar et al., 2017, 2017a, 2017b; Olfat and Aswani, 2018). Other approaches utilize generative (or at least partly generative (Nabi and Shpitser, 2018)) models to identify and remove path specific effects to ensure counterfactual fairness (Kilbertus et al., 2017; Kusner et al., 2017; Chiappa and Gillam, 2018; Nabi and Shpitser, 2018; Xu et al., 2019; Zhang et al., 2017). We do not examine these latter generative approaches in detail. Instead, our study focusses on the more common use case of discriminative modelling in combination with statistical measures of group fairness.

Ref. (Mehrabi et al., 2019) provides a broad survey of the bias sources and fairness in machine learning, creating a taxonomy for fairness definitions and analysing domain specific case-studies. A general methodology for exploring potential dataset biases was developed in (Tramér et al., 2017). Ref. (Galhotra et al., 2017) designed test cases to automatically identify sources of discrimination in black-box decisioning procedures. Similarly, (Zehlike et al., 2017)

provides a framework to test underlying machine learning algorithms using a variety of datasets, fairness measures, and statistical tests. These works provide testing and evaluation metrics but are not designed for evaluative comparison of multiple fairness interventions at an equal footing, like this this work.

Ref. (Friedler et al., 2019) evaluates the effects of data preprocessing, train-test splits and the formulation of fairness criteria in the algorithmic interventions. However, this work does not explicitly account for the role of the decision-threshold policy that is typically applied to the outputs of models. Furthermore, the comparison between interventions was point wise in nature and ignored how the metrics changed as the fairness hyper-parameters changed.

3. Evaluating Fairness

3.0.1. Notation

Consider a modelling dataset where is the th tuple respectively of: features, binary protected attribute (i.e., ), and binary target label (i.e., ). When the individual is in an unprivileged group, and when they are in the privileged group. An algorithm predicts a score which is compared to a threshold to assign binary labels to each dataset tuple, i.e. the predicted label is given by , where is the indicator function. Our convention is that represents the favourable outcome. Examples of favourable outcomes include a loan application or a job application is approved, a credit limit is increased, one’s bail application is approved, or one’s parole application is approved.

Finally, may have a continuous or binary fairness parameter which trades off fairness and classification performance. A value of is utilized here to correspond to no fairness interventions, while a value of corresponds to maximal strength of such interventions.

3.0.2. Fairness Metrics

Our aim is to measure two quantities: a predictive-performance metric quantifying how well the output scores predict the true labels, and also a fairness metric to determine how symmetric (i.e., unbiased or unprejudiced) the predicted labels are with respect to the protected attributes. The two classical fairness metrics we use are Equality of Opportunity (EO) (Roemer and Trannoy, 2015) and Disparate Impact (DI) 222Also referred to as “Adverse Impact.” For more information, see

Equality of Opportunity is defined as:

where and are the true positive rates (proportion of correctly predicted positives) of the unprivileged and privileged classes, respectively.

Disparate Impact is defined as:


is the probability of a positive prediction for the unprivileged group, and

is the corresponding probability for the privileged group.

These are two of the most important fairness metrics, with the latter also having particular legislative significance in the United States. (Barocas and Selbst, 2016). Typically, a DI value of less then is taken as an indicator of unwarranted discrimination and initiates further investigation (the “four-fifths rule”). This rule originates from the Uniform Guidelines on Employee Selection Procedures, whereby a selection process will be deemed unfair if the success rate of the disadvantaged group is less than 80% of the success rate of the advantaged group.333See In many cases, DI is not optimized directly but through a related metric known as Statistical Parity Difference (SPD), defined as the difference in probability of a positive prediction between the two groups, i.e.

3.1. Fair efficiency

One central question naturally arises: how can we benchmark without a priori knowledge of the appropriate and classification thresholds ? One can either assume a specific combination of and . Alternatively, they can utilize a “policy-agnostic” approach, which can be facilitated using our novel metric fair efficiency.

Let us introduce a helper integral ,

defined for any metric measuring fairness or predictive power. This integral considers all possible values of corresponding to the full range for all combinations of and . For the cases where is discrete, the integral is replaced by an appropriate weighted sum over the values of . For the case of fairness-unaware algorithms, which do not have a ,

can be replaced with a single-point estimate. The metrics

must be scaled so that they range from and that optimal performance corresponds to a value of .

Now, consider a predictive-performance metric (e.g., accuracy) and a fairness metric (e.g., DI or EO). How can we jointly evaluate and across all values of and ? To accomplish that, we introduce the fair efficiency

defined as the harmonic mean between

and , as:


Fair efficiency penalises models that score highly for fairness but are not predictive, and vice versa. If the fairness and performance metrics are scaled so that a value of 0 (1) is maximally unfavourable (favourable), then also follows the same pattern. Specifically, a occurs when a model is either maximally unfair or maximally non-predictive, whereas an optimal happens when the model is maximally predictive and fair.

4. Experimental Setup

The aim of this evaluation is to compare a selection of practical fairness approaches across a diverse range of benchmark datasets. We will use standard metrics such as EO and DI, precision, accuracy, the area under the Receiver Operating Characteristic curve (AUC), as well as our policy-agnostic fair efficiency.

We consider algorithms that are readily available for practitioners to use. This means they are open source, have good documentation, are able to run efficiently, and have adopted or are easily adapted to a standard fit/predict API. The latter criteria ensure that model selection and evaluation are easy to conduct in practice. We further require that the available implementations must have the ability to output a continuous score and that the fairness intervention must be controlled by a single continuous or binary parameter, . The former requirement excludes the MetaFairClassifier (Celis et al., 2019; Bellamy et al., 2018)

, ART Classifier 

(Bellamy et al., 2018) and the GerryFairClassifier (Kearns et al., 2018) from the AI Fairness 360 (AIF360 (Bellamy et al., 2018)) library. Other algorithms such as Fair representations (Zemel et al., 2013; Bellamy et al., 2018) violated the latter requirement or their implementations were not very usable, at the time of testing. Finally, we only evaluate fairness interventions applied at preprocessing and training stages, this puts post-processing interventions out-of-scope for this work.

The first set of algorithms we consider are common fairness-unaware algorithms: logistic regression (Benchmark LR), Boosted Tree (Benchmark BT) naive Bayes classifier (Benchmark NB), Support Vector Machines (Benchmark SVM) 

(Boser et al., 1992)

, random forests 

(Breiman, 2001)

(Benchmark RF), and boosted tree ensembles, such as XGBoost

(Chen and Guestrin, 2016) (Benchmark XGB) and LightGBM (Ke et al., 2017)

(Benchmark LGB). These algorithms form a benchmark for achievable performance and fairness on our datasets without applying any fairness interventions. They’re optimized across a range of hyperparameters in this comparison, e.g. logistic regression is considered in its unregularized form and in combination with l1, l2 and elastic-net regularization.

The fairness preprocessing techniques we consider are combined with a variety of mainstream classifiers. Specifically, we examine instance reweighing (Kamiran and Calders, 2012) and the disparate impact remover (Feldman et al., 2015) (DIRemover). The Reweigher is taken directly form AIF360 while the DIRemover is an equivalent implementation in AIF360 but with a tunable . Reweighing is a two-state fairness algorithm, in which is equivalent to full reweighing and conversely indicates no reweighing. Although technically binary in nature, we still vary across the full range and compute the integrated metrics as described in Section 3.1. In addition, we also consider our novel, fair feature-selection method (see Appendix A) to reflect how industry practitioners may attempt to tackle unfairness and bias through appropriate feature selection. The DIRemover debiases data by constructing new features, , from completely debiased “fair” features and the original inputs . The debiased data is then given by .

The in-training fairness algorithms we consider include (a) SPD-regularized neural networks 

(Di Stefano et al., 2020) (FR ) , (b) Lagrange reduced boosted trees and logistic regression (Agarwal et al., 2018) (LagRed BT and LagRed LR), and (c) logistic regressions with fairness-constraint optimization (Zafar et al., 2017)444 (DI LR 1 and DI LR 2).

For LagRed BT and LR, the fairness violation tolerance is mapped to . When , no unfairness, as measured by DP or EO is allowed. Conversely, corresponds to an unconstrained lagrange reduction model. The logistic regressions DI LR 1 and DI LR 2 define unfairness as the covariance between and the signed distance to the decision boundary. We consider two optimization forms: one where we optimize the logistic regression loss with the covariance constrained to and another where we minimize the covariance subject to the loss being within of the optimal loss. DI LR 1 does not have a natural parameter and so simply captures the trade-off between fairness and performance over all thresholds but not regularisation parameters.

The neural network (FR ) is 3-layer fully connected networks with the number of units in each layer selected during hyper-parameter optimisation, the subscript denotes the layers where fairness-regularization losses are applied.

The complete list of machine learning algorithms and fairness interventions considered can be found in Table 1.

name type package
Logistic regression (Benchmark LR)
benchmark sklearn
Naive Bayes Classifier (Gaussian) (Benchmark NB)
benchmark sklearn

Support Vector Classifier (RBF Kernel) (Benchmark SVM)

benchmark sklearn
Random Forest Classifier (Benchmark RF)
benchmark sklearn
BoostedTreeEnsemble (Benchmark BT)
benchmark sklearn
XGBoost (Benchmark XGB)
benchmark xgboost
LightGBM (Benchmark LGB)
benchmark lightgbm
Reweighing + random forest (Reweigh RF)
pre-pro (on/off) aif360
Reweighing + sklearn Boosted Trees (Reweigh BT)
pre-pro (on/off) aif360
Reweighing + LightGBM (Reweigh LGB)
pre-pro (on/off) aif360
Reweighing + logistic regression (Reweigh LR)
pre-pro (on/off) aif360
Reweighing + Gaussian naive Bayes (Reweigh NB)
pre-pro (on/off) aif360
Reweighing + support vector classifier (RBF kernel) (Reweigh SVM)
pre-pro (on/off) aif360
Reweighing + XGBoost (Reweigh XGB)
pre-pro (on/off) aif360
DIRemover + random forest (DIR RF)
pre-pro n/a
DIRemover + boosted trees (DIR BT)
pre-pro n/a
DIRemover + LightGBM (DIR LGB)
pre-pro n/a
DIRemover + logistic regression (DIR LR)
pre-pro n/a
DIRemover + Gaussian naive Bayes (DIR NB)
pre-pro n/a
DIRemover + support vector classifier (RBF kernel) (DIR SVM)
pre-pro n/a
DIRemover + XGBoost (DIR XGB)
pre-pro n/a
Fair feature selection ( and )
pre-pro n/a
NeuralNetwork + SPD regularisation (FR DI)
in-train n/a
Lagrange reduced boosted trees (LagRed BT)
in-train fairlearn
Lagrange reduced logistic regression (LagRed LR)
in-train fairlearn
DI optimized logistic regression with boundary covariance constraints (DI LR 1)
in-train (on/off) fair-classification
DI optimized boundary covariance logistic regression with loss constraints (DI LR 2)
in-train fair-classification
Table 1. Summary of benchmark and fairness algorithms included. “in-train” means that fairness is enforced as part of the training process, and “pre-pro” means fairness is achieved through preprocessing of the training dataset. Package versions: aif360 0.2.2 (Bellamy et al., 2018), scikit-learn 0.21.3 (Pedregosa et al., 2011), fairlearn 0.2.0 (Dudik et al., 2020), xgboost 0.9.0, lightgbm 2.3.0.

4.1. Datasets

Rows (k)
Titanic 1.3 22 0.37 gender 0.51
German 1 21 0.70 gender 0.07
Adult 48.8 14 0.24 gender 0.19
Adult (race) 48.8 14 0.24 race 0.10
10 12 0.53 z 0.14
10 16 0.51 z 0.10
10 18 0.50 z 0.14
Table 2. Summary of datasets utilized in our comparison. To give an estimate of the “intrinsic” degree of unfairness of those datasets, we report the SPD of their target. As can be seen, the datasets are more-or-less well balanced label-wise.

We use a collection of 3 real and 4 synthetic datasets, summarised in Table 2. The Titanic dataset 555 contains attributes of the passengers onboard the Titanic ship that wrecked in 1912. The target shows whether an individual survived. The Adult dataset 666 contains socio-economic records on individuals to determine whether they earn over $50K a year. The German dataset 777 comprises features describing financial information of a set of individuals, the target being an indicator of a future credit default.

Our synthetic datasets S-D, S-P, I-D, and I-P are generated according to Figure 1. A linear scheme is used to first sample protected attribute

from a Bernoulli distribution; followed by sampling

in various ways to reflect different kinds of correlation between the training covariates and

. Next, a latent log-odds variable

linearly combines the covariates and the protected attribute , with intercept . Finally, we sample . The above procedure gives rise to four different datasets:

  • Simple - Direct (S-D) This dataset has a direct effect of on , but no mediating effect through the variables. We sampled and set , where the sum is over the scalar variables .

  • Simple - Proxy (S-P) This dataset has no direct effect of on , with the effect being instead mediated through some of the variables. We split in a set of “fair” variables and a set of “unfair” variables , and set once again .

  • Interactions - Direct (I-D) The direct effect of of on is turned on by a set of binary interaction variables. Splitting into a set of fair variables and interaction variables sampled from a Bernoulli distribution, we set

  • Interactions - Proxy (I-P) This case is similar to the Simple - Proxy, but the effect of the unfair variables here is turned on by binary interaction variables. For simplicity, we used a single binary interaction variable sampled from a Bernoulli distribution. We therefore have .

Although we do not characterise exactly how the bias is generated in the real datasets, a proxy for its magnitude is given by the target’s SPD (see Table 2). For the Adult dataset, we evaluate the algorithms twice using two distinct protected attributes, race and gender.

Figure 1. Generation schemes for the synthetic datasets. Note that line weights are an illustrative indication that coefficient weights can vary, but do not necessarily represent the weights in the actual datasets.

4.2. Methodology

Dataset train prop test prop no. reps
Titanic and German 0.5 0.5 5
Adult 0.2 0.8 3
Synthetic 0.4 0.6 4
Table 3. Training/test splits and number of repetitions.

4.2.1. Decision Policies

As shown in Section 3, the fairness metrics DI and EO we will be using here are defined against a binary outcome produced after comparing the probability output of the algorithm with respect to classification threshold . The choice of (“the policy”) is important, as it affects both fairness and predictive-performance metrics. In this work, we perform comparisons using three approaches:

  1. “Argmax”: We use a fixed threshold of for all models and datasets.

  2. “PPR”: We determine a threshold such that the positive predictive rate (PPR), (i.e., the likelihood of a favourable outcome) matches a pre-determined value of within a fixed tolerance.

  3. Policy Free: We compare algorithms treating all possible classification thresholds equally. This is accomplished using our fair efficiency metric.

For the first two policies we use accuracy and precision to measure predictive performance, whereas for the latter Policy-Free approach we use AUC. Given AUC is already integrated over , the associated fair efficiency is given by:


4.2.2. Evaluation flow

To evaluate the approaches, the individual datasets are split into train/test sets multiple times, see Table 3. For each train/test repitition, we fit fairness-aware models as follows:

  1. Optimization of ML-Algorithm-specific hyperparameters: Using the training set, the optimal hyper-parameters for the model are determined using 3-fold cross-validation for and utilizing AUC as the predictive-performance metric.

  2. Model Training:

    • For all fairness aware algorithms, we sweep from and train models at each value.

    • The models are trained using the optimal hyper-parameters identified in Step 1. where possible.

  3. Evaluation of fairness:

    • For all trained models, the thresholds are swept from .

    • The predictive-performance and fairness metrics are computed for each and .

    • The predictive-performance and fairness integrals and / are first calculated on the test set for each value of , and then they are combined using a harmonic mean to produce the fair efficiencies and . These metrics are independent of the threshold policy and utilize all values.

We use or features for fair feature selection and use these to fit a random forest model ( and , respectively). A random forest is used as the estimator for its general utility and ease of fitting, and as it is also used in the other preprocessing intervention pipelines. For the benchmark models we use point estimates of efficiency at .

4.2.3. Matching the Decision Policies and Experimental Comparisons

The evaluation flow of the previous Section produces many metrics for each combination. These metrics need to be extracted according to match the Argmax and PPR policies. The determination of for each model happens as follows:

  • For Argmax, we use .

  • For PPR, we scan across all and combinations to find the smallest closest to the target acceptance rate. This scan is across all training dataset repititions. If the average acceptance rate across all repititions is not within of , then the model has not met the PPR criteria and is dropped. We do not report any further results for these dropped models.

  • For the Policy-Free approach, all values of are considered.

Given the appropriate and combinations, for PPR and Argmax, we perform several comparisons. The fairness preprocessing interventions are initially compared separately in an on/off manner. This approach allows us to examine the maximum expected fairness gains one can expect from these preprocessing techniques. As the DIRemover has a tunable fairness parameter, we compare this algorithm at maximum in conjunction with the benchmark models to the performance of the benchmark models in isolation. The former case is referred to as the fairness “on” state and the latter fairness “off”. Similarly we compare the instance reweigher plus a benchmark classifier at maximum and minimum .

For all fairness-aware interventions, and for the Argmax and PPR policies, we select a value of that is consistent with a “fairness budget”. Specifically, for each approach and dataset we identify the value corresponding to both maximum accuracy and a DI  on at least one training dataset repetition. Given these , we observe the mean DI/EO and performance metrics across the train/test repetitions. For the Policy-Free approach, which is facilitated by our fair efficiency metric, there is no particular choice of , as all are being considered equally.

Finally, unless stated otherwise, the reported metrics are averaged across the train/test repititions of each dataset.

5. Results

5.1. Argmax and PPR Policies

5.1.1. Benchmark models

Figure 2. Fairness and performance metrics of benchmark models across all datasets under Argmax and PPR threshold policies.

The benchmark models’ precision, accuracy, DI and EO are shown in Figure 2. As can be seen and for the most cases, the predictive powers of the benchmark models were more or less consistent (within ) given a particular dataset and policy. Fairness-wise, while EO was mostly consistent across models and datasets, the variations in DI across datasets were strong. Importantly enough, in the vast majority of cases, benchmark models failed to surpass the DI threshold, corresponding to the “four-fifths rule”.

Observing how the different algorithms performed, we notice that the simple Logistic Regression was not overall the fairest algorithm. On the contrary, and as reported in Table 4 (for the case of Argmax policy), SVM was the overall fairest benchmark model.

Dataset Best Model-EO Best Model-DI
Titanic Benchmark SVM Benchmark SVM
German Benchmark LR Benchmark RF
Adult (gender) Benchmark NB Benchmark NB
Adult (race) Benchmark RF Benchmark NB
Benchmark SVM Benchmark SVM
Benchmark SVM Benchmark SVM
Benchmark SVM Benchmark SVM
Benchmark SVM Benchmark SVM
Table 4. Fairest benchmark models on each dataset using Argmax policy.

5.1.2. Preprocessing approaches

Figure 3. DI for preprocessor interventions, in “on/off” comparison, across all datasets for Argmax and PPR on the test datasets.

We begin with a binary comparison, see Section 4.2.3, of the fairness preprocessing techniques, namely, DIRemover and Reweighing. For most combinations of policy, approach, and underlying machine learning algorithm we found a significant uplift in DI when the fairness intervention is applied, see Figure 3

. Comparing the uplifts across datasets or underlying methods or policies, we observe a high variance. In general, the Argmax policy allowed for more consistent uplifts across datasets and underlying ML algorithms. On the other hand, the uplifts in the PPR policy exhibit a more erratic behaviour. We also observed that the two fairness approaches combined in similarly fruitful ways with our range of underlying ML algorithms. Finally, the two techniques, overall, produced similar levels of uplifts.

We once again observed numerous instances where the fairness interventions do not result in DI  (as recommended by the four-fifths rule). This is particularly problematic for the PPR policy, where the majority of fairness-optimized models are still below this DI-threshold.

Contrastingly to DI, but similar to the benchmark models, we observed high values of EO ( ) for the non-fairness-induced models, particularly in combination with the Argmax policy, and generally the fairness regularizations improved this metric further (not shown for brevity). This supports our initial observation that EO, depending on both the target and prediction, is easier to optimize than DI and effective learning of the target assists in making the model fairer.

Finally, we should mention that the uplifts in DI from both approaches came with tolerable accuracy and precision penalties (median drop in absolute value ¡ 0.02).

5.1.3. Overall performance

Figure 4. Accuracy and DI for all fairness approaches under Argmax

We continue by considering both preprocessing and in-training fairness techniques, utilizing the accuracy and DI measures of performance. We note that in several of the cases, we could not find an appropriate decision threshold to satisfy both the PPR and/or the Argmax policies as well as the requirement of DI – with the situation being worse for the PPR policy.

Figure 4 shows the results for the case of Argmax policy. Observing within the datasets and for all datasets, except maybe the simplest S-D, we do not encounter a strong anti-correlation between fairness and accuracy – something that may be be naively expected (in an industrial setting). Instead, we notice that one may obtain strong gains in fairness (DI gains up to ) without sacrificing accuracy if multiple combinations of fairness interventions and underlying modelling approaches are trialled. Finally, we observed that in all datasets, there is at least one combination that can produce DI  results.

Dataset Model Accuracy Precision DI EO
German Reweigh XGB
Titanic DI LR 2
Adult (gender) FR
Adult (race) DI LR 2
S-D LagRed LR
I-P Reweigh LGB
Table 5. Most accurate fairness approaches for DI  under Argmax. Metrics associated with the maximum-accuracy test-set repetition are reported.
Dataset Model Accuracy Precision DI EO
German FR
Titanic Reweigh LR
Adult (gender) DI LR 2
Adult (race) DI LR 2
Table 6. Most accurate fairness approaches for DI  under PPR. Metrics associated with the maximum-accuracy test-set repetition are reported.

To conclude this analysis, we identified the most accurate models (including benchmark models) on each dataset, where the DI was on both the training and test set. These results are shown in Tables 5 and  6, for the Argmax and PPR policies, respectively. In both cases, the optimal models were typically a member of the constrained optimization family, i.e., DI LR 1/2 or FR . As expected, no benchmark models ended up being the most accurate given a DI. As mentioned above, benchmark models typically struggled to reach this level of DI. A natural question that would arise is how much accuracy would one need to sacrifice to reach a “compliant” () level of DI? We compare the benchmark models that give the largest accuracy on a test dataset repetition to the most accurate “compliant” model. For both PPR and Argmax, we observe small but varied drops in accuracies on the order of . The Titanic dataset is an exception showing drops of and for PPR and Argmax respectively.

We also consider the calibration of the models as measured by Brier score on the test data repetitions. In all cases, the models identified as being the most calibrated were fairness-unaware (i.e., they were benchmark models). The optimal Brier scores were always for these models. Moreover, on the Adult and Titanic datasets the difference between the most calibrated and least calibrated models (i.e. fairness-aware) was . This highlights that fairness and calibration requirements can often be in conflict, resulting in potential barriers to industry adoption where calibrated outputs are a necessity.

5.1.4. Computational Cost

Figure 5. Box-plots of fit times across all datasets for benchmark and models with fairness interventions.

Finally, we consider the computational cost in training these models, tested on an Intel Xeon E5-2680 at 2.80GHz with 128 GB RAM. Taking the mean training times for each model and dataset across all and repetitions considered, we observe a clear pattern in Figure 5. The preprocessing interventions and benchmark models are the fastest to fit followed by the fairness-loss regularized models. The most expensive models to fit are those that employ meta-learning approaches using complex base learner, i.e., LagRed BT which had a mean fit time

s, and fairness constrained learners. In particular DI LR 2 can often get “stuck” when trying to match the hard optimization constraints resulting in inefficient outlier runs of

s. This highlights that although these models are consistent in their performance and fairness utility, they can come with a potentially prohibitive high computational cost.

5.2. Fair efficiency

Figure 6. and for all our datasets and techniques.

We continue by utilising a policy-free approach facilitated by our fair efficiency . In the following, is calculated twice for each model alternatively using DI and EO as the fairness component, which are presented as pairs in the box plots (DI on the left of each pair). Error bars represent variance across the folds of each dataset. and for all our datasets and techniques are shown in Figure 6.

Comparing results across datasets, fairness metrics, and techniques, there are some interesting patterns we can observe.

Evaluating across techniques we observe overall consistency with only few underperforming outliers, namely SVM-based approaches (Benchmark SVM, DI Remover SVM, and Reweigh + SVM), Logistic Regression based approaches (Benchmark LR, DI Remover LR, and Reweigh LR), the two Lagrange-reduced techniques LagReg BT/LR. Finally, it is positive to see the benchmark models being competitive to all the other fairness-aware approaches.

Focussing on LagRed BT and LagRed LR, which struggled across the datasets, we believe that this is likely due to the observed behaviour as changes. Specifically, for these algorithms, is an unfairness tolerance parameter that fails to control model fairness in a monotonic fashion (as ideally, expected).

We observed fairness increasing over a small range of this parameter. This does not necessarily indicate that the algorithm is ineffective, but rather it may be harder to align the models’ performance and fairness requirements. Comparing fair efficiencies for DI to those of EO, we observe that while the gap was relatively small in most synthetic datasets, the situation is different in three out of four real datasets. In general, was higher than , implying that EO is easier to maximize than DI – a finding in agreement with previous results.

6. Discussion

Having a full view of the results across all decision-threshold policies, we proceed with describing the emerging picture and main points.

First of all, it was easier to obtain high values of EO rather than DI. While there is some guidance for the case of DI on which values correspond to an adequately fair model (four-fifths rule), there is no corresponding rule for EO. Is an EO fair? If so, then one may achieve fairness more easily by focussing on EO instead of DI.

The vast majority of deployed models in the industry are using fairness-unaware algorithms. Our benchmark evaluations revealed non-compliance with the four-fifths rule on some/most cases, although the overall picture is not terrible. On the positive side, the fairness-vs-predictive-power aspects of benchmark models (as measured by ) were comparable to those of fairness-aware models (see, Figure 6). Moreover, for EO, benchmark models performed adequately well producing high values of EO  or above for both Argmax and PPR policies (see Figure 2). On the other hand, the situation with regard to DI was less positive, with benchmark models failing to reach a compliant () level of DI in almost all cases. This indicates that that fairness-unaware algorithms may be reasonably subject to fairness concerns and investigations.

Among the benchmark algorithms, and for our tested datasets and Argmax policy, SVM had overall the best fairness profile (see Table 4). However, SVM’s predictive power was not as high as that of other algorithms (e.g., see Figures 2 and  6).

On the topic of benchmark models, we would also want to evaluate the widely spread (at least in industrial circles) belief that “simpler models are fairer”, which may stem partly from the assumption that more complex models are harder to explain. On this basis, the simple logistic regression benchmark could be assumed to be the most fair, with the less-transparent boosted tree approaches (e.g., Benchmark XGB) be the least fair. Our results, however, show this is not the case, as we did not observe a clear anti-correlation between simplicity and fairness. This result is reassuring and means complex models should not necessarily be precluded by fairness concerns.

How could someone improve the fairness of their model while using a fairness-unaware algorithm? One approach would be to try different decision-threshold policies. One can always modify their threshold until a satisfactory fairness level is achieved. However, this approach may have other material, business-related side effects associated with it. For example, in a loan application, this approach might result into approval rates incompatible with business constraints.

On the other hand, our results show that fairness-aware algorithms are effective in finding effective trade-offs between fairness and performance. As we have seen in Figures 3 and 4, one can gain appreciable amounts of fairness with minimal drops in predictive power. Moreover, it is positive to see that compliant levels of DI are routinely achievable using such algorithms (e.g., see Tables 5 and 6, and Figure 3).

However, which family of fairness-aware algorithms fared better? Were preprocessing or in-training approaches better? Were on-off (binary) approaches adequate, or should users focus on approaches having a continuous configurable parameter?

Considering the results in Tables 5 and 5, we saw that the two in-training approaches (DI LR 1 and DI LR 2) from the fair-classification package provided the most accurate-yet-fair models for about half of the combinations tested across Argmax and PPR. Looking at the scatter plot in Figure 4, we also notice these two techniques being competitive in terms of absolute levels of DI (for the Argmax policy). Moreover, their fair efficiency profiles were overall competitive, too (see Figure 6).

We notice that both DI LR 1 and DI LR 2 are “in-train” type of fairness algorithms. Is this family of algorithms, better than preprocessing ones? Our results did not reveal any obvious pattern performance-wise, as the other major player in in-train algorithm family, the pair of LagReg BT and LagReg LR, performed poorly with respect to fair efficiency (see Figure 6). However, performance aside, it should be mentioned many implementations of preprocessing techniques (e.g. the reweigher) are binary (on/off), and, as such, they do not offer as much flexibility as some of the fully-configurable in-train approaches (i.e., FR DI, LagRed LR, and DI LR 2).

Unsurprisingly we found that the benchmark models were better calibrated than those with a fairness intervention. The fairness-aware models were poorly calibrated in comparison to the benchmark models and any additional requirements on calibration could negatively impact the fairness properties of the models and make them inviable (Kleinberg, 2018). Any additional requirements on calibration, may require post-processing interventions to account for this constraint.

The two Lagrange reduction approaches (LagRed BT and LagRed LR) were found to perform poorly compared to other interventions considered. With the exception of the Titanic dataset, these models could not meet the PPR requirements. Furthermore, their fair efficiency was typically lower than those of the other models considered. We attribute this to form of the fairness constraint and evaluation methods. For these models, fairness was peaked over a small range of and low outside this range, thus lowering its associated .

Finally, the similar performance across the synthetic datasets indicates that the tested interventions function independently of the structure of the bias in the data. This is unsurprising, as the interventions focus on metrics that are agnostic of the underlying structure of the data-generation mechanism. Other metrics that do account for these mechanisms, such as the causal losses defined in  (Di Stefano et al., 2020), may offer a better fairness-performance trade-off than those considered in this study.

7. Conclusions

We evaluated 28 different models, comprising 7 fairness-unaware (benchmark) ML algorithms and 20 fairness-aware models driven by 8 fairness approaches. We utilized 3 decision threshold policies, 7 datasets, 2 fairness metrics, and 3 predictive-performance metrics.

This is the most comprehensive comparison in the literature to-date and uses a new approach to generalising the evaluation metrics, using our novel metric fair efficiency. Fair efficiency is usable within existing experimental frameworks and provides an alternative view on the trade-off between performance and fairness to context specific metrics.

We also introduced a new feature selection technique that was comparable to the other preprocessing techniques considered and in some instances was competitive with in-training techniques.

The comparisons presented here are by no means exhaustive, and are limited by our strict criteria for inclusion. As of yet, there is no standard “model-fairness API”. Disparate APIs and functions, add significant overhead to comparing models, and we hope that more algorithms will move into the scope set by this paper as consensus emerges.

Future work can expand the scope of this comparison to include causal techniques, additional fairness metrics, post-processing techniques, model interpretability and common business constraints such as calibration.

Appendix A Fair feature selection

Fair feature selection is a straightforward fairness-induction approach used here as a baseline to be compared to more evolved algorithms. It reduces prediction bias by selecting features based on how well they predict , penalised by how well they predict . Features are ranked in two stages. First the features are scored against using a weighted combination of the ranks calculated using three selection methods; mutual information (MI, weight =0.15), F-test score (FS, =0.15), and the gain importance from a random forest model (GI, =1). Next the target is set to the and each feature scored again using the same weighted combination of MI, FS and GI. The weighting between predictive power and the inverse- predictive power is controlled by a parameter. At feature ranking is based on predictive power only and at ranking is based on a feature’s inverse-power to predict only. The top ranked features based on this rank are returned. See Algorithm 1 for the pseudo-code.

0:  Training data , output features , ranking weights , mixing parameter
1:  Initialize ranking vector .
2:  for i = 1,…, len(do
3:     Compute ranking score of using :MI, F-test score, GI.
4:     Compute ranking score of using :MI, F-test score, GI.
5:     Store the result: .
6:  end for
7:  return  Top features from ranking
Algorithm 1 Fair feature selection


  • A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach (2018) A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 60–69. External Links: Link Cited by: §2, §4.
  • S. Barocas and A. D. Selbst (2016) Big data’s disparate impact. Calif. L. Rev. 104, pp. 671. Cited by: §3.0.2.
  • R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y. Zhang (2018) AI Fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. External Links: Link Cited by: Table 1.
  • R. K. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, et al. (2018) AI fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943. Cited by: §4.
  • A. Beutel, J. Chen, T. Doshi, H. Qian, A. Woodruff, C. Luu, P. Kreitmann, J. Bischof, and E. H. Chi (2019) Putting fairness principles into practice: challenges, metrics, and improvements. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, New York, NY, USA, pp. 453–459. External Links: ISBN 9781450363242, Link, Document Cited by: §2.
  • B. E. Boser, I. M. Guyon, and V. N. Vapnik (1992) A training algorithm for optimal margin classifiers. In

    Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT’92)

    , D. Haussler (Ed.),
    Pittsburgh, PA, USA, pp. 144–152. External Links: Link Cited by: §4.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.
  • T. Calders and S. Verwer (2010) Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21, pp. 277–292. External Links: Link Cited by: §2.
  • F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney (2017) Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3992–4001. External Links: Link Cited by: §2.
  • L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi (2019) Classification with fairness constraints: a meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, NY, USA, pp. 319–328. External Links: ISBN 9781450361255, Link, Document Cited by: §2, §4.
  • T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450342322, Link, Document Cited by: §4.
  • S. Chiappa and T. Gillam (2018) Path-specific counterfactual fairness.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. .
    External Links: Document Cited by: §1, §2.
  • A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5 (2), pp. 153–163. Note: PMID: 28632438 External Links: Document, Link, Cited by: §1.
  • K. A. Clarke (2005) The phantom menace: omitted variable bias in econometric research. Conflict Management and Peace Science 22 (4), pp. 341–352. External Links: Document, Link, Cited by: §1.
  • L. Cohen, Z. C. Lipton, and Y. Mansour (2019) Efficient candidate screening under multiple tests and implications for fairness. arXiv preprint arXiv:1905.11361. Cited by: §1.
  • P. G. Di Stefano, J. M. Hickey, and V. Vasileiou (2020) Counterfactual fairness: removing direct effects through regularization. arXiv preprint arXiv:2002.10774. Cited by: §2, §4, §6.
  • M. Dudik, S. Bird, H. Wallach, and K. Walker (2020) Fairlearn: a toolkit for assessing and improving fairness in ai. External Links: Link Cited by: Table 1.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS 2012, New York, NY, USA, pp. 214–226. External Links: ISBN 9781450311151, Link, Document Cited by: §1.
  • [19] (2010) Equality act. Note:
    Cited by: footnote 1.
  • M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, New York, NY, USA, pp. 259–268. External Links: ISBN 9781450336642, Link, Document Cited by: §2, §4.
  • S. A. Friedler, S. Choudhary, C. E. Scheidegger, E. P. Hamilton, S. Venkatasubramanian, and D. Roth (2019) A comparative study of fairness-enhancing interventions in machine learning. In FAT* 2019 - Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency, FAT* 2019 - Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency, pp. 329–338 (English (US)). External Links: Document Cited by: §2.
  • S. Galhotra, Y. Brun, and A. Meliou (2017) Fairness testing: testing software for discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, New York, NY, USA, pp. 498–2510. External Links: ISBN 9781450351058, Link, Document Cited by: §2.
  • N. Goel, M. Yaghini, and B. Faltings (2018) Non-discriminatory machine learning through convex fairness criteria. External Links: Link Cited by: §2.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS-16, Red Hook, NY, USA, pp. 3323–3331. External Links: ISBN 9781510838819 Cited by: §2.
  • J. M. Hickey, P. G. Di Stefano, and V. Vasileiou (2020) Fairness by explicability and adversarial shap learning. arXiv preprint arXiv:2003.05330. Cited by: §2.
  • F. Kamiran, A. Karim, and X. Zhang (2012) Decision theory for discrimination-aware classification. In 2012 IEEE 12th International Conference on Data Mining, Vol. , pp. 924–929. External Links: Document, ISSN 2374-8486 Cited by: §2.
  • F. Kamiran and T. Calders (2012) Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, pp. 1–33. External Links: Link Cited by: §2, §4.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017)

    Lightgbm: a highly efficient gradient boosting decision tree

    In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §4.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2018) Preventing fairness gerrymandering: auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 2564–2572. External Links: Link Cited by: §4.
  • N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf (2017) Avoiding discrimination through causal reasoning. pp. . Cited by: §2.
  • J. Kleinberg (2018) Inherent trade-offs in algorithmic fairness. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS-18, New York, NY, USA, pp. 40. External Links: ISBN 9781450358460, Link, Document Cited by: §1, §6.
  • M. J. Kusner, J. Loftus, C. Russell, and R. Silva (2017) Counterfactual fairness. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4066–4076. External Links: Link Cited by: §1, §2.
  • P. Manisha and S. Gujar (2018) A neural network framework for fair classifier. arXiv preprint arXiv:1811.00247. Cited by: §2.
  • N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2019) A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635. Cited by: §1, §2.
  • D. F. Mujtaba and N. R. Mahapatra (2019) Ethical considerations in ai-based recruitment. In 2019 IEEE International Symposium on Technology and Society (ISTAS), Vol. , pp. 1–7. External Links: Document, ISSN 2158-3404 Cited by: §1.
  • R. Nabi and I. Shpitser (2018) Fair inference on outcomes. External Links: Link Cited by: §2.
  • M. Olfat and A. Aswani (2018) Spectral algorithms for computing fair support vector machines. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, A. Storkey and F. Perez-Cruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, Playa Blanca, Lanzarote, Canary Islands, pp. 1933–1942. Cited by: §2.
  • C. OŃeil (2016) Weapons of math destruction: how big data increases inequality and threatens democracy. Crown Publishing Group, USA. External Links: ISBN 0553418815 Cited by: §1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Table 1.
  • G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger (2017) On fairness and calibration. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS-17, Red Hook, NY, USA, pp. 5684–5693. External Links: ISBN 9781510860964 Cited by: §2.
  • J. E. Roemer and A. Trannoy (2015) Equality of opportunity. In Handbook of income distribution, Vol. 2, pp. 217–300. Cited by: §3.0.2.
  • F. Tramér, V. Atlidakis, R. Geambasu, D. Hsu, J. Hubau, M. Humbert, A. Juels, and L. Huang (2017) FairTest: discovering unwarranted associations in data-driven applications. pp. 401–416. Cited by: §2.
  • S. Verma and J. Rubin (2018) Fairness definitions explained. In Proceedings of the International Workshop on Software Fairness, New York, NY, USA, pp. 1–7. External Links: ISBN 9781450357463, Link, Document Cited by: §1.
  • D. Xu, Y. Wu, S. Yuan, L. Zhang, and X. Wu (2019)

    Achieving causal fairness through generative adversarial networks

    In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 1452–1458. External Links: Document, Link Cited by: §2.
  • S. Yeom and M. C. Tschantz (2019) Discriminative but not discriminatory: a comparison of fairness definitions under different worldviews. arXiv preprint arXiv:1808.08619v4. Cited by: §1.
  • M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi (2017a) Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, WWW ?17, Republic and Canton of Geneva, CHE, pp. 1171–1180. External Links: ISBN 9781450349130, Link, Document Cited by: §2.
  • M. B. Zafar, I. Valera, M. G. Rodriguez, K. P. Gummadi, and A. Weller (2017b) From parity to preference-based notions of fairness in classification. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS?17, Red Hook, NY, USA, pp. 228–238. External Links: ISBN 9781510860964 Cited by: §2.
  • M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi (2017) Fairness constraints: mechanisms for fair classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 962–970. External Links: Link Cited by: §2, §4.
  • T. Zarsky (2012) Automated prediction: perception, law, and policy. Communications of the ACM 55, pp. 33–35. External Links: Link Cited by: §1.
  • M. Zehlike, C. Castillo, F. Bonchi, R. Baeza-Yates, S. Hajian, and M. Megahed (2017) MEASURES: a platform for data collection and benchmarking in discrimination-aware ml.. Note: http: // External Links: Link Cited by: §2.
  • R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pp. 325–333. Cited by: §2, §4.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES 18, New York, NY, USA, pp. 335–340. External Links: ISBN 9781450360128, Link, Document Cited by: §2.
  • L. Zhang, Y. Wu, and X. Wu (2017) A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3929–3935. External Links: Document, Link Cited by: §2.