A comparative study of fairness-enhancing interventions in machine learning

02/13/2018 ∙ by Sorelle A. Friedler, et al. ∙ THE UNIVERSITY OF UTAH Haverford College Carlos Scheidegger 0

Computers are increasingly used to make decisions that have significant impact in people's lives. Often, these predictions can affect different population subgroups disproportionately. As a result, the issue of fairness has received much recent interest, and a number of fairness-enhanced classifiers and predictors have appeared in the literature. This paper seeks to study the following questions: how do these different techniques fundamentally compare to one another, and what accounts for the differences? Specifically, we seek to bring attention to many under-appreciated aspects of such fairness-enhancing interventions. Concretely, we present the results of an open benchmark we have developed that lets us compare a number of different algorithms under a variety of fairness measures, and a large number of existing datasets. We find that although different algorithms tend to prefer specific formulations of fairness preservations, many of these measures strongly correlate with one another. In addition, we find that fairness-preserving algorithms tend to be sensitive to fluctuations in dataset composition (simulated in our benchmark by varying training-test splits), indicating that fairness interventions might be more brittle than previously thought.



There are no comments yet.


page 12

page 15

Code Repositories


Comparing fairness-aware machine learning techniques.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the use of machine learning to make decisions about people has increased, so has the drive to make fairness-aware machine learning algorithms. A considerable body of research over the past ten years has produced algorithms for accurate yet fair decisions, under varying definitions of fair, for goals such as non-discriminatory hiring, risk assessment for sentencing guidance, and loan allocation. And yet we have not yet seen extensive deployment of these algorithms in the pertinent domains. The primary obstacle appears to be our ability to compare methods effectively across different evaluation measures and different data sets with consistent data preprocessing and testing methodologies. Such comparisons would not just reveal “best-in-class” methods; they would also suggest which measures are robust and how different algorithms are sensitive to different kinds of preprocessing. As pointed out by Lehr and Ohm (2017), such considerations of the data processing pipeline are not just important for efficient implementation but also have legal ramifications for the resulting automated decision-making process.

In this paper, we present a test-bed to facilitate direct comparisons of algorithms with respect to measures on a variety of datasets. Our open-source framework allows for the easy addition of new methods, measures and data for the purpose of evaluation. We show how to use our test-bed for determining not only which specific algorithm has the best performance under a fairness or accuracy measure, but what types of algorithmic interventions tend to be the most effective. In addition to the impact of these algorithmic choices, we examine the impact of different preprocessing techniques and different measures for accuracy and fairness that have an important, and previously obscured, impact on the results of these algorithms. Our goal is to provide a comprehensive comparative analysis of existing approaches that is currently lacking in the literature.

1.1 Our results

In terms of the techniques, datasets, and measures we evaluate in this paper, we wish to highlight the following findings:

Dependence on preprocessing

Different algorithms tend to have slightly different requirements in terms of input: how are sensitive attributes encoded? Are multiple sensitive attributes supported? Does the algorithm directly support categorical attributes or are attribute transformations required? We find that these can have an impact in accuracy and fairness measures reported in the literature.

Clustering of measures

Even though there has been a proliferation of measures designed to highlight discrimination instances by machine learning algorithms, we find that a large number of these measures tend to strongly correlate with one another. As a result, techniques optimizing for one measure often performs well for a different measure (and similarly for poor performance).

Algorithms make significantly different tradeoffs

The specific mechanisms that different algorithms employ to increase fairness are quite varied, but surprisingly, the actual predictions made by these algorithms tend to vary significantly as well. As a result, no algorithm’s performance (as of the latest state of our benchmark) appears to dominate, either in accuracy or fairness measures.

Algorithms tend to be sensitive to variations in the input

We find surprising variability in fairness measures arising from variations in training-test splits; this appears to not have been previously mentioned in the literature.

2 Background

Fairness-aware machine learning algorithms seek to provide methods under which the predicted outcome of a classifier operating on data about people is fair or non-discriminatory for people based on their protected class status such as race, sex, religion, etc., also known as a sensitive attribute. Broadly, fairness-aware machine learning algorithms have been categorized as those preprocessing techniques designed to modify the input data so that the outcome of any machine learning algorithm applied to that data will be fair, those algorithm modification techniques that modify an existing algorithm or create a new one that will be fair under any inputs, and those postprocessing techniques that take the output of any model and modify that output to be fair Romei and Ruggieri (2013). Many associated metrics for measuring fairness in algorithms have also been explored. These are detailed further in Section 6 and are also surveyed in Žliobaitė (2017)

. This description of fairness-aware machine learning methods is limited to batch-learning-based interventions. We do not consider interventions that focus on sequential or reinforcement learning such as

Jabbari et al. (2017); Joseph et al. (2016a); Joseph et al. (2016b); Ensign et al. (2018a, b)

Preprocessing algorithms

The motivation behind preprocessing algorithms is the idea that training data is the cause of the discrimination that a machine learning algorithm might learn, and so modifying it can keep a learning algorithm trained on it from discriminating. This could be because the training data itself captures historical discrimination or because there are more subtle patterns in the data, such as an under-representation of a minority group, that makes errors on that group both more likely and less costly under certain accuracy measures. One such algorithm that we will analyze in this paper is that of Feldman et al. (2015) that modifies each attribute so that the marginal distributions based on the subsets of that attribute with a given sensitive value are all equal; it does not modify the training labels. Additional preprocessing approaches include Calmon et al. (2017); Kamiran and Calders. (2012).

Algorithm modifications

Modifications to specific learning algorithms, e.g., in the form of additional constraints, have been by far the most common approach. We study three such methods in this paper. Kamishima et al. (2012)

introduce a fairness focused regularization term and apply it to a logistic regression classifier.

Zafar et al. (2017) observe that standard fairness constraints are nonconvex and hard to satisfy directly and introduce a convex relaxation for purpose of optimization. Calders and Verwer (2010) build separate models for each value of a sensitive attribute and use the appropriate model for inputs with the corresponding value of the attribute.

Another method that combines preprocessing and algorithm modification is the work by Zemel et al. (2013). Their approach is to learn a modified representation of the data that is most effective at classification while still being free of signals pertaining to the sensitive attribute.

Postprocessing techniques

A third approach to building fairness into algorithm design is by modifying the results of a previously trained classifier to achieve the desired results on different groups. Kamiran et al. (2010)

designed a strategy to modify the labels of leaves in a decision tree after training in order to satisfy fairness constraints. Recent work by

Hardt et al. (2016) and Woodworth et al. (2017) explored the use of post-processing as a way to ensure fairness with respect to error profiles (see Section 6 for more on this).

In this paper we focus on group fairness approaches that aim to ensure non-discrimination across protected groups where the goal is to optimize metrics such as disparate impact. Another line of thought, known as individual fairness, is detailed in Dwork et al. (2012). In this work, we do not study algorithms that seek to optimize individual fairness: our goal is to focus on methods that explicitly deal with group-based discrimination and there are (to the best of our knowledge) no actual codes that optimize for individual fairness.

2.1 Related Work

Three prior efforts are relevant to our work. FairTest Tramèr et al. (2015)111https://github.com/columbia/fairtest provides a general methodology to explore potential biases or feature associations in a data set, as well as a way to identify regions of the input space where an algorithm might incur unusually high errors. THEMISGalhotra et al. (2017)222https://github.com/LASER-UMASS/Themis takes a blackbox decision-making procedure and designs test cases automatically to explore where the procedure might be exhibiting group-based or causal discrimination. Fairness Measures Zehlike et al. (2017) occupies a different point in the design space. Given a particular algorithm that one wishes to evaluate, they provide a framework to test the algorithm on a variety of datasets and fairness measures. This approach on the one hand is more general than our framework, because it works with any algorithm. On the other hand, it is less effective for a comparative evaluation of different algorithms especially if they have different preprocessing and training methods.

There are other software packages that audit black box software to determine the influence of individual variables. We omit a detailed description of these approaches as they are out of the scope of the investigation presented here. For more information, the reader is referred to the excellent new survey on explainability by Guidotti et al. (2018).

3 Benchmark Structure

Figure 1: The stages of the fairness-aware benchmarking program: data input, preprocessing, benchmarking, and analysis. Intermediate files are saved at each stage of the pipeline to ensure reproducibility.

In order to provide a platform for clear comparison of results across fairness-aware machine learning algorithms, we separate each stage of the learning and analysis process (see Figure 1) and ensure that each algorithm is compared using the same dataset (including the same preprocessing), the same set of training / test splits, and all desired fairness and accuracy measures. Much previous work has combined the preprocessing for a specific dataset with the code for the fairness-aware algorithm, which makes comparisons with other algorithms and other datasets difficult. Similarly, algorithms have often been analyzed only under one or two measures. Here, we emphasize that we distinguish preprocessing, algorithms, and measures, and create a pipeline in which all algorithms are analyzed under a standard preprocessing of datasets and a large set of measures.

In order to encourage easy adoption of this codebase as a platform for future algorithmic analysis, each of these choices is modularized so that adding new datasets, measures, and/or algorithms to the pipeline is as easy as creating a new object. The pipeline will then ensure that all existing algorithms are evaluated under the new dataset and measure. More details and instructions for adding to the code base can be found at the repository.333https://github.com/algofairness/fairness-comparison

4 Data

We perform all experiments based on five real-world data sets that have been previously considered in the fairness-aware machine learning literature and preprocess each consistently depending on the needs of the algorithm.444All raw datasets, preprocessing code, and resulting processed datasets are available in the repository: https://github.com/algofairness/fairness-comparison. Preprocessing described here can be reproduced by running: python3 preprocess.py The real-world datasets come from some of the domains impacted by questions of fairness in machine learning: hiring and promotion, credit-worthiness and loans, and recidivism prediction.


The Ricci dataset comes from the case of Ricci v. DeStefano Supreme Court of the United States (2009), a case before the U.S. Supreme Court in which the question at issue was an exam given to determine if firefighters would receive a promotion. The dataset has 118 entries and five attributes, including the sensitive attribute Race. The original promotion decision was made by a threshold of achieving at least a score of on the combined exam outcome Miao (2011). The goal in a fair learning context is to predict this original promotion decision while achieving fairness with respect to the sensitive attribute, Race.

Adult Income

The Adult Income dataset Lichman (2013) contains information about individuals from the 1994 U.S. census. It is pre-split into a training and test set; we use only the training data and re-split it. There are 32,561 instances and 14 attributes, including sensitive attributes race and sex. 2,399 instances with missing data are removed during the preprocessing step. The prediction task is predicting whether an individual makes more or less than $50,000 per year.


The German Credit dataset Lichman (2013) contains 1,000 instances and 20 attributes describing individuals along with a classification of each individual as a good or bad credit risk. Sensitive attribute sex is not directly included in the data, but can be derived from the given information. Sensitive attribute age is included, and is discretized into values adult (age at least 25 years old) and youth based on an analysis by Kamiran and Calders (2009) showing this discretization provided for the most discriminatory possibilities.

ProPublica recidivism

The ProPublica data includes data collected about the use of the COMPAS risk assessment tool in Broward County, Florida Angwin et al. (2016). It includes information such as the number of juvenile felonies and the charge degree of the current arrest for 6,167 individuals, along with sensitive attributes race and sex. Data is preprocessed according to the filters given in the original analysis Angwin et al. (2016). Each individual has a binary “recidivism” outcome, that is the prediction task, indicating whether they were rearrested within two years after the first arrest (the charge described in the data).

ProPublica violent recidivism

The violent recidivism version of the ProPublica data Angwin et al. (2016) describes the same scenario as the recidivism data described above, but where the predicted outcome is a rearrest for a violent crime within two years. 4,010 individuals are included after preprocessing is applied, and the sensitive attributes are race and sex.

5 Preprocessing

Each algorithm we will analyze has certain requirements for the type of data it will operate over, and these necessitate different preprocessing techniques. However, in order to provide a consistent comparison across algorithms, it’s important that each algorithm receive the same input. We reconcile these needs by creating types of inputs that multiple algorithms can handle. Algorithms that handle the same input can then be easily compared to each other; algorithms can also be compared across different preprocessing strategies for the same dataset, though these results should be seen to be less definitive.

The first preprocessing step is to modify the input data according to any data-specific needs: removing features that should not be used for classification, removing or imputing any missing data, and potentially removing items or adding derived features. In order to allow the analysis of fairness based on multiple sensitive attributes (e.g., not just ensuring fairness based on race or sex alone, but based on both someone’s race and sex) we also add a combined sensitive attribute (e.g., attribute “race-sex” with values like “White-Woman”) to each dataset that contains multiple sensitive attributes. All algorithms will receive versions of the dataset with this same preprocessing applied.

While some algorithms are able to handle the datasets for training with only the described initial preprocessing (we’ll call this version of the processed data “original”), most algorithms considered here have additional constraints.555Since scikit-learn classifiers only handle numerical data, even for classifiers like decision trees where this is not inherently a requirement, some of the tested algorithms that would otherwise handle the original data require numerical data since the algorithms call scikit-learn. For algorithms that can only handle numerical

training data as input, we modify the data to include one-hot encoded versions of each categorical variable. Some algorithms additionally require that the sensitive attributes be binary (e.g., “White” and “not White” instead of handling multiple racial categorizations) - for this version of the data (

numerical+binary) we modify the given privileged group to be and all other values to be .

5.1 Analysis

With these four preprocessed versions of each data set in place, we can compare how a single algorithm performs relative to all versions of the dataset on which it can run. The most common form of input for the algorithms we consider here is numerical, and all these algorithms can additionally handle the numerical+binary version of the dataset. This gives an opportunity to determine the effect, per algorithm and per dataset, of allowing an algorithm access to full information about sensitive attribute categorization or only a binary summary.

Figure 2: Examining the results of the Feldman et al. Feldman et al. (2015) algorithm under different preprocessing choices: numerical versus numerical+binary. Each dot plots the result of a single split of the data in terms of the labeled metric under both preprocessing choices. The gray line shows equality between the preprocessing choices. The model used within the Feldman algorithm is listed, and some variants of the algorithm had the tradeoff parameter optimized for either accuracy or disparate impact value.

Figure 2 illustrates this analysis on the impact of the numerical+ binary version of the preprocessed data on the Feldman et al. Feldman et al. (2015) algorithm. In the left figure we examine the relation between the accuracy on numerical preprocessing versus numerical+binary binary-encoded sensitive attributes. Each algorithm was run over ten random splits and the result on each split is shown as a single point on the figure. As discussed in Section 7, Feldman et al. use a generic classifier after running a preprocessing “fairness-enhancing” filter on the data, and the different algorithms reflect the different classifier used. We also automate the parameter tuning for , the fairness-accuracy tradeoff parameter for this algorithm (more about parameter tuning specifics can be found in Section 7), for both accuracy and the disparate impact value. As we can see, for most variants of the algorithm the resulting accuracy is independent of the representation, with a notable exception of the SVM variants (where the preprocessing is followed by training with an SVM). In all three SVM variants, the accuracy is consistently higher when using the numerical+binary representation than when using the numerical representation. We speculate that this is because the Feldman et al. algorithm conditions on the sensitive value in its preprocessing on the data, and this step likely preserves more accuracy when a larger number of people are in each sensitive group – as is the case when the unprivileged groups are grouped together in the binary preprocessing variant. This may be compounded by the SVM model because when categorical features are one-hot encoded for input (as required by scikit-learn) the increase in the dimensionality of the data may cause the SVM to be less effective at finding a good classifier.

We can do a similar analysis on the fairness achieved by the methods, as seen in the right side of Figure 2. Again, we compare the fairness measure (in this case DI – see Section 6) achieved for different data representations. First, we see that the fairness achieved varies across runs, an issue we will return to when we discus measure stability. Second, we notice that there is less difference between the results obtained for different representations (although SVMs still show sensitivity to the representation). In other words, for this algorithm the accuracy is affected by the choice of classifier and representation, but not the fairness achieved.

6 Measures

There are many ways to evaluate the accuracy and fairness of a model. Rather than be exhaustive,666An upcoming tutorial puts the number of fairness measures at 21 Narayanan (2018)! we will focus on representative measures for each aspect. Let be a dataset where is the data subset that can be used for training (whether categorical or numerical), is the sensitive attribute where is the privileged class, and is the binary classification label where is the positive outcome and is the negative outcome. Let

be the predicted outcomes of some algorithm. We can define accuracy and fairness measures in terms of conditional probabilities of outcome variables (

) with respect to variables like and .

6.1 Accuracy measures

We consider the standard accuracy measures: the (uniform) accuracy (), the true positive rate (TPR) () (also called the positive predictive value (PPV)), and the true negative rate (TNR) () (also called the negative predictive value (NPV)). We also consider the balanced classification rate (BCR), a version of accuracy that is unweighted per class:

Definition 1 (Bcr).

All of these measures lie in the range .

6.2 Fairness measures

Fairness measures can be divided into three broad categories, in all cases conditioned on values of the sensitive attribute . In what follows, we normalize measures to make comparisons easier. In all cases, the measures lie in the range or where in both cases perfect fairness is achieved at . We note that some of these measures have appeared in the literature not as something to be optimized (to be close to ) but as a constraint to be satisfied (i.e for example that the appropriate value must equal ).

6.2.1 Measures based on base rates

Definition 2 (Disparate Impact (DI) Feldman et al. (2015); Zafar et al. (2017)).

This measure is inspired by one of the two tests for disparate impact in the legal literature in the United StatesBarocas and Selbst (2016). In the cases where there are more than two values for a given sensitive attribute, we consider two variants of DI (which are equivalent in the case when there are only two sensitive values): binary and average. In the binary case, all unprivileged classes are grouped together into a single value (e.g., ”non White”) that is compared as a group to the privileged class (e.g., ”White”). In the average case, pairwise DI calculations are done against the privileged class (e.g., ”White” compared to ”Black”, ”White” compared to ”Asian”, etc.) and the average of these calculations is taken. This is analogous to the one-vs-all and all-vs-all methodology in multi-class classification.

Definition 3 (Cv Calders and Verwer (2010)).

This measure is the same as DI, but where the difference is taken instead of the ratio; such a measure has been used for example to measure gender discrimination in the United Kingdom. A binary grouping strategy (described above for DI) is used in the case where there is more than one sensitive value, and the averaging method can also be used. Note that we do not take the absolute value of the difference so that skew in favor of one group versus another can be detected. We note that requiring

is called the demographic parity constraint in the literature.

6.2.2 Measures based on group-conditioned accuracy

In general, we can think of fairness measures based on group-conditioned accuracy as asking whether the error rates for each group are similar. This yields the following definitions.

Definition 4.

(Group-conditioned fairness measures.)


We note that these measures have been studied under different names. For example, error rate balance Chouldechova (2017) is the aim of achieving equal -TPR and

-TNR values across sensitive groups. Equalized odds

Hardt et al. (2016) is the aim of achieving equal -TPR and -TNR (the false positive rate) across sensitive groups.

Letting any of the above measures be denoted , the values can then be aggregated for comparison by taking the mean directly or by taking the mean over comparisons analogous to DI and CV: or . In each of these cases, as we saw above, the unprivileged sensitive values could be grouped together or handled separately in the ratio or difference.

6.2.3 Measures based on group-conditioned calibration

A predictor that outputs a probability for an event is said to be well-calibrated if . Motivated by this, we can define fairness measures by group conditioning the calibration function.

Definition 5 (-Calibration+).
Definition 6 (-Calibration-).

Calibration has been introduced previously with the goal of equalizing across sensitive value Chouldechova (2017); Kleinberg et al. (2017).

6.3 Analysis

Although there are many variations on these and other measures, we find that many of these are correlated. In some cases, this is not surprising as these measures are definitionally related. For example, DI takes the ratio of two probabilities while CV takes the difference. However, by analyzing resulting measures across many algorithms, we can find correlations that are less obvious. In fact, it appears that there are two main groups of measures, all correlated with each other! In Figure 3 we fix two dataset-algorithm pairs and look at how the different measures of fairness correlate with each other. A first surprising observation is that the various group-conditioned fairness measures are very closely related to each other (the base-rate measures like DI and CV are also closely related for the reason mentioned above). This suggests that we need not focus on the specific group-conditioned fairness measure we use. An unusual exception to this is the group-conditional calibration measure on negative outcomes (s-Calibration-) which is much more closely associated with the base-rate measures than other group-conditioned measures. A second surprising observation is that the accuracy measures are correlated with the group-conditioned fairness measures. This suggests that the discussions of fairness-accuracy tradeoffs are more pertinent with respect to base-rate fairness measures.

Figure 3:

Examining the relationships between different measures of fairness. Each figure represents one data set-algorithm pair. For each entry, the algorithm is run for 10 training-testing splits for different parameter choices. The Stahel-Donoho estimator

Stahel (1981); Donoho (1982) is then computed for each set of pairs of measurements.

Additionally, there are cases in which we would expect there to be tradeoffs between measures. Recent impossibility results show that, assuming unequal base rates across populations, it is impossible to achieve both calibration and error rate balance (both the same false positive rate and the same false negative rates across groups) Chouldechova (2017); Kleinberg et al. (2017). In Figure 4 we empirically examine this tradeoff. As before, each colored point represents one instance of train-test split for an algorithm. As Figure 4 shows, there is a clear tradeoff between with s-calibration- versus s-TPR for each algorithm. Interestingly, different algorithms situate themselves in different parts of the tradeoff line.

Figure 4: An illustration of the tradeoff between s-calibration- and TPR for all algorithms on the Adult dataset. Each dot represents one run out of 10 random train-test splits.

7 Algorithms

We choose a selection of existing fairness-aware algorithms to assess; these are chosen based on availability of source code and with the goal of choosing varying types of fairness interventions (e.g., preprocessing versus algorithm modification). Each algorithm is run on each dataset and each metric is calculated on the predicted results.777All algorithm implementations can be found in the repository (https://github.com/algofairness/fairness-comparison), along with all resulting metric calculations, (see the results/ directory). The full set of results can be reproduced by running: python3 benchmark.py Synthesis statistics (such as stability) are then calculated and comparison graphs are produced.888Algorithm analysis code can be found in the repository (https://github.com/algofairness/fairness-comparison) and can be reproduced by running: python3 analysis.py

We analyze the following algorithms along with non-fairness-aware algorithms chosen for a baseline comparison: SVM, decision trees, Gaussian naive Bayes, and logistic regression.

Calders and Verwer (2010)

Calders and Verwer introduce a fairness-aware algorithm modification called Two Naive Bayes. Their approach trains separate models for the values and iteratively assesses the fairness of the combined model under the CV measure, makes small changes to the observed probabilities in the direction of reducing the measure, and retrains their two models. This algorithm can handle both categorical and numerical input data, but requires that the given sensitive attribute be binary. We use the Kamishima et al. (2012) implementation of this algorithm.999https://github.com/tkamishima/kamfadm/releases/tag/2012ecmlpkdd The algorithm has a prior parameter, which we search from to in increments of .

Feldman et al. (2015)

Feldman et al. give a preprocessing approach that modifies each attribute so that the marginal distributions based on the subsets of that attribute with a given sensitive value are all equal; it does not modify the training labels. Any algorithm can then be trained on the resulting “repaired” data. The algorithm can handle both categorical and numerical input data, but since we train scikit-learn classifiers based on this preprocessed data, our implementation can only handle numerical input. Both binary and non-binary sensitive attributes can be handled. A tuning parameter is provided to tradeoff between fairness and accuracy, where gives the fairness of a regular non-fairness aware classifier and maximizes fairness. is used as the default, and all values of at increments of in are included when the algorithm is optimized using a grid search over the parameters. The implementation comes from Feldman et al. Feldman et al. (2015) and Adler et al. (2018).101010https://github.com/algofairness/BlackBoxAuditing

Kamishima et al. (2012)

Kamishima et al. introduce a fairness-focused regularization term and apply it to a logistic regression classifier. Their approach requires numerical input and a binary sensitive attribute. A tuning parameter is provided to tradeoff between fairness and accuracy, where is the default. When optimizing the parameter we use values between and , with a finer grid used for the lower values of that range; these parameter choices are based on the experimental exploration of this parameter given in Kamishima et al. (2012). We use the Kamishima et al. implementation of this algorithm.111111https://github.com/tkamishima/kamfadm/releases/tag/2012ecmlpkdd

Zafar et al. (2017)

Zafar et al. re-express fairness constraints (which can be nonconvex) via a convex relaxation. This allows them to maximize accuracy subject to fairness and also maximize fairness subject to fairness constraints. They use two parameters: is a parameter that controls the degree of independence of the outcome and the sensitive attribute via a covariance calculation: setting forces complete independence (and therefore fairness). The second parameter fixes the degree of approximation they are willing to tolerate: the algorithm is only required to find an answer that is within a factor of the optimal solution. In their experiments they set and vary as a linear function of the corresponding covariance estimate for an unconstrained classifier. When optimizing, we use values between and in 10 logarithmic steps.

Figure 5: The performance of all algorithms on each dataset with the goal of removing discrimination on a specific attribute. From top to bottom, the algorithms and sensitive attributes considered are: Adult Income on race, German Credit on sex, Ricci on race, ProPublica recidivism on race, and ProPublica violent recidivism on race. Each point is the result of a single algorithm running on a single training / test split - each algorithm is shown for ten such splits.

In Figure 5 we can see a basic summary of the performance of each algorithm considered on each data set. Since each algorithm focuses on creating a fair outcome with respect to a specific attribute in the data, we have chosen a single sensitive attribute to consider per dataset in these overall results. It is clear that there is no one “winner” - no algorithm that is both more fair and more accurate than the others on all datasets. It is also clear that there is tremendous variation even within a single algorithm over the random splits it receives. We examine this point in more detail next.

7.1 Stability

When analyzing algorithms, an additional question we are concerned with is that of stability

- will the algorithm still perform well if the training data is slightly different? To assess this, we considered the standard deviation of each metric over 10 random splits. The results are shown in Figure

6 for the Adult Income data set for all algorithms when focusing on non-discrimination in terms of race (left) and sex (right) using numerical+binary preprocessing. These results give perhaps the clearest indication of the quality of an algorithm on a given data set. It is also easy to see that each algorithm occupies a slightly different place on the trade-off between fairness (measured here by DI when taken over binary sensitive attributes). For example, when focusing on non-discrimination in terms of sex, the Zafar et al. algorithm is potentially the best choice in terms of a balance between fairness and accuracy, but the large standard deviation over DI may make it a less desirable option.

Figure 6: The stability of algorithms on the Adult dataset. Each algorithm is tested on ten random train / test splits and a rectangle centered on the mean and with a width and height equal to the standard deviation along that measure is plotted. On the left, the algorithms attempt to remove discrimination in terms of race, and on the right in terms of sex.

7.2 Parameters

Many of these fairness-aware learning algorithms provide a parameter to allow a manual trade-off between fairness and accuracy. We automate the search for this balance and present results for all algorithms optimizing accuracy or fairness. This provides an additional means of testing the algorithm, as well as the possibility for further optimization of the tradeoff between the two. In Figure 7 we show the different results based on parameter tuning for the Zafar et al. (2017) algorithm on the Ricci dataset (left) and the Feldman et al. (2015) algorithm on the Adult Income dataset. A clear tradeoff between fairness and accuracy in these algorithms can be seen; the parameters are appropriately allowing exploration of the possible solution space.

Figure 7: The results of the Zafar et al. (2017) algorithm on the Ricci dataset (left) and the Feldman et al. (2015) algorithm on the Adult Income dataset (right) when the provided parameter to tradeoff between fairness and accuracy is used. The parameter is varied and each split and each new parameter value is shown.

7.3 Multiple sensitive attributes

Figure 8: Here, we show the behavior of four different algorithms when making predictions while accounting for different protected attributes (“repairing” race and sex, as well as a composite attribute). Different algorithms not only behave quite differently from one another, but their performance varies significantly depending on which specific attribute is being considered.

While there are still very few fairness-aware algorithms that can formally handle multiple sensitive attributes directly in the algorithm (Kearns et al. (2017); Hébert-Johnson et al. (2017)), all algorithms discussed can handle them if preprocessed as described earlier so that they are combined into a single sensitive attribute (e.g., race-sex). However, we might expect combining the attributes in this way to degrade performance under some metrics, especially in the case of algorithms that can only handle binary sensitive attributes, or when there are too many combinations for the size of the dataset to provide a large group of people with each new combined sensitive value. Looking at the Adult dataset when fairness-aware algorithms are run focusing on non-discrimination in terms of race, sex, and both, we find varying results for each of the algorithms in Figure 8. Sex is especially predictive on the Adult Income data set, so the DI value for sex is low, even on these fairness-aware algorithms. Race generally receives a higher DI value from these algorithms. When correcting for both at once, all of the algorithms find that the DI value is somewhere in between that for race and that for sex, but the Zafar et al. (2017)

algorithm has a much larger variance over race and sex than over either individually.

8 Discussion

Besides providing a central point of access to existing fairness-enhancing interventions and classification algorithms, our benchmark also highlights a number of gaps in the current practice and reporting of fairness issues in machine learning. We conclude with the following recommendations for future contributions to the area:

Emphasize preprocessing requirements.

If there are multiple plausible ways in which a dataset can be processed to generate training data for an algorithm, provide performance metrics for more than one of the possible choices. If algorithms are being compared to each other, ensure they are compared based on the same preprocessing.

Avoid proliferation of measures.

A new measure for fairness should only be introduced if it behaves fundamentally differently from existing metrics. Our study indicates that a combination of class-sensitive error rates and either DI or CV is a good minimal working set.

Account for training instability.

Showing the performance of an algorithm in a single training-test split appears to be insufficient. We recommend reporting algorithm success and stability based on a moderate number of randomized training-test splits.