Data-driven algorithms are used to inform decision making in a growing number of socially impactful domains, including criminal justice, finance, education, and hiring (O’Neil, 2016). The increasing influence of these systems has prompted a corresponding increase in public concern over their potential to produce discriminatory outputs. There is ample evidence that such concerns are well founded: algorithmic models used to generate criminal recidivism predictions (Larson et al., 2016; Chouldechova, 2017), employment advertising (Datta et al., 2015), and web product pricing (Valentino-DeVries et al., 2012) have all been shown to discriminate based on protected attributes (e.g. race or gender).
Discriminatory behavior is not a deliberate feature of these systems, but rather the result of biases present in the input data used to train the systems (Calders and Žliobaitė, 2013). A key hurdle for industrial applications of machine learning models is thus to determine whether the raw input data used to train the model contains discriminatory bias. This question is not straightforward: there are many ways to quantify bias, and many subtleties to consider when interpreting the results of such measurements. In light of this difficulty, we present a case study which examines the ability of six different fairness metrics to detect unfair bias in predictions generated by models trained on datasets containing known, artificial bias.
Our contributions are threefold. First, we frame the problem of bias detection as a causal inference problem with observational data. This framing highlights the subtleties that accompany causal studies using observational data, and emphasizes the parallels between those challenges and the difficulties associated with measuring fairness in machine learning settings. Second, we investigate the performance of six different fairness metrics under conditions of varying dataset bias. Specifically, we examine the consequences of making conclusions based on these metrics in the presence of uncertainty about the causal origins of bias in the dataset. Finally, based on these potential consequences we present a set of recommended best practices to guide fairness metric selection.
2. Related Work
The literature on fairness in machine learning is broadly categorized by three (often overlapping) goals: to quantify the degree of unfair bias present in data or model predictions (Feldman et al., 2015; Hardt et al., 2016; Zliobaite, 2015b; Dwork et al., 2012; Romei and Ruggieri, 2014), to remove unfair bias from data or model predictions (Feldman et al., 2015; Zemel et al., 2013), and to develop machine learning algorithms which include fairness constraints (Zemel et al., 2013; Kamishima et al., 2012; Zafar et al., 2015; Calders and Verwer, 2010). A primary question for all three approaches is: “How should ‘fairness’ be defined?”
This question is the subject of active debate. Several metrics have been suggested which attempt to mathematically define various competing notions of fairness (Dwork et al., 2012; Feldman et al., 2015; Zliobaite, 2015b, a; Hardt et al., 2016; Chouldechova, 2017); the proliferation of these metrics reflects the many and sometimes mutually exclusive (Kleinberg et al., 2016) interpretations of ‘fairness’ in machine learning contexts. Here we focus on six metrics which appear repeatedly in the literature: Difference in Means (Zliobaite, 2015b), Difference in Residuals (Zliobaite, 2015b), Equal Opportunity (Hardt et al., 2016), Equal Mis-opportunity (Hardt et al., 2016), Disparate Impact (Feldman et al., 2015), and Normalized Mutual Information (Zliobaite, 2015b).
Certain definitions of fairness, and thus certain fairness metrics, are demonstrably inappropriate in particular circumstances. For example, if there is a legitimate reason for a difference in the rate of positive labels between members of different protected classes (e.g. incidence of breast cancer by gender) then statistical parity between model results for the protected classes would be an inappropriate measure. More generally, metrics are inappropriate when they enforce an equality which is inconsistent with ground truth. However, in practical, real-world machine learning settings the only available data may contain an unfairly biased representation of ground truth. This situation presents a conundrum: in order to select an appropriate measure of bias for a given dataset one must first know the bias in that dataset. Here we address this conundrum by considering the consequences of selecting a fairness metric based on a mistaken assessment of the types of bias present in a given dataset.
3. Bias Detection through the lens of Observational Causal Inference
We frame the problem of detecting unfair bias in a machine learning setting as a causal inference problem with observational data. This framing serves two purposes: first, it allows us to enumerate distinct causal origins of dataset bias and to evaluate the performance of different fairness metrics in different regions of ‘dataset bias space’. Second, it highlights the relationship between the shortcomings of existing fairness metrics and the difficulties associated with causal inference on observational data. To be clear, in this work we are not applying causal inference techniques to detect bias. Rather, we are evaluating existing fairness metrics from the literature within a framework inspired by observational causal inference.
In this study we distinguish between two types of dataset bias, which we call ‘sample bias’ and ‘label bias’. In this terminology, ‘sample bias’ refers to the case when the sampling process which generates the data is not uniform across protected classes and outcome labels. For example, consider a dataset consisting of applicants to a graduate school. Certain academic disciplines have a large gender disparity in applicants. If department selectivity is correlated with applicant gender disparity then a dataset of applicants to such a program would contain sample bias because the sampling process would preferentially generate e.g., men who are more likely to be accepted and women who are more likely to be rejected.
We define ‘label bias’ as the case when there is a causal link between a protected attribute and the class label assigned to an individual which is not warranted by ground truth. Consider a dataset composed of elementary school students, with a dependent variable that indicates whether the student misbehaves. Studies have shown that Black and Latinx children are more likely than White children to receive suspensions or expulsions for similar problem behavior (Skiba et al., 2011), so if our dataset’s dependent variable were based on suspensions it would contain label bias. This taxonomy of bias types is consistent with other classifications presented in the literature, e.g. (Calders and Žliobaitė, 2013), however we distinguish the definitions presented here by their emphasis on the causal origin of the bias.
Several pitfalls of measuring fairness can be understood through the lens of causal inference. First, consider situations which display Simpson’s paradox (Blyth, 1972), i.e. cases where different levels of data aggregation produce different fairness conclusions. An analysis of graduate admissions data from Berkeley offers one such case (Bickel et al., 1975), in which the aggregate data show an admissions bias against women, but when the data are disaggregated to the department level the bias is reversed. This difficulty stems from the causal influence of the protected class (in this case gender) on the presence of an individual in the set of applicants to each department: women preferentially apply to departments with lower acceptance rates. The fairness question “Does a person’s gender cause him/her to be more likely to be accepted?” is confounded by the causal influence of the person’s gender on the sampling process which generated the dataset. Simpson’s paradox may manifest in fairness measurements whenever there is a causal link between a protected attribute and the sampling process which generates the dataset, i.e. whenever there is sample bias.
Next, consider the case where a dataset contains label bias, i.e. when there is an unwarranted causal relationship between a protected class and the label assigned to members of that class. Models trained on such a dataset may produce unfairly biased predictions even when the sensitive attribute is omitted due to collinearities between the protected class and other explanatory variables (Calders and Žliobaitė, 2013). Historic “redlining” (Žliobaite et al., 2011; Kamishima et al., 2012) practices, in which zip codes were used as a proxy for race in mortgage lending decisions are an example of a deliberate application of this effect, however the same phenomenon occurs even in the absence of malicious intent if machine learning models are applied naively to datasets containing label bias. Detecting label bias requires determining the effect of a protected attribute on a model’s predictions in the presence of other correlated variables. This is a classic causal inference problem, and is plagued by the same complications that accompany causal studies using observational data, e.g. omitted, included, or incomplete variable bias (Romei and Ruggieri, 2014).
4. Case Study
Selecting an appropriate fairness metric for a given dataset is a chicken-and-egg problem: the types of bias present in the dataset determine which metric is appropriate, but determining which types of bias are present requires some way to measure bias. Here we approach this problem by evaluating the performance of six different metrics on datasets containing known, artificial bias.
4.1. Experimental Methods
To investigate the performance of the fairness metrics we perform two experiments. We begin with a dataset containing demographic information about a subset of U.S. citizens, with a dependent variable that indicates the likelihood (on a scale from 0 to 1) that each person will sign up for the services of an unspecified state agency111Contractual limitations prevent us from identifying the client whose data we used for this analysis or making the data available.. This original dataset is summarized in Table 1.
For Experiment A, we create a new base dataset where ground truth positive rates and class membership are both balanced. This is done by selecting only the white citizens in the original dataset and then randomly re-assigning race labels. For Experiment B, we use the unmodified original dataset as the new base dataset. In both experiments we define the base dataset to be ground truth; note that in Experiment B this means that the ground truth positive rate differs between groups.
Each experiment proceeds by introducing artificial causal bias to the relevant base dataset, splitting the resulting biased data into training and testing subsets, training an elastic net logistic classification model on the training set, scoring the model on the test set, and then applying each fairness metric to the model outputs. This process is repeated for each possible combination of bias types, as summarized in Table 2. Finally, we evaluate each metric on its ability to detect the artificially introduced bias.
|race = black||1,296||9,357||10,653|
|race = white||64,536||54,804||119,340|
We note one important detail of this method: in this analysis we are treating the likelihood score from our original dataset as ground truth. This score is itself modeled (prior to our experiments here), and thus subject to causal sample and or label bias. However, our conclusions are generally applicable to situations where the ground truth contains imbalances similar to those in our base datasets, i.e. balanced classes with equal positive rates as in Experiment A, or imbalanced classes with differing positive rates as in Experiment B.
To introduce causal label bias into the training data we assign different label thresholds based on race. Specifically, in our artificially biased datasets we define:
where score is the likelihood score from the original dataset. We also define an unbiased label:
To introduce causal sample bias we preferentially sample white citizens having higher scores, while sampling black citizens uniformly:
is the probability a given person in the base dataset is included in the artificial training dataset. We also define an unbiased sampling process:
For each experiment, we train four elastic net logistic classification models, one on each of the four datasets in Table 2, and then apply the following metrics to the model predictions, where is the predicted label, is the training label (either in the case of label bias or in the non-label biased case) and indicates membership in the protected class (in this case race = black):
|No Label Bias||Label Bias|
|No Sample Bias||Dataset 1||Dataset 3|
|Sample Bias||Dataset 2||Dataset 4|
4.2. Results and Discussion
The results of Experiments A and B are presented in Figures 1 and 2, respectively. With the exception of Disparate Impact, there are no established thresholds for determining what level of measured bias constitutes an unfair model. Therefore, we focus on comparisons, both within and between experiments, to illustrate the performance of each metric under different bias and ground truth conditions.
In Experiment A all metrics correctly identify Dataset 1 as least biased, and all metrics except Difference in Residuals correctly identify Dataset 4 as most biased. Comparing Datasets 2 and 3 shows that most metrics display similar sensitivities to both types of bias, with the exception of Difference in Residuals which is more sensitive to label bias, and Equal Mis-opportunity which is more sensitive to sample bias.
In Experiment B all metrics except Difference in Residuals again correctly identify Dataset 1 as least biased, however all metrics also detect significant bias in Dataset 1 – recall that Dataset 1 contains no artificially introduced bias and is representative of ground truth. Comparing Datasets 2 and 3 shows that in Experiment B most metrics display greater sensitivity to sample bias than to label bias. In fact, comparing Datasets 1 and 3 shows that most metrics display minimal sensitivity to label bias.
Several empirical conclusions emerge from the results of these experiments. First, the inability of all metrics tested here to distinguish between bias and legitimate imbalances in the ground truth positive rate illustrates the importance of considering the expected ground truth. Second, the dependence of metric sensitivity on bias type, and the insensitivity of most metrics to label bias in Experiment B, illustrates the importance of considering the causal origin of the bias in a dataset. Finally, in practical applications fairness metrics are applied to a single dataset, and the resulting value must be interpreted alone without comparing against a known unbiased result. However, metric values depend strongly on the imbalances present in ground truth which makes interpreting an individual metric difficult without some external context.
From these observations we conclude that no single fairness metric is universally applicable. When evaluating fairness in machine learning settings practitioners must carefully consider both the imbalances which may be present in the ground truth they hope to model, and the origins of the bias in the datasets they will use to create those models. We end by proposing a set of best practices to guide practitioners when evaluating which fairness metric to use.
5. Best Practice Guidelines
Having an a-priori expectation for ground truth positive rates between classes crucially informs which fairness metrics are appropriate, and how their values should be interpreted. In the case where external legal or moral considerations require that the positive rates be equal most metrics can be applied and interpreted in a straightforward manner. Conversely, in the case where ground truth positive rates differ between classes interpreting the results of fairness metrics is difficult.
In the difficult, imbalanced case causal reasoning about the data collection and labelling procedures can inform metric selection. Specifically, if the data collection is susceptible to sample bias then Normalized Mutual Information is a reasonable metric, as it displays good sensitivity to sample bias when ground truth positive rates are imbalanced. Detecting label bias in the imbalanced case is extremely challenging. Additionally, Disparate Impact is particularly ill-suited to the imbalanced case.
Finally, our key recommendation for practitioners is that absent an external source of certainty about ground truth, fairness metrics in machine learning must be interpreted with a healthy dose of human judgement.
- Bickel et al. (1975) Peter J Bickel, Eugene A Hammel, J William O’Connell, and others. 1975. Sex bias in graduate admissions: Data from Berkeley. Science 187, 4175 (1975), 398–404.
- Blyth (1972) Colin R Blyth. 1972. On Simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc. 67, 338 (1972), 364–366.
Toon Calders and Sicco
Three naive Bayes approaches for discrimination-free classification.Data Mining and Knowledge Discovery 21, 2 (2010), 277–292.
- Calders and Žliobaitė (2013) Toon Calders and Indrė Žliobaitė. 2013. Why unbiased computational processes can lead to discriminative decision procedures. In Discrimination and Privacy in the Information Society. Springer, 43–57.
- Chouldechova (2017) Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv preprint arXiv:1703.00056 (2017).
- Datta et al. (2015) Amit Datta, Michael Carl Tschantz, and Anupam Datta. 2015. Automated experiments on ad privacy settings. Proceedings on Privacy Enhancing Technologies 2015, 1 (2015), 92–112.
- Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ACM, 214–226.
- Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 259–268.
et al. (2016)
Moritz Hardt, Eric Price,
Nati Srebro, and others.
Equality of opportunity in supervised learning. InAdvances in Neural Information Processing Systems. 3315–3323.
et al. (2012)
Shotaro Akaho, Hideki Asoh, and
Jun Sakuma. 2012.
Fairness-aware classifier with prejudice remover regularizer.Machine Learning and Knowledge Discovery in Databases (2012), 35–50.
- Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
- Larson et al. (2016) Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) (2016).
- O’Neil (2016) Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown Publishing Group (NY).
Andrea Romei and
Salvatore Ruggieri. 2014.
A multidisciplinary survey on discrimination
The Knowledge Engineering Review29, 05 (2014), 582–638.
- Skiba et al. (2011) Russell Skiba, Lauren Shure, and Natasha Williams. 2011. What do we know about racial and ethnic disproportionality in school suspension and expulsion. Briefing paper developed for the Atlantic Philanthropies’ race and gender research-to-practice collaborative (2011), 1–34.
- Valentino-DeVries et al. (2012) Jennifer Valentino-DeVries, Jeremy Singer-Vine, and Ashkan Soltani. 2012. Websites Vary Prices, Deals Based on Users’ Information. (24 December 2012). https://www.wsj.com/articles/SB10001424127887323777204578189391813881534
- Zafar et al. (2015) Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Learning fair classifiers. arXiv preprint arXiv:1507.05259 (2015).
- Zemel et al. (2013) Richard S Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. 2013. Learning Fair Representations. ICML (3) 28 (2013), 325–333.
- Zliobaite (2015a) Indre Zliobaite. 2015a. On the relation between accuracy and fairness in binary classification. arXiv preprint arXiv:1505.05723 (2015).
- Zliobaite (2015b) Indre Zliobaite. 2015b. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148 (2015).
- Žliobaite et al. (2011) Indre Žliobaite, Faisal Kamiran, and Toon Calders. 2011. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 992–1001.