Bias and Fairness Audit Toolkit
Recent work has raised concerns on the risk of unintended bias in algorithmic decision making systems being used nowadays that can affect individuals unfairly based on race, gender or religion, among other possible characteristics. While a lot of bias metrics and fairness definitions have been proposed in recent years, there is no consensus on which metric/definition should be used and there are very few available resources to operationalize them. Therefore, despite recent awareness, auditing for bias and fairness when developing and deploying algorithmic decision making systems is not yet a standard practice. We present Aequitas, an open source bias and fairness audit toolkit that is an intuitive and easy to use addition to the machine learning workflow, enabling users to seamlessly test models for several bias and fairness metrics in relation to multiple population sub-groups. We believe Aequitas will facilitate informed and equitable decisions around developing and deploying algorithmic decision making systems for both data scientists, machine learning researchers and policymakers.READ FULL TEXT VIEW PDF
While the field of algorithmic fairness has brought forth many ways to
Fairness is an increasingly important concern as machine learning models...
Data-driven algorithms play a large role in decision making across a var...
With the current ongoing debate about fairness, explainability and
Web-based human trafficking activity has increased in recent years but i...
Machine learning is increasingly used to inform decision-making in sensi...
We propose the use of Agent Based Models (ABMs) inside a reinforcement
Bias and Fairness Audit Toolkit
The Machine Learning community has been responding to concerns around unintended consequences of bias and unfairness in algorithmic decision making systems (ADM) by developing methods to detect bias and algorithmic fairness, and developing methods to avoid bias and disparate impact or defining tradeoffs among different criteria (Calders and Verwer, 2010; Kamishima et al., 2011; Dwork et al., 2012; Zemel et al., 2013; Feldman et al., 2015; Hardt et al., 2016; Kleinberg et al., 2016; Corbett-Davies et al., 2017; Zafar et al., 2017b, a).
Different notions of bias and fairness have been proposed. For problems where a risk score is being generated, a risk score is considered calibrated, or test-fair if it has equal precision among different groups (e.g. male vs female) for every value of the predicted risk score (Kleinberg et al., 2016; Chouldechova, 2017)
. Another notion of fairness is equalized odds, which consists in having equalized true positive rate and false positive rates(Hardt et al., 2016). When the application gives more importance to the positive outcome (“advantage”outcome), (Hardt et al., 2016) propose the notion of equal opportunity, which consists in relaxing the equalized odds notion to just care for the true positive rate parity.
Kleinberg et al. (2016), Joseph et al. (2016) and Chouldechova (2017) discuss the relationship between model calibration, prevalence111fraction of data points labeled as positive in each group, false negative and false positive rates in risk assessment tasks. Both authors show that when there is calibration and the prevalence is different between groups then it is not possible to have both equal false positive and negative rates across groups, i.e., balance for the positive and negative classes.
A lot of this recent work has focused on a single bias metric, or analyzing tradeoffs for a pair of measures, applied to a single problem, typically with synthetic datasets, with rare exceptions such as COMPAS (Angwin et al., 2016). There has been very little extensive empirical work done on calculating a wide variety of various bias metrics and fairness definitions/formalizations on real world problems, especially in problems with high social impact. In addition, policymakers today do not have tangible tools to use to make sure that the systems they are using to make critical policy decisions are fair and unbiased.
To overcome these barriers we developed Aequitas, an open source bias audit toolkit that implements several bias metrics and fairness definitions and can be used directly as a Python library, via command line interface or a web application, making it accessible and friendly to a wide range of users (from data scientists to policymakers). We aim Aequitas can contribute to make auditing for bias and fairness a standard procedure when developing or considering deploying algorithmic decision making systems, and consequently, to help data scientists and policymakers making more informed and equitable decisions.
Aequitas222https://github.com/dssg/aequitas, our open-source bias and fairness audit toolkit, operationalizes a wide list of bias and fairness definitions and is designed to help data scientists, machine learning researchers and policymakers audit the output of Algorithmic Decision Making Systems to check for fairness and bias across a variety of measures and policy intervention use cases.
We expect Aequitas to be used by two types of users:
Data Scientists and Machine Learning Researchers: who are building ADM systems for use in risk assessment tools. They will use Aequitas to compare bias measures and check for disparities in different models they are building during the process of model building and selection.
Policymakers: who, before “accepting” an ADM system to use in policy decision, will run Aequitas to understand what biases exist in the system and what (if anything) they need to do in order to mitigate those biases.
We can now describe some of the key concepts and the various metrics/definitions that Aequitas implements in its current version.
A traditional binary classification task using supervised learning consists of learning a predictor, that aims to predict the true outcome of a given data point from the set of features , based on labeled training data. Many problems in public policy can be formulated as statistical risk assessment problems in which we assign a real valued score to each entity (data point) and a decision
is made based on the score, typically by selecting a pre-defined number (k) of entities that should be classified as positive. After sorting the entities by, the binary predictor is defined as if where is the score of the kth ordered entity.
Let us now consider a multi-valued attribute that can be or not be a subset of , for example genderfemale, male, other. We define a group as a set of entities (data points) that have in common a specific attribute value of , for instance gender=female corresponding to all the females in the dataset. Table 1 summarizes the preliminary definitions introduced so far.
|Score||is a real valued score assigned to each entity by the predictor.|
|Decision||is a binary prediction assigned to a given entity (data point), based on thresholding on the score (e.g. top K).|
|True Outcome||is the true binary label of a given entity.|
|Attribute||is a multi-valued attribute, e.g., genderfemale, male, other|
|Group||is a group of all entities that share the same attribute value, e.g., gender=female.|
|Reference Group||is one of the groups of A that is used as reference for calculating bias measures.|
|Labeled Positive||is the number of entities labeled as positive within a group.|
|Labeled Negative||is the number of entities labeled as negative within a group.|
|Prevalence||= / is the fraction of entities within a group which true outcome was positive.|
Given all groups defined by the attribute , the predictions and true outcome for every entity of each group we can now discuss group metrics. We use two metrics (Predicted Prevalence and Predicted Positive Rate) that are only concerned about the distribution of the entities across groups in the selected set for intervention (top k) and therefore do not use the true outcomes (labels). We define distributional group metrics as follows:
Predicted Positive - is the number of entities within a group where the decision is positive,i.e., .
Total Predictive Positive - K = is the total number of entities predicted positive across groups defined by .
Predicted Negative - is the number of entities within a group which decision is negative,i.e., .
Predicted Prevalence - = / is the fraction of entities within a group which were predicted as positive.
Predicted Positive Rate - = / is the fraction of the entities predicted as positive that belong to a certain group.
We now define group metrics that require the true outcome (label) to be calculated. We focus on type I (false positives) and type II (false negative) errors across different groups. In the context of public policy and social good the goal is to avoid disproportionate errors in specific groups. We use four different error-based group metrics defined as follows:
False Positive - is the number of entities of the group with
False Negative - is the number of entities of the group with .
True Positive - is the number of entities of the group with .
True Negative - is the number of entities of the group with .
False Discovery Rate - = / is the fraction of false positives of a group within the predicted positive of the group
False Omission Rate - = / is the fraction of false negatives of a group within the predicted negative of the group
False Positive Rate - = / is the fraction of false positives of a group within the labeled negative of the group
False Negative Rate - = / is the fraction of false negatives of a group within the labeled positives of the group
Our formulation, allows performing bias and fairness analysis on any multi-valued attribute, and not just for pre-defined protected attributes. We define bias as a disparity measure across groups when compared with a reference group. This reference group can be selected using different criteria. For instance, one could use the majority group (with larger size) across the groups defined by , or the group with minimum group metric, or the traditional approach of fixing a historically favored group (e.g race=white). The bias measures are applied on a pairwise basis comparing groups defined by a given attribute . We define the following bias metrics:
Predicted Positive Rate Disparity is a bias metric that measures the disparity in predicted positive rates between a given group and a reference group:
Predicted Positive Group Rate Disparity is a bias metric that measures the disparity in predicted prevalences between a given group and a reference group:
FDR Disparity is a bias metric that measures the disparity in false discovery rates between a given group and a reference group:
FPR Disparity is a bias metric that measures the disparity in false positve rates between a given group and a reference group:
FOR Disparity is a bias metric that measures the disparity in false omission rates between a given group and a reference group:
FNR Disparity is a bias metric that measures the disparity in false negative rates between a given group and a reference group:
Aequitas uses parity based measures of impact fairness. Our formulation and implementation of fairness is flexible as it relies on a real valued parameter to control the range of disparity values that can be considered fair. One example, of formulation of disparity is using the “‘80% rule” represented by . A predictor must be as fair as the maximum value of the disparities across the groups defined by allow. This notion requires that all biases (disparities) to be within the range defined by :
We defined two types of fairness: unsupervised and supervised. Our concept of fairness relies on a group wise impact using parity constraints. Our formulation and implementation of fairness is flexible as it relies on a real valued parameter to control the range of disparity values that can be considered fair. One example, of formulation of disparity is using the “‘80% rule” represented by . A predictor must be as fair as the maximum value of the disparities across the groups defined by allow. This notion requires that all biases (disparities) to be within the range defined by .
Aequitas has the following components:
Aequitas requires the following data provided as input:
a set of predictions: entities and scores given to the entities (by a machine learning model for example)
attributes for each entity (age, gender, etc.)
rue outcomes/labels for each entity (optional: to calculate supervised fairness measures)
Attributes and values of interest (gender: male, female for example)
Reference value for each group attribute to calculate bias ratios from (male for gender, for example)
Bias Measures to calculate (FPR, FOR, for example)
Aequitas generates outputs in a few formats:
Database Tables: it calculates all the raw error metrics and disparity measures and stores them in a database table for further analysis
PDF report: A bias audit report generated as a PDF
Visual Interactive Web Report: that allows users to interactively explore different attributes and bias measures. The images we show in this paper are from the interactive dashboard.
To show the utility of Aequitas, we used it to audit several machine learning systems being used to solve problems in public health, economic development, criminal justice, education, and public safety, (Saleiro and Ghani, 2018). Here we present a short case study of one of our audits using a publicly available data set (COMPAS) from criminal justice.
The goal of COMPAS was to identify individuals who are at risk of recidivism to support pretrial release decisions. In a recent widely popularized investigation conducted by a team at ProPublica, Angwin et al. concluded that it was biased against black defendants.
We audited the predictions using Aequitas. We find that there is indeed unfairness (both unsupervised and supervised) in the model in all three attributes of interest: age, gender, and race.
Figure 1 shows the detailed group metric results. Each row represents a specific attribute-value pair (gender:female), and each bar (column) represents a group metric of interest (False Positive Rate for example). Green bars represent groups for which the model does not exhibit bias within that metric. Red bars are those that are unfavorably biased compared to the reference group. In this case we used a fairness threshold and the results show that for every metric considered there is some kind of bias towards specific groups. For instance, PPR results show that COMPAS mostly consider as high risk people with age 18-25, Males and African-Americans and that compared to each group size, younger people, Native Americans and African-Americans are being selected disproportionally.
To easily visualize the disparities between the different groups, the tool also produces results for the bias measures. We then have to determine which bias measure is relevant for our setting. If the interventions we’re focusing on in our setting are “assistive”, we only need to consider Type II parity – False Omission Rates (FOR) and False Negative Rates (FNR). This is because our interventions are preventative and designed to provide extra assistance to individuals. Providing this assistance to individuals who are false positives will not hurt them but missing individuals could be harmful to them.
If the interventions we’re focusing on in our setting are “punitive”, we need to consider Type I parity – False Discovery Rates (FDR) and False Positive Rates (FPR). This is because our interventions are punitive providing this intervention to individuals who are false positives will hurt them. Since in the COMPAS setting, the predictions are being used to make pretrial release decisions, we care about FPR and FDR Parity.
Looking at the figure 2 we can see that COMPAS is fair regarding the FDR for race but as ProPublica found, the FPR for African-Americans is almost twice as the FPR for Caucasians. For age we observe the same results for the same two metrics: FDR results are fair but FPR for ¡25 is 1.6X higher than 25-45. On the other hand, if we consider false positive errors distribution considering Sex we observe the contrary: the model is fair for FPR but the FDR of Female is 1.34 times higher than for Male.
In this paper, we presented Aequitas, our open-source fairness audit toolkit that is designed to help data scientists, machine learning researchers, and policymakers audit the output of Algorithmic Decision Making Systems to check for fairness and bias across a variety of measures and use cases. By breaking down the COMPAS predictions using a variety of bias and fairness metrics calculated using different reference groups, we show how Aequitas can help its users surface the specific metrics for which the model is imposing bias on given attribute groups.
The work here is a start at building tools for data scientists and policymakers that help them achieve fairness and equity in enacting policies. In addition to this tool, we also need to develop trainings for both of those audiences to help them understand the impact of these biases and to make informed policy decisions in the presence of ADMs.
Three naive bayes approaches for discrimination-free classification.Data Mining and Knowledge Discovery, 21(2):277–292, 2010.