1 Introduction
Machine learning (ML) algorithms are being adopted widely in areas ranging from recommendation systems and addisplay, to hiring, loan approvals, and determining recidivism in courts. Despite their potential benefits, these algorithms can still exhibit or amplify existing societal biases (Angwin et al., 2016; Zhang et al., 2017; Obermeyer et al., 2019; Lambrecht and Tucker, 2019). In this paper, we explore the interplay of two potential sources of such algorithmic biases: social bias and data bias. We use the term social bias or unfairness to mean that an algorithm makes decisions in favor or against individuals in a way that is “inconsistent” or discriminatory across groups with different social identities (e.g., race, gender). A variety of group fairness criteria (e.g., demographic parity, equalized odds) have been proposed for assessing and preventing these forms of unfairness (Mehrabi et al., 2021; Barocas et al., 2017); these criteria typically aim to (approximately) equalize a statistical measure (e.g., selection rate, true positive rate) between different groups. We use the term statistical or data bias, on the other hand, to refer to biases in the datasets used to train the algorithm.
Our work identifies the impacts of statistical data biases on the efficacy of existing fairness criteria in addressing social biases. In particular, existing fairness criteria have been proposed assuming access to unbiased training data when assessing whether an algorithm meets their desired notion of fairness. However, it has been widely observed that existing datasets suffer from the biases and errors of prior decision makers Blum and Stangl (2020); Fogliato et al. (2020); Jiang and Nachum (2020); Kallus and Zhou (2018); Wick et al. (2019). Any datadriven decision rule is inevitably only as good as the data it is trained on; training an algorithm to meet a fairness constraint is no exception. As a result, a decision maker’s attempts to attain a desired notion of (demographic) fairness by imposing a fairness constraint can be thwarted by statistical data biases. However, as we show both analytically and numerically, existing fairness criteria differ considerably in their robustness against different statistical biases.
We consider a setting in which a firm makes binary decisions (accept/reject) on agents from two demographic groups and , with denoting the disadvantaged group. We assume the training data is statistically biased as a prior decision maker has made errors when either assessing the true qualification state (label) of individuals or when measuring their features. The firm selects its decision rules based on this biased data, potentially subject to one of four fairness constraints: Demographic Parity (DP), True/False Positive Rate Parity (TPR/FPR), or Equalized Odds (EO).
In Section 3, we first assess the sensitivity of different fairnessconstrained decision rules to qualification assessment biases (which can be viewed as labeling biases) on the disadvantaged group. We analytically show that some existing fairness criteria (namely, Demographic Parity and True Positive Rate Parity/Equality of Opportunity) exhibit more robustness to such statistical biases compared to others (False Positive Rate Parity and Equalized Odds). That is, despite being trained on biased data, the resulting DP/TPRconstrained decision rules continue to satisfy the desired DP/TPR fairness criteria when implemented on unbiased data. This can be interpreted as a positive byproduct of these fairness criteria, in that (social) fairness criteria are not violated despite statistical data biases.
We also find that even if still satisfying the fairness criteria, the decision rules differ from those that would have been obtained if data was unbiased. Specifically, we show that the selection thresholds on both groups increase compared to the unbiased fair thresholds for all constraints (and hence move closer to the “unfair” thresholds due to data biases), but at different rates. As a result, we find that the loss in the firm’s utility as bias increases differs across fairness criteria. We also present similar analyses under qualification assessment biases on the advantaged group, as well as when the statistical biases are due to prior feature measurement errors on the disadvantaged group.
In Section 4, we provide support for our analytical findings through numerical experiments based on three realworld datasets (FICO, Adult, and German credit score datasets). Our numerical experiments, consistent with our analytical findings, highlight the differing robustness of fairness measures when facing statistical data biases, both in terms of satisfiability of the desired fairness criterion and the changes in firm’s expected payoff. Notably, in contrast to the typically discussed “fairnessaccuracy tradeoff”, we show that at times the adoption of a fair decision rule can increase a firm’s expected performance compared to an accuracymaximizing (unfair) algorithm when training datasets are biased. We highlight this observation in Section 4 and provide an intuitive explanation for it by interpreting fairness constraints as having a regularization effect; we also provide an analytical explanation in Appendix E.5.
Together, our findings can serve as an additional guideline when choosing among existing fairness criteria, or when proposing new criteria. They can inform decision makers who suspect that their available dataset suffers from statistical biases (including while data collection/debiasing efforts are in progress), and who can accordingly select robust fairness criteria that would either continue to be met, or be less drastically impacted, in spite of any potential data biases. We provide additional discussion about the potential implications of our findings, limitations, and future directions, in Appendix A.
Related work. The interplay between data biases and fair ML has been a subject of growing interest (Ensign et al., 2018; Neel and Roth, 2018; Bechavod et al., 2019; Kilbertus et al., 2020; Wei, 2021; Blum and Stangl, 2020; Jiang and Nachum, 2020; Kallus and Zhou, 2018; Rezaei et al., 2021; Fogliato et al., 2020; Wick et al., 2019), and our paper falls within this general category. Most of these works differ from ours in that they explore how feedback loops, censored feedback, and/or adaptive data collection lead to statistical data biases, how these exacerbate algorithmic unfairness, how to debias data, and how to build fair algorithms robust to data bias.
Most closely related to our work are (Blum and Stangl, 2020; Jiang and Nachum, 2020; Fogliato et al., 2020; Wick et al., 2019), which study the interplay between labeling biases and algorithmic fairness. Jiang and Nachum (2020) propose to address label biases in the data directly, by assigning appropriately selected weights to different samples in the training dataset. Blum and Stangl (2020) study labeling biases in the qualified disadvantaged group, as well as reweighing techniques for debiasing data. Further, they show that fairness intervention in the form of imposing Equality of Opportunity can in fact improve the accuracy achievable on biased training data. Fogliato et al. (2020) propose a sensitivity analysis framework to examine the fairness of a model obtained from biased data and consider errors in identifying the unqualified advantaged group. Wick et al. (2019) consider errors in identifying both the unqualified advantaged group and qualified disadvantaged group together and focus on the fairness accuracy tradeoff when applying different approaches to achieve Demographic Parity. In contrast to these works, we contribute a comprehensive assessment of the robustness of different group fairness criteria to several types of labeling bias as well as feature measurement errors. We review additional related work, including those on defining fairness criteria and how to satisfy them (Mehrabi et al., 2021; Barocas et al., 2017; Chouldechova, 2017; Kleinberg et al., 2016; Kamiran and Calders, 2012; Jiang and Nachum, 2020; Agarwal et al., 2018; Zafar et al., 2019; Hardt et al., 2016), in Appendix B.
2 Problem Setting
We analyze an environment consisting of a firm (the decision maker) and a population of agents, as detailed below. Table 1 in Appendix C summarizes the notation. The agents. Consider a population of agents composed of two demographic groups, distinguished by a sensitive attribute (e.g., race, gender). Let denote the fraction of the population who are in group . Each agent has an observable feature , representing information that is used by the firm in making its decisions; these could be e.g., exam scores or credit scores.^{1}^{1}1We consider onedimensional features (numerical scores) for ease of exposition in our analysis. Our experiments consider both onedimensional and dimensional features. The agent further has a (hidden) binary qualification state , with and denoting those qualified and unqualified to receive favorable decisions, respectively. Let denote the qualification rate in group . In addition, let
denote the probability density function (pdf) of the distribution of features for individuals with qualification state
from group . We make the following assumption on these feature distributions.Assumption 1.
The pdfs and their CDFs are continuously differentiable, and the pdfs satisfy the strict monotone likelihood ratio property, i.e., is strictly increasing in .
This assumption implies that an individual is more likely to be qualified as its feature (score) increases.
We further define the qualification profile of group as , which captures the likelihood that an agent with feature from group
is qualified. For instance, this could capture estimated repay probabilities given the observed credit scores (which may differ across groups). We will let group
be the group with a lower likelihood of being qualified at the same feature (i.e., if ), and refer to this group as the disadvantaged group.As we show in Section 3, the firm’s optimal decision rule can be determined based on the qualification rates and either one of the other problem primitives: the feature distributions or the qualification profiles . These quantities are related to each other as follows:
(1) 
We note that under Assumption 1, is strictly increasing in .
Existing realworld datasets also often provide information on the qualification rates together with either the feature (distributions) (e.g., the UCI Adult dataset) or the qualification profiles (e.g., the FICO credit score dataset); see Section 4. We will later detail how data biases (in the form of labeling or feature measurement errors) can be viewed as inaccuracies in these measures.
The firm. A firm makes binary decisions on agents from each group based on their observable features, with and denoting reject and accept decisions, respectively. The firm gains a benefit of from accepting qualified individuals, and incurs a loss of from accepting unqualified individuals. The goal of the firm is to select a (potentially groupdependent) decision rule or policy to maximize its expected payoff. In this paper, we restrict attention to threshold policies , where denotes the indicator function and is the decision threshold for group .^{2}^{2}2Prior work Liu et al. (2018); Zhang et al. (2020) show that threshold policies are optimal under Assumption 1 when selecting fairnessunconstrained policies, and optimal in the fairnessconstrained case given additional mild assumptions. Let denote the firm’s expected payoff under policies , with denoting the payoff from group agents.
The firm may further impose a (group) fairness constraint on the choice of its decision rule. While our framework is more generally applicable, we focus our analysis on Demographic Parity (DP) and True/False Positive Rate Parity (TPR/FPR).^{3}^{3}3We also study Equalized Odds (EO) Hardt et al. (2016) in our experiments, which requires both TPR and FPR parity.
Let denote the fairness constraint,^{4}^{4}4The choice of hard constraints is for theoretical convenience. In our experiments, we allow for soft constraints of the form . where . These constraints can be expressed as follows: DP: This constraint equalizes selection rate across groups, and is given by ; TPR: Also known as Equality of Opportunity Hardt et al. (2016), this constraint equalizes the true positive rate across groups, and can be expressed as ; FPR: False positive rate parity is defined similarly, with .
Accordingly, the firm’s optimal choice of decision thresholds can be determined by:
(2) 
Let denote the solution of (2) under fairness constraint , and denote the Maximum Utility (MU) thresholds (i.e., maximizers of the firm’s expected payoff in the absence of a fairness constraint).
Dataset biases. In order to solve (2), the firm relies on historical information and training datasets to obtain estimates of the underlying population characteristics: the qualification rates , the feature distributions , and/or the qualification profiles . However, the estimated quantities , , and/or may differ from the true population characteristics. We refer to the inaccuracies in these estimates as data bias, with the following interpretations:
1. Qualification assessment biases, reflected in the form of errors in the qualification profiles . We note that such biases can also affect the estimate of and . This case is most similar to labeling biases considered in prior work Blum and Stangl (2020); Jiang and Nachum (2020); Fogliato et al. (2020); Wick et al. (2019).
2. Measurement errors, reflected in the form of shifts in feature distributions . These can also be viewed as performative shifts Perdomo et al. (2020), or other distribution shifts over time. his case generalizes measurement biases considered in Liu et al. (2018).
Note that both the firm’s expected payoff (objective function in (2)) and the fairness criteria are impacted by such data biases. In the next sections, we analyze, both theoretically and numerically, the impacts of qualification assessment biases or feature measurement errors on the firm’s ability to satisfy the desired fairness metric f, as well as on the firm’s expected payoff.
3 Analytical Results
In this section, we analytically assess the robustness of the DP, TPR, and FPR fairness criteria to the different forms of data bias detailed in Section 2. All proofs are included in Appendix E.
We begin by characterizing the decision thresholds that maximize the firm’s expected utility, in the absence of any fairness constraints, and investigate the impacts of data biases on these thresholds.
Lemma 1 (Optimal Mu thresholds).
The thresholds maximizing the firm’s utility satisfy . Equivalently, this can be expressed as .
Lemma 2 (Impact of data biases on Mu thresholds and firm’s utility).
Let and denote the optimal MU decision thresholds for group , obtained given unbiased data and data with biases on group , respectively. If (i) (qualification profiles are underestimated), or, (ii) (happens e.g. if , ; scores are underestimated), then the decision threshold on group increases, i.e., . The reverse holds if the inequalities above are reversed. In all these cases, the decisions on group are unaffected, i.e., . Further, the firm’s utility decreases in all cases, i.e., .
As intuitively expected, biases against the disadvantaged group (underestimation of their qualification profiles, or scores) lead to an increase in their disadvantage; the reverse is true if the group is perceived more favorably. We also note that the decisions on group remain unaffected by any biases in group ’s data. This implies that if the representation of group is small, the firm has less incentive for investing resources in removing data biases on group . In the remainder of this section, we will show that the coupling introduced between the group’s decisions due to fairness criteria breaks this independence. A takeaway from this observation is that once a fairnessconstrained algorithm couples the decision rules between two groups, it also makes statistical data debiasing efforts advantageous to both groups, and therefore increases a (fair) firm’s incentives for data debiasing.
The next lemma characterizes the optimal fairnessconstrained decision thresholds.
Lemma 3 (Optimal fair thresholds).
The thresholds maximizing the firm’s expected utility subject to a fairness constraint satisfy .
This characterization is similar to those obtained in prior works Zhang et al. (2019, 2020) (our derivation technique is different). Using Lemma 3, we can further characterize the thresholds for fairness criteria , as derived in Tables 3 and 4 in Appendix E.4. These form the basis of the next set of results, which shed light on the sensitivity of different fairness criteria to biased training data.
3.1 Impacts of qualification assessment biases
We begin by assessing the sensitivity of fairnessconstrained policies to biases in qualification assessment. Specifically, we first consider biases that result in , where is the underestimation rate. This can be viewed as label biases due to a prior decision maker/policy that had a probability of correctly identifying qualified agents from the disadvantaged group . These biases will not be corrected postdecision due to censored feedback: once a qualified individual is labeled as 0 and rejected, the firm does not have an opportunity to observe the individual and assess whether this was indeed the correct label. We first analyze the impacts of such biases on the decision thresholds and on the firm’s utility.
Proposition 1 (Impact of qualification assessment biases in the disadvantaged group on Dp/tpr/fpr thresholds and firm’s utility).
Assume the qualification profile of group is underestimated so that , where . Let and denote the optimal decision thresholds satisfying fairness constraint , obtained from unbiased data and from data with biases on group given , respectively. Then,
(i) for . Further, is decreasing in .
(ii) The DP and TPR criteria continue to be met, while FPR is violated, at their .
(iii) The firm’s utility decreases under , i.e., .
(iv) The firm’s utility may increase or decrease under .
Figure 1 illustrates Proposition 1 on a synthetic dataset inspired by the FICO dataset (Hardt et al., 2016).
The proof of this proposition relies on the characterizations of the optimal fairnessconstrained thresholds in Tables 3 and 4, together with identifying the changes in the problem primitives when . In particular, we show that , , and . Intuitively, these changes can be explained as follows: can be viewed as flipping label 1 to label 0 in the training data on group with probability . This leaves the feature distribution of qualified agents unchanged, whereas it adds (incorrect) data on unqualified agents, hence biasing . Such label flipping also decreases estimated qualification rates by a factor . Using these, we show that DP/TPR continue to hold at the biased thresholds given the changes in the statistics they rely on, while FPR is violated.
We note two main differences of this lemma with Lemma 2 in the unconstrained setting: (1) the biases in group ’s data now lead to underselection of both groups compared to the unbiased case. That is, the introduction of fairness constraints couples the groups in the impact of data biases as well. (2) Perhaps more interestingly, there exist scenarios in which the adoption of a fairness constraint benefits a firm facing biased qualification assessments. (We provide additional intuition for this in Appendix E.5). Note however that the fairness criterion is no longer satisfied in such scenarios.
In addition to these observations, Proposition 1 shows that the DP and TPR fairness criteria are robust to underestimation of qualification profiles of the disadvantaged group, in that the obtained thresholds continue to satisfy the desired notion of fairness. That said, the changes in these thresholds lead to loss of utility for the firm. To better assess the impacts of these changes on the firm’s expected payoff, the next proposition characterizes the sensitivity of DP and TPR thresholds to the error rate .
Proposition 2 (Sensitivity of Dp/tpr thresholds to qualification assessment biases).
Consider the same setting as Proposition 1. Then, the rate of change of group ’s thresholds at is given by
Further, the rate of change of group ’s thresholds at is given by
As consistent with Proposition 1, this proposition shows that the thresholds increase relative to the unbiased case when qualification assessments become biased (i.e., decreases from 1).^{5}^{5}5We note that our proof characterizes the sensitivities at all bias rates ; we highlight the sensitivity at to provide intuition for low levels of bias. The proposition also allows us to assess these sensitivities based on other problem parameters:
The DP fairness constraint only evaluates the selection rates from each group (regardless of qualifications). Therefore, we see that the increase in group ’s threshold is sharper if is smaller (i.e., the representation of group increases). This is consistent with the following intuition: the impacts of biases are more pronounced when the effects of this group on the firm’s utility increases.
For the TPR fairness constraint on the other hand, the increase in group ’s threshold is impacted by additional problem parameters: it is higher if or is smaller (i.e., the representation or qualification rate of group is high), or if is larger (i.e., the firm’s benefit/loss increases/decreases). The additional impact of qualification rates and firm’s benefit/loss is due to the fact that unlike DP, TPR accounts for the qualification of those selected. Therefore, if qualification rates of group or benefits from accepting qualified individuals are high, the firm will have to make sharper adjustments to the threshold on group if it perceives they are less qualified.
The following corollary further identifies conditions under which DP is more sensitive to qualification assessment biases than TPR when facing the same bias rates.
Corollary 1 (Dp can be more sensitive to qualification assessment bias than Tpr).
Consider , the rate of change of group ’s thresholds at . There exists a such that for all , we have ; that is, DP is more sensitive to qualification assessment biases than TPR.
Our analysis in this section can be similarly applied to study the impacts of qualification assessment biases on the advantaged group; we detail this analysis in Appendix F. In particular, we consider biases of the form of , with , interpreted as prior errors by a decision maker who mistakenly labeled unqualified individuals from the advantaged group as qualified with probability . In Proposition 4, we show that this time, DP and FPR are robust against these biases, while TPR is in general violated. Notably, DP remains robust against both types of qualification assessment bias. Our experiments in Section 4 further support this observation by showing that DPconstraint thresholds are more robust to label flipping biases induced in different realworld datasets.
3.2 Impacts of feature measurement errors
We now analyze the sensitivity of fairnessconstrained decisions to an alternative form of statistical biases: errors in feature measurements of the disadvantaged group. Specifically, we consider biases that result in , where is the underestimation rate and is a nondecreasing function in
(including constant). This type of bias can occur, for instance, if scores are normally distributed and systematically underestimated such that
, where .The next proposition identifies the impacts of such biases on the firm’s utility and decision thresholds.
Proposition 3 (Impact of measurement errors in the disadvantaged group feature on Dp/tpr/fpr thresholds).
Assume the features of group are incorrectly measured, so that , where is a nondecreasing function. Let and denote the optimal decision thresholds satisfying fairness constraint , obtained from unbiased data and data with biases on group with error function , respectively. Then,
(i) If , then for both groups and any function . Further, the TPR constraint is violated at the new thresholds.
(ii) If , then for both groups and any function . Further, the FPR constraint is violated at the new thresholds.
(iii) If , then for both groups and any function . Further, the DP constraint is violated at the new thresholds.
(iv) There exist problem instances in which for any of the three constraints.
We provide a visualization of this proposition in Figure 10 in Appendix D. This proposition shows that in general, the considered fairness constraints will not remain robust against feature measurement errors. The conditions in parts (i)(iii) require that a CDF in the biased distribution firstorder stochastically dominates that of the unbiased distribution. This holds if, e.g., the corresponding features (qualified agents, unqualified agents, or all agents, respectively) are underestimated, for some . We also note that in contrast to Proposition 1, the decision threshold can in fact decrease when biases are introduced; we illustrate this in our experiments in Section 4.
4 Numerical Experiments
We now provide numerical support for our analytical results, and additional insights into the robustness of different fairness measures, through experiments on both realworld and synthetic datasets. Details about the datasets, experimental setup, and additional experiments, are provided in Appendix D. Our code is available at https://github.com/yl489/SocialBiasMeetsDataBias.
4.1 FICO credit score dataset
We begin with numerical experiments on the FICO dataset preprocessed by (Hardt et al., 2016). The FICO credit scores ranging from 300 to 850 correspond to the onedimensional feature in our model, and race is the sensitive feature (we focus on the white and black groups). The data provides repay probabilities for each score and group, which corresponds to our qualification profile . We take this data to be the unbiased ground truth. (We discuss some implications of this assumption in Appendix A.) To induce qualification assessment biases in the data, we drop the repay probabilities of the black group to model the underestimation of their qualification profiles, and generate training data on this group based on this biased profile. We use to parametrize the bias (with ). Decision rules will be found on the biased data and applied to the unbiased data.
Violation of fairness constraints. The leftmost panel in Figure 2 illustrates the fairness violation under each fairness constraint (measured as ) as qualification assessment biases on the disadvantaged group increase. These observations are consistent with Proposition 1. In particular, DP and TPR are both robust to these biases in terms of achieving their notions of fairness, while FPR has an increasing trend in fairness violation. This means that the set of possible decision rules of FPR changes when bias is imposed. Note that though a violation of 0.04 may not be severe, it is more than 300% higher than a violation below 0.01 that FPR can achieve when the data is unbiased. We will also observe more significant violations of FPR on other datasets. Finally, from Figure 2, it may seem that EO also remains relatively robust to bias. This observation is not in general true (as shown in our experiments on other datasets in Section 4.2); however, it can be explained for the FICO dataset by noting how EO’s feasible pairs of decision rules change due to data bias (similar to Figure 1) and how specific problem setup can influence the results. We provide additional discussion in Appendix D.2.
Changes in decision thresholds and firm’s utility. In Figure 2, we also display the thresholds change of each group. The thresholds for both groups increase as the bias level gets higher, which is in line with Proposition 1. Notably, the maximum utility (fairness unconstrained decision rule) would have led the firm to fully exclude the black group even at relatively low bias rates; all fairnessconstrained thresholds prevent this from happening. In addition, the threshold’s increase under TPR is less drastic than DP and EO (and consistent with Corollary 1).
From Figure 5 in Appendix D.2, we also observe that the firm’s utility from the white/black group decreases/increases as biases increase. Overall, due to the fact that the white group is the majority in this data, a higher increase in will lead to greater loss in total utility of the firm (as is the case in DP/TPR/EO seen in Figure 2). That said, the total utility may increase under FPR, as pointed out in Proposition 1, since the relative gain from the increase in is larger than the loss from . This increase may even make the FPR
constrained classifier attain higher utility that
MU when training data is highly biased.4.2 Adult dataset and German credit dataset
We next conduct experiments on two additional benchmark datasets from the UCI repository (Dua and Graff, 2017): the Adult dataset (1) and the German credit dataset (Hofmann, 1994)
. In both these datasets, instead of maximizing utility, the objective is classification accuracy. We first train a logistic regression classifier on the training set using scikitlearn
(Pedregosa et al., 2011) with default parameter settings as the base model. Then, we obtain the fair classifier by applying the exponentiated gradient approach (Agarwal et al., 2018) using Fairlearn (Bird et al., 2020). We introduce qualification assessment biases by flipping the qualification states of the (1) qualified agents from the disadvantaged group (i.e., female in Adult and age below 30 in German), (2) unqualified agents from the advantaged group (i.e., male in Adult and age above 30 in German). Result from concurrently flipping the labels of qualified agents from the disadvantaged group and unqualified agents from the advantaged group are shown in Appendix D.3.Accuracy and fairness violation on Adult and German datasets. The results are averaged across 10 runs for Adult and 30 runs for German; shaded areas indicate plus/minus standard deviation.
The results are presented in Figure 3
. To quantify the trend in fairness violation, we fit a linear regression model to the fairness violation and present all model weights in Table
2 in Appendix D.3. In all three cases, the robustness of each constraint in terms of achieving fairness matches our findings in Propositions 1 and 4. One exception, however, is that TPR remains robust when we flip the labels of the unqualified advantaged group in the German dataset. This is primarily because, while flipping the unqualified individuals in the training set will in general make the classifier accept more of these individuals, the flip will have a minor effect on the TPR violation because 1) there is a limited number of unqualified individuals with age above 30 (15.2% of the whole dataset) compared to qualified individuals with age above 30 (43.7% of the whole dataset), and 2) there is little room for the true positive rate values on the test set to increase since the values for both groups start close to 1 (0.923 and 0.845) with a small difference (0.078) (see Figure 8 in Appendix D.3).Interestingly, we also observe that fairnessconstrained accuracy can be higher than the utilitymaximizing choices when training data is biased, as seen in the top row accuracy plots in Figure 3. This can be interpreted as fairness constraints having a regularization effect: since the constraints prevent the classifier from overfitting the biased training set to some extent, the accuracy achieved on the test set with fairness constraints imposed can be higher than that of the unconstrained case.
4.3 Impacts of feature measurement errors
Lastly, we conduct experiments on a synthetic dataset inspired by the FICO credit score data (details in Appendix D.1). To bias the feature measurements, we drop the estimate of the mean of the qualified agents from group relative to the true value . As a result, will be biased relative to its true value, while will remain unchanged. As shown in Figure 4, with these choices, FPR will remain unaffected, while as predicted by Proposition 3, DP/TPR will no longer be satisfied.
Finally, Figure 4 also highlights the changes in the decision thresholds and firm’s utility under this type of bias. As noted in Proposition 3, now the thresholds on the disadvantaged group can decrease compared to the unbiased case at low bias rates. Notably, we also observe that the firm’s overall utility is lower under DP (similar to the qualification assessment bias case), but that TPR is more sensitive to bias levels than DP (unlike the qualification assessment bias case). This points to the fact that the choice of a robust fairness constraint has to be made subject to the type of data bias that the decision maker foresees.
5 Conclusion
We investigated the robustness of different fairness criteria when an algorithm is trained on statistically biased data. We provided both analytical results and numerical experiments based on three realworld datasets (FICO, Adult, and German credit score). We find that different constraints exhibit different sensitivity to labeling biases and feature measurement errors. In particular, we identified fairness constraints that can remain robust against certain forms of statistical biases (e.g., Demographic Parity and Equality of Opportunity given labeling biases on the disadvantaged group), as well as instances in which the adoption of a fair algorithm can increase the firm’s expected utility when training data is biased, providing additional motivation for adopting fair ML algorithms. Our findings present an additional guideline for choosing among existing fairness criteria when available datasets are biased.
The authors are grateful for support from the NSF program on Fairness in AI in collaboration with Amazon under Award No. IIS2040800, and Cisco Research. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF, Amazon, or Cisco.
References
 [1] (1996) Adult Dataset. Note: UCI Machine Learning Repository Cited by: §D.3, §4.2.
 A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60–69. Cited by: Appendix B, §1, §4.2.
 Machine Bias. Note: https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing Cited by: §1.
 Fairness in machine learning. Nips tutorial 1, pp. 2. Cited by: Appendix B, §D.3, §1, §1.
 Equal opportunity in online classification with partial feedback. In Advances in Neural Information Processing Systems, pp. 8974–8984. Cited by: §1.
 Fairlearn: a toolkit for assessing and improving fairness in AI. Technical report Technical Report MSRTR202032, Microsoft. External Links: Link Cited by: §4.2.
 Recovering from biased data: can fairness constraints improve accuracy?. In 1st Symposium on Foundations of Responsible Computing (FORC 2020), Cited by: §1, §1, §1, §2.
 Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: Appendix B, §1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.2.
 Runaway feedback loops in predictive policing. In Conference on Fairness, Accountability and Transparency, pp. 160–171. Cited by: §1.

Fairness evaluation in presence of biased noisy labels.
In
International Conference on Artificial Intelligence and Statistics
, pp. 2325–2336. Cited by: §1, §1, §1, §2. 
Equality of opportunity in supervised learning
. Advances in neural information processing systems 29. Cited by: Appendix B, §1, §2, §3.1, §4.1, footnote 3.  Statlog (German Credit Data). Note: UCI Machine Learning Repository Cited by: §D.3, §4.2.
 Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pp. 702–712. Cited by: Appendix B, §D.3, §1, §1, §1, §2.
 Residual unfairness in fair machine learning from prejudiced data. In International Conference on Machine Learning, pp. 2439–2448. Cited by: §1, §1.
 Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33 (1), pp. 1–33. Cited by: Appendix B, §1.
 Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pp. 277–287. Cited by: §1.
 Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807. Cited by: Appendix B, §1.
 Algorithmic bias? an empirical study of apparent genderbased discrimination in the display of stem career ads. Management science 65 (7), pp. 2966–2981. Cited by: §1.
 Delayed impact of fair machine learning. In International Conference on Machine Learning, pp. 3150–3158. Cited by: §2, footnote 2.
 A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54 (6), pp. 1–35. Cited by: Appendix B, §1, §1.
 Mitigating bias in adaptive data gathering via differential privacy. In International Conference on Machine Learning, pp. 3720–3729. Cited by: §1.
 Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: §1.
 Scikitlearn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §4.2.
 Performative prediction. In International Conference on Machine Learning, pp. 7599–7609. Cited by: §2.
 Robust fairness under covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9419–9427. Cited by: §1.
 Decisionmaking under selective labels: optimal finitedomain policies and beyond. In International Conference on Machine Learning, pp. 11035–11046. Cited by: §1.
 Unlocking fairness: a tradeoff revisited. Advances in neural information processing systems 32. Cited by: §1, §1, §1, §2.
 Fairness constraints: a flexible approach for fair classification. The Journal of Machine Learning Research 20 (1), pp. 2737–2778. Cited by: Appendix B, §1.
 A causal framework for discovering and removing direct and indirect discrimination. In IJCAI, Cited by: §1.
 Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness. Advances in Neural Information Processing Systems 32. Cited by: Appendix A, §3.
 How do fair decisions fare in longterm qualification?. Advances in Neural Information Processing Systems 33, pp. 18457–18469. Cited by: Appendix A, §3, footnote 2.
Appendix
Appendix A Discussion: Societal Impacts, Limitations, and Future Work
We begin by noting that despite their differing robustness, we find that any fairnessconstrained algorithm can couple the decisions between two demographic groups in a way that benefits the disadvantaged group suffering from data biases. This was highlighted in our numerical experiments on the FICO dataset, where a utilitymaximizing firm would have fully excluded the black group even at relatively low bias rates, while all fairnessconstrained decisions prevented this from happening. In addition, we showed that there exists settings in which a fairnessconstrained classifier can outperform utilitymaximizing classifiers from a firm’s perspective when data is biased. Together, our results provide additional motivation to adopt fair machine learning algorithms: they can simultaneously benefit disadvantaged demographic groups and increase (utility maximizing) firms’ expected profit.
We next discuss some limitations of our findings, and potential directions of future inquiries. First, our analysis has focused on a subset of demographic (group) fairness criteria; whether other fairness criteria (including individual fairness criteria) will exhibit different sensitivity to statistical data biases remains to be studied. We believe our proposed framework provides a starting point for these investigations. More importantly, we have not focused on the normative value of different fairness criteria, but rather taken these as potential desiderata. Quantifying unfairness and discrimination remains an important open question, and will in general be shaped by cultural, legal, and political perspectives. We hope that our proposed approach to quantifying and assessing the impacts of statistical biases on the efficacy of a desired (normative) notion of fairness can contribute to these conversations.
As noted in our experiments, we have taken existing realworld datasets as ground truth (i.e., unbiased). This assumption has been inevitable for us due to our lack of access to ground truth data. A main motivation of our work, however, is that existing datasets are likely to suffer from various forms of statistical biases (including not only labeling biases and measurement errors, but also due to disparate representation, changes in qualification rates over time, etc.). If the existing bias in these datasets is of the same type we have considered (e.g., labeling biases on qualified, disadvantaged agents), some of our findings continue to be supported as they can be viewed as additional bias added on top of existing ones (e.g., DP/TPR are robust in face of a range of label flipping probabilities, or thresholds monotonically increase as labeling bias increases). However, experiments on better benchmarks are indeed desirable.
Investigating other forms of dataset biases, as well as concurrent presence of multiple forms of bias, also remain as open questions. Notably, a third form of statistical bias may be due to longterm changes in qualification rates (reflected as errors in in our model). These may occur if a firm does not account for improvement efforts made by individuals over time, and has been studied as population dynamics models studied in Zhang et al. (2019, 2020). Robustness of fairness criteria against these forms of biases, and whether they consequently support improvement efforts, is an interesting extension. More broadly, we have looked at the interplay between data biases and algorithmic fairness criteria in a static model. Further investigation of the feedback loops between these two remains an important open challenge.
Appendix B Additional Related Work
A variety of fairness criteria have been proposed with the goal of formalizing desired notions of algorithmic fairness (Mehrabi et al. (2021); Barocas et al. (2017) provide excellent overviews). Our focus in this paper is on four of these (group) fairness criteria: Demographic Parity (DP), True/False Positive Rate Parity (TPR/FPR), and Equalized Odds (EO). This allows us to consider representative fairness criteria from two general categories (Barocas et al., 2017): independence (which requires that decisions be statistically independent from group membership, as is the case in DP) and separation (which requires that decisions be statistically independent from group membership when conditioned on qualification, as is the case in TPR/FPR/EO). We also note that fairness criteria from different categories are in general incompatible with each other (see Barocas et al. (2017); Chouldechova (2017); Kleinberg et al. (2016)); this points to an inherent tradeoff between these different notions of fairness. Our work on assessing the robustness of these criteria to data biases introduces an additional metric against which to compare them.
A variety of approaches have been proposed for achieving a given notion of fairness, and generally fall into three categories: (1) preprocessing
, which modifies the training dataset through feature selection or reweighing techniques (e.g.,
(Kamiran and Calders, 2012; Jiang and Nachum, 2020)), (2) inprocessing, which imposes the fairness criteria as a constraint at training time (e.g., (Agarwal et al., 2018; Zafar et al., 2019)), and (3) postprocessing, which adjusts the output of the algorithm based on the sensitive attribute (e.g., (Hardt et al., 2016)). We consider the inprocessing approach, and formulate the design of a fair classifier as a constrained optimization problem.Appendix C Summary of Notation
Our notation is summarized in Table 1.
Notation  Description 

Demographic groups,  
Fraction of population from group  
Observable feature,  
True qualification state,  
Qualification rate of group  
Feature distribution of agents with label from group  
Feature distribution of all agents from group  
Qualification profile of group ; probability that agent with feature from group is qualified  
Firm’s accept/reject decision,  
Firm’s threshold policy on group  
Firm’s benefit/loss from accepting qualified/unqualified agents  
Firm’s expected payoff given policies  
Fairness measure on group  
Firm’s optimal threshold on group for maximum utility (fairness unconstrained)  
Firm’s optimal threshold on group under fairness constraint f 
Appendix D Additional Experiment Details and Results
d.1 Parameter setup for the synthetic dataset in Figure 1 and Section 4.3
The group ratios and qualification rates are set as follows: = 0.8, = 0.2, = 0.8, and = 0.3. The feature
of each agent is sampled from a Gaussian distribution with
= 70 and = 10 if qualified, and from a Gaussian distribution with = 50 and = 10 if unqualified. A total of 100000 examples are sampled. is set to 10.d.2 FICO credit score experiments
Additional experiment setup details.
We set to 10; this is selected as losses due to defaults on loans are typically higher for a bank than benefits/interests from ontime payments. We focus on the black and white groups in our discussion. Within the FICO dataset, and are 0.12 and 0.88, and and are 0.34 and 0.76. We analyze the effects of bias with and without fairness constraints. Unless otherwise specified, we use soft constraints in our experiments. We measure the fairness violation under each constraint based on its fairness definition with respect to the level of bias we impose on the data. We will also highlight the change in the decision thresholds, selection rates of each group, and firm’s utility (both groupwise and total). Note that the FICO dataset comes with the repay probability at each score, so there is no randomness in the experiments when we induce biases.
Experiment results and discussion.
Figure 5 provides additional details on the changes in selection rate, and the change in the utility from each group, in this experiment.
Additional intuition for the observed fairness violations.
From Figure 6, DP and TPR are robust to underestimation of qualification profiles in terms of achieving their notions of fairness. Intuitively, as noted earlier, this can be explained as follows. Based on the definition of DP, the constraint tries to equalize selection rates of two groups regardless of the true qualification state of the agents. On the other hand, flipping of labels from 1 to 0 does not alter the number of agents from group at each score. Thus, each pair of decision rules that could achieve DP on the original unbiased data will still be able to achieve DP on the biased data. Conversely, the decision rules found on the biased data satisfying DP remain fair when applied back to the unbiased data.
For TPR on the other hand, recall the definition, which equalizes the ratio of the number of accepted qualified agents to the total number of qualified agents. Underestimating the qualification profiles would be equivalent to dropping the number of qualified agents at each score. When calculating the true positive rate, we decrease the two quantities in the ratio by the same fraction. Consequently, similar to DP, the set of decision rules satisfying TPR remains the same after the bias is imposed.
Figure 6 also shows that FPR has an increasing trend in fairness violation. This means that the set of possible decision rules of FPR changes when bias is imposed. To see why, note that false positive rate computes the ratio of the number of accepted unqualified agents to the total number of unqualified agents. Dropping the qualification profiles increases the two quantities in the ratio at different rates, causing the false positive rates to change at given thresholds.
Finally, from Figure 6, it may seem that EO also remains relatively robust to bias. This observation is not in general true (as shown in our experiments on other datasets in Section 4.2); however, it can be explained for the FICO dataset as follows. As EO requires satisfying both TPR and FPR, the set of decision rules achieving EO will be at most the intersection of those of TPR and FPR. As noted above, the set of feasible thresholds are the same on biased and unbiased data when requiring TPR, while they differ on FPR. As a result, the set of possible decision rules for EO will change with bias (here, it becomes smaller). That said, the pairs that remain feasible, for this dataset, are those that are dominated by the requirement to satisfy TPR, which are themselves robust to biases. (The same effect is also highlighted in Figure 1, which illustrates the changes in the set of feasible thresholds in experiments on synthetic data.) Moreover, we enforce that the constraints must be satisfied (i.e., ) during training, and the original unbiased data is used for testing. In contrast, if the constraints are imposed by introducing an extra penalty term to the objective function, the model may retain high accuracy/utility by not perfectly satisfying the fairness constraints depending on the exact formulation. Besides, the constraints may not be satisfied even when there is no bias due to the gap between training and test data. Nevertheless, our focus is the trend of how the performance of each constraint reacts to the increasing level of bias (as we discuss further in Section 4.2).
d.3 Adult and German credit score datasets
Additional experiment setup details.
(i) the Adult dataset (1): the goal is to predict if an individual would make more than 50k per year. We choose gender as the sensitive feature. We followed (Barocas et al., 2017, Chapter 2) to preprocess the data.
(ii) the German credit dataset (Hofmann, 1994): the goal is to predict if an individual has a good or bad credit risk. The sensitive feature is age. Following (Jiang and Nachum, 2020)
, we binarize age using a cutoff threshold of 30 when deciding group membership. 67% of the data is randomly selected as the training set, and the rest is used as the test set.
In both experiments, the maximum number of iterations is set such that early stopping is avoided.
Flipping labels on both advantaged and disadvantaged individuals.
The results are illustrated in Figure 7, and are again consistent with are analytical arguments. In particular, DP is robust against both forms of label biases.
Linear regression weights showing the relationship between fairness violation and bias level.
Dataset  Constraint  Weight  

Flip qualified disadvantaged  Flip unqualified advantaged  Flip both  
Adult  DP  0.007  0.001  0.004 
TPR  0.019  0.544  0.572  
FPR  0.059  0.028  0.101  
EO  0.056  0.575  0.556  
German  DP  0.015  0.005  0.027 
TPR  0.068  0.007  0.144  
FPR  0.187  0.003  0.350  
EO  0.131  0.166  0.315 
d.4 Synthetic dataset with measurement errors
Appendix E Proofs for Section 3
e.1 Proof of Lemma 1
e.2 Proof of Lemma 2
Proof.
The increase in the decision thresholds on group follows from the characterizations in Lemma 1, and noting that the functions and are both increasing. The thresholds on group remain unaffected since the optimization of and can be decoupled. The reduction in the firm’s utility follows from the proof of Lemma 1): is the only maximizer of the firm’s utility, and so the utility drops at the perturbed thresholds.
Finally, the claim that if , where , then , follows directly from Assumption 1. We also note that for the lemma to hold, we in fact only need the underestimation of the feature measurements to happen at the unbiased MU threshold . ∎
e.3 Proof of Lemma 3
Proof.
Given threshold policies, the firm’s problem in (4) can be further simplified to
s.t. 
We first use the constraint to express as a function of , that is, find a function such that . To do so, note that for all three constraints , the function is invertible. Therefore, . Further,
(3) 
It is also easy to show that , , and ; therefore, is an increasing function for all three constraints.
With the conversion , the firm’s optimizaiton problem reduces to
The first derivative of the objective function above with respect to is given by
Consider the thresholds at which this first derivative is zero. Given that is increasing, alternative feasible profiles have either both thresholds smaller or larger than those in . Together with Assumption 1, this first derivative is positive/negative when the thresholds decrease/increase. We therefore conclude that maximizes the firm’s utility.
That is, the optimal fairnessconstrained thresholds satisfy
completing the proof. ∎
e.4 Derivations of thresholds in Table 3
f  Optimal thresholds in terms of and 

DP  
TPR  
FPR 
f  Optimal thresholds in terms of and 

DP  
TPR 