Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification

01/18/2021 ∙ by Maliha Tashfia Islam, et al. ∙ University of Massachusetts Amherst 0

Classification, a heavily-studied data-driven machine learning task, drives an increasing number of prediction systems involving critical human decisions such as loan approval and criminal risk assessment. However, classifiers often demonstrate discriminatory behavior, especially when presented with biased data. Consequently, fairness in classification has emerged as a high-priority research area. Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness, including the topic of fair classification. The interdisciplinary efforts in fair classification, with machine learning research having the largest presence, have resulted in a large number of fairness notions and a wide range of approaches that have not been systematically evaluated and compared. In this paper, we contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability, using a variety of metrics and real-world datasets. Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance. We also discuss general principles for choosing approaches suitable for different practical settings, and identify areas where data-management-centric solutions are likely to have the most impact.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Virtually every aspect of human activity relies on automated systems that use prediction models learned from data: from routine everyday tasks, such as search results and product recommendations (Grbovic et al., 2015), all the way to high-stakes decisions such as mortgage approval (Chen et al., 2020), job applicant filtering (Faliagka et al., 2012), and pre-trial risk assessment of criminals (Larson et al., 2016). But such automated predictions are only as good as the data that drives them. Recent work has shown that inherent biases are common in data (Barocas and Selbst, 2016), and data-driven systems commonly demonstrate unfair and discriminatory behavior (Larson et al., 2016; Berk et al., 2018; Salimi et al., 2019; Valentino-Devries et al., 2012).

It is natural that data management research has shown growing interest in the topic of fairness over applications related to ranking, data synthesis, result diversification, and others (Kuhlman and Rundensteiner, 2020; Stoyanovich et al., 2018; Asudeh et al., 2019; Yan and Howe, 2019; Asudeh and Jagadish, 2020; Galhotra et al., 2020; Ahmadi et al., 2020). However, much of this work does not target prediction systems directly. In fact, a relatively small portion of the fairness literature within the data management community has directly targeted classification (Feldman et al., 2015; Lahoti et al., 2019; Salimi et al., 2019), one of the most important and heavily-studied supervised machine learning tasks that drives many broadly-used prediction systems. In contrast, machine learning research has rapidly produced a large body of work on the problem of improving fairness in classification.

In this paper, we closely study and evaluate existing work on fair classification, across different research communities, with two primary objectives: (1) to highlight data management aspects of existing work, such as efficiency and scalability, which are often overlooked in other communities, and (2) to produce a deeper understanding of tradeoffs and challenges across various approaches, creating guidelines for where data management solutions are more likely to have impact. We proceed to provide more detailed background on the problem of fair classification and existing approaches, we state the scope of our work and contrast with prior evaluation and analysis research, and, finally, we list our contributions.

Background on fair classification. Given the values of the predictive attributes of an entity, the task of a classifier is to predict which class, among a set of predefined classes, that entity belongs to. Classifiers typically focus on maximizing their correctness, which measures how well the predictions match the ground truth. To that end, they try to minimize the prediction error over a subset (validation set) of the available labeled data. Since both the training and the validation sets are drawn from the same source, the trained classifier naturally prioritizes the minimization of prediction error over the over-represented (majority) groups within the source data, and, thus, performs better for entities that belong to those groups. However, this may result in poor prediction performance over the under-represented (minority) groups. Moreover, as all data-driven approaches, classifiers also suffer from the general phenomenon of “garbage-in, garbage-out”: if the data contains some inherent bias, the model will also reflect or even exacerbate it. Thus, traditional learning models may discriminate in two ways: (1) they make more incorrect predictions over the minority groups than the majority groups, and (2) they learn (and replicate) training data biases.

Consider COMPAS, a risk assessment system that can predict recidivism (the tendency to reoffense) in convicted criminals and is used by the U.S. courts to classify defendants as high- or low-risk according to their likelihood of recidivating within 2 years of initial assessment (Flores et al., 2016). COMPAS achieves nearly accuracy (Dieterich et al., 2016), which is a well-known metric to quantify a classifier’s correctness. In 2014, a detailed analysis of COMPAS revealed some very troubling findings: (1) black defendants are twice more likely than white defendants to be predicted as high-risk, and (2) white reoffenders are predicted low-risk almost twice as often as black reoffenders (Larson et al., 2016). While COMPAS’ overall accuracy was similar over both groups ( for black and for white), its mistakes affected the two groups disproportionately. Furthermore, COMPAS was criticized for exacerbating societal bias due to its usage of historical arrest data in training, despite certain populations being proven to be more policed than others (Rothwell, 2014). This is not an isolated incident; many other incidents of classifier discrimination have pointed towards racial (Berk et al., 2018), gender (Salimi et al., 2019), and other forms of discrimination and unfairness (Valentino-Devries et al., 2012).

The pervasiveness of examples of discriminatory behavior in prediction systems indicates that fairness should be an important objective in classification. In recent years, study of fair classification has garnered significant interest across multiple disciplines (Feldman et al., 2015; Salimi et al., 2019; Zafar et al., 2017b; Celis et al., 2019; Hardt et al., 2016), and a multitude of approaches and notions of fairness have emerged (Verma and Rubin, 2018; Narayanan, 2018). We consider two principal dimensions in characterizing the work in this domain: (1) the targeted notion of fairness, and (2) the stage—before, during, or after training—when fairness-enforcing mechanisms are applied.

Fairness notions and mechanisms. Specifying what is fair is non-trivial: the proper definition of fairness is often driven by application-specific and even legal considerations. There is a large number of fairness definitions (Verma and Rubin, 2018; Narayanan, 2018), and new ones continue to emerge. Some fairness notions capture if individuals are treated fairly, while others quantify fair treatment of a group (e.g., people of certain race or gender). Further, some notions measure discrimination through causal association among attributes of interest (e.g., race and prediction), while others study non-causal associations. The mechanism to quantify fairness also varies: some notions rely on observational data, while others apply interventional techniques. To add further complexity, recent studies show that some fairness notions are incompatible with others and cannot be enforced simultaneously (Corbett-Davies et al., 2017).

Fairness-enforcing stage. Existing methods in fair classification operate in one of three possible stages. Pre-processing approaches attempt to repair biases in the data before the data is used to train a classifier (Kamiran and Calders, 2012; Feldman et al., 2015; Calmon et al., 2017; Zhang et al., 2017; Salimi et al., 2019). Data management research in fair classification has typically focused on the pre-processing stage. In contrast, the machine learning community largely explored in-processing approaches, which alter the learning procedure used by the classifier (Zafar et al., 2017b, a; Zhang et al., 2018; Kearns et al., 2018; Celis et al., 2019; Thomas et al., 2019), and post-processing approaches, which alter the classifier predictions to ensure fairness  (Kamiran et al., 2012; Hardt et al., 2016; Pleiss et al., 2017).

Scope of our work. We present a thorough empirical evaluation of 13 fair classification approaches and some of their variants, resulting in 18 different approaches, across the aspects of correctness, fairness, efficiency, scalability, and stability. We selected approaches that target a representative variety of fairness definitions and span all three (pre, in, and post) fairness-enforcing stages. In general, there is no one-size-fits-all solution when it comes to choosing the best fair approach and the choice is application-specific. However, our evaluation has two main objectives: (1) to highlight issues of efficiency and scalability, which are often overlooked in other communities, and (2) to produce a deeper understanding of tradeoffs and challenges across various approaches, creating guidelines for where data management solutions are more likely to have impact. To the best of our knowledge, this is the first study and evaluation of fair classification approaches through a data management lens.

Other evaluation and analysis work on fair classification. Prior work on the evaluation of fair classifiers has had a more narrow scope than ours. Friedler et al. (Friedler et al., 2019) compare variations of 4 fair approaches over 5 fairness metrics, while Jones et al. (Jones et al., 2020) evaluate variations of 6 fair approaches over 3 fairness metrics. Further, these evaluation studies do not examine runtime performance aspects, such as scalability, and do not include post-processing approaches or individual fairness metrics in their analysis.

AI Fairness 360 (Bellamy et al., 2019) is an extensible toolkit that tests 11 fair approaches on 7 fairness metrics, but it is not designed for comparative analysis of approaches and does not cover efficiency, scalability, and stability of classifiers. Other works (Tramèr et al., 2015; Galhotra et al., 2017) provide general frameworks to evaluate approaches on some specific fairness metric, but are not extendable for evaluating over multiple metrics. Lastly, there are surveys that discuss fair approaches available in the literature (Mehrabi et al., 2019; Caton and Haas, 2020), but they do not evaluate them empirically.

Contributions. In this paper, we make the following contributions:

  • We provide a new and informative categorization of 26 existing fairness notions, based on the high-level aspects of granularity, association, methodology, and requirements. We discuss their implications, tradeoffs, and limitations, and justify the choices of metrics for our evaluation. (Section 2)

  • We provide an overview of 13 fair classification approaches and several variants. We select 5 pre-processing (Kamiran and Calders, 2012; Feldman et al., 2015; Calmon et al., 2017; Zhang et al., 2017; Salimi et al., 2019), 5 in-processing (Zafar et al., 2017b, a; Zhang et al., 2018; Kearns et al., 2018; Celis et al., 2019; Thomas et al., 2019), and 3 post-processing approaches (Kamiran et al., 2012; Hardt et al., 2016; Pleiss et al., 2017) for our evaluation. (Section 3)

  • We evaluate a total of 18 variants of fair classification techniques with respect to 4 correctness and 5 fairness metrics over 4 real-world datasets including Adult (Kohavi and Becker, 1994) and COMPAS (Larson et al., 2016). Our evaluation provides interesting insights regarding the trends in fairness-correctness tradeoffs. (Section 4.2)

  • Our runtime evaluation indicates that post-processing approaches are generally most efficient and scalable. However, their efficiency and scalability are due to the simplicity of their mechanism, which limits their capacity of balancing correctness-fairness tradeoffs. In contrast, pre- and in-processing approaches generally incur higher runtimes, but offer more flexibility in controlling correctness-fairness tradeoffs. With respect to scalability, pre-processing approaches (which have been the focus of the related data management literature) tend to be most affected by the number of attributes, while in-processing approaches have worse response to increasing dataset size. (Section 4.3)

  • To evaluate stability, we measure the variance in correctness and fairness over different partitions of the training data. Our findings show that all evaluated approaches are generally stable and high-variance behavior is rare. (Section 


  • Finally, based on the insights from our evaluation, we discuss general guidelines towards selecting suitable fair classification approaches in different settings, and highlight possible areas where data management solutions can be most impactful. (Section 5)

2. Evaluation Metrics

In this section, we introduce the metrics that we use to measure the correctness and fairness of the evaluated techniques. We start with some basic notations related to the concepts of binary classification. Next, we proceed to describe the two types of evaluation metrics and the rationale behind our choices.

Basic notations.

Let be an annotated dataset with the schema , where denotes a set of attributes that describe each tuple or individual in the dataset , denotes a sensitive attribute, and denotes the annotation (ground-truth class label). Without loss of generality, we assume that is binary, i.e., , where 1 indicates a privileged and 0 indicates an unprivileged group. We use to denote the particular sensitive attribute assignment of a tuple .

We denote a binary classification task , where denotes the predicted class label (). Without loss of generality, we interpret as a favorable (positive) prediction and as an unfavorable (negative) prediction. and denote the ground-truth and predicted class label for , respectively. We summarize the notations used in the paper in Figure 1.

Notation Description
A set of attributes
, A single attribute and its value domain
A sensitive attribute
Attribute denoting the ground-truth class label
An annotated dataset with the schema
A binary classifier
Attribute that denotes the predicted class label
Value of the sensitive attribute for tuple
, Ground-truth and predicted class labels for tuple
Figure 1. Summary of notations.
True Positive () = False Positive () =
False Negative () = True Negative () =
Figure 2. Confusion matrix for predictions of a binary classifier.
Metric Definition Range Interpretation
Accuracy [0, 1] Accuracy = 1 completely correct Accuracy = 0 completely incorrect
Precision [0, 1] Precision = 1 completely correct Precision = 0 completely incorrect
Recall [0, 1] Recall = 1 completely correct Recall = 0 completely incorrect
F-score [0, 1] F-score = 1 completely correct F-score = 0 completely incorrect
Figure 3. List of correctness metrics used in our evaluation.

2.1. Correctness

The correctness of a binary classifier measures how well its predictions match the ground truth. Given a dataset and a binary classifier , we profile ’s predictions on using the statistics depicted in Figure 2, where , , , and are the numbers of true positives, true negatives, false positives, and false negatives, respectively.

Among the positive tuples (), the true positive rate () is the fraction of tuples that are correctly predicted as positive and the false negative rate () is the fraction of tuples that are incorrectly predicted as negative.

Similarly, among the negative tuples (), the true negative rate () is the fraction of tuples that are correctly predicted as negative and the false positive rate () is the fraction of tuples that are incorrectly predicted as positive.

Metrics. In our evaluation, we measure correctness using the metrics in Figure 3, which are widely-accepted and well-studied in the literature (Labatut and Cherifi, 2012). Intuitively, accuracy captures the overall correctness of the predictions made by a classifier; precision captures “preciseness”, i.e., the fraction of positive predictions that are correctly predicted as positive; and recall captures “coverage”, i.e., the fraction of positive tuples that are correctly predicted as positive. The F-score

is the harmonic mean of precision and recall. While accuracy is an effective correctness metric for classifiers operating on datasets with a balanced class distribution, it can be misleading when the dataset is imbalanced, a frequently-observed scenario in real-world datasets. In such cases, precision, recall, and F

-score, together, provide better understanding of correctness.

2.2. Fairness

Fairness in classifier predictions typically targets sensitive attributes, such as gender, race, etc. We highlight ways in which classifier predictions can discriminate through an example.

Figure 4. Prediction statistics over 100 applicants, grouped by gender: 60 male (bottom) and 40 female (top). The ground truth (positives as and negatives as ) is indicated below each segment.
Fairness notion Metric Granularity Association Methodology Additional requirements
group individual causal non-causal observational interventional ground truth prediction probability causality model resolving attribute similarity metric
demographic parity (Dwork et al., 2012) disparate impact (Zafar et al., 2017b), CV score (Calders et al., 2009)
conditional statistical parity (Corbett-Davies et al., 2017) conditional statistical parity
intersectional fairness (Foulds et al., 2020) differential fairness
conditional accuracy equality (Berk et al., 2018) false discovery/omission rate parity
predictive parity (Chouldechova, 2017) false discovery rate parity
overall accuracy equality (Berk et al., 2018) balanced classification rate (Friedler et al., 2019)
treatment equality (Berk et al., 2018) ratio of false negative and false positive

equalized odds 

(Hardt et al., 2016)
true positive/negative rate balance
equal opportunity (Hardt et al., 2016) true negative rate balance
resilience to random bias (Fish et al., 2016) resilience to random bias
preference-based fairness (Zafar et al., 2017c) group benefit
calibration (Chouldechova, 2017) calibration
calibration within groups (Kleinberg et al., 2017) well calibration
positive class balance (Kleinberg et al., 2017) fairness to positive class
negative class balance (Kleinberg et al., 2017) fairness to negative class
causal discrimination (Galhotra et al., 2017) causal discrimination
counterfactual fairness (Kusner et al., 2017) counterfactual effect (Wu et al., 2019)
path-specific fairness (Nabi and Shpitser, 2018) natural direct effects
path-specific counterfactuals (Wu et al., 2019) path-specific effect, counterfactual effect
fair causal inference (Madras et al., 2019) estimation of heterogeneous effects (Hill, 2011)
proxy fairness (Kilbertus et al., 2017b) proxy fairness
unresolved discrimination (Kilbertus et al., 2017b) causal risk difference (Qureshi et al., 2019)
interventional/justifiable fairness (Salimi et al., 2019) ratio of observable discrimination
metric multifairness (Kim et al., 2018) metric multifairness
fairness through awareness (Dwork et al., 2012) fairness through awareness
fairness through unawareness (Kusner et al., 2017) Kusner et al. (Kusner et al., 2017)
Figure 5. List of fairness definitions and their corresponding metrics in the literature. We list key properties of notions such as the level of granularity (group or individual), type of association considered between attributes (causal or non-causal), and technique of measuring fairness (observational or interventional). All notions require knowledge of the sensitive attributes and the predictions made by the classifier. Some fairness definitions rely on additional requirements that are shown in the rightmost four columns. For our evaluation, we choose five fairness metrics (Figure 6) that cover the highlighted definitions. (also known as statistical parity; also known as predictive equality)
Example 0 ().

Consider a model of university admissions that aims to offer admission to highly-qualified students. The admissions committee automates the admission process by training a binary classifier over historical admissions data. Female students are historically underrepresented at this university, making up 40% of the student body; so, we designate males as the privileged group (), and females as the unprivileged group (). After training, the classifier achieves accuracy and F-score over the training data. Figure 4 summarizes the prediction-related statistics for both groups. Although the classifier is satisfactory in terms of correctness, it is not fair across gender. Specifically, we observe two ways females are being discriminated:

  • (Discrimination-1) The fraction of females predicted as highly-qualified (positive) is 23%, which is significantly lower than the fraction of males predicted as highly-qualified ( 33%).

  • (Discrimination-2) The true positive rate for females is , which is significantly lower than that of the males ().

Example 2.1 outlines two ways a classifier can be unfair despite having reasonable accuracy in predictions overall. Discrimination-1 highlights how a group can receive an unfair advantage (or disadvantage) if the proportion of positive and negative predictions differs across groups. On the other hand, Discrimination-2 indicates how predictions can disadvantage a group if the correctness of predictions (e.g., ) differs across groups. These disparities are very common in real-world scenarios (Chouldechova and Roth, 2020) and underscore the need for ensuring fairness in classification.

2.2.1. Fairness Notions

Fairness is not entirely objective, and societal requirements and legal principles often demand different characterizations. Fairness is also a relatively new concern within the research community. Consequently, a large number of different fairness definitions have emerged, along with a variety of metrics to quantify them. Figure 5 presents a list of 26 fairness notions and their corresponding metrics that have been studied in the literature. We offer a new categorization of these notions based on their granularity, association, methodology, and requirements they impose:

Granularity. We classify fairness notions into two categories based on the granularity of their target: group fairness characterizes if any demographic group, collectively, is being discriminated against; individual fairness determines if similar individuals are treated similarly, regardless of the values of the sensitive attribute.

Association. All notions characterize fairness by investigating the existence of some association between the sensitive attribute and the prediction. The type of association can be either causal, which analyzes the source of discrimination through the causal relationships among the attributes, or non-causal, which includes observed statistical correlations among the attributes.

Methodology. We identify an important methodological distinction in fairness notions: most definitions are based on measurements over observational data, while others apply interventional methods to generate what-if scenarios and measure fairness based on predictions of those scenarios.

Additional requirements. All notions require information on the sensitive attribute and the classifier predictions. Some notions impose additional requirements, such as causality models that capture the causal relationships, resolving attributes that depend on the sensitive attribute in non-discriminatory ways, similarity metric between individuals, etc.

Metric Definition Fairness notion Range Interpretation
Disparate Impact ((Feldman et al., 2015) demographic parity [0, )     completely fair     completely unfair completely unfair
True Positive Rate Balance ((Hardt et al., 2016) equalized odds [-1, 1] = 0 completely fair = 1 completely unfair
True Negative Rate Balance ((Hardt et al., 2016) equalized odds [-1, 1] = 0 completely fair = 1 completely unfair
Causal Discrimination ((Galhotra et al., 2017) given causal discrimination [0, 1] = 0 completely fair = 1 completely unfair
Causal Risk Difference ((Qureshi et al., 2019) , given , is a set of resolving attributes unresolved discrimination [-1, 1] = 0 completely fair = 1 completely unfair
Figure 6. List of fairness metrics we use to evaluate fair classification approaches. These metrics capture group- and individual-level discrimination; and effectively contrast between causal and non-causal associations, observational and interventional techniques.

2.2.2. Fairness Metrics

While Figure 5 highlights the wide range of proposed fairness notions, Friedler et al. (Friedler et al., 2019) have shown that a large number of metrics (and their notions) strongly correlate with one another, and, thus, are highly redundant. For our evaluation, we carefully selected five fairness metrics (Figure 6) that are most prevalent in the literature and that capture commonly observed discriminations in binary classification (Chouldechova, 2017). Moreover, we ensured that our selected metrics cover all categories in our classification, including group- and individual-level fairness, causal and non-causal associations, and observational and interventional methods (highlighted rows in Figure 5). We proceed to describe our chosen metrics.

Disparate Impact () is a group, non-causal, and observational metric. It quantifies demographic parity (Dwork et al., 2012), a fairness notion that states that positive predictions should be independent of the sensitive attribute. To measure demographic parity, computes the ratio of empirical probabilities of receiving positive predictions between the unprivileged and the privileged groups.

lies in the range . denotes perfect demographic parity. indicates that the classifier favors the privileged group and means the opposite. In Example 2.1, , which suggests that positive predictions are not independent of gender as males have higher probability to receive positive predictions than females. This is indicative of Discrimination-1: the fraction of females being granted admission is much lower than males.

True Positive Rate Balance () and True Negative Rate Balance () are two group, non-causal, and observational metrics. They measure discrimination as the difference in and , respectively, between the privileged and unprivileged groups.

Both and lie in the range . These two metrics, together, measure equalized odds (Hardt et al., 2016), which states that prediction statistics (e.g., and ) should be similar across the privileged and the unprivileged groups. Perfect equalized odds is achieved when and are , as the classifier performs equally well for both groups. A positive value in either of the two metrics indicates that the classifier tends to misclassify the unprivileged group more. In Example 2.1, = and = . The high positive value of indicates Discrimination-2: the of females is much lower than males.

Causal Discrimination ((Galhotra et al., 2017) is an individual, causal, and interventional metric. It allows us to determine both the classifier’s discrimination with respect to individuals and the causal influence of the sensitive attribute. Specifically, is the fraction of tuples for which, changing the sensitive attribute causes a change in the prediction, compared to otherwise identical data points. Suppose that is the set of such tuples, defined as ; then = . lies in the range and indicates that there exists no data point for which the sensitive attribute is the cause of discrimination.

Example 0 ().

Consider 7 university applicants shown in Figure 7. To measure , we intervene on the sensitive attribute (gender) of each tuple while keeping rest of the attributes intact, and re-evaluate the classifier on the altered tuples. Suppose that the prediction for changes from 0 to 1 when ’s gender is altered from Female to Male, and that predictions do not change for any other tuples. Then, , indicating that of the applicants are directly discriminated because of their gender.

The formal definition of

requires interventions on all possible data points in the domain of attributes, but practical heuristics limit interventions to smaller datasets of interest 

(Galhotra et al., 2017).

Causal Risk Difference ((Qureshi et al., 2019) is a group, causal, and observational metric. It quantifies discrimination by measuring the difference in probability of positive prediction between the privileged and the unprivileged groups, accounting for the confounding effects of the resolving attributes. is computed from observational data using the following steps:

  • To filter out the confounding effects, first computes a propensity score, as the conditional probability of belonging to the unprivileged group given a set of resolving attributes , for each tuple: .

  • The propensity score is then used to assign each tuple a weight: . Tuples with propensity scores will have weights .

  • Finally, is formally expressed as:

lies in the range and implies no discrimination, if we account for the effects of the resolving attributes.

id SAT dept_choice rank gender admitted
t 1200 Physics 11 Male 0
t 1350 Mathematics 03 Male 1
t 1105 Physics 09 Female 1
t 1410 Mathematics 03 Female 1
t 1130 Marketing 10 Male 1
t 1290 Mathematics 12 Female 0
t 1210 Marketing 11 Male 1
Figure 7. Sample data for 7 university applicants.
Example 0 ().

Consider the applicants in Figure 7. Further, suppose that females tend to apply to the Physics and Mathematics departments, and that these two departments have low acceptance rates. By setting = {dept_choice}, we get higher propensity scores for tuples in Mathematics and Physics, i.e., those applicants are more likely to be female, and these tuples contribute more to . Based on the data of Figure 7, we get that , , and . Computing the causal risk difference, we get . In this case, CRD indicates that there is no discrimination when the choice of department is accounted for.

Discussion on metric choices

, , and address group-level and non-causal discrimination. This means they do not capture discrimination against individuals that may be masked in group aggregates, and they do not account for confounding factors in the data. On the other hand, captures individual discrimination, and can remove confounding effects by determining if the apparent discrimination found in the observational data is explainable through resolving attributes, i.e., attributes that are dependent on or are implicitly influenced by the sensitive attribute in non-discriminatory ways (e.g., choice of department in Example 2.3).

Other causal notions (Figure 5) can also address the limitations of non-causal metrics. However, they typically rely on graphical or mathematical causality models to express the cause-and-effect relationships among attributes. We exclude them from our evaluation, because determining such causality models requires making strong assumptions about the problem setting, which is often impractical (Pearl, 2009). Further, we do not include non-causal individual level metrics, because they require a similarity measure between individuals, which requires domain expertise.

3. Fair Classification Approaches

Stage Approach Fairness notion(s) Key mechanism Evaluated version(s)
Kam-Cal (Kamiran and Calders, 2012) demographic parity Apply weighted resampling over tuples in to remove dependency between and .
Feld (Feldman et al., 2015) demographic parity
Repair each independently s.t. ’s marginal distribution is indistinguishable across sensitive groups. A user-defined parameter specifies degree of repair.
(Full repair with )
(Partial repair with
Calmon (Calmon et al., 2017) demographic parity
Modify and to reduce dependency between and , while preventing major distortion of the joint data distribution and significant change of the attribute values.
Zha-Wu (Zhang et al., 2017) path-specific fairness
Exploit a (learned) causal model over the attributes to discover (direct and indirect) causal association between and . Modify to remove such causal association.
Salimi (Salimi et al., 2019) justifiable fairness
Mark attributes as admissible ()—allowed to have causal association—or inadmissible ()—prohibited to have causal association—with ; repair to ensure that is conditionally independent of , given . Reduce the repair problem to known problems.
(Weighted maximum satisfiability)
(Matrix factorization)


Zafar (Zafar et al., 2017b, a)
demographic parity
equalized odds
Use tuple ’s distance from the decision boundary as a proxy of . Model fairness violation by the correlation between this distance and over all tuples in . Solve variations of constrained optimization problem that either maximizes prediction accuracy under constraint on maximum fairness violation, or minimizes fairness violation under constraint on maximum allowable accuracy compromise.
(Maximize accuracy under constraint on demographic parity)
(Maximize demographic parity under constraint on accuracy)
(Same as , but use misclassified tuples only)
Zha-Le (Zhang et al., 2018) equalized odds
Learn classifier and adversary together. Enforce fairness by ensuring that cannot infer from and .
Kearns (Kearns et al., 2018)
demographic parity
predictive equality
Use sensitive attribute(s) to construct a set of subgroups. Define fairness constraint s.t. the probability of positive outcomes (demographic parity) or (predictive equality) of each subgroup matches that of the overall population.
Kearnspe (For subgroups where each , ensure that , )
Celis (Celis et al., 2019)
equalized odds
demographic parity
predictive parity
cond. acc. equality
Unify multiple fairness notions in a general framework by converting the fairness constraints to a linear form. Solve the corresponding linear constrained optimization problem s.t. prediction error is minimized under fairness constraints.
Celispp (Enforce
Thomas (Thomas et al., 2019)
demographic parity
equalized odds
equal opportunity
predictive equality
Compute worst possible fairness violation a classifier can incur for a set of parameters and pick parameters for which this worst possible violation is within an allowable threshold.
Thomasdp (Enforce demographic parity)
Thomaseo (Enforce equalized odds)


Kam-Kar (Kamiran et al., 2012) demographic parity
Modify for tuples close to the decision boundary (i.e., subject to low prediction confidence) s.t. the probability of positive outcome is similar across sensitive groups.
Hardt (Hardt et al., 2016) equalized odds
Derive new predictor based on and s.t. and are similar across sensitive groups.
Pleiss (Pleiss et al., 2017)
equal opportunity
predictive equality
Modify for random tuples to equalize (or ) across sensitive groups.
Pleisseop (Equalize )
Figure 8. List of fair approaches, fairness notions they support, and high-level descriptions of the mechanisms they apply to ensure fairness. According to the stage of the classifier pipeline where fairness-enhancing mechanism is applied, these approaches are divided into three groups: (1) pre-processing, (2) in-processing, and (3) post-processing. In the rightmost column, we list the variations of each approach that we consider in our evaluation. We denote in the superscript the fairness notion that a specific variation is designed to support.

Fair classification techniques vary in the fairness notions they target and the mechanisms they employ. We categorize approaches based on the stage when fairness-enforcing mechanisms are applied. (1) Pre-processing approaches attempt to repair biases in the data before training; (2) in-processing approaches modify the learning procedure to include fairness considerations; finally, (3) post-processing approaches modify the predictions made by the classifier. Figure 8 contains an overview of the fair approaches that we choose for our evaluation. We proceed to provide a high-level description of the approaches in each category, highlighting their similarities and differences. (More details are in the Appendix ).

Pre-processing approaches are motivated from the fact that machine learning techniques are data-driven and the predictions of a classifier reflect the trend and biases of the training data. Data management research most naturally fits in this category. These approaches modify the data before training to remove biases, which subsequently ensures that the predictions of a learned classifier satisfy the target notion of fairness. The main advantage of pre-processing is that it is model-agnostic, which allows flexibility in choosing the classifiers based on the application requirements. However, since pre-processing happens before training and does not have access to the predictions, these approaches are limited in the number of notions they can support and does not always come with provable guarantees of fairness.

Demographic parity is one of the most widely used fairness notions among pre-processing approaches that enforce non-causal notions (Calders et al., 2009; Kamiran et al., 2010; Kamiran and Calders, 2012; Feldman et al., 2015; Calmon et al., 2017). We evaluate three pre-processing approaches that enforce demographic parity. Kam-Cal (Kamiran and Calders, 2012) resamples the training data with a weighted sampling technique to remove dependencies between the sensitive attribute and the target attribute . In contrast, Calmon (Calmon et al., 2017) and Feld (Feldman et al., 2015) directly modify the data. Calmon modifies both and to reduce dependency between and , while preventing major distortion of the joint data distribution and significant change of the attribute values. Feld argues that a model that learns only from the attributes that are independent of is likely to make predictions that are independent of as well. To this end, Feld modifies in a way that ensures that the marginal distribution of each individual attribute is indistinguishable across the sensitive groups. Feld controls the extent of repair with a parameter . In our evaluation, we choose two values of ( and ) to highlight its impact on the performance.

We also evaluate two pre-processing approaches that do not target demographic parity. Zha-Wu (Zhang et al., 2017) enforces path-specific fairness by modifying such that all causal influence of over is removed. To this end, it learns a graphical causal model over to discover (direct and indirect) causal associations between and . Salimi (Salimi et al., 2019) enforces justifiable fairness, which prohibits causal dependence between and , except through admissible attributes. Salimi does not depend on the causal model; it translates justifiable fairness to an integrity constraint over , and minimally repairs using tuple insertion and deletion. The repair problem is reduced to two NP-hard problems: weighted maximum satisfiability (MaxSAT) (Borchers and Furman, 1998) and matrix factorization (MatFac) (Lee and Seung, 2001).

In-processing approaches are most favored by the machine learning community (Zafar et al., 2017b; Celis et al., 2019; Kearns et al., 2018; Zhang et al., 2018) and the majority of the fair classification approaches fall under this category. In-processing takes place within the training stage and fairness is typically added as a constraint to the classifier’s objective function (that maximizes correctness). The advantage of in-processing lies precisely in the ability to adjust the classification objective to address fairness requirements directly, and, thus has the potential to provide guarantees. However, in-processing techniques are model-specific and require re-implementation of the learning algorithms to include the fairness constraints. This hinges on the assumption that the model is replaceable or modifiable, which may not always be the case.

We evaluate five in-processing approaches and their variants. Zafar (Zafar et al., 2017b, a) proposes two approaches to enforce demographic parity and equalized odds, which utilize tuples’ distance from the decision boundary as a proxy of to model fairness violations, and translate the fairness notion to convex functions of the classifier parameters. Zafar then solves the resulting constrained optimization problem that either maximizes prediction accuracy under fairness constraints, or minimizes fairness violation under constraints on accuracy compromise. Zha-Le (Zhang et al., 2018) enforces equalized odds through adversarial learning: a fair classifier is trained, such that an adversary cannot predict from the knowledge of and . Kearns (Kearns et al., 2018)interpolates between group and individual fairness: it guarantees fairness for a large set of subgroups within the training population by constructing constraints that restrict the amount of fairness violation in each group. Celis (Celis et al., 2019) and Thomas (Thomas et al., 2019) provide general frameworks that accommodate a large number of notions. Celis reduces all fairness notions to linear forms and solves the corresponding constrained optimization problem to minimize prediction error under fairness constraints. Given a fairness notion, Thomas utilizes concentration inequalities to compute the worst possible fairness violation a classifier can incur, and then selects classifier parameters for which this violation is within an allowable threshold.

Post-processing approaches enforce fairness by manipulating the predictions made by an already-trained classifier. Like pre-processing, these approaches are also model-agnostic. Their benefit is that they do not require classifier retraining. However, since post-processing is applied in a late stage of the learning process, it offers less flexibility than pre- and in-processing.

We evaluate three post-processing approaches. Kam-Kar (Kamiran et al., 2012) modifies for tuples that are close to the decision boundary (i.e., the classifier has low prediction confidence for them), such that demographic parity is achieved across the sensitive groups. Hardt (Hardt et al., 2016) enforces equalized odds by learning a new predictor derived from and that equalizes and across the sensitive groups. Pleiss (Pleiss et al., 2017) enforces equal opportunity (equal across the sensitive groups) or predictive equality (equal across the sensitive groups) while maintaining the consistency between the classifier’s prediction probability for a class with the expected frequency of that class. To achieve this, it modifies for a random subset of tuples within the group with higher (or lower ).

Other approaches

Beyond the ones we evaluate, other fair classification approaches exist in the literature. Some are incorporated in the approaches we evaluate (Calders et al., 2009; Kamiran et al., 2010; Calders and Verwer, 2010). Others are empirically inferior (Kamishima et al., 2012), offer weaker guarantees (Agarwal et al., 2018; Quadrianto and Sharmanska, 2017), do not offer a practical solution (Woodworth et al., 2017; Noriega-Campero et al., 2019), or do not apply to the classification setting (Samadi et al., 2018; Louizos et al., 2015; Goh et al., 2016; Lahoti et al., 2020). Some require additional information such as intermediate attributes (Zemel et al., 2013), causal model (Kusner et al., 2017; Kilbertus et al., 2017a; Nabi and Shpitser, 2018; Chiappa, 2019; Russell et al., 2017), context-specific similarity metric between individuals and human judgments (Dwork et al., 2012; Lahoti et al., 2019), which are dataset-specific and hinge on domain knowledge.

4. Evaluation and Analysis

In this section, we present results of our comparative evaluation over 18 variations of fair classification approaches as listed in Figure 8. The objectives of our performance evaluation are: (1) to contrast the effectiveness of fair classification approaches in enforcing fairness and observe correctness-fairness tradeoffs, i.e., the compromise in correctness to achieve fairness (Section 4.2), (2) to contrast efficiency and scalability of the fair classification approaches with varying dataset size and dimensionality (Section 4.3), and (3) to contrast stability (lack of variability) of these approaches over different partitions of the training data (Section 4.4). Our results affirm and extend previous results reported by the evaluated approaches.

Additionally, we present a comparative analysis, focusing on the stage dimension (pre, in, and post). Our analysis highlights findings that explain the behavior of fair approaches in different settings. For example, we find that the impact of enforcing a specific fairness notion can be explained through the score of a fairness-unaware classifier for that notion: larger discrimination by the fairness-unaware classifier indicates that a fair approach that targets that notion will likely incur higher drop in accuracy. Further, we provide novel insights that underscore the strengths and weaknesses across pre-, in-, and post-processing approaches. We find that post-processing approaches are very efficient and scalable, but perform less well in the correctness-fairness dimensions; in contrast, pre-processing and in-processing approaches are generally less scalable with increasing data dimensionality and increasing data size, respectively, but handle the correctness-fairness tradeoff more flexibly.

We begin by providing details about our experimental settings: approaches we evaluate, their implementation details, metrics we use to evaluate the approaches, and the datasets we use. Then we proceed to present our empirical findings.

4.1. Experimental Settings

Approaches. We evaluated 18 variants based on 13 fair classification approaches (Figure 8

). We limited our evaluation to variants with available implementations, as each variant typically requires non-trivial extension to the available codebases. Pre-processing approaches require the repaired data to be paired with a classifier to complete the model pipeline and we used logistic regression as the classifier. This is in line with the evaluations of the original papers as the use of logistic regression is common across all pre-processing approaches. Finally, to contrast the fairness-aware approaches against a fairness-unaware approach, we trained an unconstrained logistic regression classifier (

LR) over each dataset.

System and implementation. We conducted the experiments on a machine equipped with Intel(R) Core(TM) i5-7200U CPU (2.71 GHz, Quad-Core) and 8 GB RAM, running on Windows 10 (version 1903) operating system. We collected some of the source codes from the authors’ public repositories, some by contacting the authors, and the rest from the open source library AI Fairness 360 

(Bellamy et al., 2019) (additional details are in the Appendix ). All the approaches are implemented in Python. We implemented the fairness-unaware classifier LR using Scikit-learn (version 0.22.1) in Python 3.6. Implementations of all these approaches use a single-threaded environment, i.e., only one of the available processor cores is used. We implemented the evaluation script in Python 3.6.111

Metrics. We evaluated all approaches using four correctness metrics (Figure 3) and five fairness metrics (Figure 6). We normalize fairness metrics to share the same range, scale, and interpretation. We report , which ensures that low fairness with respect to ( and ) is mapped to low values for . Further, we report , , , and ; this way, high discrimination with respect to, say, , maps to low fairness value in . Moreover, requires two parameters: a confidence fraction and an error-bound. We choose a confidence of 99% and error-bound of 1%, which implies that discrimination computed using is within 1% error margin of the actual discrimination with 99% confidence.

Datasets. Our evaluation includes 4 real-world datasets, summarized in Figure 9. Each dataset contains varied degrees of real-world biases, allowing for the evaluation of the fair classification approaches against different scenarios. Furthermore, these datasets are well-studied in the fairness literature and are frequently used as benchmarks to evaluate fair classification approaches (Friedler et al., 2019; Jones et al., 2020; Mehrabi et al., 2019).

Adult (Kohavi and Becker, 1994) contains information about individuals from the 1994 US census. It contains records of more than 45,000 individuals and their information over 14 demographic and occupational attributes such as race, sex, education level, marital status, occupation, etc. The target task is to predict the income levels of individuals. Favorable/positive label () denotes high-income (income $50,000) and unfavorable/negative label () indicates low-income (income $50,000). The percentage of high-income individuals in Adult is 24%. The dataset reflects historical gender-based income inequality: 11% of the females report high income, compared to 32% of the males. Hence, we choose sex as the sensitive attribute with female as the unprivileged and male as the privileged group.

COMPAS (Larson et al., 2016), compiled by ProPublica, contains criminal assessment information about defendants arrested in 2013-2014 and their assessment scores by the COMPAS recidivism tool (Dieterich et al., 2016). It contains more than 7,200 data points and 11 attributes such as age, sex, prior arrest counts, charges pressed, etc. The target task is to predict whether an individual re-offends within two years of initial assessment. Positive label indicates that an individual does not recidivate and negative label indicates that an individual recidivates. 44% of the individuals in this dataset recidivate and the data contains racial bias: the percentage of re-offenders is much higher in African-Americans (51%), compared to others (39%). Hence, we select race as the sensitive attribute with African-American as the unprivileged and all other races as the privileged group.

German (German Credit Risk, 2020) contains 1,000 instances representing individuals applying for credit or loan to a bank, with attributes age, sex, type of job, credit information, etc. The target task is to predict credit risk. Positive label indicates low credit risk and negative label indicates high credit risk. Over the entire population, 70% are of low credit risk. This percentage is slightly lower for females than males: 65% vs 71%. Hence, we choose sex as the sensitive attribute with female as the unprivileged and male as the privileged group.

Dataset Size (MB) Sensitive groups Target task
Unprivileged Privileged
Adult 5.80 45,222 14 Sex Female Male   Income 50K
COMPAS 0.30 7,214 11 Race African-American Others   Risk of recidivism
German 0.06 1,000 9 Sex Female Male   Credit risk
Credit 2.50 20,651 26 Sex Female Male   Default on loan
Figure 9. Summary of the datasets. We choose our datasets to be varied in size, number of data points, number of attributes, and different instances of sensitive-attribute-based discrimination. We provide the target prediction tasks in the rightmost column.

Credit (Yeh, 2016) originated from a research aimed at predicting loan defaulting behavior of individuals in Taiwan. It contains information about more than 20,000 individuals over 24 attributes such as education, marital status, history of past payments, etc. The target task is to predict whether an individual defaults on the next payment. Positive label represents timely payment and negative label indicates default. In this dataset, 67% do not default. The dataset is biased against females: 56% of females, compared to 75% of males, do not default. Hence, we choose sex as the sensitive attribute with female as the unprivileged and male as the privileged group.

Train-validation-test setting. The train-test split for each dataset was 70%-30% (using random selection) and we validated each classifier using 3-fold cross validation.

(a) Adult
(c) German
(d) Credit
Figure 10. Correctness and fairness scores of the 18 fair classification approaches over (a) Adult, (b) COMPAS, (c) German, and (d) Credit datasets. Higher scores for correctness (fairness) metrics correspond to more correct (fair) outcomes. The bars highlighted in red denote the reverse direction of the remaining discrimination—favoring the unprivileged group more than the privileged group. The arrows () denote the fairness metric(s) each approach is optimized for. The bar plots for LR are overlaid for aiding visual comparison. Calmondp failed to complete on the Credit dataset due to the large number of attributes (26); we display its performance over 22 attributes (the most it could handle).

4.2. Correctness and Fairness

Figure 10 presents our correctness and fairness results over all approaches and metrics across the 4 datasets. Below, we discuss the key findings of this evaluation.

The fairness performance of fairness-unaware approaches influences the relative accuracy of fair approaches. Classifiers typically target accuracy as their optimization objective. Fair approaches, directly or indirectly, modify this objective to target both fairness and accuracy. When a fairness-unaware technique displays significantly different performance across different fairness metrics (e.g., low fairness wrt and high fairness wrt ), this appears to translate to a significant difference in the accuracy of fair approaches that target these fairness metrics (higher accuracy drop for approaches that target , and lower drop for those that target ).

Figure 9(a) demonstrates this scenario for Adult. LR trained on this dataset achieves high fairness in terms of and , but exhibits very low fairness in terms of . We observe that the approaches that optimize (such as Kam-Caldp and Calmondp) demonstrate a much larger accuracy drop than the approaches that target the equalized odds metrics (such as , , and Kearnspe). is an exception as it explicitly controls the allowable accuracy drop. We hypothesize that in an effort to enforce fairness in terms of , the corresponding approaches shift the decision boundary significantly compared to LR. In contrast, approaches that target and do not need a significant boundary shift as LR’s performance on these metrics is already high. The post-processing approaches, Hardteo and Pleisseop

, appear to be outliers in this observation, but as we discuss later, their accuracy drop is indicative of the poor correctness-fairness balance that is typical in post-processing. In the other three datasets,

LR does not display such differences across these fairness metrics, and we do not observe significant differences in the accuracy performance of fair approaches that target demographic parity vs equalized odds.

Key takeaway: Fair approaches generally trade accuracy for fairness. The compromise in accuracy is bigger when fairness-unaware approaches achieve low fairness wrt the fairness metric that a fair approach optimizes for, relative to other metrics. The tradeoff is less interpretable for correctness metrics other than accuracy, as classifiers typically do not optimize for them.

There is no single winner. All approaches succeed in improving fairness wrt the metric (and notion) they target. However, they cannot guarantee fairness wrt other notions: their performance wrt those notions is generally unpredictable. This is in line with the impossibility theorem, which states that enforcing multiple notions of fairness is impossible in the general case (Chouldechova, 2017). While we observe that approaches frequently improve on fairness metrics they do not explicitly target, this can depend on the dataset and on correlations across metrics. No approach achieves perfect fairness across all metrics. Thomasdp and Thomaseo come close in the German dataset, but note that this dataset contains low gender-based biases and even LR achieves reasonable fairness scores on all metrics. Note that many techniques exhibit “reverse” discrimination (the red stripes indicate discrimination against the privileged group), but these effects are generally small (a high striped bar indicates high fairness, and, thus, low discrimination in the opposite direction).

Key takeaway: Approaches improve fairness on the metric they target, but their performance on other metrics is unpredictable.

Confounding factors produce different fairness assessments across metrics. Note the interesting contrast between and on Adult (Figure 9(a)). and essentially measure the same type of fairness, but accounts for possible confounding effects. In Adult, LR’s performance difference between and indicates confounding factors that reduce fairness wrt . Specifically, women are strongly correlated with lower-wage occupations and fewer work hours, so when uses occupation and working hours/week as resolving attributes, it produces high fairness scores. We observe that causal approaches, such as and , are particularly adept at maximizing fairness scores in due to taking causal associations into account. Other approaches maximizing can even decrease fairness scores in (e.g., ).

Key takeaway: It is important to understand the impact of confounding factors on these metrics, but we are not arguing here that is a better metric. In fact, arguably, the fact that women are associated with low-wage occupations and low work hours may in itself be a bias we want to measure.

Pre- and in-processing approaches achieve better individual-level fairness than post-processing. Although none of the approaches in our evaluation target individual fairness explicitly, we note that pre- and in-processing tend to produce better scores than post-processing. Even for the Credit dataset (Figure 9(d)), where post-processing techniques improve the score, they do worse than pre- and in-processing on average. This is because post-processing operates on less information than pre- and in-processing, it does not assume knowledge of the attributes in the training data, and, thus, does not take similarity of individuals into account.

Key takeaway: Pre- and in-processing achieve better individual-level fairness than post-processing. This is an inherent limitation of post processing, as it has no knowledge of the attributes in the training data and cannot take individual similarity into account.

(a) pre-processing
(b) in-processing
(c) post-processing
(d) pre-processing
(e) in-processing
(f) post-processing
Figure 11. Results of efficiency and scalability experiments on the fair approaches. (a) – (c) show runtimes with varying data sizes in Adult dataset and (d) – (f) show runtime with varying number of attributes in Credit dataset. Note that the y-axis is in log scale.

Pre- and in-processing achieve better correctness-fairness balance than post-processing. Post-processing operates at a late stage of the learning process and does not have access to all the data attributes by design. As a result, it has less flexibility than pre- and in-processing. Given the fact that post-hoc correction of predictions are sub-optimal with finite training data (Woodworth et al., 2017), post-processing approaches typically achieve inferior correctness-fairness balance compared to other approaches. This limitation of post-processing is best highlighted in German (Figure 9(c)), where post-processing achieves on average 5-10% lower accuracy and F-score compared to pre- and in-processing that target the same fairness metrics. Specifically, Pleisseop and Hardteo report the lowest accuracy and F-score across all techniques.

Generally, in-processing also tends to achieve 2-5% higher accuracy than pre-processing, but there is no noticeable pattern across other correctness metrics. Since in-processing modifies the training objectives directly, it has better control of the accuracy-fairness tradeoff than other methods. In contrast, there is no direct mapping between the extent of repair during pre-processing and the compromise in accuracy, so pre-processing approaches cannot directly control this tradeoff. However, we cannot conclude that in-processing is always better at balancing correctness and fairness. Rather, pre-processing approaches require appropriate tuning of the level of repair to achieve the desired correctness-fairness balance, and further investigation needs to be done in this regard.

Key takeaway: Pre- and in-processing achieve better correctness and fairness compared to post-processing. In-processing handles the accuracy-fairness tradeoff most effectively, but pre-processing can see gains from proper tuning of the level of repair.

Figure 12. Variance of the fair approaches in terms of accuracy, F-score, DI, TPRB, and CD on arbitrary folds over the Adult dataset.

4.3. Efficiency and Scalability

In this section, we study the runtime behavior of all approaches. We do not present separate variants of the same approach unless they differ significantly in behavior. We compute the total runtime of each approach as pre-processing time (if any) + training time + post-processing time (if any). We subtract from all methods the runtime of LR, so that what we report is the overhead each approach introduces over the fairness-unaware method.

Our first experiment investigates the efficiency and scalability of the fair approaches as the number of data points increases. We used the Adult dataset, as it contains the highest number of data points, and executed a new instance of each approach with a different number of data points (from 1K to 40K) sampled from the dataset. Our second experiment explores the runtime behavior of approaches as the number of attributes increases. We used the Credit dataset, as it contains the highest number of attributes, and executed a new instance of each approach with a different number of attributes (from 2 to 26). We present the results in Figure 11.

Post-processing approaches are generally most efficient and scalable. Post-processing approaches tend to be very efficient, as their mechanisms are less complex compared to pre- and in-processing approaches. As a result, they scale well wrt increasing data sizes and they are not affected by increase in the number of attributes. A few pre- and in-processing techniques like Kam-Caldp and Kearnspe do perform better than post-processing, but this does not hold for most other techniques in their categories.

Key takeaway: Post-processing approaches are more efficient and scalable than pre- and in-processing approaches. Pre- and in-processing approaches generally incur higher runtimes, which depend on their computational complexities.

Causal computations incur sharp runtime penalties. An important observation from Figure 10(a) is that causal mechanisms—such as and —have significantly higher runtimes compared to other pre-processing approaches. In fact, both variations of are NP-hard in nature. Simply, discovering causal associations from data is more complex than non-causal associations. Calmondp also demonstrates high runtimes, in its case due to relying on solving convex optimization problems, and very poor scalability with increasing attributes (Figure 11(d)).

Key takeaway: Causality-based mechanisms incur higher runtimes. Other complex mechanisms also lead to efficiency and scalability challenges.

Pre-processing approaches scale well with increasing data sizes, but tend to scale poorly with increasing number of attributes. As we noted, there is a clear separation between the inherently more complex pre-processing methods (, , and Calmondp) and the rest (Kam-Caldp and ). In fact, Kam-Caldp and perform on par with or better than post-processing techniques in terms of efficiency, and generally better than most in-processing approaches. Generally, pre-processing demonstrates more robust scaling behavior wrt the data size than the number of attributes. In fact, the runtime of several pre-processing approaches appears to grow exponentially with the number of attributes (Figure 10(d)). Calmondp did not converge for more than 22 attributes as its complexity is tied to the number of attributes. Causality-based approaches display similar challenges. There is, however, an interesting contrast between and . The number of constraints in increases rapidly with fewer attributes, resulting in higher runtimes. In contrast to other techniques, its performance improves as the number of attributes grows.

In-processing approaches scale better than pre-processing ones wrt the number of attributes, but are more affected by the data size. In-processing techniques show a slightly sharper rise in runtime when the data size increases compared to pre-processing approaches (Figure 10(b)). However, in-processing techniques scale more gracefully than pre-processing ones with the number of attributes. Their runtime does increase, since the higher number of attributes increases the complexity of the decision boundary in optimization problems, but it is generally lower than pre-processing, which typically performs data modification on a per-attribute basis.

Key takeaway: Pre-processing approaches are generally more affected by the number of attributes than the data size. In-processing approaches are generally more affected by the data size than the number of attributes.

4.4. Stability

We evaluate the stability of all the approaches through a variance test on their correctness and fairness. We executed each fair approach times with random folds, using of the data for training and the rest for testing. We report our findings on the stability of two correctness metrics (accuracy and F-score) and three fairness metrics (, , and ) over Adult (Figure 12); the results are similar for other accuracy and fairness metrics, and over the other datasets (full results are in the Appendix ).

Approaches are generally stable. Most approaches show low variance and have a very small number of outliers. Although it exhibits low variance in accuracy, Hardteo has the highest variance in F-score. shows slightly high variance in accuracy and , but is stable on the other metrics. In general, post-processing approaches exhibit slightly high variance in accuracy, F-score, and .

Key takeaway: All approaches generally exhibit low variance in terms of correctness and fairness over different train-test splits. High-variance behavior is rare, and there is no significant trend across the dimension of pre-, in-, and post-processing.

5. Lessons and Discussion

The goal of our work has been to bring some clarity to the vast and diverse landscape of fair classification research. Work on this topic has spanned multiple disciplines with different priorities and focus, resulting in a wide range of approaches and diverging evaluation goals. Data management research has started making important contributions to this area, and we believe that there are a lot of opportunities for impact and synergy. Through our evaluation, we aimed in particular to identify areas and opportunities where data management contributions appear better-suited to be successful. We discuss these general guidelines here.

Pre-processing approaches are a natural fit but exhibit scalability challenges. Data management research has primarily focused on the pre-processing stage, as data manipulations create a natural fit. However, our evaluation showed that pre-processing methods tend to not scale robustly with the number of attributes. Research in pre-processing methods should be mindful of problem settings where the high data dimensionality may lead to a poor fit. However, this observation also points to an opportunity that plays squarely into the strengths of the data management community, as efforts can focus directly on attacking this scalability challenge. Already, data management researchers made contributions in this direction (e.g., has a parallel implementation, which was not suitable for our evaluation as other approaches are single-threaded), and improvements here are likely to lead to more impact.

Similarly, we noted that in-processing techniques generally outperform others in handling the correctness-fairness tradeoff directly. However, pre-processing methods have the potential to improve this balance through appropriate tuning of the data repair levels, and further investigation can help in that regard.

Finally, causality-based approaches produce sophisticated repairs, but impose a significant runtime penalty. Kam-Caldp and use simpler repairs, resulting in orders of magnitude better runtime performance, while they maintain competitive correctness-performance tradeoff as well.

Applicability of pre-processing. Pre-processing possesses the flexibility of being model agnostic, and can be used when access and modifications to the model are not possible. However, there can also be practical constraints to modifying training data, as this may violate anti-discrimination laws (Barocas and Selbst, 2016). Additionally, pre-processing repairs data on the assumption that model predictions will follow the ground truth. But, it cannot enforce fairness notions that target the correctness of predictions across sensitive groups, as it cannot make assumptions on the correctness of predictions after the learning step. This means that metrics such as equalized odds and predictive parity cannot be easily handled in the pre-processing stage. As we saw in our evaluation, fairness as measured by different metrics can diverge, and it is important to follow the application requirements before attacking a problem setting with pre-processing methods.

Synergy with ML research. Our analysis noted that in-processing techniques exhibit better control of the correctness-fairness tradeoff and may be hard to beat in that regard. However, their performance scales worse with increasing data size compared to pre-processing approaches. Generally, runtime performance is often overlooked in machine learning research, and data management contributions can likely have impact in improving in-processing approaches in that regard.

We hope that our analysis will be helpful to outline useful perspectives and directions to data management research in fair classification. To the best of our knowledge, ours is the broadest evaluation and analysis of work in this area, and can contribute to a useful roadmap for the research community.


  • A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. M. Wallach (2018) A reductions approach to fair classification. In ICML, Cited by: §3.
  • S. Ahmadi, S. Galhotra, B. Saha, and R. Schwartz (2020) Fair correlation clustering. CoRR abs/2002.03508. External Links: Link, 2002.03508 Cited by: §1.
  • A. Asudeh, H. Jagadish, J. Stoyanovich, and G. Das (2019) Designing fair ranking schemes. In Proceedings of the 2019 International Conference on Management of Data, pp. 1259–1276. Cited by: §1.
  • A. Asudeh and H. Jagadish (2020) Fairly evaluating and scoring items in a data set. Proceedings of the VLDB Endowment 13 (12). Cited by: §1.
  • S. Barocas and A. D. Selbst (2016) Big data’s disparate impact. Calif. L. Rev. 104, pp. 671. Cited by: §1, §5.
  • R. K. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilović, et al. (2019) AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63 (4/5), pp. 4–1. Cited by: §1, §4.1.
  • V. Bentkus et al. (2004) On hoeffding’s inequalities. The Annals of Probability 32 (2), pp. 1650–1673. Cited by: §A.2.
  • R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth (2018) Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research, pp. 0049124118782533. Cited by: §1, §1, Figure 5.
  • B. Borchers and J. Furman (1998) A two-phase exact algorithm for max-sat and weighted max-sat problems.

    Journal of Combinatorial Optimization

    2 (4), pp. 299–306.
    Cited by: §A.1.5, §3.
  • L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §A.2.
  • L. Brown (1967)

    The conditional level of student’s t test

    The Annals of Mathematical Statistics 38 (4), pp. 1068–1071. Cited by: §A.2.
  • T. Calders, F. Kamiran, and M. Pechenizkiy (2009) Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13–18. Cited by: Figure 5, §3, §3.
  • T. Calders and S. Verwer (2010)

    Three naive bayes approaches for discrimination-free classification

    Data Mining and Knowledge Discovery 21 (2), pp. 277–292. Cited by: §3.
  • F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney (2017) Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pp. 3992–4001. Cited by: §A.1.3, 2nd item, §1, Figure 8, §3.
  • S. Caton and C. Haas (2020) Fairness in machine learning: a survey. arXiv preprint arXiv:2010.04053. Cited by: §1.
  • L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi (2019) Classification with fairness constraints: a meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 319–328. Cited by: §A.2, 2nd item, §1, §1, Figure 8, §3, §3.
  • S. Chen, Z. Guo, and X. Zhao (2020) Predicting mortgage early delinquency with machine learning methods. European Journal of Operational Research. Cited by: §1.
  • Y. Chen and M. Larbani (2006) Two-person zero-sum game approach for fuzzy multiple attribute decision making problems. Fuzzy Sets and Systems 157 (1), pp. 34–51. Cited by: §A.2.
  • S. Chiappa (2019) Path-specific counterfactual fairness. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 7801–7808. Cited by: §3.
  • A. Chouldechova and A. Roth (2020) A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63 (5). Cited by: §2.2.
  • A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: Figure 5, §2.2.2, §4.2.
  • S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq (2017) Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797–806. Cited by: §1, Figure 5.
  • W. Dieterich, C. Mendoza, and T. Brennan (2016) COMPAS risk scales: demonstrating accuracy equity and predictive parity. Northpointe Inc. Cited by: §1, §4.1.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: Figure 5, §2.2.2, §3.
  • E. Faliagka, K. Ramantas, A. Tsakalidis, and G. Tzimas (2012) Application of machine learning algorithms to an online recruitment system. In Proc. International Conference on Internet and Web Applications and Services, Cited by: §1.
  • M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268. Cited by: §A.1.2, 2nd item, §1, §1, §1, Figure 6, Figure 8, §3.
  • B. Fish, J. Kun, and Á. D. Lelkes (2016) A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 144–152. Cited by: Figure 5.
  • A. W. Flores, K. Bechtel, and C. T. Lowenkamp (2016) A rejoinder to machine bias: there’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation 80, pp. 38. Cited by: §1.
  • J. R. Foulds, R. Islam, K. N. Keya, and S. Pan (2020) An intersectional definition of fairness. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1918–1921. Cited by: Figure 5.
  • S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth (2019) A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency, pp. 329–338. Cited by: §1, Figure 5, §2.2.2, §4.1.
  • S. Galhotra, Y. Brun, and A. Meliou (2017) Fairness testing: testing software for discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 498–510. Cited by: §1, Figure 5, Figure 6, §2.2.2, §2.2.2.
  • S. Galhotra, K. Shanmugam, P. Sattigeri, and K. R. Varshney (2020) Fair data integration. CoRR abs/2006.06053. External Links: Link, 2006.06053 Cited by: §1.
  • German Credit Risk (2020) German credit risk- kaggle. Note: Cited by: §4.1.
  • G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander (2016) Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems, pp. 2415–2423. Cited by: §3.
  • M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidipati, J. Savla, V. Bhagwan, and D. Sharp (2015) E-commerce in your inbox: product recommendations at scale. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1809–1818. Cited by: §1.
  • W. W. Hager and S. K. Mitter (1976) Lagrange duality theory for convex control problems. SIAM Journal on Control and Optimization 14 (5), pp. 843–856. Cited by: §A.2.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    In Advances in neural information processing systems, pp. 3315–3323. Cited by: §A.3.2, 2nd item, §1, §1, Figure 5, Figure 6, §2.2.2, Figure 8, §3.
  • J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, pp. 217–240. Cited by: Figure 5.
  • G. P. Jones, J. M. Hickey, P. G. Di Stefano, C. Dhanjal, L. C. Stoddart, and V. Vasileiou (2020) Metrics and methods for a systematic comparison of fairness-aware machine learning algorithms. arXiv preprint arXiv:2010.03986. Cited by: §1, §4.1.
  • F. Kamiran, T. Calders, and M. Pechenizkiy (2010)

    Discrimination aware decision tree learning

    In 2010 IEEE International Conference on Data Mining, pp. 869–874. Cited by: §3, §3.
  • F. Kamiran and T. Calders (2012) Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33 (1), pp. 1–33. Cited by: §A.1.1, 2nd item, §1, Figure 8, §3.
  • F. Kamiran, A. Karim, and X. Zhang (2012) Decision theory for discrimination-aware classification. In 2012 IEEE 12th International Conference on Data Mining, pp. 924–929. Cited by: §A.3.1, 2nd item, §1, Figure 8, §3.
  • T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma (2012) Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 35–50. Cited by: §3.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2018) Preventing fairness gerrymandering: auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, pp. 2564–2572. Cited by: §A.2, 2nd item, §1, Figure 8, §3, §3.
  • N. Kilbertus, M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf (2017a) Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pp. 656–666. Cited by: §3.
  • N. Kilbertus, M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf (2017b) Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pp. 656–666. Cited by: Figure 5.
  • M. Kim, O. Reingold, and G. Rothblum (2018) Fairness through computationally-bounded awareness. In Advances in Neural Information Processing Systems, pp. 4842–4852. Cited by: Figure 5.
  • J. Kleinberg, S. Mullainathan, and M. Raghavan (2017) Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Cited by: Figure 5.
  • R. Kohavi and B. Becker (1994) UCI machine learning repository. External Links: Link Cited by: 3rd item, §4.1.
  • C. Kuhlman and E. Rundensteiner (2020) Rank aggregation algorithms for fair consensus. Proceedings of the VLDB Endowment 13 (12). Cited by: §1.
  • M. J. Kusner, J. Loftus, C. Russell, and R. Silva (2017) Counterfactual fairness. In Advances in neural information processing systems, pp. 4066–4076. Cited by: Figure 5, §3.
  • V. Labatut and H. Cherifi (2012) Accuracy measures for the comparison of classifiers. arXiv preprint arXiv:1207.3790. Cited by: §2.1.
  • P. Lahoti, A. Beutel, J. Chen, K. Lee, F. Prost, N. Thain, X. Wang, and E. Chi (2020) Fairness without demographics through adversarially reweighted learning. Advances in Neural Information Processing Systems 33. Cited by: §3.
  • P. Lahoti, K. P. Gummadi, and G. Weikum (2019) Operationalizing individual fairness with pairwise fair representations. Proceedings of the VLDB Endowment 13 (4), pp. 506–518. Cited by: §1, §3.
  • J. Larson, S. Mattu, L. Kirchner, and J. Angwin (2016) How we analyzed the compas recidivism algorithm. ProPublica (5 2016) 9. Cited by: 3rd item, §1, §1, §4.1.
  • D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: §A.1.5, §3.
  • C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel (2015)

    The variational fair autoencoder

    stat 1050, pp. 3. Cited by: §3.
  • D. Madras, E. Creager, T. Pitassi, and R. Zemel (2019) Fairness through causal awareness: learning causal latent-variable models for biased data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 349–358. Cited by: Figure 5.
  • N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2019) A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635. Cited by: §1, §4.1.
  • R. Nabi and I. Shpitser (2018) Fair inference on outcomes. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Figure 5, §3.
  • A. Narayanan (2018) Translation tutorial: 21 fairness definitions and their politics. In Proc. Conf. Fairness Accountability Transp., New York, USA, Vol. 1170. Cited by: §1, §1.
  • A. Noriega-Campero, M. A. Bakker, B. Garcia-Bulle, and A. Pentland (2019) Active fairness in algorithmic decision making. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 77–83. Cited by: §3.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §2.2.
  • G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger (2017) On fairness and calibration. In Advances in Neural Information Processing Systems, pp. 5680–5689. Cited by: §A.3.3, 2nd item, §1, Figure 8, §3.
  • N. Quadrianto and V. Sharmanska (2017) Recycling privileged learning and distribution matching for fairness. In Advances in Neural Information Processing Systems, pp. 677–688. Cited by: §3.
  • B. Qureshi, F. Kamiran, A. Karim, S. Ruggieri, and D. Pedreschi (2019) Causal inference for social discrimination reasoning. Journal of Intelligent Information Systems, pp. 1–13. Cited by: Figure 5, Figure 6, §2.2.2.
  • J. Rothwell (2014) How the war on drugs damages black social mobility. The Brookings Institution, published Sept 30. Cited by: §1.
  • C. Russell, M. J. Kusner, J. Loftus, and R. Silva (2017) When worlds collide: integrating different counterfactual assumptions in fairness. In Advances in Neural Information Processing Systems, pp. 6414–6423. Cited by: §3.
  • B. Salimi, L. Rodriguez, B. Howe, and D. Suciu (2019) Interventional fairness: causal database repair for algorithmic fairness. In Proceedings of the 2019 International Conference on Management of Data, pp. 793–810. Cited by: §A.1.5, 2nd item, §1, §1, §1, §1, §1, Figure 5, Figure 8, §3.
  • S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala (2018) The price of fair pca: one extra dimension. In Advances in Neural Information Processing Systems, pp. 10976–10987. Cited by: §3.
  • R. Scheines, P. Spirtes, C. Glymour, C. Meek, and T. Richardson (1998) The tetrad project: constraint based aids to causal model specification. Multivariate Behavioral Research 33 (1), pp. 65–117. Cited by: §A.1.4.
  • X. Shen, S. Diamond, Y. Gu, and S. Boyd (2016) Disciplined convex-concave programming. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1009–1014. Cited by: §A.2.
  • J. Stoyanovich, K. Yang, and H. Jagadish (2018) Online set selection with fairness and diversity constraints. In Proceedings of the EDBT Conference, Cited by: §1.
  • P. S. Thomas, B. C. da Silva, A. G. Barto, S. Giguere, Y. Brun, and E. Brunskill (2019) Preventing undesirable behavior of intelligent machines. Science 366 (6468), pp. 999–1004. Cited by: §A.2, 2nd item, §1, Figure 8, §3.
  • F. Tramèr, V. Atlidakis, R. Geambasu, D. J. Hsu, J. Hubaux, M. Humbert, A. Juels, and H. Lin (2015) Discovering unwarranted associations in data-driven applications with the fairtest testing toolkit. CoRR, abs/1510.02377. Cited by: §1.
  • J. Valentino-Devries, J. Singer-Vine, and A. Soltani (2012) Websites vary prices, deals based on users’ information. Wall Street Journal 10, pp. 60–68. Cited by: §1, §1.
  • V. N. Vapnik and A. Y. Chervonenkis (2015) On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11–30. Cited by: footnote 8.
  • S. Verma and J. Rubin (2018) Fairness definitions explained. In 2018 IEEE/ACM International Workshop on Software Fairness (FairWare), pp. 1–7. Cited by: §1, §1.
  • B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro (2017) Learning non-discriminatory predictors. In Conference on Learning Theory, pp. 1920–1953. Cited by: §3, §4.2.
  • Y. Wu, L. Zhang, X. Wu, and H. Tong (2019) Pc-fairness: a unified framework for measuring causality-based fairness. In Advances in Neural Information Processing Systems, pp. 3404–3414. Cited by: Figure 5.
  • A. Yan and B. Howe (2019) Fairst: equitable spatial and temporal demand prediction for new mobility systems. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 552–555. Cited by: Appendix B, §1.
  • I. C. Yeh (2016) UCI machine learning repository. External Links: Link Cited by: §4.1.
  • M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi (2017a) Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pp. 1171–1180. Cited by: §A.2, 2nd item, §1, Figure 8, §3.
  • M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi (2017b) Fairness constraints: mechanisms for fair classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Cited by: §A.2, 2nd item, §1, §1, Figure 5, Figure 8, §3, §3.
  • M. B. Zafar, I. Valera, M. Rodriguez, K. Gummadi, and A. Weller (2017c) From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems, pp. 229–239. Cited by: Figure 5.
  • R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. In International Conference on Machine Learning, pp. 325–333. Cited by: §3.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: §A.2, 2nd item, §1, Figure 8, §3, §3.
  • L. Zhang, Y. Wu, and X. Wu (2017) A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3929–3935. Cited by: §A.1.4, 2nd item, §1, Figure 8, §3.

Appendix A Description of Fair Approaches

In this section, we provide detailed discussion of the fair approaches that we evaluate in this paper.

a.1. Pre-processing Approaches

a.1.1. Kam-Cal

Kamiran and Calders (Kamiran and Calders, 2012) introduce a pre-processing approach that targets the notion of demographic parity. We refer to this approach as Kam-Cal. Assuming that the predictions reasonably approximates the ground truth , Kam-Cal argues that is likely to be independent of the sensitive attribute , when the classifier is deployed, if and are independent in the training data. To this end, Kam-Cal samples tuples from the training dataset to create a modified training dataset in a way that ensures that and are independent in . This is based on the intuition that the classifier is likely to learn the independence from and will ensure demographic parity when deployed.

If and are independent in , then and , their expected joint probability should be sufficiently close to their observed joint probability . These probabilities (over ) are computed using the following formulas:

If is different from , then and are not independent in . Kam-Cal’s goal is to modify to obtain such that the differences between the expected and the observed probabilities are mitigated. To achieve this, Kam-Cal employs a weighted sampling technique that compensates for the differences in and . The technique involves computing a weight for each tuple in and then sampling the tuples from , with probability proportional to their weights, to construct . The weight of a tuple is computed as:

This weighting scheme guarantees that and are sufficiently close over , which implies that and are independent in . Kam-Cal also provides empirical evidence that classifiers trained on indeed satisfy demographic parity.

Implementation. We collected the source code for Kam-Cal from the open source AI Fairness 360 library.222

a.1.2. Feld

Feldman et al. (Feldman et al., 2015) propose a pre-processing approach that also enforces demographic parity. We refer to this approach as Feld. Feld argues that demographic parity can be ensured if the marginal distribution of each is similar across the privileged and the unprivileged groups in the training data. The basis of their argument is that if a model learns from such data, it is likely to predict based on attributes that are independent of , which in turn will satisfy demographic parity within the model’s predictions. Unlike Kam-Cal, which does not modify attribute values, Feld directly modifies the values for each attribute .

Given the training data with the schema , Feld produces a modified dataset where the marginal distribution of each attribute is similar across the privileged and the unprivileged groups. Feld repairs values of each individual attribute separately to equalize the marginal distribution of the sensitive groups for each attribute. To this end, Feld

determines the quantile of each value

and replaces with the median of the corresponding quantiles from the original marginal distributions and . This repair produces the modified attribute such that , and, thus, ensures that the modified attribute is independent of the sensitive attribute.

Repeating the repair process for all attributes produces the modified and the modified dataset . The level of repair is controlled through a hyper-parameter , where yields the unmodified dataset and implies that the values within each attribute are completely moved to the median.

Implementation. We collected the source code for Feld from the AI Fairness 360 library.2 As the preferred value of is application-specific, we choose two values ( and ) in our evaluation to highlight its impact on the performance.

a.1.3. Calmon

Calmon et al. (Calmon et al., 2017) propose a pre-processing approach that also enforces demographic parity. We refer to this approach as Calmon

. Given the joint distribution associated with the training data

, Calmon computes a new distribution to transform and such that the dependency between and is reduced, without significantly distorting the data distribution. The new joint distribution yields repaired training data .

To compute the new distribution, Calmon constructs the following constraints that must be satisfied: (1) the difference between and is below an allowable threshold, (2) the new joint distribution is sufficiently close to the original one, and (3) no attribute value in is substantially distorted to compute . Calmon then formulates a convex optimization problem that searches for the optimal new distribution subject to the constraints. The resulting new distribution maps each tuple from to the modified dataset and classifiers learned on is expected to satisfy demographic parity.

Implementation. We collected the source code for Calmon from the AI Fairness 360 library.2 Further, as Calmon could not operate on more than 22 attributes on the Credit dataset over our system, we dropped 4 attributes with the least information gain.

a.1.4. Zha-Wu

Zhang, Wu, and Wu (Zhang et al., 2017) propose a pre-processing approach that targets path-specific fairness: a causal notion that uses graphical causal models to ensure that causal effects of the sensitive attribute are not carried to the prediction through any direct or indirect path, i.e., does not have any causal association with . We refer to this approach as Zha-Wu. Using the training data , Zha-Wu constructs a graphical causal model to estimate the effect of intervening on and determines its causal association with . Then they repair minimally to produce such that all causal associations between and are removed. Then the classifiers trained on the modified training data are expected to satisfy path-specific fairness, under the assumption that the distribution of the predictions made by a classifier follows the distribution of the ground truth in the training data.

To repair , Zha-Wu first verifies if violates path-specific fairness. Specifically, is a direct or indirect cause of if intervening on changes the expectations of . Zha-Wu utilizes to construct the graphical causal model and estimates the effect of intervening on as the expected difference in when changes from privileged to unprivileged. Instead of measuring causal association through all paths between and in the causal graph, Zha-Wu can measure this association through specific paths if desired. Path-specific fairness is violated if the expected difference in is above some threshold .

Next, Zha-Wu designs an optimization problem to repair such that its direct and indirect causal associations with are removed, and the causal model is minimally altered. The modified training dataset is then used to train classifiers that enforce path-specific fairness. Note that an accurate representation of the causal model depends on the training data, and Zha-Wu allows alternative causal models that can be constructed with domain knowledge.

Implementation. We retrieved the source code for Zha-Wu from the authors’ website.333 In accordance with the original paper, we construct the causal networks using the open source software TETRAD (Scheines et al., 1998) and set the value for to .

a.1.5. Salimi

Salimi et al. (Salimi et al., 2019) propose a pre-processing approach that enforces justifiable fairness: a causal fairness notion that prohibits causal dependency between the sensitive attribute and the prediction , except through admissible attributes. We refer to this approach as Salimi. Unlike other causal mechanisms, Salimi does not require access to the causal model. Salimi assumes that is likely to be fair if a classifier is trained on data where ground truth satisfies the target fairness notion. To that end, it expresses justifiable fairness as an integrity constraint and repairs to ensure that the constraint holds on the repaired training data . Unlike Kam-Cal, Salimi does not modify the attributes and only repairs by inserting or deleting tuples.

As Salimi does not depend on the causal model, it translates the condition for justifiable fairness into an integrity constraint that must hold over the training data. Salimi partitions all attributes, except the ground truth, into two disjoint sets: admissible () and inadmissible (). contains the attributes that are allowed to influence or have causal associations with prediction , while contains the rest of the attributes. Given and , justifiable fairness holds in if is independent of conditioned on

. If the probability distribution associated with

is uniform,444Datasets do not always have uniform probability distribution in practice and additional pre-processing is required to ensure that. this integrity constraint can be checked through the following multi-valued dependency: .

The goal of Salimi is then to minimally repair to form a new training dataset , such that the multi-valued dependency is satisfied. Salimi leverages techniques from maximum satisfiability (Borchers and Furman, 1998) and matrix factorization (Lee and Seung, 2001) to compute the minimal repair of that produces the optimal for training classifiers. However, these techniques are NP-hard and application-specific knowledge is generally needed to determine the sets of admissible and inadmissible attributes.

Implementation. We collected the source code for Salimi from the authors via email, as no public repository is available. Following the original paper, we choose race, gender, marital/relationship status as inadmissible attributes whenever applicable, and the rest of the attributes as admissible. Moreover, Salimi et al. discuss a second variation of that partially repairs the data, but we do not include it as there are no instructions on how to tune the level of repair for that. Lastly, although there are experiments in the original paper that discuss techniques to partition the training data and repair them in parallel, our evaluation is limited to a single-threaded implementation.

a.2. In-processing Approaches


Zafar et al. (Zafar et al., 2017b, a) propose two in-processing approaches to enforce demographic parity and equalized odds. We refer to them as and , respectively. Both of these approaches translate their corresponding fairness notion to a convex function of the classifier parameters, and compute the optimal parameters that minimize prediction errors while satisfying the notion.

To compute the optimal fair classifier, Zafar first formulates the learning process as a constrained optimization problem. Given the training data , the task of a classifier is to learn a decision boundary that separates the tuples according to the ground truth. The optimal decision boundary, defined by a set of parameters

, is the one that minimizes a convex loss function

that measures the cost of prediction errors. For any tuple , the signed distance from the decision boundary determines the prediction. Specifically, if , where denotes the signed distance. Zafar does not explicitly use to determine the prediction, rather they utilize to define the fairness constraint only.

introduces a proxy constraint for demographic parity, as directly including the notion as a constraint leads to non-convexity in the loss function.555Non-convex functions are computationally harder to optimize than convex functions. utilizes as a proxy for and argues that the empirical covariance between the sensitive attribute and the signed distance from the decision boundary is approximately zero, if the prediction of a classifier is independent of the sensitive attribute. As covariance is a convex function of , it can be used define the proxy constraint for demographic parity. Formally, covariance is computed as: , where denotes the mean of . Given the proxy constraint, proposes the following two variations that work under different constraint settings:

  • Maximizing accuracy under fairness constraint. This variation () computes the optimal classifier by minimizing under the condition that .

  • Maximizing fairness under accuracy constraint. This variation () minimizes as much as possible while ensuring is below a specified threshold. This is to avoid cases where enforcing leads to high loss in the first variation.

Both of the above variations produce a fair classifier that approximately satisfies demographic parity. Similar to , introduces a proxy constraint for equalized odds. In particular, proposes to use the covariance between and of the misclassified tuples, since covariance is approximately zero when a classifier satisfies equalized odds. This covariance is computed as: , where if tuple is misclassified, and otherwise. While this proxy is still not a convex function of , efficiently computes classifier parameters that maximize prediction accuracy under this proxy constraint through a disciplined convex-concave program (Shen et al., 2016).

Implementation. We collected the source code for Zafar from the authors’ public repository.666 We set all the hyper-parameters following the instructions specified within the source code (more details are in the authors’ repository).


Zhang, Lemoine, and others (Zhang et al., 2018) propose an in-processing approach that can enforce demographic parity, equalized odds, or equal opportunity, by leveraging adversarial learning, a technique where a classifier and an adversary with mutually competing goals are trained together. We refer to this approach as Zha-Le. Given the training data , the goal of a classifier is to maximize the accuracy of prediction , while an adversary attempts to correctly predict the sensitive attribute using (and ). Zha-Le enforces the target notion of fairness by designing the classifier to converge to optimal parameters such that does not contain any information about that the adversary can exploit.

In order to determine the optimal parameters, classifier minimizes a loss function . Adversary receives both and if equalized odds or equal opportunity is the target notion, otherwise only has access to if demographic parity is enforced. The loss of adversary is denoted as . Both the classifier and adversary apply gradient based optimizations (Bottou, 2010) to iteratively update their parameters. Adversary updates its parameters in a direction that minimizes , while the classifier only updates its parameters in a direction that both decreases and increases . This process of update guarantees that converges to a solution where is minimized while is approximately equal to the entropy of , i.e., adversary gains no information about from (and ). Hence, the optimal classifier satisfies the target fairness notion.

Implementation. We collected the source code for Zha-Le from the open source AI Fairness 360 library.777


Kearns et al. (Kearns et al., 2018) propose an in-processing approach that enforces demographic parity and predictive equality, a notion that requires equal for the privileged and the unprivileged groups. We refer to this approach as Kearns. Kearns approximately enforces the target fairness notion within a large set of subgroups888The number of subgroups must be bounded by the classifier’s VC dimension (Vapnik and Chervonenkis, 2015). defined using one or more sensitive attributes (or user-specified attributes). To that end, Kearns solves a constrained optimization problem to obtain optimal classifier parameters such that the proportion of positive outcomes (demographic parity) or FPR (predictive equality) is approximately equal to that of the population.

Kearns begins by formulating the learning process and constraint for the target fairness notion. Let be a classifier learned over training data . Moreover, let be the set of the subgroups for which fairness must be ensured. Each