Ensuring Fairness under Prior Probability Shifts

05/06/2020 ∙ by Arpita Biswas, et al. ∙ Microsoft indian institute of science 0

In this paper, we study the problem of fair classification in the presence of prior probability shifts, where the training set distribution differs from the test set. This phenomenon can be observed in the yearly records of several real-world datasets, such as recidivism records and medical expenditure surveys. If unaccounted for, such shifts can cause the predictions of a classifier to become unfair towards specific population subgroups. While the fairness notion called Proportional Equality (PE) accounts for such shifts, a procedure to ensure PE-fairness was unknown. In this work, we propose a method, called CAPE, which provides a comprehensive solution to the aforementioned problem. CAPE makes novel use of prevalence estimation techniques, sampling and an ensemble of classifiers to ensure fair predictions under prior probability shifts. We introduce a metric, called prevalence difference (PD), which CAPE attempts to minimize in order to ensure PE-fairness. We theoretically establish that this metric exhibits several desirable properties. We evaluate the efficacy of CAPE via a thorough empirical evaluation on synthetic datasets. We also compare the performance of CAPE with several popular fair classifiers on real-world datasets like COMPAS (criminal risk assessment) and MEPS (medical expenditure panel survey). The results indicate that CAPE ensures PE-fair predictions, while performing well on other performance metrics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning techniques are being increasingly applied in making important societal decisions, such as criminal risk assessment, school admission, hiring, sanctioning of loans, etc. Given the impact and sensitivity of such predictions, there is warranted concern regarding implicit discriminatory traits exhibited by such techniques. Such discrimination may be detrimental for certain population subgroups with a specific race, gender, ethnicity, etc, and may even be illegal under certain circumstances [angwin2016machine]. These concerns have spurred vast research in the area of algorithmic fairness [corbett2018measure, dressel2018accuracy, chouldechova2018frontiers, friedler2018comparative, zhang2018achieving, berk2017fairness, kleinberg2018discrimination, barocas2016big, chouldechova2017fair, romei2014multidisciplinary]. Most of these papers aim to establish fairness notions for a group of individuals (differentiated by their race, gender, etc.), and are classified as group fairness notions.

A possible, and less studied, cause for unfairness in predictions involve distributional changes (or drift) between the training and test datasets. Disparities can be introduced when the sub-populations evolve differently over time [barocas2017fairness]. There are important real-world scenarios where a type of distributional change, called prior probability shift, occurs. Informally, a prior probability shift occurs when the fraction of positively labeled instances differ between the training and the test datasets (see Section 2.1 for a formal definition). A concrete example is the COMPAS dataset [url:compas] which contains demographic information and criminal history of defendants, and records whether they recommitted a crime within a certain period of time (positive labels are given to the re-offenders, while others have negative labels). We observe that, among the valid records screened in the year , the fraction of Caucasian and African-American re-offenders were and , respectively. However, in , these fractions were and , respectively. This indicates that the extent of prior probability shift differs among Caucasian and African-American defendants, between the records of and .

If such distributional changes are unaccounted for, a classifier may end up being unfair towards the population subgroups which exhibit prior probability shifts. For example, if the rate of recidivism among a particular sensitive group reduces drastically, then a classifier trained with a higher rate of recidivism can create extreme unfairness towards individuals of that sub-population. In this work, we address these concerns and propose a method to obtain fair predictions under prior probability shifts.

Related Work. A large body of work defines various group fairness notions and provides algorithms to mitigate unfairness. Among these, Proportional Equality (PE) [biswas2019fairness, hunter2000proportional] appears to be the most appropriate fairness notion for addressing prior probability shifts among population subgroups (see Section 2.2 for definition). However, the existing results stop short of providing a procedure to ensure PE-fair predictions. We address this concern by proposing an end-to-end solution.

Apart from PE, there are other group fairness notions, none of which address prior probability shifts explicitly, such as Disparate Impact [feldman2015certifying, zafar2017afairness, kamiran2012data, calders2009building], Statistical Parity [corbett2017algorithmic, kamishima2012fairness, zemel2013learning],

Equalized Odds

  [hardt2016equality, kleinberg2017inherent, woodworth2017learning], and Disparate Mistreatment [zafar2017bfairness].

Unfortunately, all these fairness constraints are often non-convex, thereby making the optimization problem (maximizing accuracy subject to fairness constraints) difficult to solve efficiently. Several papers provide convex surrogates of the non-convex constraints [goh2016satisfying, zafar2017afairness], or finds near-optimal near-feasible solutions [cotter2018training, celis2018classification], or propose techniques to reduce dependence of group information on the predictions [kamiran2012data, kamiran2012decision, pleiss2017fairness, zhang2018mitigating]111Note that the group fairness notions require the test set to be of (statistically) significant size for fairness evaluation.. However, most of these solutions assume that the training and test datasets are identically and independently drawn from some common population distribution, and thus suffer in the presence of prior probability shifts (we provide empirical evidences in Section 4).

Our Contributions. To the best of our knowledge, we are the first to propose an end-to-end solution to ensure fair predictions in the presence of prior probability shifts.

  1. [leftmargin=*]

  2. We design a system called (Combinatorial Algorithm for Proportional Equality) in Section 3.

  3. We introduce a metric called Prevalence Difference (PD), which attempts to minimize in order to ensure PE-fairness. We theoretically establish that the PD metric exhibits several desirable properties (Theorems 1, 2)—in particular, we show that maximizing the accuracy of any subgroup is not at odds with minimizing PD. This metric also provides insights into why the predictions of are fair (Theorem 3). We discuss these in Section 3.1 and 3.2.

  4. We perform a thorough evaluation of on synthetic and real-world datasets, and compare with several other fair classifiers. In Section 4, we provide empirical evidence that provides PE-fair predictions, while performing well on other fairness metrics.

2 Background and Notations

In this paper, we focus on the binary classification problem, under prior probability shifts. Let be the prediction function, defined in some hypothesis space , where is the -dimensional feature space and is the label space. The goal of a classification problem is to learn the function

which minimizes a target loss function, say, misclassification error

(variables and

denote feature vectors and labels). However, if these predictions

are used for societal decision making, it becomes crucial to ensure lower misclassification error not only on an average but also within each group defined by their sensitive attribute values such as race, gender, ethnicity, etc. Dropping these sensitive attributes blindly from the dataset may not be enough to alleviate discrimination since some non-sensitive features can be closely correlated to the sensitive attributes [zliobaite2016using, corbett2018measure, hardt2016equality]. Hence, most existing solutions assume access to the sensitive attributes. In the presence of such a sensitive attribute with sub-populations, the goal is to learn satisfying certain group-fairness criteria (where denotes the set ). We use variable to denote group membership (one can encode multiple sensitive attributes into ). We assume that the training dataset

is drawn from an unknown joint distribution

over . The performance of the classifier is measured using a new set of data, referred as test dataset , by observing how accurate and fair the s are with respect to the true labels s.

Next, we focus on an important phenomenon called prior probability shift, which may cause a learned classifier to be unfair in its predictions on a test dataset.

2.1 Prior Probability Shift

Prior probability shift [saerens2002adjusting, moreno2012unifying, kull2014patterns] occurs when the prior class-probability changes between the training and test sets, but the class conditional probability remains unaltered. Such changes, within a sub-population, occur in many real-world scenarios, that is, == remains constant but == changes between the training and test datasets. If left unaccounted for, it may lead to unfair predictions [barocas2017fairness].

2.2 Proportional Equality

To address the fairness concern under prior probability shifts, a notion called proportional equality (PE) was formalized in  [biswas2019fairness]. A classifier is said to be PE-fair if it has low values for the following expression:

True prevalence is the fraction of population, from the group , labeled positive in the dataset .


Prediction prevalence is the fraction of population, from the group , predicted positive by the classifier for .


However, Biswas and Mukherjee biswas2019fairness do not provide any algorithm for ensuring PE-fair predictions. Any such algorithm must deal with the following key challenges:

  1. [leftmargin=*]

  2. PE (for a small ) is a non-convex constraint. Thus, it is hard to directly optimize for accuracy subject to this constraint for all .

  3. The definition of PE uses true prevalences of the test datasets , which are unavailable to the classifier during the prediction phase. Thus, an algorithm needs to estimate these prevalences. Techniques from the quantification literature can be leveraged to solve this concern, which we describe next.

2.3 Quantification Problem

Quantification learning (or prevalence estimation

) is a supervised learning problem, introduced by Forman forman2005counting. It aims to predict an aggregated quantity for a set of instances. The goal is to learn a function, called

quantifier , that outputs an estimate of the true prevalence of a finite, non-empty and unlabeled test set . As highlighted by Forman, quantification is not a by-product of classification [gonzalez2017quantification]. In fact, unlike assumptions made in classification, quantification techniques account for changes in prior probabilities within subgroups, while assuming remain the same over the training and test datasets. This allows quantifiers to perform better than naïve classify and count techniques, as demonstrated by Forman forman2006quantifying.

Some commonly used algorithms to construct quantifiers are Adjusted Classify and Count (ACC) [forman2006quantifying], Scaled Probability Average (SPA[bella2010quantification], and HDy [gonzalez2013class]. These algorithms can be used to estimate the prevalence of a group in the test set.

For ease of exposition, we describe a simple quantification technique, ACC. This method learns a binary classifier from the training set and estimates its true positive rates () and false positive rates () via -fold cross-validation. Using this trained model, the algorithm counts the number of cases on which the classifier outputs positive on the test set. Finally, the true fraction of positives (true prevalence) is estimated via the equation , where denotes the fraction of predicted positives, . The use of and from the training set can be justified by the assumption that remains same in the training and test datasets. This simple algorithm turns out to provide good estimates of prevalences under prior probability shifts. However, for our experiments, we use SPA [bella2010quantification], which uses a probability estimator instead of a classifier, and turns out to be more robust to variations while estimating probabilities of a dataset with a few samples.

Next, we discuss , which provides a comprehensive solution to the above problems by combining quantification techniques along with training an ensemble of classifiers.

3 Cape

In this section, we introduce  (Combinatorial Algorithm for Proportional Equality), for ensuring PE-fair predictions. takes as input a training dataset and a vector of desired prevalences . is separately trained for each group , since we hypothesize that the relationship between the non-sensitive features and the outcome variable may differ across groups. Thus, each group would be best served by training classifiers on datasets obtained from the corresponding group222Training a separate classifier for a small-sized subgroup may be inappropriate. For the datasets we consider, this issue never arises.. Such decoupled classifiers are also considered by Dwork et al. dwork2017decoupled, but they do not handle prior probability shifts.

The training phase outputs, for each group , the following:

  1. [leftmargin=*]

  2. a set of classifiers, each trained using a sampling of the training dataset obtained by the module , which takes as input a prevalence parameter and a training set with data points. It randomly selects, with replacement, instances with = and instances with =. Thus, it outputs a sample of size . Each classifier is thus specialized in providing accurate predictions on datasets with particular prevalences.

  3. a quantifier , generated by the module, which is subsequently used in the prediction phase of to estimate the true prevalence of the test dataset, . Separate quantifiers are created for each group since the extent of prior probability shifts may differ across groups.

During the prediction phase, for each group , an estimate of the prevalence of the test data is obtained using (learned in the training phase). This estimate is then used to choose the classifier that minimizes the prevalence difference metric (Section 3.1). Finally, outputs the predictions of the classifier on the test set .

Training Phase:

Input: Training dataset , , , , and a vector of prevalence parameters .

   Step 1: Partition based on values.
    for each group .
   Step 2: Create quantifiers, one for each .
   Step 3: Create a set of classifiers, for each .
   for all  in  do
   end for
   Output: and .

Prediction Phase:

Input: Test dataset , and the quantifiers and classifiers obtained after the training phase.

   Step 1: Partition based on values.
    for each group .
   Step 2: Estimate prevalences using the quantifiers built in training phase.
   Step 3: Choose the best classifier in terms of estimated PD, for each .
   for all  in  do
                for all .
   end for
    {Best Classifier for }
   Output: The predictions for group .
Algorithm 1 The meta-algorithm.

Note that provides the flexibility to plug in any classification and quantification algorithm into modules and . Key to is the prevalence difference metric, used in Step of the prediction phase. We formalize the metric and discuss some of its properties in the next section.

3.1 Prevalence Difference Metric

We define the prevalence difference (PD) metric, for each group , as: , where, and denote the true and predicted prevalences of the dataset (as defined in Equations 1 and 2, respectively). Hereafter, we drop the subscripts and superscripts on , and whenever we refer to the population in aggregate.

Note that the true prevalence of test set cannot be used during the prediction phase. Thus, replacing with in the definition of provides a measure to choose the best classifier for the group . Also, unlike PD, other performance metrics like accuracy, FPR or FNR are not suitable for choosing the best classifier since these metrics require the true labels of the test datasets. We use the PD metric for: (1) choosing the best classifier in the prediction phase and (2) measuring the performance of the predictions, since a high value of implies the inability to account for prior probability shift for the group .

The PD metric is somewhat different from the fairness metrics aiming to capture parity between two sub-populations. Such fairness metrics may often require sacrificing the performance on one group to maintain parity with the other group. However PD, in itself, believes that the two groups should be treated differently since each group may have gone through a different change of prior probabilities. A high indicates high extent of harm caused by the predictions made towards to the group . Thus, to audit the impact of a classifier’s predictions on a group , it is important to evaluate for , along with accuracy, FNR and FPR values within each group.

Next, we show that a perfect classifier ( accurate) attains zero prevalence difference. Additionally, we show that a classifier with high accuracy on any sub-group also attains a very low for that subgroup. Empirically, we observe that low results in PE-fair predictions.

3.2 Theoretical Guarantees

We first show a simple result— a classifier whose predictions are exactly the ground truth also attains , thereby satisfying our proposed metric used for selecting the best classifier. Note that a perfect classifier may not satisfy fairness notions such as disparate impact and statistical parity.

Theorem 1.

A perfect classifier always exhibits .


Let us consider a perfect classifier whose predictions are equal to the ground truth i.e., for all instances , where is the label predicted by the classifier for the instance . Thus, for each , the true prevalence is equal to the prediction prevalence , according to the definitions in Equations 1 and 2. Thus, the prevalence difference . ∎

Theorem 2.

If the overall accuracy of a classifier is , where is a very small number, then the overall prevalence difference for is , where and denote number of false negatives and false positives respectively in the test dataset with instances. This further implies that .


Let denote the predictions of a classifier on a test dataset . Some other notations that we use for the proof, are:
( true positives).
( true negatives).
( false positives).
( false negatives).

Note that . Let and be the true and prediction prevalences. Then, the prevalence difference can be written as:


Let the accuracy of a classifier on a test dataset be () where . Then,


Without loss of generality, let us assume . Thus, Equation 4 can be written as:


Similarly, assuming we obtain


Combining Equation 3, 5 and 6, we get the following:


Thus, when accuracy is greater than , the prevalence difference is at most . This completes the proof. ∎

Note that Theorem 2 can also be used to guarantee that highly accurate predictions for a group , implies a low value for . This leads to Corollary 2.1.

Corollary 2.1.

If accuracy of a classifier for any sub-population is greater than , then .

The following theorem gives insight on why works. In the subsequent discussion, we drop the parameter from the notations and and since we exclusively refer to these values in the context of the test dataset only.

Theorem 3.

Let where and . For a group , and test dataset , let the quantifier be such that , and the classifiers be such that for all , for small and . Then, for the best classifier

the following holds:


For the best classifier , the prevalence difference of a group can be upper bounded using triangle inequality:


Inequality (8) is implied by the assumption on the quantifier’s performance, i.e., . To provide an upper bound for , we pick such that

Since , it is at most away from one of the fractional values in . Therefore,


We use Inequality (9) to provide an upper bound to the expression , using case-by-case analysis.

Case : Assume . This leaves us with three possibilities for the value of :

  1. Assume . Then,

  2. Assume . Now, we bound the desired quantity using the value of . Note that since is the best classifier. Thus, either or .

    1. Assume . Then,

    2. Assume . Then,

  3. Assume . Now, we bound the desired quantity using the value of , and there can be three cases.

    1. Assume . Then,

    2. Assume . Then,

    3. Assume . Then,


Inequalities - establish the following upper bound when ,


Case : . An analysis analogous to Case gives the same inequality as (16). Combining Inequalities (8) and (16), we obtain the desired upper bound of on the quantity . ∎

4 Experimental Evaluation

We first evaluate on synthetically generated datasets. We then compare it with other fair classifiers on the real-world COMPAS [url:compas] and MEPS [url:meps] datasets, where we observe possible prior-probability shifts.

is open source but the link is retracted for anonymity. The performance of

on a wide range of fairness-metrics, across all these datasets, enforces our proposal that should be used for predictions under prior-probability shifts.

4.1 Datasets

Synthetic: We assume a generative model with features—sensitive attribute , and two additional attributes and —along with the label . We assume that the overall population distribution is generated as . We further consider equal representation of the two population subgroups, i.e., == for each . and are conditionally independent: , and the distributions are considered to be Gaussian () with the following parameters: =, =, =, and =.

We generate instances for the training dataset with equal label distribution, i.e., . However, while generating the test set, the prevalence parameters are different. We generated different types of test datasets, each obtained by varying the prevalences for both subgroups , such that .

COMPAS dataset contains demographic information and criminal history for pre-trial defendants in Broward County, Florida. The goal of learning is to predict whether an individual re-offends. We consider as labels and as the sensitive attribute (= denotes African-Americans, while = denotes Caucasians). We pre-processed the dataset to remove rows containing missing or invalid information. Our training dataset comprises records whose screening dates were in the year (of which are African-Americans), while the test dataset comprises records screened in the year (of which are African-Americans).

Figure 1: Accuracy of .
Figure 2: Prevalence Difference for .
Figure 3: Proportional Equality (PE).
Figure 4: Comparing accuracy, PD and PE metrics on synthetic test datasets with varying prevalences for group =. The prevalence for group = is fixed at . The reported results are averaged over

iterations and the standard deviation is of the order


MEPS comprises surveys carried out on individuals, health care professions, and employers in the United States. The feature measures the total number of trips involved in availing some sort of medical facility. The classification task involves predicting whether . We consider as the sensitive attribute (= denotes ‘Non-Whites’). The surveys for the year is our training set (with data points, of which are ‘Non-Whites’), and the surveys for is our test set (with data points, of which are ‘Non-Whites’).

4.2 Other Algorithms for Comparison

We compare against an accuracy-maximizing classifier, Max_Acc. It uses the same algorithm used by in the module . On the real-world datasets, we additionally compare with the following (in-, pre- and post-processing) fair algorithms, implemented in the IBM AI Fairness 360 [aif360-oct-2018] toolkit—Reweighing (Reweigh[kamiran2012data], Adversarial Debiasing (AD[zhang2018mitigating], and variants of Meta_fair [celis2018classification], Calibrated Equalized Odds Postprocessing (CEOP[pleiss2017fairness], Reject Option Classification (ROC[kamiran2012decision]. These algorithms target fairness notions other than PE. We evaluate the extent to which these algorithms achieve PE fairness and compare how they perform on a set of other metrics (such as FPR-diff, FNR-diff, Accuracy-diff, and PD). While can handle multiple sensitive attributes, we choose one sensitive attribute for all the datasets to stay consistent with the implementation in the IBM AIF360 toolkit.

4.3 Parameters and Modules used for

  • [leftmargin=*]

  • Prevalences: We set .

  • : As described in Section 3.

  • : Scaled Probability Average [bella2010quantification].

  • : As the synthetically generated datasets are created using simple generative models, we use generalized logistic regression (

    ) with regularization. For COMPAS and MEPS, we use gradient boosted algorithm (

    ) and -fold cross-validation for hyper-parameter tuning.

4.4 Results

Synthetic dataset: We evaluated with types of test datasets, each with for . The general trend we observe is that outperforms whenever there is a significant shift in prior probabilities. We report two interesting sets of results here.

First, we consider test datasets with , and ranging between and . Figure 4 summarizes our findings. Since accounts for prevalence changes, the accuracy of on for group = (Figure 1) is consistently higher than , except for the dataset with where the accuracies become nearly equal. The prevalence difference for = (Figure 2) is lower for whenever there is a prior probability shift (i.e., when ). In fact, for , remains across all the test datasets. Thus, for increases linearly as moves away from . Lastly, the predictions of consistently exhibit a lower valuation for PE (Figure 3), compared to . This highlights that the predictions of are more fair, compared to the purely accuracy maximizing .

Second, in Table 1, we report results for scenarios where both and significantly deviate from their corresponding prevalences in the training set. The results are representative of the general trend we observed in the other test datasets— outperforms on accuracy, PD and PE metrics.

Accuracy PE
0 0.1 0.940 0.880 0.009 0.094 0.050 0.104
1 0.1 0.930 0.855 0.016 0.110
0 0.2 0.894 0.855 0.017 0.084 0.012 0.140
1 0.8 0.909 0.877 0.006 0.074
0 0.9 0.929 0.851 0.012 0.120 0.003 0.028
1 0.9 0.940 0.879 0.006 0.097
Table 1: Accuracy, and PE values on the synthetic datasets when test set is such that , for both groups .

Real-world datasets: For COMPAS, columns and of Table 2 highlight that the true prevalences of the training (year ) and test (year ) datasets are significantly different. This is indicative of a possible prior probability shift. Column shows that the module of makes a good estimate of the true prevalences of the test dataset.

Training Data
True Prevalence
Test Data
True Prevalence
COMPAS 0 0.327 0.636 0.592
1 0.486 0.706 0.644
MEPS 0 0.253 0.253 0.273
1 0.124 0.117 0.123
Table 2: Column and show possible prior probability shifts in COMPAS and MEPS. Column highlights the prevalence estimates obtained by module of on the test datasets.
FPR FNR Accuracy
Algorithms Z = 0 Z = 1 diff Z = 0 Z = 1 diff Z = 0 Z = 1 diff Z = 0 Z = 1 Z = 0 Z = 1 PE


- 0.461 0.380 0.081 0.302 0.275 0.027 0.640 0.694 0.054 0.612 0.623 0.024 0.083 0.082
-1 0.271 0.290 0.019 0.451 0.322 0.129 0.614 0.687 0.073 0.448 0.564 0.188 0.142 0.119
0.132 0.259 0.127 0.629 0.340 0.289 0.552 0.684 0.132 0.284 0.542 0.352 0.163 0.376


0.283 0.139 0.144 0.493 0.543 0.050 0.583 0.576 0.007 0.425 0.363 0.211 0.343 0.271


--sr 0.977 0.849 0.128 0.102 0.492 0.390 0.579 0.403 0.176 0.927 0.609 0.291 0.097 0.622
--fdr 0.965 0.901 0.064 0.162 0.356 0.194 0.545 0.483 0.062 0.884 0.719 0.248 0.013 0.329
0.124 0.167 0.043 0.638 0.467 0.171 0.549 0.621 0.072 0.275 0.425 0.361 0.281 0.253


-fpr 0.066 1.000 0.934 0.722 0.000 0.722 0.517 0.706 0.189 0.201 1.000 0.435 0.294 0.699
-fnr 0.000 0.247 0.247 1.000 0.390 0.610 0.364 0.652 0.288 0.000 0.503 0.636 0.203 0.900
-weighted 0.000 0.194 0.194 1.000 0.405 0.495 0.364 0.657 0.292 0.000 0.477 0.636 0.229 0.900
-aod 0.004 0.019 0.015 0.978 0.900 0.078 0.377 0.360 0.017 0.016 0.076 0.620 0.630 0.879
-eod 0.019 0.046 0.027 0.911 0.782 0.129 0.414 0.434 0.020 0.064 0.167 0.572 0.539 0.517


- 0.131 0.068 0.063 0.425 0.488 0.063 0.794 0.883 0.089 0.243 0.120 0.010 0.003 0.135
-1 0.175 0.087 0.088 0.347 0.423 0.076 0.781 0.874 0.093 0.296 0.144 0.043 0.027 0.049
0.004 0.012 0.008 0.910 0.888 0.022 0.766 0.890 0.124 0.037 0.014 0.216 0.103 0.483


0.276 0.242 0.034 0.250 0.226 0.024 0.731 0.760 0.029 0.396 0.305 0.143 0.188 0.862


--sr 0.322 0.213 0.109 0.210 0.243 0.033 0.706 0.783 0.077 0.440 0.277 0.187 0.160 0.572
--fdr 0.347 0.254 0.102 0.193 0.218 0.025 0.692 0.758 0.066 0.463 0.308 0.210 0.191 0.657
0.062 0.051 0.011 0.644 0.569 0.075 0.791 0.889 0.098 0.136 0.095 0.117 0.022 0.728


-fpr 0.078 0.000 0.078 0.573 1.000 0.427 0.797 0.883 0.086 0.166 0.000 0.087 0.117
-fnr 0.034 0.022 0.012 0.803 0.704 0.102 0.771 0.899 0.128 0.075 0.054 0.178 0.063 0.771
-weighted 0.032 0.021 0.011 0.816 0.704 0.112 0.770 0.899 0.129 0.070 0.053 0.183 0.064 0.839
-spd 0.233 0.220 0.013 0.284 0.243 0.041 0.754 0.777 0.023 0.355 0.283 0.102 0.166 0.906
-aod 0.329 0.253 0.076 0.205 0.210 0.005 0.702 0.752 0.050 0.447 0.216 0.194 0.199 0.745
-eod 0.336 0.233 0.103 0.194 0.227 0.033 0.700 0.768 0.068 0.455 0.296 0.202 0.179 0.623
Table 3: Comparing with and other fair classifiers on the COMPAS and MEPS test datasets.

For MEPS, we observe a shift only for the group =, between the training set (surveys in the year ) and test set (surveys in ). Since the differences in prevalences are rather small, this dataset is of interest—it allows us to investigate the performance of when the extent of prior probability shift is small. Though the prevalences estimated by seem similar to the training set, the difference in the estimates of and the prevalences of the test datasets are only and , for = and = respectively, and are thus good estimates.

Table 3 summarizes the results on COMPAS and MEPS datasets for , , and the other fair algorithms described in Section 4.2. Due to lack of space, we elaborate upon the results of the COMPAS dataset only.

- considers the whole test dataset during prediction, while - considers individual instances during prediction (similar to what the other algorithms do). We expect - to perform better than - since the module is expected to perform better for larger test datasets.

- outperforms on , and all the other fairness metrics (FPR-diff, FNR-diff, Accuracy-diff, and PE). The prediction prevalences of ( and ) are close to the true prevalences of the training set ( and ), which highlights the inability of to account for the prior probability shift. One critical observation about - is that FPR-diff= and FNR-diff= which implies that the predictions exhibit equalized odds. In comparison, these differences for are and . In fact, for , is almost twice than , whereas is almost half of . This implies that imposes unfair higher risks of recidivism on African-American defendants, while Caucasian defendants are predicted to have lower risks than they actually do.

The true prevalences of the two subgroups in the test dataset are close to each other (namely, and ). Thus, a classifier aiming to achieve statistical parity is expected to do well on PE. We observe this in Table 3, where -spd (statistical parity difference) has lowest PE (). However, its false positive rates are more than for both subgroups, which is unfair and harmful for both subgroups. This unfairness is also captured by the high value of -spd. We observe that is the lowest for - among all other classifiers. For , -fdr (false discovery rate with the group fairness trade-off parameter set to ), is the only other fair classifier with a lower value. However, the predictions of -fdr have high false positive rates, and low accuracies.

Note that a trivial classifier, which always predicts positive labels, will have FNR-diff=, FPR-diff=, Accuracy-diff=. However, this classifier will have high PD for both groups (= and =

), which indicates a substantial skew between the false positives and false negatives. Thus, PD is an important metric that, in addition to accuracy, captures the learning ability of the classifiers.

We make a final observation on our experimental results. Since both COMPAS and MEPS are real-world datasets, the distributional changes highlighted in Table 2 may not be due to prior probability shifts alone. Although is designed to handle only prior probability shifts, the good performance of both - and - on a wide range of metrics for these real-world datasets shows the robustness of our approach.

A possible extension of includes handling other distributional changes, such as concept drifts, that is, when changes but remains same.

4.5 Acknowledgment

Arpita Biswas gratefully acknowledges the support of a Google PhD Fellowship Award.