1 Introduction
Predictive risk scores support decisionmaking in highstakes settings such as bail sentencing in the criminal justice system, triage and preventive care in healthcare, and lending decisions in the credit industry [36, 2]. In these areas where predictive errors can significantly impact individuals involved, studies of fairness in machine learning have analyzed the possible disparate impact introduced by predictive risk scores primarily in a binary classification setting: if predictions determine whether or not someone is detained pretrial, is admitted into critical care, or is extended a loan. But the “human in the loop” with risk assessment tools often has recourse to make decisions about extent, intensity, or prioritization of resources. That is, in practice, predictive risk scores are used to provide informative rankorderings of individuals with binary outcomes in the following settings:

[label=(0)]

In criminal justice, the “riskneedsresponsivity” model emphasizes matching the level of social service interventions to the specific individual’s risk of reoffending [5, 3]. Cowgill [16] finds quasiexperimental evidence of judges increasing bail amounts for marginal candidates at low/medium risk thresholds, suggesting that judges informed by the COMPAS risk scores vary the intensity of bail.

In credit, predictions of default risk affect not only loan acceptance/rejection decisions, but also riskbased setting of interest rates. Fuster et al. [22] embed machinelearned credit scores in an economic pricing model which suggests negative economic welfare impacts on Black and Hispanic borrowers.

In municipal services, predictive analytics tools have been used to direct resources for maintenance, repair, or inspection by prioritizing or bipartite ranking by risk of failure or contamination [38, 10]. Proposals to use new data sources such as 311 data, which incur the selfselection bias of citizen complaints, may introduce inequities in resource allocation [30].
We describe how the problem of bipartite ranking, that of finding a good ranking function that ranks positively labeled examples above negative examples, better encapsulates how predictive risk scores are used in practice to rank individual units, and how a new metric we propose, , can assess ranking disparities.
Most previous work on fairness in machine learning has emphasized disparate impact in terms of confusion matrix metrics such as true positive rates and false positive rates and other desiderata, such as probability calibration of risk scores. Due in part to inherent tradeoffs between these performance criteria, some have recommended to retain
unadjusted risk scores that achieve good calibration, rather than adjusting for parity across groups, in order to retain as much information as possible and allow human experts to make the final decision [12, 9, 27, 13]. At the same time, unlike other regressionbased settings, grouplevel discrepancies in the prediction loss of risk scores, relative to the true Bayesoptimal score, are not observable, since only binary outcomes are observed.Our bipartite rankingbased perspective illuminates a gap between the differing arguments made by ProPublica and Equivant (then Northpointe) regarding the potential bias or disparate impact of the COMPAS recidivism tool. Equivant claims fairness of the risk scores due to calibration, withingroup AUC parity (“accuracy equity”), and predictive parity in response to ProPublica’s allegations of bias due to true positive rate/false positive rate disparities for the Low/Not Low risk labels [2, 18]. Our metrics shed light on assessing the acrossgroup accuracy inequities and ranking error disparities that may be introduced by a potential risk score.
In this paper, we propose and study the crossROC curve and the corresponding metric for assessing disparities induced by a predictive risk score, as they are used in broader contexts to inform resource allocation. We relate the metric to different group and outcomebased decompositions of a bipartite ranking loss, which emphasizes the importance of assessing crossgroup accuracy equity beyond just withingroup accuracy equity, and assess the resulting metrics on datasets where fairness has been of concern.
2 Related Work
Our analysis of fairness properties of risk scores in this work is most closely related to the study of “disparate impact” in machine learning, which focuses on disparities in the outcomes of a process across protected classes, without racial animus [4]. Many previous approaches have considered formalizations of disparate impact in a binary classification setting [33, 41, 25]. Focus on disparate impact differs from the study of questions of “individual fairness”, which emphasizes the “similar treatment of similar individuals” [19].
Previously considered methods for fairness in binary classification assess grouplevel disparities in confusion matrixbased metrics. Proposals for error rate balance assess or try to equalize true positive rates and/or false positive rates, error rates measured conditional on the true outcome, emphasizing the equitable treatment of those who actually are of the outcome type of interest [25, 41]. Alternatively, one might assess the negative/positive predictive value (NPV/PPV) error rates conditional on the thresholded model prediction [11].
The predominant criterion used for assessing fairness of risk scores, outside of a binary classification setting, is that of calibration. Groupwise calibration requires that
as in [11]. The impossibilities of satisfying notions of error rate balance and calibration simultaneously have been discussed in [29, 11]. Liu et al. [31] show that group calibration is a byproduct of unconstrained empirical risk minimization, and therefore is not a restrictive notion of fairness. HebertJohnson et al. [26] note the critique that group calibration does not restrict the variance
of a risk score as an unbiased estimator of the Bayesoptimal score.
Other work has considered fairness in ranking settings specifically, with particular attention to applications in information retrieval, such as questions of fair representation in search engine results. Yang and Stoyanovich [40] assess statistical parity at discrete cutpoints of a ranking, incorporating position bias inspired by normalized discounted cumulative gain (nDCG) metrics. Celis et al. [7] consider the question of fairness in rankings, where fairness is considered as constraints on diversity of group membership in the top rankings, for any choice of . Singh and Joachims [39] consider fairness of exposure in rankings under known relevance scores and propose an algorithmic framework that produces probabilistic rankings satisfying fairness constraints in expectation on exposure, under a position bias model. We focus instead on the bipartite ranking setting, where the area under the curve (AUC) loss emphasizes ranking quality on the entire distribution, whereas other ranking metrics such as nDCG or topk metrics emphasize only a portion of the distribution.
The problem of bipartite ranking is related to, but distinct from, binary classification [20, 34, 1]. While the bipartite ranking induced by the Bayesoptimal score is analogously Bayesrisk optimal for bipartite ranking (, [32]), in general, a probabilitycalibrated classifier is not optimizing for the bipartite ranking loss. Cortes and Mohri [15] observe that AUC may vary widely for the same error rate, and that algorithms designed to globally optimize the AUC perform better than optimizing surrogates of the AUC or error rate. Narasimhan and Agarwal [35] study transfer regret bounds between the related problems of binary classification, bipartite ranking, and outcomeprobability estimation.
3 Problem Setup and Notation
We suppose we have data on features , sensitive attribute , and binary labeled outcome . We are interested in assessing the downstream impacts of a predictive risk score , which may or may not access the sensitive attribute. When these risk scores represent an estimated conditional probability of positive label, . For brevity, we also let
be the random variable corresponding to an individual’s risk score. We generally use the conventions that
is associated with opportunity or benefit for the individual (, freedom from suspicion of recidivism, creditworthiness) and that when discussing two groups, and , the group might be a historically disadvantaged group.Let the conditional cumulative distribution function of the learned score
evaluated at a threshold given label and attribute be denoted byWe let denote the complement of . We drop the subscript to refer to the whole population: . Thresholding the score yields a binary classifier, . The classifier’s true negative rate (TNR) is , its false positive rate (FPR) is , its false negative rate (FNR) is , and its true positive rate (TPR) is . Given a risk score, the choice of optimal threshold for a binary classifier depends on the differing costs of false positive and false negatives. We might expect cost ratios of false positives and false negatives to differ if we consider the use of risk scores to direct punitive measures or to direct interventional resources.
In the setting of bipartite ranking, the data comprises of a pool of positive labeled examples, , drawn i.i.d. according to a distribution , and negative labeled examples drawn according to a distribution [34]. The true rank order is determined by a preference function which describes the preference ordering of pairs of instances : if is preferred to , if is preferred to , and if and have the same ranking; this can be extended to a stochastic setting. The bipartite ranking problem is a specialization where the goal is to find a score function that ranks positive examples above negative examples with good generalization error, corresponding to the following true preference function:
The rank order may be determined by a score function , which achieves empirical bipartite ranking error .
The area under the receiver operating characteristic (ROC) curve (AUC), a common (reward) objective for bipartite ranking is often used as a metric describing the quality of a predictive score, independently of the final threshold used to implement a classifier, and is invariant to different base rates of the outcomes. The ROC curve plots on the xaxis with on the yaxis as we vary over the space of various decision thresholds. The AUC is the area under the ROC curve, ,
An AUC of corresponds to a completely random classifier; therefore, the difference from serves as a metric for the diagnostic quality of a predictive score. We recall the probabilistic interpretation of AUC that it is the probability that a randomly drawn example from the positive class is correctly ranked by the score above a randomly drawn score from the negative class [24].
Lemma 1 (Probabilistic interpretation of the AUC).
Let be drawn from and be drawn from independently. Then
It has been noted that the estimate for the AUC is the same as the estimate for the MannWhitney Ustatistic, and therefore also closely related to the nonparametric Wilcoxon ranksum test for differences between distributions [14, 24]
. The corresponding null hypothesis of
is a necessary condition of distributional equivalence with the observation that . Other representations of the AUC are studied in Menon and Williamson [32]; we focus on its use as an accuracy metric for bipartite ranking.4 The CrossROC and CrossArea Under the Curve (xAUC) metric
We introduce the crossROC curve and the crossarea under the curve metric that summarize grouplevel disparities in misranking errors induced by a score function .
Definition 1 (CrossReceiver Operating Characteristic curve (xROC)).
The curve parametrically plots over the space of thresholds , generating the curve of TPR of group on the yaxis vs. the FPR of group on the xaxis. We define the metric as the area under the curve.
Definition 2 ().
Analogous to the usual , we provide a probabilistic interpretation of the metric as the probability of correctly ranking a positive instance of group above a negative instance of group under the corresponding outcome and classconditional distributions of the score.
Lemma 2 (Probabilistic interpretation of .).
where is drawn from and is drawn from independently. For brevity, henceforth, is taken to be drawn from and independently of any other such variable. We also drop the subscript to denote omitting the conditioning on sensitive attribute (, ).
The accuracy metrics for a binary sensitive attribute measure the probability that a randomly chosen unit from the “positive” group in group , is ranked higher than a randomly chosen unit from the “negative” group, in group , under the corresponding group and outcomeconditional distributions of scores . We let denote the withingroup AUC for group , .
If the difference between these metrics, the disparity
is substantial and negative, then we might consider group to be “advantaged” in some sense when is a positive label or outcome or associated with greater beneficial resources, and “disadvantaged” if is a negative or harmful label or is associated with punitive measures. When higher scores are associated with opportunity or additional benefits and resources, Group either gains by correctly having its deserving members correctly ranked above the nondeserving members of group , or by having its nondeserving members incorrectly ranked above the deserving members of group . The magnitude of the disparity describes the misranking disparities incurred under this predictive score, while the magnitude of the measures this particular acrosssubgroup rankaccuracy.
Computing the is simple: one simply computes the sample statistic, . Algorithmic routines for computing the AUC quickly by a sorting routine can be directly used to compute the s.

Variants of the xAUC metric
We can decompose AUC differently and assess different variants of the :
Definition 3 (Balanced xAUC).
These xAUC disparities compare misranking error faced by individuals from either group, conditional on a specific outcome: compares the ranking accuracy faced by those of the negative class across groups, and analogously compares those of the positive class .
xAUC metrics as decompositions of AUC
The following proposition shows how the population AUC decomposes as weighted combinations of the and withinclass s, or the balanced decompositions or , weighted by the outcomeconditional class probabilities.
Proposition 3.
COMPAS  Framingham  German  Adult  

Logistic Reg. 

Brier  
RankBoost cal. 

Brier  
We include standard errors in Table
3 of the appendix.5 Assessing xAUC
5.1 COMPAS Example
In Fig. 1, we revisit the COMPAS data and assess our and curves to illustrate ranking disparities that may be induced by risk scores learned from this data. The COMPAS dataset is of size , where sensitive attribute is race, with for black and white, respectively. We define the outcome for nonrecidivism within 2 years and for violent recidivism. Covariates include information on number of prior arrests and age; we follow the preprocessing of Friedler et al. [21].
We first train a logistic regression model on the original covariate data (we do not use the decile scores directly in order to do a more finegrained analysis), using a 70%, 30% traintest split and evaluating metrics on the outofsample test set. In Table
1, we report the grouplevel AUC and the Brier [6] scores (summarizing calibration), and our metrics. The xAUC for column is , for column it is , and for column , is . The Brier score for a probabilistic prediction of a binary outcome is . The score is overall wellcalibrated (as well as calibrated by group), consistent with analyses elsewhere [11, 18].We also report the metrics from using a bipartite ranking algorithm, Bipartite Rankboost of Freund et al. [20] and calibrating the resulting ranking score by Platt Scaling, displaying the results as “RankBoost cal.” We observe essentially similar performance across these metrics, suggesting that the behavior of disparities is independent of model specification or complexity; and that methods which directly optimize the population error may still incur these grouplevel error disparities.
In Fig. 1, we plot ROC curves and our
curves, displaying the averaged ROC curve (interpolated to a fine grid of FPR values) over 50 sampled traintest splits, with 1 standard error bar shaded in gray. We include standard errors in Table
3 of the appendix. While a simple withingroup AUC comparison suggests that the score is overall more accurate for blacks – in fact, the AUC is slightly higher for the black population with and – computing our xROC curve and metric shows that blacks would be disadvantaged by misranking errors. The crossgroup accuracy is significantly lower than : black innocents are nearly indistinguishable from actually guilty whites. This gap of is precisely the crossgroup accuracy inequity that simply comparing withingroup AUCdoes not capture. When we plot kernel density estimates of the score distributions in Fig.
1 from a representative trainingtest split, we see that indeed the distribution of scores for black innocents has significant overlap with the distribution of scores for white innocents.Assessing balanced xROC:
In Fig. 2, we compare the curves with the curves for the COMPAS data. The relative magnitude of and provides insight on whether the burden of the disparity falls on those who are innocent or guilty. Here, since the disparity is larger in absolute terms, it seems that misranking errors result in inordinate benefit of the doubt in the errors of distinguishing risky whites () from innocent individuals, rather than disparities arising from distinguishing innocent members of either group from generally guilty individuals.
5.2 Assessing xAUC on other datasets
Additionally in Fig. 3 and Table. 1, we evaluate these metrics on multiple datasets where fairness may be of concern, including risk scores learnt on the Framingham study, the German credit dataset, and the Adult income prediction dataset (we use logistic regression as well as calibrated bipartite RankBoost) [framingham99, 17]. For the Framingham dataset (cardiac arrest risk scores), with sensitive attribute of gender, for male and for female. denotes 10year coronary heart disease (CHD) incidence. Fairness considerations might arise if predictions of likelier mortality are associated with greater resources for preventive care or triage. The German credit dataset is of size , where the sensitive attribute is age with for age , age . Creditworthiness (nondefault) is denoted by , and default by . The “Adult” income dataset is of size , sensitive attribute, for black and white. We use the dichotomized outcome for high income k, for low income k.
Overall, Fig. 3 shows that these disparities persist, though the disparities are largest for the COMPAS and the large Adult dataset. For the Adult dataset this disparity could result in the misranking of poor whites above wealthy blacks; this could be interpreted as possibly inequitable withholding of economic opportunity from actuallyhighincome blacks. The additional datasets also display different phenomena regarding the score distributions and comparisons, which we include in Fig. 6 of the Appendix.
6 Properties of the xAUC metric
We proceed to characterize the metric and its interpretations as a measure of crossgroup ranking accuracy. For a perfect classifier with , the metrics are also 1. For a classifier that classifies completely at random achieving , the s are also . Notably, the implicitly compares performance of thresholds that are the same for different levels of the sensitive attribute, a restriction which tends to hold in applications under legal constraints regulating disparate treatment.
For the sake of example we can assume normally distributed risk scores within each group and outcome condition and reexpress the
in terms of the cdf of the convolution of the score distributions. For , (drawn independently, conditional on outcome ),We might expect such that . For a fixed mean discrepancy between the advantagedguilty and disadvantagedinnocent (, in the COMPAS example), a decrease in variance increases the . For fixed variance, an increase in the mean discrepancy increases . Thus, scores which are poor estimates of the Bayesoptimal score (therefore more uncertain) lead to higher xAUCs. Differences in the uncertainty across groups of these scores lead to xAUC disparities.
Note that the metric compares probabilities of misranking errors conditional on drawing instances from either or distribution. When base rates differ, interpreting this disparity as normatively problematic implicitly assumes equipoise in that we want random individuals drawn with equal probability from the white innocent/black innocent populations to face similar misranking risks, not drawn from the population distribution of offending.
Utility Allocation Interpretation
When risk scores direct the expenditure of resources or benefits, we may interpret disparities as informative of grouplevel downstream utility disparities, if we expect resource or utility prioritizations which are monotonic in the score . (We consider settings where we expect benefits for individuals to be nondecreasing in the expended resources , due to mechanistic knowledge.) In particular, allowing for any monotonic allocation , the measures . Disparities in this measure suggest greater probability of confusion in terms of less effective utility allocation between the positive and negative classes of different groups.
We can also consider the following integral representation of the disparities (, as in [32]) as differences between the average rank of positive examples from one group above negative examples from another group.
6.1 Adjusting Scores for Equal xAUC
We study the possibility of postprocessing adjustments of a predicted risk score that yield equal xAUC across groups, noting that the exact nature of the problem domain may pose strong barriers to the implementability or individual fairness properties of postprocessing adjustment.
Without loss of generality, we consider transformations on group . When is monotonic, the withingroup AUC is preserved.
Although solving analytically for the fixed point is difficult, empirically, we can simply optimize the disparity over parametrized classes of monotonic transformations , such as the logistic transformation . We can further restrict the strength of transformation by restricting the range of parameters.
In Fig. 4 we plot the unadjusted and adjusted xROC curves (dashed) resulting from a transformation which equalizes the ; we transform group , the disadvantaged group. We optimize the empirical disparity over the space of parameters , fixing the offset . In Fig. 5, we plot the complementary cdfs corresponding to evaluating TPRs and FPRs over thresholds, as well as for the adjusted score (red). In table 2, we show the optimal parameters achieving the lowest disparity, which occurs with relatively little impact on the population , although it reduces the of the advantaged group.
COMPAS  Fram.  German  Adult  

(original)  0.743  0.771  0.798  0.905 
(adjusted)  0.730  0.772  0.779  0.902 
4.70  3.20  4.71  4.43  
0.724  0.761  0.753  0.895  
0.716  0.758  0.760  0.898 
6.2 Fair classification postprocessing and the disparity
One might consider applying the postprocessing adjustment of Hardt et al. [25]
, implementing the groupspecific thresholds as groupspecific shifts to the score distribution. Note that an equalized odds adjustment would equalize the TPR/FPR behavior for every threshold; since equalized odds might require randomization between two thresholds, there is no monotonic transform that equalizes the xROC curves for every thresholds.
We instead consider the reduction in disparity from applying the “equality of opportunity” adjustment that only equalizes TPR. For any specified true positive rate , consider groupspecific thresholds achieving . These thresholds satisfy that . Then . The score transformation on that achieves equal TPRs is:
Proposition 4.
The corresponding xAUC under an equality of opportunity adjustment, where , is:
Proof.
∎
7 Conclusion
The metrics are intended to illustrate and help interpret the potential for disparate impact of predictive scores. The curve and metrics provide insight on the disparities that may occur with the implementation of a predictive risk score in broader, but practically relevant settings, beyond binary classification.
References
 Agarwal and Roth [2005] S. Agarwal and D. Roth. Learnability of bipartite ranking functions. Proceedings of the 18th Annual Conference on Learning Theory, 2005, 2005.
 Angwin et al. [2016] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. Online., May 2016.
 Barabas et al. [2017] C. Barabas, K. Dinakar, J. Ito, M. Virza, and J. Zittrain. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. Proceedings of Machine Learning Research, 2017.
 Barocas and Selbst [2014] S. Barocas and A. Selbst. Big data’s disparate impact. California Law Review, 2014.
 Bonta and Andrews [2007] J. Bonta and D. Andrews. Riskneedresponsivity model for offender assessment and rehabilitation. 2007.
 Brier [1950] G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 1950.
 Celis et al. [2018] L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), 2018.
 Chan et al. [2018] C. Chan, G. Escobar, and J. Zubizarreta. Use of predictive risk scores for early admission to the icu. MSOM, 2018.
 Chen et al. [2018] I. Chen, F. Johansson, and D. Sontag. Why is my classifier discriminatory? In Advances in Neural Information Processing Systems 31, 2018.

Chojnacki et al. [2017]
A. Chojnacki, C. Dai, A. Farahi, G. Shi, J. Webb, D. T. Zhang, J. Abernethy,
and E. Schwartz.
A data science approach to understanding residential water contamination in flint.
Proceedings of KDD 2017, 2017.  Chouldechova [2016] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. In Proceedings of FATML, 2016.
 Chouldechova et al. [2018] A. Chouldechova, E. PutnamHornstein, D. BenavidesPrado, O. Fialko, and R. Vaithianathan. A case study of algorithmassisted decision making in child maltreatment hotline screening decisions. Conference on Fairness, Accountability, and Transparency, 2018.
 CorbettDavies and Goel [2018] S. CorbettDavies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ArXiv preprint, 2018.
 [14] C. Cortes and M. Mohri. Confidence intervals for the area under the roc curve.
 Cortes and Mohri [2003] C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. Proceedings of the 16th International Conference on Neural Information Processing Systems, 2003.
 Cowgill [2018] B. Cowgill. The impact of algorithms on judicial discretion: Evidence from regression discontinuities. Working paper, 2018.
 Dheeru and Taniskidou [2017] D. Dheeru and E. K. Taniskidou. Uci machine learning repository. http://archive.ics.uci.edu/ml, 2017.
 Dieterich et al. [2016] W. Dieterich, C. Mendoza, and T. Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Technical Report, 2016.
 Dwork et al. [2011] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 2011.
 Freund et al. [2003] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4 (2003), 2003.
 Friedler et al. [2019] S. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairnessenhancing interventions in machine learning. ACM Conference on Fairness, Accountability and Transparency (FAT*), 2019.
 Fuster et al. [2018] A. Fuster, P. GoldsmithPinkham, T. Ramadorai, and A. Walther. Predictably unequal? the effects of machine learning on credit markets. SSRN:3072038, 2018.
 Hand [2009] D. Hand. Measuring classifier performance: a coherent alternative to the area under the roc curve. Machine Learning, 2009.
 Hanley and McNeil [1982] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 1982.

Hardt et al. [2016]
M. Hardt, E. Price, N. Srebro, et al.
Equality of opportunity in supervised learning.
In Advances in Neural Information Processing Systems, pages 3315–3323, 2016.  HebertJohnson et al. [2018] U. HebertJohnson, M. Kim, O. Reingold, and G. Rothblum. Multicalibration: Calibration for the (computationallyidentifiable) masses. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:19391948, 2018.
 Holstein et al. [2019] K. Holstein, J. W. Vaughan, H. D. III, M. Dudík, and H. Wallach. Improving fairness in machine learning systems: What do industry practitioners need? 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019), 2019.
 Jones et al. [2011] J. Jones, N. Shah, C. Bruce, and W. F. Stewart. Meaningful use in practice: Using patientspecific risk in an electronic health record for shared decision making. American Journal of Preventive Medicine, 2011.
 Kleinberg et al. [2017] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent tradeoffs in the fair determination of risk scores. To appear in Proceedings of Innovations in Theoretical Computer Science (ITCS), 2017, 2017.
 Kontokosta and Hong [2018] C. E. Kontokosta and B. Hong. Who calls for help? statistical evidence of disparities in citizengovernment interactions using geospatial survey and 311 data from kansas city. Bloomberg Data for Good Exchange Conference, 2018.
 Liu et al. [2018] L. Liu, M. Simchowitz, and M. Hardt. Group calibration is a byproduct of unconstrained learning. ArXiv preprint, 2018.
 Menon and Williamson [2016] A. Menon and R. C. Williamson. Bipartite ranking: a risktheoretic perspective. Journal of Machine Learning Research, 2016.
 Michael Feldman [2015] J. M. C. S. S. V. Michael Feldman, Sorelle Friedler. Certifying and removing disparate impact. Proecedings of KDD 2015, 2015.
 Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. 2012.
 Narasimhan and Agarwal [2013] H. Narasimhan and S. Agarwal. On the relationship between binary classification, bipartite ranking, and binary class probability estimation. Proceedings of NIPS 2013, 2013.
 Rajkomar et al. [2018] A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, and M. H. Chin. Ensuring fairness in machine learning to advance health equity. Annals of Internal Medicine, 2018.
 Reilly and Evans [2006] B. Reilly and A. Evans. Translating clinical research into clinical practice: Impact of using prediction rules to make decisions. Annals of Internal Medicine, 2006.
 Rudin et al. [2010] C. Rudin, R. J. Passonneau, A. Radeva, H. Dutta, SteveIerome, and D. Isaac. A process for predicting manhole events in manhattan. Machine Learning, 2010.
 Singh and Joachims [2018] A. Singh and T. Joachims. Fairness of exposure in rankings. Proceedings of KDD 2018, 2018.
 Yang and Stoyanovich [2017] K. Yang and J. Stoyanovich. Measuring fairness in ranked outputs. Proceedings of SSDBM 17, 2017.
 Zafar et al. [2017] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. Proceedings of WWW 2017, 2017.
Appendix A Analysis
Proof of Lemma 2.
For the sake of completeness we include the probabilistic derivation of the , analogous to similar arguments for [24, 23].
By a change of variables and observing that , if we consider the mapping between threshold that achieves TPR , , we can rewrite the AUC integrated over the space of scores s as
Recalling the conditional score distributions and , then the probabilistic interpretation of the AUC follows by observing
∎
Proof of Proposition 3.
We show this for the decomposition ; the others follow by applying the same argument.
∎
Appendix B Additional Empirics
b.1 Balanced xROC curves and score distributions
( for female, male) 
( for black, white) 
( for black, white) 
We compute the similar xROC decomposition for all datasets. For Framingham and German, the balanced XROC decompositions do not suggest unequal ranking disparity burden on the innocent or guilty class in particular. For the Adult dataset, the disparity is higher than the disparity, suggesting that the misranking disparity is incurred by lowincome whites who are spuriously recognized as highincome (and therefore might be disproportionately extended economic opportunity via favorable loan terms). The Framingham data is obtained from http://biostat.mc.vanderbilt.edu/DataSets.
Framingham, German, and Adult have more peaked distributions (more certain) for the
class with more uniform distributions for the
class; the adult income dataset exhibits the greatest contrast in variance between the and class.Appendix C Standard errors for reported metrics
COMPAS  Framingham  German  Adult  

Log Reg. 
AUC  
Brier  
RB, cal. 
AUC  
Brier  
Comments
There are no comments yet.