Quantifying Infra-Marginality and Its Trade-off with Group Fairness

09/03/2019 ∙ by Arpita Biswas, et al. ∙ Microsoft indian institute of science 0

In critical decision-making scenarios, optimizing accuracy can lead to a biased classifier, hence past work recommends enforcing group-based fairness metrics in addition to maximizing accuracy. However, doing so exposes the classifier to another kind of bias called infra-marginality. This refers to individual-level bias where some individuals/subgroups can be worse off than under simply optimizing for accuracy. For instance, a classifier implementing race-based parity may significantly disadvantage women of the advantaged race. To quantify this bias, we propose a general notion of η-infra-marginality that can be used to evaluate the extent of this bias. We prove theoretically that, unlike other fairness metrics, infra-marginality does not have a trade-off with accuracy: high accuracy directly leads to low infra-marginality. This observation is confirmed through empirical analysis on multiple simulated and real-world datasets. Further, we find that maximizing group fairness often increases infra-marginality, suggesting the consideration of both group-level fairness and individual-level infra-marginality. However, measuring infra-marginality requires knowledge of the true distribution of individual-level outcomes correctly and explicitly. We propose a practical method to measure infra-marginality, and a simple algorithm to maximize group-wise accuracy and avoid infra-marginality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a machine learning algorithm being used to make decisions in societally critical domains such as healthcare

[irene2018why, goodman2018machine], education [tierney2013fairness], criminal justice [angwin2016machine, berk2017fairness], policing [simoiu2017problem, goel2017combatting] or finance [furletti2002overview]. Since data on past decisions may include historical or societal biases, it is generally believed that optimizing accuracy can result in an algorithm that sustains the same biases and thus is unfair to underprivileged groups [romei2014multidisciplinary, barocas2016big, chouldechova2018frontiers]. Theoretically too, it has been shown that it is not possible to have calibration and fairness simultaneously [kleinberg2017inherent] and several other results show that achieving multiple fairness constraints simultaneously is infeasible [chouldechova2017fair, corbett2017algorithmic].

Given this tradeoff between accuracy and fairness metrics, algorithms typically aim to maximize accuracy while also satisfying different notions of fairness, such as disparate impact [kamiran2012data, feldman2015certifying, zafar2017afairness], statistical parity [kamishima2012fairness, zemel2013learning, corbett2017algorithmic]

, equalized odds 

[hardt2016equality, kleinberg2017inherent, woodworth2017learning]. Most of these fairness notions, however, enforce group-level constraints on pre-specified protected attributes and give no guarantees on fairness with respect to other sensitive attributes. That is, it is difficult to generalize these group-level constraints to be fair with respect to multiple sensitive attributes which can lead to biases when individuals may belong to multiple historically disadvantaged groups. As an example, a classifier that is constrained to be fair on race may end up introducing discrimination against women of the advantaged race, and vice-versa. Moreover, being group-wise constraints, they provide no guarantees on fairness for individuals’ outcomes within these groups.

In this paper, therefore, we consider a different notion of bias, infra-marginality [simoiu2017problem, corbett2018measure]

, that can handle multiple sensitive attributes simultaneously and enforces individual-level rather than group-level adjustments. The idea behind this individual-level fairness is that individuals with the same probability of an outcome (say, the same risk probability) receive the same decision, irrespective of their sensitive attributes. Any deviation from this ideal necessarily leads to misclassifying low-risk members as high risk and high-risk members as low risk, which in turn may harm the members of all groups. For instance, consider the medical domain where doctors assess the severity of a person’s illness (risk probability) and prioritize treatment accordingly. It may be acceptable to prioritize patients based on a given reliable estimate of a patient’s risk, but any deviation from this rule due to a group-fairness constraint on a particular demographic, may deprive high-risk people from treatment, some of whom in turn may belong to another disadvantaged demographic. Hence, these group-level adjustments often introduce unintended biases and can be marginally

unfair at an individual-level; thus the name “infra-marginality” (see Section 3 for a definition). Conceptually, therefore, mitigating infra-marginality has a notable advantage: the fairness of a decision is based directly on the underlying outcome probability of each individual, rather than post hoc group-level adjustments or constraints.

To remove infra-marginality, Simoiu et al. [simoiu2017problem] propose taking decisions using a single threshold on the true outcome probability, whenever the outcome probability is known.111In the above-mentioned example from the medical domain, this corresponds to selecting a patient (prioritizing treatment) if and only if the risk (outcome) probability of the patient is above a threshold. Such a single-threshold classifier implements the high-level idea that legislation should apply equally to all individuals and not be based on group identity. A set of decisions (equivalently, a classifier) that conforms to this ideal is said to have zero infra-marginality (and, thus, single-threshold fair) [simoiu2017problem, corbett2018measure].

Extending past work on infra-marginality [simoiu2017problem], our main contribution is to quantify the notion and propose a generalized version of infra-marginality which we call -infra-marginality. It is relevant to note that Simoiu et al. [simoiu2017problem] essentially consider infra-marginality as a “binary” property–either a classifier suffers from infra-marginality or it does not. Extending this construct, the current work defines the degree of infra-marginality. Furthermore, using this definition, we prove theoretically that high accuracy of a classifier directly leads to low infra-marginality. These results also hold group-wise: high accuracy, per group, means low group-wise infra-marginality.

Results on simulated and real-world datasets confirm this observation. First, we construct a set of simulated datasets with varying distributions of outcome per demographic group. Under all simulations, we find that training a machine learning model to maximize accuracy with respect to the outcome also yields low infra-marginality. Specifically, infra-marginality is at most distance away from the classification error. Here, parameter depends solely on the underlying data set, i.e., it is fixed for the given problem instance and is independent of any classifier one considers. In addition, when we focus on individuals from a sensitive group separately, we find that maximizing group-wise accuracy results in low infra-marginality for both groups. Second, we consider two datasets that are widely used for studying algorithmic fairness, namely Adult-Income and Medical-Expense datasets, and demonstrate the connection between maximizing accuracy and lowering infra-marginality. Since there is no ground-truth for individual outcome probabilities, we create an evaluation testbed by developing classifiers with subsets of features and benchmarking infra-marginality with respect to the outcome probabilities learnt using all features. This benchmark serves as an approximation of the true outcome probability. We find that even when a classifier is not trained on true outcome risk, we find a similar positive trend between high accuracy and low infra-marginality.

The close connection of infra-marginality to accuracy also illustrates the inherent tradeoff with group fairness. Since group-wise fairness constraints necessarily reduce accuracy, they exacerbate infra-marginality. Using the meta-fair algorithm [celis2018classification] for training group-fair classifiers, we find that increasing the weight on a group-wise fairness constraint (such as demographic parity or equal false discovery rates) increases infra-marginality. This result is consistent for both simulated and real-world datasets: whenever constraints on accuracy optimization lead to an increase in a group-wise fairness metric, they also lead to an increase in infra-marginality bias. We argue these results present a difficult choice for ensuring fairness: group-wise constraints may be blind to fairness on other unobserved groups, while lowering infra-marginality increases individual-level fairness but may lead to group-level unfairness. Thus, as a practical measure, we propose maximizing group-wise accuracy for lowering individual-level bias whenever the true outcome probability is measurable.

While these results point to the importance of considering infra-marginality, bounding infra-marginality in practice is a challenge because it depends on the knowledge of a true outcome probability for individuals. This is often not available, especially when the true probabilities of individuals are not known and are required to be learnt using datasets which often contain only one-sided information (for example, recidivism outcomes of only those individuals who were granted bail or loan repayment outcome of only those who were granted a loan). In such cases, we may not have correct estimates of risk probabilities of the underlying population and hence infra-marginality remains a theoretical concept.

To be useful in practice, we need two conditions: first, that objective measures of the outcomes are available and, second, that the available data is sampled uniformly from the target population of interest (and not one-sided, or based on biased decisions). Happily, outcomes of interest are available in many decision-making scenarios where the outcomes can be objectively recorded. For example, in the medical domain, outcomes are often categorical and objective (e.g., cured of a disease or not). Objective outcomes also occur in security-related decisions such as searching for a weapon in a vehicle or screening passengers at an airport for prohibited goods. In all these cases, the outcome of interest is measurable and is not affected by underlying biases, unlike subjective outcomes such as success in school or work, awarding loans, etc. The second condition is more stringent and requires that we observe the outcomes for a representative sample of the underlying population and not be biased by past decisions. This assumption can be satisfied either by utilizing additional knowledge on the decision-making process or by actively changing decision-making. In Section 6, we discuss potential approaches such as by assuming that a subset of decisions are calibrated to the true outcome risk [goel2017combatting, simoiu2017problem, pierson2018fast], by having a random sample of outcome data, by randomizing the decision for a fraction of users, or by advanced strategies such as contextual bandits [agarwal2017bandits].

Overall, our results point to the importance of data design over statistical adjustments in achieving fairness in algorithmic decision-making and the limitations of depending on an available dataset for ensuring fairness. Rather than post hoc adjustments to steer an algorithm towards a chosen outcome, or introducing external constraints, it is worthwhile to consider obtaining accurate and unbiased outcome measurements. Under these conditions, our work proposes that the practice of optimizing for group-wise accuracy also leads to low bias (low infra-marginality), and allows improvements in machine learning to directly be applicable in improving fairness.

To summarize, we make the following contributions:

  • Our conceptual contribution is to develop a quantitative measure to characterize the problem of infra-marginality. Extending past work [simoiu2017problem, corbett2018measure] on infra-marginality in algorithmic fairness, we propose a general -infra-marginality formulation that enables a measure of discrimination under the definition of single-threshold fairness (Section 3).

  • Second, we show that infra-marginality has a striking property: Within an additive margin, the more accurate the classifier, lower is its degree of infra-marginality (Section 4). We provide a rigorous proof of this claim in Theorem 1. This result asserts that, for well-behaved data sets and given a classifier threshold , higher accuracy points to lower infra-marginality. Also, complementarily, an inaccurate classifier will necessarily induce infra-marginality, to a certain degree. Moreover, this result holds group-wise (Corollary 1), implying that higher group-wise accuracy reduces the infra-marginality problem per group. We also provide two propositions that identify relevant settings wherein Theorem 1 can be applied.

  • Third, when the classifier threshold is not fixed, we provide an algorithm for learning a classifier that optimizes accuracy subject to infra-marginality constraints, assuming that the true outcome probabilities are available (Section 4.2). We show that the problem reduces to applying a linear constraint and hence we can efficiently find an optimal solution. This is notably in contrast to prominent fairness metrics, which introduce non-convex (fairness) constraints in the underlying learning algorithms.

2 Background and Related Work

Datasets that collect past decisions on individuals and their resultant outcomes often reflect prevailing societal biases [barocas2016big]. These biases correspond to discrimination in decision-making, in recording outcomes, or both. During decision-making, individuals from a specific group may be less preferred for a favorable decision, thus reducing their chances of appearing in a dataset with favorable outcomes. This is especially the case when data recording is one-sided: we observe the outcomes only for those who received a favorable decision, such as being awarded a job, loan or bail. In addition, the discrimination can also occur when recording outcomes, e.g., when measuring hard-to-measure outcomes such as defining “successful” job candidates, employees or students. Due to these biases in dataset collection, building a decision-making algorithm by maximizing accuracy on the dataset perpetuates the bias in selection of individuals or the measurement of a desirable outcome.

To counter this bias, various group-wise statistical constraints have been proposed for a decision classifier that stipulate equitable treatment of different demographic groups. Demographic or statistical parity says that a fair algorithmic prediction should be independent of the sensitive attribute and thus each demographic group should have the same fraction of favorable decisions [kamishima2012fairness, zemel2013learning, corbett2017algorithmic]. Equalized odds [hardt2016equality, kleinberg2017inherent, woodworth2017learning] says that for a fair algorithmic prediction, the true positive rates and the false positive rates on different sensitive demographics should be as equal as possible. Combined with the accuracy objective, equalized odds imply similar, high accuracy for every sensitive demographic. Disparate impact [kamiran2012data, feldman2015certifying, zafar2017afairness] refers to the impact of policies that affect one sensitive demographic more than another, even though the rules applied are formally neutral. Mathematically, the probability of being positively classified should be the same for different sensitive demographics. Chouldechova and Roth [chouldechova2018frontiers] along with Friedler et al. [friedler2018comparative] provide a survey of various fairness notions and algorithms to incorporate them. However, in practice, the output of these group-fair algorithms may show worse predictions among all the groups; for example, while trying to ensure equal false positive rates among two demographic groups, the predictions may end up increasing false positive rates for both the groups.

The difficulty of ensuring fairness for different groups suggests an alternative definition of fairness based on individual-level constraints. Individual fairness [dwork2012fairness, zemel2013learning] propose that similar individuals should be treated as similarly as possible. Thus, rather than stratifying users by pre-specified groups, individual-level fairness constraints use available data (e.g., demographics) to define a similarity measure between individuals and then enforce equitable treatment for all similar individuals. The question then is, how to define a suitable similarity measure? Rather than choosing variables for defining similarity on, [simoiu2017problem] propose defining similarity between two individuals based on the true probability of outcome for the two individuals. That is, people with the same underlying probability of a favorable outcome are the most similar to each other. They define that a classifier has infra-marginality bias if people with the same probability of the outcome are given different decisions [simoiu2017problem, pierson2018fast, corbett2018measure]. Crucially, this definition depends on a measurement of true outcome probability, but does not restrict analysis to a few pre-specified groups.

Given these considerations, different fairness notions may be suitable for different settings. When all important marginalized groups are defined (e.g. by law), then group-fairness constraints are more relevant. When true outcome probabilities are known (such as in randomized search/inspection decisions in security applications), then infra-marginality constraints become a better alternative. In this paper, we focus on the latter and describe the value of considering infra-marginality in fairness decisions. In Section 3, we extend past work to propose a general notion of infra-marginality and prove our main result on the connection between accuracy and low infra-marginality. In Section 4, we provide extensive empirical results showing the inherent tradeoff between typical group fairness constraints and infra-marginality.

3 Defining Infra-marginality

In this section, we define and characterize the degree of infra-marginality. This measure quantifies an extent to which a classifier violates the notion single-threshold fairness, i.e., it quantifies the problem of infra-marginality identified by Simoiu et al. [simoiu2017problem].

3.1 Problem Setup

We consider a binary classification problem over a set of instances, , wherein label denotes the positive class and label denotes the negative class, e.g., in the search-for-a-contraband setup, class label would indicate that the individual (instance) is in possession of a contraband and, complementarily, a label

would correspond to the case in which the individual is not carrying a contraband. Conforming to the standard framework used in binary classification, we will assume that feature vectors (equivalently, data points)

are drawn from the data set via a feature distribution. Furthermore, for an instance , let be the inherent probability of being in the positive class; in the previous example, is the endowed probability that an individual is carrying the contraband.

Under single-threshold fairness, a set of (binary) decisions are deemed to be fair if and only if they are obtained by applying a single threshold on the s of the instances (irrespective of the instance’s sensitive attributes, such as race, ethnicity, and gender). In the above-mentioned contraband example, this corresponds to searching an individual if and only if its is above a (universally fixed) threshold .

3.2 Degree of Infra-Marginality

For ease of presentation, we will identify each data point with its feature vector and use to denote the (outcome) label of the data point . As mentioned before, is the outcome probability of each . Therefore, for every , the binary label is equal to one with probability and, otherwise (with probability ), we have . The standard classification exercise corresponds to learning a classier that optimizes accuracy with respect to the labels.

For a binary classifier , we will use to denote its accuracy with respect to the labels , i.e., , here the expectation is with respect to the underlying feature distribution222That is, the random sample

is drawn from an underlying (feature, label) joint distribution and, conditioned on a feature

, we have . and

is the indicator random variable which denotes whether the output of the classifier,

, is equal to , or not. Note that high accuracy simply implies a high value of .

To formally address single-threshold fairness, we define a (deterministic) label, , for each data point . Semantically, is the (binary) outcome of an absolutely fair (in the single-threshold sense) decision applied to . This, by definition, means that s are obtained by applying the same (fixed) threshold on the s across all instances: for a fixed fairness threshold , we have if , else if then . With this notation in hand we define the central construct of the present work.

Definition 1 (Infra-Marginality of a Classifier).

With respect to a given threshold , the degree of infra-marginality, , of a classifier is defined as

(1)

Here, the expectation is with respect to the feature distribution over the data set . In addition, for each , the label if , otherwise .

A classifier conforms to the ideal of single-threshold fairness iff . Furthermore, the quantity can be interpreted as the extent to which the classifier’s outputs (i.e., s) differ from the ideal labels (i.e., from the single-threshold benchmarks) . Indeed, smaller the value of the lower is ’s infra-marginality.

We will use and , respectively, to denote the distributional form of the collection of generative probabilities and labels . Formally, is a discrete distribution wherein a probability mass of is placed on and probability mass is placed on ; here, is the fraction of data points with the property that ,

. Similarly, the cumulative distribution function (cdf) of

is .

Note that the distributions and are supported on the distinct values (across ) along with and . Write to denote the weighted distance between the two distributions; specifically, the distance here is computed by normalizing (i.e., weighing) with respect to the feature distribution. Formally,333Note that if the feature distribution picks data points uniformly at random from , then this equation simply implies that the distance between the two distributions is equal to the average of the differences between the and values.

(2)

We will also refer to as the distance between the generative probabilities and labels . Note that this distance is a property of the data set and the underlying feature distribution–it is fixed for the given problem instance and is independent of any classifier one considers.

4 Accuracy and Infra-marginality

The key result of this section is the following theorem which establishes that as long as the distance between and is small, an accurate classifier will also have low infra-marginality. Complementary, under small distance, an inaccurate classifier will necessarily induce high infra-marginality.

4.1 High accuracy, low infra-marginality

Theorem 1.

For a data set , let be the distance between the generative (outcome) probabilities and the (single-threshold) labels . Then, for any binary classifies with accuracy , the degree of infra-marginality satisfies

Proof.

Using the definition of the accuracy, , of classifier , we get

(3)

Analogously, the degree of infra-marginality, , can be expressed as

(4)

Summing (3) and (4) gives us .

Subtracting one from both the sides of the previous equality and considering absolute values, we obtain

(5)

The last equation follows from the triangle inequality, and .

Recall that, for each , we have and the label can be either zero or one. In both of these cases, we have . Therefore, the desired bound holds

(via (5))
(via (2))

This result shows that, for any classifier, high accuracy points to low infra-marginality. Notably this result applies group-wise: this “inverse” connection holds even if we consider accuracy and infra-marginality separately for different subsets (groups) of the data set . Specifically, Theorem 1 can be applied, as is, to obtain Corollary 1. Here, say the set of data points is (exogenously) partitioned into two groups (i.e., disjoint subsets) and . Also, let the outcome probabilities of the two sets be and , respectively, along with the single-threshold labels and . Let () denote the distance between and ( and ). Following the above-mentioned notational conventions, the accuracy and infra-marginality measures of a classifier for group are and , respectively. With this notation in hand, we have the following group-wise guarantee.

Corollary 1.

For a data set , comprised of two groups and , and any classifier , the degree of infra-marginality, in each group, satisfies the following bounds: and .

Remark: Corollary 1 provides some useful insights towards achieving low infra-marginality. It quantitatively highlights the principle that—for mitigating group-wise infra-marginality—group-wise accuracy can be a better metric than overall accuracy. That is, in relevant settings, aiming for classifiers that maximize the minimum accuracy across groups (i.e., adopting a Rawlsian perspective on accuracy) can lead to more fair decisions than solving for, say, classifiers that enforce the same accuracy across groups or classifiers that maximize overall accuracy. In particular, Corollary 1 ensures that a max-min (Rawlsian) guarantee on accuracy translates into a max-min guarantee on infra-marginality, with additive errors at most .

The following two propositions identify relevant settings wherein Theorem 1 can be applied. The proofs of these propositions are direct and have been provided in the supplementary materials. The first proposition addresses data sets in which the probabilities s (across data points ) are spread enough and do not sharply peak around a specific value. Formally, we say that a data set is -Lipschitz if for any , the number of data points with is at most . Note that, for settings in which the cdf of is smooth, the maximum slope of the cdf corresponds to the Lipschitz constant of the data set.

Proposition 1.

If a data set is -Lipschitz and the underlying fair-threshold , then the distance between the outcome probabilities and the single-threshold labels is at most . Here, we assume that the underlying feature distribution selects instances from uniformly at random.

Proof.

We partition into subintervals, of length each. Specifically, the subintervals are , with integer . The Lipschitz condition ensures that the number of data points in each subinterval is at most .

The next proposition observes that if, in , the probability mass is spread sufficiently far away from the threshold, then again the distance between the outcome probabilities and the single-threshold labels is appropriately bounded. The result shows that Theorem 1 is useful, in particular, when is a bimodal distribution, with the two modes being close to zero and one, respectively. Formally, we will say that a distribution (supported on ) is -spread, with respect to the threshold , iff, under , the probability mass in the interval is at most . Here, and .

Proposition 2.

If, for a data set, the outcome probability distribution is

-spread and the underlying fairness threshold , then the distance between the outcome probabilities and the single-threshold labels is at most . Here, we assume that the underlying feature distribution selects instances from uniformly at random.

Proof.

If is -spread, them the distance (between this distribution and ) is maximized when fraction of the data points have value equal to and the rest of the data points (accounting for the remaining fraction) have value equal to (or ). Hence, the distance between and is upper bounded as follows: . ∎

4.2 Optimal Classifiers Under Infra-marginality Constraints

The above subsection characterized infra-marginality under a fixed classifier threshold . However, in practice, it is possible to choose to ensure high accuracy and also a minimum degree of infra-marginality. Therefore, we now present an efficient algorithm for finding classifiers with as high an accuracy as possible, under infra-marginality constraints. We address this optimization problem in a setup wherein the data set is finite and outcome probabilities, s, are known explicitly. In this setup (i.e., given

), prominent group-wise fairness notions map to non-convex constraints. Hence, maximizing accuracy subject to a group-wise fairness constraint typically requires relaxations (leading to approximate solutions) or heuristics; see, e.g., 

[celis2018classification] and references therein. We will show that, by contrast, an upper bound on infra-marginality can be expressed as a linear constraint and, hence, we can efficiently find an optimal (with respect to accuracy) classifier that satisfies a specified infra-marginality bound.

Let be a random sample from the feature distribution on , where is the data set/feature space and the label , for each . Recall that, given a universal threshold (which imposes the single-threshold fairness criterion), we define the label , for each . Let be any classifier with accuracy . Furthermore, the degree of infra-marginality of is defined as .

Given parameter , we consider the problem of maximizing the accuracy, over all classifiers, subject to the constraint that the degree of infra-marginality is at most .

maximize
subject to

Observe that, because and is binary valued, the objective function can be expressed as

Similarly, using the fact that and is binary valued, for infra-marginality we get

Therefore, the above maximization of accuracy (or minimization of error rate) subject to the degree of infra-marginality being upper bounded by , over classifiers , can be equivalently rewritten as follows. Write and note that , , and do not depend on the classifier .

minimize
subject to

This is an integer linear program with (binary) decision variables

. Now consider its linear programming relaxation by letting , and denote its optimum by . Now, using Lagrange multipliers and strong duality, the optimum of the linear relaxation is given by

For any fixed , the optimal classifier is given by , where . Therefore,

(6)

With probabilities in hand, we can solve the dual of the linear programming relaxation to efficiently compute the optimal solution of (6), i.e., compute an optimal value of the Lagrange multiplier (dual variable). Given the optimal , we know that the optimal classifier satisfies

In other words,

Note that the optimal solution of the linear programming relaxation actually yields a binary classifier , which means that is also an optimal solution of the underlying integer linear program.

Remark: It is well-known that the Bayes classifier, giving the decisions , achieves maximum accuracy over all classifiers. Our derivation shows that, even with infra-marginality constraints, optimal classifiers continue to be single-threshold classifiers. Also, it is relevant to note that the above-mentioned method can be used to efficiently solve the problem of minimizing infra-marginality subject to a (specified) lower bound on accuracy.

5 Empirical Evaluation

In this section, our primary aim is to provide empirical evidence to the developed theoretical guarantees and answer the following questions:

  • Does high accuracy (low classification error) lead to low infra-marginality for a classifier?

  • How does the relationship between classification-error and infra-marginality change when we consider them separately for each protected group?

  • What is the nature of trade-off between optimizing a classifier for group fairness measures versus infra-marginality?

5.1 Experimental Setup

We answer these questions by measuring accuracy, infra-marginality and group fairness metrics for machine learning classifiers under a wide range of datasets. Since estimation of infra-marginality depends on knowledge of the true outcome distribution, we first present results on simulated datasets where we control the data generation process. The distributions are chosen to provide a thorough understanding of the relationship between accuracy and degree of infra-marginality. We then present results on two real-world datasets that have been used in prior empirical work on algorithmic fairness: Adult Income [url:adultIncome] (for predicting annual income) and MEPS [url:meps] (for predicting utilization using medical expenditure data). These datasets satisfy the two properties that are required for empirical application of infra-marginality. First, the outcomes measured are numerical quantities that are less likely to be subjectively biased and second, they can be assumed to be a representative sample of the underlying population. The Adult-Income dataset contains a sample of adults in the United States based on the Census data from 1994 and the MEPS dataset is derived from a nationally representative survey sample of people’s healthcare expenditure in the US.

Measuring infra-marginality. On the simulated datasets, we estimate infra-marginality using its definition in Equation 1. On real-world datasets, we do not have true outcome probability and thus use an approximation. We assume that the estimated outcome probability () using a classifier can be considered as a proxy for the true outcome probability. Essentially, this assumption implies that all the relevant variables for estimating the outcome are available in the dataset (and that we have an optimal learning algorithm to estimate the outcome probability).

(7)

To assess the sensitivity of results to settings where the true outcome probability is different from the one estimated by the learnt classifier, we develop a method: we progressively remove features from a given dataset, train a classifier on the reduced dataset, and compare infra-marginality on the outcome probability estimated from this reduced dataset to the “true” infra-marginality on the outcome probability estimated from the full dataset (thus assuming it as the true outcome probability). That is, we use the following measure for infra-marginality.

(8)

While we demonstrate the theoretical results using the above assumption on the real-world datasets, in practice we recommend a combination of domain knowledge and active data collection to estimate true outcome probabilities and correspondingly, the true infra-marginality. We discuss these possibilities in Section 6.

Measuring group-fairness. For each classifier on the simulated and real-world datasets, we compare infra-marginality to prominent group-fairness metrics. We do so using the meta-fair framework proposed by Celis et al. [celis2018classification]. In this framework, there is a trade-off parameter which helps to balance between maximizing accuracy and achieving group-fairness. Higher the value of the parameter, the more is the focus on achieving group-fairness. When this parameter is , the classifier maximizes accuracy with no group-fair constraints. For our evaluation, we consider two group-fair notions:

  • Statistical Parity (SP) or Disparate Impact (DI): The aim is to achieve low value for the expression , where denotes the selection-rate for group , which is the fraction of individuals who received favorable outcome by the classifier, within group .

  • Equal False Discovery Rates: The aim is to achieve low value for , where denotes the false discovery rate for group , which is the fraction of individuals incorrectly classified, among the group who received favorable outcomes by the classifier.

We use the implementation of the meta-fair algorithm, provided in the Python package AI Fairness 360 [aif360-oct-2018]. Finally, we report the classification error rate, i.e., (1 - accuracy), degree of infra-marginality, and the value of the group (un)fairness metric.

5.2 Simulation-Based Datasets

For simplicity, we consider datasets with a single sensitive attribute (e.g., race or gender) and two additional attributes (e.g., age, income, etc.) that denote relevant features for an individual. We assume that the sensitive attribute is binary and additional attributes are continuous. We also assume that the outcome is binary and depends on the attributes of an individual. Given a classifier that predicts the outcome based on these features, our central goal is to compare its misclassification rate, infra-marginality and group-fairness.

Specifically, we assume that the sensitive attribute , two additional attributes , are real-valued, and the outcome is binary . Hence, the input space is denoted as . We create datasets using a generative model where the attributes of an individual are simulated based on their sensitive attribute and outcome. The overall population distribution is generated as . We further consider that and are conditionally independent given and , i.e., , and the distributions are considered to be Gaussian. Within this framework, we generate five types of datasets, each with instances and equal label distribution . For each instance , we obtain by applying Bayes’ rule. Then we obtain and . Based on this process, we generate datasets with various outcome probability () distributions, as shown in Figure 1.

(a) S1
(b) S2
(c) S3
(d) S4
Figure 1: True outcome probability distribution for the datasets S1, S2, S3, S4. The x-axis denotes the and y-axis denotes the density . The blue and red lines represent density curves for the group and , respectively.
  • [itemsep=0pt,leftmargin=*,topsep=2pt]

  • Dataset S1: The risk distribution of this dataset is shown in Figure 0(a). For any individual, irrespective of their group membership, the outcome probability is drawn from a density distribution which is concentrated near the threshold of . Thus, it is difficult to obtain highly accurate classifiers for this kind of outcome distribution. Also, the distance between and is overall as well as per group.

  • Dataset S2: We then construct a dataset where the outcome distributions for the two sensitive groups are separated from each other. One group (Z=0) has an outcome probabiliity mode below 0.5, and the other group (Z=1) has a mode on outcome probability greater than 0.5. The distribution of this dataset is described in Figure 0(b). Compared to , we expect it to be easier for a classifier to achieve high accuracy, but achieving group statistical parity (Disparate Impact) is harder. Here, the two groups have different densities over the outcome probabilities, one primarily concentrated around whereas other is concentrated around . Thus, in this dataset, imposing a group-fairness constraint such as disparate impact would necessarily cause more classification error and more degree of infra-marginality. The value for this dataset is for both the groups.

  • Dataset S3: This dataset corresponds to an extreme case of S2 where the outcome probability for one of the sensitive groups is concentrated near 0 whereas that for the other sensitive group is near 1, as shown in Figure 0(c). Also the value is only . Achieving high accuracy and low infra-marginality should be extremely easy for a threshold-based classifier, but satisfying group-wise parity may be equally hard.

  • Dataset S4: Here we generate data such that the density over outcome probabilities for group = is spread sufficiently away from the threshold , whereas the density over outcome probabilities for group = is concentrated around , shown in Figure 0(d). The values for two groups are also different, and for group and , respectively. Thus, we expect to see a difference in accuracy by group: a classifier should be able to distinguish individuals from group easily but find it hard to separate people with or outcomes within group . In contrast, any classifier with a single threshold is also expected to satisfy statistical parity (SP).

  • Dataset S5: In this dataset, the feature

    is drawn from the same Gaussian distribution, irrespective of

    or values. However, feature is drawn such that the values clearly separate individuals with different values, among group . For group, it is difficult to accurately classify, as shown in Figure 2.

    Figure 2: Scatterplot for a subset of datapoints from dataset S.

For dataset S5, however, the optimal classifier’s decision boundary would be , whereas, a classifier ensuring equal misclassification rates for both subgroups would necessarily choose the decision boundary as a horizontal line, say . In this case, a large fraction of individuals would receive different decision than that of the optimal classifier, hence would suffer from higher infra-marginality.

High accuracy, Low infra-marginality. Figure 3 illustrates that the degree of infra-marginality is well within the theoretical bound of classification_error, as established in Theorem 1. In particular, Figure 2(a) shows that the classification error is when a classifier is trained to maximize accuracy for dataset S1 and the degree of infra-marginality is bounded by the error ( overall, as well as groupwise). Interestingly, for dataset S2, the degree of infra-marginality is very close to the accuracy even when the theoretically established bound is high, as shown in Figure 2(b). The for the dataset S3 is low (), and hence, the infra-marginality and error rates are very close to the each other (the y-axis of Figure 2(c) is limited to for clarity). The bound of infra-marginality continues to hold in Figures 2(d) and 2(e).

(a) S1
(b) S2
(c) S3
(d) S4
(e) S5
Figure 3: Comparing the classification error, infra-marginality and group (un)fairness with increasing values of the group-fairness trade-off parameter in the meta-fair algorithm.

Tradeoff between infra-marginality and group fairness. Figures 2(a) and 2(b) show that, by invoking DI-fair constraint, the classification error and degree infra-marginality increases by more. Even for dataset S3, DI-fairness increases the classification error, and the corresponding graph obtained looks similar to that of the previous one (Figure 2(b)).

For S3, we additionally evaluate the effect of using another fairness notion called False Discovery rate (FDR). Figure 2(c) shows that ensuring FDR-fairness hurts the accuracy by about and infra-marginality increases by about . For S4, the FDR-fairness constraint not only causes increase classification error but also increases the false discovery rates for both subgroups (as shown in Figure 4), which is an extremely undesirable consequence of ensuring group-fairness. We observe a similar trend for dataset S5 in Figure 2(e). Also, we observe that the ratio of false discovery rates do not uniformly improve on increasing the group-fairness parameter.

Figure 4: FDRs for two groups increase while ensuring FDR-fairness, for S4.

Group-wise accuracy and infra-marginality. We show results for one of the datasets S4 by computing the metrics separately for each sensitive group; results on others are similar. In dataset S4, the values for group and are and , respectively. Thus, the infra-marginality and error lines in Figure 5 (obtained using FDR-fairness constraint), almost coincides for group , while these lines are quite apart for group . Moreover, Figure 2(d) shows that, when the trade-off parameter is or higher, the group-fairness improves, at a significant cost of infra-marginality and accuracy.

Figure 5: Comparison of classification error and infra-marginality for dataset S with respect to two groups and .

Summarizing the results on synthetically generated datasets: The observations over all the simulations summarize that the infra-marginality biases are within of the classifier’s error rates. In fact, in almost all executions, when accuracy is the highest, the infra-marginality values are lower than (). So, addressing the first question, we find that low classification error leads to low infra-marginality, even when the values are large. From our group-wise results on S4, we find a corresponding result: classifiers are highly accurate for and thus exhibit very low degree of infra-marginality, while accuracy for group is around and thus this group suffers higher infra-marginality. Finally, we see that increasing group-fairness leads to worse accuracy, and in turn, higher degree of infra-marginality in all the datasets.

5.3 Case-Study

We now describe our results on the Adult Income and Medical Expense datasets.

  • [itemsep=0pt,leftmargin=*,topsep=2pt]

  • Medical Expenditure Panel Survey dataset (MEPS) [url:meps]: This data consists of surveys of families and individuals collecting data on health services used, costs and frequency of services, demographics, etc., of the respondents. The classification task is to predict whether a person would have ‘high’ utilization (defined as UTILIZATION , which is roughly the average utilization for the considered population). The feature ‘UTILIZATION’, was created to measure the total number of trips requiring some sort of medical care, by summing up the following features: OBTOTV15 (the number of office based visits), OPTOTV15 (the number of outpatient visits), ERTOT15 (the number of ER visits), IPNGTD15 (the number of inpatient nights), and HHTOTD16, the number of home health visits. High utilization () respondents constituted around of the dataset. The sensitive attribute, ‘RACE’ is constructed as follows: ‘Whites‘ (Z=0) is defined by the features RACEV2X = 1 (White) and HISPANX = 2 (non Hispanic); everyone else are tagged ‘Non-Whites’ (Z=1).

  • Adult Income dataset [url:adultIncome]: This dataset contains complete information of about individuals. It is obtained from the UCI Machine Learning Repository [Dua:2017]. This dataset has been extensively used in supervised prediction of annual salary of individuals (whether or not an individual earns more than K per year). High salary is used the outcome label . For our experiment, we consider features such as age-groups, education-levels, race, and sex. The column income represents labels, which contains for individuals with K annual salary and otherwise. The sex attribute is considered to be a sensitive with denoting “Female” individuals and denoting “Male”.

Observations. We follow a similar analysis to the simulated datasets by constructing classifiers with different group-fairness parameter and measuring infra-marginality. As described in Section 5.1, we first assume that can be approximated by the learnt outcome probabilities from the classifier. For MEPS, we observe a steep increase in infra-marginality and error rate when DI-fairness constraint is invoked (as shown in Figure 6). This result holds group-wise: when dividing the data by the sensitive group and considering them separately, Figure 7 shows that adding the fairness constraint causes nearly of the decisions to change compared to the classifier that maximizes accuracy within both groups, thus leading to high infra-marginality within both groups.

Figure 6: Comparing classification error, infra-marginality and group (un)fairness with increasing values for the group-fairness trade-off parameter in the meta-fair algorithm. These classifiers are trained using the MEPS dataset.
Figure 7: Groupwise comparison of classification error, infra-marginality and selection rate for MEPS dataset.

Similarly for Adult Income dataset, we observe an increase in the degree of infra-marginality and error rate when FDR-fairness constraint is invoked (as shown in Figure 8). When looking at these metrics group-wise (Figure 9), we see that adding the fairness constraint causes nearly of the decisions to change (infra-marginality), among both the groups. Moreover, it also causes increase in the false discovery rates for both the groups; in particular, FDR is more than for group .

Figure 8: Comparing classification error, infra-marginality and group (un)fairness with increasing values for the group-fairness trade-off parameter in the meta-fair algorithm. These classifiers are trained using the Adult Income dataset.
Figure 9: Groupwise comparison of classification error, infra-marginality and false discovery rate (FDR) for Adult Income dataset.
(a) Reduced dataset after removing education-levels
(b) Reduced dataset after removing education-levels and race
Figure 10: Comparison on Adult Income dataset, after removing (one or more) features.

So far, however, we assumed that is a proxy for the true outcome probability. We now repeat the experiments on Adult Income data after leaving out an important feature education-levels and then leaving out race also. For these reduced datasets, we assume that the true outcome probability can be derived from the full dataset with all features ( ), but the classifier only has access to the reduced features. We observe that, even when all the features are not available, low infra-marginality with respect to full dataset is associated with low classification error (Figure 10). We also observe that infra-marginality with respect to the s obtained using reduced dataset is correlated to the infra-marginality with respect to the s obtained using the full dataset. This observation help us claim that, in real-world datasets (where we may not have full information of all the features), we can quantify infra-marginality using obtained from accuracy-maximizing classifier.

Summarizing the results on real-world datasets: We observe that, in both Adult Income and MEPS datasets, low classification error led to low infra-marginality. Further, we see that increasing the extent of group-fairness may lead to worse accuracy and, in turn, higher degree of infra-marginality. We observe similar trend in the group-wise results for Adult Income dataset. Finally, we investigate whether unavailability of one or more features would have an impact in the above observations. We find that, even with less features, the infra-marginality remains lowest when the classification error is the lowest.

6 Concluding Discussion

We provided a metric for infra-marginality due to a machine learning classifier and characterized its relationship with accuracy and group-based fairness metrics. In cases where unbiased estimation of the true probability of outcome,

, is possible, our theoretical results indicate the value of considering infra-marginality in addition to prominent group fairness metrics. Moreover, we showed that optimizing for infra-marginality results in a linear constraint that can be efficiently solved, in contrast to the non-convex constraints that typical group fairness metrics impose.

However, measuring infra-marginality requires estimating for any decision-making scenario, which is not usually easy to obtain. In some settings, such as the search for contraband by stopping cars on the highway [simoiu2017problem], we do obtain a proxy for , by making assumptions on the police officers’ decision-making. This is also possible in other law enforcement contexts, such as searching for prohibited items at airport security, wherein one can argue that security officials stop a person only if they detect a suspect object through the X-ray and, thus, the logged data of baggage searches can be assumed to be an unbiased estimate of . Further, airport security might decide to perform random searches, which can serve as “gold-standard” unbiased estimators of .

When such data is not logged, we suggest actively changing the current decision-making process to introduce some unbiased data that can be used for estimating . This can be done by adding probabilistic decisions, such as randomly deciding to search a person, awarding a loan to a small fraction of applicants, and so on. Decisions need not be fully random: recent work on multi-armed and contextual bandits [agarwal2017bandits] makes it possible to make optimized decisions, while still collecting data for unbiased estimation of people’s probability of outcome, i.e., , irrespective of the decision they received. While this involves considerable effort and collaboration with decision-makers, we believe that the twin benefits of an interpretable single threshold and a straightforward learning algorithm that does not constrain use of the full data for modeling, outweigh the implementation costs.

Finally, we acknowledge that there will always be cases when unbiased data collection or active intervention is not possible. This is the case especially in scenarios where outcome data is available only for people who received one of the decisions, but never both. For example, we may observe outcomes for only the people who received a loan or who were let out on bail. As a result, it is hard to know the underlying selection biases that might have impacted the inclusion of people in a particular dataset, and we leave deriving unbiased estimators for these problems as an interesting direction for future work.

Acknowledgments.

Arpita Biswas sincerely acknowledges the support of a Google PhD Fellowship Award. Siddharth Barman gratefully acknowledges the support of a Ramanujan Fellowship (SERB - SB/S2/RJN-128/2015) and a Pratiksha Trust Young Investigator Award.

References