1 Introduction
Consider a machine learning algorithm being used to make decisions in societally critical domains such as healthcare
[irene2018why, goodman2018machine], education [tierney2013fairness], criminal justice [angwin2016machine, berk2017fairness], policing [simoiu2017problem, goel2017combatting] or finance [furletti2002overview]. Since data on past decisions may include historical or societal biases, it is generally believed that optimizing accuracy can result in an algorithm that sustains the same biases and thus is unfair to underprivileged groups [romei2014multidisciplinary, barocas2016big, chouldechova2018frontiers]. Theoretically too, it has been shown that it is not possible to have calibration and fairness simultaneously [kleinberg2017inherent] and several other results show that achieving multiple fairness constraints simultaneously is infeasible [chouldechova2017fair, corbett2017algorithmic].Given this tradeoff between accuracy and fairness metrics, algorithms typically aim to maximize accuracy while also satisfying different notions of fairness, such as disparate impact [kamiran2012data, feldman2015certifying, zafar2017afairness], statistical parity [kamishima2012fairness, zemel2013learning, corbett2017algorithmic]
, equalized odds
[hardt2016equality, kleinberg2017inherent, woodworth2017learning]. Most of these fairness notions, however, enforce grouplevel constraints on prespecified protected attributes and give no guarantees on fairness with respect to other sensitive attributes. That is, it is difficult to generalize these grouplevel constraints to be fair with respect to multiple sensitive attributes which can lead to biases when individuals may belong to multiple historically disadvantaged groups. As an example, a classifier that is constrained to be fair on race may end up introducing discrimination against women of the advantaged race, and viceversa. Moreover, being groupwise constraints, they provide no guarantees on fairness for individuals’ outcomes within these groups.In this paper, therefore, we consider a different notion of bias, inframarginality [simoiu2017problem, corbett2018measure]
, that can handle multiple sensitive attributes simultaneously and enforces individuallevel rather than grouplevel adjustments. The idea behind this individuallevel fairness is that individuals with the same probability of an outcome (say, the same risk probability) receive the same decision, irrespective of their sensitive attributes. Any deviation from this ideal necessarily leads to misclassifying lowrisk members as high risk and highrisk members as low risk, which in turn may harm the members of all groups. For instance, consider the medical domain where doctors assess the severity of a person’s illness (risk probability) and prioritize treatment accordingly. It may be acceptable to prioritize patients based on a given reliable estimate of a patient’s risk, but any deviation from this rule due to a groupfairness constraint on a particular demographic, may deprive highrisk people from treatment, some of whom in turn may belong to another disadvantaged demographic. Hence, these grouplevel adjustments often introduce unintended biases and can be marginally
unfair at an individuallevel; thus the name “inframarginality” (see Section 3 for a definition). Conceptually, therefore, mitigating inframarginality has a notable advantage: the fairness of a decision is based directly on the underlying outcome probability of each individual, rather than post hoc grouplevel adjustments or constraints.To remove inframarginality, Simoiu et al. [simoiu2017problem] propose taking decisions using a single threshold on the true outcome probability, whenever the outcome probability is known.^{1}^{1}1In the abovementioned example from the medical domain, this corresponds to selecting a patient (prioritizing treatment) if and only if the risk (outcome) probability of the patient is above a threshold. Such a singlethreshold classifier implements the highlevel idea that legislation should apply equally to all individuals and not be based on group identity. A set of decisions (equivalently, a classifier) that conforms to this ideal is said to have zero inframarginality (and, thus, singlethreshold fair) [simoiu2017problem, corbett2018measure].
Extending past work on inframarginality [simoiu2017problem], our main contribution is to quantify the notion and propose a generalized version of inframarginality which we call inframarginality. It is relevant to note that Simoiu et al. [simoiu2017problem] essentially consider inframarginality as a “binary” property–either a classifier suffers from inframarginality or it does not. Extending this construct, the current work defines the degree of inframarginality. Furthermore, using this definition, we prove theoretically that high accuracy of a classifier directly leads to low inframarginality. These results also hold groupwise: high accuracy, per group, means low groupwise inframarginality.
Results on simulated and realworld datasets confirm this observation. First, we construct a set of simulated datasets with varying distributions of outcome per demographic group. Under all simulations, we find that training a machine learning model to maximize accuracy with respect to the outcome also yields low inframarginality. Specifically, inframarginality is at most distance away from the classification error. Here, parameter depends solely on the underlying data set, i.e., it is fixed for the given problem instance and is independent of any classifier one considers. In addition, when we focus on individuals from a sensitive group separately, we find that maximizing groupwise accuracy results in low inframarginality for both groups. Second, we consider two datasets that are widely used for studying algorithmic fairness, namely AdultIncome and MedicalExpense datasets, and demonstrate the connection between maximizing accuracy and lowering inframarginality. Since there is no groundtruth for individual outcome probabilities, we create an evaluation testbed by developing classifiers with subsets of features and benchmarking inframarginality with respect to the outcome probabilities learnt using all features. This benchmark serves as an approximation of the true outcome probability. We find that even when a classifier is not trained on true outcome risk, we find a similar positive trend between high accuracy and low inframarginality.
The close connection of inframarginality to accuracy also illustrates the inherent tradeoff with group fairness. Since groupwise fairness constraints necessarily reduce accuracy, they exacerbate inframarginality. Using the metafair algorithm [celis2018classification] for training groupfair classifiers, we find that increasing the weight on a groupwise fairness constraint (such as demographic parity or equal false discovery rates) increases inframarginality. This result is consistent for both simulated and realworld datasets: whenever constraints on accuracy optimization lead to an increase in a groupwise fairness metric, they also lead to an increase in inframarginality bias. We argue these results present a difficult choice for ensuring fairness: groupwise constraints may be blind to fairness on other unobserved groups, while lowering inframarginality increases individuallevel fairness but may lead to grouplevel unfairness. Thus, as a practical measure, we propose maximizing groupwise accuracy for lowering individuallevel bias whenever the true outcome probability is measurable.
While these results point to the importance of considering inframarginality, bounding inframarginality in practice is a challenge because it depends on the knowledge of a true outcome probability for individuals. This is often not available, especially when the true probabilities of individuals are not known and are required to be learnt using datasets which often contain only onesided information (for example, recidivism outcomes of only those individuals who were granted bail or loan repayment outcome of only those who were granted a loan). In such cases, we may not have correct estimates of risk probabilities of the underlying population and hence inframarginality remains a theoretical concept.
To be useful in practice, we need two conditions: first, that objective measures of the outcomes are available and, second, that the available data is sampled uniformly from the target population of interest (and not onesided, or based on biased decisions). Happily, outcomes of interest are available in many decisionmaking scenarios where the outcomes can be objectively recorded. For example, in the medical domain, outcomes are often categorical and objective (e.g., cured of a disease or not). Objective outcomes also occur in securityrelated decisions such as searching for a weapon in a vehicle or screening passengers at an airport for prohibited goods.
In all these cases, the outcome of interest is measurable and is not affected by underlying biases, unlike subjective outcomes such as success in school or work, awarding loans, etc.
The second condition is more stringent and requires that we observe the outcomes for a representative sample of the underlying population and not be biased by past decisions. This assumption can be satisfied either by utilizing additional knowledge on the decisionmaking process or by actively changing decisionmaking. In Section 6, we discuss potential approaches such as by assuming that a subset of decisions are calibrated to the true outcome risk [goel2017combatting, simoiu2017problem, pierson2018fast], by having a random sample of outcome data, by randomizing the decision for a fraction of users, or by advanced strategies such as contextual bandits [agarwal2017bandits].
Overall, our results point to the importance of data design over statistical adjustments in achieving fairness in algorithmic decisionmaking and the limitations of depending on an available dataset for ensuring fairness. Rather than post hoc adjustments to steer an algorithm towards a chosen outcome, or introducing external constraints, it is worthwhile to consider obtaining accurate and unbiased outcome measurements. Under these conditions, our work proposes that the practice of optimizing for groupwise accuracy also leads to low bias (low inframarginality), and allows improvements in machine learning to directly be applicable in improving fairness.
To summarize, we make the following contributions:

Our conceptual contribution is to develop a quantitative measure to characterize the problem of inframarginality. Extending past work [simoiu2017problem, corbett2018measure] on inframarginality in algorithmic fairness, we propose a general inframarginality formulation that enables a measure of discrimination under the definition of singlethreshold fairness (Section 3).

Second, we show that inframarginality has a striking property: Within an additive margin, the more accurate the classifier, lower is its degree of inframarginality (Section 4). We provide a rigorous proof of this claim in Theorem 1. This result asserts that, for wellbehaved data sets and given a classifier threshold , higher accuracy points to lower inframarginality. Also, complementarily, an inaccurate classifier will necessarily induce inframarginality, to a certain degree. Moreover, this result holds groupwise (Corollary 1), implying that higher groupwise accuracy reduces the inframarginality problem per group. We also provide two propositions that identify relevant settings wherein Theorem 1 can be applied.

Third, when the classifier threshold is not fixed, we provide an algorithm for learning a classifier that optimizes accuracy subject to inframarginality constraints, assuming that the true outcome probabilities are available (Section 4.2). We show that the problem reduces to applying a linear constraint and hence we can efficiently find an optimal solution. This is notably in contrast to prominent fairness metrics, which introduce nonconvex (fairness) constraints in the underlying learning algorithms.
2 Background and Related Work
Datasets that collect past decisions on individuals and their resultant outcomes often reflect prevailing societal biases [barocas2016big]. These biases correspond to discrimination in decisionmaking, in recording outcomes, or both. During decisionmaking, individuals from a specific group may be less preferred for a favorable decision, thus reducing their chances of appearing in a dataset with favorable outcomes. This is especially the case when data recording is onesided: we observe the outcomes only for those who received a favorable decision, such as being awarded a job, loan or bail. In addition, the discrimination can also occur when recording outcomes, e.g., when measuring hardtomeasure outcomes such as defining “successful” job candidates, employees or students. Due to these biases in dataset collection, building a decisionmaking algorithm by maximizing accuracy on the dataset perpetuates the bias in selection of individuals or the measurement of a desirable outcome.
To counter this bias, various groupwise statistical constraints have been proposed for a decision classifier that stipulate equitable treatment of different demographic groups. Demographic or statistical parity says that a fair algorithmic prediction should be independent of the sensitive attribute and thus each demographic group should have the same fraction of favorable decisions [kamishima2012fairness, zemel2013learning, corbett2017algorithmic]. Equalized odds [hardt2016equality, kleinberg2017inherent, woodworth2017learning] says that for a fair algorithmic prediction, the true positive rates and the false positive rates on different sensitive demographics should be as equal as possible. Combined with the accuracy objective, equalized odds imply similar, high accuracy for every sensitive demographic. Disparate impact [kamiran2012data, feldman2015certifying, zafar2017afairness] refers to the impact of policies that affect one sensitive demographic more than another, even though the rules applied are formally neutral. Mathematically, the probability of being positively classified should be the same for different sensitive demographics. Chouldechova and Roth [chouldechova2018frontiers] along with Friedler et al. [friedler2018comparative] provide a survey of various fairness notions and algorithms to incorporate them. However, in practice, the output of these groupfair algorithms may show worse predictions among all the groups; for example, while trying to ensure equal false positive rates among two demographic groups, the predictions may end up increasing false positive rates for both the groups.
The difficulty of ensuring fairness for different groups suggests an alternative definition of fairness based on individuallevel constraints. Individual fairness [dwork2012fairness, zemel2013learning] propose that similar individuals should be treated as similarly as possible. Thus, rather than stratifying users by prespecified groups, individuallevel fairness constraints use available data (e.g., demographics) to define a similarity measure between individuals and then enforce equitable treatment for all similar individuals. The question then is, how to define a suitable similarity measure? Rather than choosing variables for defining similarity on, [simoiu2017problem] propose defining similarity between two individuals based on the true probability of outcome for the two individuals. That is, people with the same underlying probability of a favorable outcome are the most similar to each other. They define that a classifier has inframarginality bias if people with the same probability of the outcome are given different decisions [simoiu2017problem, pierson2018fast, corbett2018measure]. Crucially, this definition depends on a measurement of true outcome probability, but does not restrict analysis to a few prespecified groups.
Given these considerations, different fairness notions may be suitable for different settings. When all important marginalized groups are defined (e.g. by law), then groupfairness constraints are more relevant. When true outcome probabilities are known (such as in randomized search/inspection decisions in security applications), then inframarginality constraints become a better alternative. In this paper, we focus on the latter and describe the value of considering inframarginality in fairness decisions. In Section 3, we extend past work to propose a general notion of inframarginality and prove our main result on the connection between accuracy and low inframarginality. In Section 4, we provide extensive empirical results showing the inherent tradeoff between typical group fairness constraints and inframarginality.
3 Defining Inframarginality
In this section, we define and characterize the degree of inframarginality. This measure quantifies an extent to which a classifier violates the notion singlethreshold fairness, i.e., it quantifies the problem of inframarginality identified by Simoiu et al. [simoiu2017problem].
3.1 Problem Setup
We consider a binary classification problem over a set of instances, , wherein label denotes the positive class and label denotes the negative class, e.g., in the searchforacontraband setup, class label would indicate that the individual (instance) is in possession of a contraband and, complementarily, a label
would correspond to the case in which the individual is not carrying a contraband. Conforming to the standard framework used in binary classification, we will assume that feature vectors (equivalently, data points)
are drawn from the data set via a feature distribution. Furthermore, for an instance , let be the inherent probability of being in the positive class; in the previous example, is the endowed probability that an individual is carrying the contraband.Under singlethreshold fairness, a set of (binary) decisions are deemed to be fair if and only if they are obtained by applying a single threshold on the s of the instances (irrespective of the instance’s sensitive attributes, such as race, ethnicity, and gender). In the abovementioned contraband example, this corresponds to searching an individual if and only if its is above a (universally fixed) threshold .
3.2 Degree of InfraMarginality
For ease of presentation, we will identify each data point with its feature vector and use to denote the (outcome) label of the data point . As mentioned before, is the outcome probability of each . Therefore, for every , the binary label is equal to one with probability and, otherwise (with probability ), we have . The standard classification exercise corresponds to learning a classier that optimizes accuracy with respect to the labels.
For a binary classifier , we will use to denote its accuracy with respect to the labels , i.e., , here the expectation is with respect to the underlying feature distribution^{2}^{2}2That is, the random sample
is drawn from an underlying (feature, label) joint distribution and, conditioned on a feature
, we have . andis the indicator random variable which denotes whether the output of the classifier,
, is equal to , or not. Note that high accuracy simply implies a high value of .To formally address singlethreshold fairness, we define a (deterministic) label, , for each data point . Semantically, is the (binary) outcome of an absolutely fair (in the singlethreshold sense) decision applied to . This, by definition, means that s are obtained by applying the same (fixed) threshold on the s across all instances: for a fixed fairness threshold , we have if , else if then . With this notation in hand we define the central construct of the present work.
Definition 1 (InfraMarginality of a Classifier).
With respect to a given threshold , the degree of inframarginality, , of a classifier is defined as
(1) 
Here, the expectation is with respect to the feature distribution over the data set . In addition, for each , the label if , otherwise .
A classifier conforms to the ideal of singlethreshold fairness iff . Furthermore, the quantity can be interpreted as the extent to which the classifier’s outputs (i.e., s) differ from the ideal labels (i.e., from the singlethreshold benchmarks) . Indeed, smaller the value of the lower is ’s inframarginality.
We will use and , respectively, to denote the distributional form of the collection of generative probabilities and labels . Formally, is a discrete distribution wherein a probability mass of is placed on and probability mass is placed on ; here, is the fraction of data points with the property that ,
. Similarly, the cumulative distribution function (cdf) of
is .Note that the distributions and are supported on the distinct values (across ) along with and . Write to denote the weighted distance between the two distributions; specifically, the distance here is computed by normalizing (i.e., weighing) with respect to the feature distribution. Formally,^{3}^{3}3Note that if the feature distribution picks data points uniformly at random from , then this equation simply implies that the distance between the two distributions is equal to the average of the differences between the and values.
(2) 
We will also refer to as the distance between the generative probabilities and labels . Note that this distance is a property of the data set and the underlying feature distribution–it is fixed for the given problem instance and is independent of any classifier one considers.
4 Accuracy and Inframarginality
The key result of this section is the following theorem which establishes that as long as the distance between and is small, an accurate classifier will also have low inframarginality. Complementary, under small distance, an inaccurate classifier will necessarily induce high inframarginality.
4.1 High accuracy, low inframarginality
Theorem 1.
For a data set , let be the distance between the generative (outcome) probabilities and the (singlethreshold) labels . Then, for any binary classifies with accuracy , the degree of inframarginality satisfies
Proof.
Using the definition of the accuracy, , of classifier , we get
(3) 
Analogously, the degree of inframarginality, , can be expressed as
(4) 
Subtracting one from both the sides of the previous equality and considering absolute values, we obtain
(5) 
The last equation follows from the triangle inequality, and .
This result shows that, for any classifier, high accuracy points to low inframarginality. Notably this result applies groupwise: this “inverse” connection holds even if we consider accuracy and inframarginality separately for different subsets (groups) of the data set . Specifically, Theorem 1 can be applied, as is, to obtain Corollary 1. Here, say the set of data points is (exogenously) partitioned into two groups (i.e., disjoint subsets) and . Also, let the outcome probabilities of the two sets be and , respectively, along with the singlethreshold labels and . Let () denote the distance between and ( and ). Following the abovementioned notational conventions, the accuracy and inframarginality measures of a classifier for group are and , respectively. With this notation in hand, we have the following groupwise guarantee.
Corollary 1.
For a data set , comprised of two groups and , and any classifier , the degree of inframarginality, in each group, satisfies the following bounds: and .
Remark: Corollary 1 provides some useful insights towards achieving low inframarginality. It quantitatively highlights the principle that—for mitigating groupwise inframarginality—groupwise accuracy can be a better metric than overall accuracy. That is, in relevant settings, aiming for classifiers that maximize the minimum accuracy across groups (i.e., adopting a Rawlsian perspective on accuracy) can lead to more fair decisions than solving for, say, classifiers that enforce the same accuracy across groups or classifiers that maximize overall accuracy. In particular, Corollary 1 ensures that a maxmin (Rawlsian) guarantee on accuracy translates into a maxmin guarantee on inframarginality, with additive errors at most .
The following two propositions identify relevant settings wherein Theorem 1 can be applied. The proofs of these propositions are direct and have been provided in the supplementary materials. The first proposition addresses data sets in which the probabilities s (across data points ) are spread enough and do not sharply peak around a specific value. Formally, we say that a data set is Lipschitz if for any , the number of data points with is at most . Note that, for settings in which the cdf of is smooth, the maximum slope of the cdf corresponds to the Lipschitz constant of the data set.
Proposition 1.
If a data set is Lipschitz and the underlying fairthreshold , then the distance between the outcome probabilities and the singlethreshold labels is at most . Here, we assume that the underlying feature distribution selects instances from uniformly at random.
Proof.
We partition into subintervals, of length each. Specifically, the subintervals are , with integer . The Lipschitz condition ensures that the number of data points in each subinterval is at most .
∎
The next proposition observes that if, in , the probability mass is spread sufficiently far away from the threshold, then again the distance between the outcome probabilities and the singlethreshold labels is appropriately bounded. The result shows that Theorem 1 is useful, in particular, when is a bimodal distribution, with the two modes being close to zero and one, respectively. Formally, we will say that a distribution (supported on ) is spread, with respect to the threshold , iff, under , the probability mass in the interval is at most . Here, and .
Proposition 2.
If, for a data set, the outcome probability distribution is
spread and the underlying fairness threshold , then the distance between the outcome probabilities and the singlethreshold labels is at most . Here, we assume that the underlying feature distribution selects instances from uniformly at random.Proof.
If is spread, them the distance (between this distribution and ) is maximized when fraction of the data points have value equal to and the rest of the data points (accounting for the remaining fraction) have value equal to (or ). Hence, the distance between and is upper bounded as follows: . ∎
4.2 Optimal Classifiers Under Inframarginality Constraints
The above subsection characterized inframarginality under a fixed classifier threshold . However, in practice, it is possible to choose to ensure high accuracy and also a minimum degree of inframarginality. Therefore, we now present an efficient algorithm for finding classifiers with as high an accuracy as possible, under inframarginality constraints. We address this optimization problem in a setup wherein the data set is finite and outcome probabilities, s, are known explicitly. In this setup (i.e., given
), prominent groupwise fairness notions map to nonconvex constraints. Hence, maximizing accuracy subject to a groupwise fairness constraint typically requires relaxations (leading to approximate solutions) or heuristics; see, e.g.,
[celis2018classification] and references therein. We will show that, by contrast, an upper bound on inframarginality can be expressed as a linear constraint and, hence, we can efficiently find an optimal (with respect to accuracy) classifier that satisfies a specified inframarginality bound.Let be a random sample from the feature distribution on , where is the data set/feature space and the label , for each . Recall that, given a universal threshold (which imposes the singlethreshold fairness criterion), we define the label , for each . Let be any classifier with accuracy . Furthermore, the degree of inframarginality of is defined as .
Given parameter , we consider the problem of maximizing the accuracy, over all classifiers, subject to the constraint that the degree of inframarginality is at most .
maximize  
subject to 
Observe that, because and is binary valued, the objective function can be expressed as
Similarly, using the fact that and is binary valued, for inframarginality we get
Therefore, the above maximization of accuracy (or minimization of error rate) subject to the degree of inframarginality being upper bounded by , over classifiers , can be equivalently rewritten as follows. Write and note that , , and do not depend on the classifier .
minimize  
subject to 
This is an integer linear program with (binary) decision variables
. Now consider its linear programming relaxation by letting , and denote its optimum by . Now, using Lagrange multipliers and strong duality, the optimum of the linear relaxation is given byFor any fixed , the optimal classifier is given by , where . Therefore,
(6) 
With probabilities in hand, we can solve the dual of the linear programming relaxation to efficiently compute the optimal solution of (6), i.e., compute an optimal value of the Lagrange multiplier (dual variable). Given the optimal , we know that the optimal classifier satisfies
In other words,
Note that the optimal solution of the linear programming relaxation actually yields a binary classifier , which means that is also an optimal solution of the underlying integer linear program.
Remark: It is wellknown that the Bayes classifier, giving the decisions , achieves maximum accuracy over all classifiers. Our derivation shows that, even with inframarginality constraints, optimal classifiers continue to be singlethreshold classifiers. Also, it is relevant to note that the abovementioned method can be used to efficiently solve the problem of minimizing inframarginality subject to a (specified) lower bound on accuracy.
5 Empirical Evaluation
In this section, our primary aim is to provide empirical evidence to the developed theoretical guarantees and answer the following questions:

Does high accuracy (low classification error) lead to low inframarginality for a classifier?

How does the relationship between classificationerror and inframarginality change when we consider them separately for each protected group?

What is the nature of tradeoff between optimizing a classifier for group fairness measures versus inframarginality?
5.1 Experimental Setup
We answer these questions by measuring accuracy, inframarginality and group fairness metrics for machine learning classifiers under a wide range of datasets. Since estimation of inframarginality depends on knowledge of the true outcome distribution, we first present results on simulated datasets where we control the data generation process. The distributions are chosen to provide a thorough understanding of the relationship between accuracy and degree of inframarginality. We then present results on two realworld datasets that have been used in prior empirical work on algorithmic fairness: Adult Income [url:adultIncome] (for predicting annual income) and MEPS [url:meps] (for predicting utilization using medical expenditure data). These datasets satisfy the two properties that are required for empirical application of inframarginality. First, the outcomes measured are numerical quantities that are less likely to be subjectively biased and second, they can be assumed to be a representative sample of the underlying population. The AdultIncome dataset contains a sample of adults in the United States based on the Census data from 1994 and the MEPS dataset is derived from a nationally representative survey sample of people’s healthcare expenditure in the US.
Measuring inframarginality. On the simulated datasets, we estimate inframarginality using its definition in Equation 1. On realworld datasets, we do not have true outcome probability and thus use an approximation. We assume that the estimated outcome probability () using a classifier can be considered as a proxy for the true outcome probability. Essentially, this assumption implies that all the relevant variables for estimating the outcome are available in the dataset (and that we have an optimal learning algorithm to estimate the outcome probability).
(7) 
To assess the sensitivity of results to settings where the true outcome probability is different from the one estimated by the learnt classifier, we develop a method: we progressively remove features from a given dataset, train a classifier on the reduced dataset, and compare inframarginality on the outcome probability estimated from this reduced dataset to the “true” inframarginality on the outcome probability estimated from the full dataset (thus assuming it as the true outcome probability). That is, we use the following measure for inframarginality.
(8) 
While we demonstrate the theoretical results using the above assumption on the realworld datasets, in practice we recommend a combination of domain knowledge and active data collection to estimate true outcome probabilities and correspondingly, the true inframarginality. We discuss these possibilities in Section 6.
Measuring groupfairness. For each classifier on the simulated and realworld datasets, we compare inframarginality to prominent groupfairness metrics. We do so using the metafair framework proposed by Celis et al. [celis2018classification]. In this framework, there is a tradeoff parameter which helps to balance between maximizing accuracy and achieving groupfairness. Higher the value of the parameter, the more is the focus on achieving groupfairness. When this parameter is , the classifier maximizes accuracy with no groupfair constraints. For our evaluation, we consider two groupfair notions:

Statistical Parity (SP) or Disparate Impact (DI): The aim is to achieve low value for the expression , where denotes the selectionrate for group , which is the fraction of individuals who received favorable outcome by the classifier, within group .

Equal False Discovery Rates: The aim is to achieve low value for , where denotes the false discovery rate for group , which is the fraction of individuals incorrectly classified, among the group who received favorable outcomes by the classifier.
We use the implementation of the metafair algorithm, provided in the Python package AI Fairness 360 [aif360oct2018]. Finally, we report the classification error rate, i.e., (1  accuracy), degree of inframarginality, and the value of the group (un)fairness metric.
5.2 SimulationBased Datasets
For simplicity, we consider datasets with a single sensitive attribute (e.g., race or gender) and two additional attributes (e.g., age, income, etc.) that denote relevant features for an individual. We assume that the sensitive attribute is binary and additional attributes are continuous. We also assume that the outcome is binary and depends on the attributes of an individual. Given a classifier that predicts the outcome based on these features, our central goal is to compare its misclassification rate, inframarginality and groupfairness.
Specifically, we assume that the sensitive attribute , two additional attributes , are realvalued, and the outcome is binary . Hence, the input space is denoted as . We create datasets using a generative model where the attributes of an individual are simulated based on their sensitive attribute and outcome. The overall population distribution is generated as . We further consider that and are conditionally independent given and , i.e., , and the distributions are considered to be Gaussian. Within this framework, we generate five types of datasets, each with instances and equal label distribution . For each instance , we obtain by applying Bayes’ rule. Then we obtain and . Based on this process, we generate datasets with various outcome probability () distributions, as shown in Figure 1.

[itemsep=0pt,leftmargin=*,topsep=2pt]

Dataset S1: The risk distribution of this dataset is shown in Figure 0(a). For any individual, irrespective of their group membership, the outcome probability is drawn from a density distribution which is concentrated near the threshold of . Thus, it is difficult to obtain highly accurate classifiers for this kind of outcome distribution. Also, the distance between and is overall as well as per group.

Dataset S2: We then construct a dataset where the outcome distributions for the two sensitive groups are separated from each other. One group (Z=0) has an outcome probabiliity mode below 0.5, and the other group (Z=1) has a mode on outcome probability greater than 0.5. The distribution of this dataset is described in Figure 0(b). Compared to , we expect it to be easier for a classifier to achieve high accuracy, but achieving group statistical parity (Disparate Impact) is harder. Here, the two groups have different densities over the outcome probabilities, one primarily concentrated around whereas other is concentrated around . Thus, in this dataset, imposing a groupfairness constraint such as disparate impact would necessarily cause more classification error and more degree of inframarginality. The value for this dataset is for both the groups.

Dataset S3: This dataset corresponds to an extreme case of S2 where the outcome probability for one of the sensitive groups is concentrated near 0 whereas that for the other sensitive group is near 1, as shown in Figure 0(c). Also the value is only . Achieving high accuracy and low inframarginality should be extremely easy for a thresholdbased classifier, but satisfying groupwise parity may be equally hard.

Dataset S4: Here we generate data such that the density over outcome probabilities for group = is spread sufficiently away from the threshold , whereas the density over outcome probabilities for group = is concentrated around , shown in Figure 0(d). The values for two groups are also different, and for group and , respectively. Thus, we expect to see a difference in accuracy by group: a classifier should be able to distinguish individuals from group easily but find it hard to separate people with or outcomes within group . In contrast, any classifier with a single threshold is also expected to satisfy statistical parity (SP).

Dataset S5: In this dataset, the feature
is drawn from the same Gaussian distribution, irrespective of
or values. However, feature is drawn such that the values clearly separate individuals with different values, among group . For group, it is difficult to accurately classify, as shown in Figure 2.
For dataset S5, however, the optimal classifier’s decision boundary would be , whereas, a classifier ensuring equal misclassification rates for both subgroups would necessarily choose the decision boundary as a horizontal line, say . In this case, a large fraction of individuals would receive different decision than that of the optimal classifier, hence would suffer from higher inframarginality.
High accuracy, Low inframarginality.
Figure 3 illustrates that the degree of inframarginality is well within the theoretical bound of classification_error, as established in Theorem 1.
In particular, Figure 2(a) shows that the classification error is when a classifier is trained to maximize accuracy for dataset S1 and the degree of inframarginality is bounded by the error ( overall, as well as groupwise). Interestingly, for dataset S2, the degree of inframarginality is very close to the accuracy even when the theoretically established bound is high, as shown in Figure 2(b). The for the dataset S3 is low (), and hence, the inframarginality and error rates are very close to the each other (the yaxis of Figure 2(c) is limited to for clarity). The bound of inframarginality continues to hold in Figures 2(d) and 2(e).
Tradeoff between inframarginality and group fairness. Figures 2(a) and 2(b) show that, by invoking DIfair constraint, the classification error and degree inframarginality increases by more. Even for dataset S3, DIfairness increases the classification error, and the corresponding graph obtained looks similar to that of the previous one (Figure 2(b)).
For S3, we additionally evaluate the effect of using another fairness notion called False Discovery rate (FDR). Figure 2(c) shows that ensuring FDRfairness hurts the accuracy by about and inframarginality increases by about .
For S4, the FDRfairness constraint not only causes increase classification error but also increases the false discovery rates for both subgroups (as shown in Figure 4), which is an extremely undesirable consequence of ensuring groupfairness.
We observe a similar trend for dataset S5 in Figure 2(e). Also, we observe that the ratio of false discovery rates do not uniformly improve on increasing the groupfairness parameter.
Groupwise accuracy and inframarginality. We show results for one of the datasets S4 by computing the metrics separately for each sensitive group; results on others are similar.
In dataset S4, the values for group and are and , respectively. Thus, the inframarginality and error lines in Figure 5 (obtained using FDRfairness constraint), almost coincides for group , while these lines are quite apart for group . Moreover, Figure 2(d) shows that, when the tradeoff parameter is or higher, the groupfairness improves, at a significant cost of inframarginality and accuracy.
Summarizing the results on synthetically generated datasets: The observations over all the simulations summarize that the inframarginality biases are within of the classifier’s error rates. In fact, in almost all executions, when accuracy is the highest, the inframarginality values are lower than (). So, addressing the first question, we find that low classification error leads to low inframarginality, even when the values are large. From our groupwise results on S4, we find a corresponding result: classifiers are highly accurate for and thus exhibit very low degree of inframarginality, while accuracy for group is around and thus this group suffers higher inframarginality. Finally, we see that increasing groupfairness leads to worse accuracy, and in turn, higher degree of inframarginality in all the datasets.
5.3 CaseStudy
We now describe our results on the Adult Income and Medical Expense datasets.

[itemsep=0pt,leftmargin=*,topsep=2pt]

Medical Expenditure Panel Survey dataset (MEPS) [url:meps]: This data consists of surveys of families and individuals collecting data on health services used, costs and frequency of services, demographics, etc., of the respondents. The classification task is to predict whether a person would have ‘high’ utilization (defined as UTILIZATION , which is roughly the average utilization for the considered population). The feature ‘UTILIZATION’, was created to measure the total number of trips requiring some sort of medical care, by summing up the following features: OBTOTV15 (the number of office based visits), OPTOTV15 (the number of outpatient visits), ERTOT15 (the number of ER visits), IPNGTD15 (the number of inpatient nights), and HHTOTD16, the number of home health visits. High utilization () respondents constituted around of the dataset. The sensitive attribute, ‘RACE’ is constructed as follows: ‘Whites‘ (Z=0) is defined by the features RACEV2X = 1 (White) and HISPANX = 2 (non Hispanic); everyone else are tagged ‘NonWhites’ (Z=1).

Adult Income dataset [url:adultIncome]: This dataset contains complete information of about individuals. It is obtained from the UCI Machine Learning Repository [Dua:2017]. This dataset has been extensively used in supervised prediction of annual salary of individuals (whether or not an individual earns more than K per year). High salary is used the outcome label . For our experiment, we consider features such as agegroups, educationlevels, race, and sex. The column income represents labels, which contains for individuals with K annual salary and otherwise. The sex attribute is considered to be a sensitive with denoting “Female” individuals and denoting “Male”.
Observations. We follow a similar analysis to the simulated datasets by constructing classifiers with different groupfairness parameter and measuring inframarginality. As described in Section 5.1, we first assume that can be approximated by the learnt outcome probabilities from the classifier.
For MEPS, we observe a steep increase in inframarginality and error rate when DIfairness constraint is invoked (as shown in Figure 6). This result holds groupwise: when dividing the data by the sensitive group and considering them separately, Figure 7 shows that adding the fairness constraint causes nearly of the decisions to change compared to the classifier that maximizes accuracy within both groups, thus leading to high inframarginality within both groups.
Similarly for Adult Income dataset, we observe an increase in the degree of inframarginality and error rate when FDRfairness constraint is invoked (as shown in Figure 8). When looking at these metrics groupwise (Figure 9), we see that adding the fairness constraint causes nearly of the decisions to change (inframarginality), among both the groups. Moreover, it also causes increase in the false discovery rates for both the groups; in particular, FDR is more than for group .
So far, however, we assumed that is a proxy for the true outcome probability. We now repeat the experiments on Adult Income data after leaving out an important feature educationlevels and then leaving out race also. For these reduced datasets, we assume that the true outcome probability can be derived from the full dataset with all features ( ), but the classifier only has access to the reduced features. We observe that, even when all the features are not available, low inframarginality with respect to full dataset is associated with low classification error (Figure 10). We also observe that inframarginality with respect to the s obtained using reduced dataset is correlated to the inframarginality with respect to the s obtained using the full dataset. This observation help us claim that, in realworld datasets (where we may not have full information of all the features), we can quantify inframarginality using obtained from accuracymaximizing classifier.
Summarizing the results on realworld datasets: We observe that, in both Adult Income and MEPS datasets, low classification error led to low inframarginality. Further, we see that increasing the extent of groupfairness may lead to worse accuracy and, in turn, higher degree of inframarginality. We observe similar trend in the groupwise results for Adult Income dataset. Finally, we investigate whether unavailability of one or more features would have an impact in the above observations. We find that, even with less features, the inframarginality remains lowest when the classification error is the lowest.
6 Concluding Discussion
We provided a metric for inframarginality due to a machine learning classifier and characterized its relationship with accuracy and groupbased fairness metrics. In cases where unbiased estimation of the true probability of outcome,
, is possible, our theoretical results indicate the value of considering inframarginality in addition to prominent group fairness metrics. Moreover, we showed that optimizing for inframarginality results in a linear constraint that can be efficiently solved, in contrast to the nonconvex constraints that typical group fairness metrics impose.However, measuring inframarginality requires estimating for any decisionmaking scenario, which is not usually easy to obtain. In some settings, such as the search for contraband by stopping cars on the highway [simoiu2017problem], we do obtain a proxy for , by making assumptions on the police officers’ decisionmaking. This is also possible in other law enforcement contexts, such as searching for prohibited items at airport security, wherein one can argue that security officials stop a person only if they detect a suspect object through the Xray and, thus, the logged data of baggage searches can be assumed to be an unbiased estimate of . Further, airport security might decide to perform random searches, which can serve as “goldstandard” unbiased estimators of .
When such data is not logged, we suggest actively changing the current decisionmaking process to introduce some unbiased data that can be used for estimating . This can be done by adding probabilistic decisions, such as randomly deciding to search a person, awarding a loan to a small fraction of applicants, and so on. Decisions need not be fully random: recent work on multiarmed and contextual bandits [agarwal2017bandits] makes it possible to make optimized decisions, while still collecting data for unbiased estimation of people’s probability of outcome, i.e., , irrespective of the decision they received. While this involves considerable effort and collaboration with decisionmakers, we believe that the twin benefits of an interpretable single threshold and a straightforward learning algorithm that does not constrain use of the full data for modeling, outweigh the implementation costs.
Finally, we acknowledge that there will always be cases when unbiased data collection or active intervention is not possible. This is the case especially in scenarios where outcome data is available only for people who received one of the decisions, but never both. For example, we may observe outcomes for only the people who received a loan or who were let out on bail. As a result, it is hard to know the underlying selection biases that might have impacted the inclusion of people in a particular dataset, and we leave deriving unbiased estimators for these problems as an interesting direction for future work.
Acknowledgments.
Arpita Biswas sincerely acknowledges the support of a Google PhD Fellowship Award. Siddharth Barman gratefully acknowledges the support of a Ramanujan Fellowship (SERB  SB/S2/RJN128/2015) and a Pratiksha Trust Young Investigator Award.
Comments
There are no comments yet.