1 Introduction
As repeatedly demonstrated in the news, medicine, law and numerous related ML papers [43, 20] [5]
, societal inequities have the very real risk of being vastly exacerbated if machine learning algorithms do not explicitly address fairness in model formulation and data collection. Numerous philosophical notions of fairness exist (distributive, procedural, etc)
[35, 17] [33] and the appropriateness of each definition may depend on context. Theoretically, in an equitable world of perfect data, implying perfect accuracy across all possible subgroup populations, a uniformly fair classifier may be created. However, with skewed realworld data, [28] has shown that a tradeoff exists between fairness and accuracy. We propose an alternative fairness constraint based on ParetoEfficiency [16] to avoid performance degradation within some subgroups, while striving for increased accuracy for all subgroups.A popular approach to learn fairnessaware classification models is to enforce strict metric equality constraints in relation to sensitive variables such as race and gender. Such equality constraints are being adopted as law in some locals [21]. However, by definition, enforcing strict equality constraints will ensure classification accuracy is limited by the worst performing subgroup. Due to a variety of reasons, historical injustices [18], sampling bias [8], selection bias [36] among others, subgroup populations are often not fully or fairly represented in commonly used realworld datasets. The discrepancies are particularly alarming when algorithmic models are used to predict medical and legal outcomes. For example, in a cardiology study of over 4000 ER patients with cardiac event symptoms [2], no symptoms were found to be predictive of a heart attack in white women. In black males, only an unrelated symptom (diaphoresis) was found to be indicative of a future cardiac event with 95 percent confidence, while in white males relevant features (left arm radiation, pressure/tightness) were detected with high accuracy. To reiterate, a classifier built on this longitudinal and “diverse” dataset to predict an ER cardiac event, could conceivably achieve high accuracy for white men, while only yield trivial accuracy of 50 percent for white women and black men.
A philosophical question for the reader and practitioners is  should a classifier be constructed in such a scenario? Without full historical data and appropriate domain causal knowledge, it may be infeasible to approach *fair* learning [26]. The above example is extreme both in repercussions and the skew of the data. However, we argue and demonstrate that such skew is common in frequently used datasets to evaluate fairness aware learning  UCI Adult, German credit and Heart attack. Figure 1 demonstrates the skew in accuracy by agegender subgroups in the UCI Heart Attack dataset.
In scenarios where the domain practitioners deem the above considerations acceptable for constructing a classifier, we propose an alternative fairness constraint based on ParetoEfficiency [16]
to avoid the unintentional degradation in subgroup performance, while striving for increased accuracy. The methodology is applicable to highly skewed data as described in both cardiac examples above. The ParetoEfficient Fairness (PEF) constraint restricts the choice of ML models to the Pareto frontier to ensure higher accuracy across *fair* model options. In some cases, a ParetoEfficient definition may be at odds with a strict equality fairness criterion. Figure
2 illustrates cases on a synthetic dataset where extremely unequal models might be Pareto optimal and vice versa. However, PEF avoids this pitfall by limiting the search space on the Pareto frontier within acceptable fairness bounds.Our proposed bias loss function achieves ParetoEfficient performance, which outperforms solutions based on equalizing subgroup performance. Our algorithm iteratively searches for Paretooptimal subgroup performance by leveraging the benefits of transfer learning to minimize the Group Pareto loss. Using theory from multiple objective optimization for continuous Pareto fronts, we prove that if the data distribution has high disalignment
[28] between the subgroup and the outcome, PEF will discover all Pareto optimal points and converge to a solution that is better than the Bayes optimal solutions for existing fairness constraint based algorithms (if it exists). Empirically, we show that our approach achieves an operating point which is better both in terms of global accuracy and individual subgroups accuracy than methods which approximate hard constraints of equality [43] and adversarial multitask learning [3] on three UCI datasets.2 ParetoEfficient Fairness
2.1 Motivation
To motivate the need for ParetoEfficient Fairness, consider a very simplistic binary classification task , where is a continuous scalar feature for each example in the dataset D. We partition the dataset D into a set of groups such that the examples in groups and differ in their values for sensitive variables S.
We denote the accuracy of a classifier evaluated on test samples from groups a and b as and respectively. We thus say that the classifier evaluates to an operating point . We can evaluate various operating points for classifiers by varying values of the threshold to define the scatter plot of operating points, if they take the form
(1) 
In Figure 2, we plot these operating points over a given data distribution by varying . If we assumed that the data is drawn from a uniform random distribution over a classbalanced label set for both groups and and the label is generated by flipping a fair coin, then the expected operating point for any classifier would be (0.5, 0.5) (denoted by “chance”). In other nondegenerate scenarios, it is possible to maximize the accuracy () of a single group, regardless of the accuracies of the other groups. In our case, we denote such operating points that maximize accuracy of groups and , by and respectively. A strict equality fairness criterion would require that the classifier operate on the line.
However, if the objective is to improve the performance of all groups to meet the levels of the highest performing groups, then choosing points on the line might not be desirable. Such objectives are common in policies around affirmative action [14] and recent works in fairness literature [6], where even though parity between demographic groups is desired, interventions are made to improve all groups to the level of the highest performing group. Hence, choosing the ParetoEfficient point (denoted by “PE”) may be a more desirable solution as it will increase the accuracy of both groups and compared to a solution obtained by enforcing a strict equality constraint.
2.2 Definitions
For completeness, we provide some definitions of model performance metrics which may be used to evaluate fairness on the heldout test data set.
Accuracy: The fraction of test samples which were classified correctly by the model as compared to the ground truth class labels.
False Positive Rate: In case of a binary classification task with positive and negative classes, this is the fraction of test samples which were incorrectly assigned positive by the model among the total number of ground truth negative samples.
False Negative Rate: In case of a binary classification task with positive and negative classes, this is the fraction of test samples which were incorrectly assigned negative by the model among the total number of ground truth positive samples.
These simple definitions of model performance have been used as an objective for maximization/minimization under constraints such that sensitive group level model performance (could be measured using a different metric than the one used as an objective) be equal.
Parity Loss: If the group level model performance are unequal, then the sum of absolute deviation of the group level performances from the overall model performance is defined as the Parity loss. If denotes the model performance of group and denotes the overall model performance, then parity loss is given by
(2) 
There are many fairness definitions proposed [10] in literature, but we present one of the commonly used strictequality constraints below followed by our definition of ParetoEfficient Fairness.
Equality of Odds. [19]: We say that a predictor satisfies equalized odds with respect to the sensitive attribute set and outcome , if and are independent conditional on .
(3) 
ParetoEfficiency: We define ParetoEfficient points as the set of operating points for which there does not exist another point, which has better performance (e.g. accuracy) across all the groups.
One possible pitfall of the above definition for fairness is that points or could be selected as a ParetoEfficient solution. Both are trivially ParetoEfficient points, since there are no other points which performs better across all groups, but are trivially unequal across groups.
ParetoEfficient Fairness
: We say an operating point is ParetoEfficient Fair if it is ParetoEfficient and minimizes the variance of the Pareto error across groups.
Formally, for a set of thresholds characterizing ParetoEfficient points: , performance metric for : , optimum performance metric across all operating points for group : , the Pareto error for group : , variance across groups all : , we intend to find a threshold that characterizes a ParetoEfficient Fair operating point.
(4)  
(5) 
Since it is empirically difficult to find all ParetoEfficient thresholds
without sufficient exploration, a simple heuristic is to choose a threshold that minimizes the total absolute Paretopenalty.
(6) 
We combine these two minimization criterion using a Lagrangian factor as follows:
(7) 
This formulation should not be confused as to be simple fairness and accuracy tuning parameters, as proposed in [43]. The nuance is in the fact that, is used relative to the heuristic pseudooptimum performance of groups, which is central to the argument of ParetoEfficient Fairness, thereby encouraging all subgroups to perform at their best possible levels. This is demonstrated when and is minimized, which is not the same as equality of odds [19]. Similarly, when , we minimize , which is a different than the unconstrained optimization in [43]
Group Pareto Loss: We now generalize the simple example above to any binary classification task. Here, the minimization criterion of the Group Pareto Loss in equation (5) holds, but instead of determining a specific threshold corresponding to a ParetoEfficient operating point, we minimize the group Pareto Loss over the parameters of the binary classification model.
The Group Pareto Loss is augmented with an appropriate loss weight () through the Lagrangian dual formulation similar to [13], along with the standard cross entropy classification loss: [27], to yield the ParetoEfficient Fairness Loss: . Note that the standard cross entropy classification loss aims to maximize overall performance, whereas the penalty term weighted by is used to ensure that such a maximum overall performance should be achieved while minimizing Pareto loss. In scenarios, where the maximum overall performance is achieved when each group’s performance is also maximized, then we would not need this augmentation. For cases, where such Pareto optimal and overall optimal operating points do not coincide, we are able to denote our preference between these two operating points using the penalty term .
(8) 
2.3 ParetoEfficient Algorithm
We now put the components mentioned above together and present the ParetoEfficient bias mitigation algorithm (Algorithm 1). This algorithm is an inprocessing algorithm which trains a joint model to have favorable fairness properties, as opposed to a postprocessing algorithm which uses a pretrained model for finetuning [40]. In order to obtain a heuristic pseudooptimal group accuracy for each group , we train the classifier to minimize on samples in group from dataset
. Although this may not be the best estimate of a group population’s true optimal, we avoid transfer learning during bootstrapping as it benefits larger groups more than smaller groups
[30]. Instead, we propose an iterative approach where is updated in each iteration if a better group accuracy is achieved by a jointly trained model. We use these heuristic pseudooptimal accuracies to jointly train a ParetoEfficient model on all subgroups to minimize the in every batch. We further strictly ensure that the minibatch is representative of the group distributions by sampling groupwise batch samples proportionately.Our proposal aims to achieve “potentially optimal” performance for each of the groups while performing better than approximate fairness constrained classifiers. In Figure 3, we illustrate this by applying Algorithm 1 on a synthetic data distribution with 4 groups, where 2 groups do not perform better than chance, independent of the threshold chosen, and the other 2 groups perform better for higher values of the threshold. Hence, we achieve ParetoFairness for threshold , whereas, point of equal performance (nonpareto fairness) threshold, is achieved when all groups perform equal to random chance performance (more details on the distribution is in Section 5.2). Hence, choosing to minimize Pareto loss is better than minimizing the Parity loss, which can lead to random accuracy. This approach is similar to current avenues of research where it is acceptable to be aware of the differences [24] between various groups’ performance in the dataset and operate in a way to improve as opposed to fairness through blindness [23].
3 Properties of ParetoEfficient Fairness
In this section, we outline key theoretical results about the ParetoEfficient Bias Mitigation algorithm’s convergence properties, its capacity to discover Pareto curves of subgroup accuracy and their Pareto efficiency. First, we formalize the inherent disalignment between the achieving better accuracy of the target and satisfying the fairness equality constraint for the sensitive group distribution [28]. Let
be the class probability function for the target,
be the threshold of class probability for binary classification, be the parameter that defines the tradeoff between accuracy and fairness in the fairness constrained loss function, denote the Bayesoptimal classifier which minimizes the fairness constrained loss function, then we define as the disalignment between the target and group distribution.(9) 
Theorem 1.
If , then under convexity assumptions, minimizing the Group Pareto loss will converge the classifier to a Pareto Efficient operating point of accuracies , such that for all operating points obtained by strictly enforcing the equality fairness constraint, , we have .
In the remainder of the section, we present the outline of the proof of the theorem through key lemmas about the convergence, discoverability and efficiency of the ParetoEfficient algorithm from optimization and Pareto optimality theory.
3.1 Convergence
To show that converges, we use the lemma from [39], that shows that under block separability of parameters , i.e = , where in our case is the Pareto loss for group :
, backpropagating using a block level batch gradient descent converges
[38].Lemma 1.
If f is a convex, twicedifferentiable loss function, then the sparse lasso minimizer, min( f + ), such that , is also convex.
In our case, is the crossentropy loss function along with the sparse lasso penalty parameter being equal to the Pareto loss penalty term . ∎
3.2 Discoverability
To show that the converged operating point is ParetoEfficient, we use the theory of decomposition [15] based methods that can discover convex and nonconvex Pareto curves by scalarizing multiple objectives  into a single objective. We minimize the norm of the weighted distance of each objective , from a Utopian reference point , i.e.
However, it can be seen that as , the objective function becomes nondifferentiable and hence we want to choose the minimum possible for which the above statement still holds. While a result for all Pareto curves is still an open problem, we use a significant result from [15], if the Pareto curve is assumed to be continuous.
Lemma 2.
If the Paretofront geometry is continuous, where can be parameterized as , such that , for a constant , then for the choice of , we are guaranteed to discover the Pareto front using the scalarization,
In finite datasets, we usually make the continuous Pareto curve assumption as it is interpolated using observed points. Also, in our case, we know that each
, if accuracy (error) is scaled. With this tight bound on the performance values, the condition to be satisfied, , becomes trivial to be satisfied empirically for low values of as shown in Figure 4.3.3 Efficiency
We further show using a geometric argument, that if the fairness frontier [Proposition 8 in [28]] defined by for various values of the fairness penalty, , the gain in absolute accuracy obtained by using ParetoEfficient Fairness is proportional to the value of the absolute gradient of the fairness frontier (), and solely depends on the inherent distribution of the target and groups.
(10) 
Specifically, if the fairness frontier shows that for small concessions of the fairness requirement (), the limit of accuracy achievable is much higher (), then we have a possibility that PEF would outperform by choosing such a point on the fairness frontier as shown in Figure 5. However, this is a necessary but not sufficient condition. PEF would choose such a point only if the new point is Paretodominating the subgroup accuracies of the operating point without the fairness concession. The amount of concession in Paretodominance that we are willing to allow is domaindependent and can be controlled by tuning the parameter in the PEF loss function. Hence, PEF would perform better in conditions where the fairness frontier is steep around the fairness requirement and potential increase in performances are achievable in a Paretodominant manner.
4 Related Work
Existing fairness mitigation algorithms often explicitly define constraints on model subgroup performance e.g Equality of Odds [19] and enforce using Lagrangian relaxation [28, 25, 7] to achieve corpus level parity across sensitive variables [43]. With increase in sensitive variables in real case studies, satisfying such strict constraints remain unexplored. [4, 9, 11]. [28] has established that such approximate group fairness constraints are not perfectly possible unless the underlying subpopulations demonstrate perfect accuracy with respect to the target. [3]
models the problem of debiasing as a multitask learning problem with a penalty if the shared hidden layers of the neural network can be used to predict the sensitive variable accurately.
[41] argues about preference based notions of fairness as opposed to ones based on parity. [34] provide fair dimensionality reduction algorithms where bias loss functions are employed. [42]show that on logistic regression and support vector machines, approximate fairness constraints can be enforced with a cost on accuracy.
[31] and [32] prove that equality of odds cannot be achieved by two models on separate groups which are calibrated , unless both the models achieve perfect accuracy. The main intuition behind this paper is the hypothesis that similar impossibility regimes exist in real life scenarios, especially when multiple subgroups exist.
[1] explores the Pareto optimality between overall accuracy and violation of fairness constraints. However in our work, we focus on the tradeoffs between the performance of various comparable subgroups on the Paretooptimal curve [29, 15]. Relevant to our work, studies of subgroup specific performance and use of transfer learning like methods have been explored through decoupling in [12], but do not provide theoretical results on ParetoEfficiency. To the best of our knowledge, this is the first work which extends strong theoretical results of ParetoEfficiency to achieve better subgroup performance in data distributions with high disalignment between fairness and accuracy.
5 Evaluation
We compare our approach with the scaled versions of group fairness [43] and [3] for subgroups. In [43], the authors optimize for overall accuracy in the constrained setting of ensuring equal false positive rates, but it is generally applicable to other measures of performance. For our comparison we implement an objective to maximize overall accuracy along with a Lagrangian relaxation which adds a penalty for each subgroup that deviates from the overall accuracy. In [3], the authors implement bias mitigation as a way of erasing the sensitive group membership by backpropagating negative gradients in a multiheaded feedforward neural network. We evaluate a comparison of the techniques for both the UCI datasets and synthetic toy data. The UCI Census Adult dataset predicts income category based on demographic information, where the sensitive variables are gender and race in this scenario. The UCI German dataset predicts credit type (binary) from demographic information where the sensitive variables are age, gender and personal status. The UCI Heart Attack dataset predicts health status using medical and demographic information, and age and gender are considered sensitive variables. The synthetic dataset is constructed in a manner to illustrate scenarios of how the Pareto loss is useful in skewed datasets.
5.1 UCI Datasets
Table 2 shows the Paretoloss, i.e how much each subgroup deviates from the pseudooptimal of the respective subgroup for the UCI Census Adult dataset. We see that our approach achieves zero ParetoLoss, while [43] and [3] have nonzero Pareto losses. [43] performs well in terms of lowering the sum of absolute discrepancy of all subgroups’ accuracy from the overall accuracy (Parity loss). This is expected as [43] chooses an operating point closest to equal accuracy, when exact equality isn’t possible. [3] arrives at an operating point which suffers from nonzero Parity and Paretoloss. Table 2 clarifies why our approach arrives at a better operating point. We can see that each of the subgroups have better individual accuracy than all the other approaches, some even better than the baseline. This confirms empirically that our objective function matches (and sometimes exceeds due to transfer learning) the heuristic pseudooptimal performance for each subgroup (last row of Table 2). Similar performance improvement was also observed on the UCI German and Heart Attack datasets and the condensed results are shown in Tables 4 and 4 respectively.
Model  Accuracy  FPR  FNR  Parity Loss  Pareto Loss 

Baseline (no bias loss)  0.630  0.253  0.747  0.199  0.016 
[43]  0.619  0.283  0.712  0.167  0.133 
[3]  0.648  0.224  0.769  0.226  0.077 
ParetoEfficient Loss  0.678  0.165  0.830  0.250  0.000 
Model  Subgroup 1  2  3  4  Pareto Loss 

Baseline (no bias loss)  0.890  0.883  0.818  0.784  0.016 
[43]  0.853  0.856  0.806  0.778  0.133 
[3]  0.882  0.872  0.824  0.780  0.077 
ParetoEfficient Loss  0.935  0.915  0.844  0.797  0.000 
Subgroup Pareto Frontier  0.934  0.894  0.815  0.783  N/A 
5.2 Synthetic Data
We varied the hyperparameters that define the synthetic subgroup distributions and found that for cases where it is possible for all subgroups to achieve same level of accuracy, we see that our approach remains similar in subgroup performance to [43] and [3]. However, for the use cases where subgroups differ in their “pseudooptimal” performance, [43] fails to achieve the “Pareto” accuracy for all subgroups and hence results in lower overall accuracy. [3] however continues to perform well, and our approach only shows improvements in a few scenarios. We provide detailed performance numbers and subgroup distribution parameters on the synthetic cases below.
We use a joint data distribution for binary sensitive variables , and , a confounding variable introduced to control the alignment between the target label and sensitive group distribution. We present the various dependence between the variables and the sensitive variables and the corresponding performance in Table 5 for minimizing the Pareto loss compared to minimizing Parity Loss [43] and adversarial losses [3].
Confounding dependency

Pareto  [43]  [3] 

2*a + 1*b  0  0  0 
2*b  2*a  0  0.02  0.02 
4*b  0  0  0.002 
8*b  0  0.06  0 
2*a + 2*b + 2*d  0  0  0 
(a,b): {(0,0): 3, (0,1): 11, (1,0): 4, (1,1): 8} 
0  0  0 
(a,b): {(0,0): 3, (0,1): 1, (1,0): 4, (1,1): 8} 
0  0.04  0.04 
(a,b): {(0,0): 3, (0,1): 11, (1,0): 4, (1,1): 9} 
0  0.01  0.01 
5.2.1 Skewed Dataset Scenario
Below, we show the distribution that was used to generate a skewed dataset where it is possible that attempting to achieve strict equality might result in trivial accuracy. We use a joint data distribution for binary sensitive variables and , a threshold variable introduced to classify the target label
. Let A, B be Bernoulli random variables,
(11)  
(12) 
with a selected constant
In this scenario, there are two subgroups for which , the confounding variable has uniform random values and for the two subgroups which have , the confounding variable is distributed normally with means at two different values separated by a constant. Suppose we were to train a bias mitigation model , which uses only the perceived nonsensitive variable "C" as the feature to classify to identify target label in order to achieve equal performance in terms of accuracy. While trying to increase overall accuracy, there is a potential tradeoff where the algorithm could penalize two subgroups identified by (0_0, 1_0), without any gain in the subgroups identified by (0_1, 1_1). Effectively, this implies that two subgroups are penalized to ensure that accuracy is close to the uniformly random accuracy as in the other two subgroups as shown in Figure 6. This is counterproductive for each of the subgroup’s performance in question. Hence, it is necessary to guard against such scenarios, by explicitly accounting for this edgecase.
5.3 Bias Loss Weights:
The ParetoLoss function contains a single bias loss weight term (, although it can be a vector as per domain) which controls the trade off between achieving overall accuracy and ParetoEfficient levels for each of the subgroups with equal penalty per subgroup. Modifying this weight demonstrates the difference between the equalized odds loss and the Pareto loss as can be seen in Figure 7. As the bias weight is increased, the Pareto loss bias mitigation moves the threshold towards ParetoEfficient levels with higher overall accuracy, whereas equalized loss moves towards the operating point where discrepancy across subgroups is minimized, even thought it doesn’t improve accuracy of at least one of the subgroups.
5.4 Prevalence
The number of samples in each subgroup varies and as such some minority population subgroups could be ignored during some bias mitigation techniques. The approach mentioned in [22] defines discrepancies weighted by the population ratio of each of the subgroups. This may have perverse effects on minority subgroups whose performance deviates from either the overall performance or the ParetoEfficient level. The contribution of the minority subgroup may be ignored while optimizing their loss function. This is clearly seen in Figure 8 where the curve representing the discrepancy from the overall performance can be seen to have lowered and smoothened when multiplied with the subgroup’s population ratio. In some domains, this might not be the intended fairness criterion and can be quite deceiving when the majority subgroup, which also guides the overall performance dominates the sum of discrepancy loss too. As such, the authors guard against the tyranny of the majority, by weighting subgroups equally, independent of population prevalence. This can be modified to appropriate weights by the domain practitioner.
5.5 Model Capacity
We used 3 layers of feed forward networks with 256, 128 and 64 neurons fully connected in all our comparisons of relevant losses. However, we did notice that the difference in the losses varied with the size of the model used as noted in Table
6. The gains achieved from ParetoLoss is higher when the model capacity increases, as it is able to capture the subgroup specific performance updates with a larger number of parameters.Model size  Subgroup 1  2  3  4 

64  0.923 (0.929)  0.898 (0.914)  0.812 (0.849)  0.733 (0.809) 
0.902 (0.925)  0.884 (0.909)  0.820 (0.849)  0.781 (0.805)  
0.935 (0.890)  0.915 (0.883)  0.844 (0.818)  0.797 (0.784) 
6 Conclusion
Realworld datasets often display subgroup population skew. To mitigate in the construction of fairnessaware classifier, we utilize softer Paretoefficiency fairness constraints. When subgroup populations contain deviations in prevalence and underlying distributions, a ParetoEfficient approach yields better overall and individual subgroup performance when compared to other biasmitigation algorithms enforcing hard equality constraints. The approach is appropriate for classification problems containing multiple sensitive variables. As demonstrated, the proposed methodology does not degrade performance in terms of accuracy for datasets who do demonstrate balanced distributions across subgroups. In fact, we demonstrated a substantial increase in global accuracy and individual subgroup accuracy on three UCI datasets as compared to existing fairness algorithms with hard equality constraints.
We note that our approach requires computing subgroup pseudooptimal accuracy on sufficient subgroup samples. Avoiding the pitfalls of achieving trivial accuracy with more sensitive subgroups comes at this cost of preprocessing/evaluation time. As the number of sensitive variables grow, the sample size required to establish pseudooptimal statistics may be infeasible and individual fairness definitions [33, 37] may be appropriate.
References
 [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach. A reductions approach to fair classification. CoRR, abs/1803.02453, 2018.
 [2] A. Allabban, JE Hollander, and JM Pines. Gender, race and the presentation of acute coronary syndrome and serious cardiopulmonary diagnoses in ed patients with chest pain. In Emergency Medicine Journal, volume 34, pages 653–658, 2017.
 [3] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed Huai hsin Chi. Data decisions and theoretical implications when adversarially learning fair representations. CoRR, abs/1707.00075, 2017.
 [4] P. J. Bickel, E. A. Hammel, and J. W. O’Connell. Sex bias in graduate admissions: Data from berkeley. Science, 187(4175):398–404, 1975.
 [5] Tolga Bolukbasi, KaiWei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. CoRR, abs/1607.06520, 2016.
 [6] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Sorelle A. Friedler and Christo Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 77–91, New York, NY, USA, 23–24 Feb 2018. PMLR.
 [7] Robin D. Burke. Multisided fairness for recommendation. CoRR, abs/1707.00093, 2017.
 [8] Abhijnan Chakraborty, Johnnatan Messias, Fabrício Benevenuto, Saptarshi Ghosh, Niloy Ganguly, and Krishna P. Gummadi. Who makes trends? understanding demographic biases in crowdsourced recommendations. CoRR, abs/1704.00139, 2017.
 [9] Silvia Chiappa and Thomas P. S. Gillam. Pathspecific counterfactual fairness. 02 2018.
 [10] Sam CorbettDavies and Sharad Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. CoRR, abs/1808.00023, 2018.
 [11] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
 [12] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Decoupled classifiers for groupfair and efficient machine learning. In Sorelle A. Friedler and Christo Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 119–133, New York, NY, USA, 23–24 Feb 2018. PMLR.
 [13] E. E. Eban, M. Schain, A. Mackey, A. Gordon, R. A. Saurous, and G. Elidan. Scalable Learning of NonDecomposable Objectives. ArXiv eprints, August 2016.
 [14] Dean Foster and Rakesh Vohra. An economic argument for affirmative action. Rationality and Society, 4(2):176–188, 1992.
 [15] Ioannis Giagkiozis and Peter Fleming. Methods for manyobjective optimization: an analysis. 11 2012.
 [16] Parke Godfrey, Ryan Shipley, and Jarek Gryz. Algorithms and analyses for maximal vector computation. The VLDB Journal, 16(1):5–28, January 2007.

[17]
Nina GrgicHlaca.
The case for process fairness in learning : Feature selection for fair decision making.
2016.  [18] Nina GrgicHlaca, Elissa M. Redmiles, Krishna P. Gummadi, and Adrian Weller. Human perceptions of fairness in algorithmic decision making: A case study of criminal risk prediction. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 903–912, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee.
 [19] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. CoRR, abs/1610.02413, 2016.
 [20] Hoda Heidari, Claudio Ferrari, Krishna P. Gummadi, and Andreas Krause. Fairness behind a veil of ignorance: A welfare analysis for automated decision making. CoRR, abs/1806.04959, 2018.
 [21] Corey D. Johnson Rafael Salamanca Jr. Vincent J. Gentile Robert E. Cornegy Jr. Jumaane D. Williams Ben Kallos Carlos Menchaca James Vacca, Helen K. Rosenthal. A local law in relation to automated decision systems used by agencies, 2018.
 [22] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. CoRR, abs/1711.05144, 2017.
 [23] Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna Gummadi, and Adrian Weller. Blind justice: Fairness with encrypted sensitive attributes. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2630–2639, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 [24] Zachary Lipton, Julian McAuley, and Alexandra Chouldechova. Does mitigating ml's impact disparity require treatment disparity? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8125–8135. Curran Associates, Inc., 2018.
 [25] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. CoRR, abs/1803.04383, 2018.
 [26] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Fairness through causal awareness: Learning causal latentvariable models for biased data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pages 349–358, New York, NY, USA, 2019. ACM.
 [27] Shie Mannor, Dori Peleg, and Reuven Rubinstein. The cross entropy method for classification. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 561–568, New York, NY, USA, 2005. ACM.
 [28] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 107–118, New York, NY, USA, 23–24 Feb 2018. PMLR.
 [29] K. Miettinen. Nonlinear multiobjective optimization, 1999.
 [30] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. CoRR, abs/1610.05755, 2016.
 [31] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon M. Kleinberg, and Kilian Q. Weinberger. On fairness and calibration. CoRR, abs/1709.02012, 2017.
 [32] Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. CoRR, abs/1806.00543, 2018.
 [33] Guy N. Rothblum and Gal Yona. Probably approximately metricfair learning. CoRR, abs/1803.03242, 2018.
 [34] Samira Samadi, Uthaipon Tantipongpipat, Jamie Morgenstern, Mohit Singh, and Santosh Vempala. The price of fair pca: One extra dimension. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 10999–11010, USA, 2018. Curran Associates Inc.
 [35] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pages 59–68, New York, NY, USA, 2019. ACM.
 [36] Till Speicher, Muhammad Ali, Giridhari Venkatadri, Filipe Nunes Ribeiro, George Arvanitakis, Fabrício Benevenuto, Krishna P. Gummadi, Patrick Loiseau, and Alan Mislove. Potential for discrimination in online targeted advertising. In Sorelle A. Friedler and Christo Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 5–19, New York, NY, USA, 23–24 Feb 2018. PMLR.
 [37] Till Speicher, Hoda Heidari, Nina GrgicHlaca, Krishna P. Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. CoRR, abs/1807.00787, 2018.
 [38] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387–423, Mar 2009.
 [39] Martin Vincent and Niels Richard Hansen. Sparse group lasso and high dimensional multinomial classification. Comput. Stat. Data Anal., 71(C):771–786, March 2014.
 [40] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning nondiscriminatory predictors. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1920–1953, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
 [41] Muhammad Bilal Zafar, Isabel Valera, Manuel GomezRodriguez, Krishna P. Gummadi, and Adrian Weller. From parity to preferencebased notions of fairness in classification. In NIPS, 2017.

[42]
Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P.
Gummadi.
Fairness Constraints: Mechanisms for Fair Classification.
In Aarti Singh and Jerry Zhu, editors,
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, volume 54 of Proceedings of Machine Learning Research, pages 962–970, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.  [43] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and KaiWei Chang. Men also like shopping: Reducing gender bias amplification using corpuslevel constraints. In EMNLP, 2017.
7 Appendix A
7.1 Convergence
We first present the lemma derived for sparse lasso regularizers [39], that show the convexity of block regularized minimizers, which also minimize the loss function under model parameter () constraints.
Lemma 3.
If f is a convex, twicedifferentiable loss function, then the sparse lasso minimizer, min( f + ), such that , is also convex.
Moreover, it has be shown that the convexity argument holds even when , as long as the block separability of holds, i.e = , where component of denotes the subgroup’s performance’s deviation from their group optimal performance. Hence, adopting a block level gradient descent, where the gradients are backpropagated only after each batch’s block performances are computed has been shown to converge in [38].
7.2 Discoverability
The above section shows that the ParetoEfficient Fairness loss indeed converges to a minima with the use of a convex loss function. However, it remains to be seen that the minima obtained is a paretooptimal operating point. For this, we now provide insights behind the choice of the regularizers in ParetoEfficient Fairness, based on the theory of multiple objective optimization [15]. Specifically, we use the theory of decomposition based methods which employ a scalarization technique to convert multiple objectives  into a single objective using a Weighted Metric method. Here, the distance of each objective from a Utopian reference point is measured and a corresponding norm is minimized, i.e.
The knowledge of a Utopian reference point is usually based on prior domain knowledge. In our adaptation for ParetoEfficient Fairness, we have initialized to a vector of subgroup performances when trained exclusively on the subgroup’s data alone.
Using the above Weighted Metric method provides the ability to discover both convex and nonconvex pareto curves as shown in [15]. Similarly, the norm has the ability to discover all points on the Pareto front for some weight vector as stated in the lemma below. [29]
Lemma 4.
Let x be a Paretooptimal solution, then there exists a positive
weighting w vector such that x is a solution of the weighted
Tchebycheff problem
, where the reference point is the
utopian objective vector.
However, it can be seen that as , the objective function becomes nondifferentiable and hence it is of interest to us that we choose the minimum possible p for which the above statement still holds. While a universal result for all Pareto curves is still unknown, a significant result from [15] is presented below, if the Pareto curve is known to be continuous.
Lemma 5.
If the Paretofront geometry is continuous, where denote the objectives to be optimized which can be parameterized as , such that , for a constant , then for the choice of , the same guarantees of discoverability from the Tchebycheff problem will hold when using the scalarization,
In real datasets, the number of points observed on the Pareto front is finite, and hence we usually make the assumption that the Pareto curve is extrapolated using the observed points. Under this assumption, the above lemma would hold on the continuous extrapolated Pareto curve. Also, in our case where we optimize subgroup performance, we know that each , if accuracy (error) or any other performance metric is scaled. With this tight bound on the performance values, the condition to be satisfied, , becomes trivial to be satisfied empirically under the constraints of numerical precision. For all , such that , we have that for . This is evident as , for and . Since all boundary conditions are also bounded by the limit of 1, we can safely assume and satisfy the condition for most practical purposes, as illustrated in Figure 4. Thus, for choice of , i.e p = 1,2,3.., we see that our weighted metric method produces all discoverable points on the Pareto curve and hence we can be fairly guaranteed (under the errors of numerical precision) that the minimization procedure would find a point on the Pareto curve. Note that the weights of the weighted metric method in our case is based on the fairness criterion and hence all set to 1. This will further impose the fairness constraints during the discovery of points on the Pareto curve.
7.3 Efficiency
In this subsection, we provide an analysis of when we expect PEF to outperform standard notions of fairness, like equality of opportunity, i.e when PEF has higher efficiency. [28] defines the fairnessfrontier which intuitively measures the tradeoff between utility () (accuracy) and fairness () in the distribution inherent to the problem, rather than one owing to the specific technique one uses, no matter how sophisticated it may be, by computing the fundamental limits of what accuracy is achievable by any classifier. Specifically, the frontier is computed using costsensitive measure which quantifies the alignment between the Bayesoptimal plugin classifier thresholds for the outcome and sensitive attribute distributions. [Proposition 8 in [28]].
Lemma 6.
As the absolute gradient of the fairness frontier increases near the desired fairness constraint, the efficiency gained from using PEF is monotonically nondecreasing.
Specifically, if the fairness frontier shows that for small concessions of the fairness requirement (), the limit of accuracy achievable is much higher (), then we have a possibility that PEF would outperform by choosing such a point on the fairness frontier as shown in Figure 5. However, this is a necessary but not sufficient condition. PEF would choose such a point only if the new point is Paretodominating the subgroup accuracies of the operating point without the fairness concession. The amount of concession in Paretodominance that we are willing to allow is domaindependent and can be controlled by tuning the parameter in the PEF loss function. Hence, PEF would perform better in conditions where the fairness frontier is steep around the fairness requirement and potential increase in accuracies are achievable in a Paretodominant manner.
Comments
There are no comments yet.