What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers

10/30/2019 ∙ by Ananth Balashankar, et al. ∙ 15

The potential for learned models to amplify existing societal biases has been broadly recognized. Fairness-aware classifier constraints, which apply equality metrics of performance across subgroups defined on sensitive attributes such as race and gender, seek to rectify inequity but can yield non-uniform degradation in performance for skewed datasets. In certain domains, imbalanced degradation of performance can yield another form of unintentional bias. In the spirit of constructing fairness-aware algorithms as societal imperative, we explore an alternative: Pareto-Efficient Fairness (PEF). Theoretically, we prove that PEF identifies the operating point on the Pareto curve of subgroup performances closest to the fairness hyperplane, maximizing multiple subgroup accuracy. Empirically we demonstrate that PEF outperforms by achieving Pareto levels in accuracy for all subgroups compared to strict fairness constraints in several UCI datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As repeatedly demonstrated in the news, medicine, law and numerous related ML papers [43, 20] [5]

, societal inequities have the very real risk of being vastly exacerbated if machine learning algorithms do not explicitly address fairness in model formulation and data collection. Numerous philosophical notions of fairness exist (distributive, procedural, etc)

[35, 17] [33] and the appropriateness of each definition may depend on context. Theoretically, in an equitable world of perfect data, implying perfect accuracy across all possible subgroup populations, a uniformly fair classifier may be created. However, with skewed real-world data, [28] has shown that a tradeoff exists between fairness and accuracy. We propose an alternative fairness constraint based on Pareto-Efficiency [16] to avoid performance degradation within some subgroups, while striving for increased accuracy for all subgroups.

A popular approach to learn fairness-aware classification models is to enforce strict metric equality constraints in relation to sensitive variables such as race and gender. Such equality constraints are being adopted as law in some locals [21]. However, by definition, enforcing strict equality constraints will ensure classification accuracy is limited by the worst performing subgroup. Due to a variety of reasons, historical injustices [18], sampling bias [8], selection bias [36] among others, subgroup populations are often not fully or fairly represented in commonly used real-world datasets. The discrepancies are particularly alarming when algorithmic models are used to predict medical and legal outcomes. For example, in a cardiology study of over 4000 ER patients with cardiac event symptoms [2], no symptoms were found to be predictive of a heart attack in white women. In black males, only an unrelated symptom (diaphoresis) was found to be indicative of a future cardiac event with 95 percent confidence, while in white males relevant features (left arm radiation, pressure/tightness) were detected with high accuracy. To reiterate, a classifier built on this longitudinal and “diverse” dataset to predict an ER cardiac event, could conceivably achieve high accuracy for white men, while only yield trivial accuracy of 50 percent for white women and black men.

Figure 1: Illustration of group level skew on UCI Heart Attack dataset: Groups based on age and gender have disparate target distribution and applying strict equality might unintentionally degrade accuracy without benefitting any group.
Figure 2: Illustration of Pareto Efficiency Fairness on synthetic data: The most accurate and strictly-equal classifier lies on the operating point of (0.6, 0.6). If accuracy for each of the groups is separately maximized, we would choose points and respectively. However, PEF will choose among the points on the Pareto-front, the point , which improves accuracies of both and within the fairness relaxation bounds (in gray).

A philosophical question for the reader and practitioners is - should a classifier be constructed in such a scenario? Without full historical data and appropriate domain causal knowledge, it may be infeasible to approach *fair* learning [26]. The above example is extreme both in repercussions and the skew of the data. However, we argue and demonstrate that such skew is common in frequently used datasets to evaluate fairness aware learning - UCI Adult, German credit and Heart attack. Figure 1 demonstrates the skew in accuracy by age-gender subgroups in the UCI Heart Attack dataset.

In scenarios where the domain practitioners deem the above considerations acceptable for constructing a classifier, we propose an alternative fairness constraint based on Pareto-Efficiency [16]

to avoid the unintentional degradation in subgroup performance, while striving for increased accuracy. The methodology is applicable to highly skewed data as described in both cardiac examples above. The Pareto-Efficient Fairness (PEF) constraint restricts the choice of ML models to the Pareto frontier to ensure higher accuracy across *fair* model options. In some cases, a Pareto-Efficient definition may be at odds with a strict equality fairness criterion. Figure

2 illustrates cases on a synthetic dataset where extremely unequal models might be Pareto optimal and vice versa. However, PEF avoids this pitfall by limiting the search space on the Pareto frontier within acceptable fairness bounds.

Our proposed bias loss function achieves Pareto-Efficient performance, which outperforms solutions based on equalizing subgroup performance. Our algorithm iteratively searches for Pareto-optimal subgroup performance by leveraging the benefits of transfer learning to minimize the Group Pareto loss. Using theory from multiple objective optimization for continuous Pareto fronts, we prove that if the data distribution has high disalignment

[28] between the subgroup and the outcome, PEF will discover all Pareto optimal points and converge to a solution that is better than the Bayes optimal solutions for existing fairness constraint based algorithms (if it exists). Empirically, we show that our approach achieves an operating point which is better both in terms of global accuracy and individual subgroups accuracy than methods which approximate hard constraints of equality [43] and adversarial multi-task learning [3] on three UCI datasets.

2 Pareto-Efficient Fairness

2.1 Motivation

To motivate the need for Pareto-Efficient Fairness, consider a very simplistic binary classification task , where is a continuous scalar feature for each example in the dataset D. We partition the dataset D into a set of groups such that the examples in groups and differ in their values for sensitive variables S.

We denote the accuracy of a classifier evaluated on test samples from groups a and b as and respectively. We thus say that the classifier evaluates to an operating point . We can evaluate various operating points for classifiers by varying values of the threshold to define the scatter plot of operating points, if they take the form

(1)

In Figure 2, we plot these operating points over a given data distribution by varying . If we assumed that the data is drawn from a uniform random distribution over a class-balanced label set for both groups and and the label is generated by flipping a fair coin, then the expected operating point for any classifier would be (0.5, 0.5) (denoted by “chance”). In other non-degenerate scenarios, it is possible to maximize the accuracy () of a single group, regardless of the accuracies of the other groups. In our case, we denote such operating points that maximize accuracy of groups and , by and respectively. A strict equality fairness criterion would require that the classifier operate on the line.

However, if the objective is to improve the performance of all groups to meet the levels of the highest performing groups, then choosing points on the line might not be desirable. Such objectives are common in policies around affirmative action [14] and recent works in fairness literature [6], where even though parity between demographic groups is desired, interventions are made to improve all groups to the level of the highest performing group. Hence, choosing the Pareto-Efficient point (denoted by “PE”) may be a more desirable solution as it will increase the accuracy of both groups and compared to a solution obtained by enforcing a strict equality constraint.

2.2 Definitions

For completeness, we provide some definitions of model performance metrics which may be used to evaluate fairness on the held-out test data set.

Accuracy: The fraction of test samples which were classified correctly by the model as compared to the ground truth class labels.

False Positive Rate: In case of a binary classification task with positive and negative classes, this is the fraction of test samples which were incorrectly assigned positive by the model among the total number of ground truth negative samples.

False Negative Rate: In case of a binary classification task with positive and negative classes, this is the fraction of test samples which were incorrectly assigned negative by the model among the total number of ground truth positive samples.

These simple definitions of model performance have been used as an objective for maximization/minimization under constraints such that sensitive group level model performance (could be measured using a different metric than the one used as an objective) be equal.

Parity Loss: If the group level model performance are unequal, then the sum of absolute deviation of the group level performances from the overall model performance is defined as the Parity loss. If denotes the model performance of group and denotes the overall model performance, then parity loss is given by

(2)

There are many fairness definitions proposed [10] in literature, but we present one of the commonly used strict-equality constraints below followed by our definition of Pareto-Efficient Fairness.

Equality of Odds. [19]: We say that a predictor satisfies equalized odds with respect to the sensitive attribute set and outcome , if and are independent conditional on .

(3)

Pareto-Efficiency: We define Pareto-Efficient points as the set of operating points for which there does not exist another point, which has better performance (e.g. accuracy) across all the groups.

One possible pitfall of the above definition for fairness is that points or could be selected as a Pareto-Efficient solution. Both are trivially Pareto-Efficient points, since there are no other points which performs better across all groups, but are trivially unequal across groups.

Pareto-Efficient Fairness

: We say an operating point is Pareto-Efficient Fair if it is Pareto-Efficient and minimizes the variance of the Pareto error across groups.

Formally, for a set of thresholds characterizing Pareto-Efficient points: , performance metric for : , optimum performance metric across all operating points for group : , the Pareto error for group : , variance across groups all : , we intend to find a threshold that characterizes a Pareto-Efficient Fair operating point.

(4)
(5)

Since it is empirically difficult to find all Pareto-Efficient thresholds

without sufficient exploration, a simple heuristic is to choose a threshold that minimizes the total absolute Pareto-penalty.

(6)

We combine these two minimization criterion using a Lagrangian factor as follows:

(7)

This formulation should not be confused as to be simple fairness and accuracy tuning parameters, as proposed in [43]. The nuance is in the fact that, is used relative to the heuristic pseudo-optimum performance of groups, which is central to the argument of Pareto-Efficient Fairness, thereby encouraging all subgroups to perform at their best possible levels. This is demonstrated when and is minimized, which is not the same as equality of odds [19]. Similarly, when , we minimize , which is a different than the unconstrained optimization in [43]

Group Pareto Loss: We now generalize the simple example above to any binary classification task. Here, the minimization criterion of the Group Pareto Loss in equation (5) holds, but instead of determining a specific threshold corresponding to a Pareto-Efficient operating point, we minimize the group Pareto Loss over the parameters of the binary classification model.

Figure 3: Illustration of preference of Pareto loss over Parity loss. In this synthetic data scenario, two groups perform at random accuracy level, regardless of the threshold chosen and two other groups have higher accuracies when the thresholds are increased. The Pareto loss depicts how far each of the groups are from their corresponding optimal accuracy levels. Parity loss depicts the discrepancy between the groups accuracies. Parity loss is minimized when all groups perform at random accuracy levels, whereas Pareto loss is minimized when all groups achieve their optimal accuracies and hence is a better alternative.
Figure 4: Pareto geometry condition is satisfied for discoverability of all Pareto-Optimal points for Pareto-Efficient Fairness for low values of p as required by Lemma 3.3

The Group Pareto Loss is augmented with an appropriate loss weight () through the Lagrangian dual formulation similar to [13], along with the standard cross entropy classification loss: [27], to yield the Pareto-Efficient Fairness Loss: . Note that the standard cross entropy classification loss aims to maximize overall performance, whereas the penalty term weighted by is used to ensure that such a maximum overall performance should be achieved while minimizing Pareto loss. In scenarios, where the maximum overall performance is achieved when each group’s performance is also maximized, then we would not need this augmentation. For cases, where such Pareto optimal and overall optimal operating points do not coincide, we are able to denote our preference between these two operating points using the penalty term .

(8)

2.3 Pareto-Efficient Algorithm

We now put the components mentioned above together and present the Pareto-Efficient bias mitigation algorithm (Algorithm 1). This algorithm is an in-processing algorithm which trains a joint model to have favorable fairness properties, as opposed to a post-processing algorithm which uses a pre-trained model for fine-tuning [40]. In order to obtain a heuristic pseudo-optimal group accuracy for each group , we train the classifier to minimize on samples in group from dataset

. Although this may not be the best estimate of a group population’s true optimal, we avoid transfer learning during bootstrapping as it benefits larger groups more than smaller groups

[30]. Instead, we propose an iterative approach where is updated in each iteration if a better group accuracy is achieved by a jointly trained model. We use these heuristic pseudo-optimal accuracies to jointly train a Pareto-Efficient model on all subgroups to minimize the in every batch. We further strictly ensure that the mini-batch is representative of the group distributions by sampling group-wise batch samples proportionately.

  : set of sensitive groups, : dataset, : data of group
  for  do
     
     
     
  end for
  while  do
     
      = arg min
      eval(, ),
  end while
Algorithm 1 Iterative Pareto-Efficient Bias Mitigation

Our proposal aims to achieve “potentially optimal” performance for each of the groups while performing better than approximate fairness constrained classifiers. In Figure 3, we illustrate this by applying Algorithm 1 on a synthetic data distribution with 4 groups, where 2 groups do not perform better than chance, independent of the threshold chosen, and the other 2 groups perform better for higher values of the threshold. Hence, we achieve Pareto-Fairness for threshold , whereas, point of equal performance (non-pareto fairness) threshold, is achieved when all groups perform equal to random chance performance (more details on the distribution is in Section 5.2). Hence, choosing to minimize Pareto loss is better than minimizing the Parity loss, which can lead to random accuracy. This approach is similar to current avenues of research where it is acceptable to be aware of the differences [24] between various groups’ performance in the dataset and operate in a way to improve as opposed to fairness through blindness [23].

3 Properties of Pareto-Efficient Fairness

In this section, we outline key theoretical results about the Pareto-Efficient Bias Mitigation algorithm’s convergence properties, its capacity to discover Pareto curves of subgroup accuracy and their Pareto efficiency. First, we formalize the inherent disalignment between the achieving better accuracy of the target and satisfying the fairness equality constraint for the sensitive group distribution [28]. Let

be the class probability function for the target,

be the threshold of class probability for binary classification, be the parameter that defines the trade-off between accuracy and fairness in the fairness constrained loss function, denote the Bayes-optimal classifier which minimizes the fairness constrained loss function, then we define as the disalignment between the target and group distribution.

(9)
Theorem 1.

If , then under convexity assumptions, minimizing the Group Pareto loss will converge the classifier to a Pareto Efficient operating point of accuracies , such that for all operating points obtained by strictly enforcing the equality fairness constraint, , we have .

In the remainder of the section, we present the outline of the proof of the theorem through key lemmas about the convergence, discoverability and efficiency of the Pareto-Efficient algorithm from optimization and Pareto optimality theory.

3.1 Convergence

To show that converges, we use the lemma from [39], that shows that under block separability of parameters , i.e = , where in our case is the Pareto loss for group :

, backpropagating using a block level batch gradient descent converges

[38].

Lemma 1.

If f is a convex, twice-differentiable loss function, then the sparse lasso minimizer, min( f + ), such that , is also convex.

In our case, is the cross-entropy loss function along with the sparse lasso penalty parameter being equal to the Pareto loss penalty term . ∎

3.2 Discoverability

To show that the converged operating point is Pareto-Efficient, we use the theory of decomposition [15] based methods that can discover convex and non-convex Pareto curves by scalarizing multiple objectives - into a single objective. We minimize the norm of the weighted distance of each objective , from a Utopian reference point , i.e.

However, it can be seen that as , the objective function becomes non-differentiable and hence we want to choose the minimum possible for which the above statement still holds. While a result for all Pareto curves is still an open problem, we use a significant result from [15], if the Pareto curve is assumed to be continuous.

Lemma 2.

If the Pareto-front geometry is continuous, where can be parameterized as , such that , for a constant , then for the choice of , we are guaranteed to discover the Pareto front using the scalarization,

In finite datasets, we usually make the continuous Pareto curve assumption as it is interpolated using observed points. Also, in our case, we know that each

, if accuracy (error) is scaled. With this tight bound on the performance values, the condition to be satisfied, , becomes trivial to be satisfied empirically for low values of as shown in Figure 4.

3.3 Efficiency

We further show using a geometric argument, that if the fairness frontier [Proposition 8 in [28]] defined by for various values of the fairness penalty, , the gain in absolute accuracy obtained by using Pareto-Efficient Fairness is proportional to the value of the absolute gradient of the fairness frontier (), and solely depends on the inherent distribution of the target and groups.

(10)

Specifically, if the fairness frontier shows that for small concessions of the fairness requirement (), the limit of accuracy achievable is much higher (), then we have a possibility that PEF would outperform by choosing such a point on the fairness frontier as shown in Figure 5. However, this is a necessary but not sufficient condition. PEF would choose such a point only if the new point is Pareto-dominating the subgroup accuracies of the operating point without the fairness concession. The amount of concession in Pareto-dominance that we are willing to allow is domain-dependent and can be controlled by tuning the parameter in the PEF loss function. Hence, PEF would perform better in conditions where the fairness frontier is steep around the fairness requirement and potential increase in performances are achievable in a Pareto-dominant manner.

Figure 5: Relationship between the shape of the fairness frontier and the efficiency gain expected by using PEF. y-axis denotes the maximum achievable overall accuracy for the given fairness constraint (x-axis). Larger values of the fairness constraint do not permit model performances which deviate from the fairness line. If better accuracies are achievable by relaxing the fairness constraint by a small amount, the gain expected by using PEF would also be more.

4 Related Work

Existing fairness mitigation algorithms often explicitly define constraints on model subgroup performance e.g Equality of Odds [19] and enforce using Lagrangian relaxation [28, 25, 7] to achieve corpus level parity across sensitive variables [43]. With increase in sensitive variables in real case studies, satisfying such strict constraints remain unexplored. [4, 9, 11]. [28] has established that such approximate group fairness constraints are not perfectly possible unless the underlying sub-populations demonstrate perfect accuracy with respect to the target. [3]

models the problem of debiasing as a multi-task learning problem with a penalty if the shared hidden layers of the neural network can be used to predict the sensitive variable accurately.

[41] argues about preference based notions of fairness as opposed to ones based on parity. [34] provide fair dimensionality reduction algorithms where bias loss functions are employed. [42]

show that on logistic regression and support vector machines, approximate fairness constraints can be enforced with a cost on accuracy.

[31] and [32] prove that equality of odds cannot be achieved by two models on separate groups which are calibrated , unless both the models achieve perfect accuracy. The main intuition behind this paper is the hypothesis that similar impossibility regimes exist in real life scenarios, especially when multiple subgroups exist.

[1] explores the Pareto optimality between overall accuracy and violation of fairness constraints. However in our work, we focus on the trade-offs between the performance of various comparable subgroups on the Pareto-optimal curve [29, 15]. Relevant to our work, studies of subgroup specific performance and use of transfer learning like methods have been explored through decoupling in [12], but do not provide theoretical results on Pareto-Efficiency. To the best of our knowledge, this is the first work which extends strong theoretical results of Pareto-Efficiency to achieve better subgroup performance in data distributions with high disalignment between fairness and accuracy.

5 Evaluation

We compare our approach with the scaled versions of group fairness [43] and [3] for subgroups. In [43], the authors optimize for overall accuracy in the constrained setting of ensuring equal false positive rates, but it is generally applicable to other measures of performance. For our comparison we implement an objective to maximize overall accuracy along with a Lagrangian relaxation which adds a penalty for each subgroup that deviates from the overall accuracy. In [3], the authors implement bias mitigation as a way of erasing the sensitive group membership by back-propagating negative gradients in a multi-headed feedforward neural network. We evaluate a comparison of the techniques for both the UCI datasets and synthetic toy data. The UCI Census Adult dataset predicts income category based on demographic information, where the sensitive variables are gender and race in this scenario. The UCI German dataset predicts credit type (binary) from demographic information where the sensitive variables are age, gender and personal status. The UCI Heart Attack dataset predicts health status using medical and demographic information, and age and gender are considered sensitive variables. The synthetic dataset is constructed in a manner to illustrate scenarios of how the Pareto loss is useful in skewed datasets.

5.1 UCI Datasets

Table 2 shows the Pareto-loss, i.e how much each subgroup deviates from the pseudo-optimal of the respective subgroup for the UCI Census Adult dataset. We see that our approach achieves zero Pareto-Loss, while [43] and [3] have non-zero Pareto losses. [43] performs well in terms of lowering the sum of absolute discrepancy of all subgroups’ accuracy from the overall accuracy (Parity loss). This is expected as [43] chooses an operating point closest to equal accuracy, when exact equality isn’t possible. [3] arrives at an operating point which suffers from non-zero Parity and Pareto-loss. Table 2 clarifies why our approach arrives at a better operating point. We can see that each of the subgroups have better individual accuracy than all the other approaches, some even better than the baseline. This confirms empirically that our objective function matches (and sometimes exceeds due to transfer learning) the heuristic pseudo-optimal performance for each subgroup (last row of Table 2). Similar performance improvement was also observed on the UCI German and Heart Attack datasets and the condensed results are shown in Tables 4 and 4 respectively.

Model Accuracy FPR FNR Parity Loss Pareto Loss
Baseline (no bias loss) 0.630 0.253 0.747 0.199 0.016
[43] 0.619 0.283 0.712 0.167 0.133
[3] 0.648 0.224 0.769 0.226 0.077
Pareto-Efficient Loss 0.678 0.165 0.830 0.250 0.000
Table 2: Subgroup Accuracy on UCI Adult dataset
Model Subgroup 1 2 3 4 Pareto Loss
Baseline (no bias loss) 0.890 0.883 0.818 0.784 0.016
[43] 0.853 0.856 0.806 0.778 0.133
[3] 0.882 0.872 0.824 0.780 0.077
Pareto-Efficient Loss 0.935 0.915 0.844 0.797 0.000
Subgroup Pareto Frontier 0.934 0.894 0.815 0.783 N/A
Table 1: UCI Adult dataset with bias mitigation algorithms
Model Accuracy
Baseline (no bias loss) 0.749
[43] 0.696
[3] 0.694
Pareto-Efficient Loss 0.711
Table 4: UCI Heart Attack dataset with bias mitigation algorithms
Model Accuracy
Baseline (no bias loss) 0.939
[43] 0.870
[3] 0.837
Pareto-Efficient Loss 0.879
Table 3: UCI German dataset with bias mitigation algorithms

5.2 Synthetic Data

We varied the hyper-parameters that define the synthetic subgroup distributions and found that for cases where it is possible for all subgroups to achieve same level of accuracy, we see that our approach remains similar in subgroup performance to [43] and [3]. However, for the use cases where subgroups differ in their “pseudo-optimal” performance, [43] fails to achieve the “Pareto” accuracy for all subgroups and hence results in lower overall accuracy. [3] however continues to perform well, and our approach only shows improvements in a few scenarios. We provide detailed performance numbers and subgroup distribution parameters on the synthetic cases below.

We use a joint data distribution for binary sensitive variables , and , a confounding variable introduced to control the alignment between the target label and sensitive group distribution. We present the various dependence between the variables and the sensitive variables and the corresponding performance in Table 5 for minimizing the Pareto loss compared to minimizing Parity Loss [43] and adversarial losses [3].

Confounding dependency

Pareto [43] [3]
2*a + 1*b 0 0 0
2*b - 2*a 0 0.02 0.02
4*b 0 0 -0.002
8*b 0 -0.06 0
2*a + 2*b + 2*d 0 0 0

(a,b): {(0,0): 3, (0,1): 11,

(1,0): 4, (1,1): 8}

0 0 0

(a,b): {(0,0): 3, (0,1): 1,

(1,0): 4, (1,1): 8}

0 -0.04 -0.04

(a,b): {(0,0): 3, (0,1): 11,

(1,0): 4, (1,1): 9}

0 -0.01 -0.01
Table 5: Change in overall accuracy as compared to baseline (no bias loss) on synthetic dataset

5.2.1 Skewed Dataset Scenario

Below, we show the distribution that was used to generate a skewed dataset where it is possible that attempting to achieve strict equality might result in trivial accuracy. We use a joint data distribution for binary sensitive variables and , a threshold variable introduced to classify the target label

. Let A, B be Bernoulli random variables,

Figure 6: Synthetic dataset where Pareto Efficient fairness is desirable
(11)
(12)

with a selected constant


In this scenario, there are two subgroups for which , the confounding variable has uniform random values and for the two subgroups which have , the confounding variable is distributed normally with means at two different values separated by a constant. Suppose we were to train a bias mitigation model , which uses only the perceived non-sensitive variable "C" as the feature to classify to identify target label in order to achieve equal performance in terms of accuracy. While trying to increase overall accuracy, there is a potential trade-off where the algorithm could penalize two subgroups identified by (0_0, 1_0), without any gain in the subgroups identified by (0_1, 1_1). Effectively, this implies that two subgroups are penalized to ensure that accuracy is close to the uniformly random accuracy as in the other two subgroups as shown in Figure 6. This is counter-productive for each of the subgroup’s performance in question. Hence, it is necessary to guard against such scenarios, by explicitly accounting for this edge-case.

5.3 Bias Loss Weights:

The Pareto-Loss function contains a single bias loss weight term (, although it can be a vector as per domain) which controls the trade off between achieving overall accuracy and Pareto-Efficient levels for each of the subgroups with equal penalty per subgroup. Modifying this weight demonstrates the difference between the equalized odds loss and the Pareto loss as can be seen in Figure 7. As the bias weight is increased, the Pareto loss bias mitigation moves the threshold towards Pareto-Efficient levels with higher overall accuracy, whereas equalized loss moves towards the operating point where discrepancy across subgroups is minimized, even thought it doesn’t improve accuracy of at least one of the subgroups.

Figure 7: Comparison of effects of bias-weights on Pareto loss and Parity loss between strict enforcement (Demographic Parity- DP) and Pareto-Efficient Fairness (PEF) methodologies. In case of DP, as the weight () increases, parity loss decreases but at the cost of significant Pareto loss increase. However for PEF, increase in weight, reduces Pareto loss without significantly impacting the parity loss.
Figure 8: Weighing fairness loss by prevalence (blue) provides a deceptive notion of fairness by ignoring the minority subgroup and lowering the overall parity and Pareto loss, without actually providing any benefit to the minority subgroup. Hence, we adopt an equal weighting of subgroup level Pareto losses in PEF.

5.4 Prevalence

The number of samples in each subgroup varies and as such some minority population subgroups could be ignored during some bias mitigation techniques. The approach mentioned in [22] defines discrepancies weighted by the population ratio of each of the subgroups. This may have perverse effects on minority subgroups whose performance deviates from either the overall performance or the Pareto-Efficient level. The contribution of the minority subgroup may be ignored while optimizing their loss function. This is clearly seen in Figure 8 where the curve representing the discrepancy from the overall performance can be seen to have lowered and smoothened when multiplied with the subgroup’s population ratio. In some domains, this might not be the intended fairness criterion and can be quite deceiving when the majority subgroup, which also guides the overall performance dominates the sum of discrepancy loss too. As such, the authors guard against the tyranny of the majority, by weighting subgroups equally, independent of population prevalence. This can be modified to appropriate weights by the domain practitioner.

5.5 Model Capacity

We used 3 layers of feed forward networks with 256, 128 and 64 neurons fully connected in all our comparisons of relevant losses. However, we did notice that the difference in the losses varied with the size of the model used as noted in Table

6. The gains achieved from Pareto-Loss is higher when the model capacity increases, as it is able to capture the subgroup specific performance updates with a larger number of parameters.

Model size Subgroup 1 2 3 4
64 0.923 (0.929) 0.898 (0.914) 0.812 (0.849) 0.733 (0.809)
0.902 (0.925) 0.884 (0.909) 0.820 (0.849) 0.781 (0.805)
0.935 (0.890) 0.915 (0.883) 0.844 (0.818) 0.797 (0.784)
Table 6: Effect of model size on subgroup accuracy (compared with baseline) on UCI Adult dataset

6 Conclusion

Real-world datasets often display subgroup population skew. To mitigate in the construction of fairness-aware classifier, we utilize softer Pareto-efficiency fairness constraints. When subgroup populations contain deviations in prevalence and underlying distributions, a Pareto-Efficient approach yields better overall and individual subgroup performance when compared to other bias-mitigation algorithms enforcing hard equality constraints. The approach is appropriate for classification problems containing multiple sensitive variables. As demonstrated, the proposed methodology does not degrade performance in terms of accuracy for datasets who do demonstrate balanced distributions across subgroups. In fact, we demonstrated a substantial increase in global accuracy and individual subgroup accuracy on three UCI datasets as compared to existing fairness algorithms with hard equality constraints.

We note that our approach requires computing subgroup pseudo-optimal accuracy on sufficient subgroup samples. Avoiding the pitfalls of achieving trivial accuracy with more sensitive subgroups comes at this cost of pre-processing/evaluation time. As the number of sensitive variables grow, the sample size required to establish pseudo-optimal statistics may be infeasible and individual fairness definitions [33, 37] may be appropriate.

References

7 Appendix A

7.1 Convergence

We first present the lemma derived for sparse lasso regularizers [39], that show the convexity of block regularized minimizers, which also minimize the loss function under model parameter () constraints.

Lemma 3.

If f is a convex, twice-differentiable loss function, then the sparse lasso minimizer, min( f + ), such that , is also convex.

Moreover, it has be shown that the convexity argument holds even when , as long as the block separability of holds, i.e = , where component of denotes the subgroup’s performance’s deviation from their group optimal performance. Hence, adopting a block level gradient descent, where the gradients are backpropagated only after each batch’s block performances are computed has been shown to converge in [38].

7.2 Discoverability

The above section shows that the Pareto-Efficient Fairness loss indeed converges to a minima with the use of a convex loss function. However, it remains to be seen that the minima obtained is a pareto-optimal operating point. For this, we now provide insights behind the choice of the regularizers in Pareto-Efficient Fairness, based on the theory of multiple objective optimization [15]. Specifically, we use the theory of decomposition based methods which employ a scalarization technique to convert multiple objectives - into a single objective using a Weighted Metric method. Here, the distance of each objective from a Utopian reference point is measured and a corresponding norm is minimized, i.e.
The knowledge of a Utopian reference point is usually based on prior domain knowledge. In our adaptation for Pareto-Efficient Fairness, we have initialized to a vector of subgroup performances when trained exclusively on the subgroup’s data alone.

Using the above Weighted Metric method provides the ability to discover both convex and non-convex pareto curves as shown in [15]. Similarly, the norm has the ability to discover all points on the Pareto front for some weight vector as stated in the lemma below. [29]

Lemma 4.

Let x be a Pareto-optimal solution, then there exists a positive weighting w vector such that x is a solution of the weighted Tchebycheff problem
, where the reference point is the utopian objective vector.

However, it can be seen that as , the objective function becomes non-differentiable and hence it is of interest to us that we choose the minimum possible p for which the above statement still holds. While a universal result for all Pareto curves is still unknown, a significant result from [15] is presented below, if the Pareto curve is known to be continuous.

Lemma 5.

If the Pareto-front geometry is continuous, where denote the objectives to be optimized which can be parameterized as , such that , for a constant , then for the choice of , the same guarantees of discoverability from the Tchebycheff problem will hold when using the scalarization,

In real datasets, the number of points observed on the Pareto front is finite, and hence we usually make the assumption that the Pareto curve is extrapolated using the observed points. Under this assumption, the above lemma would hold on the continuous extrapolated Pareto curve. Also, in our case where we optimize subgroup performance, we know that each , if accuracy (error) or any other performance metric is scaled. With this tight bound on the performance values, the condition to be satisfied, , becomes trivial to be satisfied empirically under the constraints of numerical precision. For all , such that , we have that for . This is evident as , for and . Since all boundary conditions are also bounded by the limit of 1, we can safely assume and satisfy the condition for most practical purposes, as illustrated in Figure 4. Thus, for choice of , i.e p = 1,2,3.., we see that our weighted metric method produces all discoverable points on the Pareto curve and hence we can be fairly guaranteed (under the errors of numerical precision) that the minimization procedure would find a point on the Pareto curve. Note that the weights of the weighted metric method in our case is based on the fairness criterion and hence all set to 1. This will further impose the fairness constraints during the discovery of points on the Pareto curve.

7.3 Efficiency

In this subsection, we provide an analysis of when we expect PEF to outperform standard notions of fairness, like equality of opportunity, i.e when PEF has higher efficiency. [28] defines the fairness-frontier which intuitively measures the trade-off between utility () (accuracy) and fairness () in the distribution inherent to the problem, rather than one owing to the specific technique one uses, no matter how sophisticated it may be, by computing the fundamental limits of what accuracy is achievable by any classifier. Specifically, the frontier is computed using cost-sensitive measure which quantifies the alignment between the Bayes-optimal plug-in classifier thresholds for the outcome and sensitive attribute distributions. [Proposition 8 in [28]].

Lemma 6.

As the absolute gradient of the fairness frontier increases near the desired fairness constraint, the efficiency gained from using PEF is monotonically non-decreasing.

Specifically, if the fairness frontier shows that for small concessions of the fairness requirement (), the limit of accuracy achievable is much higher (), then we have a possibility that PEF would outperform by choosing such a point on the fairness frontier as shown in Figure 5. However, this is a necessary but not sufficient condition. PEF would choose such a point only if the new point is Pareto-dominating the subgroup accuracies of the operating point without the fairness concession. The amount of concession in Pareto-dominance that we are willing to allow is domain-dependent and can be controlled by tuning the parameter in the PEF loss function. Hence, PEF would perform better in conditions where the fairness frontier is steep around the fairness requirement and potential increase in accuracies are achievable in a Pareto-dominant manner.