1 Introduction
Machine learning algorithms play an important role in decision making in society. When these algorithms are used to make high-impact healthcare decisions such as deciding whether or not to classify a skin lesion as benign or not, or predicting mortality for intensive care unit patients, it is paramount to guarantee that these decisions are accurate and unbiased with respect to sensitive attributes such as gender or ethnicity. A model that is trained naively may not have these properties by default (barocas2016big). It is desirable in these critical applications to impose some fairness criteria. There are several lines of work in fairness in machine learning, such as Demographic Parity (louizos2015variational; zemel2013learning; feldman2015certifying
), Equality of Odds, Equality of Opportunity (
hardt2016equality; woodworth2017learning), or Calibration (pleiss2017fairness). These notions of fairness are appropriate in many scenarios, but in domains where quality of service is paramount, such as healthcare, we argue that it is necessary to strive for models that are as close to fair as possible without introducing unnecessary harm to any subgroup (ustun2019fairness).In this work, we measure discrimination (unfairness) in terms of difference in predictive risks across sub-populations defined by our sensitive attributes. This concept has been explored in other works such as calders2010three; dwork2012fairness; feldman2015certifying; chen2018my; ustun2019fairness. We examine the subset of models from our hypothesis class that have the best trade-offs between sub-population risks, and select from this set the one with the smallest risk disparity gap. This is in contrast to common post-hoc correction methods like the ones proposed in hardt2016equality; woodworth2017learning
where randomness is potentially added to the decisions of all sub-populations. While this type of approach diminishes the accuracy-disparity gap, it does so by potentially introducing randomness into the final decision (i.e., with some probability, disregard classifier output and produce arbitrary decision based solely on sensitive label), which leads to performance degradation in most common risk metrics. Since our proposed methodology does not require test-time access to sensitive attributes, and can be applied to any standard classification or regression task, it can also be used to reduce risk disparity between outcomes, acting as an adaptive risk equalization loss compatible with unbalanced classification scenarios.
Main Contributions
We formalize the notion of no-unnecessary-harm fairness using Pareto optimality, a state of resource allocations from which it is impossible to reallocate without making one subgroup worse. We show that finding a Pareto-fair classifier is equivalent to finding a model in our hypothesis class that is both Pareto optimal with respect to the sub-population risks (no unnecessary harm) and minimizes risk disparity. This notion is already amenable to non-binary sensitive attributes. We analyze Pareto fairness on an illustrative example, and compare it to alternative approaches. We provide an algorithm that promotes fair solutions belonging to the Pareto front; this algorithm can be applied to any standard classifier or regression task. Finally, we show how our methodology performs on real tasks such as predicting ICU mortality rates in the MIMIC-III dataset from hospital notes johnson2016mimic, and classifying skin lesions in the HAM10000 dataset tschandl2018ham10000.
2 Problem Statement
Consider we have access to a dataset containing
independent triplet samples drawn from a joint distribution
where are our input features (e.g., images, tabular data, etc.), is our target variable, and indicates group membership or sensitive status (e.g., ethnicity, gender); our input features may or may not explicitly contain .Let be a classifier from our hypothesis class trained to infer from ,
; and a loss function
. We define the class-specific risk of classifier on subgroup as . The risk discrimination gap between two subgroups is measured as , and we define the pairwise discrimination gap as . Our goal is to obtain a classifier that minimizes this gap without causing unnecessary harm to any particular group. To formalize this notion, we define:Definition 2.1.
Pareto front: The set of Pareto front classifiers is defined as
Definition 2.2.
Pareto-fair classifier and Pareto-fair vector: A classifier
is an optimal no-harm classifier if it minimizes the discrimination gap among all Pareto front classifiers, . The Pareto-fair vector is defined as .The Pareto front defines the best achievable trade-offs between population risks , while the Pareto-fair classifier gives the trade-off with least disparity. Building on analysis done by chen2018my; domingos2000unified
, it is possible to decompose the risk in bias, variance and noise for some given loss functions. The noise represents the smallest achievable risk for infinitely large datasets. If it differs between sensitive groups, zero-discrimination (perfect fairness) can be only achieved by introducing bias or variance, hence doing harm. Figure
1 shows a scenario where the Pareto-front does not intersect the equality of risk line for the case of two sensitive groups and a binary output variable . Here the level of noise between subgroups differs, and the Pareto-fair vector would not be achieved by either a naive classifier (minimizes expected global risk), or a classifier where low-occurrence subgroups are over-sampled (re-balanced naive classifier).
3 Methods
Any loss function that is monotonically decreasing with is minimized by a classifier in the Pareto front. Since we want to minimize the discrimination gap, we will build an adaptive loss function that is shares this property with the following form:
(1) |
with and . It can be shown that for convex Pareto sets (with respect to risk vectors ), there exist a set of such that the Pareto-fair vector is a unique solution. In Algorithm 1 we jointly search for ,, and .
4 Experiments and Results
We evaluate our method on mortality prediction and skin lesion classification and show empirically how we reduce accuracy and risk disparity without unnecessary harm. We compare it against a naive classifier, class-rebalancing (equally sampled sub-groups), the post-processing framework presented in hardt2016equality, and the Disparate Mistreatment framework of zafar2017parity.
4.1 Predicting Mortality in Intensive Care Patients
We analyze clinical notes from adult ICU patients at the Beth Israel Deaconess Medical Center (MIMIC-III dataset) johnson2016mimic to predict patient mortality. We follow the pre-processing methodology outlined in chen2018my and use tf-idf statistics on the most frequent words in clinical notes as input features. Fairness is measured with respect to age (under/over 55 years old), ethnicity, and outcome. We used a fully connected neural network with two 2048-unit hidden layers trained with Brier score (BS) loss. Table 1 shows accuracy and BS of all tested methodologies. We observe from columns (PF BS) and (PF Acc) that our model has the best accuracy and BS discrepancies. We can reduce accuracy disparities further by applying Hardt post processing (HPF Acc), at the cost of reducing overall performance. Note that, as expected, it is better to apply this post processing on our method than on the rebalanced naive classifier (HPF Acc vs HRen Acc). This illustrates the goal of the proposed Pareto-fair paradigm: develop the fairest algorithm with no unnecessary harm, and if (e.g., due to policy) the resulting fairness level needs to be improved, use other post-processing techniques on the Pareto-fair classifier such as hardt2016equality.
Out/Age/Race | Ratio | Na Acc | ReN Acc | Zafar Acc | PF Acc | HReN Acc | HPF Acc | ReN BS | PF BS |
---|---|---|---|---|---|---|---|---|---|
A/A/NW | 5.7% | 99.10.4% | 86.31.5% | 93.01.4% | 83.42.6% | 76.31.9% | 71.63.2% | 0.2 0.02 | 0.25 0.03 |
A/A/W | 13.3% | 98.80.5% | 86.31.1% | 90.01.3% | 83.21.5% | 76.71.6% | 71.81.9% | 0.2 0.01 | 0.25 0.02 |
A/S/NW | 12.9% | 97.50.6% | 76.51.7% | 81.81.7% | 71.43.0% | 76.42.2% | 71.33.1% | 0.31 0.02 | 0.36 0.03 |
A/S/W | 56.7% | 97.90.3% | 79.00.6% | 77.40.7% | 74.61.6% | 76.22.2% | 72.11.2% | 0.28 0.01 | 0.34 0.02 |
D/A/NW | 0.4% | 23.49.4% | 76.18.5% | 47.79.6% | 78.66.1% | 66.69.9% | 74.19.2% | 0.36 0.06 | 0.34 0.04 |
D/A/W | 0.9% | 32.63.7% | 80.13.3% | 60.56.3% | 83.34.2% | 66.42.4% | 73.64.3% | 0.29 0.04 | 0.28 0.03 |
D/S/NW | 1.8% | 21.42.2% | 66.92.4% | 48.22.0% | 73.32.9% | 64.82.1% | 71.22.9% | 0.42 0.02 | 0.37 0.03 |
D/S/W | 8.3% | 23.42.2% | 67.41.9% | 57.12.2% | 72.53.6% | 66.22.9% | 72.43.5% | 0.42 0.02 | 0.37 0.04 |
Sample Mean | - | 89.50.2% | 78.90.7% | 78.00.7% | 75.71.1% | 75.11.8% | 71.91.2% | 0.29 0.01 | 0.33 0.01 |
Group Mean | 12.5% | 61.81.5% | 77.31.3% | 69.41.3% | 77.50.7% | 71.21.1% | 72.31.0% | 0.31 0.01 | 0.32 0.01 |
Discrepancy | 56.3% | 81.12.5% | 22.52.5% | 49.34.4% | 17.12.6% | 18.63.2% | 13.15.6% | 0.24 0.03 | 0.16 0.03 |
4.1.1 Skin Lesion Classification
The HAM10000 dataset tschandl2018ham10000 collects over dermatoscopic images of skin lesions over a diverse population, lesions are classified in 7 categories. We used a pretrained DenseNet121 huang2017densely as our base classifier, and measured fairness with respect to diagnosis class, casting balanced risk minimization as a particular use-case for Pareto fairness. Table 2 shows our empirical results.
Groups | Ratio | PF Acc | ReN Acc | Na Acc | PF BS | ReN BS | Na BS |
---|---|---|---|---|---|---|---|
akiec | 2.5% | 51.9% | 55.6% | 3.7% | 0.741 | 0.671 | 1.289 |
bcc | 2.7% | 56.7% | 76.7% | 56.7% | 0.549 | 0.341 | 0.613 |
bkl | 7.2% | 59.5% | 58.2% | 36.7% | 0.62 | 0.606 | 0.931 |
df | 0.5% | 66.7% | 33.3% | 0.0% | 0.536 | 0.898 | 1.721 |
nv | 81.0% | 83.5% | 90.9% | 96.0% | 0.241 | 0.128 | 0.054 |
vasc | 1.3% | 71.4% | 85.7% | 0.0% | 0.36 | 0.246 | 1.604 |
mel | 4.8% | 53.8% | 48.1% | 32.7% | 0.586 | 0.657 | 0.941 |
Sample Mean | - | 78.6% | 84.8% | 83.6% | 0.308 | 0.213 | 0.234 |
Group Mean | 14.3% | 63.4% | 64.1% | 32.3% | 0.519 | 0.507 | 1.022 |
Discrepancy | 80.4% | 31.7% | 57.5% | 96.0% | 0.501 | 0.769 | 1.667 |
5 Discussion
Here we explore the problem of reducing the risk disparity gaps in the most ethical way possible (i.e., minimizing unnecessary harm). We provide an algorithm that can be used with any standard classification or regression loss to bridge risk disparity gaps without introducing unnecessary harm. We show its performance on two real-world case studies; we take advantage of the fact that our method does not require test-time access to sensitive attributes to frame balanced classification as a fairness problem. In future work, we wish to analyze if we can automatically identify high-risk sub-populations as part of the learning process and attack risk disparities as they arise, rather than relying on preexisting notions of disadvantaged groups or populations. We believe that no-unnecessary-harm notions of fairness are of great interest for several applications, especially so on domains such as healthcare.