presented at NeurIPS 2019 HCML
The notion of individual fairness requires that similar people receive similar treatment. However, this is hard to achieve in practice since it is difficult to specify the appropriate similarity metric. In this work, we attempt to learn such similarity metric from human annotated data. We gather a new dataset of human judgments on a criminal recidivism prediction (COMPAS) task. By assuming the human supervision obeys the principle of individual fairness, we leverage prior work on metric learning, evaluate the performance of several metric learning methods on our dataset, and show that the learned metrics outperform the Euclidean and Precision metric under various criteria. We do not provide a way to directly learn a similarity metric satisfying the individual fairness, but to provide an empirical study on how to derive the similarity metric from human supervisors, then future work can use this as a tool to understand human supervision.READ FULL TEXT VIEW PDF
There has been much discussion recently about how fairness should be mea...
We revisit the notion of individual fairness proposed by Dwork et al. A
A key element of any machine learning algorithm is the use of a function...
Music similarity search is useful for a variety of creative tasks such a...
To effectively enforce fairness constraints one needs to define an
We turn the definition of individual fairness on its head—rather than
One of the most fundamental problems in machine learning is to compare
presented at NeurIPS 2019 HCML
Bias in automated decision making systems has raised many concerns. One approach to address these concerns is to enforce individual fairness (Dwork et al., 2012), which requires treating similar people similarly. However, it is not straightforward to quantify the appropriate similarity of individuals. There have been some noteworthy subsequent works on this topic, such as (Zemel et al., 2013) and (Lahoti et al., 2019). In our paper, we study the problem of learning an individual fairness metric from human annotated data. We leverage the intuition that human judgments might implicitly encode an underlying fairness metric they adhere to. We gather a dataset of human judgments about criminal recidivism risk predictions, and utilize this dataset to test the performance of several different metric learning algorithms on this data.
Algorithmic Fairness. Most prior work on algorithmic fairness has focused on group notions of fairness, which require that protected groups as indicated by sensitive attributes receive similar treatment to others (Zafar et al., 2017b, a; Hardt et al., 2016). On the other hand, individual fairness, introduced by Dwork et al. (2012), requires that similar individuals be treated similarly. To achieve this, one must first define a similarity metric that can be used to compare the individuals. This similarity metric needs to be either given or learned from data. Some recent work on individual fairness (Speicher et al., 2018; Kearns et al., 2017; Liu et al., 2017) can be thought of as taking the former approach, since it implicitly incorporates the similarity metric in the optimization problem. Instead of specifying a similarity metric directly, we attempt to learn it from human annotated data.
Metric Learning. We leverage the rich literature on metric learning, which has been studied and applied in various domains, ranging from image processing (Fei-Fei and Perona, 2005) to recommendation systems (McFee et al., 2012). In the past years, it has increasingly been applied on human annotated data, in order to model human notions of similarity (Tamuz et al., 2011). In our work, we also take this approach. We refer the reader to Bellet et al. (2013) and Kulis et al. (2013) for a more in depth overview of the relevant literature on metric learning.
Learning Fairness Metrics. For literature on the metric learning from the fairness perspective, the recent work of Ilvento (2019) is closest to ours. They propose an approach for approximating an individual fairness metric from human judgments about the relative distance between inputs. While prior work did suggest that humans find it easier to make relative judgments than absolute ones (Stewart et al., 2005)
. However, relative judgments are not as frequent in everyday decision making, for instance in recommendation systems, the collected data are about which articles the users have clicked, instead of their relative comparisons among these articles. In various scenarios, from granting bail to assigning social benefits, people are making absolute, and not relative judgments. Therefore, in real world applications, it may be more realistic to learn people’s similarity metrics from their past (absolute) judgments, than to require them to make a set of new (relative) judgments. Nevertheless, since human might have noticeable uncertainty/noise on their predictions, thus we provide learning method and assessment benchmark based on the relative comparison from their point estimations instead of directly from predictions.
Scenario. In our experiments, we focus on the task of predicting criminal recidivism risk. We use a dataset related to the COMPAS tool – a tool used across the United States to help judges make bail decisions by predicting defendants’ criminal recidivism risk on a 10 point scale (Angwin et al., 2016). This dataset, gathered by ProPublica (Angwin et al., 2016), contains information about the recidivism risk predicted by the COMPAS tool, as well as the ground truth recidivism rates, for 7214 defendants who were arrested in Broward County, Florida, in 2013 and 2014.
Survey Instrument. To gather human judgments, we conducted an online survey in which we asked participants to estimate the likelihood of criminal recidivism of a fixed set of 200 defendants from the ProPublica dataset. To mitigate the effects of order bias (Redmiles et al., 2017), the defendants were shown in random order. For each defendant, participants were shown information about the defendant’s demographics and criminal history, in the same format as in Dressel and Farid (2018) and Grgić-Hlača et al. (2019), and were asked to answer the following question: How likely do you think it is that this person will commit another crime within 2 years?.111Participants were asked to answer three questions about each defendant. The two remaining questions were Do you think this person should be granted bail?, and How confident are you in your answer about granting this person bail?, but the results of this analysis go beyond of the scope of this paper. Even though the COMPAS tool provides criminal recidivism risk predictions on a 10-point scale, our participants were asked to respond to this question using a 5-point Likert scale. We opted for this design choice in order to minimize the duration of our 200-question survey, since providing answers using 10-point Likert scales was found to be more time consuming than using 5-point scales (Matell and Jacoby, 1972).
Procedure. We recruited participants through the online crowdsourcing platform Prolific (Palan and Schitter, 2018). Utilizing Prolific’s advanced pre-screening options, we recruited 29 participants from the US who self-reported to have served on a jury. On average, the respondents took approximately 71 minutes to complete the survey, and were paid a base fee of 8.50 £. In order to increase response quality (Vaughan, 2017), we also provided a performance-based bonus. 222Performance-based payments have been found to increase the quality of responses in effort-responsive tasks (Vaughan, 2017). Hence, in order to incentivize participants to provide high-quality survey responses, we increased their bonus fee by $0.10 for each correct recidivism prediction, and decreased it by the same amount for each incorrect prediction. As additional quality control measures, we discarded responses of participants who (i) did not respond to our 5 attention check questions correctly, or (ii) completed the survey in less than 45 minutes. After discarding the responses of these participants, our final sample consisted of 20 participants. These 20 participants had an average criminal recidivism prediction accuracy of 62.4%, close to the 62.1% and 60.2% that Dressel and Farid (2018) and Grgić-Hlača et al. (2019) reported that their participants achieved on the same task, hence providing additional evidence of the quality of the responses we gathered.
Dataset. The final dataset consists of (i) the criminal recidivism risk scores provided by our 20 respondents for 200 defendants , as well as (ii) the COMPAS tool risk scores for 7214 defendants from the ProPublica dataset. The full dataset and a preview of the survey instrument, will be made publicly available once past anonymization requirements.
In our experiments, we evaluate the performance of several different Mahalanobis metric learning approaches on our criminal recidivism dataset . To cover a broad range of the learning methods, we considered one of each from the three learning paradigms discussed by (Bellet et al., 2013): (i) fully supervised: Large Margin Nearest Neighbor(LMNN, (Weinberger et al., 2006)); (ii) weakly supervised: Mahalanobis Metric for Clustering(MMC, (Xing et al., 2003)); and (iii) semi-supervised: Least Squared-residual Metric Learning(LSML, (Liu et al., 2012)).
LMNN. (Weinberger et al., 2006) The fully supervised LMNN method can be directly applied on the labeled data provided by the COMPAS tool and our respondents. It attempts to minimize the distance between training instances and neighbors of same class, while keeping instances of other classes out of the neighborhood. During the implementation, we consider that instances with the same rating in our dataset belong to the same class. In other words, in this approach, even though our data consists of Likert scale ratings, we treat these ratings as categorical values, thereby losing some information.
MMC. (Xing et al., 2003) The weakly supervised MMC method is designed to work in scenarios when rich labeled data is not readily available, and takes pairwise relative comparisons as inputs instead. The algorithm maximizes the sum of pairwise distances between dissimilar pairs while keeping that of similar pairs relatively small. The learned metric by MMC can be constrained either in a diagonal form (weighted Euclidean) or a full one. We consider instances which have equal ratings in our dataset to be similar pairs, and the others to be dissimilar pairs. Again, as for LMNN, this approach treats our Likert scale data as categorical values, and leads to information loss.
LSML. (Liu et al., 2012) Unlike the LMNN and MMC methods, which can only utilize the 200 labeled instances, the semi-supervised metric learning algorithm LSML allows the use of the remaining ~7000 unlabeled instances from the ProPublica dataset as well333We ran the algorithm both with and without using the 7000 unlabeled COMPAS data points. The results were qualitatively similar and we report the results of running the algorithm without using the unlabeled data.. It learns the metric from a set of triplet relative comparisons of the form " and are more similar than and ". The triplet constraint set is constructed as , where and are the recidivism scores from our dataset. Unlike LMNN and MMC, LSML uses relative triplets, which allow us to capture more nuanced information available from our Likert scale judgments, such as "2 is closer to 3 than 5".
We implement an adapted version of the algorithm, by adding a trade-off coefficient on the logdet regularization term. This adaptation allowed us to reduce the weight of the regularizer, thereby increasing the weight for satisfying the relative triplet constraints. As suggested by (Liu et al., 2012), we randomly subsampled training inputs, instead of utilizing the full set whose size is .
Procedure. Each metric (LMNN, MMC, LSML) is trained on 140 inputs and evaluated on 60 inputs. These 200 inputs were randomly selected. In our evaluation, we repeat this process 10 times and report the average results.
As mentioned in 3.1, we have chosen a subset of 200 defendants with the similar demography distribution to the total 7214 defendants assessed by COMPAS tool. By analysing the collected recidivism predictions, bail decisions and decision confidence, we found that:
1) Bail decisions are divergent especially for defendants classified to be with a higher probability of recidivism. As we can see from table 1, the fraction of granting bails are highly different among the 20 human judges when the defendants are classified as "Most Likely" and "Likely" to recidivate.
|Recidivism Jugement||Most Unlikely||Unlikely||Neither||Likely||Most Likely|
|max granted rate||100%||100%||100%||95.6%||76.5%|
|min granted rate||96.8%||77.9%||50.0%||0||0|
|mean granted rate||99.6%||96.6%||84.7%||43.1%||18.9%|
2) Decision confidence among the 20 human judges are quite differently calibrated. We treat ’neither’ as non-recidivate predictions, and compare the predictions with the ground truth, the general accuracy for all 20 human judges are not more than 70%, but for some human judges, the accuracy of their confident judgements are significantly higher, the results are shown in table 2.444here the confident judgments are defined as the predictions with upper 50% prediction confidence
|Accuracy||Overall||Judge 1||Judge 10||Judge 18|
In addition to the distance metric learning approaches discussed in section 3.2, we compare the three aforementioned metric learning approaches against two baselines: (i) the trivial baseline of the Euclidean metric (i.e., the distance in the feature space), as well as (ii) the non-trivial, but naïve precision matrix (i.e., the inverse of the covariance matrix), which is a standard approach for removing correlation between features.
We utilize our dataset with 7:3 train test split, all the results are an average over 10 runs. We evaluate the performance of the learned metrics with respect to the three loss functions defined below:
Relative Comparisons. This loss calculates the percentage of relative comparison triplets (constructed as described in the previous section) from the test set which violate the constraints for the distance metric , it is similar to the loss defined by (Hoffer and Ailon, 2015) but instead we assign equal weight to each instance:
where is the Heaviside step function.
kNN L1. This loss calculates the divergence between the test instance ground truth label and the weighted rating of its neighbors, defined by the metric :
where the normalised weight factor is proportional to the inverse of the distance between and its k nearest neighbors : .
kNN L2. This loss is similar to kNN but instead calculates the distance. Compared to the norm, it introduces more penalty for the large predicted error.
For the collected human survey, we have implemented LMNN, MMC, LSML in Section 3.2 following the procedure described in section 3.3. The trade off coefficient of the regularizer in our adapted LSML is set to 0.01, and the number of neighbors for calculating kNN L1, kNN L2 is chosen to be five555We have tried varying these parameters, and they do not affect the high-level takeaways of our results. The results are shown in Figure 1.
From Figure 1, the kNN L1 and L2 loss of learned metrics are slightly better than the Euclidean and Precision probably due to there are still some sways especially when human make absolute ratings. For the triplet relative comparison loss, which does incorporate the relations of nearby ratings instead of treating them as categorical variables, the learned metrics has largely outperformed the Euclidean and Precision, which proves the effectiveness of our design.
In this section we evaluate the sensitivity of our adapted LSML
method to the hyperparameter, which controls the minimum required distance between inputs and . To this end, we compare the loss of the learned metric with the Euclidean metric on the much larger COMPAS dataset, instead of the human judgments we gathered. Recall Section 3.2, which describes how we construct the sets of triplet constraints based on the choice of . The loss based on the relative comparisons is then calculated on the resulting .
|0||0.400.037||0.39 0.041||0.38 0.035|
|2||0.400.027||0.39 0.032||0.37 0.037|
|4||0.350.031||0.31 0.045||0.30 0.034|
|6||0.310.033||0.29 0.060||0.29 0.060|
Our learned metrics outperform the Euclidean distance by a large margin. As increases, the loss decreases for all three metrics, since the difference between and increases, hence providing us with less noisy data for learning the metric.
In this work, we conducted a user study, in which we gathered a set of human judgments about recidivism risk, which we will make publicly available once past anonymization requirements. We initiated work to examine various methods for learning an individual fairness metric from the human annotated data. Surprisingly, we observed similar performance across methods when considering predictive performance of ratings, though we saw differences when considering triplet consistency of unseen data. It will be interesting in future work to understand this better and consider the interplay between the metric-learning methods and the consistency of human ratings.
HW acknowledges support from Cambridge Trust CSC Scholarship, AW acknowledges support from the David MacKay Newton research fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the CFI.
Equality of opportunity in supervised learning.In Advances in neural information processing systems, pages 3315–3323, 2016.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1828–1836. JMLR. org, 2017.