Removing biased data to improve fairness and accuracy

02/05/2021 ∙ by Sahil Verma, et al. ∙ University of Washington 61

Machine learning systems are often trained using data collected from historical decisions. If past decisions were biased, then automated systems that learn from historical data will also be biased. We propose a black-box approach to identify and remove biased training data. Machine learning models trained on such debiased data (a subset of the original training data) have low individual discrimination, often 0 and lower statistical disparity than models trained on the full historical data. We evaluated our methodology in experiments using 6 real-world datasets. Our approach outperformed seven previous approaches in terms of individual discrimination and accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Automated decisions can be faster, cheaper, and less subjective than manual ones. In order to automate decisions, an organization can train a machine learning model on historical decisions or other manually-labeled data and use it to make decisions in the future. However, if the training data is biased, that bias will be reflected in the model’s decisions (garbageinout; biased_data_1). This problem has even been noted by the US White House (white_house).

Bias in models can have far-reaching societal consequences, like worsening wealth inequality (biased-loan1), difference in employment rate across genders (biased-hiring1; biased-hiring2), and difference in incarceration rates across races (unfair_arrest_2). Discrimination has been reported in machine-produced decisions for real-life scenarios like parole (propublica-main), credit cards (criticism_1), hiring (biased-hiring1), and predictive policing (NYC_policing). These problems are exacerbated by humans’ unwarranted confidence in machine-produced decisions: people generally deem such decisions fair (weakened_appeal). a Data bias happens mostly due to two phenomena: label bias and selection bias. Label bias occurs when the training labels (which are mostly generated manually) are afflicted by human bias. For example, loan applications and job applications from minority communities have been more frequently denied (Steil:2017-redlining; redliningwiki; redlining-thinkprogress). Training on historical data would perpetuate that injustice. Selection bias occurs when selecting samples from a demographic group inadvertently introduces undesired correlations between the features pertaining to that demographic group and the training labels (Blum2019RecoveringFB; ImageNet:2020; trade_off_2019), e.g. in the selected subsample for a group, most of their loan requests were denied. The propagation of bias has raised significant concerns related to use of machine learning models in critical decision making like the ones mentioned above.

Bias in machine learning systems is undesirable because it can produce unfair decisions (biased-hiring1; biased-loan3), because biased decisions are less effective and less profitable (less_effective1; less_effective2), and because bias attracts lawsuits and widespread criticism (criticism_1; criticism_2). Biased decisions can be challenged on the basis of disparate treatment and disparate impact laws in the US (Barocas2016BigDD; jolly-ryan-have; Selbst2017DisparateII; Skeem2016RISKRA; Winrow2010TheDB). Similar laws exist in other countries (australia_act). Laws define sensitive attributes (e.g., race, sex, religion) that are illegal to use as the basis of any decision.

We use both the common statistical disparity, and also individual discrimination

as metrics to measure discrimination, in particular for machine learning classifiers. (Other definitions of fairness exist;

section 5 justifies our choice.) A model is individually fair (dwork_fairness_2011) if it yields similar predictions for similar individuals. A similar pair consists of two individuals such that and is similar to . Similarity among the individuals and among the predictions is measured by defining a distance function in the input space and the output space, respectively. A similar pair of individuals are two individuals with distance lower than a specified threshold. Ideally, for ensuring fairness, the input space similarity metric should not take sensitive features into account, so two individuals who differ only in the sensitive features should always be similar. A discriminatory pair is a similar pair such that the model makes dissimilar predictions for the two similar individuals. The individual discrimination of a model is the proportion of similar pairs of individuals that receive dissimilar predictions, i.e., the percentage of discriminatory pairs (galhotra_fairness_2017; Udeshi-testing:2018; Aggarwal-testing:2019)

. An estimate of individual discrimination uses a pool of similar pairs, which may be synthetically generated or sampled from the training data. To be fair, a model’s individual discrimination should be zero or close to it.

Previous approaches to reducing discrimination (see section 5) modify the training data’s features or labels (pre-processing), or add fairness regularizers while training (in-processing), or perform post-hoc corrections in the model’s predictions (post-processing).

Our approach is a pre-processing approach that identifies and removes biased datapoints from the training data, leaving all other datapoints unchanged. We conjectre that some training data is more biased than the rest and consequently has more influence on the predictions of a learned model. Influence functions (koh_understanding_2017) measure the influence of a training datapoint on a particular prediction. Our approach (1) generates discriminatory pairs (similar individuals who receive dissimilar predictions) and (2) uses influence functions to sort the training datapoints in order of most to least influential for the dissimilar predictions received by the discriminatory pairs. We hypothesize that the datapoints that are most responsible for the dissimilar predictions are the most biased datapoints. Removing the most influential datapoints yields a debiased dataset. A model trained on the debiased dataset is fairer with less individual discrimination — often 0% — than a model trained on the full dataset.

Our approach works for black-box classification models (for which only train and predict functions are available) (blackbox1; blackbox2), and therefore proprietary models can be debiased as long as their training data is accessible. We performed 8 experiments using 6 real-world datasets (some datasets contain more than one sensitive attribute). Our approach outperforms seven previous approaches in terms of individual discrimination and accuracy and is near the average in terms of statistical disparity.

In summary, our contributions are:

  • [noitemsep,topsep=-1pt]

  • We propose a novel black-box approach for improving individual fairness: identify and remove biased decisions in historical data.

  • In two sets of experiments using 6 real-world datasets, the classifiers trained on debiased datasets exhibit nearly 0% individual discrimination. Our approach outperforms seven previous approaches in terms of individual discrimination and accuracy, and it always improves (reduces) statistical disparity.

  • To the best of our knowledge, we are the first to empirically demonstrate an increase in test accuracy (better generalization) in a supervised learning setup with real datasets while reducing discrimination, compared to the case when no fairness technique is used 

    (trade_off_2019; Blum2019RecoveringFB).

  • Our implementation and experimental scripts are open source.

2. Motivating Example

Id Income Wealth Race Decision
#1 1.0 0.1 White 1
#2 0.9 0.7 Black 0
#3 0.8 0.3 White 1
#4 0.1 0.7 Black 0
#5 0.1 0.5 White 0
#6 0.5 0.9 Black 0
#7 1.0 0.8 Black 1
Table 1. A hypothetical dataset of past loan decisions. The second datapoint is a biased decision because a black person in high range income was denied a loan (0), whereas all white people with high range income were given a loan (1).
Figure 1. Flowchart with the steps in our approach. The left portion shows the steps in Algorithm 1 and the right portion shows the steps in Algorithm 2. The output of the algorithm is a debiased training data which can be used to train a debiased model.

Suppose that a bank wishes to automate the process of loan approval. The bank could train a machine learning model on historical loan decisions, then use the model to improve speed, reduce costs, and reduce subjectivity. The financial sector is a prominent user of machine learning (finance_1; finance_2).

Table 1 shows a hypothetical dataset consisting of historical loan decisions. The bank collects 3 features from each applicant: income, wealth, and race. As shown in the table, income and wealth are numeric features with values lying between 0 and 1, after normalization. Race is a binary feature with ‘white’ and ‘black’ as the two possible values. The outcome is also binary with ‘1’ indicating that a loan was offered and ‘0’ indicating a denial.

Loan approvals should ideally not depend on the applicant’s race, but past loan approvers might have been consciously or unconsciously biased. In fact, that was the case at this bank: #2 was a biased decision in which a black person with income in high range (0.9) was denied a loan, whereas all white people with high range income ( 0.66) were given a loan. A classifier trained on a dataset that contains biased datapoints is likely to exhibit individual discrimination.

The bank would like to train a classifier using the unbiased parts of the historical data. Most of the decisions were probably sound (otherwise, the bank would have been out-competed by its rivals). It would be too expensive to filter biased decisions or create a new dataset manually, and those manual activities would also be prone to conscious and unconscious bias.

2.1. Measuring individual discrimination

Estimating a model’s individual discrimination requires the distance functions for the input and output space, and a pool of similar pairs. We consider two individuals similar if their income and wealth are the same, regardless of their race. We consider two decisions similar if they are the same decision. We randomly generate 700 similar pairs (see Section 3 for details). This is 100 times as large as the training dataset; the value “100” is arbitrary.

We trained a model on the dataset of table 1 and evaluated it on the 1400 individuals who form the 700 pairs of similar individuals. The model predicted different outcomes for 26% of the similar pairs (181 discriminatory pairs out of the 700 similar pairs).

2.2. Finding biased datapoints

Our goal is to find biased training datapoints, such that removing them reduces the model’s individual discrimination. This also improves the model’s accuracy and statistical parity, which is discussed in Section 4. Our approach contains three high-level steps:

  1. [leftmargin=0pt]

  2. Find unfair decisions made by a learned model. In particular, first generate a pool of pairs of similar individuals and find the discriminating pairs within them. Then, for each discriminating pair, heuristically determine which of the two individuals was not treated fairly.

  3. Rank the training datapoints according to their contribution to the unfair decisions.

  4. Remove some of the highest-ranked training datapoints and retrain a model with a lower individual discrimination.

Identify unfair decisions

There are 181 discriminatory pairs — pairs of similar individuals with dissimilar outcomes. One individual from each pair has been treated unfairly. The biased treatment might be in their favor (getting an undeserved loan) or against them (denied a loan they deserved). Our heuristic is that the individual whose classification confidence (i.e., the probability with which an individual is classified) is lower is the one who was treated unfairly. Note that these are the individuals in the generated pool of similar individuals, not in the training dataset.

Identify training datapoints responsible for unfair decisions

We use influence functions (koh_understanding_2017) to find the training datapoints that were most responsible for producing different predictions for the discriminatory pairs. An influence function sorts the training datapoints of a model in order of most to least responsible for a single prediction from the model. An influence function can also sort the datapoints for a set of predictions (the 181 unfair decisions, in our case), by measuring the average influence over all the relevant predictions.

We hypothesize that influence functions rank the training datapoints in order of most to least biased datapoints. Note that, the discriminatory pairs are only used to measure discrimination and to sort datapoints in the original training dataset; the discriminatory pairs themselves never occur in the training data.

For the dataset of table 1, the ranked datapoints are: #2, #7, #3, #1, #4, #6, #5. According to this ranking, #2 is the most biased decision, followed by #7 and so on.

Remove biased datapoints

When #2 is removed, and the model is retrained, it results in only 1 discriminatory pair out of the same 700 similar pairs of individuals, i.e., 0.14% remaining individual discrimination. When both #2 and #7 are removed, the discrimination is 16%. Therefore, removing only #2 is the local minimum for discrimination. Our algorithm returns the dataset with #2 removed. This reduced dataset is called the “debiased dataset”. Note that our approach removes biased datapoints from the training data. The similar pairs are synthetically generated and are only used to estimate individual discrimination and identify biased training points. The similar pairs can not be removed since they don’t occur in the training dataset.

The bank should train its model on the fair decisions in the debiased dataset, rather than on the whole dataset. We choose to remove the biased datapoint and not modify its label because the bias could arise either due to labeling bias or due to selection bias. Note that our approach can remove a datapoint belonging to any demographic group, not only the disadvantaged group as happened in this hypothetical dataset.

All the steps of our approach are shown in Figure 1.

Our approach requires access to sensitive attribute. removingSensitive1; removingSensitive2 have argued about the necessity of having access to the sensitive features in order to identify and address bias in automated models. After identifying and removing the biased decisions, the bank can train their model by omitting the sensitive feature(s) (race in this case) to avoid disparate treatment (jolly-ryan-have).

Our approach empowers the bank to make less biased decisions in the future. (As with any automated decision-making process, the bank should include avenues for challenge and redress.) This avoids violating anti-discrimination laws and is the morally right thing to do.

3. Algorithm

Input : Training dataset D, Sensitive attribute S, Binary classification model M trained on D, Input space similarity threshold ,
Output : D sorted in decreasing order of contribution to bias
1 Function 
2      
       // Discriminating pairs: similar individuals who received dissimilar outcomes
3      
       // Individuals that M discriminates against
4      
5       for  do
             // Add to InfluenceSet the individual with lower classification confidence
6             if   then
7                  
8            else
9                  
10            
11       return RankByInfluence (InfluenceSet, D)
// Randomly generates pairs of similar individuals
12Function 
13      
14       for  do
             // Sample from feature space
             // Generate A2 s.t. distance(A1, A2)
15            
16            
17      return SimilarPairs
// Ranks datapoints in D responsible for discrimination, sorted in decreasing order of influence
18 Function 
       // See (koh_understanding_2017) for implementation
19      
Algorithm 1 Sort the training data in decreasing order of bias
Input : Training dataset D, Sensitive attribute S, Input space threshold , Train function
Output : Debiased version of D, which is a subset of D
1 Function 
       // Model trained on full dataset
2      
       // Sorted biased datapoints (see algorithm 1)
3      
4      
5       for  0 to 100 do
             // Model trained on remaining data
             // DropFirst removes the first i% from input dataset SD
6            
7             if  then
8                  return
9            
// Estimates the individual discrimination of model M
10 Function 
11      
       // Discriminating pairs: similar individuals who received dissimilar outcomes
12      
13       return
Algorithm 2 Produce a debiased dataset

Algorithm 1 sorts a dataset in the order of most biased to least biased datapoint. It has four main parts.

First, algorithm 1 generates a pool PSI of pairs of similar individuals using GenerateSimilarPairs (algorithm 1). We arbitrarily choose the size of the pool to be 100 times the size of the dataset. A larger pool leads to a better estimate of individual discrimination. The individuals and their similar counterparts are automatically generated, sampling uniformly at random from the feature space that is defined by the original training dataset. For a randomly sampled individual A1, SimilarIndividual (algorithm 1) generates a similar individual A2 whose distance from A1 is less than the specified threshold , which is a user-provided parameter that determines whether two individuals are similar. For the input space similarity condition in Section 2, is 0. For example, random sampling for income, wealth, and race in the hypothetical dataset could generate an individual A1

with the feature vector

. According to the input space similarity condition, a similar individual for A1 must have the same income and wealth (these are the non-sensitive features). While generating similar individuals, we enforce changing the value of the sensitive attribute (race in this case) to generate similar individuals in different demographic groups. Therefore, A2 has the feature vector . These pairs of similar individuals allow us to estimate the individual discrimination of a model because we can compare their predictions, which should be similar as well. Note that generating pairs of similar individuals based on a given , as opposed to finding pairs of similar individuals in the training dataset, allows us to reliably generate a large pool of pairs.

Second, algorithm 1 determines the discriminatory pairs: the pairs of similar individuals who received dissimilar model predictions (a subset of the pool PSI). In the case of binary classification, different labels are considered dissimilar.

Third, algorithms 1 to 1 determine, for each discriminatory pair, which individual was misclassified due to model bias. Internally, a classifier computes, for each outcome class, the probability of an individual belonging to that class, and the predicted outcome for that individual is the class with the highest probability. Classification confidence refers to this highest probability. Our heuristic (algorithm 1) is that the individual with the lower classification confidence is the one who was treated unfairly.

Fourth, RankByInfluence (algorithm 1) identifies the datapoints in the original (biased) training dataset D responsible for discrimination against these individuals. Given a trained model and a set of datapoints along with their predictions from the model, RankByInfluence ranks the training data of the model from most influential to least influential training datapoint responsible for those predictions. If the most influential datapoints are removed from the training data, and the model is retrained with the same model architecture, the probability of change in the prediction of the discriminatory pairs is highest. We hypothesize that RankByInfluence returns the training datapoints sorted in order of most to least bias, and our experiments support this. We show all the four steps in the left portion of Figure 1.

None of the above steps is dependent on the number of output categories, so the algorithm is applicable to multi-class classifiers. The notion of similar predictions would need to be adjusted accordingly.

Algorithm 2 first calls SortDataset to rank the datapoints in decreasing order of bias (algorithm 2). It then iteratively removes a chunk of the most biased datapoints from the sorted original training dataset (algorithm 2), retrains the same model architecture on the remaining datapoints, and estimates the individual discrimination of the retrained model (algorithm 2). When the remaining discrimination in a retrained model reaches a local minimum, it returns a debiased dataset by dropping the most biased datapoints from the original training data (algorithm 2). The size of each chunk can be adjusted as desired (it is 1/100 of the original training data in algorithm 2).

4. Evaluation

Id Dataset Size # Numerical Attrs. # Categorical Attrs. Sensitive Attr. (S) Training label (binary)
D1 Adult income (UCI-repo-adult) 45222 1 11 Sex Income $50K
D2 Adult income (UCI-repo-adult) 43131 1 11 Race Income $50K
D3 German credit (UCI-repo-german) 1000 3 17 Sex Credit worthiness
D4 Student (UCI-repo-student) 649 4 28 Sex Exam score 11
D5 Recidivism (compas-data) 6150 7 3 Race Ground-truth recidivism
D6 Recidivism (compas-data) 6150 7 3 Race Prediction of recidivism
D7 Credit default (UCI-repo-default) 30000 14 9 Sex Credit worthiness
D8 Salary  (sensitive_removal_data) 52 2 3 Sex Salary $23719
Table 2. Datasets used in the evaluation

We conducted experiments to answer the following research questions:

[noitemsep,topsep=-1pt]

RQ1:

Does our technique reduce individual discrimination?

RQ2:

Do previous techniques reduce individual discrimination?

RQ3:

How does our technique impact test accuracy?

RQ4:

How do previous techniques impact test accuracy?

RQ5:

How do the techniques compare in terms of statistical disparity?

RQ6:

How sensitive are the techniques to hyperparameter choices?

We compared our pre-processing technique against 7 other techniques: a baseline model trained on the full training dataset (Full), five pre-processing techniques — simple removal of the sensitive attribute (SR), Disparate Impact Removal (DIR(feldman_certifying_2015) (used at the highest fairness enforcing level, repair=1), Preferential Sampling (PS(kamiran_data_2012), Massaging (MA(kamiran_data_2012), and Learning Fair Representations (LFR(zemel_learning) — and one in-processing technique — Adversarial Debiasing (AD(Zhang2018MitigatingUB). The implementations for some of these techniques were taken from IBM AIF360 (Bellamy2018AIF3).

We evaluated the techniques using six real-world datasets that are commonly used in the fairness literature (see table 2).

For all experiments, the machine learning model architecture we used was a neural network with 2 hidden layers. We trained models with 240 different hyperparameter settings:

  • [noitemsep,topsep=-1pt,leftmargin=15pt]

  • In the first hidden layer, the number of neurons is 16, 24, or 32.

  • In the second hidden layer, the number of neurons is 8 or 12.

  • Each model had two choices for batch sizes: the closest powers of 2 to the numbers obtained by dividing the dataset size by 10 and 20, respectively. For example, if the dataset size is 1000, the batchsizes are 64 and 128.

  • Each experiment had 20 choices for random permutations for the full dataset. The choice of random permutation affects the datapoints that form the training and test datasets.

4.1. Experimental methodology

To create the models for our approach (to answer RQ1 and RQ3), we executed the following methodology:

  1. [noitemsep,topsep=-1pt]

  2. For each of the 240 choices of hyperparameters:

    1. Split the dataset into the first 80% training and last 20% testing (without randomness, but depending on the data permutation, which is one of the hyperparameters). The dataset is normalized before usage.

    2. Debias the training dataset using Algorithm 2; that is, remove some points from it.

    3. Compute a “debiased model”, which is trained on the debiased training dataset.

  3. Let the “unfair datapoints” for a dataset be the union of the datapoints removed by all the 240 models: that is, any datapoint removed by any debiasing step.

To create the models for other approaches (to answer RQ2 and RQ4), we ran each approach 240 times, once for each choice of hyperparameters. The in-processing technique AD (Zhang2018MitigatingUB) does not take hyperparameters other than data permutations, therefore we only repeated the process 20 times, once for each data permutation.

To measure the performance of the models, we executed the following methodology for each trained model:

  1. [noitemsep,topsep=-1pt,leftmargin=10pt]

  2. Measure the model’s individual discrimination using the function DiscmTest of Algorithm 2 (RQ1 and RQ2).

  3. Measure the model’s test accuracy on a debiased test set (RQ3 and RQ4). The debiased test set is computed by removing the unfair points from the test set, which is the last 20% of the dataset.

Note that when evaluating our technique, the points removed from the test set of a model is not affected by that model itself, but only by other models that have those test points in their training set. Thus, there is no leak between the training and test dataset. The fourth hyperparameter, random perturbation, affects the datapoints that form the training and test datapoints for a model.

We conducted two sets of experiments (sections 4.3 and 4.2) with different similarity conditions in the input space.

4.1.1. Debiasing the test set

Our evaluation methodology uses a debiased test set from which unfair points have been removed. The reason is that a user’s goal is not to obtain a model that performs well on the entire dataset that includes biased decisions, but a model that performs well on fair decisions. Our experimental results indicate that our debiasing technique identifies fair decisions, so we use it for this purpose.

This experimental methodology addresses an observation made by trade_off_2019 that in most previous discrimination mitigation approaches, there is a discrepancy between the algorithm and its evaluation. Previous authors have agreed on the existence of bias in the data and have devised algorithms to mitigate bias in the resulting data or classifier, but they evaluated the accuracy of their approach on the original test set, which is potentially biased. Due to this discrepancy, most previous work suggests that a technique must trade off fairness and accuracy (berk2017convex; Chouldechova2016FairPW; CorbettDavies2017AlgorithmicDM; feldman_certifying_2015; Fish2016ACA; kamiran_data_2012; Kleinberg2016InherentTI; Menon-CF:2018; Zafar2017FairnessBD; calmon_optimized; Calders-Naive-2010; Kamishima:2012), which trade_off_2019 refute.

Using our experimental methodology, the classifier debiased using our approach usually has higher accuracy on the debiased test set than the classifier trained on the full dataset (see sections 4.5 and 4.4). We think of this phenomenon as a classifier with improved generalization, an intuition also shared by berk2017convex, who remark that fairness constraints might act as regularizers and improve generalization.

Another advantage is that using the same test set for a particular set of hyperparameters provides an apples-to-apples comparison of our technique with all the seven baselines. For the Full baseline, the full training set is used, while the test set is still debiased.

4.2. Experiments with input space threshold =0

In the first set of experiments we used the following distance functions with the definitions of individual discrimination.

Input space similarity condition: We consider two individuals to be similar if they are the same in all non-sensitive features.

Output space similarity condition: We consider two outcomes to be similar if they are the same outcome. Note that for all our experiments, the outcomes are binary.

Generating similar individuals: For the given similarity condition in the input space, GenerateSimilarPairs randomly generates the first individual, and then flips its sensitive feature to generate an individual similar to it (similar procedure as used for the hypothetical dataset in Section 2). For example, the Salary dataset has features sex, rank, age, degree, and experience, out of which age and experience are numerical while the others are categorical features. If the first randomly generated individual had feature values Male, Full, 35, Doctorate, 5, then the similar individual would have feature values Female, Full, 35, Doctorate, 5. In this set of experiments, the number of pairs of similar individuals generated for each dataset was set to 100 times the size of the respective total dataset (e.g., 3,000,000 for the Credit default dataset).

When generating individuals, there is no guarantee that the generated individuals are representative of the actual population. For example, consider a dataset whose features include gender and college alma mater. It might generate a datapoint for a male who graduated from a women’s college. Such individuals are uncommon (Timothy Boatwright of Wellesley College is one example). Therefore, a dataset with many such individuals might or might not be useful in determining whether there is discrimination against graduates of the women’s college. As another example, consider determining whether a basketball coach has discriminated ethnically in selecting team members; generated individuals might not be representative since Bolivian men have an average height of 160cm (5’1"), and Bosnian men have an average height of 184cm (6’). Data selection (as opposed to generation) approaches can guarantee that the individuals are characteristic, but they require accurate characterizations of, or large samples from, the population (which we don’t have access to). Even so, it may not be possible or easy to find many similar pairs of individuals to compare. We acknowledge these limitations. Future work should explore how to obtain similar pairs that are characteristic of real-world populations.

4.3. Experiments with input space threshold 0

The second set of experiments used these distance functions.

Input space similarity condition: We consider two individuals to be similar if, among the non-sensitive features, they have the same value for all categorical features and are within a 10% range for all numerical features (after normalization).

Output space similarity condition: We consider two outcomes similar if they are the same outcome.

Generating similar individuals: GenerateSimilarPairs randomly generates the first individual A1. It then generates 2 similar individuals for A1 following the input space similarity condition. (Therefore, in this set of experiments, the size of the pool of similar pairs was equal to 200 times the size of the dataset.) A similar individual is generated by maintaining the same values for all categorical features, and random sampling within the % to % range for all numerical features.

Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 19.0 0.0 19.0 0.0064 5.7 0.096 12.0 0.0 80 82 80 81 80 81 85 92 29 13 28 8.7 3.3 3.6 2.3 7.8
D2 11.0 0.0 11.0 0.38 6.0 0.0063 4.3 0.0 83 85 84 84 83 84 87 92 21 13 20 12 7 0.33 7.8 10
D3 6.2 0.0 3.8 0.014 0.083 0.0 8.1 0.0 75 82 74 73 75 62 76 81 1.6 3.1 2.1 14 6.2 0.085 1.5 9.8
D4 0.0015 0.0 0.02 0.0 3.5 0.037 2.3 0.0 96 98 95 92 92 69 92 96 15 5.7 20 12 13 3.8 11 25
D5 0.013 0.0 0.0046 0.11 0.87 0.0 0.34 0.0 73 77 62 48 76 73 76 74 21 26 33 2.3 3.7 0.0 0.96 25
D6 0.045 0.0 0.0046 0.02 0.16 9.8e-4 0.01 0.0 67 81 49 65 78 78 79 84 39 25 49 14 1.1 22 26 23
D7 1.2 0.0 0.04 0.65 1.3 0.0 0.046 0.0 76 78 71 77 75 83 85 80 12 6.1 11 1.6 4.6 0.0 3.4 0.12
D8 0.019 0.0 0.0 0.019 19.0 0.0 33.0 0.0 100 100 100 100 100 50 75 100 33 11 12 22 50 0.0 0.0 0.0
Avg. 4.7 0.0 4.2 0.15 4.6 0.018 7.5 0.0 81 85 76 77 82 72 81 87 21 13 22 11 11 3.7 6.6 13
Table 3. Information about the model with the least remaining discrimination, among 240 hyperparameter settings when . All numbers are percentages.
Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 21.0 0.0 21.0 0.73 6.3 4.5 17.0 6.6e-5 82 82 82 82 82 83 85 93 29 13 29 11 1.5 3.3 0.18 9.9
D2 12.0 0.0 12.0 1.8 8.0 0.42 5.2 9.3e-5 85 85 85 85 85 85 88 92 22 13 21 11 5.4 0.18 0.37 13
D3 11.0 0.0 8.7 1.1 1.7 0.56 11.0 0.007 82 82 81 83 85 78 82 82 11 3.1 6.1 3.2 1.2 2.6 5 7.9
D4 0.94 0.0 0.44 0.85 6.2 1.7 3.3 0.0031 98 98 98 99 98 89 96 98 18 5.7 20 2.4 8 16 12 17
D5 0.094 0.0 0.044 0.64 2.4 0.044 0.5 0.0016 77 77 72 61 85 87 80 100 32 26 38 3.9 1.6 21 22 26
D6 0.05 0.0 0.011 0.028 0.5 3.0 0.037 0.0013 74 81 50 76 87 83 81 100 38 25 48 15 0.078 22 26 26
D7 1.4 0.0 0.12 0.77 1.5 2.8 0.24 0.0 78 78 75 78 78 85 85 80 13 6.1 9 2.1 1.2 2 2.3 3.4
D8 0.019 0.0 0.0 0.019 19.0 6.8 33.0 0.0 100 100 100 100 100 100 75 100 11 11 10 10 0.0 12 0.0 10
Avg. 5.8 0.0 5.3 0.74 5.7 2.5 8.8 0.0016 84 85 80 83 87 86 84 93 22 13 23 7.3 2.4 9.9 8.5 14
Table 4. Information about the model with highest test accuracy, among 240 hyperparameter settings when . All numbers are percentages.
Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 20.0 0.0 20.0 0.1 6.7 5.2 17.0 2.2e-5 81 81 81 81 80 82 85 92 28 12 27 8.7 0.013 0.013 0.18 7.1
D2 13.0 0.0 13.0 1.8 7.5 0.025 5.3 4.2e-4 84 84 85 85 84 84 88 91 18 7.9 18 5 0.28 0.0024 0.37 7.6
D3 7.6 0.0 9.1 2.0 1.2 0.39 9.1 0.24 78 82 76 76 71 72 80 74 0.32 0.22 0.026 0.25 0.0 0.0 0.19 0.016
D4 3.9 0.0 3.6 1.2 7.1 5.5 4.6 0.0046 93 96 92 93 90 77 89 94 0.27 0.075 0.39 0.075 0.077 0.15 1.1 0.075
D5 0.085 0.0 0.014 0.24 1.5 0.0 0.34 0.0059 67 75 64 44 81 73 76 100 20 17 26 0.019 0.011 0.0 0.96 12
D6 0.049 0.0 0.0083 0.032 0.25 0.041 0.013 0.0021 72 78 48 68 82 71 77 100 33 19 42 9.6 0.0013 18 5.5 16
D7 1.2 0.0 0.071 0.98 1.4 0.0 0.49 0.0 76 77 73 77 78 83 84 77 8.9 3 8.3 0.022 0.14 0.0 0.62 0.12
D8 1.0 0.0 2.2 0.038 25.0 0.0 33.0 0.0 100 100 100 100 100 100 71 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg. 5.9 0.0 6.0 0.8 6.3 1.4 8.7 0.032 81 84 77 78 83 80 81 91 14 7.4 15 3 0.065 2.3 1.1 5.4
Table 5. Information about the model with least statistical parity difference, among 240 hyperparameter settings when . All numbers are percentages.

4.4. Results for input space threshold = 0

The left portion of table 3 reports, for each technique, the best (lowest) individual discrimination among its 240 models. (Figure 2 plots, for each technique, the individual discrimination of all 240 models.) When multiple models have the lowest discrimination, we choose the model with the highest accuracy. Answer to RQ1: For all datasets, our technique achieves 0% remaining discrimination. Answer to RQ2: SR also achieves 0% individual discrimination. This follows from our choice of input space similarity condition. When the sensitive feature is removed, the remaining features are the same for all pairs of similar individuals. And therefore, SR gives the same prediction for two individuals with the same features. LFR and PS leave little remaining discrimination. The other techniques (DIR, MA, and AD) are not effective in eliminating individual discrimination.

The center portion of table 3 reports, for each technique, the accuracy of the lowest-discrimination model. Answer to RQ3: Our technique produces models with better accuracy than models trained on the entire training dataset. This agrees with the observation made by trade_off_2019: “it requires no stretch of credulity to imagine that various personal attributes (e.g., race, gender, religion; sometimes termed ‘protected attributes’) have no bearing on a person’s intelligence, capability, potential, qualifications, etc., and consequently no bearing on the ground truth classification labels — such as job qualification status — that might be functions of these qualities. It then follows that enforcing fairness across these attributes should on average increase accuracy.” Answer to RQ4: DIR, PS, and LFR degrade the accuracy; AD and MA affect it little; and SR improves it, though not as much as our technique does. Our approach is always either best or within 3 percentage points of best, and is best on average.

A user may train multiple models (e.g., using different hyperparameter choices) and choose the best one for their application. Table 3 assumes that the user chooses the least-discriminating model. Table 4, by contrast, assumes that the user chooses the most accurate model. And table 5 assumes that the user chooses the model with the least statistical disparity. These are the extremal points on the Pareto frontier; a user might also choose a point between them (pareto_1).

The left and center portions of table 4 report, for each technique, the individual discrimination and accuracy of the most accurate model among the 240 models. We can answer RQ1RQ4 about these models. Answer to RQ1 and RQ2: Our approach achieves, on average, nearly 0% (0.0016%) remaining discrimination; under the definition of our input space similarity condition, SR achieves 0% discrimination; other techniques are much higher. Answer to RQ3 and RQ4: Our approach achieves by far the highest average accuracy: 93% compared to the Full model’s 84% accuracy and MA’s 87% accuracy.

Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 19.0 0.0026 19.0 0.042 5.6 0.41 12.0 0.62 81 83 81 85 81 84 85 86 29 15 28 8.9 3.3 3.6 2.3 9.8
D2 11.0 0.00044 11.0 0.45 6.0 0.37 4.3 0.67 79 80 80 80 79 80 85 81 21 8.1 20 12 7 0.59 7.8 12
D3 6.4 1.9 4.2 2.2 2.0 0.066 8.7 1.6 75 81 74 73 76 62 76 82 1.6 1.1 2.1 2.1 3.3 2.7 1.5 7.6
D4 4.7 4.6 4.8 4.7 4.2 0.87 4.5 2.3 99 98 100 98 96 90 98 97 13 12 6.6 13 12 5.4 11 4.4
D5 0.19 0.2 0.0052 0.14 0.97 0.0 0.36 8.1e5 63 62 60 44 63 70 64 100 26 24 33 2.4 3.7 0.0 8.4 0.72
D6 0.048 0.046 0.0033 0.033 0.21 0.18 0.0088 0.0002 60 67 49 59 72 72 67 99 38 23 47 14 1.1 21 26 26
D7 1.8 1.6 0.19 1.5 1.9 0.0 0.079 0.0 76 76 71 76 75 86 85 83 12 4.7 11 1.4 4.6 0.0 3.4 6.5
D8 1.7 1.3 0.033 1.5 19.0 0.0 33.0 0.0 100 100 100 100 100 57 75 100 0.0 0.0 30 0.0 50 0.0 0.0 30
Avg. 5.6 1.2 4.9 1.3 5.0 0.24 7.9 0.65 79 80 76 76 80 75 79 91 18 11 22 6.7 11 4.2 7.5 12
Table 6. Information about the model with the least remaining discrimination, among 240 hyperparameter settings when . All numbers are percentages.
Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 20.0 0.11 21.0 0.5 6.2 4.5 17.0 0.62 82 84 82 85 82 85 85 86 29 13 29 11 1.5 3.3 0.18 12
D2 12.0 0.22 12.0 1.8 7.7 0.81 5.3 0.89 80 81 81 81 81 80 87 82 22 13 21 11 5.4 0.18 0.37 14
D3 11.0 2.1 8.7 2.6 2.7 1.1 11.0 1.6 82 82 81 83 85 78 83 82 11 3.1 6.1 3.2 1.2 2.6 5 5.8
D4 5.1 4.8 4.8 4.9 5.0 2.8 5.1 2.7 100 100 100 100 100 96 100 100 18 5.7 20 2.4 8 16 12 16
D5 0.58 0.57 0.065 0.71 2.6 4.8 0.61 8.1e5 75 74 74 55 75 87 67 100 32 26 38 3.9 1.6 21 22 28
D6 0.054 0.06 0.01 0.11 0.59 5.4 0.034 0.00021 66 74 50 70 82 81 75 100 38 25 48 15 0.078 22 26 19
D7 2.0 1.7 0.62 1.7 2.1 3.7 0.26 0.0 78 78 75 79 78 87 85 83 13 6.1 9 2.1 1.2 2 2.3 12
D8 1.7 1.3 0.033 1.5 19.0 7.3 33.0 0.0 100 100 100 100 100 100 75 100 11 11 10 10 0.0 12 0.0 10
Avg. 6.6 1.4 5.9 1.7 5.7 3.8 9.0 0.73 82 84 80 81 85 86 82 91 22 13 23 7.3 2.4 9.9 8.5 15
Table 7. Information about the model with highest test accuracy, among 240 hyperparameter settings when . All numbers are percentages.
Individual discrimination Test accuracy Statistical parity difference
Id Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our Full SR DIR PS MA LFR AD Our
D1 20.0 0.11 20.0 0.1 6.7 5.2 17.0 0.75 81 84 82 85 81 82 85 85 28 12 27 8.7 0.013 0.013 0.18 8.8
D2 13.0 0.22 13.0 1.8 7.5 1.1 5.3 1.2 79 79 80 80 79 79 87 81 18 7.9 18 5 0.28 0.0024 0.37 9
D3 7.8 2.6 9.1 3.0 3.3 1.5 9.5 2.8 79 82 75 77 71 73 80 78 0.32 0.22 0.026 0.25 0.0 0.0 0.19 0.095
D4 6.0 5.3 5.9 5.4 7.2 5.9 6.2 4.2 99 99 99 99 97 79 98 99 0.27 0.075 0.39 0.075 0.077 0.15 1.1 0.075
D5 0.32 0.41 0.021 0.28 1.7 0.0 0.38 9.8e5 63 64 62 44 72 70 65 100 20 17 26 0.019 0.011 0.0 0.96 0.044
D6 0.052 0.061 0.0088 0.058 0.31 0.46 0.014 0.00023 66 70 48 64 78 70 74 99 33 19 42 9.6 0.0013 18 5.5 17
D7 2.1 1.9 0.46 1.9 2.2 0.0 0.65 0.0 76 75 73 78 78 86 84 81 8.9 3 8.3 0.022 0.14 0.0 0.62 6.5
D8 1.7 1.3 2.2 1.5 25.0 0.0 33.0 0.86 100 100 100 100 100 100 75 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg. 6.4 1.5 6.3 1.8 6.7 1.8 9.0 1.2 80 81 77 78 82 79 81 90 14 7.4 15 3 0.065 2.3 1.1 5.2
Table 8. Information about the model with least statistical parity difference, among 240 hyperparameter settings when . All numbers are percentages.
Figure 2.

The individual discrimination for all 240 hyperparameter choices (lower is better). Our approach (rightmost in each boxplot) always achieves 0% discrimination for some (often many) hyperparameter choices, and it has a little variance across choices.

Figure 3. The test accuracy for all 240 hyperparameter choices (higher is better). Our approach is best or comparable to the best in terms of both accuracy and its variance, for all experiments except D7.

We also measured the statistical disparity: the absolute difference between the success rate for individuals with one value for the sensitive attribute and the success rate for individuals with the other value for the sensitive attribute. The right portions of tables 5, 4 and 3 report the results. Answer to RQ5: Our technique achieves a lower statistical disparity than the baseline model trained on the full dataset. Compared to the other techniques (which are designed to optimize for statistical parity difference), our technique is in the middle of the pack. For each technique that has a lower statistical disparity than ours, our technique achieves considerably lower individual discrimination and higher accuracy.

If a bias mitigation technique is highly sensitive to hyperparameter choices, then users might need to run it many times to achieve desirable performance, and they may have less confidence in its generalizability.

Figure 2 shows the remaining individual discrimination for all 240 models we trained for each experiment and each technique, and fig. 3 shows the test accuracy for all the 240 models for each experiment and each technique.

Answer to RQ6: Our technique not only usually results in lower remaining discrimination and higher accuracy than previous techniques, it is also much less sensitive to hyperparameter choices (narrower range).

4.5. Results for input space threshold 0

Tables 8, 7 and 6 report the same statistics for the experiment with . We answer RQ1-RQ4 using Table 6 and Table 7.

Answer to RQ1: For both the models with the least discrimination and highest accuracy, our approach achieves very low individual discrimination (0.65% and 0.73% respectively).

Answer to RQ2: For the most accurate model, our approach gets the lowest discrimination among all approaches (0.73%). For the models with the least discrimination, our approach is second and very close to the best performing approach (LFR). Notably, SR has much higher discrimination for this choice of similarity metric. In the previous set of experiments, SR was able to get 0% individual discrimination just by choice of input space similarity.

Answer to RQ3: Our approach gets the best accuracy by far compared to all the baselines for both the models with the least discrimination and highest accuracy.

Answer to RQ4: Similar to the conclusions from the experiments with , for the least discriminative model: DIR, PS, and LFR degrade the accuracy; AD, MA, and SR affect it little. For the most accurate model: SR, MA, and LFR improve the accuracy but not as much as our approach.

Answer to RQ5: For statistical disparity (shown in the right portions of tables 8, 7 and 6) the conclusions are the same as when .

Answer to RQ6: Due to lack of space, we have added the plots showing the remaining discrimination and test accuracy for the experiments with in the appendix (Appendix A). The conclusions are the same as when .

5. Related Work

5.1. Fairness Metrics

More than 20 metrics for fairness have been proposed (verma_fairness_2018). They can be broadly divided into three categories: group fairness, causal fairness, and individual fairness.

Group fairness

Most group fairness metrics declare a model to be fair if it is satisfies a specific constraint from the confusion matrix 

(Confusion-Matrix) e.g., predictive parity (Simoiu2016; Chouldechova2016FairPW) (equal probability of being correctly classified into the favorable class for all demographic groups), predictive equality (CorbettDavies2017AlgorithmicDM; Chouldechova2016FairPW) (equal true negative rate for all groups), equality of opportunity (Chouldechova2016FairPW; Hardt2016EqualityOO; kusner_counterfactual)

(equal true positive rate for all groups), equalized odds 

(Hardt2016EqualityOO; Zafar2017FairnessBD) (equal true positive rate and equal true negative rate for all groups), accuracy equality (Simoiu2016) (equal predictive accuracy for all groups), treatment equality (Simoiu2016) (equal ratio of false negatives and false positives for all groups), calibration (Chouldechova2016FairPW; Hardt2016EqualityOO), well-calibration (Kleinberg2016InherentTI), and balance for positive and negative classes (Kleinberg2016InherentTI). Most definitions in the group fairness category require the ground truth, which is unavailable before deploying a model (group_fairness:2020).

The most popular definition from the group fairness category is statistical parity, which states that a decision-producing system is fair if the probability of getting a favorable outcome is equal for members of all demographic groups formed by sensitive features.

Most discrimination-reducing approaches evaluate their methodology on statistical parity or measures close to it (feldman_certifying_2015; kamiran_data_2012; calmon_optimized; zemel_learning; Calders-Naive-2010; Kamishima:2012; friedler_comparative_2018; wadsworth_achieving_2018). Dwork et al. (dwork_fairness_2011) criticize statistical parity by showing how three evils (reduced utility, self-fulfilling prophecy, and subset targeting) can occur even while statistical parity is maintained. Statistical parity is also not applicable in scenarios where the base rates of ground-truth occurrence are different, e.g., criminal justice (Chouldechova2016FairPW; dwork_fairness_2011; Hardt2016EqualityOO; Kleinberg2016InherentTI).

Causal fairness

Causal fairness metrics require a causal model that is used to reason about the effects of certain features on other features and the outcome (causal_1:2019; causal_2:2017). Causal models are represented by a graph with features as nodes and directed edges showing the effects of one feature on another. Learning causal models from the data is not always possible and therefore requires domain knowledge pearl2009causality. A causal model is consequently an untestable assumption about the data generation process. Papers proposing causal fairness definitions assume a given causal model and evaluate their methodology based on this assumption (kusner_counterfactual; kilbertus_avoiding; Nabi2017FairIO).

Individual fairness

Individual fairness (dwork_fairness_2011; avg_IF; Jung2019ElicitingAE; Lahoti2019_ops_IF) (defined in section 1) states that similar individuals should be treated similarly: they should be given similar predictions. The similarity metric for individuals should only consider the non-sensitive features as that ensures adherence to anti-discrimination laws. This matches a common intuition, does not make assumptions about data generation (sexgottodo:2020) or base rates, and does not require the presence of ground truth labels.

Trading off group and individual fairness

Both group fairness and individual fairness are desirable. It is a policy and political decision which one to prioritize. (A related policy/political question is what forms of affirmative action, if any, are just.) Previous work has largely ignored individual fairness, which we argue is an oversight.

Our experiments show that group and individual fairness must be traded off in a relative sense: maximizing one leads to the other taking on a non-maximal value. However, they do not need to be traded off in an absolute sense: while maximizing one, it is still possible to improve the other. Our technique is a bright spot: in tables 5, 4 and 3, ours is the only technique (out of eight) that always improves test accuracy, individual discrimination, and statistical parity. Its interventions may be acceptable across the political spectrum. This is an exciting new direction for research in fairness in machine learning.

5.2. Fairness Literature

Most previous work in the fairness literature (dunkelau_fairness-aware; mehrabi_survey_2019; friedler_comparative_2018; survey_accountability:2020) can be categorized into discrimination detection and interventional approaches.

Discrimination detection (galhotra_fairness_2017; Udeshi-testing:2018; Aggarwal-testing:2019; fliptest:2020) measures whether a learned model is biased. Our experiments take inspiration from Themis (galhotra_fairness_2017), which measures individual discrimination by generating similar individuals, who only differ in a sensitive attribute, sampling individuals uniformly at random from the feature space, which is captured from training data.

Interventional approaches aim to improve the fairness of a learned model. They can be further categorized into three groups based on the stage of machine learning pipeline they intervene in:

  1. [nosep,leftmargin=0pt]

  2. Pre-processing: Intervention at the stage of training data. Previous work modifies the training data to reduce discrimination and measures it using their preferred fairness metric (Edwards2015CensoringRW; Calders-building-2009; Kamiran2009ClassifyingWD; Kamiran-no-discm-2010; kamiran_data_2012; calmon_optimized; feldman_certifying_2015; Li2014LearningUF; Louizos2015TheVF; Hacker-continuous-2017; Hajian-method-2013; Hajian-rule-2011; Johndrow2017AnAF; kilbertus-blind-18; Lum2016ASF;

    Luong-knn-2011

    ; McNamara2017ProvablyFR; Thomas2019; Metevier2019; salimi_capuchin:_2019). Merely removing the sensitive feature (the technique called SR in this paper) does not necessarily yield a fair model (sensitive_removal1; sensitive_removal2; sensitive_removal3; sensitive_removal4; sensitive_removal5; dwork_fairness_2011) and is important to assess disparities (Bogen:2020; Kallus2020AssessingAF; Lipton2018DoesMM). A model can learn to make decisions based on proxy features that are correlated with sensitive attributes, e.g., zip code can encode racial groups. Removing all the proxies would cause a large dip in accuracy (sensitive_removal3).

  3. In-processing: Intervention at the stage of training. Previous work modifies the learning algorithm (Calders-Naive-2010; Kamiran-tree-2010; Russell-WWC-2017; Dwork-DC:2018)

    , modifies the loss function 

    (Kamishima:2012; Kamishima-RA:2011; zemel_learning; Bechavod2017LearningFC), uses an adversarial approach (Beutel2017DataDA; Zhang2018MitigatingUB; wadsworth_achieving_2018), or adds fairness constraints (Zafar2017FairnessBD; Zafar2017FairnessCM; Agarwal-RA:2018; Fish2016ACA; Donini-RM:2018; Celis2018ClassificationWF; Ruoss2020LearningCI).

  4. Post-processing: Intervention at the stage of deployment of a trained model. Previous work modifies the input before passing it to the model (Adler-AI:2018; Hardt2016EqualityOO; Pleiss-FC:2017; kusner_counterfactual; Card2018DeepWA; Woodworth2017LearningNP), or modifies the model’s prediction depending on the input’s sensitive attribute (Kamiran-tree-2010; Kamiran-DT-2012; Pedreshi-DA-2008; Pedreschi-SS:2009; Menon-CF:2018). Since the sensitive attribute is used to affect model predictions directly, this class of approach might be illegal due to disparate treatment laws (jolly-ryan-have; Winrow2010TheDB).

6. Conclusion

Building fair machine learning models is required for adherence to anti-discrimination laws; it leads to more desirable outcomes (e.g., higher profits); and it is the morally right thing to do. Training a machine learning model on biased historical decisions would perpetuate injustice.

We have proposed a novel approach to improve fairness. Our approach heuristically identifies unfair decisions made by a model, uses influence functions to identify the training data (e.g., biased historical decisions) that are most responsible for the unfair decisions, and then removes the biased training points.

Compared to a baseline model that is trained on historical data without removing any datapoints, our technique improves test accuracy, individual discrimination, and statistical disparity. Ours is the only technique (out of eight tested) that improves all three measures, no matter which is chosen as the optimization goal. By contrast, much previous work increases fairness only at the expense of accuracy.

References

Appendix

Appendix A Experimental plots

Figure 4. The individual discrimination for all 240 hyperparameter choices (lower is better). Our approach (rightmost in each boxplot) achieves low discrimination for many hyperparameter choices, and it has a little variance across choices for most datasets.
Figure 5. The test accuracy for all 240 hyperparameter choices (higher is better). Our approach is best or comparable to the best in terms of both accuracy and its variance, for all experiments except D2 and D7.