1. Introduction
Machine learning (ML) models that are deployed in the real world can have serious effects on peoples’ lives. In impactful domains such as lending (Hardt et al., 2016), college admissions (Marcinkowski et al., 2020), criminal sentencing (CorbettDavies et al., 2017; Berk et al., 2018), and healthcare (Gianfrancesco et al., 2018; Zink and Rose, 2019), there is increasing concern that models will behave in unethical ways (Kearns and Roth, 2019). This concern has led ML researchers to propose different measures of fairness for constraining and/or auditing classification models (Dwork et al., 2012). However, in many cases, desired notions of fairness require exponentially many constraints to be satisfied, making the problems of learning fair models, and also checking for fairness, computationally hard (Kearns et al., 2017)
. For this reason search heuristics like genetic programming (GP) may be useful for finding approximate solutions to these problems.
This paper is, to our knowledge, the first foray into incorporating fairness constraints into GP. We propose and study two methods for learning fair classifiers via GPbased symbolic classification. Our first proposal is a straightforward one: to add a fairness metric as an objective to multiobjective optimization (Deb et al., 2000). This fairness metric works by defining protected groups within the data, which match individuals having a specific value of one protected attribute, e.g. “female” for a sex attribute. Unfortunately, simple metrics of fairness do not capture fairness over rich subgroups and/or intersections of groups  that is, over multiple protected attributes that intersect in myriad ways. With this in mind, we propose an adaptation of lexicase selection (La Cava et al., 2018) designed to operate over randomized sequences of fairness constraints. This algorithm draws a connection between these numerous fairness constraints and the way in which lexicase samples fitness cases in random sequences for parent selection. We illustrate the ability of lexicase to sample the space of group intersections in order to pressure models to perform well on the intersections of groups that are most difficult in the current population. In our experiments, we compare several randomized search heuristics to a recent gametheoretic approach to capturing subgroup fairness. The results suggest that GP methods can produce Paretoefficient tradeoffs between fairness and accuracy, and that random search is a strong benchmark for doing so.
In the following section, we describe how fairness has been approached in the ML community and the challenges that motivate our study. Section 3 describes the algorithms we propose in detail, and Section 4 describes the experiment we conduct on four realworld datasets for which fairness concerns are pertinent. We present resulting measures of performance, statistical comparisons, and example fairnessaccuracy tradeoffs in Section 5, followed finally by a discussion of what these results entail for future studies.
2. Background
Incorporating notions of fairness into ML is a fairly new idea (Pedreshi et al., 2008), and early work in the field is reviewed in Chouldechova and Roth (Chouldechova and Roth, 2018)
. Algorithmic unfairness may arise from disparate causes, but often has to do with the properties of the data used to train a model. One major cause of bias is that data are often collected from unequal demographics of a population. In such a scenario, algorithms that minimize average error over all samples will skew towards fitting the majority population, since this leads to lower average error. One way to address this problem is to train separate models for separate demographic populations. In some scenarios, this method can reduce bias, but there are two main caveats, expounded upon in
(Thomas et al., 2019). First, some application areas explicitly forbid demographic data to be used in prediction, meaning these models could not be deployed. The second, and more general, concern is that we may want to protect several sensitive features of a population (e.g., race, ethnicity, sex, income, medical history, etc.). In those cases, dividing data beforehand is nontrivial, and can severely limit the sample size used to train each model, leading to poor performance.There is not a single agreedupon definition of fairness for classification. The definitions put forth can be grouped into two kinds: statistical fairness, in which we ask a classifier to behave approximately equally on average across protected groups according to some metric; and individual fairness, in which we ask a classifier to perform similarly on similar pairs of individuals (Dwork et al., 2012). For this paper, we focus on statistical fairness, especially equality of false positive (FP), false negative (FN), and accuracy rates among groups. We essentially ask that the classifier’s errors be distributed among different protected groups as evenly as possible.
Fairness constraints have been proposed for classification algorithms, for example by regularization (Dwork et al., 2012; Berk et al., 2017), model calibration (Hardt et al., 2016), costsensitive classification (Agarwal et al., 2018), and evolutionary multiobjective optimization (Quadrianto and Sharmanska, 2017). For the most part, literature has focused on providing guarantees over a small number of protected groups that represent single attributes  for example, race and sex. However, a model that appears fair with respect to several individual groups may actually discriminate over specific intersections or conjunctions of those groups. Kearns et al. (Kearns et al., 2017) refers to this issue as “fairness gerrymandering”. To paraphrase Example 1.1 of their work (Kearns et al., 2017), imagine a classifier that exhibits equivalent error rates according to two protected groups: a race feature taking values in {“black”, “white”} and, separately, a sex feature taking values in {“male”, “female”}. This seemingly fair classifier could actually be producing 100% of its errors on black males and white females. In such a case the classifier would appear fair according to the individual race and sex groups, but unfair with respect to their conjunction.
If we instead wish to learn a classifier that is fair with respect to both individual groups defined over single attributes and boolean conjunctions of those groups, a combinatorial problem arises. For protected attributes, we have to both learn and check for fairness over groups. It turns out that the problems of auditing a classifier for fairness over boolean conjunctions of groups (as well as other group definitions) is computationally hard in the worst case, as is the classification problem (Kearns et al., 2017).
Kearns et al. (Kearns et al., 2017) proposed a heuristic solution to the problem of learning a classifier with rich subgroup fairness constraints by formulating it as a twoplayer game in which one player learns a classifier and the other learns to audit that classifier for fairness. They empirically illustrated the tradeoff between fairness violations and model accuracy on four realworld problems (Kearns et al., 2018). In our study, we build upon their work by using their fairness auditor to compare performance of models on the same datasets. In their study, Kearns et al. focused on algorithmic characterization by reporting fairness and accuracy on the training samples. Conversely, we are interested in the generalization performance of the learned classification models; therefore we conduct our comparisons over crossvalidated predictions, rather than reporting insample.
Our interest in applying GP to the problem of fair classification is motivated by three observations from this prior work. First, given that the learning and auditing problems for rich subgroup fairness are hard in the worst case means that a heuristic method such as GP may be able to provide approximate solutions with high utility, and therefore it is worth an empirical analysis. Second, many authors note the inherent tradeoff that exists between fairness and accuracy (Hardt et al., 2016; Kearns et al., 2018) and the need for Paretoefficient solution sets. Multiobjective optimization methods that are typically used in GP (e.g., NSGA2 (Deb et al., 2000)) are wellsuited to handle competing objectives during search. Finally, we note that demographic imbalance, one of the causes of model unfairness, is a problem due to the use of average error for guiding optimization. However, recent semantic selection methods (Liskowski et al., 2015) such as lexicase selection (La Cava et al., 2016) are designed specifically to move away from scalar fitness values that average error over the entire training set. The original motivation behind these GP methods is to prevent the loss of candidate models in the search space that perform well over difficult subsets of the data (La Cava et al., 2016). Furthermore, we hypothesize that lexicase selection may be adapted to preserve models that perform well over structured subgroups of the protected attributes as well.
3. Methods
We start with a dataset of triples, , containing examples. Our labels are binary classification assignments and
is a vector of
features. In addition to , we have a vector of sensitive features, , that we wish to protect via some fairness constraint. It is worth mentioning that for the purposes of this study, contains , meaning that the learned classifier has access to the sensitive attribute observations in prediction; this is not always the case (e.g. (Thomas et al., 2019)).We also define protected groups , where each is an indicator function^{1}^{1}1We use to denote indicator functions., mapping a set of sensitive features to a group membership. It is useful to define a simple set of protected groups that correspond to the unique levels of each feature in . We will call the set of these simple groups . As an example, imagine we have two sensitive features corresponding to race and sex: and . Then would consist of four groups:
We make use of in defining marginal fairness and in Algorithm 1.
We use a recent GP technique called FEAT (La Cava et al., 2019; La Cava and Moore, 2019)
that evolves feature sets for a linear model, in this case a logistic regression model. More details of this method are given in Section
4. As in other GP methods, FEAT trains a population of individuals, , each of which produces binary classifications of the form . The fitness of is its average loss over the training samples, denoted . We refer to the fitness of over a specific group of training samples as . With these definitions in mind, we can define the fairness of a classifier with respect to a particular group and fitness measure as:(1) 
FEAT uses logistic loss as its fitness during training, in keeping with its logistic regression pairing. However, we compare fairness on fitted models relative to the FP and FN rate, as in previous work (Kearns et al., 2018; Agarwal et al., 2018).
3.1. Multiobjective Approach
A straightforward way to incorporate fairness into FEAT is to add it as an objective to a multiobjective optimization algorithm like NSGA2. We use the term marginal fairness to refer to the firstlevel fairness of a model defined over simple groups :
(2) 
A challenge with using fairness as an objective is the presence of a trivial solution: a model that produces all 1 or all 0 classifications has perfect fairness, and will easily remain in the population unless explicitly removed.
A major shortcoming of optimizing Eqn. 2 is that it does not pressure classifiers to perform well over group intersections, and is therefore susceptible to fairness gerrymandering, as described in Section 2. Unfortunately, it is not feasible to explicitly audit each classifier in the population each generation over all possible combinations of structured subgroups. While an approximate, polynomial time solution has been proposed (Kearns et al., 2017, 2018), we consider it too expensive to compute in practice each iteration on the entire set of models. For these reasons, we propose an adaptation of lexicase selection (Spector, 2012) to handle this task in the following section.
3.2. Fair Lexicase Selection
Lexicase selection is a parent selection algorithm originally proposed for program synthesis tasks (Helmuth et al., 2014) and later regression (La Cava et al., 2016). Each parent selection event, lexicase selections filters the population through a newly randomized ordering of “cases”, which are typically training samples. An individual may only pass through one of these cases if it has the best fitness in the current pool of individuals, or alternately if it is within of the best for lexicase selection. The filtering process stops when one individual is left (and is selected), or when it runs out of cases, resulting in random selection from the remaining pool.
Although different methods for defining have been proposed, we use the most common one, which defines as the median absolute deviation () of the loss () in the current selection pool: ^{2}^{2}2Defining relative to the current selection pool is called “dynamic lexicase selection” in (La Cava et al., 2018).
Lexicase selection has a few properties worth noting that are discussed in depth in (La Cava et al., 2018). First, it takes into account case “hardness”, meaning training samples that are very easy to solve apply very little selective pressure to the population, and vice versa. Second, lexicase selection selects individuals on the Pareto front spanned by the cases; this means that, in general, it is able to preserve individuals that only perform well on a small number of hard cases (i.e. specialists (Helmuth et al., 2019)). Third, and perhaps most relevant to rich subgroup fairness, lexicase selection does not require each individual to be run on each case/sample, since selection often chooses a parent before the cases have been exhausted (La Cava et al., 2016). The worst case complexity of parent selection is , which only occurs in a semantically homogeneous population.
Because of the third point above, we can ask for lexicase selection to audit classifiers over conjunctions of groups without explicitly constructing those groups beforehand. Instead, in fair lexicase (FLEX, detailed in Alg. 1), we define “cases” to be drawn from the simple groups in . A randomized ordering of these groups, i.e. cases, thereby assesses classifier performance over a conjunction of protected attributes. By defining cases in this way, selective pressure moves dynamically towards subgroups that are difficult to solve. For any given parent selection event, lexicase only needs to sample as many groups as are necessary to winnow the pool to one candidate, which is at most . Nonetheless, due to the conditional nature of case orderings, and the variability in case depth and orderings, lexicase effectively samples combinations of protected groups.
An illustration of three example selection events is shown in Figure 1. These events illustrate that FLEX can select on different sequences of groups and different sequence lengths, while also taking into account the easiness or hardness of the group among the current selection pool.
A downside of FLEX versus the multiobjective approach is that it is not as clear how to pressure for both fairness and accuracy among cases. On one hand, selecting for accuracy uniformly over many group definitions could lead to fairness, but it may also preserve markedly unfair, and therefore undesirable, models. We address this issue by allowing both case definitions to appear with equal probability. This choice explains the random coin flip in Alg.
1.Selection() :  

parents  
do N times:  
GetParent()  add selection to 
GetParent() :  
protected groups  
selection pool  
while and :  
random choice from  pick random group 
if random number then  
for  loss over group 
else  
Fairness for  group fairness 
min for  min fitness in pool 
deviation of fitnesses  
for :  
if then  
filter selection pool  
remove  
return random choice from 
4. Experiments
We conduct our experiment on four datasets used in previous related work (Kearns et al., 2018). These datasets and their properties are detailed in Table 1. Each of these classification problems contain sensitive information for which one would reasonably want to assure fairness. Two of the datasets concern models for admissions decisions (Lawschool and Student); The other two are of concern for lending and credit assessment: one predicts rates of community crime (Communities), and the other attempts to predict income level (Adult). For each of these datasets we used the same cleaning procedure as this previous work, making use of their repository (available here: github.com/algowatchpenn/GerryFair).
Dataset  Source (link)  Outcome  Samples  Features  Sensitive features  Protection Types  Number of simple groups () 

Communities  UCI  Crime rates  1994  122  18  race, ethnicity, nationality  1563 
Adult  Census  Income  2020  98  7  age, race, sex  78 
Lawschool  ERIC  Bar passage  1823  17  4  race, income, age, gender  47 
Student  Secondary Schools  Achievement  395  43  5  sex, age, relationship status, alcohol consumption  22 
We compared eight different modeling approaches in our study, the parameters of which are shown in Table 2. Here we briefly describe the two main algorithms that are used.
GerryFair
First, we used the “Fictitious Play” algorithm from (Kearns et al., 2017, 2018), trained for 100 iterations at 100 different levels of , which controls the tradeoff between error and fairness. As mentioned earlier, GerryFair treats the problem of learning a fair classifier as a two player game in which one player, the classifier, is attempting to minimize error over weighted training samples, and the other player, the auditor, is attempting to find the subgroup within the classifier’s predictions that produces largest fairness violation. The play continues for the maximum iterations or until the maximum fairness violation is less than . The final learned classifier is an ensemble of linear, costsensitive classification models. We make use of the auditor for validating the predictions of all compared models, so it is described in more detail in Section 4.1.
Feat
Our GP experiments are carried out using the Feature Engineering Automation Tool (FEAT), a GP method in which each individual model consists of a set of programs (i.e. engineered features) that are fed into a logistic regression model (see Figure 2
). This allows FEAT to learn a feature space for a logistic regression model, where the number of features is learned via the search process. The features are comprised of continuous and boolean functions, including common neural network activation functions, as shown in Table 1 in
(La Cava et al., 2019). We choose to use FEAT for this experiment because it performed well in comparison to other stateoftheart GP methods on a battery of regression tests (La Cava and Moore, 2019). FEAT is also advantageous in this application to binary classification because it can be paired with logistic regression, which provides probabilistic outputs for classification. These probabilities are necessary for assessing model performance using certain measures such as the average precision score, as we will describe later in Eqn. 5.FEAT trains models according to a common evolutionary strategy. This strategy begins with the construction of models, followed by selection for parents. The parents are used to produced offspring via mutation and crossover. Depending on the method used, parents and offspring may then compete in a survival step (as in NSGA2), or the offspring may replace the parents (LEX, FLEX). For further details of FEAT we refer the reader to (La Cava and Moore, 2020) and to the github project (github.com/lacava/feat).
We test six different selection/survival methods for FEAT, shown in Table 2. FLEXNSGA2 is a hybrid of FLEX and NSGA2 in which selection for parents is conducted using FLEX and survival is conducted using the survival step of NSGA2. Each GP method was trained for 100 generations with a population of 100, except for Random, which returned the initial population. These parameters were chosen to approximately match those of GerryFair, and to produce the same number of final models (100). However, since the GP methods are populationbased, they train 100 models per generation (except Random). GerryFair only trains two models per iteration (the classifier and the auditor); thus, at a first approximation we should expect the GP models aside from Random to require roughly 50 times more computation.
In our experiments, we run 50 repeat trials of each method on each dataset, in which we split the data 50/50 into training and test sets. For each trial, we train models by each method, and then generate predictions on the test set over each returned model. Each trial is run on a single core in a heterogeneous cluster environment, consisting mostly of 2.6GHz processors with a maximum of 8 GB of RAM.
There are inherent tradeoffs between notions of fairness and accuracy that make it difficult to pick a definitive metric by which to compare models (Kleinberg et al., 2016). We compute several metrics of comparison, defined below.
4.1. Auditing Subgroup Fairness
In order to get a richer measure of subgroup fairness for evaluating classifiers, Kearns et al. (Kearns et al., 2018)
developed an auditing method that we employ here for validating classifiers. The auditor uses costsensitive classification to estimate the group that most badly violates a fairness measure they propose, which we refer to as a subgroup FP or FN Violation. We can define this relative to FP rates as
(3) 
here, is the distribution from which the data is drawn. In Eqn. 3, is estimated by the fraction of samples group covers, so that larger groups are more highly weighted. measures fairness equivalently to Eqn. 1. This metric can be defined equivalently for FN subgroup violations, and we report both measures in our experiments. The auditing algorithm’s objective is to return an estimate of the group with the highest FP or FNViolation, and this violation is used as a measure of classifier unfairness.
4.2. Measures of Accuracy
In order to compare the accuracy of the classifiers, we used two measures. The first is accuracy, defined as
(4) 
The second is average precision score^{3}^{3}3This is a pessimistic version of estimating area under the precisionrecall curve. See https://scikitlearn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html., which is the mean precision of the model at different classification thresholds, . APS is defined as
(5) 
where is the recall and is the precision of .
4.3. Comparing AccuracyFairness Tradeoffs
It is well known that there is a fundamental tradeoff between the different notions of fairness described here and classifier accuracy (Hardt et al., 2016; Berk et al., 2018; Kleinberg et al., 2016). For this reason, recent work has focused on comparing the Pareto front of solutions between methods (Kearns et al., 2018). For GerryFair, this tradeoff is controlled via the parameter described in Table 2. For the GP methods, we treat the final population as the solution set to be evaluated.
In order to compare sets of solutions between methods, we compute the hypervolume of the Pareto front (Fonseca et al., 2006) between competing pairs of accuracy objectives (Accuracy, APS) and fairness objectives (FP Subgroup Violation, FN Subgroup Violation). This results in four hypervolume measures of comparison. For two objectives, the hypervolume provides an estimate of the area of the objective space that is covered/dominated by a set of solutions. Thus, the hypervolume allows us to compare how well each method is able to characterize the fairnessaccuracy tradeoff (Chand and Wagner, 2015).
Algorithm  Settings 

GerryFair (Kearns et al., 2017)  iterations=100, = 100 values , ml = logistic regression 
 GerryFairGB  “”, ml = gradient boosting 
FEAT (La Cava et al., 2019)  generations=100, pop size=100, max depth=6, max dim=20 
 Tourn  selection: size 2 tournament selection 
 LEX (La Cava et al., 2016)  selection: lexicase selection 
 FLEX (Alg. 1)  selection: Fair lexicase selection 
 NSGA2 (Deb et al., 2000)  NSGA2 selection and survival 
 FLEXNSGA2  selection: lexicase selection, survival: NSGA2 
 Random  return initial random population 
5. Results
In Figure 3, we show the distributions of the hypervolume of the FP violationAPS Pareto front across trials and problems for each method. Each subplot shows the test results for each method on a single dataset, with larger values indicating better performance. In general, we observe that the GPbased approaches do quite well compared to GerryFair in terms of finding good tradeoffs along the Pareto front. Every GP variant generates a higher median hypervolume measure than GerryFair and GerryFairGB on every problem.
Among GP variants, we observe that Random, LEX and FLEX tend to produce the highest hypervolume measures. Random search works best on the Communities and Student datasets; LEX performs best on Adult, and there is a virtual tie between Random, LEX and FLEX on Lawschool. NSGA2, FLEXNSGA2 and Tourn all perform similarly and generally worse than Random, LEX and FLEX.
The hypervolume performance results are further summarized across problems in Figure 4. Here, each subplot shows the distribution of rankings according to a different hypervolume measurement, shown on the y axis. The significance of pairwise Wilcoxon tests between methods are shown as asterisks between bar plots. Since all pairwise comparisons are cumbersome to show, the complete pairwise Wilcoxon tests for FP ViolationAPS hypervolume are shown in Table 3, corresponding to the bottom right subplot of Figure 4.
In general, the differences in performance between methods are significant. We observe that Random search, which has the best rankings across hypervolume measures, significantly outperforms all methods but LEX across problems. LEX and FLEX are significantly different only by one comparison, and the effect size is noticeably small. In addition, Tourn and NSGA2 are not significantly different, while NSGA2 and FLEXNSGA2 are significantly different for two of the four measures.
Since the hypervolume measures only give a coarse grained view of what the Pareto fronts of solutions look like, we plot the Pareto fronts of specific trials of each method on two problems in Figures 5 and 6. The first figure shows results for the Adult problem, and presents a typical solution set for this problem. It’s noteworthy that, despite having 100 models produced by each method, only a fraction of these models produce Paretoefficient sets on the test data. The small numbers of Pareto optimal models under test evaluation suggest that most classifiers are overfit to the training data to some degree, in terms of error rate, unfairness, or both. We also find it interesting that the combined front of solutions to this problem samples includes models from six different methods. In this way we see the potential for generating complimentary, Paretooptimal models from distinct methods.
By contrast, models for the Student dataset shown in Figure 6 are dominated by one method: Random search. Random produces high hypervolume measures for this problem compared to other methods, and the Pareto fronts in this figure shows an example: in this case, Random is able to find three Paretooptimal classifiers with very low error (high APS) and very low unfairness. These three models dominate all other solutions found by other methods.
Each method is evaluated on a single core, and the wall clock times of these runs are shown in Figure 7. Random is the quickest to train, followed by the two GerryFair variants. Compared to the generational GP methods, GerryFair exhibits runtimes that are between 2 and 5 times faster. Interestingly, the NSGA2 runs finish most quickly among the GP methods. This suggests that NSGA2 may be biased toward smaller models during optimization.
FLEX  FLEXNSGA2  GerryFair  GerryFairGB  LEX  NSGA2  Random  

FLEXNSGA2  2.6e16  
GerryFair  1.3e30  7.4e15  
GerryFairGB  4.2e27  9.1e10  3.1e02  
LEX  1.5e01  1.7e20  7.9e32  9.5e27  
NSGA2  1.7e09  5.7e03  5.6e21  1.9e14  2.1e13  
Random  1.7e02  1.3e22  5.9e32  3.5e27  1.0e+00  3.7e18  
Tourn  3.5e06  2.3e03  5.2e20  7.3e16  3.7e11  1.0e+00  2.4e12 
6. Discussion and Conclusion
The purpose of this work is to propose and evaluate methods for training fair classifiers using GP. We proposed two main ideas: first, to incorporate a fairness objective into NSGA2, and second, to modify lexicase selection to operate over subgroups of the protected attributes at each case, rather than on the raw samples. We evaluated these proposals relative to baseline GP approaches, including tournament selection, lexicase selection, and random search, and relative to a gametheoretic approach from the literature. In general we found that the GPbased methods perform quite well, in terms of the hypervolume dominated by the Pareto front of accuracyfairness tradeoffs they generate. An additional advantage of this family of methods is that they may generate intelligible models, due to their symbolic nature. However, the typical evolutionary strategies used by GP did not perform significantly better than randomly generated models, except for one tested problem.
Our first idea, to incorporate a marginal fairness objective into NSGA2, did not result in model sets that were better than tournament selection. This suggests that the marginal fairness objective (Eq. 2) does not, in and of itself, produce model sets with better subgroup fairness (Eq. 3). An obvious next step would be to incorporate the auditor (Section 4.1) into NSGA2 in order to explicitly minimize the subgroup fitness violation. The downside to this is its computational complexity, since it would require an additional iteration of model training per individual per generation.
Our proposal to modify lexicase selection by regrouping cases in order to promote fairness over subgroups (FLEX) did not significantly change the performance of lexicase selection. It appeared to improve performance on one dataset (adult), worsen performance on another (communities), and overall did not perform significantly differently than LEX. This comparison is overshadowed by the performance of random search over these datasets, which gave comparable, and occasionally better, performance than LEX and FLEX for a fraction of the computational cost.
In light of these results, we want to understand why random search is so effective on these problems. There are several possible avenues of investigation. For one, the stability of the unfairness estimate provided by the auditor on training and test sets should be understood, since the group experiencing the largest fairness violation may differ between the two. This difference may make the fairness violation differ dramatically on test data. Unlike typical uses of Pareto optimization in GP literature that seek to control some static aspect of the solution (e.g., its complexity), in application to fairness, the risk of overfitting exists for both objectives. Therefore the robustness of Pareto optimal solutions may suffer. In addition, the study conducted here considered a small number of small datasets, and it is possible that a larger number of datasets would reveal new and/or different insights.
The field of fairness in ML is nascent but growing quickly, and addresses very important societal concerns. Recent results show that the problems of both learning and auditing classifiers for rich subgroup fairness are computationally hard in the worst case. This motivates the analysis of heuristic algorithms such as GP for building fair classifiers. Our experiments suggest that GPbased methods can perform competitively with methods designed specifically for handling fairness. We hope that this motivates further inquiry into incorporating fairness constraints into models using randomized search heuristics such as evolutionary computation.
7. Supplemental Material
The code to reproduce our experiments is available from https://github.com/lacava/fair_gp.
8. Acknowledgments
The authors would like thank the Warren Center for Data Science and the Institute for Biomedical Informatics at Penn for their discussions. This work is supported by National Institutes of Health grants K99 LM01292602, R01 LM010098 and R01 AI116794.
References
 A Reductions Approach to Fair Classification. In International Conference on Machine Learning, pp. 60–69 (en). External Links: Link Cited by: §2, §3.
 A convex framework for fair regression. arXiv preprint arXiv:1706.02409. Cited by: §2.
 Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociological Methods & Research, pp. 004912411878253 (en). External Links: ISSN 00491241, 15528294, Link, Document Cited by: §1, §4.3.
 Evolutionary manyobjective optimization: A quickstart guide. Surveys in Operations Research and Management Science 20 (2), pp. 35–42. External Links: ISSN 18767354, Link, Document Cited by: §4.3.
 The Frontiers of Fairness in Machine Learning. arXiv:1810.08810 [cs, stat]. Note: arXiv: 1810.08810 External Links: Link Cited by: §2.
 Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806. Cited by: §1.

A Fast Elitist Nondominated Sorting Genetic Algorithm for Multiobjective Optimization: NSGAII
. In Parallel Problem Solving from Nature PPSN VI, M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H. Schwefel (Eds.), Vol. 1917, pp. 849–858. External Links: ISBN 9783540410560, Link Cited by: §1, §2, Table 2.  Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §1, §2, §2.
 An improved dimensionsweep algorithm for the hypervolume indicator. In 2006 IEEE international conference on evolutionary computation, pp. 1157–1163. Cited by: §4.3.
 Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine 178 (11), pp. 1544–1547. Cited by: §1.
 Equality of Opportunity in Supervised Learning. (en). External Links: Link Cited by: §1, §2, §2, §4.3.
 Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation PP (99), pp. 1–1. External Links: ISSN 1089778X, Document Cited by: §3.2.
 Lexicase selection of specialists. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1030–1038. Cited by: §3.2.
 Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. arXiv:1711.05144 [cs]. Note: arXiv: 1711.05144Comment: Added new experimental results and a slightly modified fairness definition External Links: Link Cited by: §1, §2, §2, §2, §3.1, §4, Table 2.
 An Empirical Study of Rich Subgroup Fairness for Machine Learning. arXiv:1808.08166 [cs, stat]. Note: arXiv: 1808.08166 External Links: Link Cited by: §2, §2, §3.1, §3, §4, §4.1, §4.3, §4.
 The Ethical Algorithm: The Science of Socially Aware Algorithm Design. Oxford University Press. Cited by: §1.
 Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807. Cited by: §4, §4.3.
 A probabilistic and multiobjective analysis of lexicase selection and lexicase selection. Evolutionary Computation, pp. 1–28. External Links: ISSN 10636560, Link, Document Cited by: §1, §3.2, footnote 2.
 Semantic variation operators for multidimensional genetic programming. In Proceedings of the 2019 Genetic and Evolutionary Computation Conference, GECCO ’19, Prague, Czech Republic. Note: arXiv: 1904.08577 External Links: Link, Document Cited by: §3, §4.
 Learning feature spaces for regression with genetic programming. Genetic Programming and Evolvable Machines (en). External Links: ISSN 15737632, Link, Document Cited by: §4.
 Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations, ICLR (en). External Links: Link Cited by: §3, §4, Table 2.
 EpsilonLexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, New York, NY, USA, pp. 741–748. External Links: ISBN 9781450342063, Link, Document Cited by: §2, §3.2, §3.2, Table 2.
 Comparison of Semanticaware Selection Methods in Genetic Programming. In Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion ’15, New York, NY, USA, pp. 1301–1307. External Links: ISBN 9781450334884, Link, Document Cited by: §2.
 Implications of AI (un)fairness in higher education admissions: the effects of perceived AI (un)fairness on exit, voice and organizational reputation. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, Barcelona, Spain, pp. 122–130. External Links: ISBN 9781450369367, Link, Document Cited by: §1.
 Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 560–568. Cited by: §2.
 Recycling privileged learning and distribution matching for fairness. In Advances in Neural Information Processing Systems, pp. 677–688. Cited by: §2.
 Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion, pp. 401–408. External Links: Link Cited by: §3.1.
 Preventing undesirable behavior of intelligent machines. Science 366 (6468), pp. 999–1004 (en). External Links: ISSN 00368075, 10959203, Link, Document Cited by: §2, §3.
 Fair Regression for Health Care Spending. arXiv:1901.10566 [cs, stat]. Note: arXiv: 1901.10566Comment: 31 pages, 3 figures External Links: Link Cited by: §1.
Comments
There are no comments yet.