The widespread adoption of data-driven automated decision-making software with far-reaching societal impact, e.g., for credit scoring Khandani, recidivism prediction Chouldechova, or hiring tasks Schumann, raises concerns on the fairness properties of these tools Barocas. Several fairness verification and bias mitigation approaches for machine learning (ML) systems have been proposed in recent years, e.g. aghaei19; Grari; roh20; ruoss2020learning; Urban; yurochkin20; Zafar
among the others. However, most works focus on neural networksroh20; ruoss2020learning; Urban; yurochkin20 or on group-based notions of fairness Grari; Zafar, e.g., demographic parity dwork2012fairness
or equalized oddsHardt. These notions of group-based fairness require some form of statistical parity (e.g. between positive outcomes) for members of different protected groups (e.g. gender or race). On the other hand, they do not provide guarantees for individuals or other subgroups. By contrast, in this paper we focus on individual fairness dwork2012fairness, intuitively meaning that similar individuals in the population receive similar outcomes, and on decision tree ensembles breiman-random-forests; friedman2001greedy, which are commonly used for tabular datasets since they are easily interpretable ML models with high accuracy rates.
We propose an approach for verifying individual fairness of decision tree ensembles, as well as training tree models which maximize both accuracy and fairness. The approach is based on abstract interpretation CC77; rival-yi, a well known static program analysis technique, and builds upon a framework for training robust decision tree ensembles called Meta-Silvae gat, which in turn leverages a verification tool for robustness properties of decision trees DBLP:conf/aaai/RanzatoZ20. Our approach is fully parametric on a given underlying abstract domain representing input space regions containing similar individuals. We instantiate it with a product of two abstract domains: (a) the well-known abstract domain of hyper-rectangles (or boxes) CC77, that represents exactly the standard notion of similarity between individuals based on the distance metric, and does not lose precision for the univariate split decision rules of type
; and (b) a specific relational abstract domain which accounts for one-hot encoded categorical features.
Our Fairness-Aware Tree Training method, called FATT, is designed as an extension of Meta-Silvae gat
, a learning methodology for ensembles of decision trees based on a genetic algorithm which is able to train a decision tree for maximizing both its accuracy and its robustness to adversarial perturbations. We demonstrate the effectiveness of FATT in training accurate and fair models on the standard datasets used in the literature on fairness. Overall, the experimental results show that our fair-trained models are on average betweenand more fair than naturally trained decision tree ensembles at an average cost of of accuracy. Moreover, it turns out that our tree models are orders of magnitude more compact and thus more interpretable. Finally, we show how our models can be used as “hints” for setting the size and shape hyper-parameters (i.e., maximum depth and minimum number of samples per leaf) when training standard decision tree models. As a result, this hint-based strategy is capable to output models that are about more fair and just about less accurate than standard models.
The most related work to ours is by Aghaei et al. aghaei19, Raff et al. Raff and Ruoss et al. ruoss2020learning.
By relying on the mixed-integer optimization learning approach by Bertsimas and Dunn Bertsimas17, Aghaei et al. aghaei19 put forward a framework for training fair decision trees for classification and regression. The experimental evaluation shows that this approach mitigates unfairness as modeled by their notions of disparate impact and disparate treatment at the cost of a significantly higher training computational cost. Their notion of disparate treatment is distance-based and thus akin to individual fairness with respect to the nearest individuals in a given dataset (e.g., the -nearest individuals). In contrast, we consider individual fairness with respect to the nearest individuals in the input space (thus, also individuals that are not necessarily part of a given dataset).
Raff et al. Raff propose a regularization-based approach for training fair decision trees as well as fair random forests. They consider both group fairness as well as individual fairness with respect to the -nearest individuals in a given dataset, similarly to Aghaei et al. aghaei19. In their experiments they use a subset of the datasets that we consider in our evaluation (i.e., the Adult, German, and Health datasets). Our fair models have higher accuracy than theirs (i.e., between and ) for all but one of these datasets (i.e., the Health dataset). Interestingly, their models (in particular those with worse accuracy than ours) often have accuracy on par with a constant classifier due to the highly unbalanced label distribution of the datasets (cf. Table 1).
Finally, Ruoss et al. ruoss2020learning have proposed an approach for learning individually fair data representations and training neural networks (rather than decision tree ensembles as we do) that satisfy individual fairness with respect to a given similarity notion. We use the same notions of similarity in our experiments (cf. Section 6.1).
Given an input space
of numerical vectors and a finite set of labels, a classifier is a function , where is the set of nonempty subsets of , which associates at least one label to every input in . A training algorithm takes as input a dataset and outputs a classifier which optimizes some objective function, such as the Gini index or the information gain for decision trees.
Categorical features can be converted into numerical ones through one-hot encoding, where a single feature with possible distinct categories is replaced by new binary features with values in . Then, each value of the original categorical feature is represented by a bit-value assignment to the new binary features in which the -th feature is set to (and the remaining binary features are set to ).
Classifiers can be evaluated and compared through several metrics. Accuracy on a test set is a basic metric: given a ground truth test set , the accuracy of on is . However, according to a growing belief cacm18, accuracy is not enough in machine learning, since robustness to adversarial inputs of a ML classifier may significantly affect its safety and generalization properties carlini; cacm18. Given an input perturbation modeled by a function , a classifier is stable DBLP:conf/aaai/RanzatoZ20 on the perturbation of when consistently assigns the same label(s) to every attack ranging in , i.e.,
When the sample has a ground truth label , robustness of on boils down to stability together with correct classification .
We consider standard classification decision trees commonly referred to as CARTs (Classification And Regression Trees) BreimanFOS84. A decision tree is defined inductively. A base tree is a single leaf storing a frequency distribution of labels for the samples of the training dataset, hence , or, equivalently, . Some algorithmic rule converts this frequency distribution into a set of labels, typically as . A composite tree is , where is a Boolean split criterion for the internal parent node of its left and right subtrees and ; thus, for all , . Although split rules can be of any type, most decision trees employ univariate hard splits of type for some feature and threshold .
Tree ensembles, also known as forests, are sets of decision trees which together contribute to formulate a unique classification output. Training algorithms as well as methods for computing the final output label(s) vary among different tree ensemble models. Random forests (RFs) breiman-random-forests
are a major instance of tree ensemble where each tree of the ensemble is trained independently from the other trees on a random subset of the features. Gradient boosted decision trees (GBDT)friedman2001greedy represent a different training algorithm where an ensemble of trees is incrementally build by training each new tree on the basis of the data samples which are mis-classified by the previous trees. For RFs, the final classification output is typically obtained through a voting mechanism (e.g., majority voting), while GBDTs are usually trained for binary classification problems and use some binary reduction scheme, such as one-vs-all or one-vs-one, for multi-class classification.
3 Individual Fairness
Dwork et al. dwork2012fairness define individual fairness as “the principle that two individuals who are similar with respect to a particular task should be classified similarly”. They formalize this notion as a Lipschitz condition of the classifier, which requires that any two individuals whose distance is map to distributions and , respectively, such that the statistical distance between and is at most . The intuition is that the output distributions for and are indistinguishable up to their distance . The distance metric is problem specific and satisfies the basic axioms and .
By following Dwork et al’s standard definition dwork2012fairness, we consider a classifier to be fair when outputs the same set of labels for every pair of individuals which satisfy a similarity relation . Thus, can be derived from a distance as , where
is a threshold of similarity. In order to estimate a fairness metric for a classifier, we count how often is fair on sets of similar individuals ranging into a test set :
where is defined as follows:
Definition 3.1 (Individual Fairness).
A classifier is fair on an input sample with respect to a similarity relation , denoted by , when . ∎
Hence, fairness for a similarity relation boils down to stability on the perturbation , namely, for all ,
Let us remark that fairness is orthogonal to accuracy since it does not depend on the correctness of the label assigned by the classifier, so that that training algorithms that maximize accuracy-based metrics do not necessarily achieve fair models. Thus, this is also the case of a natural learning algorithm for CART trees and RFs, that locally optimizes split criteria by measuring entropy or Gini impurity, which are both indicators of the correct classification of training data.
It is also worth observing that fairness is monotonic with respect to the similarity relation, meaning that
We will exploit this monotonicity property, since this implies that, on one hand, fair classification is preserved for smaller similarity relations and, on the other hand, fairness verification and fair training is more challenging for larger similarity relations.
4 Verifying Fairness
As individual fairness is equivalent to stability, individual fairness of decision trees can be verified by Silva DBLP:conf/aaai/RanzatoZ20, an abstract interpretation-based algorithm for checking stability properties of decision tree ensembles.
4.1 Verification by Silva
Silva performs a static analysis of an ensemble of decision trees in a so-called abstract domain that approximates properties of real vectors, meaning that each abstract value represents a set of real vectors . Silva approximates an input region for an input vector by an abstract value such that and for each decision tree , it computes an over-approximation of the set of leaves of that can be reached from some vector in . This is computed by collecting the constraints of split nodes for each root-leaf path, so that each leaf of stores the minimum set of constraints which makes reachable from the root of . It is then checked if this set of constraints can be satisfied by the input abstract value : this check is denoted by and its soundness requirement means that if some input sample may reach the leaf then must necessarily hold. When holds the leaf is marked as reachable from . For example, if then an abstract value such as satisfies while a relational abstract value such as does not. This over-approximation of the set of leaves of reachable from allows us to compute a set of labels, denoted by which is an over-approximation of the set of labels assigned by to all the input vectors ranging in , i.e., holds. Thus, if and then is stable on .
For standard classification trees with hard univariate splits of type , we will use the well-known hyper-rectangle abstract domain whose abstract values for vectors are of type
where lower and upper bounds with (more on this abstract domain can be found in rival-yi). Thus, . The hyper-rectangle abstract domain guarantees that for each leaf constraint and , the check is (sound and) complete, meaning that holds iff there exists some input sample in reaching . This completeness property therefore entails that the set of labels computed by this analysis coincides exactly with the set of classification labels computed by for all the samples in , so that for the -based perturbation such that then it turns out that is stable on iff holds.
In order to analyse a forest of trees, Silva reduces the whole forest to a single tree , by stacking every tree on top of each other, i.e., each leaf becomes the root of the next tree in , where the ordering of this unfolding operation does not matter. Then, each leaf of this huge single tree collects all the constraints of the leaves in the path from the root of to . Since this stacked tree suffers from a combinatorial explosion of the number of leaves, Silva deploys a number of optimisation strategies for its analysis. Basically, Silva exploits a best-first search algorithm to look for a pair of input samples in which are differently labeled, hence showing instability. If one such instability counterexample can be found then instability is proved and the analysis terminates, otherwise stability is proved. Also, Silva allows to set a safe timeout which, when met, stops the analysis and outputs the current sound over-approximation of labels.
4.2 Verification with One-Hot Enconding
As described above, the soundness of Silva guarantees that no true reachable leaf is missed by this static analysis. Moreover, when the input region is defined by the norm and the static analysis is performed using the abstract domain of hyper-rectangles , Silva is also complete, meaning that no false positive (i.e., a false reachable leaf) can occur. However, that this is not true anymore when dealing with classification problems involving some categorical features.
[ [, name = n1]  ] at (n1) [left=6ex,above=1ex]; [  [, name = n2] ] at (n2) [right=6ex,above=1ex];
The diagram above depicts a toy forest consisting of two trees and , where left/right branches are followed when the split condition is false/true. Here, a categorical feature is one-hot encoded by . Since colors are mutually exclusive, every white individual in the input space, i.e. , will be labeled as by both trees. However, by running the stability analysis on the hyper-rectangle , Silva would mark the classifier as unstable because there exists a sample in whose output is . This is due to the point which is a feasible input sample for the analysis, although it does not represent any actual individual in the input space. In fact, and , so that by a majority voting , thus making unstable (i.e., unfair) on (i.e., on white individuals).
To overcome this issue, we instantiate Silva to an abstract domain which is designed as a reduced product (more details on reduced products can be found in rival-yi) with a relational abstract domain keeping track of the relationships among the multiple binary features introduced by one-hot encoding a categorical feature. More formally, this relational domain maintains the following two additional constraints on the features
introduced by one-hot encoding a categorical variablewith distinct values:
the possible values for each are restricted to ;
the sum of all must satisfy .
Hence, these conditions guarantee that any abstract value for represents precisely one possible category for . This abstract domain for a categorical variable with distinct values is denoted by . In the example above, any hyper-rectangle is reduced by , so that just two different values and are allowed.
Summing up, the generic abstract value of the reduced hyper-rectangle domain computed by the analyzer Silva for data vectors consisting of numerical variables and categorical variables with distinct values is:
where and .
5 FATT: Fairness-Aware Training of Trees
Several algorithms for training robust decision trees and ensembles have been put forward Andriushchenko19; calzavara2019B; calzavara20; ChenZBH19; kantchelian; gat
The robust learning algorithm of gat, called Meta-Silvae, aims at maximizing a tunable weighted linear combination of accuracy and stability metrics. Meta-Silvae relies on a genetic algorithm for evolving a population of trees which are ranked by their accuracy and stability, where tree stability is computed by resorting to the verifier Silva DBLP:conf/aaai/RanzatoZ20. At the end of this genetic evolution, Meta-Silvae returns the best tree(s). It turns out that Meta-Silvae typically outputs compact models which are easily interpretable and often achieve accurate and stable models already with a single decision tree rather than a forest. By exploiting the equivalence (2) between individual fairness and stability and the instantiation of the verifier Silva to the product abstract domain tailored for one-hot encoding, we use Meta-Silvae as a learning algorithm for decision trees, called FATT, that enhances their individual fairness.
While standard learning algorithms for tree ensembles require tuning some hyper-parameters, such as maximum depth of trees, minimum amount of information on leaves and maximum number of trees, Meta-Silvae is able to infer them automatically, so that the traditional tuning process is not needed. Instead, some standard parameters are required by the underlying genetic algorithm, notably, the size of the evolving population, the maximum number of evolving iterations, the crossover and mutation functions holland1984genetic; Srinivas. Moreover, we need to specify the objective function of FATT that, for learning fair decision trees, is given by a weighted sum of the accuracy and individual fairness scores over the training set. It is worth remarking that, given an objective function, the genetic algorithm of FATT converges to an optimal (or suboptimal) solution regardless of the chosen parameters, which just affect the rate of convergence and therefore should be chosen for tuning its speed.
Crossover and mutation functions are two main distinctive features of the genetic algorithm of Meta-Silvae. The crossover function of Meta-Silvae combines two parent trees and by randomly substituting a subtree of with a subtree of . Also, Meta-Silvae supports two types of mutation strategies: grow-only, which only allows trees to grow, and grow-and-prune, which also allows pruning the mutated trees. Finally, let us point out that Meta-Silvae allows to set the basic parameters used by generic algorithms: population size, selection function, number of iterations. In our instantiation of Meta-Silvae to fair learning: the population size is kept fixed to , as the experimental evaluation showed that this provides an effective balance between achieved fairness and training time; the standard roulette wheel algorithm is employed as selection function; the number of iterations is typically dataset-specific.
6 Experimental Evaluation
We consider the main standard datasets used in the fairness literature and we preprocess them by following the steps of Ruoss et al. (ruoss2020learning, Section 5)
for their experiments on individual fairness for deep neural networks: (1) standardize numerical attributes to zero mean and unit variance; (2) one-hot encoding of all categorical features; (3) drop rows/columns containing missing values; and (4) split into train and test set. These datasets concern binary classification tasks, although our fair learning naturally extends to multiclass classification with no specific effort. We will make all the code, datasets and preprocessing pipelines of FATT publicly available upon publication of this work.
The Adult income dataset dua2017uci is extracted from the 1994 US Census database. Every sample assigns a yearly income (below or above $50K) to an individual based on personal attributes such as gender, race, and occupation.
The COMPAS dataset contains data collected on the use of the COMPAS risk assessment tool in Broward County, Florida angwin2016machine. Each sample predicts the risk of recidivism for individuals based on personal attributes and criminal history.
The Communities and Crime dataset dua2017uci contains socio-economic, law enforcement, and crime data for communities within the US. Each sample indicates whether a community is above or below the median number of violent crimes per population.
The German Credit dataset dua2017uci contains samples assigning a good or bad credit score to individuals.
The heritage Health dataset (https://www.kaggle.com/c/hhp) contains physician records and insurance claims. Each sample predicts the ten-year mortality (above or below the median Charlson index) for a patient.
|Training Set||Test Set|
Table 1 displays size and distribution of positive samples for these datasets. As noticed by ruoss2020learning, some datasets exhibit a highly unbalanced label distribution. For example, for the adult dataset, the constant classifier would achieve test set accuracy and individual fairness with respect to any similarity relation. Hence, we follow ruoss2020learning and we will evaluate and report the balanced accuracy of our FATT classifiers, i.e.,
6.1 Similarity Relations
We consider four different types of similarity relations, as described by Ruoss et al. (ruoss2020learning, Section 5.1). In the following, let denote the set of indexes of features of an individual after one-hot encoding.
Two individuals are similar when a subset of their (standardized) numerical features indexed by a given subset differs less than a given threshold , while all the other features are unchanged: iff for all , and for all . For our experiments, we consider in the standardized input space, e.g., for adult two individuals are similar if their age difference is at most 3.95 years.
Two individuals are similar if they are identical except for one or more categorical sensitive attributes indexed by : iff for all . For adult and german, we select the gender attribute. For compas, we identify race as sensitive attribute. For crime, we consider two individuals similar regardless of their state. Lastly, for health, neither gender nor age group should affect the final prediction.
Given noise and categorical similarity relations and , their union models a relation where two individuals are similar when some of their numerical attributes differ up to a given threshold while the other attributes are equal except some categorical features.
Here, similarity is a disjunction of two mutually exclusive cases. Consider a numerical attribute , a threshold and two noise similarities . Two individuals are defined to be similar if their -th attributes are similar for and are bounded by or these attributes are above and similar for : . For adult, we consider the median age as threshold , and two noise similarities based on age with thresholds and , which correspond to age differences of and years respectively. For german, we also consider the median age and the same noise similarities on age, that correspond to age differences of and years.
Note that our approach is not limited to supporting these similarity relations. Further domain-specific similarities can be defined and handled by our approach by instantiating the underlying verifier Silva with an appropriately over-approximating abstract domain to retain soundness. Moreover, if the similarity relation can be precisely represented in the chosen abstract domain, we also retain completeness.
Our experimental evaluation compares CART trees and Random Forests with our FATT tree models. CARTs and RFs are trained by scikit-learn. We first run a preliminary phase for tuning the hyper-parameters for CARTs and RFs. In particular, we considered both entropy and Gini index as split criteria, and we checked maximum tree depths ranging from to with step . For RFs, we scanned the maximum number of trees ( to , step ). Cross validation inferred the optimal hyper-parameters, where the datasets have been split in training and validation sets. The hyper-parameters of FATT (i.e, weights of accuracy and fairness in the objective function, type of mutation, selection function, number of iterations) by assessing convergence speed, maximum fitness value and variance among fitness in the population during the training phase. FATT trained single decision trees rather than forests, thus providing more compact and interpretable models. It turned out that accuracy and fairness of single FATT trees are already competitive, where individual fairness may exceed for the most challenging similarities. We therefore concluded that ensembles of FATT trees do not introduce statistically significant benefits over single decision trees. Since FATT trees are stochastic by relying on random seeds, each experimental test has been repeated 1000 times and the results refer to their median value.
|Acc. %||Bal.Acc. %||Individual Fairness %|
Table 2 shows a comparison between RFs and FATTs. We show accuracy, balanced accuracy and individual fairness with respect to the noise, cat, and noise-cat similarity relations as computed on the test sets . As expected, FATT trees are slightly less accurate than RFs — on average, which also reflects to balanced accuracy — but outperform them in every fairness test. On average, the fairness increment ranges between to among different similarity relations. Table 3 shows the comparison for the conditional-attribute similarity, which applies to adult and german datasets only. Here, the average fairness increase of FATT models is .
|Individual Fairness %|
Fig. 1 shows the distribution of accuracy and individual fairness for FATT trees over 1000 runs of the FATT learning algorithm. This plot is for fairness with respect to noise-cat similarity, as this is the most challenging relation to train for (this is a consequence of (3)). We can observe a stable behaviour for accuracy, with of the observations laying within one percentile from the median. The results for fairness are analogous, although for compas we report a higher variance of the distribution, where the lowest observed fairness percentage is higher than the corresponding one for RFs. We claim that this may depend by the high number of features in the dataset, which makes fair training a challenging task.
|Model size||Avg. verification time per sample (ms)|
Table 4 compares the size of RF and FATT models, defined as total number of leaves. It turns out that FATT tree models are orders of magnitude smaller and, thus, more interpretable than RFs (while having comparable accuracy and significantly enhanced fairness). Let us also remark that the average verification time per sample for our FATT models is always less than milliseconds.
|FATT||Natural CART||Hinted CART|
|Dataset||Acc. %||Fair. %||Size||Acc. %||Fair. %||Size||Acc. %||Fair. %||Size|
Finally, in Table 5 compares FATT models with natural CART trees in terms of accuracy, size, and fairness with respect to the noise-cat similarity. While CARTs are approximately more accurate than FATT on average, they are roughly half less fair and more than ten times larger.
It is well known that decision trees often overfit bramer2007 due to their high number of leaves, thus yielding unstable/unfair models. Post-training techniques such as tree pruning are often used to mitigate overfitting kearns1998, although they are deployed when a tree has been already fully trained and thus often pruning is poorly beneficial. As a byproduct of our approach, we trained a set of natural CART trees, denoted by Hint in Table 5, which exploits hyper-parameters as “hinted” by FATT training. In particular, in this “hinted” learning of CART trees, the maximum tree depth and the minimum number of samples per leaf are obtained as tree depth and minimum number of samples of our best FATT models. Interestingly, the results in Table 5 show that these “hinted” decision trees have roughly the same size of our FATT trees, are approximately more fair than natural CART trees and just less accurate. Overall, it turns out that the general performance of these “hinted” CARTs is halfway between natural CARTs and FATTs, both in term of accuracy and fairness, while having the same compactness of FATT models.
We believe that this work contributes to push forward the use of formal verification methods in decision tree learning, in particular a very well known program analysis technique such as abstract interpretation is proved to be successful for training and verifying decision tree classifiers which are both accurate and fair, improve on state-of-the-art CART and random forest models, while being much more compact and interpretable. We also showed how information from our FATT trees can be exploited to tune the natural training process of decision trees. As future work we plan to extend further our fairness analysis by considering alternative fairness definitions, such as group or statistical fairness.