As the proliferation of machine learning and algorithmic decision making continues to grow throughout industry, the net societal impact of them has been studied with more scrutiny. In the USA under the Obama administration, a report on big data collection and analysis found that “big data technologies can cause societal harms beyond damages to privacy” [united2014big]. The report feared that algorithmic decisions informed by big data may have harmful biases, further discriminating against disadvantaged groups. This along with other similar findings has lead to a surge in research around algorithmic fairness and the removal of bias from big data.
The term fairness, with respect to some sensitive feature or set of features, has a range of potential definitions. In this work, impact parity is considered. In particular, this work is concerned with group fairness under the following definitions as taken from [gajane2017formalizing],
Group Fairness: A predictor achieves fairness with bias with respect to groups and being any subset of outcomes iff,
The above definition can also be described as statistical or demographical parity. Group fairness has found widespread application in India and the USA, where affirmative action has been used to address discrimination against caste, race and gender [weisskopf2004affirmative, dumont1980homo, deshpande2017affirmative].
The above definition does not, unfortunately, have natural application to regression problems. One approach to get around this, would be to alter the definition to bound the absolute difference between the respective marginal distributions over the output space. However, this is a strong requirement and may hinder the model’s ability to model the function space appropriately. Rather, a weaker and potentially more desirable constraint would be to force the expectation of the marginal distributions over the output space to equate. Therefore, statements such as “the average expected outcome for population and is equal” would be valid.
The second issue encountered is that the generative distribution of groups and are generally unknown. In this work, it is assumed that the empirical distribution and , as observed from the training set, is equal to or negligibly perturbed from the true generative distributions.
Combining these two caveats we arrive at the following definition,
Group Fairness in Expectation: A regressor achieves fairness with respect to groups iff,
There are many machine learning techniques with which Group Fairness in Expectation constraints (GFE constraints) may be incorporated. While constraining kernel regression is introduced in Section 3, the main focus of the paper is examining decision tree regression and respective ensemble methods which build on decision tree regression such as random forests, extra trees and boosted trees due to their widespread use in industry and hence their extensive impact on society.
The main contributions of this paper are,
The use of quadrature approaches to enforce GFE constraints on kernel regression, as outlined in Section 3.
Incorporating these constraints on decision tree regression without affecting the computational or memory requirements, outlined in Sections 5 and 6.
Deriving the expected absolute perturbation bounds due to the incorporation of GFE constraints on decision tree regression in terms of the number of leaves of the tree, outlined in Section 7.
Showing that these fair trees can be combined into random forests, boosted trees and other ensemble approaches while maintaining fairness, as shown in Section 8.
2 Related Work
There have been primarily two branches of research towards the development of fair algorithms. The first is the data alteration approach, which endeavors to modify the original dataset in order to prevent discrimination or bias due to the protected variable [luong2011k, kamiran2009classifying]. The second is an attempt to regularize such that the model is penalized for bias [kamishima2011fairness, berk2017convex, calders2013controlling, calders2010three, raff2017fair].
Fair forests [raff2017fair]
introduced the first tree induction algorithm to encourage fairness. They did this by introducing a new gain measure to encourage fairness. However, the issue with adding such regularisation is two-fold. Firstly, discouraging bias via a regularising term does not make any guarantee about the bias of the post trained model. Secondly, it is hard to make any theoretical guarantees about the underlying model or the effect the new regulariser has had on the model.
The approach offered in this work seeks to perform model inference in a constrained space, leveraging basic theory from kernel quadrature, as explained further in the next section, and hence the output marginal distributions are guaranteed to have equal means. By utilizing quadrature methods, it is also possible to derive bounds for the expected absolute perturbation induced by constraining the space. This is shown explicitly in Section 7.
3 Constrained Kernel Regression
Before delving into constraints on decision tree inference models, we will first show how one can create such linear constraints on kernel regression models. While potentially useful in and of itself, albeit more derivative as research goes, the focus of the paper remains on decision tree based models purely because they are more widespread in industry by data scientists. This work builds on the earlier contributions of [jidling2017linearly], where the authors examined the incorporation of linear constraints on Gaussian processes (GPs). Gaussian processes are a Bayesian kernel method most popular for regression. For a detailed introduction to Gaussian processes, we refer the reader to [rasmussen2004gaussian]
. However, for the reader unfamiliar with GPs specifically, they may simply think of a high dimensional Gaussian distribution parameterized by a kernel
, with zero mean and unit variance without loss of generality. Given a set of inputs and respective outputs,, split into training and testing sets, and , inference is performed as,
where denotes the kernel matrix between training examples, is the kernel matrix between the test and training examples and is the prior variance on the prediction point defined by the kernel matrix. Gaussian processes differ from high dimensional Gaussian distributions as they can model the relationships between points in continuous space, via the kernel function, as opposed to being limited to a finite dimension.
An important note is that any combination of Gaussian distributions via addition and subtraction is a closed space, that is to say, the sum of Gaussians is also Gaussian and so on. While this may at first appear trivial, it is, in fact, a very useful artifact. For example, let us assume there are two variables, and , drawn from Gaussian distributions with mean and variance respectively. Further, assume that the correlation coefficient describes the interaction between the two variables. Then a new variable , which is equal to the difference and , is drawn from a Gaussian distribution with mean and variance,
We can thus write all three variables in terms of a single mean vector and covariance matrix,
Given any two of the above observations, the third can be inferred exactly. We refer to this as a degenerate distribution as will naturally be low rank. If we observe that is equal to zero, we are thus constraining the distribution of and . This can easily be extended to the relationship between sums and differences of more variables.
Bayesian quadrature [o1991bayes] is a technique used to incorporate integral observations into the Gaussian process framework. Essentially, quadrature can be derived through an infinite summation and the above relationship between these summations can be exploited [osborne2012active]. An example covariance structure thus looks akin to,
is some probability distribution over the domain of, on which the Gaussian process is defined and against which the quadrature is performed against.
Reiterating the the motivation of this work; given two generative distributions and which subpopulations and of the data are generated from, we wish to constrain the inferred function such that,
This constraint can be rewritten as,
which allows us to incorporate the constraint on as an observation in the above Gaussian process. Let be the difference between the generative probability distributions of and , then by setting the corresponding observation as zero, the covariance matrix becomes,
We will refer to these as equality constrained Gaussian processes. Let us now turn to incorporate these concepts into decision tree regression.
4 Trees as Kernel Regression
Decision tree regression (DTR) and related approaches offer a white box approach for practitioners who wish to use them. These methods are among the most popular methods in machine learning in practice as they are generally intuitive even for those not from statistics, mathematics or computer science background. They are widely applied in machine learning competitions such as Kaggle and tend to be in most data science online courses. It is their proliferation, especially in businesses without machine learning researchers, that makes them of particular interest.
DTR regress data by sorting them down binary trees based partitions in the input domain. The trees are created by recursively partitioning the domain of input along axis aligned splits determined by a given metric of the data in each partition, such as information gain or variance reduction. In this work, we will not consider the many possible techniques for learning decision trees, but rather assume that the practitioner has a trained decision tree model. For a more complete description of decision trees, the authors refer the readers to [rokach2008data].
For the purposes of this work, DTR can be described as a partitioning of space such that predictions are made by averaging the observations in the local partition, referred to as the leaves of the tree. As such, DTR has a very natural formulation as a degenerate kernel whereby,
is the index of the leaf in which the argument belongs. The kernel hence becomes naturally block diagonal and the classifier / regressor written as,
with denoting the vector of kernel values between and the observations, denotes the covariance matrix of the observations as defined by the implicit decision tree kernel and y denoting the values of the observations.
It is worth also noting how one can also write the decision tree as a two-stage model: first by averaging the observations of associated with each leaf and then by using a diagonal kernel matrix to perform inference. Trivially, the diagonal kernel matrix acts only as a lookup and outputs the leaf average that corresponds to the point being predicted. Let us refer to this compressed kernel matrix approach as the compressed kernel representation and the block diagonal variant as the explicit kernel representation.
5 Fairness Constrained Decision Trees
Borrowing concepts from the previous section on equality constrained Gaussian processes using Bayesian quadrature, decision trees may be constrained in a similar fashion. The first consideration to note is that we wish the constraint observation to act as a hard equality, that is to say noiseless. In contrast, we are willing for the observations to be perturbed in order to satisfy this hard equality constraint. To achieve this, let us add a constant noise term,
, to the diagonals of the decision tree kernel matrix. Similar to ordinary least squares regression, the regressor will now minimize the L2-norm of the error induced on the observations, conditioned on the equality constraint which is noise free. In the explicit kernel representation, this implies the minimum induced noiseper observation, whereas in compressed kernel representation this implies the minimum induced noise per leaf.
An important note is that the constraint is applied to the kernel regressor equations, hence the method is exact for regression trees or when the practitioner is concerned with relative outcomes of various predictions. However, in the case that the observations range between , as is the case in classification, then we must renormalize the output to . This no longer guarantees a minimum L2-norm perturbation and while potentially still useful, is not the focus of this work.
The second consideration is how to determine the generative probability distributions and . Given the frequentist nature of decision trees, it makes sense to consider and as the empirical distributions of subpopulations and as described in Section 1. Thus the integral of the empirical distribution on a given leaf, , is defined as the proportion of population observed in the partition associated with leaf .
Figure 1 shows the structure of a decision tree kernel in explicit and compressed kernel representation with linear constraints on its subpopulations.
6 Efficient Algorithm for Equality Constrained Decision Trees
At this point, an equality constrained variant of a decision tree has been described, both in explicit representation and compressed representation. In this section, we will show that equality constraints on a decision tree do not change the computational or memory order of complexity. The motivation for considering the order of complexities is that decision trees are one of the more scalable machine learning models, whereas kernel methods such as Gaussian processes naively scale at in computation and in memory, where is the number of observations. While the approach presented in this work utilizes concepts from Bayesian quadrature and linearly constrained Gaussian processes, the model’s usefulness would be drastically hindered if it no longer maintained the performance characteristics of the classic decision tree, namely computational cost, and memory requirements.
6.1 Efficient Constrained Decision Trees in Compressed Kernel Representation
As Figure 1 shows, the compressed kernel representation of the constrained decision tree creates an arrowhead matrix. It is well known that the inverse of an arrowhead matrix is a diagonal matrix with a rank-1 update. Letting represent the diagonal principal sub-matrix with diagonal elements equal to one, being vector such that the element is equal to the relative difference in generative populations distributions for leaf , , then the arrowhead inversion properties state that,
with and . Note that the integral of the difference between the two generative distributions when evaluated over the entire domain is equal to zero, as both and must sum to one by definition and hence their differences to zero. Returning to the equation of interest, namely with y as the average value of each leaf of the tree, and subbing in as a vector of zeros with a one indexing the leaf in which the predicted point belongs to and is equal to zero, as it does not contribute to the empirical distributions, we arrive at,
A more explicit derivation is presented in Appendix A. The term is the effect of the prior under the Gaussian process perspective, however, by post-multiplying by , this prior effect can be removed. While relatively simple to derive, the above equation shows that only an additive update to the predictions is required to ensure group fairness in decision trees. Further, if the same relative population is observed for group and group on a single leaf , then and no change is applied to the original inferred prediction before the constraint is applied other than the effect of the noise. In fact, the perturbation to a leaf’s expectation grows linearly with the bias in the population of the leaf.
From an efficiency standpoint, only the difference in generative distributions, , needs to be stored which is an additional extra memory requirement and the update per leaf can be pre-computed in . These additional memory and computational requirements are negligible compared to cost of the decision tree itself.
6.2 Efficiently Constrained Decision Trees in Explicit Kernel Representation
Let us now turn our attention to the explicit kernel representation case, that is where the in the previous subsection is replaced with the block diagonal matrix equivalent. First let us state the bordering method, a special case of the block diagonal inversion lemma,
with once again. Substituting this into the kernel regression equation once more we find,
where denotes a vector of zeros with ones placed in all elements relating to observations in the same leaf. Expanding the above linear algebra,
Noting that an expression for the inverse of is derived in Appendix B, it is straight forward to show where is iterating over the set of leaves. Note that when for all we arrive at the same value for as we did in the previous subsection. We can continue to apply this result to the other terms of interest,
where is once again the average output observation over leaf . The terms have been labeled , and for shorthand. The computation time for the three terms, along with , can be computed in linear time with respect to the size of the data, , and can be pre-computed ahead of time, hence not affect the computational complexity of a standard decision tree. Once again only and have to be stored for each leaf and hence the additional memory cost is only . As such we can simplify the full expression for the expected outcome as,
7 Expected Perturbation Bounds
In imposing equality constraints on the models the inferred outputs become perturbed. In this section, the expected magnitude of the perturbation is analyzed for the compressed kernel representation. We define the perturbation due to the equality constraint, not due to the incorporation of the noise, as,
Let us assume that the data has been preprocessed such that the output values, , have zero mean and unit variance. This can be done without loss of generality. The expected absolute perturbation is then,
As is assumed to be zero mean and unit variance, it follows that the expectation of . Further, it is well known that the absolute value of the dot product between two uniformly random sampled unit vectors in is . Substituting this back into the previous equation we arrive at,
Where denotes the Schatten p-norm of the vector. Using the norm inequality, , we can then bound the worst case expected absolute perturbation for a fixed and non-zero total variational distance between the two generative distributions, , as,
This is an interesting result as it implies that if the model is not exploiting biases in the generative distribution evenly across all of the leaves of the tree, that is to say, , then the resulting predictions will receive the greatest expected absolute perturbation when averaged over all possible .
For the explicit kernel representation, the expected absolute perturbation bound can be analysed where by each leaf holds an even number observations. In such a scenario is equal for all leaves . Substituting this into the equations for and we can find that the bounded expected perturbation is equal to,
For the sake of conciseness the full derivation of the above is left to the reader but follows the same steps as the compressed kernel representation.
8 Combinations of Fair Trees
While it is intuitive to say that ensembles of trees with GFE constraints preserve the GFE constraint, however, for the sake of completeness this is now shown more formally. Random forests [breiman2001random], extremely random trees [geurts2006extremely] and tree bagging models [breiman1996bagging] combine tree models by averaging over their predictions. Denoting the predictions of the trees at point as for each , where is the number of trees, we can easily show that the combined difference in expectation marginalised over the space is equal to zero,
It can also be easily shown that modeling residual errors of the trees with other fair trees, such as is the case for boosted tree models [elith2008working], results in fair predictors also. These concepts are not limited to tree methods either and the core concepts set out in this paper of constraining kernel matrices can have applications in models such as deep Gaussian process models [damianou2013deep].
9.1 Synthetic Demonstration
The first experiment is merely a visual demonstration to better communicate the validity of the approach. The model used is a random forest regressor with 50 GFE constrained decision tree regressors (with a maximum of four splits) endeavouring to model a linear function,
, with observations drawn from two beta distributions,and respectively. The parameters of the two beta distribution are,
Figure 3 shows the effect of perturbing the leaves to constrain the expected means of the two populations. The figure shows the greater disparity between and , the greater the perturbation in the inferred function. Both the compressed and explicit kernel representation lead to very similar plots, so only the compressed kernel representation algorithm has been shown for conciseness.
A downside to group fairness algorithms more generally, as pointed out in [luong2011k], is that candidate systems which impose group fairness can lead to qualified candidates being discriminated against. This can be visually verified as the perturbation pushes down the outcome of many orange points below the total population mean in order to satisfy the constraint. By choosing to incorporate group fairness constraints the practitioner should be aware of these tradeoffs.
9.2 ProPublica Dataset - Racial Biases
Across the USA, judges, probation and parole officers are increasingly using algorithms to aid in their decision making. The ProPublica dataset 111https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis contains data about criminal defendants from Florida in the United States. It is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm [dieterich2016compas]
which is often used by judges to estimate the probability that a defendant will be a recidivist, a term used to describe re-offenders. However, the algorithm is said to be racially biased against African-Americans[dressel2018accuracy]. In order to highlight the proposed algorithm, we first endeavor to use a random forest to approximate the decile scoring of the COMPAS algorithm and then perturb the tree to remove any racial bias from the system. The goal of this experiment is not to propose a new algorithm to model recidivism risk but rather to show how GFE can be used on a regression task in practice.
The two subpopulations we consider constraining are thus African-American and non-African-American. We encode the COMPAS algorithms decile score into an integer between zero and ten such that minimizing perturbation is an appropriate objective function. The fact the decile scores are bounded in was not taken into account. The random forest used 20 decision trees as base estimators and the explicit kernel representation version of the algorithm was used for the sake of demonstrative purposes. The features used were sex, age, race, juv_fel_count, juv_misd_count, juv_other_count, priors_count, days_b_screening_arrest, c_jail_in, c_jail_out, c_days_from_compas, r_charge_degree, r_days_from_arrest, vr_charge_degree.
Figure 3 presents the marginal distribution of predictions on a 20% held out test set before and after the GFE constraint was applied. It is visible that both the expected outcome for African Americans is decreased and for non-African Americans is increased. Notice that while the means are equal to the structure of the two of distributions are quite different, indicating that GFE constraints still allow greater flexibility than more strict group fairness such as that described in Section 1. The root square difference between the predicted points before and after perturbation was 0.8. Importantly, the GFE constraint described in this work was verified numerically with the average outputs recorded as,
Appendix A Arrowhead Matrix Update
We will now show a more explicit derivation for the arrowhead matrix update. Restating the linear algebraic equation to simplify for , letting represent the diagonal principal submatrix, with diagonal elements equal to , being vector such that the element is equal to the relative difference in generative populations distributions for leaf , , then the arrowhead inversion properties state that,
with and . Subbing in as ,
where is the
column of the identity matrix, hence and indicator function for the leaf to whichbelongs. Expanding the above into the atomic variables once more, we find,
Finally, with a little cleaning up we arrive at the expression stated in the body of the paper,
Appendix B Explicit Block Diagonal Inverse
The explicit inverse of the block diagonal matrix, , used throughout the Efficiently Constrained Decision Trees in Explicit Kernel Representation section is outlined below.
Let each block, , of the block diagonal matrix of size be equal to a matrix of ones plus some noise on the diagonal entries,
Due to the invariant property of the identity matrix under unitary transformations and as
only has one eigenvalue of magnitude, the eigenvalues of are,
Inverting the eigenvalues, we arrive at,
Applying vectors and to the matrix inverse in quadratic form leads to,
Note that if and pertain to data contained in leaf , then summary statistics on a per leaf basis are sufficient to calculate exactly.