Kernel Dependence Regularizers and Gaussian Processes with Applications to Algorithmic Fairness

11/11/2019 ∙ by Zhu Li, et al. ∙ 20

Current adoption of machine learning in industrial, societal and economical activities has raised concerns about the fairness, equity and ethics of automated decisions. Predictive models are often developed using biased datasets and thus retain or even exacerbate biases in their decisions and recommendations. Removing the sensitive covariates, such as gender or race, is insufficient to remedy this issue since the biases may be retained due to other related covariates. We present a regularization approach to this problem that trades off predictive accuracy of the learned models (with respect to biased labels) for the fairness in terms of statistical parity, i.e. independence of the decisions from the sensitive covariates. In particular, we consider a general framework of regularized empirical risk minimization over reproducing kernel Hilbert spaces and impose an additional regularizer of dependence between predictors and sensitive covariates using kernel-based measures of dependence, namely the Hilbert-Schmidt Independence Criterion (HSIC) and its normalized version. This approach leads to a closed-form solution in the case of squared loss, i.e. ridge regression. Moreover, we show that the dependence regularizer has an interpretation as modifying the corresponding Gaussian process (GP) prior. As a consequence, a GP model with a prior that encourages fairness to sensitive variables can be derived, allowing principled hyperparameter selection and studying of the relative relevance of covariates under fairness constraints. Experimental results in synthetic examples and in real problems of income and crime prediction illustrate the potential of the approach to improve fairness of automated decisions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

Current and upcoming pervasive application of machine learning algorithms promises to have an enormous impact on people’s lives. For example, algorithms now decide on the best curriculum to fill in a position (Hoffman et al., 2017), determine wages (Dieterich et al., 2016), help in pre-trial risk assessment (Brennan and Ehret, 2009), and evaluate risk of violence (Cunningham and Sorensen, 2006). Concerns were raised about the lack of fairness, equity and ethics in machine learning to treat these types of problems555See for example the Handbook on European non-discrimination law and the Paycheck Fairness Act in the U.S. Federal Legislation.. Indeed, standard machine learning models are far from being fair, just, or equitable: they will retain and often exacerbate systemic biases present in data. For example, a model trained simply to minimize a loss with respect to human-provided labels which are subject to a cognitive bias cannot be expected to be free from that bias. More nuanced modelling approaches are needed to move towards fair decision-making processes based on machine learning algorithms. New algorithms should also be easy to use, implement and interpret.

1.2 Approaches to Fairness in Machine Learning

Fairness is an elusive concept, and adopts many forms and definitions. The field is vast, and a wide body of literature and approaches exists (Pedreschi et al., 2008; Kamiran and Calders, 2009; Chouldechova and Roth, 2018).

Let us broadly distinguish into two classes of fairness: individual and group fairness. On one hand, individual fairness (Dwork et al., 2012; Joseph et al., 2016; Kim et al., 2018; Heidari et al., 2018) is a notion that can be roughly understood as: “similar individuals should be treated similarly’. An example is Dwork et al. (2012)

, where it is assumed that there exists a similarity measure among individuals, and the goal is to find a classifier that returns similar outcomes for individuals with high similarity.

Joseph et al. (2016)

gives another formalization of individual fairness, which can be loosely described as: “less qualified individual should not be favoured”, where the notion of quality is estimated from data. Although practically important, there are certain obstacles that prevent individual fairness being widely adopted in practice. For example, the approach from

Dwork et al. (2012) requires a pre-agreed similarity measure which may be difficult to define. Also, employing individual fairness requires evaluation on any pair of individuals in the dataset and when dealing with large datasets, such computation may be infeasible.

On the other hand, group fairness focuses on the inequality at the group level (where groups may be defined using a sensitive variable such as race or gender). More broadly, outcomes should not differ systematically based on individuals’ protected (sensitive) information or group membership. This problem has been addressed by modifying classification rules (Pedreschi et al., 2008; Ruggieri et al., 2010) or preprocessing the data to remove sensitive dependencies explicitly (Kamiran and Calders, 2009; Luo et al., 2015; Feldman et al., 2015; Ristanoski et al., 2013). Down-weighting sensitive features or directly removing them have been proposed (Zeng et al., 2016). However, simply removing the sensitive covariates (such as gender, disability, or race, to predict, e.g., monthly income or credit score) is often insufficient as related variables may still enter the model. Sensitive covariate may be inferred from those related variables and the bias is retained. Including covariates related to the sensitive variables in the models is called redlining, and induces the problem known as the omitted variable bias (OVB). Alternative approaches seek fair representation learning, i.e. achieving fairness through finding an optimal way to preprocess the data and map it into a latent space where all information about the sensitive variables is removed. After such preprocessing, standard machine techniques are employed to build predictive models. Examples of these methods include Zemel et al. (2013); Kamiran and Calders (2012); Adebayo and Kagal (2016); Calmon et al. (2017). Statistical parity approaches, on the other hand, directly impose the independence between predictor and sensitive variables (Calders and Verwer, 2010; Kamishima et al., 2012; Feldman et al., 2015; Pérez-Suay et al., 2017). Various other statistical measures across groups can be considered, e.g.

equalized odds

which require that the false positive and false negative rates should be approximately equal across different groups (Kleinberg et al., 2016; Hardt et al., 2016; Chouldechova, 2017; Zafar et al., 2017), and other examples are given in Berk et al. (2018). Group fairness is attractive because it is simple to implement, it often leads to convex optimization problems, and it is easy to verify in practice. However, as argued by Dwork et al. (2012), group fairness cannot give guarantees to individuals as only average fairness to members of each subgroup is attained.

1.3 Regularization for Group Fairness

In this paper, we build on the work of Pérez-Suay et al. (2017) which falls within the framework of group fairness and was the first work that considered the notion of statistical parity with continuous labels. In particular, independence between predictor and sensitive variables is imposed by employing a kernel dependence measure, namely the Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al. (2005), as a regularizer

in the objective function. Regularization is one of the key concepts in modern supervised learning, which allows imposing structural assumptions and inductive biases onto the problem at hand. It ranges from classical notions of sparsity, shrinkage, and model complexity to the more intricate regularization terms which allow building specific assumptions about the predictors into the objective functions, e.g. smoothness on manifolds

(Belkin et al., 2006). Such regularization viewpoint for algorithmic fairness was presented in Kamishima et al. (2012) in the context of classification, and was extended to regression and dimensionality reduction with kernel methods in Pérez-Suay et al. (2017). Our work extends Pérez-Suay et al. (2017) in the following three ways. Firstly, we give a general framework of empirical risk minimization with fairness regularizers and their interpretation. Secondly, we derive a Gaussian Process (GP) formulation of the fairness regularization framework, which allows uncertainty quantification and principled hyperparameter selection. Finally, we introduce a normalized version of the fairness regularizer which makes it less sensitive to the choice of kernel parameters. We demonstrate how the developed fairness regularization framework trades off model’s predictive accuracy (with respect to potentially biased data) for independence to the sensitive covariates. It is worth noting that, in our setting, a function which produced the labels is not necessarily the function we wish to learn, so that the predictive accuracy is not necessarily a gold-standard criterion.

The paper is structured as follows. The general framework, together with the relevant background, is developed in §2. In §3, we develop a Gaussian Process (GP) interpretation of kernel dependence regularization. We give some instances of fairness-regularized ERM and their interpretation in §4. §5 describes the normalized kernel dependence regularizer. Experimental results are presented in §6. We conclude the work with some remarks and further work in §7.

2 Regression with dependence penalization

2.1 Fairness regularization framework

We build on a pragmatic definition of fairness following (Chouldechova, 2017). We are given a set of inputs, , and the corresponding targets, , for . Furthermore, we have observations of sensitive inputs (sensitive inputs could be treated as a subset of ). We take to be an iid sample from an

-valued random variable

, and similarly for . For simplicity, we will assume that the inputs are vectorial, i.e. , and that the targets are scalar, i.e. , but the exposition can be trivially extended to non-Euclidean or structured domains which admit positive definite kernel functions. We let denote the matrix of observed inputs corresponding to explanatory covariates, denotes the set of sensitive (protected) variables,

denotes the vector of observed targets, which we assume are corrupted with historical biases, and

is the predictor. We will also introduce the following notions of fair predictors in terms of statistical parity. The fitted predictor is said to be parity-fair to the sensitive input if and only if is statistically independent of . Moreover, it is said to be parity-fair in expectation to the sensitive input if and only if does not depend on .


We note that parity-fairness implies parity-fairness in expectation but that the converse is not true. For example, it may be possible that the conditional variance

still depends on . For a concrete example, consider the case of modelling income where gender is a sensitive variable. Parity-fairness in expectation implies that the mean predicted income does not depend on the gender, but it is possible that, e.g. the variance is larger for one of the genders. This is hence a weaker notion of fairness, as it may still result in predictions where, say, the top 10% earners all have the same gender.

Fitting a fairness-regularized predictor for some hypothesis class , reduces to optimizing a regularized empirical risk functional  (Kamishima et al., 2012; Pérez-Suay et al., 2017):



is the loss function,

acts as an overfitting/complexity penalty on , and measures the statistical dependence between the model and the protected variables. By setting , standard, yet potentially biased, machine learning models are obtained.

The framework admits many variants depending on the loss function , regularizer and the dependence measure, . In (Kamishima et al., 2012), a logistic loss was used and was a simplified version of the mutual information estimator. In (Pérez-Suay et al., 2017), the hypothesis class was a reproducing kernel Hilbert space (RKHS), and the dependence measure was Hilbert-Schmidt Independence Criterion (HSIC), based on the norm of the particular cross-covariance operator on RKHSs (Gretton et al., 2005), allowing one to deal with several sensitive variables simultaneously. When combined with the framework of kernel ridge regression, a closed-form solution is obtained. In this paper, we extend the latter formalism and introduce a Gaussian process (GP) treatment of the problem. Then we study the HSIC penalization as a modified GP prior, and explore the aspects of HSIC normalization, and the interpretability of the hyperparameters inferred under the GP framework. Before that, let us fix notation and review the basics of GP modeling and kernel-based dependence measures.

2.2 GP models

In GP modeling, observations are assumed to arise from a probabilistic model , parametrized by the evaluation of a latent function at the input . Here, is an optional hyperparameter used to rescale the log-likelihood, i.e. . For example, in GP regression, we assume a normal likelihood, i.e. . Equivalently, the latent function is impaired by a Gaussian noise of variance , i.e. , , independently over . A Gaussian process prior, typically zero-mean666For example, in regression, it is customary to subtract sample average from the targets , and then to assume a zero-mean model., is placed on the latent function , denoted , where is a covariance function parametrized by . Advantageously, GPs provide a coherent framework to select model hyperparameters and by maximizing the marginal log-likelihood, or to pursue Bayesian treatment of hyperparameters. Moreover, they yield a posterior distribution over predictions for new inputs , allowing to quantify uncertainty and return a predictive posterior of target , not just a point estimate. We will denote latent function evaluations over all inputs as .

2.3 Dependence measures with kernels

Consider random variables and taking values in general domains and . Given kernel functions and on and respectively, with RKHSs and , the cross-covariance operator is defined as a linear operator such that , for all , . Hilbert-Schmidt Independence Criterion (HSIC) measuring dependence between and is then given by the Hilbert-Schmidt norm of . HSIC can be understood as a maximum mean discrepancy (MMD) (Gretton et al., 2012)

between the joint probability measure of

and and the product of their marginals. Given the dataset with pairs drawn from the joint , an empirical estimator of HSIC is defined as (Gretton et al., 2005):


where , are the kernel matrices computed on observations and using kernels and respectively, and has the role of centering the data in the feature space. For a broad family of kernels and (including e.g. Gaussian RBF and Matérn family), the population HSIC equals 0 if and only if and are statistically independent, cf. (Gretton et al., 2005). Hence, nonparametric independence tests consistent against all departures from independence can be devised using HSIC estimators with such kernels. Note, however, that the selection of the kernel functions and their parameters have a strong impact on the value of HSIC estimator. As we will see, this is important when HSIC is used as a regularizer, as it generally leads to different predictive models.

Moreover, HSIC is sensitive to the scale appearing in the marginal distributions of and and their units of measurements and hence needs an appropriate normalization if it is to depict a dependence measure useful for, e.g. relative dependence comparisons. This problem is well recognized in the literature and a normalized version of HSIC, called NOCCO (NOrmalized Cross-Covariance Operator) was introduced in (Fukumizu et al., 2008).

3 Interpretations of HSIC penalization

Consider a particular instantiation of the regularized functional in (1) given by


where we adopted the reproducing kernel Hilbert space (RKHS) as a hypothesis class and added a fairness penalization term consisting of an estimator of HSIC between the predicted response and the sensitive variable .

With appropriate choices of kernels and , HSIC regularizer captures all types of statistical dependence between and . However, we will here focus on fairness in expectation as it will give us a convenient link to GP modelling. Fairness in expectation corresponds to adopting a linear kernel on , i.e., . Estimator (2) then simplifies to


Given that this fairness penalty term only depends on the unknown function through its evaluations at the training inputs , direct application of Representer theorem (Kimeldorf and Wahba, 1970) tells us that the optimal solution can be written as . Hence, we obtain the so called dual problem


The problem (5) can now be solved for directly, and in the case of squared loss, it has a closed form solution (Pérez-Suay et al., 2017).

3.1 Modified Gaussian Process Prior

For a Bayesian interpretation of (3), we here assume that the loss corresponds to the negative conditional log-likelihood in some probabilistic model, i.e. that , which is true for a wide class of loss functions. Hence, we will write (3) as:


where we write (note that the objective (3) is rescaled by such that the regularization parameter now plays the role of rescaling the log-likelihood).

Consider now using explicit feature mapping

(for the moment assumed finite-dimensional) and denoting the feature matrix by

, we have and thus can recast optimization as (so called primal problem) with some abuse of notation777We write to denote the rescaled conditional negative log-likelihood.:


These problems give us an insight about how the two regularization terms interact. It is well known that solutions to regularized ERM over RKHS are closely related to GP models using covariance kernel – for a recent overview, cf. (Kanagawa et al., 2018) and references therein. In particular, by inspecting (7), the two regularization terms correspond, up to an additive constant, to a negative log-prior of , which in turn gives a prior on the evaluations . By directly applying the Woodbury-Morrison formula, the covariance matrix in this prior becomes , compared to in the standard GP case. Thus, adding an HSIC regularizer corresponds to modifying the prior on function evaluations . A natural question arises:

[backgroundcolor=gray!20] Question 1: can the fairness-regularized ERM in (3) be interpreted as simply modifying the GP prior on the whole function into a fair GP prior?

As the next proposition shows, the answer to Question 1 is positive. The proof is given in A.

Proposition 1.

Solution to (6) corresponds to the posterior mode in a Bayesian model using a modified GP prior


where , for any training set .

Several important consequences of the GP interpretation will allow us to improve the fair learning process. In particular, the GP treatment allows us to easily derive uncertainty estimates and perform hyperparameter learning using marginal log-likelihood maximization, which is more practical than typical cross-validation strategy limited to simple parameterizations. More importantly, appropriate inference of the model (parameters and hyperparameters) thus yield closer insight into the fairness tradeoffs.

3.2 Projections using Cross-Covariance Operators

We can derive an additional intepretation of the fairness regularizer in terms of cross-covariance operators. Namely, by considering an explicit feature map corresponding to the kernel , and denoting the feature matrix by , i.e. we see that the fairness regularizer in (5) reads


where is the empirical cross-covariance matrix between feature vectors and . This interpretation also holds in the case of infinite-dimensional RKHSs and . For an infinite-dimensional version of primal formulation, we define sampling operator , . Then the HSIC regularizer becomes

where the adjoint acts as . Moreover, if we define similarly the sampling operator for kernel , i.e. , with , then and , . Here, and are the empirical cross-covariance operators (Fukumizu et al., 2008), i.e.

Thus, the overall objective can be written as


Here, denotes the identity on . Hence, the additional regularization term is up to scaling simply

where is an arbitrary basis of . This gives another insight into the fairness regularizer as an action of the empirical cross-covariance operator between sensitive and remaining inputs and on the learned function888Note that this operator is different from the cross-covariance operator defining HSIC in (3) itself, as the latter pertains to cross-covariance between and . As we shall see, this perspective will also allow us to construct a normalized version of fairness regularizer in Section §5.

4 Instances of dependence-regularized learning

In this section, we give two concrete examples of fair learning and give illustrations how the fairness penalty enforces the fairness in both the ridge regression setting and in the Bayesian learning setting. As before, we denote by the cross-covariance operator and by its empirical version.

Fair Linear Regression

We start with the simple case of linear regression. We note that the kernel

on is then linear, while the kernel on need not be. For simplicity, let us assume that is finite-dimensional and write its explicit feature map as . Thus, we have the following minimization problem:


The purpose of fair linear regression is to predict from inputs while ensuring that the predictions are independent of the sensitive variable . From (11), we see that the HSIC regularizer penalizes the weighted norm of . The weight on each dimension of is guided by the cross-covariance operator . As a result, if a dimension in has a high covariance with any of the entries in , its corresponding coefficient will be shrank towards zero, leading to a low covariance between and . This can be illustrated by the following toy case. Since feature spaces are finite-dimensional, we can treat as an matrix. Say that and that has zero-off diagonal entries (i.e. the only non-zero cross-correlations are between the -th dimension of and the -th dimension of ). We further enlarge to be a matrix by appending zeros. We denote the diagonal elements of the enlarged matrix as . As a result, is symmetric and diagonal with diagonal elements . Now the second term in Eq. (11) is simply:

Since we aim at minimizing the penalty term , the coefficient is likely to be low if the corresponding feature has high covariance with , i.e. high . In the extreme case where , for all the features that have positive covariance with . Moreover, if feature has , its coefficient is unaffected by the extra penalization. In practice of course, is rarely diagonal, but the general idea is the same: the regularizer simply takes into account all cross-correlations to determine the penalty on each coefficient.

We now turn to the Bayesian perspective. Note that (11) is equivalent to the following Bayesian linear regression model


Comparing to the normal Bayesian linear regression, we can see that this version simply modifies the prior on . The same interpretation holds: assuming is diagonal and denoting the -th diagonal element of as , the prior covariance matrix is a diagonal matrix with -th diagonal element of . This means that we modify our prior such that the coefficients corresponding to the features with high are shrank towards zero.

Fair Kernel Ridge Regression

We now consider the nonlinear case with RKHSs and corresponding to feature maps and respectively. We denote the transformed data as and with the corresponding Gram matrices and . We extend the fair learning problem in the nonlinear case as the following optimization problem.


The interpretation of the form is similar to the linear case. We would like to penalize more for the coefficient if its corresponding feature has a large covariance with sensitive features .

Let us explore the Bayesian treatment of the nonlinear fair learning problem. In the weight space view, Eq.(13) corresponds to the same model as (12), with . However, we can readily derive the GP formulation. For any kernel where , the GP model is given by


While it is not obvious that is tractable as it involves the operator , Proposition 1 proves that and hence, one can readily employ this kernel as a modified GP prior and make use of the extensive GP modeling toolbox. We also note that we can treat kernel parameters of and as well as simply as parameters of .

5 Normalized dependence regularizers

We have explained the fair kernel learning, and introduced its corresponding Gaussian process version. Also, we provided another view of the dependence penalizer as the weighted norm of the coefficients where the weights are given by the cross-covariance operator. However, one issue with this framework is that the dependence measure is sensitive to the kernel parameters. For example, if we look at problem (3), the extra penalty term is sensitive to the hyperparameters and from kernel and . Notice that varying does not affect the other two terms in the objective function, one could simply adjust to reduce the HSIC value and hence reduce the objective function value. The unfairness however, is not reduced. Hence, one needs a parameter invariant dependence measure to avoid such an issue. As a result, we introduce the normalized fair learning framework in this section.

As shown in Eq. 3, fairness is enforced through using HSIC value as the penalizer. Hence, a naive way of dealing with parameter sensitivity is to use the normalized version of HSIC. This has been extensively studied in (Fukumizu et al., 2008) where the so called NOCCO was proposed. Replacing HSIC with Hilbert-Schmidt norm of NOCCO in Eq. 3, the fair learning is the following optimization problem:


where is the Hilbert-Schmidt norm of NOCCO between and . , and is the regularization parameter used in the same way as in (Fukumizu et al., 2008). Since we are using the linear kernel for and , we have . However, problem (15) does not admit a closed form solution. The reason is that the derivative of is not linear in . Hence, we ask the following question:

[backgroundcolor=gray!20] Question 2: can we find a normalized fair learning which admits a closed form solution?

It turns out that the cross-covariance view of fair learning provides us a way to answer this question. In (11), we used the empirical cross-covariance operator as the penalizer. To avoid the parameter sensitivity issue, we could use the normalized cross-covariance operator to replace . Let be the empirical version of , the learning problem is now:


This leads to a closed-form solution as

In case where and are finite dimensional, the above provides a valid solution to the normalized fair learning problem. However, this is not the case if either of and is infinite dimensional. Since we face the problem of evaluating and terms which are infinite dimensional operators.

To remedy this issue, we notice that the HSIC is potentially sensitive to parameters from and . During the optimization process, parameters from is tuned from the data, while the parameters from are free to adjust. Hence, one could only partially normalize the cross-covariance operator with respect to hyperparameters from and formulate the following learning problem:


This gives us a closed-form solution as


where in the second equality we applied the Woodbury matrix inversion lemma999In computing , we use

instead to avoid the issue with non-invertible matrix.

. As a result, the prediction at the training point for fair learning is


We provide a justification for (17) via the conditional covariance operator. For any two random variable and , we define the conditional covariance operator

It has been shown in (Fukumizu et al., 2009, Proposition 2) that


Notice that Eq.(19) is the minimal residual error when we use to predict , for any . In other words, it is the variance in that cannot be explained by . Since represents the variance of , we can treat as the maximal amount of variance of that can be explained by . In (17),

minimizing this term is equivalent to minimize the amount of variance in that can be explained by . This is essentially the same as minimizing the dependence between and . Furthermore, if is a universal kernel ( e.g. Gaussian kernel, Laplace kernel, etc., refer to (Sriperumbudur et al., 2011) for more details on universal kernel), Eq.(19) can be rewritten as


where is the space of all square integrable functions defined on . In this case, quantifies the maximal amount of variance in that can be explained by . Note that is independent of the choice of , this is particularly useful in the normalized fair learning problem. The reason is, although in defining we rely on the hyperparameter from kernel , the quantity is independent of . In other words, varying will not affect its value. This justifies the usage of as the penalty term in normalized fair learning.

6 Experiments

In this section, we illustrate the performance of the proposed methods on both synthetic and real-data problems, and study the effect of the fairness regularization. We first study performance in simulated toy datasets that allow us to study the error-vs-dependence paths and demonstrate the potential of proposed approaches in controlled scenarios. Secondly, we study the effect of the normalized dependence regularizer as well as the use of the GP formulation in contrast to the ERM framework, i.e. kernel ridge regression, in two real-data fairness problems: crime prediction and income prediction.

6.1 Toy dataset 1

We start by demonstrating the effectiveness of the proposed fairness framework by comparing it to two other baselines based on the fairness literature. The first approach is simple omission of the sensitive variable (OSV), where we use all the features except the prespecified sensitive variable. The second one mimics the ideas of fair representation learning (FRL) Zemel et al. (2013) where the input data is transformed such that it contains as much information as possible from the original data while simultaneously being statistically independent from the sensitive variable. The transformed data is then used for learning. The dataset we consider is as follows: we first sample independently from ; assuming is unobserved, we let the sensitive variable be . Obviously, and are correlated. Let the true function of interest be

where . It is readily checked that is marginally independent of the sensitive variable . We now further assume that the observations include a bias that is based on the sensitive variable

i.e. the observations are on average increased by when and decreased by otherwise. Given data , our task is to find a best fit while preserving fairness in terms of statistical parity. Clearly, simply removing the sensitive variable while training the model is not appropriate as the bias in the observations is correlated with as well and will thus be retained. Alternatively, we may want to fully remove all dependence on from the inputs. This simply corresponds to transforming as follows:


However, this shows the danger of such an approach – we now have input independent of the sensitive variable, but the true function is marginally independent of as well and hence the transformed variable will not be useful for learning! Hence, the fairness regularization on the predictor provides a remedy – it directly penalizes the dependence between the predictor and the sensitive variable rather than between the inputs and the sensitive variable, which does not take into account the learning problem at hand. We compare the performances between the following approaches: standard kernel ridge regression (KRR) and Gaussian process regression (GPR) without data modification, fairness regularization (both KRR and GPR versions) with different (refer to Eq. 3) values; OSV and FRL. In the case of kernel ridge regression (KRR), we choose the kernel lengthscale and regularization parameter with cross-validation and in the GP versions, we choose them via maximization of the marginal likelihood. We measure the performance of each model through the coefficient of determination with respect to both the observed responses and the true function values. By definition of , we would expect the standard approach to achieve the highest (on biased data) as it utilizes all the available information, whereas FRL would have the lowest score. For fairness regularization, this will depend on the value of , i.e. model with high will have low . Looking at Table 1, we do see this pattern. On the other hand, if we consider on the true function values, we see that it tends to increase with higher , i.e. higher fairness regularization improves the removal of the bias present in the observed responses from the predictors. As expected, FRL detects no signal on this data, and OSV also leads to a significant drop in . In addition, since the GP version allows us to systematically select the hyperparameters, we can see that in most cases, from the GP model will be higher than its kernel regression version. We next report the correlation between the predicted value and . Likewise, we would expect the standard approach will have the highest correlation while the FRL will have the lowest correlation. For fairness regularization, the correlation decreases as increases. Table 2 reports these results. We see that OSV still has a high correlation to the sensitive variable . In contrast, the GPR for allows a predictor that is essentially uncorrelated from while having strong performance.

Standard 0.606 0.002 0.612 0.002 0.332 0.003 0.356 0.003
0.600 0.002 0.610 0.001 0.358 0.003 0.335 0.002
0.567 0.001 0.586 0.009 0.341 0.005 0.394 0.010
0.488 0.008 0.506 0.012 0.466 0.011 0.472 0.008
0.384 0.011 0.403 0.005 0.321 0.014 0.530 0.004
OSV 0.238 0.007 0.196 0.013 0.123 0.008 0.098 0.019
FRL -0.021 0.002 -0.009 0.001 -0.024 0.002 -0.011 0.001
Table 1: The wrt. observations (left) and wrt. true value (right).
Approach KRR GPR
Standard 0.3917 0.0011 0.3863 0.0013
0.4053 0.0019 0.3853 0.0024
0.3337 0.0104 0.3257 0.0206
0.1364 0.0150 0.2234 0.0455
0.1066 0.0078 0.0139 0.0031
OSV 0.2976 0.0053 0.3195 0.0058
FRL -0.0010 0.0012 -0.0102 0.0013
Table 2: The correlation between and .

6.2 Toy dataset 2

We next consider a simple simulated dataset following the model from Pérez-Suay et al. (2017):

Similarly as in the previous example, even if we omit the sensitive variable , the remaining variables are dependent on it. We will use this dataset to study the impact of normalizing the HSIC regularizer on the trade-offs between the predictive performance and dependence on the sensitive variable. We compare here the following methods: Kernel Ridge Regression (KRR), Fair Kernel Learning (FKL), and the Normalized Fair Kernel Learning (NFKL) on toy dataset 2. To validate the behavior of the proposed methods we used the RMSE as an error measurement of the predictions. As a fairness measurement we used both the HSIC and Mutual Information (MI) estimates between the output predictions and the sensitive variables. We performed trials using points for training algorithms and points for the final test validation. We chose different values for the fairness parameter logarithmically spaced in the range . In the case of the kernel lengthscale and regularization parameters we did cross-validation taking values logarithmically spaced in ranges and . In the case of NFKL we have fixed the parameter . Figure 1 illustrates the averaged results of the presented methods. The standard KRR method (corresponding to case ) achieves the best performance in RMSE, but it is the also the most unfair in terms of both dependence measures. The use of the proposed fairness regularization approaches is able to mitigate the unfairness of the predictors by trading it off for the RMSE as the fairness regularization parameter is varied producing the unfairness/error curves shown in Figure  1. We see that NFKL outperforms FKL, i.e. that the normalization of the regularizer substantially improves this tradeoff.

Figure 1: (left) RMSE vs. HSIC of the ERM versions for different values of the fairness parameter. (right) RMSE vs. Mutual Information of the ERM versions for different values of the fairness parameter.

6.3 Crime and income prediction

In the next set of experiments, we empirically compare the performance of fair kernel learning and our proposed GP version on two real datasets:

  • Communities and Crime (Redmond and Baveja, 2002). We are here concerned about predicting per capita violent crime rate in different communities in the United States from a set of relevant features, such as median family income or the percentage of people under poverty line. Race is considered the sensitive variable. The dataset contains instances with features. Some of the features contained many missing values as some surveys were not conducted in some communities, so they were removed from the data. This returns a data matrix. We will use this data to assess performance of the discriminative versus the GP-based algorithm.

  • Adult Income (Dheeru and Karra Taniskidou, 2017). The Adult dataset contains subjects, which consists of training data and data. The original data have features among which are continuous and the remaining are categorical. The label is binary indicating whether a subjects’s income is higher that

    K or not. Each continuous feature was then discretized into quantiles and represented by a binary variable. Hence the final dataset has 123 features. The goal is to predict a given subject’s income level while controlling for the sensitive variables: gender and race. We preprocessed the data so that each feature of the predictor variable

    as well as the response variable

    has zero mean and standard deviation


Fair KRR vs Fair GP

We empirically compared the performance of the two fair kernel learning model: kernel ridge regression and the modified GP version. In the regression setting, we used -fold cross validation to choose the kernel bandwidth parameter for and the penalty parameter . For the sake of a fair comparison, we have set the parameter for the kernel on the sensitive variable to be fixed at , while we select

according to the median heuristic and also randomly draw

samples around its value. In addition, we draw values between .

For the modified GP case, we fixed as in the regression setting while optimizing over other hyperparameters. In both settings, we chose different (the penalty hyperparameter for unfairness) in the interval with high value representing more fair model. Figure 2 demonstrate the result for the two model in the crime data. We can see that for most of the time, the modified GP outperforms kernel ridge regression. The modified GP gives better trade-off between fairness and prediction accuracy due to its optimization process. Note that the performance gap between fair kernel learning and fair GP is larger in the Adult data than that in the Crime data. An possible reason is that the Crime data is much harder to learn (RMSE is 0.6 at its highest) so that the advantage of using GP in optimizing hyperparameter is limited.

Figure 2: Comparisons between fairness-regularized kernel ridge regression and fairness-regularized GP regression on crime and income data. Principled hyperparameter selection due to the proposed GP model allows improved unfairness / prediction error trade-off curves.

Fair GP with ARD Kernel

In this section, we show empirical evidence of the performance of the modified GP fair learning with ARD (Automatic Relevance Determination) kernel to the Communities and Crime real dataset. The goal of this experiment setting is to assess the effect of the fair learning on the coefficients for each feature. Specifically, we run two sets of experiments. The first experiment is to perform predictions with standard GP. The kernel is set to be ARD RBF defined as :

This experiment is similar to the previous one, except that we used the modified GP framework. We would like to see how the ’s for those sensitive variables change when we impose the fairness regularizer. We list the results in Table 3. The root mean square error and the unfairness of the predictions in two settings are also reported in the table.

Sensitive Variable GP Fair GP
Race-Black 1.809 0.216 2.939 0.367
Race-White 6.728 3.425 2.519 0.038
Race-Asian 17.79 11.96 117.9 0.045
Race-Hispanic 53.90 19.00 9.669 1.606
Income-White 132.2 2.823 213.0 0.190
Income-Black 108.9 88.73 389.3 0.026
Income-Indian 176.4 7.351 700.9 0.014
Income-Asian 17.76 8.051 386.2 0.077
Income-Other 12.63 6.762 411.3 0.136
Income-Hispanic 175.2 4.667 404.7 0.020
RMSE 0.627 0.054 0.766 0.036
Unfairness 0.050 0.001 0.0024 0.0001
Table 3: The change of for sensitive variables with and without fair lerning

We can see that for most sensitive variables, their bandwidths were significantly increased after performing fair learning. This means that in computing the kernel value, those sensitive variables are contributing less, i.e. we treat instances as similar even when their sensitive variables have different values and as a result, the learned function varies less in those dimensions than in others.

7 Conclusions

Using machine learning to facilitate and automate data-informed decisions has a huge potential to benefit society and transform people’s lives. However, data used to train machine learning models are not necessarily free from cognitive or other biases, so the discovered patterns may retain or compound discriminatory decisions. We introduced a regularization framework of fairness-aware models where statistical dependence between predictions and the sensitive, protected variables is penalized. The use of kernel dependence measures as fairness regularizers allowed us to obtain simple regression models with closed-form solutions, derive a probabilistic Gaussian process interpretation, as well as the appropriate normalization of the regularizers. The latter two developments lead to principled and robust hyperparameter selection. The developed methods show promising performance in synthetic and real-data experiments involving crime and income prediction, allowing to strike favourable tradeoffs between method’s predictive performance (on biased data) and its fairness in terms of statistical parity. While we focused on a specific viewpoint on fairness here, considering directly the statistical dependence on a prespecified set of sensitive variables, construction of machine learning techniques suited for other notions of fairness involving causal associations and conditional dependencies presents an important future research challenge. As there is also a flurry of research on the use of kernel methods in these fields, similar approaches invoking appropriate notions of kernel-based regularizers may be possible.

8 Acknowledgements

A.P.-S. and G.C.-V. are supported by the European Research Council (ERC) under the ERC-CoG-2014 SEDAL Consolidator grant (grant agreement 647423). D.S. is supported in part by The Alan Turing Institute (EP/N510129/1). The authors thank Kenji Fukumizu for fruitful discussions and Alan Chau, Lucian Chan, Qinyi Zhang and Alex Shestopaloff for helpful comments.


  • Adebayo and Kagal (2016) Adebayo, J., Kagal, L., 2016. Iterative orthogonal feature projection for diagnosing bias in black-box models. arXiv preprint arXiv:1611.04967.
  • Belkin et al. (2006) Belkin, M., Niyogi, P., Sindhwani, V., Dec. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, 2399–2434.
  • Berk et al. (2018) Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A., 2018. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 0049124118782533.
  • Brennan and Ehret (2009) Brennan, Tim, W. D., Ehret, B., 2009. Evaluating the predictive validity of the compas risk and needs assessment system. Criminal Justice and Beh. 36 (1), 21–40.
  • Calders and Verwer (2010)

    Calders, T., Verwer, S., 2010. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21 (2), 277–292.

  • Calmon et al. (2017) Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., Varshney, K. R., 2017. Optimized pre-processing for discrimination prevention. In: Advances in Neural Information Processing Systems. pp. 3992–4001.
  • Chouldechova (2017) Chouldechova, A., 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5 (2), 153–163.
  • Chouldechova and Roth (2018) Chouldechova, A., Roth, A., 2018. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810.
  • Cunningham and Sorensen (2006) Cunningham, M. D., Sorensen, J. R., 2006. Actuarial models for assessing prison violence risk: revisions and extensions of the risk assessment scale for prison (rasp). Assessment 13 (3), 253–265.
  • Dheeru and Karra Taniskidou (2017) Dheeru, D., Karra Taniskidou, E., 2017. UCI machine learning repository.
  • Dieterich et al. (2016) Dieterich, W., Mendoza, C., Brennan, T., 2016. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc.
  • Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R., 2012. Fairness through awareness. In: Proceedings of the 3rd innovations in theoretical computer science conference. ACM, pp. 214–226.
  • Feldman et al. (2015) Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., Venkatasubramanian, S., 2015. Certifying and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 259–268.
  • Fukumizu et al. (2009) Fukumizu, K., Bach, F. R., Jordan, M. I., et al., 2009. Kernel dimension reduction in regression. The Annals of Statistics 37 (4), 1871–1905.
  • Fukumizu et al. (2008) Fukumizu, K., Gretton, A., Sun, X., Schölkopf, P. B., 2008. Kernel measures of conditional dependence. In: Advances in Neural Information Processing Systems 20. pp. 489–496.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., Smola, A., Mar. 2012. A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773.
  • Gretton et al. (2005) Gretton, A., Herbrich, R., Hyvärinen, A., 2005. Kernel methods for measuring independence. Journal of Machine Learning Research 6, 2075–2129.
  • Hardt et al. (2016) Hardt, M., Price, E., Srebro, N., et al., 2016. Equality of opportunity in supervised learning. In: Advances in neural information processing systems. pp. 3315–3323.
  • Heidari et al. (2018) Heidari, H., Ferrari, C., Gummadi, K., Krause, A., 2018. Fairness behind a veil of ignorance: A welfare analysis for automated decision making. In: Advances in Neural Information Processing Systems. pp. 1265–1276.
  • Hoffman et al. (2017) Hoffman, M., Kahn, L. B., Li, D., 2017. Discretion in hiring. The Quarterly Journal of Economics 133 (2), 765–800.
  • Joseph et al. (2016) Joseph, M., Kearns, M., Morgenstern, J. H., Roth, A., 2016. Fairness in learning: Classic and contextual bandits. In: Advances in Neural Information Processing Systems. pp. 325–333.
  • Kamiran and Calders (2009) Kamiran, F., Calders, T., 2009. Classifying without discriminating. In: 2009 2nd International Conference on Computer, Control and Communication. IEEE, pp. 1–6.
  • Kamiran and Calders (2012) Kamiran, F., Calders, T., 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33 (1), 1–33.
  • Kamishima et al. (2012) Kamishima, T., Akaho, S., Asoh, H., Sakuma, J., 2012. Fairness-aware classifier with prejudice remover regularizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp. 35–50.
  • Kanagawa et al. (2018) Kanagawa, M., Hennig, P., Sejdinovic, D., Sriperumbudur, B. K., 2018. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582.
  • Kim et al. (2018) Kim, M., Reingold, O., Rothblum, G., 2018. Fairness through computationally-bounded awareness. In: Advances in Neural Information Processing Systems. pp. 4842–4852.
  • Kimeldorf and Wahba (1970) Kimeldorf, G. S., Wahba, G., 1970. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 41 (2), 495–502.
  • Kleinberg et al. (2016) Kleinberg, J., Mullainathan, S., Raghavan, M., 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.
  • Luo et al. (2015) Luo, L., Liu, W., Koprinska, I., Chen, F., 2015. Discrimination-aware association rule mining for unbiased data analytics. In: International Conference on Big Data Analytics and Knowledge Discovery. Springer, pp. 108–120.
  • Pedreschi et al. (2008) Pedreschi, D., Ruggieri, S., Turini, F., 2008. Discrimination-aware data mining. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’08. pp. 560–568.
  • Pérez-Suay et al. (2017) Pérez-Suay, A., Laparra, V., Mateo-García, G., Muñoz-Marí, J., Gómez-Chova, L., Camps-Valls, G., 2017. Fair kernel learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp. 339–355.
  • Redmond and Baveja (2002) Redmond, M., Baveja, A., 2002. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research 141 (3), 660–678.
  • Ristanoski et al. (2013) Ristanoski, G., Liu, W., Bailey, J., 2013. Discrimination aware classification for imbalanced datasets. In: CIKM ’13. ACM, NY, USA, pp. 1529–1532.
  • Ruggieri et al. (2010) Ruggieri, S., Pedreschi, D., Turini, F., May 2010. Data mining for discrimination discovery. ACM Trans. Knowl. Discov. Data 4 (2), 9:1–9:40.
  • Sriperumbudur et al. (2011) Sriperumbudur, B. K., Fukumizu, K., Lanckriet, G. R., 2011. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research 12 (Jul), 2389–2410.
  • Zafar et al. (2017) Zafar, M. B., Valera, I., Gomez Rodriguez, M., Gummadi, K. P., 2017. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In: Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 1171–1180.
  • Zemel et al. (2013) Zemel, R. S., Wu, Y., Swersky, K., Pitassi, T., Dwork, C., 2013. Learning fair representations. In: ICML (3). Vol. 28. pp. 325–333.
  • Zeng et al. (2016) Zeng, J., Ustun, B., Rudin, C., 2016. Interpretable classification models for recidivism prediction. Jour. of the Royal Stat. Soc.: Series A (Statistics in Society).

Appendix A Proof of Proposition 1

Through feature maps and , we map and into the RKHS and with kernel and