General Fair Empirical Risk Minimization

01/29/2019 ∙ by Luca Oneto, et al. ∙ Istituto Italiano di Tecnologia Università di Genova 0

We tackle the problem of algorithmic fairness, where the goal is to avoid the unfairly influence of sensitive information, in the general context of regression with possible continuous sensitive attributes. We extend the framework of fair empirical risk minimization to this general scenario, covering in this way the whole standard supervised learning setting. Our generalized fairness measure reduces to well known notions of fairness available in literature. We derive learning guarantees for our method, that imply in particular its statistical consistency, both in terms of the risk and the fairness measure. We then specialize our approach to kernel methods and propose a convex fair estimator in that setting. We test the estimator on a commonly used benchmark dataset (Communities and Crime) and on a new dataset collected at the University of Genova, containing the information of the academic career of five thousand students. The latter dataset provides a challenging real case scenario of unfair behaviour of standard regression methods that benefits from our methodology. The experimental results show that our estimator is effective at mitigating the trade-off between accuracy and fairness requirements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of designing learning methods that do not use sensitive information in a discriminatory way (e.g. knowledge about the ethnic group of an individual, sex, age) is receiving increasing attention, due to its fundamental importance in real-life scenarios, see e.g. [pleiss2017fairness, beutel2017data, hardt2016equality, feldman2015certifying, woodworth2017learning, zafar2017fairness, zafar2017parity, zafar2017fairnessARXIV, kamishima2011fairness, kearns2017preventing, perez2017fair, berk2017convex, adebayo2016iterative, calmon2017optimized, kamiran2009classifying, zemel2013learning, kamiran2012data, kamiran2010classification] and references therein. In this paper we follow a recent line of work [OnetoC060, agarwal2017reductions, zafar2017fairness, menon2018cost, bechavod2018Penalizing, zafar2017fairnessARXIV, kamishima2011fairness, kearns2017preventing, perez2017fair, berk2017convex, alabi2018optimizing, dwork2018decoupled]

in which the fairness constraint is directly taken into account during the learning procedure. An important departure from previous work that we take in this paper is to consider the possibility that the sensitive feature and/or the output (response variable) we wish to predict take real values.

The importance of being able to solve regression tasks and possibly dealing with continuous sensitive features can be highlighted by the following example. At the University of Genova, automatic systems are needed to predict students’ performance for the purpose of improving the teaching quality and the students’ support systems. In this case, the response variable is the course mark and the sensitive features can be both categorical (e.g. sex or ethnic group) or continuous (e.g. age or financial status).

Common notions of fairness that have been used in the setting of classification with categorical sensitive features is that of Equal Opportunity or Equalized Edds [hardt2016equality]

. They aim to balance decisions of a classifier among the different sensitive groups and label sets. We show how these notions can be extended to the general supervised learning setting (regression and classification) with general sensitive features (categorical and continuous). We observe that these novel fairness constraints can be incorporated within the Empirical Risk Minimization (ERM) framework. Our method and analysis build up and extend the Fair ERM (FERM) framework developed in 

[OnetoC060]. As the fairness measures used here are more general than those employed in that work, we name our approach General FERM (G-FERM). We show that G-FERM is supported by consistency guarantees both in terms of risk and fairness measure. Specifically, we derive both risk and fairness bounds, which support the statistically consistency of G-FERM. We give a concrete instance of G-FERM in the setting of kernel methods, leading to a form of constrained regularized empirical risk minimization, in which the fairness constraint is obtained by composting the

norm with a linear transformation.

Contributions. First, we present new generalized notions of fairness that encompass well studied notions used for classification and regression with categorical and numerical sensitive feature. Second, we study statistical bounds for G-FERM that imply consistency properties both in terms of fairness measure and risk of the selected model. As a third contribution, we instantiate G-FERM in the setting of kernel methods, leading to an efficient convex estimator. We test this estimator on a commonly used benchmark dataset (Communities and Crime) and on a new dataset collected at University of Genova, containing the information of the academic career of five thousand students. The latter dataset provides a challenging real case scenario of unfair behaviour of standard methods for regression that is solvable by using our methodology. The experimental results show that our estimator is effective at mitigating the trade-off between accuracy and fairness requirements.

Paper Organization. In Section 2 we discuss previous work on fairness, with a particular focus on regression and or continuous sensitive features. In Section 3 we introduce our notion of fairness which leads us to the G-FERM and study its statistical properties. In Section 4 we give the kernel-based G-FERM estimator and in Section 5 report on numerical experiments on two real datasets. Finally in Section 6 we draws conclusions and comment on future research directions.

2 Related works

In the context of fairness, most of the papers in literature address the problem of binary classification task with categorical (or even binary) sensitive features [hardt2016equality, zafar2017fairness]; a broad review on classification with categorical sensitive feature is provided in [OnetoC060]. This task is indeed very important, because it is strictly related to the possibility of having access to specific benefits (e.g. loans) without being discriminated due to gender or ethnic characteristics. On the other hand, the set of problems solvable by using these methods is limited and not comprehensive of all the real-world case scenarios.

Focusing on the works able to handle regression tasks, we can divide them by the type of problems they are able to solve and the notion of fairness they exploit. As we will see, with very few exceptions – e.g. [komiyama2017two] – most of the methods in literature are not able to deal with both classification and regression task and with both numerical and categorical sensitive features with an unified approach supported by theoretical consistency results. In fact, they introduce task oriented notions of fairness and/or do address the statistical consistency of their method with respect to the risk and the fairness measure employed.

The largest family of methods tackle regression problems with (single) categorical or binary sensitive feature [berk2017convex, calders2013controlling, fitzsimons2018equality, raff2017fair]. For example, in [berk2017convex], a convex approach for regression is proposed, where the authors use a specific definition of fairness in order to have models which treat similar examples in a similar way, in the sense of the predicted outcome. The authors tackle the problem by introducing a new convex regularizer and by imposing this notion on different regression tasks. Another example is [fitzsimons2018equality], where the authors use an adapted version of Demographic Parity [dwork2012fairness] for classification, in the context of regression.

Reducing the regression problem to have only categorical sensitive features is a serious limitation. In this sense, few interesting papers present regression methods able to deal with continuous sensitive attributes [komiyama2017two, komiyama2018nonconvex, perez2017fair]. Differently to our approach, the authors impose other definitions of fairness (e.g. Disparate Impact [zafar2017fairness] or even ad-hoc brand new definitions). Moreover, it is important to note that these methods do not naturally extend to the case of not-continuous sensitive attributes.

Considering a larger spectrum of possible methodologies, it is possible to find in literature other methods able to solve regression tasks by imposing some concept of fairness. [nabi2018fair] and [nabi2018learning] tackle the regression problem exploiting the causal machine learning framework. These methods can handle potentially both continuous and categorical sensitive features. The authors’ analysis considers only the case of categorical ones, leaving the evolution to continuous sensitive attributes as possible future works. Another interesting idea, presented in [yona2018probably], is to study the fairness as a property of the metric of the feature space. The authors introduce a new definition of metric-related fairness allowing them to solve a regression problem with categorical and continuous sensitive attributes. Finally, learning fair pre-processing rules is another possible way to obtain a regression model that is fair. In fact, for example in [zemel2013learning], the fair representation of the data can be used in synergy with any classic regression method, in order to generate a fair regression model.

3 Learning with Fairness Constraints

In this section, we introduce our framework for learning under fairness constraints. We first recall some notation used throughout this work in Section 3.1. We then present the proposed fairness measures in Section 3.2, which lead us to consider in Section 3.3 a generalized version of the FERM approach [OnetoC060]. Finally in Section 3.4 we discuss the statistical properties of our method.

3.1 Setting

Let be a training set formed by

samples drawn independently from an unknown probability distribution

over , where is the input space, is the space of the sensitive attribute and is the output space. Both and may be finite or continuous; if is a finite set of labels we are dealing with the classification setting and if we are dealing with the regression setting.

Let and be positive integers and define the sets

where , and . The sets and are prescribed by the user: the discretization process is driven by the application at hand and points in the same interval are regarded as indistinguishable. For example, it does not make sense to state that a group of students at the University of Genova is mistreated because the average grades are distant by less than of the mark range. We also define, for every and , the subsets of training points

and let .

We consider a function (or model) chosen from a set of possible ones. The functional form of the model may explicitly depend on the sensitive feature (i.e. ) or not (i.e. ) based on specific legal requirements in the application at hand [dwork2018decoupled, OnetoC062]. For this reason we will indicate where may contain the sensitive feature (i.e. ) or not (i.e. ). The error (risk) of

is measured by a prescribed loss function

. The risk of a model , together with its empirical counterpart , are defined respectively as

and

When necessary we will indicate with a subscript the particular loss function used and the associated risk, i.e. .

The purpose of a learning procedure is to find a model that minimizes the risk. Since the probability measure is usually unknown, the risk cannot be computed, however we can compute the empirical risk and a natural learning strategy, called Empirical Risk Minimization (ERM), is then to minimize the empirical risk within a prescribed set of functions, see e.g. [shalev2014understanding].

3.2 -Loss General Fair

In the literature different definitions of fairness of a classifier or real-valued function exist as described in Section 2. It is important to stress that there is not yet a consensus about which definition should be employed to evaluate algorithmic fairness. Moreover, most of the current fairness definitions are not able to deal with regression problems (or with continuous sensitive attributes), losing their meaning or being even not definable. In this work we proposes a general notion of fairness able to deal with both classification and regression and with both categorical and numerical sensitive features and which generalizes previously known notions of fairness.

Definition 3.1

A model is -general fair (-GF) with if satisfies the following condition

where, for every and , we have defined the conditional probabilities

This definition says that a model is fair if its predictions are equally distributed independently of the value of the sensitive attribute. It can be further generalized as follows.

Definition 3.2

For every let be a loss function. For every , define the conditional risks

We say that a function is -loss general fair (-LGF) with if it satisfies the following condition

This definition says that a model is fair if its errors, relative to the loss function, are approximately equally distributed independently of the value of the sensitive attribute. Definition 3.2 includes Definition 3.1 when we choose , for . Moreover, it is possible to link Definition 3.2 to other fairness measures used before in the literature.

Remark 3.3

If we choose , , , , and, for every , let be the 0-1-loss, that is , then Definition 3.2

reduces to the notion of Equalized Odds 

[hardt2016equality, OnetoC060]. On the other hand, in the same setting, if we let, for every , be the linear loss, , then we recover other notions of fairness introduced in [dwork2018decoupled]. When , , , , then Definition 3.2 reduces to the notion of Mean Distance introduced in [calders2013controlling] and also exploited in [komiyama2017two]. Finally, in the same setting, if in [komiyama2017two] it is proposed to use the correlation coefficient which is equivalent to setting in Definition 3.2.

3.3 General Fair Empirical Risk Minimization

In this paper, we aim at minimizing the risk subject to a fairness constraint. Specifically, we consider the problem

(3.1)

where is the amount of unfairness that we are willing to bear. Since the measure is unknown we replace the deterministic quantities with their empirical counterparts. That is, we replace Problem (3.1) with

(3.2)

where , and, for every and every we defined the empirical conditional risks

We will refer to Problem (3.2) as G-FERM since it generalizes the FERM approach introduced in [OnetoC060].

3.4 Statistical Analysis

Let be a solution of Problem (3.1), and let a solution of Problem (3.2). In this section we will show that these solutions are linked one to another. In particular, if the parameter is chosen appropriately, we will show that, in a certain sense, the estimator is consistent. Our analysis extends the reasoning in [OnetoC060] to the more general setting presented here.

For this purpose, we require that for any data distribution, it holds with probability at least with respect to the draw of a dataset that

(3.3)

where goes to zero as grows to infinity, that is the class is learnable with respect to the loss [shalev2014understanding]. Moreover is usually an exponential bound which means that grows logarithmically with respect to the inverse of .

Remark 3.4

If is a compact subset of linear separators in a reproducing kernel Hilbert space, and the loss is Lipschitz in its first argument, then can be obtained via Rademacher bounds [bartlett2002rademacher]. In this case goes to zero at least as as grows and decreases with as .

We are now ready to state the first result of this section.

Theorem 3.5

Let be a learnable set of functions with respect to the loss function , let be a solution of Problem (3.1) and let be a solution of Problem (3.2) with

With probability at least it holds simultaneously that

Proof. We first use Eq. (3.3) to conclude that, with probability at least ,

(3.4)

This inequality in turn implies that, with probability at least , it holds that

(3.5)

Now, in order to prove the first statement of the theorem, let us decompose the excess risk as

The inclusion property of Eq. (3.5) implies that with probability at least . Consequently with probability at least it holds that

The first statement now follows by Eq. (3.3). As for the second statement, its proof consists in exploiting the results of Eqns. (3.4) and (3.5) together with a union bound.      A consequence of the first statement of Theorem 3.5 is that as tends to infinity tends to a value which is not larger than , that is, G-FERM is consistent with respect to the risk of the selected model. The second statement of Theorem 3.5, instead, implies that as tends to infinity we have that tends to be -fair. In other words, G-FERM is consistent with respect to the fairness of the selected model.

Remark 3.6

Since the bound in Theorem 3.5 behaves as in the same setting of Remark 3.4 which is optimal [shalev2014understanding].

Thanks to Theorem 3.5 we can state that is close to both in term of its risk and its fairness. Nevertheless, our final goal is to find an which solves the following problem

(3.6)

Note that, the quantities in Problem (3.6) cannot be computed since the underline data generating distribution is unknown. Moreover, the objective function and the fairness constraint of Problem (3.6) are non convex.

Theorem 3.5 allow us to solve the first issue since we can safely search for a solution of the empirical counterpart of Problem (3.6), which is given by

(3.7)

where

(3.8)

Unfortunately, Problem (3.7) is still a difficult non-convex non-smooth problem, and for this reason it is more convenient to solve a convex relaxation. That is, we replace the possible non-convex loss function in the risk with its convex upper bound (e.g. the square loss ) and the losses , , in the constraint with a relaxation (e.g. the linear loss ) which allows to make the constraint convex. In this way, we look for a solution of the convex G-FERM problem

(3.9)

Note that this approximation of the fairness constraint correspond to matching the first order moment 

[OnetoC060].

The questions that arise here are whether is close to , how much, and under which assumptions. The following proposition sheds some lights on these issues.

Proposition 3.7

If is a convex upper bound of the loss exploited to compute the risk then . Moreover, if for and for

with small, then also the fairness is well approximated.

The first statement of Proposition 3.7 tells us that exploiting the quality in approximating the risk depend on the quality of the convex approximation. The second statement of Proposition 3.7, instead, tells us that if is small then the linear loss based fairness is close to the GF. This condition is quite natural, empirically verifiable, and it has been exploited in previous work [maurer2004note, OnetoC060]. Moreover, in Section 5 we present experiments showing that is small.

The bound in Proposition 3.7 may be tighten by using different non-linear approximations of the GF. However, the linear approximation proposed in this work gives a convex problem, and as we shall see in Section 5, works well in practice.

In summary, the combination of Theorem 3.5 and Proposition 3.7 provides conditions under which a solution of Problem (3.2), which is convex, is close, both in terms of risk and fairness measure, to a solution of Problem (3.6), which is our final goal.

4 G-FERM with Kernel Methods

In this section, we specify the G-FERM framework to the case that the underlying space of models is a reproducing kernel Hilbert space (RKHS) [shawe2004kernel, smola2001].

We let be a positive definite kernel and let be an induced feature mapping such that , for all , where is the Hilbert space of square summable sequences. Functions in the RKHS can be parametrized as

(4.1)

for some vector of parameters

. In practice a bias term (threshold) can be added to but to ease our presentation we do not include it here.

We propose to solve Problem (3.9) in the case that is a ball in the RKHS and employ a convex loss function to measure the empirical error. Standard choices are the square loss in the case of regression or the hinge loss in the case of binary classification. They are defined, for every , as and , respectively. As for the fairness constraint we use the linear loss function which implies the constraint to be convex. Then, we introduce the mean of the feature vectors associated with the training points restricted by the discretization of the sensitive feature and real outputs, namely

(4.2)

Using Eq. (4.1) the constraint in Problem (3.9) becomes

(4.3)

which can be written with more compact notation as , where is the linear operator mapping a vector to the vector . With this notation, the fairness constraint can be interpret as the composition of ball of the norm with a linear transformation .

In practice, we solve the following Tikhonov regularization problem

(4.4)
s.t.

where is a positive parameter. Note that, if the constraint reduces to the linear constraint .

Problem (4.4) can be kernelized by observing that, thanks to the Representer Theorem [shawe2004kernel]

(4.5)

The dual of Problem (4.4) may be derived using Fenchel duality, see e.g. [borwein2010convex, Theorem 3.3.5]. We postpone the discussion to future work since in our experiments we employed an off-the-shelf convex optimization solver111https://www.ibm.com/analytics/cplex-optimizer.

Finally, we note that in the case when is the identity mapping (i.e.  is the linear kernel on ) and then the fairness constraint of Problem (4.4) can be implicitly enforced by making a change of representation [OnetoC060].

5 Experiments

In this section we present a set of experiments to test the performance of the proposed method, both in terms of error and fairness. We will study both the cases with categorical and continuous sensitive feature in the context of the regression (continuous label). The classification task, as special case of our proposed framework, has been already studied in [OnetoC060]. For this purpose, we selected two metrics to compare our method with the other baselines. Concerning the error we collected the Mean Absolute Percentage Error (MAPE), that is equal to on the test set when . For what concerns the fairness of the model we will exploit the Differences of GF (DGF), see Definition 3.1, that is the following quantity, still estimated on the test set as

where the expression of is given in Eq. (3.8).

A set of four different algorithms is considered, with two different types of validation procedures. The algorithms are divided in two groups: linear and non-linear kernels. Concerning the linear methods, the baseline is regularized least squares (RLS), where we solve Problem (4.4) with no fairness constraint and a linear kernel. Fair RLS is our method in this category, that solves Problem (4.4) with a linear kernel including the fairness constraint. A kernel version of the same methods is KRLS, that solves Problem (4.4) with no fairness constraint and a Gaussian kernel, i.e. . In comparison, our proposed algorithm is Fair KRLS, where we tackle Problem (4.4) with the fairness constraint and a Gaussian kernel.

We follow two different types of possible validation procedures222Hyperparameters range: and .

. The first one is standard, and we call it Naive Validation (Naive). In particular, we performed a nested 10-fold cross validation (CV) to select the best hyperparameters and to test the final model. This procedure is repeated 30 times, and we reported the average performance on the test set alongside its standard deviation. A second validation procedure, called Novel Validation Procedure (NVP) as in 

[OnetoC060], is slightly different and more focused on finding the best fair model among the ones with low error. Also in this case, as general structure, we performed a nested 10-fold CV to test the final model. For the inner part of the nested CV, we employ a two steps procedure. In the first step, the 10-fold CV error for each of the combination of the hyperparameters is computed. In the second step, we shortlist all the hyperparameters’ combinations with error close to the best one (in our case, above 90% of the best MAPE). Finally, from this list, we select the hyperparameters with the lowest DGF.

For the sake of completeness, all the experiments have been performed both having and not having the sensitive feature in the model’s functional form, i.e. the sensitive feature is available (or not available) at test time.

5.1 Datasets

For the purpose of testing the proposed proposed methodology we employed two different datasets for regression.

The first one is a classic benchmark dataset for fairness called Communities and Crime dataset333http://archive.ics.uci.edu/ml/datasets/communities+and+crime (CRIME). CRIME combines socioeconomic data and crime rate data on communities in the United States. In the case of categorical sensitive feature, following [calders2013controlling], we made a binary attribute as to the percentage of black population, which yielded instances of with a mean crime rate and instances of with a mean crime rate . In this case

. Concerning the experiments with continuous sensitive feature we maintain the real value of the percentage of black population, avoiding the binarization step of it and then we consider

and a uniform set over , i.e. .

The second dataset we propose is new and it has been collected at the University of Genova (UNIGE). This dataset is a proprietary and highly sensitive dataset containing all the data about the past and present students enrolled at the UNIGE. In this study we take into consideration students who enrolled, in the academic year (a.y.) 2017-2018. The dataset contains instances, each one described by attributes (both numeric and categorical) about ethnicity, gender, financial status, and previous school experience. The scope is to predict the average grades and the end of the first semester. In the case of categorical sensitive feature, we consider as sensitive feature the gender ( female and male) and consequently . In the context of continuous sensitive attribute, we select as sensitive feature the income of the student, with following the official separation in five bins from the tuition system of the University of Genova (details at link https://www.studenti.unige.it/tasse/importi/).

CRIME UNIGE
Method MAPE DGF MAPE DGF
Sensitive Feature not included in the model’s functional form.
Naive RLS
NVM RLS
NVM Fair RLS
Naive KRLS
NVM KRLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
Naive RLS
NVM RLS
NVM Fair RLS
Naive KRLS
NVM KRLS
NVM Fair KRLS
Table 1: Results with and .
Figure 1: Two overlapped (White Black ) histograms of for the CRIME dataset with NVM KRLS and NVM Fair KRLS when the sensitive feature not included in the function form of the model.
CRIME UNIGE
Method
Sensitive Feature not included
in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Sensitive Feature included
in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Table 2: with and .
0 0.005 0.01
Method MAPE DGF MAPE DGF MAPE DGF
CRIME
Sensitive Feature not included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
UNIGE
Sensitive Feature not included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Table 3: Results varying with
Dataset 5 10 20
Method MAPE DGF MAPE DGF MAPE DGF
CRIME
Sensitive Feature not included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
UNIGE
Sensitive Feature not included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
NVM Fair RLS
NVM Fair KRLS
Table 4: Results varying with

5.2 Results and Discussion

Results for regression tasks with categorical sensitive feature are presented in Table 1, where MAPE and DGF are shown for the different datasets (CRIME and UNIGE), algorithms (RLS and KRLS), validation procedure (Naive and NVM), with and without the fairness constraints, and availability of the sensitive feature at test time.

For both datasets, it is clear the advantage of using our method in order to obtain more fair models (i.e. lower DGF) at the expenses of a slightly higher error (i.e. higher MAPE). Moreover, having the sensitive feature at test time increases model accuracy (i.e. lower MAPE) and reduces the fairness measure (i.e. higher DGF). The improvement is stronger in the kernel case, and where the original unfairness of the standard method is higher.

An important question concerns the sensitivity of our method with respect to the parameter (acceptable unfairness) and the number of bins . Tables 3 and 4 reports this analysis. We repeated the same experimental procedure of Table 1 for both datasets (CRIME and UNIGE), and algorithms (RLS and KRLS), and possible availability of the sensitive feature at test time, when the fairness constraint is active and with the NVM. We let range in with fixed , and also let range in maintaining . The results confirm our theoretical insights. Making larger induces lower MAPE and larger DGF, confirming the trade-off between error and fairness. Considering , we have that larger values of corresponds to impose a higher number of constraints, something that impacts negatively the MAPE value (i.e. the higher , the higher MAPE). On the other hand, increasing the value of makes the final model more fair, with a lower DGF.

Figure 1

shows the different behaviours of the standard non-linear regression models (without fairness constraints, i.e. NVM KRLS) and our method (NVM Fair KRLS) over the CRIME dataset, specifically when the sensitive feature is not part of the model’s functional form. In particular, we reported the different element in the summation which composes the DGF:

for White () and Black (). Our method, (bottom plot) obtains two probability distributions among the two different groups that are more similar with respect the baseline (top plot). This suggests that our method is more fair with respect to the selected sensitive feature.

We collected in Table 2 the values (see Proposition 3.7), for both datasets, for both NVM Fair RLS and NVM Fair KRLS, with and without the sensitive feature in the model’s functional form. As it can be noted, the value remains small and, consequently, our method provides a good convex approximation of the original non-convex optimization problem of Eq. (3.7) in practice.

As a final experiment, we empirically demonstrate that it is possible to generate fair models with continuous sensitive features. Table 5 reports the results for NVM KRLS and NVM Fair KRLS for both datasets with and without the sensitive feature in the functional form of the model. The obtained MAPE and DGF confirm the results described above in the case of categorical sensitive attributes, empirically demonstrating that our methodology is able to tackle the regression tasks having categorical and continuous sensitive feature.

CRIME UNIGE
Method MAPE DGF MAPE DGF
Sensitive Feature not included in the model’s functional form.
NVM KRLS
NVM Fair KRLS
Sensitive Feature included in the model’s functional form.
NVM KRLS
NVM Fair KRLS
Table 5: Results with , and .

6 Conclusion and Future Work

In this work, we studied the problem of enhancing supervised learning with fairness requirements. We presented a framework based on empirical risk minimization under a novel and generalized fairness constraint. Contrarily to the previous methods, our approach can handle both regression and classification problems and both continuous or categorical sensitive attributes. Furthermore we observed that our approach generalizes and reduces to known approaches available in literature. We addressed the statistical properties of the method and considered a convex relaxation of the fairness constraint, which can be linked to the non-convex constraint by means of a data dependent bound. We instantiated this approach in the setting of kernel methods, for which the convex fairness constraint can be efficiently implemented both implicitly and explicitly. Finally, we provided experimental results on two real-world datasets that indicate the effectiveness of our approach in comparison with some baselines which either do not impose the fairness constraint or impose the constraint during the validation procedure. Future work will be devoted to extend the range of applicability of our method and to study tighter bounds under specialized conditions.

References