1 Introduction
In recent years there has been a lot of interest on algorithmic fairness in machine learning see, e.g.,
[dwork2018decoupled, hardt2016equality, zafar2017fairness, zemel2013learning, kilbertus2017avoiding, kusner2017counterfactual, calmon2017optimized, joseph2016fairness, chierichetti2017fair, jabbari2016fair, yao2017beyond, lum2016statistical, zliobaite2015relation]and references therein. The central question is how to enhance supervised learning algorithms with fairness requirements, namely ensuring that sensitive information (e.g. knowledge about the ethnic group of an individual) does not ‘unfairly’ influence the outcome of a learning algorithm. For example if the learning problem is to decide whether a person should be offered a loan based on her previous credit card scores, we would like to build a model which does not unfair use additional sensitive information such as race or sex. Several notions of fairness and associated learning methods have been introduced in machine learning in the past few years, including Demographic Parity
[calders2009building], Equal Odds and Equal Opportunities
[hardt2016equality], Disparate Treatment, Impact, and mistreatment [zafar2017fairness]. The underlying idea behind such notions is to balance decisions of a classifier among the different sensitive groups and label sets.In this paper, we build upon the notion Equal Opportunity (EO) which defines fairness as the requirement that the true positive rate of the classifier is the same across the sensitive groups. In Section 2
we introduce a generalization of this notion of fairness which constrains the conditional risk of a classifier, associated to positive labeled samples of a group, to be approximately constant with respect to group membership. The risk is measure according to a prescribed loss function and approximation parameter
. When the loss is the misclassification error andwe recover the notion EO above. We study the problem of minimizing the expected risk within a prescribed class of functions subject to the fairness constraint. As a natural estimator associated with this problem, we consider a modified version of Empirical Risk Minimization (ERM) which we call Fair ERM (FERM). We derive both risk and fairness bounds, which support that FERM is statistically consistent, in a certain sense which we explain in the paper in Section
2.2. Since the FERM approach is impractical due to the nonconvex nature of the constraint, we propose, still in Section 2.2, a surrogate convex FERM problem which relates, under a natural condition, to the original goal of minimizing the misclassification error subject to a relaxed EO constraint. We further observe that our condition can be empirically verified to judge the quality of the approximation in practice. As a concrete example of the framework, in Section 3we describe how kernel methods such as support vector machines (SVMs) can be enhanced to satisfy the fairness constraint. We observe that a particular instance of the fairness constraint for
reduces to an orthogonality constraint. Moreover, in the linear case, the constraint translates into a preprocessing step that implicitly imposes the fairness requirement on the data, making fair any linear model learned with them. We report numerical experiments using both linear and nonlinear kernels, which indicate that our method improves on the stateoftheart in four out of five datasets and is competitive on the fifth dataset^{1}^{1}1 Additional technical steps and experiments are presented in the supplementary materials..In summary the contributions of this paper are twofold. First we outline a general framework for empirical risk minimization under fairness constraints. The framework can be used as a starting point to develop specific algorithms for learning under fairness constraints. As a second contribution, we shown how a linear fairness constraint arises naturally in the framework and allows us to develop a novel convex learning method that is supported by consistency properties both in terms of EO and risk of the selected model, performing favorably against stateoftheart alternatives on a series of benchmark datasets.
Previous Work. Work on algorithmic fairness can be divided in three families. Methods in the first family modify a pretrained classifier in order to increase its fairness properties while maintaining as much as possible the classification performance: [pleiss2017fairness, beutel2017data, hardt2016equality, feldman2015certifying] are examples of these methods but no consistency property nor comparison with stateoftheart proposal are provided. Methods in the second family enforce fairness directly during the training step: [agarwal2017reductions, agarwal2018reductions, woodworth2017learning, zafar2017fairness, menon2018cost, zafar2017parity, bechavod2018Penalizing, zafar2017fairnessARXIV, kamishima2011fairness, kearns2017preventing] are examples of this method which provide nonconvex approaches to the solution of the problem or they derive consistency results just for the nonconvex formulation resorting later to a convex approach which is not theoretically grounded; [PrezSuay2017Fair, dwork2018decoupled, berk2017convex, alabi2018optimizing] are other examples of convex approaches which do not compare with other stateoftheart solutions and do not provide consistency properties except for the [dwork2018decoupled] which, contrarily to our proposal, does not enforce a fairness constraint directly in the learning phase and the [olfat2018spectral] which proposes a computational tractable fair SVM starting from a constraint on the covariance matrices. Specifically, it leads to a nonconvex constraint which is imposed iteratively with a sequence of relaxation exploiting spectral decompositions. Finally, the third family of methods implements fairness by modifying the data representation and then employs standard machine learning methods: [adebayo2016iterative, calmon2017optimized, kamiran2009classifying, zemel2013learning, kamiran2012data, kamiran2010classification] are examples of these methods but, again, no consistency property nor comparison with stateoftheart proposal are provided. Our method belongs to the second family of methods, in that it directly optimizes a fairness constraint related to the notion of EO discussed above. Furthermore, in the case of linear models, our method translates to an efficient preprocessing of the input data, with methods in the third family. As we shall see, our approach is theoretically grounded and performs favorably against the stateoftheart^{2}^{2}2A detailed comparison between our proposal and stateoftheart is reported in the supplementary materials..
2 Fair Empirical Risk Minimization
In this section, we present our approach to learning with fairness. We begin by introducing our notation. We let be a sequence of
samples drawn independently from an unknown probability distribution
over , where is the set of binary output labels, represents group membership among two groups^{3}^{3}3The extension to multiple groups (e.g. ethnic group) is briefly discussed in the supplementary material. (e.g. ‘female’ or ‘male’), and is the input space. We note that the may further contain or not the sensitive feature in it. We also denote by for and . Let us consider a function (or model) chosen from a set of possible models. The error (risk) of in approximating is measured by a prescribed loss function . The risk of is defined as . When necessary we will indicate with a subscript the particular loss function used, i.e. .The purpose of a learning procedure is to find a model that minimizes the risk. Since the probability measure is usually unknown, the risk cannot be computed, however we can compute the empirical risk , where denotes the empirical expectation. A natural learning strategy, called Empirical Risk Minimization (ERM), is then to minimize the empirical risk within a prescribed set of functions.
2.1 Fairness Definitions
In the literature there are different definitions of fairness of a model or learning algorithm [hardt2016equality, dwork2018decoupled, zafar2017fairness, zafar2017fairness], but there is not yet a consensus on which definition is most appropriate. In this paper, we introduce a general notion of fairness which encompasses some previously used notions and it allows to introduce new ones by specifying the loss function used below.
Definition 1.
Let be the risk of the positive labeled samples in the th group, and let . We say that a function is fair if .
This definition says that a model is fair if it commits approximately the same error on the positive class independently of the group membership. That is, the conditional risk is approximately constant across the two groups. Note that if and we use the hard loss function, , then Definition 1 is equivalent to definition of EO proposed by [hardt2016equality], namely
(1) 
This equation means that the true positive rate is the same across the two groups. Furthermore, if we use the linear loss function and set , then Definition 1 gives
(2) 
By reformulating this expression we obtain a notion of fairness that has been proposed by [dwork2018decoupled]
Yet another implication of Eq. (2) is that the output of the model is uncorrelated with respect to the group membership conditioned on the label being positive, that is, for every , we have
Finally, we observe that our approach naturally generalizes to other fairness measures, e.g. equal odds [hardt2016equality], which could be subject of future work. Specifically, we would require in Definition 1 that for both .
2.2 Fair Empirical Risk Minimization
In this paper, we aim at minimizing the risk subject to a fairness constraint. Specifically, we consider the problem
(3) 
where is the amount of unfairness that we are willing to bear. Since the measure is unknown we replace the deterministic quantities with their empirical counterparts. That is, we replace Problem 3 with
(4) 
where . We will refer to Problem 4 as FERM.
We denote by a solution of Problem 3, and by a solution of Problem 4. In this section we will show that these solutions are linked one to another. In particular, if the parameter is chosen appropriately, we will show that, in a certain sense, the estimator is consistent. In order to present our observations, we require that it holds with probability at least that
(5) 
where the bound goes to zero as grows to infinity if the class is learnable with respect to the loss [see e.g. shalev2014understanding, and references therein]. For example, if is a compact subset of linear separators in a Reproducing Kernel Hilbert Space (RKHS), and the loss is Lipschitz in its first argument, then can be obtained via Rademacher bounds [see e.g. bartlett2002rademacher]. In this case goes to zero at least as as grows to infinity, where .
We are ready to state the first result of this section (proof is reported in supplementary materials).
Theorem 1.
A consequence of the first statement of Theorem 1 is that as tends to infinity tends to a value which is not larger than , that is, FERM is consistent with respect to the risk of the selected model. The second statement of Theorem 1, instead, implies that as tends to infinity we have that tends to be fair. In other words, FERM is consistent with respect to the fairness of the selected model.
Thanks to Theorem 1 we can state that is close to both in term of its risk and its fairness. Nevertheless, our final goal is to find an which solves the following problem
(7) 
Note that the objective function in Problem 7 is the misclassification error of the classifier , whereas the fairness constraint is a relaxation of the EO constraint in Eq. (1). Indeed, the quantity is equal to
(8) 
We refer to this quantity as difference of EO (DEO).
Although Problem 7 cannot be solved, by exploiting Theorem 1 we can safely search for a solution of its empirical counterpart
(9) 
Unfortunately Problem 9 is a difficult nonconvex nonsmooth problem, and for this reason it is more convenient to solve a convex relaxation. That is, we replace the hard loss in the risk with a convex loss function (e.g. the Hinge loss ) and the hard loss in the constraint with the linear loss . In this way, we look for a solution of the convex FERM problem
(10) 
The questions that arise here are whether, and how close, is to , how much, and under which assumptions. The following theorem sheds some lights on these issues (proof is reported in supplementary materials, Section A).
Proposition 1.
If is the Hinge loss then . Moreover, if for the following condition is true
(11) 
then it also holds that
The first statement of Proposition 1 tells us that exploiting instead of is a good approximation if is small. The second statement of Proposition 1, instead, tells us that if the hypothesis of inequality (11) holds, then the linear loss based fairness is close to the EO. Obviously the smaller is, the closer they are. Inequality (11) says that the functions and distribute, on average, in a similar way. This condition is quite natural and it has been exploited in previous work [see e.g. maurer2004note]. Moreover, in Section 4 we present experiments showing that is small.
The bound in Proposition 1 may be tighten by using different nonlinear approximations of EO [see e.g. calmon2017optimized]. However, the linear approximation proposed in this work gives a convex problem, and as we shall see in Section 5, works well in practice.
3 Fair Learning with Kernels
In this section, we specify the FERM framework to the case that the underlying space of models is a reproducing kernel Hilbert space (RKHS) [see e.g. shawe2004kernel, smola2001, and references therein]. We let be a positive definite kernel and let be an induced feature mapping such that , for all , where is the Hilbert space of square summable sequences. Functions in the RKHS can be parametrized as
(12) 
for some vector of parameters . In practice a bias term (threshold) can be added to but to ease our presentation we do not include it here.
We solve Problem (10) with a ball in the RKHS and employ a convex loss function . As for the fairness constraint we use the linear loss function, which implies the constraint to be convex. Let be the barycenter in the feature space of the positively labelled points in the group , that is
(13) 
where . Then using Eq. (12) the constraint in Problem (10) takes the form .
In practice, we solve the Tikhonov regularization problem
(14) 
where and is a positive parameter which controls model complexity. In particular, if the constraint in Problem (14) reduces to an orthogonality constraint that has a simple geometric interpretation. Specifically, the vector is required to be orthogonal to the vector formed by the difference between the barycenters of the positive labelled input samples in the two groups.
By the representer theorem [scholkopf2001generalized], the solution to Problem (14) is a linear combination of the feature vectors and the vector . However, in our case is itself a linear combination of the feature vectors (in fact only those corresponding to the subset of positive labeled points) hence is a linear combination of the input points, that is . The corresponding function used to make predictions is then given by . Let be the Gram matrix. The vector of coefficients can then be found by solving
In our experiments below we consider this particular case of Problem (14) and furthermore choose the loss function to be the Hinge loss. The resulting method is an extension of SVM. The fairness constraint and, in particular, the orthogonality constraint when , can be easily added within standard SVM solvers^{4}^{4}4In supplementary material we derive the dual of Problem (14) when is the Hinge loss.
It is instructive to consider Problem (14) when is the identity mapping (i.e. is the linear kernel on ) and . In this special case we can solve the orthogonality constraint for , where the index is such that , obtaining that . Consequently the linear model rewrites as . In this way, we then see the fairness constraint is implicitly enforced by making the change of representation , with
(15) 
In other words, we are able to obtain a fair linear model without any other constraint and by using a representation that has one feature fewer than the original one^{5}^{5}5 In supplementary material is reported the generalization of this argument to kernel for SVM.
4 Experiments
In this section, we present numerical experiments with the proposed method on one synthetic and five real datasets. The aim of the experiments is threefold. First, we show that our approach is effective in selecting a fair model, incurring only a moderate loss in accuracy. Second, we provide an empirical study of the properties of the method, which supports our theoretical observations in Section 2. Third, we highlight the generality of our approach by showing that it can be used effectively within other linear models such as Lasso.
We use our approach with
in order to simplify the hyperparameter selection procedure. For the sake of completeness, a set of results for different values of
is presented in the supplementary material and briefly we comment on these below. In all the experiments, we collect statistics concerning the classification accuracy and DEO of the selected model. We recall that the DEO is defined in Eq. (8) and is the absolute difference of the true positive rate of the classifier applied to the two groups. In all experiments, we performed a 10fold cross validation (CV) to select the best hyperparameters^{6}^{6}6The regularization parameter (for both SVM and our method) with values, equally spaced in logarithmic scale between and ; we used both the linear or RBF kernel (i.e. for two examples and , the RBF kernel is ) with . In our case, of Eq. (14).. For the Arrhythmia, COMPAS, German and Drug datasets, this procedure is repeatedtimes, and we reported the average performance on the test set alongside its standard deviation. For the Adult dataset, we used the provided split of train and test sets. Unless otherwise stated, we employ two steps in the 10fold CV procedure. In the first step, the value of the hyperparameters with highest accuracy is identified. In the second step, we shortlist all the hyperparameters with accuracy close to the best one (in our case, above
of the best accuracy). Finally, from this list, we select the hyperparameters with the lowest DEO. This novel validation procedure, that we wil call NVP, is a sanitycheck to ensure that fairness cannot be achieved by a simple modification of hyperparameter selection procedure. The code of our method is available at: https://github.com/jmikko/fair_ERM.Synthetic Experiment. The aim of this experiment is to study the behavior of our method, in terms of both DEO and classification accuracy, in comparison to standard SVM (with our novel validation procedure). To this end, we generated a synthetic binary classification dataset with two sensitive groups in the following manner. For each group in the class and for the group in the class , we generated examples for training and the same amount for testing. For the group in the class , we generated examples for training and the same number for testing. Each set of examples is sampled from a
dimensional isotropic Gaussian distribution with different mean
and variance
: (i) Group , Label : , ; (ii) Group , Label : , ; (iii) Group , Label : , ; (iv) Group , Label : , . When a standard machine learning method is applied to this toy dataset, the generated model is unfair with respect to the group , in that the classifier tends to negatively classify the examples in this group.We trained different models, varying the value of the hyperparameter , and using the standard linear SVM and our linear method. Figure 1 (Left) shows the performance of the various generated models with respect to the classification error and DEO on the test set. Note that our method generated models that have an higher level of fairness, maintaining a good level of accuracy. The grid in the plots emphasizes the fact that both the error and DEO have to be simultaneously considered in the evaluation of a method. Figure 1 (Center and Left) depicts the histogram of the values of (where is the generated model) for test examples with true label equal to for each of the two groups. The results are reported both for our method (Right) and standard SVM (Center). Note that our method generates a model with a similar true positive rate among the two groups (i.e. the areas of the value when the horizontal axis is greater than zero are similar for groups and ). Moreover, due to the simplicity of the toy test, the distribution with respect to the two different groups is also very similar when our model is used.
Real Data Experiments. We next compare the performance of our model to set of different methods on publicly available datasets: Arrhythmia, COMPAS, Adult, German, and Drug. A description of the datasets is provided in the supplementary material. These datasets have been selected from the standard databases of datasets (UCI, mldata and FairnessMeasures^{7}^{7}7FairnessMeasures website: fairnessmeasures.org). We considered only datasets with a DEO higher than , when the model is generated by an SVM validated with the NVP. For this reason, some of the commonly used datasets have been discarded (e.g. Diabetes, Heart, SAT, PSUChile, and SOEP). We compared our method both in the linear and not linear case against: (i) Naïve SVM, validated with a standard nested 10fold CV procedure. This method ignores fairness in the validation procedure, simply trying to optimize accuracy; (ii) SVM with the NVP. As noted above, this baseline is the simplest way to inject the fairness into the model; (iii) Hardt method [hardt2016equality] applied to the best SVM; (iv) Zafar method [zafar2017fairness], implemented with the code provided by the authors for the linear case^{8}^{8}8Python code for [zafar2017fairness]: https://github.com/mbilalzafar/fairclassification. Concerning our method, in the linear case, it exploits the preprocessing presented in Section 3.
Arrhythmia  COMPAS  Adult  German  Drug  
Method  ACC  DEO  ACC  DEO  ACC  DEO  ACC  DEO  ACC  DEO 
not inside  
Naïve Lin. SVM  
Lin. SVM  
Hardt                     
Zafar  
Lin. Ours  
Naïve SVM  
SVM  
Hardt                     
Ours  
inside  
Naïve Lin. SVM  
Lin. SVM  
Hardt  
Zafar  
Lin. Ours  
Naïve SVM  
SVM  
Hardt  
Ours 
Table 1 shows our experimental results for all the datasets and methods both when is inside or not. This result suggests that our method performs favorably over the competitors in that it decreases DEO substantially with only a moderate loss in accuracy. Moreover having inside increases the accuracy but  for the methods without the specific purpose of producing fairness models  decreases the fairness. On the other hand, having inside ensures to our method the ability of improve the fairness by exploiting the value of also in the prediction phase. This is to be expected, since knowing the group membership increases our information but also leads to behaviours able to influence the fairness of the predictive model. In order to quantify this effect, we present in Figure 2 the results of Table 1 of linear (left) and nonlinear (right) methods, when the error (one minus accuracy) and the DEO are normalized in columnwise and when the is inside ^{9}^{9}9The case when is not inside is reported in the supplementary materials).. In the figure, different symbols and colors refer to different datasets and methods, respectively. The closer a point is to the origin, the better the result is. The best accuracy is, in general, reached by using the Naïve SVM (in red) both for the linear and nonlinear case. This behavior is expected due to the absence of any fairness constraint. On the other hand, Naïve SVM has unsatisfactory levels of fairness. Hardt [hardt2016equality] (in blue) and Zafar [zafar2017fairness] (in cyan, for the linear case) methods are able to obtain a good level of fairness but the price of this fair model is a strong decrease in accuracy. Our method (in magenta) obtains similar or better results concerning the DEO preserving the performance in accuracy. In particular in the nonlinear case, our method reaches the lowest levels of DEO with respect to all the methods. For the sake of completeness, in the nonlinear (bottom) part of Figure 2, we show our method when the parameter is set to (in brown) instead of (in magenta). As expected, the generated models are less fair with a (small) improvement in the accuracy. An in depth analysis of the role of is presented in supplementary materials.
Application to Lasso. Due to the particular proposed methodology, we are able in principle to apply our method to any learning algorithm. In particular, when the algorithm generates a linear model we can exploit the data preprocessing in Eq. (15), to directly impose fairness in the model. Here, we show how it is possible to obtain a sparse and fair model by exploiting the standard Lasso algorithm in synergy with this preprocessing step. For this purpose, we selected the Arrhythmia dataset as the Lasso works well in a high dimensional / small sample setting. We performed the same experiment described above, where we used the Lasso algorithm in place of the SVM. In this case, by Naïve Lasso, we refer to the Lasso when it is validated with a standard nested 10fold CV procedure, whereas by Lasso we refer to the standard Lasso with the NVP outlined above. The method of [hardt2016equality] has been applied to the best Lasso model. Moreover, we reported the results obtained using Naïve Linear SVM and Linear SVM. We also repeated the experiment by using a reduced training set in order to highlight the effect of the sparsity. Table 4 reported in the supplementary material shows the results. It is possible to note that, reducing the training sets, the generated models become less fair (i.e. the DEO increases). Using our method, we are able to maintain a fair model reaching satisfactory accuracy results.
The Value of . Finally, we show experimental results to highlight how the hypothesis of Proposition 1 (Section 2.2) are reasonable in the real cases. We know that, if the hypothesis of inequality (11) are satisfied, the linear loss based fairness is close to the EO. Specifically, these two quantities are closer when is small. We evaluated for benchmark and toy datasets. The obtained results are in Table 5 of supplementary material, where has the order of magnitude of in all the datasets. Consequently, our method is able to obtain a good approximation of the DEO.
5 Conclusion and Future Work
We have presented a generalized notion of fairness, which encompasses previously introduced notion and can be used to constrain ERM, in order to learn fair classifiers. The framework is appealing both theoretically and practically. Our theoretical observations provide a statistical justification for this approach and our algorithmic observations suggest a way to implement it efficiently in the setting of kernel methods. Experimental results suggest that our approach is promising for applications, generating models with improved fairness properties while maintaining classification accuracy. We close by mentioning directions of future research. On the algorithmic side, it would be interesting to study whether our method can be improved by other relaxations of the fairness constraint beyond the linear loss used here. Applications of the fairness constraint to multiclass classification or to regression tasks would also be valuable. On the theory side, it would be interesting to study how the choice of the parameter affects the statistical performance of our method and derive optimal accuracyfairness tradeoff as a function of this parameter.
References
Supplementary Material
Appendix A Proofs

We first use Eq. (5) to conclude that, with probability at least ,
(16) This inequality in turn implies that, with probability at least , it holds that
(17) Now, in order to prove the first statement of the theorem, let us decompose the excess risk as
Inequality (17) implies that with probability at least and consequently with probability at least it holds that
The first statement now follows by Eq. (5). As for the second statement, its proof consists in exploiting the results of Eqns. (16) and (17) together with a union bound. ∎

The proof of the first statement follows directly by the inequality . In order to prove the second statement, we first note that
By applying the same reasoning to and by exploiting inequality (11) the result follows. ∎
Appendix B Literature Review of Fairness Methods
In this section, we provide a brief analysis of the different existing methods concerning fairness. We show our findings in Table 2, where the rows represent properties, characteristics and experimental results of different fairness methods. The columns represent the different algorithms and, specifically, the first column is our approach. We think that, at this stage of development of fairness in machine learning, a clear understanding of the differences and similarities among the current available algorithms is a fundamental step. Table 2 describes, in the first row, the family of the different methods, following the taxonomy defined in this paper (see Section 1). The following rows describe general properties of the methods, as for example the convexity of the approach, the convergence of the learning phase or the consistency with respect to the risk and the fairness notion. The next rows describes the presence of a specific comparison between methods and, finally, in the last row the availability of the code online is analyzed.
Ref.  Ours  [adebayo2016iterative]  [calmon2017optimized]  [agarwal2017reductions, agarwal2018reductions]  [woodworth2017learning]  [zafar2017fairness]  [kamiran2009classifying]  [PrezSuay2017Fair]  [zemel2013learning]  [menon2018cost]  [dwork2018decoupled]  [zafar2017parity]  [pleiss2017fairness]  [beutel2017data]  [bechavod2018Penalizing]  [hardt2016equality]  [zafar2017fairnessARXIV]  [berk2017convex]  [kamishima2011fairness]  [feldman2015certifying]  [kamiran2012data]  [kamiran2010classification]  [alabi2018optimizing]  [olfat2018spectral] 

Method Family  2&3  3  3  2  2  2  3  2  3  2  2  2  1    2  1  2  2  2  1  3  3  2  2 
Classification  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  
New Fairness Notions  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  
Use of EO  x  x  x  x  x  x  x  
Convex Approach  x  x  x  x  x  x  x  x  x  x  x  
Convergence Learning  x  x  x  x  x  x  x  x  
Consistency RiskFairness  x  x  x  x  x  
Experimental Results  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  
Epsilon validate  x  
Exp. w.r.t. [hardt2016equality]  x  x  x  x  x  
Exp. w.r.t. [zafar2017fairness]  x  x  x  x  
Exp. w.r.t. [kamiran2012data]  x  
Exp. w.r.t. Baseline in [zafar2017fairness]  x  x  
Exp. w.r.t. [kamiran2009classifying]  x  x  x  
Exp. w.r.t. [kamishima2011fairness]  x  x  x  
Exp. w.r.t. [kamiran2010classification]  x  
Exp. w.r.t. [zemel2013learning]  x  
Code Available  x  x  x  x  x  x  x 
Appendix C Datasets
In the following the datasets used in Section 4 are presented, outlining their tasks, type of features and source of data. Table 3 provide a summary of the datasets statistics.

Arrhythmia: from UCI repository, this database contains 279 attributes concerning the study of H. Altay Guvenir. The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. In our case, we changed the task with the binary classification between "Class 01" (i.e. "Normal") against the other 15 classes (different classes of arrhythmia).

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions): it is a popular commercial algorithm used by judges and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism). It has been shown that the algorithm is biased in favor of white defendants based on a 2 year follow up study. This dataset contains variables used by the COMPAS algorithm in scoring defendants, along with their outcomes within 2 years of the decision, for over 10000 criminal defendants in Broward County, Florida. In the original data, 3 subsets are provided. We concentrate on the one that includes only violent recividism^{10}^{10}10Analysis of the recidivism COMPAS dataset: www.propublica.org/article/howweanalyzedthecompasrecidivismalgorithm.

Adult: from UCI repository, this database contains 14 features concerning demographic characteristics of instances ( for training and for test). The task is to predict if a person has an income per year that is more (or less) than . Concerning the Adult dataset we used the provided training and test sets.

German: it is a dataset where the task is to classify people described by a set of 20 features (7 numerical, 13 categorical) as good or bad credit risks. The features are related to the economical situation of the person, as for example: credit history and amount, saving account and bonds, year of the present employment, property and others. Moreover, a set of features is concerning personal information, e.g. age, gender, if the person is a foreign, and personal status.

Drug: this dataset contains records for 1885 respondents. Each respondent is described by 12 features: Personality measurements which include NEOFFIR (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS11 (impulsivity), and ImpSS (sensation seeking), level of education, age, gender, country of residence and ethnicity. All input attributes are originally categorical and are quantified. After quantification values of all input features can be considered as realvalued. In addition, participants were questioned concerning their use of 17 legal and illegal drugs and one fictitious drug (Semeron) which was introduced to identify overclaimers. For each drug, the respondents have to select one of the answers: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day. In this sense, this dataset contains 18 classification problems, each one with seven classes: "Never Used", "Used over a Decade Ago", "Used in Last Decade", "Used in Last Year", "Used in Last Month", "Used in Last Week", and "Used in Last Day". We make the problem number (concerning heroin) a binary problem by exploiting the task "Never used" versus "Others" (i.e. "Used").
Dataset  Examples  Features  Sensitive Variable 

Arrhythmia  452  279  Gender 
COMPAS  6172  10  Ethnicity 
Adult  32561, 12661  12  Gender 
German  1700  20  Foreign 
Drug  1885  11  Ethnicity 
Appendix D Varying the Value of
In this section we present a set of experiments, as a proof of concept, that our selection of for our method is reasonable and study the impact of different values of have concerning DEO and accuracy performance.
We follow the same experimental setting presented in Section 4 for the Drug dataset, implementing our nonlinear method with equals to . The results of this experiment are presented in Figure 3, where we show also the results for Naïve SVM and Hard method. It is possible to note how increasing the value of , our model has smaller error but stronger unfairness (i.e. higher DEO).
Appendix E Visualization of the results of Table 1
In Figure 4 we reported the equivalent of Figure 2 for the case when is not inside . Note that we can reach the same conclusions drown for Table 1 and Figure 2.
Appendix F Approximation of the DEO
In this section, we numerically show the difference between the DEO and our approximation of it. Figure 5 compares the DEO with our approximation of the DEO and the classification error. We collected these results for the German dataset on the validation set, changing the two hyperparameters and (in the nonlinear case). We can note how our approximation of the DEO is empirically similar to the original DEO. It is interesting to highlight that, a correct approximation of the DEO is particularly important where the error is low.
Appendix G Dual Problem for SVM with Fairness Constraint
We follow the usual approach to derive the dual problem for SVMs, which uses the method of Lagrange multipliers [vapnik1998statistical]. We define the Lagrangian function
(18) 
where and are the vector of Lagrange multipliers and are constrained to be nonnegative. We set the derivative of the Lagrangian with respect to the primal variables and equal to zero. In the latter case we obtain that
(19) 
from which we can remove the variable in place of the constraint . In the former case we obtain the expression for ,
(20) 
Using (19) and (20) in (18) we obtain the expression
(21) 
The dual problem is then to maximize this quantity subject to the constraints that and .
The KKT conditions are
(22)  
(23)  
(24)  
(25) 
Clearly at most one of the variables and can be strictly positive. We may then let and rewrite the objective function as
(26) 
and optimize over and . It is interesting to study this problem when . In this case we can easily solve for obtaining the simplified objective
where is the orthogonal projection along the direction of , that is . This is equivalent to use the standard SVM with the kernel
In particular if , we obtain
This new kernel can then be interpreted as a change of feature mapping .
As a final remark, we note that for other proper convex loss functions (e.g. square loss or logistic loss) the dual problem can be derived via Fenchel duality [see e.g. Rockafellar1970]. We leave the full details to a future occasion.
Appendix H Multiple Valued Sensitive Features
Our method presented in Section 3 can be naturally extended to the case that the sensitive variable takes multiple categorical values, that is for some . In particular, when , the fairness constraint in Problem (4) requires that
(27) 
Furthermore if the linear loss function is used, these constraints becomes
where we defined, for
with and . Thus, we need to satisfy orthogonality constraints which try to enforce a balance between the different sensitive groups as measured by the barycenters of the within groups positive labeled points. Similar considerations apply when dealing with multiple sensitive features.
Arrhythmia dataset  

Method  Accuracy  DEO  Selected Features 
Naïve Lin. SVM    
Linear SVM    
Naïve Lasso  
Lasso  
Hardt  
Our Lasso  
Arrhythmia dataset  Training set reduced by 50%  
Method  Accuracy  DEO  Selected Features 
Naïve Lin. SVM    
Linear SVM    
Naïve Lasso  
Lasso  
Hardt  
Our Lasso 
Dataset  

Toytest  0.03 
Toytest Lasso  0.02 
Arrhythmia  0.03 
COMPAS  0.04 
Adult  0.06 
German  0.05 
Drug  0.03 