Fairness-Aware Learning with Restriction of Universal Dependency using f-Divergences

06/25/2015 ∙ by Kazuto Fukuchi, et al. ∙ University of Tsukuba 0

Fairness-aware learning is a novel framework for classification tasks. Like regular empirical risk minimization (ERM), it aims to learn a classifier with a low error rate, and at the same time, for the predictions of the classifier to be independent of sensitive features, such as gender, religion, race, and ethnicity. Existing methods can achieve low dependencies on given samples, but this is not guaranteed on unseen samples. The existing fairness-aware learning algorithms employ different dependency measures, and each algorithm is specifically designed for a particular one. Such diversity makes it difficult to theoretically analyze and compare them. In this paper, we propose a general framework for fairness-aware learning that uses f-divergences and that covers most of the dependency measures employed in the existing methods. We introduce a way to estimate the f-divergences that allows us to give a unified analysis for the upper bound of the estimation error; this bound is tighter than that of the existing convergence rate analysis of the divergence estimation. With our divergence estimate, we propose a fairness-aware learning algorithm, and perform a theoretical analysis of its generalization error. Our analysis reveals that, under mild assumptions and even with enforcement of fairness, the generalization error of our method is O(√(1/n)), which is the same as that of the regular ERM. In addition, and more importantly, we show that, for any f-divergence, the upper bound of the estimation error of the divergence is O(√(1/n)). This indicates that our fairness-aware learning algorithm guarantees low dependencies on unseen samples for any dependency measure represented by an f-divergence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently developed information systems are being increasingly incorporating machine learning techniques for making important decisions, such as credit scoring, calculating insurance rates, and evaluating employment applications. These decisions can result in the unfair treatment, if the decisions depend on the sensitive information, such as the individual’s gender, religion, race, or ethnicity. Fairness-aware learning attempts to solve this problem, and has recently received a great deal of attention 

[16, 3, 20]. In this paper, we consider the use of fairness-aware learning for classification problems.

Let and be the domain of the input and the domain of the target, respectively. In ordinary classification algorithms, the learner aims to find a hypothesis that minimizes misclassifications from a given set of iid samples . In fairness-aware learning, we assume that the input contains a viewpoint , which represents the sensitive information of individuals. The learner aims to find an that will have a low misclassification rate and for which the output of has little dependency on the viewpoint . For example, suppose the company want to make a hiring decision using information collected from job applicants (input ), including their age, place of residence, and work experience, but also including their gender, religion, race, and ethnicity (viewpoint ). We wish to make hiring decisions based on the potential work performance of the job applicants (target

) via a supervised learning algorithm,

. We say is discriminatory if the output of is dependent on the viewpoint  [15]. Fairness-aware learning attempts to avoid such unfair decisions by minimizing the dependency of the output of on the viewpoint  [3, 10, 11, 22, 6]. Needless to say, minimization of the misclassification rate and minimization of the dependency are conflicting targets. Therefore, we need to consider the trade-off between misclassification and dependency.

The existing methods resolve this conflict by suppressing the dependency to sensitive viewpoints; this is accomplished by introducing a regularization term [11, 22, 6] or by adding constraints [3, 10, 23], to the objective function of the regular empirical risk minimization (See section 2). Typically, such techniques can lead to predictions for which there is less dependency on the sensitive viewpoints of the given samples (empirical dependency). However, predictors with low empirical dependency do not necessarily achieve low dependency on sensitive viewpoints of unseen samples (generalization dependency). In the hiring decision example, the hypothesis is trained with information collected from the past histories of job applicants. Predictors trained with existing methods might make fair decisions for the job applicants in the past (low empirical dependency). However, fair decisions for the job applicants in the future (low generalization dependency) are not guaranteed. Except for the method of Fukuchi and Sakuma [6], most of the existing methods have no theoretical guarantee of the generalization dependency. In [6], theoretical analysis provides a probabilistic bound on the generalization dependency, but the analysis is derived for only a specific measure of dependency.

Our contributions.

We perform a unified analysis of the fairness-aware learning with more general dependency measures based on the -divergence [1, 4]. The -divergence is a universal class of the divergences, which can represent most of existing divergences, including the total variational distance, the covariance, the Hellinger distance, the -divergence, and the KL-divergence. Our fairness-aware learning basically follows the framework of empirical risk minimization (ERM). The goal of fairness-aware learning is to obtain predictors with an upper bound guarantee of generalization dependency; however, it cannot be directly evaluated because the underlying distribution is not observable. We thus derive an upper bound of the generalization dependency by the empirical dependency plus two extra terms. Our framework achieves the fairness of the resultant predictors by restricting the class of hypotheses to those with low empirical dependency. Thus, the upper bound of the generalization dependency of the predictors can be theoretically derived by using the bound.

The contributions of this study are two-fold. First, we propose a novel generalized procedure for estimating the -divergences for fairness-aware learning. Our estimation method can be regarded as a generalization of [14, 18, 12]. As already stated, we constrain the hypothesis class by the -divergence for guarantee of fairness. It is thus important to derive a tighter upper bound of the -divergence to achieve lower generalization dependency. Existing divergence estimation methods [14] provides an upper bound of the -divergence; however, the bound is not suitable for our purpose for the following two reasons. First, their analysis is specifically derived for KL-divergence and cannot be expanded to the general -divergences. Second, their bound is derived for convergence analysis, not for the upper bound of the divergence. Thus, the bound is loose for our purpose. Our generalized estimation procedure provides a tighter upper bound of the -divergences by introducing the maximum mean discrepancy. As a result, the estimation error of the -divergence is bounded above by the empirical maximum mean discrepancy and by .

Second, we formulate a general ERM framework for fairness-aware learning with employing the -divergence. We analyze the generalization error and generalization dependency of the proposed fairness-aware learning algorithm, and we show that even when fairness is enforced, the generalization error can be bounded above by the Rademacher complexity and , as in the regular ERM. The generalization dependency can be bounded above by the empirical maximum mean discrepancy term and two other extra terms. Thanks to the theoretical analysis of generalization dependency, we can theoretically compare the upper bound of the estimation error by dependency measures. Our analysis revealed that the divergence estimation errors for all of these divergences are equally, and the Hellinger distance achieves the lowest estimation error in terms of the constant term of the probabilistic error. We also derived a convex formulation of fairness-aware learning that works with any dependency measures represented by the -divergence. The optimization problem can be readily solved by a standard convex optimization solver.

2 Related Works

Within the setup described in the Introduction, Calders and Verwer [3] pointed out that elimination of the viewpoint from the given samples is insufficient for achieving low correlation between the output of and the viewpoint; this is because the viewpoint has an indirect influence since it is not independent from the input. For example, when we make a hiring decision using information collected from job applicants via supervised learning, even if we train with samples that exclude race and ethnicity, the output of the resultant hypothesis may be indirectly correlated with race or ethnicity, because the addresses of the applicants may be correlated with their race or ethnicity. Such an indirect effect is called the red-lining effect [3].

To remove the red-lining effect, existing works have attempted to construct a classifier that results in a fairer hypothesis. Calders and Verwer [3]

proposed the naive Bayes classifier with fairness constraint, which employs the difference between the conditional probabilities

where and . Kamiran et al. [10] and Zliobaite et al. [23] discussed various situations in which discrimination can occur, in terms of the difference of conditional probabilities. Dwork et al. [5] introduced a fairness-aware learning framework of the ERM with constraints of statistical parity, defined as the total variational distance between and . Zemel et al. [22] presented an algorithm to preserve fairness in a classification setting based on statistical parity. Kamishima et al. [11] proposed a fairness-aware learning algorithm of the maximum likelihood estimation by penalizing the log-likelihood using KL-divergence between and 111The KL divergence between and is known as mutual information between and .. These fairness-aware learning algorithms do not have a theoretical guarantee for the estimation error of the dependency measures. In addition, the design of these algorithms are tightly coupled with specific dependency measures. They thus have less flexibility for choosing other dependency measures.

The fairness for unseen samples can be measured by the estimation error bound of the dependency measure. Fukuchi and Sakuma [6] first derived a bound on the estimation error of a specific measure, namely the +1/-1 neutrality risk. They proved that the estimation error of the measure is bounded above in probability by the Rademacher complexity of the hypothesis class and term. Unfortunately, the analysis relies on the +1/-1 neutrality risk and cannot be generalized to other types of dependency measures.

Estimation procedures that use -divergences that are based on iid samples have been studied extensively. For example, for the KL-divergence, method have been proposed that use nearest-neighbor distances [21] and least-squares estimations of the probability ratio [12]. To estimate -divergences, García-García et al. [7] introduced an estimation procedure that uses loss minimization and sampling. Kanamori et al. [13] presented a divergence estimator of the

-divergences based on using the moment matching estimator 

[17] to estimate the probability ratio. Nguyen et al. [14] used a property of convex conjugate functions to derive the M-estimator of -divergences, and they also derived its convergence rate. In our analysis, we derive the upper bound of the estimation error, which yields a tighter upper bound of the estimation error of dependency measures compared to the existing convergence rate analysis.

3 Problem Formulation

Let and be the domain of the input and the domain of the target, respectively. We assume that the learner obtains a set of iid samples that are drawn from an unknown probability measure , which is defined on some measurable space . In addition, we assume the input consists of the viewpoint and various other features . Thus, . Given iid samples, the learner seeks to find a hypothesis from a class of measurable functions that minimizes both the misclassification rate and the dependence on the viewpoint . We denote for

as the random variables of the samples, and we denote

as the random variable for the corresponding viewpoint.

The misclassification of the hypothesis is evaluated by the generalization risk, which is defined as . The goal of the learner is to find the hypothesis such that

(1)

The generalization risk cannot be evaluated directly because the sample distribution is unknown. Instead of the generalization risk, empirical risk minimization (ERM) finds a hypothesis that minimizes the empirical risk

(2)

Minimization of the empirical risk results in a relatively low generalization risk, and the generalization risk of the resultant hypothesis converges towards that of the optimal hypothesis as the number of samples increases; this has been shown theoretically [2].

3.1 A Generalized Class for Dependency Measures

For the evaluation of the dependency of the output of on the viewpoint , we define a general class of measures for dependency.Given that if and are statistically independent, we can evaluate this dependency by evaluating the difference between the two probability measures, and . To measure the difference between two probability measures, we use the -divergences. Suppose and are two probability measures on a compact domain , where is absolutely continuous with respect to . The class of -divergences, also known as the Ali–Silvey distances [1, 4], takes the form

(3)

where is a convex and lower semicontinuous function such that .222The -divergence becomes one of the existing divergences due to the choice of ; that is, it becomes the total variational distance if , the Hellinger distance if , the -divergence if , or the KL-divergence if . See fig. 0(a). After we define the -divergences, we define the measure of the dependency between and as follows:

(4)

Without loss of generality, we will assume that the subdifferential of at contains . This can be readily confirmed by which does not change the value of the -divergences for any finite . We will focus on the convex functions that are differentiable on except at . Note that this includes most of the divergences, including the total variational distance, the Hellinger distance, the -divergence, and the KL-divergence.

3.2 Fairness-Aware Learning with Generalized Dependency Measures

In fairness-aware learning, the learner attempts to minimize both and . However, since does not always hold, there exists a trade-off between and . We thus consider a subset of parameterized by defined as follows:

(5)

Thus, the goal of the fairness-aware learning is to achieve the hypothesis that satisfies

(6)

Again, since the generalization risk cannot be evaluated directly, the learner minimizes the empirical risk as

(7)

The objective of fairness-aware learning is to solve the optimization problem of eq. 7. Unfortunately, cannot be evaluated directly again since the underlying distribution is unobservable. In section 4, we introduce a novel estimation procedure of the -divergences to alleviate evaluation of . Then, we prove an upper bound of with empirical estimation of given a finite number of samples. In section 5, the objective function of fairness-aware learning is redefined using the empirical estimation of .

4 Divergence Estimation

In this section, we introduce a procedure that involves minimizing the maximum mean discrepancy (MMD) for estimating , and we determine a non-asymptotic bound on the estimation error. This procedure covers the existing -divergences or KL-divergence estimation algorithms proposed by Nguyen et al. [14], Ruderman et al. [18], and Kanamori et al. [13].

4.1 Estimation of the Divergence by Minimizing the Maximum Mean Discrepancy

To estimate , we first empirically estimate the probability ratio , and then we empirically evaluate by using the estimated probability ratio. Since holds for the probability ratio , the minimizer of the difference between and is expected to be close to the probability ratio. As a measure of the disparity of and , we use the maximum mean discrepancy. Let be a set of functions . Let be an independent copy of , and let be the viewpoint of . Then, the MMD with between and is defined as

(8)
(9)

If is equivalent to the probability ratio, we have . However, does always not satisfy, which requires that is a set of functions on a universal kernel [19]. Therefore, the evaluation ability of the discrepancy of MMD is dependent on the choice of .The U-statistics [8]

gives an unbiased estimator of

eq. 9 as

(10)

The estimator of the probability ratio is obtained by minimizing .333The efficient computation of the empirical MMD is shown in appendix A. We can add the regularizer term to the empirical MMD to ensure the consistency of the estimator:

(11)

After obtained by solving eq. 11, the -divergence is empirically evaluated as

(12)

The estimation procedure is equivalent to [14] if where is regularizer parameter. In addition to the regularizer term, if we add the constraint into eq. 11, the estimation procedure becomes same as [18]. Letting and the appropriate choice of yields the estimation procedure of [13].

4.2 Analysis of Estimation Error

In this subsection, we show the upper bound on the estimation error

(13)

Surprisingly, the upper bound of the estimation error does not depend on the complexity of the class of functions . In what follows, we use and . In addition, we denotes the true probability ratio as .

The following theorem states the probabilistic upper bound on the estimation error. Let be the probability ratio estimated from the obtained set of samples . Suppose that the class of the functions of the MMD contains , where is an element of the subdifferential of , is some constant, and almost surely, where and . Then, with probability at least

(14)

where . The proof of this theorem is found in appendix C. As proved in section 4.2, the -divergences can be bounded above by the empirical -divergences, the empirical MMD, and . We minimize the error between the -divergences and the empirical -divergences by minimizing the empirical MMD. In addition, the error bound does not depend on the complexity of . This implies that in order to guarantee the upper bound on the -divergences, we should choose so that it is large enough to satisfy . A large , however, can lead an over estimation of the -divergences.

The convergence rate, i.e., the absolute value of the estimation error, as shown by [14] is dependent on the convergence rate of the empirical process with respect to . However, the upper bound proved by section 4.2 does not contain the complexity term of , such as Rademacher complexity, the covering entropy and the bracketing entropy, and thus is tighter than the convergence rate.

5 Fairness-Aware Learning with a Divergence Estimation

In this section, we provide an algorithm for solving eq. 7 that includes the introduced estimation procedure for the -divergences. We will then show that the algorithm can be formulated as a convex optimization problem.

5.1 Algorithm for Fairness-Aware Learning with -Divergence Estimation

Following the estimation procedure described in section 4, we define the optimization problem of our fairness-aware learning as

(15)

where is the constant larger than that was defined in section 4.2. As indicated by section 4.2, holds, which guarantees that the -divergence of the resultant hypothesis of eq. 15 is less than .

Let us consider the effect of the choice of on the estimation error of the divergence. The upper bound of the estimation error shown in section 4.2 does not depend on the choice of . Nevertheless, the choice of changes the constant . Letting and , fig. 1 shows the shape of and the value of corresponding to for various functions . As shown in fig. 0(b), the smallest is that for the Hellinger distance, and thus of these four divergences, it has the tightest bound on the probability ratio .

[width=]phi.pdf

(a)

[width=]c.pdf

(b)
Figure 1: The shape of and the value of in section 4.2 for various , where and .

5.2 Optimization

The necessary condition of convexity of eq. 15 is the linearity of the functions with respect to . With mild assumptions, it can be made convex for any choice of by a simple reformulation. The hypothesis is formed as , and is linear with respect to the parameters. With this assumption, the function on the RKHS is given as , where and . The optimization problem in eq. 15 can be rearranged as

(16)

Since is not appeared in the objective function in original optimization problem eq. 15, we change the optimization problem so that the optimization with respect to is only appeared in the constraint. Following the derivation of the dual problem in [14], we have the dual form of the constraint as

(17)

where

(18)

and . Letting , we can rewrite the optimization problem in eq. 16 as

(19)

where

(20)

From the definition of , we have . Let . Then, we relax the indicator function as follows:

(21)

This optimization problem is convex, and its solution is equivalent to eq. 15. We prove this claims by the following corollary and theorem. If section 5.2 holds, and is convex with respect to , the optimization problem in eq. 21 is a convex optimization problem. The solution of the optimization problem in eq. 21 is equivalent to the solution of eq. 19. The proofs of the corollary and the theorem can be found in appendix C.

6 Generalization Error Analysis

We consider the generalization error bound of the learned hypothesis that is obtained by the algorithm described in section 5. In our analysis, we use the two type of the Rademacher complexity, which measures the complexity of the class of the functions and are defined as

(22)

where are the independent Rademacher variables, that is, .

In the generalization error analysis, since our fairness-aware learning algorithm have the probabilistic error , we consider the set of hypotheses defined as

(23)

where is defined as in section 4.2. section 4.2 shows that with probability at least , . Hence, application of the theorem in [2], which is appeared in appendix B, yields the generalization error bound for our algorithm. Let be a hypothesis such that . Let , and let be a hypothesis learned from the obtained set of samples . Suppose that for any , , and . Then, with probability at least ,

(24)

Since , we have . Therefore, the convergence rate of the algorithm constrained by the -divergences is lower than that of the algorithm without the constraint.

While our algorithm guarantees an upper bound on the -divergences, it reduces the classification performance, as compared to the classifier learned by ERM. Accordingly, let us consider the generalization error of the optimal hypotheses with and without the restriction on the -divergences:

(25)

This error represents the reduction in the classification performance caused by restricting the -divergences. Since the error cannot be directly evaluated, we define the estimator of eq. 25 as

(26)

Our interest is to derive the convergence rate of this estimator. We denote for any , then the following theorem shows the convergence rate of the estimator. Suppose that for any and . Then, with probability at least ,

(27)

The proof of this theorem is appeared in appendix C.

7 Conclusions

In this paper, we considered fairness-aware learning for a classification problem, with the aim of learning the classifier that returns the prediction with the lowest misclassification rate and the lowest dependence on the viewpoint. Our contributions are as follows: (1) We propose a novel generalized procedure for estimating the -divergences for fairness-aware learning. Our generalized estimation procedure provides a tighter upper bound of the estimation error by introducing the maximum mean discrepancy. (2) We formulate a general ERM framework for fairness-aware learning algorithm that is based on the empirical estimation procedure of the -divergences, and that can guarantee an upper bound on the generalization dependency. Furthermore, we provide an analysis of the generalization error of the proposed fairness-aware learning algorithm.

References

  • Ali and Silvey [1966] SM Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
  • Bartlett et al. [2005] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
  • Calders and Verwer [2010] Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292, September 2010.
  • Csiszár [1963] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizität von markoffschen ketten. Publications of the Mathematical Institute of Hungarian Academy of Sciences, 8:85–108, 1963.
  • Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.
  • Fukuchi and Sakuma [2014] Kazuto Fukuchi and Jun Sakuma. Neutralized empirical risk minimization with generalization neutrality bound. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I, pages 418–433, 2014.
  • García-García et al. [2011] Dario García-García, Ulrike von Luxburg, and Raúl Santos-Rodríguez. Risk-based generalizations of f-divergences. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 417–424, 2011.
  • Hoeffding [1948] W. Hoeffding.

    A class of statistics with asymptotically normal distribution.

    Annals of Mathematical Statistics, 19:293–325, 1948.
  • Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Kamiran et al. [2010] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.

    Discrimination aware decision tree learning.

    In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.
  • Kamishima et al. [2012] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In in Proceedings of the ECML/PKDD2012, Part II, volume LNCS 7524, pages 35–50. Springer, 2012.
  • Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
  • Kanamori et al. [2012] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. -divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. Information Theory, IEEE Transactions on, 58(2):708–720, 2012.
  • Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
  • Pedreschi et al. [2009] Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. Measuring discrimination in socially-sensitive decision records. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 581–592. SIAM, 2009.
  • Pedreshi et al. [2008] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
  • Qin [1998] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • Ruderman et al. [2012] Avraham Ruderman, Darío García-garcía, James Petterson, and Mark D Reid. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 671–678, 2012.
  • Steinwart [2001] Ingo Steinwart.

    On the influence of the kernel on the consistency of support vector machines.

    Journal of Machine Learning Research, 2:67–93, 2001.
  • Sweeney [2013] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10:10–10:29, March 2013. ISSN 1542-7730. doi: 10.1145/2460276.2460278.
  • Wang et al. [2009] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities via-nearest-neighbor distances. Information Theory, IEEE Transactions on, 55(5):2392–2405, 2009.
  • Zemel et al. [2013] Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair representations. In ICML (3), pages 325–333, 2013.
  • Zliobaite et al. [2011] Indre Zliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992–1001. IEEE, 2011.

References

  • Ali and Silvey [1966] SM Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
  • Bartlett et al. [2005] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
  • Calders and Verwer [2010] Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292, September 2010.
  • Csiszár [1963] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizität von markoffschen ketten. Publications of the Mathematical Institute of Hungarian Academy of Sciences, 8:85–108, 1963.
  • Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.
  • Fukuchi and Sakuma [2014] Kazuto Fukuchi and Jun Sakuma. Neutralized empirical risk minimization with generalization neutrality bound. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I, pages 418–433, 2014.
  • García-García et al. [2011] Dario García-García, Ulrike von Luxburg, and Raúl Santos-Rodríguez. Risk-based generalizations of f-divergences. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 417–424, 2011.
  • Hoeffding [1948] W. Hoeffding.

    A class of statistics with asymptotically normal distribution.

    Annals of Mathematical Statistics, 19:293–325, 1948.
  • Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Kamiran et al. [2010] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.

    Discrimination aware decision tree learning.

    In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.
  • Kamishima et al. [2012] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In in Proceedings of the ECML/PKDD2012, Part II, volume LNCS 7524, pages 35–50. Springer, 2012.
  • Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
  • Kanamori et al. [2012] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. -divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. Information Theory, IEEE Transactions on, 58(2):708–720, 2012.
  • Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
  • Pedreschi et al. [2009] Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. Measuring discrimination in socially-sensitive decision records. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 581–592. SIAM, 2009.
  • Pedreshi et al. [2008] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
  • Qin [1998] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • Ruderman et al. [2012] Avraham Ruderman, Darío García-garcía, James Petterson, and Mark D Reid. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 671–678, 2012.
  • Steinwart [2001] Ingo Steinwart.

    On the influence of the kernel on the consistency of support vector machines.

    Journal of Machine Learning Research, 2:67–93, 2001.
  • Sweeney [2013] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10:10–10:29, March 2013. ISSN 1542-7730. doi: 10.1145/2460276.2460278.
  • Wang et al. [2009] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities via-nearest-neighbor distances. Information Theory, IEEE Transactions on, 55(5):2392–2405, 2009.
  • Zemel et al. [2013] Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair representations. In ICML (3), pages 325–333, 2013.
  • Zliobaite et al. [2011] Indre Zliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992–1001. IEEE, 2011.

Appendix A Maximum Mean Discrepancy with Functions on a Reproducing Kernel Hilbert Space

Since we need to solve the maximization problem in , evaluation of the empirical MMD takes considerable cost causing use of the iterative algorithm. However, if the elements of the class of the functions are represented by the inner products of the parameters, which includes the functions in a reproducing kernel Hilbert space (RKHS), the empirical MMD can be efficiently calculated. Let be a universal kernel, and let be the RKHS induced by . Let be the canonical feature map induced by . Suppose that . Then, the empirical MMD is equivalent to

(28)
Proof of appendix A.

From the definition of the MMD and , we have

(29)
(30)

Since the supremum is achieved if the direction of is equivalent to that of , we get the claim. ∎

For simplicity of notation, we let be represented by , by , and by . Then, eq. 28 can be rearranged as

(31)

Let be a matrix such that , and let be a vector such that . Let be a vector representation of . The matrix representation of the minimization of eq. 31 is obtained as

(32)

The minimizer of eq. 31 with respect to can be easily obtained if is a positive definite matrix.

Appendix B Generalization Error Bound of Bartlett et al. [2]

Bartlett et al. [2] proved the following theorem for the generalization error bound based on Bousquet’s inequality: [Bartlett et al. [2]] Let , and let be a hypothesis learned from the obtained set of samples . Suppose that for any , and . Then, with probability at least

(33)

where .

Appendix C Proofs

c.1 Proof of section 4.2

In order to prove section 4.2, we prove following lemmas. Suppose that almost surely where and , then

(34)
Proof.

Since is non-decreasing function due to the convexity of , we have

(35)

From the assumption that the subdifferential of contains zero, for and for which results that and are positive and negative, respectively. By this fact and eq. 35, we have

(36)

Combining eqs. 36 and 35 gives the claim. ∎

Suppose that almost surely where and , then

(37)
Proof.

From the assumption that the subdifferential of contains zero, for and for which yields that is non-increasing in and non-decreasing in . Therefore, , which gives the claim. ∎

Proof of section 4.2.

The error is decomposed as

(38)

From the definition of the subdifferential, we have

(39)
(40)