1 Introduction
Recently developed information systems are being increasingly incorporating machine learning techniques for making important decisions, such as credit scoring, calculating insurance rates, and evaluating employment applications. These decisions can result in the unfair treatment, if the decisions depend on the sensitive information, such as the individual’s gender, religion, race, or ethnicity. Fairnessaware learning attempts to solve this problem, and has recently received a great deal of attention
[16, 3, 20]. In this paper, we consider the use of fairnessaware learning for classification problems.Let and be the domain of the input and the domain of the target, respectively. In ordinary classification algorithms, the learner aims to find a hypothesis that minimizes misclassifications from a given set of iid samples . In fairnessaware learning, we assume that the input contains a viewpoint , which represents the sensitive information of individuals. The learner aims to find an that will have a low misclassification rate and for which the output of has little dependency on the viewpoint . For example, suppose the company want to make a hiring decision using information collected from job applicants (input ), including their age, place of residence, and work experience, but also including their gender, religion, race, and ethnicity (viewpoint ). We wish to make hiring decisions based on the potential work performance of the job applicants (target
) via a supervised learning algorithm,
. We say is discriminatory if the output of is dependent on the viewpoint [15]. Fairnessaware learning attempts to avoid such unfair decisions by minimizing the dependency of the output of on the viewpoint [3, 10, 11, 22, 6]. Needless to say, minimization of the misclassification rate and minimization of the dependency are conflicting targets. Therefore, we need to consider the tradeoff between misclassification and dependency.The existing methods resolve this conflict by suppressing the dependency to sensitive viewpoints; this is accomplished by introducing a regularization term [11, 22, 6] or by adding constraints [3, 10, 23], to the objective function of the regular empirical risk minimization (See section 2). Typically, such techniques can lead to predictions for which there is less dependency on the sensitive viewpoints of the given samples (empirical dependency). However, predictors with low empirical dependency do not necessarily achieve low dependency on sensitive viewpoints of unseen samples (generalization dependency). In the hiring decision example, the hypothesis is trained with information collected from the past histories of job applicants. Predictors trained with existing methods might make fair decisions for the job applicants in the past (low empirical dependency). However, fair decisions for the job applicants in the future (low generalization dependency) are not guaranteed. Except for the method of Fukuchi and Sakuma [6], most of the existing methods have no theoretical guarantee of the generalization dependency. In [6], theoretical analysis provides a probabilistic bound on the generalization dependency, but the analysis is derived for only a specific measure of dependency.
Our contributions.
We perform a unified analysis of the fairnessaware learning with more general dependency measures based on the divergence [1, 4]. The divergence is a universal class of the divergences, which can represent most of existing divergences, including the total variational distance, the covariance, the Hellinger distance, the divergence, and the KLdivergence. Our fairnessaware learning basically follows the framework of empirical risk minimization (ERM). The goal of fairnessaware learning is to obtain predictors with an upper bound guarantee of generalization dependency; however, it cannot be directly evaluated because the underlying distribution is not observable. We thus derive an upper bound of the generalization dependency by the empirical dependency plus two extra terms. Our framework achieves the fairness of the resultant predictors by restricting the class of hypotheses to those with low empirical dependency. Thus, the upper bound of the generalization dependency of the predictors can be theoretically derived by using the bound.
The contributions of this study are twofold. First, we propose a novel generalized procedure for estimating the divergences for fairnessaware learning. Our estimation method can be regarded as a generalization of [14, 18, 12]. As already stated, we constrain the hypothesis class by the divergence for guarantee of fairness. It is thus important to derive a tighter upper bound of the divergence to achieve lower generalization dependency. Existing divergence estimation methods [14] provides an upper bound of the divergence; however, the bound is not suitable for our purpose for the following two reasons. First, their analysis is specifically derived for KLdivergence and cannot be expanded to the general divergences. Second, their bound is derived for convergence analysis, not for the upper bound of the divergence. Thus, the bound is loose for our purpose. Our generalized estimation procedure provides a tighter upper bound of the divergences by introducing the maximum mean discrepancy. As a result, the estimation error of the divergence is bounded above by the empirical maximum mean discrepancy and by .
Second, we formulate a general ERM framework for fairnessaware learning with employing the divergence. We analyze the generalization error and generalization dependency of the proposed fairnessaware learning algorithm, and we show that even when fairness is enforced, the generalization error can be bounded above by the Rademacher complexity and , as in the regular ERM. The generalization dependency can be bounded above by the empirical maximum mean discrepancy term and two other extra terms. Thanks to the theoretical analysis of generalization dependency, we can theoretically compare the upper bound of the estimation error by dependency measures. Our analysis revealed that the divergence estimation errors for all of these divergences are equally, and the Hellinger distance achieves the lowest estimation error in terms of the constant term of the probabilistic error. We also derived a convex formulation of fairnessaware learning that works with any dependency measures represented by the divergence. The optimization problem can be readily solved by a standard convex optimization solver.
2 Related Works
Within the setup described in the Introduction, Calders and Verwer [3] pointed out that elimination of the viewpoint from the given samples is insufficient for achieving low correlation between the output of and the viewpoint; this is because the viewpoint has an indirect influence since it is not independent from the input. For example, when we make a hiring decision using information collected from job applicants via supervised learning, even if we train with samples that exclude race and ethnicity, the output of the resultant hypothesis may be indirectly correlated with race or ethnicity, because the addresses of the applicants may be correlated with their race or ethnicity. Such an indirect effect is called the redlining effect [3].
To remove the redlining effect, existing works have attempted to construct a classifier that results in a fairer hypothesis. Calders and Verwer [3]
proposed the naive Bayes classifier with fairness constraint, which employs the difference between the conditional probabilities
where and . Kamiran et al. [10] and Zliobaite et al. [23] discussed various situations in which discrimination can occur, in terms of the difference of conditional probabilities. Dwork et al. [5] introduced a fairnessaware learning framework of the ERM with constraints of statistical parity, defined as the total variational distance between and . Zemel et al. [22] presented an algorithm to preserve fairness in a classification setting based on statistical parity. Kamishima et al. [11] proposed a fairnessaware learning algorithm of the maximum likelihood estimation by penalizing the loglikelihood using KLdivergence between and ^{1}^{1}1The KL divergence between and is known as mutual information between and .. These fairnessaware learning algorithms do not have a theoretical guarantee for the estimation error of the dependency measures. In addition, the design of these algorithms are tightly coupled with specific dependency measures. They thus have less flexibility for choosing other dependency measures.The fairness for unseen samples can be measured by the estimation error bound of the dependency measure. Fukuchi and Sakuma [6] first derived a bound on the estimation error of a specific measure, namely the +1/1 neutrality risk. They proved that the estimation error of the measure is bounded above in probability by the Rademacher complexity of the hypothesis class and term. Unfortunately, the analysis relies on the +1/1 neutrality risk and cannot be generalized to other types of dependency measures.
Estimation procedures that use divergences that are based on iid samples have been studied extensively. For example, for the KLdivergence, method have been proposed that use nearestneighbor distances [21] and leastsquares estimations of the probability ratio [12]. To estimate divergences, GarcíaGarcía et al. [7] introduced an estimation procedure that uses loss minimization and sampling. Kanamori et al. [13] presented a divergence estimator of the
divergences based on using the moment matching estimator
[17] to estimate the probability ratio. Nguyen et al. [14] used a property of convex conjugate functions to derive the Mestimator of divergences, and they also derived its convergence rate. In our analysis, we derive the upper bound of the estimation error, which yields a tighter upper bound of the estimation error of dependency measures compared to the existing convergence rate analysis.3 Problem Formulation
Let and be the domain of the input and the domain of the target, respectively. We assume that the learner obtains a set of iid samples that are drawn from an unknown probability measure , which is defined on some measurable space . In addition, we assume the input consists of the viewpoint and various other features . Thus, . Given iid samples, the learner seeks to find a hypothesis from a class of measurable functions that minimizes both the misclassification rate and the dependence on the viewpoint . We denote for
as the random variables of the samples, and we denote
as the random variable for the corresponding viewpoint.The misclassification of the hypothesis is evaluated by the generalization risk, which is defined as . The goal of the learner is to find the hypothesis such that
(1) 
The generalization risk cannot be evaluated directly because the sample distribution is unknown. Instead of the generalization risk, empirical risk minimization (ERM) finds a hypothesis that minimizes the empirical risk
(2) 
Minimization of the empirical risk results in a relatively low generalization risk, and the generalization risk of the resultant hypothesis converges towards that of the optimal hypothesis as the number of samples increases; this has been shown theoretically [2].
3.1 A Generalized Class for Dependency Measures
For the evaluation of the dependency of the output of on the viewpoint , we define a general class of measures for dependency.Given that if and are statistically independent, we can evaluate this dependency by evaluating the difference between the two probability measures, and . To measure the difference between two probability measures, we use the divergences. Suppose and are two probability measures on a compact domain , where is absolutely continuous with respect to . The class of divergences, also known as the Ali–Silvey distances [1, 4], takes the form
(3) 
where is a convex and lower semicontinuous function such that .^{2}^{2}2The divergence becomes one of the existing divergences due to the choice of ; that is, it becomes the total variational distance if , the Hellinger distance if , the divergence if , or the KLdivergence if . See fig. 0(a). After we define the divergences, we define the measure of the dependency between and as follows:
(4) 
Without loss of generality, we will assume that the subdifferential of at contains . This can be readily confirmed by which does not change the value of the divergences for any finite . We will focus on the convex functions that are differentiable on except at . Note that this includes most of the divergences, including the total variational distance, the Hellinger distance, the divergence, and the KLdivergence.
3.2 FairnessAware Learning with Generalized Dependency Measures
In fairnessaware learning, the learner attempts to minimize both and . However, since does not always hold, there exists a tradeoff between and . We thus consider a subset of parameterized by defined as follows:
(5) 
Thus, the goal of the fairnessaware learning is to achieve the hypothesis that satisfies
(6) 
Again, since the generalization risk cannot be evaluated directly, the learner minimizes the empirical risk as
(7) 
The objective of fairnessaware learning is to solve the optimization problem of eq. 7. Unfortunately, cannot be evaluated directly again since the underlying distribution is unobservable. In section 4, we introduce a novel estimation procedure of the divergences to alleviate evaluation of . Then, we prove an upper bound of with empirical estimation of given a finite number of samples. In section 5, the objective function of fairnessaware learning is redefined using the empirical estimation of .
4 Divergence Estimation
In this section, we introduce a procedure that involves minimizing the maximum mean discrepancy (MMD) for estimating , and we determine a nonasymptotic bound on the estimation error. This procedure covers the existing divergences or KLdivergence estimation algorithms proposed by Nguyen et al. [14], Ruderman et al. [18], and Kanamori et al. [13].
4.1 Estimation of the Divergence by Minimizing the Maximum Mean Discrepancy
To estimate , we first empirically estimate the probability ratio , and then we empirically evaluate by using the estimated probability ratio. Since holds for the probability ratio , the minimizer of the difference between and is expected to be close to the probability ratio. As a measure of the disparity of and , we use the maximum mean discrepancy. Let be a set of functions . Let be an independent copy of , and let be the viewpoint of . Then, the MMD with between and is defined as
(8)  
(9) 
If is equivalent to the probability ratio, we have . However, does always not satisfy, which requires that is a set of functions on a universal kernel [19]. Therefore, the evaluation ability of the discrepancy of MMD is dependent on the choice of .The Ustatistics [8]
gives an unbiased estimator of
eq. 9 as(10) 
The estimator of the probability ratio is obtained by minimizing .^{3}^{3}3The efficient computation of the empirical MMD is shown in appendix A. We can add the regularizer term to the empirical MMD to ensure the consistency of the estimator:
(11) 
After obtained by solving eq. 11, the divergence is empirically evaluated as
(12) 
The estimation procedure is equivalent to [14] if where is regularizer parameter. In addition to the regularizer term, if we add the constraint into eq. 11, the estimation procedure becomes same as [18]. Letting and the appropriate choice of yields the estimation procedure of [13].
4.2 Analysis of Estimation Error
In this subsection, we show the upper bound on the estimation error
(13) 
Surprisingly, the upper bound of the estimation error does not depend on the complexity of the class of functions . In what follows, we use and . In addition, we denotes the true probability ratio as .
The following theorem states the probabilistic upper bound on the estimation error. Let be the probability ratio estimated from the obtained set of samples . Suppose that the class of the functions of the MMD contains , where is an element of the subdifferential of , is some constant, and almost surely, where and . Then, with probability at least
(14) 
where . The proof of this theorem is found in appendix C. As proved in section 4.2, the divergences can be bounded above by the empirical divergences, the empirical MMD, and . We minimize the error between the divergences and the empirical divergences by minimizing the empirical MMD. In addition, the error bound does not depend on the complexity of . This implies that in order to guarantee the upper bound on the divergences, we should choose so that it is large enough to satisfy . A large , however, can lead an over estimation of the divergences.
The convergence rate, i.e., the absolute value of the estimation error, as shown by [14] is dependent on the convergence rate of the empirical process with respect to . However, the upper bound proved by section 4.2 does not contain the complexity term of , such as Rademacher complexity, the covering entropy and the bracketing entropy, and thus is tighter than the convergence rate.
5 FairnessAware Learning with a Divergence Estimation
In this section, we provide an algorithm for solving eq. 7 that includes the introduced estimation procedure for the divergences. We will then show that the algorithm can be formulated as a convex optimization problem.
5.1 Algorithm for FairnessAware Learning with Divergence Estimation
Following the estimation procedure described in section 4, we define the optimization problem of our fairnessaware learning as
(15) 
where is the constant larger than that was defined in section 4.2. As indicated by section 4.2, holds, which guarantees that the divergence of the resultant hypothesis of eq. 15 is less than .
Let us consider the effect of the choice of on the estimation error of the divergence. The upper bound of the estimation error shown in section 4.2 does not depend on the choice of . Nevertheless, the choice of changes the constant . Letting and , fig. 1 shows the shape of and the value of corresponding to for various functions . As shown in fig. 0(b), the smallest is that for the Hellinger distance, and thus of these four divergences, it has the tightest bound on the probability ratio .
5.2 Optimization
The necessary condition of convexity of eq. 15 is the linearity of the functions with respect to . With mild assumptions, it can be made convex for any choice of by a simple reformulation. The hypothesis is formed as , and is linear with respect to the parameters. With this assumption, the function on the RKHS is given as , where and . The optimization problem in eq. 15 can be rearranged as
(16) 
Since is not appeared in the objective function in original optimization problem eq. 15, we change the optimization problem so that the optimization with respect to is only appeared in the constraint. Following the derivation of the dual problem in [14], we have the dual form of the constraint as
(17) 
where
(18) 
and . Letting , we can rewrite the optimization problem in eq. 16 as
(19) 
where
(20) 
From the definition of , we have . Let . Then, we relax the indicator function as follows:
(21) 
This optimization problem is convex, and its solution is equivalent to eq. 15. We prove this claims by the following corollary and theorem. If section 5.2 holds, and is convex with respect to , the optimization problem in eq. 21 is a convex optimization problem. The solution of the optimization problem in eq. 21 is equivalent to the solution of eq. 19. The proofs of the corollary and the theorem can be found in appendix C.
6 Generalization Error Analysis
We consider the generalization error bound of the learned hypothesis that is obtained by the algorithm described in section 5. In our analysis, we use the two type of the Rademacher complexity, which measures the complexity of the class of the functions and are defined as
(22) 
where are the independent Rademacher variables, that is, .
In the generalization error analysis, since our fairnessaware learning algorithm have the probabilistic error , we consider the set of hypotheses defined as
(23) 
where is defined as in section 4.2. section 4.2 shows that with probability at least , . Hence, application of the theorem in [2], which is appeared in appendix B, yields the generalization error bound for our algorithm. Let be a hypothesis such that . Let , and let be a hypothesis learned from the obtained set of samples . Suppose that for any , , and . Then, with probability at least ,
(24) 
Since , we have . Therefore, the convergence rate of the algorithm constrained by the divergences is lower than that of the algorithm without the constraint.
While our algorithm guarantees an upper bound on the divergences, it reduces the classification performance, as compared to the classifier learned by ERM. Accordingly, let us consider the generalization error of the optimal hypotheses with and without the restriction on the divergences:
(25) 
This error represents the reduction in the classification performance caused by restricting the divergences. Since the error cannot be directly evaluated, we define the estimator of eq. 25 as
(26) 
Our interest is to derive the convergence rate of this estimator. We denote for any , then the following theorem shows the convergence rate of the estimator. Suppose that for any and . Then, with probability at least ,
(27) 
The proof of this theorem is appeared in appendix C.
7 Conclusions
In this paper, we considered fairnessaware learning for a classification problem, with the aim of learning the classifier that returns the prediction with the lowest misclassification rate and the lowest dependence on the viewpoint. Our contributions are as follows: (1) We propose a novel generalized procedure for estimating the divergences for fairnessaware learning. Our generalized estimation procedure provides a tighter upper bound of the estimation error by introducing the maximum mean discrepancy. (2) We formulate a general ERM framework for fairnessaware learning algorithm that is based on the empirical estimation procedure of the divergences, and that can guarantee an upper bound on the generalization dependency. Furthermore, we provide an analysis of the generalization error of the proposed fairnessaware learning algorithm.
References
 Ali and Silvey [1966] SM Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
 Bartlett et al. [2005] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
 Calders and Verwer [2010] Toon Calders and Sicco Verwer. Three naive bayes approaches for discriminationfree classification. Data Mining and Knowledge Discovery, 21(2):277–292, September 2010.
 Csiszár [1963] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizität von markoffschen ketten. Publications of the Mathematical Institute of Hungarian Academy of Sciences, 8:85–108, 1963.
 Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.
 Fukuchi and Sakuma [2014] Kazuto Fukuchi and Jun Sakuma. Neutralized empirical risk minimization with generalization neutrality bound. In Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2014, Nancy, France, September 1519, 2014. Proceedings, Part I, pages 418–433, 2014.
 GarcíaGarcía et al. [2011] Dario GarcíaGarcía, Ulrike von Luxburg, and Raúl SantosRodríguez. Riskbased generalizations of fdivergences. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 417–424, 2011.

Hoeffding [1948]
W. Hoeffding.
A class of statistics with asymptotically normal distribution.
Annals of Mathematical Statistics, 19:293–325, 1948.  Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.

Kamiran et al. [2010]
Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.
Discrimination aware decision tree learning.
In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.  Kamishima et al. [2012] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairnessaware classifier with prejudice remover regularizer. In in Proceedings of the ECML/PKDD2012, Part II, volume LNCS 7524, pages 35–50. Springer, 2012.
 Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
 Kanamori et al. [2012] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. divergence estimation and twosample homogeneity test under semiparametric densityratio models. Information Theory, IEEE Transactions on, 58(2):708–720, 2012.
 Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
 Pedreschi et al. [2009] Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. Measuring discrimination in sociallysensitive decision records. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 581–592. SIAM, 2009.
 Pedreshi et al. [2008] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
 Qin [1998] Jing Qin. Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85(3):619–630, 1998.
 Ruderman et al. [2012] Avraham Ruderman, Darío Garcíagarcía, James Petterson, and Mark D Reid. Tighter variational representations of fdivergences via restriction to probability measures. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 671–678, 2012.

Steinwart [2001]
Ingo Steinwart.
On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2:67–93, 2001.  Sweeney [2013] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10:10–10:29, March 2013. ISSN 15427730. doi: 10.1145/2460276.2460278.
 Wang et al. [2009] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities vianearestneighbor distances. Information Theory, IEEE Transactions on, 55(5):2392–2405, 2009.
 Zemel et al. [2013] Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair representations. In ICML (3), pages 325–333, 2013.
 Zliobaite et al. [2011] Indre Zliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992–1001. IEEE, 2011.
References
 Ali and Silvey [1966] SM Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
 Bartlett et al. [2005] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
 Calders and Verwer [2010] Toon Calders and Sicco Verwer. Three naive bayes approaches for discriminationfree classification. Data Mining and Knowledge Discovery, 21(2):277–292, September 2010.
 Csiszár [1963] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizität von markoffschen ketten. Publications of the Mathematical Institute of Hungarian Academy of Sciences, 8:85–108, 1963.
 Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM, 2012.
 Fukuchi and Sakuma [2014] Kazuto Fukuchi and Jun Sakuma. Neutralized empirical risk minimization with generalization neutrality bound. In Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2014, Nancy, France, September 1519, 2014. Proceedings, Part I, pages 418–433, 2014.
 GarcíaGarcía et al. [2011] Dario GarcíaGarcía, Ulrike von Luxburg, and Raúl SantosRodríguez. Riskbased generalizations of fdivergences. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 417–424, 2011.

Hoeffding [1948]
W. Hoeffding.
A class of statistics with asymptotically normal distribution.
Annals of Mathematical Statistics, 19:293–325, 1948.  Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.

Kamiran et al. [2010]
Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.
Discrimination aware decision tree learning.
In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.  Kamishima et al. [2012] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairnessaware classifier with prejudice remover regularizer. In in Proceedings of the ECML/PKDD2012, Part II, volume LNCS 7524, pages 35–50. Springer, 2012.
 Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
 Kanamori et al. [2012] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. divergence estimation and twosample homogeneity test under semiparametric densityratio models. Information Theory, IEEE Transactions on, 58(2):708–720, 2012.
 Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
 Pedreschi et al. [2009] Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. Measuring discrimination in sociallysensitive decision records. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 581–592. SIAM, 2009.
 Pedreshi et al. [2008] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
 Qin [1998] Jing Qin. Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85(3):619–630, 1998.
 Ruderman et al. [2012] Avraham Ruderman, Darío Garcíagarcía, James Petterson, and Mark D Reid. Tighter variational representations of fdivergences via restriction to probability measures. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 671–678, 2012.

Steinwart [2001]
Ingo Steinwart.
On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2:67–93, 2001.  Sweeney [2013] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10:10–10:29, March 2013. ISSN 15427730. doi: 10.1145/2460276.2460278.
 Wang et al. [2009] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities vianearestneighbor distances. Information Theory, IEEE Transactions on, 55(5):2392–2405, 2009.
 Zemel et al. [2013] Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair representations. In ICML (3), pages 325–333, 2013.
 Zliobaite et al. [2011] Indre Zliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992–1001. IEEE, 2011.
Appendix A Maximum Mean Discrepancy with Functions on a Reproducing Kernel Hilbert Space
Since we need to solve the maximization problem in , evaluation of the empirical MMD takes considerable cost causing use of the iterative algorithm. However, if the elements of the class of the functions are represented by the inner products of the parameters, which includes the functions in a reproducing kernel Hilbert space (RKHS), the empirical MMD can be efficiently calculated. Let be a universal kernel, and let be the RKHS induced by . Let be the canonical feature map induced by . Suppose that . Then, the empirical MMD is equivalent to
(28) 
Proof of appendix A.
From the definition of the MMD and , we have
(29)  
(30) 
Since the supremum is achieved if the direction of is equivalent to that of , we get the claim. ∎
For simplicity of notation, we let be represented by , by , and by . Then, eq. 28 can be rearranged as
(31) 
Let be a matrix such that , and let be a vector such that . Let be a vector representation of . The matrix representation of the minimization of eq. 31 is obtained as
(32) 
The minimizer of eq. 31 with respect to can be easily obtained if is a positive definite matrix.
Appendix B Generalization Error Bound of Bartlett et al. [2]
Appendix C Proofs
c.1 Proof of section 4.2
In order to prove section 4.2, we prove following lemmas. Suppose that almost surely where and , then
(34) 
Proof.
Suppose that almost surely where and , then
(37) 
Proof.
From the assumption that the subdifferential of contains zero, for and for which yields that is nonincreasing in and nondecreasing in . Therefore, , which gives the claim. ∎
Proof of section 4.2.
The error is decomposed as
(38) 
From the definition of the subdifferential, we have
(39)  
(40)  
Comments
There are no comments yet.