I Introduction
It is reported that machine learning models are giving unfair predictions on minority people when being applied to assist consequential decision makings, e.g., they are biased against black defendants in recidivism prediction [3], against female applicants in job hiring [2] and against female employees in facial verification [23], etc. How to learn fair prediction model has become a pressing problem for government [20], industry [13, 33] and academia [6, 10]. Many fair learning methods have been developed, including label processing [37, 25], feature processing [14, 36], model regularization [12, 35] and model postprocessing [18, 15] – some have achieved promising performance with very efficient tradeoffs between model accuracy and model fairness.
We note that most fair machine learning methods require direct access to individuals’ demographic data, e.g., they need individual’s race information to mitigate racial bias. However, such data are increasingly restricted to use for protecting user privacy. In 2018, Europe launches a General Data Protection Regulation (GDPR)^{1}^{1}1https://eugdpr.org/theregulation/, which prohibits “processing of personal data revealing racial or ethnic original” and allows users to request “erasure of personal data” from the data controller. Besides, the privacy research community has a long effort of hiding sensitive personal data from data analytics [1, 28].
We thus see fairness and privacy are running in a dilemma, i.e., most fair learners need access to demographic data while these data are restricted to use for privacy protection. Debates are arising [38, 34]: should law permit the use of private demographic data for the sake of fair learning? is it technically necessary to have direct access to such data? Very few scientific studies are done to address these questions.
In this paper, we propose a distributed fair machine learning framework that does not require direct access to demographic data. We assume user data are distributed over a data center and a third party – the former holds the nonprivate data and is responsible for learning fair models; the latter holds the demographic data and can assist learning via private communications with the center that do not reveal user demographics.
Based on the framework, we present a principled strategy to design private fair learners: the center first constructs a random but fair hypothesis space via private communications with the third party; then, the center learns an accurate model in that space using standard methods. Our insight is that (i) model fairness is ensured by the fair hypothesis space and (ii) model accuracy is promised by random projection theory [4, 16].
Applying the strategy, we exemplify how to redesign four existing nonprivate fair learners into private ones, including fair ridge regression
[7], fair logistic regression
[21], fair kernel regression [30] and fair PCAs [32, 29]. We show the redesigned learners consistently outperform their counterparts in both fairness and accuracy across three realworld data sets.Finally, we theoretically analyze the proposed fair machine learning framework. We prove upper bounds on both its model fairness and model accuracy, and show their tradeoff can be balanced (and controlled) via a threshold hyperparameter .
The rest of the paper is organized as follows: Section II introduces background and related work; Section III introduces notations; Section IV presents the proposed framework and exemplifies the design of four private fair learners; Section V presents theoretical analysis on the framework; Section VI shows experimental results and discussions; Section VII shows the conclusion; Appendix contains all proofs.
Ii Related Work
Iia Fairness Measure
Several fairness notions have been proposed in the literature, such as statistical disparity [14]
, equal odds
[18], individual fairness [12], causal fairness [24] and envyfree fairness [5]. In this paper, we focus on statistical disparity, since it is most common and perhaps most refutable.In this paper, we propose to measure model fairness using covariance between prediction and demographic variable, as we find it extremely easy to use while giving very efficient accuracyfairness tradeoff. Similar measures have been used in the literature, such as mutual information [21], correlation [30] or independence [36] between these two variables. But none of them provide theoretical analysis on the used measure. In this paper, we theoretically analyze the covariance measure; we prove low covariance implies low statistical disparity.
IiB Fair Learning with Restricted Access to Demographic Data
Several lines of studies are related to the restricted access of demographic data, but do not directly address the problem.
A traditional fair learning method is to simply remove demographic feature from the model – this is a natural solution to protect privacy. However, this approach does not guarantee fairness due to the redlining effect [9]. Some studies do not use demographic data as a feature of the model, but use it in other ways during learning. For example, [25] uses kNN to detect unfair labels; they do not use demographic data to measure instance similarity, but still use it to measure label disparity in neighborhoods.
Specific discussions on the restricted use of demographic data appears in [38, 34]; but there lacks scientific investigations or solutions. Recently, Kilbertus et al [22] propose to encrypt demographic data before learning. This is a promising solution, but encryption also comes with extra cost of time and protocols. Our framework seeks another direction based on random projection; it is cheaper and easier to implement. Hashimoto et al [19] propose a fair learning method that automatically infers group membership and minimizes disparity across it; this method is also promising as it does not require access to demographic data at all. However, it focuses on a less common fairness notion called distributive justice and online learning. In contrast, we focus on the common disparity measure and offline setting (although our framework is extendable to online setting). Besides, we hypothesize that one can get fairer models with even limited access to demographic data than with no access at all.
Finally, studies on individual fairness do not require access to demographic data. For example, one can achieve fairness by learning a Lipschitz continuous prediction model [12]. Here, we focus on achieving group fairness.
Iii Notations
In this section, we introduce the basic notations that will be used throughout the paper. More will be introduced later.
We will describe a random individual by a triple , where is a sensitive demographic feature,
is a vector of
nonsensitive features and is the label. For example, when studying gender bias in hiring, will be an applicant’s gender, is the nonsensitive feature vector (e.g. education, working hours) and indicates if the applicant is hired or not. We will index observed individuals by subscript, e.g., is the individual in a (training) sample set.Let be a prediction model, which does not take as input but can use for training.
Iv A Distributed Fair Learning Framework
, generator variance
, fairness threshold , data center (DC) and third party (TP).(1) 
In this section, we present the proposed fair learning framework and exemplify how to design private fair learner with it.
We assume a scenario in Figure 1: there is a data center and a third party, over which a training set is distributed. The center has and focuses on learning fair model ; the party has and can assist learning via private communications with the center that reveal no .
Our strategy to design fair learner is shown in Algorithm 1. It has two phases: (i) steps 1 to 4 construct a random and fair hypothesis space spanned by ; (ii) step 5 learns an accurate model in that space.
Specifically, the center first generates
random hypotheses from Gaussian distributions (step 1), gets their predictions on the training set (step 2) and sends these predictions to the third party (step 3). The party estimates correlation between its demographic data and each hypothesis’s prediction; if a correlation is small enough, the center will be informed that the corresponding hypothesis is fair (step 4). Finally, the center will learn an accurate model spanned by all fair random hypotheses – the model will be both fair and accurate. Note that, throughout the process, demographic data is not revealed to the center and hence its privacy is protected.
Next, we exemplify how to apply Algorithm 1
to redesign four existing nonprivate fair learners into private ones. These four learners are chosen as they are fundamental and cover different settings, namely, linear vs nonlinear, regression vs classification, and predictive learning vs feature learning. More sophisticated learners may be designed in similar ways.
For ease of discussion, we will write = as a sample matrix, = as the associated label vector and as a matrix of returned hypotheses. Since , we will write
(2) 
Iva Distributed Fair Ridge Regression (DFRR)
Calders et al [7] develop a fair ridge regression (FRR). It minimizes squared loss on training sample, while additionally penalizing prediction disparity across demographic groups. Let be the index sets of two demographic groups (e.g. female and male) respectively. Their objective function is
where is the prediction disparity. We see requires simultaneous access to and ; thus this method cannot be directly applied in our private learning framework.
We propose a distributed fair ridge regression (DFRR) based on Algorithm 1. Our objective function is
(3) 
Minimizing the above objective for ’s gives
(4) 
The general argument for (4) is that we can first solve for (by standard method such as least square), and then solve for . This argument will be repeatedly used in the sequel.
IvB Distributed Fair Kernel Ridge Regression (DFKRR)
PerezSuay et al[30] develop a fair kernel ridge regression (FKRR). It minimizes squared loss in RKHS while additionally penalizing the correlation between prediction and demographic feature. Its objective function is
where is the correlation between prediction and demographic and and are centered variables. This method also needs simultaneous access to and .
We present a distributed fair kernel regression (DFKRR) method based on Algorithm 1. Our highlevel objective is
(5) 
Unlike the standard assumption that is expressed by ’s, we first assume is expressed by ’s as in (1) and each is linearly expressed by ’s, i.e.,
(6) 
where ’s are random coefficients associated with . A similar argument has been used [17].
Based on (5), we can generate a random hypothesis (and its predicted label set) by randomly generating a set of associated coefficients. Note the coefficients ’s are known and ’s are unknown. Minimizing gives
(7) 
where is the Gram matrix and is an by matrix with being its element at the row and column.
IvC Distributed Fair Logistic Regression (DFGR)
Kamishima et al [21] developed a fair logistic regression (FGR). It maximizes the likelihood of label while additionally penalizing mutual information between model prediction and demographic feature. Its objective function is
where measures the mutual information and can be estimated from data. This method also requires simultaneous access to and .
We propose a distributed fair logistic regression (DFGR) based on Algorithm 1. Our highlevel objective function is
(8) 
where is constructed in the same way as logistic regression, with an additional assumption has the form (1).
Minimizing (8) by Newton’s method, we can update
(9) 
where
(10) 
and
(11) 
with and diagonal matrix with – both are standard quantities in logistic regression.
IvD Distributed Fair PCA
Samadi et al [32] develop a fair PCA that minimizes reconstruction error while equalizing this error across demographic groups. Let be the sample matrix of instances in one group, be the sample matrix of instances in another group, and be the projection matrix. Their objective (to minimize) is
where measures reconstruction error. Authors show the optimal gives equal reconstruction errors across groups.
Matt Olfat et al[29] propose another fair PCA method that minimizes prediction disparity in the projected space, i.e.,
where is the prediction model and is the project matrix.
Note that both methods need access to and .
We propose a distributed fair PCA (DFPCA) method based on Algorithm 1. Let be a projection vector. Our optimization probelm is the same as PCA, i.e.,
(12) 
where is the covariance matrix. Our additional assumption is that is linearly expressed by fair random hypotheses, i.e.,
(13) 
Solving problem (12) for gives
(14) 
which implies
is the leading (generalized) eigenvector.
V Theoretical Analysis
Here we present the theoretical properties of Algorithm 1.
Va Preliminaries
Let be a random instance. We say a hypothesis is fair with respect to if . Note it means, in Algorithm 1, all returned hypotheses are fair.
We will show fairness implies a popular fairness measure called statistical parity (SP) [27], defined as
(15) 
To establish the implication, we will employ the following generalized covariance inequality [26, Theorem 2].
Lemma 1.
Let be two positively or negatively quadrant dependent random integers. Let be their joint CDF and , be their marginal CDF’s respectively. Let
(16) 
be their Hoeffding covariance, where
(17) 
If is bounded, then
(18) 
In the following, we will first present theoretical properties on model fairness and then on model error. Note that all results are presented in the context of Algorithm 1.
VB Theoretical Properties on Model Fairness
Our first result shows that fair implies statistical parity.
Lemma 2.
Our second result suggests that a hypothesis spanned by fair hypotheses remains fair – this is the insight that motivates the study. More specifically, in (1), we show that is fair because it is spanned by fair hypotheses .
Lemma 3.
In (1), is fair w.r.t. .
Combining the above result, we immediately have
Theorem 4.
In Algorithm 1, if and are positively or negatively quadrant dependent, then .
This theorem implies one can obtain a fair model through several paths. First, we can choose a small threshold , which will reduce prediction disparity at a rate of . Another way is to choose a small but it does not seem very efficient as (i) it has a lower reduction rate and (ii) it can be implied by choosing a small (thus returning fewer hypotheses).
One may also choose a small . In our proposed methods, this is done indirectly via regularizing . In experiments, we observe this is more effective than directly regularizing .
Finally, we see a model may be more fair if the demographic distribution is more balanced, i.e., the upper bound of is minimized when = = 0.5. However, such distribution is typically formed by nature and cannot be easily modified.
Our following result gives more insight on the number of returned hypotheses , and suggests it shall not be too small.
Lemma 5.
Let be a random hypothesis. Then
(19) 
where both expectations are taken over the randomness of , and the covariance is defined over the randomness of . Further, if is linear and generated from , then
(20) 
where = and is entry of .
Lemma 5 implies that increases as increases, and the rate can be larger if is linear; when approaches infinity, which means all hypotheses will be returned. The lemma also implies that smaller implies larger .
VC Theoretical Properties on Model Generalization Error
To derive an error bound for the algorithm, our backbone technique is the random projection theory [16]. It states that data distance is likely to be preserved in a randomly projected space and thus a model’s prediction error (dependent on such distance) is also likely to be preserved.
To apply the theory, we assume are linear and interpret the returned hypotheses as basis of a randomly projected space, i.e., is the feature of in the projected space.
We also assume Step 4 applies a soft threshold policy. Let be a hypothesis satisfying = 0. The soft policy will return of any
with probability
, where = and is constant. As such, each returned hypothesis in (1) is first drawn from a zeromean Gaussian (Step 1) and then selected by a mean Gaussian (Step 4). Therefore, we can say each in (1) is generated from a Gaussian centered at . Without loss of generality, we assume this Gaussian has a unit variance.Our first result extends the data distortion bound in [4] from zeromean Gaussian to nonzero mean Gaussian.
Lemma 6.
Let be any point and be a projection matrix with each projection vector
taken from a normal distribution
. Let be the projection of by . We have for ,(21) 
where .
Compared to the original bound, our new bound has an additional term . It is smaller when is smaller; if = , then = and we recover the original bound.
Based on Lemma 6, we derive the following error bound.
Theorem 7.
Suppose Algorithm 1 adopts the soft threshold policy. Let and be the expected and empirical error of respectively. If is linear and , then with probability at least ,
(22) 
where and
(23) 
with .
An important parameter in the error bound is . To facilitate discussion, we can loosen the bound and have
Remark 8.
We see error bound decreases exponentially as increases, suggesting one choose large to get accurate models. Note this is opposite to Theorem 4, which suggests choosing small to get fair models. So we see a tradeoff between accuracy and fairness is established (and controlled) via parameter . In practice, we can adjust by adjusting the threshold .
Vi Experiment
In this section, we evaluate the proposed distributed and private fair learning methods on three realworld data sets, and compared them with their existing nonprivate counterparts. To facilitate reproduction of the present results, we published our experimented data sets and random index sets at^{3}^{3}3https://uwyomachinelearning.github.io/ and the codes of our implemented methods at ^{4}^{4}4https://github.com/HuiHu1/DistributedPrivateFairLearning.
Via Data Preparation
We experimented on three public data sets commonly used for evaluating algorithm fairness: the Community Crime data set^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients, the COMPAS data set^{6}^{6}6https://www.kaggle.com/danofer/compass and the Credit Card data set^{7}^{7}7https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
The Community set contains 1993 communities described by 101 features; community crime rate is the label; we treated a community as minority if its fraction of AfricanAmerican residents is greater than 0.5. The COMPAS set contains 18317 records described by 40 features; risk of recidivism is the label; we removed incomplete data and ended up having 16000 records and 15 features; similar to [8], we treated race as the sensitive feature. The Credit set contains 30000 users described by 23 features; default payment is the label; similar to [32], we selected education degree as the sensitive feature.
ViB Experiment Design
On each set, we randomly chose instances for training and used the rest for testing. We evaluated each method for 50 random trials and reported its averaged performance.
We compared each proposed distributed private fair learner with its existing nondistributed nonprivate counterpart, i.e.,

Distributed Fair Ridge Regression (DFRR) vs Fair Ridge Regression (FRR)[7]

Distributed Fair Logistic Regression (DFGR) vs Fair Logistic Regression (FGR)[21]

Distributed Fair Kernel Regression (DFKRR) vs Fair Kernel Regression (FKRR)[30]
We also compared with a popular fair learner LFP[36]. For competing methods, we use their default hyperparameters (or, gridsearch from the default candidate values) identified in previous studies. In experiment, we observe these configurations generally achieve best performance. For our proposed DFGR, learning rate was set to 0.001.
We used five evaluation metrics: statistical parity (SP)
[27], normed disparate (ND) [27], classifier error, error parity and error disparate. Let
, be the classifier errors in two demographic groups respectively. We define(26) 
and
(27) 
Small SP, ND, EP and ED implies fair models; small classifier error implies accurate model.
ViC Results and Discussions
Our experimental results on the three data sets are presented in Tables I, II and III respectively. For the proposed learners, we set to 0.01, 0.1 and 0.25 on the three sets respectively. Our discussions will focus on Table I.
First, we observe the proposed distributed and private fair learners consistently outperform their nonprivate counterparts. Take ridge regression as an example, DFRR not only achieves much lower SP than FRR (0.05 vs 0.31), but also achieves lower classifier error (0.106 vs 0.110) and error parity (0.17 vs 0.23). Another example is PCA, where DFPCA achieves lower SP than FPCA’s (0.03 vs 0.08), lower classifier error (0.14 vs 0.15) and lower error parity (0.15 vs 0.19). Similar observations can be found on other two data sets. This implies two things: (1) the proposed distributed fair learning framework is effective; (2) the proposed private fair learners can achieve more efficient tradeoff between fairness and accuracy than the stateoftheart nonprivate counterparts.
Our second observation is that the performance gap between private and nonprivate fair learners is larger for linear models (ridge regression and PCA) compared with nonlinear models (logistic and kernel). This is partly consistent with the theoretical guarantees we proved for linear models. As to why our framework gives less improvement on nonlinear models, we do not have a principled hypothesis at the moment.
Finally, we see previous fair PCA methods do not improve fairness in classification tasks. Comparatively, our proposed distributed and private fair PCA significantly reduces SP and classifier error, making itself competitive for fair classification.
Method  Statistical Parity  Normed Disparate  Classifier Error  Error Parity  Error Disparate 

FRR[7]  .3062.0452  .2457.0128  .1102.0128  .2260  .7321 
DFRR  .0466.0117  .1691.1081  .1064.0092  .1727  .6866 
FKRR [30]  .0968.0722  .1274.0105  .1208.0054  .1250  .2515 
DFKRR  .0695.0181  .1060.0081  .1216.0143  .1152  .2510 
FGR [21]  .0898.0971  .1154.0308  .1166.0189  .1424  .5723 
DFGR  .0650.0198  .1097.0872  .1202.0690  .1212  .5190 
FPCA1 [32]  .0859.0479  .3546.0225  .1731.0089  .1895  .5557 
FPCA2[29]  .0755.0293  .3319.0186  .1476.0122  .1851  .6091 
DFPCA  .0289.0502  .2263.0306  .1351.0111  .1502  .6507 
LFR[36]  .0738.0377  .2240.0194  .1264.0068  .1319  .5431 
Method  Statistical Parity  Normed Disparate  Classifier Error  Error Parity  Error Disparate 

FRR[7]  .0515.0042  .2361.0414  .2276.0040  .0317  .1081 
DFRR  .0078.0041  .1758.0987  .2302.0045  .0139  .0543 
FKRR[30]  .0041.0013  .1194.0237  .2190.0089  .0027  .0122 
DFKRR  .0034.0015  .1147.0688  .2152.0093  .0017  .0078 
FGR[21]  .0408.0162  .2842.0319  .2428.0917  .0222  .0865 
DFGR  .0374.0645  .1852.0973  .2617.0509  .0104  .0385 
FPCA1[32]  .2806.0182  .3028.0232  .3204.1032  .0429  .1190 
FPCA2[29]  .1719.0317  .2901.1027  .2390.0278  .0394  .1472 
DFPCA  .0081.0046  .2019.1011  .2279.0046  .0167  .0690 
LFR[36]  .0182.0211  .2201.0318  .2496.0044  .0044  .0190 
Method  Statistical Parity  Normed Disparate  Classifier Error  Error Parity  Error Disparate 
FRR[7]  .0994.0016  .3109.0186  .2340.0058  .0523  .1882 
DFRR  .0118.0006  .2038.0627  .2283.0062  .0250  .1003 
FKRR[30]  .0079.0011  .1170.0117  .2001.0054  .0374  .1643 
DFKRR  .0085.0015  .0957.0286  .1823.0092  .0306  .1151 
FGR[21]  .0779.0571  .1283.0987  .2412.0469  .0253  .0951 
DFGR  .0494.0601  .1221.0890  .2244.0382  .0105  .0442 
FPCA1[32]  .1716.0149  .1458.0234  .4025.0382  .0941  .2277 
FPCA2[29]  .0981.0164  .1307.0193  .3224.0045  .0663  .1859 
DFPCA  .0344.0061  .1249.0915  .2304.0041  .0316  .1230 
LFR[36]  .0288.0132  .1552.0133  .2835.0051  .0374  .1423 
ViD Other Analysis
We first examined performance of the proposed distributed and private fair logistic regression on the Community Crime data set. The performance versus different , averaged over 50 random trials and m = 5000, is shown in Figure 2. We see as decreases, the classifier error increases and SP decreases. This means the model is fairer but less accurate, which is consistent with the implications of Theorems 4 and 7. (And considering that larger implies larger , according to Lemma 5 – the implication of this lemma is verified in Figure 3.)
Finally, we examined the PQD/PND assumption in Theorem 4. Figure 4 shows of DFRR over 20 random trials on two data sets. We see the covariance is positive in most cases, which implies and are PQD/PND.
Vii Conclusion
In this paper, we propose a distributed fair machine learning framework for protecting the privacy of demographic data. We propose a principled strategy to design private fair learners under this framework. We exemplify how to apply this strategy to redesign four nonprivate fair learners into private ones, and show our redesigns consistently outperform their nonprivate counterparts across three realworld data sets. Finally, we theoretically analyze the framework and prove its output models are both fair and accurate.
References
 [1] (2000) Privacypreserving data mining. Vol. 29, ACM. Cited by: §I.
 [2] (2018) Amazon reportedly killed an ai recruitment system because it couldn’t stop the tool from discriminating against women. In Fortune, Cited by: §I.
 [3] (2016) Machine bias: there’s software used across the country to predict future criminals. and its’s biased against blacks.. In ProPublica, Cited by: §I.
 [4] (2006) An algorithmic theory of learning: robust concepts and random projection. Machine Learning. Cited by: §I, §VC, §VIIID.
 [5] (2018) Envyfree classification. Cited by: §IIA.
 [6] (2017) Fairness in machine learning. NIPS Tutorial. Cited by: §I.
 [7] (2013) Controlling attribute effect in linear regression. In ICDM, Cited by: §I, §IVA, 1st item, TABLE I, TABLE II, TABLE III.
 [8] (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: §VIA.
 [9] (2018) The measure and mismeasure of fairness: a critical review of fair machine learning. CoRR. Cited by: §IIB.
 [10] (2018) Bias detectives: the researchers striving to make algorithms fair. Nature 558 (7710), pp. 357–357. Cited by: §I.
 [11] (2004) Nonparametric tests for positive quadrant dependence. Journal of Financial Econometrics. Cited by: footnote 2.
 [12] (2012) Fairness through awareness. In ACM Innovations in Theoretical Computer Science Conference, Cited by: §I, §IIA, §IIB.

[13]
(2018)
Microsoft improves biased facial recognition technology
. Fortune. Cited by: §I.  [14] (2015) Certifying and removing disparate impact. In KDD, Cited by: §I, §IIA.
 [15] (2016) A confidencebased approach for balancing fairness and accuracy. In SDM, Cited by: §I.
 [16] (2002) On generalization bounds, projection profile, and margin distribution. In ICML, Cited by: §I, §VC, §VIIIE, §VIIIE, §VIIIE.

[17]
(2002)
Adaptive scaling for feature selection in svms
. In NIPS, Cited by: §IVB. 
[18]
(2016)
Equality of opportunity in supervised learning
. In NIPS, Cited by: §I, §IIA.  [19] (2018) Fairness without demographics in repeated loss minimization. In ICML, Cited by: §IIB.

[20]
(2016)
Preparing for the future of artificial intelligence
. Executive Office of the President. Cited by: §I.  [21] (2012) Fairnessaware classifier with prejudice remover regularizer. In ECMLPKDD, Cited by: §I, §IIA, §IVC, 2nd item, TABLE I, TABLE II, TABLE III.
 [22] (2018) Blind justice: fairness with encrypted sensitive attributes. In ICML, Cited by: §IIB.
 [23] (2012) Face recognition performance: role of demographic information. IEEE Transactions on Information Forensics and Security. Cited by: §I.
 [24] (2017) Counterfactual fairness. In NIPS, Cited by: §IIA.
 [25] (2011) Knn as an implementation of situation testing for discrimination discovery and prevention. In KDD, Cited by: §I, §IIB.

[26]
(2003)
On some inequalities for positively and negatively dependent random variables with applications
. PUBLICATIONES MATHEMATICAEDEBRECEN 63 (4), pp. 511–522. Cited by: §VA.  [27] (2017) Provably fair representations. CoRR. Cited by: §VA, §VIB.
 [28] (2011) Differentially private data release for data mining. In KDD, Cited by: §I.

[29]
(2018)
Convex formulations for fair principal component analysis
. CoRR. Cited by: §I, §IVD, 4th item, TABLE I, TABLE II, TABLE III.  [30] (2017) Fair kernel learning. In ECMLPKDD, Cited by: §I, §IIA, §IVB, 3rd item, TABLE I, TABLE II, TABLE III.
 [31] (2015) Mixed data kernel copulas. Empirical Economics. Cited by: footnote 2.
 [32] (2018) The price of fair pca: one extra dimension. In NIPS, Cited by: §I, §IVD, 4th item, §VIA, TABLE I, TABLE II, TABLE III.
 [33] (2018) IBM helps eliminate bias in facial recognition training, but other faults may remain. PaymentsJournal. Cited by: §I.
 [34] (2017) Fairer machine learning in the real world: mitigating discrimination without collecting sensitive data. Big Data & Society 4 (2), pp. 2053951717743530. Cited by: §I, §IIB.
 [35] (2017) Fairness constraints: mechanisms for fair classification. In AISTATS, Cited by: §I.
 [36] (2013) Learning fair representations. In ICML, Cited by: §I, §IIA, §VIB, TABLE I, TABLE II, TABLE III.

[37]
(2017)
Antidiscrimination learning: a causal modelingbased framework.
International Journal of Data Science and Analytics
4 (1), pp. 1–16. Cited by: §I.  [38] (2016) Using sensitive personal data may be necessary for avoiding discrimination in datadriven decision models. Artificial Intelligence and Law 24 (2), pp. 183–201. Cited by: §I, §IIB.
Viii Appendix
Viiia Proof of Lemma 3
We will prove that is fair if it is spanned by a set of fair hypotheses. Indeed, by the linear property of covariance,
(28) 
where and the last inequality is by the Cauchy–Schwarz inequality. Further
(29) 
where the inequality is based on the fact that returned hypotheses are fair. Combining (28) and (29) proves the lemma.
ViiiB Proof of Lemma 5
First note the expected number of returned hypothesis is
(30) 
We will bound the right side probability. For convenience, write for .
The first result is a direct result of the Markov inequality.
To prove the second result, we use the Chebyshevs inequality. It states that, over random ,
(31) 
We will refine (31). We first show = . This is true because is linear, i.e., , so that
(32) 
where = is a constant vector. Then, taking expectation of on both sides, we have
(33) 
where the last inequality holds because is from a zeromean normal distribution and thus .
Next, we derive .
(34) 
ViiiC Proof of Lemma 2
Suppose and are PQD or NQD random variables with bounded covariance. If is fair w.r.t. , then
(35) 
where and .
Now we refine . Write for . Note that
(37) 
Plugging in and rearranging terms, we have
(38) 
where and . Plugging this back to (36) and dividing both sides by , we have
(39) 
The left side is . Thus the lemma is proved.
ViiiD Proof of Lemma 6
The proof sketch is similar to [4]. Note . Since each element of is from Gaussian, is also from Gaussian and thus by definition
is from a ChiSquared distribution with
degrees of freedom. Define a scaled variable; it is also from ChiSquared with the following moment generation function
(40) 
where . By the Markov inequality, for ,
(41) 
where the last equality is obtained by setting = .
By similar argument (setting = ), we have
(42) 
Combining the above two results via a union bound, we have
(43) 
where . The range of follows the range of . The lemma is proved.
ViiiE Proof Sketch of Theorem 7
The original generalization error bound is developed using a distortion bound , which assumes zeromean distribution of projection vectors. Here, we use the new distortion bound in Lemma 6.
Recall is an instance and is its projection in a random space. If is linear, let be its projection. If , by similar arguments in [16, Lemma 3.2],
(44) 
where in the coefficient is .
Then, by similar arguments in [16, Lemma 3.4], the classification error caused by random projection satisfies
(45) 
with probability at least over the randomness of .