It is reported that machine learning models are giving unfair predictions on minority people when being applied to assist consequential decision makings, e.g., they are biased against black defendants in recidivism prediction , against female applicants in job hiring  and against female employees in facial verification , etc. How to learn fair prediction model has become a pressing problem for government , industry [13, 33] and academia [6, 10]. Many fair learning methods have been developed, including label processing [37, 25], feature processing [14, 36], model regularization [12, 35] and model post-processing [18, 15] – some have achieved promising performance with very efficient trade-offs between model accuracy and model fairness.
We note that most fair machine learning methods require direct access to individuals’ demographic data, e.g., they need individual’s race information to mitigate racial bias. However, such data are increasingly restricted to use for protecting user privacy. In 2018, Europe launches a General Data Protection Regulation (GDPR)111https://eugdpr.org/the-regulation/, which prohibits “processing of personal data revealing racial or ethnic original” and allows users to request “erasure of personal data” from the data controller. Besides, the privacy research community has a long effort of hiding sensitive personal data from data analytics [1, 28].
We thus see fairness and privacy are running in a dilemma, i.e., most fair learners need access to demographic data while these data are restricted to use for privacy protection. Debates are arising [38, 34]: should law permit the use of private demographic data for the sake of fair learning? is it technically necessary to have direct access to such data? Very few scientific studies are done to address these questions.
In this paper, we propose a distributed fair machine learning framework that does not require direct access to demographic data. We assume user data are distributed over a data center and a third party – the former holds the non-private data and is responsible for learning fair models; the latter holds the demographic data and can assist learning via private communications with the center that do not reveal user demographics.
Based on the framework, we present a principled strategy to design private fair learners: the center first constructs a random but fair hypothesis space via private communications with the third party; then, the center learns an accurate model in that space using standard methods. Our insight is that (i) model fairness is ensured by the fair hypothesis space and (ii) model accuracy is promised by random projection theory [4, 16].
Applying the strategy, we exemplify how to re-design four existing non-private fair learners into private ones, including fair ridge regression
, fair logistic regression, fair kernel regression  and fair PCAs [32, 29]. We show the redesigned learners consistently outperform their counterparts in both fairness and accuracy across three real-world data sets.
Finally, we theoretically analyze the proposed fair machine learning framework. We prove upper bounds on both its model fairness and model accuracy, and show their trade-off can be balanced (and controlled) via a threshold hyper-parameter .
The rest of the paper is organized as follows: Section II introduces background and related work; Section III introduces notations; Section IV presents the proposed framework and exemplifies the design of four private fair learners; Section V presents theoretical analysis on the framework; Section VI shows experimental results and discussions; Section VII shows the conclusion; Appendix contains all proofs.
Ii Related Work
Ii-a Fairness Measure
Several fairness notions have been proposed in the literature, such as statistical disparity 
, equal odds, individual fairness , causal fairness  and envy-free fairness . In this paper, we focus on statistical disparity, since it is most common and perhaps most refutable.
In this paper, we propose to measure model fairness using covariance between prediction and demographic variable, as we find it extremely easy to use while giving very efficient accuracy-fairness trade-off. Similar measures have been used in the literature, such as mutual information , correlation  or independence  between these two variables. But none of them provide theoretical analysis on the used measure. In this paper, we theoretically analyze the covariance measure; we prove low covariance implies low statistical disparity.
Ii-B Fair Learning with Restricted Access to Demographic Data
Several lines of studies are related to the restricted access of demographic data, but do not directly address the problem.
A traditional fair learning method is to simply remove demographic feature from the model – this is a natural solution to protect privacy. However, this approach does not guarantee fairness due to the redlining effect . Some studies do not use demographic data as a feature of the model, but use it in other ways during learning. For example,  uses k-NN to detect unfair labels; they do not use demographic data to measure instance similarity, but still use it to measure label disparity in neighborhoods.
Specific discussions on the restricted use of demographic data appears in [38, 34]; but there lacks scientific investigations or solutions. Recently, Kilbertus et al  propose to encrypt demographic data before learning. This is a promising solution, but encryption also comes with extra cost of time and protocols. Our framework seeks another direction based on random projection; it is cheaper and easier to implement. Hashimoto et al  propose a fair learning method that automatically infers group membership and minimizes disparity across it; this method is also promising as it does not require access to demographic data at all. However, it focuses on a less common fairness notion called distributive justice and on-line learning. In contrast, we focus on the common disparity measure and off-line setting (although our framework is extendable to online setting). Besides, we hypothesize that one can get fairer models with even limited access to demographic data than with no access at all.
Finally, studies on individual fairness do not require access to demographic data. For example, one can achieve fairness by learning a Lipschitz continuous prediction model . Here, we focus on achieving group fairness.
In this section, we introduce the basic notations that will be used throughout the paper. More will be introduced later.
We will describe a random individual by a triple , where is a sensitive demographic feature,
is a vector ofnon-sensitive features and is the label. For example, when studying gender bias in hiring, will be an applicant’s gender, is the non-sensitive feature vector (e.g. education, working hours) and indicates if the applicant is hired or not. We will index observed individuals by subscript, e.g., is the individual in a (training) sample set.
Let be a prediction model, which does not take as input but can use for training.
Iv A Distributed Fair Learning Framework
, generator variance, fairness threshold , data center (DC) and third party (TP).
TP estimatesfrom and for each , and returns to DC if .
In this section, we present the proposed fair learning framework and exemplify how to design private fair learner with it.
We assume a scenario in Figure 1: there is a data center and a third party, over which a training set is distributed. The center has and focuses on learning fair model ; the party has and can assist learning via private communications with the center that reveal no .
Our strategy to design fair learner is shown in Algorithm 1. It has two phases: (i) steps 1 to 4 construct a random and fair hypothesis space spanned by ; (ii) step 5 learns an accurate model in that space.
Specifically, the center first generates
random hypotheses from Gaussian distributions (step 1), gets their predictions on the training set (step 2) and sends these predictions to the third party (step 3). The party estimates correlation between its demographic data and each hypothesis’s prediction; if a correlation is small enough, the center will be informed that the corresponding hypothesis is fair (step 4). Finally, the center will learn an accurate model spanned by all fair random hypotheses – the model will be both fair and accurate. Note that, throughout the process, demographic data is not revealed to the center and hence its privacy is protected.
Next, we exemplify how to apply Algorithm 1
to redesign four existing non-private fair learners into private ones. These four learners are chosen as they are fundamental and cover different settings, namely, linear vs non-linear, regression vs classification, and predictive learning vs feature learning. More sophisticated learners may be designed in similar ways.
For ease of discussion, we will write = as a sample matrix, = as the associated label vector and as a matrix of returned hypotheses. Since , we will write
Iv-a Distributed Fair Ridge Regression (DFRR)
Calders et al  develop a fair ridge regression (FRR). It minimizes squared loss on training sample, while additionally penalizing prediction disparity across demographic groups. Let be the index sets of two demographic groups (e.g. female and male) respectively. Their objective function is
where is the prediction disparity. We see requires simultaneous access to and ; thus this method cannot be directly applied in our private learning framework.
We propose a distributed fair ridge regression (DFRR) based on Algorithm 1. Our objective function is
Minimizing the above objective for ’s gives
The general argument for (4) is that we can first solve for (by standard method such as least square), and then solve for . This argument will be repeatedly used in the sequel.
Iv-B Distributed Fair Kernel Ridge Regression (DFKRR)
Perez-Suay et al develop a fair kernel ridge regression (FKRR). It minimizes squared loss in RKHS while additionally penalizing the correlation between prediction and demographic feature. Its objective function is
where is the correlation between prediction and demographic and and are centered variables. This method also needs simultaneous access to and .
We present a distributed fair kernel regression (DFKRR) method based on Algorithm 1. Our high-level objective is
Unlike the standard assumption that is expressed by ’s, we first assume is expressed by ’s as in (1) and each is linearly expressed by ’s, i.e.,
where ’s are random coefficients associated with . A similar argument has been used .
Based on (5), we can generate a random hypothesis (and its predicted label set) by randomly generating a set of associated coefficients. Note the coefficients ’s are known and ’s are unknown. Minimizing gives
where is the Gram matrix and is an -by- matrix with being its element at the row and column.
Iv-C Distributed Fair Logistic Regression (DFGR)
Kamishima et al  developed a fair logistic regression (FGR). It maximizes the likelihood of label while additionally penalizing mutual information between model prediction and demographic feature. Its objective function is
where measures the mutual information and can be estimated from data. This method also requires simultaneous access to and .
We propose a distributed fair logistic regression (DFGR) based on Algorithm 1. Our high-level objective function is
where is constructed in the same way as logistic regression, with an additional assumption has the form (1).
Minimizing (8) by Newton’s method, we can update
with and diagonal matrix with – both are standard quantities in logistic regression.
Iv-D Distributed Fair PCA
Samadi et al  develop a fair PCA that minimizes reconstruction error while equalizing this error across demographic groups. Let be the sample matrix of instances in one group, be the sample matrix of instances in another group, and be the projection matrix. Their objective (to minimize) is
where measures reconstruction error. Authors show the optimal gives equal reconstruction errors across groups.
Matt Olfat et al propose another fair PCA method that minimizes prediction disparity in the projected space, i.e.,
where is the prediction model and is the project matrix.
Note that both methods need access to and .
We propose a distributed fair PCA (DFPCA) method based on Algorithm 1. Let be a projection vector. Our optimization probelm is the same as PCA, i.e.,
where is the covariance matrix. Our additional assumption is that is linearly expressed by fair random hypotheses, i.e.,
Solving problem (12) for gives
is the leading (generalized) eigenvector.
V Theoretical Analysis
Here we present the theoretical properties of Algorithm 1.
Let be a random instance. We say a hypothesis is -fair with respect to if . Note it means, in Algorithm 1, all returned hypotheses are -fair.
We will show -fairness implies a popular fairness measure called statistical parity (SP) , defined as
To establish the implication, we will employ the following generalized covariance inequality [26, Theorem 2].
Let be two positively or negatively quadrant dependent random integers. Let be their joint CDF and , be their marginal CDF’s respectively. Let
be their Hoeffding covariance, where
If is bounded, then
In the following, we will first present theoretical properties on model fairness and then on model error. Note that all results are presented in the context of Algorithm 1.
V-B Theoretical Properties on Model Fairness
Our first result shows that -fair implies statistical parity.
Our second result suggests that a hypothesis spanned by fair hypotheses remains fair – this is the insight that motivates the study. More specifically, in (1), we show that is -fair because it is spanned by -fair hypotheses .
In (1), is -fair w.r.t. .
Combining the above result, we immediately have
In Algorithm 1, if and are positively or negatively quadrant dependent, then .
This theorem implies one can obtain a fair model through several paths. First, we can choose a small threshold , which will reduce prediction disparity at a rate of . Another way is to choose a small but it does not seem very efficient as (i) it has a lower reduction rate and (ii) it can be implied by choosing a small (thus returning fewer hypotheses).
One may also choose a small . In our proposed methods, this is done indirectly via regularizing . In experiments, we observe this is more effective than directly regularizing .
Finally, we see a model may be more fair if the demographic distribution is more balanced, i.e., the upper bound of is minimized when = = 0.5. However, such distribution is typically formed by nature and cannot be easily modified.
Our following result gives more insight on the number of returned hypotheses , and suggests it shall not be too small.
Let be a random hypothesis. Then
where both expectations are taken over the randomness of , and the covariance is defined over the randomness of . Further, if is linear and generated from , then
where = and is entry of .
Lemma 5 implies that increases as increases, and the rate can be larger if is linear; when approaches infinity, which means all hypotheses will be returned. The lemma also implies that smaller implies larger .
V-C Theoretical Properties on Model Generalization Error
To derive an error bound for the algorithm, our backbone technique is the random projection theory . It states that data distance is likely to be preserved in a randomly projected space and thus a model’s prediction error (dependent on such distance) is also likely to be preserved.
To apply the theory, we assume are linear and interpret the returned hypotheses as basis of a randomly projected space, i.e., is the feature of in the projected space.
We also assume Step 4 applies a soft threshold policy. Let be a hypothesis satisfying = 0. The soft policy will return of any
with probability, where = and is constant. As such, each returned hypothesis in (1) is first drawn from a zero-mean Gaussian (Step 1) and then selected by a -mean Gaussian (Step 4). Therefore, we can say each in (1) is generated from a Gaussian centered at . Without loss of generality, we assume this Gaussian has a unit variance.
Our first result extends the data distortion bound in  from zero-mean Gaussian to non-zero mean Gaussian.
Let be any point and be a projection matrix with each
projection vector taken
from a normal distribution
taken from a normal distribution. Let be the projection of by . We have for ,
Compared to the original bound, our new bound has an additional term . It is smaller when is smaller; if = , then = and we recover the original bound.
Based on Lemma 6, we derive the following error bound.
Suppose Algorithm 1 adopts the soft threshold policy. Let and be the expected and empirical error of respectively. If is linear and , then with probability at least ,
An important parameter in the error bound is . To facilitate discussion, we can loosen the bound and have
In Theorem 7, if
then there exist positive constants and such that
We see error bound decreases exponentially as increases, suggesting one choose large to get accurate models. Note this is opposite to Theorem 4, which suggests choosing small to get fair models. So we see a trade-off between accuracy and fairness is established (and controlled) via parameter . In practice, we can adjust by adjusting the threshold .
In this section, we evaluate the proposed distributed and private fair learning methods on three real-world data sets, and compared them with their existing non-private counterparts. To facilitate reproduction of the present results, we published our experimented data sets and random index sets at333https://uwyomachinelearning.github.io/ and the codes of our implemented methods at 444https://github.com/HuiHu1/Distributed-Private-Fair-Learning.
Vi-a Data Preparation
We experimented on three public data sets commonly used for evaluating algorithm fairness: the Community Crime data set555https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients, the COMPAS data set666https://www.kaggle.com/danofer/compass and the Credit Card data set777https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
The Community set contains 1993 communities described by 101 features; community crime rate is the label; we treated a community as minority if its fraction of African-American residents is greater than 0.5. The COMPAS set contains 18317 records described by 40 features; risk of recidivism is the label; we removed incomplete data and ended up having 16000 records and 15 features; similar to , we treated race as the sensitive feature. The Credit set contains 30000 users described by 23 features; default payment is the label; similar to , we selected education degree as the sensitive feature.
Vi-B Experiment Design
On each set, we randomly chose instances for training and used the rest for testing. We evaluated each method for 50 random trials and reported its averaged performance.
We compared each proposed distributed private fair learner with its existing non-distributed non-private counterpart, i.e.,
Distributed Fair Ridge Regression (DFRR) vs Fair Ridge Regression (FRR)
Distributed Fair Logistic Regression (DFGR) vs Fair Logistic Regression (FGR)
Distributed Fair Kernel Regression (DFKRR) vs Fair Kernel Regression (FKRR)
We also compared with a popular fair learner LFP. For competing methods, we use their default hyper-parameters (or, grid-search from the default candidate values) identified in previous studies. In experiment, we observe these configurations generally achieve best performance. For our proposed DFGR, learning rate was set to 0.001.
We used five evaluation metrics: statistical parity (SP), normed disparate (ND) 
, classifier error, error parity and error disparate. Let, be the classifier errors in two demographic groups respectively. We define
Small SP, ND, EP and ED implies fair models; small classifier error implies accurate model.
Vi-C Results and Discussions
Our experimental results on the three data sets are presented in Tables I, II and III respectively. For the proposed learners, we set to 0.01, 0.1 and 0.25 on the three sets respectively. Our discussions will focus on Table I.
First, we observe the proposed distributed and private fair learners consistently outperform their non-private counterparts. Take ridge regression as an example, DFRR not only achieves much lower SP than FRR (0.05 vs 0.31), but also achieves lower classifier error (0.106 vs 0.110) and error parity (0.17 vs 0.23). Another example is PCA, where DFPCA achieves lower SP than FPCA’s (0.03 vs 0.08), lower classifier error (0.14 vs 0.15) and lower error parity (0.15 vs 0.19). Similar observations can be found on other two data sets. This implies two things: (1) the proposed distributed fair learning framework is effective; (2) the proposed private fair learners can achieve more efficient trade-off between fairness and accuracy than the state-of-the-art non-private counterparts.
Our second observation is that the performance gap between private and non-private fair learners is larger for linear models (ridge regression and PCA) compared with nonlinear models (logistic and kernel). This is partly consistent with the theoretical guarantees we proved for linear models. As to why our framework gives less improvement on non-linear models, we do not have a principled hypothesis at the moment.
Finally, we see previous fair PCA methods do not improve fairness in classification tasks. Comparatively, our proposed distributed and private fair PCA significantly reduces SP and classifier error, making itself competitive for fair classification.
|Method||Statistical Parity||Normed Disparate||Classifier Error||Error Parity||Error Disparate|
|Method||Statistical Parity||Normed Disparate||Classifier Error||Error Parity||Error Disparate|
|Method||Statistical Parity||Normed Disparate||Classifier Error||Error Parity||Error Disparate|
Vi-D Other Analysis
We first examined performance of the proposed distributed and private fair logistic regression on the Community Crime data set. The performance versus different , averaged over 50 random trials and m = 5000, is shown in Figure 2. We see as decreases, the classifier error increases and SP decreases. This means the model is fairer but less accurate, which is consistent with the implications of Theorems 4 and 7. (And considering that larger implies larger , according to Lemma 5 – the implication of this lemma is verified in Figure 3.)
Finally, we examined the PQD/PND assumption in Theorem 4. Figure 4 shows of DFRR over 20 random trials on two data sets. We see the covariance is positive in most cases, which implies and are PQD/PND.
In this paper, we propose a distributed fair machine learning framework for protecting the privacy of demographic data. We propose a principled strategy to design private fair learners under this framework. We exemplify how to apply this strategy to redesign four non-private fair learners into private ones, and show our redesigns consistently outperform their non-private counterparts across three real-world data sets. Finally, we theoretically analyze the framework and prove its output models are both fair and accurate.
-  (2000) Privacy-preserving data mining. Vol. 29, ACM. Cited by: §I.
-  (2018) Amazon reportedly killed an ai recruitment system because it couldn’t stop the tool from discriminating against women. In Fortune, Cited by: §I.
-  (2016) Machine bias: there’s software used across the country to predict future criminals. and its’s biased against blacks.. In ProPublica, Cited by: §I.
-  (2006) An algorithmic theory of learning: robust concepts and random projection. Machine Learning. Cited by: §I, §V-C, §VIII-D.
-  (2018) Envy-free classification. Cited by: §II-A.
-  (2017) Fairness in machine learning. NIPS Tutorial. Cited by: §I.
-  (2013) Controlling attribute effect in linear regression. In ICDM, Cited by: §I, §IV-A, 1st item, TABLE I, TABLE II, TABLE III.
-  (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: §VI-A.
-  (2018) The measure and mismeasure of fairness: a critical review of fair machine learning. CoRR. Cited by: §II-B.
-  (2018) Bias detectives: the researchers striving to make algorithms fair. Nature 558 (7710), pp. 357–357. Cited by: §I.
-  (2004) Nonparametric tests for positive quadrant dependence. Journal of Financial Econometrics. Cited by: footnote 2.
-  (2012) Fairness through awareness. In ACM Innovations in Theoretical Computer Science Conference, Cited by: §I, §II-A, §II-B.
Microsoft improves biased facial recognition technology. Fortune. Cited by: §I.
-  (2015) Certifying and removing disparate impact. In KDD, Cited by: §I, §II-A.
-  (2016) A confidence-based approach for balancing fairness and accuracy. In SDM, Cited by: §I.
-  (2002) On generalization bounds, projection profile, and margin distribution. In ICML, Cited by: §I, §V-C, §VIII-E, §VIII-E, §VIII-E.
Adaptive scaling for feature selection in svms. In NIPS, Cited by: §IV-B.
Equality of opportunity in supervised learning. In NIPS, Cited by: §I, §II-A.
-  (2018) Fairness without demographics in repeated loss minimization. In ICML, Cited by: §II-B.
Preparing for the future of artificial intelligence. Executive Office of the President. Cited by: §I.
-  (2012) Fairness-aware classifier with prejudice remover regularizer. In ECMLPKDD, Cited by: §I, §II-A, §IV-C, 2nd item, TABLE I, TABLE II, TABLE III.
-  (2018) Blind justice: fairness with encrypted sensitive attributes. In ICML, Cited by: §II-B.
-  (2012) Face recognition performance: role of demographic information. IEEE Transactions on Information Forensics and Security. Cited by: §I.
-  (2017) Counterfactual fairness. In NIPS, Cited by: §II-A.
-  (2011) K-nn as an implementation of situation testing for discrimination discovery and prevention. In KDD, Cited by: §I, §II-B.
On some inequalities for positively and negatively dependent random variables with applications. PUBLICATIONES MATHEMATICAE-DEBRECEN 63 (4), pp. 511–522. Cited by: §V-A.
-  (2017) Provably fair representations. CoRR. Cited by: §V-A, §VI-B.
-  (2011) Differentially private data release for data mining. In KDD, Cited by: §I.
Convex formulations for fair principal component analysis. CoRR. Cited by: §I, §IV-D, 4th item, TABLE I, TABLE II, TABLE III.
-  (2017) Fair kernel learning. In ECMLPKDD, Cited by: §I, §II-A, §IV-B, 3rd item, TABLE I, TABLE II, TABLE III.
-  (2015) Mixed data kernel copulas. Empirical Economics. Cited by: footnote 2.
-  (2018) The price of fair pca: one extra dimension. In NIPS, Cited by: §I, §IV-D, 4th item, §VI-A, TABLE I, TABLE II, TABLE III.
-  (2018) IBM helps eliminate bias in facial recognition training, but other faults may remain. PaymentsJournal. Cited by: §I.
-  (2017) Fairer machine learning in the real world: mitigating discrimination without collecting sensitive data. Big Data & Society 4 (2), pp. 2053951717743530. Cited by: §I, §II-B.
-  (2017) Fairness constraints: mechanisms for fair classification. In AISTATS, Cited by: §I.
-  (2013) Learning fair representations. In ICML, Cited by: §I, §II-A, §VI-B, TABLE I, TABLE II, TABLE III.
Anti-discrimination learning: a causal modeling-based framework.
International Journal of Data Science and Analytics4 (1), pp. 1–16. Cited by: §I.
-  (2016) Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models. Artificial Intelligence and Law 24 (2), pp. 183–201. Cited by: §I, §II-B.
Viii-a Proof of Lemma 3
We will prove that is fair if it is spanned by a set of fair hypotheses. Indeed, by the linear property of covariance,
where and the last inequality is by the Cauchy–Schwarz inequality. Further
Viii-B Proof of Lemma 5
First note the expected number of returned hypothesis is
We will bound the right side probability. For convenience, write for .
The first result is a direct result of the Markov inequality.
To prove the second result, we use the Chebyshevs inequality. It states that, over random ,
We will refine (31). We first show = . This is true because is linear, i.e., , so that
where = is a constant vector. Then, taking expectation of on both sides, we have
where the last inequality holds because is from a zero-mean normal distribution and thus .
Next, we derive .
Viii-C Proof of Lemma 2
Suppose and are PQD or NQD random variables with bounded covariance. If is -fair w.r.t. , then
where and .
Now we refine . Write for . Note that
Plugging in and rearranging terms, we have
where and . Plugging this back to (36) and dividing both sides by , we have
The left side is . Thus the lemma is proved.
Viii-D Proof of Lemma 6
The proof sketch is similar to . Note . Since each element of is from Gaussian, is also from Gaussian and thus by definition
is from a Chi-Squared distribution withdegrees of freedom. Define a scaled variable
; it is also from Chi-Squared with the following moment generation function
where . By the Markov inequality, for ,
where the last equality is obtained by setting = .
By similar argument (setting = ), we have
Combining the above two results via a union bound, we have
where . The range of follows the range of . The lemma is proved.
Viii-E Proof Sketch of Theorem 7
The original generalization error bound is developed using a distortion bound , which assumes zero-mean distribution of projection vectors. Here, we use the new distortion bound in Lemma 6.
Recall is an instance and is its projection in a random space. If is linear, let be its projection. If , by similar arguments in [16, Lemma 3.2],
where in the coefficient is .
Then, by similar arguments in [16, Lemma 3.4], the classification error caused by random projection satisfies
with probability at least over the randomness of .