1 Introduction
Conventional machine learning model, though can reach high accuracy in average, often falls short in causing high loss on minority groups at the same time. Fairness issues have been studied in various tasks, like medical treatment, criminal justice
(Kearns et al., 2019; Nabi et al., 2018; Mary et al., 2019), credit loan (Liu et al., 2018), machine translation (Hashimoto et al., 2018), recommender system (Beutel et al., 2019; Singh and Joachims, 2019), etc. The reasons for improving model’s fairness are not limited with respect to ensuring morality or legality , but also involving economical considerations. In (Hashimoto et al., 2018), it proves the amplification of demographic disparity in sequential ERM process. That is, the minority group which suffers higher loss will become more likely to quit using our service, thereby providing no feedback data any longer. This negative feedback loop can cause shrinking size of users, hence reducing the income or welfare provided by our machine learning system over time.Specifically for recommender system, there have been line of work showing that, the realworld logging policy often collects missing not at random (MNAR) (Yuan et al., 2019) data (or selective labels). For example, users tend to only reveal ratings for items they like, thus the observed users’ feedback (usually described by clickthroughrate, CTR) can be substantially better than those not observed yet. In Fig.1, estimated average CTR from the observed outcome is . However, if all unobserved outcomes are , the true CTR is . This gap between the factual and counterfactual ratings, in MNAR situation, has been empirically validated by (Steck, 2013). In this work, we show that MNAR also undermines fairness of recommendation over time, by exacerbating minority group’s utility.
To get unbiased estimate of true CTR without revealing all feedback, running randomized controlled trials (RCTs)
(Haynes et al., 2012) has been so far the first choice, which satisfies missingatrandom (MAR) (Seaman et al., 2013) condition:Definition 1.
The missing data are missingatrandom (MAR) if the conditional probality of observed pattern, represented by here, satisfies
(1) 
However, the cost of RCTs is unacceptable in many scenarios, e.g. we cannot randomly assign drugs to the volunteers, or randomly display ads for users. Vast majority of existing works (Shalit et al., 2017; Rosenfeld et al., 2017; Bonner and Vasile, 2018) tend to leverage small RCTs and large biased logged data to correct the MNAR bias, through multitask learning or domain adaptation techniques. They claim that small RCTs, though not capable of revealing each individual’s preference (Rosenfeld et al., 2017)
, are sufficient for reaching lowvariance estimate of the
average of all, from which we can obtain model that generalizes well to those counterfactual events. However, the conception about ”small” is not welldefined yet, hence no guideline for proper number of RCTs to trade gain against cost. Besides, although RCTs are intuitively regarded as fair for each user, we argue that learning from the average from RCTs causes unfair policy, on heterogeneous demographics, i.e. users vary over populations defined by preferences.In this work, we aim to mitigate users’ utility disparity, as well as improving global utility of all in recommendation. Our object of interest is a RCTsfree method that can achieve both two goals above. We begin from introducing Gini index (Gini, 1997) as a disparity measure in section 4. In section 5, we consider two heterogeneous situations, for analyzing the RCTs’s fairness, accounting for the least favorable group’s risk. Then, we propose our RCTsfree Counterfactual Robust Risk Minimization (CRRM) framework in section 6, and test its performance in section 7. The proof of all propositions and theorems can be found in appendices.
Contributions. Our main contributions are threefold:
1. We demonstrate the unfairness of conventionally used ERMbased recommendation under MNAR data, and bridge the gap between protecting the least favorable group and ensuring fairness regarding to Gini index;
2. We first theoretically analyze the fairness of RCTsbased methods, by considering the least favorable group’s risk over multiple demographic conditions;
3. We develop a RCTsfree CRRM framework, and build its motivative error bound. Our method is proved succeeding in improving both fairness and accuracy, on both synthetic tasks and realworld data.
2 Related work
Fair machine learning. Fairness in machine learning is an emerging topic in the literature recently. The most direct principle of algorithmic fairness is to protect sensitive tags, features or labels, such as race, gender, age, etc. This formulation of fairness has been investigated in (Hardt et al., 2016; Mary et al., 2019) where the conditional independence, between the decision made by the algorithm and the variables to protect, is estimated or guaranteed. In other respects, the Rawlsian fairness (Rawls, 2009) is introduced to substantiate this equal opportunity principle in a bandit learning scheme (Joseph et al., 2016; Jabbari et al., 2017), where an action is preferred only if its true quality is better. However, as far as we know, few work has ever mentioned fairness issue in MNAR scenario.
For some scenarios when the demographics are unknown, the previous works have been so far concentrating on protecting subgroups with latent demographic structure (Kearns et al., 2017), or learning a distributionally robust model for uniform performance without demographics (Duchi and Namkoong, 2018; Hashimoto et al., 2018). Although these approaches are proved to be able to protect the minority group, no evident improvement of global utility for all users is witnessed in practice.
Correcting MNAR bias. An importance sampling approach called inverse propensity score (IPS) is proposed initially to correct the sampling bias (Rosenbaum and Rubin, 1983), then followed by line of work including doubly robust (DR) (Dudík et al., 2011; Jiang and Li, 2015), weighted doubly robust (Thomas and Brunskill, 2016), joint learning doubly robust (Wang et al., 2019), selfnormalized IPS (Swaminathan and Joachims, 2015; Schnabel et al., 2016), etc., for recommendation on MNAR data, by utilizing additional small RCTs. Besides, methods (Johansson et al., 2016; Bonner and Vasile, 2018) adapted from domain adaptation and representation learning techniques are used for learning from both and . However, they hardly mention their methods’ fairness and effectiveness over heterogeneous demographics.
3 Problem Setup
Let user , and item , then an even . The outcome of an event is where
(2) 
There are in total possible events considering different combination of users and items, where here means the cardinality of a set, we set and . The true risk of predicted outcome matrix can be easily written out as
(3) 
where is assumed drawn from a true outcome matrix , and is usually set as surrogate meansquarederror, or logistic loss, for simplicity we denote it as or .
When some of outcomes are not observed, calculating Eq.(3) becomes infeasible. Such that, previous works attempt to only minimize the factual empirical risk, namely ERM as
(4) 
in which random variable
means whether a pair is observed. The observation of outcome is hence constrained by the missing pattern , shown in Fig.2. Such that, the is a biased estimate of true risk :(5) 
3.1 Correcting sampling bias by propensity scores
To deal with the bias shown in Eq.(5), an unbiased inverse propensity score (IPS) estimator (Rosenbaum and Rubin, 1983) can be built as
(6) 
where is the used propensity score that , hence it is easy to derive .
The propensity score
here, can be approximated via a simplified naive Bayes (NB) estimator
(Marlin and Zemel, 2009) :(7) 
with assuming that . Though the and can be counted by MNAR data, the has to be estimated via MAR data, which is often achieved by taking average of RCTs’ outcomes
(8) 
Using , the bias of IPS estimator with inaccurate propensity is given by (Schnabel et al., 2016) that
(9) 
For notation simplicity, we omit of the from now on. In this work, we delve deeper into this bias, accounting for the bias of different populations. We assume that each individual belongs to a group , and the group mean preference is represented by
(10) 
In this viewpoint, the is an efficient empirical estimate of global preference for all groups. However, we question that, when demographics are heterogeneous, i.e. are different, is the by RCTs still a fair and economical estimate? Since for ensuring fairness, we care about loss suffered by the least favorable group, rather than only the average over all. Besides, measuring utility disparity is also in our interests.
,  true and predicted outcome matrix 

random variables: user, item and event  
Bernoulli random variable: event missing  
realization of  
,  true and predicted outcome of 
risk of predicted outcome of  
true risk of  
empirical risk of  
,  uniform and nonuniform policy 
,  MAR and MNAR event sets 
,  factual and counterfactual event sets 
4 Measuring Utility Disparity by Gini Index
The conventional approach to update a policy for recommender system is repeatedly learning from the MNAR logged data, by minimizing empirical risk. In this section, we identify that the policy obtained by this process fails to offer fair services for users, when taking user’s clickthroughratio (CTR) as their utility. We then propose Gini index for measuring this disparity, and reveal the connection between improving Gini fairness and protecting the least favorable group’s utility.
4.1 Motivating example
We build a tiny system with 5 users and 5 items for simulation, and compare the userwise and itemwise numbers of clicks by and , shown in Fig.3. Unfair treatment towards users by can be witnessed in the Fig.2(a). Compared with , though the total clicks achieved by is slightly higher (2916 v.s. 2484), the click rates of user #1 and #2 are much lower than their counterparts by . That means, the achieves good performance in average, at the expense of deteriorating experience of the minorities. Besides, the Fig.2(b) illustrates devastating diversity reduction by , which is also unacceptable in recommendation. Details about the implementation of simulation can be found in Appendix F.1.
4.2 Gini fairness and the minority’s utility
Example above impresses the necessity to measure users’ utility disparity. To this end, we introduce Gini index drawn from Economics to measure fairness in recommendation. The Gini index , with 0 indicating perfect equality and 1 being perfect inequality, is a measure of the distribution of income across percentiles among populations.
To compute , the Lorenz curve (Mankiw, 2006) plots population percentile by utility on the xaxis, and cumulative utility on the yaxis. Denote the cumulative th group’s utility as . From Fig.4, we can compute the area under the Lorenz curve by summation of all
(11) 
therefore the can be obtained by
(12) 
We give the following Proposition 1 which identifies the connection between controlling Gini index and protecting the least favorable group.
Proposition 1 (Least favorable group’s utility and Gini index).
Assume that , controlling the worstcase to protect the minority group can improve fairness, by virtue of controlling the supremum of the as
(13) 
if and only if the equality holds.
Based on above analysis, picking the least favorable group’s average utility can be a surrogate metric for fairness, since the ’s upper bound is restricted by . By the Lorenz curve of the motivating example in Fig.5, the has much less Gini index than , and the minimum utility earned by is larger than as well.
Another concern is that the global utility , namely the utility enjoyed by all users, is also important for evaluating a policy’s effectiveness:
(14) 
because it decides the total income of the system. Then comes two principles for developing a recommender system:
We will use the above two metrics for comparison in the experiment section.
5 Fairness of RCTs Considering Heterogeneous Users Demographics
Recall that our goal is to control the utility disparity of users, which can be realized by controlling the least favorable group’s utility , as above mentioned in Eq.(13). We proceed in two steps. First, we show how the empirical mean influences the minority’s risk by IPS estimator. Then, we analyze this risk regarding to several demographics assumptions, i.e. homogeneous and heterogeneous. As a result, we delve deeper into the limitations in fairness of randomized trials, specifically when the by RCTs is used as a trustworthy knowledge for debiasing MNAR data in the literature.
5.1 Risk of the least favorable group
Let , then the least favorable group is , and the event set related to is . Bias of IPS estimator on , similar to Eq.(9), can be derived as
(15) 
Proposition 2 (Bias of the least favorable group’s risk by IPS).
The bias of IPS risk suffered by the least favorable group , has a lower bound as
(16) 
Roughly speaking, the above inequality demonstrates that the ’s property matters. In particular, even under uniform policy, the bias for each group would be dependent on the preference demographics, as shown next.
5.2 Homogeneity: same mean and same variance
Intuitively, when randomly assigning items for users, under MAR condition, the is a fair and efficient estimate for debiasing. However, we argue that this only holds under homogeneous and weak heterogeneous demographics.
Assumption 1 (Homogeneous demographics.).
Users’ preference are same mean and same variance, namely the .
With above assumption, and by law of large numbers, it is easy to get the consistency of
, i.e. . Recall the inequality in Eq.(16), the right term approaches to zero at this situation because , which holds for other groups as well. That is the reason why in our motivating example, uniform reaches good fairness. However, these homogeneity hardly happen in practice. A realworld recommender system often serves hundred million of users, such that it is reasonable to assume that the demographics of users are highly heterogeneous. We discuss it by two cases next, i.e. weak and strong heterogeneity.5.3 Weak heterogeneity: with same mean but different variance
We begin our analysis from a simple case without loss of generality, in which users are from two distributions: and
, with probability
from the first and from the second one. Formally, this is a common case in Robust Statistics (Maronna et al., 2019) called contaminated distribution, as(17) 
For simplicity, we assume that both are Gaussian distribution (we will extend it to more general subGaussian family later). Based on Eq.(
8) and Eq.(17), it is easy to derive that(18) 
We first consider the setting that groups’ preferences are same mean but different variance (SMDV), namely , and . We argue that only if variance is controllable, the RCTs under SMDV demographics lead to asymptotically normal and fair :
Theorem 1 (Asymptotic Normality under SMDV).
Let , under SMDV assumption, if
(19) 
then the empirical mean satisfies asymptotic normality as
(20) 
Similar as in homogeneous demographics, with weak heterogeneity and mild condition on variance, the also leads to fair result.
5.4 Strong heterogeneity: with different mean and different variance
Although the above analysis proves that RCTs do lead to fair estimator in theory, the fundamental challenge faced is that, no matter for homogeneity or weak heterogeneity, the same mean assumption seems to be too strong. This argument urges us to explore more general setting, that is, when the demographics are different mean and different variance (DMDV), how well the performs, in terms of its bias and fairness. Accordingly, we continue to DMDV groups with subGaussian assumption next.
Theorem 2 (Tail bound under strong heterogeneity).
Given the , where , then
(21) 
with probability . Specifically when the subGaussian parameters are all same, i.e. , we have
(22) 
Moreover, For the bias of IPS estimator on the least favorable group, we have
(23) 
Above bound identifies that inaccurate propensities obtained by NB estimator, with from RCTs as the outcome prior, contributes to the gap between the average and on . Though the is unbiased, its part on is worse than other , since the matters for the lower bound of . For other multitask learning or domain adaptation methods, we argue that this challenge appears as well, because small RCTs are only competent for average estimation, rather than representative for all groups.
This challenge inspires us to develop a machinery, which can correct MNAR bias without RCTs, as well as achieving better fairness for recommendation.
6 Beyond Randomized Trials
For ensuring fairness on known demographics, the disparity metrics are used by (wang2019repairing) to measure the outcome disparity conditioned by predefined group:
(24) 
such that more fair model can be obtained by minimizing it. However, demographics of user’s preference in recommendation are usually unknown, thus defining groups and optimizing the Eq.(24) directly is difficult.
To deal with unknown demographics, distributionally robust optimization (DRO) technique is employed for controlling the least favorable group’s risk (Hashimoto et al., 2018), by minimizing the upper bound of the risk defining on an uncertainty distributions set. Some concerns of it are (1) the importance sampling scheme introduces additional variance in optimization; (2) the result is sensitive to the distance metric for defining the uncertainty set and (3) the DRO often degenerates to ERM (Hu et al., 2018) in many scenarios. In this section, we propose a more easytouse and effective counterfactual robust risk minimization framework to reduce utility disparity coping with unknown demographics. We first present the definition of objective function and algorithm of the framework, then give its theoretical generalization error bound.
6.1 Counterfactual robust risk minimization
One would expect that disparity could be controlled on the least favorable group , by constructing a synthetic set from
(25) 
where the is the group that causes the steepest disparity increase. By optimizing on it, our model is encouraged to maintain uniform performance among groups. We formalize this intuition to a tractable optimization regime, by presenting a robust risk that
(26) 
where the is a predictor parametrized by . In this work, we pick the matrix factorization (MF) (Koren et al., 2009), hence the input is an embedding vector looked up from the parameter , and the worstcase input is restricted in a ball, with radius around the original input . The intuition behind is that for each event , we find its counterfactual counterpart , then try to minimize the disparity between them. The objective function for our CRRM framework is defined by
(27) 
For observed events, we try to minimize both the robust risk and empirical risk ; for the unobserved, as is intractable, only
is applied. The empirical risk can be defined as any surrogate loss for supervised learning, e.g. meansquarederror. For the robust risk, it is tractable by a power iteration
(Miyato et al., 2018)approach, albeit it was used for improving local distributional smoothness for semisupervised learning.
Algorithm 1 details our CRRM framework, where trainable embeddings for users and items are and . The input hyper parameters include: regularization term , conterfactual risk weight , robust radius , perturbation radius , learning rate , power iteration step and total learning epoch , albeit few of them need being tuned in practice. Their setups are discussed in experiments section.
6.2 Bounding the risk over counterfactuals
As aforementioned, the CRRM tries to balance the predictor’s performance between a synthetic and , by penalizing samples with small factual risk but large robust risk. We formally explain it by deriving the CRRM’s theoretical generalization error bound next.
Definition 2 (CRRM for Recommendation).
Given a hypothesis space of predictions as , and CRRM’s objective function from Eq.(27) , the algorithm selects that optimizes:
(28) 
To illustrate the validity and principle of CRRM approach, we state the following bound, where only finite hypothesis set is considered for the sake of conciseness.
Theorem 3 (Generalization Error Bound of CRRM).
For any finite hypothesis space of predictions , and empirical loss is bounded , the true risk of our CRR minimizer from , is bounded with probability by:
(29) 
where and we assume .
This generalization error bound is constructive, as it motivates a general principle for designing fair and effective estimator for learning from MNAR data, with the proposed CRRM. In particular, it not only emphasizes on minimizing the gap between and for reducing bias, but also points out the large variance incurred by small factual risk. As the bound shows, the bound can become loose due to small and relatively large , such that the fairness controller even contributes to reducing variance, hence pick the hypothesis with more tight bound on the true risk .
Sparsity  0.1%  0.5%  1%  

Metric  mU  gU  mU  gU  mU  gU 
LR  0.27  0.53  0.04  0.47  0.05  0.51 
MF  0.32  0.50  0.29  0.52  0.20  0.52 
IPS  0.39  0.53  0.34  0.52  0.27  0.52 
SNIPS  0.40  0.54  0.32  0.50  0.30  0.51 
CRRM  0.47  0.60  0.36  0.60  0.23  0.57 
7 Experiments
In this section, we perform experiments to explore the performance of proposed CRRM, in terms of both global utility and Gini fairness. Furthermore, additional experiments on realworld data sets (Yahoo^{1}^{1}1https://webscope.sandbox.yahoo.com/
and Coat) are conducted to compare CRRM with other methods, according to the commonly used measurements, e.g. MSE, AUC. We here consider logistic regression (LR), matrix factorization (MFnaive), inverse propensity score based MF (MFIPS)
(Schnabel et al., 2016), selfnormalized IPS (MFSNIPS) (Swaminathan and Joachims, 2015) and MF incorporated in our CRRM framework (MFCRRM).7.1 Synthetic task
The experimental setup is similar as the motivating example done in Section 4.1, but we supplement the heterogeneous demographics setting, i.e. each user’s average preference is different, and extend the scale to , . For experiments on the nonuniform , each time decides the top 1 item for each user, and collects the users’ feedback. Then it learns from the logged data, and continues to give next round of top 1 recommendation.
How does demographic heterogeneity influence ’s property by RCTs and ERMbased policy? We want to compare the performance of with and , with varying heterogeneity, in the synthetic task. The results are shown in Fig.6, where the xaxis is the observed events ratio. With homogeneous demographics in Fig.5(a), the obtained by ERM policy is much higher than the global average, and converges to global average with high confidence. However, with strongly heterogeneous demographics defined in section 5.4, the Fig.5(b) displays high variance of the . Unlike in homogeneous setting, the blue band converges much slower with sampling ratio increasing. Besides, the orange line () also has deteriorating performance compared with before. The result demonstrates that heterogeneity of demographics can cause failure of RCTsbased IPS because of high variance incurred by , and also do harm to ERMbased in practice.
Does CRRM improve fairness and generalization? Results on strong demographic heterogeneity are reported in Table 2, where is the minimum of user’s enjoyed utility. We test our CRRM method against LR, naive MF and IPSbased MF, and significant superiority of our method is witnessed. The MFCRRM substantially perform better on both fairness and effectiveness in general, as it gains the best and almost under each sparsity setting. By contrast, MFIPS with propensity scores learned from MAR data with of number of MANR data, does not win over naive MF, due to high variance caused by under heterogeneous demographics.
COAT  YAHOO  

MSE  AUC  MSE  AUC  
MFnaive  0.656  0.681  0.340  0.687 
MFIPS  0.714  0.687  0.332  0.666 
MFSNIPS  0.605  0.684  0.325  0.667 
MFCRRM  0.277  0.717  0.216  0.713 
COAT  YAHOO  

GI  gU  GI  gU  
MFnaive  0.351  0.486  0.549  0.274 
MFIPS  0.333  0.508  0.558  0.264 
MFSNIPS  0.341  0.501  0.559  0.264 
MFCRRM  0.316  0.538  0.538  0.290 
7.2 Realworld data sets
Further experiments conducted on two realworld data sets, which both collected a small test set where users are provided randomly displayed items for labeling. Following previous works’ setting, we train all models on MNAR training set , and test them on MAR test set. More details about experimental setup are in appendix F.2.
Yahoo ! R3 Data Set. This is a usersong rating data set (Marlin and Zemel, 2009), where MNAR training set has over 300K ratings for songs that were selected by users. The test set contains ratings by 5400 users who were asked to rate 10 randomly selected songs.
Coat Shopping Data Set. This data set was collected and used in (Schnabel et al., 2016), which contains 290 users and 300 items. Each user selfselected 24 items, and additional 16 items were randomly drawn for them to rate.
Results. Results are reported by Table 3 and Table 4, where our CRRM substantially wins over other baselines on both data sets, either in effectiveness metric , MSE, AUC or in fairness metric . Unlike other IPSbased methods, CRRM has much lower cost since it does not need additional RCTs. It contributes to less variance via controlling both robust risk and factual risk, thus facilitating generalization without RCTs, as presented in Theorem 3. It should be noted that on YAHOO data set, for computation efficiency, our CRRM only sampling from for optimizing on the in each epoch, which sheds light on our CRRM’s promising prospects in realworld recommender system.
8 Discussion & Conclusion
In this work, we focus on mitigating MNAR bias and unfair policy, which are ubiquitous in recommendation. Previous works tend to leverage empirical outcome mean from small RCTs for correcting MNAR bias, e.g. IPS family. However, our study on the bias of these RCTsbased IPS estimator, with introducing more general different mean demographics, identifies the risk estimator with sacrifices the least favorable group for ensuring average performance.
We therefore propose an easytouse and RCTsfree counterfactual robust risk minimization framework, in order to circumvent the cost of RCTs, and mitigate the unfairness from previous works. The key insight of our CRRM is to balance between robust and factual risk, thus both reducing bias and facilitating fairness, explained by the generalization bound in our Theorem 3. Moreover, by random sampling when alternative events set is large, our algorithm only costs a little additional computation, thus being scalable to large recommender systems.
We conjecture that the cutting edge sampling approaches (Wang et al., 2020), rather than randomly sampling from all events, can be plugged in our framework for further improvement. Besides, other existing models for recommendation, e.g. deep factorization machine (Guo et al., 2017)
, graph neural networks
(Fan et al., 2019), can be retrofit with our CRRM for even better results.Acknowledgements
References
 Fairness in recommendation ranking through pairwise comparisons. arXiv preprint arXiv:1903.00780. Cited by: §1.
 Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. Cited by: §1, §2.
 Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750. Cited by: §2.
 Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §2.
 Metapathguided heterogeneous graph neural network for intent recommendation. Cited by: §8.

An introduction to probability theory and its applications
. Vol. 2, John Wiley & Sons. Cited by: Appendix C.  Concentration and dependency ratios. Rivista di politica economica 87, pp. 769–792. Cited by: §1.
 DeepFM: a factorizationmachine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §8.
 Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.
 Fairness without demographics in repeated loss minimization. arXiv preprint arXiv:1806.08010. Cited by: §1, §2, §6.
 Test, learn, adapt: developing public policy with randomised controlled trials— cabinet office. Cited by: §1.
 Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Cited by: Lemma D.1, Appendix D.

Does distributionally robust supervised learning give robust classifiers?
. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 2029–2037. Cited by: §6. 
Fairness in reinforcement learning
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1617–1626. Cited by: §2.  Doubly robust offpolicy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722. Cited by: §2.
 Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §2.
 Rawlsian fairness for machine learning. arXiv preprint arXiv:1610.09559 1 (2). Cited by: §2.
 Preventing fairness gerrymandering: auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144. Cited by: §2.
 Average individual fairness: algorithms, generalization and experiments. arXiv preprint arXiv:1905.10607. Cited by: §1.
 Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: §6.1.
 Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383. Cited by: §1.
 Principles of macroeconomics. Cengage Learning. Cited by: §4.2.
 Collaborative prediction and ranking with nonrandom missing data. In Proceedings of the third ACM conference on Recommender systems, pp. 5–12. Cited by: §3.1, §7.2.
 Robust statistics: theory and methods (with r). John Wiley & Sons. Cited by: §5.3.
 Fairnessaware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391. Cited by: §1, §2.
 Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §6.1.
 Learning optimal fair policies. arXiv preprint arXiv:1809.02244. Cited by: §1.
 A theory of justice. Harvard university press. Cited by: §2.
 The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §2, §3.1.
 Predicting counterfactuals from large historical data and small randomized trials. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 602–609. Cited by: §1.
 Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: §2, §3.1, §7.2, §7.
 What is meant by” missing at random”?. Statistical Science, pp. 257–268. Cited by: §1.
 Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3076–3085. Cited by: §1.
 Policy learning for fairness in ranking. arXiv preprint arXiv:1902.04056. Cited by: §1.
 Evaluation of recommendations: ratingprediction and ranking. In Proceedings of the 7th ACM conference on Recommender systems, pp. 213–220. Cited by: §1.
 The selfnormalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231–3239. Cited by: §2, §7.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §2.
 Doubly robust joint learning for recommendation on data missing not at random. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 6638–6647. Cited by: §2.

Less is better: unweighted data subsampling via influence function.
In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)
, Cited by: §8.  Improving ad click prediction by considering nondisplayed events. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 329–338. Cited by: §1.

Alan turing and the central limit theorem
. The American Mathematical Monthly 102 (6), pp. 483–494. Cited by: Definition C.2.
Appendix A Proof of Proposition 1
Proposition 1 (Least favorable group’s utility and Gini index).
Assume that , controlling the worstcase to protect the minority group can improve fairness, by virtue of controlling the supremum of the as
(A.1) 
if and only if the equality holds.
Proof.
(A.2)  
(A.3)  
(A.4)  
(A.5) 
where the last inequality holds because
(A.6) 
∎
Appendix B Proof of Proposition 2
Proposition 2 (Bias of the least favorable group’s risk).
The bias of IPS risk suffered by the least favorable group , has a lower bound as
(B.1) 
Proof.
It is easy to derive that
(B.2)  
(B.3) 
Such that, we concentrate on in the Eq.(15):
(B.4)  
(B.5)  
(B.6) 
The second inequality Eq.(B.5) comes from the fact that the least favorable group ’s true average preference must be less than the empirical average. Combining the result above and Eq.(15) yields the result. ∎
Appendix C Proof of Theorem 1
Definition C.1 (Triangular Array (TA)).
A triangular array of random variables is like
(C.1) 
For each row , the random variables are assumed independent, with , and . We denote this triangular array as henceforth.
Definition C.2 (Lindeberg’s condition (Zabell, 1995)).
For random variables , we have where the variance of is finite . We define a triangular array as , which satisfies Lindeberg’s condition if for
(C.2) 
we have
(C.3) 
Lemma C.1 (Lindeberg’s condition under SMDV).
Given , and , if the variance is controllable, namely
(C.4) 
then the satisfies Lindeberg’s condition.
Proof.
Assume without loss of generality that , and let a random variable distributed like every , then for ,
(C.5)  
(C.6)  
(C.7)  
(C.8) 
in which the second inequality holds because it is easy to derive that . When , the last two terms converge to zero, because of the condition set in Eq.(C.4). ∎
Theorem 1 (Asymptotic Normality under SMDV).
Let , under SMDV assumption, if
(C.9) 
then the empirical mean satisfies asymptotic normality as
(C.10) 
Proof.
It is easy to derive that the summation over the th row of is . Based on the condition set by Eq.(C.9), applying Lemma C.1 yields the result that satisfies Lindeberg’s condition, namely
(C.11) 
From the LindebergFellar Central Limit Theorem (CLT) (Feller, 2008), the summation over the each row of converges to a zeromean unit variance Gaussian distribution, namely
(C.12) 
∎
Appendix D Proof of Theorem 2
First, we introduce Hoeffding’s lemma on bounded random variables (Hoeffding, 1994) without proof.
Lemma D.1 (Hoeffding’s lemma (Hoeffding, 1994)).
if with probability 1, and , then is subGaussian with parameter .
Theorem 2 (Tail bound under strong heterogeneity).
Given the , where , then
(D.1) 
with probability . Specifically when the subGaussian parameters are all same, i.e. , we have
(D.2) 
Moreover, For the bias of IPS estimator on the least favorable group, we have
Comments
There are no comments yet.