On the Fairness of Randomized Trials for Recommendation With Heterogeneous Demographics and Beyond

01/25/2020 ∙ by Zifeng Wang, et al. ∙ 1

Observed events in recommendation are consequence of the decisions made by a policy, thus they are usually selectively labeled, namely the data are Missing Not At Random (MNAR), which often causes large bias to the estimate of true outcomes risk. A general approach to correct MNAR bias is performing small Randomized Controlled Trials (RCTs), where an additional uniform policy is employed to randomly assign items to each user. In this work, we concentrate on the fairness of RCTs under both homogeneous and heterogeneous demographics, especially analyzing the bias for the least favorable group on the latter setting. Considering RCTs' limitations, we propose a novel Counterfactual Robust Risk Minimization (CRRM) framework, which is totally free of expensive RCTs, and derive its theoretical generalization error bound. At last, empirical experiments are performed on synthetic tasks and real-world data sets, substantiating our method's superiority both in fairness and generalization.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional machine learning model, though can reach high accuracy in average, often falls short in causing high loss on minority groups at the same time. Fairness issues have been studied in various tasks, like medical treatment, criminal justice

(Kearns et al., 2019; Nabi et al., 2018; Mary et al., 2019), credit loan (Liu et al., 2018), machine translation (Hashimoto et al., 2018), recommender system (Beutel et al., 2019; Singh and Joachims, 2019), etc. The reasons for improving model’s fairness are not limited with respect to ensuring morality or legality , but also involving economical considerations. In (Hashimoto et al., 2018), it proves the amplification of demographic disparity in sequential ERM process. That is, the minority group which suffers higher loss will become more likely to quit using our service, thereby providing no feedback data any longer. This negative feedback loop can cause shrinking size of users, hence reducing the income or welfare provided by our machine learning system over time.

Figure 1: Missing ratings in recommender system.

Specifically for recommender system, there have been line of work showing that, the real-world logging policy often collects missing not at random (MNAR) (Yuan et al., 2019) data (or selective labels). For example, users tend to only reveal ratings for items they like, thus the observed users’ feedback (usually described by click-through-rate, CTR) can be substantially better than those not observed yet. In Fig.1, estimated average CTR from the observed outcome is . However, if all unobserved outcomes are , the true CTR is . This gap between the factual and counterfactual ratings, in MNAR situation, has been empirically validated by (Steck, 2013). In this work, we show that MNAR also undermines fairness of recommendation over time, by exacerbating minority group’s utility.

To get unbiased estimate of true CTR without revealing all feedback, running randomized controlled trials (RCTs)

(Haynes et al., 2012) has been so far the first choice, which satisfies missing-at-random (MAR) (Seaman et al., 2013) condition:

Definition 1.

The missing data are missing-at-random (MAR) if the conditional probality of observed pattern, represented by here, satisfies


However, the cost of RCTs is unacceptable in many scenarios, e.g. we cannot randomly assign drugs to the volunteers, or randomly display ads for users. Vast majority of existing works (Shalit et al., 2017; Rosenfeld et al., 2017; Bonner and Vasile, 2018) tend to leverage small RCTs and large biased logged data to correct the MNAR bias, through multi-task learning or domain adaptation techniques. They claim that small RCTs, though not capable of revealing each individual’s preference (Rosenfeld et al., 2017)

, are sufficient for reaching low-variance estimate of the

average of all, from which we can obtain model that generalizes well to those counterfactual events. However, the conception about ”small” is not well-defined yet, hence no guideline for proper number of RCTs to trade gain against cost. Besides, although RCTs are intuitively regarded as fair for each user, we argue that learning from the average from RCTs causes unfair policy, on heterogeneous demographics, i.e. users vary over populations defined by preferences.

In this work, we aim to mitigate users’ utility disparity, as well as improving global utility of all in recommendation. Our object of interest is a RCTs-free method that can achieve both two goals above. We begin from introducing Gini index (Gini, 1997) as a disparity measure in section 4. In section 5, we consider two heterogeneous situations, for analyzing the RCTs’s fairness, accounting for the least favorable group’s risk. Then, we propose our RCTs-free Counterfactual Robust Risk Minimization (CRRM) framework in section 6, and test its performance in section 7. The proof of all propositions and theorems can be found in appendices.

Contributions. Our main contributions are three-fold:

1. We demonstrate the unfairness of conventionally used ERM-based recommendation under MNAR data, and bridge the gap between protecting the least favorable group and ensuring fairness regarding to Gini index;

2. We first theoretically analyze the fairness of RCTs-based methods, by considering the least favorable group’s risk over multiple demographic conditions;

3. We develop a RCTs-free CRRM framework, and build its motivative error bound. Our method is proved succeeding in improving both fairness and accuracy, on both synthetic tasks and real-world data.

2 Related work

Fair machine learning. Fairness in machine learning is an emerging topic in the literature recently. The most direct principle of algorithmic fairness is to protect sensitive tags, features or labels, such as race, gender, age, etc. This formulation of fairness has been investigated in (Hardt et al., 2016; Mary et al., 2019) where the conditional independence, between the decision made by the algorithm and the variables to protect, is estimated or guaranteed. In other respects, the Rawlsian fairness (Rawls, 2009) is introduced to substantiate this equal opportunity principle in a bandit learning scheme (Joseph et al., 2016; Jabbari et al., 2017), where an action is preferred only if its true quality is better. However, as far as we know, few work has ever mentioned fairness issue in MNAR scenario.

For some scenarios when the demographics are unknown, the previous works have been so far concentrating on protecting subgroups with latent demographic structure (Kearns et al., 2017), or learning a distributionally robust model for uniform performance without demographics (Duchi and Namkoong, 2018; Hashimoto et al., 2018). Although these approaches are proved to be able to protect the minority group, no evident improvement of global utility for all users is witnessed in practice.

Correcting MNAR bias. An importance sampling approach called inverse propensity score (IPS) is proposed initially to correct the sampling bias (Rosenbaum and Rubin, 1983), then followed by line of work including doubly robust (DR) (Dudík et al., 2011; Jiang and Li, 2015), weighted doubly robust (Thomas and Brunskill, 2016), joint learning doubly robust (Wang et al., 2019), self-normalized IPS (Swaminathan and Joachims, 2015; Schnabel et al., 2016), etc., for recommendation on MNAR data, by utilizing additional small RCTs. Besides, methods (Johansson et al., 2016; Bonner and Vasile, 2018) adapted from domain adaptation and representation learning techniques are used for learning from both and . However, they hardly mention their methods’ fairness and effectiveness over heterogeneous demographics.

3 Problem Setup

Let user , and item , then an even . The outcome of an event is where


There are in total possible events considering different combination of users and items, where here means the cardinality of a set, we set and . The true risk of predicted outcome matrix can be easily written out as


where is assumed drawn from a true outcome matrix , and is usually set as surrogate mean-squared-error, or logistic loss, for simplicity we denote it as or .

Figure 2: A demonstration of MNAR problem.

When some of outcomes are not observed, calculating Eq.(3) becomes infeasible. Such that, previous works attempt to only minimize the factual empirical risk, namely ERM as


in which random variable

means whether a pair is observed. The observation of outcome is hence constrained by the missing pattern , shown in Fig.2. Such that, the is a biased estimate of true risk :


3.1 Correcting sampling bias by propensity scores

To deal with the bias shown in Eq.(5), an unbiased inverse propensity score (IPS) estimator (Rosenbaum and Rubin, 1983) can be built as


where is the used propensity score that , hence it is easy to derive .

The propensity score

here, can be approximated via a simplified naive Bayes (NB) estimator

(Marlin and Zemel, 2009) :


with assuming that . Though the and can be counted by MNAR data, the has to be estimated via MAR data, which is often achieved by taking average of RCTs’ outcomes


Using , the bias of IPS estimator with inaccurate propensity is given by (Schnabel et al., 2016) that


For notation simplicity, we omit of the from now on. In this work, we delve deeper into this bias, accounting for the bias of different populations. We assume that each individual belongs to a group , and the group mean preference is represented by


In this viewpoint, the is an efficient empirical estimate of global preference for all groups. However, we question that, when demographics are heterogeneous, i.e. are different, is the by RCTs still a fair and economical estimate? Since for ensuring fairness, we care about loss suffered by the least favorable group, rather than only the average over all. Besides, measuring utility disparity is also in our interests.

, true and predicted outcome matrix
random variables: user, item and event
Bernoulli random variable: event missing
realization of
, true and predicted outcome of
risk of predicted outcome of
true risk of
empirical risk of
, uniform and non-uniform policy
, MAR and MNAR event sets
, factual and counterfactual event sets
Table 1: Main notation.

4 Measuring Utility Disparity by Gini Index

The conventional approach to update a policy for recommender system is repeatedly learning from the MNAR logged data, by minimizing empirical risk. In this section, we identify that the policy obtained by this process fails to offer fair services for users, when taking user’s click-through-ratio (CTR) as their utility. We then propose Gini index for measuring this disparity, and reveal the connection between improving Gini fairness and protecting the least favorable group’s utility.

(a) Click count per user
(b) Display count of items
Figure 3: Analysis of simulation results of the motivating example.

4.1 Motivating example

We build a tiny system with 5 users and 5 items for simulation, and compare the user-wise and item-wise numbers of clicks by and , shown in Fig.3. Unfair treatment towards users by can be witnessed in the Fig.2(a). Compared with , though the total clicks achieved by is slightly higher (2916 v.s. 2484), the click rates of user #1 and #2 are much lower than their counterparts by . That means, the achieves good performance in average, at the expense of deteriorating experience of the minorities. Besides, the Fig.2(b) illustrates devastating diversity reduction by , which is also unacceptable in recommendation. Details about the implementation of simulation can be found in Appendix F.1.

4.2 Gini fairness and the minority’s utility

Example above impresses the necessity to measure users’ utility disparity. To this end, we introduce Gini index drawn from Economics to measure fairness in recommendation. The Gini index , with 0 indicating perfect equality and 1 being perfect inequality, is a measure of the distribution of income across percentiles among populations.

Figure 4: Computing the Gini index via the Lorenz curve.

To compute , the Lorenz curve (Mankiw, 2006) plots population percentile by utility on the x-axis, and cumulative utility on the y-axis. Denote the cumulative -th group’s utility as . From Fig.4, we can compute the area under the Lorenz curve by summation of all


therefore the can be obtained by


We give the following Proposition 1 which identifies the connection between controlling Gini index and protecting the least favorable group.

Proposition 1 (Least favorable group’s utility and Gini index).

Assume that , controlling the worst-case to protect the minority group can improve fairness, by virtue of controlling the supremum of the as


if and only if the equality holds.

(a) Policy
(b) Policy
Figure 5: The Lorenz curve of cumulative utility obtained by policy and , e.g. in (a), the point means the tail two users obtain of total utility; and in (b), the means of total utility. The value of Gini index is defined as the area of , therefore the is much more fair than .

Based on above analysis, picking the least favorable group’s average utility can be a surrogate metric for fairness, since the ’s upper bound is restricted by . By the Lorenz curve of the motivating example in Fig.5, the has much less Gini index than , and the minimum utility earned by is larger than as well.

Another concern is that the global utility , namely the utility enjoyed by all users, is also important for evaluating a policy’s effectiveness:


because it decides the total income of the system. Then comes two principles for developing a recommender system:

  • Reduce Gini index (Eq. (12)) for fairness

  • Improve global utility (Eq. (14)) for effectiveness

We will use the above two metrics for comparison in the experiment section.

5 Fairness of RCTs Considering Heterogeneous Users Demographics

Recall that our goal is to control the utility disparity of users, which can be realized by controlling the least favorable group’s utility , as above mentioned in Eq.(13). We proceed in two steps. First, we show how the empirical mean influences the minority’s risk by IPS estimator. Then, we analyze this risk regarding to several demographics assumptions, i.e. homogeneous and heterogeneous. As a result, we delve deeper into the limitations in fairness of randomized trials, specifically when the by RCTs is used as a trustworthy knowledge for debiasing MNAR data in the literature.

5.1 Risk of the least favorable group

Let , then the least favorable group is , and the event set related to is . Bias of IPS estimator on , similar to Eq.(9), can be derived as

Proposition 2 (Bias of the least favorable group’s risk by IPS).

The bias of IPS risk suffered by the least favorable group , has a lower bound as


Roughly speaking, the above inequality demonstrates that the ’s property matters. In particular, even under uniform policy, the bias for each group would be dependent on the preference demographics, as shown next.

5.2 Homogeneity: same mean and same variance

Intuitively, when randomly assigning items for users, under MAR condition, the is a fair and efficient estimate for debiasing. However, we argue that this only holds under homogeneous and weak heterogeneous demographics.

Assumption 1 (Homogeneous demographics.).

Users’ preference are same mean and same variance, namely the .

With above assumption, and by law of large numbers, it is easy to get the consistency of

, i.e. . Recall the inequality in Eq.(16), the right term approaches to zero at this situation because , which holds for other groups as well. That is the reason why in our motivating example, uniform reaches good fairness. However, these homogeneity hardly happen in practice. A real-world recommender system often serves hundred million of users, such that it is reasonable to assume that the demographics of users are highly heterogeneous. We discuss it by two cases next, i.e. weak and strong heterogeneity.

5.3 Weak heterogeneity: with same mean but different variance

We begin our analysis from a simple case without loss of generality, in which users are from two distributions: and

, with probability

from the first and from the second one. Formally, this is a common case in Robust Statistics (Maronna et al., 2019) called contaminated distribution, as


For simplicity, we assume that both are Gaussian distribution (we will extend it to more general sub-Gaussian family later). Based on Eq.(

8) and Eq.(17), it is easy to derive that


We first consider the setting that groups’ preferences are same mean but different variance (SMDV), namely , and . We argue that only if variance is controllable, the RCTs under SMDV demographics lead to asymptotically normal and fair :

Theorem 1 (Asymptotic Normality under SMDV).

Let , under SMDV assumption, if


then the empirical mean satisfies asymptotic normality as


Similar as in homogeneous demographics, with weak heterogeneity and mild condition on variance, the also leads to fair result.

5.4 Strong heterogeneity: with different mean and different variance

Although the above analysis proves that RCTs do lead to fair estimator in theory, the fundamental challenge faced is that, no matter for homogeneity or weak heterogeneity, the same mean assumption seems to be too strong. This argument urges us to explore more general setting, that is, when the demographics are different mean and different variance (DMDV), how well the performs, in terms of its bias and fairness. Accordingly, we continue to DMDV groups with sub-Gaussian assumption next.

Theorem 2 (Tail bound under strong heterogeneity).

Given the , where , then


with probability . Specifically when the sub-Gaussian parameters are all same, i.e. , we have


Moreover, For the bias of IPS estimator on the least favorable group, we have


Above bound identifies that inaccurate propensities obtained by NB estimator, with from RCTs as the outcome prior, contributes to the gap between the average and on . Though the is unbiased, its part on is worse than other , since the matters for the lower bound of . For other multi-task learning or domain adaptation methods, we argue that this challenge appears as well, because small RCTs are only competent for average estimation, rather than representative for all groups.

This challenge inspires us to develop a machinery, which can correct MNAR bias without RCTs, as well as achieving better fairness for recommendation.

6 Beyond Randomized Trials

For ensuring fairness on known demographics, the disparity metrics are used by (wang2019repairing) to measure the outcome disparity conditioned by predefined group:


such that more fair model can be obtained by minimizing it. However, demographics of user’s preference in recommendation are usually unknown, thus defining groups and optimizing the Eq.(24) directly is difficult.

To deal with unknown demographics, distributionally robust optimization (DRO) technique is employed for controlling the least favorable group’s risk (Hashimoto et al., 2018), by minimizing the upper bound of the risk defining on an uncertainty distributions set. Some concerns of it are (1) the importance sampling scheme introduces additional variance in optimization; (2) the result is sensitive to the distance metric for defining the uncertainty set and (3) the DRO often degenerates to ERM (Hu et al., 2018) in many scenarios. In this section, we propose a more easy-to-use and effective counterfactual robust risk minimization framework to reduce utility disparity coping with unknown demographics. We first present the definition of objective function and algorithm of the framework, then give its theoretical generalization error bound.

1:  Input: Event sets , ; Batch size , ; Power iteration step

; Epoch

; Hyper parameters , , , , .
2:  Initialize model parameters
3:  for  to  do
4:     /* Build the empirical risk term
8:     /* Build the counterfactual robust risk term

     Randomly generate an unit vector

12:     for  to  do
13:         /* Power iteration
17:     end for
20:     /* Learning from the joint CRR
22:  end for
Algorithm 1 Mini-batch SGD learning for CRRM

6.1 Counterfactual robust risk minimization

One would expect that disparity could be controlled on the least favorable group , by constructing a synthetic set from


where the is the group that causes the steepest disparity increase. By optimizing on it, our model is encouraged to maintain uniform performance among groups. We formalize this intuition to a tractable optimization regime, by presenting a robust risk that


where the is a predictor parametrized by . In this work, we pick the matrix factorization (MF) (Koren et al., 2009), hence the input is an embedding vector looked up from the parameter , and the worst-case input is restricted in a ball, with radius around the original input . The intuition behind is that for each event , we find its counterfactual counterpart , then try to minimize the disparity between them. The objective function for our CRRM framework is defined by


For observed events, we try to minimize both the robust risk and empirical risk ; for the unobserved, as is intractable, only

is applied. The empirical risk can be defined as any surrogate loss for supervised learning, e.g. mean-squared-error. For the robust risk, it is tractable by a power iteration

(Miyato et al., 2018)

approach, albeit it was used for improving local distributional smoothness for semi-supervised learning.

Algorithm 1 details our CRRM framework, where trainable embeddings for users and items are and . The input hyper parameters include: regularization term , conterfactual risk weight , robust radius , perturbation radius , learning rate , power iteration step and total learning epoch , albeit few of them need being tuned in practice. Their setups are discussed in experiments section.

6.2 Bounding the risk over counterfactuals

As aforementioned, the CRRM tries to balance the predictor’s performance between a synthetic and , by penalizing samples with small factual risk but large robust risk. We formally explain it by deriving the CRRM’s theoretical generalization error bound next.

Definition 2 (CRRM for Recommendation).

Given a hypothesis space of predictions as , and CRRM’s objective function from Eq.(27) , the algorithm selects that optimizes:


To illustrate the validity and principle of CRRM approach, we state the following bound, where only finite hypothesis set is considered for the sake of conciseness.

Theorem 3 (Generalization Error Bound of CRRM).

For any finite hypothesis space of predictions , and empirical loss is bounded , the true risk of our CRR minimizer from , is bounded with probability by:


where and we assume .

This generalization error bound is constructive, as it motivates a general principle for designing fair and effective estimator for learning from MNAR data, with the proposed CRRM. In particular, it not only emphasizes on minimizing the gap between and for reducing bias, but also points out the large variance incurred by small factual risk. As the bound shows, the bound can become loose due to small and relatively large , such that the fairness controller even contributes to reducing variance, hence pick the hypothesis with more tight bound on the true risk .

(a) Homogeneous
(b) Heterogeneous
Figure 6: Estimate of under two different setting of demographics, left: homogeneous, right: heterogeneous. Green dot lines are the maximum and minimum of among all groups, and red one is the average of all.
Sparsity 0.1% 0.5% 1%
Metric mU gU mU gU mU gU
LR 0.27 0.53 0.04 0.47 0.05 0.51
MF 0.32 0.50 0.29 0.52 0.20 0.52
IPS 0.39 0.53 0.34 0.52 0.27 0.52
SNIPS 0.40 0.54 0.32 0.50 0.30 0.51
CRRM 0.47 0.60 0.36 0.60 0.23 0.57
Table 2: and vary under data sparsity.

7 Experiments

In this section, we perform experiments to explore the performance of proposed CRRM, in terms of both global utility and Gini fairness. Furthermore, additional experiments on real-world data sets (Yahoo111https://webscope.sandbox.yahoo.com/

and Coat) are conducted to compare CRRM with other methods, according to the commonly used measurements, e.g. MSE, AUC. We here consider logistic regression (LR), matrix factorization (MF-naive), inverse propensity score based MF (MF-IPS)

(Schnabel et al., 2016), self-normalized IPS (MF-SNIPS) (Swaminathan and Joachims, 2015) and MF incorporated in our CRRM framework (MF-CRRM).

7.1 Synthetic task

The experimental setup is similar as the motivating example done in Section 4.1, but we supplement the heterogeneous demographics setting, i.e. each user’s average preference is different, and extend the scale to , . For experiments on the non-uniform , each time decides the top 1 item for each user, and collects the users’ feedback. Then it learns from the logged data, and continues to give next round of top 1 recommendation.

How does demographic heterogeneity influence ’s property by RCTs and ERM-based policy? We want to compare the performance of with and , with varying heterogeneity, in the synthetic task. The results are shown in Fig.6, where the x-axis is the observed events ratio. With homogeneous demographics in Fig.5(a), the obtained by ERM policy is much higher than the global average, and converges to global average with high confidence. However, with strongly heterogeneous demographics defined in section 5.4, the Fig.5(b) displays high variance of the . Unlike in homogeneous setting, the blue band converges much slower with sampling ratio increasing. Besides, the orange line () also has deteriorating performance compared with before. The result demonstrates that heterogeneity of demographics can cause failure of RCTs-based IPS because of high variance incurred by , and also do harm to ERM-based in practice.

Does CRRM improve fairness and generalization? Results on strong demographic heterogeneity are reported in Table 2, where is the minimum of user’s enjoyed utility. We test our CRRM method against LR, naive MF and IPS-based MF, and significant superiority of our method is witnessed. The MF-CRRM substantially perform better on both fairness and effectiveness in general, as it gains the best and almost under each sparsity setting. By contrast, MF-IPS with propensity scores learned from MAR data with of number of MANR data, does not win over naive MF, due to high variance caused by under heterogeneous demographics.

MF-naive 0.656 0.681 0.340 0.687
MF-IPS 0.714 0.687 0.332 0.666
MF-SNIPS 0.605 0.684 0.325 0.667
MF-CRRM 0.277 0.717 0.216 0.713
Table 3: Test MSE and AUC on real-world data sets.
MF-naive 0.351 0.486 0.549 0.274
MF-IPS 0.333 0.508 0.558 0.264
MF-SNIPS 0.341 0.501 0.559 0.264
MF-CRRM 0.316 0.538 0.538 0.290
Table 4: Test and on real-world data sets.

7.2 Real-world data sets

Further experiments conducted on two real-world data sets, which both collected a small test set where users are provided randomly displayed items for labeling. Following previous works’ setting, we train all models on MNAR training set , and test them on MAR test set. More details about experimental setup are in appendix F.2.

Yahoo ! R3 Data Set. This is a user-song rating data set (Marlin and Zemel, 2009), where MNAR training set has over 300K ratings for songs that were selected by users. The test set contains ratings by 5400 users who were asked to rate 10 randomly selected songs.

Coat Shopping Data Set. This data set was collected and used in (Schnabel et al., 2016), which contains 290 users and 300 items. Each user self-selected 24 items, and additional 16 items were randomly drawn for them to rate.

Results. Results are reported by Table 3 and Table 4, where our CRRM substantially wins over other baselines on both data sets, either in effectiveness metric , MSE, AUC or in fairness metric . Unlike other IPS-based methods, CRRM has much lower cost since it does not need additional RCTs. It contributes to less variance via controlling both robust risk and factual risk, thus facilitating generalization without RCTs, as presented in Theorem 3. It should be noted that on YAHOO data set, for computation efficiency, our CRRM only sampling from for optimizing on the in each epoch, which sheds light on our CRRM’s promising prospects in real-world recommender system.

8 Discussion & Conclusion

In this work, we focus on mitigating MNAR bias and unfair policy, which are ubiquitous in recommendation. Previous works tend to leverage empirical outcome mean from small RCTs for correcting MNAR bias, e.g. IPS family. However, our study on the bias of these RCTs-based IPS estimator, with introducing more general different mean demographics, identifies the risk estimator with sacrifices the least favorable group for ensuring average performance.

We therefore propose an easy-to-use and RCTs-free counterfactual robust risk minimization framework, in order to circumvent the cost of RCTs, and mitigate the unfairness from previous works. The key insight of our CRRM is to balance between robust and factual risk, thus both reducing bias and facilitating fairness, explained by the generalization bound in our Theorem 3. Moreover, by random sampling when alternative events set is large, our algorithm only costs a little additional computation, thus being scalable to large recommender systems.

We conjecture that the cutting edge sampling approaches (Wang et al., 2020), rather than randomly sampling from all events, can be plugged in our framework for further improvement. Besides, other existing models for recommendation, e.g. deep factorization machine (Guo et al., 2017)

, graph neural networks

(Fan et al., 2019), can be retrofit with our CRRM for even better results.



  • A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt, Z. Zhao, L. Hong, E. H. Chi, et al. (2019) Fairness in recommendation ranking through pairwise comparisons. arXiv preprint arXiv:1903.00780. Cited by: §1.
  • S. Bonner and F. Vasile (2018) Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. Cited by: §1, §2.
  • J. Duchi and H. Namkoong (2018) Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750. Cited by: §2.
  • M. Dudík, J. Langford, and L. Li (2011) Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §2.
  • S. Fan, J. Zhu, X. Han, C. Shi, L. Hu, B. Ma, and Y. Li (2019) Metapath-guided heterogeneous graph neural network for intent recommendation. Cited by: §8.
  • W. Feller (2008)

    An introduction to probability theory and its applications

    Vol. 2, John Wiley & Sons. Cited by: Appendix C.
  • C. Gini (1997) Concentration and dependency ratios. Rivista di politica economica 87, pp. 769–792. Cited by: §1.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §8.
  • M. Hardt, E. Price, N. Srebro, et al. (2016) Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.
  • T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang (2018) Fairness without demographics in repeated loss minimization. arXiv preprint arXiv:1806.08010. Cited by: §1, §2, §6.
  • L. Haynes, B. Goldacre, D. Torgerson, et al. (2012) Test, learn, adapt: developing public policy with randomised controlled trials— cabinet office. Cited by: §1.
  • W. Hoeffding (1994) Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Cited by: Lemma D.1, Appendix D.
  • W. Hu, G. Niu, I. Sato, and M. Sugiyama (2018)

    Does distributionally robust supervised learning give robust classifiers?

    In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 2029–2037. Cited by: §6.
  • S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth (2017)

    Fairness in reinforcement learning

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1617–1626. Cited by: §2.
  • N. Jiang and L. Li (2015) Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722. Cited by: §2.
  • F. Johansson, U. Shalit, and D. Sontag (2016) Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §2.
  • M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth (2016) Rawlsian fairness for machine learning. arXiv preprint arXiv:1610.09559 1 (2). Cited by: §2.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2017) Preventing fairness gerrymandering: auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144. Cited by: §2.
  • M. Kearns, A. Roth, and S. Sharifi-Malvajerdi (2019) Average individual fairness: algorithms, generalization and experiments. arXiv preprint arXiv:1905.10607. Cited by: §1.
  • Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: §6.1.
  • L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt (2018) Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383. Cited by: §1.
  • N. G. Mankiw (2006) Principles of macroeconomics. Cengage Learning. Cited by: §4.2.
  • B. M. Marlin and R. S. Zemel (2009) Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pp. 5–12. Cited by: §3.1, §7.2.
  • R. A. Maronna, R. D. Martin, V. J. Yohai, and M. Salibián-Barrera (2019) Robust statistics: theory and methods (with r). John Wiley & Sons. Cited by: §5.3.
  • J. Mary, C. Calauzenes, and N. El Karoui (2019) Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391. Cited by: §1, §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §6.1.
  • R. Nabi, D. Malinsky, and I. Shpitser (2018) Learning optimal fair policies. arXiv preprint arXiv:1809.02244. Cited by: §1.
  • J. Rawls (2009) A theory of justice. Harvard university press. Cited by: §2.
  • P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §2, §3.1.
  • N. Rosenfeld, Y. Mansour, and E. Yom-Tov (2017) Predicting counterfactuals from large historical data and small randomized trials. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 602–609. Cited by: §1.
  • T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims (2016) Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: §2, §3.1, §7.2, §7.
  • S. Seaman, J. Galati, D. Jackson, and J. Carlin (2013) What is meant by” missing at random”?. Statistical Science, pp. 257–268. Cited by: §1.
  • U. Shalit, F. D. Johansson, and D. Sontag (2017) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §1.
  • A. Singh and T. Joachims (2019) Policy learning for fairness in ranking. arXiv preprint arXiv:1902.04056. Cited by: §1.
  • H. Steck (2013) Evaluation of recommendations: rating-prediction and ranking. In Proceedings of the 7th ACM conference on Recommender systems, pp. 213–220. Cited by: §1.
  • A. Swaminathan and T. Joachims (2015) The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231–3239. Cited by: §2, §7.
  • P. Thomas and E. Brunskill (2016) Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §2.
  • X. Wang, R. Zhang, Y. Sun, and J. Qi (2019) Doubly robust joint learning for recommendation on data missing not at random. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 6638–6647. Cited by: §2.
  • Z. Wang, H. Zhu, Z. Dong, X. He, and S. Huang (2020) Less is better: unweighted data subsampling via influence function. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    Cited by: §8.
  • B. Yuan, J. Hsia, M. Yang, H. Zhu, C. Chang, Z. Dong, and C. Lin (2019) Improving ad click prediction by considering non-displayed events. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 329–338. Cited by: §1.
  • S. L. Zabell (1995)

    Alan turing and the central limit theorem

    The American Mathematical Monthly 102 (6), pp. 483–494. Cited by: Definition C.2.

Appendix A Proof of Proposition 1

Proposition 1 (Least favorable group’s utility and Gini index).

Assume that , controlling the worst-case to protect the minority group can improve fairness, by virtue of controlling the supremum of the as


if and only if the equality holds.


where the last inequality holds because


Appendix B Proof of Proposition 2

Proposition 2 (Bias of the least favorable group’s risk).

The bias of IPS risk suffered by the least favorable group , has a lower bound as


It is easy to derive that


Such that, we concentrate on in the Eq.(15):


The second inequality Eq.(B.5) comes from the fact that the least favorable group ’s true average preference must be less than the empirical average. Combining the result above and Eq.(15) yields the result. ∎

Appendix C Proof of Theorem 1

Definition C.1 (Triangular Array (TA)).

A triangular array of random variables is like


For each row , the random variables are assumed independent, with , and . We denote this triangular array as henceforth.

Definition C.2 (Lindeberg’s condition (Zabell, 1995)).

For random variables , we have where the variance of is finite . We define a triangular array as , which satisfies Lindeberg’s condition if for


we have

Lemma C.1 (Lindeberg’s condition under SMDV).

Given , and , if the variance is controllable, namely


then the satisfies Lindeberg’s condition.


Assume without loss of generality that , and let a random variable distributed like every , then for ,


in which the second inequality holds because it is easy to derive that . When , the last two terms converge to zero, because of the condition set in Eq.(C.4). ∎

Theorem 1 (Asymptotic Normality under SMDV).

Let , under SMDV assumption, if


then the empirical mean satisfies asymptotic normality as


It is easy to derive that the summation over the -th row of is . Based on the condition set by Eq.(C.9), applying Lemma C.1 yields the result that satisfies Lindeberg’s condition, namely


From the Lindeberg-Fellar Central Limit Theorem (CLT) (Feller, 2008), the summation over the each row of converges to a zero-mean unit variance Gaussian distribution, namely


Appendix D Proof of Theorem 2

First, we introduce Hoeffding’s lemma on bounded random variables (Hoeffding, 1994) without proof.

Lemma D.1 (Hoeffding’s lemma (Hoeffding, 1994)).

if with probability 1, and , then is sub-Gaussian with parameter .

Theorem 2 (Tail bound under strong heterogeneity).

Given the , where , then


with probability . Specifically when the sub-Gaussian parameters are all same, i.e. , we have


Moreover, For the bias of IPS estimator on the least favorable group, we have