1 Introduction
What should we expect of recommender systems? Recommenders are pivotal in connecting users to relevant content, items or information throughout the web, but with both users and content producers, sellers or information providers relying on these systems, it is important that we understand who is being supported and who is not. In this paper we focus on the risk of a recommender system underranking groups of items [21, 8, 39]. For example, if a social network underranked posts by a given demographic group, that could limit the group’s visibility on the service.
While there has been an explosion in fairness metrics for classification [23, 19, 13, 17] with researchers fleshing out when each metric is appropriate [32], there has been far less coalescence of thinking for recommender systems. Part of the challenge in studying fairness in recommender systems is that they are complex. They often consist of multiple models [47, 24], must balance multiple goals [36, 50]
, and are difficult to evaluate due to extreme and skewed sparsity
[8] and numerous dynamics [33, 3]. All of these issues are hardly resolved in the recommender system community, and present additional challenges in improving recommender fairness.One challenging split in recommender systems is between treating recommendation as a pointwise prediction problem and applying those predictions for ranked list construction. Pointwise recommenders make a prediction about user interest for each item and then a ranking of recommendations is determined based on those predictions. This setup is pervasive in practice [34, 38, 24, 16, 36], but significant research goes into bridging the gap between pointwise predictions and ranking construction [1, 6]. Fairness falls into a similar dilemma. Recent research developed fairness metrics centered around pointwise accuracy [8, 49], but this does not indicate much about the resulting ranking that the user actually sees. In contract, [52, 44, 45, 11] explore what is a fair ranking, but focus on unpersonalized rankings where relevancy is known for all items and in most cases require a postprocessing algorithm with item group memberships, which is often not possible in practice [10].
Further, evaluation of recommender systems is notoriously difficult due to the ever changing dynamics in the system. What a user was interested in yesterday they may not be interested in tomorrow, and we only know a user’s preferences if we recommend them an item. As a result, metrics are often biased (in the statistical sense) by the previous recommender system [3], and while a large body of research works to do unbiased offline evaluation [43, 42], this is very difficult due to the large item space, extreme sparsity of feedback, and evolving users and items. These issues only become more salient when trying to measure recommender system fairness, and even more so when trying to evaluate complete rankings.
We address all of these challenges through a pairwise recommendation fairness metric
. Using easytorun, randomized experiments we are able to get unbiased estimates of user preferences. Based on these observed pairwise preferences, we are able to measure the fairness of even a pointwise recommender system, and we show that these metrics directly correspond to ranking performance. Further, we offer a novel regularization term that we show improves the ultimate ranking fairness of a pointwise recommender, as seen in Figure
1. We test this on a largescale recommender system in production and show the practical benefits and tradeoffs both theoretically and empirically. In summary, we make the following contributions:
Pairwise Fairness: We propose a set of novel metrics for measuring the fairness of a recommender system based on pairwise comparisons. We show that this pairwise fairness metric directly corresponds to ranking performance and analyze its relation with pointwise fairness metrics.

Pairwise Regularization: We offer a regularization approach to improve the model performance for the given fairness metric that even works with pointwise models.

Realworld Experiments: We apply our approach in a largescale production recommender system and demonstrate that it produces significant improvements in pairwise fairness.
2 Related Work
This research lies at the intersection of and builds on a myriad of research from the recommender systems community and the machine learning fairness community.
Recommender Systems. There is a large research community focused on recommender systems with a wide variety of interests. Historically, much of the research built on collaborative filtering approaches, with a focus on ratings prediction spurred by the Netflix Prize [34]; another line of work has fleshed out pairwise models of user preferences [28, 14], particularly for information retrieval in search engines. As a core component of many industrial application of machine learning, a significant amount of work has been published on production recommenders, such as ad click prediction [38, 24] to be used for ranking ads. These systems often follow cascading patterns with sequences of models being used [47, 24]
. More recently, there has been a strong growth in using stateoftheart neural network techniques to improve recommender accuracy
[25, 16, 36].Building realworld recommenders face a variety of challenges. Two that relate to the challenges in fairness are the temporal dynamics [33, 48, 26, 9] and biased training data [29, 15, 3]. These issues do not just make training difficult but also evaluation of recommender performance [42].
Machine Learning Fairness. The machine learning fairness community has primarily focused on fairness in classification, with a myriad of definitions being proposed [19, 13, 23, 17]. Group fairness based definitions, where a model’s treatment of two groups of examples is compared, has become the most prevalent structure, but even there researchers have shown tension between different definitions [32, 40]. We primarily follow the equality of opportunity intuition from Hardt et al. [23], where we are concerned with differences in accuracy across groups. Our metric is most closely building on the AUCbased fairness metrics for classification and regression proposed by Dixon et al. [18] and expanded in [12] to be framed as different MannWhitney Utests.
Recommender System Fairness. There has been a small amount of work on fairness in ranking and recommendation, but with each piece of working taking significantly different perspectives. Zehlike et al. [52] laid out goals for fair ranking, but do not touch upon recommender systems, where data is far sparser. Similarly, Singh and Joachims [44] take a fullranking view of fairness but are able to apply this to recommender systems through a postprocessing algorithm for model predictions; followup work [45] moves this into model training. All of this work [52, 44, 45, 11] focuses on an unpersonalized information retrieval setting where relevance labels are known for each item; we focus on personalized recommendations where data sparsity and biases must be handled. In contrast, [8, 49] focus on collaborative filtering pointwise accuracy differences across groups but do not connect these metrics to resulting rankings.
More distant is research on statistical parity in recommenders, which argues that in some applications items should be shown at the same rate across groups [55]. Diversity [52, 31, 39, 46], filter bubbles [4], and feedback loops [27], while related to machine learning fairness, are not the focus of this paper.
Fairness Optimization. Many approaches have been proposed to address fairness issues. Postprocessing can provide elegant solutions [23, 44]
, but often requires knowing group memberships for all examples, which is rarely known for demographic data. Rather, numerous approaches have been developed for optimizing fairness metrics during classifier training, such as constraintbased optimization
[2, 22], adversarial learning [53, 35, 20, 7, 37, 54], and regularization over model predictions [30, 51, 5, 10]. We build on these regularization approaches for improving the fairness properties of our recommender system.3 Pairwise Fairness for Recommendation
We begin now with a description of our recommender system, the fairness concerns, and our metrics for them.
Symbol  Definition 

Query consisting of user and context features  
The set of relevant items for  
User click feedback and prediction of it  
Postclick engagement and prediction of it  
Predictions for on item  
Monotonic ranking function from predictions  
Comparison between items based on and  
Position of in ranked list of  
Binary sensitive attribute for item  
Total dataset of tuples  
Dataset of comparisons 
3.1 Recommendation Environment
We consider a production recommender system that is recommending a personalized list of items to users. We consider a cascading recommender [47, 24, 16], with a set of retrieval systems [15] followed by a ranking system [16, 36]. We assume that the retrieval systems return a set of relevant items from the total corpus of items, where . The ranking model must then score and rank items in to get a final ranking of items. Herein, we focus primarily on the role of the ranker.
Whenever a recommendation is made, the system observes user features for user and a set of context features , such as timing or device information; together we will refer to this as the query . In addition, for each item
we observe feature vector
; this can include sparse representation or learned embeddings for the item as well as any other properties tied to the item. The ranker performs its ranking based on estimates of user feedback, which could be clicks, ratings [34], dwelltime on articles [50], later purchase of items, etc. For our system, we will estimate if the user clicked on the item as well as user engagement after a click on the item , such as dwelltime, purchases, or ratings. As such, our dataset consists of historical examples . (Note, because is user engagement after a click, if no click occurs, .) only contains examples that have been recommended to the user previously.The ranker is a model parameterized by ; the model is trained to predict the user engagement . Finally, a ranking of the items is produced by a monotonic scoring function and the user is shown the top items from the relevant items ordered by .
3.2 Motivating Fairness Concerns
As discussed previously, a wide variety of fairness concerns have been highlighted in the literature. In this work, we primarily focus on the risk to groups of items from being underrecommended [21, 8, 39, 45]. For example, if a social network underranked posts by a given demographic group, that could limit the group’s visibility and thus engagement on the service. If a comment section of a website was personalized, and if a demographic group of users’ comments were underranked, then that demographic would have less of a voice on the website. In a more abstract sense, we assume that each item has sensitive attribute . We will work to measure if items from one group are systematically underranked.
Although not our primary focus, these issues could align with user group concerns if a group of items is preferred by a particular user group. This framework could be explicitly extended to incorporate user groups as well. If each user has a sensitive attribute, we can compute all of the following metrics over each user group and compare performance across the groups. For example, if we are concerned that a social network is underranking items about a particular topic to a particular demographic, we could compare the degree of underranking of that topic’s content across demographic groups.
3.3 Pairwise Fairness Metric
While the above fairness goals may seem worthwhile, we must make precise what it means for an item to be “underranked.” Here, we draw on the intuition of Hardt et al. [23]
for equality of odds, where the fairness of a classifier is quantified by comparing either its false positive rate and/or false negative rate. Stated differently, given an item’s label is positive, what is the probability the classifier will predict it to be positive. This works well in classification because the model’s prediction can be compared to a predefined threshold.
In recommender systems, it is less clear what a positive prediction is, even if we restrict our analysis to clicks () and ignore engagement (). For example, if an item is clicked, , and the predicted probability of click is , is this a positive prediction? It can be perceived as an underprediction of , but it may still be the topranked item if all other items have a predicted . As such, understanding errors in the pointwise predictions requires comparing the predictions of items for the same query.
We begin with defining a pairwise accuracy: what is the probability that a clicked item is ranked above another relevant unclicked item, for the same query:
(1) 
With this definition, we have a sense of how often the ranking system ranks the clicked item well. For succinctness, we will use to represent the comparison between the predictions for item and on query ; we will hide the term , but we only consider comparisons among relevant items for all following definitions.
As with much of the rest of fairness research, we are concerned with the relative performance across groups, not the absolute performance. As such, we can compare:
That is, is the PairwiseAccuracy for items from one group higher or lower than the PairwiseAccuracy for items from the other group ^{1}^{1}1We focus on rather than as item is the clicked item and we are concerned with ranking the clicked item well. We will incorporate in later metrics..
While this is an intuitive metric, it is problematic in that it ignores entirely user engagement and thus possibly runs the risk of promoting clickbait that users don’t ultimately value. As such, we can follow the approach of conditioning on other dependent signals as in [41, 10].
Definition 1 (Pairwise Fairness).
A model with ranking formula is considered to obey pairwise fairness if the likelihood of a clicked item being ranked above another relevant unclicked item is the same across both groups, conditioned on the items have been engaged with the same amount:
(2) 
This definition gives us an aggregate notion of ranker accuracy for items from each group.
While this is valuable, it does not distinguish between types of misorderings. This can be problematic in systematically underexposing items from one group [44]. For illustration, consider the following two examples where in both cases there are three items from each group and in the first case assume that is clicked and in the second case assume that is clicked. If in the first case the system gives a ranking and in the second case the systems gives , we see that the overall pairwise accuracy is the same in both cases, , but in the second case, even when an item from group was of interest (clicked), all group items ranked below group items. Both are problematic in ranking the clicked item low, but the second is more problematic in systematically preferring one group to the other, independent of user preferences.
To deal with this, we can split the above pairwise fairness definition into two separate criteria: pairwise accuracy between items in the same group and pairwise accuracy between items from different groups; we will refer to these metrics as intragroup pairwise accuracy and intergroup pairwise accuracy, respectively:
IntraGroup Acc.  (3)  
InterGroup Acc.  (4) 
From these we can define IntraGroup Pairswise Fairness and InterGroup Pairwise Fairness criteria.
Definition 2 (IntraGroup Pairwise Fairness).
A model with ranking formula is considered to obey intragroup pairwise fairness if the likelihood of a clicked item being ranked above another relevant unclicked item from the same group is the same independent of group, conditioned on the items have been engaged with the same amount:
(5) 
Definition 3 (InterGroup Pairwise Fairness).
A model with ranking formula is considered to obey intergroup pairwise fairness if the likelihood of a clicked item being ranked above another relevant unclicked item from the opposite group is the same independent of group, conditioned on the items have been engaged with the same amount:
(6) 
The IntraGroup Pairwise Fairness to some degree acts similarly to the overall Pairwise Fairness notion as it indicates the ability of the recommender system to rank well the item of interest to the user. The InterGroup Pairwise Fairness gives us further insight into whether mistakes in ranking are at the cost of the group as a whole.
We can see this more clearly by decomposing the overall pairwise accuracy as follows:
(7)  
That is, we find we can break up the pairwise comparisons into two sets, intragroup and intergroup comparisons, and that the overall pairwise accuracy is a weighted sum of the intergroup accuracy and intragroup accuracy, where the weights are determined by the probability of seeing a pair of that form (intergroup or intragroup) with the corresponding click and engagement. Together, these metrics give us a better sense of the fairness of the recommender system.
3.4 Measurement
While the above definitions offer a goal of how we would like a recommender system to perform, measuring the degree to which a recommender system meets these goals presents unique challenges. As discussed in the introduction, users and items in recommender systems are highly dynamic, and we typically only observe user feedback on previously recommended items, which makes metrics vulnerable to bias in the previous recommender system.
However, for all three fairness definitions given above, we would like to have unbiased estimates of user preferences between pairs of items. In order to do this we run randomized experiments over a small percentage of queries to the recommender system. The experimental description below is all assumed to operate over the subset of queries in the experimental slice.
For the experimental queries we will show the user a pair of items in positions two and three of the recommended slate; this prevents any position bias [3] where items that the recommender system ranks low are less likely to be clicked than items ranked high, irrespective of the particular items. Because the definitions above are all over arbitrary pairs of items from the set of relevant items for the given query, for each query two items are chosen at random from and their ordering in positions two and three is also randomized.
Among all queries in the experimental slice only a small fraction will have clicks on one of the items in the randomized item pair. Whenever an item in the randomized item pair is clicked, we record the query, pair, which item was clicked, and the subsequent engagement . With this, we can compute all of the probabilities in the fairness definitions above. In practice we discretize into buckets for easier comparison.
Note, as can be seen through this experiment, we cannot know the engagement we would have observed if the unclicked item had been clicked. This motivates our current metric design of conditioning on rather than estimating the accuracy of , since we can only know for one item in the pair.
Discussion
These metrics connect the performance of the ranking model to the end fairness properties of the resulting ranking. One underlying assumption is that the retrieval system that determines the set of relevant items, , is in some sense “fair.” We believe further research is needed to understand both what does it mean for a retrieval system to be “fair” and how any degree of bias in the retrieval system propagates through the ranking system to effect the end ranking experience.
4 Theoretical Analysis
While hopefully the above definitions are clear and well motivated, we find that upon further inspection they obey a number of fascinating properties.
4.1 Ranking Interpretation
While the metrics have thus far primarily been described similar to pairwise accuracy, they can be interpreted through the lens of ranking. That is, the recommender system sorts according to and . We will use to denote the position of item in the sorted list of items:
(8) 
From this perspective, we find that we can connect pairwise fairness to fairness with respect to the ranked position:
Theorem 1.
If a recommender system achieves pairwise fairness then the expected position of a clicked item with engagement is the same across groups.
Proof.
This falls out of the definition of pairwise accuracy and pairwise fairness:
∎
As such, we see that we can interpret pairwise recommender fairness as equivalent to the notion that the position of a clicked and engaged with item should not depend on the group membership on average, aligning with position bias highlighted in [11, 45]. (This analysis is similar to probabilistic interpretations in traditional pairwise IR [14], but now in the context of recommender system fairness.)
The intergroup and intragroup pairwise accuracies also connect to the rank position of the clicked item. That is, we can decompose:
Here too, we see that the overall ranked position can be decomposed into the position within the ranked list from the same group and position among the ranked list from the other group. However, because of the possibly varying distributions of the number of comparisons of each type, we believe it makes sense to focus on each of these terms as probabilities.
4.2 Relation to Pointwise Metrics
While our pairwise fairness metric aligns with previously stated goals of fair ranking, we find that it lies in tension with traditional pointwise metrics. For example, recommender systems are often evaluated in terms of calibration or RMSE [34], and these metrics have been espoused as important fairness metrics in classification [17] and in recommendation [8, 49]. We show here that these pointwise metrics are insufficient for guaranteeing pairwise fairness.
For the following proofs we consider a simplified case where and . This can be thought of ranking items by predicted click through rate (pCTR). For each group we denote its average label by .
Calibration
We begin with examining the relationship between calibration and pairwise fairness. A pCTR model for labels is considered calibrated if and only if:
(9) 
That is, among examples receiving a particular prediction, the average label for those examples needs to be equal to the predicted value. In the context of fairness, this would be evaluated over examples from one group.
Lemma 1.
A calibrated model is insufficient for guaranteeing pairwise ranking fairness.
Proof.
In order to prove this, we offer an example of a calibrated model that does not obey pairwise ranking fairness. Let’s assume that we learn a model that predicts for all examples with an item from group . This model is by definition calibrated per group.
If we have two groups, and , where then for all items with and all items with . As such, and . From this it is clear that intergroup fairness does not hold. We assume ties are split randomly, giving us . Based on Eq. (7), we find that as long as then overall pairwise fairness does not hold. ∎
This problem is highly similar to the issue pointed out by [44] with respect to ranking and exposure, and we see here holds true even among pairwise comparisons.
Squared Error
Another pervasive metric in recommender systems is mean squared error (MSE) [34]. This metric, and modifications of it, have been proposed for evaluating the fairness of collaborative filtering systems [8, 49]. While this may be worthwhile to encourage accuracy across groups,we find that it too is insufficient for guaranteeing pairwise fairness.
Lemma 2.
Equal MSE across groups is insufficient for guaranteeing pairwise ranking fairness.
Proof.
As above we will demonstrate an example of a model that achieves equal MSE across groups but does not obey pairwise fairness. Again, let us assume that we learn a model that predicts . We assume we have two groups, and , where and . We see that:
Through simply substituting in in the definition above we can see that .
Just as in the proof above, because , we find for all items with and all items with . As such, and . Again, it is clear that intergroup fairness does not hold. If we split ties randomly, such that , and as long as , then we again find that overall pairwise fairness does not hold either. ∎
As a result, while matching MSE across groups is intuitively valuable for fairness, it is insufficient for making any claims about the end ranking. Looking at the example in the proof, it is clear that this is in part due to the fact that MSE does not distinguish between over and underprediction. However, even taking that into account, MSE ignores the relative ranking and thus it is hard to determine what an improvement of in MSE means for ranking accuracy.
5 Pairwise Regularization to Improve Fairness
With an understanding of our fairness goals, we now ask: how can we learn a recommender system that achieves these fairness properties? As discussed previously, most production recommenders are pointwise recommenders trained to predict and , so we would like a modeling approach that does not require throwing out existing techniques.
To encourage fairness during training, we build on the regularization approach first proposed by Zafar et al. [51] and expanded upon by Beutel et al. [10]. In particular, Beutel et al. [10] optimized for equality of opportunity in classification [23] by minimizing the correlation between the group membership and the model’s predictions among examples with the same label. In our setting, we are concerned with the relative ordering, so we must modify this objective.
We assume that our model is trained with a loss ; for example, if squared error were used then . Further, we assume that we know and that it is differentiable. Given this, we train our model with the following objective:
(10) 
Here, is our original training data and is the experimental data from Section 3.4 consisting of pairs of tuples . The second term, the absolute correlation is computed as the correlation between two terms, and
, both random variables over pairs from
:(11)  
(12) 
That is, the pairwise regularizer calculates the correlation between the residual between the clicked and unclicked item and the group membership of the clicked item. As a result, the model is penalized if the its ability to predict which item was clicked is better for one group than the other.
To make sure there is sufficient data for a meaningful calculation, we rebalance to have approximately half of the data with the clicked item belonging to group and the other half with the clicked item belonging to group . Further data restrictions can be applied for alternative goals. If we are concerned with intragroup pairwise fairness, we can restrict to the set of pairs where , and if we are concerned with a large differences in engagement , we can create buckets of where all pairs in the set resulted in engagement . This approach is general enough to be used with pointwise recommenders as well as pairwise recommenders.
As in [10], this approach does not provably achieve pairwise fairness but we follow it due to its strong empirical performance and ease of use, crucial for production applications.
6 Experiments
To understand our pairwise fairness metric and our proposed modeling improvements, we study the performance of a largescale, production recommender system. We offer analysis of the stateoftheart production model’s performance as well as how our modeling changes effect the system.
6.1 Experimental Setup
As described in Section 3.1, we study a cascading recommender system where multiple retrieval systems return the set of relevant items for a given query, followed by a ranking model. Here we evaluate the ranking model’s performance. The ranking model is a multilayer neural network trained in a pointwisefashion with multiple heads to predict the probability of a click, , and a set of user engagement signals after the click, which we refer to in the aggregate by ; this is a similar setup to [25, 36]. This model is continuously trained on a dataset of interactions with previous recommendations.
We study the performance of the ranker with respect to a sensitive subgroup of items, comparing the performance of this subgroup to the rest of the data, denoted by “not subgroup.” The subgroup represents approximately 0.2% of all items, but it is a subgroup that we feel is important to analyze for recommendation fairness. As mentioned previously, we only know the group membership for a small percent of items; this prevents using servingtime approaches to improve the pairwise fairness metrics. Following the description in Section 3.4, we gather a dataset of random pairs of relevant items shown to the user and recorded when the user clicks on one of the items. We use a random half of this dataset for the pairwise regularization and the other half for evaluating the model.
We compare two versions of the model: (1) the production model trained without any attention to fairness, (2) a test model, trained with the same architecture but with the pairwise regularization to optimize for intergroup pairwise fairness. (As this is a live system with both data and training dynamics, we present a model chosen at random from a set of test models.) As we will see below, we focus on intergroup pairwise fairness as this is the area we find needing more improvement.
Due to the sensitive nature, we cannot report absolute accuracy measures. Rather, we report the relative performance between the subgroup and the rest of the data. That is, we aggregate the pairwise accuracy measures across engagement levels through a simple average; and we report the relative ratio of the average accuracy for the “not subgroup” divided by the average accuracy for the subgroup. All plots group engagement into four levels and maintain the same yaxis scaling so that relative comparisons can be made across them.
6.2 Baseline Performance
We begin with an analysis of the production system’s performance. As discussed in Section 3.3, we analyze the system’s performance in terms of: (1) pairwise fairness, (2) intragroup pairwise fairness, and (3) intergroup pairwise fairness.
As the overall pairwise fairness in Figure 1(a) shows, the production system underranks items from the subgroup when the subsequent level of engagement is low, but interestingly slightly overranks items from the subgroup when the subsequent level of engagement is high. In total, we find that the nonsubgroup items have an 8.3% advantage overall^{2}^{2}2That is, the pairwise accuracy of “not subgroup” divided by the pairwise accuracy of “subgroup” is 1.083..
Second, we examine the performance within each group – the intragroup pairwise accuracy. As can be seen in Figure 2(a), across all levels of engagement the model has more difficulty selecting the clicked item when comparing subgroup items than when comparing nonsubgroup items. In total, this puts the nonsubgroup items at a 14.9% advantage in intragroup pairwise fairness. We have found that this is in part due to the subgroup being small while there is far more diversity among the nonsubgroup items, making comparisons easier. When further filtering the subgroup comparisons to remove highlysimilar item comparisons, we find no meaningful difference in performance between the subgroup and the nonsubgroup.
While both of the above results suggest some deficiencies, we find that the story is significantly more dramatic when looking at the intergroup pairwise accuracy. As seen in Figure 0(a), across all levels of engagement we find that the subgroup items are significantly underranked relative to the nonsubgroup items. Overall, we find that the nonsubgroup items have a 35.6% advantage. Further, we see that the pairwise accuracy for nonsubgroup items in intergroup pairs is notably higher than in intragroup pairs, suggesting that the model is taking advantage of this difference in items. This suggests that subgroup items, even when of interest to the user, are ranked under nonsubgroup items. Because of the implication on subgroup experience and the more dramatic nature of the results, we focus herein on improving the intergroup pairwise fairness.
6.3 Fairness Improvements
As described above, we apply the pairwise regularization from Section 5 over intergroup pairs of examples so as to optimize for intergroup pairwise fairness.
We see in Figure 1 the effect of pairwise regularization on the intergroup pairwise fairness, the metric that it is most aligned with. While the regularization decreases the pairwise accuracy of the nonsubgroup items, it effectively closes the gap in the intergroup pairwise fairness metric, resulting in only a 2.6% advantage for nonsubgroup item in intergroup pairwise fairness, down from 35.6%. Further, while the decrease in pairwise accuracy for the nonsubgroup items may appear discouraging, the pairwise accuracy for the nonsubgroup in the test model is approximately onpar with the pairwise accuracy metrics we see in intragroup comparisons for nonsubgroup items, suggesting the model is no longer taking advantage of the difference in items.
While not our immediate goal, we also examine how improving the intergroup fairness effects the overall pairwise fairness. As we see in Figure 2, there is a visible improvement in the pairwise accuracy for subgroup items and the gap between the groups is largely closed. Quantitatively we observe that the relative benefit to the nonsubgroup items decreases to 2.5%, down from 8.3%. Intragroup accuracy is not optimized by our pairwise regularization configuration, and as expected we see little change in intragroup accuracy (Figure 3, with a 16.7% advantage for nongroup items).
Interestingly, in most of our live experiments using models trained with pairwise regularization we found overall engagement metrics were neutral relative to the production system. Given the subgroup is a tiny fraction of the overall system, it is reassuring to see that the above fairness benefits do not come at a cost to overall performance.
Together, this shows that pairwise regularization is effective in improving the fairness properties of the ranker.
6.4 How are improvements achieved?
While the results are compelling, we do further analysis to understand how the regularization is able to close fairness gaps. To do this, we examine the exposure of items from each group compared to the user preferences, similar in principle to a coarse pairwise calibration analysis.
To understand the user preferences, we measure the percentage of intergroup pairs for which users prefer (click on) the subgroup item versus the nonsubgroup item. This presents a base rate clickthroughrate (CTR) for each group, similar to the analysis in Section 4.2. As we see in Figure 3(a), across nearly all levels of engagement, the subgroup items are less likely to be clicked when juxtaposed with a nonsubgroup item; interestingly, highengagement interactions show a nearly even balance of likelihood of a click across the groups.
To understand how the model performs compared to this base CTR, we measure exposure: the probability of the model ranking one group’s item above that of the other group, irrespective of the user preference^{3}^{3}3This is a slight modification of exposure as defined by Singh and Joachims [44], using a probabilistic form over intergroup comparisons.. To be precise:
As we see in Figure 3(b), the production model exposes each group at approximately the same rate as the group’s base CTR.
As we can see in Figure 3(c), the exposure of items from each group changes significantly when the model is trained with the pairwise regularization. Even with lower levels of engagement, items from the subgroup are ranked higher at significantly higher rate than the base CTR. This suggests that the regularizer has the effect of showing subgroup items at a higher rate than is natural so as to make sure users interested in subgroup items are recommended them. This aligns with Lemma 1 suggesting a general tension between calibration and pairwise fairness. We believe further research on this relationship and more generally how to improve model accuracy can help alleviate this tension, but for the timebeing find this to be a reasonable tradeoff.
7 Conclusion
In this work we have provided a tractable way to get unbiased measurements of recommender system ranking fairness. We are able to do this through pairwise experiments to observe user preferences. Based on this experimental data, we can evaluate and decompose recommender system fairness to see if a model systematically misranks or underranks items from a particular group. We show that this measure aligns with ranking fairness definitions but is not covered by pointwise fairness measures. We ultimately offer a novel pairwise regularization approach to improve recommender system fairness during training, and show that it significantly improves fairness metrics in a largescale production system.
Acknowledgements: The authors would like to thank Ben Packer, Xuezhi Wang, and Andrew Cotter for their helpful comments during the preparation of this paper.
References
 Adams and Zemel [2011] R. P. Adams and R. S. Zemel. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
 Agarwal et al. [2018a] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach. A reductions approach to fair classification. arXiv preprint arXiv:1803.02453, 2018a.
 Agarwal et al. [2018b] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, and T. Joachims. Estimating position bias without intrusive interventions. arXiv preprint arXiv:1812.05161, 2018b.
 Bakshy et al. [2015] E. Bakshy, S. Messing, and L. A. Adamic. Exposure to ideologically diverse news and opinion on facebook. Science, 348(6239):1130–1132, 2015.
 Bechavod and Ligett [2017] Y. Bechavod and K. Ligett. Penalizing unfairness in binary classification. arXiv preprint arXiv:1707.00044, 2017.
 Bello et al. [2018] I. Bello, S. Kulkarni, S. Jain, C. Boutilier, E. Chi, E. Eban, X. Luo, A. Mackey, and O. Meshi. Seq2slate: Reranking and slate optimization with rnns. arXiv preprint arXiv:1810.02019, 2018.
 Beutel et al. [2017a] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075, 2017a.
 Beutel et al. [2017b] A. Beutel, E. H. Chi, Z. Cheng, H. Pham, and J. Anderson. Beyond globally optimal: Focused learning for improved recommendations. In WWW, pages 203–212, 2017b.
 Beutel et al. [2018] A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, V. Gatto, and E. H. Chi. Latent cross: Making use of context in recurrent recommender systems. In WSDM, pages 46–54, 2018.
 Beutel et al. [2019] A. Beutel, J. Chen, T. Doshi, H. Qian, A. Woodruff, C. Luu, P. Kreitmann, J. Bischof, and E. H. Chi. Putting fairness principles into practice: Challenges, metrics, and improvements. arXiv preprint arXiv:1901.04562, 2019.
 Biega et al. [2018] A. J. Biega, K. P. Gummadi, and G. Weikum. Equity of attention: Amortizing individual fairness in rankings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 0812, 2018, pages 405–414, 2018.
 Borkan et al. [2019] D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. 2019.

Calders and Verwer [2010]
T. Calders and S. Verwer.
Three naive bayes approaches for discriminationfree classification.
Data Mining and Knowledge Discovery, 21(2):277–292, 2010.  Cao et al. [2007] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129–136. ACM, 2007.
 Chen et al. [2018] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. Chi. Topk offpolicy correction for a reinforce recommender system. arXiv preprint arXiv:1812.02353, 2018.
 Covington et al. [2016] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In RecSys, pages 191–198, 2016.
 Crowson et al. [2016] C. S. Crowson, E. J. Atkinson, and T. M. Therneau. Assessing calibration of prognostic risk scores. Statistical methods in medical research, 25(4):1692–1706, 2016.
 Dixon et al. [2018] L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended bias in text classification. 2018.
 Dwork et al. [2012] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
 Edwards and Storkey [2015] H. Edwards and A. Storkey. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015.
 Ekstrand et al. [2018] M. D. Ekstrand, M. Tian, M. R. I. Kazi, H. Mehrpouyan, and D. Kluver. Exploring author gender in book rating and recommendation. In RecSys, pages 242–250, 2018.
 Goh et al. [2016] G. Goh, A. Cotter, M. R. Gupta, and M. P. Friedlander. Satisfying realworld goals with dataset constraints. In Advances in Neural Information Processing Systems, pages 2415–2423, 2016.

Hardt et al. [2016]
M. Hardt, E. Price, and N. Srebro.
Equality of opportunity in supervised learning.
In Advances in neural information processing systems, pages 3315–3323, 2016.  He et al. [2014] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9. ACM, 2014.
 He et al. [2017] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.S. Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173–182, 2017.
 Hidasi et al. [2015] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
 Jiang et al. [2019] R. Jiang, S. Chiappa, T. Lattimore, A. Agyorgy, and P. Kohli. Degenerate feedback loops in recommender systems. 2019.
 Joachims [2002] T. Joachims. Optimizing search engines using clickthrough data. In KDD, pages 133–142, 2002.
 Joachims et al. [2017] T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased learningtorank with biased feedback. In WSDM, pages 781–789, 2017.
 Kamishima et al. [2011] T. Kamishima, S. Akaho, and J. Sakuma. Fairnessaware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pages 643–650. IEEE, 2011.
 Kleinberg and Raghavan [2018] J. Kleinberg and M. Raghavan. Selection problems in the presence of implicit bias. arXiv preprint arXiv:1801.03533, 2018.
 Kleinberg et al. [2016] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
 Koren [2009] Y. Koren. Collaborative filtering with temporal dynamics. In KDD, pages 447–456, 2009.
 Koren et al. [2009] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
 Louizos et al. [2015] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
 Ma et al. [2018] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. Modeling task relationships in multitask learning with multigate mixtureofexperts. In KDD, pages 1930–1939, 2018.
 Madras et al. [2018] D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. arXiv preprint arXiv:1802.06309, 2018.
 McMahan et al. [2013] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In KDD, pages 1222–1230, 2013.
 Mehrotra et al. [2018] R. Mehrotra, J. McInerney, H. Bouchard, M. Lalmas, and F. Diaz. Towards a fair marketplace: Counterfactual evaluation of the tradeoff between relevance, fairness & satisfaction in recommendation systems. In CIKM, pages 2243–2251, 2018.
 Pleiss et al. [2017] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger. On fairness and calibration. In Advances in Neural Information Processing Systems, pages 5680–5689, 2017.
 Ritov et al. [2017] Y. Ritov, Y. Sun, and R. Zhao. On conditional parity as a notion of nondiscrimination in machine learning. arXiv preprint arXiv:1706.08519, 2017.
 Schnabel et al. [2016a] T. Schnabel, A. Swaminathan, P. I. Frazier, and T. Joachims. Unbiased comparative evaluation of ranking functions. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, pages 109–118. ACM, 2016a.
 Schnabel et al. [2016b] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352, 2016b.
 Singh and Joachims [2018] A. Singh and T. Joachims. Fairness of exposure in rankings. In KDD, pages 2219–2228, 2018.
 Singh and Joachims [2019] A. Singh and T. Joachims. Policy Learning for Fairness in Ranking. arXiv eprints, art. arXiv:1902.04056, Feb 2019.
 Stoyanovich et al. [2018] J. Stoyanovich, K. Yang, and H. Jagadish. Online set selection with fairness and diversity constraints. In Proceedings of the EDBT Conference, 2018.
 Wang et al. [2011] L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In SIGIR, pages 105–114, 2011.
 Wu et al. [2017] C.Y. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing. Recurrent recommender networks. In WSDM, pages 495–503, 2017.
 Yao and Huang [2017] S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In Advances in Neural Information Processing Systems, pages 2921–2930, 2017.
 Yi et al. [2014] X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan. Beyond clicks: dwell time for personalization. In Proceedings of the 8th ACM Conference on Recommender systems, pages 113–120. ACM, 2014.
 Zafar et al. [2015] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259, 2015.
 Zehlike et al. [2017] M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed, and R. BaezaYates. Fa* ir: A fair topk ranking algorithm. In CIKM, pages 1569–1578. ACM, 2017.
 Zemel et al. [2013] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, pages 325–333, 2013.
 Zhang et al. [2018] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating unwanted biases with adversarial learning. CoRR, abs/1801.07593, 2018.

Zhu et al. [2018]
Z. Zhu, X. Hu, and J. Caverlee.
Fairnessaware tensorbased recommendation.
In CIKM, pages 1153–1162, 2018.
Comments
There are no comments yet.