Recommender systems are bridging users and relevant products, services and peers on the Web. By leveraging past behavioural data, such automated systems aim to understand users’ preferences and predict their future interests Ricci et al. (2015). Notable examples are integrated in platforms from different contexts, including e-commerce (Amazon, eBay), multimedia (YouTube, Netflix), and education (Coursera, Udemy). The future success of these platforms depends also on the effectiveness of the underlying recommender system.
The increasing adoption of recommender systems in online platforms has spurred investigations on issues of bias in their internal mechanisms. One aspect that has received attention so far is the recommender systems’ tendency of emphasizing a “rich-get-richer” effect in favor of few popular items Nikolov et al. (2019). Such a phenomenon leads to a loop where recommender systems trained on data non-uniformly distributed across items tend to suggest popular items more than niche items, even when the latter would be of interest. Hence, popular items gain more visibility and become more likely to be selected. The awareness of this type of bias might even lead providers to bribe users, so that they rate or increase the ratings given to their items, thus allowing these items to get more visibility Ramos et al. (2020); Saúde et al. (2017). The train data will thus be imbalanced towards popular items more and more (Figure 1)111Please note that all the figures in this manuscript are best seen in color..
Recommender systems suggesting what is popular have been proved to be competitive baselines in terms of accuracy Jannach et al. (2015). However, it has been recognized that other beyond-accuracy aspects, such as whether recommendations are novel and cover well the catalog, may positively impact on the overall recommendation quality Bobadilla et al. (2013). In this view, popularity bias can lead to issues, such as filter bubbles, which may hamper user interest and beyond-accuracy aspects Ciampaglia et al. (2018); Cañamares and Castells (2018); Mehrotra et al. (2018). Since trading such qualities for item popularity might likely not be accepted, debiasing popularity can help to meet a better trade-off between accuracy and beyond-accuracy goals, improving the quality of recommendations on the whole Kaminskas and Bridge (2017).
Existing frameworks and procedures for popularity debiasing Kamishima et al. (2014); Hou et al. (2018); Abdollahpouri et al. (2018, 2017) are often based on bias metrics that do not account for user preferences, thus being assessed only on the level of popularity of items in a recommended list. It should be noted that popularity cannot be an objective concept and it strongly depends on user preferences and on how data has been collected. It follows that popularity metrics and debiasing procedures need to account for user preferences and the visibility that is given to the items thanks to recommendations, creating a bridge between these perspectives and beyond-accuracy objectives.
In this paper, we tackle this challenge with a new popularity-debiasing framework. Two novel metrics quantify how much a recommender equally treats items along the popularity tail. The first metric encourages similar probabilities of being recommended among items, that is important when platform owners may be interested in equally suggesting items (e.g., loan platforms). The second metric takes into account the ground-truth user preference, and encourages true positive rates of items to be equal. This becomes useful in contexts where the platform owners may desire to preserve the imbalance across items in data, while avoiding any further distortion on recommendations caused by algorithmic bias.
Then, we empirically prove that two widely-adopted classes of recommendation algorithms (i.e., point-wise and pair-wise) are biased towards item popularity with respect to the proposed metrics. To limit this side effect, we propose an approach based on () a data sampling that balances the input samples where the observed item is more (less) popular than the unobserved item, and () a regularization term that minimizes the correlation between user-item relevance and item popularity. Experiments show that the proposed approach provides a more equal treatment of items along the popularity tail. With a minimum loss in accuracy, it also leads to important gains in novelty and catalog coverage, known to provide benefits to the overlying platform.
2 Related Work
This research relates with and builds on literature from the recommender system and machine learning communities.
2.1 Popularity Bias in Recommendation
Treating popularity bias has often required a multi-objective setting that investigates any accuracy loss resulting from taking popularity into consideration. Therefore, the final goal has been to strike a balance between accuracy (e.g., precision, recall) and pre-conceptualized bias metrics (e.g., average recommended item popularity, average percentage and coverage of tail items) Abdollahpouri et al. (2017, 2019)
. Bias metrics tend to measure distribution skews for a complete system rather than for individual items; often require knowing head/tail memberships for all items, which is indeed arbitrary and highly variable across datasets; and evaluate bias without considering the ground truth of the user interest. We seek to address this gap with two bias metrics tailored for individual items: the first one enforces ranking probabilities for items to be the same, and the second one encourages true positive rates of items to be the same.
Under this setting, pre-processing operations alter train data to reduce the impact of imbalance. For instance, Park and Tuzhilin (2008) splits the item set into head and tail, and separately clusters their ratings. Tail recommendations leverage ratings from the corresponding cluster, while the head ones use ratings from individual items. In Jannach et al. (2015), Jannach et al. sample input user-item pairs where the observed item is less popular than the unobserved one. The work in Chen et al. (2018)
suggests to pick up unobserved items following a probability distribution based on popularity. Differently, our data sampling balances cases where the observed item is more (less) popular than the unobserved item of the input sample, to better fit with our regularization.
In-processing countermeasures modify an existing algorithm to simultaneously consider relevance and popularity, doing a joint optimization or using one criteria as a constraint for the other. The authors in Oh et al. (2011) consider the individual user popularity tendency to recommend tail items. In Kamishima et al. (2014), the authors propose to enhance the statistical independence between a recommendation and its popularity. Similarly, the work in Abdollahpouri et al. (2017) pushes RankALS algorithm optimization towards recommended lists that balance accuracy and head/tail recommendations. In Hou et al. (2018), common neighbours between two items with given popularity are retrieved, then a balanced common-neighbour similarity index is obtained, after pruning popular neighbours. Our work differs from prior literature in expressiveness and algorithmically. Furthermore, we experimented with learning-to-rank recommendation approaches, using point-wise and pair-wise as use cases.
Post-processing countermeasures seek to re-rank a recommended list according to certain constraints. The work in Abdollahpouri et al. (2019, 2018) presents two approaches for controlling item exposure; the first one adapts the xQuAD algorithm for balancing the trade-off between accuracy and mid-tail item coverage; the second one multiplies the relevance score for a given item with a weight inversely proportional to the item popularity; items are re-ranked according to the weighted scores. In Jannach et al. (2015), user-specific weights help to balance accuracy and popularity. Post-processing countermeasures can provide elegant solutions, but often require knowing head/tail memberships for all items, may be really sensitive to predicted relevance distribution, and may influence the platform efficiency.
2.2 Bias Mitigation in Machine Learning
Existing literature has primarily focused on biases in classification, with definitions mostly targeting fairness Singh and Joachims (2018). In our study, we primarily reformulate the statistical parity and the equality of opportunity notions Hardt et al. (2016); Tolan (2019)
for the item popularity bias problem, as they were conceived to measure differences in accuracy across users in fairness-aware classification. Compared to previous works, our metrics measure bias on individual items rather than fairness on classes of protected users, and do not require any intrinsic notion of group membership for an item. Moreover, our metrics align with the probability of being recommended in a top-k list rather than being classified to a given label.
Many approaches have been proposed to address bias and fairness issues in machine learning. Notable examples of in-processing approaches mostly target fairness among user groups and include three main categories: constraint-based optimization Agarwal et al. (2018); Goh et al. (2016), adversarial learning Beutel et al. (2017); Edwards and Storkey (2015), and regularization over predictions Beutel et al. (2019a); Kamishima et al. (2011); Beutel et al. (2019b). Our approach builds on and reformulates the latter class of strategies to fit with the popularity debiasing task. Differently from Beutel et al. (2019a, b), we relax the assumption of knowing group memberships of input samples, targeting individual items regardless of their head/tail membership. In our setting, we are concerned with relative differences in relevance and in popularity rather than in predicted labels and in group labels, which differently drives optimization during training and allows flexible data sampling strategies. Furthermore, as we tackle a popularity bias task rather than an unfairness mitigation task, our design choices lead to consider processes and model facets so far under-explored.
In this section, we formalize the main concepts underlying our study, including the recommender system and the new popularity-bias metrics we introduce.
3.1 Recommender System Formalization
Given a set of users and a set of items , we assume that users have expressed their interest for a subset of items in . The collected feedback from observed user-item interactions can be abstracted to a set of (, ) pairs implicitly obtained from user activity or (, , ) triplets explicitly constructed, with . Elements in may be either ratings or frequencies (e.g., play counts). We denote the user-item matrix by for implicit feedback or for explicit feedback or frequencies to indicate the (level of) preference of for , otherwise.
Given this input, the recommender system’s task is to predict unobserved user-item relevance scores, and deliver a set of ranked items. To this end, we assume that a function estimates relevance scores of unobserved entries infor a given user, and the recommender system uses them for ranking the items. Formally, it can be abstracted as learning , where denotes the predicted relevance, denotes model parameters, and denotes the function that maps model parameters to the predicted score.
We assume that each user/item is internally represented through a
-sized numerical vector. More precisely, the model includes a user-vector matrixand an item-vector matrix . We also assume that the function is parametrized by and depends on the specific recommendation algorithm under consideration. The higher the is, the higher the relevance of for is. To rank items, they are sorted by decreasing relevance, and the top- items are recommended.
3.2 Popularity Bias Metric Formalization
As we deal with item popularity biases, we define key concepts on what we consider as popularity bias and how we measure it throughout the paper.
Item Statistical Parity (ISP). Recommender system may be inherently influenced by the item popularity. More popular items remain more popular since they are more likely to appear at the top of the recommended list. This inadvertently leads to few mainstream items being recommended and to an impedance for items of the tail-end popularity spectrum to attract users. In this scenario, we assume that the recommender systems powering online environments should equally cover items along the popularity tail. For instance, this notion may be useful when platform owners manage recommendations of individuals (e.g., people recommendations) or of particularly delicate elements (e.g., loans).
To measure this property, we reformulate the statistical parity principle introduced in fairness-aware machine learning Tolan (2019), for the popularity debiasing task. This implies to control item statistical parity, i.e., equalizing the outcomes across the individual items. We operationalize this concept for items by computing the ratio between the number of users each item is recommended to and the number of users who can receive that item in the recommended list. We then encourage that such a ratio is similar among items. Considering that only top-k items are recommended, we assume that the outcome for an item is the probability of being ranked in the top-k list. We define such probability as:
where if item is being ranked in top-k for user by model , otherwise. In other words, the numerator counts the number of users are being recommended item in top-k, while the denominator counts the number of users have never interacted with item , and thus may receive as a recommendation. Last, we compute the inverse of the Gini index as a measure of distribution equality among across items. The Gini index is a well-known scale-independent and bounded measure of inequality that ranges between 0 and 1. Higher values represent higher inequality. It is used as follows:
If there is a perfectly equal distribution of recommendations across items, then and the statistical parity is met. decreases and gets closer to when the distribution of the recommendations is more unequal. For example, this occurs if most of the items never appeared in the recommended lists. In the extreme, where the same items appeared in all the recommendations, is very close to 0. Thus, will lie between 0 and 1, and the greater it is, the more equally the recommendations are distributed.
In some cases, platform owners would not equalize the recommendations along the entire popularity tail. High statistical parity could lead to situations in which even items of very low interest get recommended the same amount of times with respect to more-of-interest items. This motivated us to complement our measurement of popularity bias with another representative metric.
Item Equal Opportunity (IEO). Instead of equalizing the recommendations themselves, we can equalize some statistics of the algorithm’s effectiveness (e.g., true positive rate across items). In many applications, platform owners may care more about preserving and retaining a certain degree of item popularity, while checking that no further distortions are emphasized by algorithmic bias on recommendation distributions. For instance, guaranteeing a certain degree of popularity into recommendations resulted in higher user acceptance in contexts like tourism Cremonesi et al. (2014). In this view, an unbiased algorithm would recommend each item proportionally to its representation in the ground-truth user preference.
We operationalize the concept of equal opportunity across items by encouraging the true positive rates of different items to be the same. Its formulation builds upon the corresponding concept from the fairness-aware machine learning domain Hardt et al. (2016). Specifically, we define the true positive rate as the probability of being ranked within top-k, given the ground-truth that the item is relevant for the user in the test set. This is denoted by , where defines that items are relevant for users in the ground truth. We formalize it as follows:
where if item is relevant for user in the ground truth222While in this work we focus on an offline evaluation setting and the test set represents our ground truth, in case of online evaluation (e.g., A/B testing) if the user accepted the recommendation.. The numerator counts the number of users who consider item as relevant in the test set and are being receiving item in top-k. The denominator counts the number of users who consider item as relevant in the test set. Last, we compute the inverse of the Gini index across these probabilities, as follows:
If there is a perfect equality of being recommended when items are known to be of interest, then . Conversely, decreases and gets closer to when the probability of being recommended is high for only few items of interest in the test set. This is the case occurring when most of the niche items never appeared in the recommended lists, even if they are of interest (i.e., algorithmic bias emphasized the popularity phenomenon). Thus, will range between 0 and 1, and the greater it is, the more (the less) the popularity bias is emphasized.
While it is the responsibility of scientists to bring forth the discussion about metrics for popularity bias, and possibly to design algorithms to control them by turning parameters, it should be noted that it is ultimately up to the stakeholders to select the metrics and the trade-offs most suitable for their context.
4 Exploratory Analysis on Point- and Pair-wise Recommendation
In this section, we show that representative algorithms from two of the most widely-adopted learning-to-rank families Zhang et al. (2019), namely point-wise and pair-wise, produce biased recommendations with respect to ISP and IEO.
Since there is no standard benchmark framework for popularity bias assessment in recommendation, we adopt two public datasets with diverse item distribution skews (Fig. 1). This analysis treats ratings as positive feedback to indicate that users are interested in the items they rated. Being oriented to learning-to-rank contexts, our analysis and the proposed debiasing approach can be applied to rating or frequency matrices as well.
MovieLens1M (ML1M) Harper and Konstan (2016) contains 998,131 ratings applied to 3,705 movies by 6,040 users of the online service MovieLens. The sparsity of the user-item matrix is 0.95. Each user rated at least 20 movies.
COCO600k (COCO) Dessì et al. (2018) contains 617,588 ratings applied to 30,399 courses by 37,040 learners of an online platform. The sparsity of the user-item matrix is 0.99. Each learner rated at least 10 courses.
4.2 Recommendation Algorithms and Protocols
We consider four different methods, and investigate the recommendations they generate. Two of them are baseline recommenders (Random and MostPop) with opposite behavior with respect to item popularity: Random is insensitive to popularity and uniformly recommends items, while MostPop ignores tail items, and suggests the same few popular items to everyone333Even though comparing an algorithm against Random and MostPop has been previously studied Jannach et al. (2015); Boratto et al. (2019), there is no evidence on how the new bias metrics model their outcomes.. The other two algorithms NeuMF He and Chua (2017) and BPR Rendle and Freudenthaler (2014) belong to the point-wise and the pair-wise family, respectively. They were chosen due to their performance and wide adoption as a key block of several point- and pair-wise methods Zhang et al. (2018); Deng et al. (2019); Xue et al. (2017). Our approach makes it easy to re-run our analyses on additional algorithms.
Point-wise approaches generally estimate model parameters by minimizing the margin between the relevance predicted for an observed item and the true relevance , given interactions in :
where is the set of items of interest for . Conversely, pair-wise approaches estimate parameters by maximizing the margin between the relevance predicted for an observed item and the relevance predicted for an unobserved item , given interactions in :
where and are the sets of items of interest and not of interest for .
We performed a temporal train-test split with the most recent 20% of ratings per user in the test set and the remaining 80% oldest ones in the training set. Embedding matrices are initialized with values uniformly distributed in the range [0, 1]. Each model is served with batches of samples. For NeuMF, for each user , we created negative samples for each positive sample . For BPR, we created triplets per observed item ; the unobserved item is randomly selected. Such parameters were tuned in order to find a balance between training effectiveness and training efficiency444In our study, we are more interested in better understanding beyond-accuracy characteristics of algorithms, so the further accuracy improvements that can probably be achieved through hyper-parameter tuning would not substantially affect the outcomes of our analyses..
4.3 Ranking Accuracy and Beyond-Accuracy Observations
First, we evaluated the ranking accuracy, considering Normalized Discounted Cumulative Gain (NDCG) as a support metric (i.e., the higher it is, the better the ranking). Figure 2
The performance achieved by MostPop (straight orange line) on the full test set seemed to be highly competitive, especially in ML1M. This might reveal that the user-item feedback underlying the test set is unbalanced towards popular items, and it can bias evaluation metrics in favor of popular items. We thus examined an alternative experimental configuration, which considers a subset of the original test set where all items have the same amount of test feedback instancesBellogín et al. (2017). NDCG scores decreased under the latter evaluation setup for BPR (dashed green line), NeuMF (dashed red line), and MostPop (dashed orange line) over cutoff. This observation confirms that all algorithms tended to be considerate as accurate because they mostly suggest popular items.
The fact that MostPop adheres to BPR and NeuMF may imply that a recommender system optimized for ranking accuracy would not by default result in recommending sets with low popularity bias estimates. We conjecture that optimizing for accuracy, without explicitly considering popularity bias, favors the latter. Motivated by the patterns uncovered on ranking accuracy, we analyzed the bias metrics introduced in Section 3, namely Item Statistical Parity (ISP) and Item Equal Opportunity (IEO). From Figure 3, we can observe that BPR, NeuMF, and MostPop failed to achieve good levels of item statistical parity (top two plots). NeuMF and BPR’s statistical parity is significantly lower than the Random’s statistical parity, which maximizes statistical parity by default. Moreover, the results on equal opportunity (bottom two plots) point to a view with algorithms leading to low IEO. Reaching low values of IEO may uncover situations where where () the true positive rates for popular items is high (i.e., the recommender’s error for them is low) and () the true positive rate for all the rest of unpopular items is very low or even zero.
Observation 1. Both point- and pair-wise optimization procedures reinforce disparate statistical parity and unequal opportunities across items. Such observed inequalities are stronger for pair-wise optimization, under highly sparsed datasets, at low cutoffs.
Low ISP and IEO values reveal a high degree of bias towards popularity, which may hamper the quality of a recommendation (i.e., an algorithm might fail to learn user-item preferences for niche items, even if they are known to be of interest). Due to variable social dynamics, information cascades, and highly subjective notions, it would not be feasible to come up with a plain definition of quality. Therefore, this study explored the quality of a recommended list as a trade-off between ranking accuracy and beyond-accuracy metrics Kaminskas and Bridge (2017). These qualities are of particular importance in real-life systems, since users are most likely to consider only a small set of top-k recommendations. It is therefore crucial to make sure that this set is as interesting and engaging as much as possible555While we bring forth the discussion about such metrics, it is ultimately up to the stakeholders to select the metrics and the trade-offs most suitable for their principles and context.. Figure 4 depicts novelty and item coverage over ML1M and COCO, obtained by applying the formulas defined in Kaminskas and Bridge (2017). Both metrics range in [0,1]. Higher values mean that the novelty and the coverage are higher, respectively. The novelty of an item recommended by BPR and NeuMF is generally higher than the one measured for MostPop (top plots) on both datasets, but the presented values are still far from the maximum value of 1. While it can be easily noted that the two datasets obtain very different novelty values, this does not mean that one leads to a much better performance than the higher. These results reflect the characteristics of the data, with COCO having a much larger number of users and items (i.e., having a lot of users means that an item can be easily novel for someone). Therefore, on COCO, even a difference of 1% in novelty would become relevant, given the huge number of users it has. Similar patterns and considerations came up on item coverage for BPR and NeuMF.
Observation 2. The higher the item statistical parity and equal opportunity are, the newer and wider the recommendations are, especially in sparsed data. This pattern comes at the cost of a loss in accuracy that is negligible if a balanced test set is considered.
4.4 Internal Mechanics Analysis
Motivated by our findings, we next explored internal mechanics of the considered recommendation algorithms to better understand how disparate statistical parity and unequal opportunity across items are internally emphasized.
Throughout training, each algorithm optimized an objective function which would make it possible to improve the algorithm’s ability of predicting a high relevance for items known to be of interest for users in the training set. The fact that MostPop’s NDCG is close to that of BPR and NeuMF, and that a low value of IEO was achieved by the considered models, suggested to further investigate such algorithm’s ability in relation to the popularity of the observed item. Therefore, we analyzed the performance of each recommender in terms of pair-wise accuracy while predicting relevance for head- and mid-tail observed items. We randomly sampled four sets of triplets . Each triplet in the first set included an observed short-tail item as and an unobserved short-tail item as . Triplets in the second set relied on observed short-tail items as and unobserved mid-tail items as . The third set included observed mid-tail items as and unobserved short-tail items as . The fourth set had observed mid-tail items as and unobserved mid-tail items as . Short-tail and mid-tail popularity thresholds were set up according with popularity percentiles, as reported in Figure 1. For each set, we computed the recommender’s accuracy on predicting a higher relevance for the observed item than the unobserved one.
|Observed Item||Unobserved Item||ML-1M||COCO|
|Short Tail||Short Tail||0.86||0.89||0.84||0.93|
|Short Tail||Mid Tail||0.98||0.97||0.97||0.99|
|Mid Tail||Short Tail||0.53||0.71||0.61||0.89|
|Mid Tail||Mid Tail||0.89||0.91||0.91||0.98|
From Table 1, we observed that the pair-wise accuracy achieved by BPR and NeuMF strongly depends on the popularity of the related items and . Specifically, recommenders failed more frequently in giving higher relevance to observed mid-tail items, especially when they were compared against unobserved short-tail items. Conversely, recommenders performed significantly better when observed short-tail items were compared against unobserved mid-tail items.
Observation 3. Observed mid-tail items, even when of interest, are more likely to receive less relevance with respect to short-tail items. This effect is stronger when the feedback data is less sparse.
We conjecture that this result might depend on the fact that, in presence of popularity influence, the differences in relevance scores across items can play a key role in pair-wise accuracy. Thus, Figure 5 depicts the distribution of the user-item relevance scores obtained for observed short-tail items and observed mid-tail items in the train data. For each user, we randomly sampled pairs of items, each including a short-tail item and a mid-tail item that user interacted with in the train set. Then, we computed the short-tail item and the mid-tail item relevance for user , and we repeat the process along the users’ population to build two probability distributions. It can be observed that the distributions are significantly different in all the setups, and that there is a tendency of mid-tail observed items of getting lower relevance. This should be considered as an undesired behavior of the algorithm that is under-considering observed mid-tail items regardless of the real user’s interest.
Observation 4. User-item relevance distributions over observed short-tail items and mid-tail items are significantly different. Observed short-tail items are more likely to obtain more relevance than observed mid-tail items, and thus be over-represented in top-k lists.
Most of the observations seen so far are rooted in the fact that each recommendation algorithm emphasized a direct relation between item relevance and item popularity, emerged also throughout the training procedure (Figure 6). This effect makes the recommender system less accurate when mid-tail items are considered, even when they are known to be of interest. It is interesting to ask whether minimizing such a correlation might have a positive impact on popularity debiasing and beyond-accuracy metrics, retaining ranking accuracy.
5 The Proposed Debiasing Procedure
With an understanding of some point- and pair-wise internal mechanics, we investigate how we can devise a recommender system that limit their deficiencies while generating less popular recommendations. To this end, we propose a debiasing procedure that aims at minimizing both (
) the loss function targeted by the considered recommender (e.g., Eq.5 or 6), and () the correlation between the prediction residual and the popularity of the observed item in input. Even though we relied on BPR and NeuMF along our experiments, our approach can be seamlessly applied to other algorithms from the same family.
Correlation-based regularization approaches have been proved to be empirically effective in several domains Beutel et al. (2019a, b). Differently from prior work, the popularity debiasing task requires to relax the assumption of knowing group memberships of input samples (i.e, we target individual items regardless of their head or mid membership). Our task inspects relative differences in relevance and in popularity rather than differences in predicted labels and in group labels. Further, we do not rely on any arbitrary split between head- and mid-tail items. Lastly, as we tackle a popularity debiasing perspective, the design choices we made lead to examine training processes and model facets so far under-explored.
Our debiasing procedure relies on pre- and in-processing operations, which extend the common data preparation and model training procedures of a recommender (Figure 7). Specifically, with minimum differences between point- and pair-wise approaches, the proposed approach includes the following steps:
Input Sample Mining (sam). Under a point-wise recommendation task, negative pairs are created for each observed user-item interaction . The observed interaction is replicated times to ensure that our correlation-based regularization can work. On the other hand, a pair-wise recommender task implies that, for each user , triplets per observed user-item interaction are generated. In both cases, the unobserved item is selected among the items less popular than for input sample, and among the items more popular than for the other half of the input sample. These operations enable our regularization, as the input samples equally represent elements subjected to correlation computing. We denote the set of input samples as .
Regularized Optimization (reg). Input samples in are fed into a base recommendation algorithm in batches of size
to set up an iterated stochastic gradient descent. Regardless of the family of the algorithm, the optimization approach follows a regularized paradigm derived from the standard point- or pair-wise optimization approach. The regularized loss function can be formalized as follows:
where is a parameter that expresses the trade-off between the accuracy loss and the regularization loss. With , we yield the accuracy loss, not taking the regularization loss into account. Conversely, with , the accuracy loss is discarded and only the regularization loss is minimized.
The accuracy loss term depends on the class of the involved recommender system. For instance, it could be either Eq. 5 for point-wise recommenders or Eq. 6 for pair-wise recommenders. This aspect will make our debiasing procedure easily applicable to other algorithms, with no changes on their original implementation. Lastly, is introduced in this paper to define a regularization loss aimed at minimizing the correlation between () the residual prediction and () the observed item popularity, as:
where indicates the function used to compute the correlation across two distributions (predicted residuals) and (observed item popularities):
where identifies the observed item at position into the current batch , and represents the ratio of users interested in item in the training dataset, i.e., the popularity of the observed item. The model is thus penalized if its ability to predict a higher relevance for an observed item is better when it is more popular than the unobserved item. The proposed regularization is defined in a way that it can be applied on a wide range of rating prediction and learning-to-rank approaches.
Following a common machine-learning training procedure, operations in sam are performed after every epoch, while the regularized optimization in reg is computed for every batch of the current epoch, until convergence.
6 Experimental Evaluation
In this section, we empirically evaluate the proposed approach over standard accuracy, beyond-accuracy, and popularity debiasing objectives. We conducted the evaluation under the same experimental setup described for the exploratory analysis, including the same datasets (Section 4.1), train-test protocols (Section 4.2), and metrics (Section 4.3). We aim to answer four key research questions:
RQ1. What are the effects of our debiasing components, separately and jointly?
RQ2. What is the impact of our treatment on internal mechanics?
RQ3. To what degree of debiasing can an algorithm achieve the best recommendation quality?
RQ4. How does our approach perform compared with other state-of-the-art debiasing solutions?
6.1 Effects of Debiasing Components (RQ1)
In this subsection, we run ablation experiments to assess () the influence of the new data sampling strategy and the new regularized loss on the model performance, and () whether combining these two treatments can improve the trade-off between ranking accuracy and popularity bias metrics.
To answer these questions, we compare the base algorithm (base) against an instance of the same algorithm trained on data created through the proposed sampling strategy only (sam), the base algorithm optimized through the proposed regularized loss function only (reg), and the base algorithm combining both our treatments (sam+reg). The regularized optimization for the last two setup was configured with , which gave us the best trade-off during experiments in Section 6.3. The results are presented and discussed below.
From Figure 8, we can observe that all the newly introduced configurations (green, orange, and red lines) have a loss in accuracy with respect to the base algorithm (blue line), if we considered the full test set (straight lines). However, the gap in accuracy among the base and the regularized models is positively reduced, when we consider the same number of test ratings for all the items (dashed lines). We argue that, as large gaps of recommendation accuracy in the full test set reflect only a spurious bias in the metric and the underlying test set (see Bellogín et al. (2017) for a demonstration), the real impact of our treatments on accuracy should be considered on the balanced test set. In the latter case, there is a negligible gap in accuracy across models. NDCG seems to vary across datasets. On ML1M, with NeuMF, combining our data sampling and regularized loss slightly improves accuracy, while when the treatment are applied separately, there is no significant difference with respect to the base algorithm. On COCO, the loss in accuracy is smaller, with reg outperforming sam+reg, under NeuMF.
Figure 9 shows that our data sampling (sam: orange line) and our combination of data sampling and regularized loss (sam+reg: red line) positively impact ISP and IEO metrics, while our regularized loss alone (reg: green line) still keeps comparable bias with respect to the original algorithm (base: blue line). Furthermore, there is no statistical difference on ISP between sam (orange line) and sam+reg (red line). It follows that the regularized loss does not allow to increase ISP, directly. On other other hand, sam+reg can significantly increase IEO with respect to sam, better equalizing opportunities across items.
Observation 5. Combining our data sampling and regularization leads to higher ISP and IEO w.r.t. applying them separately. The loss in ranking accuracy is negligible with respect to the original algorithm, if a balanced test set across items is considered.
6.2 Impact on Internal Mechanics (RQ2)
In this subsection, we run ablation experiments on ML1M and COCO to assess whether our debiasing approach can effectively reduce () the gap between short-tail and mid-tail relevance distributions, and () the gap in pair-wise accuracy among short-tail and mid-tail items.
To address the first point, we compute and plot the relevance score distributions for observed head-tail and mid-tail items in Figure 10. Orange lines are calculated on user-(head-tail-item) pairs, and the green lines are calculated on user-(mid-tail-item) pairs, with the same procedure followed during the exploratory analysis (Section 4.3). The proposed approach can effectively reduce the gap between the relevance score distributions, when compared with the results in Fig. 5. It follows that our intuition and the resulting debiasing approach have been demonstrated to be valid.
For the second point, we compute the pair-wise accuracy for observed short-tail and mid-tail items in Table 2. The (mid-tail, short-tail) setup experienced a statistically significant improvement in pair-wise accuracy. Conversely, as far as mid-tail items end up to be well-performing, pair-wise accuracy on the setups involving observed short-tail items slightly decreased. The improvement is generally higher for pair-wise (BPR) and less sparse datasets (ML1M). To assess the impact of our approach in cases where the algorithm does not show any biased performance across short- and mid-tail items, we included NeuMF trained on COCO into our evaluation (last column). In this situation, our approach led to a decrease in performance over all the observed/unobserved items setups. Therefore, it should be applied only when the gap is considerable.
|Observed Item||Unobserved Item||ML-1M||COCO|
|Short Tail||Any||0.88 (-0.05)||0.91 (-0.04)||0.92 (-0.02)||0.84 (-0.14)|
|Mid Tail||Any||0.78 (+0.04)||0.85 (+0.01)||0.89 (+0.06)||0.82 (-0.14)|
|Short Tail||Short Tail||0.77 (-0.11)||0.87 (-0.05)||0.89 (+0.00)||0.85 (-0.11)|
|Short Tail||Mid Tail||0.93 (-0.06)||0.95 (-0.04)||0.95 (-0.04)||0.83 (-0.16)|
|Mid Tail||Short Tail||0.68 (+0.10)||0.80 (+0.06)||0.82 (+0.13)||0.82 (-0.10)|
|Mid Tail||Mid Tail||0.89 (-0.02)||0.90 (-0.04)||0.94 (-0.04)||0.81 (-0.18)|
Observation 6. Our correlation-based regularization, jointly with the enhanced data sampling, leads to a reduction of the gap in relevance score of items along the popularity tail. This is stronger for pair-wise approaches and sparsed datasets.
6.3 Linking Regularization Weight and Recommendation Qualities (RQ3)
We investigate how the recommender performs when we vary the regularization weight in the new proposed loss function. With this experiment, we seek to inspect to what degree the influence of popularity may be debiased to achieve the best quality of recommendation, according to ranking accuracy and beyond-accuracy objectives. For conciseness, we only report experimental results on ML1M, but results on COCO showed similar patterns.
We vary the regularizer weight and plot the results on accuracy, popularity bias, and beyond-accuracy metrics in Figure 11. The x-axis coordinates indicate the value of , while the y-axis shows the value measured for the corresponding metric at that value of . It can be observed that the regularization procedure experienced quite stable performance at various . Specifically, at the cost of a loss in NDCG on a full test set, our approach ensures comparable or even better NDCG values on the balanced test set, large gains in ISP and IEO, higher novelty and a more wider coverage of the catalog. Exception is made for catalog coverage under BPR. To balance ranking accuracy and other metrics, setting is a reasonable choice.
Observation 7. Debiasing popularity with our approach positively impacts on recommendation quality. Lower ISP and IEO, higher novelty, and a wider coverage are achieved at the cost of a small loss in NDCG, if the approach is evaluated on a balanced test set.
6.4 Comparison with Other Debiasing Approaches (RQ4)
We next compare the proposed sam+reg debiasing approach with representative state-of-the-art alternatives to assess () how the proposed model performs in comparison with other approaches, and () how they manage the trade-off between popularity bias and recommendation quality. We highlight the fact that we do not aim to show that an in-processing procedure beats a post-processing procedure (or vice versa), also because they could be jointly combined. Our goal here is to assess how far an in-processing strategy is from a post-processing strategy to reach good trade-offs. We leave the joint employment of both pre- and post-processing as future work, to focus on the validation of our approach. We compare the trade-off achieved by the proposed regularized approach sam+reg against the one obtained by:
Pop-Weighted Abdollahpouri et al. (2018). It re-ranks the output of the original algorithm according to a weighted-based strategy. The relevance returned by the original algorithm for a given item is multiplied with a weight inversely proportional to the popularity of that item, before re-ranking.
Binary-xQuad Abdollahpouri et al. (2019). For each user, it iteratively builds the re-ranked list by balancing the contribution of the relevance score returned by original algorithm and of the diversity level related to short-tail and mid-tail item sets. It includes only the best mid-tail item it can. The split between short-tail and mid-tail was performed based on the percentiles, as shown in Figure 1.
Smooth-xQuad Abdollahpouri et al. (2019). It follows the same strategy of Binary-xQuad, but it takes into account the likelihood an item should be selected based on the ratio of items in the user profile belong to the short- and mid-tails.
To answer these questions, we report accuracy, popularity bias, and beyond-accuracy metrics for all the considered approaches in Table 3. The best performer approach per metric and algorithm is identified by a bold style. The same value of is used for all the approaches to favor comparability.
From the top part of the table, it can be observed that the proposed sam+reg debiasing strategy experienced the larger loss in NDCG, when the full test set was considered. Conversely, it achieved comparable NDCG with Pop-Weighted on the balanced test set for both BPR and NeuMF. Conversely, highly sparsed datasets as COCO reduced the gap between sam+reg and the other strategies. This may be also caused by the skewed popularity tail in COCO, where it is harder to find input samples where the observed item is less popular than the unobserved item.
Going in depth with popularity bias metrics, it can be observed that our sam+reg strategy largely improves ISP on both datasets. On ML1M, Pop-Weighted exhibited the highest bias on statistical parity. Binary- and Smooth-xQuad achieved comparable scores between each other, but lower than sam+reg. On COCO, ISP improved for NeuMF, but not for BPR. Smaller improvements of our proposal were achieved on ISP on both datasets. Similar patterns were observed for IEO, with sam+reg achieving higher values. Our proposal appeared highly competitive on both novelty and catalog coverage, especially on ML1M. This came at the cost of higher NDCG loss with the full test set. Conversely, it achieved comparable NDCG scores on the balanced test set.
In this section, we aim at discussing the results and connect the insights coming from the individual experiments. Having observed some differences in performance, we also turn to the implications and limitations of our results.
The increasing adoption of recommender systems is requiring platform owners to consider issues of bias in their internal mechanics, which may represent a key factor for the future success of the overlying platform. The outcomes of our exploratory analysis in Section 4.3 highlighted that two widely-adopted classes of algorithm, point-wise and pair-wise, emphasize algorithmic bias towards unpopular items; thus, the latter ones end up to be under-recommended even when of interest, reducing novelty and coverage in recommendations. The results presented on two fundamental optimization paradigms, which constitute a key block of several state-of-the-art recommenders, can have implications beyond the algorithms presented in this study.
Our results provide more evidence on the popularity perspective in recommendation, which primarily focused on popularity bias on traditional algorithms in the contexts of movies, music, books, social network, hotels, games, and research articles Boratto et al. (2019); Pampın et al. (2015); Jannach et al. (2015, 2016); Collins et al. (2018). We extended existing knowledge by linking observations within internal mechanics to the level of popularity bias experienced by the recommender, uncovering a clear correlation between item relevance and item probability (Section 4.4). The methodology and the lessons learned throughout our exploration may provide one of the first attempts of linking internal mechanics to popularity bias and beyond-accuracy metrics.
Combining our input sample strategy with the correlation-based loss resulted in lower popularity bias at the cost of a decrease in ranking accuracy, which confirms the trade-off experienced by other debiasing procedures Abdollahpouri et al. (2019, 2019); Jannach et al. (2015). However, trading ranking accuracy for debiasing popularity has been proved to be good for improving recommendation quality (Section 6.3). This study additionally brings forth the discussion about popularity debiasing impacts on beyond-accuracy goals, which can better guide stakeholders to ultimately select the trade-offs based on their context, going beyond ranking accuracy and popularity analysis. Lastly, as the algorithms we analyzed in Section 6.4 showed very different trade-offs and patterns, our comparison among debiasing procedures make a first first in supporting stakeholders while choosing the most suitable debiasing strategy for their recommendation scenario.
Throughout this study, some limitations emerged at different levels of the pipeline. The most representative ones are related to the following aspects:
Limitations of data. Our analysis was conducted with feedback extracted from the provided ratings. It does not account for the behaviour of users who interacted with items, without necessarily providing ratings for them. However, as we deal with learning-to-rank tasks, the debiasing approach can be applied to the cases where matrix does not include ratings (e.g., binary data or frequencies). Learning-to-rank tasks usually require to define what it is of interest for the user or what it is not of interest (e.g., applying thresholds to ratings or frequencies). Our approach does not make any assumption on the feedback type, which is a design choice of the algorithm under consideration.
Limitations of recommendation techniques. While we have tested representatives of two key families of recommendation algorithms, there are many types of algorithms that we have not considered. However, our methodology makes it easy to re-run our analyses on additional algorithms. Our experiments highlighted that the proposed debiasing works well under pair-wise optimization, while it leads to lower gains on point-wise approaches. Finally, as we focused on learning-to-rank tasks in recommendation, we conclude that our debiasing procedure can be still applied on algorithms originally devised for rating prediction, when they are optimized for ranking accuracy.
Limitations of evaluation protocol. Our data cannot distinguish whether the differences in measured performance are due to actual differences in the recommender’s ability or differences in the evaluation protocol’s effectiveness at measuring popularity bias. Furthermore, there is no evidence on the real impact of the debiased recommendation on the user acceptance, which requires online evaluation studies, after offline algorithm testing.
Limitations of metrics. There are many widely-used metrics that can be used to evaluate quality of recommendations. In our specific context, we focus our results on popularity, novelty, and coverage. We also measured NDCG as a proxy of recommendation utility. We remark that, while it is responsibility of scientists to operationalize trade-offs, the metrics and the target level of a trade-off are selected by the stakeholders.
As this study aims to promote, the scope of our debiasing procedure incorporates elements of beyond-accuracy importance, which can be shaped by adjusting the popularity of the recommendations. As recommender systems move further into platforms, it becomes more and more necessary that they investigate and consider strategies similar to ours.
In this paper, we first propose two new bias metrics designed specifically for measuring popularity bias in the recommendation task. Then, we empirically show that representative learning-to-rank algorithms based on point- and pair-wise optimization are vulnerable to imbalanced item data, and tend to generate biased recommendations with respect to the proposed bias metrics. To counteract this bias, we propose a debiasing approach that incorporates a new data sampling strategy and a new regularized loss. Finally, we conduct extensive experiments to measure the trade-off between popularity bias, ranking accuracy, and beyond-accuracy metrics. Based on results, we conclude that:
Predicted user-item relevance distributions for observed short- and mid-tail items are statistically different; the first one exhibits higher relevance values.
Pair-wise accuracy on observed mid-tail items is lower than for observed short-tail items; mid-tail items are under-ranked regardless of user interest.
The combination of our sampling strategy and our regularized loss leads to a lower gap in pair-wise accuracy between short- and mid-tail observed items; higher statistical parity, equal opportunity, and beyond-accuracy estimates can be achieved by the treated recommender system.
The treated models exhibit comparable accuracy against the original model, when the same number of test ratings is used for each item, which has been proved to be a proper testing setup when popularity bias is considered Bellogín et al. (2017).
Compared to state-of-the-art alternatives, our treated model comparably reduces popularity bias while achieving competing beyond-accuracy scores and accuracy, generalizing well across populations and domains.
In our next steps, we are interested in investigating temporal- and relevance-aware bias metrics, which respectively take the item popularity or relevance at a given time into account, when treating popularity bias. It will be investigated the possibility of defining one-time post-processing mitigators, that optimize accuracy-only pre-trained embeddings for popularity bias reduction, at small learning rates. Moreover, we are interested in inspecting the inter-play between system-level and user-level tendency of preferring more (less) popular items and in linking the resulting observations to beyond-accuracy objectives.
This work has been partially supported by the Agència per a la Competivitat de l’Empresa, ACCIÓ, under “Fair and Explainable Artificial Intelligence (FX-AI)” Project.
- Controlling popularity bias in learning-to-rank recommendation. In Proc. of the Eleventh ACM Conference on Recommender Systems, pp. 42–46. Cited by: §1, §2.1, §2.1.
- Popularity-aware item weighting for long-tail recommendation. arXiv preprint arXiv:1802.05382. Cited by: §1, §2.1, item 1.
- Managing popularity bias in recommender systems with personalized re-ranking. arXiv preprint arXiv:1901.07555. Cited by: §2.1, §2.1, item 2, item 3, §6.5.
- A reductions approach to fair classification. arXiv preprint arXiv:1803.02453. Cited by: §2.2.
- Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal 20 (6), pp. 606–634. Cited by: §4.3, §6.1, item 4.
- Fairness in recommendation ranking through pairwise comparisons. arXiv preprint arXiv:1903.00780. Cited by: §2.2, §5.
- Putting fairness principles into practice: challenges, metrics, and improvements. arXiv preprint arXiv:1901.04562. Cited by: §2.2, §5.
- Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075. Cited by: §2.2.
- Recommender systems survey. Knowledge-based systems 46, pp. 109–132. Cited by: §1.
- The effect of algorithmic bias on recommender systems for massive open online courses. In European Conference on Information Retrieval, pp. 457–472. Cited by: §6.5, footnote 3.
- Should i follow the crowd?: a probabilistic analysis of the effectiveness of popularity in recommender systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 415–424. Cited by: §1.
- Missing data modeling with user activity and item popularity in recommendation. In Asia Information Retrieval Symposium, pp. 113–125. Cited by: §2.1.
- How algorithmic popularity bias hinders or promotes quality. Scientific reports 8 (1), pp. 15951. Cited by: §1.
- Position bias in recommender systems for digital libraries. In International Conference on Information, pp. 335–344. Cited by: §6.5.
- Recommending without short head. In Proc. of the 23rd International Conference on World Wide Web, pp. 245–246. Cited by: §3.2.
- Deepcf: a unified framework of representation learning and matching function learning in recommender system. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 61–68. Cited by: §4.2.
- COCO: semantic-enriched collection of online courses at scale with experimental use cases. In World Conference on Information Systems and Technologies, pp. 1386–1396. Cited by: Figure 1, item 2.
- Censoring representations with an adversary. arXiv preprint arXiv:1511.05897. Cited by: §2.2.
- Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems, pp. 2415–2423. Cited by: §2.2.
Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.2, §3.2.
- The movielens datasets: history and context. ACM transactions on interactive intelligent systems (TIIS) 5 (4), pp. 19. Cited by: Figure 1, item 1.
- Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §4.2.
- Balancing popularity bias of object similarities for personalised recommendation. European Physical Journal 91 (3), pp. 47. Cited by: §1, §2.1.
- Biases in automated music playlist generation: a comparison of next-track recommending techniques. In Proc. of the 2016 Conference on User Modeling Adaptation and Personalization, pp. 281–285. Cited by: §6.5.
- What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction 25 (5), pp. 427–491. Cited by: §1, §2.1, §2.1, §6.5, §6.5, footnote 3.
- Diversity, serendipity, novelty, and coverage: a survey and empirical analysis of beyond-accuracy objectives in recommender systems. ACM Transactions on Interactive Intelligent Systems (TiiS) 7 (1), pp. 2. Cited by: §1, §4.3.
- Correcting popularity bias by enhancing recommendation neutrality.. In RecSys Posters, Cited by: §1, §2.1.
- Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650. Cited by: §2.2.
- Towards a fair marketplace: counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proc. of the ACM International Conference on Information and Knowledge Management, pp. 2243–2251. Cited by: §1.
- Quantifying biases in online information exposure. Journal of the Association for Information Science and Technology 70 (3), pp. 218–229. Cited by: §1.
- Novel recommendation based on personal popularity tendency. In 2011 IEEE 11th International Conference on Data Mining, pp. 507–516. Cited by: §2.1.
- Evaluating the relative performance of collaborative filtering recommender systems. Journal of Universal Computer Science 21 (13), pp. 1849–1868. Cited by: §6.5.
- The long tail of recommender systems and how to leverage it. In Proc. of ACM Conference on Recommender Systems, pp. 11–18. Cited by: §2.1.
- On the negative impact of social influence in recommender systems: a study of bribery in collaborative hybrid algorithms. Information Processing & Management 57 (2), pp. 102058. External Links: Cited by: §1.
- Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM international conference on Web search and data mining, pp. 273–282. Cited by: §4.2.
- Recommender systems: introduction and challenges. In Recommender systems handbook, pp. 1–34. Cited by: §1.
- Reputation-based ranking systems and their resistance to bribery. In 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017, V. Raghavan, S. Aluru, G. Karypis, L. Miele, and X. Wu (Eds.), pp. 1063–1068. External Links: Cited by: §1.
- Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2219–2228. Cited by: §2.2.
- Fair and unbiased algorithmic decision making: current state and future challenges. arXiv preprint arXiv:1901.04730. Cited by: §2.2, §3.2.
- Deep matrix factorization models for recommender systems.. In IJCAI, pp. 3203–3209. Cited by: §4.2.
- Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In IJCAI International Joint Conference on Artificial Intelligence, Cited by: §4.2.
- Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §4.