Reducing Popularity Bias in Recommendation Over Time

06/27/2019 ∙ by Himan Abdollahpouri, et al. ∙ University of Colorado Boulder 0

Many recommendation algorithms suffer from popularity bias: a small number of popular items being recommended too frequently, while other items get insufficient exposure. Research in this area so far has concentrated on a one-shot representation of this bias, and on algorithms to improve the diversity of individual recommendation lists. In this work, we take a time-sensitive view of popularity bias, in which the algorithm assesses its long-tail coverage at regular intervals, and compensates in the present moment for omissions in the past. In particular, we present a temporal version of the well-known xQuAD diversification algorithm adapted for long-tail recommendation. Experimental results on two public datasets show that our method is more effective in terms of the long-tail coverage and accuracy tradeoff compared to some other existing approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems have an important role in e-commerce and information sites, helping users find new items. One obstacle to the effectiveness of recommenders is the problem of popularity bias (Bellogín et al., 2017; Abdollahpouri, 2019): collaborative filtering recommenders typically emphasize popular items (those with more ratings) over other “long-tail” items (Park and Tuzhilin, 2008) that may only be popular among small groups of users. Although popular items are often good recommendations, they are also likely to be well-known. So delivering only popular items will not enhance new item discovery and will ignore the interests of users with niche tastes. It also may be unfair to the producers of less popular or newer items since they are rated by fewer users.

Most of the research addressing popularity bias has concentrated on a one-shot representation of this bias, and improving the diversity of individual recommendation lists. In other words, researchers have developed algorithms that change the recommendation lists for each user without any knowledge about how the recommender system has performed for other users. Although such approaches improve the long-tail recommendation to some extent, they miss an important opportunity for long-tail promotion: the ability of the system to compensate for previous omissions by adjusting its future output.

Figure  1 shows lists generated by two different long-tail-promoting recommendation algorithms and for two different users U1 and U2 arriving at two different times t1 and t2. Popular items have a white background; long-tail items, a grey background. The outputs are superficially similar: each user gets two popular items and three long-tail items, and the averages over the popularity values of the items (shown in parentheses) are the same. However, the same three long-tail items are repeatedly generated by and a greater range of such items is produced by . When we are measuring long-tail coverage, the number of items and their average popularity in each list is not as important as an aggregate measure of how many different items are shown across all users. This distinction cannot be grasped looking only at individual recommendation lists in isolation; the whole recommendation set must be evaluated. In this paper, we capture this distinction through a modified version of the search diversification algorithm xQuAD (Santos et al., 2010) which improves long-tail recommendation over time by adapting based on its prior performance.

Figure 1. Lists of recommended items given to different users at different times by different algorithms. The grey background indicates long-tail items.

2. Approach

Result diversification has been studied in the context of information retrieval, especially for web search engines, which have a related goal to find a ranking of documents that together provide a complete coverage of the aspects underlying a query (Santos et al., 2015). EXplicit Query Aspect Diversification (xQuAD) (Santos et al., 2010)

accounts for the various aspects associated with an under-specified query. Items are selected iteratively by estimating how well a given document satisfies an uncovered aspect.

In adapting this approach, we seek to recognize the difference among users in their interest in long-tail items. Uniformly increasing the diversity of items in recommendation lists may work poorly for some users. We propose a variant that adds a personalization factor to the scoring function, based on each user’s historical interest in long-tail items.

We build on the xQuAD model to control popularity bias in recommendation algorithms over time. We assume that for a given user from the user set , a ranked recommendation list of items from the item set has already been generated by a base recommendation algorithm. The task of the modified xQuAD method is to produce a new re-ranked list () that manages popularity bias while still being accurate.

In the approach for improving long-tail recommendation introduced in (Abdollahpouri et al., 2019) which is also based on xQuAD, the new list is built iteratively according to the score where is the likelihood of user being interested in item , independent of the items on the list so far, as predicted by the base recommender. The second term denotes the likelihood of user u being interested in an item if the category of that item (long-tail vs short-head) is not in the current recommendation list .

In this work, however, since we want to take a historical view of long-tail recommendation, we do not look at to decide if an item category has already been covered or not. We, instead, look at the entire recommendation history up to the current time, . We denote that list by which contains the list of items recommended from to . So, we replace with .

Intuitively, the first term in the xQuAD equation incorporates ranking accuracy while the second term promotes diversity between different categories of items (here, short head and long tail). The parameter controls how strongly popularity bias is weighted in general. The item with the highest value is added to the output list and the process is repeated until has achieved the desired length.

The marginal likelihood over both item categories long-tail () and short head () can be computed by:


Following (Santos et al., 2010), we assume that the remaining items are independent of the current contents of and that the items are independent of each other, given the short-head and long-tail categories. Under these assumptions, we can compute in Eq.1 as


By substituting equation 2 into equation 1, we can obtain


where is equal to 1 if and 0 otherwise.

The final scoring function, therefore, is as follows:


We measure in two different ways to produce two different algorithms:

  • Binary: Use the same function as , an indicator function equal to 1 when item in list already covers category and 0 otherwise. We call this method Time Binary xQuAD: this is how this value is calculated using the list in the original xQuAD algorithm.

  • Time Smooth: Another method that we introduce in this paper is to compute the fraction of category items included in the list . We call the method that measures the in this way Time Smooth xQuAD.

The likelihood is the measure of user preference over different item categories. In other words, it measures how much each user is interested in short-head items versus long-tail items. We calculate this likelihood by the fraction of items in the user profile which belong to category .

In order to add the next item to , we compute a re-ranking score for each item in according to Eq. 4. For an item , if does not cover , then an additional positive term will be added to the estimated user preference , increasing the item’s chance of selection and balancing accuracy and popularity bias.

In Binary xQuAD, the product term is only equal to 1 if the current items in have not covered the category yet. Binary xQuAD is, therefore, optimizing for a minimal re-ranking of the original list by including the best long-tail item it can, but not seeking diversity beyond that.

3. Datasets and Preparation

We tested our proposed algorithm on two public datasets. The first is the well-known MovieLens 1M dataset (Harper and Konstan, 2015). The second is the Epinions dataset, which is gathered from a consumer opinion site where users can review items (Massa and Avesani, 2007). Following the data reduction procedure in (Abdollahpouri et al., 2017), we removed users who had fewer than 20 ratings from the Epinion dataset (MovieLens already has this characteristic). We also removed distant long-tail items from each dataset using a limit of 20 ratings, a number 20 is chosen to be consistent with the cut-off for users.

After filtering, the MovieLens dataset has 6,040 users who rated 3043 movies with a total of 995,492 ratings, a reduction of about 0.4%. Applying the same criteria to the Epinions dataset decreases the data to 220,117 ratings given by 8,144 users to 5,195 items, a reduction of around 66%. We split the items in both datasets into two categories: long-tail () and short head (’) such that short-head items make up 80% of the ratings while long-tail items have the rest. We plan to consider other divisions of the popularity distribution in future work. For MovieLens, the short-head items were those with more than 506 ratings. In Epinions, a short-head item needed only to have more than 73 ratings.

Our temporal xQuAD

algorithms operate over a series of different time epochs. To evaluate these algorithms, we split the test set into

epochs – 50 in this experiment. Investigating the effect of the number of epochs is in our plan for future work. Note that we do not need to split the data based on real time stamps, because we are not trying to learn the time-sensitive properties of users’ preferences. Rather, we are only interested in simulating a succession of epochs over which the algorithm can adjust. We choose random users for each epoch where is the total number of users in test set.

4. Evaluation

The experiments compare six algorithms. Since we are concerned with ranking performance, we chose as our baseline algorithm RankALS, a pair-wise learning-to-rank algorithm. We also include the regularized long-tail diversification algorithm from (Abdollahpouri et al., 2017) (indicated as Reg in the figures) and two other non-temporal re-ranking approaches for long-tail recommendation (Binary xQuAD indicated as Binary and Smooth xQuAD indicated as Smooth in the figures) from (Abdollahpouri et al., 2019) against which to compare our work. The temporal versions of Binary xQuAD and Smooth xQuAD described above are labeled as Time Binary and Time Smooth in the figures. We used the output from RankALS as input for the four re-ranking variants described above. We compute lists of length 100 from RankALS and pass these to the re-ranking algorithms to compute the final list of 10 recommendations for each user.111We used the implementation of RankALS in LibRec 2.0 ( for all experiments.

In order to evaluate the effectiveness of algorithms in mitigating popularity bias we use four different metrics:

Average Recommendation Popularity (ARP): This measure from (Yin et al., 2012), which calculates the average popularity of the recommended items in each list.

Long-tail Coverage Ratio (LCR): This metric measures the ratio of covered long-tail items out of all long-tail items


where is the list generated for user .

This function is related to the Aggregate Diversity metric of (Adomavicius and Kwon, 2012) but it looks only at the long-tail part of the item catalog.

Cumulative LCR (CLCR): Since we are modeling the problem of long tail recommendation in a temporal way, we also want to see how the LCR changes after each epoch. Note that the cumulative measure is not the cumulative sum of different LCR values but it is calculated after the end of each epoch using the entire set of recommendations generated up to that time epoch (i.e. from to . More formally:


In addition to the aforementioned long-tail diversity metrics, we also evaluate the accuracy of the ranking algorithms in order to examine the diversity-accuracy trade-offs. For this purpose we use the standard Normalized Discounted Cumulative Gain (NDCG) measure of ranking accuracy.

The parameter in Equation 4 has been chosen experimentally for each algorithm and each dataset and the best value selected. For the Epinions dataset, the values for Reg, Binary, Smooth, Time Binary and Time Smooth are 0.05, 0.1, 0.0001, 0.0006 and 0.0002, respectively. For the MovieLens dataset, these values are: 0.05, 0.1, 0.1, 0.1 and 0.05, respectively.

Figure 2. The epoch-wise ARP and cumulative LCR (CLCR)
MovieLens Epinions
Average LCR Average NDCG@10 Average ARP Average LCR Average NDCG@10 Average ARP
ALS 0.00059 0.262 1844 0.00000 0.0299 549
Reg 0.00306 0.261 1831 0.00171 0.0243* 447
Binary 0.00800 0.260 1827 0.00047 0.0299 548
Smooth 0.00800 0.259 1827 0.00309 0.0295 540
Time Binary 0.00060 0.262 1843 0.00043 0.0299 546
Time Smooth 0.01530 0.260 1820 0.00344 0.0298 542
Table 1. Experimental results. Values not significantly different () from ALS are in italics. Bold values are the best results and are significant improvements over the next best algorithm. * indicates a result significantly worse than the baseline.

5. Results

Figure 2 shows how long-tail coverage changed over the course of the experiment. The top (ARP) figures show the average recommendation popularity in each epoch. The bottom (CLCR) figures show the total long-tail coverage considered from the initial epoch to the present one. In the MovieLens dataset, we see that, while the ARP score is fairly similar across algorithms, Time Smooth xQuAD far out-paces the others in covering more unique items – it is, of course, specifically designed to incorporate new items at each epoch.

For Epinions, the plots are quite different. The first notable feature is how good (low) the ARP results are for the Reg algorithm. However, the CLCR results reveal how misleading this metric is. Other algorithms cover many more of the long-tail items.

Another interesting result is how the CLCR for the Time Binary algorithm jumps up at first epoch and then remains relatively stable. The reason, as we discovered, is that there are not enough high-quality long-tail items (items with average rating above 3) in this dataset and, therefore, each original ranked list of 100 items for every user contains very few of them. Once these items have been chosen in first epoch, they are not picked again by the algorithm due to its binary nature. The Time Smooth algorithm incorporates long-tail items more slowly because they are not rated as highly and therefore not scored highly by the base recommendation algorithm.

Additional results are shown in Table 1. On the MovieLens dataset, all algorithms have very similar NDCG, meaning they produce lists with similar ranking quality. There is therefore no cost associated with the improved CLCR results shown in Figure 2. ARP values are also similar. LCR values, however, are quite different for these algorithms, with Time Smooth xQuAD incorporating by far the most long-tail items in each list.

The table also includes results for the Epinions dataset. On this dataset, the algorithms behave differently due to the extreme long-tail. For example, the Reg algorithm has the best overall ARP as Figure 2 would suggest. However, its LCR is worse than Smooth xQuAD and Time Smooth xQuAD. It is concentrating its long-tail promotion on a small number of long-tail items, and not covering as much of the catalog. It is also not adding many items to recommendation lists, so its impact on overall recommendation outcomes is limited. The NDCG results for Epinions also show no statistically-significant loss for any of the re-ranking algorithms.

6. Related Work

Recommending serendipitous items from the long tail is generally considered to be a key function of recommendation (Anderson, 2006)

, as these are items that users are less likely to know about. Long-tail items are also important for a fuller understanding of users’ preferences. Systems that use active learning to explore each user’s profile will typically need to present more long tail items. These are the ones that the user is less likely to know about, and where users’ preferences are more likely to be diverse 

(Resnick et al., 2013).

Item popularity and its impact on recommendation quality has been explored by some researchers (Brynjolfsson et al., 2006; Park and Tuzhilin, 2008). These authors tried to improve the performance of the recommender system in terms of accuracy and precision, given the long-tail in the ratings. Our work, instead, focuses on reducing popularity bias and balancing the representation of items across the popularity distribution.

A regularization-based approach to improving long tail recommendations is found in (Abdollahpouri et al., 2017). One limitation with that work is that it is restricted to factorization models where the long-tail preference can be encoded in terms of the latent factors. This algorithm does not account for differential user tolerance towards long-tail items. A re-ranking approach can be applied to any algorithm, and in our implementation, we also take personalization of long-tail promotion into account.

There is substantial research in recommendation diversity, where the goal is to avoid recommending too many similar items (Zhou et al., 2010; Castells et al., 2011; Zhang and Hurley, 2008), including some research on personalized diversity where the amount of diversification is dependent on the user’s tolerance (Eskandanian et al., 2017; Wasilewski and Hurley, 2018). Another similar work to ours is (Vargas et al., 2012) where authors used a modified version of xQuAD for intent-oriented diversification of search results and recommendations. Another work that also used xQuAD in recommendation is (Liu and Burke, 2018) where the authors used it to improve recommendation fairness in a microlending scenario.

Our work is different from these previous diversification approaches in that it is not dependent on the characteristics of items, but rather on the relative popularity of items. In addition, our work takes the performance of the recommender system at previous times into account in order to compensate for previous omissions. Another relevant work to ours is (Abdollahpouri et al., 2019) where authors used xQuAD for long-tail recommendation. However, in that work the long-tail compensation is considered as a one-shot action.

Temporal diversity and novelty has been also explored in (Lathia et al., 2010) where authors investigated how different algorithms perform in terms of diversity of the recommended item lists over time. Our work is one approach to improve the temporal novelty of the recommendations although our focus is more on the coverage of the item catalog rather than differences across lists.

7. Conclusion and Future work

Many recommendation algorithms have the problem of popularity bias: a few items are recommended frequently while the majority of the items do not appear at all. Research on popularity bias has concentrated on improving individual recommendation lists for each user without taking into account what has been recommended to others. In this work, we showed that a list-wise approach is not effective for long-tail promotion. We presented a temporal approach based on the xQuAD diversification algorithm from information retrieval to improve long-tail recommendation over time.

Experimental results showed that our approach is capable of recommending more unique long-tail items than the other baselines while maintaining comparable ranking accuracy. In future work, we intend to investigate the effect of epoch size on the performance of the algorithms. In addition, we want to make the algorithm more dynamic by modifying the parameter in each epoch, depending on performance. Given the differences in algorithm performance across these datasets, we are also interested in developing and fitting parameterized models of long-tail distributions based on the characteristics of the data to go beyond the simple short-head / long-tail division employed here.


  • (1)
  • Abdollahpouri (2019) Himan Abdollahpouri. 2019. Popularity Bias in Ranking and Recommendation. In In AAAI/ACM Conference on AI, Ethics, and Society (AIES’19) January 27–28, 2019, Honolulu, HI, USA. ACM.
  • Abdollahpouri et al. (2017) Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling Popularity Bias in Learning to Rank Recommendation. In Proceedings of the 11th ACM conference on Recommender systems. ACM, 42–46.
  • Abdollahpouri et al. (2019) Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2019. Managing Popularity Bias in Recommender Systems with Personalized Re-ranking.. In Florida AI Research Symposium (FLAIRS). ACM, To appear.
  • Adomavicius and Kwon (2012) G. Adomavicius and Y.O. Kwon. 2012. Improving aggregate recommendation diversity using ranking-based techniques. Knowledge and Data Engineering, IEEE Transactions on 24, 5 (2012), 896–911.
  • Anderson (2006) Chris Anderson. 2006. The long tail: Why the future of business is selling more for less. Hyperion.
  • Bellogín et al. (2017) Alejandro Bellogín, Pablo Castells, and Iván Cantador. 2017. Statistical biases in Information Retrieval metrics for recommender systems. Information Retrieval Journal 20, 6 (2017), 606–634.
  • Brynjolfsson et al. (2006) Erik Brynjolfsson, Yu Jeffrey Hu, and Michael D Smith. 2006. From niches to riches: Anatomy of the long tail. Sloan Management Review (2006), 67–71.
  • Castells et al. (2011) Pablo Castells, Saúl Vargas, and Jun Wang. 2011. Novelty and diversity metrics for recommender systems: choice, discovery and relevance. In Proceedings of International Workshop on Diversity in Document Retrieval (DDR). ACM Press, 29–37.
  • Eskandanian et al. (2017) Farzad Eskandanian, Bamshad Mobasher, and Robin Burke. 2017. A Clustering Approach for Personalizing Diversity in Collaborative Recommender Systems. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. ACM, 280–284.
  • Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4 (2015), 19.
  • Lathia et al. (2010) Neal Lathia, Stephen Hailes, Licia Capra, and Xavier Amatriain. 2010. Temporal diversity in recommender systems. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 210–217.
  • Liu and Burke (2018) Weiwen Liu and Robin Burke. 2018. Personalizing Fairness-aware Re-ranking. arXiv preprint arXiv:1809.02921 (2018). Presented at the 2nd FATRec Workshop held at RecSys 2018, Vancouver, CA.
  • Massa and Avesani (2007) Paolo Massa and Paolo Avesani. 2007. Trust-aware recommender systems. In Proceedings of the 2007 ACM conference on Recommender systems. ACM, 17–24.
  • Park and Tuzhilin (2008) Yoon-Joo Park and Alexander Tuzhilin. 2008. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems. ACM, 11–18.
  • Resnick et al. (2013) Paul Resnick, R Kelly Garrett, Travis Kriplean, Sean A Munson, and Natalie Jomini Stroud. 2013. Bursting your (filter) bubble: strategies for promoting diverse exposure. In Proceedings of the 2013 conference on Computer supported cooperative work companion. ACM, 95–100.
  • Santos et al. (2010) Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. 2010. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web. ACM, 881–890.
  • Santos et al. (2015) Rodrygo LT Santos, Craig Macdonald, Iadh Ounis, et al. 2015. Search result diversification. Foundations and Trends® in Information Retrieval 9, 1 (2015), 1–90.
  • Vargas et al. (2012) Saúl Vargas, Pablo Castells, and David Vallet. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 75–84.
  • Wasilewski and Hurley (2018) Jacek Wasilewski and Neil Hurley. 2018. Intent-aware Item-based Collaborative Filtering for Personalised Diversification. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization. ACM, 81–89.
  • Yin et al. (2012) Hongzhi Yin, Bin Cui, Jing Li, Junjie Yao, and Chen Chen. 2012. Challenging the long tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), 896–907.
  • Zhang and Hurley (2008) M. Zhang and N. Hurley. 2008. Avoiding monotony: improving the diversity of recommendation lists. In Proceedings of the 2008 ACM conference on Recommender systems. ACM, 123–130.
  • Zhou et al. (2010) T. Zhou, Z. Kuscsik, J.G. Liu, M. Medo, J.R. Wakeling, and Y.C. Zhang. 2010. Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences 107, 10 (2010), 4511–4515.