Collaborative Filtering (CF) is a well-known recommendation technique that uses historical data on interactions between users and items (i.e., ratings provided by the users on different items), and generates personalized recommendations for the users. Recommendations generated by CF generally suffer from bias against certain groups of users or items (Yin et al., 2012; Yao and Huang, 2017). Bias in recommendation output can originate from different sources: 1) it may stem from the underlying biases in the input data: Figure 1 shows the distribution of the rating data in the MovieLens dataset (see section 4 for more details on this dataset) where a few popular items receive large proportion of ratings while the majority of other items do not receive much attention from the users, or 2) it may be due to the algorithmic bias where recommendation algorithms propagate the existing bias in data (Jannach et al., 2013) and, in some cases, intensify it by recommending these popular items even to the users who are not interested in popular items (Abdollahpouri and Mansoury, 2020).
The algorithmic bias could be intensified over time when users interact with the given recommendations, that are biased towards popular items, and this interaction is added to the data. Users receiving recommendation lists may select (e.g., by rating or clicking) some of the recommended items and the system will add those items to their profiles as part of their interaction history. In this way, recommendations and user profiles form a feedback loop (D’Amour et al., 2020; Chaney et al., 2018); the users and the system are in a process of mutual dynamic evolution where users profile get updated over time via recommendations generated by the recommender system and the effectiveness of the recommender system is also affected by the profile of users.
The study on feedback loop in machine learning and particularly recommender systems has recently received more attention from researchers(D’Amour et al., 2020; Jiang et al., 2019; Chaney et al., 2018; Schmit and Riquelme, 2017; Sinha et al., 2016; Sun et al., 2019). D’Amour et al. (D’Amour et al., 2020) analyzed the long-term fairness of machine learning based decision-making systems in three different domains through simulation studies: bank loans, allocation of attention, and college admission in an agent-based environment. Their analysis showed that common single-step analysis does not show the dynamic behavior of the system and the need for exploring the long-term effect of the decision-making systems. In another work which is also based on a simulation using synthetic data, Chaney et al. (Chaney et al., 2018) showed that feedback loop causes homogenization of the user experience and shift in item consumption. Homogenization in their study was measured as the ratio of commonly rated items in a target user’s profile and her nearest neighbor’s profile, and showed that homogenization leads to lower utility for the users.
In this paper, we investigate the effect of feedback loop on amplifying bias in recommender systems. We study popularity bias amplification and the impact of this effect on other aspects of a recommender system including declining aggregate diversity, shifting the representation of the users’ taste, and also homogenization of the users. In particular, we show that the impact of feedback loop is generally stronger for the users who belong to the minority group. For the experiments, we simulate the users interaction with recommender systems over time in an offline setting. The concept of time here is not chronological but rather consecutive interactions of users with the recommendations in different iterations. That is, in each iteration, users’ profile is updated by adding selected items from the recommendation lists generated at previous iteration to their profile. We performed the simulation using three recommendation algorithms on a movie dataset.
2. Feedback loop simulation
The idealistic scenario for investigating the effect of feedback loop on amplifying bias in recommender systems is to perform online testing on a real-world platform with steady stream of data. However, due to the lack of access to the real-world platforms for experimentation, we simulate the recommender system process in an offline setting. To do so, we simulate recommendation process over time by iteratively generating recommendation lists to the users and updating their profile by adding the selected items from those recommendation lists based on an acceptance probability. Given the rating dataas an matrix formed by ratings provided by the users on different items , the mechanism for simulating feedback loop is to generate recommendation lists for the users in each iteration and updating their profile based on the delivered recommendations in each iteration. The following steps show this mechanism:
Given as the rating data in iteration , we split into training and test sets as 80% for and 20% for .
We build the recommendation model on to generate the recommendation lists to all users.
For each user and recommendation list generated for , we follow the acceptance probability concept proposed in (Abdollahpouri et al., 2019) to decide which item from the recommendation list the user might select. The acceptance probability assigns a probability value to each item in where more relevant items (higher ranked) are assigned higher probability to be selected. Formally, for each item in , the acceptance probability can be calculated as follows:
where is a negative value () for controlling the probability assigned to each recommended item and is the rank of the item in . Equation 1 is only a selection probability and does not assign a potential rating a user might give to the selected item. This is particularly important if we want to also include rating-based algorithms such as UserKNN
in our simulation as we have done it in this paper. To estimate the rating a user might give to the selected item, we follow theItem Response Theory used in (Sinha et al., 2016; Ho and Quinn, 2008). More formally,
where is the average of the ratings in ’s profile,
is the standard deviation of the ratings in’s profile, is the average of ratings assigned to , and
is a noise term derived from a Gaussian distribution (i.e.,). In order to estimate an integer rating value in the range of where and are the minimum and maximum rating values, respectively, we use the following equation (Sinha et al., 2016):
After estimating , we add to ’s profile if is not already in ’s profile and we repeat this process for all users to form .
The steps 1 through 3 are repeated in each iteration.
3. Modeling Feedback
As we mentioned in section 1, recommendation algorithms suffer from popularity bias. In this section, we formally model the propagation of this bias due to the feedback loop phenomenon. Let and be the average popularity (i.e. the expected values) of the items in the rating data and the recommended items in iteration , respectively.
where is the percent increase of the popularity of the recommendations compared to that of rating data in iteration . Now, assuming, out of all the recommendations given to the users, we add interactions () to the profiles of the users, the size of the rating data in the next iteration would be and its average popularity will be which can be simplified as which means the average popularity of the items in the rating data is now increased by . Based on Equation 4, by definition, the average popularity of the recommended items in each iteration is proportional to the average popularity of the rating data in the same iteration plus a positive value and since has increased compared to , will be also higher than due to transitivity. In other words, in each iteration , indicating the popularity propagation/intensification from one iteration to the next one.
In this section, we describe the data and the algorithms we used in our experiments along with the empirical results.
We performed our experiments on MovieLens 1M111We picked this dataset particularly because it has information about both the users (such as gender) and the items (such as genre). dataset (Harper and Konstan, 2015) which is a movie rating data collected by the GroupLens research group. In this dataset, 6,040 users provided 1,000,209 ratings (4,331 males provided 753,769 ratings and 1,709 females provided 246,440 ratings) on 3,706 movies. The ratings are in the range of 1-5 and the density of the dataset is 4.468%. Also, each movie is assigned either a single genre or a combination of several genres. Overall, there are 18 unique genres in this dataset.
We performed a comprehensive evaluation of the effect of feedback loop on amplifying bias in recommender systems using three different recommendation algorithms: user-based collaborative filtering (UserKNN) (Resnick et al., 1994), bayesian personalized ranking (BPR) (Rendle et al., 2009), and MostPopular. BPR is a factorization model that works on binary data and UserKNN is a neighborhood model that works on explicit rating data. MostPopular recommends the most popular items to everyone (the popular items that a user has not seen yet). We set the number of factors in BPR and the number of neighbors in UserKNN to 50 to achieve the best performance in terms of precision. For our simulation, we performed the steps 1-3 in section 2 for 20 iterations ().
4.3.1. Popularity bias amplification
As we formally showed in section 3, recommendation models can intensify the popularity bias in input data over time due to the feedback loop. Figure 2 (left) shows the effect of such a loop on the average popularity of recommendation lists over time (i.e. in different iterations). As shown in this plot, even though these algorithms start with different average popularity values due to their inherent nature, they all show an ascending pattern in terms of the average popularity over different iterations. The curve for BPR seems to have a steeper slope compared to the other algorithms indicating a stronger bias propagation of this algorithm. The exact reason for these performance differences across different algorithms needs further investigation and we leave it for future work.
Figure 2 (right) shows the aggregate diversity (aka catalog coverage) of recommendation algorithms: the percentage of items which appear at least once in the recommendation lists across all users. As a recommender system concentrates more on popular items, it will necessarily cover fewer items in its recommendations and that effect is clear here, especially for BPR, which starts out with a relatively high aggregate diversity.
This bias amplification over different iterations could lead to two other problems: 1) shifting the representation of the user’s taste over time, and 2) the domination of one group of users (the majority group) over another (the minority group) which eventually could diminish the differences between the groups and create homogenization.
4.3.2. Shifting users’ taste representation
One consequence of the feedback loop is shifting the representation of the users’ taste revealed in user profiles. We define the users interest toward various movie genres based on the rated items in their profile which creates a genre distribution over rating data. This genre distribution is calculated as the ratio of the movies associated with each genre over different genres in the users’ profiles. In the MovieLens dataset, some movies are assigned multiple genres hence, in those case, we assign equal probability to each genre. For example, if an item has genres and , the probability of either of and is 0.5. Given genre distribution in iteration as initial preferences represented in the system, we are interested in investigating how initial users’ taste representation changes over time due to the feedback loop. For this purpose, in each iteration
, we calculate the Kullback-Leibler divergence (KLD) between the initial genre distribution and the genre distribution in iterationfor each user. Higher KLD value indicates higher deviation from the initial preference.
Figure 3 (left) shows the deviation of users taste from their initial preferences. In all recommendation algorithms, we observe that the deviation of users’ profiles from their initial preferences increases over time. It is worth noting that the change in users preferences shown in this Figure is the change in the representation of users’ preferences in the system, not the change in users’ intrinsic preferences. One consequence of this change in representation of users’ preferences in the system is that recommendation models may not be able to capture the users’ true preferences when generating recommendations for the users.
Shifting the users’ taste representation could happen in two situations: when the recommendations given to the users are more diverse from what the users are interested in (i.e. exploration), or when the recommendations are over-concentrated on few items when the users’ profiles are more diverse. In the latter, since all users are exposed to a limited number of items over time, their profiles all converge towards a common range of preferences.
Figure 3 (right) shows the distance between the representation of males (majority group) and females (minority group) preferences over time. In each iteration , given the genre distribution separately extracted from males and females ratings as and , respectively, we calculate the KLD of and , , which measures the distance between the preferences of males and females. As shown in the plot, the KLD value dramatically decreases over time in all algorithms showing the strong homogenization of users’ preferences.
Now, an interesting question is which user group is dominating the other. To answer this question we separately compare genre preferences of males and females with the preferences of the whole population. Given as the initial genre preferences of all users (the population), we calculate and in each iteration .
Figure 4 (left) separately shows and in different iterations. We can see that, for all algorithms, the representation of females preferences are approaching toward the representation of initial preferences of the population. However, this value is slightly increasing for males showing that they become distant from the preferences of the initial population. We believe the reason is that male users are taking up the majority of the ratings in the data and hence, initially, the population is closer to the male profiles. Over time, since the recommended items are more likely to be those rated by males (as males have rated more items), when added to the users’ profiles, causes the female profiles to get closer to the initial population which was dominated by the male users.
Figure 4 (right) shows the deviation from the representation of initial preferences of each user in the system separately for males and females. In all algorithms, the deviation for females is significantly higher than males, demonstrating the severity of the impact of the feedback loop on the minority group (e.g. females in our experiment).
5. Discussion and Future Work
There are several interesting threads of research that could be built on this work. Firstly, in some recommendation domains such as music, it is very common for a user to listen to the same song repeatedly. Therefore, the restriction we imposed on the selection algorithm in this paper regrading the items that were already in the users’ profile (those items were not added to the users’ profiles in the next iteration) could be lifted and, instead, the rating for that item would be updated in each iteration.
Secondly, different strategies for user grouping could be used. Here, we used a pre-defined label for users (i.e. gender) to create user groups. One could group the users based on their average profile size, average popularity of their rated items, or some other statistical characteristics that might be of importance for any particular reason.
Thirdly, different algorithms that control the popularity bias problem (Antikacioglu and Ravi, 2017; Mansoury et al., 2020; Kamishima et al., 2014) could be investigated in terms of how they mitigate the bias amplification in feedback loop. Our hypothesis is that since these algorithms reduce the popularity bias in each iteration, according to Equation 4 their bias amplification over time would be also smaller than the standard algorithms.
Finally, the selection technique in Equation 1 we used in this paper leverages the ranking position of the items in the list in order to define whether it would be selected by the user or not. Other selection policies such as top-1 (selecting the first item in the list) or even random selection could be studied.
In this paper, we investigated the effect of feedback loop on bias amplification in recommender systems through an offline simulation. We formally and empirically showed that different recommendation algorithms amplify the existing bias through different iterations of users interaction. We then showed that this bias amplification leads to other issues in recommender systems such as declining the aggregate diversity, shifting the representation of the users preferences (i.e. their profiles), and homogenization of the user groups. In particular, for two user groups males and females, we observed that the bias amplification for the females which happen to be in minority group based on their population and their number of ratings was stronger than that of males. These results emphasize the importance of the algorithmic solutions to tackle popularity bias and increasing diversity in the recommendations since even a small bias in the current state of a recommender system could be greatly amplified over time if it is not addressed properly.
- Beyond personalization: research directions in multistakeholder recommendation. arXiv preprint arXiv:1905.01986. Cited by: item 3).
- Multi-sided exposure bias in recommendation. In KDD Workshop on Industrial Recommendation Systems, pp. . Cited by: §1.
- Post processing recommender systems for diversity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 707–716. Cited by: §5.
- How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 224–232. Cited by: §1, §1.
- Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 525–534. Cited by: §1, §1.
- The movielens datasets: history and context. ACM Transactions on Interactive Intelligent Systems 5 (4), pp. 1–19. Cited by: §4.1.
- Improving the presentation and interpretation of online ratings data with model-based figures. The American Statistician 62 (4). Cited by: item 3).
- What recommenders recommend–an analysis of accuracy, popularity, and sales diversity effects. In International conference on user modeling, adaptation, and personalization, pp. 25–37. Cited by: §1.
- Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 383–390. Cited by: §1.
- Correcting popularity bias by enhancing recommendation neutrality.. In RecSys Posters, Cited by: §5.
- FairMatch: a graph-based approach for improving aggregate diversity in recommender systems. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, pp. 154–162. Cited by: §5.
BPR: bayesian personalized ranking from implicit feedback.
Proceedings of the 25th conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: §4.2.
- GroupLens: an open architecture for collaborative filtering of netnews. In ACM conference on Computer supported cooperative work, pp. 175–186. Cited by: §4.2.
- Human interaction with recommendation systems: on bias and exploration. stat 1050 (1). Cited by: §1.
- Deconvolving feedback loops in recommender systems. In Advances in neural information processing systems, pp. 3243–3251. Cited by: §1, item 3), item 3).
- Debiasing the human-recommender system feedback loop in collaborative filtering. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 645–651. Cited by: §1.
- Beyond parity: fairness objectives for collaborative filtering. In In Advances in Neural Information Processing Systems, pp. 2921–2930. Cited by: §1.
- Challenging the long tail recommendation. In arXiv preprint arXiv:1205.6700, pp. . Cited by: §1.