1. Introduction and Background
Recommender systems are powerful information filters for guiding users to find their interested items from gigantic and rapidly expanding pool of candidates, and they have taken more and more scenarios in our lives (Davidson et al., 2010) (Schedl et al., 2015) (Guo et al., 2018) (Li et al., 2008). Users in industrial recommender systems are normally recommended a list of items at one time. Ideally, such list-wise recommendation should provide diverse and relevant options to the users. Due to efficiency issue, many industrial recommender systems implement list-wise recommendation as Top- recommendation, which selects the first items from an ordered list. The ordered list is generated by a ranking function, which is learned from labeled data to optimize accuracy and produces a ranking score for each individual item. Such top- recommendation focuses on relevance of each individual item independently and overlooks mutual influence between items. As observed in (McNee et al., 2006), recommending a list of items by such a method lead to sub-optimal performance of recommender systems, due to the following two aspects. On the one hand, ranking by relevance is likely to select multiple similar items in the list. However, it is highly possible that at most one of such similar items is needed by a user, while the others are redundant and waste the chance being displayed to the user. We take a real-world example from a mainstream App Store. As shown in Figure 1(a), a top- recommendation list consists of multiple apps in the category of social community and video, because the ranking function learns that the user likes to chat with people and watch video. However, recommending multiple apps with the same functionality results wasting displayed chances and also degrading user experience. A more reasonable recommendation list should take diversity into account, as presented in Figure 1(b). On the other hand, focusing on relevance of items may lead to information isolation for the users (Pariser, 2011), which results in leaving fewer opportunities for exploring new items (Nguyen et al., 2014). To address this problem, diversity (Ziegler et al., 2005; Zhang and Hurley, 2008; Bradley and Smyth, 2001; Adomavicius and Kwon, 2012) has been imposed as a complement of accuracy, to model the mutual influence between items and therefore improve the effectiveness of list-wise recommendations.
A multitude of approaches have been presented to generate the diverse recommendations. Industrial recommender systems are a very complicated framework so that we have to consider the ease of implementation and the risk of launching online when we deploy new models or components online. Therefore, considering diversity in recommendation, we aim to propose a component that is compatible with all the existing components, instead of replacing some of them. That is to say, we target “diversity” as a re-ranking model, which can be easily deployed as a follow-up component after any existing ranking function. Some existing models exist, which treat “diversity” as a re-ranking model, assuming a ranking of items is available. For example, Maximal Marginal Relevance (MMR) method in (Carbonell and Goldstein, 1998) selects one item at a time from a ranking list, which considers both of the relevance and the pair-wise similarity. Probabilistic models based on Determinantal Point Process (DPP) in (Chen et al., 2017; Kulesza and Taskar, 2011) consider the list-wise similarity among items through a kernel matrix, which consists of relevance and pair-wise similarity. Compared with the MMR-based model, DPP-based model can improve the diversity more efficiently without degrading the accuracy (Chen et al., 2017). However, we observe that an unrealistic assumption is made in DPP-based model: it is assumed that different users have the same propensity to the degree of diversity. We find some evidence from both literature and real-world data to call for different propensity to diversity with different individuals.
Analyzing user behaviors from the same App Store, the result is presented in Figure 2. The figure presents the distribution of user’s entropy over her download history. The -axis represent the entropy value of the app category in her download history, while the -axis represents the normalized population of this entropy value in the whole population of 10,000 users. It suggests that users’ taste varies significantly. Users with large entropy values have a variety of interests over different categories of apps, while users with small entropy values focus on few categories of apps. It can be implied from this fact that different individuals have different propensity to diversity.
After seeking answer from industrial applications, we check the viewpoint of literature. (Chen et al., 2013) demonstrates that the personality traits of users significantly correlate with their behaviors in recommendation system. They take into account each users’ personality trait based on a large scale user survey in (Wu et al., 2018; Chen et al., 2016), and present that different users have different propensity to the degree of diversity. Specifically, the users with narrow taste of items may expect more similar items in the recommendation list, while the users who have a variety of interests may expect more diverse items. Moreover, some researchers have proposed several methods to utilize the users’ behaviors for building the personalized diversified recommendation. In (Di Noia et al., 2014), the proposed algorithms focus on the users’ propensity to diversity based on the different attributes of items, and re-rank the recommendation list by MMR methods. A pre-filtering approach proposed in (Eskandanian et al., 2017) clusters the users into four groups according to each user’s inclination to the diversity, and then apply the user based collaborative filtering algorithm for each group. These methods demonstrate the effectiveness of the personalized diversity with the offline experiment on some public datasets.
However, (Di Noia et al., 2014; Eskandanian et al., 2017) personalize the diversity propensity on four user clusters, instead of on individuals, where each hyper-parameter in individual user cluster needs to be grid searched. As indicated from Figure 2, it is more reasonable to personalize the diversity propensity on individual users, as users’ propensity to diversity varies significantly. However, it is impossible to extend (Di Noia et al., 2014; Eskandanian et al., 2017) straightforwardly, personalizing diversity propensity from the granularity of user clusters to a much finer granularity of individual users, as searching for hyper-parameter for each user is impractical. Note that the number of users in an industry recommender system is normally tens or hundreds of millions.
In this paper, we propose a personalized DPP model to improve the diversity of recommendation list, where the personalized granularity is of individual users. The hyper-parameter for each user is factorized to two factors: one is formulated by information entropy of a user’s interaction history, while the other is commonly shared across all the users and tunable.
We summarize the main contributions of our study:
We propose a personalized re-ranking model for improving diversity of recommendation list, and it can be easily deployed as a follow-up component after ranking function.
The re-ranking model employs personalized DPP, where the penalization granularity is on individual users, instead of on user clusters in the literature.
We conduct the experimental evaluations on an offline benchmark to show the superiority of our proposed re-ranking model.
We deploy our proposed re-ranking model in a live recommender system and demonstrate the significant improvement for both of diversity and accuracy over baselines in online A/B test.
The rest of the paper is organized as follows: in Section 2, we elaborate our re-ranking model in detail. We present our system architecture in live recommender systems in Section 3. Experimental setting and offline/online results are shown and discussed in Section 4. Finally, we give the conclusion in Section 5.
2. Re-ranking Model
2.1. DPP-based Re-ranking
As studied in (Chen et al., 2017), DPP-based model is more effect and more efficient than other models such as MMR-based model. Therefore, we choose to investigate how to apply DPP-based re-ranking model in our recommender system. In this section, we present DPP-based re-ranking model, and discuss its limitation, which motivates our personalized DPP-based re-ranking model in the next section.
is a probability distribution on the powerset of. That is, , assigns a probability , such that . It is stated in (Wilhelm et al., 2018) that, finding the set is a way of selecting a relevant and diverse subset of items from the whole item set . Furthermore, can be compactly parameterized by a positive semi-definite kernel matrix , such that , where is the determinants of matrix and is a submatrix of projected to only those rows and columns in . Therefore, find the set is equivalent to finding the set .
The positive semi-definite kernel matrix is defined as follows:
where denotes relevance score of item generated from the ranking function, denotes a user-defined similarity matrix among the items, is the hyper-parameter to trade-off relevance and diversity.
As discussed before, we need to select a set of items from the whole item set , such that
It is known as a NP-hard problem (Wilhelm et al., 2018) with complexity to find the optimal set. To make DPP-based re-ranking model applicable in industrial recommender systems, we choose to use an efficient and effective approximation algorithm, Fast Greedy MAP Inference (Chen et al., 2018), to perform re-ranking in an acceptable latency. Such an approximation algorithm solves this combination optimization problem approximately in . Although theoretic lower bound is not provided in (Chen et al., 2018), online A/B test is conducted to demonstrate its superiority.
2.2. Personalized DPP
In DPP, is a tunable hyper-parameter to balance the trade-off between relevance and diversity. DPP assumes every individual have the same propensity to the degree of diversity, as the same value is applied when constructing the kernel matrix , which is shared when performing re-ranking for all users. However, as we discussed in Section 1, different individuals have different propensity to diversity, so that personalization is needed in DPP.
A straightforward way to implement personalization in DPP is setting a unique hyper-parameter for user . Unfortunately, this approach is not practical, since the number of hyper-parameters ’s is too large to be tuned individually. In this paper, we present an effect and efficient method to achieve personalized DPP (For short, we refer it as pDPP). We factorize user-wise hyper-parameter to two factors as
where is a tunable and shared hyper-parameter to trade-off relevance and diversity across all the users (which is of the same functionality as in DPP) and is a user-wise factor representing diversity propensity of user .
Next, we elaborate the intuition of defining . As explained in a real-world example in Section 1, users’ diversity propensity can be reflected by their historical behavior. As one of the possible choices, Shannon entropy over the distribution of different genres111Our formulation can be extended easily by including other features of items. of interacted items by the user is utilized, as
where denotes the probability of user being interested in genre , namely, one of user ’s interacted items being of genre . As shown in (Di Noia et al., 2014), a user with higher has higher propensity of diversity and vice versa. Due to this intuition, we define as the normalized . Formally, we propose to use a - , as follows:
where represent the maximal entropy value over all the users and denote the minimal value. The hyper-parameter controls the personalization degree of (and therefore ). As shown in Figure 3, a larger value indicates less personalized values among all the users, e.g., when , it can be seen that and pDPP downgrades to DPP. In practice, we choose to use two special cases: when , is the standard - normalized ; and when , is the normalized .
To summarize, pDPP is a personalized version of DPP without introducing extra hyper-parameters for tuning. Though the formulation is simple, the experiment results in Section 4 demonstrate its effectiveness.
3. System Implementation
3.1. Framework Modifications
An overview of a recommender system with pDPP re-ranking model is shown in Figure 4. We first present the modules without considering the re-ranking model (which is surrounded in green box) and then illustrate how to adapt these modules with pDPP.
The architecture of a recommender system consists of three modules. (1) Offline training module processes user-item interaction data, extracts features (user features, item features and context features), trains model and uploads the model. (2) Online prediction module receives users’ request and returns a list of items. There are usually two steps in this module, namely retrieval and ranking. Since there may be over millions of items, it is impossible to score every item within a required latency (often within tens of millisecond). The retrieval step returns a short list of items (often hundreds or thousands) of items that is suitable for the user under such context. After reducing the size of candidates, the ranking step computes relevance scores for individual items using the offline trained model. (3) Nearline updating module, which updates user features, item features and even the offline trained models with real-time interaction data.
Our proposed pDPP re-ranking model can be integrated into the above architecture easily. Next, we will elaborate how to adapt the three modules in the framework, to deploy this re-ranking model.
In offline training module, computes value for individual user and uploads such values to online .
In online prediction module, given the relevance scores of candidate items computed by any ranking function and the personalized value from online , pDPP re-ranking model generates the final recommendation list, considering both relevance and diversity.
In nearline updating module, personalized values are updated based on the real-time user-item interaction data, and the updated values are sent to online .
Developing accurate ranking function is an essential research topic and attracts many researchers from both academia and industry. As can be seen, our pDPP re-ranking model is compatible with any advanced ranking function, without any modification on such ranking function.
3.2. Practical Issues
To help readers better understand and implement our model in their recommender systems, we summarize several practical issues which should be noticed in real-world applications.
In research work such as (Chen et al., 2018), the kernel matrix is pre-computed and stored in memory, as shown in Algorithm 1. However, such a method cannot be performed in a real-world recommender system, due to the following two reasons. Firstly, the relevance score ’s, computed by a ranking function, are personalized and real-time updated. Such industrial-style ranking function makes different relevance scores of individual users to the same item, and furthermore, the relevance score of a user-item pair may be updated in a few seconds as the user feature may be changed. Secondly, our pDPP model has a personalized factor when constructing so that different users have different . We need a huge amount of time and storage resources to handle such ’s if we need to pre-compute and store them. Due to these two reasons, we compute the personalized kernel matrix for a user on-the-fly when this user trigger the request to our recommender system.
In our experiments, we tried two different approaches to construct the similarity matrix : one utilizes item features and the other uses user-item interaction information. The method with user-item interaction performs slightly worse than with the other. The reason may due to the fact that user-item interactions are usually very sparse which makes the item representations based on such information not very reliable. No matter which approach are used, we find that the performance is better when we normalize in .
Cold start problem is one of the common challenges in recommender systems. In our system, we set if is a new user. Moreover, users with only few interactions are also regarded as new users by our system. We make such a decision because is a relatively safe value for exploration, while balancing the trade-off between relevance and diversity.
4. Experimental Evaluation
To demonstrate the superiority of our pDPP-based re-ranking model, we firstly design offline experiment on two datasets to compare the relevance and diversity of recommendation result of our model with that of baselines. Furthermore, we deploy our model on a live recommender system, to validate its effectiveness in an industry application. In this section, we will present the experiment details and analyze the results in terms of offline and online evaluation, respectively.
4.1. Offline Evaluation
For offline evaluation, we prepare two datasets. Besides MovieLens, which is a benchmark in recommendation research community, we also collect user-item interaction log from our commercial App Store. To help reproduce our experiment result, we firstly describe how we process such two datasets.
MovieLens 1M Dataset 222http://grouplens.org/datasets/movielens/1m/ contains 1,000,209 anonymous ratings with approximately 3,900 movies rated by 6,040 users. As a traditional pre-processing by research work as (Chen et al., 2017), we eliminate the movies rated by less than 10 users and the users rating less than 20 movies. We randomly split the ratings to two parts, where 70% of the ratings are used for training, and 30% are for testing. A samples with rating greater than or equal to 4 is treated as positive, otherwise negative. We perform item-based collaborating filtering as ranking function to predict the relevance score for each item. The similarity matrix is built based on the genres of movies, which is to say, if movie and are of the same genre.
Company Dataset is collected from our commercial App Store. This dataset contains approximately 100,000 download records from about 80,000 users in 8 consecutive days. The size of the whole item set is 7,000. Samples in the first 7 days are used for training, while samples in the last day are for testing. We consider all download records as the positive samples, and the others (i.e., the apps that are in the item set but not downloaded by a user) are negative ones. The similarity matrix is generated by the category of apps, i.e., if app and are of the same category.
Two baselines are compared. The first baseline considers the ranking function for relevance and disregards re-ranking model for diversity, which is referred as BASE. The other baseline is the standard DPP for re-ranking, which is presented as DPP. Although MMR is a popular state-of-the-art method, we omit it here due to its inferiority compared to DPP (Chen et al., 2017). Our personalized DPP model for re-ranking is denoted as pDPP. Note that the ranking function333 In MovieLens 1M Dataset, item-based CF is served as the ranking function; while in Company Dataset, a popular deep learning model is served.
In MovieLens 1M Dataset, item-based CF is served as the ranking function; while in Company Dataset, a popular deep learning model is served.utilized in BASE, DPP and pDPP keeps consistent, for fair comparision. The hyper-parameter value in DPP and value in pDPP are found by grid search.
4.1.3. Evaluation Metrics
To compare the models comprehensively, we evaluate them from both relevance and diversity aspects of their recommendation results. Precision is utilized to measure the relevance, which is defined as
where denotes recommendation list of user , denotes download apps of user in test set.
To measure the diversity, we adopt intra-list distance () (Zhang and Hurley, 2008), which is defined as
and is the and of the first item in the recommendation list.
4.1.4. Experimental Results
Both offline experiments are performed for multiple times to ensure the results are statistically accurate. During each experiment, we randomly shuffle the data for training and test on the MovieLens Dataset and conduct consecutive experiments within different dates for Company Dataset. Experiment results on MovieLens 1M Dataset with are shown in Table 1. Due to space limit, we omit the results with other values, but they are analogous.
DPP model aims to balance the trade-off between accuracy and diversity, for which we can focus more on diversity (i.e., metric) by enlarging , but on the other hand, the relevance performance (i.e., metric) will be degraded. Compared with BASE, DPP achieves the same accuracy but better diversity. We select a reasonable range for , to avoid degrading the accuracy significantly, i.e., . Among such values, makes DPP performs the best in terms of while enables DPP achieves the best .
As expected, pDPP outperforms all the baselines in terms of and , which demonstrates the superiority of modelling different propensity of diversity for individual users. Specially, pDPP performs best in terms of and and slightly decreases the performance of compared to pDPP.
The results on Company Dataset are shown in Table 2. Similar to the experiments on MovieLens 1M Dataset, we select an appropriate range as . Adding DPP-based re-ranking model based on BASE, the DPP models improve the diversity while sacrificing the performance of relevance. Compared with the BASE and DPP models, pDPP models(with 0.6 as the ) gain the best performance on and . Remarkably, retaining exactly the same accuracy as BASE, pDPP methods improve the diversity significantly. Specifically, between pDPP family, the model with achieves better diversity than the one with while keeps the same accuracy performance.
4.2. Online Evaluation
As shown its superior balancing the trade-off between accuracy and diversity in offline evaluation, we deploy pDPP in a live recommender system to verify its effectiveness in an industry application.
4.2.1. Experiment setting
For online evaluation, we conduct online A/B test. We compare three different families of models: BASE, DPP and pDPP. We randomly split all the users into hundreds of bins, each of which consists of more than 100,000 users. A bin of users are served by each of the three compared models. In our live recommender system, the hyper-parameter of DPP and of pDPP are set to as the performance of is the best when in offline evaluation (as presented in Table 2).
4.2.2. Evaluation Metrics
To compare the performance of these methods, we evaluate them on the basis two metrics of accuracy and one metric of diversity. The first accuracy metric that we measure is download ratio (), defined as
Beyond that, we also measure the engagement of users. More specifically, we study average number of downloads () per user, as
Besides these two accuracy metrics, we adopt to evaluate the diversity, the same as in offline evaluation.
4.2.3. Online Performance
The results of A/B online test are shown in Table 3. Considering the commercial concerns, we only present the relative improvement of DPP and pDPP over BASE model in terms of , and .
We can observe that both DPP and pDPP perform significantly better than BASE
in terms of all the three evaluation metrics. It suggests that improving diversity is able to boost the recommendation performance. BetweenDPP and pDPP, we observe that pDPP is superior than DPP, which indicates that personalized propensity to diversity is more suitable than identical propensity setting. We observe that the improvement of the pDPP over DPP is not as significant as that in offline evaluation. Through detailed analysis, we find that about 35% of the users have only one download record in their behavior history so that it is hard to define their propensity to diversity under such circumstance, which may be one reason for the not-so-significant improvement. However, the daily turnover of our App Store is millions of dollars, therefore even such not-so-significant lift in and brings extra millions of dollars each year.
Recommender system which only focuses on accuracy may lead to sub-optimal, as it too much emphasizes the accuracy of each individual items and leads to presenting similar items. Diversity, which has been studied to present the users with more diversified items, can be viewed as mutual influence among items. Therefore, combining accuracy and diversity in recommender system is a reasonable and convincing way to improve the performance. Furthermore, different users have different propensity to diversity, which requires personalized diversity. In this paper, we propose a personalized re-ranking model for improving the diversity of the recommendation list based on personalized DPP. This re-ranking model can be easily deployed as a follow-up component after any existing ranking function. The offline experiments over two real-world datasets and the online comparison through A/B testing in an industrial recommender system demonstrate the effectiveness of our proposed re-ranking model.
-  (2012) Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering 24 (5), pp. 896–911. Cited by: §1.
Improving recommendation diversity.
Proceedings of the Twelfth Irish Conference on Artificial Intelligence and Cognitive Science, Maynooth, Ireland, pp. 85–94. Cited by: §1.
-  (1998) The use of mmr and diversity-based reranking for reodering documents and producing summaries. Cited by: §1.
-  (2018) Fast greedy map inference for determinantal point process to improve recommendation diversity. In Advances in Neural Information Processing Systems, pp. 5622–5633. Cited by: §2.1, 1st item.
-  (2017) Improving the diversity of top-n recommendation via determinantal point process. In Large Scale Recommendation Systems Workshop at the Conference on Recommender Systems (RecSys). http://arxiv. org/abs/1709.05135, Cited by: §1, §2.1, §2.1, §4.1.1, §4.1.2.
-  (2013) How personality influences users’ needs for recommendation diversity?. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pp. 829–834. Cited by: §1.
-  (2016) Personality and recommendation diversity. In Emotions and Personality in Personalized Services, pp. 201–225. Cited by: §1.
-  (2010) The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pp. 293–296. Cited by: §1.
-  (2014) An analysis of users’ propensity toward diversity in recommendations. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 285–288. Cited by: §1, §1, §2.2, §4.1.3.
-  (2017) A clustering approach for personalizing diversity in collaborative recommender systems. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pp. 280–284. Cited by: §1, §1.
-  (2018) Deepfm: an end-to-end wide & deep learning framework for ctr prediction. arXiv preprint arXiv:1804.04950. Cited by: §1.
K-dpps: fixed-size determinantal point processes.
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1193–1200. Cited by: §1, §2.1.
-  (2008) Research of information recommendation system based on reading behavior. In 2008 International Conference on Machine Learning and Cybernetics, Vol. 3, pp. 1626–1631. Cited by: §1.
-  (2006) Being accurate is not enough: how accuracy metrics have hurt recommender systems. In CHI’06 extended abstracts on Human factors in computing systems, pp. 1097–1101. Cited by: §1.
-  (2014) Exploring the filter bubble: the effect of using recommender systems on content diversity. In Proceedings of the 23rd international conference on World wide web, pp. 677–686. Cited by: §1.
-  (2014) Comparing context-aware recommender systems in terms of accuracy and diversity. User Modeling and User-Adapted Interaction 24 (1-2), pp. 35–65. Cited by: §4.1.3.
-  (2011) The filter bubble: what the internet is hiding from you. Penguin UK. Cited by: §1.
-  (2015) Music recommender systems. In Recommender systems handbook, pp. 453–492. Cited by: §1.
-  (2018) Practical diversified recommendations on youtube with determinantal point processes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2165–2173. Cited by: §2.1, §2.1.
-  (2018) Personalizing recommendation diversity based on user personality. User Modeling and User-Adapted Interaction 28 (3), pp. 237–276. Cited by: §1.
-  (2008) Avoiding monotony: improving the diversity of recommendation lists. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 123–130. Cited by: §1, §4.1.3.
-  (2005) Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web, pp. 22–32. Cited by: §1.