Diversity-Promoting Deep Reinforcement Learning for Interactive Recommendation

Interactive recommendation that models the explicit interactions between users and the recommender system has attracted a lot of research attentions in recent years. Most previous interactive recommendation systems only focus on optimizing recommendation accuracy while overlooking other important aspects of recommendation quality, such as the diversity of recommendation results. In this paper, we propose a novel recommendation model, named Diversity-promoting Deep Reinforcement Learning (D^2RL), which encourages the diversity of recommendation results in interaction recommendations. More specifically, we adopt a Determinantal Point Process (DPP) model to generate diverse, while relevant item recommendations. A personalized DPP kernel matrix is maintained for each user, which is constructed from two parts: a fixed similarity matrix capturing item-item similarity, and the relevance of items dynamically learnt through an actor-critic reinforcement learning framework. We performed extensive offline experiments as well as simulated online experiments with real world datasets to demonstrate the effectiveness of the proposed model.



There are no comments yet.


page 1

page 2

page 3

page 4


D2RLIR : an improved and diversified ranking function in interactive recommendation systems based on deep reinforcement learning

Recently, interactive recommendation systems based on reinforcement lear...

Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation

Interactive recommendation aims to learn from dynamic interactions betwe...

Learning from Atypical Behavior: Temporary Interest Aware Recommendation Based on Reinforcement Learning

Traditional robust recommendation methods view atypical user-item intera...

Practical Data Poisoning Attack against Next-Item Recommendation

Online recommendation systems make use of a variety of information sourc...

Recommending Accurate and Diverse Items Using Bilateral Branch Network

Recommender systems have played a vital role in online platforms due to ...

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling

Recommendation is crucial in both academia and industry, and various tec...

Bandit Learning for Diversified Interactive Recommendation

Interactive recommender systems that enable the interactions between use...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommender systems have been widely used in various application scenarios. In practice, existing recommender systems are mainly developed based on collaborative filtering methods, e.g., matrix factorization [Shi et al.2014]

. Recently, with the rapid development of deep learning techniques, e.g., Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), a lot of deep learning based recommender systems have been proposed 

[Zhang et al.2018]. Particularly, most recommender systems are developed in an off-line manner, by learning users’ preferences on items from the historical interaction data between users and items. These methods, however, usually ignore the interactions between users and the recommender system. Therefore, they cannot adopt the users’ immediate and long-term feedback on the recommendation results to improve the performance of the recommender system.

From a different perspective, the interactive recommender system [Steck et al.2015] explicitly models the interactions between users and the recommender system. In this scenario, the recommender system is treated as an agent, which generates recommendations to users according to some learnt policy. Based on users’ feedback on the recommendation results, the agent consequentially update its recommendation policy, thus to improve users’ satisfaction in the long run. Moreover, in the interaction process, the interactive recommender system can also explore users’ potential preferences on items, which may not be revealed by users’ historical behaviour data.

Some of the existing interactive recommender systems are developed based on the contextual bandit framework, e.g., LinUCB [Li et al.2010]

and Thompson sampling 

[Kawale et al.2015]. Recently, some research works build interactive recommender systems using the deep reinforcement learning framework, considering the state transitions of the environment (i.e., users) [Zheng et al.2018, Zhao et al.2018a, Zhao et al.2018b, Chen et al.2019]. These recommendation methods mainly focus on optimizing the recommendation accuracy. However, they do not enforce the diversity on the recommendation results. Therefore, the items in a recommendation list generated by these methods may usually be very similar with each other. This often results in the reduction of user experiences and user satisfaction.

Differing from existing interactive recommendation methods, this paper proposes a novel recommendation model, namely Diversity-promoting Deep Reinforcement Learning (DRL), for diversified interactive recommendation. More specifically, to generate diverse while relevant item recommendations, we do sequential sampling from a Determinantal Point Process (DPP) model, which is an elegant probabilistic model for set diversity [Kulesza et al.2012]. For each user, a personalized DPP kernel matrix is dynamically built from two parts: a fixed similarity matrix capturing item-item similarity, and the relevance of items learnt through deep reinforcement learning. To dynamically learn the relevance of items, we adopt the actor-critic deep reinforcement learning framework to explore and exploit the user’s preferences on items. In D

RL, the actor network generates an action vector, i.e., the parameter of the DPP kernel matrix, which represents the user’s dynamic personal preferences. The critic network evaluates the quality of the learnt recommendation policy, considering both current and future rewards for the recommended items. To demonstrate the effectiveness of the proposed D

RL model, we have conducted extensive experiments on real world datasets, as well as simulated online experiments. The experimental results indicate that DRL has superior performance under different settings.

To summarize, the major contributions of this work are as follows:

  • We propose a novel diversity-promoting recommendation model, i.e., DRL, which is able to generate diverse while relevant recommendations in interactive recommendation.

  • In order to test the online performance of DRL, we propose a novel online simulator which can simulate user behaviours by considering both user preferences and the diversity of user behaviors.

  • Extensive experiments with both offline and online settings demonstrate the superior performance of DRL in comparison with several state-of-the-art methods.

2 Related Work

Contextual bandit has often been used for building interactive recommender systems. For instance, Li et al. proposed a contextual bandit algorithm named LinUCB to sequentially recommend articles to users based on the contextual information of users and articles [Li et al.2010]. Kawale et al. proposed to integrate Thompson sampling with online Bayesian probabilistic matrix factorization for interactive recommendation [Kawale et al.2015]. Tang et al. developed a parameter-free bandit approach which applied online bootstrap to learn the recommendation model in an online manner [Tang et al.2015]. Moreover, in [Wang et al.2017], a factorization-based bandit approach was proposed to solve the online interactive recommendation problem. Another group of works apply deep reinforcement learning for solving the interactive recommendation problem. For example, Zheng et al. proposed a deep reinforcement learning framework for news recommendation [Zheng et al.2018]. They utilized deep Q-learning to explicitly model the future rewards of recommendation results and also developed an effective exploration strategy to improve the recommendation accuracy. Zhao et al. proposed a deep recommendation framework (i.e., DEERS) that leveraged reinforcement learning to consider both users’ positive and negative feedbacks for recommendation [Zhao et al.2018b]. Moreover, they also developed a page-wise recommendation model (i.e., DeepPage) which employed deep reinforcement learning to optimize the display of items on a webpage [Zhao et al.2018a]. In a recent work [Chen et al.2019], Chen et al. proposed a tree structured policy gradient method to mitigate the large action space problem of deep reinforcement learning based recommendation methods.

Promoting the diversity of recommendation results has received increasing research attentions [Kunaver and Požrl2017]. The maximal marginal relevance (MRR) model [Carbonell and Goldstein1998] was one of the pioneering work for promoting diversity in information retrieval tasks. In [Zhang and Hurley2008], Zhang and Hurley proposed a trust-region based optimization method to maximize the diversity of recommendation list, while maintaining an acceptable level of matching quality. In [Lathia et al.2010], Lathia et al. designed and evaluated a hybrid mechanism to maximize the temporal recommendation diversity. In [Zhao et al.2012], Zhao et al. utilized the purchase interval information to increase the diversity of recommended items. In [Qin and Zhu2013], Qin and Zhu proposed an entropy regularizer to promote recommendation diversity. This entropy regularizer was incorporated in the contextual combinatorial bandit framework to diversify the online recommendation results [Qin et al.2014]

. Moreover, Sha et al. proposed a combinatorial optimization approach to combine the relevance of items, coverage of the user’s interests, and diversity between items for recommendation 

[Sha et al.2016]. Puthiya Parambath et al. developed a new criterion to capture both relevance and diversity for recommendation [Puthiya Parambath et al.2016]. This criterion was optimized by an efficient greedy algorithm. Recently, in [Antikacioglu and Ravi2017], a maximum-weight degree-constrained subgraph selection method was proposed to post-process the recommendations generated by collaborative filtering models, so as to increase recommendation diversity. In [Cheng et al.2017]

, the diversified recommendation problem was formulated as supervised learning task, and a diversified collaborative filtering model was introduced to solve the optimization problems. In 

[Chen et al.2018], a fast greedy maximum a posteriori (MAP) inference algorithm for determinantal point process was proposed to improve recommendation diversity.

3 The Proposed Recommendation Model

In this section, we first introduce some background about the determinantal point process model and then present the details of the proposed DRL recommendation model.

3.1 Determinantal Point Process

DPP is an elegant probabilistic model that has been successfully used to promote diversity in various machine learning tasks 

[Kulesza et al.2012]. In this work, we propose to employ DPP to promote the diversity in interactive recommendation. Specifically, for a recommendation task, a DPP on the discrete item set

is a probability distribution on the powerset of

(i.e., the set of all subsets of ). If it assigns non-zero probability to the empty set , there exists a positive semi-definite (PSD) kernel matrix , such that for each subset of , the probability of is defined as follows:


where is restricted to those rows and columns which are indexed by , and

is the identity matrix.

In practice, the kernel matrix can be constructed from low-rank matrices. For example, can be written as a Gram matrix , where and . The rows of are vectors representing the properties of items. As revealed in [Kulesza et al.2012], each row vector can be empirically constructed as the product of the item relevance score and the item feature vector , i.e., . Hence, an element of the kernel matrix can be written as . Moreover, if normalization is applied to the feature vector , i.e.,

, the Cosine similarity between two items

and can be calculated as . Hence, can be re-written as follows:


where is the item similarity matrix, and is a diagonal matrix with the element being . Following [Chen et al.2018], we modify the DPP kernel matrix as follows:


where and . is a hyper-parameter used to trade-off between relevance and diversity. For simplicity, we model the relevance score of each item using a linear function of its features as follows:


where is the model parameter.

As the item similarity matrix is fixed when the item feature vectors are given, according to Eq.(3) and (4), becomes the only model parameter needs to be learnt to determine the DPP kernel. Once the DPP kernel is determined, different inference algorithms [Kulesza and Taskar2011, Chen et al.2018] can be applied to find the optimal diverse while relevant item set for recommendation. Intuitively, is a vector representing users’ preferences. In the interactive recommendation settings, users’ preferences can be dynamically learnt through the interactions between users and the recommender system. In this paper, we propose to dynamically learn using a deep reinforcement learning framework.

3.2 Deep Reinforcement Learning Framework

In the interactive recommendation setting of D

RL, the recommender system (i.e., agent) interacts with users (i.e., environment) by sequentially recommending a set of diverse while relevant items to maximize its cumulative rewards in the long run. The interactive recommendation process can be modeled as a Markov Decision Process (MDP). We define the key components of the MDP of D

RL as follows:

  • State Space: A state is determined by the recent items that the user has interacted before time step and the user’s personal features;

  • Action Space: An action is the parameter used to construct the DPP kernel matrix, as shown in Eq. (4), which captures the user’s dynamic preferences;

  • Reward: Once the recommender system chooses an action based on its current state , a set of diverse while relevant items can be sampled from the determined DPP kernel and recommend to the user. Then, the user would provide her feedbacks on the recommended items, e.g., click or not click on the items. The recommender system receives these feedbacks as the immediate reward to update its recommendation policy;

  • Discount factor: The discount factor is used to measure the present value of future rewards. When is set to 0, the recommender system only considers the immediate rewards but ignores the future rewards. On the other hand, when is set to 1, the recommender system counts all the future rewards as well as the immediate rewards.

Figure 1: The proposed deep reinforcement learning framework.

The proposed DRL model is built based on the Actor-Critic reinforcement learning framework [Sutton and Barto2018]. Figure 1 shows the overall structure of DRL. Next, we detail the actor and the critic components of DRL.

  • Actor Network: Given the current state , the actor network employs deep neural networks to generate an action . As shown in Figure 1, the state

    is determined by the user’s personal features and her recent interests. Specifically, the user’s personal features are fed into two fully-connected ReLU neural network layers to get more abstract user features

    . Meanwhile, at time step , we collect the features of recent items that the user has interacted before , and stack them into a feature matrix . Intuitively, we treat the matrix as an “image” of the feature space, and use CNN (consisting of two convolutional layers) to capture the sequential patterns of the user’s interests by extracting the local features of the “image” [Tang and Wang2018]. The output of the CNN module is then fed into two fully-connected ReLU neural network layers to get higher level features . Then, we concatenate with user features and define the state representation as .

    After that, is fed into several fully connected layers to generate the action

    . In the first two neural network layers, we use ReLU as the activation function, and we use Tanh activation function in the output layer to make

    bounded. As we are interested in optimizing the DDP kernel to generate diverse recommendations, the action representation is essentially the parameter used to build the DPP kernel at time . Once is determined, we utilize the fast greedy MAP inference algorithm [Chen et al.2019] to do sequentially sampling of a set of diverse while relevant items as recommendations to the user.

  • Critic Network: The critic network is designed to approximate the Q-value function

    , which estimates the quality of the learnt DPP-based recommendation policy

    . As shown in Figure 1, the input of the critic network is the action representation and the state representation that considers both the user profile and the user’s recent interests. and are then fed into several fully connected ReLU neural network layers to predict the Q-value. We denote the critic network by and the network parameters by .

1:  Randomly initialize the actor network and critic network with the weights and ;
2:  Initialize the target networks and with weights and ;
3:  Initialize the replay buffer;
4:  for  do
5:     Receive an initial observed state ;
6:     for  do
7:        Derive action from actor network ;
8:        Apply exploration policy to : ;
9:        Construct DPP kernel matrix using Eq.(3), (4);
10:        Find the diverse item set using the fast greedy MAP inference algorithm;
11:        Compute reward and observe new state ;
12:        Store transition sample to replay buffer;
13:        Sample a mini-batch of transitions from replay buffer with replay sampling technique;
14:        Update the critic network by minimizing the loss in Eq. (5);
15:        Update the weights of actor network using the sampled gradient in Eq. (6);
16:        Update weights of target networks using Eq. (7);
17:     end for
18:  end for
Algorithm 1 DDPG algorithm for DRL

The Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015] algorithm is used to train the DRL framework. We also employ the target network technique in DRL. In DDPG, the critic network is learned using the Bellman equation in Q-learning. The network parameter

is updated by minimizing the following loss function:


where , is the size of the sampled mini-batch, and are the target actor network and target critic network, respectively. The actor network is updated by using sampled policy gradient as follows:


Moreover, the network parameters of the target networks and are updated using the soft updating rule as follows:


The details of the DDPG based learning algorithm are summarized in Algorithm 1.

4 Experiments

In this section, we first evaluate the performance of the proposed DRL model by performing offline experiments on two real world datasets. Then, we conduct simulations to evaluate the online performance of DRL.

4.1 Experimental Setting

4.1.1 Datasets

The experiments are conducted on two public datasets: MovieLens-100K and MovieLens-1M 111https://grouplens.org/datasets/movielens/. MovieLens-100K consists of 100,000 ratings given by 943 users to 1,682 movies. MovieLens-1M contains 1,000,209 ratings given by 6,040 users to 3,706 movies. There are 18 movie categories in both datasets, and each movie may belong to more than one categories. In each dataset, we kept the ratings larger than 3 as positive feedback. We sort all the positive feedback in time order, and use the former 80% positive feedback for model training and the remaining 20% positive feedback for model testing. Moreover, we also remove the users that have less than 10 positive feedbacks in the training data. Table 1 summarizes the statistics of the experimental datasets.

Dataset # users # items # Interactions
Movielens-100K 716 1,374 45,447
Movielens-1M 5,218 3,467 510,940
Table 1: Statistics of the experimental datasets.

4.1.2 Evaluated Algorithms

We compare the proposed DRL method with the following recommendation models: (1) BPRMF [Rendle et al.2009]: This is a matrix factorization method that uses pairwise learning to optimize the recommendation accuracy of top- items. (2) LMF [Johnson2014]: This is a logistic matrix factorization recommendation approach that exploits users’ implicit feedback by modeling the user-item interaction probability. (3) Caser [Tang and Wang2018]: This is a deep learning based sequential recommendation method which uses CNN to extract the sequential patterns of users’ behaviors. (4) CUCB [Qin et al.2014]: This is a contextual bandit based interactive recommendation method, which promotes the recommendation diversity by employing an entropy regularizer.

For BPRMF and LMF, we empirically set the dimensionality of latent space to 30, considering both the recommendation accuracy and efficiency. The regularization parameters are chosen from , and the optimal learning rates are chosen from . For Caser, we set the dimensionality of the embedding layer to 30, the Markov order to 5, and the target number to 2. In addition, we use the latent features of users and items learnt by BPRMF as the user and item features in the interactive recommendation methods, i.e., CUCB and DRL. In CUCB, the regularization parameter of the entropy regularizer is selected from . For DRL, we set the number of recent items in each state to 5. The parameter that balances relevance and diversity in the DPP kernel matrix is chosen from . We utilize Adam to optimize the actor and critic networks, and set the learning rate to 0.001. In addition, we set the discount factor of DRL to 0.95.

1:  Initialize the actor network by minimizing the distance between its output and the embedding of the next interacted item;
2:  Initialize the critic network by fixing the actor network;
3:  Train the actor-critic network following DDPG until converge;
4:  Initialize the user’s state with the last interacted items in the training data:
5:  Initialize the list of displayed items with the last interacted items in the training data;
6:  Collect candidate items by excluding the user’s interacted items the training data;
7:  Load the user’s interacted items in testing data;
8:  for  do
9:     Generate action by the actor network ;
10:     Construct DPP kernel matrix and find the diverse item set by the fast greedy inference algorithm;
11:     for  do
12:        if  then
13:           Remove the first item in , and append to ;
14:        end if
15:     end for
16:     Compute recommendation performances using Eq. (8) and (9);
17:     Generate new state with ;
18:     ;
19:  end for
Algorithm 2 Offline Evaluation for DRL

4.1.3 Metrics

The performance of recommendation algorithms is evaluated from two aspects, i.e., accuracy and diversity. The recommendation accuracy at each recommendation epoch is defined as follows:


where is the set of items recommended to the user at epoch , is the cardinality of a set, and denotes the set of items the user has interacted with in the testing data. The diversity of the recommendation results at epoch is measured by the intra-list distance (ILD) [Zhang and Hurley2008] metric, which is defined as follows:


where denotes the similarity between two items and . For fair comparison, we define by the Jaccard similarity between the categories of two items and . In the experiments, we empirically set the size of to 5.

4.1.4 Setup

We adopt two different settings to evaluate the recommendation algorithms: offline evaluation on public datasets and online evaluation with a simulator.

1:  Observe all the previously interacted items of the user;
2:  for  do
3:     Compute the user’s interaction probability with using pre-trained LMF;
4:     if  then
5:        ;  
6:     end if
7:     if  then
8:        Set the reward , ;
9:     else
10:        Set the reward ;
11:     end if
12:  end for
Algorithm 3 Simulator for Online User Behavior

In offline evaluation, we compare DRL with all other baseline mehods. For each non-interactive recommendation approach (i.e., BPRMF, LMF, and Caser), we first learn the recommendation model using the training data, and then rank all candidate items by the learned model. At each recommendation epoch, we choose a set of top ranked items from the ranked candidates as recommendations, and these items will be removed from the candidate set once they have been recommended. The recommendation accuracy will be measured by comparing recommended items with the ground truth in the testing data. Note that the non-interactive recommendation methods would not update the learned models in the recommendation process. However, in CUCB, the bandit parameter is updated by using the user’s feedback on the recommendation results. For the offline training of DRL, we first initialize the actor network by minimizing a L2-loss between its output and the embedding of the next interacted item by randomly sampling state-action transition samples from the replay buffer. After that, we fix the actor network for some time and update the critic network alone. When both the actor network and the critic network are well initialized, they will follow DDPG (Algorithm 1) to update iteratively. Once the two networks converge, we fix the learned network parameters in the recommendation process. The details of the offline evaluation process for DRL are summarized in Algorithm 2.

For online evaluation, we build a simulator to compare DRL with CUCB. The online interactive recommendation process of DRL follows Algorithm 1 and the feedback on recommended items is generated by a simulator as summarized in Algorithm 3. In this simulator, we first use the pre-trained LMF [Johnson2014] model on training data to predict user preference on the recommended items. Then, we use MMR [Carbonell and Goldstein1998] to simulate the user’s feedback probability by considering both the user preference and the diversity of recommended items. The MMR parameter simulates a person’s personal consideration between relevance and diversity, and a higher simulates a person who prefers more relevant and less diverse recommendations. The parameter is a decision threshold, and positive feedback will be provided if the user’s feedback probability is greater than . We empirically set the MMR parameter for each user randomly in the range and the threshold to 0.5. A good interactive recommender system should be able to learn from the simulated user behaviours and make accurate while diverse recommendations. For online evaluation, the accuracy is evaluated by comparing the recommended results with simulated feedbacks.

Figure 2: Offline evaluation results on Movielens-100K.
Figure 3: Offline evaluation results on Movielens-1M.

4.2 Experimental Results

4.2.1 Offline Performance

The offline performance of the evaluated recommendation algorithms on Movielens-100K and Movilens-1M datasets are summarized in Figure 2 and Figure 3, respectively. It can be observed from Figure 3(a) that DRL consistently outperforms all the baselines in terms of recommendation accuracy at all recommendation epochs on Movielens-1M dataset. Moreover, from Figure 3(b), we can also note that DRL also consistently generates more diverse recommendations than the baselines at all recommendation epoches.

However, the performance of DRL is not as superior on Movielens-100K dataset. As shown in Figure 2(a), DRL sometimes achieves poorer recommendation accuracy at certain recommendation epoches comparing with other baselines. One potential reason is that the critic network is not able to accurately estimate the quality of the learned recommendation policy. From Figure 2(b), we can note that DRL achieves comparable performance with the baselines in terms of recommendation diversity. In the offline setting, the user’s feedback on the recommendation results is fixed, which limits the effectiveness of interactive recommender systems in capturing users’ dynamic preferences on items. To remedy this problem, we conduct simulations of online experiments to study the effectiveness of DRL, based on Movielens-100K and Movielens-1M datasets.

4.2.2 Simulated Online Performance

The simulated online performance of DRL and CUCB on Movielens-100K and Movilens-1M datasets are summarized in Figure 4 and Figure 5, respectively. As shown in Figure 4(a), the online recommendation accuracy of DRL and CUCB both fluctuate at all recommendation epochs, but DRL consistently achieves better recommendation accuracy than CUCB at all recommendation epochs. Moreover, from Figure 4(b), we can also observe that the online recommendation results generated by DRL are much more diverse than those generated by CUCB. On average, DRL outperforms CUCB by 52.13% and 92.79%, in terms of Precision and Diversity, over all recommendation epoches in the simulated experiments on Movielens-100K datasets.

Superior performance of DRL can also be observed on Movielens-1M dataset. As shown in Figure 5(a), both DRL and CUCB achieve stable performance on recommendation accuracy, and DRL significantly outperforms CUCB consistently at all recommendation epochs. Moreover, Figure 5(b) shows that the recommendation diversity of DRL fluctuates at first and then becomes steady quickly, which is significantly higher than that of CUCB. On average, DRL outperforms CUCB by 76.04% and 156.58% in terms of Precision and Diversity, over all recommendation epoches in the simulation experiments on Movielens-1M dataset.

Figure 4: Online simulation performances on Movielens-100K.
Figure 5: Online simulation performances on Movielens-1M.

5 Conclusion

In this paper, we have proposed a novel interactive recommendation model, namely Diversity-promoting Deep Reinforcement Learning (DRL). Specifically, DRL uses actor-critic reinforcement learning framework to model the interactions between users and the recommender system. Moreover, the determinantal point process (DPP) model has been incorporated in the recommendation process to promote the diversity of recommendation results. Both empirical experiments on real world datasets and simulated online experiments have been performed to demonstrate the effectiveness of DRL. The future work will focus on the following directions. Firstly, we would like to directly learn the DPP kernel matrix from the data. Secondly, we are also interested in developing more efficient DPP inference algorithms for generating relevant yet diverse recommendations.


  • [Antikacioglu and Ravi2017] Arda Antikacioglu and R Ravi. Post processing recommender systems for diversity. In KDD’17, 2017.
  • [Carbonell and Goldstein1998] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR’98, 1998.
  • [Chen et al.2018] Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity. In NIPS’18, 2018.
  • [Chen et al.2019] Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. Large-scale interactive recommendation with tree-structured policy gradient. In AAAI’19, 2019.
  • [Cheng et al.2017] Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. Learning to recommend accurate and diverse items. In WWW’17, 2017.
  • [Johnson2014] Christopher C Johnson. Logistic matrix factorization for implicit feedback data. NIPS’14, 27, 2014.
  • [Kawale et al.2015] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient thompson sampling for online matrix factorization recommendation. In NIPS’15, 2015.
  • [Kulesza and Taskar2011] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In ICML’11, 2011.
  • [Kulesza et al.2012] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3), 2012.
  • [Kunaver and Požrl2017] Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey. Knowledge-Based Systems, 123, 2017.
  • [Lathia et al.2010] Neal Lathia, Stephen Hailes, Licia Capra, and Xavier Amatriain. Temporal diversity in recommender systems. In SIGIR’10, 2010.
  • [Li et al.2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW’10, 2010.
  • [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [Puthiya Parambath et al.2016] Shameem A Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. A coverage-based approach to recommendation diversity on similarity graph. In RecSys’16, 2016.
  • [Qin and Zhu2013] Lijing Qin and Xiaoyan Zhu. Promoting diversity in recommendation by entropy regularizer. In IJCAI’13, 2013.
  • [Qin et al.2014] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In SDM’14, 2014.
  • [Rendle et al.2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In UAI’09, 2009.
  • [Sha et al.2016] Chaofeng Sha, Xiaowei Wu, and Junyu Niu. A framework for recommending relevant and diverse items. In IJCAI’16, 2016.
  • [Shi et al.2014] Yue Shi, Martha Larson, and Alan Hanjalic. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys, 47(1):3, 2014.
  • [Steck et al.2015] Harald Steck, Roelof van Zwol, and Chris Johnson. Interactive recommender systems: Tutorial. In RecSys’15. ACM, 2015.
  • [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [Tang and Wang2018] Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM’18. ACM, 2018.
  • [Tang et al.2015] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendation via parameter-free contextual bandits. In SIGIR’15, 2015.
  • [Wang et al.2017] Huazheng Wang, Qingyun Wu, and Hongning Wang. Factorization bandits for interactive recommendation. In AAAI’17, 2017.
  • [Zhang and Hurley2008] Mi Zhang and Neil Hurley. Avoiding monotony: improving the diversity of recommendation lists. In RecSys’08, 2008.
  • [Zhang et al.2018] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys, 2018.
  • [Zhao et al.2012] Gang Zhao, Mong Li Lee, Wynne Hsu, and Wei Chen. Increasing temporal diversity with purchase intervals. In SIGIR’12, 2012.
  • [Zhao et al.2018a] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for page-wise recommendations. In RecSys’18, 2018.
  • [Zhao et al.2018b] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. Recommendations with negative feedback via pairwise deep reinforcement learning. In KDD’18, 2018.
  • [Zheng et al.2018] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW’18, 2018.