1 Introduction
Recommender systems have been widely used in various application scenarios. In practice, existing recommender systems are mainly developed based on collaborative filtering methods, e.g., matrix factorization [Shi et al.2014]
. Recently, with the rapid development of deep learning techniques, e.g., Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), a lot of deep learning based recommender systems have been proposed
[Zhang et al.2018]. Particularly, most recommender systems are developed in an offline manner, by learning users’ preferences on items from the historical interaction data between users and items. These methods, however, usually ignore the interactions between users and the recommender system. Therefore, they cannot adopt the users’ immediate and longterm feedback on the recommendation results to improve the performance of the recommender system.From a different perspective, the interactive recommender system [Steck et al.2015] explicitly models the interactions between users and the recommender system. In this scenario, the recommender system is treated as an agent, which generates recommendations to users according to some learnt policy. Based on users’ feedback on the recommendation results, the agent consequentially update its recommendation policy, thus to improve users’ satisfaction in the long run. Moreover, in the interaction process, the interactive recommender system can also explore users’ potential preferences on items, which may not be revealed by users’ historical behaviour data.
Some of the existing interactive recommender systems are developed based on the contextual bandit framework, e.g., LinUCB [Li et al.2010]
[Kawale et al.2015]. Recently, some research works build interactive recommender systems using the deep reinforcement learning framework, considering the state transitions of the environment (i.e., users) [Zheng et al.2018, Zhao et al.2018a, Zhao et al.2018b, Chen et al.2019]. These recommendation methods mainly focus on optimizing the recommendation accuracy. However, they do not enforce the diversity on the recommendation results. Therefore, the items in a recommendation list generated by these methods may usually be very similar with each other. This often results in the reduction of user experiences and user satisfaction.Differing from existing interactive recommendation methods, this paper proposes a novel recommendation model, namely Diversitypromoting Deep Reinforcement Learning (DRL), for diversified interactive recommendation. More specifically, to generate diverse while relevant item recommendations, we do sequential sampling from a Determinantal Point Process (DPP) model, which is an elegant probabilistic model for set diversity [Kulesza et al.2012]. For each user, a personalized DPP kernel matrix is dynamically built from two parts: a fixed similarity matrix capturing itemitem similarity, and the relevance of items learnt through deep reinforcement learning. To dynamically learn the relevance of items, we adopt the actorcritic deep reinforcement learning framework to explore and exploit the user’s preferences on items. In D
RL, the actor network generates an action vector, i.e., the parameter of the DPP kernel matrix, which represents the user’s dynamic personal preferences. The critic network evaluates the quality of the learnt recommendation policy, considering both current and future rewards for the recommended items. To demonstrate the effectiveness of the proposed D
RL model, we have conducted extensive experiments on real world datasets, as well as simulated online experiments. The experimental results indicate that DRL has superior performance under different settings.To summarize, the major contributions of this work are as follows:

We propose a novel diversitypromoting recommendation model, i.e., DRL, which is able to generate diverse while relevant recommendations in interactive recommendation.

In order to test the online performance of DRL, we propose a novel online simulator which can simulate user behaviours by considering both user preferences and the diversity of user behaviors.

Extensive experiments with both offline and online settings demonstrate the superior performance of DRL in comparison with several stateoftheart methods.
2 Related Work
Contextual bandit has often been used for building interactive recommender systems. For instance, Li et al. proposed a contextual bandit algorithm named LinUCB to sequentially recommend articles to users based on the contextual information of users and articles [Li et al.2010]. Kawale et al. proposed to integrate Thompson sampling with online Bayesian probabilistic matrix factorization for interactive recommendation [Kawale et al.2015]. Tang et al. developed a parameterfree bandit approach which applied online bootstrap to learn the recommendation model in an online manner [Tang et al.2015]. Moreover, in [Wang et al.2017], a factorizationbased bandit approach was proposed to solve the online interactive recommendation problem. Another group of works apply deep reinforcement learning for solving the interactive recommendation problem. For example, Zheng et al. proposed a deep reinforcement learning framework for news recommendation [Zheng et al.2018]. They utilized deep Qlearning to explicitly model the future rewards of recommendation results and also developed an effective exploration strategy to improve the recommendation accuracy. Zhao et al. proposed a deep recommendation framework (i.e., DEERS) that leveraged reinforcement learning to consider both users’ positive and negative feedbacks for recommendation [Zhao et al.2018b]. Moreover, they also developed a pagewise recommendation model (i.e., DeepPage) which employed deep reinforcement learning to optimize the display of items on a webpage [Zhao et al.2018a]. In a recent work [Chen et al.2019], Chen et al. proposed a tree structured policy gradient method to mitigate the large action space problem of deep reinforcement learning based recommendation methods.
Promoting the diversity of recommendation results has received increasing research attentions [Kunaver and Požrl2017]. The maximal marginal relevance (MRR) model [Carbonell and Goldstein1998] was one of the pioneering work for promoting diversity in information retrieval tasks. In [Zhang and Hurley2008], Zhang and Hurley proposed a trustregion based optimization method to maximize the diversity of recommendation list, while maintaining an acceptable level of matching quality. In [Lathia et al.2010], Lathia et al. designed and evaluated a hybrid mechanism to maximize the temporal recommendation diversity. In [Zhao et al.2012], Zhao et al. utilized the purchase interval information to increase the diversity of recommended items. In [Qin and Zhu2013], Qin and Zhu proposed an entropy regularizer to promote recommendation diversity. This entropy regularizer was incorporated in the contextual combinatorial bandit framework to diversify the online recommendation results [Qin et al.2014]
. Moreover, Sha et al. proposed a combinatorial optimization approach to combine the relevance of items, coverage of the user’s interests, and diversity between items for recommendation
[Sha et al.2016]. Puthiya Parambath et al. developed a new criterion to capture both relevance and diversity for recommendation [Puthiya Parambath et al.2016]. This criterion was optimized by an efficient greedy algorithm. Recently, in [Antikacioglu and Ravi2017], a maximumweight degreeconstrained subgraph selection method was proposed to postprocess the recommendations generated by collaborative filtering models, so as to increase recommendation diversity. In [Cheng et al.2017], the diversified recommendation problem was formulated as supervised learning task, and a diversified collaborative filtering model was introduced to solve the optimization problems. In
[Chen et al.2018], a fast greedy maximum a posteriori (MAP) inference algorithm for determinantal point process was proposed to improve recommendation diversity.3 The Proposed Recommendation Model
In this section, we first introduce some background about the determinantal point process model and then present the details of the proposed DRL recommendation model.
3.1 Determinantal Point Process
DPP is an elegant probabilistic model that has been successfully used to promote diversity in various machine learning tasks
[Kulesza et al.2012]. In this work, we propose to employ DPP to promote the diversity in interactive recommendation. Specifically, for a recommendation task, a DPP on the discrete item setis a probability distribution on the powerset of
(i.e., the set of all subsets of ). If it assigns nonzero probability to the empty set , there exists a positive semidefinite (PSD) kernel matrix , such that for each subset of , the probability of is defined as follows:(1) 
where is restricted to those rows and columns which are indexed by , and
is the identity matrix.
In practice, the kernel matrix can be constructed from lowrank matrices. For example, can be written as a Gram matrix , where and . The rows of are vectors representing the properties of items. As revealed in [Kulesza et al.2012], each row vector can be empirically constructed as the product of the item relevance score and the item feature vector , i.e., . Hence, an element of the kernel matrix can be written as . Moreover, if normalization is applied to the feature vector , i.e.,
, the Cosine similarity between two items
and can be calculated as . Hence, can be rewritten as follows:(2) 
where is the item similarity matrix, and is a diagonal matrix with the element being . Following [Chen et al.2018], we modify the DPP kernel matrix as follows:
(3) 
where and . is a hyperparameter used to tradeoff between relevance and diversity. For simplicity, we model the relevance score of each item using a linear function of its features as follows:
(4) 
where is the model parameter.
As the item similarity matrix is fixed when the item feature vectors are given, according to Eq.(3) and (4), becomes the only model parameter needs to be learnt to determine the DPP kernel. Once the DPP kernel is determined, different inference algorithms [Kulesza and Taskar2011, Chen et al.2018] can be applied to find the optimal diverse while relevant item set for recommendation. Intuitively, is a vector representing users’ preferences. In the interactive recommendation settings, users’ preferences can be dynamically learnt through the interactions between users and the recommender system. In this paper, we propose to dynamically learn using a deep reinforcement learning framework.
3.2 Deep Reinforcement Learning Framework
In the interactive recommendation setting of D
RL, the recommender system (i.e., agent) interacts with users (i.e., environment) by sequentially recommending a set of diverse while relevant items to maximize its cumulative rewards in the long run. The interactive recommendation process can be modeled as a Markov Decision Process (MDP). We define the key components of the MDP of D
RL as follows:
State Space: A state is determined by the recent items that the user has interacted before time step and the user’s personal features;

Action Space: An action is the parameter used to construct the DPP kernel matrix, as shown in Eq. (4), which captures the user’s dynamic preferences;

Reward: Once the recommender system chooses an action based on its current state , a set of diverse while relevant items can be sampled from the determined DPP kernel and recommend to the user. Then, the user would provide her feedbacks on the recommended items, e.g., click or not click on the items. The recommender system receives these feedbacks as the immediate reward to update its recommendation policy;

Discount factor: The discount factor is used to measure the present value of future rewards. When is set to 0, the recommender system only considers the immediate rewards but ignores the future rewards. On the other hand, when is set to 1, the recommender system counts all the future rewards as well as the immediate rewards.
The proposed DRL model is built based on the ActorCritic reinforcement learning framework [Sutton and Barto2018]. Figure 1 shows the overall structure of DRL. Next, we detail the actor and the critic components of DRL.

Actor Network: Given the current state , the actor network employs deep neural networks to generate an action . As shown in Figure 1, the state
is determined by the user’s personal features and her recent interests. Specifically, the user’s personal features are fed into two fullyconnected ReLU neural network layers to get more abstract user features
. Meanwhile, at time step , we collect the features of recent items that the user has interacted before , and stack them into a feature matrix . Intuitively, we treat the matrix as an “image” of the feature space, and use CNN (consisting of two convolutional layers) to capture the sequential patterns of the user’s interests by extracting the local features of the “image” [Tang and Wang2018]. The output of the CNN module is then fed into two fullyconnected ReLU neural network layers to get higher level features . Then, we concatenate with user features and define the state representation as .After that, is fed into several fully connected layers to generate the action
. In the first two neural network layers, we use ReLU as the activation function, and we use Tanh activation function in the output layer to make
bounded. As we are interested in optimizing the DDP kernel to generate diverse recommendations, the action representation is essentially the parameter used to build the DPP kernel at time . Once is determined, we utilize the fast greedy MAP inference algorithm [Chen et al.2019] to do sequentially sampling of a set of diverse while relevant items as recommendations to the user. 
Critic Network: The critic network is designed to approximate the Qvalue function
, which estimates the quality of the learnt DPPbased recommendation policy
. As shown in Figure 1, the input of the critic network is the action representation and the state representation that considers both the user profile and the user’s recent interests. and are then fed into several fully connected ReLU neural network layers to predict the Qvalue. We denote the critic network by and the network parameters by .
The Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015] algorithm is used to train the DRL framework. We also employ the target network technique in DRL. In DDPG, the critic network is learned using the Bellman equation in Qlearning. The network parameter
is updated by minimizing the following loss function:
(5) 
where , is the size of the sampled minibatch, and are the target actor network and target critic network, respectively. The actor network is updated by using sampled policy gradient as follows:
(6) 
Moreover, the network parameters of the target networks and are updated using the soft updating rule as follows:
(7) 
The details of the DDPG based learning algorithm are summarized in Algorithm 1.
4 Experiments
In this section, we first evaluate the performance of the proposed DRL model by performing offline experiments on two real world datasets. Then, we conduct simulations to evaluate the online performance of DRL.
4.1 Experimental Setting
4.1.1 Datasets
The experiments are conducted on two public datasets: MovieLens100K and MovieLens1M ^{1}^{1}1https://grouplens.org/datasets/movielens/. MovieLens100K consists of 100,000 ratings given by 943 users to 1,682 movies. MovieLens1M contains 1,000,209 ratings given by 6,040 users to 3,706 movies. There are 18 movie categories in both datasets, and each movie may belong to more than one categories. In each dataset, we kept the ratings larger than 3 as positive feedback. We sort all the positive feedback in time order, and use the former 80% positive feedback for model training and the remaining 20% positive feedback for model testing. Moreover, we also remove the users that have less than 10 positive feedbacks in the training data. Table 1 summarizes the statistics of the experimental datasets.
Dataset  # users  # items  # Interactions 

Movielens100K  716  1,374  45,447 
Movielens1M  5,218  3,467  510,940 
4.1.2 Evaluated Algorithms
We compare the proposed DRL method with the following recommendation models: (1) BPRMF [Rendle et al.2009]: This is a matrix factorization method that uses pairwise learning to optimize the recommendation accuracy of top items. (2) LMF [Johnson2014]: This is a logistic matrix factorization recommendation approach that exploits users’ implicit feedback by modeling the useritem interaction probability. (3) Caser [Tang and Wang2018]: This is a deep learning based sequential recommendation method which uses CNN to extract the sequential patterns of users’ behaviors. (4) CUCB [Qin et al.2014]: This is a contextual bandit based interactive recommendation method, which promotes the recommendation diversity by employing an entropy regularizer.
For BPRMF and LMF, we empirically set the dimensionality of latent space to 30, considering both the recommendation accuracy and efficiency. The regularization parameters are chosen from , and the optimal learning rates are chosen from . For Caser, we set the dimensionality of the embedding layer to 30, the Markov order to 5, and the target number to 2. In addition, we use the latent features of users and items learnt by BPRMF as the user and item features in the interactive recommendation methods, i.e., CUCB and DRL. In CUCB, the regularization parameter of the entropy regularizer is selected from . For DRL, we set the number of recent items in each state to 5. The parameter that balances relevance and diversity in the DPP kernel matrix is chosen from . We utilize Adam to optimize the actor and critic networks, and set the learning rate to 0.001. In addition, we set the discount factor of DRL to 0.95.
4.1.3 Metrics
The performance of recommendation algorithms is evaluated from two aspects, i.e., accuracy and diversity. The recommendation accuracy at each recommendation epoch is defined as follows:
(8) 
where is the set of items recommended to the user at epoch , is the cardinality of a set, and denotes the set of items the user has interacted with in the testing data. The diversity of the recommendation results at epoch is measured by the intralist distance (ILD) [Zhang and Hurley2008] metric, which is defined as follows:
(9) 
where denotes the similarity between two items and . For fair comparison, we define by the Jaccard similarity between the categories of two items and . In the experiments, we empirically set the size of to 5.
4.1.4 Setup
We adopt two different settings to evaluate the recommendation algorithms: offline evaluation on public datasets and online evaluation with a simulator.
In offline evaluation, we compare DRL with all other baseline mehods. For each noninteractive recommendation approach (i.e., BPRMF, LMF, and Caser), we first learn the recommendation model using the training data, and then rank all candidate items by the learned model. At each recommendation epoch, we choose a set of top ranked items from the ranked candidates as recommendations, and these items will be removed from the candidate set once they have been recommended. The recommendation accuracy will be measured by comparing recommended items with the ground truth in the testing data. Note that the noninteractive recommendation methods would not update the learned models in the recommendation process. However, in CUCB, the bandit parameter is updated by using the user’s feedback on the recommendation results. For the offline training of DRL, we first initialize the actor network by minimizing a L2loss between its output and the embedding of the next interacted item by randomly sampling stateaction transition samples from the replay buffer. After that, we fix the actor network for some time and update the critic network alone. When both the actor network and the critic network are well initialized, they will follow DDPG (Algorithm 1) to update iteratively. Once the two networks converge, we fix the learned network parameters in the recommendation process. The details of the offline evaluation process for DRL are summarized in Algorithm 2.
For online evaluation, we build a simulator to compare DRL with CUCB. The online interactive recommendation process of DRL follows Algorithm 1 and the feedback on recommended items is generated by a simulator as summarized in Algorithm 3. In this simulator, we first use the pretrained LMF [Johnson2014] model on training data to predict user preference on the recommended items. Then, we use MMR [Carbonell and Goldstein1998] to simulate the user’s feedback probability by considering both the user preference and the diversity of recommended items. The MMR parameter simulates a person’s personal consideration between relevance and diversity, and a higher simulates a person who prefers more relevant and less diverse recommendations. The parameter is a decision threshold, and positive feedback will be provided if the user’s feedback probability is greater than . We empirically set the MMR parameter for each user randomly in the range and the threshold to 0.5. A good interactive recommender system should be able to learn from the simulated user behaviours and make accurate while diverse recommendations. For online evaluation, the accuracy is evaluated by comparing the recommended results with simulated feedbacks.
4.2 Experimental Results
4.2.1 Offline Performance
The offline performance of the evaluated recommendation algorithms on Movielens100K and Movilens1M datasets are summarized in Figure 2 and Figure 3, respectively. It can be observed from Figure 3(a) that DRL consistently outperforms all the baselines in terms of recommendation accuracy at all recommendation epochs on Movielens1M dataset. Moreover, from Figure 3(b), we can also note that DRL also consistently generates more diverse recommendations than the baselines at all recommendation epoches.
However, the performance of DRL is not as superior on Movielens100K dataset. As shown in Figure 2(a), DRL sometimes achieves poorer recommendation accuracy at certain recommendation epoches comparing with other baselines. One potential reason is that the critic network is not able to accurately estimate the quality of the learned recommendation policy. From Figure 2(b), we can note that DRL achieves comparable performance with the baselines in terms of recommendation diversity. In the offline setting, the user’s feedback on the recommendation results is fixed, which limits the effectiveness of interactive recommender systems in capturing users’ dynamic preferences on items. To remedy this problem, we conduct simulations of online experiments to study the effectiveness of DRL, based on Movielens100K and Movielens1M datasets.
4.2.2 Simulated Online Performance
The simulated online performance of DRL and CUCB on Movielens100K and Movilens1M datasets are summarized in Figure 4 and Figure 5, respectively. As shown in Figure 4(a), the online recommendation accuracy of DRL and CUCB both fluctuate at all recommendation epochs, but DRL consistently achieves better recommendation accuracy than CUCB at all recommendation epochs. Moreover, from Figure 4(b), we can also observe that the online recommendation results generated by DRL are much more diverse than those generated by CUCB. On average, DRL outperforms CUCB by 52.13% and 92.79%, in terms of Precision and Diversity, over all recommendation epoches in the simulated experiments on Movielens100K datasets.
Superior performance of DRL can also be observed on Movielens1M dataset. As shown in Figure 5(a), both DRL and CUCB achieve stable performance on recommendation accuracy, and DRL significantly outperforms CUCB consistently at all recommendation epochs. Moreover, Figure 5(b) shows that the recommendation diversity of DRL fluctuates at first and then becomes steady quickly, which is significantly higher than that of CUCB. On average, DRL outperforms CUCB by 76.04% and 156.58% in terms of Precision and Diversity, over all recommendation epoches in the simulation experiments on Movielens1M dataset.
5 Conclusion
In this paper, we have proposed a novel interactive recommendation model, namely Diversitypromoting Deep Reinforcement Learning (DRL). Specifically, DRL uses actorcritic reinforcement learning framework to model the interactions between users and the recommender system. Moreover, the determinantal point process (DPP) model has been incorporated in the recommendation process to promote the diversity of recommendation results. Both empirical experiments on real world datasets and simulated online experiments have been performed to demonstrate the effectiveness of DRL. The future work will focus on the following directions. Firstly, we would like to directly learn the DPP kernel matrix from the data. Secondly, we are also interested in developing more efficient DPP inference algorithms for generating relevant yet diverse recommendations.
References
 [Antikacioglu and Ravi2017] Arda Antikacioglu and R Ravi. Post processing recommender systems for diversity. In KDD’17, 2017.
 [Carbonell and Goldstein1998] Jaime Carbonell and Jade Goldstein. The use of mmr, diversitybased reranking for reordering documents and producing summaries. In SIGIR’98, 1998.
 [Chen et al.2018] Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity. In NIPS’18, 2018.
 [Chen et al.2019] Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. Largescale interactive recommendation with treestructured policy gradient. In AAAI’19, 2019.
 [Cheng et al.2017] Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. Learning to recommend accurate and diverse items. In WWW’17, 2017.
 [Johnson2014] Christopher C Johnson. Logistic matrix factorization for implicit feedback data. NIPS’14, 27, 2014.
 [Kawale et al.2015] Jaya Kawale, Hung H Bui, Branislav Kveton, Long TranThanh, and Sanjay Chawla. Efficient thompson sampling for online matrix factorization recommendation. In NIPS’15, 2015.
 [Kulesza and Taskar2011] Alex Kulesza and Ben Taskar. kdpps: Fixedsize determinantal point processes. In ICML’11, 2011.
 [Kulesza et al.2012] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3), 2012.
 [Kunaver and Požrl2017] Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey. KnowledgeBased Systems, 123, 2017.
 [Lathia et al.2010] Neal Lathia, Stephen Hailes, Licia Capra, and Xavier Amatriain. Temporal diversity in recommender systems. In SIGIR’10, 2010.
 [Li et al.2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In WWW’10, 2010.
 [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Puthiya Parambath et al.2016] Shameem A Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. A coveragebased approach to recommendation diversity on similarity graph. In RecSys’16, 2016.
 [Qin and Zhu2013] Lijing Qin and Xiaoyan Zhu. Promoting diversity in recommendation by entropy regularizer. In IJCAI’13, 2013.
 [Qin et al.2014] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In SDM’14, 2014.
 [Rendle et al.2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. Bpr: Bayesian personalized ranking from implicit feedback. In UAI’09, 2009.
 [Sha et al.2016] Chaofeng Sha, Xiaowei Wu, and Junyu Niu. A framework for recommending relevant and diverse items. In IJCAI’16, 2016.
 [Shi et al.2014] Yue Shi, Martha Larson, and Alan Hanjalic. Collaborative filtering beyond the useritem matrix: A survey of the state of the art and future challenges. ACM Computing Surveys, 47(1):3, 2014.
 [Steck et al.2015] Harald Steck, Roelof van Zwol, and Chris Johnson. Interactive recommender systems: Tutorial. In RecSys’15. ACM, 2015.
 [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [Tang and Wang2018] Jiaxi Tang and Ke Wang. Personalized topn sequential recommendation via convolutional sequence embedding. In WSDM’18. ACM, 2018.
 [Tang et al.2015] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendation via parameterfree contextual bandits. In SIGIR’15, 2015.
 [Wang et al.2017] Huazheng Wang, Qingyun Wu, and Hongning Wang. Factorization bandits for interactive recommendation. In AAAI’17, 2017.
 [Zhang and Hurley2008] Mi Zhang and Neil Hurley. Avoiding monotony: improving the diversity of recommendation lists. In RecSys’08, 2008.
 [Zhang et al.2018] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys, 2018.
 [Zhao et al.2012] Gang Zhao, Mong Li Lee, Wynne Hsu, and Wei Chen. Increasing temporal diversity with purchase intervals. In SIGIR’12, 2012.
 [Zhao et al.2018a] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for pagewise recommendations. In RecSys’18, 2018.
 [Zhao et al.2018b] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. Recommendations with negative feedback via pairwise deep reinforcement learning. In KDD’18, 2018.
 [Zheng et al.2018] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW’18, 2018.
Comments
There are no comments yet.