I Introduction
Thanks to the increasing online services, such as online shopping, online news and online social networks, it becomes quite convenient to acquire items (goods, books, videos, news, etc.) via Internet or mobile devices. Albeit the great convenience, the overwhelming number of items in the systems also pose a significant challenge for users, to find the items that match their interests. Recommendation is a widely used solution and various families of techniques have been proposed, such as contentbased collaborative filtering [1], matrix factorization based methods [2, 3, 4, 5], logistic regression, factorization machines and its variants [6, 7, 8]
, deep learning models
[9, 10, 11, 12] and multiarmed bandits [13, 14, 15, 16, 17]. However, such mentioned studies suffer from two serious limitations.Firstly, most of them consider the recommendation procedure as a static process, i.e., they assume the user’s underlying preference keeps unchanged. However, it is very common that a user’s preference is dynamic w.r.t. time, i.e., a user’s preference on previous items will affect her choice on the next items. Hence, it would be more reasonable to model the recommendation as a sequential decision making process. We will show some evidence observed in publicly available datasets (MovieLens and Yahoo! Music) to support our opinion. In the two datasets, the sequential behaviors of users are recorded and we are interested in what would happen if a user consecutively receives satisfied or unsatisfied recommendations. Though the datasets do not record any recommendation procedure, we can simulate this according to the users’ ratings, namely, consecutive rating “positive” (“negative”) simulates that a user consecutively receives satisfied (unsatisfied) recommendations. As presented in Figure 1, we observe that a user tends to gives a higher (lower) rating if she has consecutively received more satisfied (unsatisfied) items, as shown by the green (red) line, where the blue dot line denotes the average rating for reference. This suggests that a user will be more pleasant (unpleasant) if she consecutively receives more satisfied (unsatisfied) recommendations and therefore she tends to give a higher (lower) rating to the current recommendation. Hence, the user’s dynamic preference suggests that a good recommendation should be modeled as a sequential decision making process.
Secondly, the aforementioned studies are trained by maximizing the immediate rewards of recommendations, which merely concentrates on whether the recommended items are clicked or consumed, but ignores the longterm contributions that the items can make. However, the items with small immediate rewards but large longterm benefits are also crucial [18]. We take an example in News recommendation [19]
to explain this. As a user requests for news to read, two possible pieces of news may lead to the same immediate reward, i.e., the user will click and read the two pieces of news with equal probability, where one is about a thunderstorm alert and the other is about a basketball player Kobe Bryant. In this example, after reading the news about thunderstorm, the user probably is not willing to read news about this issue anymore; while on the other hand, the user will possibly read more about NBA or basketball after reading the news about Kobe. The fact suggests that recommending the news about Kobe will introduce more longterm rewards. Hence, when recommending items to users, both the immediate and longterm rewards should be taken into consideration.
Recently, Reinforcement Learning (RL) [20], which has shown great potential in various challenging scenarios that require both dynamic modeling and long term planning, such as game playing [21, 22], realtime ads bidding [23, 24], neural network structure searching [25, 26], is introduced in recommender systems [18, 27, 28, 29, 19, 30, 31, 32, 33].
In the early stage, modelbased RL techniques are proposed to model recommendation procedure, such as POMDP [18] and Qlearning [27]. However, these methods are inapplicable to complicated recommendation scenarios when the number of candidate items is large, because a timeconsuming dynamic programming step is required to update the model. Later, modelfree RL techniques are utilized in recommender systems, from both academia and industry. Such techniques can be divided into two categories: valuebased [29, 19] and policybased [28, 32, 33]
. Valuebased approaches compute Qvalues of all available actions for a given state and the one with the maximum Qvalue is selected as the best action. Due to the evaluation on overall actions, the approaches may become very inefficient if the action space is too large. As for the policybased approaches, this type of studies generate a continuous parameter vector as the representation of an action
[28, 32, 33], which can be utilized in generating the recommendation and updating the Qvalue evaluator. Thanks to the continuous representations, the inefficiency drawbacks can be overcome. However, these studies [28, 32, 33] still have one common limitation: the user state is learnt via a conventional fully connected neural network, which does not explicitly and carefully model the interactions between users and items.In this paper, to break the limitations stated above, we propose a deep reinforcement learning based r
ecommendation framework with explicit useritem interactions modeling (DRR). The “ActorCritic” type framework DRR is incorporated with a state representation module, which explicitly models the complex dynamic useritem interactions to pursuit better recommendation performance. Specifically, the embeddings of users and items from the historical interactions are fed into a carefully designed multilayer network, which explicitly models the interactions between users and items, to produce a continuous state representation of the user in terms of her underlying sequential behaviors. This network is named as the state representation module, which plays two important roles in our framework. On the one hand, it is utilized to generate an ranking action to calculate the recommendation scores for ranking. On the other hand, the state representation together with the generated action is the input of the Critic network, which aims to estimate the Qvalue, i.e., the quality of the action in the current state. Based on the evaluation, the Actor (policy) network can be updated. We note that both the Actor and Critic networks are carefully designed by modeling the interactions between users and items explicitly. Extensive experiments on four realworld datasets demonstrate that the proposed method yields superior performance than the stateoftheart methods. The main contributions of this paper can be summarized as follows:

We propose a deep reinforcement learning based recommendation framework DRR. Unlike the conventional studies, DRR adopts an “ActorCritic” structure and treats the recommendation as a sequential decision making process, which takes both the immediate and longterm rewards into consideration.

Under the DRR framework, three different network structures are proposed, which can explicitly model the interactions between users and items.

Extensive experiments are carried out on four realworld datasets, and the results demonstrate the proposed methods indeed outperforms the stateoftheart competitors.
The rest of this paper is organized as follows. Related work and background are presented in Section II. The preliminary knowledge is presented in Section III. The proposed methods are introduced in Section IV. Experimental details and results are discussed in Section V. Finally, we conclude this paper and discuss some future work in Section VI.
Ii Related Work
Iia NonRL based Recommendation Techniques
Various kinds of recommendation techniques are proposed in the past a few decades to improve the performance of recommender systems, including contentbased filtering [1], matrix factorization based methods [2, 3, 4, 5], logistic regression, factorization machines and its variants [6, 7, 8], and until recently deep learning models [9, 10, 11, 12].
At the beginning of this century, contentbased filtering [1] is proposed to recommend items by considering the content similarity between items. Later, collaborative filtering (CF) is put forward and extensively studied. The rationale behind CF is that the users with similar behaviors tend to prefer the same items, and the items consumed by similar users tend to have the same rating. However, conventional CF based methods tend to suffer from the data scarcity, because the similarity calculated from sparse data can be very unreliable. Matrix factorization (MF), as an advanced CF technique, plays an important role in recommender systems. MF models [2, 3, 4, 5] characterize both items and users by vectors in the same space, which are inferred from the observed useritem interactions. Regarding the recommendation as a binary classification problem, logistic regression and its variants [6] are also utilized in recommender systems. However, logistic regression based models are hard to generalize to the feature interactions that never or rarely appear in the training data. Factorization machines [7] model pairwise feature interactions as inner product of latent vectors between features and show promising results. As an extension to FM, Fieldaware FM (FFM [8]) enables each feature to have multiple latent vectors to interact with different fields. Recently, deep learning models [9, 10, 11, 12] are applied to model the complicated feature interactions for recommendation.
As a distinguished direction, contextual multiarmed bandits are also utilized to model the interactive nature of recommender systems [13, 14, 15, 16, 17]
. Li et al. apply Thompson Sampling (TS) and Upper Confident Bound (UCB) to balance the tradeoff between exploration and exploitation in
[13] and [14], respectively. The authors of [16] propose a dynamic context drift model to address the time varying problem. To integrate the latent vectors of items and users with some exploration, the authors of [15, 17] combine matrix factorization with multiarmed bandits.However, all these methods suffer from two limitations. First, they consider the recommendation procedure as a static process, i.e., they assume the underlying user’s preference keeps static and they aim to learn the user’s preference as precise as possible. Second, they are learned to maximize the immediate rewards of recommendations, but ignore the longterm benefits that the recommendations can make.
IiB RL based Recommendation Techniques
As modelbased RL techniques [18, 27] are inapplicable in recommendation scenario due to their high time complexity, most researchers turn to modelfree RL techniques. The modelfree RL techniques can be divide into two categories: policybased and valuebased.
Policybased approaches [28, 32, 33] aim to generate a policy, of which the input is a state, and the output is an action. These works apply deterministic policies, which generates an action directly. DulacArnold et al. [33] resolves the large action space problem by modeling the state in a continuous item embedding space and selecting the items via a neighborhood method. However, as the underlying algorithm is essentially a continuousaction algorithm, its performance may be cursed by the gap between the continuous and discrete action spaces. In [28, 32], the policy network outputs a continuous action representation, and the recommendation is generated by ranking the items with their scores, which are computed by a predefined function with the action representation and the item embeddings as input. However, one common limitation of the studies is that they do not carefully learn the state representation.
For valuebased approaches [29, 19], the action with maximum Qvalue over all the possible actions is selected as the best action. Zhao et al. [29] take both user’s positive feedback and negative feedback into consideration when modeling user state. Dueling Qnetwork is utilized in [19], to model Qvalue of a stateaction pair. Moreover, a minor update with exploration by dueling bandit gradient descent is proposed. However, such valuebased approaches need to evaluate the Qvalues of all the actions under a specific state, which is very inefficient when the number of actions is large.
To make RL based recommendation techniques suitable for largescale scenario, in this paper, we propose the DRR framework which carefully and explicitly model the interactions between users and items to learn the state representation.
Iii Preliminaries
The essential underlying model of reinforcement learning is Markov Decision Process (MDP). An MDP is defined as
. is the state space and is the action space. is the state transition function. is the reward function. is the discount rate. The objective of an agent in an MDP is to find an optimal policy () which maximizes the expected cumulative rewards from any state , i.e., , or maximizes equivalently the expected cumulative rewards from any stateaction pair , i.e., . Here is the expectation under policy , is the current timestep and is the immediate reward at a future timestep .We model the recommendation procedure as a sequential decision making problem, in which the recommender (i.e., agent) interacts with users (i.e., environment) to suggest a list of items sequentially over the timesteps, by maximizing the cumulative rewards of the whole recommendation procedure. More specifically, the recommendation procedure is modeled by an MDP, as follows.

States . A state is the representation of user’s positive interaction history with recommender, as well as her demographic information (if it exists in the datasets).

Actions . An action is a continuous parameter vector denoted as . Each item ^{1}^{1}1 is the embedding of item , which can be generated by MF or VAE. has a ranking score, which is defined as the inner product of the action and the item embedding, i.e., . Then the top ranked ones will be recommended.

Transitions . The state is modeled as the representation of user’s positive interaction history. Hence, once the user’s feedback is collected, the state transition is determined.

Reward . Given the recommendation based on the action and the user state , the user will provide her feedback, i.e., click, not click, or rating, etc. The recommender receives immediate reward according to the user’s feedback.

Discount rate . is a factor measuring the present value of longterm rewards. In the case of , the recommender considers only immediate rewards but longterm rewards are ignored. On the other hand, when , the recommender treats immediate rewards and longterm rewards as equally important.
Figure 2 illustrates the recommenderuser interactions in MDP formulation. Considering the current user state and immediate reward to the previous action, the recommender takes an action. Note that in our model, an action corresponds to neither recommending an item nor recommending a list of items. Instead, an action is a continuous parameter vector. Taking such an action, the parameter vector is used to determine the ranking scores of all the candidate items, by performing inner product with item embeddings. All the candidate items are ranked according to the computed scores and TopN items are recommended to the user. Taking the recommendation from the recommender, the user provides her feedback to the recommender and the user state is updated accordingly. The recommender receives rewards according to the user’s feedback. Without loss of generalization, a recommendation procedure is a timestep^{2}^{2}2If a recommendation episode terminates in less than T timesteps, then the length of the episode is the actual value. trajectory as .
Iv The Proposed DRR Framework
As aforementioned in Section 1, conventional recommendation techniques suffer from either a lack of sequential modeling or ignoring the longterm rewards, or both. To address the drawbacks, we propose a deep reinforcement learning based recommendation framework (DRR) based on the ActorCritic learning scheme. Also, different from some recent RL studies, we carefully and explicitly build a state representation module to model the interactions between the users and items. Next, we will first elaborate the Actor network, Critic network and the state representation module respectively, which are essentially the three key ingredients in our framework; then the training and evaluation procedures will be presented to show how to learn and use the DRR framework.
Iva Three Key Ingredients in DRR
IvA1 The Actor network
The Actor network, also called the policy network, is depicted on the left part of Figure 3. For a given user, the network accounts for generating an action based on her state . Let us explain the network from the input to the output part. In DRR, the user state, denoted by the embeddings of her latest positively interacted items, is regarded as the input. Then the embeddings are fed into a state representation module (which will be introduced in details later) to produce a summarized representation for the user. For instance, at timestep , the state can be defined in Eq. (1):
(1) 
where stands for the state representation module, denotes the embeddings of the latest positive interaction history, and is a dimensional vector. When the recommender agent recommends an item , if the user provides positive feedback, then in the next timestep, the state is updated to , where ; otherwise, . The reasons to define the state in such a manner are two folds: (i) a superior recommender system should cater to the users’ taste, i.e., what items the users like; (ii) the latest records represent the users’ recent interests more precisely.
Finally, by two ReLU layers and one Tanh layer, the state representation
is transformed into an action as the output of the Actor network. Particularly, the action is defined as a ranking function represented by a continuous parameter vector . By using the action, the ranking score of the item is defined as:(2) 
Then, the top ranked item (w.r.t. the ranking scores) is recommended to the user. Note that, the widely used greedy exploration technique is adopted here.
IvA2 The Critic network
The Critic part in DRR, shown as the middle part of Figure 3, is a Deep QNetwork [21], which leverages a deep neural network parameterized as to approximate the true stateaction value function , namely, the Qvalue function. The Qvalue function reflects the merits of the action policy generated by the Actor network. Specifically, the input of the Critic network is the user state generated by the user state representation module and the action generated by the policy network, and the output is the Qvalue, which is a scalar. According to the Qvalue, the parameters of the Actor network are updated in the direction of improving the performance of action , i.e., boosting . Based on the deterministic policy gradient theorem [34], we can update the Actor by the sampled policy gradient shown in Eq.(3):
(3) 
where is the expectation of all possible Qvalues that follow the policy . Here the minibatch strategy is utilized and denotes the batch size. Moreover, the Critic network is updated accordingly by the temporaldifference learning approach [20], i.e., minimizing the mean squared error shown in Eq.(4):
(4) 
where . The target network [35] technique is also adopted in DRR framework, where and is the parameters of the target Critic and Actor network.
IvA3 The State Representation Module
As noted above, the state representation module plays an important role in both the Actor network and Critic network. Hence, it is very crucial to design a good structure to model the state. In [10, 11], it has been shown that modeling the feature interactions explicitly can boost the performance of a recommendation system. Inspired by the studies, we propose to design the state representation module by explicitly modeling the interactions between the users and items. Specifically, we develop three structures, which will be elaborated next.

DRRp. Inspired by [10, 11], we propose a product based neural network for the state representation module, which is depicted in Figure 4^{3}^{3}3The legend in Figure 4, 5 and 6 is the same to Figure 3. The structure is named as DRRp, which utilizes a product operator to capture the pairwise local dependency between items. We can see that the structure clones the representations of the items from . In addition, it computes the pairwise interactions between the items, by using the elementwise product operator. As a result, new features vectors are yielded, which will be concatenated with the cloned vectors as the state representation. We note that in the elementwise product part, a weight is also learned for each item to show its importance. Hence, in DRRp the state representation module can be formally stated as follows:
(5) (6) where denotes the elementwise product, is a scalar indicating the importance of item , and is a dimensional vector which models the interactions between item and . The dimensionality of is .

DRRu. Though DRRp can model the pairwise local dependency between items, the useritem interactions are neglected. To remedy this, we design another structure in Figure 5, which is referred as DRRu. In DRRu, we can see that the user embedding is also incorporated. In addition to the local dependency between items, the pairwise interactions of useritem are also taken into account. Formally, the state representation module can be expressed as:
(7) The dimensionality of is also .

DRRave. In DRRp and DRRu structures, the interactions between users and items can be exploited and modeled. For the two structures, it is not difficult to find that the positions of items in matters, e.g., the state representations of and are different. When is large, we expect the positions of items really matter, because denotes a longterm sequence; whereas memorizing the positions of items may lead to overfitting if the sequence is a shortterm one. Hence, we design another structure by eliminating the position effects, which is depicted in Figure 6. As an average pooling layer is adopted, we call the structure DRRave. We can see from Figure 6 that the embeddings of items in are first transformed by a weighted average pooling layer. Then, the resulting vector is leveraged to model the interactions with the input user. Finally, the embedding of the user, the interaction vector, and the average pooling result of items are concatenate into a vector to denote the state representation. Formally, the DRRave structure can be expressed as:
(8) (9) Here indicates the weighted average pooling layer. The dimensionality of in DRRave is .
IvB Training Procedure of the DRR Framework
Next, we introduce how to train the DRR framework. We first present the overall idea and then discuss the detailed algorithm. As aforementioned, DRR utilizes the users’ interaction history with the recommender agent as training data. During the procedure, the recommender takes an action following the current recommendation policy after observing the user (environment) state , then it obtains the feedback (reward) from the user, and the user state is updated to . According to the feedback, the recommender updates its recommendation policy. In this work, we utilize deep deterministic policy gradient (DDPG) [35] algorithm to train the proposed DRR framework, as detailed in Algorithm 1.
Specifically, in timestep , the training procedure mainly includes two phases, i.e., transition generation (lines 712) and model updating (lines 1317). For the first stage, the recommender observes the current state that is calculated by the proposed state representation module, then generates an action according to the current policy with greedy exploration, and recommends an item according to the action by Eq. (2) (lines 89). Subsequently, the reward can be calculated based on the feedback of the user to the recommended item , and the user state is updated (lines 1011). Finally, the recommender agent stores the transition into the replay buffer (line 12).
In the second stage, the model updating, the recommender samples a minibatch of transitions with widely used prioritized experience replay [36] sampling technique (line 13), which is essentially an importance sampling strategy. Then, the recommender updates the parameters of the Actor network and Critic network according to Eq. (3) and Eq. (4) respectively (line 1416). Finally, the recommender updates the target networks’ parameters with the soft replace strategy.
IvC Evaluation
In this subsection, we discuss how to evaluate the models with a environment simulator. The most straightforward way to evaluate the RL based models is to conduct online experiments on recommender systems where the recommender directly interacts with users. However, the underlying commercial risk and the costly deployment on the platform make it impractical. Therefore, throughout the testing phase, we conduct the evaluation of the proposed models on public offline datasets and propose two ways to evaluate the models, which are the offline evaluation and the online evaluation.
IvC1 Offline evaluation
Intuitively, the offline evaluation of the trained models is to test the recommendation performance with the learned policy, which is described in Algorithm 2. Specifically, for a given session , the recommender only recommends the items that appear in this session, denoted as , rather than the ones in the whole item space. The reason is that we only have the ground truth feedback for the items in the session in the recoreded offline log. For each timestep, the recommender agent takes an action according to the learned policy , and recommends an item based on the action by Eq. (2) (lines 45). After that, the recommender observes the reward according to the feedback of the recommended item by Eq. (10) (lines 56). Then the user state is updated to and the recommended item is removed from the candidate set (lines 78). The offline evaluation procedure can be treated as a rerank procedure of the candidate set by iteratively selecting an item w.r.t. the action generated by the Actor network in DRR framework. Moreover, the model parameters are not updated in the offline evaluation.
IvC2 Online evaluation with environment simulator
As aforementioned that it is risky and costly to directly deploy the RL based models on recommender systems. Therefore, we conduct online evaluation with an environment simulator. In this paper, we pretrain a PMF [37] model as the environment simulator, i.e., to predict an item’s feedback that the user never rates before. The online evaluation procedure follows Algorithm 1, i.e., the parameters continuously update during the online evaluation stage. Its major difference from Algorithm 1 is that the feedback of a recommended item is observed by the environment simulator. Moreover, before each recommendation session starting in the simulated online evaluation, we reset the parameters back to and which is the policy learned in the training stage for a fair comparison.
V Experiment
Va Datasets and Evaluation Metrics
We adopt the following publicly available datasets from the real world to conduct the experiments:

MovieLens (100k)^{4}^{4}4https://grouplens.org/datasets/movielens/100k/. A benchmark dataset comprises of 0.1 million ratings from users to the recommended movies on MovieLens website.

Yahoo! Music (R3)^{5}^{5}5https://webscope.sandbox.yahoo.com/. This dataset contains over 0.36 million ratings of songs collected from two different sources. The first source consists of ratings provided by users during normal interactions with Yahoo! Music services. The second source consists of ratings of randomly selected songs collected during an online survey by Yahoo! Research. We normalize the ratings to discrete values from 1 to 5.

MovieLens (1M)^{6}^{6}6https://grouplens.org/datasets/movielens/1m/. A benchmark dataset includes of 1 million ratings from the MovieLens website.

Jester (2)^{7}^{7}7http://eigentaste.berkeley.edu/dataset/. This dataset contains over 1.7 million realvalue ratings (10.0 to +10.0) over jokes in an online joke recommender system.
Note that except for Jester, the ratings in the other datasets are discrete values from 1 to 5, and the statistic information of the datasets is given in Table I. The MovieLens (100k) and MovieLens (1M) are abbreviated as ML (100k) and ML (1M) respectively.
ML (100k)  Yahoo! Music  ML (1M)  Jester  

# user  943  15,400  6,040  63,978 
# item  1,682  1,000  3,952  150 
# ratings  100,000  365,740  1,000,209  1,761,439 
We conduct both offline and simulated online evaluation on these four datasets. For the offline evaluation, we utilize Precision@k and NDCG@k as the metrics to measure the performance of the proposed models. For the simulated online evaluation, we leverage the total accumulated rewards as the metric.
VB Compared Methods
We compare the proposed methods with some representative baseline methods. For the offline evaluation, we compare to conventional methods including Popularity, PMF [37] and SVD++ [38], and a RL based method DRRn. Moreover, the online evaluation baselines contain the stateoftheart multiarmed bandits methods LinUCB [39] and HLinUCB [40] and the DRRn as well.

Popularity recommends the most popular item, i.e., the item with the highest average rating or the items with largest number of positive ratings^{8}^{8}8To get a better result of popularity based recommendation, we both test the two strategies, and choose the best one to report. from current available items to the users at each timestep.

PMF makes a matrix decomposition as SVD, while it only takes into account the non zero elements.

SVD++ mixes strengths of the latent model as well as the neighborhood model.

LinUCB selects an arm (item) according to the estimated upper confidence bound of the potential reward.

HLinUCB further learns hidden features for each arm to model the potential reward.

DRRn simply utilizes the concatenation of the item embeddings to represent user state, which is widely used in previous studies. Although it is under the DRR framework, we treat this method as a baseline to assess the effectiveness of our proposed state representation module.
VC Experimental Settings
For each dataset, we randomly divide them into two parts: 80% of the ratings are used for training, while the other 20% are for evaluation. Moreover, for MovieLens (100k), Yahoo! Music and MovieLens (1M), the positive ratings are and , while for Jester, the positive ones are those higher than . The number of latest positively rated items , which is empirically set to . We perform PMF to pretrain the dimensional embeddings of the users and items. Moreover, in each episode, we do not recommend repeated items, i.e., we remove the ones already recommended from the candidate set. The discount rate is 0.9. We utilize Adam optimizer for all the RL based methods with norm regularization to prevent overfitting. As for the reward function, we empirically normalize the ratings into range [1 ,1] and utilize the normalized ones as the feedback of the corresponding recommendations. For instance, in timestep , the recommender agent recommends an item to user , (denoted as action in state ), and the rating comes from the interaction logs if user actually rates item , or from a predicted value by the simulator otherwise. Therefore, the reward function can be defined as follows:
(10)  
where the first setting is for MovieLens (100k), Yahoo! Music and MovieLens (1M), and the second one is for Jester. All the baseline methods are carefully tuned for a fair comparison. We model the recommendation procedure as an interaction episode with length , and the hyperparameter is tuned for different datasets (detailed in Section V.E).
VD Results and Analysis
VD1 Offline Evaluation Results and Analysis
The offline evaluation results are summarized from Table II to Table V respectively, where the best results are marked in bold type. In the offline evaluation, we compare the proposed methods to some representative offline learning methods. The results suggest that the proposed methods under the DRR framework outperform the baselines on most of datasets, which demonstrates the effectiveness of our proposed methods.
Specifically, as aforementioned that, we propose three different network structure in the state representation module to model the explicit interactions of the users and items under the DRR framework, which are the DRRp, DRRu and DRRave. From the results in Table II to Table V, we find that the three methods all outperform the baselines in most cases. Moreover, DRRn that simply concats the item embeddings to represent the state , performs worse than the proposed DRRp, DRRu and DRRave. From the observations, we can conclude in two folds: (i) the proposed methods indeed have the capability of longterm scheduling and dynamic adaptation, which are ignored by conventional methods; (ii) the proposed state representation module well captures the dynamic interactions between the users and items, and the state should not be simply concatenate with fully connected layers as DRRn does, which may result in information loss.
Compared with DRRp, DRRu and DRRave, we can see that DRRave outperforms DRRu, and DRRu is superior than DRRp on the four datasets in most cases. The reasons are as follows: 1) The DRRu method has better performance than DRRp, because DRRu only captures the interactions of user’s historical items, but also seizes the personalization information through the useritem interactions. 2) DRRave performs the best, because of two reasons: (i) DRRave method captures the personalization information through useritem interactions; (ii) as noted in Section IV, by using the average pooling, it eliminates the position effects in .
Model  Precision@5  Precision@10  NDCG@5  NDCG@10 

Popularity  0.6933  0.6012  0.9104  0.9008 
PMF  0.6988  0.6194  0.9095  0.8968 
SVD++  0.7034  0.6255  0.9125  0.8991 
DRRn  0.7185  0.6387  0.9147  0.9004 
DRRp  0.7263  0.6448  0.9076  0.9015 
DRRu  0.7417  0.6536  0.9183  0.9062 
DRRave  0.7887  0.6935  0.9255  0.9046 
Model  Precision@5  Precision@10  NDCG@5  NDCG@10 

Popularity  0.3826  0.3805  0.8870  0.8811 
PMF  0.3835  0.3817  0.8837  0.8802 
SVD++  0.3857  0.3821  0.8887  0.8813 
DRRn  0.3844  0.3819  0.8876  0.8810 
DRRp  0.3850  0.3822  0.8883  0.8815 
DRRu  0.3864  0.3827  0.8889  0.8819 
DRRave  0.3917  0.3839  0.9004  0.8949 
Model  Precision@5  Precision@10  NDCG@5  NDCG@10 

Popularity  0.7141  0.6181  0.8906  0.8738 
PMF  0.7072  0.6193  0.8901  0.8746 
SVD++  0.7142  0.6258  0.9009  0.8776 
DRRn  0.7151  0.6221  0.8902  0.8751 
DRRp  0.7346  0.6366  0.8909  0.8753 
DRRu  0.7375  0.6385  0.8912  0.8763 
DRRave  0.7693  0.6594  0.9112  0.8980 
Model  Precision@5  Precision@10  NDCG@5  NDCG@10 

Popularity  0.6167  0.6012  0.8932  0.8703 
PMF  0.6171  0.6015  0.8740  0.8676 
SVD++  0.6184  0.6027  0.8819  0.8614 
DRRn  0.6178  0.6021  0.8915  0.8724 
DRRp  0.6181  0.6029  0.8934  0.8753 
DRRu  0.6217  0.6043  0.8974  0.8805 
DRRave  0.6278  0.6076  0.9124  0.9079 
VD2 Simulated online evaluation results and analysis
The results of the simulated online evaluation are summarized in Table VI, where the best results are marked in bold type. In the experiment, we only compare with the baseline methods that can perform online learning, which are LinUCB, HLinUCB and DRRn. Again, we find that the proposed methods deliver higher rewards than all the baselines.
On the one hand, the fact suggests that the proposed RLbased methods model dynamic adaptation and longterm rewards better than the multiarmed bandits based methods LinUCB and HLinUCB. On the other hand, the observation indicates that the proposed state representation structures are superior to the naive fullconnected network in DRRn. Again, we observe that DRRave performs the best among all the three proposed interaction modeling structures.
Model  ML (100k)  Yahoo! Music  ML (1M)  Jester 

LinUCB  1,958  30,462.5  30,174  141,358.4 
HLinUCB  1,475  32,725  32,785.5  147,105.5 
DRRn  2,654.5  35,382.5  35,860  165,844.5 
DRRp  2,832  37,328.5  36,653  177,414.2 
DRRu  2,869  42,174.5  37,615  183,517.6 
DRRave  3,251.5  49,095  40,588  194,860.7 
VE Parameter Study
In this subsection, we investigate how the episode length affect the performance of proposed methods. Figure 7 shows the results^{9}^{9}9Due to the space limit, We only present the performance of DRRave, while DRRp and DRRu have similar observations. From the left part of Figure 7, we observe that the performance on MovieLens first increases and then decreases as the length of the episode is gradually increased, and the summit appears at . A similar tend can be found for the Yahoo! Music from the right part of Figure 7, where the performance peaks at . The reason may due to the tradeoff between the exploitation and exploration. When the episode length is small, the user can not fully interact with the recommender agent, i.e., the exploration is insufficient. As we enlarge the episodes, the recommender agent can explore (interact with users) adequately, i.e., the recommender agent captures the user’s preference, so that the performance improves. However, if the episodes are too large, the recommender focuses on exploiting locally, but the user preferred items is limited, therefore the performance declines as we do not recommend repeated items to user. Hence, we should nicely trade off the exploration and exploitation by setting a suitable value for .
VF Case Study
In this subsection, we present an example to show the different recommendation manner between LinUCB and DRRave on MovieLens dataset. Specifically, we randomly pick up a user with ID 11, and conduct the recommendation procedure with LinUCB and DRRave respectively. To verify the reaction to the same recommendation scenario, we fix the first three recommended items and to see what will happen next. The results of recommended item and the reward are reported in Table VII.
From Table VII, we can see that LinUCB and DRRave react differently when given two consecutive negative recommendations (Eraser and First Knight). Specifically, LinUCB keeps exploring without considering to recommend a “safe” item to please the user. However, DRRave stops exploration and recommends a riskfree movie Dead Man Walking, which belongs to the same genre as Chasing Amy that has gained a positive feedback from the user at timestep 1. The observation demonstrates the superiority of the proposed DRRave against LinUCB.
timestep  LinUCB  DRRave 

1  Chasing Amy (1)  Chasing Amy (1) 
2  Eraser (0.5)  Eraser (0.5) 
3  First Knight (1)  First Knight (1) 
4  The Deer Hunter (0.5)  Dead Man Walking (1) 
5  Event Horizon (1)  Braveheart (0.5) 
6  The Net (0)  The Usual Suspect (0.5) 
7  Striptease (0.5)  Psycho (0.5) 
Vi Conclusion
In this paper, we propose a deep reinforcement learning based framework DRR to perform the recommendation task. Unlike the conventional studies, DRR treats the recommendation as a sequential decision making process and adopts an “ActorCritic” learning scheme, which can take both the immediate and longterm rewards into account. In DRR, a state representation module is incorporated and three instantiation structures are designed, which can explicitly model the interactions between users and items. Extensive experiments on four realworld datasets demonstrate the superiority of the proposed DRR method over stateoftheart competitors.
References
 [1] R. J. Mooney and L. Roy, “Contentbased book recommending using learning for text categorization,” in ACM DL, 2000, pp. 195–204.
 [2] M. Deshpande and G. Karypis, “Itembased topN recommendation algorithms,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 143–177, 2004.
 [3] Y. Koren, R. M. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
 [4] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: Itemtoitem collaborative filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003.
 [5] J. Wang, A. P. De Vries, and M. J. Reinders, “Unifying userbased and itembased collaborative filtering approaches by similarity fusion,” in SIGIR. ACM, 2006, pp. 501–508.
 [6] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica, “Ad click prediction: a view from the trenches,” in KDD 2013, Chicago, IL, USA, August 1114, 2013, 2013, pp. 1222–1230.
 [7] S. Rendle, “Factorization machines,” in ICDM, Sydney, Australia, 1417 December 2010, 2010, pp. 995–1000.
 [8] Y. Juan, Y. Zhuang, W. Chin, and C. Lin, “Fieldaware factorization machines for CTR prediction,” in RecSys, Boston, MA, USA, September 1519, 2016, 2016, pp. 43–50.
 [9] W. Zhang, T. Du, and J. Wang, “Deep learning over multifield categorical data   A case study on user response prediction,” in ECIR 2016, Padua, Italy, March 2023, 2016. Proceedings, 2016, pp. 45–57.
 [10] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang, “Productbased neural networks for user response prediction,” in ICDM 2016, December 1215, 2016, Barcelona, Spain, 2016, pp. 1149–1154.
 [11] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorizationmachine based neural network for CTR prediction,” in IJCAI 2017, Melbourne, Australia, August 1925, 2017, 2017, pp. 1725–1731.
 [12] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” CoRR, vol. abs/1606.07792, 2016.
 [13] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in NIPS, Granada, Spain., 2011, pp. 2249–2257.
 [14] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextualbandit approach to personalized news article recommendation,” in WWW 2010, Raleigh, North Carolina, USA, April 2630, 2010, 2010, pp. 661–670.
 [15] H. Wang, Q. Wu, and H. Wang, “Factorization bandits for interactive recommendation,” in AAAI, February 49, 2017, San Francisco, California, USA., 2017, pp. 2695–2702.
 [16] C. Zeng, Q. Wang, S. Mokhtari, and T. Li, “Online contextaware recommendation with time varying multiarmed bandit,” in SIGKDD , San Francisco, CA, USA, August 1317, 2016, 2016, pp. 2025–2034.
 [17] X. Zhao, W. Zhang, and J. Wang, “Interactive collaborative filtering,” in CIKM’13, San Francisco, CA, USA, October 27  November 1, 2013, 2013, pp. 1411–1420.

[18]
G. Shani, D. Heckerman, and R. I. Brafman, “An mdpbased recommender system,”
Journal of Machine Learning Research
, vol. 6, pp. 1265–1295, 2005.  [19] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li, “DRN: A deep reinforcement learning framework for news recommendation,” in WWW 2018, Lyon, France, April 2327, 2018, 2018, pp. 167–176.
 [20] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [22] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [23] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo, “Realtime bidding by reinforcement learning in display advertising,” in WSDM 2017, Cambridge, United Kingdom, February 610, 2017, 2017, pp. 661–670.
 [24] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, “Realtime bidding with multiagent reinforcement learning in display advertising,” CoRR, vol. abs/1802.09756, 2018.
 [25] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by network transformation,” in AAAI , New Orleans, Louisiana, USA, February 27, 2018, 2018.
 [26] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” CoRR, vol. abs/1611.01578, 2016.
 [27] N. Taghipour and A. A. Kardan, “A hybrid web recommender system based on qlearning,” in Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), Fortaleza, Ceara, Brazil, March 1620, 2008, 2008, pp. 1164–1168.
 [28] X. Zhao, L. Zhang, Z. Ding, D. Yin, Y. Zhao, and J. Tang, “Deep reinforcement learning for listwise recommendations,” CoRR, vol. abs/1801.00209, 2018.
 [29] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommendations with negative feedback via pairwise deep reinforcement learning,” CoRR, vol. abs/1802.06501, 2018.
 [30] L. Xia, J. Xu, Y. Lan, J. Guo, W. Zeng, and X. Cheng, “Adapting markov decision process for search result diversification,” in SIGIR , Shinjuku, Tokyo, Japan, August 711, 2017, 2017, pp. 535–544.
 [31] Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng, “Reinforcement learning to rank with markov decision process,” in SIGIR , Shinjuku, Tokyo, Japan, August 711, 2017, 2017, pp. 945–948.
 [32] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, “Reinforcement learning to rank in ecommerce search engine: Formalization, analysis, and application,” CoRR, vol. abs/1803.00710, 2018.
 [33] G. DulacArnold, R. Evans, P. Sunehag, and B. Coppin, “Reinforcement learning in large discrete action spaces,” CoRR, vol. abs/1512.07679, 2015.
 [34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller, “Deterministic policy gradient algorithms,” in ICML 2014, Beijing, China, 2126 June 2014, 2014, pp. 387–395.
 [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
 [36] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 [37] A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factorization,” in NIPS, 2008, pp. 1257–1264.
 [38] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in KDD. ACM, 2008, pp. 426–434.
 [39] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextualbandit approach to personalized news article recommendation,” in WWW. ACM, 2010, pp. 661–670.
 [40] H. Wang, Q. Wu, and H. Wang, “Learning hidden features for contextual bandits,” in CIKM. ACM, 2016, pp. 1633–1642.
Comments
There are no comments yet.