Thanks to the increasing online services, such as online shopping, online news and online social networks, it becomes quite convenient to acquire items (goods, books, videos, news, etc.) via Internet or mobile devices. Albeit the great convenience, the overwhelming number of items in the systems also pose a significant challenge for users, to find the items that match their interests. Recommendation is a widely used solution and various families of techniques have been proposed, such as content-based collaborative filtering , matrix factorization based methods [2, 3, 4, 5], logistic regression, factorization machines and its variants [6, 7, 8]
, deep learning models[9, 10, 11, 12] and multi-armed bandits [13, 14, 15, 16, 17]. However, such mentioned studies suffer from two serious limitations.
Firstly, most of them consider the recommendation procedure as a static process, i.e., they assume the user’s underlying preference keeps unchanged. However, it is very common that a user’s preference is dynamic w.r.t. time, i.e., a user’s preference on previous items will affect her choice on the next items. Hence, it would be more reasonable to model the recommendation as a sequential decision making process. We will show some evidence observed in publicly available datasets (MovieLens and Yahoo! Music) to support our opinion. In the two datasets, the sequential behaviors of users are recorded and we are interested in what would happen if a user consecutively receives satisfied or unsatisfied recommendations. Though the datasets do not record any recommendation procedure, we can simulate this according to the users’ ratings, namely, consecutive rating “positive” (“negative”) simulates that a user consecutively receives satisfied (unsatisfied) recommendations. As presented in Figure 1, we observe that a user tends to gives a higher (lower) rating if she has consecutively received more satisfied (unsatisfied) items, as shown by the green (red) line, where the blue dot line denotes the average rating for reference. This suggests that a user will be more pleasant (unpleasant) if she consecutively receives more satisfied (unsatisfied) recommendations and therefore she tends to give a higher (lower) rating to the current recommendation. Hence, the user’s dynamic preference suggests that a good recommendation should be modeled as a sequential decision making process.
Secondly, the aforementioned studies are trained by maximizing the immediate rewards of recommendations, which merely concentrates on whether the recommended items are clicked or consumed, but ignores the long-term contributions that the items can make. However, the items with small immediate rewards but large long-term benefits are also crucial . We take an example in News recommendation 
to explain this. As a user requests for news to read, two possible pieces of news may lead to the same immediate reward, i.e., the user will click and read the two pieces of news with equal probability, where one is about a thunderstorm alert and the other is about a basketball player Kobe Bryant. In this example, after reading the news about thunderstorm, the user probably is not willing to read news about this issue anymore; while on the other hand, the user will possibly read more about NBA or basketball after reading the news about Kobe. The fact suggests that recommending the news about Kobe will introduce more long-term rewards. Hence, when recommending items to users, both the immediate and long-term rewards should be taken into consideration.
Recently, Reinforcement Learning (RL) , which has shown great potential in various challenging scenarios that require both dynamic modeling and long term planning, such as game playing [21, 22], real-time ads bidding [23, 24], neural network structure searching [25, 26], is introduced in recommender systems [18, 27, 28, 29, 19, 30, 31, 32, 33].
In the early stage, model-based RL techniques are proposed to model recommendation procedure, such as POMDP  and Q-learning . However, these methods are inapplicable to complicated recommendation scenarios when the number of candidate items is large, because a time-consuming dynamic programming step is required to update the model. Later, model-free RL techniques are utilized in recommender systems, from both academia and industry. Such techniques can be divided into two categories: value-based [29, 19] and policy-based [28, 32, 33]
. Value-based approaches compute Q-values of all available actions for a given state and the one with the maximum Q-value is selected as the best action. Due to the evaluation on overall actions, the approaches may become very inefficient if the action space is too large. As for the policy-based approaches, this type of studies generate a continuous parameter vector as the representation of an action[28, 32, 33], which can be utilized in generating the recommendation and updating the Q-value evaluator. Thanks to the continuous representations, the inefficiency drawbacks can be overcome. However, these studies [28, 32, 33] still have one common limitation: the user state is learnt via a conventional fully connected neural network, which does not explicitly and carefully model the interactions between users and items.
In this paper, to break the limitations stated above, we propose a deep reinforcement learning based r
ecommendation framework with explicit user-item interactions modeling (DRR). The “Actor-Critic” type framework DRR is incorporated with a state representation module, which explicitly models the complex dynamic user-item interactions to pursuit better recommendation performance. Specifically, the embeddings of users and items from the historical interactions are fed into a carefully designed multi-layer network, which explicitly models the interactions between users and items, to produce a continuous state representation of the user in terms of her underlying sequential behaviors. This network is named as the state representation module, which plays two important roles in our framework. On the one hand, it is utilized to generate an ranking action to calculate the recommendation scores for ranking. On the other hand, the state representation together with the generated action is the input of the Critic network, which aims to estimate the Q-value, i.e., the quality of the action in the current state. Based on the evaluation, the Actor (policy) network can be updated. We note that both the Actor and Critic networks are carefully designed by modeling the interactions between users and items explicitly. Extensive experiments on four real-world datasets demonstrate that the proposed method yields superior performance than the state-of-the-art methods. The main contributions of this paper can be summarized as follows:
We propose a deep reinforcement learning based recommendation framework DRR. Unlike the conventional studies, DRR adopts an “Actor-Critic” structure and treats the recommendation as a sequential decision making process, which takes both the immediate and long-term rewards into consideration.
Under the DRR framework, three different network structures are proposed, which can explicitly model the interactions between users and items.
Extensive experiments are carried out on four real-world datasets, and the results demonstrate the proposed methods indeed outperforms the state-of-the-art competitors.
The rest of this paper is organized as follows. Related work and background are presented in Section II. The preliminary knowledge is presented in Section III. The proposed methods are introduced in Section IV. Experimental details and results are discussed in Section V. Finally, we conclude this paper and discuss some future work in Section VI.
Ii Related Work
Ii-a Non-RL based Recommendation Techniques
Various kinds of recommendation techniques are proposed in the past a few decades to improve the performance of recommender systems, including content-based filtering , matrix factorization based methods [2, 3, 4, 5], logistic regression, factorization machines and its variants [6, 7, 8], and until recently deep learning models [9, 10, 11, 12].
At the beginning of this century, content-based filtering  is proposed to recommend items by considering the content similarity between items. Later, collaborative filtering (CF) is put forward and extensively studied. The rationale behind CF is that the users with similar behaviors tend to prefer the same items, and the items consumed by similar users tend to have the same rating. However, conventional CF based methods tend to suffer from the data scarcity, because the similarity calculated from sparse data can be very unreliable. Matrix factorization (MF), as an advanced CF technique, plays an important role in recommender systems. MF models [2, 3, 4, 5] characterize both items and users by vectors in the same space, which are inferred from the observed user-item interactions. Regarding the recommendation as a binary classification problem, logistic regression and its variants  are also utilized in recommender systems. However, logistic regression based models are hard to generalize to the feature interactions that never or rarely appear in the training data. Factorization machines  model pairwise feature interactions as inner product of latent vectors between features and show promising results. As an extension to FM, Field-aware FM (FFM ) enables each feature to have multiple latent vectors to interact with different fields. Recently, deep learning models [9, 10, 11, 12] are applied to model the complicated feature interactions for recommendation.
. Li et al. apply Thompson Sampling (TS) and Upper Confident Bound (UCB) to balance the trade-off between exploration and exploitation in and , respectively. The authors of  propose a dynamic context drift model to address the time varying problem. To integrate the latent vectors of items and users with some exploration, the authors of [15, 17] combine matrix factorization with multi-armed bandits.
However, all these methods suffer from two limitations. First, they consider the recommendation procedure as a static process, i.e., they assume the underlying user’s preference keeps static and they aim to learn the user’s preference as precise as possible. Second, they are learned to maximize the immediate rewards of recommendations, but ignore the long-term benefits that the recommendations can make.
Ii-B RL based Recommendation Techniques
As model-based RL techniques [18, 27] are inapplicable in recommendation scenario due to their high time complexity, most researchers turn to model-free RL techniques. The model-free RL techniques can be divide into two categories: policy-based and value-based.
Policy-based approaches [28, 32, 33] aim to generate a policy, of which the input is a state, and the output is an action. These works apply deterministic policies, which generates an action directly. Dulac-Arnold et al.  resolves the large action space problem by modeling the state in a continuous item embedding space and selecting the items via a neighborhood method. However, as the underlying algorithm is essentially a continuous-action algorithm, its performance may be cursed by the gap between the continuous and discrete action spaces. In [28, 32], the policy network outputs a continuous action representation, and the recommendation is generated by ranking the items with their scores, which are computed by a pre-defined function with the action representation and the item embeddings as input. However, one common limitation of the studies is that they do not carefully learn the state representation.
For value-based approaches [29, 19], the action with maximum Q-value over all the possible actions is selected as the best action. Zhao et al.  take both user’s positive feedback and negative feedback into consideration when modeling user state. Dueling Q-network is utilized in , to model Q-value of a state-action pair. Moreover, a minor update with exploration by dueling bandit gradient descent is proposed. However, such value-based approaches need to evaluate the Q-values of all the actions under a specific state, which is very inefficient when the number of actions is large.
To make RL based recommendation techniques suitable for large-scale scenario, in this paper, we propose the DRR framework which carefully and explicitly model the interactions between users and items to learn the state representation.
The essential underlying model of reinforcement learning is Markov Decision Process (MDP). An MDP is defined as. is the state space and is the action space. is the state transition function. is the reward function. is the discount rate. The objective of an agent in an MDP is to find an optimal policy () which maximizes the expected cumulative rewards from any state , i.e., , or maximizes equivalently the expected cumulative rewards from any state-action pair , i.e., . Here is the expectation under policy , is the current timestep and is the immediate reward at a future timestep .
We model the recommendation procedure as a sequential decision making problem, in which the recommender (i.e., agent) interacts with users (i.e., environment) to suggest a list of items sequentially over the timesteps, by maximizing the cumulative rewards of the whole recommendation procedure. More specifically, the recommendation procedure is modeled by an MDP, as follows.
States . A state is the representation of user’s positive interaction history with recommender, as well as her demographic information (if it exists in the datasets).
Actions . An action is a continuous parameter vector denoted as . Each item 111 is the embedding of item , which can be generated by MF or VAE. has a ranking score, which is defined as the inner product of the action and the item embedding, i.e., . Then the top ranked ones will be recommended.
Transitions . The state is modeled as the representation of user’s positive interaction history. Hence, once the user’s feedback is collected, the state transition is determined.
Reward . Given the recommendation based on the action and the user state , the user will provide her feedback, i.e., click, not click, or rating, etc. The recommender receives immediate reward according to the user’s feedback.
Discount rate . is a factor measuring the present value of long-term rewards. In the case of , the recommender considers only immediate rewards but long-term rewards are ignored. On the other hand, when , the recommender treats immediate rewards and long-term rewards as equally important.
Figure 2 illustrates the recommender-user interactions in MDP formulation. Considering the current user state and immediate reward to the previous action, the recommender takes an action. Note that in our model, an action corresponds to neither recommending an item nor recommending a list of items. Instead, an action is a continuous parameter vector. Taking such an action, the parameter vector is used to determine the ranking scores of all the candidate items, by performing inner product with item embeddings. All the candidate items are ranked according to the computed scores and Top-N items are recommended to the user. Taking the recommendation from the recommender, the user provides her feedback to the recommender and the user state is updated accordingly. The recommender receives rewards according to the user’s feedback. Without loss of generalization, a recommendation procedure is a timestep222If a recommendation episode terminates in less than T timesteps, then the length of the episode is the actual value. trajectory as .
Iv The Proposed DRR Framework
As aforementioned in Section 1, conventional recommendation techniques suffer from either a lack of sequential modeling or ignoring the long-term rewards, or both. To address the drawbacks, we propose a deep reinforcement learning based recommendation framework (DRR) based on the Actor-Critic learning scheme. Also, different from some recent RL studies, we carefully and explicitly build a state representation module to model the interactions between the users and items. Next, we will first elaborate the Actor network, Critic network and the state representation module respectively, which are essentially the three key ingredients in our framework; then the training and evaluation procedures will be presented to show how to learn and use the DRR framework.
Iv-a Three Key Ingredients in DRR
Iv-A1 The Actor network
The Actor network, also called the policy network, is depicted on the left part of Figure 3. For a given user, the network accounts for generating an action based on her state . Let us explain the network from the input to the output part. In DRR, the user state, denoted by the embeddings of her latest positively interacted items, is regarded as the input. Then the embeddings are fed into a state representation module (which will be introduced in details later) to produce a summarized representation for the user. For instance, at timestep , the state can be defined in Eq. (1):
where stands for the state representation module, denotes the embeddings of the latest positive interaction history, and is a -dimensional vector. When the recommender agent recommends an item , if the user provides positive feedback, then in the next timestep, the state is updated to , where ; otherwise, . The reasons to define the state in such a manner are two folds: (i) a superior recommender system should cater to the users’ taste, i.e., what items the users like; (ii) the latest records represent the users’ recent interests more precisely.
Finally, by two ReLU layers and one Tanh layer, the state representationis transformed into an action as the output of the Actor network. Particularly, the action is defined as a ranking function represented by a continuous parameter vector . By using the action, the ranking score of the item is defined as:
Then, the top ranked item (w.r.t. the ranking scores) is recommended to the user. Note that, the widely used -greedy exploration technique is adopted here.
Iv-A2 The Critic network
The Critic part in DRR, shown as the middle part of Figure 3, is a Deep Q-Network , which leverages a deep neural network parameterized as to approximate the true state-action value function , namely, the Q-value function. The Q-value function reflects the merits of the action policy generated by the Actor network. Specifically, the input of the Critic network is the user state generated by the user state representation module and the action generated by the policy network, and the output is the Q-value, which is a scalar. According to the Q-value, the parameters of the Actor network are updated in the direction of improving the performance of action , i.e., boosting . Based on the deterministic policy gradient theorem , we can update the Actor by the sampled policy gradient shown in Eq.(3):
where is the expectation of all possible Q-values that follow the policy . Here the mini-batch strategy is utilized and denotes the batch size. Moreover, the Critic network is updated accordingly by the temporal-difference learning approach , i.e., minimizing the mean squared error shown in Eq.(4):
where . The target network  technique is also adopted in DRR framework, where and is the parameters of the target Critic and Actor network.
Iv-A3 The State Representation Module
As noted above, the state representation module plays an important role in both the Actor network and Critic network. Hence, it is very crucial to design a good structure to model the state. In [10, 11], it has been shown that modeling the feature interactions explicitly can boost the performance of a recommendation system. Inspired by the studies, we propose to design the state representation module by explicitly modeling the interactions between the users and items. Specifically, we develop three structures, which will be elaborated next.
DRR-p. Inspired by [10, 11], we propose a product based neural network for the state representation module, which is depicted in Figure 4333The legend in Figure 4, 5 and 6 is the same to Figure 3. The structure is named as DRR-p, which utilizes a product operator to capture the pairwise local dependency between items. We can see that the structure clones the representations of the items from . In addition, it computes the pairwise interactions between the items, by using the element-wise product operator. As a result, new features vectors are yielded, which will be concatenated with the cloned vectors as the state representation. We note that in the element-wise product part, a weight is also learned for each item to show its importance. Hence, in DRR-p the state representation module can be formally stated as follows:
where denotes the element-wise product, is a scalar indicating the importance of item , and is a -dimensional vector which models the interactions between item and . The dimensionality of is .
DRR-u. Though DRR-p can model the pairwise local dependency between items, the user-item interactions are neglected. To remedy this, we design another structure in Figure 5, which is referred as DRR-u. In DRR-u, we can see that the user embedding is also incorporated. In addition to the local dependency between items, the pairwise interactions of user-item are also taken into account. Formally, the state representation module can be expressed as:
The dimensionality of is also .
DRR-ave. In DRR-p and DRR-u structures, the interactions between users and items can be exploited and modeled. For the two structures, it is not difficult to find that the positions of items in matters, e.g., the state representations of and are different. When is large, we expect the positions of items really matter, because denotes a long-term sequence; whereas memorizing the positions of items may lead to overfitting if the sequence is a short-term one. Hence, we design another structure by eliminating the position effects, which is depicted in Figure 6. As an average pooling layer is adopted, we call the structure DRR-ave. We can see from Figure 6 that the embeddings of items in are first transformed by a weighted average pooling layer. Then, the resulting vector is leveraged to model the interactions with the input user. Finally, the embedding of the user, the interaction vector, and the average pooling result of items are concatenate into a vector to denote the state representation. Formally, the DRR-ave structure can be expressed as:
Here indicates the weighted average pooling layer. The dimensionality of in DRR-ave is .
Iv-B Training Procedure of the DRR Framework
Next, we introduce how to train the DRR framework. We first present the overall idea and then discuss the detailed algorithm. As aforementioned, DRR utilizes the users’ interaction history with the recommender agent as training data. During the procedure, the recommender takes an action following the current recommendation policy after observing the user (environment) state , then it obtains the feedback (reward) from the user, and the user state is updated to . According to the feedback, the recommender updates its recommendation policy. In this work, we utilize deep deterministic policy gradient (DDPG)  algorithm to train the proposed DRR framework, as detailed in Algorithm 1.
Specifically, in timestep , the training procedure mainly includes two phases, i.e., transition generation (lines 7-12) and model updating (lines 13-17). For the first stage, the recommender observes the current state that is calculated by the proposed state representation module, then generates an action according to the current policy with -greedy exploration, and recommends an item according to the action by Eq. (2) (lines 8-9). Subsequently, the reward can be calculated based on the feedback of the user to the recommended item , and the user state is updated (lines 10-11). Finally, the recommender agent stores the transition into the replay buffer (line 12).
In the second stage, the model updating, the recommender samples a minibatch of transitions with widely used prioritized experience replay  sampling technique (line 13), which is essentially an importance sampling strategy. Then, the recommender updates the parameters of the Actor network and Critic network according to Eq. (3) and Eq. (4) respectively (line 14-16). Finally, the recommender updates the target networks’ parameters with the soft replace strategy.
In this subsection, we discuss how to evaluate the models with a environment simulator. The most straightforward way to evaluate the RL based models is to conduct online experiments on recommender systems where the recommender directly interacts with users. However, the underlying commercial risk and the costly deployment on the platform make it impractical. Therefore, throughout the testing phase, we conduct the evaluation of the proposed models on public offline datasets and propose two ways to evaluate the models, which are the offline evaluation and the online evaluation.
Iv-C1 Offline evaluation
Intuitively, the offline evaluation of the trained models is to test the recommendation performance with the learned policy, which is described in Algorithm 2. Specifically, for a given session , the recommender only recommends the items that appear in this session, denoted as , rather than the ones in the whole item space. The reason is that we only have the ground truth feedback for the items in the session in the recoreded offline log. For each timestep, the recommender agent takes an action according to the learned policy , and recommends an item based on the action by Eq. (2) (lines 4-5). After that, the recommender observes the reward according to the feedback of the recommended item by Eq. (10) (lines 5-6). Then the user state is updated to and the recommended item is removed from the candidate set (lines 7-8). The offline evaluation procedure can be treated as a rerank procedure of the candidate set by iteratively selecting an item w.r.t. the action generated by the Actor network in DRR framework. Moreover, the model parameters are not updated in the offline evaluation.
Iv-C2 Online evaluation with environment simulator
As aforementioned that it is risky and costly to directly deploy the RL based models on recommender systems. Therefore, we conduct online evaluation with an environment simulator. In this paper, we pretrain a PMF  model as the environment simulator, i.e., to predict an item’s feedback that the user never rates before. The online evaluation procedure follows Algorithm 1, i.e., the parameters continuously update during the online evaluation stage. Its major difference from Algorithm 1 is that the feedback of a recommended item is observed by the environment simulator. Moreover, before each recommendation session starting in the simulated online evaluation, we reset the parameters back to and which is the policy learned in the training stage for a fair comparison.
V-a Datasets and Evaluation Metrics
We adopt the following publicly available datasets from the real world to conduct the experiments:
MovieLens (100k)444https://grouplens.org/datasets/movielens/100k/. A benchmark dataset comprises of 0.1 million ratings from users to the recommended movies on MovieLens website.
Yahoo! Music (R3)555https://webscope.sandbox.yahoo.com/. This dataset contains over 0.36 million ratings of songs collected from two different sources. The first source consists of ratings provided by users during normal interactions with Yahoo! Music services. The second source consists of ratings of randomly selected songs collected during an online survey by Yahoo! Research. We normalize the ratings to discrete values from 1 to 5.
MovieLens (1M)666https://grouplens.org/datasets/movielens/1m/. A benchmark dataset includes of 1 million ratings from the MovieLens website.
Jester (2)777http://eigentaste.berkeley.edu/dataset/. This dataset contains over 1.7 million real-value ratings (-10.0 to +10.0) over jokes in an online joke recommender system.
Note that except for Jester, the ratings in the other datasets are discrete values from 1 to 5, and the statistic information of the datasets is given in Table I. The MovieLens (100k) and MovieLens (1M) are abbreviated as ML (100k) and ML (1M) respectively.
|ML (100k)||Yahoo! Music||ML (1M)||Jester|
We conduct both offline and simulated online evaluation on these four datasets. For the offline evaluation, we utilize Precision@k and NDCG@k as the metrics to measure the performance of the proposed models. For the simulated online evaluation, we leverage the total accumulated rewards as the metric.
V-B Compared Methods
We compare the proposed methods with some representative baseline methods. For the offline evaluation, we compare to conventional methods including Popularity, PMF  and SVD++ , and a RL based method DRR-n. Moreover, the online evaluation baselines contain the state-of-the-art multi-armed bandits methods LinUCB  and HLinUCB  and the DRR-n as well.
Popularity recommends the most popular item, i.e., the item with the highest average rating or the items with largest number of positive ratings888To get a better result of popularity based recommendation, we both test the two strategies, and choose the best one to report. from current available items to the users at each timestep.
PMF makes a matrix decomposition as SVD, while it only takes into account the non zero elements.
SVD++ mixes strengths of the latent model as well as the neighborhood model.
LinUCB selects an arm (item) according to the estimated upper confidence bound of the potential reward.
HLinUCB further learns hidden features for each arm to model the potential reward.
DRR-n simply utilizes the concatenation of the item embeddings to represent user state, which is widely used in previous studies. Although it is under the DRR framework, we treat this method as a baseline to assess the effectiveness of our proposed state representation module.
V-C Experimental Settings
For each dataset, we randomly divide them into two parts: 80% of the ratings are used for training, while the other 20% are for evaluation. Moreover, for MovieLens (100k), Yahoo! Music and MovieLens (1M), the positive ratings are and , while for Jester, the positive ones are those higher than . The number of latest positively rated items , which is empirically set to . We perform PMF to pretrain the -dimensional embeddings of the users and items. Moreover, in each episode, we do not recommend repeated items, i.e., we remove the ones already recommended from the candidate set. The discount rate is 0.9. We utilize Adam optimizer for all the RL based methods with -norm regularization to prevent overfitting. As for the reward function, we empirically normalize the ratings into range [-1 ,1] and utilize the normalized ones as the feedback of the corresponding recommendations. For instance, in timestep , the recommender agent recommends an item to user , (denoted as action in state ), and the rating comes from the interaction logs if user actually rates item , or from a predicted value by the simulator otherwise. Therefore, the reward function can be defined as follows:
where the first setting is for MovieLens (100k), Yahoo! Music and MovieLens (1M), and the second one is for Jester. All the baseline methods are carefully tuned for a fair comparison. We model the recommendation procedure as an interaction episode with length , and the hyper-parameter is tuned for different datasets (detailed in Section V.E).
V-D Results and Analysis
V-D1 Offline Evaluation Results and Analysis
The offline evaluation results are summarized from Table II to Table V respectively, where the best results are marked in bold type. In the offline evaluation, we compare the proposed methods to some representative offline learning methods. The results suggest that the proposed methods under the DRR framework outperform the baselines on most of datasets, which demonstrates the effectiveness of our proposed methods.
Specifically, as aforementioned that, we propose three different network structure in the state representation module to model the explicit interactions of the users and items under the DRR framework, which are the DRR-p, DRR-u and DRR-ave. From the results in Table II to Table V, we find that the three methods all outperform the baselines in most cases. Moreover, DRR-n that simply concats the item embeddings to represent the state , performs worse than the proposed DRR-p, DRR-u and DRR-ave. From the observations, we can conclude in two folds: (i) the proposed methods indeed have the capability of long-term scheduling and dynamic adaptation, which are ignored by conventional methods; (ii) the proposed state representation module well captures the dynamic interactions between the users and items, and the state should not be simply concatenate with fully connected layers as DRR-n does, which may result in information loss.
Compared with DRR-p, DRR-u and DRR-ave, we can see that DRR-ave outperforms DRR-u, and DRR-u is superior than DRR-p on the four datasets in most cases. The reasons are as follows: 1) The DRR-u method has better performance than DRR-p, because DRR-u only captures the interactions of user’s historical items, but also seizes the personalization information through the user-item interactions. 2) DRR-ave performs the best, because of two reasons: (i) DRR-ave method captures the personalization information through user-item interactions; (ii) as noted in Section IV, by using the average pooling, it eliminates the position effects in .
V-D2 Simulated online evaluation results and analysis
The results of the simulated online evaluation are summarized in Table VI, where the best results are marked in bold type. In the experiment, we only compare with the baseline methods that can perform online learning, which are LinUCB, HLinUCB and DRR-n. Again, we find that the proposed methods deliver higher rewards than all the baselines.
On the one hand, the fact suggests that the proposed RL-based methods model dynamic adaptation and long-term rewards better than the multi-armed bandits based methods LinUCB and HLinUCB. On the other hand, the observation indicates that the proposed state representation structures are superior to the naive full-connected network in DRR-n. Again, we observe that DRR-ave performs the best among all the three proposed interaction modeling structures.
|Model||ML (100k)||Yahoo! Music||ML (1M)||Jester|
V-E Parameter Study
In this subsection, we investigate how the episode length affect the performance of proposed methods. Figure 7 shows the results999Due to the space limit, We only present the performance of DRR-ave, while DRR-p and DRR-u have similar observations. From the left part of Figure 7, we observe that the performance on MovieLens first increases and then decreases as the length of the episode is gradually increased, and the summit appears at . A similar tend can be found for the Yahoo! Music from the right part of Figure 7, where the performance peaks at . The reason may due to the trade-off between the exploitation and exploration. When the episode length is small, the user can not fully interact with the recommender agent, i.e., the exploration is insufficient. As we enlarge the episodes, the recommender agent can explore (interact with users) adequately, i.e., the recommender agent captures the user’s preference, so that the performance improves. However, if the episodes are too large, the recommender focuses on exploiting locally, but the user preferred items is limited, therefore the performance declines as we do not recommend repeated items to user. Hence, we should nicely trade off the exploration and exploitation by setting a suitable value for .
V-F Case Study
In this subsection, we present an example to show the different recommendation manner between LinUCB and DRR-ave on MovieLens dataset. Specifically, we randomly pick up a user with ID 11, and conduct the recommendation procedure with LinUCB and DRR-ave respectively. To verify the reaction to the same recommendation scenario, we fix the first three recommended items and to see what will happen next. The results of recommended item and the reward are reported in Table VII.
From Table VII, we can see that LinUCB and DRR-ave react differently when given two consecutive negative recommendations (Eraser and First Knight). Specifically, LinUCB keeps exploring without considering to recommend a “safe” item to please the user. However, DRR-ave stops exploration and recommends a risk-free movie Dead Man Walking, which belongs to the same genre as Chasing Amy that has gained a positive feedback from the user at timestep 1. The observation demonstrates the superiority of the proposed DRR-ave against LinUCB.
|1||Chasing Amy (1)||Chasing Amy (1)|
|2||Eraser (-0.5)||Eraser (-0.5)|
|3||First Knight (-1)||First Knight (-1)|
|4||The Deer Hunter (-0.5)||Dead Man Walking (1)|
|5||Event Horizon (-1)||Braveheart (0.5)|
|6||The Net (0)||The Usual Suspect (-0.5)|
|7||Striptease (-0.5)||Psycho (0.5)|
In this paper, we propose a deep reinforcement learning based framework DRR to perform the recommendation task. Unlike the conventional studies, DRR treats the recommendation as a sequential decision making process and adopts an “Actor-Critic” learning scheme, which can take both the immediate and long-term rewards into account. In DRR, a state representation module is incorporated and three instantiation structures are designed, which can explicitly model the interactions between users and items. Extensive experiments on four real-world datasets demonstrate the superiority of the proposed DRR method over state-of-the-art competitors.
-  R. J. Mooney and L. Roy, “Content-based book recommending using learning for text categorization,” in ACM DL, 2000, pp. 195–204.
-  M. Deshpande and G. Karypis, “Item-based top-N recommendation algorithms,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 143–177, 2004.
-  Y. Koren, R. M. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
-  G. Linden, B. Smith, and J. York, “Amazon.com recommendations: Item-to-item collaborative filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003.
-  J. Wang, A. P. De Vries, and M. J. Reinders, “Unifying user-based and item-based collaborative filtering approaches by similarity fusion,” in SIGIR. ACM, 2006, pp. 501–508.
-  H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica, “Ad click prediction: a view from the trenches,” in KDD 2013, Chicago, IL, USA, August 11-14, 2013, 2013, pp. 1222–1230.
-  S. Rendle, “Factorization machines,” in ICDM, Sydney, Australia, 14-17 December 2010, 2010, pp. 995–1000.
-  Y. Juan, Y. Zhuang, W. Chin, and C. Lin, “Field-aware factorization machines for CTR prediction,” in RecSys, Boston, MA, USA, September 15-19, 2016, 2016, pp. 43–50.
-  W. Zhang, T. Du, and J. Wang, “Deep learning over multi-field categorical data - - A case study on user response prediction,” in ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings, 2016, pp. 45–57.
-  Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang, “Product-based neural networks for user response prediction,” in ICDM 2016, December 12-15, 2016, Barcelona, Spain, 2016, pp. 1149–1154.
-  H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorization-machine based neural network for CTR prediction,” in IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 2017, pp. 1725–1731.
-  H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” CoRR, vol. abs/1606.07792, 2016.
-  O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in NIPS, Granada, Spain., 2011, pp. 2249–2257.
-  L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, 2010, pp. 661–670.
-  H. Wang, Q. Wu, and H. Wang, “Factorization bandits for interactive recommendation,” in AAAI, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 2695–2702.
-  C. Zeng, Q. Wang, S. Mokhtari, and T. Li, “Online context-aware recommendation with time varying multi-armed bandit,” in SIGKDD , San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 2025–2034.
-  X. Zhao, W. Zhang, and J. Wang, “Interactive collaborative filtering,” in CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, 2013, pp. 1411–1420.
G. Shani, D. Heckerman, and R. I. Brafman, “An mdp-based recommender system,”
Journal of Machine Learning Research, vol. 6, pp. 1265–1295, 2005.
-  G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li, “DRN: A deep reinforcement learning framework for news recommendation,” in WWW 2018, Lyon, France, April 23-27, 2018, 2018, pp. 167–176.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
-  H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo, “Real-time bidding by reinforcement learning in display advertising,” in WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, 2017, pp. 661–670.
-  J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, “Real-time bidding with multi-agent reinforcement learning in display advertising,” CoRR, vol. abs/1802.09756, 2018.
-  H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by network transformation,” in AAAI , New Orleans, Louisiana, USA, February 2-7, 2018, 2018.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” CoRR, vol. abs/1611.01578, 2016.
-  N. Taghipour and A. A. Kardan, “A hybrid web recommender system based on q-learning,” in Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), Fortaleza, Ceara, Brazil, March 16-20, 2008, 2008, pp. 1164–1168.
-  X. Zhao, L. Zhang, Z. Ding, D. Yin, Y. Zhao, and J. Tang, “Deep reinforcement learning for list-wise recommendations,” CoRR, vol. abs/1801.00209, 2018.
-  X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommendations with negative feedback via pairwise deep reinforcement learning,” CoRR, vol. abs/1802.06501, 2018.
-  L. Xia, J. Xu, Y. Lan, J. Guo, W. Zeng, and X. Cheng, “Adapting markov decision process for search result diversification,” in SIGIR , Shinjuku, Tokyo, Japan, August 7-11, 2017, 2017, pp. 535–544.
-  Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng, “Reinforcement learning to rank with markov decision process,” in SIGIR , Shinjuku, Tokyo, Japan, August 7-11, 2017, 2017, pp. 945–948.
-  Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, “Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application,” CoRR, vol. abs/1803.00710, 2018.
-  G. Dulac-Arnold, R. Evans, P. Sunehag, and B. Coppin, “Reinforcement learning in large discrete action spaces,” CoRR, vol. abs/1512.07679, 2015.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller, “Deterministic policy gradient algorithms,” in ICML 2014, Beijing, China, 21-26 June 2014, 2014, pp. 387–395.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
-  A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factorization,” in NIPS, 2008, pp. 1257–1264.
-  Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in KDD. ACM, 2008, pp. 426–434.
-  L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in WWW. ACM, 2010, pp. 661–670.
-  H. Wang, Q. Wu, and H. Wang, “Learning hidden features for contextual bandits,” in CIKM. ACM, 2016, pp. 1633–1642.