1. Introduction
With the wide use of mobile applications such as TikTok, Pandora radio and Instagram feeds, interactive recommender systems (IRS) have received much attention in recent years (Zhao et al., 2013; Wang et al., 2017). Unlike that in traditional recommender systems (He et al., 2017; Wang et al., 2006; Koren et al., 2009), where the recommendation is treated as a onestep prediction task, the recommendation in IRS is formulated as a multistep decisionmaking process. In each step, the system delivers an item to the user and may receive feedback from her, which subsequently derives the next recommendation decision in a sequential manner. The recommendationfeedback interaction is repeated until the end of this visit session of the user. The goal of the IRS is to explore users’ new interests, as well as to exploit the learned preferences, to provide accurate predictions, so as to optimize the outcome of the entire recommendation sequence (Zhao et al., 2013, 2018b).
One way to implement IRS and balance the exploration and exploitation is multiarmed bandit (MAB) methods (Li et al., 2010; Zeng et al., 2016; Wang et al., 2017). In MABbased models, the user preference is often modeled by a linear function that is continuously learned through the interactions with proper explorationexploitation tradeoff. However, these MABbased models preassume that the underlying user preference remains unchanged during the recommendation process, i.e., they do not model the dynamic transitions of user preferences (Zhao et al., 2013). The key advantage for modern IRS is to learn about the possible dynamic transitions of the user’s preference and optimize the longterm utility.
Recently, some researchers have incorporated deep reinforcement learning (DRL) models into interactive recommender system (Zheng et al., 2018; Zhao et al., 2018b; Hu et al., 2018; Chen et al., 2019b; Zhao et al., 2018a), due to the great potential of DRL in decision making and longterm planning in dynamic environment (Silver et al., 2016). Mahmood and Ricci (2007) first proposed to used modelbased techniques in RL where dynamic programming algorithms such as policy iteration are utilized. Some recent works use modelfree frameworks to tackle IRS tasks, e.g., deep Qnetwork (DQN) (Zhao et al., 2018b) and deep deterministic policy gradient (DDPG) (Hu et al., 2018).
Nevertheless, employing DRL in realworld interactive recommender system is still challenging. A common method for training recommendation models is to make use of offline logged data directly, but it will suffer from the estimation bias problem
(Chen et al., 2019a) under the realtime interaction setting. In contrast, the ideal setting of learning the optimal recommendation policy is to train the agent online. However, due to the itemstate search space for each recommendation step and the trialanderror nature of RL algorithms, DRL methods normally face sample efficiency problem (Zou et al., 2019), i.e., learning such a policy requires a huge amount of data through interacting with real users before achieving the best policy, which may degrade user experience and damage system profit (Zhang et al., 2016a). Therefore, it is quite crucial to improve the sample efficiency of existing DRL models with only a limited amount of interaction data.Fortunately, there is rich prior knowledge from other external sources that may contribute to dealing with the above problems, such as textual reviews, visual images or item attributes (Cheng et al., 2016). Among these, knowledge graph (KG), a wellknown structured knowledge base, represents various relations as the attributes of items and links items if they have common attributes, which has shown great effectiveness for representing the correlation between items (Wang et al., 2018). The association between the items provided by KG is very suitable for the recommendation scenarios. For example, a user likes the movie Inception, and the information behind it may be that her favorite director is Nolan
. With such links among actions (items) on the graph, one useritem interaction record could reveal the user’s preference on multiple connected items. In addition, the information contained in the semantic space of the entire knowledge graph will also be helpful in extracting user interest during the recommendation process. Since there have been many successful works applying opensourced KGs (such as DBpedia, NELL, and Microsoft Satori) to
traditional recommendation systems (Yu et al., 2014; Zhang et al., 2016b; Wang et al., 2018), we believe that it is reasonably promising to leverage KG to DRLbased methods in IRS scenarios.In this paper, we make the first attempt to leverage KG for reinforcement learning in interactive recommender systems, trying to address the aforementioned limitations of the existing DRL methods. We propose KGQR (Knowledge Graph enhanced Qlearning framework for interactive Recommendation), a novel architecture that extends DQN. Specifically, we integrate graph learning and sequential decision making as a whole to facilitate knowledge in KG and pattern mining in IRS. On one hand, to alleviate data sparsity, the user feedback is modeled to propagate via structure information of KG, so that the user’s preference can be transited among correlated items (in the KG). In this way, one interactive record can affect multiple connected items, thus the sample efficiency is improved. On the other hand, by aggregating the semantic correlations among items in KG, the item embedding and the user’s preference are effectively represented, which leads to more accurate Qvalue approximation and hence better recommendation performance. In addition, we also conduct a methodology to deal with action selection in such large space. Rather than enumerating the whole item set, each step the candidate set for recommendation is dynamically generated from the local graph of KG, by considering the neighborhood of the items in user’s highscored interacted items. The method of candidate selection forces the deep Qnetwork to fit on the samples that KG considers more useful through the structure of item correlations, hence it can make better use of limited learning samples for RLalgorithm.
To the best of our knowledge, this is the first work to introduce KG into RLbased methods to interactive recommender systems. The contributions of our work can be summarized as follows.

[leftmargin = 10pt]

We propose a novel endtoend deep reinforcement learning based framework KGQR for interactive recommendation to addresses the sparsity issue. By leveraging prior knowledge in KG in both candidate selection and the learning of user preference from sparse user feedback, KGQR can improve sample efficiency of RLbased IRS models.

The dynamic user preference can be represented more precisely by considering the semantic correlations of items in KG, with graph neural networks.

Extensive experiments have been conducted on two realworld datasets, demonstrating that KGQR is able to achieve better performance than stateofthearts with much fewer useritem interactions, which indicates high sample efficiency.
2. Related Work
Traditional KG Enhanced Recommendation
. Traditional KG enhanced recommendation models can be classified into three categories: pathbased methods, embeddingbased methods and hybrid methods. In
pathbased methods (Yu et al., 2014; Shi et al., 2015; Zhao et al., 2017), KG is often treated as a heterogeneous information network (HIN), in which specific metapaths/metagraphs are manually designed to represent different patterns of connections. The performance of these methods is heavily dependent on the handcrafted metapaths, which are hard to design. In embeddingbased methods, the entity embedding extracted from KG via Knowledge Graph Embedding (KGE) algorithms (like TransE (Bordes et al., 2013), TransD (Ji et al., 2015), TransR (Lin et al., 2015)), is utilized to better represent items in recommendation. Zhang et al. (2016b) propose Collaborative Knowledge Base Embedding (CKE), to jointly learn the latent representations in collaborative filtering as well as items’ semantic representations from the knowledge base, including KG, texts, and images. MKR (Wang et al., 2019b) associates the embedding learning on KG with the recommendation task by cross & compress units. KSR (Huang et al., 2018) extends the GRUbased sequential recommender by integrating it with a knowledgeenhanced KeyValue Memory Network. In hybrid methods, researchers combine the above two categories to learn the user/item embeddings by exploiting highorder information in KG. Ripplenet (Wang et al., 2018) is a memorynetworklike model that propagates users’ potential preferences along with links in the KG. Inspired by the development of graph neural network (Kipf and Welling, 2016; Hamilton et al., 2017; Veličković et al., 2017), KGAT (Wang et al., 2019a) applies graph attention network (Veličković et al., 2017) framework in a collaborative knowledge graph to learn the user, item and entity embeddings in an endtoend manner.However, most of these methods are onestep prediction tasks and can not model the iterative interactions with users. Besides, they all greedily optimize an immediate user’s feedback and don’t take the user’s longterm utility into consideration.
Reinforcement Learning in IRS
. RLbased recommendation methods model the interactive recommendation process as a Markov Decision Process (MDP), which can be divided into modelbased and modelfree methods. As one of the
modelbased techniques, Mahmood and Ricci (Mahmood and Ricci, 2007) utilize policy iteration to search for the optimal recommendation policy where an action is defined to be an item and a state is represented as gram of items. Policy iteration needs to go through the whole state space in each iteration, with exponential complexity to the number of items. Therefore, it is unable to handle large state and action space.Recently, most works on RLbased recommendation prefer modelfree techniques, including policy gradient (PG)based, DQNbased and DDPGbased methods. PGbased methods, as well as DQNbased, treat recommending an item as an action. PGbased methods (Chen et al., 2019b) learn a stochastic policy as a distribution over the whole item space and sample an item according to such distribution.
DQNbased methods (Zhao et al., 2018b; Zheng et al., 2018; Zou et al., 2019) learn Qvalue for each item and select the item with the maximum Qvalue. Zheng et al. (Zheng et al., 2018) combine DQN and Dueling Bandit Gradient Decent (DBGD) (Grotov and de Rijke, 2016) to conduct online news recommendation. Zou et al. (Zou et al., 2019)
integrate both the intent engagement (such as click and order) and longterm engagement (such as dwell time and revisit) when modeling versatile user behaviors. In DDPGbased works, an action is often defined as a continuous ranking vector. Dulac et al.
(DulacArnold et al., 2015) represent the discrete actions in a continuous vector space, pick a protoaction in a continuous hidden space according to the policy and then choose the valid item via a nearest neighbor method. In (Hu et al., 2018; Zhao et al., 2018a), the policies compute the ranking score of an item by calculating the predefined function value (such as an inner product) of the generated action vector and the item embedding.Nevertheless, all existing RLbased recommendation models suffer from low sample efficiency issue and need to pretrain user/item embeddings from history, which means that they cannot handle recommendation on coldstart problem well. A significant difference between our approach and the existing models is that we first propose a framework that combines the semantic and structural information of KG with the IRS to break such limitations.
3. Problem Formulation
In feed streaming recommendation scenario, the interactive nature between the recommender system and the user is a multistep interaction process that lasts for a period of time. At each timestep , according to the observations on past interactions, the recommendation agent delivers an item to the user, and receives feedback (e.g., click, purchase or skip) from her. This process continues until the user leaves the recommender system. Under such circumstances, the interactive recommendation process can be formulated as a Markov Decision Process (MDP). The ultimate goal of the recommender system is to learn a recommendation policy , which maximizes the cumulative utility over the whole interactive recommendation process as
(1) 
Here is a representation abstracted from user’s positively interacted items that denotes user’s preference at timestep ; is the user’s immediate feedback to the recommended item at the state according to some internal function , abbreviated as .
Notations  Descriptions 

Set of users and items in IRS.  
Knowledge graph.  
Set of entities and relations in .  
Recorded user’s positively interacted at timestep .  
Dense representation of user’s preference at timestep .  
The user’s reward at timestep .  
Episode length.  
Dense representation of an entity.  
Candidate action space at timestep .  
Parameters of state representation network.  
Parameters of online Qnetwork /target Qnetwork.  
Replay Buffer. 
To achieve this goal, traditional recommendation methods usually adopt a greedy strategy and only optimize onestep reward, i.e., at each timestep , they optimize the immediate reward . Different from them, DRL algorithms take the longterm impact into consideration and explicitly model the longrun performance. They will optimize at timestep , instead. And is the discount factor to control the degressively accumulated longterm rewards.
In general, we can use Qvalue to evaluate the value of an action (i.e., recommending an item) taken at a given state, defined as
(2) 
which is a weighted sum of the expected reward of all future steps starting from the current state and following the policy to take actions. Then following the optimal Bellman equation (Bellman, 1952), the optimal , having the maximum expected reward achievable is:
(3) 
Since the state and action spaces are usually enormous, we normally estimate the Qvalue of each stateaction () pair via a parameterized deep neural network, i.e., .
As mentioned in Section 1, learning this Qfunction from scratch requires numerous interactions with real users due to the low data efficiency problem that is common in the RL algorithm. However, unlike basic RL algorithms, in RS scenarios, KG can provide complementary and distinguishable information for each item by their latent knowledgelevel connection in the graph. Thus, with the prior knowledge of the environment and actions, the Qfunction can be learned more efficiently,
(4) 
Here is the knowledge graph comprised of subjectpropertyobject triples facts, e.g., triple (), which denotes Nolan is the director of Inception. And it is ofen present as (), , , and , denote the set of entities and relationships in , respectively. Usually, an item can be linked to an entity in the knowledge graph, e.g., the movie Godfather from MovieLens dataset has a corresponding entity entry in DBpedia. We will introduce how we design the knowledge enhanced DRL framework for IRS in the following sections and the key notations used in this paper are summarized in Table 1.
4. KGQR Methodology
The overview of our proposed framework is shown in Figure 1. Generally, our KGQR model contains four main components: graph convolution module, state representation module, candidate selection module and Qlearning network module. In the interactive recommendation process, at each timestep , the IRS sequentially recommends items to users, and correspondingly updates its subsequent recommendation strategy based on user’s feedback . At the specific time during one recommendation session, according to the interaction history combined with the knowledge graph , the IRS models the user’s preference via graph convolution module and state representation module. The details of these two representation learning modules will be discussed in Section 4.1. Then the IRS calculates the highestscored item in the candidate set through Qnetwork and recommends it to the user. We will introduce the candidate selection module and deep Qnetwork module in Section 4.2 and Section 4.3, respectively.
4.1. KG Enhanced State Representation
In IRS scenario, it is impossible to get user’s state directly, and what we can directly observe is the recorded usersystem interaction history . As state is one of the key part in MDP, the design of state representation module is critical to study the optimal recommendation strategy.
4.1.1. Graph convolutional embedding layer
Generally, the state representation in IRS is abstracted from the user’s clicked^{1}^{1}1Without loss of generality, we take “click” behavior as user positive feedback as the running example. items, since the positive items represent the key information about what the user prefers (Zhao et al., 2018b). Given the user’s history, we first convert the clicked item set into embedding vectors , where is the dimension of the embeddings. Since we have already linked items with entities in KG, we can take advantage of the semantic and correlation information among items in KG for better item embedding .
In order to distill structural and semantic knowledge in the graph into a lowdimensional dense node representation, different approaches of graph embedding methods can be applied. In addition to harvesting the semantic information, we incline to explicitly link these items so that one data can affect more items. Thus, a graph convolutional network (GCN) (Kipf and Welling, 2016) is used in our work to recursively propagate embeddings along the connectivity of items and learn the embeddings of all entities on the graph .
The computation of the node’s representation in a single graph convolutional embedding layer is a twostep procedure: aggregation and integration. These two procedures can naturally be extended to multiple hops, and we use the notation to identify th hop. In each layer, first, we aggregate the representations of the neighboring nodes of a given node :
(5) 
where is the set of neighboring nodes of . Notice that, here we consider the classic Average aggregator for example, other aggregator like concat aggregator (Hamilton et al., 2017), neighbor aggregator or attention mechanism (GAT) (Veličković et al., 2017) can also be implemented.
Second, we integrate the neighborhood representation with ’s representation as
(6) 
where and are trainable parameters for hop neighborhood aggregator and
is the activation function implemented as
. In Equation 6, we assume the neighborhood representation and the target entity representation are integrated via a multilayer perceptron. After
hop graph convolutional embedding layer, each clicked item is then converted into .4.1.2. Behavior aggregation layer
Since the interactive recommendation is a sequential decisionmaking process, at each step, the model requires the current observation of the user as input, and provides a recommended item
as output. It is natural to use autoregressive models such as recurrent neural networks (RNN) to represent the state based on the observationaction sequence
(Hausknecht and Stone, 2015; Narasimhan et al., 2015). Thus, we use an RNN with a gated recurrent unit (GRU) as the network cell
(Cho et al., 2014) to aggregate user’s historical behaviors and distill user’s state . The update function of a GRU cell is defined as(7)  
where denotes the input vector, and denote the update gate and reset gate vector respectively, is the elementwise product operator. The update function of hidden state
is a linear interpolation of previous hidden state
and a new candidate hidden state . The hidden state is the representation of current user state, which is then fed into the Qnetwork, i.e.,(8) 
For simplicity, the set of the whole network parameters for computing , including parameters of graph convolutional layer and parameters of GRU cell, is denoted as .
In Figure 2(a), we illustrate the knowledge enhanced state representation module elaborated above. The upper part is the recurrent neural network that takes clicked item’s embedding at each timestep as the input vector, and outputs the hidden state of the current step as the state representation. The item embeddings, which are the input to GRU, are learned by performing graph convolutional network in KG, as shown in the bottom part.
4.2. Neighborbased Candidate Selection
Generally, the clicked items have some inherent semantic characteristics, e.g., similar genre movies (Wang et al., 2018). Since users are usually not likely to be interested in all items, we can focus on selecting the potential candidates for restricted retrieval based on this semantic information in KG. Specifically, we utilize KG to filter some irrelevant items (i.e., actions) and dynamically obtain the potential candidates. The restricted retrieval will focus the data samples on the area that is more useful, as suggested in the structure of item correlation. Thus, these potential candidates would not only reduce the large searching space, but also improve the sample efficiency of policy learning.
More specifically, we perform a sampling strategy based on the hop neighborhood in KG. In each timestep , the user’s historical interacted items serve as the seed set . The hop neighborhood set starting from the seed entities is denoted as
(9)  
Then, the candidate action set for the current user state is defined as
(10) 
with a userdefined cardinality. The shallow part in ”candidate selection” in Figure 2(b) denotes the selected actions with the information from KG. Then all candidate items get their embedding through the graph convolutional layers.
4.3. Learning Deep QNetwork
After modeling the user’s state and obtaining candidate sets , we need to design Qnetwork to combine this information and improve the recommendation policy for the interactive recommendation process. Here we implement a deep Qnetwork (DQN) with duelingQ (Wang et al., 2016) and doubleQ (Van Hasselt et al., 2016) techniques to model the expected longterm user satisfaction from the current user state as well as to learn the optimal strategy.
4.3.1. Deep Qnetwork.
We adopt dueling technique to reduce the approximation variance and stabilize training
(Wang et al., 2016). That is, using two networks to compute the value function and advantage functions respectively, and it is shown in Figure 2. Then the Qvalue can be computed as,(11) 
Here the approximation of value function and advantage function are accomplished by multilayer perceptrons. and is the parameter of value function and advantage function respectively and we denote .
4.3.2. Model training.
With the proposed framework, we can train the parameters of the model through trialanderror process. During the interactive recommendation process, at timestep , the recommender agent gets the user’s state from the observations about her, and recommendeds an item via an greedy policy (i.e., with probability choosing the item in the candidate with the max Qvalue, with probability choosing a random item). Then the agent receives the reward from the user’s feedback and stores the experience in the replay buffer . From
, we sample minibatch of experiences, and minimize the meansquare loss function to improve the Qnetwork, defined as
(12) 
Here is the target value based on the optimal . According to Equation (3), is defined as
(13) 
To alleviate the overestimation problem in original DQN, we also utilize a target network along with the online network (i.e., the double DQN architecture (Van Hasselt et al., 2016)). The online network backpropagates and updates its weights at each training step. The target network is a duplicate of the online network and updates its parameters with training delay. The target value of the online network update is then changed to
(14) 
where denotes the parameter of the target network, and updates according to soft assign as
(15) 
where the interpolation parameter is also called update frequency.
To summarize, the training procedure of our KGQR is presented in Algorithm 1. Note that this paper mainly focuses on the way of incorporating KG into DRL methods for IRS. Thus we study the most typical DQN model as a running example. Our method can be seamlessly incorporated into other DRL models such as policy gradient (PG) (Chen et al., 2019b), DDPG (Hu et al., 2018) etc.
5. Experiment
We conduct experiments on two realworld datasets to evaluate our proposed KGQR framework. We aim to study the following research questions (RQs):

[leftmargin = 10pt]

RQ1: How does KGQR perform as compared with stateoftheart interactive recommendation methods?

RQ2: Does KGQR improve sample efficiency?

RQ3: How do different components (i.e., KGenhanced state representation, GCNbased taskspecific representation learning, neighborbased candidate selection) affect the performance of KGQR?
5.1. Experimental Settings
5.1.1. Datasets
We adopt two realworld benchmark datasets for evaluation and describe them as below.

[leftmargin = 10pt]
 BookCrossing^{2}^{2}2http://www2.informatik.unifreiburg.de/cziegler/BX/:

is a book rating dataset from BookCrossing community. The ratings are ranging from 0 to 10. This dataset is linked with Microsoft Satori and the subKG is released by (Wang et al., 2018).
 Movielens20M^{4}^{4}4https://grouplens.org/datasets/movielens/:

is a benchmark dataset, which consists of 20 million ratings from users to movies in MovieLens website. The ratings are ranging from 1 to 5. It is also linked with Microsoft Satori and the subKG is released by (Wang et al., 2019c).
For BookCrossing dataset, we follow the processing of (Wang et al., 2018) to convert original ratings into two categories, 1 for high ratings, 0 for others. For MovieLens20M dataset, we keep the users with at least 200 interactions. The statistics information of these two datasets is presented in Table 2.
We choose these two typical datasets since our work focuses on incorporating KG into RLbased models for IRS. The experiments on more datasets with rich domain information such as news or images will be left as future work.
Bookcrossing  Movielens20M  

UserItem Interaction  #User  17,860  16,525 
#Linked Items  14,910  16,426  
#Interactions  139,746  6,711,013  
Knowledge Graph  #Entities  77,903  101,769 
#Relation Types  25  32  
#Triples  151,500  489,758 
5.1.2. Simulator
Due to the interactive nature of our problem, online experiments where the recommender system interacts with users and learns the policy according to the users’ feedback directly would be ideal. However, as mentioned in Section 1, the trialanderror strategy for training policy in an online fashion would degrade the user’s experience, as well as the system profit. Thus, the community has formed a protocol (Hu et al., 2018; Chen et al., 2019b; Zhao et al., 2018a; DulacArnold et al., 2015; Wang et al., 2017) to build up an environment simulator based on offline datasets for evaluation.
Following the experiment protocol in (Chen et al., 2019b), our mimic environment simulator takes into account the instinctive feedback as well as the sequence nature of user behavior. We perform matrix factorization to train the 20dimensional embeddings of the users and items. Then we normalize the ratings of each dataset into the range [1,1], and use them as users’ instinctive feedback. Then we combine a sequential reward with the instinctive reward. For instance, if the recommender system recommends an item to a user at timestep , the final reward function comes as
(16) 
where is the predicted rating given by the simulator, and means the consecutive positive and negative counts representing the sequential pattern, and is a tradeoff between instinctive feedback and sequential nature. In our experiment, is chosen from {0.0, 0.1, 0.2}, following the empirical experience in (Chen et al., 2019b).
For each dataset, we randomly divide the users into two parts: 80% of the users are used for training the parameters of our model, and the other 20% are used for testing the model performance. Due to train/test dataset splitting style, the users in our test set never exist in the training set. That is to say, the experiment is a coldstart scenario, which means there is no user click history at the beginning. To handle coldstart problem, the recommender system collects the most popular items among the training users, and recommends a popular item to a test user at step . Then, according to the user’s feedback, the recommender system recommends items to the user interactively. Besides, we remove the recommended items from the candidate set to avoid repeated recommendation in one episode. The episode length for all the two datasets in our experiment.
5.1.3. Evaluation Metrics
Three evaluation metrics are used.
Average Reward. As an IRS aims to maximize the reward of the whole episode, a straightforward evaluation measure is the average reward over each interaction of test users.
(17) 
We also check for the precision and recall during
timesteps of the interactions, which are widely used metrics in traditional recommendation tasks.Average Cumulative Precision@.
(18) 
Average Cumulative Recall@.
(19) 
We define if the instinctive feedback of the recommended item given by the simulator is higher than the predefined threshold, which is 0.5 in BookCrossing and 3.5 in Movielens20M. And we define is the total number of the positive instinctive feedback among all items, i.e., number of items with based on the simulator.
Significance test. The Wilocoxon signedrank test has been performed to evaluate whether the difference between KGQR and the other baselines is significant.
5.1.4. Baselines
We compare KGQR with 7 representative baseline methods in the IRS scenario, where GreedySVD and GRU4Rec are traditional recommendation methods, LinearUCB and HLinearUCB are based on multiarmed bandits, DDPG, DDPGR, DQNR are DRLbased methods.

[topsep = 3pt,leftmargin =10pt]
 GreedySVD:

is a wellknown collaborative filtering methods via singular value decomposition
(Koren, 2008). In interactive scenarios, we train the model after each interaction with users and recommend an item with the predicted highest rating to one user.  GRU4Rec:

is a representative RNNbased sequential recommendation algorithm (Hidasi et al., 2016) to predict what the user will click at the next timestep based on the browsing histories.
 LinearUCB:

is a multiarmed bandit algorithm (Li et al., 2010) which selects items according to the estimated upper confidence bound (UCB) of the potential reward based on contextual information about the users and items.
 HLinearUCB:

is a contextual bandit algorithm combined with extra hidden features (Wang et al., 2017).
 DDPG:

is a DDPGbased method (DulacArnold et al., 2015) which represents the discrete actions into a continuous vector space. The actor selects a protoaction in the continuous space and then chooses the item with the max Qvalue from the candidate items selected via nearestneighbor (KNN) according to the protoaction. In this approach, a larger value boosts the performance but also brings computational cost, indicating the existence of a tradeoff between performance and efficiency. In our experiment, the value is set to {1, 0.1, }, where is the total number of items.
 DDPGR:

is a DDPGbased method (Zhao et al., 2018a) where the actor learns a ranking vector. The vector is utilized to compute the ranking score of each item, by performing product operation between this vector and item embedding. Then the item with the highest ranking score is recommended to the user.
 DQNR:

is a DQNbased method (Zheng et al., 2018) where the recommender system learns a Q function to estimate the Qvalue of all the actions at a given state. The method then recommends the item with highest Qvalue at the current state.
Note that, traditional knowledge enhanced recommendation methods like CKE (Zhang et al., 2016b), RippleNet (Wang et al., 2018), KGAT (Wang et al., 2019a) and etc. are not suitable for the online interactive recommendation as tested in this paper. Because our recommendation process is an online sequential recommendation in the case of coldstart setting, this means there is no data about the test user at the beginning. We model user preferences in realtime through the user interaction process and provide recommendations under the current situation. These traditional models could not handle this coldstart problem; thus, we do not compare our proposed model with them.
5.1.5. Parameter Settings
In KGQR, we set the maximal hop number for both datasets. We have tried larger hops, and find that the model with larger hops brings exponential growth of computational cost with only limited performance improvement. The dimension of item embedding is fixed to 50 for all the models. For baseline methods, the item embedding is pretrained by matrix factorization with the training set. For KGQR, we pretrain the embedding of KG by TransE (Bordes et al., 2013)
, and then embedding will be updated while learning the deep Qnetwork. Besides, other parameters are randomly initialized with uniform distribution. The policy network in all the RLbased methods takes two fullyconnected layers with activation function as ReLU. The hyperparameters of all models are chosen by grid search, including learning rate,
norm regularization, discount factor and etc. All the trainable parameters are optimized by Adam optimizer (Kingma and Ba, 2015)in an endtoend manner. We use PyTorch
(Paszke et al., 2019) to implement the pipelines and train networks with an NVIDIA GTX 1080Ti GPU. We repeat the experiments 5 times by changing the random seed for KGQR and all the baselines.Dataset  Method  
Reward  Precision@32  Recall@32  Reward  Precision@32  Recall@32  Reward  Precision@32  Recall@32  
BookCrossing  Greedy SVD  0.0890  0.3947  0.0031  0.1637  0.4052  0.0032  0.2268  0.4133  0.0033 
GRU4Rec  0.5162  0.8611  0.0070  1.3427  0.8595  0.0070  2.1797  0.8625  0.0070  
LinearUCB  0.0885  0.3956  0.0032  0.1640  0.4049  0.0032  0.2268  0.4133  0.0033  
HLinearUCB  0.1346  0.3819  0.0031  0.3566  0.3841  0.0031  0.6064  0.3713  0.0031  
DDPGR  0.5521  0.9115  0.0074  1.1412  0.8800  0.0072  2.2057  0.9270  0.0076  
DDPG(K=1)  0.3159  0.7302  0.0059  0.7312  0.7990  0.0065  0.8409  0.7472  0.0061  
DDPG(K=0.1N)  0.7312  0.9907  0.0080  2.0750  0.9813  0.0080  3.3288  0.9758  0.0079  
DDPG(K=N)  0.7639  0.9927  0.0081  2.2729  0.9942  0.0081  3.7179  0.9915  0.0081  
DQNR  0.7634  0.9936  0.0081  2.2262  0.9907  0.0080  3.6118  0.9881  0.0080  
KGQR  0.8307*  0.9945*  0.0081  2.3451*  0.9971*  0.0081  3.7661*  0.9966*  0.0081*  
MovieLens20M  Greedy SVD  0.4320  0.6569  0.0049  0.6915  0.6199  0.0048  0.9042  0.5932  0.0046 
GRU4Rec  0.7822  0.8382  0.0074  1.5267  0.8253  0.0072  2.3500  0.8316  0.0073  
LinearUCB  0.2307  0.3790  0.0029  0.6147  0.5821  0.0046  0.8017  0.5614  0.0044  
HLinearUCB  0.0995  0.3852  0.0029  0.0172  0.3841  0.0028  0.2265  0.3774  0.0027  
DDPGR  0.2979  0.4917  0.0034  1.4952  0.7626  0.0055  2.3003  0.6977  0.0045  
DDPG(K=1)  0.5755  0.7293  0.0059  1.0854  0.7165  0.0058  1.6912  0.7371  0.0061  
DDPG(K=0.1N)  0.6694  0.8167  0.0070  1.1578  0.8165  0.0069  2.2212  0.8215  0.0068  
DDPG(K=N)  0.8071  0.9606  0.0082  2.1544  0.9446  0.0081  3.6071  0.9533  0.0082  
DQNR  0.8863  0.9680  0.0086  2.3025  0.9667  0.0081  3.4036  0.9089  0.0071  
KGQR  0.9213*  0.9726*  0.0086*  2.4242*  0.9722*  0.0083*  3.7695*  0.9713*  0.0084* 
5.2. Overall Performance (RQ1)
Table LABEL:results reports the performance comparison results. We have the following observations: (i) KGQR consistently obtains the best performance across all environment settings on both datasets. For instance, compared to RLbased recommendation methods like DQNbased (i.e., DQNR) and DDPGbased (e.g., DDPG, DDPGR), KGQR improves over the strongest baselines in terms of Reward by 3.2% and 5.3% in BookCrossing and MovieLens20M, respectively. For traditional evaluation metrics, KGQR improves Precision@32 by 0.5% and 1.9% in the two datasets, respectively. This demonstrates that the leverage of prior knowledge in KG significantly improves the recommendation performance. (ii) In most conditions, nonRL methods including conventional methods and MABbased methods, perform worse than the RLbased methods. Two reasons stand for the significant performance gap. On the one hand, except GRU4Rec, the capacity of other nonRL methods are limited in modeling user preference without considering sequential information. On the other hand, they all focus on the immediate item reward and do not take the present value of the overall performance of the whole sequence into the current decision, which makes them perform even worse in environments that give future rewards more (e.g., , ). (iii) Among the RLbased baselines, we can observe that DQNR and DDPG () achieves much better performance than the other DDPG based methods. When ( is the total number of items), DDPG can be seen as a greedy policy that always picks the item with max Qvalue. We also notice that the training process of DDPG based methods is not stable, e.g., their training curves sometimes experience a sudden drop. This may be accounted for that the continuous protoaction picked by the actor is inconsistent with the final action that the critic is learned with. Such inconsistency between actor and critic may result in inferior performance.
Model  BookCrossing  Movielens20M  

0.5  1.0  1.5  2.0  0.5  1.0  1.5  2.0  
DDPGR  0.46M  2.49M      2.28M  2.62M  3.82M   
DDPG  0.20M  0.49M  1.46M  2.44M  0.04M  0.07M  1.76M  2.42M 
DQNR  0.20M  1.06M  3.09M  4.83M  0.06M  0.08M  2.47M  4.27M 
KGQR  0.06M  0.17M  0.24M  0.40M  0.04M  0.06M  0.16M  0.33M 
5.3. Sample Efficiency (RQ2)
One motivation of exploiting KG is to improve sample efficiency in RLbased recommendation, i.e., to reduce the amount of interaction data needed to achieve the same performance. In Figure 3 and Table 4, we analyze the number of interactions needed for each DRLbased model to achieve the same performance with the environment of in Eq.(16). As can be observed, our proposed KGQR can achieve the same performance as the other RLbased methods using the least number of interactions. More specifically, to achieve a test reward of 2.0, KGQR only needs 17.3% and 13.6% of interactions compared to the second efficient baseline (i.e., DDPG) in the two datasets. This result empirically validates that sample efficiency is improved by utilizing the semantic and correlation information of items in KG. The detailed analysis of different components that contributes to improve sample efficiency is proposed in Section 5.4.2.
KGQRKG  KGQRGCNCS  KGQRCS  KGQR  

KGemb*  
GCNprop*  
CS* 
Model  BookCrossing  Movielens20M  

Reward  Precision@32  Reward  Precision@32  
KGQRKG  2.2262  0.9907  2.3025  0.9667 
KGQRGCNCS  2.2181  0.9819  2.2402  0.9621 
KGQRCS  2.2836  0.9939  2.3689  0.9698 
KGQR  2.3451  0.9971  2.4242  0.9722 
5.4. Analysis (RQ3)
In this section, we further analyze the effectiveness of different components in the proposed framework. In KGQR, there are three components utilizing KG that may affect the performance of KGQR: KG enhanced item representation, GCN propagation in state representation (Section 4.1) and neighborbased candidate selection (Section 4.2). To study the effectiveness of each such component, we evaluate the performance of four different KGQR variants, namely KGQRKG (i.e.,DQNR), KGQRCS, KGQRGCNCS and KGQR. The relationship between KGQR variants and different components is presented in Table 5. In the ablation study, we consider the environment of in Eq.(16), and Table 6 shows the performance of these four variants.
5.4.1. Recommendation performance
Effect of KG enhanced item representation. In KGQRKG, the item embeddings are pretrained by MF model from the historical interaction data, while in KGQRGCNCS, they are retrieved from KG, pretrained with TransE. Therefore, the marginal difference between KGQRKG and KGQRGCNCS performance indicates that the information in KG has almost equal contribution with the historical interaction data, which suggests the superiority of applying KG for coldstart scenario (i.e., no historical interaction data exists).
Effect of GCN propagation in state representation. Comparing the performance of KGQRGCNCS with KGQRCS in Table 6, the improvement in KGQRGCNCS indicates that the signal from RLbased recommendation guides the update of KG embedding so that the items in KG can be represented more suitably for the current specific recommendation task.
Effect of neighborbased candidate selection. The comparison between the performance of KGQRCS and KGQR validates the effectiveness of neighborbased candidate selection, for candidate selection module can leverage the local structure of interacted items in KG to filter irrelevant items and such restricted retrieval improves the final recommendation performance.
To study the influence of candidate size, we vary the candidate size in the range of {1000, 2000, 3000, 5000, 10000} and present the recommendation performance in Figure 4. We observe that recommendation performance first grows as candidate size increases, since a small size of candidate limits possible choices of the recommendation algorithm. However, further increasing of candidate size degrades the performance, since the neighborbased candidate selection filters some irrelevant items in advance. Such irrelevant items have very limited chance to be recommended and to collect feedback which makes them not be able to learn well by the recommendation algorithm and eventually gains a negative effect to the performance.
5.4.2. Sample efficiency
Effect of KGenhanced state representation. Comparing the number of interactions of KGQRKG with that of KGQR w.r.t same test average reward in Figure 3, we notice that in both two environments the utilizing of KG and taskspecific representation learning improves the sample efficiency. This observation demonstrates our motivation that the propagation of user preference through the correlated items via GCN is helpful in dealing with sample efficiency issues.
Effect of neighborbased candidate selection. Besides the performance improvements, the candidate selection significantly improves the sample efficiency, as shown in Figure 3 (comparing the purple line and red line).
6. Conclusion
In this work, we proposed a knowledge graph enhanced Qlearning framework (KGQR) for the interactive recommendation. To the best of our knowledge, it is the first work leveraging KG in RLbased interactive recommender systems, which to a large extent addresses the sample complexity issue and significantly improves the performance. Moreover, we directly narrow down the action space by utilizing the structure information of knowledge graphs to effectively address the large action space issue. The model propagates user preference among the correlated items in the graph, to deal with the extremely sparse user feedback problem in IRS. All these designs improve sample efficiency, which is a common issue in previous works. The comprehensive experiments with a carefullydesigned simulation environment based on two realworld datasets demonstrate that our model can lead to significantly better performance with higher sample efficiency compared to stateofthearts.
For future work, we plan to investigate KGQR on news and image recommendation tasks with other DRL frameworks such as PG and DDPG. We are also scheduling the process of deploying KGQR onto an online commercial recommender system. Further, we are interested in inducing a more complex sequential model to represent the dynamics of user preferences, e.g., taking user’s propensity to different relations that the click history shows into consideration.
Acknowledgement
The corresponding author Weinan Zhang thanks the support of ”New Generation of AI 2030” Major Project 2018AAA0100900 and NSFC (61702327, 61772333, 61632017, 81771937). The work is also sponsored by Huawei Innovation Research Program.
References
 (1)
 Bellman (1952) Richard Bellman. 1952. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America 38, 8 (1952), 716.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In NeuIPS’13. 2787–2795.
 Chen et al. (2019b) Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019b. Largescale interactive recommendation with treestructured policy gradient. In AAAI’19, Vol. 33. 3312–3320.
 Chen et al. (2019a) Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019a. Topk offpolicy correction for a REINFORCE recommender system. In WSDM’19. ACM, 456–464.

Cheng et al. (2016)
HengTze Cheng, Levent
Koc, Jeremiah Harmsen, Tal Shaked,
Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado,
Wei Chai, Mustafa Ispir, et al.
2016.
Wide & deep learning for recommender systems. In
Proceedings of the 1st workshop on deep learning for recommender systems. ACM, 7–10.  Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
 DulacArnold et al. (2015) Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).
 Grotov and de Rijke (2016) Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for information retrieval: SIGIR 2016 Tutorial. In SIGIR’16. ACM, 1215–1218.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeuIPS’17. 1024–1034.
 Hausknecht and Stone (2015) Matthew Hausknecht and Peter Stone. 2015. Deep recurrent qlearning for partially observable mdps. In 2015 AAAI Fall Symposium Series.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In WebConf’17. International World Wide Web Conferences Steering Committee, 173–182.
 Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Sessionbased recommendations with recurrent neural networks. ICLR’16.
 Hu et al. (2018) Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in ecommerce search engine: Formalization, analysis, and application. In SIGKDD’18. ACM, 368–377.
 Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, JiRong Wen, and Edward Y Chang. 2018. Improving sequential recommendation with knowledgeenhanced memory networks. In SIGIR’18. ACM, 505–514.
 Ji et al. (2015) Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In IJCNLP’15. 687–696.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR’15.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD’08. ACM, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextualbandit approach to personalized news article recommendation. In WebConf’10. ACM, 661–670.
 Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In AAAI’15.
 Mahmood and Ricci (2007) Tariq Mahmood and Francesco Ricci. 2007. Learning and adaptivity in interactive recommender systems. In Proceedings of the ninth international conference on Electronic commerce. ACM, 75–84.
 Narasimhan et al. (2015) Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. 2015. Language understanding for textbased games using deep reinforcement learning. EMNLP‘15.
 Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, highperformance deep learning library. In NeuIPS’19. 8024–8035.
 Shi et al. (2015) Chuan Shi, Zhiqiang Zhang, Ping Luo, Philip S Yu, Yading Yue, and Bin Wu. 2015. Semantic path based personalized recommendation on weighted heterogeneous information networks. In CIKM’15. ACM, 453–462.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double qlearning. In AAAI’16.
 Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
 Wang et al. (2017) Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization bandits for interactive recommendation. In AAAI’17.
 Wang et al. (2018) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018. RippleNet: Propagating user preferences on the knowledge graph for recommender systems. In CIKM’18. ACM, 417–426.
 Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. MultiTask Feature Learning for Knowledge Graph Enhanced Recommendation. In WebConf’19. ACM, 2000–2010.
 Wang et al. (2019c) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019c. Knowledge graph convolutional networks for recommender systems. In WebConf’19. ACM, 3307–3313.
 Wang et al. (2006) Jun Wang, Arjen P De Vries, and Marcel JT Reinders. 2006. Unifying userbased and itembased collaborative filtering approaches by similarity fusion. In SIGIR’06. ACM, 501–508.
 Wang et al. (2019a) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and TatSeng Chua. 2019a. KGAT: Knowledge Graph Attention Network for Recommendation. SIGKDD’19.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. 2016. Dueling network architectures for deep reinforcement learning. ICML’16.
 Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In WSDM’14. ACM, 283–292.
 Zeng et al. (2016) Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. Online contextaware recommendation with time varying multiarmed bandit. In SIGKDD’16. ACM, 2025–2034.
 Zhang et al. (2016b) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016b. Collaborative knowledge base embedding for recommender systems. In SIGKDD’16. ACM, 353–362.

Zhang
et al. (2016a)
Weinan Zhang, Ulrich
Paquet, and Katja Hofmann.
2016a.
Collective noise contrastive estimation for policy transfer learning. In
AAAI’16.  Zhao et al. (2017) Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In SIGKDD’17. ACM, 635–644.
 Zhao et al. (2018a) Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018a. Deep reinforcement learning for pagewise recommendations. In ACM RecSys’18. ACM, 95–103.
 Zhao et al. (2018b) Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018b. Recommendations with negative feedback via pairwise deep reinforcement learning. In SIGKDD’18. ACM, 1040–1048.
 Zhao et al. (2013) Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive collaborative filtering. In CIKM’13. ACM, 1411–1420.
 Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. In WebConf’18. International World Wide Web Conferences Steering Committee, 167–176.
 Zou et al. (2019) Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, and Dawei Yin. 2019. Reinforcement Learning to Optimize Longterm User Engagement in Recommender Systems. arXiv preprint arXiv:1902.05570 (2019).
Comments
There are no comments yet.