Reinforcement Knowledge Graph Reasoning for Explainable Recommendation

by   Yikun Xian, et al.
Gerard de Melo
Rutgers University

Recent advances in personalized recommendation have sparked great interest in the exploitation of rich structured information provided by knowledge graphs. Unlike most existing approaches that only focus on leveraging knowledge graphs for more accurate recommendation, we perform explicit reasoning with knowledge for decision making so that the recommendations are generated and supported by an interpretable causal inference procedure. To this end, we propose a method called Policy-Guided Path Reasoning (PGPR), which couples recommendation and interpretability by providing actual paths in a knowledge graph. Our contributions include four aspects. We first highlight the significance of incorporating knowledge graphs into recommendation to formally define and interpret the reasoning process. Second, we propose a reinforcement learning (RL) approach featuring an innovative soft reward strategy, user-conditional action pruning and a multi-hop scoring function. Third, we design a policy-guided graph search algorithm to efficiently and effectively sample reasoning paths for recommendation. Finally, we extensively evaluate our method on several large-scale real-world benchmark datasets, obtaining favorable results compared with state-of-the-art methods.



page 1

page 2

page 3

page 4


Neural-Symbolic Reasoning over Knowledge Graph for Multi-stage Explainable Recommendation

Recent work on recommender systems has considered external knowledge gra...

Heterogeneous Relational Reasoning in Knowledge Graphs with Reinforcement Learning

Path-based relational reasoning over knowledge graphs has become increas...

Inferring Substitutable and Complementary Products with Knowledge-Aware Path Reasoning based on Dynamic Policy Network

Inferring the substitutable and complementary products for a given produ...

DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning

We study the problem of learning to reason in large scale knowledge grap...

Multi-Hop Knowledge Graph Reasoning with Reward Shaping

Multi-hop reasoning is an effective approach for query answering (QA) ov...

Knowledge Aware Conversation Generation with Explainable Reasoning on Augmented Graphs

Two types of knowledge, triples from knowledge graphs and texts from uns...

Search from History and Reason for Future: Two-stage Reasoning on Temporal Knowledge Graphs

Temporal Knowledge Graphs (TKGs) have been developed and used in many di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Equipping recommendation systems with the ability to leverage knowledge graphs (KG) not only facilitates better exploitation of various structured information to improve the recommendation performance, but also enhances the explainability of recommendation models due to the intuitive ease of understanding relationships between entities (Zhang and Chen, 2018). Recently, researchers have explored the potential of knowledge graph reasoning in personalized recommendation. One line of research focuses on making recommendations using knowledge graph embedding models, such as TransE (Bordes et al., 2013) and node2vec (Grover and Leskovec, 2016)

. These approaches align the knowledge graph in a regularized vector space and uncover the similarity between entities by calculating their representation distance

(Zhang et al., 2016b). However, pure KG embedding methods lack the ability to discover multi-hop relational paths. Ai et al. (Ai et al., 2018) proposed to enhance the collaborative filtering (CF) method over KG embedding for personalized recommendation, followed by a soft matching algorithm to find explanation paths between users and items. However, one issue of this strategy is that the explanations are not produced according to the reasoning process, but instead are later generated by an empirical similarity matching between the user and item embeddings. Hence, their explanation component is merely trying to find a post-hoc explanation for the already chosen recommendations.

Another line of research investigates path-based recommendation. For example, Gao et al. (Gao et al., 2018) proposed the notion of meta-paths to reason over KGs. However, the approach has difficulty in coping with numerous types of relations and entities in large real-world KGs, and hence it is incapable of exploring relationships between unconnected entities. Wang et al. (Wang et al., 2019) first developed a path embedding approach for recommendation over KGs that enumerates all the qualified paths between every user–item pair, and then trained a sequential RNN model from the extracted paths to predict the ranking score for the pairs. The recommendation performance is further improved, but it is not practical to fully explore all the paths for each user–item pair in large-scale KGs.

We believe that an intelligent recommendation agent should have the ability to conduct explicit reasoning over knowledge graphs to make decisions, rather than merely embed the graph as latent vectors for similarity matching. In this paper, we consider knowledge graphs as a versatile structure to maintain the agent’s knowledge about users, items, other entities and their relationships. The agent starts from a user and conducts explicit multi-step path reasoning over the graph, so as to discover suitable items in the graph for recommendation to the target user. The underlying idea is that if the agent draws its conclusion based on an explicit reasoning path, it will be easy to interpret the reasoning process that leads to each recommendation. Thus, the system can provide causal evidence in support of the recommended items. Accordingly, our goal is not only to select a set of candidate items for recommendation, but also to provide the corresponding reasoning paths in the graph as interpretable evidence for why a given recommendation is made. As an example illustrated in Figure 1, given user , the algorithm is expected to find candidate items and , along with their reasoning paths in the graph, e.g., and .

In this paper, we propose an approach that overcomes the shortcomings of previous work. Specifically, we cast the recommendation problem as a deterministic Markov Decision Process (MDP) over the knowledge graph. We adopt a Reinforcement Learning (RL) approach, in which an agent starts from a given user, and learns to navigate to the potential items of interest, such that the path history can serve as a genuine explanation for why the item is recommended to the user.

The main challenges are threefold. First, it is non-trivial to measure the correctness of an item for a user, so careful consideration is needed regarding the terminal conditions and RL rewards. To solve the problem, we design a soft reward strategy based on a multi-hop scoring function that leverages the rich heterogeneous information in the knowledge graph. Second, the size of the action space depends on the out-degrees in the graph, which can be very large for some nodes, so it is important to conduct an efficient exploration to find promising reasoning paths in the graph. In this regard, we propose a user-conditional action pruning strategy to decrease the size of the action spaces while guaranteeing the recommendation performance. Third, the diversity of both items and paths must be preserved when the agent is exploring the graph for recommendation, so as to avoid being trapped in limited regions of items. To achieve this, we design a policy-guided search algorithm to sample reasoning paths for recommendation in the inference phase. We conduct several case studies on the reasoning paths to qualitatively evaluate the diversity of explainations for recommendation.

The major contributions of this paper can be outlined as follows.

  1. We highlight the significance of incorporating rich heterogeneous information into the recommendation problem to formally define and interpret the reasoning process.

  2. We propose an RL-based approach to solve the problem, driven by our soft reward strategy, user-conditional action pruning, and a multi-hop scoring strategy.

  3. We design a beam search-based algorithm guided by the policy network to efficiently sample diverse reasoning paths and candidate item sets for recommendation.

  4. We extensively evaluate the effectiveness of our method on several Amazon e-commerce domains, obtaining strong results as well as explainable reasoning paths.

The source code is available online.111Link to the source code:

Figure 1. Illustration of the Knowledge Graph Reasoning for Explainable Recommendation (KGRE-Rec) problem.

2. Related Work

2.1. Collaborative Filtering

Collaborative Filtering (CF) has been one of the most fundamental approaches for recommendation. Early approaches to CF consider the user–item rating matrix and predict ratings via user-based (Resnick et al., 1994; Konstan et al., 1997) or item-based (Sarwar et al., 2001; Linden et al., 2003)

collaborative filtering methods. With the development of dimension reduction methods, latent factor models such as matrix factorization gained widespread adoption in recommender systems. Specific techniques include singular value decomposition

(Koren et al., 2009), non-negative matrix factorization (Lee and Seung, 2001) and probabilistic matrix factorization (Mnih and Salakhutdinov, 2008)

. For each user and item, these approaches essentially learn a latent factor representation to calculate the matching score of the user–item pairs. Recently, deep learning and neural models have further extended collaborative filtering. These are broadly classified into two sub-categories: the similarity learning approach and the representation learning approach. Similarity learning adopts fairly simple user/item embeddings (e.g., one-hot vectors) and learns a complex prediction network as a similarity function to compute user–item matching scores

(He et al., 2017b). In contrast, the representation learning approach learns much richer user/item representations but adopts a simple similarity function (e.g., inner product) for score matching (Zhang et al., 2017). However, researchers have noticed the difficulty of explaining the recommendation results in latent factor or latent representation models, making explainable recommendation (Zhang et al., 2014; Zhang and Chen, 2018) an important research problem for the community.

Figure 2. Pipeline of our Policy-Guided Path Reasoning method for recommendation. The algorithm aims to learn a policy that navigates from a user to potential items of interest by interacting with the knowledge graph environment. The trained policy is then adopted for the path reasoning phase to make recommendations to the user.

2.2. Recommendation with Knowledge Graphs

Some previous efforts have made recommendations to users with the help of knowledge graph embeddings (Nickel et al., 2011; Bordes et al., 2013). One research direction leverages knowledge graph embeddings as rich content information to enhance the recommendation performance. For example, Zhang et al. (Zhang et al., 2016b) adopted knowledge base embeddings to generate user and item representations for recommendation, while Huang et al. (Huang et al., 2018) employed memory networks over knowledge graph entity embeddings for recommendation. Wang et al. (Wang et al., 2018b) proposed a ripple network approach for embedding-guided multi-hop KG-based recommendation. Another research direction attempts to leverage the entity and path information in the knowledge graph to make explainable decisions. For example, Ai et al.  (Ai et al., 2018) incorporated the learning of knowledge graph embeddings for explainable recommendation. However, their explanation paths are essentially post-hoc explanations, as they are generated by soft matching after the corresponding items have been chosen. Wang et al. (Wang et al., 2019) proposed an RNN based model to reason over KGs for recommendation. However, it requires enumerating all the possible paths between each user–item pair for model training and prediction, which can be impractical for large-scale knowledge graphs.

2.3. Reinforcement Learning

Reinforcement Learning has attracted substantial interest in the research community. In recent years, there have been a series of widely noted successful applications of deep RL approaches (e.g., AlphaGo (Silver et al., 2016)), demonstrating their ability to better understand the environment, and enabling them to infer high-level causal relationships. There have been attempts to invoke RL in recommender systems in a non-KG setting, such as for ads recommendation (Theocharous et al., 2015), news recommendation (Zheng et al., 2018) and post-hoc explainable recommendation (Wang et al., 2018a). At the same time, researchers have also explored RL in KG settings for other tasks such as question answering (QA) (Xiong et al., 2017; Das et al., 2017; Lin et al., 2018), which formulates multi-hop reasoning as a sequential decision making problem. For example, Xiong et al. (Xiong et al., 2017) leveraged reinforcement learning for path-finding, and Das et al. (Das et al., 2017) proposed a system called MINERVA that trains a model for multi-hop KG question answering. Lin et al. (Lin et al., 2018) proposed models for end-to-end RL-based KG question answering with reward shaping and action dropout. However, to the best of our knowledge, there is no previous work utilizing RL in KGs for the task of recommendation, especially when the KG has an extremely large action space for each entity node as the number of path hops grow.

3. Methodology

In this section, we first formalize a new recommendation problem called Knowledge Graph Reasoning for Explainable Recommendation. Then we present our approach based on reinforcement learning over knowledge graphs to solve the problem.

3.1. Problem Formulation

In general, a knowledge graph with entity set and relation set is defined as , where each triplet represents a fact of the relation from head entity to tail entity . In this paper, we consider a special type of knowledge graph for explainable recommendation, denoted by . It contains a subset of a User entities and a subset of Item entities , where and . These two kinds of entities are connected through relations . We give a relaxed definition of -hop paths over the graph as follows.

Definition 3.1 ().

(-hop path) A -hop path from entity to entity is defined as a sequence of entities connected by relations, denoted by , where represents either or , .

Now, the problem of Knowledge Graph Reasoning for Explainable Recommendation (KGRE-Rec) can be formalized as below.

Definition 3.2 ().

(KGRE-Rec Problem) Given a knowledge graph , user and integers and , the goal is to find a recommendation set of items such that each pair is associated with one reasoning path , and is the number of recommendations.

In order to simultaneously conduct item recommendation and path finding, we consider three aspects that result in a good solution to the problem. First, we do not have pre-defined targeted items for any user, so it is not applicable to use a binary reward indicating whether the user interacts with the item or not. A better design of the reward function is to incorporate the uncertainty of how an item is relevant to a user based on the rich heterogeneous information given by the knowledge graph. Second, out-degrees of some entities may be very large, which degrades the efficiency of finding paths from users to potential item entities. Enumeration of all possible paths between each user and all items is unfeasible on very large graphs. Thus, the key challenge is how to effectively perform edge pruning and efficiently search relevant paths towards potential items using the reward as a heuristic. Third, for every user, the diversity of reasoning paths for recommended items should be guaranteed. It is not reasonable to always stick to a specific type of reasoning path to provide explainable recommendations. One naive solution is post-hoc recommendation, which first generates candidate items according to some similarity measure, followed by a separate path finding procedure from the user to candidate items within the graph. The major downsides of this are that the recommendation process fails to leverage the rich heterogeneous meta-data in the knowledge graph, and that the generated paths are detached from the actual decision-making process adopted by the recommendation algorithm, which remains uninterpretable.

In the following sections, we introduce our Policy-Guided Path Reasoning method (PGPR) for explainable recommendation over knowledge graphs. It solves the problem through reinforcement learning by making recommendations while simultaneously searching for paths in the context of rich heterogeneous information in the KG. As illustrated in Figure 2, the main idea is to train an RL agent that learns to navigate to potentially “good” items conditioned on the starting user in the knowledge graph environment. The agent is then exploited to efficiently sample reasoning paths for each user leading to the recommended items. These sampled paths naturally serve as the explanations for the recommended items.

3.2. Formulation as Markov Decision Process

The starting point of our method is to formalize the KGRE-Rec problem as a Markov Decision Process (MDP) (Sutton and Barto, 2018). In order to guarantee path connectivity, we add two special kinds of edges to the graph . The first one are reverse edges, i.e., if , then , which are used for our path definition. The second are self-loop edges, associated with the no operation (NO-OP) relation, i.e., if , then .


The state at step is defined as a tuple , where is the starting user entity, is the entity the agent has reached at step , and is the history prior to step . We define the -step history as the combination of all entities and relations in the past steps, i.e., . Conditioned on some user , the initial state is represented as . Given some fixed horizon , the terminal state is .


The complete action space of state is defined as all possible outgoing edges of entity excluding history entities and relations. Formally, . Since the out-degree follows a long-tail distribution, some nodes have much larger out-degrees compared with the rest of nodes. It is fairly space-inefficient to maintain the size of the action space based on the largest out-degree. Thus, we introduce a user-conditional action pruning strategy that effectively keeps the promising edges conditioned on the starting user based on a scoring function. Specifically, the scoring function maps any edge to a real-valued score conditioned on user . Then, the user-conditional pruned action space of state , denoted by , is defined as:


where is a pre-defined integer that upper-bounds the size of the action space. The details of this scoring function will be discussed in the next section.


Given any user, there is no pre-known targeted item in the KGRE-Rec problem, so it is unfeasible to consider binary rewards indicating whether the agent has reached a target or not. Instead, the agent is encouraged to explore as many “good” paths as possible. Intuitively, in the context of recommendations, a “good” path is one that leads to an item that a user will interact with, with high probability. To this end, we consider to give a soft reward only for the terminal state

based on another scoring function . The terminal reward is defined as


where the value of is normalized to the range of . is also introduced in the next section.


Due to the graph properties, a state is determined by the position of the entity. Given a state and an action , the transition to the next state is:


One exception is that the initial state

is stochastic, which is determined by the starting user entity. For simplicity, we assume the prior distribution of users follows a uniform distribution so that each user is equally sampled at the beginning.


Based on our MDP formulation, our goal is to learn a stochastic policy that maximizes the expected cumulative reward for any initial user :


We solve the problem through REINFORCE with baseline (Sutton and Barto, 2018) by designing a policy network and a value network that share the same feature layers. The policy network takes as input the state vector

and binarized vector

of pruned action space and emits the probability of each action, with zero probability for actions not in . The value network maps the state vector to a real value, which is used as the baseline in REINFORCE. The structures of the two networks are defined as follows:


Here, are the learned hidden features of the state, is the Hadamard product, which is used to mask invalid actions here, and

is a non-linear activation function, for which we use an Exponential Linear Unit (ELU). State vectors

are represented as the concatenation of the embeddings , and history . For the binarized pruned action space , we set the maximum size among all pruned action spaces. The model parameters for both networks are denoted as . Additionally, we add a regularization term that maximizes the entropy of the policy in order to encourage the agent to explore more diverse paths. Finally, the policy gradient is defined as:


where is the discounted cumulative reward from state to the terminal state .

3.3. Multi-Hop Scoring Function

Now we present the scoring function for the action pruning strategy and the reward function. We start with some relevant concepts.

One property of the knowledge graph

is that given the type of a head entity and a valid relation, the type of tail entity is determined. We can extend this property by creating a chain rule of entity and relation types:

. If the types of entity and all relations are given, the types of all other entities are uniquely determined. According to this rule, we introduce the concept of patterns as follows.

Definition 3.3 ().

(-hop pattern) A sequence of relations is called a valid -hop pattern for two entities if there exists a set of entities whose types are uniquely determined such that forms a valid -hop path over .

One caveat with pattern is the direction of each relation, provided that we allow reverse edges in the path. For entities and relation , represents either or in the path. We refer to the relation as a forward one if and , or as a backward one if and .

In order to define the scoring functions for action pruning and reward, we consider a special case of patterns with both forward and backward relations.

Definition 3.4 ().

(1-reverse -hop pattern) A -hop pattern is 1-reverse, denoted by , if are forward and are backward.

In other words, paths with a 1-reverse -hop pattern have the form of . Note that the pattern contains all backward relations when , and all forward relations when .

Now we define a general multi-hop scoring function of two entities given 1-reverse -hop pattern as follows.


where is the dot product operation, , are -dimensional vector representations of the entity and relation , and is the bias for entity . When ,

, the scoring function simply computes the cosine similarity between two vectors:


When ,, the scoring function computes the similarity between two entities via translational embeddings (Bordes et al., 2013):


For , the scoring function in Equation 9 quantifies the similarity of two entities based on a 1-reverse -hop pattern.

Scoring Function for Action Pruning

We assume that for user entity and another entity of other type, there exists only one 1-reverse -hop pattern for some integer . For entity , we denote as the smallest such that is a valid pattern for entities . Therefore, the scoring function for action pruning is defined as .

Scoring Function for Reward

We simply use the 1-hop pattern between user entity and item entity, i.e., . The scoring function for reward design is defined as .

Learning Scoring Function

A natural question that arises is how to train the embeddings for each entity and relation. For any pair of entities with valid -hop pattern , we seek to maximize the conditional probability of , which is defined as


However, due to the huge size of the entity set , we adopt a negative sampling technique (Mikolov et al., 2013) to approximate :


The goal is to maximize the objective function , defined as:


where is 1 if is a valid pattern for entities and 0 otherwise.

3.4. Policy-Guided Path Reasoning

Input : , , ,
Output : path set , probability set , reward set
1 Initialize , , ;
2 for  to  do
3       Initialize , , ;
4       forall , ,  do
5             path ;
6             Set ;
7             Get user-conditional pruned action space from environment given state ;
8             and ;
9             Actions ;
10             forall  do
11                   Get from environment given action ;
12                   Save new path to ;
13                   Save new probability to ;
14                   Save new reward to ;
16             end forall
18       end forall
20 end for
21Save if the path ends with an item;
22 return filtered ;
Algorithm 1 Policy-Guided Path Reasoning
CDs & Vinyl Clothing Cell Phones Beauty
Entities Description Number of Entities
User User in recommender system 075,258 39,387 27,879 22,363
Item Product to be recommended to users 064,443 23,033 10,429 12,101
Feature A product feature word from reviews 202,959 21,366 22,493 22,564
Brand Brand or manufacturer of the product 001,414 01,182 00,955 02,077
Category Category of the product 000,770 01,193 00,206 00,248
Relations Description Number of Relations per Head Entity
Purchase User Item
Mention User Feature
Described_by Item Feature
Belong_to Item Category
Produced_by Item Brand
Also_bought Item Item
Also_viewed Item another Item
Bought_together Item another Item
Table 1. Descriptions and statistics of four Amazon e-commerce datasets: CDs & Vinyl, Clothing, Cell Phones and Beauty.

The final step is to solve our recommendation problem over the knowledge graph guided by the trained policy network. Recall that given a user , the goal is to find a set of candidate items and the corresponding reasoning paths . One straightforward way is to sample paths for each user according to the policy network . However, this method cannot guarantee the diversity of paths, because the agent guided by the policy network is likely to repeatedly search the same path with the largest cumulative rewards. Therefore, we propose to employ beam search guided by the action probability and reward to explore the candidate paths as well as the recommended items for each user. The process is described as Algorithm 1. It takes as input the given user , the policy network , horizon , and predefined sampling sizes at each step, denoted by . As output, it delivers a candidate set of -hop paths for the user with corresponding path generative probabilities and path rewards . Note that each path ends with an item entity associated with a path generative probability and a path reward.

For the acquired candidate paths, there may exist multiple paths between the user and item . Thus, for each pair of in the candidate set, we select the path from with the highest generative probability based on as the one to interpret the reasoning process of why item is recommended to . Finally, we rank the selected interpretable paths according to the path reward in and recommend the corresponding items to the user.

4. Experiments

In this section, we extensively evaluate the performance of our PGPR method on real-world datasets. We first introduce the benchmarks for our experiments and the corresponding experimental settings. Then we quantitatively compare the effectiveness of our model with other state-of-the-art approaches, followed by ablation studies to show how parameter variations influence our model.

4.1. Data Description

All experiments are conducted on the Amazon e-commerce datasets collection (He and McAuley, 2016a), consisting of product reviews and meta information from The datasets include four categories: CDs and Vinyl, Clothing, Cell Phones and Beauty. Each category is considered as an individual benchmark that constitutes a knowledge graph containing 5 types of entities and 7 types of relations. The description and statistics of each entity and relation can be found in Table 1. Note that once the type of head entity and relation are provided, the type of tail entity is uniquely determined. In addition, as shown in Table 1, we find that Mention and Described_by account for a very large proportion among all relations. These two relations are both connected to the Feature entity, which may contain redundant and less meaningful words. We thus adopt TF-IDF to eliminate less salient features in the preprocessing stage: For each dataset, we keep the frequency of feature words less than 5,000 with TF-IDF score . We adopt the same data split rule as in previous work (Zhang et al., 2017), which randomly sampled 70% of user purchases as the training data and took the rest 30% as test. The objective in the KGRE-Rec problem is to recommend items purchased by users in the test set together with reasoning paths for each user–item pair.

Dataset CDs & Vinyl Clothing Cell Phones Beauty
Measures () NDCG Recall HR Prec. NDCG Recall HR Prec. NDCG Recall HR Prec. NDCG Recall HR Prec.
BPR 2.009 2.679 8.554 1.085 0.601 1.046 1.767 0.185 1.998 3.258 5.273 0.595 2.753 4.241 8.241 1.143
BPR-HFT 2.661 3.570 9.926 1.268 1.067 1.819 2.872 0.297 3.151 5.307 8.125 0.860 2.934 4.459 8.268 1.132
VBPR 0.631 0.845 2.930 0.328 0.560 0.968 1.557 0.166 1.797 3.489 5.002 0.507 1.901 2.786 5.961 0.902
TransRec 3.372 5.283 11.956 1.837 1.245 2.078 3.116 0.312 3.361 6.279 8.725 0.962 3.218 4.853 0.867 1.285
DeepCoNN 4.218 6.001 13.857 1.681 1.310 2.332 3.286 0.229 3.636 6.353 9.913 0.999 3.359 5.429 9.807 1.200
CKE 4.620 6.483 14.541 1.779 1.502 2.509 4.275 0.388 3.995 7.005 10.809 1.070 3.717 5.938 11.043 1.371
JRL 5.378 7.545 16.774 2.085 1.735 2.989 4.634 0.442 4.364 7.510 10.940 1.096 4.396 6.949 12.776 1.546
PGPR (Ours) 5.590 7.569 16.886 2.157 2.858 4.834 7.020 0.728 5.042 8.416 11.904 1.274 5.449 8.324 14.401 1.707
Table 2. Overall recommendation effectiveness of our method compared to other baselines on four Amazon datasets. The results are reported in percentage (%) and are calculated based on the top-10 predictions in the test set. The best results are highlighted in bold and the best baseline results are marked with a star ().

4.2. Experimental Setup

Baselines & Metrics

We compare our results against previous state-of-the-art methods. BPR (Rendle et al., 2009) is a Bayesian personalized ranking model that learns latent embeddings of users and items. BPR-HFT (McAuley and Leskovec, 2013) is a Hidden Factors and Topics (HFT) model that incorporates topic distributions to learn latent factors from reviews of users or items. VBPR (He and McAuley, 2016b) is the Visual Bayesian Personalized Ranking method that builds upon the BPR model but incorporates visual product knowledge. TransRec (He et al., 2017a) invokes translation-based embeddings for sequential recommendation. It learns to map both user and item representations in a shared embedding space through personalized translation vectors. DeepCoNN

or Deep Cooperative Neural Networks

(Zheng et al., 2017) are a review-based convolutional recommendation model that learns to encode both users and products with reviews assisting in rating prediction. CKE or Collaborative Knowledge base Embedding (Zhang et al., 2016a) is a modern neural recommender system based on a joint model integrating matrix factorization and heterogeneous data formats, including textual contents, visual information and a structural knowledge base to infer the top- recommendations results. JRL (Zhang et al., 2017) is a start-of-the-art joint representation learning model for top- recommendation that utilizes multimodal information including images, text and ratings into a neural network. Note that we did not include (Wang et al., 2019) as a baseline because we are unable to enumerate all the possible paths between user–item pairs due to the large scale of our datasets.

All models are evaluated in terms of four representative top- recommendation measures: Normalized Discounted Cumulative Gain (NDCG), Recall, Hit Ratio (HR) and Precision (Prec.). These ranking metrics are computed based on the top-10 predictions for every user in the test set.

Implementation Details

The default parameter settings across all experiments are as follows. For the KGRE-Rec problem, we set the maximum path length to 3 based on the assumption that shorter paths are more reliable for users to interpret the reasons of recommendation. For models’ latent representations, the embeddings of all entities and relations are trained based on the 1-hop scoring function defined in Equation 11, and the embedding size is set to 100. On the RL side, the history vector is represented by the concatenation of embeddings of and , so the state vector is of size 400. The maximum size of the pruned action space is set to 250, i.e., there are at most 250 actions for any state. To encourage the diversity of paths, we further adopt action dropout on the pruned action space with a rate of . The discount factor is . For the policy/value network, , , and

. For all four datasets, our model is trained for 50 epochs using Adam optimization. We set a learning rate of

and a batch size of 64 for the CDs & Vinyl dataset, and a learning rate of and batch size of 32 for the other datasets. The weight of the entropy loss is . In the path reasoning phase, we set the sampling sizes at each step to for CDs & Vinyl, and for the other three datasets.

4.3. Quantitative Analysis

In this experiment, we quantitatively evaluate the performance of our model on the recommendation problem compared to other baselines on all four Amazon datasets. We follow the default setting as described in the previous section.

The results are reported in Table 2. Overall, our PGPR method consistently outperforms all other baselines on all datasets in terms of NDCG, Hit Rate, Recall and Precision. For example, it obtains a 3.94% NDCG improvement over the best baseline (JRL) on the CDs & Vinyl dataset, a significant improvement of 64.73% on Clothing, 15.53% on Cell Phone, and 23.95% on Beauty. Similar trends can be observed for Recall, Hit Rate and Precision on all datasets. This shows that searching for reasoning paths over the knowledge graph provides substantial benefits for product recommendation and hence—even in the absence of interpretability concerns—is a promising technique for recommendation over graphs. Our path reasoning process is guided by the learned policy network that incorporates rich heterogeneous information from knowledge graphs and captures multi-hop interactions among entities (e.g., user mentions feature, feature is described by item, item belongs to category, etc.). This also largely contributes to the promising recommendation performance of our method. Besides, we notice that directly applying TransE for recommendation (Ai et al., 2018) slightly outperforms ours. It can be regarded as a single-hop latent matching method, but the post-hoc explanations do not necessarily reflect the true reason of generating a recommendation. In contrast, our methods generate recommendations through an explicit path reasoning process over knowledge graphs, so that the explanations directly reflect how the decisions are generated, which makes the system transparent.

Dataset # Valid Paths/User # Items/User # Paths/Item
CDs & Vinyl
Cell Phone
Table 3. Results of number of valid paths per user, number of unique items per user and number of paths per item.
Figure 3. Recommendation effectiveness of our model under different sizes of pruned action spaces on the Clothing dataset. The results using multi-hop scoring function are also reported.
Figure 4. Recommendation effectiveness of our model under different sizes of pruned action spaces on the Beauty dataset. The results using multi-hop scoring function are also reported.

Furthermore, we examine the efficiency of our method in finding valid reasoning paths. A path is deemed valid if it starts from a user and ends at an item entity within three hops (i.e., at most four entities in a path). As shown in Table 3, we respectively report the average number of valid paths per user, the average number of unique items per user, and the average number of supportive paths per item. We observe two interesting facts. First, the success rate of our method to find valid paths is around 50%, which is calculated as the number of valid paths out of all sampled paths (200 paths for CDs & Vinyl dataset and 125 for others). Especially for the Cell Phone dataset, almost all paths are valid. Considering that the number of all possible paths from each user to items is very large and the difficulty of our recommendation problem is particularly high, these results suggest that our method performs very well in regard to path finding properties. Second, each recommended item is associated with around 1.6 reasoning paths. This implies that there are multiple reasoning paths that can serve as supportive evidence for each recommendation. One could hence consider providing more than one of these if users request further details.

4.4. Influence of Action Pruning Strategy

In this experiment, we evaluate how the performance of our model varies with different sizes of pruned action spaces.

Recall that conditioned on the starting user, the action space is pruned according to the scoring function defined in Equation 9, where actions with larger scores are more likely to be preserved. In other words, larger action spaces contain more actions that are less relevant to the user. This experiment aims to show whether larger action spaces are helpful in exploring more reasoning paths to find potential items. We experiment on two selected datasets, Beauty and Clothing, and follow the default setting from the previous section, except that the size of the pruned action space is varied from 100 to 500 with a step size of 50. The results on two datasets are plotted in Figures 3 and 4, respectively. The best baseline method JRL is also reported in the figures for comparison. Its performance does not depend on the action space.

There are two interesting observations in the results. First, our model outperforms JRL under most choices of pruned action space sizes. Take the Clothing dataset as an example. As shown in Figure 3, for any metric among NDCG, Recall, Hit Rate and Precision, the blue curve of our method is consistently above the red curve of JRL by a large margin for all sizes ranging from 100 to 500. The results further demonstrate the effectiveness of our method compared to other baselines. Second, the performance of our model is slightly influenced by the size of the pruned action space. As shown in both figures, the common trend is that a smaller pruned action space leads to better performance. This means that the scoring function is a good indicator for filtering proper actions conditioned on the starting user. Another possible reason is that larger action spaces require more exploration in RL, but for fair comparison, we set the same parameters such as learning rate and training steps across all different choices of action space, which may lead to suboptimal solutions in cases of larger action spaces.

Dataset Clothing Beauty
Sizes NDCG Recall HR Prec. NDCG Recall HR Prec.
2.858 4.834 7.020 0.728 5.449 8.324 14.401 1.707
2.918 4.943 7.217 0.749 5.555 8.470 14.611 1.749
2.538 4.230 6.177 0.636 4.596 6.773 12.130 1.381
2.988 5.074 7.352 0.767 5.749 8.882 15.268 1.848
2.605 4.348 6.354 0.654 4.829 7.138 12.687 1.458
3.051 5.207 7.591 0.791 5.863 9.108 15.599 1.905
2.700 4.525 6.575 0.679 4.968 7.365 13.168 1.519
3.081 5.271 7.673 0.797 5.926 9.166 15.667 1.920
2.728 4.583 6.733 0.693 5.067 7.554 13.423 1.559
Table 4. Influence of sampling sizes at each level on the recommendation quality. The best results are highlighted in bold and the results under the default setting are underlined. All numbers in the table are given in percentage (%).

4.5. Multi-Hop Scoring Function

Besides action pruning and the reward definition, the scoring function is also used as a part of the objective function in training the knowledge graph representation. By default, we employ a 1-hop scoring function for representation learning. In this experiment, we explore whether multi-hop scoring functions can further improve the recommendation performance of our method.

In particular, the 2-hop scoring function from Equation 9 is: for any valid 2-hop pattern between entities and . This function is plugged into the objective function in Equation 14 for training entity and relation embeddings. All further settings are adopted from the previous action space experiment. We also plot the results in Figures 3 and 4 with an additional green curve representing our new model trained by the 2-hop scoring function. Surprisingly, we find that our 2-hop PGPR method further outperforms the default model (blue curve). This improvement mainly stems from the effectiveness of the multi-hop scoring function, which captures interactions between entities with longer distance. For example, if a user purchases an item and the item belongs to a category, the 2-hop scoring function enhances the relevance between the User entity and the Category entity through the 2-hop pattern {Purchase, Belong_to}.

Dataset Clothing Beauty
History NDCG Recall HR Prec. NDCG Recall HR Prec.
0-step 1.972 3.117 4.492 0.462 3.236 4.407 8.026 0.888
1-step 2.858 4.834 7.020 0.728 5.449 8.324 14.401 1.707
2-step 2.786 4.702 6.865 0.710 5.342 8.181 14.168 1.669
Table 5. Results for different history representations of state. All numbers in the table are given in percentage (%).
Figure 5. All 3-hop path patterns found in the results.

4.6. Sampling Size in Path Reasoning

In this experiment, we study how the sampling size for path reasoning influences the recommendation performance of our method.

We carefully design 9 different combinations of sampling sizes given a path length of 3. As listed in the first column of Table 4, each tuple means that the we sample top actions at step as described in Algorithm 1. For fair comparison, the total number of sampling paths () is fixed to 120 (except for the first case). We experiment on the Clothing and Beauty datasets and follow the default settings of other parameters. The recommendation results in terms of NDCG, Recall, Hit Rate and Precision are reported in Table 4. Interestingly, we observe that the first two levels of sampling sizes play a more significant role in finding good paths. For example, in the cases of , our model performs much better than in the rest of cases. One explanation is that the first two selections of actions largely determine what kinds of items can be reached. After the first two steps are determined, the policy network tends to converge to selecting the optimal action leading to a good item. On the other hand, our model is quite stable if the sample sizes at the first two levels are large, which offers a good guidance for parameter tuning.

4.7. History Representations

Finally, we examine how different representations of state history influence our method. We consider three alternatives for : no history (0-step), last entity with relation (1-step), and last two entities with relations (2-step). Other settings are the same as in the previous experiments. As shown in Table 5, we find that the worst results are obtained in the 0-step case, which suggests that a state representation without history cannot provide sufficient information for the RL agent to learn a good policy. Apart from this, the performance of using 2-step history is slightly worse than that of 1-step history. One possible reason is that additional history information is redundant and even misleads the algorithm in the decision-making process.

5. Case Study on Path Reasoning

To intuitively understand how our model interprets the recommendation, we give a case study here based on the results generated in the previous experiments. We first study the path patterns discovered by our model during the reasoning process, followed by various cases for recommendation.

Path patterns

For a fixed path length of 3, we find that our method managed to discover 15 different path patterns, which are plotted in an aggregated form in Figure 5. Interestingly, the pattern user item user item is one kind of collaborative filtering.

Case study

Figure 6. Real cases of recommendation reasoning paths.

As shown in Figure 6, we provide several real-world examples of the reasoning paths generated by our method to illustrate how to interpret recommendations through paths.

The first example (Case 1) comes from the Beauty dataset, where a user purchased an item “shampoo” that was described by two feature words “nourish” and “lightening”. Meanwhile, another item “conditioner” also contained these two features in some review. Therefore, our model recommended “conditioner” to this user. In the second example (Case 2), there are two users who both mentioned the feature words “run” and “comfort” in their reviews, so our method made a decision based on the purchase history of one user, by recommending the item “running shoes” purchased by the user to the other user. In the third example (Case 3), a user bought an item “iPhone” and also viewed another item “charger line”. Considering that other users who purchased “phone case” would also buy “charger line”, our method accordingly recommended “iPhone case” to the user. The last example (Case 4) depicts that one user purchased an item “neck chain”, which belonged to the category of “Hello Kitty”. Thus, our method recommended the user another item “key chain” that was also in the same category “Hello Kitty”. We conclude that our PGPR method not only achieves promising recommendation results, but also is able to efficiently find diverse reasoning paths for the recommendations.

6. Conclusions and Future Work

We believe that future intelligent agents should have the ability to perform explicit reasoning over knowledge for decision making. In this paper, we propose RL-based reasoning over knowledge graphs for recommendation with interpretation. To achieve this, we develop a method called Policy-Guided Path Reasoning (PGPR). Based on our proposed soft reward strategy, user-conditional action pruning strategy, and a multi-hop scoring approach, our RL-based PGPR algorithm is not only capable of reaching outstanding recommendation results, but also exposes its reasoning procedure for explainability. We conduct extensive experiments to verify the performance of our approach compared with several state-of-art baselines. It should be noted that our PGPR approach is a flexible graph reasoning framework and can be extended to many other graph-based tasks such as product search and social recommendation, which will be explored in the future. We can also extend our PGPR approach to model time-evolving graphs so as to provide dynamic decision support.


  • (1)
  • Ai et al. (2018) Qingyao Ai, Vahid Azizi, Xu Chen, and Yongfeng Zhang. 2018. Learning Heterogeneous Knowledge Base Embeddings for Explainable Recommendation. Algorithms (2018).
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems. 2787–2795.
  • Das et al. (2017) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. 2017. Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases with Reinforcement Learning. In NIPS-17 Workshop on Automatic Knowledge-Base Construction.
  • Gao et al. (2018) Li Gao, Hong Yang, Jia Wu, Chuan Zhou, Weixue Lu, and Yue Hu. 2018. Recommendation with multi-source heterogeneous information.

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
  • He et al. (2017a) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017a. Translation-based recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 161–169.
  • He and McAuley (2016a) Ruining He and Julian McAuley. 2016a. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on World Wide Web. 507–517.
  • He and McAuley (2016b) Ruining He and Julian McAuley. 2016b. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI.
  • He et al. (2017b) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017b. Neural Collaborative Filtering. In WWW. 173–182.
  • Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. 2018. Improving sequential recommendation with knowledge-enhanced memory networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 505–514.
  • Konstan et al. (1997) Joseph A Konstan, Bradley N Miller, David Maltz, Jonathan L Herlocker, Lee R Gordon, and John Riedl. 1997. GroupLens: applying collaborative filtering to Usenet news. Commun. ACM 40, 3 (1997), 77–87.
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
  • Lee and Seung (2001) Daniel D Lee and H Sebastian Seung. 2001. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems. 556–562.
  • Lin et al. (2018) Xi Victoria Lin, Richard Socher, and Caiming Xiong. 2018. Multi-Hop Knowledge Graph Reasoning with Reward Shaping. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Brussels, Belgium, October 31-November 4, 2018

  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. recommendations: Item-to-item collaborative filtering. IEEE Internet computing 1 (2003), 76–80.
  • McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 165–172.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Mnih and Salakhutdinov (2008) Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization. In Advances in neural information processing systems. 1257–1264.
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A Three-Way Model for Collective Learning on Multi-Relational Data.. In ICML. 809–816.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, 452–461.
  • Resnick et al. (1994) Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work. ACM, 175–186.
  • Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Theocharous et al. (2015) Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. 2015. Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees.. In IJCAI. 1806–1812.
  • Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018b. Ripplenet: Propagating user preferences on the knowledge graph for recommender systems. In CIKM. ACM, 417–426.
  • Wang et al. (2018a) Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018a. A Reinforcement Learning Framework for Explainable Recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 587–596.
  • Wang et al. (2019) Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable Reasoning over Knowledge Graphs for Recommendation. AAAI (2019).
  • Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017. Deeppath: A reinforcement learning method for knowledge graph reasoning. Proceedings of Conference on Empirical Methods in Natural Language Processing (2017), 564–573.
  • Zhang et al. (2016b) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016b. Collaborative Knowledge Base Embedding for Recommender Systems. In KDD.
  • Zhang et al. (2016a) Wei Zhang, Quan Yuan, Jiawei Han, and Jianyong Wang. 2016a. Collaborative Multi-Level Embedding Learning from Reviews for Rating Prediction. In IJCAI.
  • Zhang et al. (2017) Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Croft. 2017. Joint representation learning for top-n recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1449–1458.
  • Zhang and Chen (2018) Yongfeng Zhang and Xu Chen. 2018. Explainable Recommendation: A Survey and New Perspectives. arXiv preprint arXiv:1804.11192 (2018).
  • Zhang et al. (2014) Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014.

    Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis.

    SIGIR (2014), 83–92.
  • Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A Deep Reinforcement Learning Framework for News Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 167–176.
  • Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In WSDM.