Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning

11/15/2017 ∙ by Rajarshi Das, et al. ∙ University of Massachusetts Amherst The University of Texas at Austin Carnegie Mellon University 0

Knowledge bases (KB), both automatically and manually constructed, are often incomplete --- many valid facts can be inferred from the KB by synthesizing existing information. A popular approach to KB completion is to infer new relations by combinatory reasoning over the information found along other paths connecting a pair of entities. Given the enormous size of KBs and the exponential number of paths, previous path-based models have considered only the problem of predicting a missing relation given two entities or evaluating the truth of a proposed triple. Additionally, these methods have traditionally used random paths between fixed entity pairs or more recently learned to pick paths between them. We propose a new algorithm MINERVA, which addresses the much more difficult and practical task of answering questions where the relation is known, but only one entity. Since random walks are impractical in a setting with combinatorially many destinations from a start node, we present a neural reinforcement learning approach which learns how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach obtains state-of-the-art results on several datasets, significantly outperforming prior methods.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Meandering In Networks of Entities to Reach Verisimilar Answers Edit Add topics

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated reasoning, the ability of computing systems to make new inferences from observed evidence, has been a long standing goal of artificial intelligence. We are interested in automated reasoning on large knowledge bases (KB) with rich and diverse semantics (Suchanek et al., 2007; Bollacker et al., 2008; Carlson et al., 2010). KBs are highly incomplete (Min et al., 2013)

, and facts not directly stored in a KB can often be inferred from those that are, creating exciting opportunities and challenges for automated reasoning. For example, consider the small knowledge graph in figure 

1. We can infer the (unobserved fact) home stadium of Colin Kaepernick from the following reasoning path: . Our goal is to automatically learn such reasoning paths in KBs. We frame the learning problem as one of query answering, that is to say, answering questions of the form (Colin Kaepernick, PlaysInLeague, ).

From its early days, the focus of automated reasoning approaches has been to build systems which can learn crisp symbolic logical rules (McCarthy, 1960; Nilsson, 1991)

. Symbolic representations have also been integrated with machine learning especially in statistical relational learning 

(Muggleton et al., 1992; Getoor & Taskar, 2007; Kok & Domingos, 2007; Lao et al., 2011)

, but due to poor generalization performance, these approaches have largely been superceded by distributed vector representations. Learning embedding of entities and relations using tensor factorization or neural methods has been a popular approach (Nickel et al., 2011; Bordes et al., 2013; Socher et al., 2013; inter alia), but these methods cannot capture chains of reasoning expressed by KB paths. Neural multi-hop models

(Neelakantan et al., 2015; Guu et al., 2015; Toutanova et al., 2016) address the aforementioned problems to some extent by operating on KB paths in vector space. However, these models take as input a set of paths which are gathered by performing random walks independent of the query relation. Additionally, models such as Neelakantan et al. (2015); Das et al. (2017) use the same set of initially collected paths to answer a diverse set of query types (e.g. MarriedTo, Nationality, WorksIn etc.).

This paper presents a method for efficiently searching the graph for answer-providing paths using reinforcement learning (RL) conditioned on the input question, eliminating any need for precomputed paths. Given a massive knowledge graph, we learn a policy, which, given the query , starts from and learns to walk to the answer node by choosing to take a labeled relation edge at each step, conditioning on the query relation and entire path history. This formulates the query-answering task as a reinforcement learning (RL) problem where the goal is to take an optimal sequence of decisions (choices of relation edges) to maximize the expected reward (reaching the correct answer node). We call the RL agent minerva for ”Meandering In Networks of Entities to Reach Verisimilar Answers.”

Our RL-based formulation has many desirable properties. First, minerva has the built-in flexibility to take paths of variable length, which is important for answering harder questions that require complex chains of reasoning (Shen et al., 2017). Secondly, minerva needs no pretraining and trains on the knowledge graph from scratch with reinforcement learning; no other supervision or fine-tuning is required representing a significant advance over prior applications of RL in NLP. Third, our path-based approach is computationally efficient, since by searching in a small neighborhood around the query entity it avoids ranking all entities in the KB as in prior work. Finally, the reasoning paths found by our agent automatically form an interpretable provenance for its predictions.


Figure 1: A small fragment of a knowledge base represented as a knowledge graph. Solid edges are observed and dashed edges are part of queries. Note how each query (e.g. Nationality, PlaysInLeague, PLayerHomeStadium) can be answered by traversing the graph via “logical” paths between entity ‘Colin Kaepernick’ and the corresponding answer.

The main contributions of the paper are: (a) We present agent minerva, which learns to do query answering by walking on a knowledge graph conditioned on an input query, stopping when it reaches the answer node. The agent is trained using reinforcement learning, specifically policy gradients (§ 2). (b) We evaluate minerva on several benchmark datasets and compare favorably to Neural Theorem Provers (NTP) (Rocktäschel & Riedel, 2017) and Neural LP (Yang et al., 2017), which do logical rule learning in KBs, and also state-of-the-art embedding based methods such as DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016). (c) We also extend minerva to handle partially structured natural language queries and test it on the WikiMovies dataset (§ 4.3) (Miller et al., 2016).

We also compare to DeepPath (Xiong et al., 2017) which uses reinforcement learning to pick paths between entity pairs. The main difference is that the state of their RL agent includes the answer entity since it is designed for the simpler task of predicting if a fact is true or not. As such their method cannot be applied directly to our more challenging query answering task where the second entity is unknown and must be inferred. Nevertheless, minerva outperforms DeepPath on their benchmark nell-995 dataset when compared in their experimental setting (§ 4.1).

2 Task and Model

We formally define the task of query answering in a KB. Let denote the set of entities and be the set of binary relations. Then a KB is a collection of facts stored as triplets where and . Query answering seeks to answer questions of the form , e.g. Toronto, locatedIn, . We would also like to clearly point out the difference between query answering and the task of fact prediction. Fact prediction involves predicting if a fact is true or not, e.g. (Toronto, locatedIn, Canada)?. This task is easier than predicting the correct entity as the answer in query answering since the latter require finding the answer entity among many possible entities.

Next we describe how we reduce the problem of query answering in a KB to a finite horizon sequential decision making problem and solve it using reinforcement learning. We begin by representing the environment as a deterministic Markov decision process on a knowledge graph

derived from the KB (§2.1). Our RL agent is given an input query of the form . Starting from vertex corresponding to in the knowledge graph , the agent learns to traverse the environment/graph to mine the answer and stop when it determines the answer (§ 2.2). The agent is trained using policy gradient more specifically by reinforce (Williams, 1992) with control variates (§ 2.3). Let us begin by describing the environment.

2.1 Environment - States, Actions, Transitions and Rewards

Our environment is a finite horizon, deterministic and partially observed Markov decision process that lies on a knowledge graph derived from the KB. Recall that a KB is collection of facts stored as triplets where and . From the KB, a knowledge graph can be constructed where the entities are represented as the nodes and relation as labeled edge between them. Formally, a knowledge graph is a directed labeled multigraph , where and denote the vertices and edges of the graph respectively. Note that and . Also, following previous approaches (Bordes et al., 2013; Neelakantan et al., 2015; Xiong et al., 2017), we add the inverse relation of every edge, i.e. for an edge , we add the edge to the graph. (If the set of binary relations does not contain the inverse relation , it is added to as well.) On this graph we will now specify a deterministic partially observable Markov decision process, which is a 5-tuple , each of which we elaborate below.

States. The state space consists of all possible query-answers cartesian product with the set of entities. Intuitively, we want a state to encode the query (), the answer (), and a location of exploration (current node of the entity). Thus overall a state is represented by and the state space consists of all valid combinations.

Observations. The complete state of the environment is not observable, but only its current location of exploration and query can be observed but not the answer, i.e. only is observed. Formally the observation function is defined as .

Actions. The set of possible actions from a state consists of all outgoing edges of the vertex in . Formally . Basically, this means an agent at each state has option to select which outgoing edge it wishes to take having the knowledge of the label of the edge and destination vertex .

During implementation, we unroll the computation graph up to a fixed number of time steps . We augment each node with a special action called ‘no_op’ which goes from a node to itself. Some questions are easier to answer and needs lesser steps of reasoning than others. This design decision allows the agent to remain at a node for any number of time steps. This is especially helpful when the agent has managed to reach a correct answer at a time step and can continue to stay at the ‘answer node’ for the rest of the time steps. Alternatively, we could have allowed the agent to take a special ‘stop’ action, but we found the current setup to work sufficiently well. As mentioned before, we also add the inverse relation of a triple, i.e. for the triple , we add the triple to the graph. We found this important because this actually equips our agent to undo a potentially wrong decision as it can retract back to the current node in the next step.

Transition. The environment evolves deterministically by just updating the state to the new vertex pointed by the edge selected by the agent through its action. The query and answer remains the same. Formally, the transition function is defined by , where and .

Rewards. We only have a terminal reward of +1 if the current location is the correct answer at the end and 0 otherwise. To elaborate, if is the final state, then we receive a reward of +1 if else 0.=, i.e. .

2.2 Policy Network

To solve the finite horizon deterministic partially observable Markov decision process described above, we aim to design a randomized history-dependent policy , where and history

is just the sequence of observations and actions taken. We restrict ourselves to the function class expressed by long short-term memory network (LSTM)

(Hochreiter & Schmidhuber, 1997) for learning the randomized history-dependent policy.

An agent based on LSTM encodes the history as a continuous vector . We also have embedding matrix and for the binary relations and entities respectively. The history embedding for is updated according to LSTM dynamics:


where and denote the vector representation for action/relation at time and observation/entity at time respectively and denote vector concatenation. To elucidate, , i.e. the embedding of the relation corresponding to label of the edge the agent chose at time and if i.e. the embedding of the entity corresponding to vertex the agent is at time .

Based on the history embedding , the policy network makes the decision to choose an action from all available actions conditioned on the query relation. Recall that each possible action represents an outgoing edge with information of the edge relation label and destination vertex/entity . So embedding for each is , and stacking embeddings for all the outgoing edges we obtain the matrix

. The network taking these as inputs is parameterized as a two-layer feed-forward network with ReLU nonlinearity which takes in the current history representation

and the embedding for the query relation

and outputs a probability distribution over the possible actions from which a discrete action is sampled. In other words,

Note that the nodes in do not have a fixed ordering or number of edges coming out from them. The size of matrix is , so the decision probabilities lies on simplex of size

. Also the procedure above is invariant to order in which edges are presented as desired and falls in purview of neural networks designed to be permutation invariant

Zaheer et al. (2017). Finally, to summarise, the parameters of the LSTM, the weights , , the corresponding biases (not shown above for brevity), and the embedding matrices form the parameters of the policy network.

2.3 Training

For the policy network () described above, we want to find parameters that maximizes the expected reward:

where we assume there is a true underlying distribution . To solve this optimization problem, we employ REINFORCE (Williams, 1992) as follows:

  • [leftmargin=9mm, itemsep=2pt]

  • The first expectation is replaced with empirical average over the training dataset.

  • For the second expectation, we approximate by running multiple rollouts for each training example. The number of rollouts is fixed and for all our experiments we set this number to 20.

  • For variance reduction, a common strategy is to use an additive control variate baseline

    (Hammersley, 2013; Fishman, 2013; Evans & Swartz, 2000)

    . We use a moving average of the cumulative discounted reward as the baseline. We tune the weight of this moving average as a hyperparameter. Note that in our experiments we found that learnt baseline performed similarly, but we finally settled for cumulative discounted reward as the baseline owing to its simplicity.

  • To encourage the policy to sample more diverse paths rather than sticking with a few, we add an entropy regularization term to our cost function after multiplying it by a constant . We treat as a hyperparameter to control the exploration exploitation trade-off.

Experimental Details We choose the relation and embedding dimension size as 200. The action embedding is formed by concatenating the entity and relation embedding. We use a 3 layer LSTM with dimension size of 400. The hidden layer size of MLP (weights and ) is set to 400. We use Adam (Kingma & Ba, 2014) with the default parameters in REINFORCE for the update. The best hyperparameter values can be found in appendix.

3 Data

Dataset #entities #relations #facts #queries
countries 272 2 1158 24
umls 135 49 5,216 661
kinship 104 26 10686 1074
wn18rr 40,945 15 86,835 3134
nell-995 75,492 200 154,213 3992
fb15k-237 14,505 237 272,115 20,466
WikiMovies 43,230 9 196,453 9952
Table 1: Statistics of various datasets used in experiments.
Task Metric Model
ComplEx NTP NTP- minerva
S1 99.370.4 90.8315.4 100.00.0 100.00.0
S2 AUC-PR 87.952.8 87.411.7 93.040.4 910.01
S3 48.446.3 56.6817.6 77.2617.0 930.01
Table 2: Performance on countries dataset. minerva significantly outperforms baselines in the challenging S3 task.

We test our model on the following query answering datasets. (a) countries (Bouchard et al., 2015), (b) Alyawarra kinship (kinship), (c) Unified Medical Language Systems (umls) (Kok & Domingos, 2007) (d) wn18rr (Dettmers et al., 2017), (e) nell-995, (f) fb15k-237 (g) WikiMovies (Miller et al., 2016). We also test on a synthetic grid world dataset released by Yang et al. (2017) to test the ability of the model to learn rules of long length.

The countries dataset is carefully designed to explicitly test the logical rule learning and reasoning capabilities of link prediction models. The dataset has 3 tasks (S1-3 in table  2) each requiring reasoning steps of increasing length and difficulty (see Rocktäschel & Riedel (2017) for more details about the tasks). We also test our model on existing large and challenging KG datasets ((d) - (f)). wn18rr is created from the original wordnet18 dataset by removing test triples which can be answered trivially, making the datasets more realistic and challenging. Additionally, we test our model on a question answering dataset - WikiMovies (Miller et al., 2016) where the query is in natural language but the answers can be found in an accompanying KB. Table 1 report the various statistics of the datasets.

4 Experiments

4.1 Knowledge Graph Query Answering

This section describes the experimental results on the various knowledge graph query answering datasets. During inference, we do beam search with a beam width of 40 and rank entities by the probability of the trajectory the model took to reach the entity.

Model umls kinship
NeuralLP 0.70 0.73
minerva 0.91 0.93
Table 3: hits@10 on umls and kinship

countries, kinship, umls. We first test minerva on the countries dataset which is explicitly designed to test the ability of models to learn logical rules. It contains countries, regions and subregions as entities. The queries are of the form LocatedIn(c, ?) and the answer is a region. For example, LocatedIn(Egypt, ?) with the answer as Africa. Our experimental settings and scores are directly comparable to NTP and ComplEx (Trouillon et al., 2016). NTP- is a NTP model trained with an additional objective function of ComplEx. We also compare minerva against Neural LP (Yang et al., 2017) on the umls and kinship datasets.

Task DeepPath minerva athleteplaysinleague 0.960 0.970 worksfor 0.711 0.825 organizationhiredperson 0.742 0.851 athleteplayssport 0.957 0.985 teamplayssport 0.738 0.846 personborninlocation 0.795 0.793 athletehomestadium 0.890 0.895 organizationheadquarteredincity 0.790 0.946 athleteplaysforteam 0.750 0.824 tableMAP scores for different query relations on the nell-995 dataset. Note that in this comparison, minerva refers to only a single learnt model for all query relations which is competitive with individual DeepPath models trained separately for each query relation. figureGrid world experiment: We significantly outperform NeuralLP for longer path lengths.

The evaluation metric we report is

hits@k - which is the percentage of correct entities ranked in top-k. For the countries dataset, we report the area under the precision-recall curve for comparing with the baselines. Following the design of the countries dataset, for task S1 and S2, we set the maximum path length and for S3, we set .

Table  2 shows that minerva outperforms all the baseline models except on the task S2 of countries, where the ensemble model ntp- outperforms it, albeit with a higher variance across runs. Our gains are much more prominent in task S3, which is the hardest among all the tasks. We similarly outperform NeuralLP on the umls and kinship datasets.

Nell-995 We also compare minerva to DeepPath. For a fair comparison, we only rank the answer entities against the negative examples in the dataset used in their experiments222We are grateful to Xiong et al. (2017) for releasing the negative examples used in their experiments. and report the mean average precision (MAP) scores for each query relation. DeepPath feeds the paths its agent gathers as input features to the path ranking algorithm (PRA) (Lao et al., 2011)

, which trains a per-relation classifier. But unlike them, we train one model which learns for all query relations. If our agent is not able to reach the correct entity or one of the negative entities, the corresponding query gets a score of negative infinity. As show in table  

4.1, we outperform them or achieve comparable performance for all the query relations For this experiment, we set the maximum length .

Model hits@1 hits@3 hits@10
ConvE 0.306 0.360 0.411
DistMult 0.389 0.439 0.491
ComplEx 0.411 0.458 0.507
minerva 0.413 0.456 0.513
Table 4: Performance on wn18rr

wn18rr Next we test minerva on another large KB dataset – wn18rr. On this dataset, we compare with three recently proposed latent factorization model – (a) ConvE (Dettmers et al., 2017), (b) DistMult (Yang et al., 2015), (c) ComplEx (Trouillon et al., 2016). We report hits at various and we compare favorably with the state-of-the-art results of ComplEx in all settings (table  4). For this experiment, we also set the maximum length .

Model hits@10
ConvE 0.458
DistMult 0.568
ComplEx 0.419
minerva 0.456

tablePerformance on fb15k-237

FB15k-237 We test minerva on yet another popular KB dataset – FB15K-237. The baselines are the same as before, however our implementation of DistMult gave a score of 56.8 hits@10 which, to our knowledge, is the highest score reported on this dataset333We are aware of the high variance of DistMult scores reported on FB15k-237 by several papers, but to ensure fairness we report the high scores our in-house implementation achieved..The performance of ConvE and ComplEx are taken from Dettmers et al. (2017). Even though minerva performs comparably to ConvE and ComplEx, the results are significantly behind the performance of DistMult.

Upon delving more into the structure of knowledge graph derived from FB15K-237, we found few interesting characteristics of the dataset. As a prelude, we would like to describe a long existing concept in graph theory – clustering coefficient (Holland & Leinhardt, 1971; Watts & Strogatz, 1998). Clustering coefficient () of a graph measures whether groups of nodes form ‘tightly knit’ communities - i.e. whether groups of nodes tend to cluster together. A high implies the presence of higher number of densely connected groups of nodes. For instance, if we consider three nodes and , a high means with high probability whenever three nodes are connected as , it implies nodes are also connected forming a triangle. Intuitively, minerva can use such closed shapes to learn paths such as () to predict the answer of the query, i.e. the third node (). The clustering coefficient also extends from triangles to cliques of arbitrary size (Watts & Strogatz, 1998). Figure 4.1 plots for various datasets. We find that FB15k-237 has the least clustering coefficient (0.19) among all datasets. This means that the dataset has sparse neighborhoods and hence minerva finds it difficult to learn logical rules. We also check the frequency of occurrence of various unique paths (types). We define a path type as the sequence of relations (ignoring the entities) in a path. Intuitively, a predictive path which generalizes across queries will occur many number of times in the graph. Figure 4.1 shows the plot. As we can see, the characteristics of FB15k-237 is quite different from other datasets. Path types do not repeat that often, making it hard for minerva to learn paths which generalizes. We also provide further analysis of the types of various query relation in FB15k-237 in the appendix.

figureNetwork avg. cluster coefficient of various datasets figureCount of number of unique path types of length 3 which occur more than ‘x’ times in various datasets. In Kinship and NELL-995, there are more than 10 path types which occur more than 10 times, however for FB15k-237, we see a sharp decrease as ‘x’ becomes higher.

4.2 Grid World Path Finding

As we empirically find and also noted by previous work (Rocktäschel & Riedel, 2017; Das et al., 2017; Yang et al., 2017), often the reasoning chains required to answer queries in KB is not too long (restricted to 3 or 4 hops). To test if our model can learn long reasoning paths, we test our model on a synthetic 16-by-16 grid world dataset created by Yang et al. (2017), where the task is to navigate to a particular cell (answer entity) starting from a random cell (start entity) by following a set of directions (query relation). The KB consists of atomic triples of the form ((2,1), North, (1,1)) – entity (1,1) is north of entity (2,1). The queries consists of a sequence of directions (e.g. North, SouthWest, East). The queries are classified into classes based on the path lengths. Figure 2 shows the accuracy on varying path lengths. Compared to Neural LP, minerva is much more robust for queries which require longer path lengths showing a very little degrade in performance for even the longest path length in the dataset.

4.3 Partially Structured Queries

Model Accuracy
Memory Network 78.5
QA system 93.5
Key-Value Memory Network 93.9
Neural LP 94.6
minerva 96.7
Table 5: Performance on WikiMovies

Queries in KB datasets are structured in the form of triples. However, this is unsatisfactory since for most real applications, the queries appear in natural language. As a first step in this direction, we extend minerva to take in “partially structured” queries. We use the WikiMovies dataset (Miller et al., 2016) which contains questions in natural language albeit generated by templates created by human annotators. An example question from the dataset is “Which is a film written by Herb Freed?”. WikiMovies also has an accompanying KB which can be used to answer all the questions.

We link the entity occurring in the question to the KB via simple string matching. To form the vector representation of the query relation, we design a simple question encoder which computes the average of the embeddings of the question words. The word embeddings are learned from scratch and we do not use any pretrained embeddings. We compare our results with those reported in Yang et al. (2017) (table  5). We got the best result using , suggesting that WikiMovies is not the best testbed for multihop reasoning, but this experiment is a promising first step towards the realistic setup of having textual queries and knowledge bases.

5 Analysis

Figure 2: Based on the query relation our agent assigns different probabilities to different actions. The dashed edges in the top row denote query relation. Examples in the bottom row are from the WikiMovies dataset and hence the questions are partially structured.

Effectiveness of Remembering Path History. minerva encodes the history of decisions it has taken in the past using LSTMs. To test the importance of remembering the sequence of decisions, we did an ablation study in which the agent chose the next action based on only local information i.e. current entity and query and did not have access to the history . For the kinship dataset, we observe a 27% points decrease in hits@1 (0.184 v/s 0.46) and 13% decrease in hits@10 (0.63 v/s 0.76). For grid-world, it is also not surprising that we see a big drop in performance. The final accuracy is 0.23 for path lengths 2-4 and 0.04 for lengths 8-10. For nell, the performance dropped from 0.576 to 0.564 and for FB15k-237 the hits@10 performance dropped from 0.456 to 0.408.

NO-OP and Inverse Relations. At each step, minerva can choose to take a no-op edge and remain at the same node. This gives the agent the flexibility of taking paths of variable lengths. Some questions are easier to answer than others and require lesser steps of reasoning and if the agent reaches the answer early, it can choose to remain there. Example (i) in table  6 shows such an example. Similarly inverse relation gives the agent the ability to recover from a potentially wrong decision it has taken before. Example (ii) shows such an example, where the agent took a incorrect decision at the first step but was able to revert the decision because of the presence of inverted edges.

Query based Decision Making. At each step before making a decision, our agent conditions on the query relation. Figure 2 shows examples, where based on the query relation, the probabilities are peaked on different actions. For example, when the query relation is WorksFor, minerva assigns a much higher probability of taking the edge CoachesTeam than AthletePlaysInLeague. We also see similar behavior on the WikiMovies dataset where the query consists of words instead of fixed schema relation.

Inference Time. minerva is efficient at inference time since it has to essentially search for answer entities in its local neighborhood, whereas previous methods rank all the entities in the dataset. For instance, on the test dataset of wn18rr, the wall clock running time of minerva is 63 seconds whereas that of a GPU implementation of DistMult is 211 seconds (with the maximum batch size).

(i) Can learn general rules:
(S1) LocatedInX, Y LocatedInX, Z & LocatedInZ, Y
(S2) LocatedInX, Y NeighborOfX, Z & LocatedInZ, Y
(S3) LocatedInX, Y NeighborOfX, Z & NeighborOfZ, W & LocatedInW, Y
(ii) Can learn shorter path: Richard F. Velky ?
Richard F. Velky Schaghticokes Schaghticokes Schaghticokes
(iii) Can recover from mistakes: Donald Graham ?
Donald Graham TNT Post Donald Graham Wash Post
Table 6: A few example of paths found by minerva on the countries and nell. minerva can learn general rules as required by the countries dataset (example (i)). It can learn shorter paths if necessary (example (ii)) and has the ability to correct a previously taken decision (example (iii))


6 Related Work

Learning vector representations of entities and relations using tensor factorization (Nickel et al., 2011, 2012; Bordes et al., 2013; Riedel et al., 2013; Nickel et al., 2014; Yang et al., 2015) or neural methods (Socher et al., 2013; Toutanova et al., 2015; Verga et al., 2016) has been a popular approach to reasoning with a knowledge base. However, these methods cannot capture more complex reasoning patterns such as those found by following inference paths in KBs. Multi-hop link prediction approaches (Lao et al., 2011; Neelakantan et al., 2015; Guu et al., 2015; Toutanova et al., 2016; Das et al., 2017) address the problems above, but the reasoning paths that they operate on are gathered by performing random walks independent of the type of query relation. Lao et al. (2011)

further filters paths from the set of sampled paths based on the restriction that the path must end at one of the target entities in the training set and are within a maximum length. These constraints make them query dependent but they are heuristic in nature. Our approach eliminates any necessity to pre-compute paths and learns to efficiently search the graph conditioned on the input query relation.

Inductive Logic Programming (ILP)

(Muggleton et al., 1992) aims to learn general purpose predicate rules from examples and background knowledge. Early work in ILP such as FOIL (Quinlan, 1990), PROGOL (Muggleton, 1995) are either rule-based or require negative examples which is often hard to find in KBs (by design, KBs store true facts). Statistical relational learning methods (Getoor & Taskar, 2007; Kok & Domingos, 2007; Schoenmackers et al., 2010) along with probabilistic logic (Richardson & Domingos, 2006; Broecheler et al., 2010; Wang et al., 2013) combine machine learning and logic but these approaches operate on symbols rather than vectors and hence do not enjoy the generalization properties of embedding based approaches.

Neural Theorem Provers (NTP) (Rocktäschel & Riedel, 2017) and Neural LP (Yang et al., 2017) are two recent methods in learning logical rules that can be trained end-to-end with gradient based learning. NTPs are constructed by Prolog’s backward chaining inference method. It operates on vectors rather than symbols, thereby providing a success score for each proof path. However, since a score can be computed between any two vectors, the computation graph becomes quite large because of such soft-matching during substitution step of backward chaining. For tractability, it resides to heuristics such as only keeping the top-K scoring proof paths, but it loses any guarantee of computing exact gradients. Also the efficacy of NTPs has yet to be shown on large KBs. Neural LP introduces a differential rule learning system using operators defined in TensorLog (Cohen, 2016) and has a LSTM based controller and a differentiable memory component (Graves et al., 2014; Sukhbaatar et al., 2015) and the rule scores are calculated via attention. Even though, differentiable memory allows the network to be trained end to end, it necessitates accessing the entire memory which can be computationally expensive. RL approaches which can make hard selection of memory (Zaremba & Sutskever, 2015) are computationally attractive. minerva uses a similar hard selection of relation edges to walk on the graph. More importantly, minerva outperforms both these methods on their respective benchmark datasets.

DeepPath (Xiong et al., 2017) uses RL based approaches to find paths in KBs. However, the state of their MDP requires the target entity to be known in advance and hence their path finding strategy is dependent on knowing the answer entity. minerva does not need any knowledge of the target entity and instead learns to find the answer entity among all entities. DeepPath, additionally feeds its gathered paths to Path Ranking Algorithm (Lao et al., 2011), whereas minerva is a complete system trained to do query answering. DeepPath also uses fixed pretrained embeddings for its entity and relations. Lastly, on comparing minerva with DeepPath in their experimental setting on the NELL dataset, we match their performance or outperform them.

minerva is also similar to methods for learning to search for structured prediction (Collins & Roark, 2004; Daumé III & Marcu, 2005; Daumé III et al., 2009; Ross et al., 2011; Chang et al., 2015). These methods are based on imitating a reference policy (oracle) which make near-optimal decision at every step. In our problem setting, it is unclear what a good reference policy would be. For example, a shortest path oracle between two entities would be bad, since the answer providing path should depend on the query relation.

7 Conclusion

We explored a new way of automated reasoning on large knowledge bases in which we use the knowledge graphs representation of the knowledge base and train an agent to walk to the answer node conditioned on the input query. We achieve state-of-the-art results on multiple benchmark knowledge base completion tasks and we also show that our model is robust and can learn long chains-of-reasoning. Moreover it needs no pretraining or initial supervision. Future research directions include applying more sophisticated RL techniques and working directly on textual queries and documents.


This work was supported in part by the Center for Data Science and the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by Defense Advanced Research Agency (DARPA) contract number HR0011-15-2-0036, in part by the National Science Foundation (NSF) grant numbers DMR-1534431 and IIS-1514053 and in part by the Chan Zuckerberg Initiative under the project “Scientific Knowledge Base Construction. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.


  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In ICDM, 2008.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, 2013.
  • Bouchard et al. (2015) Guillaume Bouchard, Sameer Singh, and Theo Trouillon. On approximate reasoning capabilities of low-rank vector spaces. AAAI Spring Symposium, 2015.
  • Broecheler et al. (2010) Matthias Broecheler, Lilyana Mihalkova, and Lise Getoor. Probabilistic similarity logic. In UAI, 2010.
  • Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, Jr., and Tom M. Mitchell. Toward an Architecture for Never-ending Language Learning. In AAAI, 2010.
  • Chang et al. (2015) Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search better than your teacher. In ICML, 2015.
  • Cohen (2016) William Cohen. Tensorlog: A differentiable deductive database. arXiv:1605.06523, 2016.
  • Collins & Roark (2004) Michael Collins and Brian Roark.

    Incremental parsing with the perceptron algorithm.

    In ACL, 2004.
  • Das et al. (2017) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum.

    Chains of reasoning over entities, relations, and text using recurrent neural networks.

    In EACL, 2017.
  • Daumé III & Marcu (2005) Hal Daumé III and Daniel Marcu. Learning as search optimization: Approximate large margin methods for structured prediction. In ICML, 2005.
  • Daumé III et al. (2009) Hal Daumé III, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 2009.
  • Dettmers et al. (2017) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. arXiv:1707.01476, 2017.
  • Evans & Swartz (2000) Michael Evans and Timothy Swartz. Approximating integrals via Monte Carlo and deterministic methods. OUP Oxford, 2000.
  • Fishman (2013) George Fishman. Monte Carlo: concepts, algorithms, and applications. Springer Science & Business Media, 2013.
  • Getoor & Taskar (2007) Lise Getoor and Ben Taskar. Introduction to statistical relational learning. MIT press, 2007.
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv:1410.5401, 2014.
  • Guu et al. (2015) Kelvin Guu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. In EMNLP, 2015.
  • Hammersley (2013) John Hammersley. Monte carlo methods. Springer Science & Business Media, 2013.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
  • Holland & Leinhardt (1971) Paul W Holland and Samuel Leinhardt. Transitivity in structural models of small groups. Comparative Group Studies, 1971.
  • Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  • Kok & Domingos (2007) Stanley Kok and Pedro Domingos. Statistical predicate invention. In ICML, 2007.
  • Lao et al. (2011) Ni Lao, Tom Mitchell, and William Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011.
  • McCarthy (1960) John McCarthy. Programs with common sense. RLE and MIT Computation Center, 1960.
  • Miller et al. (2016) Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. EMNLP, 2016.
  • Min et al. (2013) Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL, 2013.
  • Muggleton (1995) Stephen Muggleton. Inverse entailment and progol. New generation computing, 1995.
  • Muggleton et al. (1992) Stephen Muggleton, Ramon Otero, and Alireza Tamaddoni-Nezhad. Inductive logic programming. Springer, 1992.
  • Neelakantan et al. (2015) Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models for knowledge base completion. In ACL, 2015.
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, 2011.
  • Nickel et al. (2012) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing yago: scalable machine learning for linked data. In WWW, 2012.
  • Nickel et al. (2014) Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factorization models by including observable patterns. In NIPS, 2014.
  • Nilsson (1991) Nils J Nilsson. Logic and artificial intelligence. Artificial intelligence, 1991.
  • Quinlan (1990) J Ross Quinlan. Learning logical definitions from relations. Machine learning, 1990.
  • Richardson & Domingos (2006) Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 2006.
  • Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. Relation extraction with matrix factorization and universal schemas. In NAACL, 2013.
  • Rocktäschel & Riedel (2017) Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In NIPS, 2017.
  • Ross et al. (2011) Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell.

    A reduction of imitation learning and structured prediction to no-regret online learning.

    In AISTATS, 2011.
  • Schoenmackers et al. (2010) Stefan Schoenmackers, Oren Etzioni, Daniel Weld, and Jesse Davis. Learning first-order horn clauses from web text. In EMNLP, 2010.
  • Shen et al. (2017) Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In KDD, 2017.
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 2013.
  • Suchanek et al. (2007) Fabian Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In WWW, 2007.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS, 2015.
  • Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In EMNLP, 2015.
  • Toutanova et al. (2016) Kristina Toutanova, Victoria Lin, Wen-tau Yih, Hoifung Poon, and Chris Quirk. Compositional learning of embeddings for relation paths in knowledge base and text. In ACL, 2016.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016.
  • Verga et al. (2016) Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum. Multilingual relation extraction using compositional universal schema. In NAACL, 2016.
  • Wang et al. (2013) William Yang Wang, Kathryn Mazaitis, and William W Cohen. Programming with personalized pagerank: a locally groundable first-order probabilistic logic. In CIKM, 2013.
  • Watts & Strogatz (1998) Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 1998.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992.
  • Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 2017.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In ICLR, 2015.
  • Yang et al. (2017) Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In NIPS, 2017.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets. In NIPS, 2017.
  • Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv:1505.00521, 2015.

8 Appendix

(i) M to 1
Los Angeles Rams American Football
The Walking Dead USA
(ii) 1 to M
CEO Merck & Co.
Traffic collision Albert Camus
Harmonica Greg Graffin
Table 7: Few example facts belonging to m to 1, 1 to m relations in FB15k-237


Relation tail/head
/people/marriage_union_type/unions_of_this_type./people/marriage/location_of_ceremony 129.75
/organization/role/leaders./organization/leadership/organization 65.15
/location/country/second_level_divisions 49.18
/user/ktrueman/default_domain/international_organization/member_states 36.5
/base/marchmadness/ncaa_basketball_tournament/seeds./base/marchmadness/ncaa_tournament_seed/team 33.6
Table 8: Few example 1-to-M relations from FB15k-237 with high cardinality ratio of tail to head.

8.1 Analysis of query relations of FB15k-237

We further did some query analysis on the FB15k-237 dataset. Following Bordes et al. (2013), we categorized the query relations into (M)any to 1, 1 to M and 1 to 1 relations. An example of a M to 1 relation would be ‘/people/profession’ (What is the profession of ‘X’?). An example of 1 to M relation would be ‘/people/cause_of_death/people’. An example query of that relation would be (Traffic collision, /people/cause_of_death/people, ?) ‘Who were killed in traffic collision accidents?’. Another example would be /music/instrument/instrumentalists (Who plays the music instrument ‘X’?). From a query answering point of view, the answers to this question is a list of entities. However, during evaluation time, the model is evaluated based on whether it is able to predict the one target entity which is in the query triple. Also since minerva outputs the end points of the paths as target entities, it is sometimes possible that the particular target entity of the triple does not have a path from the source entity (however there are paths to other ‘correct’ answer entities). Table 7 shows few other examples of relations belonging to different classes.

Following Bordes et al. (2013), we classify a relation as 1-to-M if the ratio of cardinality of tail to head entities is greater than 1.5 and as M-to-1 if it is lesser than 0.67. In the validation set of FB15k-237, 54% of the queries are 1-to-M, whereas only 26% are M-to-1. Contrasting it with nell-995, 27% are 1-to-M and 36% are M-to-1 or umls or kinship where only 18% and 32% of the relations are 1-to-M. Table 8 shows relations from FB15k-237 dataset which have tail-to-head ratio. The average ratio for 1-to-M relations in FB15k-237 is 13.39 (substantially higher than 1.5). As explained before, the current evaluation scheme is unfair when it comes to 1-to-M relations and the high percentage of 1-to-M relations in FB15k-237 also explains the sub optimal performance of minerva.

8.2 Hyperparameters

In our experiments, we tune our model over two hyper parameters, viz., which is the entropy regularization constant and which is the moving average constant for the REINFORCE baseline. The table 9 lists the best hyper parameters for all the datasets.

Dataset Path Length
UMLS 0.05 0 2
KINSHIP 0.05 0.05 2
Countries S1 0.01 0.1 2
Countries S2 0.02 0.02 2
Countries S3 0.01 0.01 3
WN18RR 0.05 0.05 3
NELL-995 0.05 0.02 3
FB15K-237 0.02 0.05 3
Table 9: Best hyper parameters