NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language

06/14/2019 ∙ by Leon Weber, et al. ∙ GFZ Humboldt-Universität zu Berlin UCL 0

Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and their composition from data, but lead to models that are difficult to interpret. In this paper, we describe a model combining neural networks with logic programming in a novel manner for solving multi-hop reasoning tasks over natural language. Specifically, we propose to use a Prolog prover which we extend to utilize a similarity function over pretrained sentence encoders. We fine-tune the representations for the similarity function via backpropagation. This leads to a system that can apply rule-based reasoning to natural language, and induce domain-specific rules from training data. We evaluate the proposed system on two different question answering tasks, showing that it outperforms two baselines -- BIDAF (Seo et al., 2016a) and FAST QA (Weissenborn et al., 2017b) on a subset of the WikiHop corpus and achieves competitive results on the MedHop data set (Welbl et al., 2017).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of multi-hop reasoning on natural language data. For instance, consider the statements “Socrates was born in Athens” and “Athens belongs to Greece”, and the question “Where was Socrates born?”. There are two possible answers following from the given statements, namely “Athens” and “Greece”. While the answer “Athens” follows directly from “Socrates was born in Athens”, the answer “Greece” requires the reader to combine both statements, using the knowledge that a person born in a city , located in a country , is also born in . This step of combining multiple pieces of information is referred to as multi-hop reasoning (Welbl et al., 2017)

. In the literature, such multi-hop reading comprehension tasks are frequently solved via end-to-end differentiable (deep learning) models 

(Sukhbaatar et al., 2015; Peng et al., 2015; Seo et al., 2016b; Raison et al., 2018; Henaff et al., 2016; Kumar et al., 2016; Graves et al., 2016; Dhingra et al., 2018). Such models are capable of dealing with the linguistic variability and ambiguity of natural language by learning word and sentence-level representations from data. However, in such models, explaining the reasoning steps leading to an answer and interpreting the model parameters to extrapolate new knowledge is a very challenging task (Doshi-Velez and Kim, 2017; Lipton, 2018; Guidotti et al., 2019). Moreover, such models tend to require large amounts of training data to generalise correctly, and incorporating background knowledge is still an open problem (Rocktäschel et al., 2015; Weissenborn et al., 2017a; Rocktäschel and Riedel, 2017; Evans and Grefenstette, 2017).

In contrast, rule-based models are easily interpretable, naturally produce explanations for their decisions, and can generalise from smaller quantities of data. However, these methods are not robust to noise and can hardly be applied to domains where data is ambiguous, such as vision and language (Moldovan et al., 2003; Rocktäschel and Riedel, 2017; Evans and Grefenstette, 2017).

In this paper, we introduce NLProlog, a system combining a symbolic reasoner and a rule-learning method with distributed sentence and entity representations to perform rule-based multi-hop reasoning on natural language input.111NLProlog and our evaluation code is available at NLProlog generates partially interpretable and explainable models, and allows for easy incorporation of prior knowledge. It can be applied to natural language without the need of converting it to an intermediate logic form. At the core of NLProlog is a backward-chaining theorem prover, analogous to the backward-chaining algorithm used by Prolog reasoners (Russell and Norvig, 2010b), where comparisons between symbols are replaced by differentiable similarity function between their distributed representations Sessa (2002). To this end, we use end-to-end differentiable sentence encoders, which are initialized with pretrained sentence embeddings (Pagliardini et al., 2017) and then fine-tuned on a downstream task. The differentiable fine-tuning objective enables us learning domain-specific logic rules – such as transitivity of the relation is in – from natural language data. We evaluate our approach on two challenging multi-hop Question Answering data sets, namely MedHop and WikiHop (Welbl et al., 2017).

Our main contributions are the following: i) We show how backward-chaining reasoning can be applied to natural language data by using a combination of pretrained sentence embeddings, a logic prover, and fine-tuning via backpropagation, ii) We describe how a Prolog reasoner can be enhanced with a differentiable unification function based on distributed representations (embeddings), iii) We evaluate the proposed system on two different Question Answering (QA) datasets, and demonstrate that it achieves competitive results in comparison with strong neural QA models while providing interpretable proofs using learned rules.

2 Related Work

Our work touches in general on weak-unification based fuzzy logic (Sessa, 2002) and focuses on multi-hop reasoning for QA, the combination of logic and distributed representations, and theorem proving for question answering.

Multi-hop Reasoning for QA.

One prominent approach for enabling multi-hop reasoning in neural QA models is to iteratively update a query embedding by integrating information from embeddings of context sentences, usually using an attention mechanism and some form of recurrency (Sukhbaatar et al., 2015; Peng et al., 2015; Seo et al., 2016b; Raison et al., 2018). These models have achieved state-of-the-art results in a number of reasoning-focused QA tasks. Henaff et al. (2016)

employ a differentiable memory structure that is updated each time a new piece of information is processed. The memory slots can be used to track the state of various entities, which can be considered as a form of temporal reasoning. Similarly, the Neural Turing Machine 

(Graves et al., 2016) and the Dynamic Memory Network (Kumar et al., 2016), which are built on differentiable memory structures, have been used to solve synthetic QA problems requiring multi-hop reasoning. Dhingra et al. (2018) modify an existing neural QA model to additionally incorporate coreference information provided by a coreference resolution model. De Cao et al. (2018) build a graph connecting entities and apply Graph Convolutional Networks (Kipf and Welling, 2016) to perform multi-hop reasoning, which leads to strong results on WikiHop. Zhong et al. (2019) propose a new neural QA architecture that combines a combination of coarse-grained and fine-grained reasoning to achieve very strong results on WikiHop.

All of the methods above perform reasoning implicitly as a sequence of opaque differentiable operations, making the interpretation of the intermediate reasoning steps very challenging. Furthermore, it is not obvious how to leverage user-defined inference rules during the reasoning procedure.

Combining Rule-based and Neural Models.

In Artificial Intelligence literature, integrating symbolic and sub-symbolic representations is a long-standing problem 

(Besold et al., 2017). Our work is very related to the integration of Markov Logic Networks (Richardson and Domingos, 2006) and Probabilistic Soft Logic (Bach et al., 2017) with word embeddings, which was applied to Recognizing Textual Entailment (RTE) and Semantic Textual Similarity (STS) tasks (Garrette et al., 2011, 2014; Beltagy et al., 2013, 2014), improving over purely rule-based and neural baselines.

An area in which neural multi-hop reasoning models have been investigated is Knowledge Base Completion (KBC) (Das et al., 2016; Cohen, 2016; Neelakantan et al., 2015; Rocktäschel and Riedel, 2017; Das et al., 2017; Evans and Grefenstette, 2018). While QA could be in principle modeled as a KBC task, the construction of a Knowledge Base (KB) from text is a brittle and error prone process, due to the inherent ambiguity of natural language.

Very related to our approach are Neural Theorem Provers (NTPs) (Rocktäschel and Riedel, 2017): given a goal, its truth score is computed via a continuous relaxation of the backward-chaining reasoning algorithm, using a differentiable unification operator. Since the number of candidate proofs grows exponentially with the length of proofs, NTPs cannot scale even to moderately sized knowledge bases, and are thus not applicable to natural language problems in its current form. We solve this issue by using an external prover and pretrained sentence representations to efficiently discard all proof trees producing proof scores lower than a given threshold, significantly reducing the number of candidate proofs.

Theorem Proving for Question Answering.

Our work is not the first to apply theorem proving to QA problems. Angeli et al. (2016) employ a system based on Natural Logic to search a large KB for a single statement that entails the candidate answer. This is different from our approach, as we aim to learn a set of rules that combine multiple statements to answer a question.

Systems like Watson (Ferrucci et al., 2010) and COGEX (Moldovan et al., 2003) utilize an integrated theorem prover, but require a transformation of the natural language sentences to logical atoms. In the case of COGEX, this improves the accuracy of the underlying system by 30%, and increases its interpretability. While this work is similar in spirit, we greatly simplify the preprocessing step by replacing the transformation of natural language to logic with the simpler approach of transforming text to triples by using co-occurences of named entities.

Fader et al. (2014) propose OpenQA, a system that utilizes a mixture of handwritten and automatically obtained operators that are able to parse, paraphrase and rewrite queries, which allows them to perform large-scale QA on KBs that include Open IE triples. While this work shares the same goal – answering questions using facts represented by natural language triples – we choose to address the problem of linguistic variability by integrating neural components, and focus on the combination of multiple facts by learning logical rules.

3 Background

In the following, we briefly introduce the backward chaining algorithm and unification procedure (Russell and Norvig, 2016) used by Prolog reasoners, which lies at the core of NLProlog. We consider Prolog programs that consists of a set of rules in the form of Horn clauses:

where are predicate symbols, and are either function (denoted in lower case) or variable (upper case) symbols. The domain of function symbols is denoted by , and the domain of predicate symbols by . is called the head and the body of the rule. We call the body size of the rule and rules with a body size of zero are named atoms (short for atomic formula). If an atom does not contain any variable symbols it is termed fact.

For simplicity, we only consider function-free Prolog in our experiments, i.e. Datalog (Gallaire and Minker, 1978) programs where all function symbols have arity zero and are called entities and, similarly to related work (Sessa, 2002; Julián-Iranzo et al., 2009), we disregard negation and disjunction. However, in principle NLProlog also supports functions with higher arity.

A central component in a Prolog reasoner is the unification operator: given two atoms, it tries to find variable substitutions that make both atoms syntactically equal. For example, the atoms and result in the following variable substitutions after unification: .

Prolog uses backward chaining for proving assertions. Given a goal atom , this procedure first checks whether is explicitly stated in the KB – in this case, it can be proven. If it is not, the algorithm attempts to prove it by applying suitable rules, thereby generating subgoals that are proved next. To find applicable rules, it attempts to unify with the heads of all available rules. If this unification succeeds, the resulting variable substitutions are applied to the atoms in the rule body: each of those atoms becomes a subgoal, and each subgoal is recursively proven using the same strategy.

For instance, the application of the rule to the goal would yield the subgoal . Then the process is repeated for all subgoals until no subgoal is left to be proven. The result of this procedure is a set of rule applications and variable substitutions referred to as proof. Note that the number of possible proofs grows exponentially with its depth, as every rule might be used in the proof of each subgoal. Pseudo code for weak unification can be found in Appendix A – we refer the reader to Russell and Norvig (2010a) for an in-depth treatment of the unification procedure.

4 NLProlog

Applying a logic reasoner to QA requires transforming the natural language paragraphs to logical representations, which is a brittle and error-prone process.

Our aim is reasoning with natural language representations in the form of triples, where entities and relations may appear under different surface forms. For instance, the textual mentions is located in and lies in express the same concept. We propose replacing the exact matching between symbols in the Prolog unification operator with a weak unification operator (Sessa, 2002), which allows to unify two different symbols , by comparing their representations using a differentiable similarity function with parameters .

With the weak unification operator, the comparison between two logical atoms results in an unification score resulting from the aggregation of each similarity score. Inspired by fuzzy logic t-norms (Gupta and Qi, 1991), aggregation operators are e.g. the minimum or the product of all scores. The result of backward-chaining with weak unification is a set of proofs, each associated with a proof score measuring the truth degree of the goal with respect to a given proof. Similarly to backward chaining, where only successful proofs are considered, in NLProlog the final proof success score is obtained by taking the maximum over the success scores of all found proofs. NLProlog combines inference based on the weak unification operator and distributed representations, to allow reasoning over sub-symbolic representations – such as embeddings – obtained from natural language statements.

Each natural language statement is first translated into a triple, where the first and third element denote the entities involved in the sentence, and the second element denotes the textual surface pattern

connecting the entities. All elements in each triple – both the entities and the textual surface pattern – are then embedded into a vector space. These vector representations are used by the similarity function

for computing similarities between two entities or two textual surface patterns and, in turn, by the backward chaining algorithm with the weak unification operator for deriving a proof score for a given assertion. Note that the resulting proof score is fully end-to-end differentiable with respect to the model parameters : we can train NLProlog using gradient-based optimisation by back-propagating the prediction error to . Fig. 1 shows an outline of the model, its components and their interactions.

4.1 Triple Extraction

To transform the support documents to natural language triples, we first detect entities by performing entity recognition with spaCy Honnibal and Montani (2017). From these, we generate triples by extracting all entity pairs that co-occur in the same sentence and use the sentence as the predicate blinding the entities. For instance, the sentence “Socrates was born in Athens and his father was Sophronicus” is converted in the following triples: i) (Socrates, ENT1 was born in ENT2 and his father was Sophronicus, Athens), ii) (Socrates, ENT1 was born in Athens and his father was ENT2, Sophronicus), and iii) (Athens, Socrates was born in ENT1 and his father was ENT2, Sophronicus). We also experimented with various Open Information Extraction frameworks (Niklaus et al., 2018): in our experiments, such methods had very low recall, which led to significantly lower accuracy values.

4.2 Similarity Computation

Embedding representations of the symbols in a triple are computed using an encoder parameterized by – where denote the sets of entity and predicate symbols, and denotes the embedding size. The resulting embeddings are used to induce the similarity function

, given by their cosine similarity scaled to



In our experiments, for using textual surface patterns, we use a sentence encoder composed of a static pre-trained component – namely, Sent2vec (Pagliardini et al., 2017)

– and a Multi-Layer Perceptron (MLP) with one hidden layer and Rectified Linear Unit (ReLU) activations 

(Jarrett et al., 2009). For encoding predicate symbols and entities, we use a randomly initialised embedding matrix. During training, both the MLP and the embedding matrix are learned via backpropagation, while the sentence encoder is kept fixed.

Additionally, we introduce a third lookup table and MLP for the predicate symbols of rules and goals. The main reason of this choice is that semantics of goal and rule predicates may differ from the semantics of fact predicates, even if they share the same surface form. For instance, the query can be interpreted either as or as , which are semantically dissimilar.

Figure 1: Overview of NLProlog – all components are depicted as ellipses, while inputs and outputs are drawn as squares. Phrases with red background are entities and blue ones are predicates.

4.3 Training the Encoders

We train the encoder parameters on a downstream task via gradient-based optimization. Specifically, we train NLProlog with backpropagation using a learning from entailment setting (Muggleton and Raedt, 1994), in which the model is trained to decide whether a Prolog program entails the truth of a candidate triple , where

is the set of candidate triples. The objective is a model that assigns high probabilities

to true candidate triples, and low probabilities to false triples. During training, we minimize the following loss:


where is the correct answer. For simplicity, we assume that there is only one correct answer per example, but an adaptation to multiple correct answers would be straight-forward, e.g. by taking the minimum of all answer scores.

To estimate

, we enumerate all proofs for the triple up to a given depth , where

is a user-defined hyperparameter. This search yields a number of proofs, each with a success score

. We set to be the maximum of such proof scores:

Note that the final proof score only depends on the proof with maximum success score . Thus, we propose to first conduct the proof search by using a prover utilizing the similarity function induced by the current parameters , which allows us to compute the maximum proof score . The score for each proof is given by the aggregation – either using the minimum or the product functions – of the weak unification scores, which in turn are computed via the differentiable similarity function . It follows that is end-to-end differentiable, and can be used for updating the model parameters

via Stochastic Gradient Descent.

4.4 Runtime Complexity of Proof Search

The worst case complexity vanilla logic programming is exponential in the depth of the proof (Russell and Norvig, 2010a). However, in our case, this is a particular problem because weak unification requires the prover to attempt unification between all entity and predicate symbols.

To keep things tractable, NLProlog only attempts to unify symbols with a similarity greater than some user-defined threshold . Furthermore, in the search step for one statement , for the rest of the search, is set to whenever a proof for with success score is found. Due to the monotonicity of the employed aggregation functions, this allows to prune the search tree without losing the guarantee to find the proof yielding the maximum success score , provided that . We found this optimization to be crucial to make the proof search scale on the considered data sets.

4.5 Rule Learning

In NLProlog, the reasoning process depends on rules that describe the relations between predicates. While it is possible to write down rules involving natural language patterns, this approach does not scale. Thus, we follow Rocktäschel and Riedel (2017) and use rule templates to perform Inductive Logic Programming (ILP) (Muggleton, 1991), which allows NLProlog to learn rules from training data. In this setting, a user has to define a set of rules with a given structure as input. Then, NLProlog

can learn the rule predicate embeddings from data by minimizing the loss function in

Eq. 2 using gradient-based optimization methods.

For instance, to induce a rule that can model transitivity, we can use a rule template of the form , and NLProlog will instantiate multiple rules with randomly initialized embeddings for , , and , and fine-tune them on a downstream task. The exact number and structure of the rule templates is treated as a hyperparameter.

Unless explicitly stated otherwise, all experiments were performed with the same set of rule templates containing two rules for each of the forms , and , where is the query predicate. The number and structure of these rule templates can be easily modified, allowing the user to incorporate additional domain-specific background knowledge, such as

5 Evaluation

We evaluate our method on two QA datasets, namely MedHop, and several subsets of WikiHop (Welbl et al., 2017). These data sets are constructed in such a way that it is often necessary to combine information from multiple documents to derive the correct answer.

In both data sets, each data point consists of a query , where is an entity, is a variable – representing the entity that needs to be predicted, is a list of candidates entities, is an answer entity and is the query predicate. Furthermore, every query is accompanied by a set of support documents which can be used to decide which of the candidate entities is the correct answer.

5.1 MedHop

MedHop is a challenging multi-hop QA data set, and contains only a single query predicate. The goal in MedHop is to predict whether two drugs interact with each other, by considering the interactions between proteins that are mentioned in the support documents. Entities in the support documents are mapped to data base identifiers. To compute better entity representations, we reverse this mapping and replace all mentions with the drug and proteins names gathered from DrugBank (Wishart et al., 2006) and UniProt (Apweiler et al., 2004).

5.2 Subsets of WikiHop

To further validate the effectiveness of our method, we evaluate on different subsets of WikiHop (Welbl et al., 2017), each containing a single query predicate. We consider the predicates publisher, developer, country, and record_label, because their semantics ensure that the annotated answer is unique and they contain a relatively large amount of questions that are annotated as requiring multi-hop reasoning. For the predicate publisher, this yields 509 training and 54 validation questions, for developer 267 and 29, for country 742 and 194, and for record_label 2305 and 283. As the test set of WikiHop is not publicly available, we report scores for the validation set.

5.3 Baselines

Following Welbl et al. (2017), we use two neural QA models, namely BiDAF (Seo et al., 2016a) and FastQA (Weissenborn et al., 2017b), as baselines for the considered WikiHop predicates. We use the implementation provided by the Jack 222 QA framework (Weissenborn et al., 2018) with the same hyperparameters as used by Welbl et al. (2017), and train a separate model for each predicate.333We also experimented with the AllenNLP implementation of BiDAF, available at, obtaining comparable results. To ensure that the performance of the baseline is not adversely affected by the relatively small number of training examples, we also evaluate the BiDAF model trained on the whole WikiHop corpus. In order to compensate for the fact that both models are extractive QA models which cannot make use of the candidate entities, we additionally evaluate modified versions which transform both the predicted answer and all candidates to vectors using the wiki-unigrams model of Sent2vec (Pagliardini et al., 2017). Consequently, we return the candidate entity which has the highest cosine similarity to the predicted entity. We use the normalized version of MedHop for training and evaluating the baselines, since we observed that denormalizing it (as for NLProlog) severely harmed performance. Furthermore on MedHop, we equip the models with word embeddings that were pretrained on a large biomedical corpus (Pyysalo et al., 2013).

5.4 Hyperparameter Configuration

On MedHop we optimize the embeddings of predicate symbols of rules and query triples, as well as of entities. WikiHop has a large number of unique entity symbols and thus, learning their embeddings is prohibitive. Thus, we only train the predicate symbols of rules and query triples on this data set. For MedHop we use bigram Sent2vec embeddings trained on a large biomedical corpus 444, and for WikiHop the wiki-unigrams model555 of Sent2vec. All experiments were performed with the same set of rule templates containing two rules for each of the forms , and and set the similarity threshold to and maximum proof depth to . We use Adam (Kingma and Ba, 2014) with default parameters.

5.5 Results

The results for the development portions of WikiHop and MedHop are shown in Table 1. For all predicates but developer, NLProlog strongly outperforms all tested neural QA models, while achieving the same accuracy as the best performing QA model on developer. We evaluated NLProlog on the hidden test set of MedHop and obtained an accuracy of 29.3%, which is 6.1 pp better than FastQA and 18.5 pp worse than BiDAF.666Note, that these numbers are taken from Welbl et al. (2017) and were obtained with different implementations of BiDAF and FastQA. As the test set is hidden, we cannot diagnose the exact reason for the inconsistency with the results on the development set, but observe that FastQA suffers from a similar drop in performance.

Model MedHop publisher developer country recordlabel
BiDAF 42.98 66.67 65.52 53.09 68.90
 + Sent2Vec 75.93 68.97 61.86 75.62
 + Sent2Vec + wikihop 74.07 62.07 66.49 78.09
FastQA 52.63 62.96 62.07 57.21 70.32
 + Sent2Vec 75.93 58.62 64.95 78.09
NLProlog 65.78 83.33 68.97 77.84 79.51
 - rules 64.33 83.33 68.97 74.23 74.91
 - entity MLP 37.13 68.52 41.38 72.16 64.66
Table 1: Accuracy scores in percent for different predicates on the development set of the respective predicates. +/- denote independent modifications to the base algorithm.

5.6 Importance of Rules

Exemplary proofs generated by NLProlog for the predicates record_label and country can be found in Fig. 2.

To study the impact of the rule-based reasoning on the predictive performance, we perform an ablation experiment in which we train NLProlog without any rule templates. The results can be found in the bottom half of Table 1. On three of the five evaluated data sets, performance decreases markedly when no rules can be used and does not change on the remaining two data sets. This indicates that reasoning with logic rules is beneficial in some cases and does not hurt performance in the remaining ones.

5.7 Impact of Entity Embeddings

In a qualitative analysis, we observed that in many cases multi-hop reasoning was performed via aligning entities and not by applying a multi-hop rule. For instance, the proof of the statement country(Oktabrskiy Big Concert Hall, Russia) visualized in Figure 2, is performed by making the embeddings of the entities Oktabrskiy Big Concert Hall and Saint Petersburg sufficiently similar. To gauge the extent of this effect, we evaluate an ablation in which we remove the MLP on top of the entity embeddings. The results, which can be found in Table 1, show that fine-tuning entity embeddings plays an integral role, as the performance degrades drastically. Interestingly, the observed performance degradation is much worse than when training without rules, suggesting that much of the reasoning is actually performed by finding a suitable transformation of the entity embeddings.

Figure 2: Example proof trees generated by NLProlog, showing a combination of multiple rules. Entities are shown in red and predicates in blue. Note, that entities do not need to match exactly. The first and third proofs were obtained without the entity MLP (as described in Section 5.7), while the second one was obtained in the full configuration of NLProlog.

5.8 Error Analysis

We performed an error analysis for each of the WikiHop predicates. To this end, we examined all instances in which one of the neural QA models (with Sent2Vec) produced a correct prediction and NLProlog did not, and labeled them with pre-defined error categories. Of the 55 instances, of the errors were due to NLProlog

unifying the wrong entities, mainly because of an over-reliance on heuristics, such as predicting a record label if it is from the same country as the artist. In

of the cases, NLProlog produced a correct prediction, but another candidate was defined as the answer. In the prediction was due to an error in predicate unification, i.e. NLProlog identified the correct entities, but the sentence did not express the target relation. Furthermore, we performed an evaluation on all problems of the studied WikiHop predicates that were unanimously labeled as containing the correct answer in the support texts by  Welbl et al. (2017). On this subset, the micro-averaged accuracy of NLProlog shows an absolute increase of pp, while the accuracy of BiDAF (FastQA) augmented with Sent2Vec decreases by () pp. We conjecture that this might be due to NLProlog’s reliance on explicit reasoning, which could make it less susceptible to spurious correlations between the query and supporting text.

6 Discussion and Future Work

We proposed NLProlog, a system that is able to perform rule-based reasoning on natural language, and can learn domain-specific rules from data. To this end, we proposed to combine a symbolic prover with pretrained sentence embeddings, and to train the resulting system using backpropagation. We evaluated NLProlog on two different QA tasks, showing that it can learn domain-specific rules and produce predictions which outperform those of the two strong baselines BiDAF and FastQA in most cases.

While we focused on a subset of First Order Logic in this work, the expressiveness of NLProlog could be extended by incorporating a different symbolic prover. For instance, a prover for temporal logic (Orgun and Ma, 1994) would allow to model temporal dynamics in natural language. We are also interested in incorporating future improvements of symbolic provers, triple extraction systems and pretrained sentence representations to further enhance the performance of NLProlog. Additionally, it would be interesting to study the behavior of NLProlog in the presence of multiple WikiHop query predicates.


Leon Weber and Jannes Münchmeyer acknowledge the support of the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS). We would like to thank the anonymous reviewers for the constructive feedback. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU used for this research.


Appendix A Algorithms

       : function atom variable list
       : function atom variable list
       : current substitutions, default =
       : current success score, default =
       Output: (Unifying substitution or failure, Updated success score )
       if  then return (failure, ) else if  then return (failure, ) else if  then return (, ) else if  is Var  then return else if  is Var  then return else if  is , is , and  then
       end if
      else if  is , is , and  then
       end if
      else if  is and y is  then
       end if
      else if  is empty list and y is empty list then return else return (failure, )
       if  then return else if  then return else return
Algorithm 1 The weak unification algorithm in NLProlog without occurs check