1 Introduction
Automated reasoning, attempting to conduct logical reasoning algorithmically, has been a longstanding focus of general artificial intelligence with wide application in knowledge base completion, natural language understanding, question answering, agent planning and etc. In particular, with the recent advent of webscale knowledge graphs, such as Freebase[1], DBpedia [2], and Google’s Knowledge Vault [3] and due to their high incompleteness [4] there has been rising interest in solutions to the knowledge base completion task.
For many years, reasoning has been tackled as the task of building systems capable of inferring new crisp symbolic logical rules [5, 6]
. However, those traditional methods are too brittle to be applied to noisy automatically created knowledge bases. With the recent revival of interest in artificial neural networks, neural link prediction models have been applied vastly for the completion of knowledge graphs. These methods
[7, 8, 9, 10, 11, 12, 13] heavily rely on the subsymbolic representation of entities and relations learned through maximization of a scoring objective function over valid factual triples. Thus, the current success of such deep models hinges primarily on the power of those subsymbolic continuous realvalued representations in encoding the similarity/relatedness of entities and relations. Recent attempts have focused on neural multihop reasoners [14, 15, 16, 17, 18] to equip the model to deal with more complex reasoning where multihop inference is required. More recently, a Neural Theorem Prover [19] has been proposed in an attempt to take advantage of both symbolic and subsymbolic reasoning.Despite their success, the main restriction common to machine learningbased reasoners is that they are unable to recognize and generalize to analogous situations or tasks. This inherent limitation follows from both the representation functions used and the learning process. The major issue comes from the mere reliance of these models on the representation of entities learned during the training or in the pretraining phase stored in a lookup table. Consequently, these models have difficulty to deal with outofvocabulary entities. Although the smallscale outofvocabulary problem has been addressed in part in the natural language processing domain by taking advantage of characterlevel embedding
[20], learning embeddings on the fly by leveraging text descriptions or spelling [21], copy mechanism [22] or pointer networks [23], still these solutions are insufficient for transferring purposes. An even greater source of concern is that reasoning in most of the above subsymbolic approaches hinges more on the notion of similarity and geometricbased proximity of realvalued vectors (induction) as opposed to performing transitive reasoning (deduction) over them.
Inspired by these observations, we take a different approach in this work by investigating the emulation of deductive symbolic reasoning using memory networks. Memory networks [24] are a class of learning models capable of conducting multiple computational steps over an explicit memory component before returning an answer. They have been recently applied successfully to a range of natural language processing tasks such as question answering [25, 26], language modeling [25], and dialogue tasks [27, 28]. They use memory to store the context or knowledge bases of facts explicitly and perform inferencing over it using multihops recurrent attention. Endtoend memory networks (MemN2N) [25]
are a lesssupervised, more general version of these networks, applicable to the settings where labeled supporting memories are not available. They are very similar to the original memory networks, except that the supporting memory slots have not been predetermined as labels for the model. More specifically, first, those memory inputs useful for finding the correct answers are retrieved through an attention mechanism and an output vector is calculated by obtaining the weighted sum of the memory output representations. To apply multihop attention over the memory before outputting the response, the above process gets repeated recursively K times by replacing the query vector with the summation of the query and output vector obtained from the previous step. Finally, the output vector and final query representation from the last hop pass through a final weight matrix multiplication and a softmax to produce the output label. We have selected such networks since we believe that they are a primary candidate to perform well for deductive logical entailment. Their sequential nature corresponds, conceptually, to the sequential process underlying some deductive reasoning algorithms. The attention modeling corresponds to pulling only relevant information (logical axioms) necessary for the next reasoning step. And their success in natural language inferencing is also promising: while natural language inferencing does not follow a formal logical semantics, logical deductive entailment is nevertheless akin to some aspects of natural language reasoning. Besides, as attention can be traced over the run of a memory network, we will furthermore get insights into the "reasoning" underlying the network output, as we will be able to see which pieces of the memory (i.e., the input knowledge graph) are taken into account at each step.
This paper contributes a recipe involving a simple but effective knowledge base triple normalization before learning their representation within an endtoend memory network. To perform logical inference in more abstract level, and thereby facilitating the transfer of reasoning expertise from one knowledge graph to another, the normalization maps entities and predicates in a knowledge to a generic vocabulary. Facts in additional knowledge bases are normalized using the same vocabulary, so that the network does not learn to overfit its learning to entity and predicate names in a specific knowledge base. This emulates symbolic reasoning by neural embeddings as the actual names (as strings) of entities from the underlying logic such as variables, constants, functions, and predicates are insubstantial for logical entailment in the sense that a consistent renaming across a theory does not change the set of entailed formulas (under the same renaming). Thanks to the termagnostic feature of our representation, we are able to create a reasoning system capable of performing reasoning over an unseen set of vocabularies in the test phase.
Our approach combines the best of two worlds: transferability of classical deductive symbolic reasoning and robustness of neuralsubsymbolic reasoning. This combination supports crossknowledge graph reasoning obviating the need for supervised retraining over the task of interest or unsupervised pretraining over the external source of data for learning the representations when encountered with a new knowledge graph.
Our contributions are threefold: (i) We present the construction of memory networks for emulating the symbolic deductive reasoning. (ii) We propose an optimization to this architecture using normalization approach to enhance their transfer capability. We show that in an unnormalized setting, they fail to perform well across knowledge graphs. (iii) We examine the efficacy of our model for crossdomain and crossknowledge graph deductive reasoning. We also show the robustness of our model to the noisy train/test set and provides scalability (in terms of reduced time and space complexity) for large datasets.
This paper is structured as follows. In Section 2 we discuss related research efforts, including a briefing of the history of attempts to integrate logical reasoning in connectionist approaches. In Section 3 and 4, we concretely present the deep learning architecture we used. In Section 5 and 6, we present an experimental evaluation of our approach and analyze our findings. We conclude and discuss future work in Section 7.
2 Related works
The research into methods how to use artificial neural networks to perform logical deductive reasoning tasks is often referred to as the study of neuralsymbolic integration. It can be traced back at least to the landmark 1942 article by McCulloch and Pitts [29]
in which it was shown how propositional logic formulas can be represented using a simple neural network model with threshold activation functions. A comprehensive and recent state of the art survey can be found in
[30], and hence we will only mention essentials for understanding the context our work is placed in.Most of the body of work on neuralsymbolic integration concerns propositional logic only (see, e.g.,[31]), and indeed relationships both theoretical and practical in nature between propositional logics and subsymbolic systems are relatively easy to come by, an observation to which John McCarthy referred as the "propositional fixation" of artificial neural networks [32]. Some of these include KnowledgeBased Artificial Neural Networks [33] and the closely related propositional core method [34, 35]. Early attempts to go beyond propositional logic included the SHRUTI system [36, 37] which, however, uses a nonstandard connectionist architecture and thus had severe limitations as far as learning was concerned. Approaches to using standard artificial neural network architectures with proven good learning capabilities for firstorder predicate logic [38]
or firstorder logic programming
[39, 40] were by their very design unable to scale beyond toy examples.In the past few years, deep learning as a subsymbolic machine learning paradigm has surpassed expectations as to the speed of progress in machinelearning based problem solving, and it is a reasonable assumption that these developments have not yet met their natural limit. Consequently, they are being looked upon as promising for trying to overcome the symbolicsubsymbolic divide [41, 42, 43, 44, 19, 45, 46] – this list is not exhaustive. Even more work exists on inductive logical inference, e.g. [47, 19]
, which is not what we are going to deal with in this work. Concretely, on the issue of doing logical reasoning using deep networks, we want to mention the following selected contributions: Tensorbased approaches have been used
[19, 45, 46], following [48, 9]. However, approaches are restricted in terms of logical expressibility and/or to toy examples and limited evaluations. [44] perform knowledge graph reasoning using RDF(S) [49, 50], based on knowledge graph embeddings. However evaluation and training are done on the same knowledge graph, i.e., there is no learning of the general logical deduction calculus, and consequently no transfer thereof to new data. [43] considers OWL RL reasoning [50, 51], however again training and evaluation are done on the same knowledge bases, i.e., no transfer is possible and no general deduction calculus is acquired during training.In short, to the best of our knowledge, to date, there is no subsymbolic reasoning work, which is able to transfer the learning capability from one knowledge graph to unseen one. In fact, since previous works have focused to conduct reasoning on the unseen part of the same knowledge graph, they have tried to gain generalization ability through induction and robustness to missing edges[52] as opposed to deduction. The induction queries include those triples (s,p,o) where at least there are one missing link in all the paths from s to o in the knowledge graph. Likewise, recent years have seen some progress in zeroshot relation learning in subsymbolic reasoning domain[14, 53, 54]. Zeroshot learning refers to the ability of the model to infer new relations of pair of entities where that relation has not been seen before in training set[55]. This generalization capability is still quite limited and fundamentally different from our work in terms of both methodology and purpose.
3 Knowledge Graph Reasoning
In order to explain what we are setting out to do, let us first reframe the deductive reasoning (or entailment) problem as a classification task. Any given logic comes with an entailment relation , where is a subset of the set of all logical formulas (or axioms) over , and is the set of all theories (or sets of logical formulas) over . If , then we say that is entailed by . Reframed as a classification task, we can ask whether a given pair
should be classified as a valid entailment (i.e.,
) holds, or as the opposite (i.e., ). Applying a deep learning approach to this, we would like to train a Deep Neural Network (DNN) on sets of examples , such that the DNN learns to correctly classify them as valid or invalid inferences. Of course, we would have to restrict our attention to finite theories, which is usually done in computational logic anyway.3.1 Problem: Lack of Transferability
We wish to train a model whose learnings will transfer to new theories within the same logic. That way, our results will demonstrate that the reasoning principles (inference rules) which underlie the logic have been learned. If we were to train a model such that it learns only to reason over one theory, then that could hardly be demonstrated. One of the key obstacles we face with our task, however, is to understand how to represent training and test data so that they can be used in standard deep learning settings. Logical theories are highly structured, and it is essentially this structure which determines logical entailments, indeed some entailment algorithms can be understood in a straightforward way as a type of syntax rewriting systems. At the same time, the actual names (as strings) of entities from the underlying logic such as variables, constants, functions, predicates, are insubstantial for logical entailment in the sense that a consistent renaming across a theory does not change the set of entailed formulas (under the same renaming).
For use with standard deep learning approaches, formulas – or even theories – will have to be represented in the real coordinate space as vectors (points), matrices or tensors (multidimensional arrays); in deep learning, such a representation is commonly called an embedding. A plethora of embeddings for knowledge graphs have been proposed [56, 57, 13, 58, 11], however, we are not aware of an existing embedding which adheres to the principles which seem important for the deductive reasoning scenario. Indeed, prominent use case explored for knowledge graph embeddings is not deductive in nature; rather, it concerns the problem of the discovery or suggestion of additional links or edges in the graph, together with appropriate edge labels. In this link discovery setting, the actual labels for nodes or edges in the graph, and as such their commonsense meanings, are likely important, and most existing embeddings reflect this. However, for deductive reasoning the names of entities are insubstantial, i.e., ideally should not be captured by an embedding. Another inherent problem in the use of such representations across knowledge graphs is the outofvocabulary problem. Formally speaking, such methods define a matrix to store a d dimensional realvalued vector of each word in the vocabulary . Given the word lookup table , the embedding for word can be obtained through multiplication of word lookup table and ’s onehot vector representation () as . The word lookup table can be initialized with vectors in an unsupervised task or during training of the reasoner. In any case, it is obvious that word lookup tables cannot generate vector representation for unseen terms and it is impractical to store the vectors of all words when vocabulary size is huge [20]. On the other hand, using standard graph embeddings [59] also appears to be insufficient, because structural aspects such as the particular importance of nodes or edge labels from the RDF/RDFS (Resource Description Framework Schema) namespaces to the deductive reasoning process, would not get sufficient attention. Similarly, memory networks usually rely on wordlevel embeddings lookup tables, i.e., learned with the underlying rationale that words that occur in similar supervised scenarios should be represented by similar vectors in the real coordinate space. That is why they are known to have difficulties dealing with outofvocabulary terms, as a word lookup table cannot provide a representation for the unseen, and thus cannot really be applied to natural language inference over new sets of words [21], and for us this will pose a challenge in the transfer to new knowledge bases.
We thus seek embeddings which are agnostic to the terms (i.e., strings) used as primitives in the knowledge base. One option may be to pursue variants of the copy mechanism and of pointer networks [60, 23] to refer to the unknown words in the memory in generating the responses. Despite the success of these mentioned methods in handling few unknown words absent during training, transferability and the ability of these models to generalize to a completely new set of vocabulary is still a widely open research question. Furthermore, these approaches are fundamentally appropriate for generative settings and therefore are not suitable for classification problems. Another option is utilizing characterlevel embeddings (ideal for openvocabulary word representation) [20] to compose the representation of characters into words. Similarly, using characterlevel embeddings is an inelegant solution in our case, because of the importance of having a wordagnostic embedding. Therefore entity representation limitations of memory networks need to be overcome, in order to make them applicable to deductive logical entailment.
3.2 Solution: Normalized Embedding
To build such an embedding, we will build on existing approaches for the embedding of structured data, and modify them for our purposes. We expect that some type of normalization will be required before embedding, and that this normalization will have two different aspects. One the one hand, normalization as usually done before invoking logical reasoning algorithms will help control the structural complexity of the formulas which constitute theories and entailments. On the other hand, we will explore syntactic normalization, and with this we mean a renaming of primitives from the logical language (variables, constants, functions, predicates) to a set of predefined entity names which will be used across different theories. By randomly assigning the mapping for the renaming, the network’s learning will be based on the structural information within the theories, and not on the actual names of the primitives, which should be insubstantial for the entailment task. Note that the normalization does not only play the role of “forgetting” irrelevant label names, but also makes it possible to transfer learning from one knowledge graph to the other. Indeed, for the approach to work, the network should be trained with many knowledge graphs, and then subsequently tested on completely new ones which had not been encountered during training. Our preliminary results show how our simple but very effective normalization phase can surprisingly lead to creating a wordagnostic reasoning system capable of conducting reasoning over unseen knowledge graphs containing new vocabulary.
4 Model Architecture
Our model architecture is an adaptation of the endtoend memory network proposed by [25] with some fundamental alterations necessary for abstract reasoning. A highlevel view of our model shown in Figure 1 is as follows. It takes a discrete set of normalized triples that are to be stored in the memory, a query , and outputs a "yes" or "no" as an answer to determine whether can be inferred from the current knowledge graph statements or not. Each of the normalized and contains symbols coming from a general dictionary with normalized words shared among all of the knowledge graphs in both training and test sets. The model writes all triples to the memory and then calculates a continuous embedding for the whole triples and . Through multiple hop attention over those continuous representations, the model then classifies the query. The model will get trained by backpropagation of error from output to the input through multiple memory accesses and embeddings will get learned through these accesses. We discuss components of the architecture in more detail below.
4.1 Model Description
The design of the model is based on the MemN2N[25] endtoend memory network. The model is augmented with an external memory component storing the embedding of the normalized triples in our knowledge graph. This external memory is defined as an tensor where denotes the number of triples in knowledge graph and is the dimensionality of the embeddings. The knowledge base is stored in the memory vectors from two continuous representations of and obtained from two input and output embedding matrices of A and C with size where is the size of vocabulary. Similarly, the query is embedded via a matrix to obtain an internal state . In each reasoning step, those memory slots useful for finding the correct answers should have their contents retrieved. To enable this, we use an attention mechanism for over memory input representations by taking an internal product followed by a softmax:
(1) 
where
Equation 1
is used to calculate a probability vector
over the memory inputs, the output vector is computed as weighted sum of the transformed memory contents s with respect to their corresponding probabilities by:(2) 
This describes the computation within a single hop. The internal state of query vector gets updated for the next hop using:
(3) 
The process repeats times where is the number of memory units in the network. The output of the memory unit is used to predict a label by passing and through a weight matrix of size and a softmax:
(4) 
Figure 1 illustrates the model for
. The parameters to be learned by backpropagation are the matrices
and .4.2 Memory Content
A knowledge graph is a collection of facts stored as triplets where and are subject and object respectively, while is a predicate (relation) binding and together. Every entity in the knowledge graph is represented by a unique Universal Resource Identifier (URI). We normalize these triples by systematically renaming all URIs which are not in the RDF/RDFS namespaces as discussed previously. Each such URI is mapped to a set of arbitrary strings in a predefined set , where is taken as a training hyperparameter giving an upper bound for the largest number of entities in a knowledge graph the system will be able to handle. Note that URIs in the RDF/RDFS namespaces are not renamed, as they are important for the deductive reasoning process. Consequently, each normalized knowledge graph will be a collection of facts stored as triplets .
It is important to note that each symbol is mapped into an element of regardless of its position in the triple, and whether it is a subject or an object or a predicate. Yet the position of a subject within a predicate is an important feature to consider. Inspired by [25], we thus employ a positional encoding to encode the position of each element within the triplet. This gives the formal solution: , where denotes an elementwise multiplication and is a column vector with the structure (assuming 1based indexing), with denominator of (the number of elements in the triplet), and is the size of the embedding vector in the memory input embedding matrix . Each memory slot thus represents the positionalweighted summation of each triplet. By using positional encoding (PE), we ensure that the order of the elements now affects the encoding of each memory slot . This representation which is used for query triplet, memory inputs, and memory outputs, is core to everything we do.
5 Experimental Setups
5.1 Candidate Logic
There is a plethora of logics which could be used for our investigation. We exclude propositional logics because they seem to be easier to capture using connectionist architectures, while at the same time the methods used for dealing with them in a subsymbolic manner do mostly not seem to transfer to nonpropositional logics. Here we use Resource Description Framework (RDFS). The Resource Description Framework (RDF) [49, 50] is an established and widely used W3C standard for expressing knowledge graphs. The standard comes with a formal semantics which defines an entailment relation. As a logic, RDFS is of very low expressivity, and reasoning algorithms are very straightforward. One way to frame it is that there is a small set of thirteen entailment rules, fixed across all knowledge graphs, which are expressible using Datalog. These thirteen rules can be used to entail new facts. The completion of a knowledge graph is in general infinite because, by definition, there is an infinite set of facts (related to RDFSencodings of lists) which are always entailed  however for practical reasons, this is ignored by established RDFS reasoning systems, i.e., for all practical purposes we can consider completions of knowledge graphs to be finite.
5.2 Dataset
Due to the novel nature of the problem at hand, there is no dataset available for testing the capability of our approach. The good news however is, there are plethora of knowledge graph[61] that we could use to create our own dataset. The Linked Data Cloud^{1}^{1}1https://lodcloud.net/ website lists over 1,200 interlinked RDF(S) datasets, which constitute knowledge graphs suitable for our setting, some of which are of substantial size. Here, we have collected data from this website as well as Data Hub^{2}^{2}2https://datahub.io/ website to create our training set^{3}^{3}3https://github.com/mdksarker/KGCmpldataset. Our training set (owlcentric dataset) is comprised of a set of knowledge graphs of size 1000 triples sampled from around 20 ontologies (as listed in Table 1). In order to test our model generalization ability to completely different domain, we have collected another dataset called OWLCentric test set. Furthermore, to assure our evaluation set represents the realworld RDF data and meet the quality requirement of linked data[62], we have followed [63] to collect data for our Linked Data dataset. To our best attempt we could not find public rdfdump or sparql endpoint for some of the datasets mentioned in the paper though. In addition, to test the capability of our model to conduct long chain of reasonings we have created a synthetic dataset using rdfs:subclass and rdfs:subproperty predicates. It covers the reasoning chains of maximum 10.
For each knowledge graph we have created a finite set of inferred triples using Apache Jena^{4}^{4}4https://jena.apache.org/ api. These inferred triples comprise our positive class instances. For generating invalid instances we are following two methods. In the first scenario, we are generating invalid triplets by random permutation of entities and filtering out those triplets which are currently in the knowledge base or set of valid triplets. In the second scenario, which serves as the final quality check for not including trivial invalid triplets in our dataset, we are creating invalid instances with the aid of rdf:type predicate. More specifically, for each valid triple in the dataset, we replace one of the elements (chosen randomly), with another random element which qualifies for being placed in that triple based on its typeof relationships. Here, as well, we add the newly generated triplet if it is not already part of original knowledge base or valid facts. Indeed, through random selection of one of the hyponyms of hypernyms of entity of interest, we assure that our created dataset is challenging enough. Those datasets created by this strategy are denoted by superscript "a" in experimental Table 3. Some important statistics of our created datasets has been summarized in Table 1 and 2. More specifically, Table 2 illustrates number of knowledge graphs in each of our datasets and average number of facts and entities per knowledge graph. It also demonstrates average number of classes, individuals, relations and axiomatic triples for each knowledge graph (in percentage).
5.3 Training Details
Trainings have been done over 10 epochs using Adam optimizer. All trainings have been done with the batch size of 100 over triplets. For the final batches of queries for each knowledge graph, we have used zeropadding to the maximum batch size of 100. The capacity of our external memory is 1000 which is the maximum size of our knowledge bases also. Our model have been trained using a learning rate of
and learning rate decay of . We have used linear starting of 1 epoch where we have removed the softmax from each memory layer except for the final layer. L2 norm clipping of max 40 has been applied on gradient. Both memory input embeddings and memory output embeddings are vectors of size . The embedding matrices of A, B, and C therefore are of size where 3033 is the size of normalized vocabulary. Unless otherwise mentioned, we have used K=10 for all of our experiments. Adjacent weight sharing has been used where output embedding of one layer is the input embedding of the next one as in . Similarly, the answer prediction weight matrix get copied to the final output embedding and query embedding is equal to the first layer input embedding as in. All the weights are initialized using Gaussian distribution with
and .6 Experimental Results
6.1 Quantitative Results
In this section, we highlight some of our findings along with experimental results of our proposed approach. The evaluation metrics we report are average of precision and recall and fmeasure over all the knowledge graphs in the test set, obtained for both valid and invalid set of triplets. Particularly, we also report, the recall for negative class, also called specificity, to interpret the result more carefully by counting number of true negatives. Additionally, as we mentioned in the training detail, we have done zeropadding for each batch of query triplets of size of less than 100. This implies the need for introducing another class label for such zero paddings both in the training and test phase. In our evaluation, however, we have not considered the zeropadding class in calculation of precision, recall and fmeasure. Through our evaluations, however, we have observed some misclassification from/to this class. Here we are reporting accuracy as well to take such mistakes also into account.
To the best of our knowledge there is no architecture capable of conducting deductive reasoning on completely unseen knowledge graph. That is why, we have considered the nonnormalized embedding version of our memory network as a baseline. Our technique shows a clear significant advantage over the baseline as shown in Table 3. A further even more important benefit of using our normalization model is its training time. In fact, this considerable time complexity difference is the result of remarkable size difference of embedding matrices in original and normalized cases. For instance, the size of embedding matrices to be learned by our algorithm for normalized OWLCentric dataset is as opposed to for normalized one (and for Linked Data which is prohibitively big). That has caused a remarkably high decrease in training time and space complexity and consequently has helped the scalability of our memory networks. In case of OWLCentric dataset, for instance, the space required for saving normalized model is 80 times less than the intact model( after compression). Likewise, the normalized model is almost 40 times faster to train than the nonnormalized one for this dataset. Our normalized model trained for just over a day on OWLCentric data achieves better accuracy, whereas it trained on the same nonnormalized dataset more than a week on 12core machine. Hence, the importance of using our novel normalized representation learning cannot be emphasized too much.
To further get an idea of how our model performs on different data sources, we have applied our approach to multiple datasets with various characteristics. The result across all variations are given in Table 3. From this Table we can see that, apart from our strikingly good performance compared to the baseline, there are number of other interesting points: Our model gets even better results on Linked Data task while it has trained on the OWLCentric dataset. The reasons for this performance gain are not yet wholly understood. Another interesting observation is the poor performance of our algorithm when it has trained on OWLCentric dataset and has been applied on a tricky version of the Linked Data. In that case our model has classified most of the triples to the "yes" class and this has led to low specificity (recall for "no" class) of 16%. It is inevitable because in challenging "No"’s version of our dataset the negative instances bear close resemblance to positives ones, making differentiation more challenging. Training the model on the tricky OWLCentric dataset has improved that by substantial margin (more than three times). In case of the synthetic data, however, although performance is not ideal , we still nevertheless believe that it is acceptable. An evident rationalization for this performance decrease for synthetic data compared to other sources of data is the significant difference in reasoning patterns and nature of training and test set. Indeed, our training so far have only been done on realworld datasets and not peculiar synthetic data. Training the model on synthetic data is not the focus of this study.
Dataset  Hop 1  Hop 2  Hop 3  Hop 4  Hop 5  Hop 6  Hop 7  Hop 8  Hop 9  Hop 10 
OWLCentric ^{a}  8%  67%  24%  0.01%  0%  0%  0%  0%  0%  0% 
Linked Data^{b}  31%  50%  19%  0%  0%  0%  0%  0%  0%  0% 
Linked Data^{c}  34%  46%  20%  0%  0%  0%  0%  0%  0%  0% 
OWLCentric^{d}  5%  64%  30%  1%  0%  0%  0%  0%  0%  0% 
Synthetic Data  0.03%  1.42%  1%  1.56%  3.09%  6.03%  11.46%  20.48%  31.25%  23.65% 

Training Set

LemonUby Ontology

Agrovoc Ontology

Completely Different Domain
Further experiments were needed to analyze the reasoning depth acquired by our network. Fundamentally, we conjecture that reasoning depth acquired by the network will correspond both to (1) the number of layers in the deep network, and (2) the ratio of deep versus shallow reasoning required to perform the deductive reasoning. Let us explain this. Forwardchaining reasoners (which are the standard for RDF(S), OWL EL, and OWL RL reasoning, and can also be used for Datalog) iteratively apply inference rules in order to derive new entailed facts. In subsequent iterations, the previously derived facts need to be taken into account. The number of sequential applications of the inference rules which are required to obtain a given logical consequence can be understood as a measure of the "depth" of the deductive entailment. To gain better understanding of what our model has learned, we have mimic this behavior of symbolic reasoners in creating our test set. In order to do that, we first have started from our input knowledge graph in hop 0. We then have produced, subsequently, knowledge graphs of , , until no new triples are added (i.e. until is empty) by applying the RDFS inference patterns from W3C website^{5}^{5}5https://www.w3.org/TR/rdf11mt/#rdfsentailment^{,}^{6}^{6}6https://www.w3.org/TR/rdf11mt/#entailmentrulesinformative. Consequently, our hop 0 dataset contains the original graph, and the inferred axioms thus are replaced by the triples in the original graph. Hop 1 contains the RDFS axiomatic triples in the inferred axioms field. The real inference steps start with where . It is worthwhile noting that, in the process of creating this data, our reasoning tool encountered several errors during reasoning because of the missing entities in some triples. There are a lot of such a missing/unknown/inapplicable entities in realworld knowledge graphs emphasizing the need for using more robust subsymbolic reasoners.
Table 4 summarizes our results in this setup. The poor performance on hop 0 is not unexpected since our training set does not include any original triplets. Unsurprisingly, we also observe that result over our synthetic data generated with subclassof and subpropertyof predicates is poor. That is because of the huge gap between distribution of our training data over reasoning hops and the synthetic data reasoning hop length distribution. Table 5 provides further evidence for that. From Tables 4 and 5, one can see how the distribution of our training set affects the learning capability of our model. Apart from our observations, previous studies[64, 19, 16, 65] also corroborate that the reasoning chain length required to answer a query in a realworld KB is limited to 3 or 4. Therefore a synthetic training toy set is required to be built for further analysis of the reasoning depth capability of our model in future.
Furthermore, a naive expectation on the trained network would be that each layer would perform a type of equivalent to an inference rule application. If this is the case, then the number of layers would limit the entailment depth the network could acquire, however we had to assess this assumption experimentally. Therefore, we have done 10 experiments (K=1 to 10) to assess the effect of change of number of computational hops on our results over OWLCentric Dataset. Interestingly, our experimental results suggest that our model is able to get almost the same performance with K=1 and more interestingly, the Fmeasure remains constant with step by step increase of K from 1 to 10. This shows us that multihop reasoning can be done in onehop attention of memory networks (as we need only 23 hop reasoning) over our training set, while the increase of number of hops would not hurt our performance. This demonstrates robustness of the proposed method against change of its structure. This also suggests that each attention hop of our memory network is able to conduct more than one inference rule application step (deductive reasoning hop).
6.2 General Embeddings Visualization
In order to gain some insight on the behavior of our normalized embedding model, we have plotted a tDistributed Stochastic Neighbor Embedding (tSNE) [66]
and Principal Component Analysis (PCA) twodimensional vector visualization of embeddings computed for the RDF(S) words and all normalized words in the knowledge graphs in figures 2, 3, and 4 respectively. The embeddings have been fetched from the matrix B (embedding query lookup table) in the computational hop 1 of our trained model over OWLCentric dataset. Words are positioned in the plot by the semantic relationships implied by their embeddings. As anticipated all the normalized words tend to form one cluster as opposed to creating multiple separated clusters as shown in Figure 3 and 4. We found PCA plot more insightful. The PCA projection illustrates ability of our model to automatically organize RDF(S) concepts and learn implicitly the relationships between them, as during the training we have not provide any supervised information about what each RDF(S) element means. For instance, rdfs:domain and rdfs:range have been located very close together and far from normalized entities. Similarly, rdf:subject, rdf:predicate and rdf:object are very similar in the vector space. That is the case for rdfs:seesAlso and isDefinedBy. Similarly, rdfs:container, rdf:bag, rdf:seq, and rdf:alt are in vicinity of each other. Rdf:lanstring is the only RDF domain entity which is inside the normalized entities circle. We believe that it is because the rdf:langString’s domain and range is string and consequently it has only cooccurred with normalized instance in knowledge graph. Thus its vector is very close to normalized entities vectors. Another possible rationalization for this might be the low frequency of rdf:langString in our training set. This could possibly contribute to the general representation for this word.
6.3 Ablation Study
We perform ablation study where we remove positional encoding from embeddings and compare the results to assess their impact on our results. The idea behind positional encoding is keeping order of elements in the triples into account. Here instead we are using bag of word representations and do not take ordering of elements in each triplet into account. The results of our experiments have been listed in Table 6. As anticipated, removing the positional encoding shows performance decrease for all of our experiments in terms of accuracy. Indeed, through more detailed analysis of the result for our first model, we found that it classifies all our zeropaddings to negative class. That is the explanation for huge gap of the accuracy and fmeasure in that model. Nevertheless, surprisingly, but still not hard to appreciate, removing positional encoding would not decrease the performance for some of our experiments substantially. Indeed, this is not practically surprising in light of the fact that orderless representations have always shown tremendous success in natural language processing domain even when order naturally matters.
6.4 Limitations
Our work clearly has some limitations. One limitation of our initial approach is that our setting puts a global limit on the size of knowledge graphs a trained system will be able to handle, and required training time can be expected grow superlinearly in the size of the knowledge graphs. We will in fact test the scalability limits of our approach in future, but we are confident that we can reasonably handle knowledge graphs with hundreds of thousands of triples. The contribution of this work, however, is on the fundamental capabilities of our model to perform deductive entailment across knowledge graphs, and thus we have not focused heavily on scalability aspects. Our future work will concentrate on scalability issues more.
Additionally, based on our analysis the reasoning hop length of our realworld datasets are either 2 or 3. This is the case for realworld datasets used in previous studies also. This distribution of training data constrains the capability of our model for learning longer path reasonings. However, due to the high reasoning capacity of memory networks, we are highly confident that the model would be capable of doing that in order of 10s, when it has been trained on complex enough data. Typically, one would expect the number of entailed facts over the number of inference rule applications to follow a longtail distribution, which means that in training data, "deep" entailments would be underrepresented, and this may cause a network to not actually acquire deep inference skills. In future works, We will experiment with different synthetic training sets, possibly overrepresenting "deep" entailments, to counter this problem.
Keeping above limitations in mind, our future goal is creating a synthetic dataset with longer reasoning paths for training our model. We would like also to explore the scalability power of our work in future.
7 Conclusions
In this paper, we have shown how emulating the symbolic reasoning through subsymbolic reasoning can lead to a scalable and efficient model capable of transferring its reasoning ability from one domain to another without any re/pretraining or finetunning over new domain. To achieve this goal, we have introduced a normalization technique to the representation learning in memory networks. We empirically show our proposed model comfortably beats its unnormalized counterparts. Apart from knowledge graph reasoning, our approach would lend itself well for use by hybrid subsymbolic symbolic reasoning systems in planning, cognitive systems and robot control. Our study also provides additional considerable insight into not only representation learning for rare or outofvocabulary words in general context but also for transfer learning, zeroshot learning and domain adaptation in reasoning domain.
8 Acknowledgements
This work is supported by the Ohio Federal Research Network project HumanCentered Big Data.
References
 [1] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.
 [2] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia–a largescale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.
 [3] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A webscale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610. ACM, 2014.
 [4] Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 777–782, 2013.
 [5] John McCarthy. Programs with common sense. RLE and MIT computation center, 1960.
 [6] Nils J Nilsson. Logic and artificial intelligence. Artificial intelligence, 47(13):31–56, 1991.
 [7] Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. Factorizing yago: scalable machine learning for linked data. In Proceedings of the 21st international conference on World Wide Web, pages 271–280. ACM, 2012.
 [8] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74–84, 2013.
 [9] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934, 2013.
 [10] KaiWei Chang, Scott Wentau Yih, Bishan Yang, and Chris Meek. Typed tensor decomposition of knowledge bases for relation extraction. 2014.
 [11] Bishan Yang, Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
 [12] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, 2015.
 [13] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, pages 2071–2080, 2016.
 [14] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models for knowledge base completion. arXiv preprint arXiv:1504.06662, 2015.
 [15] Baolin Peng, Zhengdong Lu, Hang Li, and KamFai Wong. Towards neural networkbased reasoning. arXiv preprint arXiv:1508.05508, 2015.
 [16] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426, 2016.
 [17] Dirk Weissenborn. Separating answers from queries for neural reading comprehension. arXiv preprint arXiv:1607.03316, 2016.
 [18] Yelong Shen, PoSen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1047–1055. ACM, 2017.
 [19] Tim Rocktäschel and Sebastian Riedel. Endtoend differentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800, 2017.
 [20] Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096, 2015.
 [21] Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzębski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. Learning to compute word embeddings on the fly. arXiv preprint arXiv:1706.00286, 2017.
 [22] Mihail Eric and Christopher D Manning. A copyaugmented sequencetosequence architecture gives good performance on taskoriented dialogue. arXiv preprint arXiv:1701.04024, 2017.
 [23] Dinesh Raghu, Nikhil Gupta, et al. Hierarchical pointer memory network for task oriented dialogue. arXiv preprint arXiv:1805.01216, 2018.
 [24] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. corr abs/1410.3916, 2014.
 [25] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. Endtoend memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
 [26] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301, 2015.
 [27] Antoine Bordes, YLan Boureau, and Jason Weston. Learning endtoend goaloriented dialog. arXiv preprint arXiv:1605.07683, 2016.
 [28] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning endtoend dialog systems. arXiv preprint arXiv:1511.06931, 2015.
 [29] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
 [30] Tarek R Besold, Artur d’Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos, Pascal Hitzler, KaiUwe Kühnberger, Luis C Lamb, Daniel Lowd, Priscila Machado Vieira Lima, et al. Neuralsymbolic learning and reasoning: A survey and interpretation. arXiv preprint arXiv:1711.03902, 2017.
 [31] Artur SD’Avila Garcez, Luis C Lamb, and Dov M Gabbay. Neuralsymbolic cognitive reasoning. Springer Science & Business Media, 2008.
 [32] John McCarthy. Epistemological challenges for connectionism. Behavioral and Brain Sciences, 11(1):44–44, 1988.
 [33] Geoffrey G Towell and Jude W Shavlik. Knowledgebased artificial neural networks. Artificial intelligence, 70(12):119–165, 1994.
 [34] Pascal Hitzler, Steffen Hölldobler, and Anthony Karel Seda. Logic programs and connectionist networks. Journal of Applied Logic, 2(3):245–272, 2004.
 [35] Steffen Hoelldobler and Yvonne Kalinke. Ein massiv paralleles modell für die logikprogrammierung. In WLP, pages 89–92, 1994.
 [36] Lokendra Shastri. Advances in shruti—a neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence, 11(1):79–108, 1999.
 [37] Lokendra Shastri. Shruti: A neurally motivated architecture for rapid, scalable inference. In Perspectives of NeuralSymbolic Integration, pages 183–203. Springer, 2007.
 [38] Helmar Gust, KaiUwe Kühnberger, and Peter Geibel. Learning models of predicate logical theories with neural networks based on topos theory. In Perspectives of NeuralSymbolic Integration, pages 233–264. Springer, 2007.
 [39] Sebastian Bader, Pascal Hitzler, and Steffen Hölldobler. Connectionist model generation: A firstorder approach. Neurocomputing, 71(1315):2420–2432, 2008.
 [40] Sebastian Bader, Pascal Hitzler, Steffen Hölldobler, Andreas Witzel, et al. A fully connectionist model generator for covered firstorder logic programs. In IJCAI, pages 666–671, 2007.
 [41] Masataro Asai and Alex Fukunaga. Classical planning in deep latent space: Bridging the subsymbolicsymbolic boundary. arXiv preprint arXiv:1705.00154, 2017.
 [42] Ivan Donadello, Luciano Serafini, and Artur d’Avila Garcez. Logic tensor networks for semantic image interpretation. arXiv preprint arXiv:1705.08968, 2017.
 [43] Patrick Hohenecker and Thomas Lukasiewicz. Ontology reasoning with deep neural networks. arXiv preprint arXiv:1808.07980, 2018.
 [44] Bassem Makni and James Hendler. Deep learning for noisetolerant rdfs reasoning.
 [45] Luciano Serafini and Artur S d’Avila Garcez. Learning and reasoning with logic tensor networks. In Conference of the Italian Association for Artificial Intelligence, pages 334–348. Springer, 2016.
 [46] Luciano Serafini and Artur d’Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.

[47]
Dai Quoc Nguyen, Dat Quoc Nguyen, Tu Dinh Nguyen, and Dinh Phung.
A convolutional neural networkbased model for knowledge base completion and its application to search personalization.
Semantic Web, (Preprint):1–14.  [48] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836, 2015.
 [49] World Wide Web Consortium et al. Rdf 1.1 concepts and abstract syntax. 2014.
 [50] Pascal Hitzler, Markus Krotzsch, and Sebastian Rudolph. Foundations of semantic web technologies. Chapman and Hall/CRC, 2009.
 [51] Pascal Hitzler, Markus Krötzsch, Bijan Parsia, Peter F PatelSchneider, and Sebastian Rudolph. Owl 2 web ontology language primer. W3C recommendation, 27(1):123, 2009.
 [52] Kelvin Guu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. arXiv preprint arXiv:1506.01094, 2015.
 [53] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. arXiv preprint arXiv:1707.06690, 2017.
 [54] Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1119–1129, 2015.
 [55] Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, page 6, 2011.
 [56] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
 [57] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, volume 15, pages 2181–2187, 2015.

[58]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Knowledge graph embedding by translating on hyperplanes.
In AAAI, volume 14, pages 1112–1119, 2014.  [59] Hongyun Cai, Vincent W Zheng, and Kevin Chang. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering, 2018.
 [60] Andrea Madotto, ChienSheng Wu, and Pascale Fung. Mem2seq: Effectively incorporating knowledge bases into endtoend taskoriented dialog systems. arXiv preprint arXiv:1804.08217, 2018.
 [61] Michelle Cheatham, Adila Krisnadhi, Reihaneh Amini, Pascal Hitzler, Krzysztof Janowicz, Adam Shepherd, Tom Narock, Matt Jones, and Peng Ji. The geolink knowledge graph. Big Earth Data, pages 1–13, 2018.
 [62] Krzysztof Janowicz, Pascal Hitzler, Benjamin Adams, Dave Kolas, II Vardeman, et al. Five stars of linked data vocabulary use. Semantic Web, 5(3):173–176, 2014.
 [63] Stella Sam, Pascal Hitzler, and Krzysztof Janowicz. On the quality of vocabularies for linked dataset papers published in the semantic web journal, 2018.
 [64] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. arXiv preprint arXiv:1711.05851, 2017.
 [65] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pages 2319–2328, 2017.
 [66] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
Comments
There are no comments yet.