Reasoning over RDF Knowledge Bases using Deep Learning

11/09/2018 ∙ by Monireh Ebrahimi, et al. ∙ 1

Semantic Web knowledge representation standards, and in particular RDF and OWL, often come endowed with a formal semantics which is considered to be of fundamental importance for the field. Reasoning, i.e., the drawing of logical inferences from knowledge expressed in such standards, is traditionally based on logical deductive methods and algorithms which can be proven to be sound and complete and terminating, i.e. correct in a very strong sense. For various reasons, though, in particular, the scalability issues arising from the ever-increasing amounts of Semantic Web data available and the inability of deductive algorithms to deal with noise in the data, it has been argued that alternative means of reasoning should be investigated which bear high promise for high scalability and better robustness. From this perspective, deductive algorithms can be considered the gold standard regarding correctness against which alternative methods need to be tested. In this paper, we show that it is possible to train a Deep Learning system on RDF knowledge graphs, such that it is able to perform reasoning over new RDF knowledge graphs, with high precision and recall compared to the deductive gold standard.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated reasoning, attempting to conduct logical reasoning algorithmically, has been a long-standing focus of general artificial intelligence with wide application in knowledge base completion, natural language understanding, question answering, agent planning and etc. In particular, with the recent advent of web-scale knowledge graphs, such as Freebase[1], DBpedia [2], and Google’s Knowledge Vault [3] and due to their high incompleteness [4] there has been rising interest in solutions to the knowledge base completion task.

For many years, reasoning has been tackled as the task of building systems capable of inferring new crisp symbolic logical rules [5, 6]

. However, those traditional methods are too brittle to be applied to noisy automatically created knowledge bases. With the recent revival of interest in artificial neural networks, neural link prediction models have been applied vastly for the completion of knowledge graphs. These methods

[7, 8, 9, 10, 11, 12, 13] heavily rely on the subsymbolic representation of entities and relations learned through maximization of a scoring objective function over valid factual triples. Thus, the current success of such deep models hinges primarily on the power of those subsymbolic continuous real-valued representations in encoding the similarity/relatedness of entities and relations. Recent attempts have focused on neural multi-hop reasoners [14, 15, 16, 17, 18] to equip the model to deal with more complex reasoning where multi-hop inference is required. More recently, a Neural Theorem Prover [19] has been proposed in an attempt to take advantage of both symbolic and sub-symbolic reasoning.

Despite their success, the main restriction common to machine learning-based reasoners is that they are unable to recognize and generalize to analogous situations or tasks. This inherent limitation follows from both the representation functions used and the learning process. The major issue comes from the mere reliance of these models on the representation of entities learned during the training or in the pre-training phase stored in a lookup table. Consequently, these models have difficulty to deal with out-of-vocabulary entities. Although the small-scale out-of-vocabulary problem has been addressed in part in the natural language processing domain by taking advantage of character-level embedding

[20], learning embeddings on the fly by leveraging text descriptions or spelling [21], copy mechanism [22] or pointer networks [23]

, still these solutions are insufficient for transferring purposes. An even greater source of concern is that reasoning in most of the above sub-symbolic approaches hinges more on the notion of similarity and geometric-based proximity of real-valued vectors (induction) as opposed to performing transitive reasoning (deduction) over them.

Inspired by these observations, we take a different approach in this work by investigating the emulation of deductive symbolic reasoning using memory networks. Memory networks [24] are a class of learning models capable of conducting multiple computational steps over an explicit memory component before returning an answer. They have been recently applied successfully to a range of natural language processing tasks such as question answering [25, 26], language modeling [25], and dialogue tasks [27, 28]. They use memory to store the context or knowledge bases of facts explicitly and perform inferencing over it using multi-hops recurrent attention. End-to-end memory networks (MemN2N) [25]

are a less-supervised, more general version of these networks, applicable to the settings where labeled supporting memories are not available. They are very similar to the original memory networks, except that the supporting memory slots have not been pre-determined as labels for the model. More specifically, first, those memory inputs useful for finding the correct answers are retrieved through an attention mechanism and an output vector is calculated by obtaining the weighted sum of the memory output representations. To apply multi-hop attention over the memory before outputting the response, the above process gets repeated recursively K times by replacing the query vector with the summation of the query and output vector obtained from the previous step. Finally, the output vector and final query representation from the last hop pass through a final weight matrix multiplication and a softmax to produce the output label. We have selected such networks since we believe that they are a primary candidate to perform well for deductive logical entailment. Their sequential nature corresponds, conceptually, to the sequential process underlying some deductive reasoning algorithms. The attention modeling corresponds to pulling only relevant information (logical axioms) necessary for the next reasoning step. And their success in natural language inferencing is also promising: while natural language inferencing does not follow a formal logical semantics, logical deductive entailment is nevertheless akin to some aspects of natural language reasoning. Besides, as attention can be traced over the run of a memory network, we will furthermore get insights into the "reasoning" underlying the network output, as we will be able to see which pieces of the memory (i.e., the input knowledge graph) are taken into account at each step.

This paper contributes a recipe involving a simple but effective knowledge base triple normalization before learning their representation within an end-to-end memory network. To perform logical inference in more abstract level, and thereby facilitating the transfer of reasoning expertise from one knowledge graph to another, the normalization maps entities and predicates in a knowledge to a generic vocabulary. Facts in additional knowledge bases are normalized using the same vocabulary, so that the network does not learn to overfit its learning to entity and predicate names in a specific knowledge base. This emulates symbolic reasoning by neural embeddings as the actual names (as strings) of entities from the underlying logic such as variables, constants, functions, and predicates are insubstantial for logical entailment in the sense that a consistent renaming across a theory does not change the set of entailed formulas (under the same renaming). Thanks to the term-agnostic feature of our representation, we are able to create a reasoning system capable of performing reasoning over an unseen set of vocabularies in the test phase.

Our approach combines the best of two worlds: transferability of classical deductive symbolic reasoning and robustness of neural-subsymbolic reasoning. This combination supports cross-knowledge graph reasoning obviating the need for supervised retraining over the task of interest or unsupervised pretraining over the external source of data for learning the representations when encountered with a new knowledge graph.

Our contributions are threefold: (i) We present the construction of memory networks for emulating the symbolic deductive reasoning. (ii) We propose an optimization to this architecture using normalization approach to enhance their transfer capability. We show that in an unnormalized setting, they fail to perform well across knowledge graphs. (iii) We examine the efficacy of our model for cross-domain and cross-knowledge graph deductive reasoning. We also show the robustness of our model to the noisy train/test set and provides scalability (in terms of reduced time and space complexity) for large datasets.

This paper is structured as follows. In Section  2 we discuss related research efforts, including a briefing of the history of attempts to integrate logical reasoning in connectionist approaches. In Section  3 and  4, we concretely present the deep learning architecture we used. In Section  5 and  6, we present an experimental evaluation of our approach and analyze our findings. We conclude and discuss future work in Section  7.

2 Related works

The research into methods how to use artificial neural networks to perform logical deductive reasoning tasks is often referred to as the study of neural-symbolic integration. It can be traced back at least to the landmark 1942 article by McCulloch and Pitts [29]

in which it was shown how propositional logic formulas can be represented using a simple neural network model with threshold activation functions. A comprehensive and recent state of the art survey can be found in

[30], and hence we will only mention essentials for understanding the context our work is placed in.

Most of the body of work on neural-symbolic integration concerns propositional logic only (see, e.g.,[31]), and indeed relationships both theoretical and practical in nature between propositional logics and subsymbolic systems are relatively easy to come by, an observation to which John McCarthy referred as the "propositional fixation" of artificial neural networks [32]. Some of these include Knowledge-Based Artificial Neural Networks [33] and the closely related propositional core method [34, 35]. Early attempts to go beyond propositional logic included the SHRUTI system [36, 37] which, however, uses a non-standard connectionist architecture and thus had severe limitations as far as learning was concerned. Approaches to using standard artificial neural network architectures with proven good learning capabilities for first-order predicate logic [38]

or first-order logic programming

[39, 40] were by their very design unable to scale beyond toy examples.

In the past few years, deep learning as a subsymbolic machine learning paradigm has surpassed expectations as to the speed of progress in machine-learning based problem solving, and it is a reasonable assumption that these developments have not yet met their natural limit. Consequently, they are being looked upon as promising for trying to overcome the symbolic-subsymbolic divide [41, 42, 43, 44, 19, 45, 46] – this list is not exhaustive. Even more work exists on inductive logical inference, e.g. [47, 19]

, which is not what we are going to deal with in this work. Concretely, on the issue of doing logical reasoning using deep networks, we want to mention the following selected contributions: Tensor-based approaches have been used

[19, 45, 46], following [48, 9]. However, approaches are restricted in terms of logical expressibility and/or to toy examples and limited evaluations. [44] perform knowledge graph reasoning using RDF(S) [49, 50], based on knowledge graph embeddings. However evaluation and training are done on the same knowledge graph, i.e., there is no learning of the general logical deduction calculus, and consequently no transfer thereof to new data. [43] considers OWL RL reasoning [50, 51], however again training and evaluation are done on the same knowledge bases, i.e., no transfer is possible and no general deduction calculus is acquired during training.

In short, to the best of our knowledge, to date, there is no sub-symbolic reasoning work, which is able to transfer the learning capability from one knowledge graph to unseen one. In fact, since previous works have focused to conduct reasoning on the unseen part of the same knowledge graph, they have tried to gain generalization ability through induction and robustness to missing edges[52] as opposed to deduction. The induction queries include those triples (s,p,o) where at least there are one missing link in all the paths from s to o in the knowledge graph. Likewise, recent years have seen some progress in zero-shot relation learning in sub-symbolic reasoning domain[14, 53, 54]. Zero-shot learning refers to the ability of the model to infer new relations of pair of entities where that relation has not been seen before in training set[55]. This generalization capability is still quite limited and fundamentally different from our work in terms of both methodology and purpose.

3 Knowledge Graph Reasoning

In order to explain what we are setting out to do, let us first re-frame the deductive reasoning (or entailment) problem as a classification task. Any given logic comes with an entailment relation , where is a subset of the set of all logical formulas (or axioms) over , and is the set of all theories (or sets of logical formulas) over . If , then we say that is entailed by . Re-framed as a classification task, we can ask whether a given pair

should be classified as a valid entailment (i.e.,

) holds, or as the opposite (i.e., ). Applying a deep learning approach to this, we would like to train a Deep Neural Network (DNN) on sets of examples , such that the DNN learns to correctly classify them as valid or invalid inferences. Of course, we would have to restrict our attention to finite theories, which is usually done in computational logic anyway.

3.1 Problem: Lack of Transferability

We wish to train a model whose learnings will transfer to new theories within the same logic. That way, our results will demonstrate that the reasoning principles (inference rules) which underlie the logic have been learned. If we were to train a model such that it learns only to reason over one theory, then that could hardly be demonstrated. One of the key obstacles we face with our task, however, is to understand how to represent training and test data so that they can be used in standard deep learning settings. Logical theories are highly structured, and it is essentially this structure which determines logical entailments, indeed some entailment algorithms can be understood in a straightforward way as a type of syntax rewriting systems. At the same time, the actual names (as strings) of entities from the underlying logic such as variables, constants, functions, predicates, are insubstantial for logical entailment in the sense that a consistent renaming across a theory does not change the set of entailed formulas (under the same renaming).

For use with standard deep learning approaches, formulas – or even theories – will have to be represented in the real coordinate space as vectors (points), matrices or tensors (multidimensional arrays); in deep learning, such a representation is commonly called an embedding. A plethora of embeddings for knowledge graphs have been proposed [56, 57, 13, 58, 11], however, we are not aware of an existing embedding which adheres to the principles which seem important for the deductive reasoning scenario. Indeed, prominent use case explored for knowledge graph embeddings is not deductive in nature; rather, it concerns the problem of the discovery or suggestion of additional links or edges in the graph, together with appropriate edge labels. In this link discovery setting, the actual labels for nodes or edges in the graph, and as such their commonsense meanings, are likely important, and most existing embeddings reflect this. However, for deductive reasoning the names of entities are insubstantial, i.e., ideally should not be captured by an embedding. Another inherent problem in the use of such representations across knowledge graphs is the out-of-vocabulary problem. Formally speaking, such methods define a matrix to store a d dimensional real-valued vector of each word in the vocabulary . Given the word lookup table , the embedding for word can be obtained through multiplication of word lookup table and ’s one-hot vector representation () as . The word lookup table can be initialized with vectors in an unsupervised task or during training of the reasoner. In any case, it is obvious that word lookup tables cannot generate vector representation for unseen terms and it is impractical to store the vectors of all words when vocabulary size is huge [20]. On the other hand, using standard graph embeddings [59] also appears to be insufficient, because structural aspects such as the particular importance of nodes or edge labels from the RDF/RDFS (Resource Description Framework Schema) namespaces to the deductive reasoning process, would not get sufficient attention. Similarly, memory networks usually rely on word-level embeddings lookup tables, i.e., learned with the underlying rationale that words that occur in similar supervised scenarios should be represented by similar vectors in the real coordinate space. That is why they are known to have difficulties dealing with out-of-vocabulary terms, as a word lookup table cannot provide a representation for the unseen, and thus cannot really be applied to natural language inference over new sets of words [21], and for us this will pose a challenge in the transfer to new knowledge bases.

We thus seek embeddings which are agnostic to the terms (i.e., strings) used as primitives in the knowledge base. One option may be to pursue variants of the copy mechanism and of pointer networks [60, 23] to refer to the unknown words in the memory in generating the responses. Despite the success of these mentioned methods in handling few unknown words absent during training, transferability and the ability of these models to generalize to a completely new set of vocabulary is still a widely open research question. Furthermore, these approaches are fundamentally appropriate for generative settings and therefore are not suitable for classification problems. Another option is utilizing character-level embeddings (ideal for open-vocabulary word representation) [20] to compose the representation of characters into words. Similarly, using character-level embeddings is an inelegant solution in our case, because of the importance of having a word-agnostic embedding. Therefore entity representation limitations of memory networks need to be overcome, in order to make them applicable to deductive logical entailment.

3.2 Solution: Normalized Embedding

To build such an embedding, we will build on existing approaches for the embedding of structured data, and modify them for our purposes. We expect that some type of normalization will be required before embedding, and that this normalization will have two different aspects. One the one hand, normalization as usually done before invoking logical reasoning algorithms will help control the structural complexity of the formulas which constitute theories and entailments. On the other hand, we will explore syntactic normalization, and with this we mean a renaming of primitives from the logical language (variables, constants, functions, predicates) to a set of predefined entity names which will be used across different theories. By randomly assigning the mapping for the renaming, the network’s learning will be based on the structural information within the theories, and not on the actual names of the primitives, which should be insubstantial for the entailment task. Note that the normalization does not only play the role of “forgetting” irrelevant label names, but also makes it possible to transfer learning from one knowledge graph to the other. Indeed, for the approach to work, the network should be trained with many knowledge graphs, and then subsequently tested on completely new ones which had not been encountered during training. Our preliminary results show how our simple but very effective normalization phase can surprisingly lead to creating a word-agnostic reasoning system capable of conducting reasoning over unseen knowledge graphs containing new vocabulary.

4 Model Architecture

Our model architecture is an adaptation of the end-to-end memory network proposed by [25] with some fundamental alterations necessary for abstract reasoning. A high-level view of our model shown in Figure 1 is as follows. It takes a discrete set of normalized triples that are to be stored in the memory, a query , and outputs a "yes" or "no" as an answer to determine whether can be inferred from the current knowledge graph statements or not. Each of the normalized and contains symbols coming from a general dictionary with normalized words shared among all of the knowledge graphs in both training and test sets. The model writes all triples to the memory and then calculates a continuous embedding for the whole triples and . Through multiple hop attention over those continuous representations, the model then classifies the query. The model will get trained by back-propagation of error from output to the input through multiple memory accesses and embeddings will get learned through these accesses. We discuss components of the architecture in more detail below.

4.1 Model Description

Figure 1: Proposed model diagram for K=1

The design of the model is based on the MemN2N[25] end-to-end memory network. The model is augmented with an external memory component storing the embedding of the normalized triples in our knowledge graph. This external memory is defined as an tensor where denotes the number of triples in knowledge graph and is the dimensionality of the embeddings. The knowledge base is stored in the memory vectors from two continuous representations of and obtained from two input and output embedding matrices of A and C with size where is the size of vocabulary. Similarly, the query is embedded via a matrix to obtain an internal state . In each reasoning step, those memory slots useful for finding the correct answers should have their contents retrieved. To enable this, we use an attention mechanism for over memory input representations by taking an internal product followed by a softmax:



Equation 1

is used to calculate a probability vector

over the memory inputs, the output vector is computed as weighted sum of the transformed memory contents s with respect to their corresponding probabilities by:


This describes the computation within a single hop. The internal state of query vector gets updated for the next hop using:


The process repeats times where is the number of memory units in the network. The output of the memory unit is used to predict a label by passing and through a weight matrix of size and a softmax:


Figure 1 illustrates the model for

. The parameters to be learned by backpropagation are the matrices

and .

4.2 Memory Content

A knowledge graph is a collection of facts stored as triplets where and are subject and object respectively, while is a predicate (relation) binding and together. Every entity in the knowledge graph is represented by a unique Universal Resource Identifier (URI). We normalize these triples by systematically renaming all URIs which are not in the RDF/RDFS namespaces as discussed previously. Each such URI is mapped to a set of arbitrary strings in a predefined set , where is taken as a training hyper-parameter giving an upper bound for the largest number of entities in a knowledge graph the system will be able to handle. Note that URIs in the RDF/RDFS namespaces are not renamed, as they are important for the deductive reasoning process. Consequently, each normalized knowledge graph will be a collection of facts stored as triplets .

It is important to note that each symbol is mapped into an element of regardless of its position in the triple, and whether it is a subject or an object or a predicate. Yet the position of a subject within a predicate is an important feature to consider. Inspired by [25], we thus employ a positional encoding to encode the position of each element within the triplet. This gives the formal solution: , where denotes an element-wise multiplication and is a column vector with the structure (assuming 1-based indexing), with denominator of (the number of elements in the triplet), and is the size of the embedding vector in the memory input embedding matrix . Each memory slot thus represents the positional-weighted summation of each triplet. By using positional encoding (PE), we ensure that the order of the elements now affects the encoding of each memory slot . This representation which is used for query triplet, memory inputs, and memory outputs, is core to everything we do.

5 Experimental Setups

width=1 Dataset Ontologies OWL-Centric Amino Acid Ontology schema, Biological Pathway Exchange (BioPAX) schema, COmmon Semantic MOdel (COSMO), dbpedia-schema, Descriptions and situation, Disease, Dolce, Dublin_core schema, Gene, General formal ontology (GFO), Human Phenotype, Institutional Ontology, Metadata for Ontology Description and publication, Ontology for Biomedical Investigations, Phenotypic quality,, University of Lehigh benchmark, Xenopus anatomy and development, Yet Another More Advanced Top-level Ontology (YAMATO). Linked Data AGROVOC Linked Dataset, Amsterdam Museum Linked Open Data, The Apertium Bilingual Dictio- naries on the Web of Data, A Curated and Evolving Linguistic Linked Dataset (Asit), EARTh: an Environmental Application Reference Thesaurus in the Linked Open Data Cloud data, lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies, Linked European Television Heritage data, Linked Web APIs Dataset: Web APIs meet Linked Data. OWL-Centric Test Set Animal Health Surveillance Ontology, Cryptographic ontology of Semantic interoperability for rapid integration and deployment, Drug Abuse Ontology, Drug target ontology, General Ontology for Linguistic Description (GOLD), Identification ontology, Inline Hockey League pattern ontology, Knowledge processing ontology for Robots, Minimal category of list ontology, Provenance and Plans ontology , SAREF: the Smart Appliances REFerence ontology, Tatian Corpus of Deviating Examples (T-CODEX) ontology

Table 1: List of ontologies used to create our datasets

width=1 Test Dataset #KG Base Inferred Invalid #Facts #Ent. %Class %Indv %R. %Axiom. #Facts #Ent. %Class %Indv %R. %Axiom. #Facts OWL-Centric 2464 996 832 14 19 3 0 494 832 14 0.01 1 20 462 Linked Data 20527 999 787 3 22 5 0 124 787 3 0.006 1 85 124 OWL-Centric Test Set 21 622 400 36 41 3 0 837 400 36 3 1 12 476 Synthetic Data 2 752 506 52 0 1 0 126356 506 52 0 1 0.07 700

Table 2: Statistics of various datasets used in experiments

5.1 Candidate Logic

There is a plethora of logics which could be used for our investigation. We exclude propositional logics because they seem to be easier to capture using connectionist architectures, while at the same time the methods used for dealing with them in a subsymbolic manner do mostly not seem to transfer to non-propositional logics. Here we use Resource Description Framework (RDFS). The Resource Description Framework (RDF) [49, 50] is an established and widely used W3C standard for expressing knowledge graphs. The standard comes with a formal semantics which defines an entailment relation. As a logic, RDFS is of very low expressivity, and reasoning algorithms are very straightforward. One way to frame it is that there is a small set of thirteen entailment rules, fixed across all knowledge graphs, which are expressible using Datalog. These thirteen rules can be used to entail new facts. The completion of a knowledge graph is in general infinite because, by definition, there is an infinite set of facts (related to RDFS-encodings of lists) which are always entailed - however for practical reasons, this is ignored by established RDFS reasoning systems, i.e., for all practical purposes we can consider completions of knowledge graphs to be finite.

5.2 Dataset

Due to the novel nature of the problem at hand, there is no dataset available for testing the capability of our approach. The good news however is, there are plethora of knowledge graph[61] that we could use to create our own dataset. The Linked Data Cloud111 website lists over 1,200 interlinked RDF(S) datasets, which constitute knowledge graphs suitable for our setting, some of which are of substantial size. Here, we have collected data from this website as well as Data Hub222 website to create our training set333 Our training set (owl-centric dataset) is comprised of a set of knowledge graphs of size 1000 triples sampled from around 20 ontologies (as listed in Table 1). In order to test our model generalization ability to completely different domain, we have collected another dataset called OWL-Centric test set. Furthermore, to assure our evaluation set represents the real-world RDF data and meet the quality requirement of linked data[62], we have followed [63] to collect data for our Linked Data dataset. To our best attempt we could not find public rdf-dump or sparql endpoint for some of the datasets mentioned in the paper though. In addition, to test the capability of our model to conduct long chain of reasonings we have created a synthetic dataset using rdfs:subclass and rdfs:subproperty predicates. It covers the reasoning chains of maximum 10.

For each knowledge graph we have created a finite set of inferred triples using Apache Jena444 api. These inferred triples comprise our positive class instances. For generating invalid instances we are following two methods. In the first scenario, we are generating invalid triplets by random permutation of entities and filtering out those triplets which are currently in the knowledge base or set of valid triplets. In the second scenario, which serves as the final quality check for not including trivial invalid triplets in our dataset, we are creating invalid instances with the aid of rdf:type predicate. More specifically, for each valid triple in the dataset, we replace one of the elements (chosen randomly), with another random element which qualifies for being placed in that triple based on its type-of relationships. Here, as well, we add the newly generated triplet if it is not already part of original knowledge base or valid facts. Indeed, through random selection of one of the hyponyms of hypernyms of entity of interest, we assure that our created dataset is challenging enough. Those datasets created by this strategy are denoted by superscript "a" in experimental Table 3. Some important statistics of our created datasets has been summarized in Table 1 and 2. More specifically, Table 2 illustrates number of knowledge graphs in each of our datasets and average number of facts and entities per knowledge graph. It also demonstrates average number of classes, individuals, relations and axiomatic triples for each knowledge graph (in percentage).

5.3 Training Details

Trainings have been done over 10 epochs using Adam optimizer. All trainings have been done with the batch size of 100 over triplets. For the final batches of queries for each knowledge graph, we have used zero-padding to the maximum batch size of 100. The capacity of our external memory is 1000 which is the maximum size of our knowledge bases also. Our model have been trained using a learning rate of

and learning rate decay of . We have used linear starting of 1 epoch where we have removed the softmax from each memory layer except for the final layer. L2 norm clipping of max 40 has been applied on gradient. Both memory input embeddings and memory output embeddings are vectors of size . The embedding matrices of A, B, and C therefore are of size where 3033 is the size of normalized vocabulary. Unless otherwise mentioned, we have used K=10 for all of our experiments. Adjacent weight sharing has been used where output embedding of one layer is the input embedding of the next one as in . Similarly, the answer prediction weight matrix get copied to the final output embedding and query embedding is equal to the first layer input embedding as in

. All the weights are initialized using Gaussian distribution with

and .

6 Experimental Results

width=1 Training Dataset Test Dataset Valid Triples Class Invalid Triples Class Accuracy Precision Recall /Sensitivity F-measure Precision Recall /Specificity F-measure OWL-Centric Dataset Linked Data 93 98 96 98 93 95 96 OWL-Centric Dataset (90%) OWL-Centric Dataset (10%) 88 91 89 90 88 89 90 OWL-Centric Dataset OWL-Centric Test Set b 79 62 68 70 84 76 69 OWL-Centric Dataset Synthetic Data 65 49 40 52 54 42 52 OWL-Centric Dataset Linked Data a 54 98 70 91 16 27 86 OWL-Centric Dataset a Linked Data a 62 72 67 67 56 61 91 OWL-Centric Dataset(90%) a OWL-Centric Dataset(10%) a 79 72 75 74 81 77 80 OWL-Centric Dataset OWL-Centric Test Set ab 58 68 62 62 50 54 58 OWL-Centric Dataset a OWL-Centric Test Set ab 77 57 65 66 82 73 73 OWL-Centric Dataset Synthetic Data a 70 51 40 47 52 38 51 OWL-Centric Dataset a Synthetic Data a 67 23 25 52 80 62 50 Baseline OWL-Centric Dataset Linked Data 73 98 83 94 46 61 43 OWL-Centric Dataset (90%) OWL-Centric Dataset (10%) 84 83 84 84 84 84 82 OWL-Centric Dataset OWL-Centric Test Set b 62 84 70 80 40 48 61 OWL-Centric Dataset Synthetic Data 35 41 32 48 55 45 48

  • More Tricky Nos & Balanced Dataset

  • Completely Different Domain.

Table 3: Experimental results of proposed model

6.1 Quantitative Results

In this section, we highlight some of our findings along with experimental results of our proposed approach. The evaluation metrics we report are average of precision and recall and f-measure over all the knowledge graphs in the test set, obtained for both valid and invalid set of triplets. Particularly, we also report, the recall for negative class, also called specificity, to interpret the result more carefully by counting number of true negatives. Additionally, as we mentioned in the training detail, we have done zero-padding for each batch of query triplets of size of less than 100. This implies the need for introducing another class label for such zero paddings both in the training and test phase. In our evaluation, however, we have not considered the zero-padding class in calculation of precision, recall and f-measure. Through our evaluations, however, we have observed some misclassification from/to this class. Here we are reporting accuracy as well to take such mistakes also into account.

To the best of our knowledge there is no architecture capable of conducting deductive reasoning on completely unseen knowledge graph. That is why, we have considered the non-normalized embedding version of our memory network as a baseline. Our technique shows a clear significant advantage over the baseline as shown in Table 3. A further even more important benefit of using our normalization model is its training time. In fact, this considerable time complexity difference is the result of remarkable size difference of embedding matrices in original and normalized cases. For instance, the size of embedding matrices to be learned by our algorithm for normalized OWL-Centric dataset is as opposed to for normalized one (and for Linked Data which is prohibitively big). That has caused a remarkably high decrease in training time and space complexity and consequently has helped the scalability of our memory networks. In case of OWL-Centric dataset, for instance, the space required for saving normalized model is 80 times less than the intact model( after compression). Likewise, the normalized model is almost 40 times faster to train than the non-normalized one for this dataset. Our normalized model trained for just over a day on OWL-Centric data achieves better accuracy, whereas it trained on the same non-normalized dataset more than a week on 12-core machine. Hence, the importance of using our novel normalized representation learning cannot be emphasized too much.

To further get an idea of how our model performs on different data sources, we have applied our approach to multiple datasets with various characteristics. The result across all variations are given in Table 3. From this Table we can see that, apart from our strikingly good performance compared to the baseline, there are number of other interesting points: Our model gets even better results on Linked Data task while it has trained on the OWL-Centric dataset. The reasons for this performance gain are not yet wholly understood. Another interesting observation is the poor performance of our algorithm when it has trained on OWL-Centric dataset and has been applied on a tricky version of the Linked Data. In that case our model has classified most of the triples to the "yes" class and this has led to low specificity (recall for "no" class) of 16%. It is inevitable because in challenging "No"’s version of our dataset the negative instances bear close resemblance to positives ones, making differentiation more challenging. Training the model on the tricky OWL-Centric dataset has improved that by substantial margin (more than three times). In case of the synthetic data, however, although performance is not ideal , we still nevertheless believe that it is acceptable. An evident rationalization for this performance decrease for synthetic data compared to other sources of data is the significant difference in reasoning patterns and nature of training and test set. Indeed, our training so far have only been done on real-world datasets and not peculiar synthetic data. Training the model on synthetic data is not the focus of this study.

width=1 Test Dataset Hop 0 Hop 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop 6 Hop 7 Hop 8 Hop 9 Hop 10 P R F P R F P R F P R F P R F P R F P R F P R F P R F P R F P R F Linked Dataa 0 0 0 80 99 88 89 97 93 77 98 86 - - - - - - - - - - - - - - - - - - - - - Linked Datab 2 0 0 82 91 86 89 98 93 79 100 88 - - - - - - - - - - - - - - - - - - - - - OWL-Centric c 19 5 9 31 75 42 78 80 78 48 47 44 4 34 6 - - - - - - - - - - - - - - - - - - Synthetic 32 46 33 31 87 38 66 55 44 25 45 32 29 46 33 26 46 33 25 46 33 25 46 33 24 43 31 25 43 31 22 36 28

  • LemonUby Ontology

  • Agrovoc Ontology

  • Completely Different Domain

Table 4: Experimental results over each reasoning hop
Dataset Hop 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop 6 Hop 7 Hop 8 Hop 9 Hop 10
OWL-Centric a 8% 67% 24% 0.01% 0% 0% 0% 0% 0% 0%
Linked Datab 31% 50% 19% 0% 0% 0% 0% 0% 0% 0%
Linked Datac 34% 46% 20% 0% 0% 0% 0% 0% 0% 0%
OWL-Centricd 5% 64% 30% 1% 0% 0% 0% 0% 0% 0%
Synthetic Data 0.03% 1.42% 1% 1.56% 3.09% 6.03% 11.46% 20.48% 31.25% 23.65%
  • Training Set

  • LemonUby Ontology

  • Agrovoc Ontology

  • Completely Different Domain

Table 5: Data distribution per knowledge graph over each reasoning hop

Further experiments were needed to analyze the reasoning depth acquired by our network. Fundamentally, we conjecture that reasoning depth acquired by the network will correspond both to (1) the number of layers in the deep network, and (2) the ratio of deep versus shallow reasoning required to perform the deductive reasoning. Let us explain this. Forward-chaining reasoners (which are the standard for RDF(S), OWL EL, and OWL RL reasoning, and can also be used for Datalog) iteratively apply inference rules in order to derive new entailed facts. In subsequent iterations, the previously derived facts need to be taken into account. The number of sequential applications of the inference rules which are required to obtain a given logical consequence can be understood as a measure of the "depth" of the deductive entailment. To gain better understanding of what our model has learned, we have mimic this behavior of symbolic reasoners in creating our test set. In order to do that, we first have started from our input knowledge graph in hop 0. We then have produced, subsequently, knowledge graphs of , , until no new triples are added (i.e. until is empty) by applying the RDFS inference patterns from W3C website555,666 Consequently, our hop 0 dataset contains the original graph, and the inferred axioms thus are replaced by the triples in the original graph. Hop 1 contains the RDFS axiomatic triples in the inferred axioms field. The real inference steps start with where . It is worthwhile noting that, in the process of creating this data, our reasoning tool encountered several errors during reasoning because of the missing entities in some triples. There are a lot of such a missing/unknown/inapplicable entities in real-world knowledge graphs emphasizing the need for using more robust sub-symbolic reasoners.

Table 4 summarizes our results in this setup. The poor performance on hop 0 is not unexpected since our training set does not include any original triplets. Unsurprisingly, we also observe that result over our synthetic data generated with subclass-of and sub-property-of predicates is poor. That is because of the huge gap between distribution of our training data over reasoning hops and the synthetic data reasoning hop length distribution. Table 5 provides further evidence for that. From Tables 4 and 5, one can see how the distribution of our training set affects the learning capability of our model. Apart from our observations, previous studies[64, 19, 16, 65] also corroborate that the reasoning chain length required to answer a query in a real-world KB is limited to 3 or 4. Therefore a synthetic training toy set is required to be built for further analysis of the reasoning depth capability of our model in future.

Furthermore, a naive expectation on the trained network would be that each layer would perform a type of equivalent to an inference rule application. If this is the case, then the number of layers would limit the entailment depth the network could acquire, however we had to assess this assumption experimentally. Therefore, we have done 10 experiments (K=1 to 10) to assess the effect of change of number of computational hops on our results over OWL-Centric Dataset. Interestingly, our experimental results suggest that our model is able to get almost the same performance with K=1 and more interestingly, the F-measure remains constant with step by step increase of K from 1 to 10. This shows us that multi-hop reasoning can be done in one-hop attention of memory networks (as we need only 2-3 hop reasoning) over our training set, while the increase of number of hops would not hurt our performance. This demonstrates robustness of the proposed method against change of its structure. This also suggests that each attention hop of our memory network is able to conduct more than one inference rule application step (deductive reasoning hop).

6.2 General Embeddings Visualization

In order to gain some insight on the behavior of our normalized embedding model, we have plotted a t-Distributed Stochastic Neighbor Embedding (t-SNE) [66]

and Principal Component Analysis (PCA) two-dimensional vector visualization of embeddings computed for the RDF(S) words and all normalized words in the knowledge graphs in figures 2, 3, and 4 respectively. The embeddings have been fetched from the matrix B (embedding query lookup table) in the computational hop 1 of our trained model over OWL-Centric dataset. Words are positioned in the plot by the semantic relationships implied by their embeddings. As anticipated all the normalized words tend to form one cluster as opposed to creating multiple separated clusters as shown in Figure 3 and 4. We found PCA plot more insightful. The PCA projection illustrates ability of our model to automatically organize RDF(S) concepts and learn implicitly the relationships between them, as during the training we have not provide any supervised information about what each RDF(S) element means. For instance, rdfs:domain and rdfs:range have been located very close together and far from normalized entities. Similarly, rdf:subject, rdf:predicate and rdf:object are very similar in the vector space. That is the case for rdfs:seesAlso and isDefinedBy. Similarly, rdfs:container, rdf:bag, rdf:seq, and rdf:alt are in vicinity of each other. Rdf:lanstring is the only RDF domain entity which is inside the normalized entities circle. We believe that it is because the rdf:langString’s domain and range is string and consequently it has only co-occurred with normalized instance in knowledge graph. Thus its vector is very close to normalized entities vectors. Another possible rationalization for this might be the low frequency of rdf:langString in our training set. This could possibly contribute to the general representation for this word.

(a) Embeddings for RDF(S) namespace
(b) Embeddings for the whole general vocabulary
Figure 2: t-SNE projection
Figure 3: PCA projection of embeddings for the whole general vocabulary

6.3 Ablation Study

width=1 Training Dataset Test Dataset Valid Triples Class Invalid Triples Class Accuracy Precision Recall F-measure Precision Recall F-measure OWL-Centric Dataset Linked Data 94 97 95 97 93 95 28 OWL-Centric Dataset (90%) OWL-Centric Dataset (10%) 85 92 88 92 83 87 76 OWL-Centric Dataset OWL-Centric Test Set a 73 80 75 80 67 71 61 OWL-Centric Dataset Synthetic Data 52 43 46 51 60 54 51

  • Completely Different Domain.

Table 6: Ablation Study: No Positional Encoding

We perform ablation study where we remove positional encoding from embeddings and compare the results to assess their impact on our results. The idea behind positional encoding is keeping order of elements in the triples into account. Here instead we are using bag of word representations and do not take ordering of elements in each triplet into account. The results of our experiments have been listed in Table 6. As anticipated, removing the positional encoding shows performance decrease for all of our experiments in terms of accuracy. Indeed, through more detailed analysis of the result for our first model, we found that it classifies all our zero-paddings to negative class. That is the explanation for huge gap of the accuracy and f-measure in that model. Nevertheless, surprisingly, but still not hard to appreciate, removing positional encoding would not decrease the performance for some of our experiments substantially. Indeed, this is not practically surprising in light of the fact that orderless representations have always shown tremendous success in natural language processing domain even when order naturally matters.

6.4 Limitations

Our work clearly has some limitations. One limitation of our initial approach is that our setting puts a global limit on the size of knowledge graphs a trained system will be able to handle, and required training time can be expected grow super-linearly in the size of the knowledge graphs. We will in fact test the scalability limits of our approach in future, but we are confident that we can reasonably handle knowledge graphs with hundreds of thousands of triples. The contribution of this work, however, is on the fundamental capabilities of our model to perform deductive entailment across knowledge graphs, and thus we have not focused heavily on scalability aspects. Our future work will concentrate on scalability issues more.

Additionally, based on our analysis the reasoning hop length of our real-world datasets are either 2 or 3. This is the case for real-world datasets used in previous studies also. This distribution of training data constrains the capability of our model for learning longer path reasonings. However, due to the high reasoning capacity of memory networks, we are highly confident that the model would be capable of doing that in order of 10s, when it has been trained on complex enough data. Typically, one would expect the number of entailed facts over the number of inference rule applications to follow a long-tail distribution, which means that in training data, "deep" entailments would be underrepresented, and this may cause a network to not actually acquire deep inference skills. In future works, We will experiment with different synthetic training sets, possibly overrepresenting "deep" entailments, to counter this problem.

Keeping above limitations in mind, our future goal is creating a synthetic dataset with longer reasoning paths for training our model. We would like also to explore the scalability power of our work in future.

7 Conclusions

In this paper, we have shown how emulating the symbolic reasoning through sub-symbolic reasoning can lead to a scalable and efficient model capable of transferring its reasoning ability from one domain to another without any re/pre-training or fine-tunning over new domain. To achieve this goal, we have introduced a normalization technique to the representation learning in memory networks. We empirically show our proposed model comfortably beats its unnormalized counterparts. Apart from knowledge graph reasoning, our approach would lend itself well for use by hybrid sub-symbolic symbolic reasoning systems in planning, cognitive systems and robot control. Our study also provides additional considerable insight into not only representation learning for rare or out-of-vocabulary words in general context but also for transfer learning, zero-shot learning and domain adaptation in reasoning domain.

8 Acknowledgements

This work is supported by the Ohio Federal Research Network project Human-Centered Big Data.