AAAI 2020: Rule-Guided Compositional Representation Learning on Knowledge Graphs
Representation learning on a knowledge graph (KG) is to embed entities and relations of a KG into low-dimensional continuous vector spaces. Early KG embedding methods only pay attention to structured information encoded in triples, which would cause limited performance due to the structure sparseness of KGs. Some recent attempts consider paths information to expand the structure of KGs but lack explainability in the process of obtaining the path representations. In this paper, we propose a novel Rule and Path-based Joint Embedding (RPJE) scheme, which takes full advantage of the explainability and accuracy of logic rules, the generalization of KG embedding as well as the supplementary semantic structure of paths. Specifically, logic rules of different lengths (the number of relations in rule body) in the form of Horn clauses are first mined from the KG and elaborately encoded for representation learning. Then, the rules of length 2 are applied to compose paths accurately while the rules of length 1 are explicitly employed to create semantic associations among relations and constrain relation embeddings. Besides, the confidence level of each rule is also considered in optimization to guarantee the availability of applying the rule to representation learning. Extensive experimental results illustrate that RPJE outperforms other state-of-the-art baselines on KG completion task, which also demonstrate the superiority of utilizing logic rules as well as paths for improving the accuracy and explainability of representation learning.READ FULL TEXT VIEW PDF
AAAI 2020: Rule-Guided Compositional Representation Learning on Knowledge Graphs
Knowledge graphs (KGs) such as Freebase [Bollacker, Gottlob, and Flesca2008], DBpedia [Lehmann et al.2015] and NELL [Mitchell et al.2018] are knowledge bases which store factual triples consisting of entities with their relations. They have achieved rapid development and extensive applications for various research fields, such as zero-shot recognition [Wang, Ye, and Gupta2018], question answering [Hao et al.2018] and recommender systems [Zhang et al.2016].
Typical KGs contain billions of triples but remain to be incomplete. Specifically, 75% of 3 million person entities miss a nationality in Freebase [West et al.2014] and 60% of person entities do not have a place of birth in DBpedia [Krompaß, Baier, and Tresp2015]. Thus, it is hard to further introduce KGs into some applications, such as no correct answers for question answering systems based on incomplete KGs. And the symbolic nature of triple facts in the form of (head entity, relation, tail entity) makes it challenging to expand large scale KGs.
In recent years, representation learning on KGs (or known as KG embedding) such as TransE [Bordes et al.2013], TransH [Wang et al.2014] and TransR [Lin et al.2015b] has become popular which intends to embed entities and relations of KGs into a continuous vector space while retaining the inherent structure and latent semantic information of KGs [Wang et al.2017]. This could benefit a lot for large scale KG completion.
The above embedding models purely consider the single triples. In fact, multi-step paths in KGs always play a pivotal role in providing extra relationships between entity pairs. For instance, we can utilize a path in the KG to expand the knowledge for its corresponding triple fact .
Both Lin et al. Lin-b:PTransE and Guu et al. Kelvin:Traversing succeeded to learn the entity and relation embeddings on paths. In their work, the relation embeddings are initialized randomly and the path representations are composed via addition, multiplication or Recurrent Neural Networks (RNN) of the relations along the paths. Since the path representations are achieved purely based on numerical calculations in a latent space, such approaches will cause error propagation and limited accuracy of the paths embeddings and further affect the whole representation learning.
An effective way to apply the extra semantic information in KG embedding is to employ the logic rules in view of their accuracy and explainability. KALE [Guo et al.2016] and RUGE [Guo et al.2018] both convert the rules into complex formulae modeled by t-norm fuzzy logics to transfer the knowledge in rules into the learned embeddings. Nevertheless, logic rules could maintain their original explainability only in the symbolic form and such rules actually focus on the semantic associations as well as the constraints of various relations, which have not been well exploited for the triples and paths in KG embedding.
This paper proposes a novel rule-guided compositional representation learning approach named Rule and Path-based Joint Embedding (RPJE), using Horn rules to compose paths and associate relations in the semantic level to improve the precision of learning KG embeddings on paths and enhance the explainability of our representation learning. In allusion to the path as shown in Figure 1, the rule bodies in two Horn rules of length 2 and are respectively matched with two segments of the path, which could be applied to iterately compose the entire path as a straightforward triple . Then, the rule body of length 1 is matched with the relation . Therefore, two relation embeddings denoting and are further constrained to be closer in the latent space with confidence level 0.9.
In experiments, we evaluate the proposed model RPJE on four benchmark datasets of FB15K, FB15K-237, WN18, and NELL-995. Experimental results illustrate that our approach achieves superior performances on KG completion task and significantly outperforms state-of-the-art baselines, which verifies the capability of combining rules with paths in KG embedding. Our main contributions of this work are:
To the best of our knowledge, this is the first attempt to integrate logic rules with paths for KG embedding, endowing our model with both the explainability from semantic level and the generalization from data level.
Our proposed model RPJE considers the various types of rules to inject prior knowledge into KG embedding. It can use the encoded length-2 rules to start paths composition rather than with randomly initialized vectors for obtaining accurate path representations. And the semantic associations among relations could be created by length-1 rules.
We conduct extensive experiments on KG completion and our model achieves promising performances. The influence of different confidence thresholds of rules demonstrates that considering confidence of rules in our model guarantees the effectiveness of using rules and could achieve good robustness to various confidence thresholds.
KG embedding models:
In recent years, many works have been done to learn distributed representations for entities and relations in KGs, which fall into three major categories: (1) Translational distance model. Inspired by the translation invariant principle from word embedding[Mikolov, tau Yih, and Zweig2013], TransE [Bordes et al.2013] regards relations as translating operations between head and tail entities, i.e., the formula should be satisfied when triple
holds. (2) Tensor decomposition model. DistMult[Yang et al.2015] and ComplEx [Trouillon et al.2016] both utilize tensor decomposition to represent each relation as a matrix and each entity as a vector. 3) Neural networks model. In NTN [Socher et al.2013], a 3-way tensor and two transfer matrices are encoded into multilayer neural network. Among these methods, TransE and plenty of its variants TransH [Wang et al.2014], TransR [Lin et al.2015b] and TransG [Xiao, Huang, and Zhu2016] have become promising approaches for successfully capturing the semantics of KG symbols. However, the above methods merely consider the facts immediately observed in KGs and ignore extra prior knowledge to enhance KG embedding.
Path enhanced models: Paths existing in KGs have gained more attentions to be combined with KG embedding because multi-hop paths could provide relationships between seemingly unconnected entities in KGs. Path Ranking Algorithm (PRA) [Lao, Mitchell, and Cohen2011]
is one of the early studies which searches paths by random walk in KGs and regards the paths as features for a per-target relation binary classifier. Neelakantan et al. Neelakantan:Compositional-vector-space proposed a compositional vector space model with a recurrent neural network to model relational paths on knowledge graph completion. Guu et al. Kelvin:Traversing introduced additive and multiplicative interactions between relation matrices in the path. Lin et al. Lin-b:PTransE proposed PTransE to obtain the path embeddings by composing all the relations in each path. DPTransE[Zhang et al.2018] jointly builds interactions between the latent features and graph features of KGs to offer precise and discriminative embedding. However, all these techniques obtain the path representations via calculating relation embeddings along the paths, which would cause limited accuracy and lack explainability.
Rule extraction and rule enhanced models:
Logic rules are explainable and contain rich semantic information, which have shown the power in knowledge inference. The Inductive Logic Programming algorithms such as XAIL are available to learn FOL or even ASP-style rules. However, it is difficult to mine rules from KGs with ILP algorithms due to open world assumption of KGs (absent information cannot be taken as counterexamples). Therefore, several rule mining methods have been developed to extract rules efficiently from large scale KGs, including AMIE[Galárraga et al.2013], AMIE+ [Galárraga et al.2015], RLvLR [Omran, Wang, and Wang2018] and CARL [Tanon et al.2018].
To improve both the precision and generalization of KG completion, some of the recent researches attempt to incorporate the rules into KG embedding models [Wang, Wang, and Guo2015]. Minervini et al. Complex-R imposed the equivalence and inversion constraints on the relation embeddings. But this approach considers only two types of constraints between relations rather than general rules, which might not always be available for any KG. In both KALE [Guo et al.2016] and RUGE [Guo et al.2018], the triples are represented as atoms and the rules are modeled by t-norm fuzzy logic for being converted into complex formulae formed by atoms with logical connectives. However, the above two methods quantify the rules in embedding which would decrease the explainability and accuracy of rules. By eliminating the complex process of modeling rules to be formulae, we explicitly and immediately employ the Horn rules to deduce the path embeddings and create the semantic associations between relations.
We attempt to integrate paths with logic rules to provide more semantic information in our model. The overall framework of the proposed scheme is shown in Figure 2. Firstly, we extract the paths and mine the Horn rules from KG, where the rules of length 1 and 2 are denoted as Rules and Rules , respectively (§3.1). Then, we apply Rules to iteratively compose paths and Rules to create the semantic associations of some relation pairs (§3.2). Furthermore, vector initialization is used to transform the entities and relations in symbolic space into the vector space for training the KG embeddings. Finally, compositional representation learning is implemented for optimizing objective specific to triples, paths and associated relations pairs (§3.3, §3.4).
Horn rules could be mined automatically by any KG rule exaction algorithm or tool. In this paper, we first mine rules together with their confidence levels denoted as from KGs. And a rule with higher confidence level has higher possibility to hold. We limit the maximum length of rules to 2 for the efficiency of mining valid rules. Thus, rules are classified into two types according to their length: (1) Rules . The set of length-1 rules is denoted as Rules , which associating two relations in rule body and rule head. (2) Rules . The set of rules with length 2 is denoted as , which could be utilized to compose paths. Some examples of Rules and Rules are provided in Table 1.
|Rules with confidence levels|
|0.86 (extracted from NELL-995)|
|1 (extracted from FB15K)|
|Rules with confidence levels|
|1 (extracted from NELL-995)|
|0.81 (extracted from FB15K)|
Remark: The inverse version of each relation is always added in path-based approaches to constrain each path along one direction and improve the graph connectivity [Zhang et al.2018]. Therefore, given a triple , a reconstructed triple is defined to express the inverse relationship between entity and entity .
The Horn rules extracted from KGs could be utilized in two modules for compositional representation learning, including paths composition by Rules and relation pairs association by Rules .
We first implement paths extraction procedure by PTransE [Lin et al.2015a] on KGs, where each path is extracted together with its reliability which is achieved by the path-constraint resource allocation mechanism and denoted as between an entity pair . We generate each path set by selecting the paths between the entity pair with their reliability over 0.01. Specifically, it is essential to form a sequential path by atoms of each rule body in Rules
for composing paths. A chain rule is further defined as the rule which the entity pair linked by the chain in the rule body is also connected by the relation in the rule head. However, the rules mined by most of the existing open-source rule mining systems could not be directly utilized because these rules are not chain rules. Therefore, we should encode each rule to form a directed path of its rule body (removing some of the rules could not be converted as this formalization in any case). In total, there are totally 8 different types of rules conversion modes, as provided in Table2. Take the original rule for instance, we first convert the atom into , and then exchange two atoms in the rule body to obtain a chain rule , which could be further abbreviated to . Then, a path containing a sequence of relations could be composed as .
|The original rules||Encoded rules|
To make the best of encoded rules, we should traverse the paths and conduct the composition operation iteratively in the semantic level until no relation could be composed by rules since two relations are composed every time and the composition result might be composed in the next step. Considering two types of scenarios in practical paths composition procedure: (1) The optimal scenario that all of the relations in a path could be iteratively composed by Rules and finally joined together as a single relation between entity pair. (2) The general scenario that some relations are unable to be composed based on Rules , we will adopt the numerical operation such as addition for the embeddings of these relations. Besides, in allusion to the situation that more than single rule could be matched in the path simultaneously, such as two rules as well as are both activated, the rule with the highest confidence should be selected to compose the path. Specifically, we define the path composition result via the above procedure as which is also denoted as the path embedding of the path p.
On account of the Rules , where a relation may have more semantic similarity with its directly implicating relation when the rule holds. The rules in the form of need to be encoded as for representation learning. Hence, in the training process, the embeddings denoting a pair of relations which appear simultaneously in Rules should be constrained to be closer than the embeddings of two relations that mismatch any rule.
Along with the strategy of translation-based algorithms, for each triple , we define three energy functions to respectively model correlations with the direct triple along with the typical translation-based methods, the path pair using Rules and the relation pair employing Rules :
where is defined with the less score if triple holds. denotes the energy function evaluating the similarity between path and relation , and represents the reliability of the path from the entity pair and is calculated the same as in PTransE [Lin et al.2015a]. h, r and t are the embeddings of head entity, relation and tail entity, respectively. denotes the composition result of the path , which is obtained according to the paths composition procedure explained in §3.2. And represents the set of confidence levels corresponding to all the rules in Rules employed in the process of composing the path . is the energy function indicating the similarity of relation and another relation and should be assigned with less score if is the relation implicated by the relation with Rules . is the embedding of the relation .
With the open world assumption [Drumond, Rendle, and Schmidt-Thieme2012]
, we introduce the pairwise ranking loss function to formalize our optimization objective of RPJE for training, which is defined as
In Eq.4, is defined as the set of relations all deduced from on the basis of Rules , and is any relation in . denotes all the paths linking entity pair , and is one of the path in . , and are three margin-based loss functions considering the energy functions in Eqs.1,2,3 to measure the effectiveness of representation learning in regard to the direct triple , the path pair as well as the relation pair , respectively, which are defined as follows:
where the function is defined to obtain the maximum value between 0 and . , , are three positive hyper-parameters denoting each margin of the loss functions in Eqs.5,6,7, respectively. The weight of triples is fixed to 1, and , are two hyper-parameters respectively weighting the influence of paths and relation pairs embedding constraint. denotes the confidence level of the rule in Rules associating and . The confidence levels of all the rules are considered to be penalty coefficients in optimization. represents a set that contains all the positive triples observed in KG. Following the negative sampling method as in [Bordes, Weston, and Bengio2014], contains the negative triples reconstructed via randomly replacing the entities and relations in and removing the triples already exist in .
To solve the optimization, we utilize mini-batch stochastic gradient descent (SGD). And considering the training efficiency, the paths are limited no longer than 3 steps.
We evaluate our model on four typical datasets: FB15K and FB15K-237 both extracted from the large-scale Freebase [Bollacker, Gottlob, and Flesca2008], WN18 extracted from WordNet [Miller1995] and NELL-995 extracted from NELL [Mitchell et al.2018]. Note that FB15K-237 contains no inverse relation and hence it is hard to learn embeddings by these mutually independent relations, so FB15K and FB15K-237 are always regarded as two distinguishing datasets. Statistics of datasets used are shown in Table 3. We evaluate the performance of our approach and other baselines on KG completion task, which is specifically formulated as link prediction and relation prediction. Specifically, link prediction aims to complete a triple with one entity missing while relation prediction aims to predict a relation given head and tail entities.
Our scheme is readily incorporable to any rule mining tool. And we choose AMIE+ [Galárraga et al.2015] for its convenience and fast-speed to mine rich rules with an alternative confidence threshold on different databases. The confidence thresholds of rules are selected in the range of [0,1] with the step size 0.1 to search the best performance of rules on datasets. Table 4 lists the statistics of rules with various confidence thresholds in the range of mined from FB15K, FB15K-237, WN18 as well as NELL-995, which have been encoded for representation learning.
|Datasets||Rule Types||Various Confidence Thresholds|
Three principle assessment metrics are focused on: the mean rank of correct entities (MR), the mean reciprocal rank of correct entities (MRR) and the proportion of test triples for which correct entity is ranked in the top n predictions (Hits@n). And an evaluation result should achieve lower MR, higher MRR and Hits@10. Moreover, the “filtered” setting eliminates the reconstructed triples that could be observed in the KG, yet the “raw” setting does not. To achieve these metrics, We define the score function for calculating the scores for reconstructed triples as follows:
As shown in Eq.9, the Rules should be utilized for composing paths in testing process. We rank the scores in descending order.
To verify the performance of our approach, we select several involved state-of-the-art models to implement KG completion, including three types of baselines: (1) Embedding methods only considering triple facts: TransE [Bordes et al.2013], TransH [Wang et al.2014], TransR [Lin et al.2015b], STransE [Nguyen et al.2016], TransG [Xiao, Huang, and Zhu2016], TEKE [Wang and Li2016], R-GCN+ [Michael et al.2018], KB-LRN [Garciaduran and Niepert2017], ConvE [Dettmers et al.2018]. (2) Path-based models: PTransE [Lin et al.2015a] and DPTransE [Zhang et al.2018]. (3) Rule enhanced models: KALE [Guo et al.2016] and RUGE [Guo et al.2018]. We use the best results presented in their original papers and also implement PTransE and RUGE by their source codes.
To guarantee fair comparison, we adopt the following evaluation settings in our work: (1) 100 mini-batches are created on datasets. (2) The entity and relation embeddings are initialized randomly and limited to unit vectors. (3) Following the same configurations as many prevailing baselines, the learning rate is chosen as 0.001, and are selected as , the embedding dimension is set to 50 for WN18 and 100 for other three datasets considering only 18 relations exist in WN18, dissimilarity is selected as
and training epochs is set to 500. In addition, we employ a grid search to select the other optimal hyper-parameters. We manually tune the marginin , and the weight coefficients , both in . The best models are selected on validation sets. The resulting optimal of margin and the weight coefficients , are assigned to: , , .
In this subsection, we experimentally examine the impact of the two important parameters in our proposed scheme, namely the confidence levels of the rules and path steps. It indicates that our model is robust to the noisy rules for the rules with low confidence will be filtered out according to the appropriate confidence threshold. For instance, based on the confidence threshold 0.7, the rule with confidence 0.96 will be used in path composition, but the rule with confidence 0.6 will be removed. It is also known that selecting a confidence threshold is a trade-off between higher confidence with more rules. We investigate the performance influence by varying the confidence thresholds in the range of [0.1, 1.0] with step 0.1. From Figure 3, we can observe that the confidence thresholds of 0.7 and 0.8 achieve the best tradeoffs. Furthermore, RPJE outperforms PTransE with the confidence threshold in a broad range of [0.4, 1.0] which illustrates the rules exploited in our model will be effective as long as the confidence threshold is selected in a moderate range. Particularly, RPJE-min obtains worse performance with lower confidence threshold due to more incorrect rules employed will cut down the accuracy of representation learning.
Additionally, we could compare the performance of different path steps limitation between steps 2 and 3. Figure 3 illustrates the link prediction results considering different confidence thresholds achieved by RPJE-S2 (2-step path), RPJE-S3 (3-step path), RPJE-min (RPJE ignoring the confidence of rules) and PTransE on FB15K. These results verify the confidence levels’ contribution to representation learning. On account of same configurations, RPJE employing paths with maximum 2 steps consistently outperforms that with maximum 3 steps. The reason might be that longer paths may cause lower accuracy in paths composition, which will be studied in the future work. Therefore, we select the confidence threshold as 0.7 and the path steps as 2 for the best setting in the following results.
|Models||Head Entities Prediction (Hits@10)||Tail Entities Prediction (Hits@10)|
In this section, we first evaluate link prediction and relation prediction of the proposed RPJE with a variety of baselines on FB15K. From Table 5, it can be observed that: (1) Our approach RPJE achieves superiority compared with other baselines, and most of the improvements are statistically significant. This demonstrates that RPJE learns more reasonable embeddings for KGs via using logic rules in conjunction with paths. (2) In particular, RPJE outperforms PTransE on each metric, which indicates the superiority of introducing logic rules for providing higher accuracy in paths composition and learning better path embeddings. (3) Compared to the rule-based baselines KALE and RUGE, RPJE obtains the improvements of 56.0%/6.3% on MRR and 18.5%/4.4% on Hits@10 (filtered), which demonstrates the effectiveness of explicitly employing rules for preserving more semantic information and further integrating paths.
Table 6 shows the evaluation results of predicting entities by various types of relations. We can observe that: 1) RPJE outperforms all baselines significantly and consistently in regard to all the relation categories. Compared to the best performing baseline TransG, RPJE achieves an average improvement of 4.2% in head entities prediction and 6.5% in tail entities prediction. 2) More interestingly, on the two toughest tasks of predicting head entities of N-1 relation and predicting tail entities of 1-N relation, our approach achieves the best performance improvements approximately 12.6% and 13.4% compared to the best baselines, respectively.
The results of relation prediction are shown in Table 7. Three typical models representing three types of baselines are implemented. The results illustrate that RPJE outperforms baselines in all metrics. It verifies that paths could provide extra relationships for entity pairs and rules can further create more semantic association for relations to improve relation embeddings and benefit for relation prediction.
Furthermore, we implement the experiments on dataset FB15K-237. Since FB15K-237 is constructed up to date, only a minority of existing works have implemented their experiments and show evaluation results on this dataset, which can be selected as baselines. As shown in Table 8, RPJE obtains the best performance with approximately 29.5% improvement compared to PTransE on MRR and 26.8% improvement compared to KB-LRN on Hits@10. Although no inverse relation could be observed in FB15K-237, we could employ Horn rules to provide significant supplements for building semantic associations of relations.
We also test the models on datasets WN18 and NELL-995. Three types of typical models are selected as baselines. Few rules can be mined from WN18 due to extremely limited amount of relations. And very parse paths can be extracted from NELL-995 because entities are far more than relations on this dataset. Even so, our model RPJE achieves consistent and significant improvements over other baselines as shown in Table 9. It illustrates the superiority of our approach for representation learning on various large scale KGs. On the other hand, considering the performance gains on FB15K are more than that on WN18, which is because more rules provide more semantic information to RPJE to use. As can be expected, our model RPJE will obtain better performance on datasets which implies more rules and paths.
To verify the effectiveness of different components of RPJE, we implement the ablation study of link prediction on FB15K by removing the paths as well as the length-2 rules at one time (-PaRu2), and the rules of length 1 (-Ru1) from our integrated model, respectively. More specifically, -PaRu2 means removing / in Eq. 2/6 and -Ru1 means removing / in Eq. 3/7. As shown in Table 10, we can conclude removing each component will lead to performance degradation especially when removing the paths and the rules of length 2 changes the results significantly.
As shown in Figure 4, considering a relation prediction task with the given head entity and tail entity , the result is obtained by our model RPJE. Particularly, this result can be explained by RPJE with paths and rules: for the 2-steps path, the rule
with the confidence 0.81 is activated to compose the path into the prediction result while providing the confidence level 0.81 of this result. For the path of 3 steps, the intermediate composition result is obtained by calling Rules and further employed to achieve the relation prediction result via embeddings on the reconstructed path containing the relation .
In this paper, we proposed a novel model RPJE to learn KG embeddings by integrating triple facts, Horn rules and paths in a unified framework to enhance the accuracy and the explainability of representation learning. The Experimental results on KG completion verified the rules with confidence levels are significant in improving the accuracy of composing paths and enhancing the association between relations.
For future work, we will investigate some other potential composition operations such as Long Short Term Memory networks (LSTM) with attention mechanism which may benefit for long paths. And we will explore to push the embedding information back from RPJE to rule learning with a well-designed closed-loop system.
This work was partially supported by the National Natural Science Foundation of China (No. 61772054), the NSFC Key Project (No. 61632001), the Natural Science Foundation of China (No. 61572493) and the Fundamental Research Funds for the Central Universities.
Knowledge graph embedding by translating on hyperplanes.In AAAI 2014, 1112–1119.