AAAI 2020 - InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions
Most existing knowledge graphs suffer from incompleteness, which can be alleviated by inferring missing links based on known facts. One popular way to accomplish this is to generate low-dimensional embeddings of entities and relations, and use these to make inferences. ConvE, a recently proposed approach, applies convolutional filters on 2D reshapings of entity and relation embeddings in order to capture rich interactions between their components. However, the number of interactions that ConvE can capture is limited. In this paper, we analyze how increasing the number of these interactions affects link prediction performance, and utilize our observations to propose InteractE. InteractE is based on three key ideas – feature permutation, a novel feature reshaping, and circular convolution. Through extensive experiments, we find that InteractE outperforms state-of-the-art convolutional link prediction baselines on FB15k-237. Further, InteractE achieves an MRR score that is 9 7.5 respectively. The results validate our central hypothesis – that increasing feature interaction is beneficial to link prediction performance. We make the source code of InteractE available to encourage reproducible research.READ FULL TEXT VIEW PDF
AAAI 2020 - InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions
Knowledge graphs (KGs) are structured representations of facts, where nodes represent entities and edges represent relationships between them. This can be represented as a collection of triples , each representing a relation between a "subject-entity" and an "object-entity" . Some real-world knowledge graphs include Freebase [freebase], WordNet [wordnet], YAGO [yago], and NELL [nell]. KGs find application in a variety of tasks, such as relation extraction [distant_supervision2009], question answering [qa_kg_1, qa_kg_2], recommender systems [kb-recommender] and dialog systems [kg_in_dialog].
However, most existing KGs are incomplete [kg_incomp1]. The task of link prediction
alleviates this drawback by inferring missing facts based on the known facts in a KG. A popular approach for solving this problem involves learning a low-dimensional representation for all entities and relations and utilizing them to predict new facts. In general, most existing link prediction methods learn to embed KGs by optimizing a score function which assigns higher scores to true facts than invalid ones. These score functions can be classified astranslation distance based [transe, transg, transh] or semantic matching based [hole, analogy].
Recently, neural networks have also been utilized to learn the score function[neural_tensor_network, chandrahas2017, conve]. The motivation behind these approaches is that shallow methods like TransE [transe] and DistMult [distmult] are limited in their expressiveness. As noted in [conve], the only way to remedy this is to increase the size of their embeddings, which leads to an enormous increase in the number of parameters and hence limits their scalability to larger knowledge graphs.
Convolutional Neural Networks (CNN) have the advantage of using multiple layers, thus increasing their expressive power, while at the same time remaining parameter-efficient. [conve] exploit these properties and propose ConvE - a model which applies convolutional filters on stacked 2D reshapings of entity and relation embeddings. Through this, they aim to increase the number of interactions between components of these embeddings.
In this paper, we conclusively establish that increasing the number of such interactions is beneficial to link prediction performance, and show that the number of interactions that ConvE can capture is limited. We propose InteractE, a novel CNN based KG embedding approach which aims to further increase the interaction between relation and entity embeddings. Our contributions are summarized as follows:
We propose InteractE, a method that augments the expressive power of ConvE through three key ideas – feature permutation, "checkered" feature reshaping, and circular convolution.
We provide a precise definition of an interaction, and theoretically analyze InteractE to show that it increases interactions compared to ConvE. Further, we establish a correlation between the number of heterogeneous interactions (refer to Def. 4.2) and link prediction performance.
Through extensive evaluation on various link prediction datasets, we demonstrate InteractE’s effectiveness (Section 9).
We have made available the source code of InteractE and datasets used in the paper as a supplementary material.
Non-Neural: Starting with TransE [transe]
, there have been multiple proposed approaches that use simple operations like dot products and matrix multiplications to compute a score function. Most approaches embed entities as vectors, whereas for relations, vector[transe, hole], matrix [distmult, analogy]
representations have been explored. For modeling uncertainty of learned representations, Gaussian distributions[gaussian_kg, transg] have also been utilized. Methods like TransE [transe] and TransH [transh] utilize a translational objective for their score function, while DistMult [distmult] and ComplEx [complex] use a bilinear diagonal based model.
Neural Network based: Recently, Neural Network (NN) based score functions have also been proposed. Neural Tensor Network [neural_tensor_network] combines entity and relation embeddings by a relation-specific tensor which is given as input to a non-linear hidden layer for computing the score. [kg_incomp1, chandrahas2017]
also utilize a Multi-Layer Perceptron for modeling the score function.
Convolution based: Convolutional Neural Networks (CNN) have also been employed for embedding Knowledge Graphs. ConvE [conve] uses convolutional filters over reshaped subject and relation embeddings to compute an output vector and compares this with all other entities in the knowledge graph. sacn_paper propose ConvTransE a variant of the ConvE score function. They eschew 2D reshaping in favor of directly applying convolution on the stacked subject and relation embeddings. Further, they propose SACN which utilizes weighted graph convolution along with ConvTransE.
ConvKB [convkb] is another convolution based method which applies convolutional filters of width 1 on the stacked subject, relation and object embeddings for computing score. As noted in [sacn_paper], although ConvKB was claimed to be superior to ConvE, its performance is not consistent across different datasets and metrics. Further, there have been concerns raised about the validity of its evaluation procedure111https://openreview.net/forum?id=HkgEQnRqYQ¬eId=HklyVUAX2m Hence, we do not compare against it in this paper. A survey of all variants of existing KG embedding techniques can be found in [survey2016nickel, survey2017].
KG Link Prediction: Given a Knowledge Graph (KG) , where and denote the set of entities and relations, and denotes the triples (facts) of the form , the task of link prediction is to predict new facts such that and , based on the existing facts in KG. Formally, the task can be modeled as a ranking problem, where the goal is to learn a function which assigns higher scores to true or likely facts than invalid ones.
Most existing KG embedding approaches define an encoding for all entities and relations, i.e., . Then, a score function is defined to measure the validity of triples. Table 1 lists some of the commonly used score functions. Finally, to learn the entity and relation representations, an optimization problem is solved for maximizing the plausibility of the triples in the KG.
ConvE: In this paper, we build upon ConvE [conve], which models interaction between entities and relations using 2D Convolutional Neural Networks (CNN). The score function used is defined as follows:
where, , denote 2D reshapings of , , and denotes the convolution operation. The 2D reshaping enhances the interaction between entity and relation embeddings which has been found to be helpful for learning better representations [hole].
Let , where , be an entity and a relation embedding respectively, and let be a convolutional kernel of size . Further, we define that a matrix is a -submatrix of another matrix if such that . We denote this by .
(Reshaping Function) A reshaping function transforms embeddings and into a matrix , where . For conciseness, we abuse notation and represent by . We define three types of reshaping functions.
Stack () reshapes each of and into a matrix of shape , and stacks them along their height to yield an matrix (Fig. 2a). This is the reshaping function used in [conve].
Alternate () reshapes and into matrices of shape , and stacks rows of and alternately. In other words, as we decrease , the "frequency" with which rows of and alternate increases. We denote as for brevity (Fig. 2b).
Chequer () arranges and such that no two adjacent cells are occupied by components of the same embedding (Fig. 2c).
(Interaction) An interaction is defined as a triple , such that is a -submatrix of the reshaped input embeddings; and are distinct components of or . The number of interactions is defined as the cardinality of the set of all possible triples. Note that can be replaced with
for some padding function.
An interaction is called heterogeneous if and are components of and respectively, or vice-versa. Otherwise, it is called homogeneous. We denote the number of heterogeneous and homogeneous interactions as and respectively. For example, in a matrix , if there are components of and of , then the number of heterogeneous and homogeneous interactions are: , and . Please note that the sum of total number of heterogenous and homogenous interactions in a reshaping function is constant and is equal to , i.e., .
Recent methods [distmult, hole] have demonstrated that expressiveness of a model can be enhanced by increasing the possible interactions between embeddings. ConvE [conve] also exploits the same principle albeit in a limited way, using convolution on 2D reshaped embeddings. InteractE extends this notion of capturing entity and relation feature interactions using the following three ideas:
Feature Permutation: Instead of using one fixed order of the input, we utilize multiple permutations to capture more possible interactions.
Checkered Reshaping: We substitute simple feature reshaping of ConvE with checked reshaping and prove its superiority over other possibilities.
Circular Convolution: Compared to the standard convolution, circular convolution allows to capture more feature interactions as depicted in Figure 3. The convolution is performed in a depth-wise manner [depthwise_convolution] on different input permutations.
In this section, we provide a detailed description of the various components of InteractE. The overall architecture is depicted in Fig. 1. InteractE learns a -dimensional vector representation for each entity and relation in the knowledge graph, where .
To capture a variety of heterogeneous interactions, InteractE first generates -random permutations of both and , denoted by
. Note that with high probability, the sets of interactions withinfor different are disjoint. This is evident because the number of distinct interactions across all possible permutations is very large. So, for different permutations, we can expect the total number of interactions to be approximately times the number of interactions for one permutation.
Next, we apply the reshaping operation , and define . ConvE [conve] uses as a reshaping function which has limited interaction capturing ability. On the basis of Proposition 7.3, we choose to utilize as the reshaping function in InteractE, which captures maximum heterogeneous interactions between entity and relation features.
Motivated by our analysis in Proposition 7.4, InteractE uses circular convolution, which further increases interactions compared to the standard convolution. This has been successfully applied for tasks like image recognition [omnidirectionalwang2018]. Circular convolution on a -dimensional input with a filter is defined as:
InteractE stacks each reshaped permutation as a separate channel. For convolving permutations, we apply circular convolution in a depth-wise manner [depthwise_convolution]. Although different sets of filters can be applied for each permutation, in practice we find that sharing filters across channels works better as it allows a single set of kernel weights to be trained on more input instances.
The output of each circular convolution is flattened and concatenated into a vector. InteractE then projects this vector to the embedding space (). Formally, the score function used in InteractE is defined as follows:
where denotes depth-wise circular convolution, denotes vector concatenation, represents the object entity embedding matrix and is a learnable weight matrix. Functions and
are chosen to be ReLU and sigmoid respectively. For training, we use the standard binary cross entropy loss with label smoothing.
In this section, we analyze multiple variants of 2D reshaping with respect to the number of interactions they induce. We also examine the advantages of using circular padded convolution over the standard convolution.
For simplicity, we restrict our analysis to the case where the output of the reshaping function is a square matrix, i.e., . Note that our results can be extended to the general case as well. Proofs of all propositions herein are included in the supplementary material.
For any kernel of size , for all if
is odd andif is even, the following statement holds:
For any kernel of size and for all (), the following statement holds:
For any kernel of size and for all reshaping functions , the following statement holds:
Let , denote zero padding and circular padding functions respectively, for some . Then for any reshaping function ,
|InteractE (Proposed Method)||.354||172||.535||.263||.463||5202||.528||.430||.541||2375||.687||.462|
In our experiments, following [conve, rotate], we evaluate on the three most commonly used link prediction datasets. A summary statistics of the datasets is presented in Table 3.
FB15k-237 [toutanova] is a improved version of FB15k [transe] dataset where all inverse relations are deleted to prevent direct inference of test triples by reversing training triples.
WN18RR [conve] is a subset of WN18 [transe] derived from WordNet [wordnet], with deleted inverse relations similar to FB15k-237.
YAGO3-10 is a subset of YAGO3 [yago] constitutes entities with at least 10 relations. Triples consist of descriptive attributes of people.
, we use the filtered setting, i.e., while evaluating on test triples, we filter out all the valid triples from the candidate set, which is generated by either corrupting the head or tail entity of a triple. The performance is reported on the standard evaluation metrics: Mean Reciprocal Rank (MRR), Mean Rank (MR) and Hits@1, and Hits@10. We report average results across
runs. We note that the variance is substantially low on all the metrics and hence omit it.
In our experiments, we compare InteractE against a variety of baselines which can be categorized as:
Non-neural: Methods that use simple vector based operations for computing score. For instance, DistMult [distmult], ComplEx [complex], KBGAN [kbgan], KBLRN [kblrn] and RotatE [rotate].
Neural: Methods which leverage a non-linear neural network based architecture in their scoring function. This includes R-GCN [r_gcn], ConvE [conve], ConvTransE [sacn_paper], and SACN [sacn_paper].
In this section, we attempt to answer the questions below:
How does InteractE perform in comparison to the existing approaches? (Section 9.1)
What is the effect of different feature reshaping and circular convolution on link prediction performance? (Section 9.2)
How does the performace of our model vary with number of feature permutations? (Section 9.3)
What is the performance of InteractE on different relation types? (Section 9.4)
In order to evaluate the effectiveness of InteractE, we compare it against the existing knowledge graph embedding methods listed in Section 8.3. The results on three standard link prediction datasets are summarized in Table 2. The scores of all the baselines are taken directly from the values reported in the papers [conve, rotate, sacn_paper, kbgan, kblrn]. Since our model builds on ConvE, we specifically compare against it, and find that InteractE outperforms ConvE on all metrics for FB15k-237 and WN18RR and on three out of four metrics on YAGO3-10. On an average, InteractE obtains an improvement of %, %, and % on FB15k-237, WN18RR, and YAGO3-10 on MRR over ConvE. This validates our hypothesis that increasing heterogeneous interactions help improve performance on link prediction. For YAGO3-10, we observe that the MR obtained from InteractE is worse than ConvE although it outperforms ConvE on all other metrics. Simliar trend has been observed in [conve, rotate].
Compared to other baseline methods, InteractE outperforms them on FB15k-237 across all the metrics and on out of metrics on YAGO3-10 dataset. The below-par performance of InteractE on WN18RR can be attributed to the fact that this dataset is more suitable for shallow models as it has very low average relation-specific in-degree. This is consistent with the observations of [conve].
In this section, we empirically test the effectiveness of different reshaping techniques we analyzed in Section 7. For this, we evaluate different variants of InteractE on validation data of FB15k-237 and WN18RR with the number of feature permutations set to . We omit the analysis on YAGO3-10 given its large size. The results are summarized in Figure 4. We find that the performance with Stacked reshaping is the worst, and it improves when we replace it with alternate reshaping. This observation is consistent with our findings in Proposition 7.1. Further, we find that MRR improves on decreasing the value of in alternate reshaping, which empirically validates Proposition 7.2. Finally, we observe that checkered reshaping gives the best performance across all reshaping functions for most scenarios, thus justifying Proposition 7.3.
We also compare the impact of using circular and standard convolution on link prediction performance. The MRR scores are reported in Figure 4. The results show that circular convolution is consistently better than the standard convolution. This also verifies our statement in Proposition 7.4. Overall, we find that increasing interaction helps improve performance on the link prediction task, thus validating the central thesis of our paper.
In this section, we analyze the effect of increasing the number of feature permutations on InteractE’s performance on validation data of FB15k-237, WN18RR, and YAGO3-10. The overall results are summarized in Figure 5. We observe that on increasing the number of permuations although on FB15k-237, MRR remains the same, it improves on WN18RR and YAGO3-10 datasets. However, it degrades as the number of permutations is increased beyond a certain limit. We hypothesize that this is due to over-parameteralization of the model. Moreover, since the number of relevant interactions are finite, increasing the number of permutations could become redundant beyond a limit.
In this section, we analyze the performance of InteractE on different relation categories of FB15k-237. We chose FB15k-237 for analysis over other datasets because of its more and diverse set of relations. Following [kg_relation_cat], we classify the relations based on the average number of tails per head and heads per tail into four categories: one-to-one, one-to-many, many-to-one, and many-to-many. The results are presented in Table 4. Overall, we find that InteractE is effective at modeling complex relation types like one-to-many and many-to-many whereas, RotatE captures simple relations like one-to-one better. This demonstrates that an increase in interaction allows the model to capture more complex relationships.
In this paper, we propose InteractE, a novel knowledge graph embedding method which alleviates the limitations of ConvE by capturing additional heterogeneous feature interactions. InteractE is able to achieve this by utilizing three central ideas, namely feature permutation, checkered feature reshaping, and circular convolution. Through extensive experiments, we demonstrate that InteractE achieves a consistent improvement on link prediction performance on multiple datasets. We also theoretically analyze the effectiveness of the components of InteractE, and provide empirical validation of our hypothesis that increasing heterogeneous feature interaction is beneficial for link prediction with ConvE. This work demonstrates a possible scope for improving existing knowledge graph embedding methods by leveraging rich heterogenous interactions.