TransINT: Embedding Implication Rules in Knowledge Graphs with Isomorphic Intersections of Linear Subspaces

07/01/2020 ∙ by So Yeon Min, et al. ∙ MIT ibm 0

Knowledge Graphs (KG), composed of entities and relations, provide a structured representation of knowledge. For easy access to statistical approaches on relational data, multiple methods to embed a KG into f(KG) ∈ R^d have been introduced. We propose TransINT, a novel and interpretable KG embedding method that isomorphically preserves the implication ordering among relations in the embedding space. Given implication rules, TransINT maps set of entities (tied by a relation) to continuous sets of vectors that are inclusion-ordered isomorphically to relation implications. With a novel parameter sharing scheme, TransINT enables automatic training on missing but implied facts without rule grounding. On a benchmark dataset, we outperform the best existing state-of-the-art rule integration embedding methods with significant margins in link Prediction and triple Classification. The angles between the continuous sets embedded by TransINT provide an interpretable way to mine semantic relatedness and implication rules among relations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning distributed vector representations of multi-relational knowledge is an active area of research Bordes et al. (2013); Nickel et al. (2011); Kazemi and Poole (2018b); Wang et al. (2014); Bordes et al. (2011). These methods map components of a KG (entities and relations) to elements of and capture statistical patterns, regarding vectors close in distance as representing similar concepts. One focus of current research is to bring logical rules to KG embeddings Guo et al. (2016); Wang et al. (2015a); Wei et al. (2015). While existing methods impose hard geometric constraints and embed asymmetric orderings of knowledge Nickel and Kiela (2017); Vendrov et al. (2015); Vilnis et al. (2018), many of them only embed hierarchy (unary Is_a relations), and cannot embed binary or n-ary relations in KG’s. On the other hand, other methods that integrate binary and n-ary rules Guo et al. (2016); Fatemi et al. (2018); Rocktäschel et al. (2015); Demeester et al. (2016) do not bring significant enough performance gains.

We propose TransINT, a new and extremely powerful KG embedding method that isomorphically preserves the implication ordering among relations in the embedding space. Given pre-defined implication rules, TransINT restricts entities tied by a relation to be embedded to vectors in a particular region of included isomorphically to the order of relation implication. For example, we map any entities tied by is_father_of to vectors in a region that is part of the region for is_parent_of; thus, we can automatically know that if John is a father of Tom, he is also his parent even if such a fact is missing in the KG. Such embeddings are constructed by sharing and rank-ordering the basis of the linear subspaces where the vectors are required to belong.

Mathematically, a relation can be viewed as sets of entities tied by a constraint Stoll (1979). We take such a view on KG’s, since it gives consistency and interpretability to model behavior. We show that angles between embedded relation sets can identify semantic patterns and implication rules - an extension of the line of thought as in word/ image embedding methods such as Mikolov et al. (2013), Frome et al. (2013) to relational embedding.

The main contributions of our work are: (1) A novel KG embedding such that implication rules in the original KG are guaranteed to unconditionally, not approximately, hold. (2) Our model suggests possibilities of learning semantic relatedness between groups of objects. (3) We significantly outperform state-of-the-art rule integration embedding methods, Guo et al. (2016) and Fatemi et al. (2018), on two benchmark datasets, FB122 and NELL Sport/ Location.

Figure 1: Two equivalent ways of expressing relations. (a): relations defined in a hypothetical KG. (b): relations defined in a set-theoretic perspective (Definition 1). Because is_father_of is_parent_of, the set for is_father_of is a subset of that for is_parent_of (Definition 2).

2 TransINT

In this section, we describe the intuition and justification of our method. We first define relation as sets, and revisit TransH Wang et al. (2014) as mapping relations to sets in . Finally, we propose TransINT. We put next to definitions and theorems we propose/ introduce. Otherwise, we use existing definitions and cite them.

2.1 Sets as Relations

We define relations as sets and implication as inclusion of sets, as in set-theoretic logic.

Definition (Relation Set): Let be a binary relation and entities. Then, a set such that if and only if always exists Stoll (1979). We call the relation set of .

For example, consider the distinct relations in Figure 1a, and their corresponding sets in Figure 1b; Is_Father_Of(Tom, Harry) is equivalent to .

Definition (Logical Implication): For two relations, implies (or ) iff ,

For example, Is_Father_Of Is_Parent_Of. (In Figure 1b, ).

Figure 2: Two perspectives of viewing TransH in ; order of operations can be flipped. (The orange dot is the origin, to emphasize that translated vectors are equivalent.) (a): projection first, then difference - first projecting and onto , and then requiring . (b): difference first, then projection - first subtracting from , and then projecting the difference () to and requiring (. All belong to the red line, which is unique because it is when is translated to the origin.

2.2 Background: TransH

Given a fact triple in a KG (i.e. (Harry, is_father_of, Tom)), TransH maps each entity to a vector, and each relation

to a relation-specific hyperplane

and a fixed vector on (Figure 2a). For each fact triple , TransH wants

where are projections on onto (Figure 2a).

Revisiting TransH

We interpret TransH in a novel perspective. An equivalent way to put Eq.1 is to change the order of subtraction and projection (Figure 2b):

Projection of onto .

This means that all entity vectors such that their difference belongs to the red line are considered to be tied by relation (Figure 2b); the red line, which is the set of all vectors whose projection onto is the fixed vector . Thus, upon a deeper look, TransH actually embeds a relation set in KG (figure 1b) to a particular set in . We call such sets relation space for now; in other words, a relation space of some relation is the space where each ’s can exist. We formally visit it later in Section 3.1. Thus, in TransH,

Figure 3: Two perspectives of viewing TransINT. (a): TransINT as TransH with additional constraints - by intersecting ’s and projecting ’s. The dotted orange lines are the projection constraint. (b): TransINT as mapping of sets (relations in KG’s) into linear subspaces (viewing TransINT in the relation space (Figure 2b)). The blue line, red line, and the green plane is respectively is_father_of, is_mother_of and is_parent_of’s relation space - where ’s of tied by these relations can exist. The blue and the red line lie on the green plane - is_parent_of’s relation space includes the other two’s.

2.3 TransINT

We propose TransINT, which, given pre-defined implication rules, guarantees isomorphic ordering of relations in the embedding space. Like TransH, TransINT embeds a relation to a (subspace, vector) pair (, ). However, TransINT modifies the relation embeddings (, ) so that the relation spaces (i.e. red line of Figure 2b) are ordered by implication; we do so by intersecting the ’s and projecting the ’s (Figure 3a). We explain with familial relations as a running example.

Intersecting the ’s

TransINT assigns distinct hyperplanes and to is_father_of and is_mother_of. However, because is_parent_of is implied by the aforementioned relations, we assign

TrainsINT’s is not a hyperplane but a linear subspace of rank (Figure 3a), unlike in TransH where all ’s are hyperplanes (whose ranks are ).

Projecting the ’s

TransINT constrains the ’s with projections (Figure 3a’s dotted orange lines). First, and are required to have the same projection onto . Second, is that same projection onto .

We connect the two above constraints to ordering relation spaces. Figure 3b graphically illustrates that is_parent_of’s relation space (green hyperplane) includes those of is_father_of (blue line) and is_mother_of (red line). More generally, TransINT requires that

For distinct relations , require the following if and only if : Intersection Constraint: . Projection Constraint: Projection of to is . where and are distinct.

We prove that these two constraints guarantee that an ordering isomorphic to implication holds in the embedding space: () iff (’s rel. space ’s rel. space)

or equivalently, () iff (’s rel. space ’s rel. space) .

3 TransINT’s Isomorphic Guarantee

In this section, we formally state TransINT’s isomorphic guarantee. We denote all matrices with capital letters (e.g. ) and vectors with arrows on top (e.g. ).

3.1 Projection and Relation Space

In , there is a bijection between each linear subspace and a projection matrix ; Strang (2006). A random point is projected onto iff multiplied by ; i.e. . In the rest of the paper, we denote (or ) as the projection matrix onto a linear subspace (or . Now, we formally define a general concept that subsumes relation space (Figure 3b).

Definition Let be a linear subspace and its projection matrix. Then, given on , the set of vectors that become when projected on to , or the solution space of , is denoted as .

With this definition, relation space (Figure 3b) is , where is the projection matrix of (subspace for relation ); it is the set of points such that .

3.2 Isomorphic Guarantees

Main Theorem 1 (Isomorphism): Let be the (subspace, vector) embeddings assigned to relations by the Intersection Constraint and the Projection Constraint; the projection matrix of . Then, is isomorphic to .

In actual optimization, TransINT requires something less strict than :

for some non-negative and small . This bounds to regions with thickness , centered around (Figure 4). We prove that isomorphism still holds with this weaker requirement.

Definition Given any , the solution space of (where ) is denoted as .

Main Theorem 2 (Margin-aware Isomorphism): , is isomorphic to .

Figure 4: Fig. 3(b)’s relation spaces when is required. (a): Each relation space now becomes regions with thickness , centered around figure 3(b)’s relation space. (b): Relationship of the angle and area of overlap between two relation spaces. With respect to the green region, the nearly perpendicular cylinder overlaps much less with it than the other cylinder with much closer angle.

4 Initialization and Training

The intersection and projection constraints can be imposed with parameter sharing.

4.1 Parameter Sharing Initializaion

From initialization, we bind parameters so that they satisfy the two constraints. For each entity , we assign a -dimensional vector . To each , we assign (or ) with parameter sharing. Please see Appendix B on definitions of head/ parent/ child relations. We first construct the ’s.

Intersection constraint

Each subspace can be uniquely defined by its orthogonal subspace. We define the orthogonal subspace of the ’s top-down. To every head relation , assign a -dimensional vector as an orthogonal subspace for , making a hyperplane. Then, to each that is not a head, additionally assign a new -dimensional vector linearly independent to the bases of all of its parents. Then, ’s basis of the orthogonal subspace for becomes [] where are the vectors assigned to ’s parent relations. Projection matrices can be uniquely constructed given the bases [] Strang (2006). Now, we initialize the ’s.

Projection Constraint

To the head relation , pick any random and assign . To each non-head whose parent is , assign for some random . This results in

for any parent, child pair.

Parameters to be trained

Such initialization leaves the following parameters given a KG with entities ’s and relations ’s: (1) a -dimensional vector () for the head relation, (2) a -dimensional vector () for each non-head relation, (3) a -dimensional vector for each head and non-head relation, (4) a -dimensional vector for each entity . TransH and TransINT both assign two -dimensional vectors for each relation and one -dimensional vector for each entity; thus, TransINT has the same number of parameters as TransH.

4.2 Training

We construct negative examples (wrong fact triplets) and train with a margin-based loss, following the same protocols as in TransE and TransH.

Training Objective

We adopt the same loss function as in TransH. For each fact triplet

, we define the score function

and train a margin-based loss :

where is the set of all triples in the KG and is a negative triple made from corrupting

. We minimize this objective with stochastic gradient descent.

Automatic Grounding of Positive Triples

Without any special treatment, our initialization guarantees that training for a particular also automatically executes training with for any , at all times. For example, by traversing (Tom, is_father_of, Harry) in the KG, the model automatically also traverses (Tom, is_parent_of, Harry), (Tom, is_family_of, Harry), even if they are missing in the KG. This is because with the given initialization (section 4.1.1) and thus,

In other words, training towards less than automatically guarantees training towards less than . This eliminates the need to manually create missing triples that are true by implication rule.

5 Experiments

We evaluate TransINT on two standard benchmark datasets - Freebase 122 Bordes et al. (2013) and NELL sport/ location Wang et al. (2015b) and compare against respectively KALE Guo et al. (2016) and SimplE+ Fatemi et al. (2018)

, state-of-the-art methods that integrate rules to KG embeddings, respectively in the trans- and bilinear family. We perform link prediction and triple classification tasks on Freebase 122, and link prediction only on NELL sport/ location (because SimplE+ only reported performance on link prediction). All codes for experiments were implemented in PyTorch

Paszke et al. (2019).111Repository for all of our code:

5.1 Link Prediction on Freebase 122 and NELL Sport/ Location

We compare link prediction results with KALE on Freebase 122 (FB122) and with SimplE+ on NELL Sport/ Location. The task is to predict the gold entity given a fact triple with missing head or tail - if (, , ) is a fact triple in the test set, predict given (, ) or predict given (, ). We follow TransE, KALE, and SimplE+’s protocol. For each test triple (, , ), we rank the similarity score when is replaced with for every entity in the KG, and identify the rank of the gold head entity ; we do the same for the tail entity . Aggregated over all test triples, we report for FB 122: (i) the mean reciprocal rank (MRR), (ii) the median of the ranks (MED), and (iii) the proportion of ranks no larger than (HITS@N) which are the same metrics reported by KALE. For NELL Sport/ Location, we follow the protocol of SimplE+ and do not report MED. A lower MED, and a higher MRR and Hits HITS@N are better.

TransH, KALE, and SimplE+ adopt a “filtered” setting that addresses when entities that are correct, albeit not gold, are ranked before the gold entity. For example, if the gold entity is (Tom, is_parent_of, John) and we rank every entity for being the head of (?, is_parent_of, John), it is possible that Sue, John’s mother, gets ranked before Tom. To avoid this, the “filtered setting” ignores corrupted triplets that exist in the KG when counting the rank of the gold entity. (The setting without this is called the “raw setting”).

TransINT’s hyperparameters are: learning rate (

), margin (), embedding dimension (), and learning rate decay (

), applied every 10 epochs to the learning rate. We find optimal configurations among the following candidates:

; we grid-search over each possible (, , , /0. We create 100 mini-batches of the training set (following the protocol of KALE) and train for a maximum of 1000 epochs with early stopping based on the best median rank. Furthermore, we try training with and without normalizing each of entity vectors, relation vectors, and relation subspace bases after every batch of training.

5.1.1 Experiment on Freebase 122

We compare our performance with that of KALE and previous methods (TransE, TransH, TransR) that were compared against it, using the same dataset (FB122). FB122 is a subset of FB15K Bordes et al. (2013) accompanied by 47 implication and transitive rules; it consists of 122 Freebase relations on “people”, “location”, and “sports” topics. Out of the 47 rules in FB122, 9 are transitive rules (e.g. person/nationality(x,y) country/ official_language(y,z) person/languages(x,z)) to be used for KALE. However, since TransINT only deals with implication rules, we do not take advantage of them, unlike KALE.

We also put us on some intentional disadvantages against KALE to assess TransINT’s robustness to absence of negative example grounding. In constructing negative examples for the margin-based loss , KALE both uses rules (by grounding) and their own scoring scheme to avoid false negatives. While grounding with FB122 is not a burdensome task, it known to be very inefficient and difficult for extremely large datasets Ding et al. (2018). Thus, it is a great advantage for a KG model to perform well without grounding of training/ test data. We evaluate TransINT on two settings - with and without rule grounding. We call them respectively (grounding), (no grounding).

Raw Filtered
3 5 10 3 5 10
TransE 0.262 10.0 33.6 42.5 50.0 0.480 2.0 58.9 64.2 70.2
TransH 0.249 12.0 31.9 40.7 48.6 0.460 3.0 53.7 59.1 66.0
TransR 0.261 15.0 28.9 37.4 45.9 0.523 2.0 59.9 65.2 71.8
KALE 0.294 9.0 36.9 44.8 51.9 0.523 2.0 61.7 66.4 72.8
TransINT 0.339 6.0 40.1 49.1 54.6 0.655 1.0 70.4 75.1 78.7
TransINT 0.323 8.0 38.3 46.6 53.8 0.620 1.0 70.1 74.1 78.3
Table 1: Results for Link Prediction on FB122. : For KALE, we report the best performance by any of KALE-PRE, KALE-Joint, KALE-TRIP (3 variants of KALE proposed by Guo et al. (2016)).
Sport Location
MRR Hits N% MRR Hits N%
Filtered Raw 1 3 10 Filtered Raw 1 3 10
Logical Inference - - 28.8 - - - - 27.0 - -
SimplE 0.230 0.174 18.4 23.4 32.4 0.190 0.189 13.0 21.0 31.5
SimplE 0.404 0.337 33.9 44.0 50.8 0.440 0.434 43.0 44.0 45.0
TransINT 0.450 0.361 37.6 50.2 56.2 0.550 0.535 51.2 56.8 61.1
TransINT 0.431 0.362 36.7 48.7 52.1 0.536 0.534 51.1 53.3 59.0
Table 2: Results for Link Prediction on NELL sport/ location.

We report link prediction results in Table 1; since we use the same train/ test/ validation sets, we directly copy from Guo et al. (2016) for baselines. While the filtered setting gives better performance (as expected), the trend is generally similar between raw and filtered. TransINT outperforms all other models by large margins in all metrics, even without grounding; especially in the filtered setting, the Hits@N gap between and KALE is around 46 times that between KALE and the best Trans Baseline (TransR).

Also, while performs higher than in all settings/metrics, the gap between them is much smaller than the that between and KALE, showing that TransINT robustly brings state-of-the-art performance even without grounding. The results suggest two possibilities in a more general sense. First, the emphasis of true positives could be as important as/ more important than avoiding false negatives. Even without manual grounding, has automatic grounding of positive training instances enabled (Section 4.1.1.) due to model properties, and this could be one of its success factors. Second, hard constraint on parameter structures can bring performance boost uncomparable to that by regularization or joint learning, which are softer constraints.

5.1.2 Experiment on NELL Sport/ Location

We compare TransINT against SimplE+, a state-of-the-art method that outperforms ComplEx Trouillon et al. (2016) and SimplE Kazemi and Poole (2018a), on NELL (Sport/ Location) for link prediction. NELL Sport/ Location is a subset of NELL Mitchell et al. (2015) accompanied by implication rules - a complete list of them is available in Appendix C. Since we use the same train/ test/ validation sets, we directly copy from Fatemi et al. (2018) for baselines (Logical Inference, SimplE, SimplE+). The results are shown in Table 2. Again, TransINT and TransinT significantly outperform other methods in all metrics. The general trends are similar to the results for FB 122; again, the performance gap between TransINT and TransINT is much smaller than that between TransINT and SimplE+.

5.2 Triple Classification on Freebase 122

The task is to classify whether an unobserved instance

is correct or not, where the test set consists of positive and negative instances. We use the same protocol and test set provided by KALE; for each test instance, we evaluate its similarity score and classify it as “correct” if is below a certain threshold (), a hyperparameter to be additionally tuned for this task. We report on mean average precision (MAP), the mean of classification precision over all distinct relations (’s) of the test instances. We use the same experiment settings/ training details as in Link Prediction other than additionally finding optimal . Triple Classification results are shown in Table 3. Again, and both significantly outperform all other baselines. We also separately analyze MAP for relations that are/ are not affected by the implication rules (those that appear/ do not appear in the rules), shown in parentheses of Table 3 with the order of (influenced relations/ uninfluenced relations). We can see that both TransINT’s have MAP higher than the overall MAP of KALE, even when the TransINT’s have the penalty of being evaluated only on uninfluenced relations; this shows that TransINT generates better embeddings even for those not affected by rules. Furthermore, we comment on the role of negative example grounding; we can see that grounding does not help performance on unaffected relations (i.e. 0.752 vs 0.761), but greatly boosts performance on those affected by rules (0.839 vs 0.709). While TransINT does not necessitate negative example grounding, it does improve the quality of embeddings for those affected by rules.

TransE TransH TransR KALE TransINT TransINT
0.634 0.641 0.619 0.677 0.781 (0.839/ 0.752) 0.743 (0.709/ 0.761)
Table 3: Results for Triple Classification on FB122, in Mean Average Precision (MAP).

6 Semantics Mining with Overlap Between Embedded Regions

Traditional embedding methods that map an object (i.e. words, images) to a singleton vector learn soft tendencies between embedded vectors with cosine similarity, or angular distance between two embddings. TransINT extends such a line of thought to semantic relatedness between groups of objects, with angles between relation spaces. In Fig. 4b, one can observe that the closer the angle between two embedded regions, the larger the overlap in area. For entities

and to be tied by both relations , has to belong to the intersection of their relation spaces. Thus, we hypothesize the following over any two relations that are not explicitly tied by the pre-determined rules:

Relation Anlge imb
Not Disjoint Relatedness /people/person/nationality 22.7 1.18
Implication /people/person/place_lived/location 46.7 3.77
Disjoint /people/cause_of_death/people 76.6 n/a
/sports/sports_team/colors 83.5 n/a
Table 4: Examples of relations’ angles and imb with respect to /people/person/place_of_birth

Let be the set of ’s in ’s relation space (denoted as ) and that of ’s.

(1) Angle between and represents semantic “disjointness” of ; the more disjoint two relations, the closer their angle to 90. When the angle between and is small, (2) if majority of belongs to the overlap of and but not vice versa, implies . (3) if majority of and both belong to their overlap, and are semantically related.

(2) and (3) consider the imbalance of membership in overlapped regions. Exact calculation of this involves specifying an appropriate (Fig. 3). As a proxy for deciding whether an element of (denote ) belongs in the overlapped region, we can consider the distance between and its projection to ; the further away is from the overlap, the larger the projected distance. Call the mean of such distances from to as and the reverse . The imbalance in can be quantified with , which is minimized to 1 when and increases as are more imbalanced; we call this factor .

For hypothesis (1), we verified that the vast majority of relation pairs have angles near to 90, with the mean and median respectively 83.0 and 85.4; only 1% of all relation pairs had angles less than 50. We observed that relation pairs with angle less than 20 were those that can be inferred by transitively applying the pre-determined implication rules. Relation pairs with angles within the range of had strong tendencies of semantic relatedness or implication; such tendency drastically weakened past . Table 4 shows the angle and of relations with respect to /people/person/place_of_birth, whose trend agrees with our hypotheses. Finally, we note that such an analysis could be possible with TransH as well, since their method too maps ’s to lines (Fig. 2b).

In all of link Prediction, triple classification, and semantics mining, TransINT’s theme of assigning optimal regions to bound entity sets is unified and consistent. Furthermore, the integration of rules into embedding space geometrically coherent with KG embeddings alone. These two qualities were missing in existing works such as TransE, KALE, and SimplE+.

7 Related Work

Our work is related to two strands of work. The first strand is Order Embeddings Vendrov et al. (2015) and their extensions Vilnis et al. (2018); Athiwaratkun and Wilson (2018), which are significantly limited in that only unary relations and their hierarchies can be modeled. While Nickel and Kiela (2017)

also approximately embed unary partial ordering, their focus is on achieving reasonably competent result with unsupervised learning of rules in low dimensions, while ours is achieving state-of-the-art in a supervised setting.

The second strand is those that enforce the satisfaction of common sense logical rules for binary and -ary relations in the embedded KG. Wang et al. (2015a)

explicitly constraints the resulting embedding to satisfy logical implications and type constraints via linear programming, but it only requires to do so during inference, not learning. On the other hand,

Guo et al. (2016), Rocktäschel et al. (2015), Fatemi et al. (2018) induce that embeddings follow a set of logical rules during learning, but their approaches involve soft induction instead of hard constraints, resulting in rather insignificant improvements. Our work combines the advantages of both Wang et al. (2015a) and works that impose rules during learning. Finally, Demeester et al. (2016) models unary relations only and Minervini et al. (2017) transitivity only, whose contributions are fundamentally different from us.

8 Conclusion

We presented TransINT, a new KG embedding method such that relation sets are mapped to continuous sets in , inclusion-ordered isomorphically to implication rules. Our method is extremely powerful, outperforming existing state-of-the-art methods on benchmark datasets by significant margins. We further proposed an interpretable criterion for mining semantic similarity and implication rules among sets of entities with TransINT.


  • B. Athiwaratkun and A. G. Wilson (2018) Hierarchical density order embeddings. External Links: 1804.09843 Cited by: §7.
  • A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 2787–2795. External Links: Link Cited by: §1, §5.1.1, §5.
  • A. Bordes, J. Weston, R. Collobert, and Y. Bengio (2011) Learning structured embeddings of knowledge bases. In AAAI, Cited by: §1.
  • T. Demeester, T. Rocktäschel, and S. Riedel (2016) Lifted rule injection for relation embeddings. In EMNLP, Cited by: §1, §7.
  • B. Ding, Q. Wang, B. Wang, and L. Guo (2018) Improving knowledge graph embedding using simple constraints. In ACL, Cited by: §5.1.1.
  • B. Fatemi, S. Ravanbakhsh, and D. Poole (2018) Improved knowledge graph embedding using background taxonomic information. In AAAI, Cited by: Appendix C, §1, §1, §5.1.2, §5, §7.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In NIPS, Cited by: §1.
  • S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo (2016) Jointly embedding knowledge graphs and logical rules. In EMNLP, Cited by: §1, §1, §5.1.1, Table 1, §5, §7.
  • S. M. Kazemi and D. Poole (2018a) SimplE embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4284–4295. External Links: Link Cited by: §5.1.2.
  • S. M. Kazemi and D. Poole (2018b) SimplE embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4284–4295. External Links: Link Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.
  • P. Minervini, T. Demeester, T. Rocktäschel, and S. Riedel (2017) Adversarial sets for regularising neural link predictors. CoRR abs/1707.07596. External Links: Link, 1707.07596 Cited by: §7.
  • T. M. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel, J. Krishnamurthy, et al. (2015) Never-ending learning. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    Cited by: §5.1.2.
  • M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. In NIPS, Cited by: §1, §7.
  • M. Nickel, V. Tresp, and H. Kriegel (2011) A three-way model for collective learning on multi-relational data. In ICML, Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §5.
  • T. Rocktäschel, S. Singh, and S. Riedel (2015) Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1119–1129. External Links: Link, Document Cited by: §1, §7.
  • R. R. Stoll (1979) Set theory and logic. Courier Corporation. Cited by: §1, §2.1, §2.1.
  • G. Strang (2006) Linear algebra and its applications. Thomson, Brooks/Cole, Belmont, CA. External Links: ISBN 0030105676 9780030105678 0534422004 9780534422004, Link Cited by: §A.1.2, §A.1.2, §A.1.2, §A.1.2, §A.1.2, §3.1, §4.1.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    ICML’16, pp. 2071–2080. Cited by: §5.1.2.
  • I. Vendrov, J. R. Kiros, S. Fidler, and R. Urtasun (2015) Order-embeddings of images and language. CoRR abs/1511.06361. Cited by: §1, §7.
  • L. Vilnis, X. Li, S. Murty, and A. McCallum (2018) Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627. Cited by: §1, §7.
  • Q. Wang, B. Wang, and L. Guo (2015a) Knowledge base completion using embeddings and rules. In IJCAI, Cited by: Appendix C, §1, §7.
  • Q. Wang, B. Wang, and L. Guo (2015b) Knowledge base completion using embeddings and rules. In IJCAI, Cited by: §5.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge graph embedding by translating on hyperplanes. In AAAI, Cited by: §1, §2.
  • Z. Wei, J. Zhao, K. Liu, Z. Qi, Z. Sun, and G. Tian (2015) Large-scale knowledge base completion: inferring via grounding network sampling over selected instances. In CIKM, Cited by: §1.

Appendix A Proof For TransINT’s Isomorphic Guarantee

Here, we provide the proofs for Main Theorems 1 and 2. We also explain some concepts necessary in explaining the proofs. We put next to definitions and theorems we propose/ introduce. Otherwise, we use existing definitions and cite them.

a.1 Linear Subspace and Projection

We explain in detail elements of that were intuitively discussed. In this and later sections, we mark all lemmas and definitions that we newly introduce with ; those not marked with are accompanied by reference for proof. We denote all matrices with capital letters (ex) ) and vectors with arrows on top (ex) ).

a.1.1 Linear Subspace and Rank

The linear subspace given by ( is matrix and ) is the set of that are solutions to the equation; its rank is the number of constraints imposes. For example, in , a hyperplane is a set of such that for some scalars ; because vectors are bound by one equation (or its “” only really contains one effective equation), a hyperplane’s rank is 1 (equivalently ). On the other hand, a line in imposes to 2 constraints, and its rank is 2 (equivalently ).

Consider two linear subspaces , each given by . Then,

by definition. In the rest of the paper, denote as the linear subspace given by some .

Figure 5: Projection matrices of subspaces that include each other.

a.1.2 Properties of Projection


For all on , projecting onto is still ; the converse is also true.

Lemma 1 [Strang].


Projection decomposes any vector to two orthogonal components - and . Thus, for any projection matrix , is also a projection matrix that is orthogonal to (i.e. = 0) [Strang].

Lemma 2 Let be a projection matrix. Then is also a projection matrix such that [Strang].

The following lemma also follows.

Lemma 3 [Strang].

Projection onto an included space

If one subspace includes , the order of projecting a point onto them does not matter. For example, in Figure 3, a random point in can be first projected onto at , and then onto at . On the other hand, it can be first projected onto at , and then onto at still . Thus, the order of applying projections onto spaces that includes one another does not matter.

If we generalize, we obtain the following two lemmas (Figure 5):

Lemma 4 Every two subspaces if and only if .

proof) By Lemma 1, if , then . On the other hand, if , then there is some such that . Thus,

Because projection matrices are symmetric [Strang],

Lemma 5 For two subspaces and vector ,

proof) is equivlaent to .

By Lemma 4, if . Since , .

Partial ordering

If two subspaces strictly include one another, projection is uniquely defined from lower rank subspace to higher rank subspace, but not the other way around. For example, in Figure 3, a point in (rank 0) is always projected onto (rank 1) at point . Similarly, point on (rank 1) is always projected onto similarly, onto (order 2) at point . However, “inverse projection” from to is not defined, because not only but other points on (such as ) project to at point ; these points belong to . In other words, . This is the key intuition for isomorphism , which we prove in the next chapter.

a.2 Proof for Isomorphism

Now, we prove that TransINT’s two constraints (section 2.3) guarantee isomorphic ordering in the embedding space.

Two posets are isomorphic if their sizes are the same and there exists an order-preseving mapping between them. Thus, any two posets , are isomorphic if and

Main Theorem 1 (Isomorphism): Let be the (subspace, vector) embeddings assigned to relations by the Intersection Constraint and the Projection Constraint; the projection matrix of . Then, is isomorphic to .

proof) Since each is distinct and each is assigned exactly one , .


Now, let’s show

Because the , intersection and projection constraints are true iff , enough to show that the two constraints hold iff .

First, let’s show . From the Intersection Constraint, . By Lemma 5, . From the Projection Constraint, . Thus,


Now, let’s show the converse; enough to show that if , then the intersection and projection constraints hold true.

If ,

both have to be true. For any , or equivalently, if for some , then the second equation becomes , which can be only compatible with the first equation if , since any vector’s projection onto a subspace is unique. (Projection Constraint)

Now that we know , by Lemma 5, (intersection constraint).




, the two posets are isomorphic.

In actual implementation and training, TransINT requires something less strict than :

for some non-negative and small . This bounds to regions with thickness , centered around (Figure 4). We prove that isomorphism still holds with this weaker requirement.

Definition Given a projection matrix , we call the solution space of as .

Main Theorem 2 (Margin-aware Isomorphism): For all non-negative scalars , is isomorphic to .

proof) Enough to show that and are isomorphic for all .

First, let’s show

By Main Theorem 1 and Lemma 4,

Thus, for all vector ,

Thus, if , then .


Now, let’s show the converse. Assume for some . Then,