1 Introduction
Knowledge graph is a multirelational graph whose nodes represent entities and edges denote relationships between entities. Knowledge graphs store facts about people, places and world from various sources. Those facts are kept as triples (head entity, relation, tail entity) and denoted as . A large number of knowledge graphs, such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), NELL (Carlson et al., 2010) and YAGO3 (Mahdisoltani, Biega, and Suchanek, 2013), have been built over the years and successfully applied to many domains such as search, recommendation and question answering. Knowledge graph has become an increasingly crucial component in machine intelligence systems. Although these knowledge graphs have already contained millions of entities and facts, they are far from complete compared to existing facts and newly added knowledge from the real world. Therefore knowledge graph completion is important topic and drawn great attentions from academic and industrial researchers.
Knowledge graph embedding represents entities and relations in continuous vector spaces. It is widely used for knowledge graph completion and could be roughly categorized into two classes
(Wang et al., 2017): translational distance models and semantic matching models. Translational distance models measure the plausibility of a fact as the distance between two entities. It is usually done with relation dependent translational operations. Started from a simple and effective approach called TransE (Bordes et al., 2013), many knowledge graph embedding methods have been proposed, such as TransH (Wang et al., 2014), TransR (Lin et al., 2015) to the latest RotatE (Sun et al., 2019). Another thread of research focuses on similaritybased scoring functions and measures fact plausibility by matching latent semantics of entities and relations embodied in their vector space representations, for example, DistMult (Yang et al., 2014), ConvE (Dettmers et al., 2018), ConvKB (Nguyen et al., 2017), CapsE (Nguyen et al., 2019) and QuatE (Zhang et al., 2019).Though great progress has been made, 1toN, Nto1, and NtoN relation predictions (Bordes et al., 2013; Wang et al., 2014) still remain challenging issues. In Figure 1, relation “profession” demonstrates a NtoN relation example and the corresponding edges are highlighted as green. Assuming triple (SergeiRachmaninoff, Profession, Pianist) is unknown and we want to evaluate the plausibility of this triple, the model needs to rank it against all triples by replacing “SergeiRachmaninoff” or “Pianist” with other entities in the knowledge graph. Entity “SergeiRachmaninoff” connected to multiple entities as head entity via relation “profession”, while “Pianist” as a tail entity also can reach multiple entities through relation “profession”. It makes the prediction hard because the mapping from certain entityrelation pair can lead to multiple different entities.
In this work, a novel translational knowledge graph embedding with graph context is proposed to alleviate the 1toN, Nto1 and NtoN issues. The proposed approach includes two parts. First we extend the RotatE modeling from 2D complex domain to high dimension space for better modeling capacity. Orthogonal transform embedding (OTE ) employs orthogonal transforms to represent relations in knowledge graph. In addition, the entity embedding is divided into groups. Each group is modeled and scored independently. The final score is the summation of all group scores. Hence, each group could address different aspects of entityrelation pair and alleviate 1toN, Nto1 and NtoN issues. Second, graph context is used to integrate graph structure information in the knowledge graph. Considering the triple (SergeiRachmaninoff, Profession, Pianist): from the incomplete knowledge graph, human can find useful context information, such as (SergeiRachmaninoff, role, Piano) and (SergeiRachmaninoff, Profession, Composer) in Figure 1. In this work, each node embedding in knowledge graph is augmented with two graph context representations, computed from the neighboring outgoing and incoming nodes respectively. Each context representation is computed based on the embeddings of the neighbouring nodes and the corresponding relations connecting to these neighbouring nodes. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference. We show that OTE with graph context modeling performs consistently better than RotatE on the standard benchmark FB15k237 and WN18RR datasets.
In summary, our main contributions include:

A new translational model, orthogonal transform embedding OTE, is proposed to extend RotatE from 2D space to high dimensional space.

A directed graph context modeling method is proposed to integrate knowledge graph structure, including both neighboring entity nodes and relation edges;

Experimental results of OTE on standard benchmark FB15k237 and WN18RR datasets show consistent improvements over RotatE, the state of art translational distance embedding model, especially on FB15k237 with many high indegree connection nodes.
2 Related work
2.1 Knowledge Graph Embedding
Translational distance model is also known as additive models, since it projects head and tail entities into the same embedding space and the difference between two entity embeddings is used to describe the plausibility of the given triple. TransE (Bordes et al., 2013) is the first and most representative translational distance model. A series of work is conducted along this line such as TransH (Wang et al., 2014), TransR (Lin et al., 2015) and TransD (Ji et al., 2015) etc. RotatE (Sun et al., 2019) further extends the computation into complex domain and is the stateofart within this category. On the other hand, semantic matching models usually take multiplicative score functions to compute the plausibility of the given triple, such as DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), T͡uckER (Balazevic, Allen, and Hospedales, 2019) and QuatE (Zhang et al., 2019). ConvKB (Nguyen et al., 2017) and CapsE (Nguyen et al., 2019) are further integrating head, relation and tail embeddings from the beginning of modeling and the triple as a whole is measured together using convolutional model or capsule networks.
Those embedding methods focus on modeling the individual triple and achieve good performance for knowledge graph completion. However, they ignore knowledge graph structure and don’t take the advantage of context from neighbouring nodes and connections. This issue inspired the usage of graph neural networks
(Kipf and Welling, 2016; Veličković et al., 2017) for graph context modeling. Encoderdecoder framework is adopted in (Schlichtkrull et al., 2017; Shang et al., 2018; Bansal et al., 2019). The knowledge graph structure is first encoded via graph neural networks and the output with rich structure information is passed to the following graph embedding model for prediction. The models could be endtoend based that the graph encoder and graph embedding decoder are training together. It could also be separated that the graph encoder output is only used to initialise the entity embedding in the graph embedding model (Nathani et al., 2019).2.2 Orthogonal Transform
Orthogonal transform is considered to be more stable and efficient for neural networks (Saxe, McClelland, and Ganguli, 2013; Vorontsov et al., 2017)
. However, to optimize a linear transform with orthogonal property reserved is not straightforward. Soft constraints could be enforced during optimization to encourage the learnt linear transform close to be orthogonal.
(Bansal, Chen, and Wang, 2018) extensively compared different orthogonal regularizations and find regularizations make the training faster and more stable in different tasks. On the other hand, some work has been done to achieve strict orthogonal during optimization by applying special gradient update scheme. Harandi and Fernando (2016) proposed a Stiefel layer to guarantee fully connected layers to be orthogonal by using Reimannian gradients. Huang et al. (2017)consider the estimation of orthogonal matrix as an optimization over multiple dependent stiefel manifolds problem and solve it via eigenvalue decomposition on a proxy parameter matrix.
Vorontsov et al. (2017)applied hard constraint on orthogonal transform update via Cayley transform. In this work, we construct the orthogonal matrix via Gram Schmidt process and the gradient is calculated automatically through autograd mechanism in PyTorch
(Paszke et al., 2017).3 Preliminaries
3.1 RotatE as an orthogonal transform
The proposed translational models used in this study are partially inspired by work RotatE (Sun et al., 2019). In RotatE, the embedding translation is done via Hadamard production (elementwise) defined on the complex domain. Given a triple , the corresponding embedding are , , , where and , , and is the embedding dimension. For each dimension , and are corresponding real and image components, and the translational computation is conducted as an orthogonal transform as below:
where is a 2D orthogonal matrix derived from .
Though RotatE shows good performance on graph embedding, it is defined in complex domain and limited it’s modeling ability. A natural extension is to apply similar operation on a higher dimension space and it is the motivation behind OTE described in section 4.1.
3.2 Gram Schmidt process
In this paper, we use GramSchmidt process to orthogonalize a linear transform into an orthogonal transform. The GramSchmidt process takes a set of tensor
for and generates an orthogonal set that spans the same dimensional subspace of as .(1)  
(2) 
where , is the norm of vector and denotes the inner product of and .
Orthogonal matrix has many desired properties for neural network based machine learning methods. For example, it is able to get the corresponding reverse matrix by transposing itself; it also preserves energy that the
norm of vector is kept before and after transform. It is thought to be more stable and efficient in many neural networks based models (Bansal, Chen, and Wang, 2018). In this study, we are more interested in its property to obtain reverse matrix by transpose and it makes the reverse project operation discussed in section 4.1 possible.4 Methods
We consider knowledge graph as a directed graph throughout this section, where is a set of entity nodes with , and is a set of relation edges with . Facts are stored in a knowledge graph as a collection of triples . Each triple has a head entity and tail entity . Relation connects two entities with direction from head to tail. As discussed in the introduction section, 1toN, Nto1 and NtoN relations (Bordes et al., 2013; Wang et al., 2014) are the main issues in the current systems. They are addressed in the proposed approach by: 1) OTE to handle the mapping from one entityrelation pair to different entities ; 2) Directed graph context to integrate knowledge graph structure information to reduce the uncertainty.
4.1 Orthogonal Transform Embedding (OTE)
Head, relation and tail entity are represented as , , in OTE, where , , and is the dimension of the entity embedding. The entity embedding is further considered as concatenation of subvectors, e.g., , where and . is a collection of linear transform matrix , and . For each subembedding , we assume the projection from and to is calculated as below:
(3) 
where is Gram Schmidt process applied to square matrix and the output transform is an orthogonal matrix derived from . is the concatenation of all subvector from equation 3, e.g., . Equation 3 defines a simple transition between node embeddings via relation embedding. The norm of is preserved before and after transform. Improving the training stability through orthogonal matrix is not priority in this study due to the shallowness of the model. Instead, it might limit the modeling ability. Hence, a scalar tensor is introduced to match entity embeddings with different norms and Equation 3 is rewritten as
(4) 
The corresponding distance scoring function is defined as
(5) 
The reverse project from tail to head can be obtained by simply transposing the and reversing the sign of as below,
(6) 
where means the translational project function from tail entity and relation pair to the head entity.
The Gram Schmidt process is employed as part of computation graph in our model. is calculated every time in the neural networks forward computation to get orthogonal matrix , while the corresponding gradient is calculated and propagated back to via autograd computation within PyTorch during the backward computation. It eliminates the need of special gradient update schemes employed in previous hard constraint based orthogonal transform estimations (Harandi and Fernando, 2016; Vorontsov et al., 2017). One potential implementation issue is that
might not be an invertible matrix and it makes the compute
problematic. In our experiments, we initialize to make sure they are with full rank. During training, we also keep checking the determinant of and we found the update is fairly stable that we don’t observe any issue for experiments with dimensions varied from 5 to 100 we explored.It can be easily proved that OTE has the ability to model and infer all three types of relation patterns: symmetry/antisymmetry, inversion, and composition as RotatE does. The proof is listed in Appendix A.
4.2 Directed Graph Context
4.2.1 Head Relation Pair Context
As discussed in the introduction section, the knowledge graph structure context could provide valuable information for link prediction task. In order to measure the plausibility of the questionable triple , the difference between and can be measured directly. In the other hand, assuming is a valid triple in the training set, if the context pair ( is similar to , it is a good indicator that the questionable triple could be a valid one too. Hence, we can compare projects from other heads of with project from the questionable triple . All connected (head) entityrelation pairs of are considered as its graph context and denoted as . The questionable triple gets a high score if we could find any of those valid projects from are close to .
However, it is expensive to keep all context projects in and compare them individually. In this work, an approximated approach is taken. For each tail entity node , a head context representation is defined as the average from all corresponding context pairs in as below:
(7) 
The similarity between pair from questionable triple and head context of is measured as
(8) 
There is no new parameter introduced for the graph context modeling, since the message passing is done via OTE entityrelation project . The graph context can be easily applied to other translational embedding algorithms, such as RotatE and TransE etc, by replacing OTE.
4.2.2 Tail Relation Pair Context
Equation 5 only considers the difference between project of pair to , and it is natural also considered difference between project of pair to when we measure the plausibility of triple .
(9) 
Similarly, we can compute the tail context representation for every head entity node . The corresponding distance score between pair from questionable triple and tail context of is defined as below:
(10) 
We further combine all 4 distance scores discussed above as new distance score for the graph context orthogonal transform embedding (GCOTE) training and inference
(11) 
Figure 1 demonstrates the computation of graph context for the questionable triple (SergeiRachmaninoff,profession,Pianist). Edges for relation “profession” are colored as green. Entities marked with are head entities to entity “Pianist”, and these entities and corresponding relations to connect “Pianist” form the head graph context of “Pianist”. The pairs in the head graph context are employed to calculate of “Pianist”. While entities with are tail entities for entity “SergeiRachmaninoff”. Those entites and corresponding relations are the tail graph context of entity “SergeiRachmaninoff”. Similarly, they are used to update of entity “SergeiRachmaninoff”.
The generation of graph structure context defined in Equation 7 can be considered as a variant of one layer GCN (Kipf and Welling, 2016). Compared with previous proposed GNN based methods (Schlichtkrull et al., 2017; Shang et al., 2018; Bansal et al., 2019), our approach has 3 differences. First, the proposed method is based on directed graph instead of nondirected graph; second, in message passing phase, the proposed method employs entityrelation project in OTE to pass information from neighbouring nodes to the target node, while GCN or GAT has a separate matrix to transform the node embedding in message passing phase; third, the output of graph model is used for scoring function directly instead of as input for the following embedding method.
5 Experiments
5.1 Datasets
Two common used benchmark datasets (FB15k237, and WN18RR) are employed in this study to evaluate the performance of link prediction.
5.1.1 FB15k237
5.1.2 Wn18rr
WN18RR (Dettmers et al., 2018) is derived from WN18 (Bordes et al., 2013), which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples obtained by inverting triples from the training set. Thus WN18RR dataset (Dettmers et al., 2018) is created to ensure that the evaluation dataset does not have test leakage due to redundant inverse relation. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.
Each dataset is split into three sets for: training, validation and testing, which is same with the setting of (Sun et al., 2019). The statistics of two data sets are summarized at table 1. Only triples in the training set are used to compute graph context.
Dataset  FB15k237  WN18RR 

Entities  14,541  40,943 
Relations  237  11 
Train Edges  272,115  86,835 
Val. Edges  17,535  3,034 
Test Edges  20,466  3,134 
5.2 Evaluation Protocol:
Following the evaluation protocol used in (Dettmers et al., 2018; Sun et al., 2019), each test triple is measured into two scenarios: head focused and tail focused . For each case, the test triple is ranked among all triples with masked entity replaced by entities in knowledge graph. Those true triples observed in either train/validation/test set except the test triple will be excluded during evaluation. Top 1,3 and 10 (Hits@1, Hits@3, Hits@10), mean rank (MR) and the mean reciprocal rank (MRR) are reported in the experiments.
5.3 Experimental Setup
The OTE model hyperparameters are determined by a grid search during the training, including learning rate, embedding size and subembedding dimension . The best models are selected by early stopping on the validation set. The system is first trained with OTE or RotatE embedding and then the corresponding graph context based model is fine tuned on the pretrained model.
For different datasets, we have found that the following settings work well: for FB15k237, embedding size is set to 400, subembedding dimension to 20 and learning rates to and for pretraining and finetuning stages respectively ; for WN18RR dataset, embedding size is set to 400, subembedding dimension to 4 and learning rate to and for pretraining and finetuning stages. For OTE alone model, no finetuneing step is applied.
We use the adaptive moment (Adam) algorithm
(Kingma and Ba, 2014) to train the model. Our models are implemented by PyTorch and run on NVIDIA Tesla P40 Graphics Processing Units. The graph neural networks implementation is based on Geometric PyTorch (Fey and Lenssen, 2019). The pretraining OTE takes 5 hours with 240,000 steps and finetuning GCOTE takes 23 hours with 60,000 steps. Though, it takes more computation for graph context based model during training, the inference could be efficient if both head and tail context representations are precomputed and saved for each entity in the knowledge graph. ^{1}^{1}1We will release the source code after review.Selfadversarial negative sampling loss (Sun et al., 2019) is used to optimize the embedding in this work,
(12)  
where is a fixed margin,
is sigmoid function,
is negative triple, and negative sampling weight defined in (Sun et al., 2019).5.4 Experimental Results
5.4.1 Main Results
FB15k237  WN18RR  
Model  MR  MRR  @1  @3  @10  MR  MRR  @1  @3  @10 
TransE  357  .294      .465  3384  .226      .501 
RotatE  177  .338  .241  .375  .533  3340  .476  .428  .492  .571 
DistMult  254  .241  .155  .263  .419  5110  .43  .39  .44  .49 
ComplEx  339  .247  .158  .275  .428  5261  .44  .41  .46  .51 
ConvE  244  .325  .237  .356  .501  4187  .43  .40  .44  .52 
QuatE  87  .348  .248  .382  .550  2314  .488  .438  .508  .582 
TurkER    .358  .266  .392  .544    .470  .443  .482  .526 
RGCN+    .249  .151  .264  .417           
SACN    .352  .261  .385  .536    .47  .43  .48  .54 
A2N    .317  .232  .348  .486    .45  .42  .46  .51 
OTE  174  .351  .258  .388  .537  2968  .485  .437  .502  .587 
GCOTE  154  .361  .267  .396  .550  2715  .491  .442  .511  .583 
In Table 2, the GCOTE is compared with a number of strong baselines. For translational distance models, TransE (Bordes et al., 2013) and it’s latest development RotatE (Sun et al., 2019) are compared; For semantic matching models, results from DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), TuckER (Balazevic, Allen, and Hospedales, 2019) and QuatE (Zhang et al., 2019) are reported; Three systems with graph context information: RGCN+ (Schlichtkrull et al., 2017), SACN (Shang et al., 2018) and A2N (Bansal et al., 2019) are also included. All results except OTE and GCOTE come from the corresponding literature.
On the FB15k237 data set, the GCOTE outperforms all models for all metrics except MR from QuatE. The MRR is improved from 0.338 in RotatE, to 0.361, or 7% improvement.
On the WN18RR data set, the GCOTE also achieves slightly better MRR compared with the state of the art models RotatE and QuatE. 0.015 MRR improvement is observed from RotatE to GCOTE.
In FB15k237, there are average 18.7 relations per node, while this number drops to 2.1 edges per node in WN18RR. FB15k237 has richer graph structure context compared with WN18RR. The results indicate that the proposed method GCOTE is more effective at data set with rich structure context information.
5.4.2 Ablation Study
In Table 3, an ablation study is conducted to examine the impact of each modification. The results from FB15k237 validation set are reported. The embedding dimension for “RotatE S” and “RotatE L” are 400 and 2000 respectively^{2}^{2}2It is equivalent to the summation of dimensions from real and image part in RotatE setting.. “RotatE L” has the same configure as the model reported in (Sun et al., 2019). All other models are with embedding dimension 400. We examine the impact from model variation, subvector dimension (for OTE) and graph context.
First, when the embedding dimension increases from 400 to 2000, the MRR of RotatE is increased from 0.33 to 0.343 (row 1 and 2). Second, for OTE models, when the subembedding dimension increases from 2 to 20, the MRR is improved from 0.327 to 0.355, though both models are with the same entity embedding dimension (row 34). OTE with subembedding dimension 20 outperforms both RotatE models listed in the table. Increasing the subembedding dimension helps to improve the modeling ability. It is more effective than increasing the embedding dimension size along (RotatE S v.s. L). Third, two variations of OTE are examined. Row 5 “OTE  scalar” is OTE model without diagonal scalar tensor as Equation 3. The MRR degrades slightly to 0.352. It indicates that preserving vector norm in the strict orthogonal transform limits the model ability in the shallow structure network. Row 6 “LNE” is a model using normal linear transform instead of orthogonal version. Hence, two types of relation transforms are estimated for projects from head to tail and tail to head respectively. The result is slightly worse than OTE and with significant more parameters. Last, row 7 and 8 show that graph context improve the performance further for the translational distance models without adding extra parameters. MRRs are increased by and for RotatE and OTE respectively.
model  MRR  @10  #param  
RotatE S    .330  .515  5.9 
RotatE L    .340  .530  29.3 
OTE  2  .327  .511  6.1 
OTE  20  .355  .540  7.8 
OTE  scalar  20  .352  .535  7.7 
LNE  20  .354  .538  9.6 
GCRotatE L    .354  .546  29.3 
GCOTE  20  .367  .555  7.8 
In Figure 2, the impact of subembedding dimension to the OTE performance is demonstrated. The blue line shows MRR value for each subembedding dimension and green bars are the corresponding value. Both values increase and slowly saturated around 20. When the size gets bigger, the performance get worse though more parameters are used. The similar experiments are also conducted on WN18RR data set and we find the best subembedding dimension is 4 on that data set.
5.4.3 Error Analysis
We split the triples in the FB15k237 evaluation set into different categories to study the errors on 1N, N1 and NN relations. The results (in H@10) are presented in Table 4. Assume and are the number of and pairs appeared in triples from the training set respectively. A triple from the validation set is considered as one of the categories defined below:
In Table 4, results from two models, RotatE L and GCOTE, are compared for each category. “num.” is the number of triples in the validation set belonging to the corresponding category; “H” or “T” is the experiment to predict head entity or tail entity given other entity and relation. “A” is the average result from both “H” and “T” experiments.
It is clear from the table that first prediction to entity in “N” side is harder than prediction to entity in the “1” side. It is within our expectation. For example, given triple (SergeiRachmaninoff, Gender, male), it is easy to get high rank for tail entities “male” and “female”, since they are quite different from other entities in the knowledge graph; but it is difficult to get a high rank for entity “SergeiRachmaninoff”, since it has to compete with many other person names in the data set. Entity in the “N” side is easier to get confused with others and get a low rank. Second, GCOTE improves for both “H” and “T” experiments at all categories compared with “RotatE L”. It is also clear that the proposed approach has more gain on hard cases, say prediction at entities on the “N” side. It shows the graph context and orthogonal transform embedding do help on those hard cases.
RotatE L  GCOTE  
type  num.  H  T  A  H  T  A 
1N  2255  .710  .169  .440  .718  .204  .461 
N1  5460  .156  .850  .503  .209  .863  .536 
NN  9763  .490  .631  .561  .508  .651  .579 
6 Conclusions
In this work, a novel translational distance based approach for knowledge graph link prediction is proposed. It includes twofolds.
First, an orthogonal transform relation based graph embedding method OTE is proposed. OTE extends the modeling in RotatE from 2D complex domain to high dimension space with orthogonal relation matrix generated from Gram Schmidt process. The results show increasing the modeling dimension is more effective than increasing the embedding dimensionality.
Second, graph structure context is explicitly modeled via two directed context representations. Each node embedding in knowledge graph is augmented with two context representations, computed from the neighboring outgoing and incoming nodes respectively. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference.
The proposed approach effectively improves prediction accuracy on the difficult N1, 1N and NN issues within the knowledge graph link prediction tasks. The experimental results show that it achieves good performance in two common data sets compared with the baseline RotatE, especially on data set (FB15k237) with rich structure information.
References
 Auer et al. (2007) Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer. 722–735.
 Balazevic, Allen, and Hospedales (2019) Balazevic, I.; Allen, C.; and Hospedales, T. 2019. TuckER: Tensor factorization for knowledge graph completion. In EMNLP.
 Bansal et al. (2019) Bansal, T.; Juan, D.C.; Ravi, S.; and McCallum, A. 2019. A2N: Attending to neighbors for knowledge graph inference. In ACL.
 Bansal, Chen, and Wang (2018) Bansal, N.; Chen, X.; and Wang, Z. 2018. Can we gain more from orthogonality regularizations in training deep networks? In NeurIPS.
 Bollacker et al. (2008) Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1247–1250. ACM.
 Bordes et al. (2013) Bordes, A.; Usunier, N.; GarciaDuran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multirelational data. In NIPS, 2787–2795.
 Carlson et al. (2010) Carlson, A.; Betteridge, J.; Kisiel, B.; Settles, B.; Hruschka Jr, E. R.; and Mitchell, T. M. 2010. Toward an architecture for neverending language learning. In AAAI, volume 5, 3. Atlanta.
 Dettmers et al. (2018) Dettmers, T.; Minervini, P.; Stenetorp, P.; and Riedel, S. 2018. Convolutional 2D knowledge graph embeddings. In AAAI.
 Fey and Lenssen (2019) Fey, M., and Lenssen, J. E. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.

Harandi and
Fernando (2016)
Harandi, M., and Fernando, B.
2016.
Generalized backpropagation, étude de cas: Orthogonality.
ArXiv.  Huang et al. (2017) Huang, L.; Liu, X.; Lang, B.; Yu, A. W.; Wang, Y.; and Li, B. Q. 2017. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In AAAI.
 Ji et al. (2015) Ji, G.; He, S.; Xu, L.; Liu, K.; and Zhao, J. 2015. Knowledge graph embedding via dynamic mapping matrix. In ACL.
 Kingma and Ba (2014) Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In ICLR.
 Kipf and Welling (2016) Kipf, T. N., and Welling, M. 2016. Semisupervised classification with graph convolutional networks. In ICLR.
 Lin et al. (2015) Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015. Learning entity and relation embedings for knowledge graph completion. In AAAI, volume 15, 2181–2187.
 Mahdisoltani, Biega, and Suchanek (2013) Mahdisoltani, F.; Biega, J.; and Suchanek, F. M. 2013. Yago3: A knowledge base from multilingual wikipedias. In CIDR.
 Nathani et al. (2019) Nathani, D.; Chauhan, J.; Sharma, C.; and Kaul, M. 2019. Learning attentionbased embeddings for relation prediction in knowledge graphs. In ACL.
 Nguyen et al. (2017) Nguyen, D. Q.; Nguyen, T. D.; Nguyen, D. Q.; and Phung, D. 2017. A novel embedding model for knowledge base completion based on convolutional neural network. arXiv preprint arXiv:1712.02121.
 Nguyen et al. (2019) Nguyen, D. Q.; Vu, T.; Nguyen, T. D.; Nguyen, D. Q.; and Phung, D. 2019. A Capsule Networkbased Embedding Model for Knowledge Graph Completion and Search Personalization. In NAACL, 2180–2189.
 Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
 Saxe, McClelland, and Ganguli (2013) Saxe, A. M.; McClelland, J. L.; and Ganguli, S. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.
 Schlichtkrull et al. (2017) Schlichtkrull, M. S.; Kipf, T. N.; Bloem, P.; van den Berg, R.; Titov, I.; and Welling, M. 2017. Modeling relational data with graph convolutional networks. In ESWC.
 Shang et al. (2018) Shang, C.; Tang, Y.; Huang, J.; Bi, J.; He, X.; and Zhou, B. 2018. Endtoend structureaware convolutional networks for knowledge base completion. In AAAI.
 Sun et al. (2019) Sun, Z.; Deng, Z.H.; Nie, J.; and Tang, J. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR.
 Toutanova and Chen (2015) Toutanova, K., and Chen, D. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, 57–66.
 Trouillon et al. (2016) Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; and Bouchard, G. 2016. Complex embeddings for simple link prediction. In ICML, 2071–2080.
 Veličković et al. (2017) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2017. Graph attention networks. In ICLR.
 Vorontsov et al. (2017) Vorontsov, E.; Trabelsi, C.; Kadoury, S.; and Pal, C. J. 2017. On orthogonality and learning recurrent networks with long term dependencies. In ICML.

Wang et al. (2014)
Wang, Z.; Zhang, J.; Feng, J.; and Chen, Z.
2014.
Knowledge graph embedding by translating on hyperplanes.
In AAAI, volume 14, 1112–1119.  Wang et al. (2017) Wang, Q.; Mao, Z.; Wang, B.; and Guo, L. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29:2724–2743.
 Yang et al. (2014) Yang, B.; Yih, W.t.; He, X.; Gao, J.; and Deng, L. 2014. Embedding entities and relations for learning and inference in knowledge bases. In ICLR.
 Zhang et al. (2019) Zhang, S.; Tay, Y.; Yao, L.; and Liu, Q. 2019. Quaternion knowledge graph embedding. CoRR abs/1904.10281.
Appendix A Disussion on the Ability of Pattern Modeling and Inference
It can be proved that OTE can infer all three types of relation patterns, e.g., symmetry/antisymmetry, inversion and composition patterns.
a.1 Symmetry/antisymmetry:
If and hold, we have
In other words, if is a symmetry matrix and no scale is applied, the relation is symmetry relation.
If the relation is antisymmetry, e.g., and , we just need to one of the is not symmetry matrix or .
a.2 Reversion:
If and hold, we have
In other words, if , the relation is inverse relation of .
a.3 Composition:
If , and hold, we have
It means if is equal to then relation is composition of relation and .