Knowledge graphs represent structured collections of facts describing the world in the form of typed relationships between entities (hogan2020knowledge). These collections of facts have been used in a wide range of applications, including web search, question answering, recommender systems, cancer research, machine translation, and even entertainment (eder2012knowledge; bordes2014question; zhang2016collaborative; saleem2014big; moussallem2019; malyshev2018getting). However, most knowledge graphs on the web are far from complete (nickel2015review). The task of identifying missing links in knowledge graphs is referred to as link prediction. Knowledge Graph Embedding (KGE) models have been particularly successful at tackling the link prediction task, among many others (nickel2015review).
KGE research has mainly focused on the two smallest normed division algebras—real numbers () and complex numbers ()—neglecting the benefits of the larger normed division algebras—quaternions ( and octonions (). While yang2015embedding introduced the trilinear product of real-valued embeddings of triples (h, r, t) as a scoring function for link prediction, trouillon2016complex showed the usefulness of the Hermitian product of complex-valued embeddings : in contrast to real-valued embeddings, this product is not symmetric and can be used to model antisymmetric relations since . To further increase the expressivity, zhang2019quaternion proposed learning quaternion-valued embeddings due to their benefits over complex-valued embeddings. Recently, zhang2021beyond show that replacing a fully-connected layer with a hypercomplex multiplication layer in a neural network leads to significant parameter efficiency without degenerating predictive performance in many tasks including natural language inference, machine translation and text style transfer.
nguyen2017novel; dettmers2018convolutional; balavzevic2019hypernetwork; demir2021convolutional showed that convolutions are another effective means to increase the expressivity: the sparse connectivity property of the convolution operator endows models with parameter efficiency—unlike models simply increasing the embedding size which is not scalable to large knowledge graphs (dettmers2018convolutional). Different configurations of the number of feature maps and the shape of kernels in the convolution operation are often explored to find the best ratio between expressiveness and parameter space size.
We investigate the use of convolutions on hypercomplex embeddings by proposing four models: QMult and OMult can be considered hypercomplex extensions of DistMult (yang2015embedding) in and , respectively. In contrast to the state of the art (zhang2019quaternion), we address the scaling effect of multiplication in and
by applying the batch normalization technique. Through the batch normalization technique,QMult and OMult are allowed to control the rate of normalization and benefit from its implicit regularization effect (ioffe2015batch). Importantly, lu2020dense suggest that using solely unit quaternion-based rotations between head entity and relation limits the modeling capacity for various types of relations. ConvQ and ConvO build upon QMult and OMult by including the convolution operator in a way inspired by the residual learning framework (he2016deep). ConvQ and ConvO forge QMult and OMult with a 2D convolution operation and an affine transformation via the Hadamard product, respectively. By virtue of this architecture, we show that ConvQ can degenerate QMult, ComplEx or DistMult, if such degeneration is necessary to further minimize the training loss (see Equations 10 and 6).
Experiments suggest that our models often achieve state-of-the-art performance on seven benchmark datasets (WN18, FB15K, WN18RR, FB15K-237, YAGO3-10, Kinship and UMLS). Superiority of our models against state-of-the-art models increases as the size and complexity of the knowledge graph grows. Our results also indicate that generalization performances of models can be further increased by applying ensemble learning.
2 Related Work
In the last decade, a plethora of KGE approaches have been successfully applied to tackle various tasks (nickel2015review; cai2018comprehensive; ji2020survey). In this section, we give a brief chronological overview of selected KGE
approaches. RESCAL computes a three-way factorization of a third-order adjacency tensor representing the input knowledge graph to compute scores for triples(nickel2011three). RESCAL captures various types of relations in the input KG but is limited in its scalability as it has quadratic complexity in the factorization rank (trouillon2017knowledge). DistMult can be regarded as an efficient extension of RESCAL with a diagonal matrix per relation to reduce the complexity of RESCAL (yang2015embedding). DistMult performs poorly on antisymmetric relations, whereas performing well on symmetric relations (trouillon2016complex). ComplEx extends DistMult by learning representations in a complex vector space (trouillon2016complex). ComplEx is able to infer both symmetric and antisymmetric relations via a Hermitian inner product of embeddings that involves the conjugate-transpose of one of the two input vectors. lacroix2018canonical design two novel regularizers along with a data augmentation technique and propose ComplEx-N3 that can be seen as ComplEx with the N3 regularization. ConvE applies a 2D convolution operation to model the interactions between entities and relations (dettmers2018convolutional). ConvKB extends ConvE by omitting reshaping operation in the encoding of representations in the convolution operation (nguyen2017novel). Similarly, HypER extends ConvE by applying relation-specific convolution filters as opposed to applying filters from concatenated subject and relation vectors (balavzevic2019hypernetwork). TuckER employs the Tucker decomposition on the binary tensor representing the input knowledge graph triples (balavzevic2019tucker). RotatE employs a rotational model taking predicates as rotations from subjects to objects in complex space via the element-wise Hadamard product (sun2019rotate). By these means, RotatE performs well on composition relations where other approaches perform poorly. QuatE applies the quaternion multiplication followed by an inner product to compute scores of triples (zhang2019quaternion).
3 Link Prediction & Hypercomplex Numbers
Let and represent the sets of entities and relations. Then, a Knowledge Graph (KG) can be formalised as a set of triples where each triple contains two entities and a relation . The link prediction problem is formalised by learning a scoring function ideally characterized by if is true and is not (dettmers2018convolutional).
The quaternions are a 4-dimensional normed division algebra (hamilton1844lxxviii; baez2002octonions). A quaternion number is defined as where are real numbers and are imaginary units satisfying Hamilton’s rule: . Let and be two quaternions, the inner product of two quaternions is defined as
The quaternion multiplication of and is defined as
The quaternion multiplication is also known as the Hamilton product (zhang2021beyond). For a -dimensional quaternion vector a + b i + c j + d k with , the inner product and multiplication is defined accordingly. The Octonions are an 8-dimensional algebra where an octonion number is defined as , where are imaginary units (baez2002octonions). Their product (), inner product () and vector operations are defined analogously to quaternions.
The quaternion multiplication subsumes real-valued multiplication and enjoys a parameter saving with as compared to the real-valued matrix multiplication (parcollet2018quaternion; parcollet2019quaternion; zhang2021beyond). Leveraging such properties of quaternions in neural networks showed promising results in numerous tasks (zhang2021beyond; zhang2019quaternion; chen2020quaternion). In turn, the octonion multiplication in neural networks and learning octonion-valued knowledge graph embeddings had not been yet fully explored.
4 Convolutional Hypercomplex Embeddings
dettmers2018convolutional suggest that indegree and PageRank can be used to quantify the difficulty of predicting missing links in KG. Results indicate that the superiority of ConvE becomes more apparent against DistMult and ComplEx as the complexity of the knowledge graph increases, i.e., indegree and PageRank of a KG increase (see Table 6 in dettmers2018convolutional). In turn, zhang2019quaternion show that learning quaternion-valued embeddings via multiplicative interactions can be a more effective means of predicting missing links than learning real and complex-valued embeddings. Although learning quaternion-valued embeddings through multiplicative interactions yields promising results, the only way to further increase the expressiveness of such models is to increase the number of dimensions of embeddings. This does not scale to larger knowledge graphs (dettmers2018convolutional). Increasing parameter efficiency while retaining effectiveness is a desired property in many applications (zhang2021beyond; trouillon2016complex; trouillon2017knowledge).
Motivated by findings of aforementioned works, we investigate the composition of convolution operations with hypercomplex multiplications. The rationale behind this composition is to increase the expressiveness without increasing the number of parameters. This nontrivial endeavor is the keystone of embedding models (trouillon2016complex). The sparse connectivity property of the convolution operation endows models with parameter efficiency which helps to scale to larger knowledge graphs. Additionally, different configurations of the number of kernels and their shapes can be explored to find the best ratio between expressiveness and the number of parameters. Although increasing the number of feature maps results in increasing the number of parameters, we are able to benefit from the parameter sharing property of convolutions (goodfellow2016deep).
Inspired by the early works DistMult and ConvE, we dub our approaches QMult, OMult, ConvQ, and ConvO where “Q” represents the quaternion variant and “O” the octonion variant. Given a triple , computes a triple score through the quaternion multiplication of head entity embeddings and relation embeddings followed by the inner product with tail entity embeddings as
where . Similarly, performs the octonion multiplication followed by the inner product as
where . Computing scores of triples in this setting can be illustrated in two consecutive steps: (1) rotating through by applying quaternion/octonion multiplication and (2) squishing () and into a real number line by taking the inner product. During training, the degree between () and is minimized provided .
Motivated by the response of John T. Graves to W. R. Hamilton,222“If with your alchemy you can make three pounds of gold, why should you stop there?” baez2002octonions. we combine convolution operations with QMult and OMult as defined
where (respectively ) is defined as
, and () denote the rectified linear unit function, a flattening operation, convolution operation, kernel in the convolution and an affine transformation, respectively.
Connection to ComplEx and DistMult.
During training, can reduce its range into if such reduction is necessary to further decrease the training loss. In the following Equations 10, 9, 8, 7 and 6, we elucidate the reduction of ConvQ into QMult and ComplEx:
Equation 6 corresponds to QMult provided that . ConvQ can be further reduced into ComplEx by setting the imaginary parts j and k of , and to zero:
Computing the quaternion multiplication of two quaternion-valued vectors corresponds to Equation 8:
The resulting quaternion-valued vector is scaled with :
Through taking the inner product of the former vector with , we obtain
where corresponds to the multi-linear inner product. Equation 10 corresponds to ComplEx provided that . In the same way, ConvQ can be reduced into DistMult by setting all imaginary parts i, j, k to zero for , and yielding
Connection to residual learning.
The residual learning framework facilitates training of deep neural networks. A simple residual learning block consists of two weight layers denoted by and an identity mapping of the input (see Figure 2 in he2016deep). Increasing the depth of a neural model via stacking residual learning blocks led to significant improvements in many domains. In our setting, and correspond to and , respectively. We replaced the identity mapping of the input with the hypercomplex multiplication. To scale the output, we replaced the elementwise vector addition with the Hadamard product. By virtue of such inclusion, ConvQ and ConvO are endowed with the ability of controlling the impact of on predicted scores as shown in Equation 10. Ergo, the gradients of loss (see Equation 12) w.r.t. head entity and relation embeddings can be propagated in two ways, namely, via or hypercomplex multiplication. Moreover, the number of feature maps and the shape of kernels can be used to find the best ratio between expressiveness and the number of parameters. Hence, the expressiveness of models can be adjusted without necessarily increasing the embedding size. Although increasing the number of feature maps results in increasing the number of parameters in the model, we are able to benefit from the parameter sharing property of convolutions.
5 Experimental Setup
We used seven datasets: WN18RR, FB15K-237, YAGO3-10, FB15K, WN18, UMLS and Kinship. An overview of the datasets is provided in Table 1. The latter four datasets are included for the sake of the completeness of our evaluation. dettmers2018convolutional suggest that indegree and PageRank can be used to indicate difficulty of performing link prediction on an input KG. In our experiments, we are particularly interested in link prediction results on complex KG. As commonly done, we augment the datasets by adding reciprocal triples (t, r, h) (dettmers2018convolutional; balavzevic2019hypernetwork; balavzevic2019tucker). For link prediction based on only tail entity ranking experiments (see Table 5), we omit the data augmentation on the test set as similarly done by bansal2019a2n.
Overview of datasets in terms of entities, relations, average node degree plus/minus standard deviation.
5.2 Training and optimization
We apply the same training strategy as dettmers2018convolutional: Following the KvsAll training procedure,333Here, we follow the terminology of ruffinelli2019you. for a given pair (h, r), we compute scores for all with
and apply the logistic sigmoid function
. Models are trained to minimize the binary cross entropy loss function, whereand denote the predicted scores and binary label vector, respectively.
We employ the Adam optimizer (kingma2014adam), dropout (srivastava2014dropout), label smoothing and batch normalization (ioffe2015batch) as similarly done in the literature (balavzevic2019hypernetwork; balavzevic2019tucker; dettmers2018convolutional; demir2021convolutional)
. Moreover, we selected hyperparameters of our approaches by random search based on validation set performances(balavzevic2019tucker). Notably, we did not search a good random seed for the random number generator and fixed the seed to 1 throughout our experiments.
We employ the standard metrics filtered Mean Reciprocal Rank (MRR) and hits at N (H@N) for link prediction (dettmers2018convolutional; balavzevic2019hypernetwork). For each test triple , we construct its reciprocal and add it into which is a common technique to decrease the computational cost during testing (dettmers2018convolutional). Then, for each test triple , we compute the score of triples for all and calculate the filtered ranking of the triple having . Then we compute the MRR: . Consequently, given a , we compute ranks of missing entities based on the rank of head and tail entities as similarly done in balavzevic2019hypernetwork; balavzevic2019tucker; dettmers2018convolutional. For the sake of completeness, we also report link prediction performances based on only tail rankings, i.e., without including triples with reciprocal relations into test data, as similarly done by bansal2019a2n.
5.4 Implementation Details and Reproducibility
We implemented and evaluated our approach in the framework provided by balavzevic2019tucker; balazevic2019multi. To alleviate the hardware requirements for the reproducibility, we provide hyperparameter optimization, training and evaluation scripts along with pretrained models at the project page. Experiments were conducted on a single NVIDIA GeForce RTX 3090.
Table 2 reports link prediction results on the WN18RR, FB15K-237 and YAGO3-10 datasets. Overall, the superior performance of our approaches becomes more and more apparent as the size and complexity of the knowledge graphs grows. On the smallest benchmark dataset (WN18RR), QMult, OMult, ConvQ and ConvO outperform many approaches including DistMult, ConvE and ComplEx in all metrics. However, QuatE, TuckER, and RotatE yield best performances. On the second-largest benchmark dataset (FB15K-237 is larger than WN18RR), ConvO outperforms all state-of-the-art approaches in 3 out of 4 metrics. Additionally, QMult and ConvQ outperform all state-of-the-art approaches except for TucKER in terms of MRR, H@1 and H@3. On the largest benchmark dataset (YAGO3-10 is larger than WN18RR), QMult, ConvO, ConvQ outperform all approaches in all metrics. Surprisingly, QMult and OMult reach the best and second-best performances in all metrics, whereas ConvO does not perform particularly well compared to our other approaches. ConvO outperforms QMult, OMult, and ConvQ in 8 out of 12 metrics, whereas QMult yields better performance on YAGO3-10. Overall, these results suggest that superiority of learning hypercomplex embeddings becomes more apparent as the size and complexity of the input knowledge graph increases as measured by indegree (see Table 1) and PageRank (see Table 6 in dettmers2018convolutional). In Table 3, we compare some of best performing approaches on WN18RR, FB15K-237 and YAGO3-10 in terms of the number of trainable parameters. Results indicate that our approaches yield competitive performances (if not better) on all benchmark datasets.
6.1 Ensemble Learning
6.2 Impact of Tail Entity Rankings
During our experiments, we observed that models often perform more accurately in predicting missing tail entities compared to predicting missing head entities which was also observed in bansal2019a2n. Table 5 indicates that MRR performances based on only tail entity rankings are on average absolute higher than MRR results based on head and tail entity rankings on FB15K-237 while such difference was not observed on WN18RR.
|Relation Name||Rel. Type||RotatE||QMult||ConvQ||OMult||ConvO||Ensemble|
|(h, r, x)|
|(x, r, t)|
6.3 Link Prediction Per Relation and Direction
We reevaluate link prediction performances of some of the best-performing models from Table 2 in Tables 7 and 6. allen2021interpreting distinguish three types of relations: Type S relations are specialization relations such as hypernym, type C denote so-called generalized context-shifts and include has_part relations, and type R relations include so-called highly-related relations such as similar_to. Our results show that our approaches accurately rank missing tail and head entities for type R relations. For instance, our approaches perfectly rank ( MRR) missing entities of symmetric relations (verb_group and similar_to). However, the direction of entity prediction has a significant impact on the results for non-symmetric type C relations. For instance, MRR performances of QMult, ConvQ, OMult and ConvO vary by up to absolute 0.63 for the relation member_of_domain_region. The low performances on hypernym (type S) may stem from the fact that there are 184 triples in the test split of WN18RR where hypernym occurs with entities of which at least one did not occur in the training split. Models often perform poorly on type C relations but considerably better on type R relations corroborating findings by allen2021interpreting.
6.4 Batch vs. Unit Normalization
We investigate the effect of using batch-normalization instead of unit normalization as previously proposed by zhang2019quaternion. Table 8 indicates that the scaling effect of hypercomplex multiplications can be effectively alleviated by using the batch normalization technique. Replacing unit normalization with the batch normalization technique allows benefiting (1) from its regularization effect and (2) from its numerical stability. Through batch normalization, our models are able to control the rate of normalization and benefit from its implicit regularization effect (ioffe2015batch).
6.5 Convergence on YAGO3-10
indicates that incurred binary cross entropy losses significantly decrease within the first 100 epochs. After theth iteration, ConvQ and ConvO appear to converge as losses do not fluctuate, whereas training losses of QMult and OMult continue fluctuating.
6.6 Link Prediction Results on Previous Benchmark Datasets
Table 9 reports results on WN18 and FB15K showing that our approaches ConvQ and ConvQ outperform state-of-the-art approaches in 6 out of 8 metrics on the datasets.
Our approaches often outperform many state-of-the-art approaches on all datasets. QMult and OMult outperform many state-of-the-art approaches including DistMult and ComplEx. These results indicate that scoring functions based on hypercomplex multiplications are more effective than scoring functions based on real and complex multiplications. This observation corroborates findings of zhang2019quaternion. ConvO often perform slightly better than ConvQ on all datasets. Additionally, QMult and QMult perform particularly well on YAGO3-10. These results may stem from the fact that ConvQ and ConvO
may benefit from initializing parameters with the correct variance as highlighted inhanin2018start. Overall, superior performances of our models stem from (1) hypercomplex embeddings and (2) the inclusion of convolution operations. Our models are allowed to degrade into ComplEx or DistMult if necessary (see Section 4). Inclusion of the convolution operation followed by an affine transformation permits to find a good ratio between expressiveness and the number of parameters.
In this study, we presented effective compositions of convolution operations with hypercomplex multiplications in the quaternion and octonion algebras to address the link prediction problem. Experimental results showed that QMult and OMult performing hypercomplex multiplication on hypercomplex-valued embeddings of entities and relations are effective methods to tackle the link prediction problem. ConvQ and ConvO forge QMult and OMult with convolution operations followed by an affine transformation. By virtue of this novel composition, ConvQ and ConvO facilitate to find a good ratio between expressiveness and the number of parameters. Experiments suggest that (1) generalizing real- and complex-valued models such as DistMult and ComplEx to the hypercomplex space is beneficial, particularly for larger knowledge graphs, (2) the scaling effect of hypercomplex multiplication can be more effectively tackled with batch normalization than unit normalization, and (3) the application of ensembling can be used to further increase generalization performances.
In future work, we plan to investigate the generalization of our approaches to temporal knowledge graphs and translation based models on hypercomplex vector spaces.