multirelational-poincare
Multi-relational Poincaré Graph Embeddings
view repo
Hyperbolic embeddings have recently gained attention in machine learning due to their ability to represent hierarchical data more accurately and succinctly than their Euclidean analogues. However, multi-relational knowledge graphs often exhibit multiple simultaneous hierarchies, which current hyperbolic models do not capture. To address this, we propose a model that embeds multi-relational graph data in the Poincaré ball model of hyperbolic space. Our Multi-Relational Poincaré model (MuRP) learns relation-specific parameters to transform entity embeddings by Möbius matrix-vector multiplication and Möbius addition. Experiments on the hierarchical WN18RR knowledge graph show that our multi-relational Poincaré embeddings outperform their Euclidean counterpart and existing embedding methods on the link prediction task, particularly at lower dimensionality.
READ FULL TEXT VIEW PDFMulti-relational Poincaré Graph Embeddings
Hyperbolic space can be thought of as a continuous analogue of discrete trees, making it suitable for modelling hierarchical data structures Sarkar (2011); De Sa et al. (2018). Various types of hierarchical data have recently been embedded in hyperbolic space Nickel and Kiela (2017, 2018); Gulcehre et al. (2019); Tifrea et al. (2019)
, requiring relatively few dimensions and achieving promising results on downstream tasks. This demonstrates the advantage of modelling tree-like structures in spaces with constant negative curvature (hyperbolic) over zero-curvature spaces (Euclidean). More recently, tools needed to construct hyperbolic neural networks have been developed
Ganea et al. (2018a); Bécigneul and Ganea (2019), facilitating the use of hyperbolic embeddings in downstream tasks.Certain data structures, such as knowledge graphs, often exhibit multiple hierarchies simultaneously. For example, lion is near the top of the animal food chain but near the bottom in a tree of taxonomic mammal types Miller (1995). Despite the widespread use of hyperbolic geometry in representation learning, the only existing approach to embedding hierarchical multi-relational graph data in hyperbolic space Suzuki et al. (2019) does not outperform Euclidean models. The difficulty with representing multi-relational data in hyperbolic space lies in finding a way to represent entities (nodes), shared across relations, such that they form a different hierarchy under different relations, e.g. nodes near the root of the tree under one relation may be leaf nodes under another. Further, many state-of-the-art approaches to modelling multi-relational data, such as DistMult Yang et al. (2015), ComplEx Trouillon et al. (2016), and TuckER Balažević et al. (2019) (i.e. bilinear models), rely on inner product as a similarity measure and there is no clear correspondence to the Euclidean inner product in hyperbolic space Tifrea et al. (2019) by which these models can be converted. Existing translational approaches that use Euclidean distance to measure similarity, such as TransE Bordes et al. (2013) and STransE Nguyen et al. (2016), can be converted to the hyperbolic domain, but do not currently compete with the bilinear models in terms of predictive performance. However, it has recently been shown in the closely related field of word embeddings Allen and Hospedales (2019) that the difference (i.e. relation) between word pairs that form analogies manifests as a vector offset, justifying a translational approach to modelling relations.
In this paper, we propose MuRP, a theoretically inspired method to embed hierarchical multi-relational data in the Poincaré ball model of hyperbolic space. By considering the surface area of a hypersphere of increasing radius centered at a particular point, Euclidean space can be seen to “grow” polynomially, whereas in hyperbolic space the equivalent growth is exponential De Sa et al. (2018). Therefore, moving outwards from the root of a tree, there is more “room” to separate leaf nodes in hyperbolic space than in Euclidean. MuRP learns relation-specific parameters that transform entity embeddings by Möbius matrix-vector multiplication and Möbius addition Ungar (2001). The model outperforms not only its Euclidean counterpart, but also current state-of-the-art models on the link prediction task on the hierarchical WN18RR dataset. We also show that our Poincaré embeddings require far fewer dimensions than Euclidean embeddings to achieve comparable performance. We visualize the learned embeddings and analyze the properties of the Poincaré model compared to its Euclidean analogue, such as convergence rate, performance per relation, and influence of embedding dimensionality.
Multi-relation link prediction A knowledge graph is a multi-relational graph representation of a collection of facts (or triples) of the form , where denotes the set of entities and denotes the set of binary relations between them. The presence of indicates that subject entity is related to object entity by relation . In a multi-relational graph representation of , nodes correspond to entities and typed directed edges represent relations, i.e. nodes for and are linked by a directed edge of type if and only if . Given a set of facts , the task of multi-relational link prediction is to predict triples that are true in . A perfect encoding of would simply recall known facts. However, knowledge graphs are typically incomplete, so the aim is to infer other facts that are true but missing from . Typically, a score function is learned, that assigns a score
to each triple, indicating the strength of prediction that a particular triple corresponds to a true fact. A non-linearity, such as the logistic sigmoid function, is often used to convert the score to a predicted probability
of the triple being true.Knowledge graph relations exhibit multiple properties, such as symmetry, asymmetry, and transitivity. Certain knowledge graph relations, such as “hypernym” and “has_part”, induce a hierarchical structure over entities, suggesting that embedding them in hyperbolic rather than Euclidean space may lead to improved representations Sarkar (2011); Nickel and Kiela (2017, 2018); Ganea et al. (2018b); Tifrea et al. (2019). Based on this intuition, we focus on embedding multi-relational knowledge graph data in hyperbolic space.
Hyperbolic geometry of the Poincaré ball The Poincaré ball model is one of five isometric models of hyperbolic geometry Cannon et al. (1997), each offering different perspectives for performing mathematical operations in hyperbolic space. The isometry means there exists a one-to-one distance-preserving mapping from the metric space of one model onto that of another , where are sets and distance functions, or metrics, providing a notion of equivalence between the models.
The Poincaré ball of radius is a -dimensional manifold equipped with the Riemannian metric which is conformal to the Euclidean metric (i.e. angle-preserving with respect to the Euclidean space Ganea et al. (2018a)) with the conformal factor , i.e. . The distance between two points is measured along a geodesic (i.e. shortest path between the points, see Figure 0(a)) and is given by:
(1) |
where denotes the Euclidean norm and represents Möbius addition Ungar (2001); Ganea et al. (2018a):
(2) |
with being the Euclidean inner product.
Each point has a tangent space , a -dimensional vector space, that is a local first-order approximation of the manifold around , which for the Poincaré ball is a -dimensional Euclidean space, i.e. . The exponential map allows one to move on the manifold from in the direction of a vector , tangential to at . The inverse is the logarithmic map . For the Poincaré ball, these are defined Ganea et al. (2018a) as:
(3) |
(4) |
Ganea et al. (2018a) show that matrix-vector multiplication in hyperbolic space (Möbius matrix-vector multiplication) can be obtained by projecting a point onto the tangent space at with , performing matrix multiplication by in the Euclidean tangent space, and projecting back to via the exponential map at , i.e.:
(5) |
Embedding hierarchical data in hyperbolic space has recently gained popularity in representation learning. Nickel and Kiela (2017) first embedded the transitive closure^{1}^{1}1Each node in a directed graph is connected not only to its children, but to every descendant, i.e. all nodes to which there exists a directed path from the starting node. of the WordNet noun hierarchy, in the Poincaré ball, showing that low-dimensional hyperbolic embeddings can significantly outperform higher-dimensional Euclidean embeddings in terms of both representation capacity and generalization ability. The same authors subsequently embedded hierarchical data in the Lorentz model of hyperbolic geometry Nickel and Kiela (2018).
Ganea et al. (2018a)
introduced Hyperbolic Neural Networks, connecting hyperbolic geometry with deep learning. They build on the definitions for Möbius addition, Möbius scalar multiplication, exponential and logarithmic maps of
Ungar (2001) to derive expressions for linear layers, bias translation and application of non-linearity in the Poincaré ball. Hyperbolic analogues of several other algorithms have been developed since, such as Poincaré Glove Tifrea et al. (2019) and Hyperbolic Attention Networks Gulcehre et al. (2019). More recently, Gu et al. (2019) note that data can be non-uniformly hierarchical and learn embeddings on a product manifold with components of different curvature: spherical, hyperbolic and Euclidean. To our knowledge, only Riemannian TransE Suzuki et al. (2019) seeks to embed multi-relational data in hyperbolic space, but the Riemannian translation method fails to outperform Euclidean baselines.Bilinear models
typically represent relations as linear transformations acting on entity vectors. An early model, RESCAL
Nickel et al. (2011), optimizes a score function , containing the bilinear product between the subject entity embedding , a full rank relation matrix and the object entity embedding . RESCAL is prone to overfitting due to the number of parameters per relation being quadratic relative to the number per entity. DistMult Yang et al. (2015) is a special case of RESCAL with diagonal relation matrices, reducing parameters per relation and controlling overfitting. However, due to its symmetry, DistMult cannot model asymmetric relations. ComplEx Trouillon et al. (2016) extends DistMult to the complex domain, enabling asymmetry to be modelled. TuckER Balažević et al. (2019)performs a Tucker decomposition of the tensor of triples, which enables information sharing between different relations via the core tensor. The authors show each of the linear models above to be a special case of TuckER.
Translational models regard a relation as a translation (or vector offset) from the subject to the object entity embeddings. These models include TransE Bordes et al. (2013) and its many successors, e.g. FTransE Feng et al. (2016), STransE Nguyen et al. (2016). The score function for translational models typically considers Euclidean distance between the translated subject entity embedding and the object entity embedding.
A set of entities can form different hierarchies under different relations. In the WordNet knowledge graph Miller (1995), the “hypernym”, “has_part” and “member_meronym” relations each induce different hierarchies over the same set of entities. For example, the noun chair is a parent node to different chair types (e.g. folding_chair, armchair) under the relation “hypernym” and both chair and its types are parent nodes to parts of a typical chair (e.g. backrest, leg) under the relation “has_part”. An ideal embedding model should capture all hierarchies simultaneously.
Score function Bilinear multi-relational models measure similarity between the subject entity embedding (after relation-specific transformation) and an object entity embedding using the Euclidean inner product Nickel et al. (2011); Yang et al. (2015); Trouillon et al. (2016); Balažević et al. (2019). However, a clear correspondence to the Euclidean inner product does not exist in hyperbolic space Tifrea et al. (2019). The Euclidean inner product can be expressed as a function of Euclidean distances and norms, i.e. , . Noting this, in Poincaré Glove, Tifrea et al. (2019) absorb squared norms into biases and replace the Euclidean with the Poincaré distance to obtain the hyperbolic version of Glove Pennington et al. (2014).
Separately, it has recently been shown in the closely related field of word embeddings that statistics pertaining to analogies naturally contain linear structures Allen and Hospedales (2019), explaining why similar linear structure appears amongst word embeddings of Word2Vec Mikolov et al. (2013a, b); Levy and Goldberg (2014). Analogies are word relationships of the form “ is to as is to ”, such as “man is to woman as king is to queen”, and are in principle not restricted to two pairs (e.g. “…as brother is to sister”). It can be seen that analogies have much in common with relations in multi-relational graphs, as a difference between pairs of words (or entities) common to all pairs, e.g. if and hold, then we could say “ is to as is to ”. Of particular relevance is the demonstration that the common difference, i.e. relation, between the word pairs (e.g. (man, woman) and (king, queen)) manifests as a common vector offset Allen and Hospedales (2019)
, suggesting justifying the previously heuristic translational approach to modelling relations.
Inspired by these two ideas, we define the basis score function for multi-relational graph embedding:
(6) |
where is a distance function, are the embeddings and scalar biases of the subject and object entities and respectively. is a diagonal relation matrix and a translation vector (i.e. vector offset) of relation . and represent the subject and object entity embeddings after applying the respective relation-specific transformations, a stretch by to and a translation by to .
Hyperbolic model Taking the hyperbolic analogue of Equation 6, we define the score function for our Multi-Relational Poincaré (MuRP) model as:
(7) |
where are hyperbolic embeddings of the subject and object entities and respectively, and is a hyperbolic translation vector of relation . The relation-adjusted subject entity embedding is obtained by Möbius matrix-vector multiplication: the original subject entity embedding is projected to the tangent space of the Poincaré ball at with , transformed by the diagonal relation matrix , and then projected back to the Poincaré ball by . The relation-adjusted object entity embedding is obtained by Möbius addition of the relation vector to the object entity embedding . Since the relation matrix is diagonal, the number of parameters of MuRP increases linearly with the number of entities and relations, making it scalable to large knowledge graphs. To obtain the predicted probability of a fact being true, we apply the logistic sigmoid to the score, i.e. .
To directly compare the properties of hyperbolic embeddings with the Euclidean, we implement the Euclidean version of Equation 6 with . We refer to this model as Multi-Relational Euclidean (MuRE) model.
Geometric intuition We see from Equation 6 that the biases determine the radius of a hypersphere decision boundary centered at . Entities and are predicted to be related by if relation-adjusted falls within a hypershpere of radius (see Figure 0(b)). Since biases are subject and object entity-specific, each subject-object pair induces a different decision boundary. The relation-specific parameters and determine the position of the relation-adjusted embeddings, but the radius of the entity-specific decision boundary is independent of the relation. The score function in Equation 6 resembles the score functions of existing translational models Bordes et al. (2013); Feng et al. (2016); Nguyen et al. (2016), with the main difference being the entity-specific biases, which can be seen to change the geometry of the model. Rather than considering an entity as a point in space, each bias defines an entity-specific sphere of influence surrounding the center given by the embedding vector (see Figure 0(c)). The overlap between spheres measures relatedness between entities. We can thus think of each relation as moving the spheres of influence in space, so that only the spheres of subject and object entities that are connected under that relation overlap.
To train both models, we generate negative samples for each true triple , where we corrupt either the subject or the object entity with a randomly chosen entity from the set of all entities . Both models are trained to minimize the Bernoulli negative log-likelihood loss:
(8) |
where is the predicted probability, is the binary label indicating whether a sample is positive or negative and is the number of training samples.
For fairness of comparison, we optimize the Euclidean model using stochastic gradient descent (SGD) and the hyperbolic model using
Riemannian stochastic gradient descent (RSGD) Bonnabel (2013). We note that the Riemannian equivalent of adaptive optimization methods has recently been developed Bécigneul and Ganea (2019), but leave replacing SGD and RSGD with their adaptive equivalent to future work. To compute the Riemannian gradient , the Euclidean gradient is multiplied by the inverse of the Poincaré metric tensor:(9) |
Instead of the Euclidean update step , a first order approximation of the true Riemannian update, we use the exponential map at to project the gradient onto its corresponding geodesic on the Poincaré ball and compute the Riemannian update:
(10) |
where denotes the learning rate.
To evaluate both Poincaré and Euclidean models, we first test their performance on the knowledge graph link prediction task using standard WN18RR and FB15k-237 datasets:
FB15k-237 Toutanova et al. (2015) is a subset of Freebase, a database of real world facts, created from FB15k Bordes et al. (2013) by removing the inverse of many relations from validation and test sets to make the dataset more challenging. FB15k-237 contains 14,541 entities and 237 relations.
WN18RR Dettmers et al. (2018) is a subset of WordNet, a hierarchical database of relations between words, created in the same way as FB15k-237 from WN18 Bordes et al. (2013). WN18RR contains 40,943 entities and 11 relations.
We evaluate each triple from the test set as in Bordes et al. (2013): we generate (where denotes number of entities in the dataset) evaluation triples for each test triple by keeping the subject entity and relation fixed and replacing the object entity with all possible entities and similarly keeping and fixed and varying . The scores obtained for each evaluation triple are ranked. All true triples are removed from the evaluation triples apart from the current test triple, i.e. the commonly used filtered setting Bordes et al. (2013)
. We evaluate our models using the evaluation metrics standard across the link prediction literature: mean reciprocal rank (MRR) and hits@
, . Mean reciprocal rank is the average of the inverse of a mean rank assigned to the true triple over all evaluation triples. Hits@ measures the percentage of times the true triple appears in the top ranked evaluation triples.We implement both models in PyTorch and make our code publicly available.
^{2}^{2}2https://github.com/ibalazevic/multirelational-poincare We choose the learning rate from by MRR on the validation set and find that the best learning rate is for WN18RR and for FB15k-237 for both models. We initialize all embeddings near the origin where distances are small in hyperbolic space, similar to Nickel and Kiela (2017). We set the batch size to 128 and the number of negative samples to . In all experiments, we set the curvature of MuRP to , since preliminary experiments showed that any material change reduced performance.Table 1 shows the results obtained for both datasets. As expected, MuRE performs slightly better on the non-hierarchical FB15k-237 dataset, whereas MuRP outperforms on WN18RR which contains hierarchical relations (as shown in Section 5.3). Both MuRE and MuRP outperform previous state-of-the-art models on WN18RR on all metrics apart from hits@1, where MuRP obtains second best overall result. In fact, even at relatively low embedding dimensionality (), this is maintained, demonstrating the ability of hyperbolic models to succinctly represent multiple hierarchies. On FB15k-237, MuRE is outperformed only by TuckER Balažević et al. (2019), a model capable of multi-task learning between relations, which is highly advantageous on that dataset due to a large number of relations compared to WN18RR and thus relatively little data per relation in some cases.
WN18RR | FB15k-237 | |||||||
MRR | Hits@10 | Hits@3 | Hits@1 | MRR | Hits@10 | Hits@3 | Hits@1 | |
TransE Bordes et al. (2013) | ||||||||
DistMult Yang et al. (2015) | ||||||||
ComplEx Trouillon et al. (2016) | ||||||||
Neural LP Yang et al. (2017) | ||||||||
MINERVA Das et al. (2018) | ||||||||
ConvE Dettmers et al. (2018) | ||||||||
ComplEx-N3 Lacroix et al. (2018) | ||||||||
M-Walk Shen et al. (2018) | ||||||||
TuckER Balažević et al. (2019) | ||||||||
RotatE Sun et al. (2019) | ||||||||
MuRE | ||||||||
MuRE | ||||||||
MuRP | ||||||||
MuRP |
Effect of dimensionality We compare the MRR achieved by MuRE and MuRP on WN18RR for embeddings of different dimensionalities . As expected, the difference between MRRs is greatest at lower embedding dimensionality (see Figure 1(a)).
Convergence rate Figure 1(b)
shows the MRR per epoch for MuRE and MuRP on the WN18RR training and validation sets, showing that MuRP also converges faster.
Performance per relation Since not every relation in WN18RR induces a hierarchical structure over the entities, we report the Krackhardt hierarchy score (Khs) Krackhardt (2014) of the entity graph formed by each relation to obtain a measure of the hierarchy induced by each relation. The score is defined only for directed networks and measures the proportion of node pairs where there exists a directed path , but not (see Appendix A for further details). The score takes a value of one for all directed acyclic graphs, and zero for cycles and cliques. We also report the length of the longest path (i.e. tree depth) for hierarchical relations as both need to be considered. To gain insight as to which relations benefit most from embedding entities in hyperbolic space, we compare Hits@10 per relation of MuRE and MuRP for entity embeddings of low dimensionality (). From Table 2 we see that both models achieve comparable performance on non-hierarchical, symmetric relations with the Krackhardt hierarchy score 0, such as “similar_to” and “verb_group”, whereas MuRP generally outperforms MuRE on hierarchical relations. We also see that the difference between the performances of MuRE and MuRP is generally larger for relations that form deeper trees, fitting the hypothesis that hyperbolic space is of most benefit for modelling hierarchical relations.
Computing the Krackhardt hierarchy score for FB15k-237, we find that of the relations have , however, the average of longest path lengths over those relations is with only relations having paths longer than 2, meaning that the vast majority of relational sub-graphs consist of directed edges between pairs of nodes, rather than a tree.
Relation Name | MuRE | MuRP | Khs | Longest Path | |
hypernym | |||||
has_part | |||||
member_meronym | |||||
also_see | |||||
synset_domain_topic_of | |||||
instance_hypernym | |||||
member_of_domain_region | |||||
member_of_domain_usage | |||||
derivationally_related_form | |||||
similar_to | |||||
verb_group |
Biases vs embedding vector norms We plot the norms versus the biases for MuRP and MuRE in Figure 3. This shows an overall correlation between embedding vector norm and bias (or radius of the sphere of influence) for both MuRE and MuRP. This makes sense intuitively, as the sphere of influence increases to “fill out the space” in regions that are less cluttered, i.e. further from the origin.
Spatial layout In Figure 4, we show a 40-dimensional subject embedding for the word asia and a random subset of 1500 object embeddings for the hierarchical WN18RR relation “has_part”, projected to 2 dimensions so that distances and angles of object entity embeddings relative to the subject entity embedding are preserved (see Appendix B for details of the projection method). We show subject and object entity embeddings before and after relation-specific transformation. For both MuRE and MuRP, we see that applying the relation-specific transformation separates true object entities from false ones. However, in the Poincaré model, where distances increase further from the origin, embeddings are moved further towards the boundary of the disk, where, loosely speaking, there is more space to separate and therefore distinguish them.
Quality of learned embeddings Here we analyze the false positives and false negatives predicted by both models. MuRP predicts 15 false positives and 0 false negatives, whereas MuRE predicts only 2 false positives and 1 false negative, so seemingly performs better. However, inspecting the false positives predicted by MuRP, we find they are all countries on the Asian continent (e.g. sri_lanka, palestine, malaysia, sakartvelo, thailand), so are actually correct, but missing from the dataset. MuRE’s predicted false positives (philippines and singapore) are both also correct but missing, whereas the false negative (bahrain) is indeed falsely predicted. We note that this suggests current evaluation methods may be unreliable.
We introduce a novel, theoretically inspired, translational method for embedding multi-relational graph data in the Poincaré ball model of hyperbolic geometry. Our multi-relational Poincaré model MuRP learns relation-specific parameters to transform entity embeddings by Möbius matrix-vector multiplication and Möbius addition. We show that MuRP outperforms its Euclidean counterpart MuRE and existing models on the link prediction task on the hierarchical WN18RR knowledge graph dataset, and requires far lower dimensionality to achieve comparable performance to its Euclidean analogue. We analyze various properties of the Poincaré model compared to its Euclidean analogue and provide insight through a visualization of the learned embeddings.
Future work may include investigating the impact of recently introduced Riemannian adaptive optimization methods compared to Riemannian SGD. Also, given not all relations in a knowledge graph are hierarchical, we may look into combining the Euclidean and hyperbolic models to produce mixed-curvature embeddings that best fit the curvature of the data.
We thank Rik Sarkar, Ivan Titov, Jonathan Mallinson and Eryk Kopczyński for helpful comments on this manuscript. Ivana Balažević and Carl Allen were supported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.
Go for a Walk and Arrive at the Answer: Reasoning over Paths in Knowledge Bases Using Reinforcement Learning.
In International Conference on Learning Representations, 2018.Association for the Advancement of Artificial Intelligence
, 2018.Empirical Methods in Natural Language Processing
, 2014.Let be the binary reachability matrix of a directed graph with nodes, with if there exists a directed path from node to node and otherwise. The Krackhardt hierarchy score of Krackhardt [2014] is defined as:
(11) |
To project high-dimensional embeddings to 2 dimensions for visualization purposes, we use the following method to compute dimensions for projection of entity :
, where is the original high-dimensional subject entity embedding and is the number of object entity embeddings.
.
This projects the reference subject entity embedding onto the -axis () and all object entity embeddings are positioned relative to it, according to their component aligned with the subject entity and their “remaining” component .
Comments
There are no comments yet.