# Hyperbolic Entailment Cones for Learning Hierarchical Embeddings

Learning graph representations via low-dimensional embeddings that preserve relevant network properties is an important class of problems in machine learning. We here present a novel method to embed directed acyclic graphs. Following prior work, we first advocate for using hyperbolic spaces which provably model tree-like structures better than Euclidean geometry. Second, we view hierarchical relations as partial orders defined using a family of nested geodesically convex cones. We prove that these entailment cones admit an optimal shape with a closed form expression both in the Euclidean and hyperbolic spaces. Moreover, they canonically define the embedding learning process. Experiments show significant improvements of our method over strong recent baselines both in terms of representational capacity and generalization.

## Authors

• 16 publications
• 13 publications
• 48 publications
• ### Hyperbolic Disk Embeddings for Directed Acyclic Graphs

Obtaining continuous representations of structural data such as directed...
02/12/2019 ∙ by Ryota Suzuki, et al. ∙ 0

• ### Hyperbolic Neural Networks

Hyperbolic spaces have recently gained momentum in the context of machin...
05/23/2018 ∙ by Octavian-Eugen Ganea, et al. ∙ 0

• ### Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph

Nickel and Kiela (2017) present a new method for embedding tree nodes in...
09/15/2021 ∙ by Sameer Bansal, et al. ∙ 0

• ### Highly Scalable and Provably Accurate Classification in Poincare Balls

Many high-dimensional and large-volume data sets of practical relevance ...
09/08/2021 ∙ by Eli Chien, et al. ∙ 17

• ### Hyperbolic Busemann Learning with Ideal Prototypes

Hyperbolic space has become a popular choice of manifold for representat...
06/28/2021 ∙ by Mina Ghadimi Atigh, et al. ∙ 5

• ### Probing BERT in Hyperbolic Spaces

Recently, a variety of probing tasks are proposed to discover linguistic...
04/08/2021 ∙ by Boli Chen, et al. ∙ 0

• ### Neural Distance Embeddings for Biological Sequences

The development of data-dependent heuristics and representations for bio...
09/20/2021 ∙ by Gabriele Corso, et al. ∙ 32

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Producing high quality feature representations of data such as text or images is a central point of interest in artificial intelligence. A large line of research focuses on embedding discrete data such as graphs

(Grover & Leskovec, 2016; Goyal & Ferrara, 2017) or linguistic instances (Mikolov et al., 2013; Pennington et al., 2014; Kiros et al., 2015) into continuous spaces that exhibit certain desirable geometric properties. This class of models has reached state-of-the-art results for various tasks and applications, such as link prediction in knowledge bases (Nickel et al., 2011; Bordes et al., 2013) or in social networks (Hoff et al., 2002), text disambiguation (Ganea & Hofmann, 2017), word hypernymy (Shwartz et al., 2016), textual entailment (Rocktäschel et al., 2015) or taxonomy induction (Fu et al., 2014).

Popular methods typically embed symbolic objects in low dimensional Euclidean vector spaces using a strategy that aims to capture semantic information such as functional similarity. Symmetric distance functions are usually minimized between representations of correlated items during the learning process. Popular examples are word embedding algorithms trained on corpora co-occurrence statistics which have shown to strongly relate semantically close words and their topics

(Mikolov et al., 2013; Pennington et al., 2014).

However, in many fields (e.g. Recommender Systems, Genomics (Billera et al., 2001), Social Networks), one has to deal with data whose latent anatomy is best defined by non-Euclidean spaces such as Riemannian manifolds (Bronstein et al., 2017)

. Here, the Euclidean symmetric models suffer from not properly reflecting complex data patterns such as the latent hierarchical structure inherent in taxonomic data. To address this issue, the emerging trend of geometric deep learning

is concerned with non-Euclidean manifold representation learning.

In this work, we are interested in geometrical modeling of hierarchical structures, directed acyclic graphs (DAGs) and entailment relations via low dimensional embeddings. Starting from the same motivation, the order embeddings method (Vendrov et al., 2015) explicitly models the partial order induced by entailment relations between embedded objects. Formally, a vector represents a more general concept than any other embedding from the Euclidean entailment region . A first concern is that the capacity of order embeddings grows only linearly with the embedding space dimension. Moreover, the regions suffer from heavy intersections, implying that their disjoint volumes rapidly become bounded222For example, in dimensions, no distinct regions can simultaneously have unbounded disjoint sub-volumes.. As a consequence, representing wide (with high branching factor) and deep hierarchical structures in a bounded region of the Euclidean space would cause many points to end up undesirably close to each other. This also implies that Euclidean distances would no longer be capable of reflecting the original tree metric.

Fortunately, the hyperbolic space does not suffer from the aforementioned capacity problem because the volume of any ball grows exponentially with its radius, instead of polynomially as in the Euclidean space. This exponential growth property enables hyperbolic spaces to embed any weighted tree while almost preserving their metric333See end of Section 2.2 for a rigorous formulation. (Gromov, 1987; Bowditch, 2006; Sarkar, 2011). The tree-likeness of hyperbolic spaces has been extensively studied (Hamann, 2017). Moreover, hyperbolic spaces are used to visualize large hierarchies (Lamping et al., 1995), to efficiently forward information in complex networks (Krioukov et al., 2009; Cvetkovski & Crovella, 2009) or to embed heterogeneous, scale-free graphs (Shavitt & Tankel, 2008; Krioukov et al., 2010; Bläsius et al., 2016).

From a machine learning perspective, recently, hyperbolic spaces have been observed to provide powerful representations of entailment relations (Nickel & Kiela, 2017). The latent hierarchical structure surprisingly emerges as a simple reflection of the space’s negative curvature. However, the approach of (Nickel & Kiela, 2017)

suffers from a few drawbacks: first, their loss function causes most points to collapse on the border of the Poincaré ball, as exemplified in Figure

3

. Second, the hyperbolic distance alone (being symmetric) is not capable of encoding asymmetric relations needed for entailment detection, thus a heuristic score is chosen to account for concept generality or specificity encoded in the embedding norm.

We here inspire ourselves from hyperbolic embeddings (Nickel & Kiela, 2017) and order embeddings (Vendrov et al., 2015). Our contributions are as follows:

• We address the aforementioned issues of (Nickel & Kiela, 2017) and (Vendrov et al., 2015). We propose to replace the entailment regions of order-embeddings by a more efficient and generic class of objects, namely geodesically convex entailment cones. These cones are defined on a large class of Riemannian manifolds and induce a partial ordering relation in the embedding space.

• The optimal entailment cones satisfying four natural properties surprisingly exhibit canonical closed-form expressions in both Euclidean and hyperbolic geometry that we rigorously derive.

• An efficient algorithm for learning hierarchical embeddings of directed acyclic graphs is presented. This learning process is driven by our entailment cones.

• Experimentally, we learn high quality embeddings and improve over experimental results in (Nickel & Kiela, 2017) and (Vendrov et al., 2015) on hypernymy link prediction for word embeddings, both in terms of capacity of the model and generalization performance.

We also compute an analytic closed-form expression for the exponential map in the -dimensional Poincaré ball, allowing us to perform full Riemannian optimization (Bonnabel, 2013) in the Poincaré ball, as opposed to the approximate optimization method used by (Nickel & Kiela, 2017).

## 2 Mathematical preliminaries

We now briefly visit some key concepts needed in our work.

#### Notations.

We always use to denote the Euclidean norm of a point (in both hyperbolic or Euclidean spaces). We also use to denote the Euclidean scalar product.

### 2.1 Differential geometry

For a rigorous reasoning about hyperbolic spaces, one needs to use concepts in differential geometry, some of which we highlight here. For an in-depth introduction, we refer the reader to (Spivak, 1979) and (Hopper & Andrews, 2010).

#### Manifold.

A manifold of dimension is a set that can be locally approximated by the Euclidean space . For instance, the sphere and the torus embedded in are -dimensional manifolds, also called surfaces, as they can locally be approximated by . The notion of manifold is a generalization of the notion of surface.

#### Tangent space.

For , the tangent space of at is defined as the -dimensional vector-space approximating around at a first order. It can be defined as the set of vectors that can be obtained as , where is a smooth path in such that .

#### Riemannian metric.

A Riemannian metric on is a collection of inner-products on each tangent space , depending smoothly on . Although it defines the geometry of locally, it induces a global distance function by setting to be the infimum of all lengths of smooth curves joining to in , where the length of a curve is defined as

 ℓ(γ)=∫10√gγ(t)(γ′(t),γ′(t))dt. (1)

#### Riemannian manifold.

A smooth manifold equipped with a Riemannian metric is called a Riemannian manifold. Subsequently, due to their metric properties, we will only consider such manifolds.

#### Geodesics.

A geodesic (straight line) between two points is a smooth curve of minimal length joining to in . Geodesics define shortest paths on the manifold. They are a generalization of lines in the Euclidean space.

#### Exponential map.

The exponential map around , when well-defined, maps a small perturbation of by a vector to a point , such that is a geodesic joining to . In Euclidean space, we simply have . The exponential map is important, for instance, when performing gradient-descent over parameters lying in a manifold (Bonnabel, 2013).

#### Conformality.

A metric on is said to be conformal to if it defines the same angles, i.e. for all and ,

 ~gx(u,v)√~gx(u,u)√~gx(v,v)=gx(u,v)√gx(u,u)√gx(v,v). (2)

This is equivalent to the existence of a smooth function such that , which is called the conformal factor of (w.r.t. ).

### 2.2 Hyperbolic geometry

The hyperbolic space of dimension is a fundamental object in Riemannian geometry. It is (up to isometry) uniquely characterized as a complete, simply connected Riemannian manifold with constant negative curvature (Cannon et al., 1997). The other two model spaces of constant sectional curvature are the flat Euclidean space (zero curvature) and the hyper-sphere (positive curvature).

The hyperbolic space has five models which are often insightful to work in. They are isometric to each other and conformal to the Euclidean space (Cannon et al., 1997; Parkkonen, 2013). We prefer to work in the Poincaré ball model for the same reasons as (Nickel & Kiela, 2017) and, additionally, because we can derive a closed form expression of geodesics and exponential map.

#### Poincaré metric tensor.

The Poincaré ball model is defined by the manifold equipped with the following Riemannian metric

 gDx=λ2xgE,where\ λx:=21−∥x∥2, (3)

and

is the Euclidean metric tensor with components

of the standard space with the usual Cartesian coordinates.

As the above model is a Riemannian manifold, its metric tensor is fundamental in order to uniquely define most of its geometric properties like distances, inner products (in tangent spaces), straight lines (geodesics), curve lengths or volume elements. In the Poincaré ball model, the Euclidean metric is changed by a simple scalar field, hence the model is conformal (i.e. angle preserving), yet distorts distances.

#### Induced distance and norm.

It is known (Nickel & Kiela, 2017) that the induced distance between 2 points is given by

 dD(x,y)=cosh−1(1+2∥x−y∥2(1−∥x∥2)⋅(1−∥y∥2)). (4)

The Poincare norm is then defined as:

 ∥x∥D:=dD(0,x)=2tanh−1(∥x∥) (5)

#### Geodesics and exponential map.

We derive parametric expressions of unit-speed geodesics and exponential map in the Poincaré ball. Geodesics in are all intersections of the Euclidean unit ball with (degenerated) Euclidean circles orthogonal to the unit sphere (equations are derived below). We know from the Hopf-Rinow theorem that the hyperbolic space is complete as a metric space. This guarantees that is geodesically complete. Thus, the exponential map is defined for each point and any . To derive its closed form expression, we first prove the following.

###### Theorem 1.

(Unit-speed geodesics) Let and such that . The unit-speed geodesic with and is given by

 γx,v(t)=(λxcosh(t)+λ2x⟨x,v⟩sinh(t))x+λxsinh(t)v1+(λx−1)cosh(t)+λ2x⟨x,v⟩sinh(t) (6)
###### Proof.

See appendix B. ∎

###### Corollary 1.1.

(Exponential map) The exponential map at a point , namely , is given by

 expx(v)=λx(cosh(λx∥v∥)+⟨x,v∥v∥⟩sinh(λx∥v∥))1+(λx−1)cosh(λx∥v∥)+λx⟨x,v∥v∥⟩sinh(λx∥v∥)x+1∥v∥sinh(λx∥v∥)1+(λx−1)cosh(λx∥v∥)+λx⟨x,v∥v∥⟩sinh(λx∥v∥)v (7)
###### Proof.

See appendix C. ∎

We also derive the following fact (useful for future proofs).

###### Corollary 1.2.

Given any arbitrary geodesic in , all its points are coplanar with the origin .

###### Proof.

See appendix D. ∎

#### Angles in hyperbolic space.

It is natural to extend the Euclidean notion of an angle to any geodesically complete Riemannian manifold. For any points A, B, C on such a manifold, the angle is the angle between the initial tangent vectors of the geodesics connecting B with A, and B with C, respectively. In the Poincaré ball, the angle between two tangent vectors is given by

 cos(∠(u,v))=gDx(u,v)√gDx(u,u)√gDx(v,v)=⟨u,v⟩∥u∥∥v∥ (8)

The second equality happens since is conformal to .

#### Hyperbolic trigonometry.

The notion of angles and geodesics allow definition of the notion of a triangle in the Poincaré ball. Then, the classic theorems in Euclidean geometry have hyperbolic formulations (Parkkonen, 2013). In the next section, we will use the following theorems.

Let . Denote by and by the length of the hyperbolic segment BA (and others). Then, the hyperbolic laws of cosines and sines hold respectively

 cos(∠B)=cosh(a)cosh(c)−cosh(b)sinh(a)sinh(c) (9) sin(∠A)sinh(a)=sin(∠B)sinh(b)=sin(∠C)sinh(c) (10)

#### Embedding trees in hyperbolic vs Euclidean space.

Finally, we briefly explain why hyperbolic spaces are better suited than Euclidean spaces for embedding trees. However, note that our method is applicable to any DAG.

(Gromov, 1987) introduces a notion of -hyperbolicity in order to characterize how ‘hyperbolic’ a metric space is. For instance, the Euclidean space for is not -hyperbolic for any , while the Poincaré ball is -hyperbolic. This is formalized in the following theorem (section 6.2 of (Gromov, 1987), proposition 6.7 of (Bowditch, 2006)):

Theorem: For any , any -hyperbolic metric space and any set of points , there exists a finite weighted tree and an embedding such that for all ,

 |dT(f−1(xi),f−1(xj))−dX(xi,xj)|=O(δlog(n)). (11)

Conversely, any tree can be embedded with arbitrary low distortion into the Poincaré disk (with only 2 dimensions), whereas this is not true for Euclidean spaces even when an unbounded number of dimensions is allowed (Sarkar, 2011; De Sa et al., 2018).

The difficulty in embedding trees having a branching factor at least in a quasi-isometric manner comes from the fact that they have an exponentially increasing number of nodes with depth. The exponential volume growth of hyperbolic metric spaces confers them enough capacity to embed trees quasi-isometrically, unlike the Euclidean space.

## 3 Entailment Cones in the Poincaré Ball

In this section, we define “entailment” cones that will be used to embed hierarchical structures in the Poincaré ball. They generalize and improve over the idea of order embeddings (Vendrov et al., 2015).

#### Convex cones in a complete Riemannian manifold.

We are interested in generalizing the notion of a convex cone to any geodesically complete Riemannian manifold (such as hyperbolic models). In a vector space, a convex cone (at the origin) is a set that is closed under non-negative linear combinations

 v1,v2∈S⟹αv1+βv2∈S(∀α,β≥0). (12)

The key idea for generalizing this concept is to make use of the exponential map at a point .

 expx:TxM→M,TxM=tangent space at x (13)

We can now take any cone in the tangent space at a fixed point and map it into a set , which we call the -cone at , via

 Sx:=expx(S),S⊆TxM. (14)

Note that, in the above definition, we desire that the exponential map be injective. We already know that it is a local diffeomorphism. Thus, we restrict the tangent space in Eq. 14 to the ball , where is the injectivity radius of at . Note that for hyperbolic space models the injectivity radius of the tangent space at any point is infinite, thus no restriction is needed.

#### Angular cones in the Poincaré ball.

We are interested in special types of cones in that can extend in all space directions. We want to avoid heavy cone intersections and to have capacity that scales exponentially with the space dimension. To achieve this, we want the definition of cones to exhibit the following four intuitive properties detailed below. Subsequently, solely based on these necessary conditions, we formally prove that the optimal cones in the Poincaré ball have a closed form expression.

1) Axial symmetry. For any , we require circular symmetry with respect to a central axis of the cone . We define this axis to be the spoke through from :

 Ax:={x′∈Dn:x′=αx, 1∥x∥>α≥1} (15)

Then, we fix any tangent vector with the same direction as , e.g. . One can verify using Corollary 1.1 that generates the axis-oriented geodesic as:

 Ax=expx({y∈Rn:y=α¯x, α>0}). (16)

We next define the angle for any tangent vector as in Eq. 8. Then, the axial symmetry property is satisfied if we define the angular cone at to have a non-negative aperture as follows:

 Sψ(x)x:={v∈TxDn:∠(v,¯x)≤ψ(x)} (17) Sψ(x)x:=expx(Sψ(x)x).

We further define the conic border (face):

 ∂Sψ:={v:∠(v,¯x)=ψ(x)},∂Sψx:=expx(∂Sψx). (18)

2) Rotation invariance. We want the definition of cones to be independent of the angular coordinate of the apex , i.e. to only depend on the (Euclidean) norm of :

 ψ(x)=ψ(x′)(∀x,x′∈Dn∖{0}, s.t. ∥x∥=∥x′∥). (19)

This implies that there exists s. t. for all we have .

3) Continuous cone aperture functions. We require the aperture of our cones to be a continuous function. Using Eq. 19, it is equivalent to the continuity of . This requirement seems reasonable and will be helpful in order to prove uniqueness of the optimal entailment cones. When optimization-based training is employed, it is also necessary that this function be differentiable. Surprisingly, we will show below that the optimal functions are actually smooth, even when only requiring continuity.

4) Transitivity of nested angular cones. We want cones to determine a partial order in the embedding space. The difficult property is transitivity. We are interested in defining a cone width function such that the resulting angular cones satisfy the transitivity property of partial order relations, i.e. they form a nested structure as follows

 ∀x,x′∈Dn∖{0}:x′∈Sψ(x)x⟹Sψ(x′)x′⊆Sψ(x)x. (20)

#### Closed form expression of the optimal ψ.

We now analyze the implications of the above necessary properties. Surprisingly, the optimal form of the function admits an interesting closed-form expression. We will see below that mathematically cannot be defined on the entire open ball . Towards these goals, we first prove the following.

###### Lemma 2.

If transitivity holds, then

 ∀x∈Dom(ψ):ψ(x)≤π2. (21)
###### Proof.

See appendix E. ∎

Note that so far we removed the origin of from our definitions. However, the above surprising lemma implies that we cannot define a useful cone at the origin. To see this, we first note that the origin should “entail” the entire space , i.e. . Second, similar with property 3, we desire the cone at 0 be a continuous deformation of the cones of any sequence of points in that converges to 0. Formally, when . However, this is impossible because Lemma 2 implies that the cone at each point can only cover at most half of . We further prove the following:

###### Theorem 3.

If transitivity holds, then the function

 h:(0,1)∩Dom(~ψ)→R+,h(r):=r1−r2sin(~ψ(r)), (22)

is non-increasing.

###### Proof.

See appendix F. ∎

The above theorem implies that a non-zero cannot be defined on the entire because , for any function . As a consequence, we are forced to restrict to some , i.e. to leave the open ball outside of the domain of . Then, theorem 3 implies that

 ∀r∈[ϵ,1):sin(~ψ(r))r1−r2≤sin(~ψ(ϵ))ϵ1−ϵ2. (23)

Since we are interested in cones with an aperture as large as possible (to maximize model capacity), it is natural to set all terms equal to , i.e. to make constant:

 ∀r∈[ϵ,1):sin(~ψ(r))r1−r2=K, (24)

which gives both a restriction on (in terms of ):

 K≤ϵ1−ϵ2⟺ϵ∈[2K1+√1+4K2,1), (25)

as well as a closed form expression for

 ψ: Dn∖Bn(O,ϵ)→(0,π/2) x↦arcsin(K(1−∥x∥2)/∥x∥), (26)

which is also a sufficient condition for transitivity to hold:

###### Theorem 4.

If is defined as in Eqs.25-26, then transitivity holds.

The above theorem has a proof similar to that of Thm. 3.

So far, we have obtained a closed form expression for hyperbolic entailment cones. However, we still need to understand how they can be used during embedding learning. For this goal, we derive an equivalent (and more practical) definition of the cone :

###### Theorem 5.

For any , we denote the angle between the half-lines and as

 Ξ(x,y):=π−∠Oxy, (27)

Then, this angle equals

 arccos(⟨x,y⟩(1+∥x∥2)−∥x∥2(1+∥y∥2)∥x∥⋅∥x−y∥√1+∥x∥2∥y∥2−2⟨x,y⟩), (28)

Moreover, we have the following equivalent expression of the Poincaré entailment cones satisfying Eq. 26:

 Sψ(x)x=\Sety∈DnΞ(x,y)≤arcsin(K1−∥x∥2∥x∥). (29)
###### Proof.

See appendix G. ∎

Examples of 2-dimensional Poincaré cones corresponding to apex points located at different radii from the origin are shown in Figure 2. This figure also shows that transitivity is satisfied for some points on the border of the hypercones.

#### Euclidean entailment cones.

One can easily adapt the above proofs to derive entailment cones in the Euclidean space . The only adaptations are: i) replace the hyperbolic cosine law by usual Euclidean cosine law, ii) geodesics are straight lines, and iii) the exponential map is given by . Thus, one similarly obtains that is non-decreasing, the optimal values of are obtained for constant being equal to and

 Sψ(x)x={y∈Rn∣Ξ(x,y)≤ψ(x)}, (30)

where now becomes

 Ξ(x,y)=arccos(∥y∥2−∥x∥2−∥x−y∥22∥x∥⋅∥x−y∥), (31)

for all . From a learning perspective, there is no need to be concerned about the Riemannian optimization described in Section 4.2, as the usual Euclidean gradient-step is used in this case.

## 4 Learning with entailment cones

We now describe how embedding learning is performed.

### 4.1 Max-margin training on angles

We learn hierarchical word embeddings from a dataset of entailment relations , also called hypernym links, defining that entails , or, equivalently, that is a subconcept of 666We prefer this notation over the one in (Nickel & Kiela, 2017).

We choose to model the embedding entailment relation as belonging to the entailment cone .

Our model is trained with a max-margin loss function similar to the one in (Vendrov et al., 2015):

 L=∑(u,v)∈PE(u,v)+∑(u′,v′)∈Nmax(0,γ−E(u′,v′)), (32)

for some margin , where and define samples of positive and negative edges respectively. The energy

measures the penalty of a wrongly classified pair

, which in our case measures how far is point from belonging to expressed as the smallest angle of a rotation of center bringing into :

 E(u,v):=max(0,Ξ(u,v)−ψ(u)), (33)

where is defined in Eqs. 28 and 31. Note that (Vendrov et al., 2015) use . This loss function encourages positive samples to satisfy and negative ones to satisfy . The same loss is used both in the hyperbolic and Euclidean cases.

### 4.2 Full Riemannian optimization

As the parameters of the model live in the hyperbolic space, the back-propagated gradient is a Riemannian gradient. Indeed, if is in the Poincaré ball, and if we compute the usual (Euclidean) gradient of our loss, then

 u←u−η∇uL (34)

makes no sense as an operation in the Poincaré ball, since the substraction operation is not defined in this manifold. Instead, one should compute the Riemannian gradient indicating a direction in the tangent space , and should move along the corresponding geodesic in  (Bonnabel, 2013):

 u←expu(−η∇RuL), (35)

where the Riemannian gradient is obtained by rescaling the Euclidean gradient by the inverse of the metric tensor. As our metric is conformal, i.e. where is the Euclidean metric (see Eq 3), this leads to a simple formulation

 ∇RuL=(1/λu)2∇uL. (36)

Previous work (Nickel & Kiela, 2017) optimizing word embeddings in the Poincaré ball used the retraction map as a first order approximation of . Note that since we derived a closed-form expression of the exponential map in the Poincaré ball (Corollary 1.1), we are able to perform full Riemannian optimization in this model of the hyperbolic space.

## 5 Experiments

We evaluate the representational and generalization power of hyperbolic entailment cones and of other baselines using data that exhibits a latent hierarchical structure. We follow previous work (Nickel & Kiela, 2017; Vendrov et al., 2015) and use the full transitive closure of the WordNet noun hierarchy (Miller et al., 1990). Our binary classification task is link prediction for unseen edges in this directed acyclic graph.

#### Dataset splitting. Train and evaluation settings.

We remove the tree root since it carries little information and only has trivial edges to predict. Note that this implies that we co-embed the resulting subgraphs together to prevent overlapping embeddings (see smaller examples in Figure 3). The remaining WordNet dataset contains 82,114 nodes and 661,127 edges in the full transitive closure. We split it into train - validation - test sets as follows. We first compute the transitive reduction of this directed acyclic graph, i.e. “basic” edges that form the minimal edge set for which the original transitive closure can be fully recovered. These edges are hard to predict, so we will always include them in the training set. The remaining “non-basic” edges (578,477) are split into validation (5%), test (5%) and train (fraction of the rest).

We augment both the validation and the test parts with sets of negative pairs as follows: for each true (positive) edge , we randomly sample five and five negative corrupted pairs that are not edges in the full transitive closure. These are then added to the respective negative set. Thus, ten times as many negative pairs as positive pairs are used. They are used to compute standard classification metrics associated with these datasets: precision, recall, F1. For the training set, negative pairs are dynamically generated as explained below.

We make the task harder in order to understand the generalization ability of various models when differing amounts of transitive closure edges are available during training. We generate four training sets that include 0%, 10%, 25%, or 50% of the non-basic edges, selected randomly. We then train separate models using each of these four sets after being augmented with the basic edges.

#### Baselines.

We compare against the strong hierarchical embedding methods of Order embeddings (Vendrov et al., 2015) and Poincaré embeddings (Nickel & Kiela, 2017). Additionally, we also use Simple Euclidean embeddings, i.e. the Euclidean version of the method presented in (Nickel & Kiela, 2017) (one of their baselines). We note that Poincaré and Simple Euclidean embeddings were trained using a symmetric distance function, and thus cannot be directly used to evaluate asymmetric entailment relations. Thus, for these baselines we use the heuristic scoring function proposed in (Nickel & Kiela, 2017):

 score(u,v)=(1+α(∥u∥−∥v∥))d(u,v) (37)

and tune the parameter on the validation set. For all the other methods (our proposed cones and order embeddings), we use the energy penalty , e.g. Eq. 33 for hyperbolic cones. This scoring function is then used at test time for binary classification as follows: if it is lower than a threshold, we predict an edge; otherwise, we predict a non-edge. The optimal threshold is chosen to achieve maximum F1 on the validation set by passing over the sorted array of scores of positive and negative validation pairs.

#### Training details.

For all methods except Order embeddings, we observe that initialization is very important. Being able to properly disentangle embeddings from different subparts of the graph in the initial learning stage is essential in order to train qualitative models. We conjecture that initialization is hard because these models are trained to minimize highly non-convex loss functions. In practice, we obtain our best results when initializing the embeddings corresponding to the hyperbolic cones using the Poincaré embeddings pre-trained for 100 epochs. The embeddings for the Euclidean cones are initialized using Simple Euclidean embeddings pre-trained also for 100 epochs. For the Simple Euclidean embeddings and Poincaré embeddings, we find the burn-in strategy of

(Nickel & Kiela, 2017) to be essential for a good initial disentanglement. We also observe that the Poincaré embeddings are heavily collapsed to the unit ball border (as also pictured in Fig. 3) and so we rescale them by a factor of 0.7 before starting the training of the hyperbolic cones.

Each model is trained for 200 epochs after the initialization stage, except for order embeddings which were trained for 500 epochs. During training, 10 negative edges are generated per positive edge by randomly corrupting one of its end points. We use batch size of 10 for all models. For both cone models we use a margin of .

All Euclidean models and baselines are trained using stochastic gradient descent. For the hyperbolic models, we do not find significant empirical improvements when using full Riemannian optimization instead of approximating it with a retraction map as done in

(Nickel & Kiela, 2017). We thus use the retraction approximation since it is faster. For the cone models, we always project outside of the ball centered on the origin during learning as constrained by Eq. 26 and its Euclidean version. For both we use . A learning rate of 1e-4 is used for both Euclidean and hyperbolic cone models.

#### Results and discussion.

Table 1 shows the obtained results. For a fair comparison, we use models with the same number of dimensions. We focus on the low dimensional setting (5 and 10 dimensions) which is more informative. It can be seen that our hyperbolic cones are better than all the baselines in all settings, except in the setting for which order embeddings are better. However, once a small percentage of the transitive closure edges becomes available during training, we observe significant improvements of our method, sometimes by more than F1 score. Moreover, hyperbolic cones have the largest growth when transitive closure edges are added at train time. We further note that, while mathematically not justified888Indeed, mathematically, hyperbolic embeddings cannot be considered as Euclidean points., if embeddings of our proposed Euclidean cones model are initialized with the Poincaré embeddings instead of the Simple Euclidean ones, then they perform on par with the hyperbolic cones.

## 6 Conclusion

Learning meaningful graph embeddings is relevant for many important applications. Hyperbolic geometry has proven to be powerful for embedding hierarchical structures. We here take one step forward and propose a novel model based on geodesically convex entailment cones and show its theoretical and practical benefits. We empirically discover that strong embedding methods can vary a lot with the percentage of the taxonomy observable during training and demonstrate that our proposed method benefits the most from increasing size of the training data. As future work, it would be interesting to understand if the proposed entailment cones can be used to embed more complex data such as sentences or images.

Our code is publicly available.

## Acknowledgements

We would like to thank Maximilian Nickel, Colin Evans, Chris Waterson, Marius Pasca, Xiang Li and Vered Shwartz for helpful discussions about related work and evaluation settings.

This research is funded by the Swiss National Science Foundation (SNSF) under grant agreement number 167176. Gary Bécigneul is also funded by the Max Planck ETH Center for Learning Systems.

## References

• Billera et al. (2001) Billera, L. J., Holmes, S. P., and Vogtmann, K. Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4):733–767, 2001.
• Bläsius et al. (2016) Bläsius, T., Friedrich, T., Krohmer, A., and Laue, S. Efficient embedding of scale-free graphs in the hyperbolic plane. In LIPIcs-Leibniz International Proceedings in Informatics, volume 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
• Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
• Bordes et al. (2013) Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795, 2013.
• Bowditch (2006) Bowditch, B. H. A course on geometric group theory. 2006.
• Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
• Cannon et al. (1997) Cannon, J. W., Floyd, W. J., Kenyon, R., Parry, W. R., et al. Hyperbolic geometry. Flavors of geometry, 31:59–115, 1997.
• Cvetkovski & Crovella (2009) Cvetkovski, A. and Crovella, M. Hyperbolic embedding and routing for dynamic graphs. In INFOCOM 2009, IEEE, pp. 1647–1655. IEEE, 2009.
• De Sa et al. (2018) De Sa, C., Gu, A., Ré, C., and Sala, F. Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.
• Fu et al. (2014) Fu, R., Guo, J., Qin, B., Che, W., Wang, H., and Liu, T. Learning semantic hierarchies via word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 1199–1209, 2014.
• Ganea & Hofmann (2017) Ganea, O.-E. and Hofmann, T. Deep joint entity disambiguation with local neural attention. arXiv preprint arXiv:1704.04920, 2017.
• Goyal & Ferrara (2017) Goyal, P. and Ferrara, E. Graph embedding techniques, applications, and performance: A survey. arXiv preprint arXiv:1705.02801, 2017.
• Gromov (1987) Gromov, M. Hyperbolic groups. In Essays in group theory, pp. 75–263. Springer, 1987.
• Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM, 2016.
• Hamann (2017) Hamann, M. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 1–17, 2017.
• Hoff et al. (2002) Hoff, P. D., Raftery, A. E., and Handcock, M. S. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.
• Hopper & Andrews (2010) Hopper, C. and Andrews, B. The ricci flow in riemannian geometry, 2010.
• Kiros et al. (2015) Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
• Krioukov et al. (2009) Krioukov, D., Papadopoulos, F., Boguñá, M., and Vahdat, A. Greedy forwarding in scale-free networks embedded in hyperbolic metric spaces. ACM SIGMETRICS Performance Evaluation Review, 37(2):15–17, 2009.
• Krioukov et al. (2010) Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguná, M. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
• Lamping et al. (1995) Lamping, J., Rao, R., and Pirolli, P. A focus+ context technique based on hyperbolic geometry for visualizing large hierarchies. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 401–408. ACM Press/Addison-Wesley Publishing Co., 1995.
• Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
• Miller et al. (1990) Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235–244, 1990.
• Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341–6350, 2017.
• Nickel et al. (2011) Nickel, M., Tresp, V., and Kriegel, H.-P. A three-way model for collective learning on multi-relational data. 2011.
• Parkkonen (2013) Parkkonen, J. Hyperbolic geometry. 2013.
• Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In EMNLP, volume 14, pp. 1532–43, 2014.
• Robbin & Salamon (2011) Robbin, J. W. and Salamon, D. A. Introduction to differential geometry. ETH, Lecture Notes, preliminary version, January, 2011.
• Rocktäschel et al. (2015) Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kočiskỳ, T., and Blunsom, P. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.
• Sarkar (2011) Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Springer, 2011.
• Shavitt & Tankel (2008) Shavitt, Y. and Tankel, T.

Hyperbolic embedding of internet graph for distance estimation and overlay construction.

IEEE/ACM Transactions on Networking (TON), 16(1):25–36, 2008.
• Shwartz et al. (2016) Shwartz, V., Goldberg, Y., and Dagan, I. Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 2389–2398, 2016.
• Spivak (1979) Spivak, M. A comprehensive introduction to differential geometry. volume four. 1979.
• Vendrov et al. (2015) Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.

## Appendix A Geodesics in the Hyperboloid Model

The hyperboloid model is , where . The hyperboloid model can be viewed from the extrinsically as embedded in the pseudo-Riemannian manifold Minkowski space and inducing its metric. The Minkowski metric tensor of signature has the components

 gRn,1=⎡⎢ ⎢ ⎢⎣−10…001…000…000…1⎤⎥ ⎥ ⎥⎦

The associated inner-product is . Note that the hyperboloid model is a Riemannian manifold because the quadratic form associated with is positive definite.

In the extrinsic view, the tangent space at can be described as . See Robbin & Salamon (2011); Parkkonen (2013).

Geodesics of are given by the following theorem (Eq (6.4.10) in Robbin & Salamon (2011)):

###### Theorem 6.

Let and such that . The unique unit-speed geodesic with and is

 ϕx,v(t)=xcosh(t)+vsinh(t). (38)

## Appendix B Proof of Theorem 1

###### Proof.

From theorem 6, appendix A, we know the expression of the unit-speed geodesics of the hyperboloid model . We can use the Egregium theorem to project the geodesics of to the geodesics of . We can do that because we know an isometry between the two spaces:

 ψ(x):=(λx−1,λxx),ψ−1(x0,x′)=x′1+x0 (39)

Formally, let with . Also, let be the unique unit-speed geodesic in with and . Then, by Egregium theorem, is also a unit-speed geodesic in . From theorem 6, we have that , for some . One derives their expression:

 x′=ψ∘γ(0)=(λx−1,λxx) (40) v′=˙ϕ(0)=∂ψ(y0,y)∂y∣∣∣γ(0)˙γ(0)=[λ2x⟨x,v⟩λ2x⟨x,v⟩x+λxv]

Inverting once again, , one gets the closed-form expression for stated in the theorem. ∎

One can sanity check that indeed the formula from theorem 1 satisfies the conditions:

## Appendix C Proof of Corollary 1.1

###### Proof.

Denote . Using the notations from Thm. 1, one has . Using Eq. 3 and 6, one derives the result. ∎

## Appendix D Proof of Corollary 1.2

###### Proof.

For any geodesic , consider the plane spanned by the vectors and . Then, from Thm. 1, this plane contains all the points of , i.e.

 {γx,v(t):t∈R}⊆{ax+bv:a,b∈R} (41)

## Appendix E Proof of Lemma 2

###### Proof.

Assume the contrary and let s.t. . We will show that transitivity implies that

 ∀x′∈∂Sψ(x)x:ψ(∥x′∥)≤π2 (42)

If the above is true, by moving on any arbitrary (continuous) curve on the cone border that ends in , one will get a contradiction due to the continuity of .

We now prove the remaining fact, namely Eq. 42. Let any arbitrary . Also, let be any arbitrary point on the geodesic half-line connecting with starting from (i.e. excluding the segment from to ). Moreover, let be any arbitrary point on the spoke through radiating from , namely (notation from Eq. 15). Then, based on the properties of hyperbolic angles discussed before (based on Eq. 8), the angles and are well-defined.

From Cor. 1.2 we know that the points are coplanar. We denote this plane by . Furthermore, the metric of the Poincaré ball is conformal with the Euclidean metric. Given these two facts, we derive that

 ∠yx′z+∠zx′x=∠(yx′x)=π (43)

thus

 min(∠yx′z,∠zx′x)≤π2 (44)

It only remains to prove that

 ∠yx′z≥ψ(x′)&∠zx′x≥ψ(x′) (45)

Indeed, assume w.l.o.g. that . Since , there exists a point in the plane such that

 ∠Oxt<∠Oxy&ψ(x′)≥∠tx′z>∠yx′z (46)

Then, clearly, , and also , which contradicts the transitivity property (Eq. 20). ∎

## Appendix F Proof of Theorem 3

###### Proof.

We first need to prove the following fact:

###### Lemma 7.

Transitivity implies that for all , :

 sin(ψ(∥x′∥))sinh(∥x′∥D)≤sin(ψ(∥x∥))sinh(∥x∥D). (47)
###### Proof.

We will use the exact same figure and notations of points as in the proof of lemma 2. In addition, we assume w.l.o.g that

 ∠yx′z≤π2 (48)

Further, let be the intersection of the spoke through with the border of . Following the same argument as in the proof of lemma 2, one proves Eq. 45 which gives:

 ∠yx′z≥ψ(x′) (49)

In addition, the angle at between the geodesics and can be written in two ways:

 ∠Ox′x=∠yx′z (50)

Since , one proves

 ∠Oxx′=π−∠x′xb=π−ψ(x) (51)

We apply hyperbolic law of sines (Eq. 10) in the hyperbolic triangle :

 sin(∠Oxx′)sinh(dD(O,x′))=sin(∠Ox′x)sinh(dD(O,x)) (52)

Putting together Eqs. 48,49,50,51,52, and using the fact that is an increasing function on , we derive the conclusion of this helper lemma. ∎

We now return to the proof of our theorem. Consider any arbitrary with . Then, we claim that is enough to prove that

 ∃x∈Dn,x′∈∂Sψ(x)xs.t.∥x∥=r,∥x′∥=r′ (53)

Indeed, if the above is true, then one can use the fact 5, i.e.

 sinh(∥x∥D)=sinh(ln(1+r1−r))=2r1−r2 (54)

and apply lemma 7 to derive

 h(r′)≤h(r) (55)

which is enough for proving the non-increasing property of function .

We are only left to prove the fact 53. Let any arbitrary s.t. . Also, consider any arbitrary geodesic that takes values on the cone border, i.e. . We know that

 ∥γx,v(0)∥=∥x∥=r (56)

and that this geodesic ”ends” on the ball’s border , i.e.

 ∥limt→∞γx,v(t)∥=1 (57)

Thus, because the function is continuous, we obtain that for any there exists an s.t. . By setting we obtain the desired result. ∎

## Appendix G Proof of Theorem 5

###### Proof.

For any , the axial symmetry property implies that . Applying the hyperbolic cosine law in the triangle and writing the above angle inequality in terms of the cosines of the two angles, one gets

 cos∠Oxy=−cosh(∥y∥D)+cosh(∥x∥D)cosh(dD(x,y))sinh(∥x∥D)sinh(dD(x,y)) (58)

Eq. 28 is then derived from the above by an algebraic reformulation. ∎