DeepAI

# Hyperbolic Neural Networks

Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, mostly because of the absence of corresponding hyperbolic neural network layers. This makes it hard to use hyperbolic embeddings in downstream tasks. Here, we bridge this gap in a principled manner by combining the formalism of Möbius gyrovector spaces with the Riemannian geometry of the Poincaré model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks such as gated recurrent units. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks.

• 18 publications
• 13 publications
• 58 publications
04/03/2018

### Hyperbolic Entailment Cones for Learning Hierarchical Embeddings

Learning graph representations via low-dimensional embeddings that prese...
05/24/2018

### Hyperbolic Attention Networks

We introduce hyperbolic attention networks to endow neural networks with...
02/10/2021

Recently, Hyperbolic Spaces in the context of Non-Euclidean Deep Learnin...
12/21/2021

### Dynamically Stable Poincaré Embeddings for Neural Manifolds

In a Riemannian manifold, the Ricci flow is a partial differential equat...
03/14/2022

### FisheyeHDK: Hyperbolic Deformable Kernel Learning for Ultra-Wide Field-of-View Image Recognition

Conventional convolution neural networks (CNNs) trained on narrow Field-...
11/14/2019

### The Isoperimetric Problem in a Lattice of H^3

The isoperimetric problem is one of the oldest in geometry and it consis...
05/28/2020

### Hyperbolic Manifold Regression

Geometric representation learning has recently shown great promise in se...

## 1 Introduction

It is common in machine learning to represent data as being embedded in the Euclidean space . The main reason for such a choice is simply convenience, as this space has a vectorial structure, closed-form formulas of distance and inner-product, and is the natural generalization of our intuition-friendly, visual three-dimensional space. Moreover, embedding entities in such a continuous space allows to feed them as input to neural networks, which has led to unprecedented performance on a broad range of problems, including sentiment detection (kim2014convolutional, ), machine translation (bahdanau2014neural, ), textual entailment (rocktaschel2015reasoning, ) or knowledge base link prediction (nickel2011three, ; bordes2013translating, ).

Despite the success of Euclidean embeddings, recent research has proven that many types of complex data (e.g. graph data) from a multitude of fields (e.g. Biology, Network Science, Computer Graphics or Computer Vision) exhibit a highly non-Euclidean latent anatomy

(bronstein2017geometric, ). In such cases, the Euclidean space does not provide the most powerful or meaningful geometrical representations. For example, de2018representation shows that arbitrary tree structures cannot be embedded with arbitrary low distortion (i.e. almost preserving their metric) in the Euclidean space with unbounded number of dimensions, but this task becomes surprisingly easy in the hyperbolic space with only 2 dimensions where the exponential growth of distances matches the exponential growth of nodes with the tree depth.

The adoption of neural networks and deep learning in these non-Euclidean settings has been rather limited until very recently, the main reason being the non-trivial or impossible principled generalizations of basic operations (e.g. vector addition, matrix-vector multiplication, vector translation, vector inner product) as well as, in more complex geometries, the lack of closed form expressions for basic objects (e.g. distances, geodesics, parallel transport). Thus, classic tools such as multinomial logistic regression (MLR), feed forward (FFNN) or recurrent neural networks (RNN) did not have a correspondence in these geometries.

How should one generalize deep neural models to non-Euclidean domains ? In this paper we address this question for one of the simplest, yet useful, non-Euclidean domains: spaces of constant negative curvature, i.e. hyperbolic. Their tree-likeness properties have been extensively studied (gromov1987hyperbolic, ; hamann_2017, ; ungar2008gyrovector, ) and used to visualize large taxonomies (lamping1995focus+, ) or to embed heterogeneous complex networks (krioukov2010hyperbolic, ). In machine learning, recently, hyperbolic representations greatly outperformed Euclidean embeddings for hierarchical, taxonomic or entailment data (nickel2017poincar, ; de2018representation, ; ganea2018hyperbolic, ). Disjoint subtrees from the latent hierarchical structure surprisingly disentangle and cluster in the embedding space as a simple reflection of the space’s negative curvature. However, appropriate deep learning tools are needed to embed feature data in this space and use it in downstream tasks. For example, implicitly hierarchical sequence data (e.g. textual entailment data, phylogenetic trees of DNA sequences or hierarchial captions of images) would benefit from suitable hyperbolic RNNs.

The main contribution of this paper is to bridge the gap between hyperbolic and Euclidean geometry in the context of neural networks and deep learning by generalizing in a principled manner both the basic operations as well as multinomial logistic regression (MLR), feed-forward (FFNN), simple and gated (GRU) recurrent neural networks (RNN) to the Poincaré model of the hyperbolic geometry. We do it by connecting the theory of gyrovector spaces and generalized Möbius transformations introduced by (albert2008analytic, ; ungar2008gyrovector, ) with the Riemannian geometry properties of the manifold. We smoothly parametrize basic operations and objects in all spaces of constant negative curvature using a unified framework that depends only on the curvature value. Thus, we show how Euclidean and hyperbolic spaces can be continuously deformed into each other. On a series of experiments and datasets we showcase the effectiveness of our hyperbolic neural network layers compared to their "classic" Euclidean variants on textual entailment and noisy-prefix recognition tasks. We hope that this paper will open exciting future directions in the nascent field of Geometric Deep Learning.

## 2 The Geometry of the Poincaré Ball

### 2.1 Basics of Riemannian geometry

We briefly introduce basic concepts of differential geometry largely needed for a principled generalization of Euclidean neural networks. For more rigorous and in-depth expositions, see spivak1979comprehensive ; hopper2010ricci .

An -dimensional manifold is a space that can locally be approximated by : it is a generalization to higher dimensions of the notion of a 2D surface. For , one can define the tangent space of at as the first order linear approximation of around . A Riemannian metric on is a collection of inner-products varying smoothly with . A Riemannian manifold is a manifold equipped with a Riemannian metric . Although a choice of a Riemannian metric on only seems to define the geometry locally on , it induces global distances by integrating the length (of the speed vector living in the tangent space) of a shortest path between two points:

 d(x,y)=infγ∫10√gγ(t)(˙γ(t),˙γ(t))dt, (1)

where is such that and . A smooth path of minimal length between two points and is called a geodesic, and can be seen as the generalization of a straight-line in Euclidean space. The parallel transport is a linear isometry between tangent spaces which corresponds to moving tangent vectors along geodesics and defines a canonical way to connect tangent spaces. The exponential map at , when well-defined, gives a way to project back a vector of the tangent space at , to a point on the manifold. This map is often used to parametrize a geodesic starting from with unit-norm direction as . For geodesically complete manifolds, such as the Poincaré ball considered in this work, is well-defined on the full tangent space . Finally, a metric is said to be conformal to another metric if it defines the same angles, i.e.

 ~gx(u,v)√~gx(u,u)√~gx(v,v)=gx(u,v)√gx(u,u)√gx(v,v), (2)

for all , . This is equivalent to the existence of a smooth function , called the conformal factor, such that for all .

### 2.2 Hyperbolic space: the Poincaré ball

The hyperbolic space has five isometric models that one can work with (cannon1997hyperbolic, ). Similarly as in (nickel2017poincar, ) and (ganea2018hyperbolic, ), we choose to work in the Poincaré ball. The Poincaré ball model is defined by the manifold equipped with the following Riemannian metric:

 gDx=λ2xgE,where\ λx:=21−∥x∥2, (3)

being the Euclidean metric tensor. Note that the hyperbolic metric tensor is conformal to the Euclidean one. The

induced distance between two points is known to be given by

 dD(x,y)=cosh−1(1+2∥x−y∥2(1−∥x∥2)(1−∥y∥2)). (4)

Since the Poincaré ball is conformal to Euclidean space, the angle between two vectors is given by

 cos(∠(u,v))=gDx(u,v)√gDx(u,u)√gDx(v,v)=⟨u,v⟩∥u∥∥v∥. (5)

### 2.3 Gyrovector spaces

In Euclidean space, natural operations inherited from the vectorial structure, such as vector addition, subtraction and scalar multiplication are often useful. The framework of gyrovector spaces provides an elegant non-associative algebraic formalism for hyperbolic geometry just as vector spaces provide the algebraic setting for Euclidean geometry (albert2008analytic, ; ungar2001hyperbolic, ; ungar2008gyrovector, ).

In particular, these operations are used in special relativity, allowing to add speed vectors belonging to the Poincaré ball of radius (the celerity, i.e. the speed of light) so that they remain in the ball, hence not exceeding the speed of light.

We will make extensive use of these operations in our definitions of hyperbolic neural networks.

For , denote111We take different notations as in ungar2001hyperbolic where the author uses . by . Note that if , then ; if , then is the open ball of radius . If then we recover the usual ball .

The Möbius addition of and in is defined as

 x⊕cy:=(1+2c⟨x,y⟩+c∥y∥2)x+(1−c∥x∥2)y1+2c⟨x,y⟩+c2∥x∥2∥y∥2. (6)

In particular, when , one recovers the Euclidean addition of two vectors in . Note that without loss of generality, the case can be reduced to . Unless stated otherwise, we will use as to simplify notations. For general , this operation is not commutative nor associative. However, it satisfies . Moreover, for any , we have and (left-cancellation law). The Möbius substraction is then defined by the use of the following notation: . See (vermeer2005geometric, , section 2.1) for a geometric interpretation of the Möbius addition.

#### Möbius scalar multiplication.

For , the Möbius scalar multiplication of by is defined as

 r⊗cx:=(1/√c)tanh(rtanh−1(√c∥x∥))x∥x∥, (7)

and . Note that similarly as for the Möbius addition, one recovers the Euclidean scalar multiplication when goes to zero: . This operation satisfies desirable properties such as ( additions), (scalar distributivity222 has priority over in the sense that and .), (scalar associativity) and (scaling property).

#### Distance.

If one defines the generalized hyperbolic metric tensor as the metric conformal to the Euclidean one, with conformal factor , then the induced distance function on is given by333The notation should always be read as and not .

 dc(x,y)=(2/√c)tanh−1(√c∥−x⊕cy∥). (8)

Again, observe that , i.e. we recover Euclidean geometry in the limit444The factor comes from the conformal factor , which is a convention setting the curvature to .. Moreover, for we recover of Eq. (4).

#### Hyperbolic trigonometry.

Similarly as in the Euclidean space, one can define the notions of hyperbolic angles or gyroangles (when using the ), as well as hyperbolic law of sines in the generalized Poincaré ball . We make use of these notions in our proofs. See Appendix A.

### 2.4 Connecting Gyrovector spaces and Riemannian geometry of the Poincaré ball

In this subsection, we present how geodesics in the Poincaré ball model are usually described with Möbius operations, and push one step further the existing connection between gyrovector spaces and the Poincaré ball by finding new identities involving the exponential map, and parallel transport.

In particular, these findings provide us with a simpler formulation of Möbius scalar multiplication, yielding a natural definition of matrix-vector multiplication in the Poincaré ball.

#### Riemannian gyroline element.

The Riemannian gyroline element is defined for an infinitesimal as , and its size is given by (ungar2008gyrovector, , section 3.7):

 ∥ds∥=∥(x+dx)⊖cx∥=∥dx∥/(1−c∥x∥2). (9)

What is remarkable is that it turns out to be identical, up to a scaling factor of , to the usual line element of the Riemannian manifold .

#### Geodesics.

The geodesic connecting points is shown in (albert2008analytic, ; ungar2008gyrovector, ) to be given by:

 γx→y(t):=x⊕c(−x⊕cy)⊗ct,with\ γx→y:R→Dnc\ s.t.\ γx→y(0)=x\ and\ γx→y(1)=y. (10)

Note that when goes to , geodesics become straight-lines, recovering Euclidean geometry. In the remainder of this subsection, we connect the gyrospace framework with Riemannian geometry.

###### Lemma 1.

For any and s.t. , the unit-speed geodesic starting from with direction is given by:

 γx,v(t)=x⊕c(tanh(√ct2)v√c∥v∥), where\ γx,v:R→Dn\ s.t.\ γx,v(0)=x\ and\ ˙γx,v(0)=v. (11)
###### Proof.

One can use Eq. (10) and reparametrize it to unit-speed using Eq. (8). Alternatively, direct computation and identification with the formula in (ganea2018hyperbolic, , Thm. 1) would give the same result. Using Eq. (8) and Eq. (11), one can sanity-check that . ∎

#### Exponential and logarithmic maps.

The following lemma gives the closed-form derivation of exponential and logarithmic maps.

###### Lemma 2.

For any point , the exponential map and the logarithmic map are given for and by:

 expcx(v)=x⊕c(tanh(√cλcx∥v∥2)v√c∥v∥), logcx(y)=2√cλcxtanh−1(√c∥−x⊕cy∥)−x⊕cy∥−x⊕cy∥. (12)
###### Proof.

Following the proof of (ganea2018hyperbolic, , Cor. 1.1), one gets . Using Eq. (11) gives the formula for . Algebraic check of the identity concludes. ∎

The above maps have more appealing forms when , namely for :

 expc0(v)=tanh(√c∥v∥)v√c∥v∥, logc0(y)=tanh−1(√c∥y∥)y√c∥y∥. (13)

Moreover, we still recover Euclidean geometry in the limit , as is the Euclidean exponential map, and is the Euclidean logarithmic map.

#### Möbius scalar multiplication using exponential and logarithmic maps.

We studied the exponential and logarithmic maps in order to gain a better understanding of the Möbius scalar multiplication (Eq. (7)). We found the following:

###### Lemma 3.

The quantity can actually be obtained by projecting in the tangent space at 0 with the logarithmic map, multiplying this projection by the scalar in , and then projecting it back on the manifold with the exponential map:

 r⊗cx=expc0(rlogc0(x)),∀r∈R,x∈Dnc. (14)

In addition, we recover the well-known relation between geodesics connecting two points and the exponential map:

 γx→y(t)=x⊕c(−x⊕cy)⊗ct=expcx(tlogcx(y)),t∈[0,1]. (15)

This last result enables us to generalize scalar multiplication in order to define matrix-vector multiplication between Poincaré balls, one of the essential building blocks of hyperbolic neural networks.

#### Parallel transport.

Finally, we connect parallel transport (from ) to gyrovector spaces with the following theorem, which we prove in appendix B.

###### Theorem 4.

In the manifold , the parallel transport w.r.t. the Levi-Civita connection of a vector to another tangent space is given by the following isometry:

 Pc0→x(v)=logcx(x⊕cexpc0(v))=λc0λcxv. (16)

As we’ll see later, this result is crucial in order to define and optimize parameters shared between different tangent spaces, such as biases in hyperbolic neural layers or parameters of hyperbolic MLR.

## 3 Hyperbolic Neural Networks

Neural networks can be seen as being made of compositions of basic operations, such as linear maps, bias translations, pointwise non-linearities and a final sigmoid or softmax layer. We first explain how to construct a softmax layer for logits lying in a Poincaré ball. Then, we explain how to transform a mapping between two Euclidean spaces as one between Poincaré balls, yielding matrix-vector multiplication and pointwise non-linearities in the Poincaré ball. Finally, we present possible adaptations of various recurrent neural networks to the hyperbolic domain.

### 3.1 Hyperbolic multiclass logistic regression

In order to perform multi-class classification on the Poincaré ball, one needs to generalize multinomial logistic regression (MLR) also called softmax regression to the Poincaré ball.

#### Reformulating Euclidean MLR.

Let’s first reformulate Euclidean MLR from the perspective of distances to margin hyperplanes, as in

(lebanon2004hyperplane, , Section 5). This will allow us to easily generalize it.

Given

classes, one learns a margin hyperplane for each such class using softmax probabilities:

 (17)

Note that any affine hyperplane in can be written with a normal vector and a scalar shift :

 Ha,b={x∈Rn:⟨a,x⟩−b=0},where\ a∈Rn∖{0}, and b∈R. (18)

As in (lebanon2004hyperplane, , Section 5), we note that . Using Eq. (17):

 p(y=k|x)∝exp(sign(⟨ak,x⟩−bk)∥ak∥d(x,Hak,bk)), bk∈R,x,ak∈Rn. (19)

As it is not immediately obvious how to generalize the Euclidean hyperplane of Eq. (18) to other spaces such as the Poincaré ball, we reformulate it as follows:

 ~Ha,p={x∈Rn:⟨−p+x,a⟩=0}=p+{a}⊥, where\ p∈Rn, a∈Rn∖{0}. (20)

This new definition relates to the previous one as . Rewriting Eq. (19) with :

 p(y=k|x)∝exp(sign(⟨−pk+x,ak⟩)∥ak∥d(x,~Hak,pk)), with pk,x,ak∈Rn. (21)

It is now natural to adapt the previous definition to the hyperbolic setting by replacing by :

###### Definition 3.1 (Poincaré hyperplanes).

For , let . Then, we define Poincaré hyperplanes as

 ~Hca,p:={x∈Dnc:⟨logcp(x),a⟩p=0}=expcp({a}⊥)={x∈Dnc:⟨−p⊕cx,a⟩=0}. (22)

The last equality is shown appendix C. can also be described as the union of images of all geodesics in orthogonal to and containing . Notice that our definition matches that of hypergyroplanes, see (ungar2014analytic, , definition 5.8). A 3D hyperplane example is depicted in Fig. 1.

Next, we need the following theorem, proved in appendix D:

###### Theorem 5.
 dc(x,~Hca,p):=infw∈~Hca,pdc(x,w)=1√csinh−1(2√c|⟨−p⊕cx,a⟩|(1−c∥−p⊕cx∥2)∥a∥). (23)

#### Final formula for MLR in the Poincaré ball.

Putting together Eq. (21) and Thm. 5, we get the hyperbolic MLR formulation. Given classes and :

 p(y=k|x)∝exp(sign(⟨−pk⊕cx,ak⟩)√gcpk(ak,ak)dc(x,~Hcak,pk)),∀x∈Dnc, (24)

or, equivalently

 p(y=k|x)∝exp(λcpk∥ak∥√csinh−1(2√c⟨−pk⊕cx,ak⟩(1−c∥−pk⊕cx∥2)∥ak∥)),∀x∈Dnc. (25)

Notice that when goes to zero, this goes to , recovering the usual Euclidean softmax.

However, at this point it is unclear how to perform optimization over , since it lives in and hence depends on . The solution is that one should write , where , and optimize as a Euclidean parameter.

### 3.2 Hyperbolic feed-forward layers

In order to define hyperbolic neural networks, it is crucial to define a canonically simple parametric family of transformations, playing the role of linear mappings in usual Euclidean neural networks, and to know how to apply pointwise non-linearities. Inspiring ourselves from our reformulation of Möbius scalar multiplication in Eq. (14), we define:

###### Definition 3.2 (Möbius version).

For , we define the Möbius version of as the map from to by:

 f⊗c(x):=expc0(f(logc0(x))), (26)

where and .

Note that similarly as for other Möbius operations, we recover the Euclidean mapping in the limit if is continuous, as . This definition satisfies a few desirable properties too, such as: for and (morphism property), and for (direction preserving). It is then straight-forward to prove the following result:

###### Lemma 6 (Möbius matrix-vector multiplication).

If is a linear map, which we identify with its matrix representation, then , if we have

 M⊗c(x)=(1/√c)tanh(∥Mx∥∥x∥tanh−1(√c∥x∥))Mx∥Mx∥, (27)

and if . Moreover, if we define the Möbius matrix-vector multiplication of and by , then we have for and (matrix associativity), for and (scalar-matrix associativity) and for all (rotations are preserved).

#### Pointwise non-linearity.

If is a pointwise non-linearity, then its Möbius version can be applied to elements of the Poincaré ball.

#### Bias translation.

The generalization of a translation in the Poincaré ball is naturally given by moving along geodesics. But should we use the Möbius sum with a hyperbolic bias or the exponential map with a Euclidean bias ? These views are unified with parallel transport (see Thm 4). Möbius translation of a point by a bias is given by

 x←x⊕cb=expcx(Pc0→x(logc0(b)))=expcx(λc0λcxlogc0(b)). (28)

We recover Euclidean translations in the limit . Note that bias translations play a particular role in this model. Indeed, consider multiple layers of the form , each of which having Möbius version . Then their composition can be re-written . This means that these operations can essentially be performed in Euclidean space. Therefore, it is the interposition between those with the bias translation of Eq. (28) which differentiates this model from its Euclidean counterpart.

#### Concatenation of multiple input vectors.

If a vector is the (vertical) concatenation of two vectors , , and can be written as the (horizontal) concatenation of two matrices and , then . We generalize this to hyperbolic spaces: if we are given , , , and as before, then we define . Note that when goes to zero, we recover the Euclidean formulation, as . Moreover, hyperbolic vectors can also be "concatenated" with real features by doing: with learnable and .

### 3.3 Hyperbolic RNN

#### Naive RNN.

A simple RNN can be defined by where is a pointwise non-linearity, typically

, sigmoid, ReLU, etc. This formula can be naturally generalized to the hyperbolic space as follows. For parameters

, , , we define:

 ht+1=φ⊗c(W⊗cht⊕cU⊗cxt⊕cb),ht∈Dnc, xt∈Ddc. (29)

Note that if inputs ’s are Euclidean, one can write and use the above formula, since .

#### GRU architecture.

One can also adapt the GRU architecture:

 rt =σ(Wrht−1+Urxt+br), zt =σ(Wzht−1+Uzxt+bz), (30) ~ht =φ(W(rt⊙ht−1)+Uxt+b), ht =(1−zt)⊙ht−1+zt⊙~ht,

where denotes pointwise product. First, how should we adapt the pointwise multiplication by a scaling gate? Note that the definition of the Möbius version (see Eq. (26)) can be naturally extended to maps as . In particular, choosing yields555If has coordinates, then denotes the diagonal matrix of size with ’s on its diagonal. . Hence we adapt to and the reset gate to:

 rt=σlogc0(Wr⊗cht−1⊕cUr⊗cxt⊕cbr), (31)

and similarly for the update gate . Note that as the argument of in the above is unbounded, and can a priori take values onto the full range . Now the intermediate hidden state becomes:

 ~ht=φ⊗c((Wdiag(rt))⊗cht−1⊕cU⊗cxt⊕b), (32)

where Möbius matrix associativity simplifies into . Finally, we propose to adapt the update-gate equation as

 ht=ht−1⊕cdiag(zt)⊗c(−ht−1⊕c~ht). (33)

Note that when goes to zero, one recovers the usual GRU. Moreover, if or , then becomes or respectively, similarly as in the usual GRU. This adaptation was obtained by adapting tallec2018can : in this work, the authors re-derive the update-gate mechanism from a first principle called time-warping invariance. We adapted their derivation to the hyperbolic setting by using the notion of gyroderivative birman2001hyperbolic and proving a

gyro-chain-rule

(see appendix E).

## 4 Experiments

We evaluate our method on two tasks. The first is natural language inference, or textual entailment. Given two sentences, a premise (e.g. "Little kids A. and B. are playing soccer.") and a hypothesis (e.g. "Two children are playing outdoors."), the binary classification task is to predict whether the second sentence can be inferred from the first one. This defines a partial order in the sentence space. We test hyperbolic networks on the biggest real dataset for this task, SNLI (bowman2015large, ). It consists of 570K training, 10K validation and 10K test sentence pairs. Following (vendrov2015order, ), we merge the "contradiction" and "neutral" classes into a single class of negative sentence pairs, while the "entailment" class gives the positive pairs.

We conjecture that the improvements of hyperbolic neural networks are more significant when the underlying data structure is closer to a tree. To test this, we design a proof-of-concept task of detection of noisy prefixes, i.e. given two sentences, one has to decide if the second sentence is a noisy prefix of the first, or a random sentence. We thus build synthetic datasets PREFIX-Z% (for Z being 10, 30 or 50) as follows: for each random first sentence of random length at most 20 and one random prefix of it, a second positive sentence is generated by randomly replacing Z% of the words of the prefix, and a second negative sentence of same length is randomly generated. Word vocabulary size is 100, and we generate 500K training, 10K validation and 10K test pairs.

#### Models architecture.

Our neural network layers can be used in a plug-n-play manner exactly like standard Euclidean layers. They can also be combined with Euclidean layers. However, optimization w.r.t. hyperbolic parameters is different (see below) and based on Riemannian gradients which are just rescaled Euclidean gradients when working in the conformal Poincaré model (nickel2017poincar, ). Thus, back-propagation can be applied in the standard way.

In our setting, we embed the two sentences using two distinct hyperbolic RNNs or GRUs. The sentence embeddings are then fed together with their squared distance (hyperbolic or Euclidean, depending on their geometry) to a FFNN (Euclidean or hyperbolic, see Sec. 3.2) which is further fed to an MLR (Euclidean or hyperbolic, see Sec. 3.1) that gives probabilities of the two classes (entailment vs neutral). We use cross-entropy loss on top. Note that hyperbolic and Euclidean layers can be mixed, e.g. the full network can be hyperbolic and only the last layer be Euclidean, in which case one has to use and functions to move between the two manifolds in a correct manner as explained for Eq. 26.

#### Optimization.

Our models have both Euclidean (e.g. weight matrices in both Euclidean and hyperbolic FFNNs, RNNs or GRUs) and hyperbolic parameters (e.g. word embeddings or biases for the hyperbolic layers). We optimize the Euclidean parameters with Adam (kingma2014adam, )

(learning rate 0.001). Hyperbolic parameters cannot be updated with an equivalent method that keeps track of gradient history due to the absence of a Riemannian Adam. Thus, they are optimized using full Riemannian stochastic gradient descent (RSGD)

(bonnabel2013stochastic, ; ganea2018hyperbolic, ). We also experiment with projected RSGD (nickel2017poincar, ), but optimization was sometimes less stable. We use a different constant learning rate for word embeddings (0.1) and other hyperbolic weights (0.01) because words are updated less frequently.

#### Numerical errors.

Gradients of the basic operations defined above (e.g. , exponential map) are not defined when the hyperbolic argument vectors are on the ball border, i.e. . Thus, we always project results of these operations in the ball of radius , where . Numerical errors also appear when hyperbolic vectors get closer to 0, thus we perturb them with an before they are used in any of the above operations. Finally, arguments of the function are clipped between to avoid numerical errors, while arguments of are clipped to at most .

#### Hyperparameters.

For all methods, baselines and datasets, we use , word and hidden state embedding dimension of 5 (we focus on the low dimensional setting that was shown to already be effective (nickel2017poincar, )), batch size of 64. We ran all methods for a fixed number of epochs. For all models, we experiment with both identity (no non-linearity) or non-linearity in the RNN/GRU cell, as well as identity or ReLU after the FFNN layer and before MLR. As expected, for the fully Euclidean models, and ReLU respectively surpassed the identity variant by a large margin. We only report the best Euclidean results. Interestingly, for the hyperbolic models, using only identity for both non-linearities works slightly better and this is likely due to two facts: i) our hyperbolic layers already contain non-linearities by their nature, ii) is limiting the output domain of the sentence embeddings, but the hyperbolic specific geometry is more pronounced at the ball border, i.e. at the hyperbolic "infinity", compared to the center of the ball.

For the results shown in Tab. 1, we run each model (baseline or ours) exactly 3 times and report the test result corresponding to the best validation result from these 3 runs. We do this because the highly non-convex spectrum of hyperbolic neural networks sometimes results in convergence to poor local minima, suggesting that initialization is very important.

#### Results.

Results are shown in Tab. 1

. Note that the fully Euclidean baseline models might have an advantage over hyperbolic baselines because more sophisticated optimization algorithms such as Adam do not have a hyperbolic analogue at the moment. We first observe that all GRU models overpass their RNN variants. Hyperbolic RNNs and GRUs have the most significant improvement over their Euclidean variants when the underlying data structure is more tree-like, e.g. for PREFIX-10%

for which the tree relation between sentences and their prefixes is more prominent we reduce the error by a factor of for hyperbolic vs Euclidean RNN, and by a factor of for hyperbolic vs Euclidean GRU. As soon as the underlying structure diverges more and more from a tree, the accuracy gap decreases for example, for PREFIX-50% the noise heavily affects the representational power of hyperbolic networks. Also, note that on SNLI our methods perform similarly as with their Euclidean variants. Moreover, hyperbolic and Euclidean MLR are on par when used in conjunction with hyperbolic sentence embeddings, suggesting further empirical investigation is needed for this direction (see below).

We also observe that, in the hyperbolic setting, accuracy tends to increase when sentence embeddings start increasing, and gets better as their norms converge towards 1 (the ball border for ). Unlike in the Euclidean case, this behavior does happen only after a few epochs and suggests that the model should first adjust the angular layout in order to disentangle the representations, before increasing their norms to fully exploit the strong clustering property of the hyperbolic geometry. Similar behavior was observed in the context of embedding trees by (nickel2017poincar, ). Details in appendix F.

#### MLR classification experiments.

For the sentence entailment classification task we do not see a clear advantage of hyperbolic MLR compared to its Euclidean variant. A possible reason is that, when trained end-to-end, the model might decide to place positive and negative embeddings in a manner that is already well separated with a classic MLR. As a consequence, we further investigate MLR for the task of subtree classification. Using an open source implementation of (nickel2017poincar, ), we pre-trained Poincaré embeddings of the WordNet noun hierarchy (82,115 nodes). We then choose one node in this tree (see Table 2) and classify all other nodes (solely based on their embeddings) as being part of the subtree rooted at this node. All nodes in such a subtree are divided into positive training nodes (80%) and positive test nodes (20%). The same splitting procedure is applied for the remaining WordNet nodes that are divided into a negative training and negative test set respectively. Three variants of MLR are then trained on top of pre-trained Poincaré embeddings (nickel2017poincar, ) to solve this binary classification task: hyperbolic MLR, Euclidean MLR applied directly on the hyperbolic embeddings and Euclidean MLR applied after mapping all embeddings in the tangent space at 0 using the

map. We use different embedding dimensions : 2, 3, 5 and 10. For the hyperbolic MLR, we use full Riemannian SGD with a learning rate of 0.001. For the two Euclidean models we use ADAM optimizer and the same learning rate. During training, we always sample the same number of negative and positive nodes in each minibatch of size 16; thus positive nodes are frequently resampled. All methods are trained for 30 epochs and the final F1 score is reported (no hyperparameters to validate are used, thus we do not require a validation set). This procedure is repeated for four subtrees of different sizes.

Quantitative results are presented in Table 2. We can see that the hyperbolic MLR overpasses its Euclidean variants in almost all settings, sometimes by a large margin. Moreover, to provide further understanding, we plot the 2-dimensional embeddings and the trained separation hyperplanes (geodesics in this case) in Figure 2. We can see that respecting the hyperbolic geometry is very important for a quality classification model.

## 5 Conclusion

We showed how classic Euclidean deep learning tools such as MLR, FFNNs, RNNs or GRUs can be generalized in a principled manner to all spaces of constant negative curvature combining Riemannian geometry with the elegant theory of gyrovector spaces. Empirically we found that our models outperform or are on par with corresponding Euclidean architectures on sequential data with implicit hierarchical structure. We hope to trigger exciting future research related to better understanding of the hyperbolic non-convexity spectrum and development of other non-Euclidean deep learning methods.

Our data and Tensorflow

(abadi2016tensorflow, ) code are publicly available.

## Acknowledgements

We thank Igor Petrovski for useful pointers regarding the implementation.

This research is funded by the Swiss National Science Foundation (SNSF) under grant agreement number 167176. Gary Bécigneul is also funded by the Max Planck ETH Center for Learning Systems.

## References

• [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. 2016.
• [2] Ungar Abraham Albert. Analytic hyperbolic geometry and Albert Einstein’s special theory of relativity. World scientific, 2008.
• [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
• [4] Graciela S Birman and Abraham A Ungar. The hyperbolic derivative in the poincaré ball model of hyperbolic geometry. Journal of mathematical analysis and applications, 254(1):321–333, 2001.
• [5] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, Sept 2013.
• [6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (NIPS), pages 2787–2795, 2013.
• [7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

, pages 632–642. Association for Computational Linguistics, 2015.
• [8] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
• [9] James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry. Flavors of geometry, 31:59–115, 1997.
• [10] Christopher De Sa, Albert Gu, Christopher Ré, and Frederic Sala. Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.
• [11] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the thirty-fifth international conference on machine learning (ICML), 2018.
• [12] Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pages 75–263. Springer, 1987.
• [13] Matthias Hamann. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, page 1–17, 2017.
• [14] Christopher Hopper and Ben Andrews. The Ricci flow in Riemannian geometry. Springer, 2010.
• [15] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Association for Computational Linguistics, 2014.
• [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
• [17] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguná. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
• [18] John Lamping, Ramana Rao, and Peter Pirolli. A focus+ context technique based on hyperbolic geometry for visualizing large hierarchies. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 401–408. ACM Press/Addison-Wesley Publishing Co., 1995.
• [19] Guy Lebanon and John Lafferty. Hyperplane margin classifiers on the multinomial manifold. In Proceedings of the international conference on machine learning (ICML), page 66. ACM, 2004.
• [20] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the international conference on machine learning (ICML), volume 11, pages 809–816, 2011.
• [21] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems (NIPS), pages 6341–6350, 2017.
• [22] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
• [23] Michael Spivak. A comprehensive introduction to differential geometry. Publish or perish, 1979.
• [24] Corentin Tallec and Yann Ollivier. Can recurrent neural networks warp time? In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
• [25] Abraham A Ungar. Hyperbolic trigonometry and its application in the poincaré ball model of hyperbolic geometry. Computers & Mathematics with Applications, 41(1-2):135–147, 2001.
• [26] Abraham Albert Ungar. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1–194, 2008.
• [27] Abraham Albert Ungar. Analytic hyperbolic geometry in n dimensions: An introduction. CRC Press, 2014.
• [28] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
• [29] J Vermeer. A geometric interpretation of ungar’s addition and of gyration in the hyperbolic plane. Topology and its Applications, 152(3):226–242, 2005.

## Appendix A Hyperbolic Trigonometry

#### Hyperbolic angles.

For , we denote by the angle between the two geodesics starting from and ending at and respectively. This angle can be defined in two equivalent ways: i) either using the angle between the initial velocities of the two geodesics as given by Eq. 5, or ii) using the formula