Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry

by   Maximilian Nickel, et al.

We are concerned with the discovery of hierarchical relationships from large-scale unstructured similarity scores. For this purpose, we study different models of hyperbolic space and find that learning embeddings in the Lorentz model is substantially more efficient than in the Poincaré-ball model. We show that the proposed approach allows us to learn high-quality embeddings of large taxonomies which yield improvements over Poincaré embeddings, especially in low dimensions. Lastly, we apply our model to discover hierarchies in two real-world datasets: we show that an embedding in hyperbolic space can reveal important aspects of a company's organizational structure as well as reveal historical relationships between language families.


page 1

page 2

page 3

page 4


Unit Ball Model for Hierarchical Embeddings in Complex Hyperbolic Space

Learning the representation of data with hierarchical structures in the ...

Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings

We consider the task of inferring is-a relationships from large text cor...

Neural Distance Embeddings for Biological Sequences

The development of data-dependent heuristics and representations for bio...

Hyperbolic Embeddings for Learning Options in Hierarchical Reinforcement Learning

Hierarchical reinforcement learning deals with the problem of breaking d...

Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph

Nickel and Kiela (2017) present a new method for embedding tree nodes in...

Representation Tradeoffs for Hyperbolic Embeddings

Hyperbolic embeddings offer excellent quality with few dimensions when e...

Low-rank approximations of hyperbolic embeddings

The hyperbolic manifold is a smooth manifold of negative constant curvat...

1 Introduction

Hierarchical structures are ubiquitous in knowledge representation and reasoning. For example, starting with Linnaeus, taxonomies have long been used in biology to categorize and understand the relationships between species (Mayr, 1968). In social science, hierarchies are used to understand interactions in humans and animals or to analyze organizational structures such as companies and governments (Dodds et al., 2003). In comparative linguists, evolutionary trees are used to describe the origin of languages (Campbell, 2013), while ontologies are used to provide rich categorizations of entities in semantic networks (Antoniou & Van Harmelen, 2004). Hierarchies are also known to provide important information for learning and classification (Silla & Freitas, 2011). In cognitive development, the results of Inhelder & Piaget (1964) suggest that the classification structure in children’s thinking is hierarchical in nature.

Hierarchies can therefore provide important insights into systems of concepts. However, explicit information about such hierarchical relationships is unavailable for many domains. In this paper, we therefore consider the problem of discovering concept hierarchies from unstructured observations, specifically in the following setting:

  1. We focus on discovering pairwise hierarchical relations between concepts, where all superior and subordinate concepts are observed.

  2. We aim to infer concept hierarchies only from pairwise similarity measurements, which are relatively easy and cheap to obtain in many domains.

Examples of hierarchy discovery that adhere to this setting include the creation of taxonomies from similarity judgments (e.g., genetic similarity of species or cognate similarity of languages) and the recovery of organizational hierarchies and dominance relations from social interactions.

To infer hierarchies from similarity judgments, we propose to model such relationships as a combination of two separate aspects: relatedness and generality. Concept A is a parent (a superior) to concept B if both concepts are related and A is more general than B. By separating these aspects, we can then discover concept hierarchies via hyperbolic embeddings. In particular, we build upon ideas of Poincaré embeddings (Nickel & Kiela, 2017) to learn continuous representations of hierarchies. Due to its geometric properties, hyperbolic space can be thought of as continuous analogue to discrete trees. By embeddings concepts in such a way that their similarity order is preserved, we can then identify (soft) hierarchical relationships from the embedding: relatedness is captured via the distance in the embedding space, while generality is captured via the norm of the embeddings.

To learn high-quality embeddings, we propose a new optimization approach based on the Lorentz model of hyperbolic space. The Lorentz model allows for an efficient closed-form computation of the geodesics on the manifold. This facilitates the development of an efficient optimizer that directly follows these geodesics, rather than doing a first-order approximation as in (Nickel & Kiela, 2017). It allows us also to avoid numerical instabilities that arise from the Poincaré distance. As we will show experimentally, this optimization method leads to a substantially improved embedding quality, especially in low dimensions. Simultaneously, we retain the attractive properties of hyperbolic embeddings, i.e., learning continuous representations of hierarchies via gradient-based optimization while scaling to large datasets.

The reminder of this paper is organized as follows. In Section 2, we discuss related work regarding hyperbolic and ordered embeddings. In Section 2, we introduce our model and algorithm to compute the embeddings. In Section 4 we evaluate the efficiency of our approach on large taxonomies. Furthermore, we evaluate the ability of our model to discover meaningful hierarchies on real-world datasets.

2 Related Work

Hyperbolic geometry has recently received attention in machine learning and network science due to its attractive properties for modeling data with latent hierarchies.

Krioukov et al. (2010) showed that typical properties of complex networks (e.g., heterogeneous degree distributions and strong clustering) can be explained by assuming an underlying hyperbolic geometry and, moreover, developed a framework to model networks based on these properties. Furthermore, Kleinberg (2007) and Boguñá et al. (2010) proposed hyperbolic embeddings for greedy shortest-path routing in communication networks. Asta & Shalizi (2015) used hyperbolic embeddings of graphs to compare the global structure of networks. Sun et al. (2015) proposed to learn representations of non-metric data in pseudo-Riemannian space-time, which is closely related to hyperbolic space.

Most similar to our work are the recently proposed Poincaré embeddings (Nickel & Kiela, 2017), which learn hierarchical representations of symbolic data by embedding them into an -dimensional Poincaré ball. The main focus of that work was to model the link structure of symbolic data efficiently, i.e., to find low-dimensional embeddings via exploiting the hierarchical structure of hyperbolic space. Here, we build upon this idea and extend it in various ways. First, we propose a new model to compute hyperbolic embeddings in the Lorentz model of hyperbolic geometry. This allows us to develop an efficient Riemannian optimization method that scales well to large datasets and provides better embeddings, especially in low dimensions. Second, we consider inferring hierarchies from real-valued similarity scores, which generalize binary adjacency matrices as considered by Nickel & Kiela (2017). Third, in addition to preserving similarity (e.g., local link structure), we also focus on recovering the correct hierarchical relationships from the embedding.

Simultaneously to the present work, De Sa et al. (2018) analyzed the representation trade-offs for hyperbolic embeddings and proposed a new combinatorial embedding approach as well as a new approach to Multi-Dimensional Scaling (MDS) in hyperbolic space. Furthermore, Ganea et al. (2018) extended Poincaré embeddings using geodesically convex cones to model asymmetric relations.

Another related method is Order Embeddings (Vendrov et al., 2015), which was proposed to learn visual-semantic hierarchies over words, sentences, and images from ordered input pairs. In contrast, we are concerned with learning hierarchical embeddings from less supervision: namely, from unordered (symmetric) input pairs that provide no direct information about the partial ordering in the hierarchy.

Further work on embedding order-structures include Stochastic Triplet Embeddings (Van Der Maaten & Weinberger, 2012), Generalized Non-Metric MDS (Agarwal et al., 2007), and Crowd Kernels (Tamuz et al., 2011). In the context of word embeddings, Vilnis & McCallum (2015) proposed Gaussian Embeddings to learn improved representations. By mapping words to densities, this model is capable of capturing uncertainty, assymmetry, and (hierarchical) entailment relations.

To discover structural forms (e.g., trees, grids, chains) from data, Kemp & Tenenbaum (2008) proposed a model for making probabilistic inferences over a space of graph grammars. Recently, Lake et al. (2018)

proposed an alternative approach to this work based on structural sparsity. Additionally, hierarchical clustering has a long history in machine learning and data mining

(Duda et al., 1973). Bottom-up agglomerative clustering assigns each data point to its own cluster and then iteratively merges the two closest points according to a given distance measure (e.g., single link, average link, max link). As such, hierarchical clustering provides a hierarchical partition of the input space. In contrast, we are concerned with discovering direct hierarchical relationships between the input data points.

3 Methods

In the following, we describe our approach for learning continuous hierarchies from unstructured observations.

(a) Geodesics in the Poincaré disk.
(b) Lorentz model of hyperbolic geometry.
Figure 1: fig:geodesics) Geodesics in the Poincaré disk model of hyperbolic space. Due to the negative curvature of the space, geodesics between points are arcs that are perpendicular to the boundary of the disk. For curved arcs, midpoints are closer to the origin of the disk (p1) than the associated points, e.g. (p3, p5). fig:hyperboloid) Points (p,q) lie on the surface of the upper sheet of a two-sheeted hyperboloid. Points (u, v) are the mapping of (p, q) onto the Poincaré disk using Equation 11.

3.1 Hyperbolic Geometry & Poincaré Embeddings

Hyperbolic space is the unique, complete, simply connected Riemannian manifold with constant negative sectional curvature. There exist multiple equivalent111Meaning that there exist transformations between the different models that preserve all geometric properties including isometry. models for hyperbolic space and one can choose the model whichever is best suited for a given task. Nickel & Kiela (2017) based their approach for learning hyperbolic embeddings on the Poincaré ball model, due to its conformality and convenient parameterization. The Poincaré ball model is the Riemannian manifold , where is the open -dimensional unit ball and where

The distance function on is then defined as


It can be seen from Equation 1, that the distance within the Poincaré ball changes smoothly with respect to the norm of and . This locality property of the distance is key for learning continuous embeddings of hierarchies. For instance, by placing the root node of a tree at the origin of , it would have relatively small distance to all other nodes, as its norm is zero. On the other hand, leaf nodes can be placed close to the boundary of the ball, as the distance between points grows quickly with a norm close to one.

3.2 Riemannian Optimization in the Lorentz Model

In the following, we propose a new method to compute hyperbolic embeddings based on the Lorentz model of hyperbolic geometry. The main advantage of this parameterization is that it allows us to perform Riemannian optimization very efficiently. An additional advantage is that its distance function (see Equation 5) avoids numerical instabilities that arise from the fraction in the Poincaré distance.

3.2.1 The Lorentz Model of Hyperbolic space

In the following, let , and let


denote the Lorentzian scalar product. The Lorentz model of -dimensional hyperbolic space is then defined as the Riemannian manifold , where


denotes the upper sheet of a two-sheeted -dimensional hyperboloid and where


The associated distance function on is then given as


Furthermore, it holds for any point


3.2.2 Riemannian Optimization

To derive the Riemannian SGD (RSGD) algorithm for the Lorentz model, we will first review the necessary concepts of Riemannian optimization. A Riemannian manifold is a real, smooth manifold equipped with a Riemannian metric . Furthermore, for each , let denote the associated tangent space. The metric induces then a inner product . Geodesics are the generalizations of straight lines to Riemannian manifolds, i.e., constant speed curves that are locally distance minimizing. The exponential map

maps a tangent vector

onto such that , , and . For a complete manifold , the exponential map is defined for all points .

Furthermore, let be a smooth real-valued function over parameters . In Riemannian optimization, we are then interested in solving problems of the form


Following Bonnabel (2013), we minimize Equation 7 using Riemannian SGD. In RSGD, updates to the parameters are computed via


where denotes the Riemannian gradient and denotes the learning rate.

For the Lorentz model, the tangent space is defined as follows: For a point , the tangent space consists of all vectors orthogonal to , where orthogonality is defined with respect to the Lorentzian scalar product. Hence,

Furthermore, let . The exponential map is then defined as


where denotes the norm of in .

To compute parameter updates as in Equation 7, we need additionally the Riemannian gradient of at . For this purpose, we first compute the direction of steepest descent from the Euclidean gradient via


Since is an involutory matrix (i.e., ), the inverse in Equation 10 is trivial to compute. To derive the Riemannian gradient from , we then use the orthogonal projection from the ambient Euclidean space onto the tangent space of the current parameter. This projection is computed as

since (Robbin & Salamon, 2017). Using Sections 3.2.2 and 9

, we can then estimate the parameters

using RSGD as in Algorithm 1. We initialize the embeddings close to the origin of

by sampling from the uniform distribution

and by setting according to Equation 6.

Input Learning rate

, number of epochs

Algorithm 1

Riemannian Stochastic Gradient Descent

3.2.3 Equivalence of models

The Lorentz and Poincaré disk model both have specific strengths: the Poincare disk provides a very intuitive method for visualizing and interpreting hyperbolic embeddings. The Lorentz model on the other hand is well-suited for Riemannian optimization. Due to the equivalence of both models, we can exploit their individual strengths simultaneously: points in the Lorentz model can be mapped into the Poincaré ball via the diffeomorphism , where


Furthermore, points in can be mapped into via

We will therefore learn the embeddings via Algorithm 1 in the Lorentz model and visualize the embeddings by mapping them into the Poincaré disk using Equation 11. See also Figure 1 for an illustration of Lorentz model and its connections to the Poincaré disk.

3.3 Inferring Concept Hierarchies from Similarity

Nickel & Kiela (2017) embedded unweighted undirected graphs in hyperbolic space. In the following, we extend this approach to a more general setting, i.e., inferring continuous hierarchies from pairwise similarity measurements.

Let be a set of concepts and be a dataset of pairwise similarity scores between these concepts. We also assume that the concepts can be organized according to an unobserved hierarchy , where defines a partial order over the elements of . Since partial order is a reflexive, anti-symmetric, and transitive binary relation, it is well suited to define hierarchical relations over . If or , then the concepts , are comparable (e.g., located in the same subtree). Otherwise they are incomparable (e.g., located in different subtrees). For concepts , we will refer to as the superior and to as the subordinate node.

Given this setting, our goal is then to recover the partial order from . For this purpose, we separate the semantics of the partial order relation into two distinct aspects: First, whether two concepts are comparable (denoted by ) and, second, whether concept is more general than (denoted by ). Combining both aspects provides us with the usual interpretation of partial order.

By explicitly distinguishing between the aspects of comparability and generality, we can then make the following structural assumptions on to infer hierarchies from pairwise similarities: 1) Comparable (and related) concepts are more similar to each other than incomparable concepts (i.e., if ); and 2) We assume that general concepts are similar to more concepts than less general ones. Both are mild assumptions given that the similarity scores describe concepts that are organized in a latent hierarchy. For instance, 1) simply follows from the assumption that concepts in the same subtree of the ground-truth hierarchy are more similar to each other than to concepts in different subtrees. This is also used in methods that use path-lengths in taxonomies to measure the semantic similarity of concepts (e.g., see Resnik et al., 1999).

It follows from assumption 1) the we want to preserve the similarity orderings in the embedding space in order to predict comparability. In particular, let denote the embedding of and let denote the set of concepts that are less similar to then (including ). Based only on pairwise similarities in , it is difficult to make global decisions about the likelihood that is true. However, it follows from assumption 1) that we can make local ranking decisions, i.e., we can infer that is the most likely among all . For this purpose, let

be the nearest neighbor of in the set . We then learn embeddings by optimizing



For computational efficiency, we follow (Jean et al., 2015) and randomly subsample on large datasets.

Equation 12 is a ranking loss that aims to preserve the neighborhood structures in . For each pair of concepts , this loss induces embeddings where is closer in the embedding space than pairs that are less similar. Since we compute the embedding in a metric space, we also retain transitive relations approximately. We can therefore identify the comparability of concepts by their distance in the embedding.

Moreover, by optimizing Equation 12 in hyperbolic space, we are also able to infer the generality of concepts from their embeddings. According to assumption 2), we can can assume that general objects will be close to many different concepts. Since Equation 12 optimizes the local similarity ranking for all concepts, we can also assume that this ordering is preserved. We can see from Equation 1 that points with a small distance to many different points are located close to the center. We can therefore identify the generality of a concept simply via the norm of its embedding .

We have now cast the problem of hierarchy discovery as a simple embedding problem whose objective is to preserve local similarity orderings

4 Evaluation

Taxonomy Nodes Edges Depth
WordNet Nouns 82,115 769,130 19
WordNet Verbs 13,542 35,079 12
EuroVoc (en) 7,084 10,547 5
ACM 2,299 6,526 5
MeSH 28,470 191,849 15
Table 1: Taxonomy Statistics. The number of edges refers to the full transitive closure of the respective taxonomy.
WordNet Nouns WordNet Verbs EuroVoc ACM MeSH
2 5 10 2 5 10 2 5 10 2 5 10 2 5 10
MR Poincaré 90.7 4.9 4.02 10.71 1.39 1.35 2.83 1.25 1.23 4.14 1.8 1.71 61.11 14.05 12.8
Lorentz 22.8 3.18 2.95 3.64 1.26 1.23 1.63 1.24 1.17 3.05 1.67 1.63 38.99 14.13 12.42
74.8 35.1 36.2 66.0 9.6 8.9 42.4 6.1 3.4 26.3 7.2 4.8 36.2 -0.5 2.9
MAP Poincaré 11.8 82.8 86.5 36.5 91.0 91.2 64.3 94.0 94.4 69.3 94.1 94.8 19.5 76.3 79.4
Lorentz 30.5 92.3 92.8 57.9 93.5 93.3 87.1 95.8 96.5 82.9 96.6 97.0 34.8 77.7 79.9
61.3 10.3 6.8 58.6 2.7 2.3 35.6 1.6 2.0 19.6 2.7 2.3 43.9 1.8 0.6
Poincaré 13.8 57.2 58.5 11.0 54.1 55.1 37.5 57.5 61.4 59.8 63.5 62.9 42.2 69.9 74.9
Lorentz 41.0 58.9 59.5 47.9 55.5 56.6 54.5 61.7 67.5 65.9 65.9 65.9 64.5 71.4 76.3
Table 2: Evaluation of Taxonomy Embeddings. MR = Mean Rank, MAP = Mean Average Precision = Spearman rank-order correlation. indicates the relative improvement of optimization in the Lorentz model.

4.1 Embedding Taxonomies

In the following experiments, we evaluate the performance of the Lorentz model for embedding large taxonomies. For this purpose, we compore its embedding quality to Poincaré embeddings (Nickel & Kiela, 2017) on the following real-world taxonomies

WordNet ®

(Miller & Fellbaum, 1998) is a large lexical database which, amongst other relations, provides hypernymy (is-a) relations. In our experiments, we embedded the noun and verb hierarchy of WordNet.


is a mulitlingual thesaurus maintained by the European Union. It contains keywords organized in 21 domains and 127 sub-domains. In our experiments, we used the English section of EuroVoc.222Available at http://eurovoc.europa.eu


The ACM computing classification system is a hierarchical ontology which is used by various ACM journals to organize subjects by area.


Medical Subject Headings (MeSH; (Rogers, 1963)) is a medical thesaurus which is created, maintained and provided by the U.S. National Library of Medicine. In our experiments we used the 2018 MeSH hierarchy.

Statistics for all taxonomies are provided in Table 1.

In our evaluation, we closely follow the setting of Nickel & Kiela (2017): First, we embed the undirected transitive closure of these taxonomies, such that the hierarchical structure is not directly visible from the observed edges but has to be inferred. To measure the quality of the embedding, we compute for each observed edge the corresponding distance in the embedding and rank it among the distances of all unobserved edges for , i.e., among . We then report the mean rank (MR) and mean average precision (MAP) of this ranking.

In addition, we also evaluate how well the norm of the embeddings (i.e., our indicator for generality), correlates with the ground-truth ranks in the embedded taxonomy. Since different subtrees can have very different depths, we normalize the rank of each concept by the depth of its subtree and measure the Spearman rank-order correlation of the normalized rank with the norm of the embedding. We compute the normalized rank in the following way: Let denote the shortest path to the root node from , and let denote the longest path from to any of its children.333Since all taxonomies in our experiments are DAGs, it is possible to compute the longest path in the graph The normalized rank of is the given as

To learn the embeddings in the Lorentz model, we employ the Riemannian optimization method as described in Section 3.2. For Poincaré embeddings, we use the official open-source implementation.444Source code available at https://github.com/facebookresearch/poincare-embeddings

Both methods were cross-validated over identical sets of hyperparameters.

Table 2 shows the results of our evaluation. It can be seen that both methods are very efficient in embedding these large taxonomies. However, the Lorentz model shows consistently higher-quality embeddings and especially so in low dimensions. The relative improvement of the two-dimensional Lorentz embeddings over the Poincaré embedding amounts to 74.8% on the WordNet noun hierarchy and 42.4% on EuroVoc. Similar improvements can be observed on all taxonomies. Furthermore, on the most complex taxonomy (WordNet nouns), the 10-dimensional Lorentz embeddings already out-performs the best reported numbers reported in (Nickel & Kiela, 2017) (which went up to 200 dimensions). This suggests that the full Riemannian optimization approach can be very helpful for obtaining good embeddings. This is especially the case in low dimensions where it is harder for the optimization procedure to escape from local minima.

4.2 Enron Email Corpus

(a) Embedding of the Enron communication graph

Level 5

Level 4

Level 3

Level 2

Level 1

Level 0

CEO / President

COO / Vice President / Director

In-House Lawyer

Manager / Trader

Specialist / Analyst

(b) Org. hierarchy


(c) Rank-order correlation
Figure 2: Embedding of the Enron email corpus. Abbreviations in parentheses indicate organizational role: CEO = Chief Executive Officer, COO = Chief Operating Officer, P = President, VP = Vice President, D = Director, M = Manager, T = Trader. Blue lines indicate edges in the graph. Node size indicates node degree.

In addition to the taxonomies in Section 4.1, we are interested in discovering hierarchies from real-world graphs that have not been generated from a clean DAG. For this purpose, we embed the communication graph of the Enron email corpus (Priebe et al., 2006) which consists of 125,409 emails that have been exchanged between 184 email addresses and 150 unique users.555This dataset has been created by Priebe et al. (2006) from the full Enron email corpus which has been released into public domain by the Federal Energy Regulatory Commission (FERC). From this data, we construct a graph where weighted edges represent the total number of emails that have been exchanged between two users. The dataset includes also the organizational roles for 130 users, based largely on the information collected by Shetty & Adibi (2005).

Figure 2 shows the two-dimensional embedding of this graph. It can be seen that the embedding captures important properties of the organizational structure. First, the nodes are approximately arranged according to the organizational hierarchy of the company. Executive roles such as CEOs, COOs, and (vice) presidents are embedded close to the origin, while other employees (e.g., traders and analysts) are located closer to the boundary. Figure 2 shows the Spearman correlation of the norms of the embedding with the organizational rank. It can be seen that the norm correlates well with the ground-truth ranking and is on-par or better than commonly-used centrality measures on graphs. We also observe that the embedding provides a meaningful clustering of users. For instance, the lower left of the disk shows a cluster of traders. Above that cluster (i.e., closer to the origin), are managers (e.g., John F.) and vice presidents (e.g., Fletcher S., Kevin P.) who have been associated with the trading arm of Enron. This illustrates that, in addition to the notion of rank in a hierarchy, the embedding provides also insight into the similarity of nodes within the hierarchy.

4.3 Historical Linguistics Data

Figure 3: Embedding of the IELex lexical cognate data.

The field of historical linguistics is concerned with the history of languages, their relations, and their origins. An important concept to determine the relations between languages are so-called cognates

, i.e., words that are shared across different languages (but not borrowed) and which indicate common ancestry in the history of languages. To be classified as cognate, words must have similar meaning and systematic sound correspondences. Languages are assumed to be related if they share a large number of cognates.

The goal of our experiments was to discover the historical relationships between languages (which are assumed to follow a hierarchical tree structure) by embedding cognate similarity data. For this purpose, we used the lexical cognate data provided by Bouckaert et al. (2012), which consists of 103 Indo-European languages and 6280 cognate sets in total. Since the number of cognate sets grew over time, not all languages are annotated with all possible sets. For this reason, we computed the cognate similarity between two languages in the following way. Let denote the number of common cognates in languages . Furthermore, let denote the number of cognate annotations for . We then compute the cognate similarity of simply as

Figure 3 shows a two-dimensional embedding of these cognate similarity scores. It can be seen that the embedding allows us to discover a meaningful hierarchy that corresponds well with the assumed origin of languages. First, the embedding shows clear clusters of high-level language families such as Celtic, Romance, Germanic, Balto-Slavic, Hellenic, Indic, and Iranian. Moreover, each of these cluster displays meaningful internal hierarchies such as (Gothic Old High German German), (Old Prussian Old Church Slavonic Bulgarian), (Latin Italian), or (Ancient Greek Greek). Closer to the center of the disc, we also find a number of ancient languages. For instance, Oscan and Umbrian are two extinct sister languages of Latin and located above the Romance cluster, Similarly, Avestan and Vedic-Sanskrit are two ancient languages that separated early in the pre-historic era before 1800 BCE (Baldi, 1983). After separation, Avestan developed in ancient Persia while Vedic-Sanskrit developed independently in ancient India. In the embedding, both languages are close to the center and to each other. Furthermore, Avestan is close to the Iranian cluster while Vedic-Sanskrit is close to the Indic cluster.

5 Conclusion

We introduced a new method for learning continuous concept hierarchies from unstructured observations. We exploited the properties of hyperbolic geometry in such a way that we can discover hierarchies from pairwise similarity scores – under the assumption that concepts in the same subtree of the ground-truth hierarchy are more similar to each other than to concepts in different subtrees. To learn the embeddings, we developed an efficient Riemannian optimization approach based on the Lorentz model of hyperbolic space. Due to the more principled optimization approach, we were able to substantially improve the quality of the embeddings compared to the method proposed by Nickel & Kiela (2017) – especially in low dimensions. We further showed on two real-world datasets, that our method can discover meaningful hierarchies from nothing but pairwise similarity information.


The authors thank Joan Bruna, Martín Arjovsky, Eryk Kopczyński, and Laurens van der Maaten for helpful discussions and suggestions.


  • Agarwal et al. (2007) Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D., and Belongie, S. Generalized non-metric multidimensional scaling. In Artificial Intelligence and Statistics, pp. 11–18, 2007.
  • Antoniou & Van Harmelen (2004) Antoniou, G. and Van Harmelen, F. Web ontology language: Owl. In Handbook on ontologies, pp. 67–92. Springer, 2004.
  • Asta & Shalizi (2015) Asta, D. M. and Shalizi, C. R. Geometric network comparisons. In Meila, M. and Heskes, T. (eds.), Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI, pp. 102–110, 2015.
  • Baldi (1983) Baldi, P. An introduction to the Indo-European languages. SIU Press, 1983.
  • Boguñá et al. (2010) Boguñá, M., Papadopoulos, F., and Krioukov, D. Sustaining the internet with hyperbolic mapping. Nature communications, 1:62, 2010.
  • Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Contr., 58(9):2217–2229, 2013.
  • Bouckaert et al. (2012) Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, M. A., and Atkinson, Q. D. Mapping the origins and expansion of the indo-european language family. Science, 337(6097):957–960, 2012.
  • Campbell (2013) Campbell, L. Historical linguistics. Edinburgh University Press, 2013.
  • De Sa et al. (2018) De Sa, C., Gu, A., Ré, C., and Sala, F. Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.
  • Dodds et al. (2003) Dodds, P. S., Watts, D. J., and Sabel, C. F. Information exchange and the robustness of organizational networks. Proceedings of the National Academy of Sciences, 100(21):12516–12521, 2003.
  • Duda et al. (1973) Duda, R. O., Hart, P. E., Stork, D. G., et al. Pattern classification, volume 2. Wiley New York, 1973.
  • Ganea et al. (2018) Ganea, O.-E., Bécigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. arXiv preprint arXiv:1804.01882, 2018.
  • Inhelder & Piaget (1964) Inhelder, B. and Piaget, J. The growth of logic in the child. Routledge & Paul, 1964.
  • Jean et al. (2015) Jean, S., Cho, K., Memisevic, R., and Bengio, Y.

    On using very large target vocabulary for neural machine translation.


    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

    , volume 1, pp. 1–10, 2015.
  • Kemp & Tenenbaum (2008) Kemp, C. and Tenenbaum, J. B. The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31):10687–10692, 2008.
  • Kleinberg (2007) Kleinberg, R. Geographic routing using hyperbolic space. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pp. 1902–1909. IEEE, 2007.
  • Krioukov et al. (2010) Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguná, M. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
  • Lake et al. (2018) Lake, B. M., Lawrence, N. D., and Tenenbaum, J. B. The emergence of organizing structure in conceptual representation. Cognitive Science, 2018.
  • Mayr (1968) Mayr, E. The role of systematics in biology. Science, 159(3815):595–599, 1968.
  • Miller & Fellbaum (1998) Miller, G. and Fellbaum, C. Wordnet: An electronic lexical database, 1998.
  • Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. pp. 6338–6347, 2017.
  • Priebe et al. (2006) Priebe, C. E., Conroy, J. M., Marchette, D. J., and Park, Y. Enron data set, 2006. URL http://cis.jhu.edu/~parky/Enron/enron.html.
  • Resnik et al. (1999) Resnik, P. et al. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res.(JAIR), 11:95–130, 1999.
  • Robbin & Salamon (2017) Robbin, J. W. and Salamon, D. A. Introduction to differential geometry. ETH, Lecture Notes, preliminary version, October, 2017.
  • Rogers (1963) Rogers, F. Medical subject headings. Bulletin of the Medical Library Association, 51:114–116, 1963.
  • Shetty & Adibi (2005) Shetty, J. and Adibi, J. Enron employee status report, 2005. URL http://www.isi.edu/~adibi/Enron/EnronEmployeeStatus.xls.
  • Silla & Freitas (2011) Silla, C. N. and Freitas, A. A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.
  • Sun et al. (2015) Sun, K., Wang, J., Kalousis, A., and Marchand-Maillet, S. Space-time local embeddings. In Advances in Neural Information Processing Systems 28, pp. 100–108, 2015.
  • Tamuz et al. (2011) Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 673–680, 2011.
  • Van Der Maaten & Weinberger (2012) Van Der Maaten, L. and Weinberger, K. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pp. 1–6. IEEE, 2012.
  • Vendrov et al. (2015) Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
  • Vilnis & McCallum (2015) Vilnis, L. and McCallum, A. Word representations via gaussian embedding. In International Conference on Learning Representations (ICLR), 2015.