1 Introduction
Hierarchical structures are ubiquitous in knowledge representation and reasoning. For example, starting with Linnaeus, taxonomies have long been used in biology to categorize and understand the relationships between species (Mayr, 1968). In social science, hierarchies are used to understand interactions in humans and animals or to analyze organizational structures such as companies and governments (Dodds et al., 2003). In comparative linguists, evolutionary trees are used to describe the origin of languages (Campbell, 2013), while ontologies are used to provide rich categorizations of entities in semantic networks (Antoniou & Van Harmelen, 2004). Hierarchies are also known to provide important information for learning and classification (Silla & Freitas, 2011). In cognitive development, the results of Inhelder & Piaget (1964) suggest that the classification structure in children’s thinking is hierarchical in nature.
Hierarchies can therefore provide important insights into systems of concepts. However, explicit information about such hierarchical relationships is unavailable for many domains. In this paper, we therefore consider the problem of discovering concept hierarchies from unstructured observations, specifically in the following setting:

We focus on discovering pairwise hierarchical relations between concepts, where all superior and subordinate concepts are observed.

We aim to infer concept hierarchies only from pairwise similarity measurements, which are relatively easy and cheap to obtain in many domains.
Examples of hierarchy discovery that adhere to this setting include the creation of taxonomies from similarity judgments (e.g., genetic similarity of species or cognate similarity of languages) and the recovery of organizational hierarchies and dominance relations from social interactions.
To infer hierarchies from similarity judgments, we propose to model such relationships as a combination of two separate aspects: relatedness and generality. Concept A is a parent (a superior) to concept B if both concepts are related and A is more general than B. By separating these aspects, we can then discover concept hierarchies via hyperbolic embeddings. In particular, we build upon ideas of Poincaré embeddings (Nickel & Kiela, 2017) to learn continuous representations of hierarchies. Due to its geometric properties, hyperbolic space can be thought of as continuous analogue to discrete trees. By embeddings concepts in such a way that their similarity order is preserved, we can then identify (soft) hierarchical relationships from the embedding: relatedness is captured via the distance in the embedding space, while generality is captured via the norm of the embeddings.
To learn highquality embeddings, we propose a new optimization approach based on the Lorentz model of hyperbolic space. The Lorentz model allows for an efficient closedform computation of the geodesics on the manifold. This facilitates the development of an efficient optimizer that directly follows these geodesics, rather than doing a firstorder approximation as in (Nickel & Kiela, 2017). It allows us also to avoid numerical instabilities that arise from the Poincaré distance. As we will show experimentally, this optimization method leads to a substantially improved embedding quality, especially in low dimensions. Simultaneously, we retain the attractive properties of hyperbolic embeddings, i.e., learning continuous representations of hierarchies via gradientbased optimization while scaling to large datasets.
The reminder of this paper is organized as follows. In Section 2, we discuss related work regarding hyperbolic and ordered embeddings. In Section 2, we introduce our model and algorithm to compute the embeddings. In Section 4 we evaluate the efficiency of our approach on large taxonomies. Furthermore, we evaluate the ability of our model to discover meaningful hierarchies on realworld datasets.
2 Related Work
Hyperbolic geometry has recently received attention in machine learning and network science due to its attractive properties for modeling data with latent hierarchies.
Krioukov et al. (2010) showed that typical properties of complex networks (e.g., heterogeneous degree distributions and strong clustering) can be explained by assuming an underlying hyperbolic geometry and, moreover, developed a framework to model networks based on these properties. Furthermore, Kleinberg (2007) and Boguñá et al. (2010) proposed hyperbolic embeddings for greedy shortestpath routing in communication networks. Asta & Shalizi (2015) used hyperbolic embeddings of graphs to compare the global structure of networks. Sun et al. (2015) proposed to learn representations of nonmetric data in pseudoRiemannian spacetime, which is closely related to hyperbolic space.Most similar to our work are the recently proposed Poincaré embeddings (Nickel & Kiela, 2017), which learn hierarchical representations of symbolic data by embedding them into an dimensional Poincaré ball. The main focus of that work was to model the link structure of symbolic data efficiently, i.e., to find lowdimensional embeddings via exploiting the hierarchical structure of hyperbolic space. Here, we build upon this idea and extend it in various ways. First, we propose a new model to compute hyperbolic embeddings in the Lorentz model of hyperbolic geometry. This allows us to develop an efficient Riemannian optimization method that scales well to large datasets and provides better embeddings, especially in low dimensions. Second, we consider inferring hierarchies from realvalued similarity scores, which generalize binary adjacency matrices as considered by Nickel & Kiela (2017). Third, in addition to preserving similarity (e.g., local link structure), we also focus on recovering the correct hierarchical relationships from the embedding.
Simultaneously to the present work, De Sa et al. (2018) analyzed the representation tradeoffs for hyperbolic embeddings and proposed a new combinatorial embedding approach as well as a new approach to MultiDimensional Scaling (MDS) in hyperbolic space. Furthermore, Ganea et al. (2018) extended Poincaré embeddings using geodesically convex cones to model asymmetric relations.
Another related method is Order Embeddings (Vendrov et al., 2015), which was proposed to learn visualsemantic hierarchies over words, sentences, and images from ordered input pairs. In contrast, we are concerned with learning hierarchical embeddings from less supervision: namely, from unordered (symmetric) input pairs that provide no direct information about the partial ordering in the hierarchy.
Further work on embedding orderstructures include Stochastic Triplet Embeddings (Van Der Maaten & Weinberger, 2012), Generalized NonMetric MDS (Agarwal et al., 2007), and Crowd Kernels (Tamuz et al., 2011). In the context of word embeddings, Vilnis & McCallum (2015) proposed Gaussian Embeddings to learn improved representations. By mapping words to densities, this model is capable of capturing uncertainty, assymmetry, and (hierarchical) entailment relations.
To discover structural forms (e.g., trees, grids, chains) from data, Kemp & Tenenbaum (2008) proposed a model for making probabilistic inferences over a space of graph grammars. Recently, Lake et al. (2018)
proposed an alternative approach to this work based on structural sparsity. Additionally, hierarchical clustering has a long history in machine learning and data mining
(Duda et al., 1973). Bottomup agglomerative clustering assigns each data point to its own cluster and then iteratively merges the two closest points according to a given distance measure (e.g., single link, average link, max link). As such, hierarchical clustering provides a hierarchical partition of the input space. In contrast, we are concerned with discovering direct hierarchical relationships between the input data points.3 Methods
In the following, we describe our approach for learning continuous hierarchies from unstructured observations.
3.1 Hyperbolic Geometry & Poincaré Embeddings
Hyperbolic space is the unique, complete, simply connected Riemannian manifold with constant negative sectional curvature. There exist multiple equivalent^{1}^{1}1Meaning that there exist transformations between the different models that preserve all geometric properties including isometry. models for hyperbolic space and one can choose the model whichever is best suited for a given task. Nickel & Kiela (2017) based their approach for learning hyperbolic embeddings on the Poincaré ball model, due to its conformality and convenient parameterization. The Poincaré ball model is the Riemannian manifold , where is the open dimensional unit ball and where
The distance function on is then defined as
(1) 
It can be seen from Equation 1, that the distance within the Poincaré ball changes smoothly with respect to the norm of and . This locality property of the distance is key for learning continuous embeddings of hierarchies. For instance, by placing the root node of a tree at the origin of , it would have relatively small distance to all other nodes, as its norm is zero. On the other hand, leaf nodes can be placed close to the boundary of the ball, as the distance between points grows quickly with a norm close to one.
3.2 Riemannian Optimization in the Lorentz Model
In the following, we propose a new method to compute hyperbolic embeddings based on the Lorentz model of hyperbolic geometry. The main advantage of this parameterization is that it allows us to perform Riemannian optimization very efficiently. An additional advantage is that its distance function (see Equation 5) avoids numerical instabilities that arise from the fraction in the Poincaré distance.
3.2.1 The Lorentz Model of Hyperbolic space
In the following, let , and let
(2) 
denote the Lorentzian scalar product. The Lorentz model of dimensional hyperbolic space is then defined as the Riemannian manifold , where
(3) 
denotes the upper sheet of a twosheeted dimensional hyperboloid and where
(4) 
The associated distance function on is then given as
(5) 
Furthermore, it holds for any point
(6) 
3.2.2 Riemannian Optimization
To derive the Riemannian SGD (RSGD) algorithm for the Lorentz model, we will first review the necessary concepts of Riemannian optimization. A Riemannian manifold is a real, smooth manifold equipped with a Riemannian metric . Furthermore, for each , let denote the associated tangent space. The metric induces then a inner product . Geodesics are the generalizations of straight lines to Riemannian manifolds, i.e., constant speed curves that are locally distance minimizing. The exponential map
maps a tangent vector
onto such that , , and . For a complete manifold , the exponential map is defined for all points .Furthermore, let be a smooth realvalued function over parameters . In Riemannian optimization, we are then interested in solving problems of the form
(7) 
Following Bonnabel (2013), we minimize Equation 7 using Riemannian SGD. In RSGD, updates to the parameters are computed via
(8) 
where denotes the Riemannian gradient and denotes the learning rate.
For the Lorentz model, the tangent space is defined as follows: For a point , the tangent space consists of all vectors orthogonal to , where orthogonality is defined with respect to the Lorentzian scalar product. Hence,
Furthermore, let . The exponential map is then defined as
(9) 
where denotes the norm of in .
To compute parameter updates as in Equation 7, we need additionally the Riemannian gradient of at . For this purpose, we first compute the direction of steepest descent from the Euclidean gradient via
(10) 
Since is an involutory matrix (i.e., ), the inverse in Equation 10 is trivial to compute. To derive the Riemannian gradient from , we then use the orthogonal projection from the ambient Euclidean space onto the tangent space of the current parameter. This projection is computed as
since (Robbin & Salamon, 2017). Using Sections 3.2.2 and 9
, we can then estimate the parameters
using RSGD as in Algorithm 1. We initialize the embeddings close to the origin ofby sampling from the uniform distribution
and by setting according to Equation 6.
Input Learning rate , number of epochs . 


for  
Riemannian Stochastic Gradient Descent
3.2.3 Equivalence of models
The Lorentz and Poincaré disk model both have specific strengths: the Poincare disk provides a very intuitive method for visualizing and interpreting hyperbolic embeddings. The Lorentz model on the other hand is wellsuited for Riemannian optimization. Due to the equivalence of both models, we can exploit their individual strengths simultaneously: points in the Lorentz model can be mapped into the Poincaré ball via the diffeomorphism , where
(11) 
Furthermore, points in can be mapped into via
We will therefore learn the embeddings via Algorithm 1 in the Lorentz model and visualize the embeddings by mapping them into the Poincaré disk using Equation 11. See also Figure 1 for an illustration of Lorentz model and its connections to the Poincaré disk.
3.3 Inferring Concept Hierarchies from Similarity
Nickel & Kiela (2017) embedded unweighted undirected graphs in hyperbolic space. In the following, we extend this approach to a more general setting, i.e., inferring continuous hierarchies from pairwise similarity measurements.
Let be a set of concepts and be a dataset of pairwise similarity scores between these concepts. We also assume that the concepts can be organized according to an unobserved hierarchy , where defines a partial order over the elements of . Since partial order is a reflexive, antisymmetric, and transitive binary relation, it is well suited to define hierarchical relations over . If or , then the concepts , are comparable (e.g., located in the same subtree). Otherwise they are incomparable (e.g., located in different subtrees). For concepts , we will refer to as the superior and to as the subordinate node.
Given this setting, our goal is then to recover the partial order from . For this purpose, we separate the semantics of the partial order relation into two distinct aspects: First, whether two concepts are comparable (denoted by ) and, second, whether concept is more general than (denoted by ). Combining both aspects provides us with the usual interpretation of partial order.
By explicitly distinguishing between the aspects of comparability and generality, we can then make the following structural assumptions on to infer hierarchies from pairwise similarities: 1) Comparable (and related) concepts are more similar to each other than incomparable concepts (i.e., if ); and 2) We assume that general concepts are similar to more concepts than less general ones. Both are mild assumptions given that the similarity scores describe concepts that are organized in a latent hierarchy. For instance, 1) simply follows from the assumption that concepts in the same subtree of the groundtruth hierarchy are more similar to each other than to concepts in different subtrees. This is also used in methods that use pathlengths in taxonomies to measure the semantic similarity of concepts (e.g., see Resnik et al., 1999).
It follows from assumption 1) the we want to preserve the similarity orderings in the embedding space in order to predict comparability. In particular, let denote the embedding of and let denote the set of concepts that are less similar to then (including ). Based only on pairwise similarities in , it is difficult to make global decisions about the likelihood that is true. However, it follows from assumption 1) that we can make local ranking decisions, i.e., we can infer that is the most likely among all . For this purpose, let
be the nearest neighbor of in the set . We then learn embeddings by optimizing
(12) 
where
For computational efficiency, we follow (Jean et al., 2015) and randomly subsample on large datasets.
Equation 12 is a ranking loss that aims to preserve the neighborhood structures in . For each pair of concepts , this loss induces embeddings where is closer in the embedding space than pairs that are less similar. Since we compute the embedding in a metric space, we also retain transitive relations approximately. We can therefore identify the comparability of concepts by their distance in the embedding.
Moreover, by optimizing Equation 12 in hyperbolic space, we are also able to infer the generality of concepts from their embeddings. According to assumption 2), we can can assume that general objects will be close to many different concepts. Since Equation 12 optimizes the local similarity ranking for all concepts, we can also assume that this ordering is preserved. We can see from Equation 1 that points with a small distance to many different points are located close to the center. We can therefore identify the generality of a concept simply via the norm of its embedding .
We have now cast the problem of hierarchy discovery as a simple embedding problem whose objective is to preserve local similarity orderings
4 Evaluation
Taxonomy  Nodes  Edges  Depth 

WordNet Nouns  82,115  769,130  19 
WordNet Verbs  13,542  35,079  12 
EuroVoc (en)  7,084  10,547  5 
ACM  2,299  6,526  5 
MeSH  28,470  191,849  15 
WordNet Nouns  WordNet Verbs  EuroVoc  ACM  MeSH  

2  5  10  2  5  10  2  5  10  2  5  10  2  5  10  
MR  Poincaré  90.7  4.9  4.02  10.71  1.39  1.35  2.83  1.25  1.23  4.14  1.8  1.71  61.11  14.05  12.8 
Lorentz  22.8  3.18  2.95  3.64  1.26  1.23  1.63  1.24  1.17  3.05  1.67  1.63  38.99  14.13  12.42  
74.8  35.1  36.2  66.0  9.6  8.9  42.4  6.1  3.4  26.3  7.2  4.8  36.2  0.5  2.9  
MAP  Poincaré  11.8  82.8  86.5  36.5  91.0  91.2  64.3  94.0  94.4  69.3  94.1  94.8  19.5  76.3  79.4 
Lorentz  30.5  92.3  92.8  57.9  93.5  93.3  87.1  95.8  96.5  82.9  96.6  97.0  34.8  77.7  79.9  
61.3  10.3  6.8  58.6  2.7  2.3  35.6  1.6  2.0  19.6  2.7  2.3  43.9  1.8  0.6  
Poincaré  13.8  57.2  58.5  11.0  54.1  55.1  37.5  57.5  61.4  59.8  63.5  62.9  42.2  69.9  74.9  
Lorentz  41.0  58.9  59.5  47.9  55.5  56.6  54.5  61.7  67.5  65.9  65.9  65.9  64.5  71.4  76.3 
4.1 Embedding Taxonomies
In the following experiments, we evaluate the performance of the Lorentz model for embedding large taxonomies. For this purpose, we compore its embedding quality to Poincaré embeddings (Nickel & Kiela, 2017) on the following realworld taxonomies
 WordNet ®

(Miller & Fellbaum, 1998) is a large lexical database which, amongst other relations, provides hypernymy (isa) relations. In our experiments, we embedded the noun and verb hierarchy of WordNet.
 EuroVoc

is a mulitlingual thesaurus maintained by the European Union. It contains keywords organized in 21 domains and 127 subdomains. In our experiments, we used the English section of EuroVoc.^{2}^{2}2Available at http://eurovoc.europa.eu
 ACM

The ACM computing classification system is a hierarchical ontology which is used by various ACM journals to organize subjects by area.
 MeSH

Medical Subject Headings (MeSH; (Rogers, 1963)) is a medical thesaurus which is created, maintained and provided by the U.S. National Library of Medicine. In our experiments we used the 2018 MeSH hierarchy.
Statistics for all taxonomies are provided in Table 1.
In our evaluation, we closely follow the setting of Nickel & Kiela (2017): First, we embed the undirected transitive closure of these taxonomies, such that the hierarchical structure is not directly visible from the observed edges but has to be inferred. To measure the quality of the embedding, we compute for each observed edge the corresponding distance in the embedding and rank it among the distances of all unobserved edges for , i.e., among . We then report the mean rank (MR) and mean average precision (MAP) of this ranking.
In addition, we also evaluate how well the norm of the embeddings (i.e., our indicator for generality), correlates with the groundtruth ranks in the embedded taxonomy. Since different subtrees can have very different depths, we normalize the rank of each concept by the depth of its subtree and measure the Spearman rankorder correlation of the normalized rank with the norm of the embedding. We compute the normalized rank in the following way: Let denote the shortest path to the root node from , and let denote the longest path from to any of its children.^{3}^{3}3Since all taxonomies in our experiments are DAGs, it is possible to compute the longest path in the graph The normalized rank of is the given as
To learn the embeddings in the Lorentz model, we employ the Riemannian optimization method as described in Section 3.2. For Poincaré embeddings, we use the official opensource implementation.^{4}^{4}4Source code available at https://github.com/facebookresearch/poincareembeddings
Both methods were crossvalidated over identical sets of hyperparameters.
Table 2 shows the results of our evaluation. It can be seen that both methods are very efficient in embedding these large taxonomies. However, the Lorentz model shows consistently higherquality embeddings and especially so in low dimensions. The relative improvement of the twodimensional Lorentz embeddings over the Poincaré embedding amounts to 74.8% on the WordNet noun hierarchy and 42.4% on EuroVoc. Similar improvements can be observed on all taxonomies. Furthermore, on the most complex taxonomy (WordNet nouns), the 10dimensional Lorentz embeddings already outperforms the best reported numbers reported in (Nickel & Kiela, 2017) (which went up to 200 dimensions). This suggests that the full Riemannian optimization approach can be very helpful for obtaining good embeddings. This is especially the case in low dimensions where it is harder for the optimization procedure to escape from local minima.
4.2 Enron Email Corpus
In addition to the taxonomies in Section 4.1, we are interested in discovering hierarchies from realworld graphs that have not been generated from a clean DAG. For this purpose, we embed the communication graph of the Enron email corpus (Priebe et al., 2006) which consists of 125,409 emails that have been exchanged between 184 email addresses and 150 unique users.^{5}^{5}5This dataset has been created by Priebe et al. (2006) from the full Enron email corpus which has been released into public domain by the Federal Energy Regulatory Commission (FERC). From this data, we construct a graph where weighted edges represent the total number of emails that have been exchanged between two users. The dataset includes also the organizational roles for 130 users, based largely on the information collected by Shetty & Adibi (2005).
Figure 2 shows the twodimensional embedding of this graph. It can be seen that the embedding captures important properties of the organizational structure. First, the nodes are approximately arranged according to the organizational hierarchy of the company. Executive roles such as CEOs, COOs, and (vice) presidents are embedded close to the origin, while other employees (e.g., traders and analysts) are located closer to the boundary. Figure 2 shows the Spearman correlation of the norms of the embedding with the organizational rank. It can be seen that the norm correlates well with the groundtruth ranking and is onpar or better than commonlyused centrality measures on graphs. We also observe that the embedding provides a meaningful clustering of users. For instance, the lower left of the disk shows a cluster of traders. Above that cluster (i.e., closer to the origin), are managers (e.g., John F.) and vice presidents (e.g., Fletcher S., Kevin P.) who have been associated with the trading arm of Enron. This illustrates that, in addition to the notion of rank in a hierarchy, the embedding provides also insight into the similarity of nodes within the hierarchy.
4.3 Historical Linguistics Data
The field of historical linguistics is concerned with the history of languages, their relations, and their origins. An important concept to determine the relations between languages are socalled cognates
, i.e., words that are shared across different languages (but not borrowed) and which indicate common ancestry in the history of languages. To be classified as cognate, words must have similar meaning and systematic sound correspondences. Languages are assumed to be related if they share a large number of cognates.
The goal of our experiments was to discover the historical relationships between languages (which are assumed to follow a hierarchical tree structure) by embedding cognate similarity data. For this purpose, we used the lexical cognate data provided by Bouckaert et al. (2012), which consists of 103 IndoEuropean languages and 6280 cognate sets in total. Since the number of cognate sets grew over time, not all languages are annotated with all possible sets. For this reason, we computed the cognate similarity between two languages in the following way. Let denote the number of common cognates in languages . Furthermore, let denote the number of cognate annotations for . We then compute the cognate similarity of simply as
Figure 3 shows a twodimensional embedding of these cognate similarity scores. It can be seen that the embedding allows us to discover a meaningful hierarchy that corresponds well with the assumed origin of languages. First, the embedding shows clear clusters of highlevel language families such as Celtic, Romance, Germanic, BaltoSlavic, Hellenic, Indic, and Iranian. Moreover, each of these cluster displays meaningful internal hierarchies such as (Gothic Old High German German), (Old Prussian Old Church Slavonic Bulgarian), (Latin Italian), or (Ancient Greek Greek). Closer to the center of the disc, we also find a number of ancient languages. For instance, Oscan and Umbrian are two extinct sister languages of Latin and located above the Romance cluster, Similarly, Avestan and VedicSanskrit are two ancient languages that separated early in the prehistoric era before 1800 BCE (Baldi, 1983). After separation, Avestan developed in ancient Persia while VedicSanskrit developed independently in ancient India. In the embedding, both languages are close to the center and to each other. Furthermore, Avestan is close to the Iranian cluster while VedicSanskrit is close to the Indic cluster.
5 Conclusion
We introduced a new method for learning continuous concept hierarchies from unstructured observations. We exploited the properties of hyperbolic geometry in such a way that we can discover hierarchies from pairwise similarity scores – under the assumption that concepts in the same subtree of the groundtruth hierarchy are more similar to each other than to concepts in different subtrees. To learn the embeddings, we developed an efficient Riemannian optimization approach based on the Lorentz model of hyperbolic space. Due to the more principled optimization approach, we were able to substantially improve the quality of the embeddings compared to the method proposed by Nickel & Kiela (2017) – especially in low dimensions. We further showed on two realworld datasets, that our method can discover meaningful hierarchies from nothing but pairwise similarity information.
Acknowledgments
The authors thank Joan Bruna, Martín Arjovsky, Eryk Kopczyński, and Laurens van der Maaten for helpful discussions and suggestions.
References
 Agarwal et al. (2007) Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D., and Belongie, S. Generalized nonmetric multidimensional scaling. In Artificial Intelligence and Statistics, pp. 11–18, 2007.
 Antoniou & Van Harmelen (2004) Antoniou, G. and Van Harmelen, F. Web ontology language: Owl. In Handbook on ontologies, pp. 67–92. Springer, 2004.
 Asta & Shalizi (2015) Asta, D. M. and Shalizi, C. R. Geometric network comparisons. In Meila, M. and Heskes, T. (eds.), Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, UAI, pp. 102–110, 2015.
 Baldi (1983) Baldi, P. An introduction to the IndoEuropean languages. SIU Press, 1983.
 Boguñá et al. (2010) Boguñá, M., Papadopoulos, F., and Krioukov, D. Sustaining the internet with hyperbolic mapping. Nature communications, 1:62, 2010.
 Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Contr., 58(9):2217–2229, 2013.
 Bouckaert et al. (2012) Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, M. A., and Atkinson, Q. D. Mapping the origins and expansion of the indoeuropean language family. Science, 337(6097):957–960, 2012.
 Campbell (2013) Campbell, L. Historical linguistics. Edinburgh University Press, 2013.
 De Sa et al. (2018) De Sa, C., Gu, A., Ré, C., and Sala, F. Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.
 Dodds et al. (2003) Dodds, P. S., Watts, D. J., and Sabel, C. F. Information exchange and the robustness of organizational networks. Proceedings of the National Academy of Sciences, 100(21):12516–12521, 2003.
 Duda et al. (1973) Duda, R. O., Hart, P. E., Stork, D. G., et al. Pattern classification, volume 2. Wiley New York, 1973.
 Ganea et al. (2018) Ganea, O.E., Bécigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. arXiv preprint arXiv:1804.01882, 2018.
 Inhelder & Piaget (1964) Inhelder, B. and Piaget, J. The growth of logic in the child. Routledge & Paul, 1964.

Jean et al. (2015)
Jean, S., Cho, K., Memisevic, R., and Bengio, Y.
On using very large target vocabulary for neural machine translation.
InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing
, volume 1, pp. 1–10, 2015.  Kemp & Tenenbaum (2008) Kemp, C. and Tenenbaum, J. B. The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31):10687–10692, 2008.
 Kleinberg (2007) Kleinberg, R. Geographic routing using hyperbolic space. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pp. 1902–1909. IEEE, 2007.
 Krioukov et al. (2010) Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguná, M. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
 Lake et al. (2018) Lake, B. M., Lawrence, N. D., and Tenenbaum, J. B. The emergence of organizing structure in conceptual representation. Cognitive Science, 2018.
 Mayr (1968) Mayr, E. The role of systematics in biology. Science, 159(3815):595–599, 1968.
 Miller & Fellbaum (1998) Miller, G. and Fellbaum, C. Wordnet: An electronic lexical database, 1998.
 Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. pp. 6338–6347, 2017.
 Priebe et al. (2006) Priebe, C. E., Conroy, J. M., Marchette, D. J., and Park, Y. Enron data set, 2006. URL http://cis.jhu.edu/~parky/Enron/enron.html.
 Resnik et al. (1999) Resnik, P. et al. Semantic similarity in a taxonomy: An informationbased measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res.(JAIR), 11:95–130, 1999.
 Robbin & Salamon (2017) Robbin, J. W. and Salamon, D. A. Introduction to differential geometry. ETH, Lecture Notes, preliminary version, October, 2017.
 Rogers (1963) Rogers, F. Medical subject headings. Bulletin of the Medical Library Association, 51:114–116, 1963.
 Shetty & Adibi (2005) Shetty, J. and Adibi, J. Enron employee status report, 2005. URL http://www.isi.edu/~adibi/Enron/EnronEmployeeStatus.xls.
 Silla & Freitas (2011) Silla, C. N. and Freitas, A. A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(12):31–72, 2011.
 Sun et al. (2015) Sun, K., Wang, J., Kalousis, A., and MarchandMaillet, S. Spacetime local embeddings. In Advances in Neural Information Processing Systems 28, pp. 100–108, 2015.
 Tamuz et al. (2011) Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 673–680, 2011.
 Van Der Maaten & Weinberger (2012) Van Der Maaten, L. and Weinberger, K. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pp. 1–6. IEEE, 2012.
 Vendrov et al. (2015) Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. Orderembeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
 Vilnis & McCallum (2015) Vilnis, L. and McCallum, A. Word representations via gaussian embedding. In International Conference on Learning Representations (ICLR), 2015.