1 Introduction
Concept hierarchies, i.e., systems of isa relationships, are ubiquitous in knowledge representation and reasoning. For instance, understanding isa relationships between concepts is of special interest in many scientific fields as it enables highlevel abstractions for reasoning and provides structural insights into the concept space of a domain. A prime example is biology, in which taxonomies have a long history ranging from Linnaeus et al. (1758) up to recent efforts such as the Gene Ontology and the UniProt taxonomy (Ashburner et al., 2000; Gene Ontology Consortium, 2016; Apweiler et al., 2004). Similarly, in medicine, ontologies like MeSH and ICD10 are used to organize medical concepts such as diseases, drugs, and treatments (Rogers, 1963; Simms, 1992)
. In Artificial Intelligence, concept hierarchies provide valuable information for a wide range of tasks such as automated reasoning, fewshot learning, transfer learning, textual entailment, and semantic similarity
(Resnik, 1993; Lin, 1998; BernersLee et al., 2001; Dagan et al., 2010; Bowman et al., 2015; Zamir et al., 2018). In addition, isarelationships are the basis of complex knowledge graphs such as
DBpedia (Auer et al., 2007) and Yago (Suchanek et al., 2007; Hoffart et al., 2013) which have found important applications in text understanding and question answeringCreating and inferring concept hierarchies has, for these reasons, been a long standing task in fields such as natural language processing, the semantic web, and artificial intelligence. Early approaches such as
WordNet (Miller et al., 1990; Miller and Fellbaum, 1998) and CyC (Lenat, 1995) have focused on the manual construction of highquality ontologies. To increase scalability and coverage, the focus in recent efforts such as ProBase (Wu et al., 2012) and WebIsaDB (Seitner et al., 2016) has shifted towards automated construction.In this work, we consider the task of inferring concept hierarchies from large text corpora in an unsupervised way. For this purpose, we combine Hearst patterns with recently introduced hyperbolic embeddings (Nickel and Kiela, 2017, 2018), what provides important advantages for this task. First, as Roller et al. (2018) showed recently, Hearst patterns provide important constraints for hypernymy extraction from distributional contexts. However, it is also wellknown that Hearst patterns are prone to missing and wrong extractions, as words must cooccur in exactly the right pattern to be detected successfully. For this reason, we first extract potential isa relationships from a corpus using Hearst patterns and build a directed weighted graph from these extractions. We then embed this Heart Graph in hyperbolic space to infer missing hypernymy relations and remove wrong extractions. By using hyperbolic space for the embedding, we can exploit the following important advantages:

[leftmargin=1.5em]
 Consistency

Hyperbolic entailment cones (Ganea et al., 2018) allow us to enforce transitivity of isarelations in the entire embedding space. This improves the taxonomic consistency of the model, as it enforces that if and . To improve optimization properties, we also propose a new method to compute hyperbolic entailment cones in the Lorentz model of hyperbolic space.
 Efficiency

Hyperbolic space allows for very low dimensional embeddings of graphs with latent hierarchies and heavytailed degree distributions. To embed large Hearst graphs – which exhibit both properties (e.g., see Figure 2) – this is an important advantage. In our experiments, we will show that hyperbolic embeddings allow us to decrease the embedding dimension by over an order of magnitude while outperforming SVDbased methods.
 Interpretability

In hyperbolic embeddings, similarity is captured via distance while hierarchy is captured through the norm of embeddings. In addition to semantic similarity this allows us to get additional insights from the embedding such as the generality of terms.
Figure 1 shows an example of a twodimensional embedding of the Hearst graph that we use in our experiments. Although we will use higher dimensionalities for our final embedding, the visualization serves as a good illustration of the hierarchical structure that is obtained through the embedding.
2 Related Work
Hypernym detection
Detecting isarelations from text is a longstanding task in natural language processing. A popular approach is to exploit highprecision lexicosyntactic patterns as first proposed by Hearst (1992). These patterns may be predefined or learned automatically (Snow et al., 2005; Shwartz et al., 2016). However, it is well known that such patternbased methods suffer significantly from missing extractions as terms must occur in exactly the right configuration to be detected Shwartz et al. (2016); Roller et al. (2018). Recent works improve coverage by leveraging search engines (Kozareva and Hovy, 2010) or by exploiting webscale corpora (Seitner et al., 2016); but also come with significant precision tradeoffs.
To overcome the sparse extractions of patternbased methods, focus has recently shifted to distributional approaches which provide rich representations of lexical meaning. These methods alleviate the sparsity issue but also require specialized similarity measures to distinguish different lexical relationships. To date, most measures are inspired by the Distributional Inclusion Hypothesis (DIH; Geffet and Dagan 2005) which hypothesizes that for a subsumption relation (cat, isa, mammal) the subordinate term (cat) should appear in a subset of the contexts in which the superior term (mammal) occurs. Unsupervised methods for hypernymy detection based on distributional approaches include WeedsPrec (Weeds et al., 2004), invCL (Lenci and Benotto, 2012), SLQS Santus et al. (2014), and DIVE Chang et al. (2018)
. Distributional representations that are based on positional or dependecybased contexts may also capture crude Hearstpatternlike features
(Levy et al., 2015; Roller and Erk, 2016). Shwartz et al. (2017) showed that such contexts plays an important role for the success of distributional methods.Recently, Roller et al. (2018) performed a systematic study of unsupervised distributional and patternbased approaches. Their results showed that patternbased methods are able to outperform DIHbased methods on several challenging hypernymy benchmarks. Key aspects to good performance where the extraction of patterns from large text corpora and using embedding methods to overcome the sparsity issue. Our work builds on these findings by replacing their embeddings with ones with a naturally hierarchical geometry.
Taxonomy induction
Although detecting hypernymy relationships is an important and difficult task, these systems alone do not produce rich taxonomic graph structures CamachoCollados (2017), and complete taxonomy induction may be seen as a parallel and complementary task.
Many works in this area consider a taxonomic graph as the starting point, and consider a variety of methods for growing or discovering areas of the graph. For example, Snow et al. (2006)
train a classifier to predict the likelihood of an edge in WordNet, and suggest new undiscovered edges, while
Kozareva and Hovy (2010) propose an algorithm which repeatedly crawls for new edges using a web search engine and an initial seed taxonomy. Cimiano et al. (2005) considered learning ontologies using Formal Concept Analysis. Similar works consider noisy graphs discovered from Hearst patterns, and provide algorithms for pruning edges until a strict hierarchy remains (Velardi et al., 2005; Kozareva and Hovy, 2010; Velardi et al., 2013). Maedche and Staab (2001) proposed a method to learn ontologies in a Semantic Web context.Embeddings
Recently, works have proposed a variety of graph embedding techniques for representing and recovering hierarchical structure. Orderembeddings Vendrov et al. (2016) represent text and images with embeddings, where the ordering over individual dimensions form a partially ordered set. Hyperbolic embeddings treat words as points in nonEuclidean geometries, and may be viewed as a continuous generalization of tree structures Nickel and Kiela (2017, 2018). Extensions have considered how distributional cooccurrences may be used to augment order embeddings Li et al. (2018) and Hyperbolic embeddings Dhingra et al. (2018). Other recent works have focused on the often complex overlapping structure of word classes, and induced hierarchies using boxlattice structures Vilnis et al. (2018) and Gaussian word embeddings Athiwaratkun and Wilson (2018). Compared to many of the purely graphbased works, these methods generally require extensive supervision of hierarchical structure, and cannot learn taxonomies using only unstructured noisy data. Recently, Tifrea et al. (2018) proposed an extension of GloVe (Pennington et al., 2014) to hyperbolic space. Our experimental results in Section 4, which show substantial gains over the results reported by Tifrea et al. (2018) for hypernymy prediction, underline the importance of selecting the right distributional context for this task.
3 Methods
In the following, we discuss our method for unsupervised learning of concept hierarchies. We first discuss the extraction and construction of the Hearst graph, followed by a description of the Hyperbolic Embeddings.
3.1 Hearst Graph
Pattern 
X which is a (example class kind …) of Y 
X (and or) (any some) other Y 
X which is called Y 
X is JJS (most)? Y 
X a special case of Y 
X is an Y that 
X is a !(member part given) Y 
!(features properties) Y such as X, X, … 
(Unlike like) (most all any other) Y, X 
Y including X, X, … 
The main idea introduced by Hearst (1992) is to exploit certain lexicosyntactic patterns to detect isa relationships in natural language. For instance, patterns like “NP such as NP” or “NP and other NP” often indicate a hypernymy relationship . By treating unique noun phrases as nodes in a large, directed graph, we may construct a Hearst Graph using only unstructured text and very limited prior knowledge in form of patterns. Table 1 lists the only patterns that we use in this work. Formally, let denote the set of isa relationships that have been extracted from a text corpus. Furthermore, let denote how often we have extracted the relationship . We then represent the extracted patterns as a weighted directed graph where is the set of all extracted terms.
Hearst patterns afford a number of important advantages in terms of data acquisition: they are embarrassingly parallel across both sentences and distinct Hearst patterns, and counts are easily aggregated in any MapReduce setting Dean and Ghemawat (2004). Our own experiments, and those of Seitner et al. (2016), demonstrate that this approach can be scaled to large corpora such as CommonCrawl.^{1}^{1}1http://commoncrawl.org As Roller et al. (2018)
showed, pattern matches also provide important contextual constraints which boost signal compared to methods based on the Distributional Inclusion Hypothesis.
However, naïvely using Hearst pattens can easily result in a graph that is extremely sparse: pattern matches naturally follow a longtailed distribution that is skewed by the occurrence probabilities of constituent words (see Figure
2) and many true relationships are unlikely to ever appear in a corpus (e.g. “longtailed macaque isa entity”). This may be alleviated with generous, low precision patterns Seitner et al. (2016), but the resulting graph will contain many false positives, inconsistencies, and cycles. For example, our own Hearst graph contains the cycle: (area, isa, spot), (spot, commercial), (commercial, promotion), (promotion, area), which is caused by the polysemy of spot (location, advertisement) and area (location, topical area).3.2 Hyperbolic Embeddings
Roller et al. (2018)
showed that lowrank embedding methods, such as Singular Value Decomposition (SVD), alleviate the aforementioned sparsity issues but still produce cyclic and inconsistent predictions. In the following, we will discuss how hyberbolic embeddings allow us to improve consistency via strong hierarchical priors in the geometry.
First, we will briefly review necessary concepts of hyperbolic embeddings: In contrast to Euclidean or Spherical space, there exist multiple equivalent models for hyperbolic space.^{2}^{2}2e.g., the Poincaré ball, the Lorentz model, the Poincaré upper half plane, and the BeltramiKlein model Since there exist transformations between these models that preserve all geometric properties (including isometry), we can choose whichever is best suited for a given task. In the following, we will first discuss hyperbolic embeddings based on the Poincaréball model, which is defined as follows: The Poincaréball model is the Riemannian manifold , where is the open dimensional unit ball and where is the distance function
(1)  
Hyperbolic space has a natural hierarchical structure and, intuitively, can be thought of as a continuous versions of trees. This property becomes evident in the Poincaré ball model: it can be seen from Equation 1, that the distance within the Poincaré ball changes smoothly with respect to the norm of a point . Points that are close to the origin of the disc are relatively close to all other points in the ball, while points that are close to the boundary are relatively far apart^{3}^{3}3This can be seen by considering how the Euclidean distance in is scaled by the norms of the respective points. This locality property of the distance is key for learning continuous embeddings of hierarchical structures and corresponds to the behavior of shortestpaths in trees.
Hyperbolic Entailment Cones
Ganea et al. (2018) used the hierarchical properties of hyperbolic space to define entailment cones in an embedding. The main idea of hyperbolic entailment cones (HECs) is to define for each possible point in the space, an entailment region in the form of a hyperbolic cone . Points that are located inside a cone are assumed to be children of . The width of each cone is determined by the norm of the associated base point. The closer is to the origin, i.e., the more general the basepoint is, the larger the width of becomes and the more points are subsumed in the entailment cone. Equation 3 shows entailment cones for different points in . To model the possible entailment , we can then use the energy function
(2) 
In Equation 2, denotes the halfaperture of the cone associated with point and denotes the angle between the halflines and . If , i.e., if the angle between and is smaller than the half aperture of , it holds that . If , the energy can then be interpreted as smallest angle of a rotation bringing into the cone associated with .
Given a Hearst graph , we then compute embeddings of all terms in the following way: Let denote the embedding of term and let . To minimize the overall energy of the embedding, we solve the optimization problem
(3) 
where
The goal of Equation 3 is therefore to find a joint embedding of all terms that best explains all observed Hearst patterns.
Lorentz Entailment Cones
The optimization problem Equation 3 is agnostic of the hyperbolic manifold on which the optimization is performed. Ganea et al. (2018) developed hyperbolic entailment cones in the Poincaréball model. However, as Nickel and Kiela (2018) pointed out, the Poincaréball model is not optimal from an optimization perspective as it is prone to numerical instabilities when points approach the boundary of the ball. Instead, Nickel and Kiela (2018) proposed to perform optimization in the Lorentz model and use the Poincaré ball only for analysis and visualization. Here, we follow this approach and develop entailment cones in the Lorentz model of hyperbolic space. The Lorentz model is defined as follows: let , and let
denote the Lorentzian scalar product. The Lorentz model of dimensional hyperbolic space is then the Riemannian manifold , where
denotes the upper sheet of a twosheeted dimensional hyperboloid and where the associated distance function on is given as
Due to the equivalence of both models, we can define a mapping between both spaces that preserves all properties including isometry. Points in the Lorentz model can be mapped into the Poincaré ball via the diffeomorphism , where
(4) 
Furthermore, points in can be mapped to via
See also Figure 4 for an illustration of the Lorentz model and its connections to the Poincaré ball.
To define entailment cones in the Lorentz model, it is necessary to derive and in . Both quantities can be derived easily from the hyperbolic law of cosines and the mapping between and . In particular, it holds that
Due to space restrictions, we refer to the supplementary material for the full derivation.
Training
To solve Equation 3, we follow Nickel and Kiela (2018) and perform stochastic optimization via Riemannian SGD (RSGD; Bonnabel 2013). In RSGD, updates to the parameters are computed via
(5) 
where denotes the Riemannian gradient and denotes the learning rate. In Equation 5, the Riemannian gradient of at is computed via
where denotes the Euclidean gradient of and where
denote the projection from the ambient space onto the tangent space
and the inverse of the metric tensor, respectively. Finally, the exponential map for
is computed viawhere and .
As suggested by Nickel and Kiela (2018), we initialize the embeddings close to the origin of
by sampling from the uniform distribution
and by setting to , what ensures that the sampled points are located on the surface of the hyperboloid.4 Experiments
To evaluate the efficacy of our method, we evaluate on several commonlyused hypernymy benchmarks (as described in Roller et al. (2018)) as well as in a reconstruction setting (as described in Nickel and Kiela (2017)). Following Roller et al. (2018), we compare to the following methods for unsupervised hypernymy detection:
Detection (AP)  Direction (Acc.)  Graded ()  
Bless  Eval  Leds  Shwartz  WBless  Bless  WBless  BiBless  Hyperlex  
Cosine  .12  .29  .71  .31  .53  .00  .54  .52  .14 
WeedsPrec  .19  .39  .87  .43  .68  .63  .59  .45  .43 
invCL  .18  .37  .89  .38  .66  .64  .60  .47  .43 
SLQS  .15  .35  .60  .38  .69  .75  .67  .51  .16 
p(x, y)  .49  .38  .71  .29  .74  .46  .69  .62  .62 
ppmi(x, y)  .45  .36  .70  .28  .72  .46  .68  .61  .60 
sp(x, y)  .66  .45  .81  .41  .91  .96  .84  .80  .51 
spmi(x, y)  .76  .48  .84  .44  .96  .96  .87  .85  .53 
HypeCones  .81  .50  .89  .50  .98  .94  .90  .87  .59 
PatternBased Models
Let be the set of Hearst patterns in our corpus, be the count of how many times occurs in , and . We then consider the following patternbased methods:
Count Model (p)
This models simply outputs the count, or equivalently, the extraction probabilities of Hearst patterns, i.e.,
PPMI Model (ppmi)
To correct for skewed occurrence probabilities, the PPMI model predicts hypernymy relations based on the Pointwise Mutual Information over the Hearst pattern corpus. Let and , then:
SVD Count (sp)
To account for missing relations, we also compare against lowrank embeddings of the Hearst corpus using Singular Value Decomposition (SVD). Specifically, let , such that and be the singular value decomposition of , then:
SVD PPMI (spmi)
We also evaluate against the SVD of the PPMI matrix, which is identical to , with the exception that , instead of . Roller et al. (2018) showed that this method provides stateoftheart results for unsupervised hypernymy detection.
Hyperbolic Embeddings (HypeCone)
We embed the Hearst graph into hyperbolic space as described in section 3.2. At evaluation time, we predict the likelihood using the model energy .
Distributional Models
The distributional models in our evaluation are based on the DIH, i.e., the idea that contexts in which a narrow term may appear (ex: cat) should be a subset of the contexts in which a broader term (ex: animal) may appear.
WeedsPrec
The first distributional model we consider is WeedsPrec Weeds et al. (2004), which captures the features of which are included in the set of more general term’s features, :
invCL
Lenci and Benotto (2012), introduce the idea of distributional exclusion by also measuring the degree to which the broader term contains contexts not used by the narrower term. The degree of inclusion is denoted as:
To measure the inclusion of and and the noninclusion of in , invCL is then computed as
Slqs
The SLQS model is based on the informativeness hypothesis Santus et al. (2014); Shwartz et al. (2017), i.e., the idea that general words appear mostly in uninformative contexts, as measured by entropy. SLQS depends on the median entropy of a term’s top contexts:
where is the Shannon entropy of context across all terms. SLQS is then defined as:
Corpora and Preprocessing
We construct our Hearst graph using the same data, patterns, and procedure as described in Roller et al. (2018): Hearst patterns are extracted from the concatenation of GigaWord and Wikipedia. The corpus is tokenized, lemmatized, and POStagged using CoreNLP 3.8.0 Manning et al. (2014). The full set of Hearst patterns is provided in Table 1. These include prototypical Hearst patterns, like “animals [such as] big cats,” as well as broader patterns like “New Year [is the most important] holiday.” Noun phrases were allowed to match limited modifiers, and produced additional hits for the head of the noun phrase. The final corpus contains circa 4.5M matched pairs, 431K unique pairs, and 243K unique terms.
Animals  Plants  Vehicles  
All  Missing  Transitive  All  Missing  Transitive  All  Missing  Transitive  
p(x, y)  350.18  512.28  455.27  271.38  393.98  363.73  43.12  82.57  66.10 
ppmi(x, y)  350.47  512.28  455.38  271.40  393.98  363.76  43.20  82.57  66.16 
sp(x, y)  56.56  77.10  11.22  43.40  64.70  17.88  9.19  26.98  14.84 
spmi(x, y)  58.40  102.56  12.37  40.61  71.81  14.80  9.62  17.96  3.03 
HypeCones  25.33  37.60  4.37  17.00  31.53  6.36  5.12  10.28  2.74 
56.6  51.2  61.1  58.1  51.3  57.0  44.3  42.8  9.6 
Hypernymy Tasks
We consider three distinct subtasks for evaluating the performance of these models for hypernymy prediction:

Detection: Given a pair of words (, ), determine if is a hypernym of .

Direction: Given a pair (, ), determine if is more general than or vise versa.

Graded Entailment: Given a pair of words (, ), determine the degree to which is a .
For detection, we evaluate all models on five commonlyused benchmark datasets: Bless Baroni and Lenci (2011), Leds Baroni et al. (2012), Eval Santus et al. (2015), Shwartz Shwartz et al. (2016), and WBless Weeds et al. (2014), In addition to positive hypernymy relations, these datasets includes negative samples in the form of random pairs, cohyponymy, antonymy, meronymy, and adjectival relations. For directionality and graded entailment, we also use the BiBless (Kiela et al., 2015) and Hyperlex (Vulic et al., 2016) datasets. We refer to Roller et al. (2018) for an indepth discussion of these datasets.
Table 2 shows the results for all tasks on these datasets. It can be seen that our proposed approach provides substantial gains on the detection and directionality tasks and, overall, achieves state of the art results on seven of these nine benchmarks. In addition, our method clearly outperforms other embeddingbased approaches on Hyperlex, although it can not fully match the performance of the countbased methods. As Roller et al. (2018)
noted, this might be an artifact of the evaluation metric, as countbased methods benefit from their sparsepredictions in this setting.
It can also be seen that our method outperforms Poincaré GloVe for the task of hypernymy prediction. While Tifrea et al. (2018) report Spearman’s on HyperLex and accuracy on WBless, our method can achieve substantially better results for the same tasks ( on HyperLex, on WBLess). This illustrates the importance of the distributional constraints that are provided by the Hearst patterns.
An additional benefit is the efficiency of our embedding. For all tasks, we have used a 20dimensional embedding for HypeCones, while the best results for SVDbased methods have been achieved with 300 dimensions. This reduction in parameters by over an order of magnitude clearly highlights the efficiency of hyperbolic embeddings for representing hierarchical structures.
Reconstruction
In the following, we compare embedding and patternbased methods on the task of reconstructing an entire subtree of WordNet, i.e., the animals, plants, and vehicles taxonomies, as proposed by Kozareva and Hovy (2010). In addition to predicting the existence of single hypernymy relations, this allows us to evaluate the performance of these models for inferring full taxonomies and to perform an ablation for the prediction of missing and transitive relations. We follow previous work (Bordes et al., 2013; Nickel and Kiela, 2017) and report for each observed relation in WordNet, its score ranked against the score of the groundtruth negative edges. In Table 3, All refers to the ranking of all edges in the subtree, Missing to edges that are not included in the Hearst graph , Transitive to missing transitive edges in (i.e. for all edges ).
It can be seen that our method clearly outperforms the SVD and countbased models with a relative improvement of typically over over the best nonhyperbolic model. Furthermore, our ablation shows that HypeCones improves the consistency of the embedding due to its transitivity property. For instance, in our Hearst Graph the relation (male horse, isa, equine) is missing. However, since we correctly model that (male horse, isa, horse) and (horse, isa, equine), by transitivity, we also infer (male horse, isa, equine), which SVD fails to do.
5 Conclusion
In this work, we have proposed a new approach for inferring concept hierarchies from large text corpora. For this purpose, we combine Hearst patterns with hyperbolic embeddings what allows us to set appropriate constraints on the distributional contexts and to improve the consistency in the embedding space. By computing a joint embedding of all terms that best explains the extracted Hearst patterns, we can then exploit these properties for improved hypernymy prediction. The natural hierarchical structure of hyperbolic space allows us also to learn very efficient embeddings that reduce the required dimensionality substantially over SVDbased methods. To improve optimization, we have furthermore proposed a new method to compute entailment cones in the Lorentz model of hyperbolic space. Experimentally, we show that our embeddings achieve stateoftheart performance on a variety of commonlyused hypernymy benchmarks.
References
 Apweiler et al. (2004) Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, et al. 2004. Uniprot: the universal protein knowledgebase. Nucleic acids research, 32(suppl_1):D115–D119.
 Ashburner et al. (2000) Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. 2000. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25.
 Athiwaratkun and Wilson (2018) Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Hierarchical density order embeddings. In Proceedings of the International Conference on Learning Representations.
 Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer.
 Baroni et al. (2012) Marco Baroni, Raffaella Bernardi, NgocQuynh Do, and Chungchieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 23–32. Association for Computational Linguistics.
 Baroni and Lenci (2011) Marco Baroni and Alessandro Lenci. 2011. How we BLESSed distributional semantic evaluation. In Proceedings of the 2011 Workshop on GEometrical Models of Natural Language Semantics, pages 1–10, Edinburgh, UK.
 BernersLee et al. (2001) Tim BernersLee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific american, 284(5):34–43.
 Bonnabel (2013) Silvere Bonnabel. 2013. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Contr., 58(9):2217–2229.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795.
 Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
 CamachoCollados (2017) Jose CamachoCollados. 2017. Why we have switched from building fullfledged taxonomies to simply detecting hypernymy relations. arXiv preprint arXiv:1703.04178.

Chang et al. (2018)
HawShiuan Chang, Ziyun Wang, Luke Vilnis, and Andrew McCallum. 2018.
Distributional inclusion vector embedding for unsupervised hypernymy detection.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 485–495, New Orleans, Louisiana. Association for Computational Linguistics.  Cimiano et al. (2005) Philipp Cimiano, Andreas Hotho, and Steffen Staab. 2005. Learning concept hierarchies from text corpora using formal concept analysis. Journal of artificial intelligence research, 24:305–339.
 Dagan et al. (2010) Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth. 2010. Recognizing textual entailment: Rational, evaluation and approaches–erratum. Natural Language Engineering, 16(1):105–105.
 Dean and Ghemawat (2004) Jeffrey Dean and Sanjay Ghemawat. 2004. Mapreduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, pages 137–150, San Francisco, CA.
 Dhingra et al. (2018) Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. 2018. Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on GraphBased Methods for Natural Language Processing (TextGraphs12), pages 59–69, New Orleans, Louisiana, USA. Association for Computational Linguistics.
 Ganea et al. (2018) OctavianEugen Ganea, Gary Bécigneul, and Thomas Hofmann. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. arXiv preprint arXiv:1804.01882.
 Geffet and Dagan (2005) Maayan Geffet and Ido Dagan. 2005. The distributional inclusion hypotheses and lexical entailment. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 107–114. Association for Computational Linguistics.
 Gene Ontology Consortium (2016) Gene Ontology Consortium. 2016. Expansion of the gene ontology knowledgebase and resources. Nucleic acids research, 45(D1):D331–D338.
 Hearst (1992) Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguisticsVolume 2, pages 539–545. Association for Computational Linguistics.
 Hoffart et al. (2013) Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194:28–61.
 Kiela et al. (2015) Douwe Kiela, Laura Rimell, Ivan Vulic, and Stephen Clark. 2015. Exploiting image generality for lexical entailment detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), pages 119–124. ACL.
 Kozareva and Hovy (2010) Zornitsa Kozareva and Eduard Hovy. 2010. A semisupervised method to learn and construct taxonomies using the web. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 1110–1118. Association for Computational Linguistics.
 Lenat (1995) Douglas B. Lenat. 1995. Cyc: a largescale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38.
 Lenci and Benotto (2012) Alessandro Lenci and Giulia Benotto. 2012. Identifying hypernyms in distributional semantic spaces. In Proceedings of the First Joint Conference on Lexical and Computational SemanticsVolume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 75–79. Association for Computational Linguistics.
 Levy et al. (2015) Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations? In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 970–976.

Li et al. (2018)
Xiang Li, Luke Vilnis, and Andrew McCallum. 2018.
Improved representation learning for predicting commonsense
ontologies.
In
International Conference on Machine Learning Workshop on Deep Structured Prediction
.  Lin (1998) Dekang Lin. 1998. An informationtheoretic definition of similarity. In Proceedings of the 14th International Conference on Machine Learning, volume 98, pages 296–304.
 Linnaeus et al. (1758) Carolus Linnaeus et al. 1758. Systema naturae, vol. 1. Systema naturae, Vol. 1.
 Maedche and Staab (2001) Alexander Maedche and Steffen Staab. 2001. Ontology learning for the semantic web. IEEE Intelligent systems, 16(2):72–79.
 Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
 Miller and Fellbaum (1998) George Miller and Christiane Fellbaum. 1998. Wordnet: An electronic lexical database.
 Miller et al. (1990) George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. 1990. Introduction to wordnet: An online lexical database. International journal of lexicography, 3(4):235–244.
 Nickel and Kiela (2017) Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems 30, pages 6338–6347. Curran Associates, Inc.
 Nickel and Kiela (2018) Maximilian Nickel and Douwe Kiela. 2018. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In Proceedings of the Thirtyfifth International Conference on Machine Learning.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
 Resnik (1993) Philip Stuart Resnik. 1993. Selection and information: a classbased approach to lexical relationships. IRCS Technical Reports Series, page 200.
 Rogers (1963) FB Rogers. 1963. Medical subject headings. Bulletin of the Medical Library Association, 51:114–116.
 Roller and Erk (2016) Stephen Roller and Katrin Erk. 2016. Relations such as hypernymy: Identifying and exploiting hearst patterns in distributional vectors for lexical entailment. arXiv preprint arXiv:1605.05433.
 Roller et al. (2018) Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst patterns revisited: Automatic hypernym detection from large text corpora. arXiv preprint arXiv:1806.03191.
 Santus et al. (2014) Enrico Santus, Alessandro Lenci, Qin Lu, and S Schulte im Walde. 2014. Chasing hypernyms in vector spaces with entropy. In 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 38–42. EACL (European chapter of the Association for Computational Linguistics).
 Santus et al. (2015) Enrico Santus, Frances Yung, Alessandro Lenci, and ChuRen Huang. 2015. Evalution 1.0: an evolving semantic dataset for training and evaluation of distributional semantic models. In Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications, pages 64–69.
 Seitner et al. (2016) Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli, Robert Meusel, Heiko Paulheim, and Simone Paolo Ponzetto. 2016. A large database of hypernymy relations extracted from the web. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 2328, 2016.
 Shwartz et al. (2016) Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated pathbased and distributional method. arXiv preprint arXiv:1603.06076.
 Shwartz et al. (2017) Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguisticallymotivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 65–75, Valencia, Spain. Association for Computational Linguistics.
 Simms (1992) GO Simms. 1992. The ICD10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines, volume 1. World Health Organization.
 Snow et al. (2005) Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In Advances in neural information processing systems, pages 1297–1304.
 Snow et al. (2006) Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 801–808. Association for Computational Linguistics.
 Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 812, 2007, pages 697–706.
 Tifrea et al. (2018) Alexandru Tifrea, Gary Bécigneul, and OctavianEugen Ganea. 2018. Poincaré glove: Hyperbolic word embeddings. arXiv preprint arXiv:1810.06546.
 Velardi et al. (2013) Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013. Ontolearn reloaded: A graphbased algorithm for taxonomy induction. Computational Linguistics, 39(3):665–707.
 Velardi et al. (2005) Paola Velardi, Roberto Navigli, Alessandro Cuchiarelli, and R Neri. 2005. Evaluation of ontolearn, a methodology for automatic learning of domain ontologies. Ontology Learning from Text: Methods, evaluation and applications, 123(92).
 Vendrov et al. (2016) Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Orderembeddings of images and language. In Proceedings of the International Conference on Learning Representations (ICLR), volume abs/1511.06361.
 Vilnis et al. (2018) Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic embedding of knowledge graphs with box lattice measures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 263–272. Association for Computational Linguistics.
 Vulic et al. (2016) Ivan Vulic, Daniela Gerz, Douwe Kiela, Felix Hill, and Anna Korhonen. 2016. Hyperlex: A largescale evaluation of graded lexical entailment. arXiv preprint arXiv:1608.02117.
 Weeds et al. (2014) Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hypernyms and cohyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2249–2259. Dublin City University and Association for Computational Linguistics.
 Weeds et al. (2004) Julie Weeds, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the 20th international conference on Computational Linguistics, page 1015. Association for Computational Linguistics.
 Wu et al. (2012) Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 481–492. ACM.

Zamir et al. (2018)
Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik,
and Silvio Savarese. 2018.
Taskonomy: Disentangling task transfer learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3712–3722.