1 Introduction
Entity typing classifies textual mentions of entities according to their semantic class. The task has progressed from finding company names
Rau (1991), to recognizing coarse classes (person, location, organization, and other, Tjong Kim Sang and De Meulder, 2003), to finegrained inventories of about one hundred types, with finergrained types proving beneficial in applications such as relation extraction Yaghoobzadeh et al. (2017) and question answering Yavuz et al. (2016). The trend towards larger inventories has culminated in ultrafine and open entity typing with thousands of classes Choi et al. (2018); Zhou et al. (2018).Sentence  Annotation  












However, large type inventories pose a challenge for the common approach of casting entity typing as a multilabel classification task Yogatama et al. (2015); Shimaoka et al. (2016), since exploiting intertype correlations becomes more difficult as the number of types increases. A natural solution for dealing with a large number of types is to organize them in hierarchy ranging from general, coarse types such as “person” near the top, to more specific, fine types such as “politician” in the middle, to even more specific, ultrafine entity types such as “diplomat” at the bottom (see Figure 2). By virtue of such a hierarchy, a model learning about diplomats will be able to transfer this knowledge to related entities such as politicians.
Prior work integrated hierarchical entity type information by formulating a hierarchyaware loss Ren et al. (2016); Murty et al. (2018); Xu and Barbosa (2018) or by representing words and types in a joint Euclidean embedding space Shimaoka et al. (2017); Abhishek et al. (2017). Noting that it is impossible to embed arbitrary hierarchies in Euclidean space, nickel2017poincare propose hyperbolic space as an alternative and show that hyperbolic embeddings accurately encode hierarchical information. Intuitively (and as explained in more detail in Section 2), this is because distances in hyperbolic space grow exponentially as one moves away from the origin, just like the number of elements in a hierarchy grows exponentially with its depth.
While the intrinsic advantages of hyperbolic embeddings are wellestablished, their usefulness in downstream tasks is, so far, less clear. We believe this is due to two difficulties: First, incorporating hyperbolic embeddings into a neural model is nontrivial since training involves optimization in hyperbolic space. Second, it is often not clear what the best hierarchy for the task at hand is.
In this work, we address both of these issues. Using ultrafine grained entity typing Choi et al. (2018) as a test bed, we first show how to incorporate hyperbolic embeddings into a neural model (Section 3). Then, we examine the impact of the hierarchy, comparing hyperbolic embeddings of an expertgenerated ontology to those of a large, automaticallygenerated one (Section 4). As our experiments on two different datasets show (Section 5), hyperbolic embeddings improve entity typing in some but not all cases, suggesting that their usefulness depends both on the type inventory and its hierarchy. In summary, we make the following contributions:

We develop a finegrained entity typing model that embeds both entity types and entity mentions in hyperbolic space.

We compare two different entity type hierarchies, one created by experts (WordNet) and one generated automatically, and find that their adequacy depends on the dataset.

We study the impact of replacing the Euclidean geometry with its hyperbolic counterpart in an entity typing model, finding that the improvements of the hyperbolic model are noticeable on ultrafine types.
2 Background: Poincaré Embeddings
Hyperbolic geometry studies nonEuclidean spaces with constant negative curvature. Twodimensional hyperbolic space can be modelled as the open unit disk, the socalled Poincaré disk, in which the unit circle represents infinity, i.e., as a point approaches infinity in hyperbolic space, its norm approaches one in the Poincaré disk model. In the general dimensional case, the disk model becomes the Poincaré ball Chamberlain et al. (2017) , where denotes the Euclidean norm. In the Poincaré model the distance between two points is given by:
(1) 
If we consider the origin and two points, and , moving towards the outside of the disk, i.e. , the distance tends to . That is, the path between and is converges to a path through the origin. This behaviour can be seen as the continuous analogue to a (discrete) treelike hierarchical structure, where the shortest path between two sibling nodes goes through their common ancestor.
As an alternative intuition, note that the hyperbolic distance between points grows exponentially as points move away from the center. This mirrors the exponential growth of the number of nodes in trees with increasing depths, thus making hyperbolic space a natural fit for representing trees and hence hierarchies Krioukov et al. (2010); Nickel and Kiela (2017).
By embedding hierarchies in the Poincaré ball so that items near the top of the hierarchy are placed near the origin and lower items near infinity (intuitively, embedding the “vertical” structure), and so that items sharing a parent in the hierarchy are close to each other (embedding the “horizontal” structure), we obtain Poincaré embeddings Nickel and Kiela (2017). More formally, this means that embedding norm represents depth in the hierarchy, and distance between embeddings the similarity of the respective items.
Figure 5 shows the results of embedding the WordNet noun hierarchy in twodimensional Euclidean space (left) and the Poincaré disk (right). In the hyperbolic model, the types tend to be located near the boundary of the disk. In this region the space grows exponentially, which allows related types to be placed near one another and far from unrelated ones. The actual distance in this model is not the one visualized in the figure but the one given by Equation 1.
3 Entity Typing in Hyperbolic Space
3.1 Task Definition
The task we consider is, given a context sentence containing an entity mention , predict the correct type labels that describe from a type inventory , which includes more than 10,000 types Choi et al. (2018). The mention can be a named entity, a nominal, or a pronoun. The groundtruth type set may contain multiple types, making the task a multilabel classification problem.
3.2 Objective
We aim to analyze the effects of hyperbolic and Euclidean spaces when modeling hierarchical information present in the type inventory, for the task of finegrained entity typing. Since hyperbolic geometry is naturally equipped to model hierarchical structures, we hypothesize that this enhanced representation will result in superior performance. With the goal of examining the relation between the metric space and the hierarchy, we propose a regression model. We learn a function that maps feature representations of a mention and its context onto a vector space such that the instances are embedded closer to their target types.
The groundtruth type set contains a varying number of types per instance. In our regression setup, however, we aim to predict a fixed amount of labels for all the instances. This imposes strong upper bounds to the performance of our proposed model. Nonetheless, as the strict accuracy of stateoftheart methods for the UltraFine dataset is below 40% Choi et al. (2018); Xiong et al. (2019), the evaluation we perform is still informative in qualitative terms, and enables us to gain better intuitions with regard to embedding hierarchical structures in different metric spaces.
3.3 Method
Given the encoded feature representations of a mention and its context , noted as our goal is to learn a mapping function , where is the target vector space. We intend to approximate embeddings of the type labels , previously projected into the space. Subsequently, we perform a search of the nearest type embeddings of the embedded representation in order to assign the categorical label corresponding to the mention within that context. Figure 8 presents an overview of the model.
The label distribution on the dataset is diverse and finegrained. Each instances is annotated with three levels of granularity, namely coarse, fine and ultrafine, and on the development and test set there are, on average, five labels per item. This poses a challenging problem for learning and predicting with only one projection. As a solution, we propose three different projection functions, , and , each one of them finetuned to predict labels of a specific granularity.
We hypothesize that the complexity of the projection increases as the granularity becomes finer, given that the target label space per granularity increases. Inspired by sanh2019hierarchicalEmbeddding, we arrange the three projections in a hierarchical manner that reflects these difficulties. The coarse projection task is set at the bottom layer of the model and more complex (finer) interactions at higher layers. With the projected embedding of each layer, we aim to introduce an inductive bias in the next projection that will help to guide it into the correct region of the space. Nevertheless, we use shortcut connections so that top layers can have access to the encoder layer representation.
3.4 Mention and Context Representations
To encode the context containing the mention , we apply the encoder schema of choi2018ultra based on shimaoka2016attentive. We replace the location embedding of the original encoder with a word position embedding to reflect relative distances between the th word and the entity mention. This modification induces a bias on the attention layer to focus less on the mention and more on the context. Finally we apply a standard BiLSTM and a selfattentive encoder McCann et al. (2017) on top to get the context representation .
For the mention representation we derive features from a characterlevel CNN, concatenate them with the Glove word embeddings Pennington et al. (2014) of the mention, and combine them with a similar selfattentive encoder. The mention representation is denoted as . The final representation is achieved by the concatenation of mention and context .
3.5 Projecting into the Ball
To learn a projection function that embeds our feature representation in the target space, we apply a variation of the reparameterization technique introduced in dhingra2018embeddingTextInHS. The reparameterization involves computing a direction vector and a norm magnitude from as follows:
(2) 
where , can be arbitrary functions, whose parameters will be optimized during training, and
is the sigmoid function that ensures the resulting norm
. The reparameterized embedding is defined as , which lies in .By making use of this simple technique, the embeddings are guaranteed to lie in the Poincaré ball. This avoids the need to correct the gradient or the utilization of RiemannianSGD Bonnabel (2011)
. Instead, it allows the use of any optimization method in deep learning, such as Adam
Kingma and Ba (2014).We parameterize the direction function
as a multilayer perceptron (MLP) with a single hidden layer, using rectified linear units (ReLU) as nonlinearity, and dropout. We do not apply the ReLU function after the output layer in order to allow negative values as components of the direction vector. For the norm magnitude function
we use a single linear layer.3.6 Optimization of the Model
We aim to find projection functions that embed the instance representations closer to the respective target types, in a given vector space . As target space we use the Poincaré Ball and compare it with the Euclidean unit ball . Both and are metric spaces, therefore they are equipped with a distance function, namely the hyperbolic distance defined in Equation 1, and the Euclidean distance respectively, which we intend to minimize. Moreover, since the Poincaré Model is a conformal model of the hyperbolic space, i.e. the angles between Euclidean and hyperbolic vectors are equal, the cosine distance can be used, as well.
We propose to minimize a combination of the distance defined by each metric space and the cosine distance to approximate the embeddings. Although formally this is not a distance metric since it does not satisfy the CauchySchwarz inequality, it provides a very strong signal to approximate the target embeddings accounting for the main concepts modeled in the representation: relatedness, captured via the distance and orientation in the space, and generality, via the norm of the embeddings.
To mitigate the instability in the derivative of the hyperbolic distance^{2}^{2}2 we follow the approach proposed in deSa18tradeoffs and minimize the square of the distance, which does have a continuous derivative in . Thus, in the Poincaré Model we minimize the distance for two points defined as:
(3) 
Whereas in the Euclidean space, for we minimize:
(4) 
The hyperparameters
and are added to compensate the bounded image of the cosine distance function in .4 Hierarchical Type Inventories
In this section, we investigate two methods for deriving a hierarchical structure for a given type inventory. First, we introduce the datasets on which we perform our study since we exploit some of their characteristics to construct a hierarchy.
4.1 Data
Split  Coarse  Fine  Ultrafine 

Train  2,416,593  4,146,143  3,997,318 
Dev  1,918  1,289  7,594 
Test  1,904  1,318  7,511 
We focus our analysis on the the UltraFine entity typing dataset introduced in choi2018ultra. Its design goals were to increase the diversity and coverage entity type annotations. It contains 10,331 target types defined as freeform noun phrases and divided in three levels of granularity: coarse, fine and ultrafine. The data consist of 6,000 crowdsourced examples and approximately 6M training samples in the opensource version^{3}^{3}3choi2018ultra uses the licensed Gigaword to build part of the dataset resulting in about 25.2M training samples., automatically extracted with distant supervision, by entity linking and nominal head word extraction. Our evaluation is done on the original crowdsourced dev/test splits.
To gain a better understanding of the proposed model under different geometries, we also experiment on the OntoNotes dataset Gillick et al. (2014) as it is a standard benchmark for entity typing.
4.2 Deriving the Hierarchies
The two methods we analyze to derive a hierarchical structure from the type inventory are the following.
Knowledge base alignment: Hierarchical information can be provided explicitly, by aligning the type labels to a knowledge base schema. In this case the types follow the treelike structure of the ontology curated by experts. On the UltraFine dataset, the type vocabulary (i.e. noun phrases) is extracted from WordNet Miller (1992). Nouns in WordNet are organized into a deep hierarchy, defined by hypernym or “IS A” relationships. By aligning the type labels to the hypernym structure existing in WordNet, we obtain a type hierarchy. In this case, all paths lead to the root type entity. In the OntoNotes dataset the annotations follow a preestablished, much smaller, hierarchical taxonomy based on “IS A” relations, as well.
Type cooccurrences: Although in practical scenarios hierarchical information may not always be available, the distribution of types has an implicit hierarchy that can be inferred automatically. If we model the groundtruth labels as nodes of a graph, its adjacency matrix can be drawn and weighted by considering the cooccurrences on each instance. That is, if and are annotated as true types for a training instance, we add an edge between both types. To weigh the edge we explore two variants: the frequency of observed instances where this corelation holds, and the pointwise mutual information (), as a measure of the association between the two types^{4}^{4}4We adapt in order to satisfy the condition of nonnegativity.. By mining type cooccurrences present in the dataset as an affinity score, the hierarchy can be inferred. This method alleviates the need for a type inventory explicitly aligned to an ontology or predefined label correlations.
To embed the target type representations into the different metric spaces we make use of the library Hype^{5}^{5}5https://github.com/facebookresearch/poincareembeddings/ Nickel and Kiela (2018). This library allows us to embed graphs into lowdimensional continuous spaces with different metrics, such as hyperbolic or Euclidean, ensuring that related objects are closer to each other in the space. The learned embeddings capture notions of both similarity, through the relative distance among each other, and hierarchy, through the distance to the origin, i.e. the norm. The projection of the hierarchy derived from WordNet is depicted in Figure 5.
5 Experiments
We perform experiments on the UltraFine Choi et al. (2018) and OntoNotes Gillick et al. (2014) datasets to evaluate which kind of hierarchical information is better suited for entity typing, and under which geometry the hierarchy can be better exploited.
5.1 Setup
For evaluation we run experiments on the UltraFine dataset with our model projecting onto the hyperbolic space, and compare to the same setting in Euclidean space. The type embeddings are created based on the following hierarchical structures derived from the dataset: the type vocabulary aligned to the WordNet hierarchy (WordNet), type cooccurrence frequency (freq), pointwise mutual information among types (pmi), and finally, the combination of WordNet’s transitive closure of each type with the cooccurrence frequency graph (WordNet + freq).
We compare our model to the multitask model of choi2018ultra trained on the opensource version of their dataset (MultiTask). The final type predictions consist of the closest neighbor from the coarse and fine projections, and the three closest neighbors from the ultrafine projection. We report Loose Macroaveraged and Loose Microaveraged F1 metrics computed from the precision/recall scores over the same three granularities established by choi2018ultra. For all models we optimize Macroaveraged F1 on coarse types on the validation set, and evaluate on the test set. All experiments project onto a target space of 10 dimensions. The complete set of hyperparameters is detailed in the Appendix.
6 Results and Discussion
6.1 Comparison of the Hierarchies
Results on the test set are reported in Table 4. From comparing the different strategies to derive the hierarchies, we can see that freq and pmi substantially outperform MultiTask on the ultrafine granularity ( and relative improvement in Macro F1 and Micro F1, respectively, with the hyperbolic model). Both hierarchies show a substantially better performance over the WordNet hierarchy on this granularity as well (MaF1 and MiF1 for pmi vs and for WordNet on the Hyperbolic model), indicating that these structures, created solely from the dataset statistics, better reflect the type distribution in the annotations. On freq and pmi, types that frequently cooccur on the training set are located closer to each other, improving the prediction based on nearest neighbor.
All the hierarchies show very low performance on fine when compared to the MultiTask model. This exhibits a weakness of our regression setup. On the test set there are 1,998 instances but only 1,318 fine labels as ground truth (see Table 1). By forcing a prediction on the fine level for all instances, precision decreases notably. More details in Section 6.3.
The combined hierarchy WordNet + freq achieves marginal improvements on coarse and fine granularities, while it degrades the performance on ultrafine when compared to freq.
By imposing a hierarchical structure over the type vocabulary we can infer types that are located higher up in the hierarchy from the predictions of the lower ones. To analyze this, we add the closest coarse label to the ultrafine prediction of each instance. Results are reported in Table (b)b. The improvements are noticeable on the Macro score (up to F1 points difference on freq) whereas Micro decreases. Since we are adding types to the prediction, this technique improves recall and penalizes precision. Macro is computed on the entity level, while Micro provides an overall score, showing that per instance the prediction tends to be better. The improvements can be observed on freq and pmi given that their predictions over ultrafine types are better.
6.2 Comparison of the Spaces
When comparing performances with respect to the metric spaces, the hyperbolic models for pmi and freq outperform all other models on ultrafine granularity. Compared to its Euclidean counterpart, pmi brings considerable improvements ( vs and vs for Macro and Micro F1 respectively). This can be explained by the exponential growth of this space towards the boundary of the ball, combined with a representation that reflects the type cooccurrences in the dataset. Figure 9 shows a histogram of the distribution of groundtruth types as closest neighbors to the prediction.
On both Euclidean and hyperbolic models, the type embeddings for coarse and fine labels are located closer to the origin of the space. In this region, the spaces show a much more similar behavior in terms of the distance calculation, and this similarity is reflected on the results as well.
The low performance of the hyperbolic model of WordNet on coarse can be explained by the fact that entity is the root node of the hierarchy, therefore it is located closer to the center of the space. Elements placed in the vicinity of the origin have a norm closer to zero, thus their distance to other types tends to be shorter (does not grow exponentially). This often misleads the model into assign entity as the coarse. See Table 5c for an example.
This issue is alleviated on WordNet + freq. Nevertheless, it appears again when using the ultrafine prediction to infer the coarse label. The drop in performance can be seen in Table (b)b: Macro F1 decreases by and Micro F1 by .
6.3 Error analysis
We perform an error analysis on samples from the development set and predictions from two of our proposed hyperbolic models. We show three examples in Table 5. Overall we can see that predictions are reasonable, suggesting synonyms or related words.
In the proposed regression setup, we predict a fixed amount of labels per instance. This schema has drawbacks as shown in example a), where all predicted types by the freq model are correct though we can not predict more, and b), where we predict more related types that are not part of the annotations.
In examples b) and c) we see how the freq model predicts the coarse type correctly whereas the model that uses the WordNet hierarchy predicts group and entity since these labels are considered more general (organization IS A group) thus located closer to the origin of the space.
To analyse precision and recall more accurately, we compare our model to the one of shimaoka2016attentive (
AttNER) and the multitask model of choi2018ultra (multi). We show the results for macroaveraged metrics in Table 6. Our model is able to achieve higher recall but lower precision. Nonetheless we are able to outperform AttNER with a regression model even though they apply a classifier to the task.Model  Dev  Test  

P  R  F1  P  R  F1  
AttNER  53.7  15.0  23.5  54.2  15.2  23.7 
freq  24.8  25.9  25.4  25.6  26.8  26.2 
multi  48.1  23.2  31.3  47.1  24.2  32.0 
6.4 Analysis Case: OntoNotes
Model  Sp  Coarse  Fine  Ultra  

Ma  Mi  Ma  Mi  Ma  Mi  
Onto  Hy  83.0  81.9  24.0  23.9  2.0  2.0 
Eu  82.2  82.2  28.8  28.7  2.4  2.4  
Freq  Hy  81.7  81.8  27.1  27.1  4.2  4.2 
Eu  81.7  81.7  30.6  30.6  3.8  3.8 
To better understand the effects of the hierarchy and the metric spaces we also perform an evaluation on OntoNotes Gillick et al. (2014). We compare the original hierarchy of the dataset (Onto), and one derived from the type cooccurrence frequency extracted from the data augmented by choi2018ultra with this type inventory. The results for the three granularities are presented in Table 7.
The freq model on the hyperbolic geometry achieves the best performance for the ultrafine granularity, in accordance with the results on the UltraFine dataset. In this case the improvements of the frequencybased hierarchy are not so remarkable when compared to the onto model given that the type inventory is much smaller, and the annotations follow a hierarchy where there is only one possible path for every label to its coarse type.
The low results on the ultrafine granularity are due to the reduced multiplicity of the annotated types (See Table 10). Most instances have only one or two types, setting very restrictive upper bounds for this setup.
7 Related Work
Type inventories for the task of finegrained entity typing Ling and Weld (2012); Gillick et al. (2014); Yosef et al. (2012) have grown in size and complexity Del Corro et al. (2015); Murty et al. (2017); Choi et al. (2018)
. Systems have tried to incorporate hierarchical information on the type distribution in different manners. shimaoka2017neural encode the hierarchy through a sparse matrix. xuBarbosa2018hierarchyAware model the relations through a hierarchyaware loss function. ma2016labelEmbedding and abhishek2017jointLearning learn embeddings for labels and feature representations into a joint space in order to facilitate information sharing among them. Our work resembles xiong2019inductiveBias since they derive hierarchical information in an unrestricted fashion, through type cooccurrence statistics from the dataset. These models operate under Euclidean assumptions. Instead, we impose a hyperbolic geometry to enrich the hierarchical information.
Hyperbolic spaces have been applied mostly on complex and social networks modeling Krioukov et al. (2010); Verbeek and Suri (2016)
. In the field of Natural Language Processing, they have been employed to learn embeddings for Question Answering
Tay et al. (2018), in Neural Machine Translation
Gulcehre et al. (2019), and to model language Leimeister and Wilson (2018); Tifrea et al. (2019). We build upon the work of nickel2017poincare on modeling hierarchical link structure of symbolic data and adapt it with the parameterization method proposed by dhingra2018embeddingTextInHS to cope with feature representations of text.8 Conclusions
Incorporation of hierarchical information from large type inventories into neural models has become critical to improve performance. In this work we analyze expertgenerated and datadriven hierarchies, and the geometrical properties provided by the choice of the vector space, in order to model this information. Experiments on two different datasets show consistent improvements of hyperbolic embedding over Euclidean baselines on very finegrained labels when the hierarchy reflects the annotated type distribution.
Acknowledgments
We would like to thank the anonymous reviewers for their valuable comments and suggestions, and we also thank Ana Marasović, Mareike Pfeil, Todor Mihaylov and MarkChristoph Müller for their helpful discussions. This work has been supported by the German Research Foundation (DFG) as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1 and the Klaus Tschira Foundation, Heidelberg, Germany.
References
 Abhishek et al. (2017) Abhishek Abhishek, Ashish Anand, and Amit Awekar. 2017. Finegrained entity type classification by jointly learning representations and label embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 797–807, Valencia, Spain. Association for Computational Linguistics.
 Bonnabel (2011) Silvère Bonnabel. 2011. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58.
 Chamberlain et al. (2017) Benjamin Paul Chamberlain, James Clough, and Marc Peter Deisenroth. 2017. Neural embeddings of graphs in hyperbolic space. CoRR, abs/1705.10359.
 Choi et al. (2018) Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultrafine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 87–96, Melbourne, Australia. Association for Computational Linguistics.
 Del Corro et al. (2015) Luciano Del Corro, Abdalghani Abujabal, Rainer Gemulla, and Gerhard Weikum. 2015. Finet: Contextaware finegrained named entity typing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 868–878, Lisbon, Portugal. Association for Computational Linguistics.
 Dhingra et al. (2018) Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. 2018. Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on GraphBased Methods for Natural Language Processing (TextGraphs12), pages 59–69, New Orleans, Louisiana, USA. Association for Computational Linguistics.
 Gillick et al. (2014) Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh. 2014. ContextDependent FineGrained Entity Type Tagging. ArXiv eprints.
 Gulcehre et al. (2019) Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas. 2019. Hyperbolic attention networks. In International Conference on Learning Representations.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations, abs/1412.6980.
 Krioukov et al. (2010) Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguñá. 2010. Hyperbolic geometry of complex networks. Physical review. E, Statistical, nonlinear, and soft matter physics, 82:036106.
 Leimeister and Wilson (2018) Matthias Leimeister and Benjamin J. Wilson. 2018. Skipgram word embeddings in hyperbolic space. CoRR, abs/1809.01498.

Ling and Weld (2012)
Xiao Ling and Daniel S. Weld. 2012.
Finegrained entity recognition.
In
Proceedings of the TwentySixth AAAI Conference on Artificial Intelligence
, AAAI’12, pages 94–100. AAAI Press.  Ma et al. (2016) Yukun Ma, Erik Cambria, and SA GAO. 2016. Label embedding for zeroshot finegrained named entity typing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 171–180, Osaka, Japan.
 McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308.
 Miller (1992) George A. Miller. 1992. Wordnet: A lexical database for english. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 2326, 1992.
 Murty et al. (2017) Shikhar Murty, Patrick Verga, Luke Vilnis, and Andrew McCallum. 2017. Finer grained entity typing with typenet. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
 Murty et al. (2018) Shikhar Murty, Patrick Verga, Luke Vilnis, Irena Radovanovic, and Andrew McCallum. 2018. Hierarchical losses and new resources for finegrained entity typing and linking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 97–109, Melbourne, Australia. Association for Computational Linguistics.
 Nickel and Kiela (2017) Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6341–6350. Curran Associates, Inc.

Nickel and Kiela (2018)
Maximillian Nickel and Douwe Kiela. 2018.
Learning
continuous hierarchies in the Lorentz model of hyperbolic geometry.
In
Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pages 3779–3788, Stockholmsmässan, Stockholm Sweden. PMLR.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
 Rau (1991) Lisa F Rau. 1991. Extracting company names from text. In Artificial Intelligence Applications, 1991. Proceedings., Seventh IEEE Conference on, volume 1, pages 29–32. IEEE.
 Ren et al. (2016) Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, and Jiawei Han. 2016. Label noise reduction in entity typing by heterogeneous partiallabel embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1825–1834, New York, NY, USA. ACM.
 Sala et al. (2018) Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. 2018. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4460–4469, Stockholmsmässan, Stockholm Sweden. PMLR.
 Sanh et al. (2019) Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multitask approach for learning embeddings from semantic tasks. In AAAI, volume abs/1811.06031.
 Shimaoka et al. (2016) Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. 2016. An attentive neural architecture for finegrained entity type classification. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pages 69–74, San Diego, CA. Association for Computational Linguistics.
 Shimaoka et al. (2017) Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. 2017. Neural architectures for finegrained entity type classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1271–1280, Valencia, Spain. Association for Computational Linguistics.
 Tay et al. (2018) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 583–591, New York, NY, USA. ACM.
 Tifrea et al. (2019) Alexandru Tifrea, Gary Becigneul, and OctavianEugen Ganea. 2019. Poincare glove: Hyperbolic word embeddings. In International Conference on Learning Representations.
 Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLTNAACL 2003  Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Verbeek and Suri (2016) Kevin Verbeek and Subhash Suri. 2016. Metric embedding, hyperbolic space, and social networks. Computational Geometry, 59:1 – 12.
 Xiong et al. (2019) Wenhan Xiong, Jiawei Wu, Deren Lei, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. Imposing labelrelational inductive bias for extremely finegrained entity typing. In Proceedings of NAACLHLT 2019.
 Xu and Barbosa (2018) Peng Xu and Denilson Barbosa. 2018. Neural finegrained entity type classification with hierarchyaware loss. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 16–25, New Orleans, Louisiana. Association for Computational Linguistics.
 Yaghoobzadeh et al. (2017) Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Schütze. 2017. Noise mitigation for neural entity typing and relation extraction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1183–1194, Valencia, Spain. Association for Computational Linguistics.
 Yavuz et al. (2016) Semih Yavuz, Izzeddin Gur, Yu Su, Mudhakar Srivatsa, and Xifeng Yan. 2016. Improving semantic parsing via answer type inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 149–159, Austin, Texas. Association for Computational Linguistics.
 Yogatama et al. (2015) Dani Yogatama, Daniel Gillick, and Nevena Lazic. 2015. Embedding methods for fine grained entity type classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 291–296, Beijing, China. Association for Computational Linguistics.
 Yosef et al. (2012) Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, and Gerhard Weikum. 2012. HYENA: Hierarchical type classification for entity names. In Proceedings of COLING 2012: Posters, pages 1361–1370, Mumbai, India.
 Zhou et al. (2018) Ben Zhou, Daniel Khashabi, ChenTse Tsai, and Dan Roth. 2018. Zeroshot open entity typing as typecompatible grounding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2065–2076, Brussels, Belgium. Association for Computational Linguistics.
Appendix A Appendix
a.1 Hyperparameters
Both hyperbolic and Euclidean models were trained with the following hyperparameters.
Parameter  Value 

Word embedding dim  300 
Max mention tokens  5 
Max mention chars  25 
Context length (per side)  10 
Char embedding dim  50 
Position embedding dim  25 
Context LSTM dim  200 
Attention dim  100 
Mention dropout  0.5 
Context dropout  0.2 
Max gradient norm  10 
Projection hidden dim  500 
Optimizer  Adam 
Learning rate  0.001 
Batch size  1024 
Epochs  50 
a.2 Dataset statistics
Split  Samples  Coarse  Fine  Ultrafine 

Train  6,240,105  2,148,669  2,664,933  3,368,607 
Dev  1,998  1,612  947  1,860 
Test  1,998  1,598  964  1,864 
Split  Samples  Coarse  Fine  Ultra 

Train  793,487  828,840  735,162  301,006 
Dev  2,202  2,337  869  76 
Test  8,963  9,455  3,521  417 