Fine-Grained Entity Typing in Hyperbolic Space

06/06/2019 ∙ by Federico López, et al. ∙ 0

How can we represent hierarchical information present in large type inventories for entity typing? We study the ability of hyperbolic embeddings to capture hierarchical relations between mentions in context and their target types in a shared vector space. We evaluate on two datasets and investigate two different techniques for creating a large hierarchical entity type inventory: from an expert-generated ontology and by automatically mining type co-occurrences. We find that the hyperbolic model yields improvements over its Euclidean counterpart in some, but not all cases. Our analysis suggests that the adequacy of this geometry depends on the granularity of the type inventory and the way hierarchical relations are inferred.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Entity typing classifies textual mentions of entities according to their semantic class. The task has progressed from finding company names

Rau (1991), to recognizing coarse classes (person, location, organization, and other, Tjong Kim Sang and De Meulder, 2003), to fine-grained inventories of about one hundred types, with finer-grained types proving beneficial in applications such as relation extraction Yaghoobzadeh et al. (2017) and question answering Yavuz et al. (2016). The trend towards larger inventories has culminated in ultra-fine and open entity typing with thousands of classes Choi et al. (2018); Zhou et al. (2018).

Sentence Annotation
…when the
president said…
…during the
negotiation, he
…after the last
meeting, she
…the president
Figure 2: Examples of annotations and hierarchical type inventory with co-occurrence frequencies.

However, large type inventories pose a challenge for the common approach of casting entity typing as a multi-label classification task Yogatama et al. (2015); Shimaoka et al. (2016), since exploiting inter-type correlations becomes more difficult as the number of types increases. A natural solution for dealing with a large number of types is to organize them in hierarchy ranging from general, coarse types such as “person” near the top, to more specific, fine types such as “politician” in the middle, to even more specific, ultra-fine entity types such as “diplomat” at the bottom (see Figure 2). By virtue of such a hierarchy, a model learning about diplomats will be able to transfer this knowledge to related entities such as politicians.

Prior work integrated hierarchical entity type information by formulating a hierarchy-aware loss Ren et al. (2016); Murty et al. (2018); Xu and Barbosa (2018) or by representing words and types in a joint Euclidean embedding space Shimaoka et al. (2017); Abhishek et al. (2017). Noting that it is impossible to embed arbitrary hierarchies in Euclidean space, nickel2017poincare propose hyperbolic space as an alternative and show that hyperbolic embeddings accurately encode hierarchical information. Intuitively (and as explained in more detail in Section 2), this is because distances in hyperbolic space grow exponentially as one moves away from the origin, just like the number of elements in a hierarchy grows exponentially with its depth.

While the intrinsic advantages of hyperbolic embeddings are well-established, their usefulness in downstream tasks is, so far, less clear. We believe this is due to two difficulties: First, incorporating hyperbolic embeddings into a neural model is non-trivial since training involves optimization in hyperbolic space. Second, it is often not clear what the best hierarchy for the task at hand is.

In this work, we address both of these issues. Using ultra-fine grained entity typing Choi et al. (2018) as a test bed, we first show how to incorporate hyperbolic embeddings into a neural model (Section 3). Then, we examine the impact of the hierarchy, comparing hyperbolic embeddings of an expert-generated ontology to those of a large, automatically-generated one (Section 4). As our experiments on two different datasets show (Section 5), hyperbolic embeddings improve entity typing in some but not all cases, suggesting that their usefulness depends both on the type inventory and its hierarchy. In summary, we make the following contributions:

  1. We develop a fine-grained entity typing model that embeds both entity types and entity mentions in hyperbolic space.

  2. We compare two different entity type hierarchies, one created by experts (WordNet) and one generated automatically, and find that their adequacy depends on the dataset.

  3. We study the impact of replacing the Euclidean geometry with its hyperbolic counterpart in an entity typing model, finding that the improvements of the hyperbolic model are noticeable on ultra-fine types.

2 Background: Poincaré Embeddings

(a) Euclidean Space.
(b) Hyperbolic Space.
Figure 5: Type inventory of the Ultra-Fine dataset aligned to the WordNet noun hierarchy and projected on two dimensions in different spaces.
 Projection layers.
(a) Projection layers.
(b) Incorporation of hierarchical information.
Figure 8: Overview of the proposed model to predict types of a mention within its context.

Hyperbolic geometry studies non-Euclidean spaces with constant negative curvature. Two-dimensional hyperbolic space can be modelled as the open unit disk, the so-called Poincaré disk, in which the unit circle represents infinity, i.e., as a point approaches infinity in hyperbolic space, its norm approaches one in the Poincaré disk model. In the general -dimensional case, the disk model becomes the Poincaré ball Chamberlain et al. (2017) , where denotes the Euclidean norm. In the Poincaré model the distance between two points is given by:


If we consider the origin and two points, and , moving towards the outside of the disk, i.e. , the distance tends to . That is, the path between and is converges to a path through the origin. This behaviour can be seen as the continuous analogue to a (discrete) tree-like hierarchical structure, where the shortest path between two sibling nodes goes through their common ancestor.

As an alternative intuition, note that the hyperbolic distance between points grows exponentially as points move away from the center. This mirrors the exponential growth of the number of nodes in trees with increasing depths, thus making hyperbolic space a natural fit for representing trees and hence hierarchies Krioukov et al. (2010); Nickel and Kiela (2017).

By embedding hierarchies in the Poincaré ball so that items near the top of the hierarchy are placed near the origin and lower items near infinity (intuitively, embedding the “vertical” structure), and so that items sharing a parent in the hierarchy are close to each other (embedding the “horizontal” structure), we obtain Poincaré embeddings Nickel and Kiela (2017). More formally, this means that embedding norm represents depth in the hierarchy, and distance between embeddings the similarity of the respective items.

Figure 5 shows the results of embedding the WordNet noun hierarchy in two-dimensional Euclidean space (left) and the Poincaré disk (right). In the hyperbolic model, the types tend to be located near the boundary of the disk. In this region the space grows exponentially, which allows related types to be placed near one another and far from unrelated ones. The actual distance in this model is not the one visualized in the figure but the one given by Equation 1.

3 Entity Typing in Hyperbolic Space

3.1 Task Definition

The task we consider is, given a context sentence containing an entity mention , predict the correct type labels that describe from a type inventory , which includes more than 10,000 types Choi et al. (2018). The mention can be a named entity, a nominal, or a pronoun. The ground-truth type set may contain multiple types, making the task a multi-label classification problem.

3.2 Objective

We aim to analyze the effects of hyperbolic and Euclidean spaces when modeling hierarchical information present in the type inventory, for the task of fine-grained entity typing. Since hyperbolic geometry is naturally equipped to model hierarchical structures, we hypothesize that this enhanced representation will result in superior performance. With the goal of examining the relation between the metric space and the hierarchy, we propose a regression model. We learn a function that maps feature representations of a mention and its context onto a vector space such that the instances are embedded closer to their target types.

The ground-truth type set contains a varying number of types per instance. In our regression setup, however, we aim to predict a fixed amount of labels for all the instances. This imposes strong upper bounds to the performance of our proposed model. Nonetheless, as the strict accuracy of state-of-the-art methods for the Ultra-Fine dataset is below 40% Choi et al. (2018); Xiong et al. (2019), the evaluation we perform is still informative in qualitative terms, and enables us to gain better intuitions with regard to embedding hierarchical structures in different metric spaces.

3.3 Method

Given the encoded feature representations of a mention and its context , noted as our goal is to learn a mapping function , where is the target vector space. We intend to approximate embeddings of the type labels , previously projected into the space. Subsequently, we perform a search of the nearest type embeddings of the embedded representation in order to assign the categorical label corresponding to the mention within that context. Figure 8 presents an overview of the model.

The label distribution on the dataset is diverse and fine-grained. Each instances is annotated with three levels of granularity, namely coarse, fine and ultra-fine, and on the development and test set there are, on average, five labels per item. This poses a challenging problem for learning and predicting with only one projection. As a solution, we propose three different projection functions, , and , each one of them fine-tuned to predict labels of a specific granularity.

We hypothesize that the complexity of the projection increases as the granularity becomes finer, given that the target label space per granularity increases. Inspired by sanh2019hierarchicalEmbeddding, we arrange the three projections in a hierarchical manner that reflects these difficulties. The coarse projection task is set at the bottom layer of the model and more complex (finer) interactions at higher layers. With the projected embedding of each layer, we aim to introduce an inductive bias in the next projection that will help to guide it into the correct region of the space. Nevertheless, we use shortcut connections so that top layers can have access to the encoder layer representation.

3.4 Mention and Context Representations

To encode the context containing the mention , we apply the encoder schema of choi2018ultra based on shimaoka2016attentive. We replace the location embedding of the original encoder with a word position embedding to reflect relative distances between the -th word and the entity mention. This modification induces a bias on the attention layer to focus less on the mention and more on the context. Finally we apply a standard Bi-LSTM and a self-attentive encoder McCann et al. (2017) on top to get the context representation .

For the mention representation we derive features from a character-level CNN, concatenate them with the Glove word embeddings Pennington et al. (2014) of the mention, and combine them with a similar self-attentive encoder. The mention representation is denoted as . The final representation is achieved by the concatenation of mention and context .

3.5 Projecting into the Ball

To learn a projection function that embeds our feature representation in the target space, we apply a variation of the re-parameterization technique introduced in dhingra2018embeddingTextInHS. The re-parameterization involves computing a direction vector and a norm magnitude from as follows:


where , can be arbitrary functions, whose parameters will be optimized during training, and

is the sigmoid function that ensures the resulting norm

. The re-parameterized embedding is defined as , which lies in .

By making use of this simple technique, the embeddings are guaranteed to lie in the Poincaré ball. This avoids the need to correct the gradient or the utilization of Riemannian-SGD Bonnabel (2011)

. Instead, it allows the use of any optimization method in deep learning, such as Adam

Kingma and Ba (2014).

We parameterize the direction function

as a multi-layer perceptron (MLP) with a single hidden layer, using rectified linear units (ReLU) as nonlinearity, and dropout. We do not apply the ReLU function after the output layer in order to allow negative values as components of the direction vector. For the norm magnitude function

we use a single linear layer.

3.6 Optimization of the Model

We aim to find projection functions that embed the instance representations closer to the respective target types, in a given vector space . As target space we use the Poincaré Ball and compare it with the Euclidean unit ball . Both and are metric spaces, therefore they are equipped with a distance function, namely the hyperbolic distance defined in Equation 1, and the Euclidean distance respectively, which we intend to minimize. Moreover, since the Poincaré Model is a conformal model of the hyperbolic space, i.e. the angles between Euclidean and hyperbolic vectors are equal, the cosine distance can be used, as well.

We propose to minimize a combination of the distance defined by each metric space and the cosine distance to approximate the embeddings. Although formally this is not a distance metric since it does not satisfy the Cauchy-Schwarz inequality, it provides a very strong signal to approximate the target embeddings accounting for the main concepts modeled in the representation: relatedness, captured via the distance and orientation in the space, and generality, via the norm of the embeddings.

To mitigate the instability in the derivative of the hyperbolic distance222 we follow the approach proposed in deSa18tradeoffs and minimize the square of the distance, which does have a continuous derivative in . Thus, in the Poincaré Model we minimize the distance for two points defined as:


Whereas in the Euclidean space, for we minimize:


The hyperparameters

and are added to compensate the bounded image of the cosine distance function in .

4 Hierarchical Type Inventories

In this section, we investigate two methods for deriving a hierarchical structure for a given type inventory. First, we introduce the datasets on which we perform our study since we exploit some of their characteristics to construct a hierarchy.

4.1 Data

Split Coarse Fine Ultra-fine
Train 2,416,593 4,146,143 3,997,318
Dev 1,918 1,289 7,594
Test 1,904 1,318 7,511
Table 1: Type instances in the dataset grouped by split and granularity.

We focus our analysis on the the Ultra-Fine entity typing dataset introduced in choi2018ultra. Its design goals were to increase the diversity and coverage entity type annotations. It contains 10,331 target types defined as free-form noun phrases and divided in three levels of granularity: coarse, fine and ultra-fine. The data consist of 6,000 crowdsourced examples and approximately 6M training samples in the open-source version333choi2018ultra uses the licensed Gigaword to build part of the dataset resulting in about 25.2M training samples., automatically extracted with distant supervision, by entity linking and nominal head word extraction. Our evaluation is done on the original crowdsourced dev/test splits.

To gain a better understanding of the proposed model under different geometries, we also experiment on the OntoNotes dataset Gillick et al. (2014) as it is a standard benchmark for entity typing.

4.2 Deriving the Hierarchies

The two methods we analyze to derive a hierarchical structure from the type inventory are the following.

Knowledge base alignment: Hierarchical information can be provided explicitly, by aligning the type labels to a knowledge base schema. In this case the types follow the tree-like structure of the ontology curated by experts. On the Ultra-Fine dataset, the type vocabulary (i.e. noun phrases) is extracted from WordNet Miller (1992). Nouns in WordNet are organized into a deep hierarchy, defined by hypernym or “IS A” relationships. By aligning the type labels to the hypernym structure existing in WordNet, we obtain a type hierarchy. In this case, all paths lead to the root type entity. In the OntoNotes dataset the annotations follow a pre-established, much smaller, hierarchical taxonomy based on “IS A” relations, as well.

Type co-occurrences: Although in practical scenarios hierarchical information may not always be available, the distribution of types has an implicit hierarchy that can be inferred automatically. If we model the ground-truth labels as nodes of a graph, its adjacency matrix can be drawn and weighted by considering the co-occurrences on each instance. That is, if and are annotated as true types for a training instance, we add an edge between both types. To weigh the edge we explore two variants: the frequency of observed instances where this co-relation holds, and the pointwise mutual information (), as a measure of the association between the two types444We adapt in order to satisfy the condition of non-negativity.. By mining type co-occurrences present in the dataset as an affinity score, the hierarchy can be inferred. This method alleviates the need for a type inventory explicitly aligned to an ontology or pre-defined label correlations.

To embed the target type representations into the different metric spaces we make use of the library Hype555 Nickel and Kiela (2018). This library allows us to embed graphs into low-dimensional continuous spaces with different metrics, such as hyperbolic or Euclidean, ensuring that related objects are closer to each other in the space. The learned embeddings capture notions of both similarity, through the relative distance among each other, and hierarchy, through the distance to the origin, i.e. the norm. The projection of the hierarchy derived from WordNet is depicted in Figure 5.

5 Experiments

(a) Results on the same three granularities analyzed by choi2018ultra.
(b) Comparison to previous coarse results.
Table 4: Results on the test set for different hierarchies and spaces. The best results of our models are marked in bold. On (b) we report the comparison of adding the closest coarse label to the ultra-fine prediction, with respect to the coarse results on (a).

We perform experiments on the Ultra-Fine Choi et al. (2018) and OntoNotes Gillick et al. (2014) datasets to evaluate which kind of hierarchical information is better suited for entity typing, and under which geometry the hierarchy can be better exploited.

5.1 Setup

For evaluation we run experiments on the Ultra-Fine dataset with our model projecting onto the hyperbolic space, and compare to the same setting in Euclidean space. The type embeddings are created based on the following hierarchical structures derived from the dataset: the type vocabulary aligned to the WordNet hierarchy (WordNet), type co-occurrence frequency (freq), pointwise mutual information among types (pmi), and finally, the combination of WordNet’s transitive closure of each type with the co-occurrence frequency graph (WordNet + freq).

We compare our model to the multi-task model of choi2018ultra trained on the open-source version of their dataset (MultiTask). The final type predictions consist of the closest neighbor from the coarse and fine projections, and the three closest neighbors from the ultra-fine projection. We report Loose Macro-averaged and Loose Micro-averaged F1 metrics computed from the precision/recall scores over the same three granularities established by choi2018ultra. For all models we optimize Macro-averaged F1 on coarse types on the validation set, and evaluate on the test set. All experiments project onto a target space of 10 dimensions. The complete set of hyperparameters is detailed in the Appendix.

6 Results and Discussion

6.1 Comparison of the Hierarchies

max width= a) Example Rin, Kohaku and Sesshomaru Rin befriends Kohaku, the demonslayer Sango’s younger brother, while Kohaku acts as her guard when Naraku is using her for bait to lure Sesshomaru into battle. Annotation event, conflict, war, fight, battle, struggle, dispute, group_action Prediction freq: event, conflict, war, fight, battle;
WordNet: event, conflict, difference, engagement, assault
b) Example The UN mission in Afghanistan dispatched its own investigation, expressing concern about reports of civilian casualties and calling for them to be properly cared for. Annotation organization, team, mission Prediction freq: organization, team, mission, activity, operation;
WordNet: group, institution, branch, delegation, corporation
c) Example Brazilian President Luiz Inacio Lula da Silva and Turkish Prime Minister Recep Tayyip Erdogan talked about designing a strategy different from sanctions at a meeting Monday, Amorim said. Annotation event, meeting, conference, gathering, summit Prediction freq: event, meeting, conference, film, campaign;
WordNet: entity, meeting, gathering, structure, court

Table 5: Qualitative analysis of instances taken from the development set. The predictions are generated with the hyperbolic models of freq and WordNet. Correct predictions are marked in blue color.

Results on the test set are reported in Table 4. From comparing the different strategies to derive the hierarchies, we can see that freq and pmi substantially outperform MultiTask on the ultra-fine granularity ( and relative improvement in Macro F1 and Micro F1, respectively, with the hyperbolic model). Both hierarchies show a substantially better performance over the WordNet hierarchy on this granularity as well (MaF1 and MiF1 for pmi vs and for WordNet on the Hyperbolic model), indicating that these structures, created solely from the dataset statistics, better reflect the type distribution in the annotations. On freq and pmi, types that frequently co-occur on the training set are located closer to each other, improving the prediction based on nearest neighbor.

All the hierarchies show very low performance on fine when compared to the MultiTask model. This exhibits a weakness of our regression setup. On the test set there are 1,998 instances but only 1,318 fine labels as ground truth (see Table 1). By forcing a prediction on the fine level for all instances, precision decreases notably. More details in Section 6.3.

The combined hierarchy WordNet + freq achieves marginal improvements on coarse and fine granularities, while it degrades the performance on ultra-fine when compared to freq.

By imposing a hierarchical structure over the type vocabulary we can infer types that are located higher up in the hierarchy from the predictions of the lower ones. To analyze this, we add the closest coarse label to the ultra-fine prediction of each instance. Results are reported in Table (b)b. The improvements are noticeable on the Macro score (up to F1 points difference on freq) whereas Micro decreases. Since we are adding types to the prediction, this technique improves recall and penalizes precision. Macro is computed on the entity level, while Micro provides an overall score, showing that per instance the prediction tends to be better. The improvements can be observed on freq and pmi given that their predictions over ultra-fine types are better.

6.2 Comparison of the Spaces

Figure 9: Histogram of ground-truth type neighbor positions for ultra-fine predictions in Hyperbolic and Euclidean spaces on the test set.

When comparing performances with respect to the metric spaces, the hyperbolic models for pmi and freq outperform all other models on ultra-fine granularity. Compared to its Euclidean counterpart, pmi brings considerable improvements ( vs and vs for Macro and Micro F1 respectively). This can be explained by the exponential growth of this space towards the boundary of the ball, combined with a representation that reflects the type co-occurrences in the dataset. Figure 9 shows a histogram of the distribution of ground-truth types as closest neighbors to the prediction.

On both Euclidean and hyperbolic models, the type embeddings for coarse and fine labels are located closer to the origin of the space. In this region, the spaces show a much more similar behavior in terms of the distance calculation, and this similarity is reflected on the results as well.

The low performance of the hyperbolic model of WordNet on coarse can be explained by the fact that entity is the root node of the hierarchy, therefore it is located closer to the center of the space. Elements placed in the vicinity of the origin have a norm closer to zero, thus their distance to other types tends to be shorter (does not grow exponentially). This often misleads the model into assign entity as the coarse. See Table 5c for an example.

This issue is alleviated on WordNet + freq. Nevertheless, it appears again when using the ultra-fine prediction to infer the coarse label. The drop in performance can be seen in Table (b)b: Macro F1 decreases by and Micro F1 by .

6.3 Error analysis

We perform an error analysis on samples from the development set and predictions from two of our proposed hyperbolic models. We show three examples in Table 5. Overall we can see that predictions are reasonable, suggesting synonyms or related words.

In the proposed regression setup, we predict a fixed amount of labels per instance. This schema has drawbacks as shown in example a), where all predicted types by the freq model are correct though we can not predict more, and b), where we predict more related types that are not part of the annotations.

In examples b) and c) we see how the freq model predicts the coarse type correctly whereas the model that uses the WordNet hierarchy predicts group and entity since these labels are considered more general (organization IS A group) thus located closer to the origin of the space.

To analyse precision and recall more accurately, we compare our model to the one of shimaoka2016attentive (

AttNER) and the multi-task model of choi2018ultra (multi). We show the results for macro-averaged metrics in Table 6. Our model is able to achieve higher recall but lower precision. Nonetheless we are able to outperform AttNER with a regression model even though they apply a classifier to the task.

Model Dev Test
P R F1 P R F1
AttNER 53.7 15.0 23.5 54.2 15.2 23.7
freq 24.8 25.9 25.4 25.6 26.8 26.2
multi 48.1 23.2 31.3 47.1 24.2 32.0
Table 6: Combined performance over the three granularities. Results are extracted from choi2018ultra.

6.4 Analysis Case: OntoNotes

Model Sp Coarse Fine Ultra
Ma Mi Ma Mi Ma Mi
Onto Hy 83.0 81.9 24.0 23.9 2.0 2.0
Eu 82.2 82.2 28.8 28.7 2.4 2.4
Freq Hy 81.7 81.8 27.1 27.1 4.2 4.2
Eu 81.7 81.7 30.6 30.6 3.8 3.8
Table 7: Macro and micro F1 results on OntoNotes.

To better understand the effects of the hierarchy and the metric spaces we also perform an evaluation on OntoNotes Gillick et al. (2014). We compare the original hierarchy of the dataset (Onto), and one derived from the type co-occurrence frequency extracted from the data augmented by choi2018ultra with this type inventory. The results for the three granularities are presented in Table 7.

The freq model on the hyperbolic geometry achieves the best performance for the ultra-fine granularity, in accordance with the results on the Ultra-Fine dataset. In this case the improvements of the frequency-based hierarchy are not so remarkable when compared to the onto model given that the type inventory is much smaller, and the annotations follow a hierarchy where there is only one possible path for every label to its coarse type.

The low results on the ultra-fine granularity are due to the reduced multiplicity of the annotated types (See Table 10). Most instances have only one or two types, setting very restrictive upper bounds for this setup.

7 Related Work

Type inventories for the task of fine-grained entity typing Ling and Weld (2012); Gillick et al. (2014); Yosef et al. (2012) have grown in size and complexity Del Corro et al. (2015); Murty et al. (2017); Choi et al. (2018)

. Systems have tried to incorporate hierarchical information on the type distribution in different manners. shimaoka2017neural encode the hierarchy through a sparse matrix. xuBarbosa2018hierarchyAware model the relations through a hierarchy-aware loss function. ma2016labelEmbedding and abhishek2017jointLearning learn embeddings for labels and feature representations into a joint space in order to facilitate information sharing among them. Our work resembles xiong2019inductiveBias since they derive hierarchical information in an unrestricted fashion, through type co-occurrence statistics from the dataset. These models operate under Euclidean assumptions. Instead, we impose a hyperbolic geometry to enrich the hierarchical information.

Hyperbolic spaces have been applied mostly on complex and social networks modeling Krioukov et al. (2010); Verbeek and Suri (2016)

. In the field of Natural Language Processing, they have been employed to learn embeddings for Question Answering

Tay et al. (2018)

, in Neural Machine Translation

Gulcehre et al. (2019), and to model language Leimeister and Wilson (2018); Tifrea et al. (2019). We build upon the work of nickel2017poincare on modeling hierarchical link structure of symbolic data and adapt it with the parameterization method proposed by dhingra2018embeddingTextInHS to cope with feature representations of text.

8 Conclusions

Incorporation of hierarchical information from large type inventories into neural models has become critical to improve performance. In this work we analyze expert-generated and data-driven hierarchies, and the geometrical properties provided by the choice of the vector space, in order to model this information. Experiments on two different datasets show consistent improvements of hyperbolic embedding over Euclidean baselines on very fine-grained labels when the hierarchy reflects the annotated type distribution.


We would like to thank the anonymous reviewers for their valuable comments and suggestions, and we also thank Ana Marasović, Mareike Pfeil, Todor Mihaylov and Mark-Christoph Müller for their helpful discussions. This work has been supported by the German Research Foundation (DFG) as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1 and the Klaus Tschira Foundation, Heidelberg, Germany.


Appendix A Appendix

a.1 Hyperparameters

Both hyperbolic and Euclidean models were trained with the following hyperparameters.

Parameter Value
Word embedding dim 300
Max mention tokens 5
Max mention chars 25
Context length (per side) 10
Char embedding dim 50
Position embedding dim 25
Context LSTM dim 200
Attention dim 100
Mention dropout 0.5
Context dropout 0.2
Max gradient norm 10
Projection hidden dim 500
Optimizer Adam
Learning rate 0.001
Batch size 1024
Epochs 50
Table 8: Model hyperparameters.

a.2 Dataset statistics

Split Samples Coarse Fine Ultra-fine
Train 6,240,105 2,148,669 2,664,933 3,368,607
Dev 1,998 1,612 947 1,860
Test 1,998 1,598 964 1,864
Table 9: Amount of samples with at least one label of the granularity organized by split on Ultra-Fine Dataset.
Split Samples Coarse Fine Ultra
Train 793,487 828,840 735,162 301,006
Dev 2,202 2,337 869 76
Test 8,963 9,455 3,521 417
Table 10: Samples and label distribution by split on OntoNotes Dataset.