BoningKnife: Joint Entity Mention Detection and Typing for Nested NER via prior Boundary Knowledge

07/20/2021 ∙ by Huiqiang Jiang, et al. ∙ Microsoft Peking University 0

While named entity recognition (NER) is a key task in natural language processing, most approaches only target flat entities, ignoring nested structures which are common in many scenarios. Most existing nested NER methods traverse all sub-sequences which is both expensive and inefficient, and also don't well consider boundary knowledge which is significant for nested entities. In this paper, we propose a joint entity mention detection and typing model via prior boundary knowledge (BoningKnife) to better handle nested NER extraction and recognition tasks. BoningKnife consists of two modules, MentionTagger and TypeClassifier. MentionTagger better leverages boundary knowledge beyond just entity start/end to improve the handling of nesting levels and longer spans, while generating high quality mention candidates. TypeClassifier utilizes a two-level attention mechanism to decouple different nested level representations and better distinguish entity types. We jointly train both modules sharing a common representation and a new dual-info attention layer, which leads to improved representation focus on entity-related information. Experiments over different datasets show that our approach outperforms previous state of the art methods and achieves 86.41, 85.46, and 94.2 F1 scores on ACE2004, ACE2005, and NNE, respectively.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named Entity Recognition (NER) is a fundamental tasks in natural language processing (NLP), which aims to extract and recognize named entities, like person names, organizations, geopolitical entities, etc., in unstructured text. However, in addition to flat entity mentions, nested or overlapping entities are commonplace in natural language. Such nested entities bring richer entity knowledge and semantics and can be critical to facilitate various downstream NLP tasks and real-world applications. As an example of their frequency, nested entities account for 35.19%, 30.80%, and 66.14% of mentions in standard datasets like ACE2004 Doddington et al. (2004), ACE2005 Walker et al. (2006), and NNE Ringland et al. (2019), respectively.

Nonetheless, the standard method for classic NER treats the problem as a sequence labeling task which has difficulty recognizing entities with nested structures directly Alex et al. (2007); Lu and Roth (2015); Katiyar and Cardie (2018). With that in mind, various approaches to recognizing nested entities have been proposed. From hyper-graph based methods Lu and Roth (2015); Wang and Lu (2018); Marinho et al. (2019)

which design expressive tagging schemas, to span-based methods which classify the categories of sub-sequences

Sohrab and Miwa (2018); Luan et al. (2019); Xia et al. (2019); Fisher and Vlachos (2019)

. In order to improve output quality, most more recent approaches to nested NER adopt structures that require the enumeration or heuristic traversal of all sub-sequences, which leads to inefficiency, and lack effective use of boundary information, which is very significant for nested entities. Recently,

Zheng et al. (2019) and Tan et al. (2020) explore using boundary knowledge to enhance recognition of nested entities. But both focus only on entity start/end information, which face limitations in handling long entity spans, the interaction of entity start/end(s), and lack region information. Moreover, decoupling the different nested levels of entity information remains a big problem in the nested NER task. For example, the spans "the leader of the Hezbollah in Syrian-occupied Lebanon", "the Hezbollah in Syrian-occupied Lebanon", and "Syrian-occupied Lebanon" all share the same end token (see Fig 1). Shared contextual representations tend to focus on the outermost entity type (in the above case, PER).

Figure 1: An example of nested mentions in ACE2004.

In this paper, we propose a novel joint entity mention detection and typing model via prior boundary knowledge, BoningKnife, which can carve entity boundaries accurately and tease out type information more precisely. Our model consists of two main components, MentionTagger and TypeClassifier, which are jointly trained with a common encoder representation and a shared dual-info attention layer. MentionTagger performs mention detection by better leveraging boundary knowledge beyond just entity start/end to better handle nesting levels and longer spans. This improved representation of boundary knowledge both addresses limitations of previous systems and allows the generation of high quality mention candidates, which are critical for the overall system efficiency. TypeClassifier then utilizes a new two-level attention mechanism to decouple different nested level representations and better distinguish entity types. Moreover, the offshoots of MentionTagger entity token detection are further leveraged in the dual-info attention layer to improve joint training performance.

Experimental results on three datasets show that our approach achieves significant improvements over state-of-the-art methods across multiple nested NER datasets. Further analysis and case studies demonstrate the effectiveness of each component in the model and its different sub-task strategies of mention detection and typing attention layers. Moreover, our approach also achieves higher efficiency without drop in quality.

2 Related Work

Traditionally, most approaches formalize the NER task as a sequence labeling problem, which assigns a single label to each token in a sentence. Shen et al. (2003); Zhang et al. (2004); Zhou (2006) adopt bottom-up methods, which performs entity recognition from inner to outer mentions, following hand-crafted rules.

Lu and Roth (2015) introduces the idea of using a graph structure to connect tokens with multiple entities. While Muis and Lu (2018); Wang and Lu (2018); Katiyar and Cardie (2018); Wang and Lu (2019) propose hypergraphs and different methods to utilize graph information for nested NER.

Transition-based models, which assemble a shift-reduce structure to detect a nested entity, have also been proposed. Wang et al. (2018) builds a forest structure based on shift-reduce parsing. Marinho et al. (2019) uses a stack structure to construct the transition-shift-reduce model. And Ju et al. (2018) proposes a dynamically stacked multiple LSTM-CRF model to recognize the entity in an inside-out manner until no outer entity is plucked.

Span-based methods are another class of methods to recognize nested entities by classifying sub-sequences Xia et al. (2019). Luan et al. (2019) proposes a graph-based model which leverages entity linking to improve NER performance. Fisher and Vlachos (2019) introduces a merge-and-label method which uses nested entity hierarchy features. Strakova et al. (2019) views the nested NER task as a seq2seq generator problem, in which the input is a list of sentences and output target entities list. Shibuya and Hovy (2020) introduces an improved CRF model that recursively decoder the entity from outside to inside.

However, most previous methods need to traverse all sub-sequences and lack boundary knowledge, which is significant for nested entities. To try and mitigate such shortcomings, Zheng et al. (2019) defines a boundary detection task to generate a mention candidate set based on the entity start/end, followed by typing all mentions in the candidate set. And Tan et al. (2020) splits boundary information into two sub-tasks (entity start and entity end), before classifying candidates. While both works use boundary knowledge, they focus only on entity start/end, which does not fully represent boundary information and lead to issues such as not handling long spans well. BoningKnife jointly trains entity mention detection and typing modules and utilizes an extended representation of boundary knowledge to address such limitations.

3 Problem and Methodology

Figure 2: Framework of the proposed BoningKnife, including Encoder Layer, MentionTagger, and TypeClassifier.

In this section, we define the nested NER task, and then elaborate on our proposed solution. Fig 2 illustrates the framework of the proposed (BoningKnife). Specifically, we jointly train tagger and classifier, where the former (MentionTagger) extracts potential mention spans and generates mention candidates by leveraging an improved representation of boundary knowledge, and the later (TypeClassifier) classifies mention candidates into predefined entity types.

3.1 Problem Statement

Let denote sentence data and denote entity label data, where and

are the vector space of sentences and labels. Given a sentence

, where means the length of sentences and represents the -th token of , the NER task object is to extract all semantic elements where are the element start/end indices, means the element corresponding to a predefined label, and is the entity space.

Essentially, nested NER aims to learn a space representation (only for the upper triangular matrix), where

is the size of entity categories and non-entity, and each value in the matrix represents the span type probability

. We decompose the target probability into the product of two conditional probabilities (detection and typing) with a latent parameter (span).


In the formula, we discard the term with a small span probability in order to reduce the amount of calculations.

3.2 Encoder Layer

Figure 3: Mask matrix of Dual-info attention (windows size is 2). Left: global mask. Right: mention-focus mask which fuses "entity detection" subtasks and local context.

Because of the importance of context to entities, it is necessary to infuse information from different nested entities into one token. We propose a Dual-info attention structure to obtain entity semantic knowledge from both the token itself and others. This Dual-info attention representation is the input of two main sub-components of the model.

For the attention architecture, we use a pre-LayerNorm residual connection and multi-head attention mechanism.


where is the attention mask matrix.

Dual-info Attention consists of Global Masked attention and Mention-focus Masked attention layers. Fig 3 shows the masked matrix example of Dual-info attention based on Devlin et al. (2018). We use , the BERT representation of the -th token in the sentence, to compute the Dual-info attention representation .


Global Masked Attention considers every token from the same sentence, which makes the representation more contextual, while Mention-focus Masked Attention uses entity detection from MentionTagger (Sec 3.3) and local context to construct attention weights. For tokens not in mention candidates, we encode them from the representation of mention candidates’ tokens and local context. Otherwise, we encode mention candidates’ tokens by calculating the attention weighted sum of all tokens except itself. This approach tries to emphasise information related to: entity to entity, token to entity, and entity types. The ablation experiments (Sec 5.1) and attention discussion (Sec 5.3) further showcase its effects.

3.3 MentionTagger

The mention tagger module aims to extract entity mention candidates and compute their corresponding probabilities in a sentence.

It is onerous to learn the mention detection matrix in space directly. A basic idea is to traverse all sub-sequences based on a shared representation. Building a high dimensional matrix concatenating the representation of entity’s start/end token. However, this method misses the interaction between start and end tokens, and increases the risk of over-fitting in the training stage.

The other extreme is to treat entity boundary information as two completely independent variables, like recent mention detection models Zheng et al. (2019); Tan et al. (2020) that only consider entity start/end tokens, which lack enough region contextual information. MentionTagger circumvents these two problems and utilizes three types of boundary information: entity start/end token, entity token itself, and mention region; which we term prior boundary knowledge. To infuse prior boundary distribution, we use three sub-tasks (start/end detection, entity detection, and mention detection) in training the tagger module.

Start/End detection Inspired by the biaffine model Dozat and Manning (2017), we use two MLPs ( and ) to get a low dimension start/end representation and compute the span representation for span . We also use a start/end detection sub-task to enhance start/end representation and apply two other MLPs( and ) to project / to the category space. For span , these are:


where and are the Dual-info Attention representation of the span ’s start/end token.

And the span vector generated from the low dimension start/end representation ,


where , , are self-learned parameters.


where and are the output probability of the start/end detection sub-task.


where are the ground truth labels of the start/end detection sub-task and , are the training losses of the start/end sub-task, respectively.

Entity detection Notice that only using the start/end information does not define a span boundary. Two high probability start /end don’t mean the probability of span is high. For example, in the sentence "Joe went to school.", the probability of start token"Joe" and end token "school" are both high, but the probability of span "Joe went to school" being an entity is low. Applying the entity detection information (verdict token w/o belonging to at least one entity) can help address this problem. We reduce the span probability with large spacing by accumulating the entity detection probability values.

For the token index in a sentence, we have:


where means the output probability of token belonging to a span.

In this sub-task, the entity detection loss function

is defined as:


where are the ground truth labels of the entity detection sub-task.

Mention detection Using the span representation and entity detection probability , we compute the mention detection probability for all sub-sequences in the sentence. For the span , these are:


where means the output probability of the mention detection sub-task.

The mention detection loss function is calculated as follows:


where are the ground truth labels of the mention detection sub-task. After getting the mention probability of each mention pair , we use a threshold hyper-parameter to generate the mention candidate .

MentionTagger not only outputs a mention candidate set as the input of TypeClassifier, but the offshoots of its internal entity token detection are fed to the shared Dual-info attention layer as a mention-focus mask matrix , improving its representation of entity semantic information.

3.4 TypeClassifier

After obtaining mention candidates from MentionTagger, TypeClassifier aims to predict the probability of entity types for each candidate. We utilize a mention decoupling layer (MDL) to focus on the current mention semantic information, including two-level attention and a four-level representation (see Fig 2), for each span.

We apply a dimensional position embedding over the Dual-info Attention encoder embedding , as position-wise token representation , and consider as the input to the two-level attention component.


where is a learnable position embedding.

Two-level attention combines mention-level attention (to make semantic information focus on internal spans) and neighbor-level attention (which emphasizes the knowledge from span boundaries and contextual tokens). Both utilize have different mask matrices. Mention-level attention can only see the mention tokens, while neighbor-level attention can only see the remaining tokens. This attention is defined as:


where are the difference mask matrices, and is the attention architecture mentioned in eq (2).

Four-level representation For each span, we align and combine four fine-grained representations, thereby improving the model’s understanding of entity boundaries and entity semantics. Taking the span , the detail representation encompasses:

  • Sentence-level Representation,

  • Position-wise Token Representation,

  • Mention-level Representation,

  • Neighbor-level Representation,


where are the [CLS], [SEP], start, end, previous mention, and next mention token representations, respectively.

These four different features are combined into TypeClassifer’s representation:


In the same way, the predicted classification probability is output through the Softmax function.


The corresponding loss function is:


3.5 Optimization Objective

In MentionTagger, we joint optimize the above detection sub-tasks. All losses are based on cross entropy loss. To balance difference loss, we apply a focal loss style self-adjusting weight strategy. For larger losses, the weights will be correspondingly scaled up to improve the learning process.




where is a score to judge degree of sub-task training, which borrows from focal loss Lin et al. (2017) and is a normalization version of , .

We jointly train MentionTagger and TypeClassifier as a multi-task process alternately, where the shared representation layer and the entity detection prediction results in Dual-info attention enhance the connections between the two components. The overall optimization goal is:


4 Experiments

4.1 Datasets

We evaluate our model on three nested NER datasets, ACE2004 Doddington et al. (2004), ACE2005 Walker et al. (2006), and NNE Ringland et al. (2019); using the same splits as previous work Lu and Roth (2015); Wang and Lu (2018); Lin et al. (2019); Ringland et al. (2019).222ACE2004 / ACE2005 as in, and NNE as in Table 1 shows the proportions of nested entities in the datasets range from 30.80% to 66.14%. ACE2004 and ACE2005 include 7 entity types, while NNE has 114 entity types. We report precision, recall, and micro-F1 metrics for all experiments.

ACE2004 ACE2005 NNE
Documents 443 464 2,312
Sentences 8,507 9,311 49,208
Mentions 27,753 31,102 279,795
Entity overlaps 9,767 9,579 185,054
Overlap ratio 35.19% 30.80% 66.14%
% of overlaps over all sub-sequences 1.11% 1.30% 1.47%
Table 1: Statistics of the ACE2004, ACE2005, and NNE nested NER datasets.

4.2 Baseline and Experimental Settings

We compare BoningKnife with a set of representative models and the recent state of the art.

  • Wang and Lu (2018), a graph-based model using LSTM to learn a feature encoder;

  • Xia et al. (2019), which is a detect-classify model without boundary knowledge;

  • Luan et al. (2019), a graph-based model which leverages entity linking to improve NER;

  • Fisher and Vlachos (2019), a merge-and-label model with hierarchical features;

  • Strakova et al. (2019), which treats the nested NER task as a seq2seq problem;

  • Shibuya and Hovy (2020), which extracts nested entities recursively with CRF;

  • Tan et al. (2020), which combines entity start/ end probabilities.

Moreover, we utilize itself as a lower-bound baseline for the BERT-based models.333NNE results for Shibuya and Hovy (2020) are reported by using their public code in

We perform a random search strategy for hyperparameter optimization and select the best settings on the development sets. We initialize the loss weight parameter as

in MentionTagger and set the max neighbor window size to 128 in neighbor-level attention block to control memory size. The hidden sizes of low dimension start/end representation are 84, the number of attention heads is 16, and the windows size of mention-focus attention mask matrix is 2. Except for , our model has around 24M parameters. We employ AdamW Loshchilov and Hutter (2019) as optimizer during training. Experiments are repeated 5 times for different random seeds on each corpus.

4.3 Results and Discussion

Model ACE2004 ACE2005
P(%) R(%) F1(%) P (%) R(%) F1 (%)
Wang and Lu (2018) 78.0 72.4 75.1 76.8 72.3 74.5
Xia et al. (2019) [ELMO] 81.7 77.4 79.5 79.0 77.3 78.2
Luan et al. (2019) - - 84.7 - - 82.9
BERT merge outmost & inmost 80.55 79.23 79.88 78.12 82.71 80.35
Fisher and Vlachos (2019) - - - 82.7 82.1 82.4
Strakova et al. (2019) - - 84.4 - - 84.3
Shibuya and Hovy (2020) 85.23 84.72 84.97 83.30 0.22 84.69 0.37 83.99 0.27
Tan et al. (2020) 85.8 84.8 85.3 83.8 83.9 83.9
BoningKnife 85.98 0.36 86.86 0.39 86.41 0.24 84.77 0.31 86.16 0.43 85.46 0.32
Table 2: Results of the proposed BoningKnife and prior state-of-the-art methods over the ACE2004/2005 test sets. denotes models utilizing . ’-’ denotes results not reported.
Model P(%) R(%) F1(%)
Wang et al. (2018) 77.4 70.1 73.6
merge outmost & inmost 80.43 74.94 77.59
Wang and Lu (2018) 91.8 91.1 91.4
Shibuya and Hovy (2020)44footnotemark: 4 93.03 93.34 93.19
BoningKnife 93.74 94.75 94.24
( 0.33) ( 0.24) ( 0.05)
Table 3: Results of the proposed model and baselines over the NNE dataset.

Tables 2 and 3 report the results of our model and the different baselines on the ACE2004/ACE2005 and NNE datasets. It can be seen that our proposed method outperforms all previous state-of-the-art methods, reaching 86.41, 85.46, and 94.24 in average micro-F1 score, on ACE2004, ACE2005, and NNE respectively.

Compared with the latest boundary-enhanced method (Tan et al., 2020), our method achieves 1.11 and 1.56 absolute point gains on ACE2004 and ACE2005 555Unfortunately Tan et al. (2020) does not report results over NNE and did not release their code for further experiments. The boost comes mainly from recall improvements (2.06 to 2.26 points). MentionTagger is able to produce more precise mention candidates, which allows TypeClassifier to focus on distinguishing entity types, instead of filtering candidates as not viable. The improved precision of MentionTagger is further evidenced in Table 6.

In the NNE corpus, BoningKnife achieves 94.24 F1-score; an improvement of 1.05 points over the previous SOTA. We hypothesize that as NNE datasets has deeper nesting levels, (Tan et al., 2020)’s approach leads to error transmission in their recursive encoding process.

5 Analysis

5.1 Ablation & Flat/Nested Performance

Method P(%) R(%) F1(%)
- w/o EntityDetection subtask (-1.06)
- w/o Start/End sub-task (-0.97)
- w/o Neighbor-level attention (-0.41)
- w/o Two-level attention (-0.83)
- w/o Mention-focus attention (-0.28)
- w/o MentionTagger Stage666Replacing MentionTagger with using entity start/end to generate mention candidates, similarly to Tan et al. (2020). (-0.89)
Table 4: Ablation study of the proposed BoningKnife over the ACE 2004 test set, where numbers in parenthesis denote performance change.

To validate the contributions and effectiveness of different components in the proposed model, we introduce the following model variants to perform an ablation study:

Table 4 highlights the performance contributions of each component in our proposed model, and removing any of them will generally lead to substantial performance drops. It can be seen that quality decreases significantly when either removing MentionTagger or its sub-tasks (entity token and start/end detection) sub-task, which indicates the proposed model makes effective usage of boundary knowledge (for example, to better handle long length entity spans). Without the proposed two-level attention in TypeClassifier, it becomes harder for the model to separate nested information and assign the proper type for nested entities; even more so than removing only the neighbor-level component of the two-level attention . This further demonstrates the benefits of the two-level structure and its ability to combine clear boundary and local context information. Lastly, while the effects of removing mention-focus mask attention are less prominent, it’s still noticeable and removing this component leads to slower overall model convergence. Furthermore, Table 5 reports the Flat/Nested performance across datasets. It can be seen that BoningKnife excels in nested entities while remaining competitive on flat results; which further evidences the overall effectiveness of the model in leveraging boundary knowledge.

Method ACE2004 ACE2005 NNE
Flat Nested Flat Nested Flat Nested
BoningKnife 84.32 87.10 84.54 86.23 84.45 94.73
-w/o MentionTagger 83.56 86.27 84.16 85.45 83.89 93.65
Shibuya and Hovy (2020) 84.45 85.14 84.86 83.13 84.26 93.67
Table 5: Flat/Nested F1-scores over the ACE2004, ACE2005, and NNE test sets.

5.2 Time Complexity

Method ACE2004 ACE2005 NNE
P Time (s) P Time (s) P Time (s)
BoningKnife 89.46 479 87.22 649 95.48 7841
-w/o MentionTagger 31.54 839 32.58 1190 30.71 32758
TimeRatio 1.75X 1.83X 4.18X
Table 6:

Comparison of mention precision and training time cost per epoch, between E2E system using MentionTagger and mention strategy from

Tan et al. (2020).

Similarly to Tan et al. (2020), our method substantially improves the time complexity over typical span-based methods by generating high-quality candidates, which greatly reduce complexity and training time. Span-based models, which require traversing all sub-sequences, have ) time complexity, where is the count of tags. Efficiently reducing the number of candidates is key in a two-step system like BoningKnife, as span classifying time complexity is determined by the number of candidates in its input. To measure the speedup from our approach due to its improved candidate generation and provide a comparison with Tan et al. (2020), we run two experiments: i) the complete BoningKnife system and ii) BoningKnife - MentionTagger + mention strategy in Tan et al. (2020). The experiments were run on a Ubuntu 16.04.6 server with Intel Xeon CPU E5-2690v3 @ 2.60GHz and one P100 GPU.

Table 6 reports the comparison between both experiments. We can see that our method provides significant speedup over the simpler modeling of boundary knowledge approach, especially with deeper nesting levels. BoningKnife is 1.75x, 1.83x, and 4.18x faster in ACE2004, ACE2005, and NNE, respectively, while still achieving higher quality.

5.3 Case Study and Attention Weight Visualization

Sentence: The Coventry University researchers who report the findings in the British journal of sports medicine say anxiety and depression are common among those so injured, possibly as a result of pain and impaired mobility.
(a) The Coventry University researchers who report the 1.000 0.999 1.000 0.942 1.000 PER PER
findings in the British journal of sports medicine
(b) Coventry University 0.595 1.000 0.945 0.502 0.998 ORG ORG
(c) who 1.000 0.999 1.000 0.997 1.000 PER PER
(d) those so injured 1.000 0.483 0.953 0.045 1.000 PER PER
(e) those so injured, possibly as a result of pain and 1.000 0.000 0.000 0.573 0.000 Non-entity Non-entity
impaired mobility
(f) . 0.000 0.000 0.000 0.501 0.000 Non-entity Non-entity
Table 7: An example where BoningKnife leverages prior boundary knowledge to better predict nested entity type. from the ablation experiment "w/o ED subtask".

Table 7 shows an example of BoningKnife prediction in ACE 2004. Span (d), "those so injured", is a correct mention, but the probability of end token "injured" is small. For S/E based methods like Tan et al. (2020); Zheng et al. (2019), this span would likely be discarded, but in our method it is correctly identified. Compared with only, the entity token detection knowledge reduces the number of high probability mentions (like (e) and (f)) inconsistent with the prior information, while not discarding very long entities, like mention (a).

(a) Global Mask
(b) Mention-focus Mask
Figure 4: Visualization of the Dual-info Attention weights for the case study sentence (Table 7).

Fig 4 shows the Dual-info attention weights for the sentence in Table 7. The global attention weights all focus on common keywords like "university", "the". While the mention-focus attention, focus on specific token neighbors, like "report" focusing on "researchers" and "findings", which improve their semantic information. Also, additional tokens focus on the relevant entity tokens, like "who" focusing on the same entity type word "those" instead of on itself in global attention.

6 Conclusion

In this paper we propose a novel joint entity mention detection and typing model via prior boundary knowledge for the nested NER task. The proposed method effectively incorporates prior boundary knowledge information to generate high quality mention candidates, which greatly improves efficiency of the whole system. By introducing a Dual-info attention layer at the mention classification stage, it facilitates mention decoupling and more accurate mention classification at different levels. Experiments show that our system, BoningKnife, achieves state-of-the-art results on three standard benchmark datasets; and an ablation study further demonstrates the effectiveness of its components.


  • Doddington et al. [2004] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M Strassel, and Ralph M Weischedel. The automatic content extraction (ace) program-tasks, data, and evaluation. In Lrec, volume 2, page 1. Lisbon, 2004. URL
  • Walker et al. [2006] Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57, 2006.
  • Ringland et al. [2019] Nicky Ringland, Xiang Dai, Ben Hachey, Sarvnaz Karimi, Cécile Paris, and James R. Curran. NNE: A dataset for nested named entity recognition in english newswire. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL
  • Alex et al. [2007] Beatrice Alex, Barry Haddow, and Claire Grover. Recognising nested named entities in biomedical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pages 65–72. Association for Computational Linguistics, 2007. URL
  • Lu and Roth [2015] Wei Lu and Dan Roth. Joint mention extraction and classification with mention hypergraphs. In Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2015. URL
  • Katiyar and Cardie [2018] Arzoo Katiyar and Claire Cardie. Nested named entity recognition revisited. In Proc. Conf. North American Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 861–871, 2018. URL
  • Wang and Lu [2018] Bailin Wang and Wei Lu. Neural segmental hypergraphs for overlapping mention recognition. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL
  • Marinho et al. [2019] Zita Marinho, Alfonso Mendes, Sebastiao Miranda, and David Nogueira. Hierarchical nested named entity recognition. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 28–34, 2019. URL
  • Sohrab and Miwa [2018] Mohammad Golam Sohrab and Makoto Miwa. Deep exhaustive model for nested named entity recognition. In Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), pages 2843–2849, 2018. URL
  • Luan et al. [2019] Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. A general framework for information extraction using dynamic span graphs. Proc. Conf. North American Assoc. for Computational Linguistics (NAACL), 2019. URL
  • Xia et al. [2019] Congying Xia, Chenwei Zhang, Tao Yang, Yaliang Li, Nan Du, Xian Wu, Wei Fan, Fenglong Ma, and Philip Yu. Multi-grained named entity recognition. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL
  • Fisher and Vlachos [2019] Joseph Fisher and Andreas Vlachos.

    Merge and label: A novel neural network architecture for nested ner.

    Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL
  • Zheng et al. [2019] Changmeng Zheng, Yi Cai, Jingyun Xu, Ho-fung Leung, and Guandong Xu. A boundary-aware neural model for nested named entity recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 357–366, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1034. URL
  • Tan et al. [2020] Chuanqi Tan, Wei Qiu, Mosha Chen, Rui Wang, and Fei Huang. Boundary enhanced neural span classification for nested named entity recognition. Proc. Int. Conf. Proc. Conference on Alien Intelligence(AAAI), 2020.
  • Shen et al. [2003] Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan.

    Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain.

    In Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), pages 49–56. Association for Computational Linguistics, 2003. URL
  • Zhang et al. [2004] Jie Zhang, Dan Shen, Guodong Zhou, Jian Su, and Chew-Lim Tan. Enhancing hmm-based biomedical named entity recognition by studying special phenomena. Journal of biomedical informatics, 37(6):411–422, 2004. URL
  • Zhou [2006] GD Zhou. Recognizing names in biomedical texts using mutual information independence model and svm plus sigmoid. International Journal of Medical Informatics, 75(6):456–467, 2006. URL
  • Muis and Lu [2018] Aldrian Obaja Muis and Wei Lu. Labeling gaps between words: Recognizing overlapping mentions with mention separators. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL
  • Wang and Lu [2019] Bailin Wang and Wei Lu. Combining spans into entities: A neural two-stage approach for recognizing discontiguous entities. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2019. URL
  • Wang et al. [2018] Bailin Wang, Wei Lu, Yu Wang, and Hongxia Jin. A neural transition-based model for nested mention recognition. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL
  • Ju et al. [2018] Meizhi Ju, Makoto Miwa, and Sophia Ananiadou. A neural layered model for nested named entity recognition. In Proc. Conf. North American Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1446–1459, 2018. URL
  • Strakova et al. [2019] Jana Strakova, Milan Straka, and Jan Hajic. Neural architectures for nested ner through linearization. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL
  • Shibuya and Hovy [2020] Takashi Shibuya and Eduard Hovy. Nested named entity recognition via second-best sequence learning and decoding. Transactions of the Association for Computational Linguistics, 8:605–620, 2020. doi: 10.1162/tacl_a_00334. URL
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. Proc. Conf. North American Assoc. for Computational Linguistics (NAACL), 2018. URL
  • Dozat and Manning [2017] Timothy Dozat and Christopher D. Manning. Deep biaffine attention for neural dependency parsing. Proc. Int. Conf. Learning Representations (ICLR), 2017.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017. URL
  • Lin et al. [2019] Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. Proc. Int. Conf. Learning Representations (ICLR), abs/1711.05101, 2019. URL