Named Entity Recognition (NER) is a fundamental tasks in natural language processing (NLP), which aims to extract and recognize named entities, like person names, organizations, geopolitical entities, etc., in unstructured text. However, in addition to flat entity mentions, nested or overlapping entities are commonplace in natural language. Such nested entities bring richer entity knowledge and semantics and can be critical to facilitate various downstream NLP tasks and real-world applications. As an example of their frequency, nested entities account for 35.19%, 30.80%, and 66.14% of mentions in standard datasets like ACE2004 Doddington et al. (2004), ACE2005 Walker et al. (2006), and NNE Ringland et al. (2019), respectively.
Nonetheless, the standard method for classic NER treats the problem as a sequence labeling task which has difficulty recognizing entities with nested structures directly Alex et al. (2007); Lu and Roth (2015); Katiyar and Cardie (2018). With that in mind, various approaches to recognizing nested entities have been proposed. From hyper-graph based methods Lu and Roth (2015); Wang and Lu (2018); Marinho et al. (2019)
which design expressive tagging schemas, to span-based methods which classify the categories of sub-sequencesSohrab and Miwa (2018); Luan et al. (2019); Xia et al. (2019); Fisher and Vlachos (2019)
. In order to improve output quality, most more recent approaches to nested NER adopt structures that require the enumeration or heuristic traversal of all sub-sequences, which leads to inefficiency, and lack effective use of boundary information, which is very significant for nested entities. Recently,Zheng et al. (2019) and Tan et al. (2020) explore using boundary knowledge to enhance recognition of nested entities. But both focus only on entity start/end information, which face limitations in handling long entity spans, the interaction of entity start/end(s), and lack region information. Moreover, decoupling the different nested levels of entity information remains a big problem in the nested NER task. For example, the spans "the leader of the Hezbollah in Syrian-occupied Lebanon", "the Hezbollah in Syrian-occupied Lebanon", and "Syrian-occupied Lebanon" all share the same end token (see Fig 1). Shared contextual representations tend to focus on the outermost entity type (in the above case, PER).
In this paper, we propose a novel joint entity mention detection and typing model via prior boundary knowledge, BoningKnife, which can carve entity boundaries accurately and tease out type information more precisely. Our model consists of two main components, MentionTagger and TypeClassifier, which are jointly trained with a common encoder representation and a shared dual-info attention layer. MentionTagger performs mention detection by better leveraging boundary knowledge beyond just entity start/end to better handle nesting levels and longer spans. This improved representation of boundary knowledge both addresses limitations of previous systems and allows the generation of high quality mention candidates, which are critical for the overall system efficiency. TypeClassifier then utilizes a new two-level attention mechanism to decouple different nested level representations and better distinguish entity types. Moreover, the offshoots of MentionTagger entity token detection are further leveraged in the dual-info attention layer to improve joint training performance.
Experimental results on three datasets show that our approach achieves significant improvements over state-of-the-art methods across multiple nested NER datasets. Further analysis and case studies demonstrate the effectiveness of each component in the model and its different sub-task strategies of mention detection and typing attention layers. Moreover, our approach also achieves higher efficiency without drop in quality.
2 Related Work
Traditionally, most approaches formalize the NER task as a sequence labeling problem, which assigns a single label to each token in a sentence. Shen et al. (2003); Zhang et al. (2004); Zhou (2006) adopt bottom-up methods, which performs entity recognition from inner to outer mentions, following hand-crafted rules.
Lu and Roth (2015) introduces the idea of using a graph structure to connect tokens with multiple entities. While Muis and Lu (2018); Wang and Lu (2018); Katiyar and Cardie (2018); Wang and Lu (2019) propose hypergraphs and different methods to utilize graph information for nested NER.
Transition-based models, which assemble a shift-reduce structure to detect a nested entity, have also been proposed. Wang et al. (2018) builds a forest structure based on shift-reduce parsing. Marinho et al. (2019) uses a stack structure to construct the transition-shift-reduce model. And Ju et al. (2018) proposes a dynamically stacked multiple LSTM-CRF model to recognize the entity in an inside-out manner until no outer entity is plucked.
Span-based methods are another class of methods to recognize nested entities by classifying sub-sequences Xia et al. (2019). Luan et al. (2019) proposes a graph-based model which leverages entity linking to improve NER performance. Fisher and Vlachos (2019) introduces a merge-and-label method which uses nested entity hierarchy features. Strakova et al. (2019) views the nested NER task as a seq2seq generator problem, in which the input is a list of sentences and output target entities list. Shibuya and Hovy (2020) introduces an improved CRF model that recursively decoder the entity from outside to inside.
However, most previous methods need to traverse all sub-sequences and lack boundary knowledge, which is significant for nested entities. To try and mitigate such shortcomings, Zheng et al. (2019) defines a boundary detection task to generate a mention candidate set based on the entity start/end, followed by typing all mentions in the candidate set. And Tan et al. (2020) splits boundary information into two sub-tasks (entity start and entity end), before classifying candidates. While both works use boundary knowledge, they focus only on entity start/end, which does not fully represent boundary information and lead to issues such as not handling long spans well. BoningKnife jointly trains entity mention detection and typing modules and utilizes an extended representation of boundary knowledge to address such limitations.
3 Problem and Methodology
In this section, we define the nested NER task, and then elaborate on our proposed solution. Fig 2 illustrates the framework of the proposed (BoningKnife). Specifically, we jointly train tagger and classifier, where the former (MentionTagger) extracts potential mention spans and generates mention candidates by leveraging an improved representation of boundary knowledge, and the later (TypeClassifier) classifies mention candidates into predefined entity types.
3.1 Problem Statement
Let denote sentence data and denote entity label data, where and
are the vector space of sentences and labels. Given a sentence, where means the length of sentences and represents the -th token of , the NER task object is to extract all semantic elements where are the element start/end indices, means the element corresponding to a predefined label, and is the entity space.
Essentially, nested NER aims to learn a space representation (only for the upper triangular matrix), where
is the size of entity categories and non-entity, and each value in the matrix represents the span type probability. We decompose the target probability into the product of two conditional probabilities (detection and typing) with a latent parameter (span).
In the formula, we discard the term with a small span probability in order to reduce the amount of calculations.
3.2 Encoder Layer
Because of the importance of context to entities, it is necessary to infuse information from different nested entities into one token. We propose a Dual-info attention structure to obtain entity semantic knowledge from both the token itself and others. This Dual-info attention representation is the input of two main sub-components of the model.
For the attention architecture, we use a pre-LayerNorm residual connection and multi-head attention mechanism.
where is the attention mask matrix.
Dual-info Attention consists of Global Masked attention and Mention-focus Masked attention layers. Fig 3 shows the masked matrix example of Dual-info attention based on Devlin et al. (2018). We use , the BERT representation of the -th token in the sentence, to compute the Dual-info attention representation .
Global Masked Attention considers every token from the same sentence, which makes the representation more contextual, while Mention-focus Masked Attention uses entity detection from MentionTagger (Sec 3.3) and local context to construct attention weights. For tokens not in mention candidates, we encode them from the representation of mention candidates’ tokens and local context. Otherwise, we encode mention candidates’ tokens by calculating the attention weighted sum of all tokens except itself. This approach tries to emphasise information related to: entity to entity, token to entity, and entity types. The ablation experiments (Sec 5.1) and attention discussion (Sec 5.3) further showcase its effects.
The mention tagger module aims to extract entity mention candidates and compute their corresponding probabilities in a sentence.
It is onerous to learn the mention detection matrix in space directly. A basic idea is to traverse all sub-sequences based on a shared representation. Building a high dimensional matrix concatenating the representation of entity’s start/end token. However, this method misses the interaction between start and end tokens, and increases the risk of over-fitting in the training stage.
The other extreme is to treat entity boundary information as two completely independent variables, like recent mention detection models Zheng et al. (2019); Tan et al. (2020) that only consider entity start/end tokens, which lack enough region contextual information. MentionTagger circumvents these two problems and utilizes three types of boundary information: entity start/end token, entity token itself, and mention region; which we term prior boundary knowledge. To infuse prior boundary distribution, we use three sub-tasks (start/end detection, entity detection, and mention detection) in training the tagger module.
Start/End detection Inspired by the biaffine model Dozat and Manning (2017), we use two MLPs ( and ) to get a low dimension start/end representation and compute the span representation for span . We also use a start/end detection sub-task to enhance start/end representation and apply two other MLPs( and ) to project / to the category space. For span , these are:
where and are the Dual-info Attention representation of the span ’s start/end token.
And the span vector generated from the low dimension start/end representation ,
where , , are self-learned parameters.
where and are the output probability of the start/end detection sub-task.
where are the ground truth labels of the start/end detection sub-task and , are the training losses of the start/end sub-task, respectively.
Entity detection Notice that only using the start/end information does not define a span boundary. Two high probability start /end don’t mean the probability of span is high. For example, in the sentence "Joe went to school.", the probability of start token"Joe" and end token "school" are both high, but the probability of span "Joe went to school" being an entity is low. Applying the entity detection information (verdict token w/o belonging to at least one entity) can help address this problem. We reduce the span probability with large spacing by accumulating the entity detection probability values.
For the token index in a sentence, we have:
where means the output probability of token belonging to a span.
In this sub-task, the entity detection loss functionis defined as:
where are the ground truth labels of the entity detection sub-task.
Mention detection Using the span representation and entity detection probability , we compute the mention detection probability for all sub-sequences in the sentence. For the span , these are:
where means the output probability of the mention detection sub-task.
The mention detection loss function is calculated as follows:
where are the ground truth labels of the mention detection sub-task. After getting the mention probability of each mention pair , we use a threshold hyper-parameter to generate the mention candidate .
MentionTagger not only outputs a mention candidate set as the input of TypeClassifier, but the offshoots of its internal entity token detection are fed to the shared Dual-info attention layer as a mention-focus mask matrix , improving its representation of entity semantic information.
After obtaining mention candidates from MentionTagger, TypeClassifier aims to predict the probability of entity types for each candidate. We utilize a mention decoupling layer (MDL) to focus on the current mention semantic information, including two-level attention and a four-level representation (see Fig 2), for each span.
We apply a dimensional position embedding over the Dual-info Attention encoder embedding , as position-wise token representation , and consider as the input to the two-level attention component.
where is a learnable position embedding.
Two-level attention combines mention-level attention (to make semantic information focus on internal spans) and neighbor-level attention (which emphasizes the knowledge from span boundaries and contextual tokens). Both utilize have different mask matrices. Mention-level attention can only see the mention tokens, while neighbor-level attention can only see the remaining tokens. This attention is defined as:
where are the difference mask matrices, and is the attention architecture mentioned in eq (2).
Four-level representation For each span, we align and combine four fine-grained representations, thereby improving the model’s understanding of entity boundaries and entity semantics. Taking the span , the detail representation encompasses:
Position-wise Token Representation,
where are the [CLS], [SEP], start, end, previous mention, and next mention token representations, respectively.
These four different features are combined into TypeClassifer’s representation:
In the same way, the predicted classification probability is output through the Softmax function.
The corresponding loss function is:
3.5 Optimization Objective
In MentionTagger, we joint optimize the above detection sub-tasks. All losses are based on cross entropy loss. To balance difference loss, we apply a focal loss style self-adjusting weight strategy. For larger losses, the weights will be correspondingly scaled up to improve the learning process.
where is a score to judge degree of sub-task training, which borrows from focal loss Lin et al. (2017) and is a normalization version of , .
We jointly train MentionTagger and TypeClassifier as a multi-task process alternately, where the shared representation layer and the entity detection prediction results in Dual-info attention enhance the connections between the two components. The overall optimization goal is:
We evaluate our model on three nested NER datasets, ACE2004 Doddington et al. (2004), ACE2005 Walker et al. (2006), and NNE Ringland et al. (2019); using the same splits as previous work Lu and Roth (2015); Wang and Lu (2018); Lin et al. (2019); Ringland et al. (2019).222ACE2004 / ACE2005 as in https://statnlp-research.github.io/publications/, and NNE as in https://github.com/nickyringland/nested_named_entities. Table 1 shows the proportions of nested entities in the datasets range from 30.80% to 66.14%. ACE2004 and ACE2005 include 7 entity types, while NNE has 114 entity types. We report precision, recall, and micro-F1 metrics for all experiments.
|% of overlaps over all sub-sequences||1.11%||1.30%||1.47%|
4.2 Baseline and Experimental Settings
We compare BoningKnife with a set of representative models and the recent state of the art.
Wang and Lu (2018), a graph-based model using LSTM to learn a feature encoder;
Xia et al. (2019), which is a detect-classify model without boundary knowledge;
Luan et al. (2019), a graph-based model which leverages entity linking to improve NER;
Fisher and Vlachos (2019), a merge-and-label model with hierarchical features;
Strakova et al. (2019), which treats the nested NER task as a seq2seq problem;
Shibuya and Hovy (2020), which extracts nested entities recursively with CRF;
Tan et al. (2020), which combines entity start/ end probabilities.
Moreover, we utilize itself as a lower-bound baseline for the BERT-based models.333NNE results for Shibuya and Hovy (2020) are reported by using their public code in https://github.com/yahshibu/nested-ner-2019-bert.
We perform a random search strategy for hyperparameter optimization and select the best settings on the development sets. We initialize the loss weight parameter asin MentionTagger and set the max neighbor window size to 128 in neighbor-level attention block to control memory size. The hidden sizes of low dimension start/end representation are 84, the number of attention heads is 16, and the windows size of mention-focus attention mask matrix is 2. Except for , our model has around 24M parameters. We employ AdamW Loshchilov and Hutter (2019) as optimizer during training. Experiments are repeated 5 times for different random seeds on each corpus.
4.3 Results and Discussion
|P(%)||R(%)||F1(%)||P (%)||R(%)||F1 (%)|
|Wang and Lu (2018)||78.0||72.4||75.1||76.8||72.3||74.5|
|Xia et al. (2019) [ELMO]||81.7||77.4||79.5||79.0||77.3||78.2|
|Luan et al. (2019)||-||-||84.7||-||-||82.9|
|BERT merge outmost & inmost||80.55||79.23||79.88||78.12||82.71||80.35|
|Fisher and Vlachos (2019)||-||-||-||82.7||82.1||82.4|
|Strakova et al. (2019)||-||-||84.4||-||-||84.3|
|Shibuya and Hovy (2020)||85.23||84.72||84.97||83.30 0.22||84.69 0.37||83.99 0.27|
|Tan et al. (2020)||85.8||84.8||85.3||83.8||83.9||83.9|
|BoningKnife||85.98 0.36||86.86 0.39||86.41 0.24||84.77 0.31||86.16 0.43||85.46 0.32|
|Wang et al. (2018)||77.4||70.1||73.6|
|merge outmost & inmost||80.43||74.94||77.59|
|Wang and Lu (2018)||91.8||91.1||91.4|
|Shibuya and Hovy (2020)44footnotemark: 4||93.03||93.34||93.19|
|( 0.33)||( 0.24)||( 0.05)|
Tables 2 and 3 report the results of our model and the different baselines on the ACE2004/ACE2005 and NNE datasets. It can be seen that our proposed method outperforms all previous state-of-the-art methods, reaching 86.41, 85.46, and 94.24 in average micro-F1 score, on ACE2004, ACE2005, and NNE respectively.
Compared with the latest boundary-enhanced method (Tan et al., 2020), our method achieves 1.11 and 1.56 absolute point gains on ACE2004 and ACE2005 555Unfortunately Tan et al. (2020) does not report results over NNE and did not release their code for further experiments. The boost comes mainly from recall improvements (2.06 to 2.26 points). MentionTagger is able to produce more precise mention candidates, which allows TypeClassifier to focus on distinguishing entity types, instead of filtering candidates as not viable. The improved precision of MentionTagger is further evidenced in Table 6.
In the NNE corpus, BoningKnife achieves 94.24 F1-score; an improvement of 1.05 points over the previous SOTA. We hypothesize that as NNE datasets has deeper nesting levels, (Tan et al., 2020)’s approach leads to error transmission in their recursive encoding process.
5.1 Ablation & Flat/Nested Performance
|- w/o EntityDetection subtask||(-1.06)|
|- w/o Start/End sub-task||(-0.97)|
|- w/o Neighbor-level attention||(-0.41)|
|- w/o Two-level attention||(-0.83)|
|- w/o Mention-focus attention||(-0.28)|
|- w/o MentionTagger Stage666Replacing MentionTagger with using entity start/end to generate mention candidates, similarly to Tan et al. (2020).||(-0.89)|
To validate the contributions and effectiveness of different components in the proposed model, we introduce the following model variants to perform an ablation study:
Table 4 highlights the performance contributions of each component in our proposed model, and removing any of them will generally lead to substantial performance drops. It can be seen that quality decreases significantly when either removing MentionTagger or its sub-tasks (entity token and start/end detection) sub-task, which indicates the proposed model makes effective usage of boundary knowledge (for example, to better handle long length entity spans). Without the proposed two-level attention in TypeClassifier, it becomes harder for the model to separate nested information and assign the proper type for nested entities; even more so than removing only the neighbor-level component of the two-level attention . This further demonstrates the benefits of the two-level structure and its ability to combine clear boundary and local context information. Lastly, while the effects of removing mention-focus mask attention are less prominent, it’s still noticeable and removing this component leads to slower overall model convergence. Furthermore, Table 5 reports the Flat/Nested performance across datasets. It can be seen that BoningKnife excels in nested entities while remaining competitive on flat results; which further evidences the overall effectiveness of the model in leveraging boundary knowledge.
|Shibuya and Hovy (2020)||84.45||85.14||84.86||83.13||84.26||93.67|
5.2 Time Complexity
|P||Time (s)||P||Time (s)||P||Time (s)|
Comparison of mention precision and training time cost per epoch, between E2E system using MentionTagger and mention strategy fromTan et al. (2020).
Similarly to Tan et al. (2020), our method substantially improves the time complexity over typical span-based methods by generating high-quality candidates, which greatly reduce complexity and training time. Span-based models, which require traversing all sub-sequences, have ) time complexity, where is the count of tags. Efficiently reducing the number of candidates is key in a two-step system like BoningKnife, as span classifying time complexity is determined by the number of candidates in its input. To measure the speedup from our approach due to its improved candidate generation and provide a comparison with Tan et al. (2020), we run two experiments: i) the complete BoningKnife system and ii) BoningKnife - MentionTagger + mention strategy in Tan et al. (2020). The experiments were run on a Ubuntu 16.04.6 server with Intel Xeon CPU E5-2690v3 @ 2.60GHz and one P100 GPU.
Table 6 reports the comparison between both experiments. We can see that our method provides significant speedup over the simpler modeling of boundary knowledge approach, especially with deeper nesting levels. BoningKnife is 1.75x, 1.83x, and 4.18x faster in ACE2004, ACE2005, and NNE, respectively, while still achieving higher quality.
5.3 Case Study and Attention Weight Visualization
|Sentence: The Coventry University researchers who report the findings in the British journal of sports medicine say anxiety and depression are common among those so injured, possibly as a result of pain and impaired mobility.|
|(a)||The Coventry University researchers who report the||1.000||0.999||1.000||0.942||1.000||PER||PER|
|findings in the British journal of sports medicine|
|(d)||those so injured||1.000||0.483||0.953||0.045||1.000||PER||PER|
|(e)||those so injured, possibly as a result of pain and||1.000||0.000||0.000||0.573||0.000||Non-entity||Non-entity|
Table 7 shows an example of BoningKnife prediction in ACE 2004. Span (d), "those so injured", is a correct mention, but the probability of end token "injured" is small. For S/E based methods like Tan et al. (2020); Zheng et al. (2019), this span would likely be discarded, but in our method it is correctly identified. Compared with only, the entity token detection knowledge reduces the number of high probability mentions (like (e) and (f)) inconsistent with the prior information, while not discarding very long entities, like mention (a).
Fig 4 shows the Dual-info attention weights for the sentence in Table 7. The global attention weights all focus on common keywords like "university", "the". While the mention-focus attention, focus on specific token neighbors, like "report" focusing on "researchers" and "findings", which improve their semantic information. Also, additional tokens focus on the relevant entity tokens, like "who" focusing on the same entity type word "those" instead of on itself in global attention.
In this paper we propose a novel joint entity mention detection and typing model via prior boundary knowledge for the nested NER task. The proposed method effectively incorporates prior boundary knowledge information to generate high quality mention candidates, which greatly improves efficiency of the whole system. By introducing a Dual-info attention layer at the mention classification stage, it facilitates mention decoupling and more accurate mention classification at different levels. Experiments show that our system, BoningKnife, achieves state-of-the-art results on three standard benchmark datasets; and an ablation study further demonstrates the effectiveness of its components.
- Doddington et al.  George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M Strassel, and Ralph M Weischedel. The automatic content extraction (ace) program-tasks, data, and evaluation. In Lrec, volume 2, page 1. Lisbon, 2004. URL http://www.lrec-conf.org/proceedings/lrec2004/pdf/5.pdf.
- Walker et al.  Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57, 2006.
- Ringland et al.  Nicky Ringland, Xiang Dai, Ben Hachey, Sarvnaz Karimi, Cécile Paris, and James R. Curran. NNE: A dataset for nested named entity recognition in english newswire. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL https://www.aclweb.org/anthology/P19-1510.
- Alex et al.  Beatrice Alex, Barry Haddow, and Claire Grover. Recognising nested named entities in biomedical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pages 65–72. Association for Computational Linguistics, 2007. URL https://www.aclweb.org/anthology/W07-1009.
- Lu and Roth  Wei Lu and Dan Roth. Joint mention extraction and classification with mention hypergraphs. In Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2015. URL https://www.aclweb.org/anthology/D15-1102.
- Katiyar and Cardie  Arzoo Katiyar and Claire Cardie. Nested named entity recognition revisited. In Proc. Conf. North American Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 861–871, 2018. URL https://www.aclweb.org/anthology/N18-1079.
- Wang and Lu  Bailin Wang and Wei Lu. Neural segmental hypergraphs for overlapping mention recognition. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL https://www.aclweb.org/anthology/D18-1019.
- Marinho et al.  Zita Marinho, Alfonso Mendes, Sebastiao Miranda, and David Nogueira. Hierarchical nested named entity recognition. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 28–34, 2019. URL https://www.aclweb.org/anthology/W19-1904.
- Sohrab and Miwa  Mohammad Golam Sohrab and Makoto Miwa. Deep exhaustive model for nested named entity recognition. In Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), pages 2843–2849, 2018. URL https://www.aclweb.org/anthology/D18-1309.
- Luan et al.  Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. A general framework for information extraction using dynamic span graphs. Proc. Conf. North American Assoc. for Computational Linguistics (NAACL), 2019. URL https://www.aclweb.org/anthology/N19-1308.
- Xia et al.  Congying Xia, Chenwei Zhang, Tao Yang, Yaliang Li, Nan Du, Xian Wu, Wei Fan, Fenglong Ma, and Philip Yu. Multi-grained named entity recognition. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL https://www.aclweb.org/anthology/P19-1138.
Fisher and Vlachos 
Joseph Fisher and Andreas Vlachos.
Merge and label: A novel neural network architecture for nested ner.Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL https://www.aclweb.org/anthology/P19-1585.
- Zheng et al.  Changmeng Zheng, Yi Cai, Jingyun Xu, Ho-fung Leung, and Guandong Xu. A boundary-aware neural model for nested named entity recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 357–366, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1034. URL https://www.aclweb.org/anthology/D19-1034.
- Tan et al.  Chuanqi Tan, Wei Qiu, Mosha Chen, Rui Wang, and Fei Huang. Boundary enhanced neural span classification for nested named entity recognition. Proc. Int. Conf. Proc. Conference on Alien Intelligence(AAAI), 2020.
Shen et al. 
Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan.
Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain.In Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), pages 49–56. Association for Computational Linguistics, 2003. URL https://www.aclweb.org/anthology/W03-1307.
- Zhang et al.  Jie Zhang, Dan Shen, Guodong Zhou, Jian Su, and Chew-Lim Tan. Enhancing hmm-based biomedical named entity recognition by studying special phenomena. Journal of biomedical informatics, 37(6):411–422, 2004. URL https://www.ncbi.nlm.nih.gov/pubmed/15542015.
- Zhou  GD Zhou. Recognizing names in biomedical texts using mutual information independence model and svm plus sigmoid. International Journal of Medical Informatics, 75(6):456–467, 2006. URL http://nlp.suda.edu.cn/~gdzhou/publication/zhougd2006_IJMI_BiomedicalNamedEntityRecognition.pdf.
- Muis and Lu  Aldrian Obaja Muis and Wei Lu. Labeling gaps between words: Recognizing overlapping mentions with mention separators. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL https://www.aclweb.org/anthology/D17-1276.
- Wang and Lu  Bailin Wang and Wei Lu. Combining spans into entities: A neural two-stage approach for recognizing discontiguous entities. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2019. URL https://www.aclweb.org/anthology/D19-1644.
- Wang et al.  Bailin Wang, Wei Lu, Yu Wang, and Hongxia Jin. A neural transition-based model for nested mention recognition. Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), 2018. URL https://www.aclweb.org/anthology/D18-1124.
- Ju et al.  Meizhi Ju, Makoto Miwa, and Sophia Ananiadou. A neural layered model for nested named entity recognition. In Proc. Conf. North American Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1446–1459, 2018. URL https://www.aclweb.org/anthology/N18-1131.
- Strakova et al.  Jana Strakova, Milan Straka, and Jan Hajic. Neural architectures for nested ner through linearization. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL https://www.aclweb.org/anthology/P19-1527.
- Shibuya and Hovy  Takashi Shibuya and Eduard Hovy. Nested named entity recognition via second-best sequence learning and decoding. Transactions of the Association for Computational Linguistics, 8:605–620, 2020. doi: 10.1162/tacl_a_00334. URL https://aclanthology.org/2020.tacl-1.39.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. Proc. Conf. North American Assoc. for Computational Linguistics (NAACL), 2018. URL https://www.aclweb.org/anthology/N19-1423.
- Dozat and Manning  Timothy Dozat and Christopher D. Manning. Deep biaffine attention for neural dependency parsing. Proc. Int. Conf. Learning Representations (ICLR), 2017.
- Lin et al.  Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017. URL http://arxiv.org/abs/1708.02002.
- Lin et al.  Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL), 2019. URL https://www.aclweb.org/anthology/P19-1511.
- Loshchilov and Hutter  Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. Proc. Int. Conf. Learning Representations (ICLR), abs/1711.05101, 2019. URL http://arxiv.org/abs/1711.05101.