Named Entity Recognition as Dependency Parsing

05/14/2020 ∙ by Juntao Yu, et al. ∙ Google Queen Mary University of London 0

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing, concerned with identifying spans of text expressing references to entities. NER research is often focused on flat entities only (flat NER), ignoring the fact that entity references can be nested, as in [Bank of [China]] (Finkel and Manning, 2009). In this paper, we use ideas from graph-based dependency parsing to provide our model a global view on the input via a biaffine model (Dozat and Manning, 2017). The biaffine model scores pairs of start and end tokens in a sentence which we use to explore all spans, so that the model is able to predict named entities accurately. We show that the model works well for both nested and flat NER through evaluation on 8 corpora and achieving SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

‘Nested Entities’ are named entities containing references to other named entities as in [Bank of [China]], in which both [China] and [Bank of China] are named entities. Such nested entities are frequent in data sets like ace 2004, ace 2005 and genia (e.g., 17% of NEs in genia are nested Finkel and Manning (2009), altough the more widely used set such as conll 2002, 2003 and ontonotes only contain so called flat named entities and nested entities are ignored.

The current SoTA models all adopt a neural network architecture without hand-crafted features, which makes them more adaptable to different tasks, languages and domains

Lample et al. (2016); Chiu and Nichols (2016); Peters et al. (2018); Devlin et al. (2019); Ju et al. (2018); Sohrab and Miwa (2018); Straková et al. (2019). In this paper, we introduce a method to handle both types of NEs in one system by adopting ideas from the biaffine dependency parsing model of dozat-and-manning2017-parser. For dependency parsing, the system predicts a head for each token and assigns a relation to the head-child pairs. In this work, we reformulate NER as the task of identifying start and end indices, as well as assigning a category to the span defined by these pairs. Our system uses a biaffine model on top of a multi-layer BiLSTM to assign scores to all possible spans in a sentence. After that, instead of building dependency trees, we rank the candidate spans by their scores and return the top-ranked spans that comply with constraints for flat or nested NER. We evaluated our system on three nested NER benchmarks (ace 2004, ace 2005, genia) and five flat NER corpora (conll 2002 (Dutch, Spanish) conll 2003 (English, German), and ontonotes

). The results show that our system achieved SoTA results on all three nested NER corpora, and on all five flat NER corpora with substantial gains of up to 2.2% absolute percentage points compared to the previous SoTA. We provide the code as open source

111The code is available at

2 Related Work

Flat Named Entity Recognition. The majority of flat NER models are based on a sequence labelling approach. collobert2011natural introduced a neural NER model that uses CNNs to encode tokens combined with a CRF layer for the classification. Many other neural systems followed this approach but used instead LSTMs to encode the input and a CRF for the prediction Lample et al. (2016); Ma and Hovy (2016); Chiu and Nichols (2016). These latter models were later extended to use context-dependent embeddings such as ELMo Peters et al. (2018). clark-etal-2018-semi quite successfully used cross-view training (CVT) paired with multi-task learning. This method yields impressive gains for a number of NLP applications including NER. devlin2019bert invented BERT, a bidirectional transformer architecture for the training of language models. BERT and its siblings provided better language models that turned again into higher scores for NER.

glample2016-ner cast NER as transition-based dependency parsing using a Stack-LSTM. They compare with a LSTM-CRF model which turns out to be a very strong baseline. Their transition-based system uses two transitions (shift and reduce) to mark the named entities and handles flat NER while our system has been designed to handle both nested and flat entities.

Nested Named Entity Recognition. Early work on nested NER, motivated particularly by the GENIA corpus, includes Shen et al. (2003); Beatrice Alex and Grover (2007); Finkel and Manning (2009). finkel-manning-2009-nested also proposed a constituency parsing-based approach. In the last years, we saw an increasing number of neural models targeting nested NER as well. ju-etal-2018-neural suggested a LSTM-CRF model to predict nested named entities. Their algorithm iteratively continues until no further entities are predicted. lin-etal-2019-sequence tackle the problem in two steps: they first detect the entity head, and then they infer the entity boundaries as well as the category of the named entity. strakova-etal-2019-neural tag the nested named entity by a sequence-to-sequence model exploring combinations of context-based embeddings such as ELMo, BERT, and Flair. zheng2019boundary use a boundary aware network to solve the nested NER. Similar to our work, sohrab-miwa-2018-deep enumerate exhaustively all possible spans up to a defined length by concatenating the LSTMs outputs for the start and end position and then using this to calculate a score for each span. Apart from the different network and word embedding configurations, the main difference between their model and ours is there for the use of biaffine model. Due to the biaffine model, we get a global view of the sentence while sohrab-miwa-2018-deep concatenates the output of the LSTMs of possible start and end positions up to a distinct length. dozat-and-manning2017-parser demonstrated that the biaffine mapping performs significantly better than just the concatenation of pairs of LSTM outputs.

3 Methods

Our model is inspired by the dependency parsing model of dozat-and-manning2017-parser. We use both word embeddings and character embeddings as input, and feed the output into a BiLSTM and finally to a biaffine classifier.

Figure 1 shows an overview of the architecture. To encode words, we use both BERT and fastText embeddings Bojanowski et al. (2016). For BERT we follow the recipe of Kantor and Globerson (2019) to obtain the context dependent embeddings for a target token with 64 surrounding tokens each side. For the character-based word embeddings, we use a CNN to encode the characters of the tokens. The concatenation of the word and character-based word embeddings is feed into a BiLSTM to obtain the word representations ().

Figure 1: The network architectures of our system.

After obtaining the word representations from the BiLSTM, we apply two separate FFNNs to create different representations () for the start/end of the spans. Using different representations for the start/end of the spans allow the system to learn to identify the start/end of the spans separately. This improves accuracy compared to the model which directly uses the outputs of the LSTM since the context of the start and end of the entity are different. Finally, we employ a biaffine model over the sentence to create a

scoring tensor (

), where is the length of the sentence and is the number of NER categories (for non-entity). We compute the score for a span by:

where and are the start and end indices of the span , is a tensor, is a matrix and is the bias.

The tensor provides scores for all possible spans that could constitute a named entity under the constrain that (the start of entity is before its end). We assign each span a NER category :

We then rank all the spans that have a category other than ”non-entity” by their category scores () in descending order and apply following post-processing constraints: For nested NER, a entity is selected as long as it does not clash the boundaries of higher ranked entities. We denote a entity to clash boundaries with another entity if or , e.g. in the Bank of China, the entity the Bank of clashes boundary with the entity Bank of China, hence only the span with the higher category score will be selected. For flat NER, we apply one more constraint, in which any entity containing or is inside an entity ranked before it will not be selected. The learning objective of our named entity recognizer is to assign a correct category (including the non-entity) to each valid span. Hence it is a multi-class classification problem and we optimise our models with softmax cross-entropy:

Parameter Value
BiLSTM size 200
BiLSTM layer 3
BiLSTM dropout 0.4
FFNN size 150
FFNN dropout 0.2
BERT size 1024
BERT layer last 4
fastText embedding size 300
Char CNN size 50
Char CNN filter widths [3,4,5]
Char embedding size 8
Embeddings dropout 0.5
Optimiser Adam
learning rate 1e-3
Table 1:

Major hyperparameters for our models.

4 Experiments

Data Set. We evaluate our system on both nested and flat NER, for the nested NER task, we use the ace 2004222, ace 2005333, and genia Kim et al. (2003) corpora; for flat NER, we test our system on the conll 2002 Tjong Kim Sang (2002), conll 2003 Tjong Kim Sang and De Meulder (2003) and ontonotes444 corpora.

For ace 2004, ace 2005 we follow the same settings of lu-roth-2015-joint and muis-lu-2017-labeling to split the data into 80%,10%,10% for train, development and test set respectively. To make a fair comparson we also used the same documents as in lu-roth-2015-joint for each split.

For genia, we use the genia

v3.0.2 corpus. We preprocess the dataset following the same settings of finkel-manning-2009-nested and lu-roth-2015-joint and use 90%/10% train/test split. For this evaluation, since we do not have a development set, we train our system on 50 epochs and evaluate on the final model.

For conll 2002 and conll 2003, we evaluate on all four languages (English, German, Dutch and Spanish). We follow glample2016-ner to train our system on the concatenation of the train and development set.

For ontonotes, we evaluate on the English corpus and follow strubell-etal-2017-fast to use the same train, development and test split as used in CoNLL 2012 shared task for coreference resolution Pradhan et al. (2012).

Evaluation Metric. We report recall, precision and F1 scores for all evaluations. The named entity is considered correct when both boundary and category are predicted correctly.

Hyperparameters We use a unified setting for all of the experiments, Table 1 shows hyperparameters for our system.

Model P R F1
ace 2004
katiyar-cardie-2018-nested 73.6 71.8 72.7
wang-etal-2018-neural-transition - - 73.3
wang-lu-2018-neural 78.0 72.4 75.1
strakova-etal-2019-neural - - 84.4
luan-etal-2019-general - - 84.7
Our model 87.3 86.0 86.7
ace 2005
katiyar-cardie-2018-nested 70.6 70.4 70.5
wang-etal-2018-neural-transition - - 73.0
wang-lu-2018-neural 76.8 72.3 74.5
lin-etal-2019-sequence 76.2 73.6 74.9
fisher-vlachos-2019-merge 82.7 82.1 82.4
luan-etal-2019-general - - 82.9
strakova-etal-2019-neural - - 84.3
Our model 85.2 85.6 85.4
katiyar-cardie-2018-nested 79.8 68.2 73.6
wang-etal-2018-neural-transition - - 73.9
ju-etal-2018-neural 78.5 71.3 74.7
wang-lu-2018-neural 77.0 73.3 75.1
sohrab-miwa-2018-deep555In sohrab-miwa-2018-deep, the last 10% of the training set is used as a development set, we include their result mainly because their system is similar to ours. 93.2 64.0 77.1
lin-etal-2019-sequence 75.8 73.9 74.8
luan-etal-2019-general - - 76.2
strakova-etal-2019-neural - - 78.3
Our model 81.8 79.3 80.5
Table 2: State of the art comparison on ace 2004, ace 2005 and genia corpora for nested NER.
Model P R F1
chiu2016named 86.0 86.5 86.3
strubell-etal-2017-fast - - 86.8
clark-etal-2018-semi - - 88.8
fisher-vlachos-2019-merge - - 89.2
Our model 90.8 91.8 91.3
conll 2003 English
chiu2016named 91.4 91.9 91.6
glample2016-ner - - 90.9
strubell-etal-2017-fast - - 90.7
devlin2019bert - - 92.8
strakova-etal-2019-neural - - 93.4
Our model 93.6 93.3 93.5
conll 2003 German
glample2016-ner - - 78.8
strakova-etal-2019-neural - - 85.1
Our model 88.2 84.7 86.4
conll 2003 German revised666The revised version is provided by the shared task organiser in 2006 with more consistent annotations. We confirmed with the author of akbik-etal-2018-flair that they used the revised version.
akbik-etal-2018-flair - - 88.3
Our model 92.3 88.2 90.2
conll 2002 Spanish
glample2016-ner - - 85.8
strakova-etal-2019-neural - - 88.8
Our model 90.5 90.0 90.3
conll 2002 Dutch
glample2016-ner - - 81.7
akbik-etal-2019-pooled - - 90.4
strakova-etal-2019-neural - - 92.7
Our model 94.2 92.7 93.5
Table 3: State of the art comparison on conll 2002, conll 2003, ontonotes corpora for flat NER.

5 Results on Nested NER

Using the constraints for nested NER, we first evaluate our system on nested named entity corpora: ace 2004, ace 2005 and genia. Table 2 shows the results. Both ace 2004 and ace 2005 contain 7 NER categories and have a relatively high ratio of nested entities (about 1/3 of then named entities are nested). Our results outperform the previous SoTA system by 2% (ace 2004) and 1.1% (ace 2005), respectively. genia differs from ace 2004 and ace 2005 and uses five medical categories such as DNA or RNA. For the genia corpus our system achieved an F1 score of 80.5% and improved the SoTA by 2.2% absolute. Our hypothesise is that for genia the high accuracy gain is due to our structural prediction approach and that sequence-to-sequence models rely more on the language model embeddings which are less informative for categories such as DNA, RNA. Our system achieved SoTA results on all three corpora for nested NER and demonstrates well the advantages of a structural prediction over sequence labelling approach.

6 Results on Flat NER

We evaluate our system on five corpora for flat NER (conll 2002 (Dutch, Spanish), conll 2003 (English, German) and ontonotes. Unlike most of the systems that treat flat NER as a sequence labelling task, our system predicts named entities by considering all possible spans and ranking them. The ontonotes corpus consists of documents form 7 different domains and is annotated with 18 fine-grained named entity categories. To predict named entities for this corpus is more difficult than for conll 2002 and conll 2003. These corpora use coarse-grained named entity categories (only 4 categories). The sequence-to-sequence models usually perform better on the conll 2003 English corpus (see Table 3), e.g. the system of chiu2016named,strubell-etal-2017-fast. In contrast, our system is less sensitive to the domain and the granularity of the categories. As shown in Table 3, our system achieved an F1 score of 91.3% on the ontonotes corpus and is very close to our system performance on the conll 2003 corpus (93.5%). On the multi-lingual data, our system achieved F1 scores of 86.4% for German, 90.3% for Spanish and 93.5% for Dutch. Our system outperforms the previous SoTA results by large margin of 2.1%, 1.5%, 1.3% and 0.8% on ontonotes, Spanish, German and Dutch corpora respectively and is slightly better than the SoTA on English data set. In addition, we also tested our system on the revised version of German data to compare with the model by akbik-etal-2018-flair, our system again achieved a substantial gain of 1.9% when compared with their system.

7 Conclusion

In this paper, we reformulate NER as a structured prediction task and adopted a SoTA dependency parsing approach for nested and flat NER. Our system uses contextual embeddings as input to a multi-layer BiLSTM. We employ a biaffine model to assign scores for all spans in a sentence. Further constraints are used to predict nested or flat named entities. We evaluated our system on eight named entity corpora. The results show that our system achieves SoTA on all of the eight corpora. We demonstrate that advanced structured prediction techniques lead to substantial improvements for both nested and flat NER.


This research was supported in part by the DALI project, ERC Grant 695662.


  • B. H. Beatrice Alex and C. Grover (2007) Recognising nested named entities in biomedical text. In Proc. of BioNLP, pp. 65–72. Cited by: §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016)

    Enriching word vectors with subword information

    arXiv preprint arXiv:1607.04606. Cited by: §3.
  • J. P. Chiu and E. Nichols (2016) Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics 4, pp. 357–370. Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §1.
  • T. Dozat and C. Manning (2017) Deep biaffine attention for neural dependency parsing. In Proceedings of 5th International Conference on Learning Representations (ICLR), Cited by: Named Entity Recognition as Dependency Parsing.
  • J. R. Finkel and C. D. Manning (2009) Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 141–150. External Links: Link Cited by: Named Entity Recognition as Dependency Parsing, §1, §2.
  • M. Ju, M. Miwa, and S. Ananiadou (2018) A neural layered model for nested named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1446–1459. External Links: Link, Document Cited by: §1.
  • B. Kantor and A. Globerson (2019) Coreference resolution with entity equalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 673–677. External Links: Document Cited by: §3.
  • J.-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl 1), pp. i180–i182. Cited by: §4.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. External Links: Document, Link Cited by: §1, §2.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link, Document Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §1, §2.
  • S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang (2012) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea. Cited by: §4.
  • D. Shen, J. Zhang, G. Zhou, J. Su, and C. Tan (2003)

    Effective adaptation of a Hidden Markov Model-based Named Entity Recognizer for the biomedical domain

    In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, Cited by: §2.
  • M. G. Sohrab and M. Miwa (2018) Deep exhaustive model for nested named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2843–2849. External Links: Link, Document Cited by: §1.
  • J. Straková, M. Straka, and J. Hajic (2019) Neural architectures for nested NER through linearization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5326–5331. External Links: Link, Document Cited by: §1.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. External Links: Link Cited by: §4.
  • E. F. Tjong Kim Sang (2002) Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), External Links: Link Cited by: §4.