Nested Named Entity Recognition via Second-best Sequence Learning and Decoding

09/05/2019 ∙ by Takashi Shibuya, et al. ∙ Carnegie Mellon University Sony 0

When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method outperforms existing methods capable of handling nested entities, achieving the F1-scores of 82.81 82.70

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

nested-ner-2019-bert

Implementation of Nested Named Entity Recognition using BERT


view repo

nested-ner-2019

Implementation of Nested Named Entity Recognition


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named entity recognition (NER) is the task of identifying text spans associated with proper names and classifying them according to their semantic class such as person or organization. NER, or in general the task of recognizing entity mentions, is one of the first stages in deep language understanding and its importance has been well recognized in the NLP community 

Nadeau and Sekine (2007).

One popular approach to the task of NER is to regard it as a sequence labeling problem. In this case, it is implicitly assumed that mentions are not nested in texts. However, names often contain entities nested within themselves, as illustrated in Fig. 1, which contains 3 mentions of the same type (PROTEIN) in the span “… in Ca2+ -dependent PKC isoforms in …”, taken from the GENIA dataset Kim et al. (2003). Name nesting is common, especially in technical domains Alex et al. (2007); Byrne (2007); Wang (2009). The assumption of no nesting leads to loss of potentially important information and may negatively impact subsequent downstream tasks. For instance, a downstream entity linking system that relies on NER may fail to link the correct entity if the entity mention is nested.

Figure 1: Example of nested entities.

Various approaches to recognizing nested entities have been proposed. Many of them rely on producing and rating all possible (sub)spans, which can be computationally expensive. Wang and Lu (2018) provided a hypergraph-based approach to consider all possible spans. Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates and classifies all possible spans. These methods, however, achieve high performance at the cost of time complexity. To reduce the running time, they set a threshold to discard longer entity mentions. If the hyperparameter is set low, running time is reduced but longer mentions are missed. In contrast, Muis and Lu (2017) proposed a sequence labeling approach that assigns tags to gaps between words, which efficiently handles sequences using Viterbi decoding. However, this approach suffers from structural ambiguity issues during inference as explained by Wang and Lu (2018). Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure in a greedy manner. However, their method uses an additional hyperparameter as the threshold for selecting multiple mention candidates. This hyperparameter affects the trade-off between recall and precision.

In this paper, we propose new learning and decoding methods to extract nested entities without any additional hyperparameters. We summarize our contributions as:

  • We describe a decoding method that iteratively recognizes entities from outermost ones to inner ones without structural ambiguity. It recursively searches a span of each extracted entity for inner nested entities using the Viterbi algorithm. This algorithm does not require hyperparameters for the maximal length or number of mentions considered.

  • We also provide a novel learning method that ensures the aforementioned decoding. Models are optimized based on an objective function designed according to the decoding procedure.

  • Empirically, we demonstrate that our method outperforms the current state-of-the-art methods with , , and in F1-score on three standard datasets: ACE-2004111https://catalog.ldc.upenn.edu/LDC2005T09, ACE-2005222https://catalog.ldc.upenn.edu/LDC2006T06, and GENIA.

Figure 2: Overview of our second-best path decoding algorithm to iteratively find nested entities.

2 Related Work

The success of neural networks has boosted the performance of NER 

Strubell et al. (2017); Akbik et al. (2018). However, few attempts have explored nested entity recognition.

Alex et al. (2007) proposed several ways to combine multiple CRFs for such tasks. They separately built inside-out and outside-in layered CRFs that used the current guesses as the input for the next layer. They also cascaded separate CRFs of each entity type by using the output from the previous CRF as the input features of the current CRF, yielding best performance in their work. However, their method could not handle nested entities of the same entity type. In contrast, Ju et al. (2018) dynamically stacked multiple layers that recognize entities sequentially from innermost ones to outermost ones. Their method can deal with nested entities of the same entity type.

Finkel and Manning (2009) proposed a CRF-based constituency parser for this task such that each named entity is a node in the parse tree. Their model achieved state-of-the-art performance on the GENIA dataset of the time. However, its time complexity is the cube of the length of a given sentence, making it not scalable to large datasets involving long sentences. Later on, Wang et al. (2018) proposed a scalable transition-based approach, a constituency forest (a collection of constituency trees). Its time complexity is linear in the sentence length.

Lu and Roth (2015) introduced a mention hypergraph representation for capturing nested entities as well as crossing entities (two entities overlap but neither is contained in the other). One issue in their approach is the spurious structures of the representation. To address the spurious structures issue, Muis and Lu (2017) incorporated mention separators to yield better performance on nested entities. On the other hand, it still suffers from the structural ambiguity issue. Wang and Lu (2018) proposed a hypergraph representation free of structural ambiguity, and their method achieved state-of-the-art performance on the ACE-2004 and ACE-2005 datasets. However, they introduced a hyperparameter, the maximal length of an entity, to reduce the time complexity. Setting the hyperparameter to a small number results in speeding up but ignoring longer entity segments.

Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure using an LSTM network in a greedy manner. However, their method has a hyperparameter that sets a threshold for selecting multiple candidate mentions. It must be carefully tuned for adjusting the trade-off between recall and precision.

Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates all possible spans as potential entity mentions and classifies them. Their model achieved state-of-the-art performance on the GENIA dataset in terms of F1-score. However, they also use the maximal-length hyperparameter to reduce time complexity.

We show that our model outperforms all these methods on three datasets: ACE-2004, ACE-2005, and GENIA. In addition, our model has no additional hyperparameters to the conventional CRF-based model commonly used for flat NER. Also, our decoding algorithm is fast in practice as shown in Subsection 5.4.

3 Method

We first explain our usage of CRF, which is the base of our decoding and training methods. Then, we introduce our decoding and training methods. Our decoding and training methods focus on the output layer of neural architectures and therefore can be combined with any neural model.

3.1 Usage of CRF

We first explain two key points about our usage of CRF. The first key point is that we prepare a separate CRF for each named entity type. This enables our method to handle the situation where the same mention span is assigned multiple entity types. The second one is that each element of the transition matrix of each CRF has a fixed value according to whether it corresponds to a legal transition (e.g., B-XXX to I-XXX, where XXX is the name of entity type) or an illegal one (e.g., O to I-XXX). This is helpful for keeping the scores for tag sequences including outer entities higher than those of tag sequences including inner entities.

Formally, we use to represent a sequence output from the last hidden layer of a neural model, where

is the vector for the

-th word, and is the number of tokens. represents a sequence of IOBES tags of entity type for . Here, we define the score function to be

(1)

and

denote the weight matrix and the bias vector corresponding to

, respectively. stands for the transition matrix from the previous token to the current token, and is the transition scores from to .

algocf[t]    

3.2 Decoding

We employ three strategies for decoding. First, we consider each entity type separately using multiple CRFs in decoding, which makes it possible to handle the situation that the same mention span is assigned multiple entity types. Second, our decoder searches nested entities in an outside-to-inside way333Our usage of inside/outside is different from the inside-outside algorithm in dynamic programming., which realizes efficient processing by eliminating the spans of non-entity at an early stage. More specifically, our method recursively narrows down the spans to Viterbi-decode. The spans to Viterbi-decode are dynamically decided according to the preceding Viterbi-decoding result. Only the spans that have just been recognized as entity mentions are Viterbi-decoded again. Third, we use the same scores of Eq. 1 to extract outermost entities and even inner entities without re-encoding, which makes inference more efficient and faster. These three strategies are deployed and completed only in the output layer of neural architectures.

We describe the pseudo-code of our decoding method in Algorithm LABEL:alg:pseudo-code. Also, we depict the overview of our decoding method with an example in Fig. 2 . We use the term level in the sense of the depth of entity nesting. [S] and [E] in Fig. 2 stand for the START and END tags respectively. We always attach these tags to both ends of every sequence of IOBES tags in Viterbi-decoding.

We explain the decoding procedure and mechanism in detail below. We consider each entity type separately and iterate the same decoding process regarding distinct entity types as described in Algorithm LABEL:alg:pseudo-code. In the decoding process for each entity type , we first calculate the CRF scores over the entire sentence. Next, we decode a sequence with the standard -best Viterbi decoding as with the conventional linear-chain CRF. “Ca2+ -dependent PKC isoforms” is extracted at the 1st level with regard to the example of Fig. 2.

Then, we start our recursive decoding to extract nested entities within previously extracted entity spans by finding the 2nd best path. In Fig. 2, the span “Ca2+ -dependent PKC isoforms” is processed at the 2nd level. Here, if we search for the best path within each span, the same tag sequence will be obtained, even though the processed span is different. This is because we continue using the same scores and because all the values of corresponding to legal transitions are equal to . Regarding the example of Fig. 2, the score of the transition from [S] to B-P at the 2nd level is equal to the score of the transition from O to B-P at the 1st level. This is true for the transition from E-P to [E] at the 2nd level and the one from E-P to O at the 1st level. The best path between the [S] and [E] tags is identical to the best path between the two O tags under our restriction about the transition matrix of CRF. Therefore, we search for the 2nd best path within the span by utilizing the -best Viterbi A* algorithm Seshadri and Sundberg (1994); Huang et al. (2012).444Without our restriction about the transition matrix of CRF, we would have to watch both the best path and the 2nd best path. Note that our situation is different from normal situations where -best decoding is needed. We already know the best path within the span and want to find only the 2nd best path. Thus, we can extract nested entities by finding the 2nd best path within each extracted entity. Regarding the example of Fig. 2, “PKC isoforms” is extracted from the span “Ca2+ -dependent PKC isoforms” at the 2nd level.

We continue this recursive decoding until no entities are predicted or when only single-token entities are detected within a span. In Fig. 2, the span “PKC isoforms” is processed at the 3rd level. At the 3rd or deeper levels, the tag sequence of its grandparent level is no longer either the best path or the 2nd best path because the start or end position of the current span is in the middle of the entity mention at the grandparent level. As for the example shown in Fig. 2, “PKC” is tagged I-P at the 1st level, and the transition from [S] to I-P is illegal. The scores of the paths that includes illegal transitions cannot be larger than those of the paths that consist of only legal transitions because the elements of the transition matrix corresponding to illegal transitions are set to . That is why at all levels below the 1st level we only need to find the 2nd best path.

This recursive processing is stopped when no entities are predicted or when only single-token entities are detected within a span. In Fig. 2, the span “PKC” is not processed any more because it is a single-token entity. The aforementioned processing is executed on all entity types, and all detected entities are returned as an output result.

3.3 Training

To extract entities from outside to inside successfully, a model has to be trained in a way that the scores for the paths including outer entities will be higher than those for the paths including inner entities. We propose a new objective function to achieve this requirement.

;
;  # the start position
;  # the end position
foreach  do
        ;
       
for ; ;  do
        foreach  do
               foreach  do
                      ;
                     
              
       foreach  do
               ;
              
       
foreach  do
        ;
       
return ;
Algorithm 1 LogSumExp of the scores of all possible paths

We maximize the log-likelihood of the correct tag sequence as with the conventional CRF-based model. Considering that our model has a separate CRF for each entity type, the log-likelihood for one training data, , is as follows:

(2)

where is the set of parameters of a neural model, and denotes the collection of the gold IOBES tags for all levels regarding the entity type .

Let and denote the start and end positions of the -th span at the -th level. With regard to the 1st level, and because we consider the whole span of a sentence. The spans considered at each deeper level, , are determined according to the spans of multi-token entities at its immediate parent level. As for the example of Fig. 2, only the span of “Ca2+ -dependent PKC isoforms” is considered at the 2nd level. Here, the log-likelihood for each entity type can be expressed as follows:

(3)

where and are the log-likelihoods of the (1st) best and 2nd best paths for each span, respectively. denotes the correct IOBES tag of the position of the -th level of the entity type .

Figure 3: Divided search spaces.

Best path. can be calculated in the same manner as the conventional linear-chain CRF:

(4)

denotes the set of all possible tag sequences from position to position of the entity type . It is well known that the second term of Eq. 4 can be efficiently calculated by the algorithm shown in Algorithm 1.

Figure 4: Lattice and best path.

2nd best path. given the best path can be calculated by excluding the best path from all possible paths. This concept is also adopted by ListNet Cao et al. (2007), which is used for ranking tasks such as document retrieval or recommendation. can be expressed by the following equation:

(5)

where denotes the set of all possible tag sequences except the best path within the span from position to position of the entity type .

However, to the best of our knowledge, the way of efficiently computing the second term of Eq. 5 has not been proposed yet in the literature. Simply subtracting the exponential score of the best path from the summation of the exponential scores of all possible paths causes underflow, overflow, or loss of significant digits. We introduce the way of accurately computing it with the same time complexity as Eq. 4. For explanation, we use the simplified example of the lattice depicted in Fig. 4, in which the span length is 4 and the number of states is 3. The special nodes for start and end states are attached to the both ends of the span. There are paths in this lattice. We assume that the path that consists of top nodes of all time steps are the best path as shown in Fig. 4. No generality is lost by making this assumption. To calculate the second term of Eq. 5, we have to consider the exponential scores for all the possible paths except the best path, paths.

Figure 5: Merge of search spaces.

We first give a way of thinking, which is not our algorithm itself but helpful to understand it. In the example, we can further group these 80 paths according to the steps where the best path is not taken. In this way, we have 4 spaces in total as illustrated in Fig. 3. In Space 1, the top node of time step is excluded from consideration. paths are taken into account here. Since this space covers all paths that do not go through the top node of time step , we only have to consider the paths that go through this node in other spaces. In Space 2, this node is always passed through, and instead the top node of time step is excluded. paths are considered in this space. Similarly, paths and paths are taken into consideration in Space 3 and Space 4, respectively. Thus, we can consider all the possible paths except the best path, paths. However, this is not our algorithm itself as we mentioned.

We introduce two tricks for making the calculation more efficient. We explain them with Fig. 5, in which Spaces 2 and 3 are picked up. The first trick is that the separated two spaces can be merged at time step because the paths later than time step are identical. When we reach time step in the forward iteration in each of the two spaces, we can merge them using the calculation results at time step , as shown with the red edges in Fig. 5. The second trick is that the blue nodes in Fig. 5 can be copied from Space 2 to Space 3 at time step since the considered paths until that time step are also the same. These two tricks can be applied to other pairs of two adjacent spaces, which relieves the need to separately calculate the summation of the exponential scores for each space. Therefore, the second term of Eq. 5 can be calculated as shown in Algorithm 2.

;
;  # the start position
;  # the end position
;  # the best path
for ; ;  do
        ;
       
;
foreach  do
        ;
       
;
for ; ;  do
        foreach  do
               foreach  do
                      ;
                     
              if  then
                      foreach  do
                             ;
                            
                     ;
                     
              
       foreach  do
               ;
              
       ;
       
foreach  do
        ;
       
;
return ;
Algorithm 2 LogSumExp of the scores of all possible paths except the best path

Thus, we can train a model using the objective function of Eqs. 2, 3, 4, and 5.

ACE-2005 GENIA
Train () Dev () Test () Train () Dev () Test ()
# documents 370 43 51 - - -
# sentences (7,214) (954) (1,054) 15,022 1,669 1,855
# mentions 24,818 3,234 3,038 47,027 4,469 5,600
   - 1st level 21,959 (88) 2,900 (90) 2,683 (88) 44,611 (95) 4,239 (95) 5,273 (94)
   - 2nd level 2,634 (11) 316 (10) 323 (11) 2393 (5) 230 (5) 327 (6)
   - 3rd level 214 (1) 18 (1) 30 (1) 23 (0) 0 (0) 0 (0)
   - 4th level 9 (0) 0 (0) 2 (0) 0 (0) 0 (0) 0 (0)
# labels per token () 1.06 1.05 1.05 1.05 1.05 1.05
Table 1: Statistics of the datasets used in the experiments. Note that in ACE-2005, sentences are not originally split. We report the numbers of sentences based on the preprocessing with the Stanford CoreNLP Manning et al. (2014).

3.4 Characteristics

We explain some theoretical characteristics of our method here. Regarding the time complexity of decoder, the worst case for our method is when our decoder narrows down the spans one by one, from tokens (a whole sentence) to tokens. The time complexity for the worst case is therefore for each entity type, in total, where denotes the number of entity types. However, this rarely happens. The average processing time in the case where our decoding method narrows down spans successfully according to gold labels is , where is the average number of gold IOBES tags of each entity type assigned to a word. The average numbers calculated from the gold labels of ACE-2004, ACE-2005, and GENIA are , , and , respectively. For these datasets, the expected processing time is nearly equal to .

Further, our method focuses on the output layer of neural architectures; therefore our method can be combined with any neural model. While some existing methods have hyperparameters beyond those of the conventional CRF-based model used for flat NER tasks, our method does not. Wang et al. (2018)’s method and Sohrab and Miwa (2018)’s method use a hyperparameter for the maximal length of considered entities. Katiyar and Cardie (2018)’s method uses a threshold, which affects the number of detected entities. These hyperparameters must be tuned depending on datasets. Our method is easy to use from this viewpoint.

We verify the empirical performances of our methods in the successive sections.

4 Experimental Settings

4.1 Datasets

We perform nested entity extraction experiments on ACE-2005 Doddington et al. (2004) and GENIA Kim et al. (2003). For ACE-2005, we use the same splits of documents as Lu and Roth (2015), published on their website555http://www.statnlp.org/research/ie. For GENIA, we use GENIAcorpus3.02p666http://www.geniaproject.org/genia-corpus/pos-annotation in which sentences are already tokenized Tateisi and Tsujii (2004). Following previous works Finkel and Manning (2009); Lu and Roth (2015), we first split the last of sentences as the test set. Next, we use the first and the subsequent for training and development sets, respectively. We make the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other entity types, resulting in 5 entity types. The statistics of each dataset are shown in Table 1.

Hyperparameter Value
word dropout rate
character embedding dimension
CNN window size
CNN filter number
batch size
LSTM hidden size
LSTM dropout rate
gradient clipping
Table 2: Hyperparameters in our experiments.

4.2 Model and Training

In this study, we adopt as baseline a BiLSTM-CRF model, which is widely used for NER tasks Lample et al. (2016); Chiu and Nichols (2016); Ma and Hovy (2016). We apply our usage of CRF to this baseline. We prepare two types of models for our experiments. The first one is the model combined with conventional word embeddings,777https://github.com/yahshibu/nested-ner-2019 which is prepared for fairly comparing our method with existing methods. We initialize word embeddings with the pretrained embeddings GloVe Pennington et al. (2014) of dimension 100 in ACE-2005. For GENIA, we initialize each word with the pretrained embeddings trained on MEDLINE abstracts Chiu et al. (2016). The initialized word embeddings are fixed during training. The words that do not appear in embedding vocabulary but appear in the training set are initialized with zero vectors and tuned during training. We also adopt a CNN-based character-level representation Chiu and Nichols (2016); Ma and Hovy (2016). The vectors of the word embeddings and the character-level representation are concatenated and then input into the BiLSTM layer. The second model is the model combined with the pretrained BERT model Devlin et al. (2018).888https://github.com/yahshibu/nested-ner-2019-bert We use the cased version of BERT large model as a contextual word embeddings generator without fine-tuning and stack the BiLSTM layers on top of the BERT model. Both models have 2 BiLSTM hidden layers, and the dimensionality of each hidden unit is 200 in all our experiments. Table 2 lists the hyperparameters used for our experimental evaluations. We adopt AdaBound Luo et al. (2019)

as an optimizer. Early stopping is used based on the performance of development set. We repeat the experiment 5 times with different random seeds and report average and standard deviation of F1-scores on a test set as the final performance.

ACE-2005 GENIA
Method Precision () Recall () F1 () Precision () Recall () F1 ()
Finkel and Manning (2009) - - -
Muis and Lu (2017)
Katiyar and Cardie (2018)
Ju et al. (2018)999Note that in ACE-2005, Ju et al. (2018) did their experiments with a different split from Lu and Roth (2015) that we follow.
Wang and Lu (2018)
Wang et al. (2018)101010Wang et al. (2018)

did not report precision and recall scores. Instead of

Wang et al. (2018), Wang and Lu (2018) reported the scores for the model of Wang et al. (2018).
Sohrab and Miwa (2018) - - -
This work
This work [BERT]
Table 3: Main results. The last two lines show our proposed method.

5 Experimental Results

ACE-2005 GENIA
Precision () Recall () F1 () Precision () Recall () F1 ()
This work
    L
    L&D
Table 4: Results when ablating away the learning (L) and decoding (D) components of our proposed method.

5.1 Comparison with Existing Methods

Table 3 presents comparisons of our model with existing methods. Note that some existing methods use embeddings of POS tags as an additional input feature whereas our method does not. Our method outperforms the existing methods with and in terms of F1-score when combined with conventional word embeddings. Especially, our method brings much higher recall values than the other methods. The recall scores are improved by and on ACE-2005 and GENIA datasets, respectively. These results demonstrate that our training and decoding algorithms are quite effective for extracting nested entities. Moreover, when we use BERT as contextual word embeddings, we achieve an F1-score of on ACE-2005. This result shows that BERT is very powerful for this task, too. On the other hand, BERT does not perform well on GENIA. We assume that this is because the domain of GENIA is quite different from that of the corpus used for training the BERT model. Regardless, it is demonstrated that our method outperforms existing methods.

5.2 Ablation Study

We conduct an ablation study to verify the effectiveness of our learning and decoding methods. We first replace our objective function for training with the standard objective function of the liner-chain CRF. The methods for decoding -best paths have been well studied because such algorithms have been required in many domains Soong and Huang (1990); Kaji et al. (2010); Huang et al. (2012). However, we hypothesize that our learning method, as well as our decoding method, helps to improve performance. That is why we first remove only our learning method. Then, we also replace our decoding algorithm with the standard decoding algorithm of the linear-chain CRF. It is equivalent to preparing the conventional CRF for each entity type separately.

The results are shown in Table 4. They demonstrate that introducing only our decoding algorithm surely brings high recall scores but hurts precision. This suggests that our learning method should be necessary for achieving high precision. Besides, removing the decoding algorithm results in lower recall. This is natural because it does not intend to find any nested entity after extracting outermost entities. Thus, it is demonstrated that both our learning and decoding algorithms contribute much to good performance across these datasets.

ACE-2005 GENIA
Level Recall () Num. Rcall () Num.
1st 2,683 5,273
2nd 323 327
3rd 30 - 0
4th 2 - 0
Table 5: Recall scores for gold annotations of each level.
ACE-2005 GENIA
Level Precision () Num. Precision () Num.
1st 2,582 5079
2nd 359 397
3rd 52 6
4th 9 - 0
Table 6: Precision scores for predictions of each level of one trial.

5.3 Analysis of Behavior

To further understand how our method handles nested entities, we investigate the performances for entities of each level. Table 5 shows the recall scores for gold entities of each level when using conventional word embeddings. Among all levels, our model results in the best performance at the 1st level that consists of only gold outermost entities. The deeper a level, the lower recall scores. On the other hard, Table 6 shows the precision scores for predicted entities in each level of one trial on each dataset. Because the number of levels in the predictions vary between trials, taking macro average of precision scores over multiple trials is not representative. Therefore, we show only the precision scores from one trial in Table 6. The precision score for the 4th level on ACE-2005 is as high as or higher than those of other levels. Precision scores are less dependent on level. This tendency is also shown in other trials.

5.4 Running Time

We examine the decoding speed in terms of the number of words processed per second. We compare our model with the models of Wang and Lu (2018) and Wang et al. (2018)

using their implementations. The used platform is the same (PyTorch), and the machine on which we run them is also the same (CPU: Intel i7 2.7 GHz). Results on ACE-2005 are listed in Table 

7. Our model turns out to be faster than theirs, showing its scalability.

Method # tokens per second
Wang and Lu (2018) 459
Wang et al. (2018) 2,059
This work 5,594
Table 7: Decoding speed on ACE-2005.

5.5 Comparison on ACE-2004

Method P () R () F1 ()
Muis and Lu (2017)
Katiyar and Cardie (2018)
Wang and Lu (2018)
Wang et al. (2018)111111Wang et al. (2018) did not report precision and recall scores. Instead of Wang et al. (2018), Wang and Lu (2018) reported the scores for the model of Wang et al. (2018).
This work
This work [BERT]
Table 8: Comparison on ACE-2004. The last two lines show our proposed method.

We compare our method with existing methods also on the ACE-2004 dataset. We use the same splits as Lu and Roth (2015). The setups are the same as those of our experiment on ACE-2005. Table 8 shows the results. As shown, our method significantly outperforms existing methods. Note that most of them use embeddings of POS tags as an additional input feature whereas our method does not.

5.6 Error Analysis

We manually scan the test set predictions on ACE-2005. We find out that many of the errors can be classified into two types.

The first type is partial prediction error. Given the following sentence: “Let me set aside the hypocrisy of a man who became president because of a lawsuit trying to eliminate everybody else’s lawsuits, but instead focus on his own experience”. Our model predicts “a man who became president” or “a man who became president because of a lawsuit trying to eliminate everybody else’s lawsuits” as PER (Person) while the annotation marks “a man who became president because of a lawsuit”. It is difficult for our model to extract the proper spans of clauses or independent sentences that contain numerous modifiers.

The second type is error derived from pronominal mention, as reported by Katiyar and Cardie (2018). Consider the following example: “They roar, they screech.”. These “They”s refer to “tanks” in another sentence of the same document and are indeed annotated as VEH (Vehicle). Our model fails to detect these pronominal mentions or wrongly labels them as PER. Document context should be taken into consideration to solve this problem.

6 Conclusion

We propose learning and decoding methods for extracting nested entities. Our decoding method iteratively recognizes entities from outermost ones to inner ones in an outside-to-inside way. It recursively searches a span of each extracted entity for nested entities with second-best sequence decoding. We also design an objective function for training that ensures decoding. Our method has no hyperparameters beyond those of conventional CRF-based models. Our method achieve , , and on ACE-2004, ACE-2005, and GENIA datasets, respectively, in terms of F1-score. We also demonstrate that our decoding method is faster than existing methods.

For future work, one interesting direction is joint modeling of NER with entity linking or coreference resolution. Durrett and Klein (2014), Luo et al. (2015), and Nguyen et al. (2016) demonstrated that leveraging mutual dependency of the NER, linking, and coreference tasks could boost each performance. We would like to address this issue while taking nested entities into account.

References

  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §2.
  • B. Alex, B. Haddow, and C. Grover (2007) Recognising nested named entities in biomedical text. In Biological, translational, and clinical language processing, Prague, Czech Republic, pp. 65–72. External Links: Link Cited by: §1, §2.
  • K. Byrne (2007) Nested named entity recognition in historical archive text. In International Conference on Semantic Computing (ICSC 2007), pp. 589–596. Cited by: §1.
  • Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In

    Proceedings of the 24th International Conference on Machine Learning

    ,
    pp. 129–136. Cited by: §3.3.
  • B. Chiu, G. Crichton, A. Korhonen, and S. Pyysalo (2016) How to train good word embeddings for biomedical NLP. In

    Proceedings of the 15th Workshop on Biomedical Natural Language Processing

    ,
    Berlin, Germany, pp. 166–174. External Links: Link, Document Cited by: §4.2.
  • J. P.C. Chiu and E. Nichols (2016) Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, pp. 357–370. External Links: Link, Document Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Computing Research Repository arXiv:1810.04805. Note: version 1 External Links: Link Cited by: §4.2.
  • G. Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, and R. Weischedel (2004) The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. External Links: Link Cited by: §4.1.
  • G. Durrett and D. Klein (2014) A joint model for entity analysis: coreference, typing, and linking. Transactions of the Association for Computational Linguistics 2, pp. 477–490. External Links: Link, Document Cited by: §6.
  • J. R. Finkel and C. D. Manning (2009) Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 141–150. External Links: Link Cited by: §2, §4.1, Table 3.
  • Z. Huang, Y. Chang, B. Long, J. Crespo, A. Dong, S. Keerthi, and S. Wu (2012) Iterative Viterbi A* algorithm for k-best sequential decoding. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 611–619. External Links: Link Cited by: §3.2, §5.2.
  • M. Ju, M. Miwa, and S. Ananiadou (2018) A neural layered model for nested named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1446–1459. External Links: Link, Document Cited by: §2, Table 3, footnote 9.
  • N. Kaji, Y. Fujiwara, N. Yoshinaga, and M. Kitsuregawa (2010) Efficient staggered decoding for sequence labeling. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 485–494. External Links: Link Cited by: §5.2.
  • A. Katiyar and C. Cardie (2018) Nested named entity recognition revisited. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 861–871. External Links: Link, Document Cited by: §1, §2, §3.4, Table 3, §5.6, Table 8.
  • J.-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl_1), pp. i180–i182. External Links: Document Cited by: §1, §4.1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §4.2.
  • W. Lu and D. Roth (2015) Joint mention extraction and classification with mention hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 857–867. External Links: Link, Document Cited by: §2, §4.1, §5.5, footnote 9.
  • G. Luo, X. Huang, C. Lin, and Z. Nie (2015) Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 879–888. External Links: Link, Document Cited by: §6.
  • L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. Computing Research Repository arXiv:1902.09843. Note: version 1 External Links: Link Cited by: §4.2.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link, Document Cited by: §4.2.
  • C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky (2014) The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 55–60. External Links: Link, Document Cited by: Table 1.
  • A. O. Muis and W. Lu (2017) Labeling gaps between words: recognizing overlapping mentions with mention separators. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2608–2618. External Links: Link, Document Cited by: §1, §2, Table 3, Table 8.
  • D. Nadeau and S. Sekine (2007) A survey of named entity recognition and classification. Lingvisticæ Investigationes 30 (1), pp. 3–26. External Links: Document Cited by: §1.
  • D. B. Nguyen, M. Theobald, and G. Weikum (2016) J-NERD: joint named entity recognition and disambiguation with rich linguistic features. Transactions of the Association for Computational Linguistics 4, pp. 215–229. External Links: Link, Document Cited by: §6.
  • J. Pennington, R. ocher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §4.2.
  • N. Seshadri and C. W. Sundberg (1994) List Viterbi decoding algorithms with applications. IEEE Transactions on Communications 42 (234), pp. 313–323. External Links: Document Cited by: §3.2.
  • M. G. Sohrab and M. Miwa (2018) Deep exhaustive model for nested named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2843–2849. External Links: Link Cited by: §1, §2, §3.4, Table 3.
  • F. K. Soong and E. Huang (1990) A Tree.Trellis based fast search for finding the n best sentence hypotheses in continuous speech recognition. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, External Links: Link Cited by: §5.2.
  • E. Strubell, P. Verga, D. Belanger, and A. McCallum (2017) Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2670–2680. External Links: Link, Document Cited by: §2.
  • Y. Tateisi and J. Tsujii (2004) Part-of-speech annotation of biology research abstracts. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. External Links: Link Cited by: §4.1.
  • B. Wang, W. Lu, Y. Wang, and H. Jin (2018) A neural transition-based model for nested mention recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1011–1017. External Links: Link Cited by: §2, §3.4, Table 3, §5.4, Table 7, Table 8, footnote 10, footnote 11.
  • B. Wang and W. Lu (2018) Neural segmental hypergraphs for overlapping mention recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 204–214. External Links: Link Cited by: §1, §2, Table 3, §5.4, Table 7, Table 8, footnote 10, footnote 11.
  • Y. Wang (2009) Annotating and recognising named entities in clinical notes. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Suntec, Singapore, pp. 18–26. External Links: Link Cited by: §1.