Named Entity Recognition (NER) is the task of detecting mentions of real-world entities from text and classifying them into predefined types. NER benefits many natural language processing applications (e.g., information retrievalBerger and Lafferty (2017), relation extraction Yu et al. (2019), and question answering Khalid et al. (2008)).
NER methods have been extensively investigated and researchers have proposed effective ones. Most prior approaches Huang et al. (2015); Chiu and Nichols (2016); Gridach (2017); Zhang and Yang (2018); Gui et al. (2019); Xue et al. (2020) cast this task as a sequence labeling problem where each token is assigned a label that represents its entity type. Their underlying assumption is that an entity mention should be a short span of text Muis and Lu (2016), and should not overlap with each other. While such assumption is valid for most cases, it does not always hold, especially in clinical corpus Pradhan et al. (2015). For example, Figure 1 shows two discontiguous entity mentions with overlapping segments. Thus, there is a need to move beyond continuous entities and devise methods to extract discontinuous ones.
Towards this goal, current state-of-the-art (SOTA) models can be categorized into two classes: combination-based and transition-based. Combination-based models first detect all the overlapping spans and then learn to combine these segments with a separate classifier Wang and Lu (2019); Transition-based models incrementally label the discontinuous spans through a sequence of shift-reduce actions Dai et al. (2020a). Although these methods have achieved reasonable performance, they continue to have difficulty with the same problem: exposure bias Zhang et al. (2019). Specifically, combination-based methods use the gold segments to guide the classifier during the training process while at inference the input segments are given by a trained model, leading to a gap between training and inference Wang and Lu (2019). For transition-based models, at training time, the current action relies on the golden previous actions, while in the testing phase, the entire action sequence is generated by the model Wang et al. (2017)
. As a result, a skewed prediction will further deviate the predictions of the follow-up actions. Such accumulated discrepancy may hurt the performance.
In order to overcome the limitation of such prior works, we propose Mac, a Maximal clique discovery based discontinuous NER model. The core insight behind Mac is that all (potentially discontinuous) entity mentions in the sentence can naturally form a segment graph by interpreting their contained continuous segments as nodes, and connecting segments of the same entity to each other as edges. Then the discontinuous NER task is equivalent to finding the maximal cliques from the graph, which is a well-studied problem in graph theory. So, the question that remains is how to construct such a segment graph. We decompose it into two uncoupled subtasks, segment extraction (SE) and edge prediction (EP) in Mac. Typically, given an -token sentence, two tag tables are formed for SE and EP respectively where each entry captures the interaction between two individual tokens. SE is then regarded as a labeling problem where tags are assigned to distinguish the boundary tokens of each segment, which have benefits in identifying overlapping segments. EP is converted as the problem of aligning the boundary tokens of segments contained in the same entity. Overall, the tag tables of SE and EP are generated independently, and will be consumed together by a maximum clique searching algorithm to recover desired entities from them, thus immune from the exposure bias problem.
We conducted experiments on three standard discontinuous NER benchmarks. Experiments show that Mac can effectively recognize discontinuous entity mentions without sacrificing the accuracy on continuous mentions. This leads to a new state-of-the-art (SOTA) on this task, with substantial gains of up to 3.5% absolute percentage points over previous best reported result. Lastly, we show that in the runtime experiments on GPU environments, Mac is about five times faster than the SOTA model.
2 Related Work
Our work is inspired by three lines of research: discontinuous NER, joint extraction, and maximal clique discovery.
Discontinuous NER requires to identify all entity mentions that have discontinuous structures. To achieve this end, several researchers introduced new position indicators into the traditional BIO tagging scheme so that the sequential labeling models can be employed Tang et al. (2013); Metke-Jimenez and Karimi (2016); Dai et al. (2017); Tang et al. (2018). However, this model suffers from the label ambiguity problem due to the limited flexibility of the extended tag set. As the improvement, Muis and Lu Muis and Lu (2016) used hyper-graphs to represent entity spans and their combinations, but did not completely resolve the ambiguity issue Dai et al. (2020a). Wang and Lu Wang and Lu (2019) presented a pipeline framework which first detects all the candidate spans of entities and then merges them into entities. By decomposing the task into two inter-dependency steps, this approach does not have the ambiguity issue, but meanwhile being susceptible to exposure bias. Recently, Dai et al. Dai et al. (2020a) constructed a transition action sequence for recognizing discontinuous and overlapping structure. At training time, it predicts with the ground truth previous actions as condition while at inference it has to select the current action based on the results of previous steps, leading to exposure bias. In this paper, for the first time we propose a one-stage method to address discontinuous NER while without suffering from the ambiguity issue, realizing the consistency of training and inference.
Joint extraction aims to detect entity pairs along with their relations using a single model Yu et al. (2020). Discontinuous NER is related to joint extraction where the discontiguous entities can be viewed as relation links between segments Wang and Lu (2019). Our model is motivated by TPLinker Wang et al. (2020), which formulates joint extraction as a token pair linking problem by aligning the boundary tokens of entity pairs. The main differences between our model and TPLinker are two-fold: (1) We propose a tailor-designed tagging scheme for recognizing discontinuous segments; (2) The maximal clique discovery algorithm is introduced into our model to accurately merge the discontinuous segments.
Maximal clique discovery is to find a clique of maximum size in a given graph Dutta and Lauri (2019). Here, a clique is a subset of the vertices all of which are pairwise adjacent. Maximal clique discovery finds extensive application across diverse domains Stix (2004); Boginski et al. (2005); Imbiriba et al. (2017). In this paper, we reformulate discontinuous NER as the task of maximal clique discovery by constructing a segment graph and leveraging the classic B-K backtracking algorithm Bron and Kerbosch (1973) to find all the maximum cliques as the entities.
In graph theory, a clique is a vertex subset of an undirected graph where every two vertices in the clique are adjacent, while a maximal clique is the one that cannot be extended by including one more adjacent vertex. That means each vertex in the maximal clique has close relations with each other, and no other vertex can be added, which is similar to the relations between segments in a discontinuous entity. Based on this insight, we claim that discontinuous NER can be equivalently interpreted as discovering maximal cliques from a segment graph, where nodes represent segments that either form entities on their own or present as parts of a discontinuous entity, and edges connect segments that belong to the same entity mention.
Considering the maximum clique searching process is usually non-parametric Bron and Kerbosch (1973), discontinuous NER is actually decomposed into two subtasks: segment extraction and edge prediction, to respectively create the nodes and edges of the segment graph. Their prediction results can be generated independently with our proposed grid tagging scheme, and will be consumed together to construct a segment graph, so that the maximal clique discovery algorithm can be applied to recover desired entities. The overall extraction process is depicted in Figure 2. Next, we will first introduce our grid tagging scheme and its decoding workflow. Then we will detail the Mac, a Maximal clique discovery based discontinuous NER model based on this tagging scheme.
3.1 Grid Tagging Scheme
Inspired by wang2020tplinker, we implement single-stage segment extraction and edge prediction based on a novel grid tagging scheme. Given an -token sentence, our scheme constructs an tag table by enumerating all possible token pairs and giving each token pair the tag(s) based on their relation(s). Note that one token pair may have multiple tags according to the pre-defined tag set.
3.1.1 Segment Extraction
As demonstrated in Figure 1, entity mentions could overlap with each other. To make our model capable of extracting such overlapping segments, we construct a two-dimensional tag table. Figure 3 provides an example. A pair of tokens will be assigned with a set of labels if a segment from to belongs to the corresponding categories. Considering , we discard the lower triangle region of the tag table, so grids are actually generated for an -token sentence. In practical, the BIS tagging scheme is adopted to represent if a segment is a continuous entity mention (X-S) or locates at the beginning (X-B) or inside (X-I) of a discontinuous entity of type X. For example, (upper, body) is assigned with the tag POB-S since “upper body” is a continuous entity of type Part of Body (POB). And the tag of (Sever, joint) is ADE-B as “Sever joint” is a beginning segment of the discontinuous mention “Sever joint pain” of type Adverse Drug Event (ADE). Meanwhile, “joint” is also recognized as an entity since there is a POB-S tag in the place of (joint, joint), thus the overlapping segment extraction problem is solved.
3.1.2 Edge Prediction
Edge prediction is to construct the links between segments of the same entity mention by aligning their boundary tokens. The tagging scheme is defined as follows: (1) head to head (X-H2H) indicates it locates in a place (, ) where and are respectively the beginning tokens of two segments which constitute the same entity of type X; (2) tail to tail (X-T2T) is similar to X-H2H, but focusing on the ending token. As shown in Figure 4, “Sever” has the ADE-H2H and ADE-T2T relations to “shoulder” and “pain”, because the type of the discontinuous entity mention “Sever shoulder pain” is Adverse Drug Event . The same logic goes for other tags in the matrix.
3.2 Decoding Workflow
Formally, the decoding procedure is summarized in Algorithm 1. The segment tagging table and edge tagging table of a sentence serve as the inputs. Firstly, we extract all the typed segments through decoding . Then we construct a segment graph , in which segments that belong to the same entity (decoded from ) have edges with each other. Figure 2 gives an example. Correspondingly, we can yield a continuous entity mention from the single-vertex clique directly, and concatenate segments in each multiple-vertex clique following their original sequential order in to recover discontinuous entity mentions. We choose the classic B-K backtracking algorithm Bron and Kerbosch (1973) for finding the maximal cliques in , which takes time, where is the number of nodes.
3.3 Model Structure
With the grid tagging scheme, we propose an end-to-end neural architecture named Mac. Figure 5 reveals the overview structure.
3.3.1 Token Representation
Given an -token sentence , we first map each token
into a low-dimensional contextual vectorwith a basic encoder. Then we generate two representations, and , as the task-specific features for the segment extractor and the edge predictor, respectively:
where is a parameter matrix and
is a bias vector to be learned during training.
3.3.2 Segment Extractor
The probability that a pair of tokens are the boundary tokens of a segment can be represented as:
where and denotes the beginning token and ending token. In our tagging scheme (Figure 3), we have a fixed beginning token at the -th row, and take the given beginning token as the condition to label the corresponding ending token, so in the -th row is always 1. Hence, all we need to do is to calculate .
Inspired by conLN, we use the Conditional Layer Normalization (CLN) mechanism to model the conditional probability. That is, a conditional vector is introduced as extra contextual information to generate the gain parameter and bias of layer normalization Ba et al. (2016) as follows:
where and are the conditional vector and input vector respectively. denotes the -th element of , and
are the mean and standard deviation taken across the elements of, respectively.
is firstly normalized by fixing the mean and variance and then scaled and shifted byand respectively. Based on the CLN mechanism, the representation of token pair being a segment boundary can be defined as:
In this way, For different , different LN parameters are generated, which results in effectively adapting to be more -specific.
Furthermore, besides the features of boundary tokens, we also consider inner tokens and segment length to learn a better segment representation. Specifically, we deploy a LSTM network Hochreiter and Schmidhuber (1997) to compute the hidden states of inner tokens, and use a looking-up table to embed the segment length. Since the ending token is always behind the beginning one, in each row , only the tokens behind will be fed into the LSTM. We take the hidden state outputted at each time step as the inner token representation of the segment . Then the representation of a segment from to can be defined as follows:
3.3.3 Edge Predictor
Edge prediction is similar with segment extraction since they all need to learn the representation of each token pair. The key differences are summarized in the following two aspects: (1) the distance between segments is usually not informative, so the length embedding is valueless in edge prediction; (2) encoding the tokens between segments may carry noisy semantics for correlation tagging and aggravate the burden of training, so no is required. Under such considerations, we represent each token pair for edge prediction as:
3.4 Training and Inference
In practical, our grid tagging scheme aims to tag most relevant labels for each token pair, so it can be seen as a multi-label classification problem. Once having the comprehensive token pair representations ( and ), we can build the multi-label classifier via a fully connected network. Mathematically, the predicted probability of each tag for
can be estimated via:
where is the symbol of subtask indicator, denoting segment extraction and edge prediction respectively, and each dimension of denotes the probability of a tag between and
. The sigmoid function is used to transfer the projected value into a probability, in this case, the cross-entropy loss can be used as the loss function which has been proved suitable for multi-label classification task:
where is the number of pre-defined tags in , is the predicted probability of along the -th tag, and is the corresponding ground truth. equals to if or if . Then, the losses from segment extraction and edge prediction are aggregated to form the training objective :
At inference, the probability vector needs thresholding to be converted to tags. We enumerate several values in the range
and pick the one that maximizes the evaluation metrics on the validation (dev) set as the threshold.
In this section, after introducing the datasets and baseline models, we present our experimental results and detailed analysis.
Following previous work Dai et al. (2020a), we conduct experiments on three benchmark datasets from the biomedical domain: (1) CADEC Karimi et al. (2015) is sourced from AskaPatient: an online forum where patients can discuss their experiences with medications. We use the dataset pre-processed by Dai et al.Dai et al. (2020a) which selected Adverse Drug Event (ADE) annotations from the original dataset because only the ADEs involve discontinuous annotations. (2) ShARe 13 Pradhan et al. (2013) and (3) ShARe 14 Mowery et al. (2014)
focus on the identification of disorder mentions in clinical notes, including discharge summaries, electrocardiogram, echocardiogram, and radiology reports. Around 10% of mentions in these three data sets are discontinuous. The descriptive statistics of the datasets are reported in Table1.
4.2 Implementation Details
We implement our model upon the in-field BERT base model: Yelp Bert Dai et al. (2020b) for CADEC, and Clinical BERT Alsentzer et al. (2019) for ShARe 13 and 14. The network parameters are optimized by Adam Kingma and Ba (2014)
with a learning rate of 1e-5. The batch size is fixed to 12. The threshold for converting probability to tag is set as 0.5. All the hyper-parameters are tuned on the dev set. We run our experiments on a NVIDIA Tesla V100 GPU for at most 300 epochs, and choose the model with the best performance on the dev set to output results on the test set. we report the test score of the run with the median dev score among 5 randomly initialized runs.
|CADEC||ShARe 13||ShARe 14|
|Model||CADEC||ShARe 13||ShARe 14|
4.3 Comparison Models
For comparison, we employ the following models as baselines: (1) BIOE Metke-Jimenez and Karimi (2016) expands the BIO tagging scheme with additional tags to represent discontinuous entity; (2) Graph Muis and Lu (2016) uses hyper-graphs to organize entity spans and their combinations; (3) Comb Wang and Lu (2019) first detects entity spans, then deploys a classifier to merge them. For fair comparison, we re-implement Comb based on the in-fild BERT backbone called Comb; (4) Trans Dai et al. (2020a) is the current best discontinuous NER method, which generates a sequence of actions with the aid of buffer and stack structure to detect entity; Note that the original Trans model is based on ELMo. For fair comparison with our model, we also implement the in-field BERT-based Trans models, namely Trans.
|Model||CADEC||ShARe 13||ShARe 14|
|BIOE||68.3/ 5.8||52.0/ 1.0||57.3/ 1.8||51.8/ 39.7||39.5/ 12.3||44.8/ 18.8||37.5/ 8.8||38.4/ 4.5||37.9/ 6.0|
|Graph||69.5/ 60.8||43.2/ 14.8||53.3/ 23.9||82.3/ 78.4||47.4/ 36.6||60.2/ 50.0||60.0/ 42.7||52.8/ 39.5||56.2/ 41.1|
|Comb||63.9/ 44.0||57.8/ 23.4||60.7/ 30.6||59.7/ 65.5||49.8/ 29.6||54.3/ 40.8||52.9/ 51.2||52.8/ 35.0||52.9/ 41.6|
|Trans||66.5/ 41.2||64.3/ 35.1||65.4/ 37.9||70.5/ 78.5||56.8/ 39.4||62.9/ 52.5||61.9/ 56.1||64.5/ 43.8||63.1/ 49.2|
|Trans||69.1/ 39.5||64.4/ 34.0||66.7/ 36.6||68.2/ 65.9||55.4/ 39.0||61.1/ 49.0||55.5/ 52.0||55.6/ 37.8||55.6/ 43.8|
|Mac||74.7/ 52.9||65.5/ 38.3||69.8/ 44.4||77.9/ 66.1||60.5/ 48.4||68.1/ 55.9||69.3/ 51.0||70.2/ 57.6||69.7/ 54.1|
4.4 Main results
Table 2 reports the results of our models against other baseline methods. We have the following observations. (1) Our method, Mac, significantly outperforms all other methods and achieves the SOTA F1 score on all three datasets. (2) BERT-based Trans model achieves poorer results than its ELMo-based counterpart, which is in line with the claim in the original paper. (3) Over the SOTA method Trans, Mac achieves substantial improvements of 2.6% in F1 score on three datasets averagely. Moreover, the Wilcoxon’s test shows that a significant difference () exists between our model and Trans. We consider that it is because Trans is inherently a multi-stage method as it introduces several dependent actions, thus suffering from the exposure bias problem. While for our Mac method, it elegantly decomposes the discontinuous NER task into two independent subtasks and learns them together with a joint model, realizing the consistency of training and inference. (4) Comb can be approximately seen as the pipeline version of our method, their performance gap again confirms the effectiveness of our one-stage learning framework.
As shown in Table 1, only around 10% mentions are discontinuous in all three datasets, which is far less than the continuous entity mentions. To evaluate the effectiveness of our proposed model on recognizing discontinuous mentions, following muis2016learning, we report the results on sentences that include at least one discontinuous mention. We also report the evaluation results when only discontinuous mentions are considered. The scores in these two settings are separated by a slash in Table 3. Comparing Table 2 and 3, we can see that the BIOE model performs better than the Graph when testing on the full dataset but far worse on discontinuous mentions. Consistently, our model again defeat the baseline models in terms of F1 score. Even though some models outperform Mac on precision or recall, they greatly sacrifice another score, which results in lower F1 score than Mac.
4.5 Model Ablation Study
|Model||F1||Dis F1||Dis F1|
|– Tag B and S||78.2||55.8||46.1|
|– Segment length embedding||78.1||55.7||46.2|
|– CLN mechanism||76.8||52.7||44.4|
|– Segment inner representation||72.9||55.6||46.3|
To verify the effectiveness of each component, we ablate one component at a time to understand its impact on the performance. Concretely, we investigated the tagging scheme of segments, the segment length embedding, the CLN mechanism (by replacing it with the vector concatenation), and the segment inner token representation.
From these ablations shown in Table 4, we find that: (1) When we take B, I and S tags in segment extraction as one class, the score slightly drops by 0.5%, which indicates the segments in different positions of entities may have different semantic features, so distinguishing them can reduce the confusion in the process of model recognition; (2) When we remove the segment length embedding (Formula 9), the overall F1 score drops by 0.6%, showing that it is necessary to let segment extractor aware of the token pair distance information to filter out impossible segments by implicit distance constraint; (3) Compared with concatenating, it is a better choice to use CLN (Formula 7 and 11) to fuse the features of two tokens, which brings 1.9% improvement; (4) Removing segment inner features (Formula 8) results in a remarkable drop on the overall F1 score while little drop on the scores of discontinuous mentions, which suggests that the information of inner tokens is essential to recognize continuous entity mentions.
4.6 Performance Analysis222Limited by the space, we analyze the performance with regard to different interval and span length in Appendix.
4.6.1 Performance on Overlapping Mentions
|Pattern||CADEC||ShARe 13||ShARe 14|
As discussed in the introduction, overlap is very common in discontinuous entity mentions. To evaluate the capability of our model on extraction overlapping structures, as suggested in Dai et al. (2020a), we divide the test set into four categories: (1) no overlap; (2) left overlap; (3) right overlap; and (4) multiple overlap. Figure 7 gives examples for each overlapping pattern. As illustrated in Figure 6, Mac outperforms Trans on all the overlapping patterns. Trans gets zero scores on some patterns. It might result from insufficient training since these overlapping patterns have relatively fewer samples in the training sets (see Table 5), while the sequential action structure of transition-based model is a bit data hungry. By contrast, Mac is more resilient to overlapping patterns, we attribute the performance gain to two design choices: (1) the grid tagging scheme has strong power in accurately identifying overlapping segments and assembling them into a segment graph; (2) Based on the graph, the maximal clique discovery algorithm can effectively recover all the candidate overlapping entity mentions.
4.6.2 Analysis on Running Speed
|Model||CADEC||ShARe 13||ShARe 14|
|Trans||29.1 Sen/s||33.4 Sen/s||33.9 Sen/s|
|Trans||36.3 Sen/s||40.6 Sen/s||40.3 Sen/s|
|Mac||193.3 Sen/s||200.2 Sen/s||198.1 Sen/s|
Table 6 shows the comparison of computational efficiency between the SOTA model Trans, Trans
, and our proposed Mac. All of these models are implemented by Pytorch and ran on a single Tesla V100 GPU environment. As we can see, the prediction speed of Mac is around 5 times faster than Trans. Since the transition-based model employs a stack to store partially processed spans and a buffer to store unprocessed tokens Dai et al. (2020a), it is difficult to utilize GPU parallel computing to speed up the extraction process. In the official implementation, Trans is restricted to processes one token at a time, which means it is seriously inefficient and difficult to deploy in real development environment. By contrast, Mac is capable of handling data in batch mode because it is a single-stage sequence labeling model in essence.
In this paper, we reformulate discontinuous NER as the task of discovering maximal cliques in a segment graph, and propose a novel Mac architecture. It decomposes the construction of segment graph as two independent 2-D grid tagging problems, and solves them jointly in one stage, addressing the exposure bias issue in previous studies. Extensive experiments on three benchmark datasets show that Mac beats the previous SOTA method by as much as 3.5 pts in F1, while being 5 times faster. Further analysis demonstrates the ability of our model in recognizing discontinuous and overlapping entity mentions. In the future, we would like to explore similar formulation in other information extraction tasks, such as event extraction and nested NER.
This work was supported by the National Key R&D Program of China (Grant No. 2017YFB0802804), the Guangdong Province Key Area R&D Program of China (Grant No.2019B010137004), and Key Program of National Natural Science Foundation of China (Grant No. U1766215).
- Publicly available clinical bert embeddings. Cited by: §4.2.
- Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.3.2.
- Information retrieval as statistical translation. In Proceedings of ACM SIGIR, Cited by: §1.
- Statistical analysis of financial networks. Computational statistics & data analysis. Cited by: §2.
- Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM. Cited by: §2, §3.2, §3.
- Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics. Cited by: §1.
- An effective transition-based model for discontinuous ner. In Proceedings of ACL, Cited by: §1, §2, §4.1, §4.3, §4.6.1, §4.6.2.
- Cost-effective selection of pretraining data: a case study of pretraining bert on social media. In Proceedings of EMNLP: Findings, Cited by: §4.2.
- Medication and adverse event extraction from noisy text. In Proceedings of the Australasian Language Technology Association Workshop, Cited by: §2.
- Finding a maximum clique in dense graphs via 2 statistics. In Proceedings of CIKM, Cited by: §2.
Character-level neural network for biomedical named entity recognition. Journal of biomedical informatics. Cited by: §1.
CNN-based chinese ner with lexicon rethinking.. In Proceedings of IJCAI, Cited by: §1.
- Long short-term memory. Neural computation. Cited by: §3.3.2.
- Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §1.
- Band selection for nonlinear unmixing of hyperspectral images as a maximal clique problem. IEEE Transactions on Image Processing. Cited by: §2.
- Cadec: a corpus of adverse drug event annotations. Journal of biomedical informatics. Cited by: §4.1.
- The impact of named entity normalization on information retrieval for question answering. In Proceedings of ECIR, Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Concept identification and normalisation for adverse drug event discovery in medical forums.. In Proceedings of ISWC, Cited by: §2, §4.3.
- Task 2: share/clef ehealth evaluation lab 2014. In Proceedings of CLEF, Cited by: §4.1.
- Learning to recognize discontiguous entities. In Proceedings of EMNLP, Cited by: §1, §2, §4.3.
- Task 1: share/clef ehealth evaluation lab 2013.. In Proceedings of CLEF, Cited by: §4.1.
- Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. Journal of the American Medical Informatics Association. Cited by: §1.
- Finding all maximal cliques in dynamic graphs. Computational Optimization and applications. Cited by: §2.
Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features. In BMC medical informatics and decision making, Cited by: §2.
- Recognizing continuous and discontinuous adverse drug reaction mentions from social media using lstm-crf. Wireless Communications and Mobile Computing. Cited by: §2.
- Combining spans into entities: a neural two-stage approach for recognizing discontiguous entities. In Proceedings of EMNLP, Cited by: §1, §2, §2, §4.3.
- Transition-based disfluency detection using lstms. In Proceedings of EMNLP, Cited by: §1.
- TPLinker: single-stage joint extraction of entities and relations through token pair linking. In Proceedings of COLING, Cited by: §2.
- Coarse-to-fine pre-training for named entity recognition. In Proceedings of EMNLP, Cited by: §1.
- Beyond word attention: using segment attention in neural relation extraction.. In Proceedings of IJCAI, Cited by: §1.
- Joint extraction of entities and relations based on a novel decomposition strategy. In Proceedings of ECAI, Cited by: §2.
Bridging the gap between training and inference for neural machine translation. In Proceedings of ACL, Cited by: §1.
- Chinese ner using lattice lstm. In Proceedings of ACL, Cited by: §1.
Appendix A Performance Analysis on Different Interval and Span Lengths
|Length||CADEC||ShARe 13||ShARe 14|
|Length||CADEC||ShARe 13||ShARe 14|
Intervals between segments usually make the total length of a discontinuous mention longer than continuous one. Considering the involved segments, the whole span is even longer. That is, different words of a discontinuous mention may be distant to each other, which makes discontinuous NER harder than the conventional NER task. To further evaluate the robustness of Mac in different settings, we analyse the results of test sets on different interval and span lengths. The interval length refers to the number of words between discontinuous segments. The span length refers to the number of words of the whole span. For example, for the entity mention “Sever shoulder pain” in “Sever joint, shoulder and upper body pain.”, the interval length is 5, and the span length is 8. Such phenomenon requires models to have the ability of capturing the semantic dependency between distant segments.
For the convenience of analysis, we report all datasets’ distribution on interval and span length in Table 7 and 8, respectively. And Figure 8 shows the F1 scores of Trans and Mac on different interval and span lengths. As we can see, Mac outperforms Trans on most interval and span lengths. Even though Mac is defeated in some cases, the sample number in those cases is too small to disprove the superiority of Mac. For example, on CADEC, Trans outperforms Mac when span length is 8, but the sample number in the test set is only 10.
We figure out an interesting phenomenon: Both Mac and the transition-based model Trans show poor performance when interval length is 1 and span length is 3, even though the corresponding training samples are sufficient enough (see length = 1 in Table 7 and length = 3 in Table 8444For discontinuous mentions, when span length is 3, the interval length can only be 1.). For example, ShARe 14 has over 200 training samples, of which the interval length is 1, but both models perform much worse than when interval length is 3, which has less training samples. This might result from three folds: (1) Even though the training samples are sufficient, their features and context are so different from the ones in the test sets; (2) the validation set is too small to choose a good model state for the samples with the interval length equal to 1; (3) discontinuous mentions with interval length equal to 1 are harder cases than the others, since only one word to separate the segments makes these discontinuous mentions very similar to the continuous ones, which confuse the model to treat them as a continuous mention. We leave this problem to our future work.
Appendix B Evaluation Metric
An entity mention is considered correct if and only if the type and the offset of all segments match the ground truth. For example, for an ADE mention “Sever shoulder pain” in “Sever joint, shoulder and upper body pain.”, the type ADE and 6 indices (0, 1, 3, 4, 8, 9) need to be matched.
Appendix C Downloadable Data
The datasets used in this paper (CADEC, ShARe 13, ShARe 14) are available at the listed links: