A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text
Named Entity Recognition and Relation Extraction for Chinese literature text is regarded as the highly difficult problem, partially because of the lack of tagging sets. In this paper, we build a discourse-level dataset from hundreds of Chinese literature articles for improving this task. To build a high quality dataset, we propose two tagging methods to solve the problem of data inconsistency, including a heuristic tagging method and a machine auxiliary tagging method. Based on this corpus, we also introduce several widely used models to conduct experiments. Experimental results not only show the usefulness of the proposed dataset, but also provide baselines for further research. The dataset is available at https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset.READ FULL TEXT VIEW PDF
We propose a new Named entity recognition (NER) method to effectively ma...
We critically assess mainstream accounting and finance research applying...
Identifying the named entities mentioned in text would enrich many seman...
Named entity recognition (NER) is an extensively studied task that extra...
In this paper, we introduce the NER dataset from CLUE organization
Knowledge about software used in scientific investigations is important ...
In this work, we create a web application to highlight the output of NLP...
A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text
Recent researches on Named Entity Recognition (NER) [Lin and Wu2009, Collobert et al.2011, Huang et al.2015] and Relation Extraction (RE) [Kambhatla2004, Zeng et al.2014, Nguyen and Grishman2015] focused on news articles and achieved the promising performance. However, for a complex but important work, Chinese literature, this task becomes more difficult due to the lack of datasets. Thus, in this paper, we build a NER and RE dataset from hundreds of Chinese literature articles. Unlike previous sentence-level datasets where sentences are independent with each other, we build a discourse-level dataset where sentences from the same passage provide the additional context information.
However, tagging entities and relations in Chinese literature text is more difficult than that in traditional datasets which have simple entity classes and explicit relationships. Various rhetorical devices pose great challenges for building a high-consistency dataset. A simple example of personification is shown in Figure 1. “Hamlett” is a person name but refers to a rabbit. Some annotators label it with a “Person” tag and another annotators label it with a “Thing” tag. Thus, the major difficulty lies in how to handle massive ambiguous cases to ensure data consistency.
In this paper, we propose two methods to solve this problem. On one hand, we define several generic disambiguating rules to deal with the most common cases. On the other hand, since these heuristic rules are too generic to handle all ambiguous cases, we also introduce a machine auxiliary tagging method which uses the annotation standards learned from the subset of the corpus to predict labels on the rest data. Annotators just care about the cases where the predicted labels are different with the gold labels, which significantly reduces annotators’ efforts.
Overall speaking, we manually annotate 726 articles, 29,096 sentences and over 100,000 characters in total, which is accomplished in 300 person-hours spread across 5 people and three months.
Based on this corpus, we also introduce some widely used models to conduct experiments. Experimental results not only show the usefulness of the proposed dataset, but also provide baselines for further research.
Our contributions are listed as follows:
We provide a new dataset for joint learning of Named Entity Recognition and Relation Extraction for Chinese literature text.
Unlike previous sentence-level datasets, the proposed dataset is based on the discourse level which provides additional context information.
Based on this corpus, we introduce some widely used models to conduct experiments which can be used as baselines for further works.
Our work is related to recent works on Named Entity Recognition and Relation Extraction, which are briefly introduced as follows.
Named Entity Recognition has a long history in the field of natural language processing. One standard approach to NER is to regard the problem as a sequence labelling problem, where each word is assigned a tag, indicating whether the word belongs to part of any named entity or appears outside of all entities. Previous approaches used sequence labelling models, such as hidden Markov models (HMMs)[Zhou and Su2002], maximum entropy Markov models (MEMMs) [McCallum et al.2000]
, structured perceptrons[Sun et al.2009, Sun2015], as well as conditional random fields [Sun et al.2014, Sun2014]. While most research efforts exploited standard word-level features [Ratinov and Roth2009], more sophisticated features can also be used. Ling and Weld ling2012fine showed that using syntactic-level features from dependency structures in a CRFs-based model can lead to improved NER performance. Such dependency structures were also used in the work by Liu, Huang, and Zhu liu2010recognizing.
More recently, neural networks have achieved promising results[Collobert et al.2011, Huang et al.2015, He and Sun2016, He and Sun2017]. Collobert et al. collobert2011natural used a CNN over a sequence of word embeddings with a CRF layer on top. Huang et al. HuangXY15 presented a new CRF-LSTM models but using hand-crafted spelling features.
Relation Extraction plays an important role in NLP. Traditional methods [Kambhatla2004, Hendrickx et al.2010] are usually feature-based models and their performance strongly depends on the quality of the extracted features. In kernel based methods [Bunescu and Mooney2005, Wang2008, Plank and Moschitti2013], similarity between two data samples is measured without explicit feature representation.
Recently, deep neural networks are widely used in relation classification. works are widely used in relation classification. Zeng et al. zeng2014relation exploit a convolutional deep neural network to extract lexical and sentence level features. Zhang et al. zhang2015bidirectional used bidirectional long short-term memory networksto model the sentence with sequential information. Miwa et al. miwaend present an end-to-end neural model to extract entities and relations between them. Both word sequence and dependency tree information can be captured by stacking tree-structured LSTM-RNNs on sequential LSTM-RNNs. Cai et al. cai2016bidirectional propose BRCNN to model the SDP, which can pick up bidirectional information with a combination of LSTM and CNN.
|Person||Person||李秋 (Qiu Li)||32.71|
|Location||Location, country or city name||巴黎 (France)||17.18|
|Time||Time related words||一天 (one day)||7.36|
|Metric||Measurement related words||一升 (1L)||3.64|
|Organization||Organization name||信息学报 (Journal of Information Processing)||2.03|
|Abstract||Abstract||山西日报 (Shanxi daily)||1.43|
|Located||Be located in||幽兰 (orchid)-山谷 (valley)||37.43|
|Part-Whole||Be a part of||花 (flower)-仙人掌 (cactus)||23.76|
|Family||Family relationship||母亲 (mother)-奶奶 (grandmother)||10.25|
|General-Special||Generalization relationship||鱼 (fish)-鲫鱼 (carp)||6.99|
|Social||Be socially related||母亲 (mother)-邻里 (neighbour)||6.02|
|Ownership||Occupation relationship||村民 (villager)-旧屋 (house)||5.10|
|Use||Do something with||爷爷 (grandfather)-毛笔 (brush)||4.76|
|Create||Bring about something||男人 (man)-陶器 (pottery)||2.93|
|Near||A short distance away||山 (hill)-县城 (town)||2.76|
We first obtain over 1,000 Chinese literature articles from the website and then filter, extract 726 articles. Too short or too noise articles are not included. Due to the difficulty of tagging Chinese literature text, we divide the annotation process into three steps. The detailed tagging process are shown in Figure 2
Step 1: First Tagging Process. We first attempt to annotate the raw articles based on defined entity and relation tags. In the tagging process, we find a problem of data inconsistency. To solve this problem, we design the next two steps.
Step 2: Heuristic Tagging Based on Generic disambiguating Rules. We design several generic disambiguation rules to ensure the consistency of annotation guidelines. For example, remove all adjective words and only tag “entity header” when tagging entities (e.g., change “a girl in red cloth” to “girl”). In this stage, we re-annotate all articles and correct all inconsistency entities and relations based on the heuristic rules.
Step 3: Machine Auxiliary Tagging. Even though the heuristic tagging process significantly improves dataset quality, it is too hard to handle all inconsistency cases based on limited heuristic rules. Thus, we introduce a machine auxiliary tagging method. The core idea is to train a model to learn annotation guidelines on the subset of the corpus and produce predicted tags on the rest data. The predicted tags are used to be compared with the gold tags to discovery inconsistent entities and relations, which largely reduce annotators’ efforts. Specifically, we divide the corpus into 10 parts, and make predictions on each part of the corpus based on the model trained on the rest of the corpus. The model we used in this paper is CRF with a simple bigram feature template.
After all annotation steps, we also check all entities and relations to ensure the correctness of the dataset.
We will describe the tagging set and annotation format in detail.
We define 7 entity tags and 9 relation tags based on several available NER and RE datasets but with some additional categories specific to Chinese literature text. Details of the tags are shown in Table 1 and 2.
We add three new entity tags specific for understanding literature text, including “Thing”, “Time” and “Metric”. “Thing” is for capturing objects which articles mainly describe, such as “flower”, “tree” and so on. “Time” is for capturing the time-line of a story, such as “one day”, “one month” and so on. “Metric” is for capturing the measurement related words, such as “1L”, “1mm” and so on.
As for relation tags, we set 9 different classes for better understanding the connection between entities, including “Located”, “Near”, “Part-Whole”, “Family”, “Social”, “Create”, “Use”, “Ownership”, “General-Special”. For building the relations between people in literature articles, we use the “Social” tag, which is not quite common in other corpora.
Each entity is identified by “T” tag, which takes several attributes.
Id: a unique number identifying the entity within the document. It starts at 0, and is incremented every time a new entity is identified within the same document.
Type: one of the entity tags.
Begin Index: the begin index of an entity. It starts at 0, and is incremented every character.
End Index: the end index of an entity. It starts at 0, and is incremented every character.
Value: words being referred to an identifiable object.
Each relation is identified by “R” tag, which can take several attributes:
Id: a unique number identifying the relation within the document. It starts at 0, and is incremented every time a new relation is identified within the same document.
Arg1 and Arg2: two entities associated with a relation.
Type: one of the relation tags.
|SVM||Word embeddings, NER, WordNet, HowNet,||48.9|
|[Hendrickx et al.2010]||
POS, dependency parse, Google n-gram
|[Socher et al.2011]||+ POS, NER, WordNet||49.1|
|[Zeng et al.2014]||+ word position embeddings, NER, WordNet||52.4|
|[Santos et al.2015]||+ word position embeddings||54.1|
|[Xu et al.2015]||+ POS + NER + WordNet||55.3|
|DepNN||Word embeddings, WordNet||55.2|
|[Lin and Wu2009]|
|[Cai et al.2016]||+ POS, NER, WordNet||55.6|
We introduce several baselines to conduct experiments. In this section we will describe experiment settings, baselines and experiment results in detail.
Experiments are performed on a commodity 64-bit Dell Precision T5810 workstation with one 3.0 GHz 16-core CPU and 64GB RAM. The performance of NER and RE models are evaluated by
-score. For training, we use mini-batch stochastic gradient descent to minimize negative log-likelihood. Training is performed with shuffled mini-batches of size 32.
LSTM We consider bi-directional LSTM as one of models. Both the character embedding dimension and the hidden dimension are set to be 100.
The results are shown in Table 3
. It can be clearly seen that CRF achieves the better performance than Bi-LSTM on all tags, which probably be attributed to the feature template.
All models perform better on “Person”, “Thing” and “Time” tags than on “Location”, “Organization” and “Metric” tags. It shows that “Person”, “Thing” and “Time” tags are the most easily identifiable entities. The problem of data sparsity makes it hard for capturing “Location”, “Organization” and “Metric” tags.
The higher accuracies show that the entities predicted by the model probably are the right tags, which reflects the data consistency between the training and testing sets. The lower recalls indicate that there is still a lot of unknown entities on the testing set. How to handle these unknown entities is a urgent problem for further research.
compares several state-of-the-art methods on the proposed corpus. The first entry in the table presents the highest performance achieved by traditional feature-based methods. Hendrickx et al. Hendrickx2010 feeds a variety of hand-crafted features to the SVM classifier and achieves an-score of 48.9.
Recent performance improvements on the task of relation classification are mostly achieved with the help of neural networks. Socher et al. socher2011semi builds a recursive neural network on the constituency tree and achieves a comparable performance with Hendrickx et al. Hendrickx2010. Xu et al. xu2015classifying introduces a type of gated recurrent neural network which could raise thescore to 55.3. By diminishing the impact of the other classes, Santos et al. santos2015classifying achieves an -score of 54.1. Along the line of CNNs, Liu et al. LinW09 achieves an -score of 55.2.
We build a discourse-level Named Entity Recognition and Relation Extraction dataset for Chinese literature text. To solve the problem of data inconsistency in tagging process, we propose two methods in this paper, one is a heuristic tagging method and another is a machine auxiliary tagging method. Based on this corpus, we introduce several widely-used models to conduct experiments, which provides baselines for further research.
Bidirectional recurrent convolutional neural network for relation classification.In ACL (1).
Proceedings of the Thirty-First Conference on Artificial Intelligence, pages 3216–3222.
Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Séaghdha, D. O., Padó, S., Pennacchiotti, M., Romano, L., and Szpakowicz, S.(2010). Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 33–38, Stroudsburg, PA, USA. Association for Computational Linguistics.
Semi-supervised recursive autoencoders for predicting sentiment distributions.In Proceedings of the conference on empirical methods in natural language processing, pages 151–161. Association for Computational Linguistics.