Entity-Relation Extraction as Multi-Turn Question Answering

by   Xiaoya Li, et al.

In this paper, we propose a new paradigm for the task of entity-relation extraction. We cast the task as a multi-turn question answering problem, i.e., the extraction of entities and relations is transformed to the task of identifying answer spans from the context. This multi-turn QA formalization comes with several key advantages: firstly, the question query encodes important information for the entity/relation class we want to identify; secondly, QA provides a natural way of jointly modeling entity and relation; and thirdly, it allows us to exploit the well developed machine reading comprehension (MRC) models. Experiments on the ACE and the CoNLL04 corpora demonstrate that the proposed paradigm significantly outperforms previous best models. We are able to obtain the state-of-the-art results on all of the ACE04, ACE05 and CoNLL04 datasets, increasing the SOTA results on the three datasets to 49.6 (+1.2), 60.3 (+0.7) and 69.2 (+1.4), respectively. Additionally, we construct a newly developed dataset RESUME, which requires multi-step reasoning to construct entity dependencies, as opposed to the single-step dependency extraction in the triplet exaction in previous datasets. The proposed multi-turn QA model also achieves the best performance on the RESUME dataset.



There are no comments yet.


page 1

page 2

page 3

page 4


Event Detection as Question Answering with Entity Information

In this paper, we propose a recent and under-researched paradigm for the...

Relation Extraction as Two-way Span-Prediction

The current supervised relation classification (RC) task uses a single e...

TransferNet: An Effective and Transparent Framework for Multi-hop Question Answering over Relation Graph

Multi-hop Question Answering (QA) is a challenging task because it requi...

Neural Architectures for Open-Type Relation Argument Extraction

In this work, we introduce the task of Open-Type Relation Argument Extra...

UPB at SemEval-2021 Task 8: Extracting Semantic Information on Measurements as Multi-Turn Question Answering

Extracting semantic information on measurements and counts is an importa...

Relation Extraction : A Survey

With the advent of the Internet, large amount of digital text is generat...

UHop: An Unrestricted-Hop Relation Extraction Framework for Knowledge-Based Question Answering

In relation extraction for knowledge-based question answering, searching...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying entities and their relations is the prerequisite of extracting structured knowledge from unstructured raw texts, which has recieved growing interest these years. Given a chunk of natural language text, the goal of entity-relation extraction is to transform it to a structural knowledge base. For example, given the following text:

In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services Company, of which he is CEO and lead designer. He helped fund Tesla, Inc., an electric vehicle and solar panel manufacturer, in 2003, and became its CEO and product architect. In 2006, he inspired the creation of SolarCity, a solar energy services Company, and operates as its chairman. In 2016, he co-founded Neuralink, a neurotechnology Company focused on developing brain–computer interfaces, and is its CEO. In 2016, Musk founded The Boring Company, an infrastructure and tunnel-construction Company.

We need to extract four different types of entities, i.e., Person, Company, Time and Position, and three types of relations, found, founding-Time and serving-role. The text is to be transformed into a structural dataset shown in Table 1.

Person Corp Time Position
Musk SpaceX 2002 CEO
Musk Tesla 2003 CEO&
product architect
Musk SolarCity 2006 chairman
Musk Neuralink 2016 CEO
Musk The Boring Company 2016 -
Table 1: An illustration of an extracted structural table.

Most existing models approach this task by extracting a list of triples from the text, i.e., rel(), which denotes that relation rel holds between entity and entity . Previous models fall into two major categories: the pipelined approach, which first uses tagging models to identify entities, and then uses relation extraction models to identify the relation between each entity pair; and the joint approach, which combines the entity model and the relation model throught different strategies, such as constraints or parameters sharing.

There are several key issues with current approaches, both in terms of the task formalization and the algorithm. At the formalization level, the rel() triplet structure is not enough to fully express the data structure behind the text. Take the Musk case as an example, there is a hierarchical dependency between the tags: the extraction of Time depends on Position since a Person can hold multiple Positions in a Company during different Time periods. The extraction of Position also depends on Company since a Person can work for multiple companies. At the algorithm level, for most existing relation extraction models Miwa and Bansal (2016); Wang et al. (2016a); Ye et al. (2016), the input to the model is a raw sentence with two marked mentions, and the output is whether a relation holds between the two mentions. As pointed out in wang2016relation,zeng2018extracting, it is hard for neural models to capture all the lexical, semantic and syntactic cues in this formalization, especially when (1) entities are far away; (2) one entity is involved in multiple triplets; or (3) relation spans have overlaps333e.g., in text A B C D, (A, C) is a pair and (B, D) is a pair..

In the paper, we propose a new paradigm to handle the task of entity-relation extraction. We formalize the task as a multi-turn question answering task: each entity type and relation type is characterized by a question answering template, and entities and relations are extracted by answering template questions. Answers are text spans, extracted using the now standard machine reading comprehension (MRC) framework: predicting answer spans given context Seo et al. (2016); Wang and Jiang (2016); Xiong et al. (2017); Wang et al. (2016b). To extract structural data like Table 1, the model need to answer the following questions sequentially:

  • [topsep=0pt, partopsep=0pt]

  • Q: who is mentioned in the text? A: Musk;

  • Q: which Company / companies did Musk work for? A: SpaceX, Tesla, SolarCity, Neuralink and The Boring Company;

  • Q: when did Musk join SpaceX? A: 2002;

  • Q: what was Musk’s Position in SpaceX? A: CEO.

Treating the entity-relation extraction task as a multi-turn QA task has the following key advantages: (1) the multi-turn QA setting provides an elegant way to capture the hierarchical dependency of tags. As the multi-turn QA proceeds, we progressively obtain the entities we need for the next turn. This is closely akin to the multi-turn slot filling dialogue system Williams and Young (2005); Lemon et al. (2006); (2) the question query encodes important prior information for the relation class we want to identify. This informativeness can potentially solve the issues that existing relation extraction models fail to solve, such as distantly-separated entity pairs, relation span overlap, etc; (3) the QA framework provides a natural way to simultaneously extract entities and relations: most MRC models support outputting special None tokens, indicating that there is no answer to the question. Throught this, the original two tasks, entity extraction and relation extraction can be merged to a single QA task: a relation holds if the returned answer to the question corresponding to that relation is not None, and this returned answer is the entity that we wish to extract.

In this paper, we show that the proposed paradigm, which transforms the entity-relation extraction task to a multi-turn QA task, introduces significant performance boost over existing systems. It achieves state-of-the-art (SOTA) performance on the ACE and the CoNLL04 datasets. The tasks on these datasets are formalized as triplet extraction problems, in which two turns of QA suffice. We thus build a more complicated and more difficult dataset called RESUME which requires to extract biographical information of individuals from raw texts. The construction of structural knowledge base from RESUME requires four or five turns of QA. We also show that this multi-turn QA setting could easilty integrate reinforcement learning (just as in multi-turn dialog systems) to gain additional performance boost.

The rest of this paper is organized as follows: Section 2 details related work. We describe the dataset and setting in Section 3, the proposed model in Section 4, and experimental results in Section 5. We conclude this paper in Section 6.

2 Related Work

2.1 Extracting Entities and Relations

Many earlier entity-relation extraction systems are pipelined Zelenko et al. (2003); Miwa et al. (2009); Chan and Roth (2011); Lin et al. (2016): an entity extraction model first identifies entities of interest and a relation extraction model then constructs relations between the extracted entities. Although pipelined systems has the flexibility of integrating different data sources and learning algorithms, they suffer significantly from error propagation.

To tackle this issue, joint learning models have been proposed. Earlier joint learning approaches connect the two models through various dependencies, including constraints solved by integer linear programming

Yang and Cardie (2013); Roth and Yih (2007), card-pyramid parsing Kate and Mooney (2010), and global probabilistic graphical models Yu and Lam (2010); Singh et al. (2013)

. In later studies, li2014incremental extract entity mentions and relations using structured perceptron with efficient beam-search, which is significantly more efficient and less Time-consuming than constraint-based approaches. miwa2014modeling,gupta2016table,zhang2017end proposed the table-filling approach, which provides an opportunity to incorporating more sophisticated features and algorithms into the model, such as search orders in decoding and global features. Neural network models have been widely used in the literature as well. miwa2016end introduced an end-to-end approach that extract entities and their relations using neural network models with shared parameters, i.e., extracting entities using a neural tagging model and extracting relations using a neural multi-class classification model based on tree LSTMs

Tai et al. (2015). wang2016relation extract relations using multi-level attention CNNs. zeng2018extracting proposed a new framework that uses sequence-to-sequence models to generate entity-relation triples, naturally combining entity detection and relation detection.

Another way to bind the entity and the relation extraction models is to use reinforcement learning or Minimum Risk Training, in which the training signals are given based on the joint decision by the two models. sun2018extracting optimized a global loss function to jointly train the two models under the framework work of Minimum Risk Training. takanobu2018hierarchical used hierarchical reinforcement learning to extract entities and relations in a hierarchical manner.

2.2 Machine Reading Comprehension

Main-stream MRC models Seo et al. (2016); Wang and Jiang (2016); Xiong et al. (2017); Wang et al. (2016b) extract text spans in passages given queries. Text span extraction can be simplified to two multi-class classification tasks, i.e., predicting the starting and the ending positions of the answer. Similar strategy can be extended to multi-passage MRC Joshi et al. (2017); Dunn et al. (2017) where the answer needs to be selected from multiple passages. Multi-passage MRC tasks can be easily simplified to single-passage MRC tasks by concatenating passages Shen et al. (2017); Wang et al. (2017b). wang2017evidence first rank the passages and then run single-passage MRC on the selected passage. tan2017s train the passage ranking model jointly with the reading comprehension model. Pretraining methods like BERT Devlin et al. (2018) or Elmo Peters et al. (2018) have proved to be extremely helpful in MRC tasks.

There has been a tendency of casting non-QA NLP tasks as QA tasks McCann et al. (2018). Our work is highly inspired by levy2017zero. Levy et al. (2017) and McCann et al. (2018) focus on identifying the relation between two pre-defined entities and the authors formalize the task of relation extraction as a single-turn QA task. In the current paper we study a more complicated scenario, where hierarchical tag dependency needs to be modeled and single-turn QA approach no longer suffices. We show that our multi-turn QA method is able to solve this challenge and obtain new state-of-the-art results.

3 Datasets and Tasks

3.1 ACE04, ACE05 and CoNLL04

We use ACE04, ACE05 and CoNLL04 Roth and Yih (2004), the widely used entity-relation extraction benchmarks for evaluation. ACE04 defines 7 entity types, including Person (per), Organization (org), Geographical Entities (gpe), Location (loc), Facility (fac), Weapon (wea) and Vehicle (veh). For each pair of entities, it defines 7 relation categories, including Physical (phys), Person-Social (per-soc), Employment-Organization (emp-org), Agent-Artifact (art), PER/ORG Affiliation (other-aff), GPE- Affiliation (gpe-aff) and Discourse (disc). ACE05 was built upon ACE04. It kept the per-soc, art and gpe-aff categories from ACE04 but split phys into phys and a new relation category part-whole. It also deleted disc and merged emp-org and other-aff into a new category emp-org. As for CoNLL04, it defines four entity types (loc, org, perand others) and five relation categories (located_in, work_for, orgBased_in, live_in ]and kill).

For ACE04 and ACE05, we followed the training/dev/test split in li2014incremental and miwa2016end444https://github.com/tticoin/LSTM-ER/.. For the CoNLL04 dataset, we followed miwa2014modeling.

3.2 RESUME: A newly constructed dataset

The ACE and the CoNLL-04 datasets are intended for triplet extraction, and two turns of QA is sufficient to extract the triplet (one turn for head-entities and another for joint extraction of tail-entities and relations). These datasets do not involve hierarchical entity relations as in our previous Musk example, which are prevalent in real life applications.

Therefore, we construct a new dataset called RESUME. We extracted 841 paragraphs from chapters describing management teams in IPO prospectuses. Each paragraph describes some work history of an executive. We wish to extract the structural data from the resume. The dataset is in Chinese. The following shows an examples:


Mr. Zheng Qiang, a supervisor of the Company. He was born in 1973. His nationality is Chinese with no permanent residency abroad. He graduated from Nanjing University with a major in economic management in 1995. From 1995 to 1998, he worked for Jiangsu Changzhou Road Transportation Co., Ltd. as an organizer of accounting. From 1998 to 2000, he worked as a project manager in Yuexiu Certified Public Accountants. In 2010, he worked in the Guangdong branch of Guofu Haohua Certified Public Accountants Co., Ltd., and served as a project manager, department manager, partner and deputy chief accountant. From 2010 to 2011, he worked for Guangdong Zhongke Investment Venture Capital Management Co., Ltd. as a deputy general manager; since 2011, he has served as thedirector and general manager of Guangdong Zhongguang Investment Management Co., Ltd.; since 2016, he has served as director and general manager of Zhanjiang Zhongguang Venture Capital Co., Ltd.; since March 2016, he has served as the supervisor of the Company.

We identify four types of entities: Person (the name of the executive), Company (the company that the executive works/worked for), Position (the position that he/she holds/held) and Time (the time period that the executive occupies/occupied that position). It is worth noting that one person can work for different companies during different periods of time and that one person can hold different positions in different periods of time for the same company.

We recruited crowdworkers to fill the slots in Table 1. Each passage is labeled by two different crowdworkers. If labels from the two annotators disagree, one or more annotators were asked to label the sentence and a majority vote was taken as the final decision. Since the wording of the text is usually very explicit and formal, the inter-agreement between annotators is very high, achieving a value of 93.5% for all slots. Some statistics of the dataset are shown in Table 2. We randomly split the dataset into training (80%), validation(10%) and test set (10%).

Total # Average # per passage
Person 961 1.09
Company 1988 2.13
Position 2687 1.33
Time 1275 1.01
Table 2: Statistics for the RESUME dataset.

4 Model

1:sentence , EntityQuesTemplates, ChainOfRelTemplates
2:a list of list (table) M = []
6:for entity_question in EntityQuesTemplates do
7:      = Extract_Answer(entity_question, s)
8:     if do
9:        HeadEntList = HeadEntList +
10:     endif
11:end for
12:for head_entity in HeadEntList do
13:     ent_list = [head_entity]
14:     for  [rel, rel_temp] in ChainOfRelTemplates do
15:          for (rel, rel_temp) in List of [rel, rel_temp] do
16:               q = GenQues(rel_temp, rel, ent_list)
17:                = Extract_Answer(rel_question, s)
18:               if
19:                   ent_list = ent_list + e
20:               endif
21:          end for
22:     end for
23:     if len(ent_list)len([rel, rel_temp])
24:         M = M + ent_list
25:     endif
26:end for
Algorithm 1 Transforming the entity-relation extraction task to a multi-turn QA task.

4.1 System Overview

Relation Type head-e tail-e Natural Language Question & Template Question
gen-aff FAC GPE find a geo-political entity that connects to XXX
XXX; has affiliation; geo-political entity
part-whole FAC FAC find a facility that geographically relates to XXX
XXX; part whole; facility
part-whole FAC GPE find a geo-political entity that geographically relates to XXX
XXX; part whole; geo-political entity
part-whole FAC VEH find a vehicle that belongs to XXX
XXX; part whole; vehicle
phys FAC FAC find a facility near XXX?
XXX; physical; facility
art GPE FAC find a facility which is made by XXX
XXX; agent artifact; facility
art GPE VEH find a vehicle which is owned or used by XXX
XXX; agent artifact; vehicle
art GPE WEA find a weapon which is owned or used by XXX
XXX; agent artifact; weapon
org-aff GPE ORG find an organization which is invested by XXX
XXX; organization affiliation; organization
part-whole GPE GPE find a geo political entity which is controlled by XXX
XXX; part whole; geo-political entity
part-whole GPE LOC find a location geographically related to XXX
XXX; part whole; location
Table 3: Some of the question templates for different relation types in AEC.
Q1 Person: who is mentioned in the text? A:
Q2 Company: which companies did work for? A:
Q3 Position: what was ’s position in ? A:
Q4 Time: During which period did work for as A:
Table 4: Question templates for the RESUME dataset.

The overview of the algorithm is shown in Algorithm 1. The algorithm contains two stages:

(1) The head-entity extraction stage (line 4-9): each episode of multi-turn QA is triggered by an entity. To extract this starting entity, we transform each entity type to a question using EntityQuesTemplates (line 4) and the entity is extracted by answering the question (line 5). If the system outputs the special none token, then it means does not contain any entity of that type.

(2) The relation and the tail-entity extraction stage (line 10-24): ChainOfRelTemplates defines a chain of relations, the order of which we need to follow to run multi-turn QA. The reason is that the extraction of some entities depends on the extraction of others. For example, in the RESUME dataset, the position held by an executive relies on the company he works for. Also the extraction of the Time entity relies on the extraction of both the Company and the Position. The extraction order is manually pre-defined. ChainOfRelTemplates also defines the template for each relation. Each template contains some slots to be filled. To generate a question (line 14), we insert previously extracted entity/entities to the slot/slots in a template. The relation rel and tail-entity will be jointly extracted by answering the generated question (line 15). A returned none token indicates that there is no answer in the given sentence.

It is worth noting that entities extracted from the head-entity extraction stage may not all be head entities. In the subsequent relation and tail-entity extraction stage, extracted entities from the first stage are initially assumed to be head entities, and are fed to the templates to generate questions. If an entity extracted from the first stage is indeed a head-entity of a relation, then the QA model will extract the tail-entity by answering the corresponding question. Otherwise, the answer will be None and thus ignored.

For ACE04, ACE05 and CoNLL04 datasets, only two QA turns are needed. ChainOfRelTemplates thus only contain chains of 1. For RESUME, we need to extract 4 entities, so ChainOfRelTemplates contain chains of 3.

4.2 Generating Questions using Templates

Each entity type is associated with a type-specific question generated by the templates. There are two ways to generate questions based on templates: natural language questions or pseudo-questions. A pseudo-question is not necessarily grammatical. For example, the natural language question for the Facility type could be Which facility is mentioned in the text, and the pseudo-question could just be entity: facility.

At the relation and the tail-entity joint extraction stage, a question is generated by combing a relation-specific template with the extracted head-entity. The question could be either a natural language question or a pseudo-question. Examples are shown in Table 3 and Table 4.

4.3 Extracting Answer Spans via MRC

Various MRC models have been proposed, such as BiDAF Seo et al. (2016) and QANet Yu et al. (2018). In the standard MRC setting, given a question where denotes the number of words in , and context , where denotes the number of words in , we need to predict the answer span. For the QA framework, we use BERT Devlin et al. (2018) as a backbone. BERT performs bidirectional language model pretraining on large-scale datasets using transformers Vaswani et al. (2017) and achieves SOTA results on MRC datasets like SQUAD Rajpurkar et al. (2016). To align with the BERT framework, the question and the context are combined by concatenating the list [CLS, Q, SEP, C, SEP], where CLS and SEP are special tokens, is the tokenized question and is the context. The representation of each context token is obtained using multi-layer transformers.

Traditional MRC models Wang and Jiang (2016); Xiong et al. (2017)

predict the starting and ending indices by applying two softmax layers to the context tokens. This softmax-based span extraction strategy only fits for single-answer extraction tasks, but not for our task, since one sentence/passage in our setting might contain multiple answers. To tackle this issue, we formalize the task as a query-based tagging problem

Lafferty et al. (2001); Huang et al. (2015); Ma and Hovy (2016). Specially, we predict a BMEO (beginning, inside, ending and outside) label for each token in the context given the query. The representation of each word is fed to a softmax layer to output a BMEO label. One can think that we are transforming two N-class classification tasks of predicting the starting and the ending indices (where denotes the length of sentence) to 5-class classification tasks555 For some of the relations that we are interested in, their corresponding questions have single answers. We tried the strategy of predicting the starting and the ending index and found the results no different from the ones in the multi-answer QA-based tagging setting. .

Training and Test

At the training time, we jointly train the objectives for the two stages:


is the parameter controling the trade-off between the two objectives. Its value is tuned on the validation set. Both the two models are initialized using the standard BERT model and they share parameters during the training. At test time, head-entities and tail-entities are extracted separately based on the two objectives.

4.4 Reinforcement Learning

Note that in our setting, the extracted answer from one turn not only affects its own accuracy, but also determines how a question will be constructed for the downstream turns, which in turn affect later accuracies. We decide to use reinforcement learning to tackle it, which has been proved to be successful in multi-turn dialogue generation Mrkšić et al. (2015); Li et al. (2016); Wen et al. (2016), a task that has the same challenge as ours.

Action and Policy

In a RL setting, we need to define action and policy. In the multi-turn QA setting, the action is selecting a text span in each turn. The policy defines the probability of selecting a certain span given the question and the context. As the algorithm relies on the BMEO tagging output, the probability of selecting a certain span

is the joint probability of being assigned to (beginning), being assigned to (inside) and being assigned to (end), written as follows:



For a given sentence , we use the number of correctly retrieved triples as rewards. We use the REINFORCE algorithm Williams (1992), a kind of policy gradient method, to find the optimal policy, which maximizes the expected reward . The expectation is approximated by sampling from the policy and the gradient is computed using the likelihood ratio:


where denotes a baseline value. For each turn in the multi-turn QA setting, getting an answer correct leads to a reward of +1 . The final reward is the accumulative reward of all turns. The baseline value is set to the average of all previous rewards. We do not initialize policy networks from scratch, but use the pre-trained head-entity and tail-entity extraction model described in the previous section. We also use the experience replay strategy Mnih et al. (2015): for each batch, half of the examples are simulated and the other half is randomly selected from previously generated examples.

For the RESUME dataset, we use the strategy of curriculum learning Bengio et al. (2009), i.e., we gradually increase the number of turns from 2 to 4 at training.

multi-turn QA multi-turn QA+RL tagging+dependency tagging+relation
p r f p r f p r f p r f
Person 98.1 99.0 98.6 98.1 99.0 98.6 97.0 97.2 97.1 97.0 97.2 97.1
Company 82.3 87.6 84.9 83.3 87.8 85.5 81.4 87.3 84.2 81.0 86.2 83.5
Position 97.1 98.5 97.8 97.3 98.9 98.1 96.3 98.0 97.0 94.4 97.8 96.0
Time 96.6 98.8 97.7 97.0 98.9 97.9 95.2 96.3 95.7 94.0 95.9 94.9
all 91.0 93.2 92.1 91.6 93.5 92.5 90.0 91.7 90.8 88.2 91.5 89.8
Table 5: Results for different models on the RESUME dataset.
Models Entity P Entity R Entity F Relation P Relation R Relation F
li2014incremental 83.5 76.2 79.7 60.8 36.1 49.3
miwa2016end 80.8 82.9 81.8 48.7 48.1 48.4
Katiyar2017 81.2 78.1 79.6 46.4 45.3 45.7
D18-1307 - - 81.6 - - 47.5
Multi-turn QA 84.4 82.9 83.6 50.1 48.7 49.4 (+1.0)
Table 6: Results of different models on the ACE04 test set. Results for pipelined methods are omitted since they consistently underperform joint models (see li2014incremental for details).
Models Entity P Entity R Entity F Relation P Relation R Relation F
li2014incremental 85.2 76.9 80.8 65.4 39.8 49.5
miwa2016end 82.9 83.9 83.4 57.2 54.0 55.6
Katiyar2017 84.0 81.3 82.6 55.5 51.8 53.6
zhang2017end - - 83.5 - - 57.5
sun2018extracting 83.9 83.2 83.6 64.9 55.1 59.6
Multi-turn QA 84.7 84.9 84.8 64.8 56.2 60.2 (+0.6)
Table 7: Results of different models on the ACE05 test set. Results for pipelined methods are omitted since they consistently underperform joint models (see li2014incremental for details).
Models Entity P Entity R Entity F1 Relation P Relation R Relation F
miwa2014modeling 80.7 61.0
zhang2017end 85.6 67.8
D18-1307 83.6 62.0
Multi-turn QA 89.0 86.6 87.8 69.2 68.2 68.9 (+2.1)
Table 8:

Comparison of the proposed method with the previous models on the CoNLL04 dataset. Precision and recall values of baseline models were not reported in the previous papers.

5 Experimental Results

5.1 Results on RESUME

Answers are extracted according to the order of Person (first-turn), Company (second-turn), Position (third-turn) and Time (forth-turn), and the extraction of each answer depends on those prior to them.

For baselines, we first implement a joint model in which entity extraction and relation extraction are trained together (denoted by tagging+relation). As in Zheng et al. (2017), entities are extracted using BERT tagging models, and relations are extracted by applying a CNN to representations output by BERT transformers.

Existing baselines which involve entity and relation identification stages (either pipelined or joint) are well suited for triplet extractions, but not really tailored to our setting because in the third and forth turn, we need more information to decide the relation than just the two entities. For instance, to extract Position, we need both Person and Company, and to extract Time, we need Person, Company and Position. This is akin to a dependency parsing task, but at the tag-level rather than the word-level Dozat and Manning (2016); Chen and Manning (2014). We thus proposed the following baseline, which modifies the previous entity+relation strategy to entity+dependency, denoted by tagging+dependency. We use the BERT tagging model to assign tagging labels to each word, and modify the current SOTA dependency parsing model Biaffine Dozat and Manning (2016) to construct dependencies between tags. The Biaffine dependency model and the entity-extraction model are jointly trained.

Results are presented in Table 5. As can be seen, the tagging+dependency model outperforms the tagging+relation model. The proposed multi-turn QA model performs the best, with RL adding additional performance boost. Specially, for Person extraction, which only requires single-turn QA, the multi-turn QA+RL model performs the same as the multi-turn QA model. It is also the case in tagging+relation and tagging+dependency.

5.2 Results on ACE04, ACE05 and CoNLL04

For ACE04, ACE05 and CoNLL04, only two turns of QA are required. For evaluation, we report micro-F1 scores, precision and recall on entities and relations (Tables 6, 7 and 8) as in li2014incremental,miwa2016end,Katiyar2017,zhang2017end. For ACE04, the proposed multi-turn QA model already outperforms previous SOTA by +1.8% for entity extraction and +1.0% for relation extraction. For ACE05, the proposed multi-turn QA model outperforms previous SOTA by +1.2% for entity extraction and +0.6% for relation extraction. For CoNLL04, the proposed multi-turn QA model leads to a +2.1% on relation F1.

6 Ablation Studies

6.1 Effect of Question Generation Strategy

In this subsection, we compare the effects of natural language questions and pseudo-questions. Results are shown in Table 9.

Model Overall P Overall R Overall F
Pseudo Q 90.2 92.3 91.2
Natural Q 91.0 93.2 92.1
Pseudo Q 83.7 81.3 82.5 49.4 47.2 48.3
Natural Q 84.4 82.9 83.6 50.1 48.7 49.9
Pseudo Q 83.6 84.7 84.2 60.4 55.9 58.1
Natural Q 84.7 84.9 84.8 64.8 56.2 60.2
Pseudo Q 87.4 86.4 86.9 68.2 67.4 67.8
Natural Q 89.0 86.6 87.8 69.6 68.2 68.9
Table 9: Comparing of the effect of natural language questions with pseudo-questions.

We can see that natural language questions lead to a strict F1 improvement across all datasets. This is because natural language questions provide more fine-grained semantic information and can help entity/relation extraction. By contrast, the pseudo-questions provide very coarse-grained, ambiguous and implicit hints of entity and relation types, which might even confuse the model.

6.2 Effect of Joint Training

In this paper, we decompose the entity-relation extraction task into two subtasks: a multi-answer task for head-entity extraction and a single-answer task for joint relation and tail-entity extraction. We jointly train two models with parameters shared. The parameter control the tradeoff between the two subtasks:


Results regarding different values of on the ACE05 dataset are given as follows:

Entity F1 Relation F1
85.0 55.1
84.8 55.4
85.2 56.2
84.8 56.4
84.6 57.9
84.8 58.3
84.6 58.9
84.8 60.2
83.9 58.7
82.7 58.3
81.9 57.8

When is set to 0, the system is essentially only trained on the head-entity prediction task. It is interesting to see that does not lead to the best entity-extraction performance. This demonstrates that the second-stage relation extraction actually helps the first-stage entity extraction, which again confirms the necessity of considering these two subtasks together. For the relation extraction task, the best performance is obtained when is set to 0.7.

6.3 Case Study

Table 10 compares outputs from the proposed multi-turn QA model with the ones of the previous SOTA MRT model Sun et al. (2018). In the first example, MRT is not able to identify the relation between john scottsdale and iraq because the two entities are too far away, but our proposed QA model is able to handle this issue. In the second example, the sentence contains two pairs of the same relation. The MRT model has a hard time identifying handling this situation, not able to locate the ship entity and the associative relation, which the multi-turn QA model is able to handle this case.

example1 [john scottsdale] PER: PHYS-1 is
on the front lines in [iraq]GPE: PHYS-1 .
MRT [john scottsdale] PER
is on the front lines in [iraq]GPE .
Multi-QA [john scottsdale] PER: PHYS-1 is
on the front lines in [iraq]GPE: PHYS-1 .
example2 The [men] PER: ART-1 held on the
sinking [vessel] VEH: ART-1
until the [passenger] PER: ART-2
[ship] VEH: ART-2
was able to reach them.
MRT The [men] PER: ART-1 held on the
sinking [vessel] VEH: ART-1 until
the [passenger]PER
ship was able to reach them.
Multi-QA The [men] PER: ART-1 held on the
sinking [vessel] VEH: ART-1
until the [passenger] PER: ART-2
[ship] VEH: ART-2 was able to reach them.
Table 10: Comparing the multi-turn QA model with MRT Sun et al. (2018).

7 Conclusion

In this paper, we propose a multi-turn question answering paradigm for the task of entity-relation extraction. We achieve new state-of-the-art results on 3 benchmark datasets. We also construct a new entity-relation extraction dataset that requires hierarchical relation reasoning and the proposed model achieves the best performance.