Relation Mention Extraction from Noisy Data with Hierarchical Reinforcement Learning

by   Jun Feng, et al.
Zhejiang University
Tsinghua University

In this paper we address a task of relation mention extraction from noisy data: extracting representative phrases for a particular relation from noisy sentences that are collected via distant supervision. Despite its significance and value in many downstream applications, this task is less studied on noisy data. The major challenges exists in 1) the lack of annotation on mention phrases, and more severely, 2) handling noisy sentences which do not express a relation at all. To address the two challenges, we formulate the task as a semi-Markov decision process and propose a novel hierarchical reinforcement learning model. Our model consists of a top-level sentence selector to remove noisy sentences, a low-level mention extractor to extract relation mentions, and a reward estimator to provide signals to guide data denoising and mention extraction without explicit annotations. Experimental results show that our model is effective to extract relation mentions from noisy data.



page 3


Reinforcement Learning-based N-ary Cross-Sentence Relation Extraction

The models of n-ary cross sentence relation extraction based on distant ...

Reinforcement Learning for Relation Classification from Noisy Data

Existing relation classification methods that rely on distant supervisio...

Are Noisy Sentences Useless for Distant Supervised Relation Extraction?

The noisy labeling problem has been one of the major obstacles for dista...

Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning

This paper presents a reinforcement learning approach to extract noise i...

Improving Reinforcement Learning for Neural Relation Extraction with Hierarchical Memory Extractor

Distant supervision relation extraction (DSRE) is an efficient method to...

Crowdsourcing Semantic Label Propagation in Relation Classification

Distant supervision is a popular method for performing relation extracti...

Human-Like Decision Making: Document-level Aspect Sentiment Classification via Hierarchical Reinforcement Learning

Recently, neural networks have shown promising results on Document-level...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The increasing demand for structured knowledge has significantly advanced the research of named entity recognition and relation extraction. Extensive prior research has studied extracting entities 

[Borthwick et al.1998, Chiu and Nichols2016, Xu, Jiang, and Watcharawittayakul2017] and relations [Bunescu and Mooney2005, Mintz et al.2009, Zeng et al.2014, Zheng et al.2017] from a plain text. Figure 1 illustrates an example of relation extraction, where the relation “ place_of_birth” between two entities “Barack_Obama” and “Hawaii” is detected since the expression “was born in” suggests the relation “place_of_birth” directly. Such representative expressions are referred to as relation mention.

Figure 1: Illustration of relation mention extraction from noisy sentences. Words in red are relation mentions.

Relation mentions can be valuable resources in many downstream tasks and benefit many applications such as relation extraction, question answering, and language inference. Moreover, it offers good interpretability to reveal the textual evidence for a detected relation, and further, we can study the language variety in relation mention: there are various phrases and ways to express the same relation. For instance, for the “ place_of_birth” relation, there are many expressions such as “the birth place”, “was born in”, “hails from”, and so on.

Relation mention extraction in this paper is defined as follows: given a relation , and a set of sentences containing an entity pair and associated with a noisy relation label 111The relation label is automatically generated under the distant supervision assumption. Noisy means that some sentences may not mention the automatically-labeled relation at all., the task is to extract a set of representative phrases for relation (e.g., “place_of_birth”), such as “the birth place”, “hails from”, and “was born in”. We term the task as relation mention extraction.

Many existing studies only focus on sentence-level relation classification that predicts whether a sentence mentions a relation [Riedel, Yao, and McCallum2010, Hoffmann et al.2011, Li and Ji2014, Miwa and Bansal2016, Ren et al.2017, Zheng et al.2017]. However, they do not concern the words or phrases that describe a relation. Our problem also differs from Open IE  [Banko et al.2007, Fader, Soderland, and Etzioni2011, Angeli, Premkumar, and Manning2015], in that such systems do not need to normalize different expressions (e.g., “the birth place” and “was born in”) to the same canonical relation (e.g., “place_of_birth”), as shown in Figure 1. Some works deal with the noisy labeling issue on relation label  [Takamatsu, Sato, and Nakagawa2012, Zeng et al.2015, Feng et al.2018], but they do not involve relation mention extraction.

There are two major challenges for relation mention extraction. First, the sentences for a relation are constructed by distant supervision [Mintz et al.2009, Zeng et al.2015], and are hence noisy where a sentence may not describe the relation at all. Extraction from noisy sentences will definitely lead to undesired, incorrect relation mentions. Second, it is too costly to conduct mention annotation to specify which words or phrases mention a relation in a sentence, particularly in the setting of large-scale relation mention extraction. Instead, there is only a very weak signal available, indicating that a sentence (noisy itself) might describe a relation.

To address these challenges, we devise a hierarchical reinforcement learning [Sutton, Precup, and Singh1999] model to address the task of relation mention extraction from noisy sentences. The model consists of three components: a top-level sentence selector for selecting correctly labeled sentences that express a particular relation, a low-level mention extractor for identifying mention words in a selected sentence, and a reward estimator for providing signals to guide sentence denoising and mention extraction without explicit annotations. The intuition behind this model is as follows: if a high-quality sentence is selected, it will facilitate relation mention extraction, and in return, the extraction performance will signify the fitness of sentence selection.

Our model works as follows: at the top level, the agent decides whether a sentence should be selected or not from a sentence bag222A sentence bag contains sentences labeled as the same relation; once the agent selects a sentence, it enters into a low-level RL process for mention extraction. When the low-level process completes its task, the agent will return back to the top-level process and continues to tackle the next sentence in the bag. Since we have no explicit annotations on either sentence (whether a sentence truly describes a relation) or word (which words are a relation mention), the problem can be formulated as a natural sequential decision problem and the policy learning in the high-level and low-level processes is guided by the delayed rewards (the likelihood of relation classification), which is a weak, indirect supervision signal for policy learning.

Our contributions are as follows:

  • We study the task of relation mention extraction in new settings: from noisy sentences and with only weak supervision, that is, there is no explicit annotations on sentences or mention words.

  • We propose a novel hierarchical reinforcement learning model which consists of a top-level sentence selector for removing noisy sentences, a low-level extractor for extracting relation mentions, and a reward estimator for offering supervision signals to guide data denoising and mention extraction.

Related Work

We deal with relation mention extraction in this paper. As closely related tasks, named entity recognition (NER) and relation extraction (RE) have attracted considerable research efforts recently. NER locates entity’s mentions in a plain text [Borthwick et al.1998, Chiu and Nichols2016, Xu, Jiang, and Watcharawittayakul2017, Katiyar and Cardie2017]. As entity mentions are less diverse and it is easier to access high-quality labels for NER, this task is usually formulated as a full supervision problem (e.g., sequential labeling). The goal of RE is to extract semantic relations between two given entities. Many researchers have explored models based on handcrafted features [Mooney and Bunescu2005, Zhou et al.2005]

or deep neural networks 

[Socher et al.2012, Zeng et al.2014, dos Santos, Xiang, and Zhou2015, Lin, Liu, and Sun2017].

The most relevant to our work is Open IE [Banko et al.2007, Wu and Weld2010, Hoffmann et al.2011, Angeli, Premkumar, and Manning2015], which extracts triples that contain two entities and a relation mention. However, there is no need to normalize different expressions to a canonical relation in Open IE systems.

There exists a large amount of studies for sentence-level relation classification which predicts whether a sentence describes a relation but without specifying a token span as mention [Riedel, Yao, and McCallum2010, Hoffmann et al.2011, Li and Ji2014, Miwa and Bansal2016, Ren et al.2017, Zheng et al.2017]. [Wang et al.2016] and [Huang and others2016] adopted attention mechanisms to highlight some words in a sentence as the clues of a relation. However, such methods can only detect separate words but do not consider the dependency between words.

There are also some works [Feng et al.2018, Zeng et al.2018] using reinforcement learning for relation extraction from noisy data. However, they target more on relation classification instead of mention extraction. Our work is inspired by [Feng et al.2018] where an instance selector was used to remove noisy sentences. However, the supervision signal for sentence selection is sparse as there is only a delayed reward available after all selection in a bag is completed. By contrast, our model is more straightforward: the top-level sentence selector can receive an intermediate reward after each selection from the low-level mention extractor and obtains direct feedbacks to guide policy learning.

Figure 2: The hierarchical decision making process to extract mentions for relation “place_of_birth”. Blue circles denote selected sentences (white for unselected sentences), and green squares indicate mention words (white squares means non-mention words). Words in red are mention words.


Problem Definition

We formulate the task of relation mention extraction from noisy data as follows: given a relation and a sequence of sentence, relation pairs as , the goal is to extract a set of representative phrases for relation . Each is a sentence associated with two entities and a noisy relation label , produced by distant supervision [Mintz et al.2009]. In other words, a sentence may not express relation at all.

The challenges for relation mention extraction come from: 1) there are noisy relation labels, and 2) there is no word-level mention annotation.


As illustrated in Figure 2, the process of relation mention extraction works as follows: the agent first decides whether a sentence expresses a given relation; if the agent predicts so, it will scan the words in the sentence one by one to identify the mention words; otherwise, the agent directly skip the current sentence. The agent continues to tackle the next sentence until all the sentences for the same entity pairs are handled. The above process can be naturally formulated as a semi-Markov decision process. We thus address the task in the framework of hierarchical reinforcement learning  [Sutton, Precup, and Singh1999, Dietterich2000]. The hierarchical reinforcement learning process has two tasks: a top-level RL task which takes an option for data denoising, deciding whether a sentence should be selected; and a low-level RL task that makes primitive actions for mention extraction, deciding which words are part of a relation mention.

As shown in Figure 3, our model consists of three components: a top-level sentence selector, a low-level mention extractor, and a reward estimator. The sentence selector scans the sentences in a bag and takes options

(top-level action) to determine whether a sentence describes a relation. The mention extractor performs a sequential scan on a selected sentence and takes actions on whether a particular word in the sentence is part of a relation mention. As there are no explicit supervision for either the selector or the extractor, we pretrain a relation classifier as the reward estimator to guide the policy learning in the two modules.

Figure 3: The hierarchical reinforcement learning model.

Reward Estimator

We adopt a CNN classifier to offer supervision signals to help estimate the rewards for the sentence selector and the mention extractor. The supervision signal is measured by the likelihood of relation classification for a given sentence . Following [Feng et al.2018]

, the CNN network has an input layer, a convolution layer, a max pooling layer, and a non-linear layer from which the representation is used for relation classification.

CNN Structure. The CNN structure can be briefly described as below:


where x

is the input vectors and

is the result of the max pooling layer. In this structure, there is a convolution layer, and a max pooling layer. The convolution operation is performed on 3 consecutive words, and the number of feature maps is set to , the same as [Lin et al.2016]. Hence, the convolution parameters are and .

Then, the relation classifier estimates as follows:


where and are parameters in the fully-connected layer, is the total number of relations, and the parameters .

This probability

is used to estimate the rewards to the sentence selector and the mention extractor, see Eq. 5 and Eq. 7.

Loss function. Given a training set

, cross-entropy is used as loss function to train the CNN classifier:


Top-Level Sentence Selector

The top-level sentence selector aims to select a sentence that truly mentions the given relation. A selected sentence will then be passed to the low-level mention extractor for further mention extraction. As we do not have an explicit supervision for the sentence selector, we measure the utility of the selected sentences as a whole using a final reward. Thus, this RL process terminates when all the sentences are scanned. In what follows, state , option and reward at step (corresponding to the -th sentence) will be introduced.

State. The state consists of the information about tthe current sentence, the already selected sentences, the relation label, and the extracted relation mentions from the previously selected sentences:
1) The vector representation of the current sentence, which is obtained from the non-linear layer of the CNN classifier for relation classification;
2) The average of the sentence representations of the chosen sentences;
3) The one-hot representation of a given relation;
4) The representation of the extracted relation mentions, which is the average of the word vectors of all the mention words.

Option. The option where 1 means the -th sentence is selected. We sample the value of from the policy function as follows:



is the sigmoid function with the parameter


Reward. At each step , if the sentence is selected, the sentence selector will receive an intermediate reward which is the delayed reward received by the low-level mention extractor on the -th sentence, as defined by Eq. 7; otherwise, the intermediate reward is set as .

In addition to the intermediate rewards, a final reward is computed to measure the utility of all the chosen sentences, when the top-level selector completes its scan on all the sentences for a given relation:


where () contains the selected sentences, and is the given relation. is provided by the reward estimator, see Eq. 2.

Low-Level Mention Extractor

Once the top-level sentence selector chooses a sentence , the low-level mention extractor will scan sequentially the words in to identify relation mention words given relation . At each step , the mention extractor makes a decision on whether the -th word is part of the relation mention. This low-level RL process terminates after the last word is scanned.

State. The state encodes the information about the current words, the already chosen words in the sentence, and the relation:
1) The vector representation of the current words;
2) The representation of the chosen mention words, which is the average of the word embeddings of all the chosen words;
3) The one-hot representation of the relation.

Action. The action where 1 means the -th word is selected as a mention word. We sample from the policy function:


where is the sigmoid function with the parameter .

Reward. As there is no annotation on which words are related to a relation mention, we design a delayed reward to measure the adequacy of the extracted mention words once all the words in sentence are scanned. The delayed reward consists of three terms: the word discriminability, the continuity of the relation mention, and the distance to the two entities.

Formally, suppose a mention is extracted from sentence , where () is a word index in , and is the number of words in the extracted mention. We denote the indices of the two entities as /, respectively.

The delayed rewards is defined as:


1) The first term is the word discriminability which measures how well can distinguish the relation. , defined by Eq. 2 in the reward estimator, is the classification likelihood of sentence . is the sentence where are removed from .
2) The second term is the continuity reward which encourages the extraction of a consecutive token span at a certain extent.
3) The third term is the distance reward which encourages that mention words should be close to the two entities.

The three rewards are soft constraints for mention extraction. For instance, the contituity reward encourages extraction of consective words, but the model may also extract non-consecutive words as mention. And, are the hyper-parameters to balance the three factors.

Training Objective and Optimization

For the sentence selector, we aim to maximize the expected future cumulative rewards, as below:


where is the future cumulative rewards from state . To compute , we sampled some trajectories according to the current policy. Taking one trajectory as example ( is the number of sentences in the top-level process), , Note that the rewards received by the low-level mention extractor are passed to the selector, which provides a feedback to indicate how well sentence selection is.

Similarly, the mention extractor maximizes the expected cumulative rewards, as follows:


where , since the mention extractor have no intermediate rewards but only a delayed final reward.

According to the policy gradient theorem [Sutton et al.1999] and the REINFORCE algorithm [Williams1992], we compute the gradient of the top-level sentence selector policy as:


The policy gradient of the low-level mention extractor yields:

Input: Training data , and each relation has a sentence bag .
foreach  do
       foreach sentence  do
             Sample option for the selector: , see Eq. 4;
             if  then
                   Sample actions for the extractor on sentence with :   , see Eq. 6 ;
                   Obtain the final reward from the extractor ;
                   Update the parameter ;
                   Compute the intermediate reward of the selector: , see Eq.7
             end if
       end foreach
      Obtain the reward of the extractor , see Eq. 5 ;
       Update the parameter ;
end foreach
ALGORITHM 1 Training Process of Hierarchical Reinforcement Learning

For model learning, we first use all the sentences to pretrain a CNN classifier as the reward estimator and pretrain the low-level mention extractor according to Eq. 3 and Eq. 9 respectively. After that, with the reward provided by the CNN classifier (parameters fixed), we are able to train the hierarchical RL model. See the details of our learning procedure in Algorithm 1.

Relation Mention Ranking

Note that our goal is to extract a set of representative phrases for a relation. Since our model extracts a mention from each selected sentence, we need to rank the extracted mentions at the corpus level to construct high-quality mention resources. Formally, an extracted mention for a relation is ranked by the below score, similar to [Angeli, Premkumar, and Manning2015]:


where and . is the times that mention is extracted for relation , is the number of the sentences labeled as relation , and is the times that mention is extracted from all the selected sentences. Finally, we select top mentions for each relation to construct the mention resource.


Experimental Setup

Data Preparation

We evaluated our model on a clean dataset and a noisy dataset, respectively.

Clean dataset. The clean dataset is adopted from SemEval-2010 [Hendrickx et al.2009], which contains 10,717 sentences and 9 distinct relations. The average sentence length is . We took 8,000 sentences for training and the remainder for test.

Noisy dataset. To validate the performance of mention extraction from noisy data, we adopted a widely used dataset from [Riedel, Yao, and McCallum2010]333 This dataset contains 522,611 sentences, 281,270 entity pairs, and 18,252 relational facts in the training set; and 172,448 sentences, 96,678 entity pairs and 1,950 relational facts in the test set. There are 39,528 unique entities and 53 unique relations. The average sentence length is . This dataset consists of noisy sentences which may not describe a fact at all.

Method Clean Data Noisy Data
StanfordIE 0.30 0.11
ATT 0.27 0.02
N-gram 0.38 0.24
Single RL 0.71 0.35
HRL 0.71 0.52
Table 1: Sentence-level extraction accuracy for relation mention. Note tath HRL is the same as Single RL on the clean data. Note that StanfordIE is an unsupervised method.
Example-I: the Entity-Origin relation between name and address.
The headquarters of the operation were at Berlin and the code [name] for the program was derived from that [address].
Output: ATT: derived StanfordIE: N/A HRL: derived from
Example-II: the Product-Producer relation between philosopher and writings.
Andronicus wrote a work, the fifth book of which contained a complete list of the [philosopher] ’s [writings].
Output: ATT: wrote StanfordIE: of HRL: ’s
Table 2: Examples for the extracted mentions by ATT, StanfordIE, and our model. N/A means StanfordIE did not extract any word.


OpenIE  [Angeli, Premkumar, and Manning2015, Mausam et al.2012]. OpenIE systems are the most relevant to our work, which extract a triple that contains two entity mentions and a relation mention. As aforementioned, OpenIE systems do not normalize different expressions to a canonical relation. Thus, we mapped the extracted mentions to a relation following the algorithm described in  [Angeli, Premkumar, and Manning2015], which is trained on our training data. In our experiment, we use Stanford OpenIE [Angeli, Premkumar, and Manning2015] as baseline.

ATT  [Huang and others2016].ATT adopts a word-level attention over the words in a sentence and assigns each word an attention weight. We selected the word with the largest weight as the relation mention.

Single RL. This model only adopts the low-level mention extractor and ignores the top-level sentence selector. We compared this model with our HRL model on the noisy dataset. On the clean dataset, HRL is unnecessary since there is no noisy sentences.

N-gram. To show the necessity for adopting reinforcement learning, we devised a new model named N-gram as our baseline, which searches over all n-grams () in a sentence and chooses as the mention the one which provides the maximal reward. The reward is the same as the final reward of the low-level mention extractor (see Eq.  7).

Parameter Settings

The parameters of our model are different on the clean and noisy datasets. For the clean dataset, we set the hyper-parameter , and the learning rate as . The training episode number is . For the noisy dataset, we set , . The learning rate is and the training episode number is during the pretraining of the mention detector. The learning rate is and the training episode number is during the training of HRL. The reward discount factor is on both datasets.

For the parameters of the CNN classifier in the reward estimator, the word embedding dimension and the position embedding dimension . The window size of the convolution layer is . The learning rate is . The batch size is fixed to . The training episode number . We employed a dropout strategy with a probability of .

Quality of Extracted Relation Mentions

We evaluated the the quality of extracted relation mentions with two metrics. At the sentence level, accuracy is assessed by manually checking whether the phrase extracted from a sentence is indeed representative for the given relation . At the mention level, Precision@K is assessed by ranking the extracted mentions according to the representative ability (see Eq. 12).

Sentence-level Evaluation

We respectively sampled sentences from the clean and noisy datasets, and manually annotated the relation mention for each sentence. As different baseline models extract multi-granularity relation mentions, we annotated multiple relation mentions for each sentence for fair comparison. And, we guaranteed that all the annotations are representative for a given relation. For instance, for sentence “Muscle fatigue is the number one cause of arm muscle pain.” with relation label “Cause-Effect”, mention annotations are “is the number one cause of”, “is the cause of”, “the cause of” and “cause”.

Then, we compared the extracted mentions with those manual annotations for each sentence to evaluate the extraction performance. Thus, this is sentence-level evaluation. The results shown in Table 1 reveal the following observations:

First, our proposed models (Single RL and HRL) outperform the baselines on both clean and noisy data. Compared to our model, ATT has two drawbacks: the word with the largest attention weight may not be a mention word; and it cannot identify a consecutive token span as mention. As for StanfordIE, it failed to extract fact triples and did not extract any relation mention in many cases.

Second, HRL outperforms the baselines substantially on the noisy data, demonstrating the effectiveness of data denoising by the sentence selector. By contrast, StanfordIE, ATT, N-gram and Single RL all suffer from the noisy data remarkably due to the inability of excluding noisy sentences.

We also note that ATT drops much more than other baselines on noisy data. Our investigation into the results shows that ATT is sensitive to the sentence length. The longer the sentence is, the more difficult ATT can locate the correct relation mention words. The average length of sentence in noisy data is much longer than that in the clean data ( vs. ).

Third, SingleRL outperforms N-gram on both clean and noisy data. The results show that our RL strategy is reasonable and effective.

We further presented some exemplar mentions extracted by the models in Table 2. Interestingly, our model can not only identify typical phrases like “derived from”, but also discover less typical representative words such as “’s’’. For StanfordIE, it sometimes failed to extract any word or extracted undesirable results. As for ATT, it is prone to produce wrong attention.

Method P@1 P@2 P@5 P@10
StanfordIE 0.88 0.82 0.72 0.61
ATT 0.67 0.74 0.61 0.44
N-gram 0.83 0.75 0.67 0.56
Single RL 0.94 0.94 0.84 0.66
Table 3: Average Precision@K of the extracted mentions from the clean data (mention-level).
Method P@1 P@2 P@5 P@10
StanfordIE 0.38 0.50 0.40 0.33
ATT 0.15 0.15 0.20 0.17
N-gram 0.38 0.38 0.42 0.37
Single RL 0.46 0.38 0.40 0.28
HRL 0.77 0.77 0.74 0.71
Table 4: Average Precision@K of the relation mentions extracted from the noisy data (mention-level).
Relation Exemplar phrases
Cause-Effect triggers, caused by, lead to,
generated by, instigates
Product-Producer hand-made by, co-founded by,
makes, created by
Founder founder of, chief executive of,
managing director at, chairman of
Children son of, daughter,
father, son of minister
Table 5: Exemplar mention phrases for some sampled relations.

Mention-level Evaluation

We conducted mention-level evaluation to assess the quality of the extracted mentions at the corpus level. For each relation, we chose top 10 representative mentions which are ranked by Eq. 12. We adopted Precision@K as the performance metric.

The results on the clean data and noisy data are presented in Table 3 and Table 4, respectively. On the clean data, the top 5 mentions extracted by our model achieve a precision of more than 0.8, significantly higher than those obtained by StanfordIE and ATT (Table 3). As for the noisy data, P@10 drops remarkably for all the methods, but HRL performs much better than the baselines. Moreover, HRL outperforms single RL remarkably . All the evidence supports that the sentence selector effectively exclude the noisy sentences (Table 4).

A concrete example of the ranked mentions is presented in Table 7. It shows that the extracted phrases are representative and meaningful. We also show the top 10 relation mentions for some relations in a supplementary file.

Utility of Extracted Relation Mentions

We evaluated whether the extracted mentions can facilitate downstream applications such as relation classification, on both clean and noisy data.

We first evaluated on the clean data how extracted mentions can benefit relation classification as addition feature. More specifically, we constructed a binary vector where the -th dimension represents whether at least one extracted mention of the -th relation occurs in a given sentence. The dimension of the binary vector equals to the number of relations. For each sentence, if it contains the mentions for a relation, the corresponding dimension will be set to , otherwise .

Mention features CNN Regression
StanfordIE 81.74 27.52
Ollie 81.28 18.32
ATT 81.48 20.61
N-gram 81.57 36.97
Single RL 82.13 39.32
Table 6: Macro of relation classification on the clean data.

To fully check the effectiveness of the extracted relation mentions, we used the binary vector in two ways. The first way is that we directly used the binary vector for relation classification, with a logistic regression classifier. The second way is to use the binary vector along with a CNN classifier. We concatenated the binary vector with the output of the pooling layer of a CNN structure, and fed the concatenated vector into a fully-connected layer for relation classification.

We compared different mention features generated by Open IE, ATT, RL, and HRL respectively. The results on the clean data are shown in Table 6. It demonstrates that the relation mentions from our model obtain better performance than those from the baseline models. In the CNN classifier, mention features are only used as additional feature, which may explain that only slight improvement is observed.

Due to the page limit, we provided a supplementary file to show the experiment results on noisy data. We believe that such extracted mentions would be beneficial for question answering, language inference, and more, which will be validated in future work.


Our model is advantageous in extracting relation mentions that can be expressed explicitly by words. However, some relations are expressed implicitly, or sometimes, we need to make semantic reasoning to derive a relation.

This first example demonstrates an implicit relation mention: sentence “I spent a year working for a [software] [company] to pay off my college loans.” is labeled with a Product-Producer relation, which requires the knowledge that a software company sells software (However, the Apple company does not produce apple).

The second example shows that relation mention detection sometimes needs to make semantic reasoning: sentence “You ’ll get an instant overview of [Tallahassee], which was chosen as [Florida] ’s capital for only one reason ” is marked with the Contains relation, which needs to be inferred from the capital relationship. The third example, “[Nicola Sturgeon], the newly elected first minister of [Scotland], expressed concern that ”, labeled with the Nationality relation, also needs to make semantic reasoning from minister to derive the desired relation.

Our model has limitations on these cases, and we will leave it as future work.


In this paper, we present a hierarchical reinforcement learning model for extracting relation mentions from noisy data. The model consists of a sentence selector to exclude noisy sentences, a mention extractor to identify mention words in a selected sentence, and a reward estimator to guide the policy learning of the selector and the extractor. The model learns from large-scale noisy data without explicit annotations on either sentence (whether a sentence truly describes a relation) or on word (which words are a relation mention). Experiments show that our model outperforms the state-of-the-art baselines.


  • [Angeli, Premkumar, and Manning2015] Angeli, G.; Premkumar, M. J. J.; and Manning, C. D. 2015. Leveraging linguistic structure for open domain information extraction. In ACL, volume 1, 344–354.
  • [Banko et al.2007] Banko, M.; Cafarella, M. J.; Soderland, S.; Broadhead, M.; and Etzioni, O. 2007. Open information extraction from the web. In IJCAI, volume 7, 2670–2676.
  • [Borthwick et al.1998] Borthwick, A.; Sterling, J.; Agichtein, E.; and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora.
  • [Bunescu and Mooney2005] Bunescu, R. C., and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In EMNLP, 724–731.
  • [Chiu and Nichols2016] Chiu, J. P., and Nichols, E. 2016. Named entity recognition with bidirectional lstm-cnns. TACL 4:357–370.
  • [Dietterich2000] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR) 13(1):227–303.
  • [dos Santos, Xiang, and Zhou2015] dos Santos, C. N.; Xiang, B.; and Zhou, B. 2015.

    Classifying relations by ranking with convolutional neural networks.

    In ACL, 626–634.
  • [Fader, Soderland, and Etzioni2011] Fader, A.; Soderland, S.; and Etzioni, O. 2011. Identifying relations for open information extraction. In EMNLP, 1535–1545.
  • [Feng et al.2018] Feng, J.; Huang, M.; Zhao, L.; Yang, Y.; and Zhu, X. 2018. Reinforcement learning for relation classification from noisy data. AAAI.
  • [Hendrickx et al.2009]

    Hendrickx, I.; Kim, S. N.; Kozareva, Z.; Nakov, P.; Ó Séaghdha, D.; Padó, S.; Pennacchiotti, M.; Romano, L.; and Szpakowicz, S.

    2009. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, 94–99.
  • [Hoffmann et al.2011] Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; and Weld, D. S. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 541–550.
  • [Huang and others2016] Huang, X., et al. 2016. Attention-based convolutional neural network for semantic relation extraction. In COLING, 2526–2536.
  • [Katiyar and Cardie2017] Katiyar, A., and Cardie, C. 2017. Going out on a limb: Joint extraction of entity mentions and relations without dependency trees. In ACL, volume 1, 917–928.
  • [Li and Ji2014] Li, Q., and Ji, H. 2014. Incremental joint extraction of entity mentions and relations. In ACL, volume 1, 402–412.
  • [Lin et al.2016] Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun, M. 2016. Neural relation extraction with selective attention over instances. In ACL, volume 1, 2124–2133.
  • [Lin, Liu, and Sun2017] Lin, Y.; Liu, Z.; and Sun, M. 2017. Neural relation extraction with multi-lingual attention. In ACL, volume 1, 34–43.
  • [Mausam et al.2012] Mausam; Michael, S.; Robert, B.; Stephen, S.; and Oren, E. 2012. Open language learning for information extraction. In EMNLP, 523–534.
  • [Mintz et al.2009] Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In ACL-IJCNLP, 1003–1011.
  • [Miwa and Bansal2016] Miwa, M., and Bansal, M. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In ACL, 1105–1116.
  • [Mooney and Bunescu2005] Mooney, R. J., and Bunescu, R. C. 2005. Subsequence kernels for relation extraction. In NIPS, 171–178.
  • [Ren et al.2017] Ren, X.; Wu, Z.; He, W.; Qu, M.; Voss, C. R.; Ji, H.; Abdelzaher, T. F.; and Han, J. 2017. Cotype: Joint extraction of typed entities and relations with knowledge bases. In WWW, 1015–1024.
  • [Riedel, Yao, and McCallum2010] Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In ECML-PKDD, 148–163. Springer.
  • [Socher et al.2012] Socher, R.; Huval, B.; Manning, C. D.; and Ng, A. Y. 2012. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL, 1201–1211.
  • [Sutton et al.1999] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS.
  • [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
  • [Takamatsu, Sato, and Nakagawa2012] Takamatsu, S.; Sato, I.; and Nakagawa, H. 2012. Reducing wrong labels in distant supervision for relation extraction. In ACL, 721–729. ACL.
  • [Wang et al.2016] Wang, L.; Cao, Z.; de Melo, G.; and Liu, Z. 2016. Relation classification via multi-level attention cnns. In ACL, volume 1, 1298–1307.
  • [Williams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
  • [Wu and Weld2010] Wu, F., and Weld, D. S. 2010. Open information extraction using wikipedia. In ACL, 118–127.
  • [Xu, Jiang, and Watcharawittayakul2017] Xu, M.; Jiang, H.; and Watcharawittayakul, S. 2017. A local detection approach for named entity recognition and mention detection. In ACL, volume 1, 1237–1247.
  • [Zeng et al.2014] Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. Relation classification via convolutional deep neural network. In COLING, 2335–2344.
  • [Zeng et al.2015] Zeng, D.; Liu, K.; Chen, Y.; and Zhao, J. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP, 1753–1762.
  • [Zeng et al.2018] Zeng, X.; He, S.; Liu, K.; and Zhao, J. 2018. Large scaled relation extraction with reinforcement learning. AAAI.
  • [Zheng et al.2017] Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; and Xu, B. 2017. Joint extraction of entities and relations based on a novel tagging scheme. In ACL, volume 1, 1227–1236.
  • [Zhou et al.2005] Zhou, G.; Su, J.; Zhang, J.; and Zhang, M. 2005. Exploring various knowledge in relation extraction. In ACL, 427–434.

Appendix A Supplementary Materials

Relation Top 10 relation mentions
Cause-Effect by, from, after, caused by, generated by, due, following, comes, through, produced by
Person-Company chief executive of, the, at, general, chairman of, president of,
UNK cheif executive of, secretary general, vice president, founder of
Component-Whole comprises, contains, has, includes, with, comprised, composed, a, consists of
Product-Producer by, produced by, found by, created by, from, in, ’s, secreted by, built by, from
Entity-Destination into, to, sent to, in, placed into, migrated into, inside, placed, injected into, vested into
Table 7: Top 10 relation mentions for some relations.

Relation Mention Rankings

We presented top 10 relation mentions for some relations in Table 7. It shows that the extracted phrases are representative and meaningful. Although most of the phrases are representative and meaningful, some of them lack semantic meanings, such as “by”, “at” and “with”. To interpret these cases, we need to throw them back to the sentence context. For the 8th relation mention “UNK cheif executive of ” for relation ”Person-Company”, “UNK” indicates the company name.

Utility of Extracted Relation Mentions

We evaluated whether the extracted mentions can facilitate downstream applications such as relation classification, on both clean and noisy data. The result on clean data is shown in the main paper. We show the result on noisy data in this supplementary file.

Experiment on Noisy Data

Similar to the experiments on the clean data, we generated the binary vector for each sentence on the noisy data and concatenated it with the output of the pooling layer of a CNN, and fed the new vector into a fully-connected layer for relation classification.

As there is no manual annotation on noisy data, we evaluated the results under the held-out evaluation configuration, which provides an approximate measure of relation extraction without expensive human labors.

We compared different mention features generated by HRL and the baseline models. We divided the baseline models into two groups. The first group consist of previous existing models include StanfordIE and ATT. The second group consists of the simplified version of HRL include SingleRL and N-gram.

Figure 4 and Figure 5 show the results on noisy data. Figure 4 shows that our HRL model outperforms the existing mention extraction models. Figure 5 shows that HRL outperforms Single RL and Single RL outperforms N-gram. This demonstrates that the necessity of removing noisy sentences for relation mention extraction.

Figure 4: Comparison between HRL and the baselines.
Figure 5: Comparison between HRL and its simplified models