Towards Time-Aware Distant Supervision for Relation Extraction

03/08/2019 ∙ by Tianwen Jiang, et al. ∙ Microsoft Harbin Institute of Technology 0

Distant supervision for relation extraction heavily suffers from the wrong labeling problem. To alleviate this issue in news data with the timestamp, we take a new factor time into consideration and propose a novel time-aware distant supervision framework (Time-DS). Time-DS is composed of a time series instance-popularity and two strategies. Instance-popularity is to encode the strong relevance of time and true relation mention. Therefore, instance-popularity would be an effective clue to reduce the noises generated through distant supervision labeling. The two strategies, i.e., hard filter and curriculum learning are both ways to implement instance-popularity for better relation extraction in the manner of Time-DS. The curriculum learning is a more sophisticated and flexible way to exploit instance-popularity to eliminate the bad effects of noises, thus get better relation extraction performance. Experiments on our collected multi-source news corpus show that Time-DS achieves significant improvements for relation extraction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Distant supervision (DS) has become a popular paradigm for relation extraction in recent years (Mintz et al., 2009; Zeng et al., 2015; Zheng et al., 2017). Distant supervision could largely extend annotated training instances through aligning relation instances in knowledge bases (KB) to sentences in text.

However, distant supervision heavily suffers from the wrong labeling problem because the aligned sentences are not necessarily expressing the same relations as the ones in KB(Mintz et al., 2009; Riedel et al., 2010). Such wrong labeling problem introduces many false positive training instances that hurts the performance of the models. Many efforts have been made to alleviate the bad effects of such noises produced by DS. Some studies (Riedel et al., 2010; Surdeanu et al., 2012; Ritter et al., 2013; Min et al., 2013)

applied multi-instance learning for relaxing the distant supervision assumption and making the at-least-one assumption: if two entities preserve a relation in a KB, at least one sentence that mentions the entity pair expresses the relation. Nowadays, Some neural networks studies

(Zeng et al., 2015; Lin et al., 2016; Feng et al., 2017) learned from multiple instances attentively, without explicitly characterizing the inherent noise. However, all these studies attempted to empower noise-tolerance of models rather than reducing the noises from the source, i.e., the process of distant supervision. Therefore, these studies still suffer from the effects of noises by some degree.

Figure 1. An illustration of the factor time in affecting the distribution of the relation Partnership(Microsoft, Facebook) in news corpus of 2016. 20 aligned sentences for the relation instance from distant supervision are sampled randomly for each day, and each bar represents the ratio of sentences which really express the relation Partnership(Microsoft, Facebook) in that day, i.e. the ratio of reliable sentences. The ratio is nearly zero before May 26, then a sudden increase occurred in May 26 and sustained for about two days. Gradual decrease is found after May 27.

Taking time

into consideration can effectively alleviate the noises in DS for relation extraction. We find that relation instances in news data is usually time-sensitive thus not uniformly distributed in news, e.g.,

Partnership(Microsoft, Facebook) in Figure 1. The mentions of Partnership(Microsoft, Facebook) tend to be concentrated in a certain period of time, i.e. from May 26 to 27 in 2016. Therefore, an intuition is that the automatically annotated dataset produced in such a certain period of time has extremely fewer noises, while mentions in other days are more likely false positives.

To model the above intuition that introducing a new factor time to enhance DS for automatic dataset annotation, we propose a novel time-aware distant supervision framework (Time-DS). Time-DS can effectively reduce the impact of noises by making use of time. Time-DS uses a time series instance-popularity for each relation instance to indicate how many news mentioning the relation every day. The instance-popularity is proposed to encode the strong relevance of time and true relation mention. For better use of the time series instance-popularity, Time-DS considers two strategies.

First, taking instance-popularity as a hard filter to eliminate noises in the process of DS, i.e., aligning. This hard filter sets a hard threshold to filter noisy data, i.e. unreliably aligned sentences. This is a simple strategy with apparent drawbacks. It (1) heavily relies on the threshold of instance-popularity, and (2) unable to utilize the noises in training to make DS-based models more robust. We therefore propose a second strategy to conduct curriculum learning (Bengio et al., 2009) on the weighted training instances, i.e., instances with instance-popularity, which is a slightly more sophisticated but flexible way to exploit instance-popularity. The main idea of curriculum learning is simple: starting with the easiest aspect of a task, and leveling up the difficulty gradually. In this study, we begin with the high-quality annotated sentences and gradually add low-quality sentences into training set according to our proposed instance-popularity. Curriculum learning for Time-DS can make full use of every weighted training instance for better relation extraction performance, while obtaining a robust model to put up with noises.

We conduct experiments of relation extraction on a multi-source news corpus with timestamp, where it is more natural to utilize rich temporal statistics compared with independent single documents only. Meanwhile, the multi-source news corpus is of value for training a DS model for relation extraction from two other aspects. First, the multi-source news corpus contains diverse expressions of the same relations, which is a superior to single-source news corpus. Second, a large number of relation mentions can be obtained with a few relation instance seeds. Both aspects benefit training a powerful and robust DS model. The experimental results show the superiority of our proposed Time-DS for relation extraction. It is worthwhile to highlight our contributions as follows:

  • To alleviate the noises issue of distant supervision in the time-sensitive domain like news data with the timestamp, we take time into consideration.

  • To use time in a sophisticated and flexible way, we use curriculum learning in terms of a time series instance-popularity, which is proved to be effective for noises elimination.

  • A multi-source news corpus with timestamp is collected. Such multi-source corpus is more natural to utilize rich temporal statistics compared with independent single documents only.

2. Problem Statement

In this section, we firstly introduce some concepts used in this paper, then formally define Time-DS for relation extraction.

Definition 1. (Relation Instance) If a relation holds between two entities , we take as a relation instance, such as Partnership(Microsoft, Facebook).

Definition 2. (Relation Mention) For the relation instance , we define relation mention as a triple , consisting of an entity pair and a sentence . The sentence contains these two entities and , and expresses the relation of the two entities.

Definition 3. (Supervision Knowledge) In the context of distant supervision, supervision knowledge is some supervision signal for automatic dataset annotation to liberate manpower. Relation instances in knowledge bases are usually taken as such supervision knowledge. However, knowledge bases may not be available in some domains for DS. Alternatively, we propose to extract supervision knowledge from news data via high-quality rules in this paper.

Problem. (Time-DS for Relation Extraction) Given an unlabeled corpus with time stamp like news data, Time-DS is supposed to make the most of time to produce high-quality annotated dataset automatically, or to eliminate the bad effects of noises produced in the basic DS for relation extraction modeling training.

Here are three questions which Time-DS must deal with,

Question 1. How to obtain supervision knowledge when knowledge base is unavailable in the target domain?

Question 2. How to eliminate the bad effects of noises in obtained relation mentions through alignment of DS?

Question 3. Can we make use of these noises in a reasonable way instead of discarding them simplistically?

3. Time-DS

Figure 2. Time-aware distant supervision framework for relation extraction. 1⃝ Extracting supervision knowledge based on high-quality rules. 2⃝ Computing of the instance-popularity distribution for each relation instance of supervision knowledge. 3⃝ Aligning the relation instances to the sentences attached with instance-popularity. 4⃝ Two strategies, taking instance-popularity as a hard filter; Curriculum learning on the weighted training instances, i.e., instances with instance-popularity.

We introduce details of time-aware distant supervision (Time-DS) framework in this section, following four steps of Figure 2. First, a rule-based method is utilized to extract supervision knowledge for the case when KB is unavailable (in Section 3.1, to answer Question 1). Second, the instance-popularity distribution for each relation instance is approximated based on its rule-matched mentions in news data(in Section 3.23.3). Then, aligning the supervision knowledge (i.e., relation instances) to sentences in raw corpus with instance-popularity attached, to generate automatic annotated dataset for relation extraction model training (in Section 3.4). Finally, Time-DS considers two strategies for better use of the time series instance-popularity, that is hard filter to answer Question 2 and curriculum learning to answer Question 3 (in Section 3.4).

3.1. Extracting Supervision Knowledge

KB is a typical supervision knowledge in previous DS based studies. However, KB is usually unavailable in some specific domains. This issue brings the Question 3. Even when KB is available, it is still impossible to get instance-popularity distribution only using the relation instances in KB. Some information such as relation mentions and timestamp are also needed. Therefore, it is valuable to get a few of relation instances firstly from raw corpus as supervision knowledge, along with the relation mentions with timestamp.

First, we apply a few of pre-designed high-quality rules to extract relation instances as candidate supervision knowledge. At the same time we reserve the relation mentions with timestamp. Here each rule follows the template of Pattern, Constraint, where Pattern is a regular expression containing a selected connector, Constraint is a lexical constraint on entities to which the pattern can be applied. For example, given the connector “has formed a partnership with”, we use the pattern “[entity1] has formed a partnership with [entity2]” to extract Partnership relationship between organizations, with a constraint that both [entity1] and [entity2] must be organizations. As a consequence, this pattern can match the sentence “Microsoft has formed a partnership with Facebook”, but will not match the sentence “Kevin has formed a partnership with Jack to finish the project”.

Second, we calculate the confidence of each extracted relation instance, and set a reasonable threshold as filter to obtain the final supervision knowledge (i.e. a set of relation instances with high-confidence), as the first step showed in Figure 2. We believe that the confidence of a relation instance , denoted as , depends on the amount of relation mentions and the kinds of matched rules in these mentions. We assume that more mentions and matched rules mean more reliable relation instance. According to the assumption and treat mentions and matched rules equally, we define the as follows,

(1)

where represents the set of matched rules for , and represents the set of matched sentences for . and represent the maximum number of the matched patterns and sentences in all relation instances.

3.2. Definition of Instance-Popularity

The assumption of distant supervision is: any sentence that contains a pair of entities which participate in a relation instance is likely to express that relation in some way. However, this assumption is not always true. A large part of sentences containing a pair of entities are noises. The previous Figure 1 indicates that the mentions in a certain period of time have extremely less noises, while in other days are more likely false positives. In other word, whether an aligned sentence expresses the corresponding relation has a strong relevance with time.

Following the intuition, we introduce a time series instance-popularity for each relation instance to indicate how many news expressing the relation every certain period of time. Given a sentence at some time point, which contains two entities of a relation instance, we assume that the certainty of expressing the relation is proportional to the instance-popularity at that time point of the relation instance. Instance-popularity is to prepare for figuring out the issue brought by Question 2.

Formally, the instance-popularity (denoted as InsPo) of a given relation instance at time is defined as:

(2)

where , represent the instance-popularity of at and the set of sentences expressing at -centric time-window separately. denotes the amount of the . is the whole set of sentences expressing over time for normalization, and

(3)

where is the amount of . is the length of the time-window, and is the amount of time points we concern.

3.3. Approximation of Instance-Popularity

The actual whole set of sentences which express the relation instance is usually unavailable in practice. Thus we cannot compute the instance-popularity directly according to Equation 2. In this section, we provide an approximate method to calculate the instance-popularity.

For each relation instance in the supervision knowledge (gained in Sec 3.2), the set of its whole rule-matched sentences is denoted as , which is a sub-set of . Further, we can obtain the sentences set in any -centric time-window from , denoted as . The assumption is that people select relation patterns under some distribution in news data, and we use

to denotes such probability of the relation pattern

being selected to express the given relation instance at time point . Then we can calculate and

under the probability distribution of the relation patterns in the pre-designed rules,

(4)
(5)

where is the -th relation pattern. are the number of relation patterns and time points separately.

Assumption. In the multi-source news corpus, a given a pattern for expressing a relation instance would be selected with the same probability at any time, which means that the probability of a given pattern being selected is independent of time. Therefore we have, .

According to the Equation 2 to  5, and Assumption, we can get the following equations.

(6)
(7)

Based on the Equation 67, we can approximate the instance-popularity for at time point , as the second step in Figure 2. The approximation equation is as follows.

(8)

3.4. Two Strategies to Exploit Instance-Popularity

Given the unlabeled corpus with timestamp, we align the supervision knowledge, i.e., relation instances, to the corpus, as the third step showed in Figure 2. A lot of relation mentions can be obtained, along with the corresponding approximate instance-popularity. In other words, a large scaled annotated dataset attached with instance-popularity is acquired. Relation extraction model can be trained on the datasets. In the training process of relation extraction model, for better use of the time series instance-popularity, Time-DS considers two strategies, as the last step of Figure 2, that is hard filter and curriculum learning.

Hard Filter. We have obtained the dataset attached with instance-popularity in sentence-level, and instance-popularity is to quantize the reliability of the annotated sentences. Hard filter sets a hard threshold of the instance-popularity to filter the dataset to get a higher-quality sub-dataset, discarding the noises. This is a simplest way to utilize Time-DS

However, despite the effectiveness of hard filter, It (1) heavily relies on the threshold of instance-popularity, and (2) refuses to use some noises to make DS based models robust. We hope that the noises can be used reasonably to get a more sufficient training rather than discard these noises directly, as asked in Question 3. A natural idea is guiding the relation extraction model to adapt to the noisy training sets gradually, i.e., learning something simple first, and then attempting to deal with noises. Fortunately, a technique called curriculum learning fits our problem.

Curriculum Learning. The main idea of curriculum learning (Bengio et al., 2009) is simple: starting with the easiest aspect of a task, and leveling up the difficulty gradually. In our study, we begin with the high-quality annotated sentences and gradually add low-quality sentences into training set according to instance-popularity. In particular, all the annotated sentences from distant supervision are ranked by instance-popularity from high to low. Then we divide the ranking list into several groups by assigning different thresholds of instance-popularity, i.e., {Rank, Rank, Rank, …}. Therefore, different training sets can be easily created by gradually combining different groups of annotated sentences with the ranking order, i.e., Rank, Rank, Rank, etc.

Then, following the strategy of curriculum learning, (1) first, the model is trained on the highest-quality training set, that is Rank. After the training is complete, (2) the second highest-quality training set and the previous training set, i.e, the highest-quality training set, are merged to generate a new training set, that is Rank. The model is trained again in this new training set. (3) Then, repeat the above processes, i.e., add the the lower-quality training set gradually and train the model in every new training set until all annotated sentences from DS are taken into consideration. Note that the training instances, i.e., sentences, are shuffled during each training process.

4. Experiments

In this part, we conduct experiments of relations extraction/classification on five time-sensitive relations from a multi-source news corpus.

Relation #Relation Instances Training Set #Sentences Test Set #Sentence
for Training for Test Original Set ()
Acquisition 30 8 39,365 9,905 8,361 6,963 6,388 5,825 4,664 694
Investing 46 11 2,741 1,227 239 146 142 141 138 48
JobChange 71 12 188,945 35,154 25,177 21,622 18,655 16,452 13,507 905
Lawsuit 15 3 16,503 2,147 1,588 1,334 944 854 697 313
Partnership 12 5 1,408 794 794 794 794 794 259 503
total 174 39 248,872 49,224 36,156 30,859 26,923 24,066 19,265 2,463/376
Table 1. Statistics of the supervision knowledge and annotated dataset. The statistics of the supervision knowledge is reported in the “#Relation Instances” column. “Original Set” is the whole original training set, i.e., the training set produced by DS. The other filtered training sets (“” to “”) are produced by hard filter, based on different threshold of instance-popularity.

4.1. Data Preparation

We collect about 42 million news articles from 50,428 different on-line news websites, and the time spans 8 months from Jan. 2016 to Aug. 2016. In each article, the title, first paragraph, and timestamp are remained to construct a multi-source news corpus. The corpus contains nearly 320 million sentences in total111The multi-source news corpus will be open available.. Stanford CoreNLP tool (Manning et al., 2014) is applied to recognize named entities in the multi-source new corpus. Organization management is an interesting and informative domain in news data, thus we focus on five typical time-sensitive relations in the domain, namely, Acquisition, Investing, JobChange, Lawsuit, and Partnership. It is worth to mention that Time-DS can be easily transfered to any other time-sensitive relations such as MarriedTo, VisitIn, by just designing a few of high-quality rules of these relations.

Acquisition. An organization buys another organization (directed). Example: Verizon announced it had completed the $ 4.4 billion acquisition of AOL.

Investing. An organization puts money into another organization (directed). Example: Vontobel Asset Management Inc. boosted its position in shares of Mastercard Inc..

JobChange. A person leave or join in an organization (directed). Example: Papiss Cisse has left Newcastle United.

Lawsuit. An organization suits another organization (directed). Example: Samsung Elec sues Huawei for patent infringement.

Partnership. An organization forms a partnership with another organization (undirected). Example: Konami has announced a partnership with FC Barcelona for PES 2017.

Test Set. We follow the previous study (Mintz et al., 2009) to hold out part of the relation instances in supervision knowledge to be aligned into corpus to get test set. However, such test set also suffers from the wrong labeling problem, leading to a rough measure of the performance. Hence we refine the test set in three steps. (1) First, filter the test instances with a suitable instance-popularity threshold222In our experiments, we set the threshold as 0.2 for relation Investing and 0.7 for the other four relations to get the candidate positive samples, and also reserve some of filtered instances as candidate negative samples. (2) Then three experts in the relevant domain proofread the candidate test set independently, that is judging and correcting the correctness of the existing tags. (3) Finally, the remaining disagreements are resolved, and if no consensus could be achieved, the samples are removed. At last, 2,463 positive and 376 negative are achieved to form the final test set (in Table 1), in which 186 samples has been corrected.

Validation Set. The above test set is randomly partitioned into 10 equal size subsets. Of the 10 subsets, a single subset is retained as the validation set for selecting model, and the remaining 9 subsets are used as testing data. This process is then repeated 10 times, with each of the 10 subsets used exactly once as the validation set. The trained model with the best averaged performance on validation sets is selected as the final model for evaluation.

4.2. Target Models

In this part, we describe two models which are fed into our Time-DS and the basic DS framework for end-to-end relation extraction task and relation classification task. The relation extraction task is to extract relation mentions from the given sentences, and categorize them into a pre-defined set of relation types. If the relation mentions are given, then the task is a classification problem, called relation classification. In this paper, the relation extraction and classification we studied are both in sentence-level.

End-to-End Relation Extraction. We feed the model proposed by Zheng et al (2017), i.e., LSTM-LSTM-Bias into our Time-DS framework. LSTM-LSTM-Bias designs a novel tagging schema to convert the task to a sequence tagging problem. Therefore the model can extract entities and relations jointly without other redundant information and achieve the best results on the public dataset. Since LSTM-LSTM-Bias is a sequence tagging model, the training only need word-level positive and negative. Thus we can train the model on the automatic annotated datasets generated by Time-DS directly.

Relation Classification. We applied a Bi-LSTM and Attention based neural model proposed by Zhou (2016), which is a typical paradigm for relation classification, called Att-BLSTM. The training of Att-BLSTM needs negative samples, which is unavailable from the generated datasets in manner of Time-DS. To obtain negative samples, we replace tail entity of each relation instance with another entity, which is in the same sentence. For instance, given a sentence “A has formed a partnership with B, which is located in C” and its relation instance Partnership(A, B), we replace the entity B with entity C. The new relation instance Partnership(A, C) and original sentence form a negative sample.

Metrics. Similar with previous work, we report the aggregate precision/recall curves on the end-to-end relation extraction model, and macro-F1 on relation classification model.

4.3. Annotated Datasets Generation with Instance-Popularity

For Question 1, i.e., how to obtain supervision knowledge when knowledge base is unavailable, we apply only a few of manually designed high-quality rules (see in Section 3.1)333Only 5 to 8 different rules for each relation is suitable in our experiments.. Meanwhile the rule-matched sentences are reserved for instance-popularity approximation. Then we use confidence (see Equation 1) to filter the rule-based extracted relation instances to form the final supervision knowledge. The final supervision knowledge is divided into two sub-sets, i.e., for training and test, according to held-out method. The statistics is reported in the “#Relation Instances” column of Table 1. We can see that the amount of gained relation instances is much smaller (only about 2.1 hundred) compared with the existent KB(over 3.2 million relation instances are obtained in FreeBase in (Riedel et al., 2010)). However, with the help of the multi-source news corpus, we can gain huge amount of expressive annotated sentences.

Aligning the relation instances to the corpus, we get 248,872 training instances, i.e., relation mentions, annotated with instance-popularity, called Original Set. The Instance-popularity distribution of relation instance is directly approximated according to Equation 8444The size of time-window is set 3 days in our experiments, based on the reserved rule-matched sentences. To use hard filter strategy, we filter the Original Set to get another six sub-sets according to the different instance-popularity thresholds. The highest threshold is set 0.6, because the amount of Investing mentions are nearly zero when the threshold higher than 0.6. The overall statistic of annotated dataset is showed in Table 1.

4.4. Effectiveness of Hard Filter

To answer Question 2, i.e., how to eliminate the bad effects of noises produced in the basic DS, we adapt a instance-popularity-based hard filter strategy. Here we examine the the performance of hard filter strategy on relation extraction and classification tasks.

Figure  3 shows the aggregate precision/recall curves of relation extraction model trained on different datasets. we find that: (1) Except 0.1 threshold of hard filter, the model trained on all datasets of hard filter outperform the one trained in manner of the basic DS, i.e., Original Set. (2) It is clear that the accuracy/recall performance increases with the increase of the threshold of instance-popularity when the threshold is lower than 0.5.

We also investigate more fine-grained precision/recall curves of the five relations separately (see in Figure 4)555Some precision/recall curves backtrack, such as Investing, that is because LSTM-LSTM-Bias is trained and to do predict in tag-level instead of relation-level (Zheng et al., 2017).. we find that: (1) Similar general tendency is observed, that is the model train on datasets of hard filter outperform the one trained in manner of the basic DS when threshold of hard filter higher than some value. (2) The effects of hard filters on different relation types is not exactly same. The performance for Partnership increases as the hard filter became stricter, while the performance on JobChange already peaks when the threshold arrives 0.1.

Figure 3. Aggregate precision/recall curves of relation extraction model trained on different datasets. The Original Set curve represents the performance of training on the annotated dataset generated by the basic DS. The other six curves represent the performance of training on the annotated datasets generated by different hard filters, denoted as “HF”.
Figure 4. More fine-grained precision/recall curves on the five target relations respectively.

Table 2 presents the macro-average F1 value on relation classification model. We can see that (1) Time-DS framework on different fractions of training set outperforms the basic DS framework on the whole training set for relation classification; (2) The macro-F1 value tends to increase with the increase of the threshold of instance-popularity; (3) Even with a smaller scale of training data, when the corresponding training set is almost one-thirteenth of the Original Set (see Table 1), the Time-DS framework outperforms basic DS framework, achieving significant improvement for relation classification. Above all, it is very clear from these three observations that hard filter with instance-popularity is a very effective strategy to segment Original Set and eliminate the bad effects of the noises generated from the basic DS.

Training Set macro-P(%) macro-R(%) macro-F1(%)
the Basic DS
Original Set
Time-DS with Hard Filter
InsPo
InsPo
InsPo
InsPo
InsPo
InsPo
Table 2. Macro-average of precision, recall and F1 of relation classification model, which is trained on the Original Set and the other six filtered training sets respectively.
Figure 5. Aggregate precision/recall curves of relation extraction model trained on six training sets with traditional training in hard filter and curriculum learning, where “CL” represents curriculum learning.
Training Set macro-P(%) macro-R(%) macro-F1(%)
the Basic DS
Original Set
Time-DS with Curriculum Learning
InsPo (7th round)
InsPo (6th round)
InsPo (5th round)
InsPo (4th round)
InsPo (3rd round)
InsPo (2nd round)
InsPo (1st round)
Table 3. Macro-average of precision, recall and F1 of relation classification model trained on Original Set with curriculum learning.

4.5. Effectiveness of Curriculum Learning

Hard filter strategy heavily relies on the threshold settings to remove the effects of noisy samples. However, noisy samples still play important roles in improving the robustness of models. This issue naturally bring the Question 3, i.e., can we make use of these noises in a reasonable way to improve the robustness rather than discarding them simplistically. The strategy of curriculum learning is used in Time-DS to answer Question 3.

To implement Time-DS with curriculum learning, we reconstruct the original set to get 7 subsets and apply a curriculum learning strategy. In particular, we distribute the original set to get seven subsets with different instance-popularity ranges, i.e., [0.6, 1.0], [0.5, 1.0], [0.4, 1.0], [0.3, 1.0], [0.2, 1.0], [0.1, 1.0], [0.0, 1.0]. At 1st round, we use the subset [0.6, 1.0] to train the model for relation classification. At 2nd round, we use the subset [0.5, 1.0] to retrain the model on the basis of the model obtained from the 1st round. At 3rd round, we use the subset [0.4, 1.0] to retrain the model on the basis of the model obtained from the 2nd round. The latter rounds follow the same rules until 7th round when all instances of Original Set participate in the training process.

Figure 5 presents the performance on relation extraction666Some precision/recall curves starts from non-zero point, which is because softmax may output 1 in some dimension for optimization in Python language.. We find that (1) curriculum learning based Time-DS significantly outperform the basic DS in any training round; (2) Each training round of curriculum learning outperform the traditional training in hard filter strategy. Table 3 presents the performance on relation classification. (1) From the comparison between the different rounds of Time-DS with curriculum learning and the basic DS, it is clear that Time-DS with curriculum learning outperforms the basic DS in every round of training and Time-DS with curriculum learning achieves the best performance at round 4. (2) From round 1 to round 4, the noisy samples are gradually added, the performance tends to increase. However, the performance decrease when adding to much noisy samples into the training set after round 4.

4.6. Deep analysis of Instance-Popularity

In this section, we provide deep analysis about how instance-popularity works to encode the strong relevance of timestamp and the true relation mention in news data.

Time-sensitive Relations. Many relations in news is very sensitive to time, such as these relations in Table 5. At the time period of the establishment of these relations, there would be extremely less noises of these relation mentions. This feature would benefit a lot for the alignment process of distant supervision. Therefore, we want to check the consistency of peaking time of instance-popularity and the establishment time of each relation instance. In particular, we sample several relation instances randomly, and acquire the establishment time from Wikipedia or news report. In Table 5, we can find that the time when instance-popularity reaches peak is usually consistent with the establishment time. Therefore, it is reasonable to take instance-popularity as measure to find relation mentions as training data with less noises.

Orig. Set
Noise Ratio 0.66 0.37 0.33 0.23 0.37 0.21 0.18
Set Scale 248,872 49,224 36,156 30,859 26,923 24,066 19,265
Table 4. Noise ratio and set scale of the different training sets.
Relation Instance Instance-Popularity Peaking Time Establishment Time
Acquisition (Pfizer, Anacor Pharmaceuticals) May May 16-18, 2016 May 16, 2016
Acquisition (NBC, DreamWorks Animation) Apr. Apr. 25-30, 2016 Apr. 28, 2016
Investing (Private Trust, Honeywell International) Feb. Feb. 25-27 2016 Feb. 25 2016
Investing (Jennison Associates, Boeing) Jan. Jan. 13-15, 2016 Jan. 13, 2016
JobChange (Louis Van Gaal, Manchester United) May May 22-24, 2016 May 23, 2016
JobChange (Derek Fisher, Knicks) Feb. Feb. 7-9, 2016 Feb. 7, 2016
Lawsuit(Huawei, Samsung Electronics) May May 22-27, 2016 May 25, 2016
Lawsuit (Wal-Mart Stores Inc., Visa Inc) May May 10-12, 2016 May 10, 2016
Partnership (Microsoft, Facebook) May May 25-27, 2016 May 26, 2016
Partnership (Google, Fiat) May May 4-6, 2016 May 3, 2016
Table 5. Instance-popularity cases of some relation instances, and their establishment time. Due to the limited space, we only display the month when instance-popularity peaks and merge three continuous days to be displayed in one cell.

The Ability to Eliminate the Noises. We investigate the ability of instance-popularity as hard filter to eliminate the noises generated from the DS alignment. Although the instance-popularity has been proved to be useful indirectly a component of Time-DS, a direct evaluation would be much more straightforward. Specifically, we randomly sample 100 training instances respectively from each data sets to check the ratio of noises in each training set. Table 4 presents the the ratio of noises in each training set. Note that our models which are trained on subset “” and subset “” achieve the best performance for relation classification and extraction. It is clear that the training sets with the lower ratio of noises tend to be those with which our model achieves the better performance. This is not very strict because the scale of training set also affects the performance.

Error Cases Study. In this study, we have two types of error cases. First, there are some aligned sentences with low-instance-popularity is actually the true relation instances. We call them low-instance-popularity but positive case (LP). Second, there are some aligned sentences with timestamp consisting with the peak time of instance-popularity are actually the fake relation instances. We call them high-instance-popularity but negative case (HN). We present some LP and HN cases as follows.

LP1. “Jose Mourinho is reportedly set to be confirmed as Manchester United’s new manager in the coming days.”, InsPo: 0.0, relation: JobChange.

LP2. “Activision Blizzard’s acquisition of Major League Gaming appears to be bearing fruit.”, InsPo: 0.053, relation: Acquisition.

LP3. “… between two giants in the technology world, as we ’ve seen repeatedly with the Apple v. Samsung litigation .”, InsPo: 0.097, relation: Lawsuit.

HN1. “LinkedIn will give Microsoft an even greater foothold in the space …”, InsPo: 0.67, relation: Acquisition.

HN2. “Google and Fiat Chrysler engineers will fit Google ’s autonomous driving technology into the Pacifica minivan.”, InsPo: 0.5, relation: Partnership.

HN3. “Warren Buffett, fondly known as the Oracle of Omaha … behind brainchild Berkshire Hathaway Inc. just upped his stake in Apple Inc. by a significant chunk.”, InsPo: 0.80, relation: Investing.

The above LP cases have different situations. (1) Some relation instances are reported in the news before the official establishment of the relations. Therefore, the timestamp of these mentions is earlier than the peak time of instance-popularity, e.g., LP1. (2) Some relation instances are still reported in the news even a long time after the official establishment of the relations. Therefore, the timestamp of these mentions is much latter than the peak time of instance-popularity, e.g., LP2. (3) Some relation instances will be reported by news media for a long time. In this case, instance-popularity

will be normally distributed in a long time period. Therefore,

instance-popularity usually fails to detect the time period of the establishment of the relations in a short time interval, e.g., the 6 years lawsuits between Apple Inc. and Samsung Electronics.

The above HN cases have different situations. (1) Some alignment mentions actually talk about other aspects of the mentioned entities. For example HN1 discusses the influence of the acquisition rather than expressing the acquisition as a relation. (2) Some cases provide incomplete information, making it hard to confirm the existence of the relation, e.g., HN2 and HN3.

5. Related Work

5.1. Improvements for Distant Supervision

To alleviate effects of noises in automatic annotated dataset of DS, some studies captured certain types of noise and aggregated multi-instance learning (Riedel et al., 2010; Surdeanu et al., 2012; Ritter et al., 2013; Min et al., 2013). Some neural networks methods learned from multiple instances attentively, without explicitly characterizing the inherent noise (Zeng et al., 2015; Lin et al., 2016; Feng et al., 2017). These approaches focus on enhancing noise-tolerance of models instead of reducing noises from the source, hence, still suffer from the effects of noises in some ways. Some work considered utilizing many other kinds of knowledge besides KB (Han and Sun, 2016; Liu et al., 2017), to enrich the supervision knowledge. However, such studies suffer from the conflicts brought by multiple supervisions (Ratner et al., 2016), and hard to benefit the existing relation extraction/classification models.

5.2. Relation Extraction/Classification

Relation classification aims to classify the given relation mention to a pre-defined relation type. Deep neural networks have shown promising results, and the representative progress was made by Zeng et al. 

(2014). To encode both past and future context information, Zhang and Wang (2015)

employed a bidirectional Recurrent Neural Network (Bi-RNN). To address the long-distance problem, some approaches based on Long Short-Term Memory networks (LSTM) have been proposed 

(Zhang et al., 2015; Xu et al., 2015). Recently, Zhou (2016)

combined the attention model and bidirectional LSTM, achieving a significant improvements for relation classification.

Relation extraction can be regard as a pipeline of two separated tasks, i.e., named entity recognition and relation extraction. However, some studies consider extracting entities and relations in a single model. Most of these methods are feature-based  

(Ren et al., 2017; Yang and Cardie, 2013; Miwa and Sasaki, 2014; Li and Ji, 2014). Recently, Miwa and Bansal (Miwa and Bansal, 2016) used a LSTM-based model to reduce such manual features. Zheng (Zheng et al., 2017) converted the relation extraction to a sequence tagging problem, and proposed a LSTM-based encoder-decoder model to extract the entities and relations jointly without other redundant information, leading to the best results on the public dataset.

5.3. Curriculum Learning

The main idea of curriculum learning (Bengio et al., 2009)

is starting with the easiest aspect of a task and leveling up the difficulty gradually. Curriculum learning is mainly applied to solve various vision problems of Computer Vision (CV), such as tracking 

(Supancic III and Ramanan, 2013)

, face detection 

(Lin et al., 2018), object detection (Chen and Gupta, 2015), video detection (Jiang et al., 2014), etc. Luo et al. (2017) applied curriculum learning to the task of relation classification. However, they used curriculum learning to address the cold-start of their model training, on a special dataset with explicit prior knowledge of data quality, which was different from our work.

6. Conclusion

In this paper, to alleviate the noise issue in distant supervision (DS), we take a new factor time into consideration and propose a novel time-aware distant supervision (Time-DS). To make the most of time, we consider two strategies, i.e., hard filter and curriculum learning. Time-DS benefits from these two strategies thus can guide the training process and further achieves better models on relation extraction/classification. The experimental results show the effectiveness of the time series instance-popularity and significant improvements on relation extraction/classification via feeding models into Time-DS.

References

  • (1)
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    . ACM, 41–48.
  • Chen and Gupta (2015) Xinlei Chen and Abhinav Gupta. 2015.

    Webly supervised learning of convolutional networks. In

    Proceedings of the IEEE International Conference on Computer Vision. 1431–1439.
  • Feng et al. (2017) Xiaocheng Feng, Jiang Guo, Bing Qin, Ting Liu, and Yongjie Liu. 2017. Effective deep memory networks for distant supervised relation extraction. In

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI

    . 19–25.
  • Han and Sun (2016) Xianpei Han and Le Sun. 2016. Global distant supervision for relation extraction. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Jiang et al. (2014) Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. 2014. Self-paced learning with diversity. In Advances in Neural Information Processing Systems. 2078–2086.
  • Li and Ji (2014) Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 402–412.
  • Lin et al. (2018) Liang Lin, Keze Wang, Deyu Meng, Wangmeng Zuo, and Lei Zhang. 2018. Active self-paced learning for cost-effective and progressive face identification. IEEE transactions on pattern analysis and machine intelligence 40, 1 (2018), 7–19.
  • Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 2124–2133.
  • Liu et al. (2017) Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. 2017. Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    . 46–56.
  • Luo et al. (2017) Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang Huang, Rui Yan, and Dongyan Zhao. 2017. Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 430–439.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.
  • Min et al. (2013) Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 777–782.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1003–1011.
  • Miwa and Bansal (2016) Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:1601.00770 (2016).
  • Miwa and Sasaki (2014) Makoto Miwa and Yutaka Sasaki. 2014. Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1858–1869.
  • Ratner et al. (2016) Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems. 3567–3575.
  • Ren et al. (2017) Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. 2017. CoType: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1015–1024.
  • Riedel et al. (2010) Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 148–163.
  • Ritter et al. (2013) Alan Ritter, Luke Zettlemoyer, Oren Etzioni, et al. 2013. Modeling missing data in distant supervision for information extraction. Transactions of the Association for Computational Linguistics 1 (2013), 367–378.
  • Supancic III and Ramanan (2013) James Steven Supancic III and Deva Ramanan. 2013. Self-paced learning for long-term tracking. In

    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on

    . IEEE, 2379–2386.
  • Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational Linguistics, 455–465.
  • Xu et al. (2015) Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1785–1794.
  • Yang and Cardie (2013) Bishan Yang and Claire Cardie. 2013. Joint inference for fine-grained opinion extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1640–1649.
  • Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015.

    Distant supervision for relation extraction via piecewise convolutional neural networks. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1753–1762.
  • Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2335–2344.
  • Zhang and Wang (2015) Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. arXiv preprint arXiv:1508.01006 (2015).
  • Zhang et al. (2015) Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 73–78.
  • Zheng et al. (2017) Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1227–1236.
  • Zhou et al. (2016) Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2. 207–212.