Deep Ranking Based Cost-sensitive Multi-label Learning for Distant Supervision Relation Extraction

07/25/2019 ∙ by Hai Ye, et al. ∙ 0

Knowledge base provides a potential way to improve the intelligence of information retrieval (IR) systems, for that knowledge base has numerous relations between entities which can help the IR systems to conduct inference from one entity to another entity. Relation extraction is one of the fundamental techniques to construct a knowledge base. Distant supervision is a semi-supervised learning method for relation extraction which learns with labeled and unlabeled data. However, this approach suffers the problem of relation overlapping in which one entity tuple may have multiple relation facts. We believe that relation types can have latent connections, which we call class ties, and can be exploited to enhance relation extraction. However, this property between relation classes has not been fully explored before. In this paper, to exploit class ties between relations to improve relation extraction, we propose a general ranking based multi-label learning framework combined with convolutional neural networks, in which ranking based loss functions with regularization technique are introduced to learn the latent connections between relations. Furthermore, to deal with the problem of class imbalance in distant supervision relation extraction, we further adopt cost-sensitive learning to rescale the costs from the positive and negative labels. Extensive experiments on a widely used dataset show the effectiveness of our model to exploit class ties and to relieve class imbalance problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation extraction (RE) aims to classify the relations (or called relation facts) between two given named entities from natural-language text. Fig. 

1 shows two sentences with the same entity tuple but two different relation facts. RE is to accurately extract the corresponding relation facts (place_of_birth, place_lived) for the entity tuple (Patsy Ramsey, Atlanta) based on the contexts of sentences. Supervised-learning methods require numerous labeled data to work well. With the rapid growth of volume of relation types, traditional methods can not keep up with the step for the limitation of labeled data. In order to narrow down the gap of data sparsity, 2009 proposes distant supervision (DS) for relation extraction, which automatically generates training data by aligning a knowledge facts database (ie. Freebase freebase ) to texts. For a fact (e.g. entity tuple with a relation type) from the knowledge base, the sentences containing the entity tuple in the fact are regarded as the training data.

Class ties mean the connections (relatedness) between relations types for relation extraction. In general, we conclude that class ties can have two categories: weak class ties and strong class ties. Weak class ties mainly involve the co-occurrence of relations such as place_of_birth and place_lived, CEO_of and founder_of. Besides, strong class ties mean that relations have latent logical entailments. Take the two relations of capital_of and city_of for example, if one entity tuple has the relation of capital_of, it must express the relation fact of city_of, because the two relations have the entailment of capital_of city_of. Obviously the opposite induction is not correct. Further take the following sentence of

Jonbenet told me that her mother never left since she was born.

for example. This sentence expresses two relation facts which are place_of_birth and place_lived. However, the word “born” is a strong bias to extract place_of_birth, so it may not be easy to predict the relation of place_lived, but extracting place_of_birth will provide evidence for prediction of place_lived by incorporating the weak ties between the two relations,

Figure 1: Training instances generated by freebase. The entity tuple is (Patsy Ramsey, Atlanta) and its two relation facts are palce_of_birth and place_lived.

Exploiting class ties is necessary for DS based relation extraction. In DS scenario, there is a challenge that one entity tuple can have multiple relation facts which is called relation overlapping 2011 ; 2012 , as shown in Fig. 1. However, the relations of one entity tuple can have class ties mentioned above which can be leveraged to enhance relation extraction, for that it narrows down potential searching spaces and reduces uncertainties between relations when predicting unknown relations, such that if one pair of entities has CEO_of relation, it will contain founder_of relation with high possibility.

To exploit class ties between relations, we propose to make joint extraction by considering pairwise connections between positive and negative labels inspired by furnkranz2008multilabel ; zhang2006multilabel . As the example for one entity tuple with two different relation types shown in Fig. 1, by extracting the two relations jointly, we can maintain the class ties

(co-occurrence) of them and the class ties can be learned by potential models, which can be leveraged to extract instances with unknown relations. We introduce a ranking based multi-label learning framework to make joint extraction, to learn to rank the prediction probability for positive relations higher than negative ones. We design ranking based loss functions for multi-label learning. Furthermore, inspired by 

zhouMIML ; MIML2005 , we add a regularization term to the loss functions to better learn the relatedness between relation facts, and we only regularize the positive relation types ignoring the relation of NR (does not express any relation) based on the assumption that the connections between relations are only in positive relations but not in NR (see Sec. 3.4).

Figure 2: The main architecture of our model. The features of sentences are encoded by CNN model, and then the sentence embeddings are aggregated, finally the bag representation is used to make joint extraction.

Besides, class imbalance is the another severe problem which can not be ignored for distant supervision relation extraction. We find that around training data express NR relation type and even more than in test set, so samples with NR type count a much higher proportion comparing to the positive samples (not categorized as NR). This problem will severely affect the model training, causing the model easily to classify the samples to have the NR relation type japkowicz2002class . To overcome this problem, based on the ranking loss functions, we further adopt cost-sensitive learning to rescale the costs from the positive and negative labels, by increasing the losses for positive labels and penalizing losses from NR type (detailed in Sec. 3.5).

Furthermore, combining information across sentences will be more appropriate for joint extraction which provides more information from other sentences to extract each relation (hao ; lin2016 ). In Fig. 1, sentence #1 is the evidence for place_of_birth, but it also expresses the meaning of “living in someplace”, so it can be aggregated with sentence #2 to extract place_lived. Meanwhile, the word of “hometown” in sentence #2 can provide evidence for place_of_birth which should be combined with sentence #1 to extract place_of_birth.

In this work, we propose a unified model that integrates ranking based cost-sensitive multi-label learning with convolutional neural network (CNN) to exploit class ties between relations and further relieve the class imbalance problem. Inspired by the effectiveness of deep learning for modeling sentence features 

deeplearning2015 , we use CNN to encode sentences. Similar to lin2016 ; rank2015 , we use class embeddings to represent relation classes. The whole model architecture is presented in Fig. 2

. We first use CNN to embed sentences, then we introduce two variant methods to combine the embedded sentences into one bag representation vector aiming to aggregate information across sentences, after that we measure the similarity between the bag representation and relation class in real-valued space. Finally, we use the ranking loss functions to learn to make joint extraction over multiple relation types.

Our experimental results on dataset of 2010 are evident that: (1) Our model is much more effective than the baselines; (2) Leveraging class ties will enhance relation extraction and our model is efficient to learn class ties by joint extraction; (3) A much better model can be trained after relieving class imbalance from NR.

Our contributions in this paper can be encapsulated as follows:

We propose to leverage class ties to enhance relation extraction. Combined with CNN, an effective deep ranking based multi-label learning model with regularization technique is introduced to exploit class ties.

We adopt the cost-sensitive learning to relieve the class imbalance problem and experimental results show the effectiveness of our method.

2 Related Work

2.1 Relation Extraction

Previous methods on relation extraction can mainly be summarized as supervision based and distant supervision based. Supervision based methods needs much labeled data to work well which can not keep up with the rapid growth of relation types. To overcome the problem of data sparsity for supervision based methods, distant supervision relation extraction has been proposed by 2009 . However, DS based relation extraction suffers the two problems of wrong labelling problem and overlapping problem, in which the former means that sentences containing certain entities actually do not express the relation type of the entities indicated or even do not express any relations and the latter mean that one entity tuple may have multiple relation types. To solve the problem of wrong labelling, 2010 introduces multi-instance learning for relation extraction in which the mentions of one certain entity tuple are merged as one bag and make the model to extract relations on mention bags, however this method can not deal with the relation overlapping problem. Afterwards, 2011 and 2012 introduce the framework of multi-instance multi-label learning to jointly overcome the two problems and improve the performance significantly. Though they also propose to make joint extraction of relations, they only use information from single sentence losing information from other sentences. Global tries to use Markov logic model to capture consistency between relation labels, on the contrary, our model leverages deep ranking to learn class ties automatically.

Recent years, deep learning has achieved remarkable success in computer vision and natural language processing 

deeplearning2015 . Deep learning has been applied to automatically learn the features of sentences (zeng2014 ; Yu2014 ; rank2015 ; lin2016 ; DBLP:conf/pakdd/YeYLC17 ; DBLP:conf/emnlp/YeW18 ; DBLP:journals/corr/abs-1802-08504 ; DBLP:conf/coling/JiangYLCM18 ). In supervision relation extraction, zeng2014 applies convolutional neural networks to model sentences and import position feature for RE, which obtains significant gains in RE performance. Afterwards, Yu2014 ; rank2015 ; lin2016 further introduce more advanced deep learning models for RE. In distant supervision relation extraction, zeng2015

proposes a piecewise convolutional neural network with multi-instance learning for DS based relation extraction, which improves the precision and recall significantly. Afterwards,

lin2016 introduces the attention mechanism (attention1 ; attention2 ) to merge the sentence features aiming to construct better bag representations. lin2017 further proposes a multi-lingual neural relation extraction framework considering the information consistency and complementarity among cross-lingual texts. However, the two deep learning based models only make separated extraction thus can not model class ties between relations. Recently, zeng2016incorporating proposes to incorporate relation paths for distant supervision relation extraction and ji2017distant introduces to use the description of entities to enhance distant supervision relation extraction. chen2018encoding proposes a joint inference approach by encoding implicit relation requirements for relation extraction. Joint learning is also applied to jointly study two related tasks DBLP:journals/corr/abs-1906-00575 . Besides, a lot of works have been proposed in recent times to solve the wrong labelling problem. bingfeng2017 proposes to model the noise caused by wrong labelling problem and show that dynamic transition matrix can effectively characterize the noises. qin2018dsgan ; han2018denoising propose to use adversarial learning goodfellow2014generative to solve the wrong labelling problem. Instead, DBLP:conf/aaai/FengHZYZ18 ; DBLP:conf/acl/WangXQ18a

adopt reinforcement learning to learn to select high-quality data for training.

DBLP:conf/emnlp/LiuWCS17 dynamically corrects the wrong labeled data during training by exploiting semantic information from labeled entity pairs. DBLP:conf/emnlp/LiuZZJ18 transfers the priori knowledge learned from relevant entity classification task to make the model robust to noisy data.

2.2 Deep Learning to Rank

Learning to rank (LTR) is an important technique in information retrieval (IR) liu2009

. The methods to train a LTR model include pointwise, pairwise and listwise. We apply pairwise LTR in our paper. Deep learning to rank has been widely used in many problems to serve as a classification model. In image retrieval,

zhao2015deep applies deep semantic ranking for multi-label image retrieval. In text matching, severyn2015learning adopts learning to rank combined with deep CNN for short text pairs matching. In traditional supervised relation extraction, rank2015 designs a pairwise loss function based on CNN for single label relation extraction. Based on the advantage of deep learning to rank, we propose pairwise learning to rank (LTR) liu2009 combined with CNN in our model aiming to jointly extract multiple relations.

2.3 Cost-sensitive Learning

Cost-sensitive learning is one of the techniques for class imbalance problem, which assigns higher wrong classification costs to classes with small proportion. For example, shen2015deepcontour proposes a regularized softmax to deal with the imbalanced edge label classification. khan2015cost

adopts cost-sensitive learning to learn deep feature representations from imbalanced data. Another approach to relieve class imbalance problem is re-sampling

huang2016learning ; imbalance including over-sampling and under-sampling, which aims to balance the distributions of data in different labels.

This paper is the extension of yehaiacl2017 . Compared to original work in yehaiacl2017 , this paper has several improvements:

Methods: (a) We further fully consider the class imbalance problem. We propose a novel ranking based cost-sensitive loss function combined with multi-label learning. (b) To better learn class ties between relations, we further introduce a regularization term to ranking loss functions.

Experiments: (a) We further do experiments to analyze the effectiveness of our novel cost-sensitive ranking loss functions. (b) The evaluation experiments on the effectiveness of regularization have further be conducted.

Content: (a) We rewrite the description of our methods from the view of multi-label learning and cost-sensitive learning to gain more theoretical justification improvement.

3 Methodology

We introduce our methods in this section. Firstly, we describe the widely used CNN architecture for sentence encoding. Then we discuss the ranking based multi-label learning framework with regularization technique. After that, we introduce the proposed cost-sensitive learning to overcome the NR effects for model training.

3.1 Notation

We define the relation classes as , entity tuples as and mentions111The sentence containing one certain entity is called mention. as . Dataset is constructed as follows: for entity tuple and its relation class set , we collect all the mentions that contain , the dataset we use is . Given a data , the sentence embeddings of encoded by CNN are defined as and we use class embeddings to represent the relation classes, which will be learned in model training.

3.2 CNN for sentence embedding

We take the effective piecewise CNN architecture adopted from zeng2015 ; lin2016 to encode sentence and we will briefly introduce PCNN in this section. More details of PCNN can be obtained from previous work.

3.2.1 Words Representations

Word Embedding Given a word embedding matrix where is the size of word dictionary and is the dimension of word embedding, the words of a mention will be represented by real-valued vectors from .

Position Embedding The position embedding of a word measures the distance from the word to entities in a mention. We add position embeddings into words representations by appending position embedding to word embedding for every word. Given a position embedding matrix where is the number of distances and is the dimension of position embeddings, the dimension of words representations becomes .

3.2.2 Convolution, Piecewise max-pooling

After transforming words in to real-valued vectors, we get the sentence . The set of kernels is where is the number of kernels. Define the window size as and given one kernel , the convolution operation is defined as follows:

(1)

where is the vector after conducting convolution along for times and

is the bias vector. For these vectors whose indexes out of range of

, we replace them with zero vectors.

By piecewise max-pooling, when pooling, the sentence is divided into three parts:

, and ( and are the positions of entities, is the beginning of sentence and is the end of sentence). This piecewise max-pooling is defined as follows:

(2)

where is the result of mention processed by kernel ; . Given the set of kernels , following the above steps, the mention can be embedded to where .

3.2.3 Non-Linear Layer, Regularization

To learn high-level features of mentions, we apply a non-linear layer after pooling layer. After that, a dropout layer is applied to prevent over-fitting. We define the final fixed sentence representation as ().

(3)

where is a non-linear function and we use in this paper; is a Bernoulli random vector with probability p to be .

3.3 Combine Information across Sentences

We propose two options to combine sentences to provide enough information for multi-label learning.

AVE   The first option is average method. This method regards all the sentences equally and directly average the values in all dimensions of sentence embedding. This AVE function is defined as follows:

(4)

where is the number of sentences and is the bag representation combining all sentence embeddings. Because it weights the importance of sentences equally, this method may bring much noise data from two aspects: (1) the wrong labelling data; (2) irrelated mentions for one relation class, for all sentences containing the same entity tuple being combined together to construct the bag representation.

ATT   The second one is a sentence-level attention algorithm used by  lin2016 to measure the importance of sentences aiming to relieve the wrong labelling problem. For every sentence, ATT will calculate a weight by comparing the sentence to one relation. We first calculate the similarity between one sentence embedding and relation class as follows:

(5)

where is the similarity between sentence embedding and relation class and a is a bias factor. In this paper, we set as . Then we apply to rescale () to . We get the weight for as follows:

(6)

so the function to merge with ATT is as follows:

(7)

3.4 Learning Class Ties via Ranking based Multi-label Learning with Regularization

Firstly, we have to present the score function to measure the similarity between bag representation and relation .

Score Function   We use dot function to produce score for to be predicted as relation . The score function is as follows:

(8)

There are other options for score function. In MultiATT , they propose a margin based loss function that measures the similarity between and by distance. Because score function is not an important issue in our model, we adopt dot function, also used by rank2015 and lin2016 , as our score function.

Now we start to introduce the ranking loss functions.

Pairwise ranking aims to learn the score function that ranks positive classes higher than negative ones. This goal can be summarized as follows:

(9)

where is a margin factor which controls the minimum margin between the positive scores and negative scores. Inspired by rank2015 , given and , we adopt the following function to learn the score function:

(10)

where , is the rescale factor, is positive margin and is negative margin. This loss function is designed to rank positive classes higher than negative ones controlled by the margin of . In reality, will be higher than and will be lower than . In our work, we set as , as and as adopted from  rank2015 . To simplify the loss functions given in the followings, we use to replace the first term in and use to replace the second term.

To model the class ties (co-occurrence) of the labels, we have the assumption that the positive labels have the same class ties and are connected with each other. Out of this assumption, we have two mechanisms to learn the class ties, which are making joint extraction of relations and explicitly modeling the connections by regularizing the learning of positive labels. In the followings, we will first introduce the loss functions for multi-label learning extended from Eq. 10; then we discuss the regularization term.

To learn class ties between relations, we firstly extend the Eq. 10 to make multi-label learning. Followings are the proposed ranking based loss functions:

with AVE (Variant-1)   We define the margin-based loss function with option of AVE to aggregate sentences as follows:

(11)

Similar to Weston2011 and rank2015 , we update one negative class at every training round but to balance the loss between positive classes and negative ones, we multiply before the right term in Eq. 11

to expand the negative loss. We apply mini-batch based stochastic gradient descent (SGD) to minimize the loss function. The negative class is chosen as the one with highest score among all negative classes 

rank2015 , i.e.:

(12)

with ATT (Variant-2)   Now we define the loss function for the option of ATT to combine sentences as follows:

(13)

where means the attention weighted representation where attention weights are merged by comparing sentence embeddings with relation class and is chosen by the following function:

(14)

which means we update one negative class in every training round. We keep the values of , and same as values in Eq. 11. In Eq. 13, for every , we need to sample according to Eq. 14, so different from Eq. 11, we do not extend the negative loss by multiplying .

According to this loss function, we can see that: for each class , it will capture the most related information from sentences to merge , then rank higher than all negative scores which each is (). We use the same update algorithm to minimize this loss.

Based on the assumption that all positive labels have the same class ties, making joint extraction of the relations can capture the co-occurrence of the labels. If the relations for the same entity pair usually appear together, then extracting them jointly can learn the statistical property of their co-appearance.

Regularization To learn the class ties between relations, we have proposed the ranking based loss functions above. Inspired by zhouMIML ; MIML2005 , we further capture the relation connections by adding an extra regularization term to the loss functions. We only consider the relatedness between positive labels ignoring NR. The relatedness is measured by the mean function :

(15)

where . is the center of the labels, and we hope the positive labels can be close to the center which can be measured by:

(16)

Following zhouMIML , to model the class ties we need to minimize the loss function as follows:

(17)

where and are hyper-parameters. Eq. 17 is designed based on the consideration that the labels in which class ties exist should be clustered together and should be close to the center of these labels. According to Eq. 15, Eq. 16 can be re-written as:

(18)

By merging Func. 18 into Eq. 17, we have the our final regularization term:

(19)

In this paper, we set as and is set as .

Pro. (%) Training Test
Riedel 72.52 96.26
Table 1: The proportions of NR samples from Riedel’s dataset.

3.5 Ranking based Cost-sensitive Multi-label Learning

In relation extraction, the dataset will always contain certain negative samples which do not express any relation types and are classified as NR type (no relation). Table 1 presents the proportion of NR samples in the dataset from 2010 , which shows that the almost data is about NR. Data imbalance will severely affect the model training and cause the model only sensitive to classes with high proportion imbalance , causing a positive sample to be classified as NR. In order to relieve this problem, we adopt cost-sensitive learning to construct the loss function. Based on , the cost-sensitive loss function which is Variant-3 is as follows:

(20)

where ; is an indicate function. Similar to Eq. 14, we select as follows:

(21)

Because NR counts a high proportion in the training set, without controlling, the model will receive large costs from NR. In order to relieve the effects from NR, we penalize the losses from NR. Specifically, we have two strategies to do that. We adopt two hyper-parameters which are  () and to penalize the losses from NR. If is a positive label, to balance the costs between the positive labels and the NR label, we further add the costs from the left positive relations and at the same time, the extra cost from NR is calculated. The default value of is and if is small enough, this loss function will be similar to loss Eq. 13. Based on the experimental results, we find that the best results are achieved when is set to , so we set as in this paper. How the and affect model performance is discussed in Sec. 4.5 and Sec. 4.6. We also add the regularization term to to better capture the class ties between relations.

We give out the pseudocode of merging in algorithm .

input : , and ;
output : ;
1 ;
2 for  do
3       Merge representation by Eq. 5, 6, 7;
4       ;
5       ;
6       ;
7       for  do
8             ;
9            
10      ;
11      
12return ;
Algorithm 1 Ranking based Cost-sensitive Multi-label Learning

4 Experiments

In this section, we conduct two sets of experiments, in which the first one is for comparing our method with the baselines and the second one is used to evaluate our model. Without the special statement, we will adhere to the methods and settings mentioned above to conduct the following experiments.

4.1 Dataset and Evaluation Criteria

Dataset.   We conduct our experiments on a widely used dataset, developed by 2010 and has been used by 2011 ; 2012 ; zeng2015 ; lin2016 . The dataset aligns Freebase relation facts with the New York Times corpus, in which training mentions are from 2005-2006 corpus and test mentions from 2007. The training set contains 522,611 sentences, 281,270 entity pairs and 18,252 relation facts. In test set, there are 172,448 sentences, 96,678 entity pairs and 1,950 relation facts. In all, there are 53 relation labels including the NR relation. Following 2009 , we adopt held-out evaluation framework in all experiments. We use all training dataset to train our model and then test the trained model on test dataset to compare the predicted relations to gold relations.

Evaluation Criteria.   To evaluate the model performance, we draw the precision/recall (P/R) curves and precision@N (P@N) is reported to illustrate the model performance. For the metric of P/R curve, the bigger of the area contained under the curve, the better of the model performance.

4.2 Experimental Settings

Word Embeddings.   We adopt the trained word embeddings from lin2016 . Similar to lin2016 , we keep the words that appear more than times to construct word dictionary and use “UNK” to represent the other ones.

Hyper-parameter Settings.   Three-fold validation on the training dataset is adopted to tune the parameters following 2012 . We select word embedding size from . Batch size is tuned from . We determine learning rate among . The window size of convolution is tuned from . We keep other hyper-parameters same as zeng2015 : the number of kernels is , position embedding size is and dropout rate is . Table 2 shows the detailed parameter settings.

Parameter Name Symbol Value
Window size
Sentence. emb. dim.
Word. emb. dim.
Position. emb. dim.
Batch size
Learning rate
Dropout pos.
Table 2: Hyper-parameter settings.

4.3 Comparisons with Baselines

Baseline.   We compare our model with the following baselines:

Mintz 2009 is the first original model which incorporates distant supervision for relation extraction.

MultiR 2011 is the multi-instance learning based graphical model which aims to address overlapping relation problem.

MIML 2012 is a multi-instance multi-label framework which jointly considers the wrong labelling problem and overlapping problem.

PCNN+ATT lin2016 is the previous state-of-the-art model in dataset of 2010 which applies sentence-level attention to relieve the wrong labelling problem in DS based relation extraction. This model applies piece-wise convolutional neural network zeng2015 to model sentences.

Besides comparing to the above methods, we also compare our variant models represented by Rank+AVE (using loss function of ), Rank+ATT (using loss of )and Rank+Cost (using loss of ).

Figure 3: Performance comparison of our model and the baselines. “Rank+Cost” is using the loss function of , “Rank+ATT” is using and “Rank+AVE” is using .

Results and Discussion.   We compare our three variants of loss functions with the baselines and the results are shown in Fig. 3. From the results we can see that:

  • Rank+AVE (Variant-1) lags behind PCNN+ATT, whose reason may lie in that Rank+AVE does not use the attention mechanism to aggregate the information among the sentences, which brings much noise for encoding sentence contexts;

  • After adopting the attention mechanism, Rank+ATT achieves much better performances comparing to Rank+AVE, and even better than PCNN+ATT;

  • Comparing PCNN+ATT and Rank+ATT, we can see that Rank+ATT is superior to PCNN+ATT, which comes from the strategy that we model the class ties into the relation extraction;

  • Our variant method of Rank+Cost achieves the best performance among all the baselines; by comparing to Rank+ATT, our cost-sensitive learning method can really work for relieving the negative effects from NR.

4.4 Impact of Class Ties

In this section, we conduct experiments to reveal the effectiveness of our model to learn class ties with three variant loss functions mentioned above, and the impact of class ties for relation extraction. As mentioned above, we adopt two techniques to model the class ties: multi-label learning with ranking based loss functions and regularization term to better model class ties. In the followings, we will conduct experiments to reveal the two aspects for modeling class ties. We will adopt P/R curves and precisions@N (, , , ) to show the model performances.

Figure 4: Results for impact of ranking based loss function with methods of Rank + AVE, Rank + ATT and Rank + Cost

Ranking based Loss Function. The effectiveness of ranking loss functions to learn class ties lies in the joint extraction of relations to conduct multi-label leaning, so to reveal the impact of ranking loss function to learn class ties, we will compare the joint extraction with separated extraction. Regularization term is added to all variant models. To conduct the experiment of separated extraction, we divide the labels of entity tuple into single label and for one relation label we select the sentences expressing this relation to construct the bag, then we use the re-constructed dataset to train our model with our three variant loss functions.

P@N(%) 100 200 300 400 500 Ave.
R.+AVE+J.
R.+AVE+S. 80.2 74.9 72.2 67.8 64.0 71.8
R.+ATT+J. 86.8 78.4 75.2 71.1 78.4
R.+ATT+S. 82.7 75.3
R.+ExATT+J. 86.8 83.2 81.1 76.7 73.5 80.3
R.+ExATT+S. 76.3
Table 3: Precisions for top , , , , and average of them for impact of joint extraction and class ties.

Experimental results are shown in Fig. 4 and Table 3. From the results we can see that: (1) For Rank+ATT and Rank+Cost, joint extraction exhibits better performance than separated extraction, which demonstrates class ties will improve relation extraction and the two methods are effective to learn class ties; (2) For Rank+AVE, surprisingly joint extraction does not keep up with separated extraction. For the second phenomenon, it may come from the strategy of AVE method to aggregate sentences. To make joint extraction, we will combine all the sentences containing the same entity tuple, however, not all sentences have the same relation, the fact is that one part of the sentences express one relation type and some will have another one. Simply averaging the sentence representations will hinder the model to learn the latent mapping from the sentences to the corresponding relation type, because averaging operation will gender redundant information from other unrelated sentences.

Figure 5: Results for impact of regularization to model class ties.
P@N(%) 100 200 300 400 500 Ave.
R.+AVE+no-regu. 66.5 64.0
R.+AVE+regu. 79.1 73.8 70.4 70.5
R.+ATT+no-regu.
R.+ATT+regu. 86.8 80.6 78.4 75.2 71.1 78.4
R.+Cost+no-regu.
R.+Cost+regu. 86.8 83.2 81.1 76.7 73.5 80.3
Table 4: Precisions for top , , , , and average of them for impact of regularization to model class ties

Regularization. To see the impact of regularization technique for modeling class ties, we compare the methods using regularization with the ones without using regularization. All variant models are in setting of joint extraction. The results are shown in Fig. 5 and Table 4. From the results, we can see that after regularizing the learning of relations, the model performance can be further improved indicated by methods of Rank+Cost and Rank+ATT, which demonstrates the effectiveness of regularization to model class ties. We do not see many effects of regularization for method of Rank+AVE. Noises brought by averaging sentence embeddings may hinder the positive effects of regularization.

Figure 6: Results for impact of cost-sensitive learning. “ATT” means the loss function of Variant-2; “ATT+NR” means only considering the cost of NR controlled by ignoring the cost controlled by based on Variant-2 and is set to ; “ATT+P” means considering the cost controlled by based on Variant-2 ignoring the cost of NR and is set to ; “ATT+NR+P” is the loss function of Variant-3 and jointly considers the two kinds of costs mentioned above, is set to and is .
P@N(%) 100 200 300 400 500 Ave.
ATT
ATT+NR
ATT+P
ATT+NR+P
Table 5: Precisions for top , , , , and average of them for impact of cost-sensitive learning.

4.5 Impact of Cost-sensitive Learning

In this section, we conduct experiments to reveal the effectiveness of cost-sensitive learning to relieve the impact of NR for model training and model performance. For the loss function of , we have two parts for cost-sensitive learning: the first is the one penalized by , and the second is the NR cost penalized by . Based on the loss function of Variant-3, we respectively relieve the cost controlled by and the cost of NR controlled by to see the impact of cost-sensitive learning. We will adopt P/R curves and precisions@N (, , , ) to show the model performances.

The results are shown in Fig. 6 and Table 5. From the results, we can see that considering the cost controlled by can sightly improve the performance in low recall range and considering the cost of NR controlled by can boost the performance significantly. Considering both of the two kinds of costs can achieve the best performance. From these results, we can see that relieving NR impact is really important to improve the extraction performance.

Figure 7: Effect of for model performance based on the loss function of Variant-3.

4.6 Impact of NR

From the discussion above, we can know that NR can have much significant impact for model performance, so in this section, we conduct more experiments to reveal the impact of NR cost controlled by for model performance.

Effect of Penalty.  We conduct experiments on the choice of . Based on the loss function of Variant-3, we select from to see how much effect of NR can gender to the performance. We also adopt P/R curves and precisions@N (, , , ) to show the model performances. Models are set with joint extraction and regularization. The results are shown in Fig. 7 and Table 6. From the results we can find that when becomes larger (from to ), the model performance will decrease because NR will have more negative impact on model performance, so in order to achieve better model performance, the value of should be set smaller.

P@N(%) 100 200 300 400 500 Ave.
=
=
=
=
Table 6: Precisions for top , , , , and average of them for impact of cost-sensitive learning.
Figure 8: Impact of NR for model convergence. “+NR” means not relieving NR impact with of ; “-NR” is opposite with of . ExATT is based on the loss function of Variant-3.

Effect of NR for Model Convergence.  Then we further evaluate the impact of NR for convergence behavior of our model in model training. Also with the three variant loss functions, in each iteration, we record the maximal value of F-measure 222

to represent the model performance at current epoch. Models are with setting of joint extraction but without regularization. Model parameters are tuned for

times and the convergence curves are shown in Fig. 8. From the result, we can find out: “+NR” converges quicker than “-NR” and arrives to the final score at the around or epoch. In general, “-NR” converges more smoothly and will achieve better performance than “+NR” in the end.

5 Conclusion and Future Works

In this work, we propose a ranking based cost-sensitive multi-label learning for distant relation extraction aiming to leverage class ties to enhance relation extraction and relieving class imbalance problem. To exploit class ties between relations to improve relation extraction, we propose a general ranking based multi-label learning framework combined with convolutional neural networks, in which ranking based loss functions with regularization technique are introduced to learn the latent connections between relations. Furthermore, to deal with the problem of class imbalance in distant supervision relation extraction, we further adopt cost-sensitive learning to rescale the costs from the positive and negative labels. In the experimental study, we further do experiments to analyze the effectiveness of our novel cost-sensitive ranking loss functions. The evaluation experiments on the effectiveness of regularization have further be conducted.

In the future, we will focus on the following aspects: (1) Our method in this paper considers pairwise intersections between labels, so to better exploit class ties, we will extend our method to exploit all other labels’ influences on each relation for relation extraction, transferring second-order to high-order zhang2014review ; (2) We will regard the task of distant supervision relation extraction as a multi-instance based learning-to-rank problem, and will take the view from learning-to-rank to design the algorithms and combine other advanced tricks from information retrieval field; (3) What effects will entity pairs take to the relation extraction performance? Can we use a general entity pair replacement (,

) to represent all entity pairs? Answering the two problems may help the transfer learning of RE systems.

Acknowledgment

This work was supported by the National High-tech Research and Development Program (863 Program) (No. 2014AA015105) and National Natural Science Foundation of China (No. 61602490).

References

References

  • (1) M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in: Proceedings of ACL-IJCNLP, 2009.
  • (2) K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in: Proceedings of KDD, 2008, pp. 1247–1250.
  • (3) R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, D. S. Weld, Knowledge-based weak supervision for information extraction of overlapping relations, in: Proceedings of ACL-HLT, 2011.
  • (4) M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance multi-label learning for relation extraction, in: Proceedings of EMNLP, 2012.
  • (5)

    J. Fürnkranz, E. Hüllermeier, E. L. Mencía, K. Brinker, Multilabel classification via calibrated label ranking, Machine learning 73 (2) (2008) 133–153.

  • (6) M.-L. Zhang, Z.-H. Zhou, Multilabel neural networks with applications to functional genomics and text categorization, IEEE transactions on Knowledge and Data Engineering 18 (10) (2006) 1338–1351.
  • (7)

    Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F. Li, Multi-instance multi-label learning, Artificial Intelligence 176 (1) (2012) 2291–2320.

  • (8) T. Evgeniou, C. A. Micchelli, M. Pontil, Learning multiple tasks with kernel methods, Journal of Machine Learning Research 6 (Apr) (2005) 615–637.
  • (9) N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intelligent data analysis 6 (5) (2002) 429–449.
  • (10) H. Zheng, Z. Li, S. Wang, Z. Yan, J. Zhou, Aggregating inter-sentence information to enhance relation extraction, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • (11) Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural relation extraction with selective attention over instances, in: Proceedings of ACL, 2016.
  • (12) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
  • (13) C. N. d. Santos, B. Xiang, B. Zhou, Classifying relations by ranking with convolutional neural networks, in: Proceeding of ACL, 2015.
  • (14) S. Riedel, L. Yao, A. McCallum, Modeling relations and their mentions without labeled text, in: Proceedings of ECML-PKDD, Springer, 2010, pp. 148–163.
  • (15) X. Han, L. Sun, Global distant supervision for relation extraction, in: Proceedings of AAAI, 2016.
  • (16) D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al., Relation classification via convolutional deep neural network., in: Proceeding of COLING, 2014.
  • (17) M. G. Yu Mo, M. Dredze, Factor-based compositional embedding models, in: NIPS Workshop on Learning Semantics, 2014.
  • (18) H. Ye, Z. Yan, Z. Luo, W. Chao, Dependency-tree based convolutional neural networks for aspect term extraction, in: Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part II, 2017.
  • (19) H. Ye, L. Wang, Semi-supervised learning for neural keyphrase generation, in: Proceedings of Empirical Methods in Natural Language Processing, 2018.
  • (20) H. Ye, X. Jiang, Z. Luo, W. Chao, Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions, CoRR abs/1802.08504.
  • (21) X. Jiang, H. Ye, Z. Luo, W. Chao, W. Ma, Interpretable rationale augmented charge prediction system, in: The 27th International Conference on Computational Linguistics: System Demonstrations, 2018.
  • (22) D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant supervision for relation extraction via piecewise convolutional neural networks, in: Proceedings of EMNLP, 2015.
  • (23)

    T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-based neural machine translation, in: Proceedings of EMNLP, 2015.

  • (24) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.
  • (25) Y. Lin, Z. Liu, M. Sun, Neural relation extraction with multi-lingual attention, in: Proceedings of Association for Computational Linguistics, 2017.
  • (26) W. Zeng, Y. Lin, Z. Liu, M. Sun, Incorporating relation paths in neural relation extraction, arXiv preprint arXiv:1609.07479.
  • (27) G. Ji, K. Liu, S. He, J. Zhao, Distant supervision for relation extraction with sentence-level attention and entity descriptions., in: AAAI, 2017, pp. 3060–3066.
  • (28) L. Chen, Y. Feng, S. Huang, B. Luo, D. Zhao, Encoding implicit relation requirements for relation extraction: A joint inference approach, Artificial Intelligence 265 (2018) 45–66.
  • (29) H. Ye, W. Li, L. Wang, Jointly learning semantic parser and natural language generator via dual information maximization, CoRR abs/1906.00575.
  • (30) B. Luo, Y. Feng, Z. Wang, Z. Zhu, S. Huang, R. Yan, D. Zhao, Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix, in: Proceedings of Association for Computational Linguistics, 2017.
  • (31) P. Qin, W. Xu, W. Y. Wang, Dsgan: Generative adversarial training for distant supervision relation extraction, arXiv preprint arXiv:1805.09929.
  • (32) X. Han, Z. Liu, M. Sun, Denoising distant supervision for relation extraction via instance-level adversarial training, arXiv preprint arXiv:1805.10959.
  • (33) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014.
  • (34) J. Feng, M. Huang, L. Zhao, Y. Yang, X. Zhu, Reinforcement learning for relation classification from noisy data, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018.
  • (35) P. Qin, W. Xu, W. Y. Wang, Robust distant supervision relation extraction via deep reinforcement learning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018.
  • (36) T. Liu, K. Wang, B. Chang, Z. Sui, A soft-label method for noise-tolerant distantly supervised relation extraction, in: Proceedings of Empirical Methods in Natural Language Processing, 2017.
  • (37) T. Liu, X. Zhang, W. Zhou, W. Jia, Neural relation extraction via inner-sentence noise reduction and transfer learning, in: Proceedings of Empirical Methods in Natural Language Processing, 2018.
  • (38) T.-Y. Liu, Learning to rank for information retrieval, Foundations and Trends in Information Retrieval 3 (3) (2009) 225–331.
  • (39) F. Zhao, Y. Huang, L. Wang, T. Tan, Deep semantic ranking based hashing for multi-label image retrieval, in: Proceedings of CVPR, 2015.
  • (40) A. Severyn, A. Moschitti, Learning to rank short text pairs with convolutional deep neural networks, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2015, pp. 373–382.
  • (41)

    W. Shen, X. Wang, Y. Wang, X. Bai, Z. Zhang, Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

  • (42) S. H. Khan, M. Bennamoun, F. Sohel, R. Togneri, Cost sensitive learning of deep feature representations from imbalanced data, arXiv preprint arXiv:1508.03422.
  • (43) C. Huang, Y. Li, C. Change Loy, X. Tang, Learning deep representation for imbalanced classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • (44) H. He, E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284.
  • (45) H. Ye, W. Chao, Z. Luo, Z. Li, Jointly extracting relations with class ties via effective deep ranking, in: Proceedings of Association for Computational Linguistics, 2017.
  • (46) L. Wang, Z. Cao, G. de Melo, Z. Liu, Relation classification via multi-level attention cnns, in: Proceedings of ACL, Volume 1: Long Papers, 2016.
  • (47) J. Weston, S. Bengio, N. Usunier, WSABIE: scaling up to large vocabulary image annotation, in: Proceedings of IJCAI, 2011.
  • (48) M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE transactions on knowledge and data engineering 26 (8) (2014) 1819–1837.