Relation Extraction (RE) aims to extract relations of entities from sentences, which can automate the construction of Knowledge Bases (KBs) and has potential benefits to downstream applications such as question answering  and web search . Due to the difficulty of collecting a large amount of sentence-level annotations, most recent RE methods are based on the Distant Supervision (DS) framework  which can automatically annotates adequate amounts of data. With the DS framework, RE can be cast as a problem of classifying a bag of sentences which contain the same query entity pair, into predefined relation classes.
This paper studies the DSRE problem by re-examining the relation definitions in the existing methods. On one hand, the class definitions may not be fine-grained enough, since the style that a sentence expresses the entity relation may vary for different query entity pairs. For example, given two sentences in Figure 1, they both express the same relation. However, the keywords that convey this relation are quite different due to the difference in query entity types, i.e., (Person, Organization) VS. (Person, News agency). It seems that we need to further consider the style shift problem of each relation class concerning the entity types. On the other hand, the class definitions may be too fine-grained since many classes are semantically related and their samples may be similar in the feature space. For example, the relation classes “/business/person/company” and “/business/company/founders” both express that a person is a member of a company. We expect that based on the class mutual relationship, the network learned from one class can be adapted to enhance other classes. That especially benefits the long-tail problem.
To unify those two arguments, we propose to use a dynamic neural network which contains parameters (i.e., attention and classifier) that can be dynamically determined by the query entity types and class mutual relationship. By doing so, we can make our prediction model adaptive to the query entity types which can naturally deal with the style shift problem. Also, the class mutual relationship is incorporated for determining the network parameters. Therefore, the learning process of class-dependent parameters will take into account the semantic similarity between relation classes. This mechanism can be particularly helpful for the long-tail problem.
Specifically, to realize the dynamics characteristic and generate parameters for our model, we develop a dynamic parameter generation module. Such module generates network parameters in two steps: firstly the entity types and relation lexical definitions are utilized to generate the dynamic class representations. Then, the mutual relationship of classes is further considered for transferring semantically similar information between the dynamic class representations, and output the final dynamic parameters for attention and classifier. Note that the mutual relationship of classes is characterized by the predefined semantic similarity between classes111In this paper, we design several human-specified rules to define the similarity between classes.
and can be represented as a graph defined through an affinity matrix. The generator utilizes it in a Graph Convolutional Network (GCN).
In our design, the adaptation of parameters from the entity type information account for the style shift problem. Meanwhile, the adaptation from the relation lexical definitions and affinity matrix enhances the network for addressing the long-tail problem. The dynamic parameter generator unifies these two arguments, which are complementary to each other.
We conduct experiments on a widely used large-scale DSRE benchmark dataset, and the experimental results demonstrate the superior performance of the proposed method. It is validated that the dynamic network design is beneficial for handling both style shift and long-tail problems in DSRE. In summary, our main contributions of this work are as follows:
We first utilize the class relationship with entity types as well as the mutual relationship of classes for improving the performance of DSRE.
We propose a novel dynamic parameter generator to build a dynamic neural network whose parameter is determined by the query entity types relation lexical definitions, and the mutual relationship of classes.
Our experiments on a widely used benchmark show that our method gives new state-of-the-art performance.
2 Related Works
2.1 Hand-crafted Feature Based Methods
In its early years, most of the DSRE methods are based on the hand-crafted features [13, 14, 6], e.g., POS tags, named entity tags, and dependency paths.  assumes that sentences containing the same entity pair, all express the same relation. However, this assumption does not always hold. To relax this assumption,  assumes that if two entities are held in a relation, at least one sentence mentioning these entities may express such relation. Then, they employ the multi-instance learning (MIL) paradigm to support this assumption. Later, since different relational triplets may have overlaps in a sentence, [6, 16] apply the multi-instance multi-label paradigm to handle this problem. However, the hand-crafted features are not sufficient robust, which will lead to the error propagation problem.
2.2 Deep-feature Based Methods
Recently, researchers turn to apply deep learning to DSRE due to its promising performance and generalization ability in various NLP applications. Many methods[23, 10, 7] are under the MIL paradigm framework aiming to denoise the data generated by DS.  select one sentence in a bag which can well express the relation between the entity pair within such sentence. However, the authors omit useful information in other sentences, which are also useful for expressing such relation. To solve this problem, attention mechanism  and its variants [2, 5, 22] are introduced to capture the useful information in other sentences. Other learning strategies, like adversarial training , capsule network 
, and reinforcement learning[3, 17] are also applied to DSRE to further improve its performance.
2.3 Methods Incorporating External Information
Recently, other useful external information is identified to be beneficial for DSRE, e.g., KB information.  utilize entity descriptions for DSRE, which can provide rich background information of entities, and help recognize relations in DSRE.  use a set of side information, e.g., entity type, and relation alias, to boost DSRE performance.  leverage the corpus-based and KG-based information, and use logic rules on the entity type level.  propose a coarse-to-fine grained attention scheme by hierarchical relation structures in KB. Based on , 
propose a knowledge-aware attention scheme using Knowledge Graph embedding (KGE). Besides, combines the distant supervision data with additional directly-supervised data to train a model for identifying valid sentences.
However, all the above works ignore the style shift problem, whereas DNNRE uses the entity type information to address it and further improve DSRE performance. Besides, instead of the relation hierarchies structure defined in KB [5, 24], we handle the long-tail problem with the graph defined by the affinity matrix (i.e., the human-specified class relationship), since the semantically similar information can be directly transferred between two relation nodes. Note that there are also previous works using entity types in their models . However, they are quite different from us: we utilize entity types to dynamically generate the parameters in our model for addressing the style shift problem, whereas previous works just use entity types as input features.
The primary idea of the proposed method is to build a network with DYNAMIC weights, that is, parts of the network parameters will be dynamically generated by the combination of the entity types, relation lexical definitions, and class relationship matrix. This is in contrast to the traditional methods which use STATIC models for which the model parameters will be fixed during testing. Formally, the class-dependent parameters of the proposed network can be dynamically generated by the following function:
where (i.e., and is the entity types, is the lexical definitions of the candidate relation classes, is the predefined relationship between relation classes. The function is called dynamic parameter generator which transfer , , and into the network parameters .
Since is a variable of the query entity types, the generated network parameters will be online adapted at the test stage, which offers a solution to compensate the style shift. and are the other two factors for determining . Introducing them to the parameter generator enables the network to leverage the prior knowledge about the class mutual relationship, which is particularly helpful for handling the long-tail problem.
3.1 Overall Architecture
The overall architecture of DNNRE is illustrated in Figure 2. In the top right, the Sentence Encoder encodes a bag of sentences into sentence representations. Meanwhile, the External Information Acquisition of , , and in the bottom left, will be executed, where the Type Mapping is queried by the entity pair to obtain the entity types, and Human-specified Class Relationship offer prior knowledge to obtain the affinity matrix. Then, these information will be utilized to generate the dynamic parameters by the Dynamic Parameter Generator (bottom). Finally, the dynamic parameters will build the Dynamic Neural Network (i.e., dynamic attention and classifier) in the top right. The Dynamic Attention aggregates the sentence representations into a bag representation, which is feed into the Dynamic Classifier to predict its corresponding relation class.
The remaining of this section is organized as follows:
Firstly, the Sentence Encoder will be introduced briefly in subsection 3.2.
Then, the Information Acquisition (i.e., , , and ) will be introduced exhaustively in subsection 3.3.
The Dynamic Parameter Generator will be elaborated in subsection 3.4.
Finally, the Dynamic Neural Network is introduced in subsection 3.5.
3.2 Sentence Encoder
into a fixed length vectorby using a sentence encoder. In this work, we use PCNN  to fulfill this task.
Specifically, we represent each word in a sentence by the word embedding and the position embedding. The word embedding is used to represent each word token of by a pre-trained word embedding vector , which is trained on NYT corpus by the word2vec tool222https://code.google.com/p/word2vec/. Two fixed dimension vectors are used as position embedding to represent the relative positions between and entity pair. We concatenate position embedding to the word representation.
The word representations () are fed into the encoding layer. convolution kernels slide over the input to capture features in the -gram:
where means the word representation from index to . Afterwards, we can obtain .
After this convolution operation, a piecewise max-pooling is adopted to aggregate word-level information. Supposedis split into by the entity positions, this pooling method is described as below:
Then we obtain , and is flattened to a vector and translate it into the sentence embedding by a non-linear layer.
3.3 External Information Acquisition
In our design, the generated parameters of the attention and classifier are dynamically determined by and . The representations of them are shown as follows:
: Entity Type Information. The type information of entities has been proved to be useful for the DSRE task [12, 18] as additional input features. Unlike those existing works, we use the entity types to dynamically determine the network parameters. The entity types are extracted from KB and further mapping to the types defined by . We create an embedding vector for each entity type. Note that in practice, one entity may correspond to multiple entity types, in such a case, we then use the average of its corresponding entity type embedding vectors to represent it.
: Relation lexical definition. Each relation can be defined lexically in a 3-level hierarchical structure, we split each relation into a 3-element tuple for better capturing the similarity between relations. Also, we represent each element as an embedding vector in the -th relation tuple, i.e., . For example, given two relations:
“/location/china_province/capital” is split to (“/location”, “/china_province”, “/capital”),
“/location/fr_region/capital” is split to (“/location”, “/fr_region”, “/capital”).
Their similarity can be identified from that both of them express the capital of a country region. What’s more, since different relations may share the same elements, e.g., “/location” and “/capital” in the above relations, those embedding vectors can be trained from instances of different relation classes. Comparing with the scheme of representing each relation by a single embedding vector, the embedding in our design could be trained across classes and can generalize better for long-tail relations.
: Relation Affinity Matrix. Besides inferring the inter-class relationships from the lexical definition of classes, it is also possible to directly obtain the class relationship from prior knowledge. To incorporate such prior knowledge, we use the following rules to define an affinity matrix for relation classes (examples for the affinity matrix construction is shown in the Appendix):
denotes the affinity matrix, and , denotes two classes in relation classes.
If is a special case of , or is a concept at a lower level of j, then .
If and have some similar properties in terms of space or time, then .
and for other pairs.
Note that the affinity matrix is a directed graph, i.e., may not equal to .
3.4 Dynamic Parameter Generator
In the following we will elaborate the implementation of the dynamic parameter generator . In our design, it consists two parts and achieves parameter generation through two stages.
Stage 1: Generate dynamic class representation
The first stage is to convert and into a set of dimensional embedding vectors with each vector corresponding to one relation class. Because is dynamically changing with each query entity pair, the resulted representation is not a constant after training.
The conversion of this step is achieved by fully-connected layers and it consists of three terms.
where , is a static parameter for the -th class and it encodes class-specific information. The second term can be seen as a dynamic component generated from the information of the head and tail entity types; the third term can be viewed as a dynamic component generated from the information of the relation lexical definition.
Since an entity may belong to multiple entity types, we represent and as a set of entity type embeddings, namely, and . The mapping function is then realized by:
where denotes the averaging operation on the entity type embeddings and denotes the fully-connected layers. In our design, we use a two-layer fully-connected module.
We also use a two-layer fully-connected module for generating mapping from which consists of three embeddings . The mapping function is defined as:
Stage 2: Generate parameters with the GCN
The key idea of GCN  enables information on each node of a graph to flow to each other. Formally, given an affinity matrix and features representation as input of a graph convolution layer, the output of a graph convolution layer is . The convolution operation on a graph can be defined as follows:
is an activation function anddenotes a Laplace normalization on , and is a translation matrix and is the parameters to be learned at the training time.
In our application, we use the results of stage 1 as the input of the GCN:
where denotes the number of relation classes.
The last layer output of GCN, , is the final output of the dynamic parameter generator and will be utilized as the network parameters for the attention and the classifier. In our work, we use two dynamic parameter generators: one for the attention and one for the classifier. We denote them as and , where is the dimensions of the parameters for each class.
3.5 Dynamic Neural Network
After the sentences being encoded into vector representations, the next operation is to aggregate them into a bag representation by attention mechanism. Finally the bag representation are fed into a classifier.
Attention and classifier both measure the similarity between features and relations, at sentence and bag level, respectively. In that sense, the dynamic parameter generator can enhance both of them, which will be introduced as dynamic attention and dynamic classifier module in the following parts.
Given sentences in a bag, their corresponding features are extracted by PCNN as , it is a common practise to use the attention mechanism to generate weights to selectively attend the most relevant sentence. Then the sentence features are aggregated to a fixed-length vector representation for a bag.
In our work, the attention parameters will be generated by the dynamic parameter generator, and the attention weights is calculated as follows:
where is the dynamic attention parameters for the -th class and indicates taking out the -th row. is the sentence feature. Note that we run the dynamic attention times to obtain aggregation results, i.e., .
Each result is classified by its corresponding classifier. In other words, the decision value for the -th class is
where is dynamic classifier parameters for the -th class. and is a bias term. Note that at the test stage, we do not know the ground-truth relation category , thus we run the dynamic attention and the dynamic classifier times with a hypothesis
each time. Each run will produce a posterior probability for the-th class, and this result will be used for prediction and evaluation. The same operation has also been used in .
4 Experimental Results
In this section, we first describe the dataset and evaluation criteria. Second, we show our hyper-parameter choices. Then, we report our results compared with other existing methods for DSRE. Finally, we conduct the ablation study and case study to demonstrate the effectiveness of DNNRE.
We evaluate our method on a widely used dataset, NYT. Such dataset is generated by aligning Freebase relation facts with the New York Times corpus. The entities in sentences are recognized by the Stanford Named Entity Tagger  and further matched the corresponding Freebase entities. The NYT dataset has been widely used as benchmark in the existing literature [6, 16, 10].
4.2 Evaluation Criteria
Following the existing works [13, 10], we use a held-out evaluation method to evaluate the models. The held-out evaluation method compares the predicted relation classes with the ground truth to evaluate the corresponding method. The Precision-Recall (PR) curves and the top-N precision (P@N) will be reported for analysis. Moreover, to further evaluate our method on long-tail relations, we follow [5, 24] and apply Hits@K metrics. In Addition, in the ablation study, we use AUC for quantitative analysis.
4.3 Hyper-parameter Settings
We use the same hyper-parameter settings in PCNN 
. The dimension of entity type and relation tuple element embedding are both set to 50. GCN layers are set to 2. The cross-entropy loss function is applied to train our model. The Adadelta optimizer with its default parameters is used as the optimizer. Moreover, dropout strategy is used at the classification layer, and L2-regularization is also used to prevent the model training from over-fitting.
4.4 Overall Evaluation Results
To evaluate the performance of DNNRE
, we compare it against several existing hand-crafted feature based and deep-feature based methods, which are as follows:
Mintz represents a traditional DSRE model that was proposed by .
MultiR  is a graphical model for multi-instance learning.
MIML  applys multi-instance multi-label paradigm in DSRE.
PCNN+ATT  uses attention to aggregate sentence embeddings to a bag-level embedding.
RESIDE  utilizes side information to boost the DSRE performance.
PCNNs+WN  applies linear attenuation simulation and non-IID relevance embedding.
From the PR curves in Figure 3, it can be observed that DNNRE achieves superior performance compared with the state-of-the-arts. The precision value of DNNRE outperforms others under almost all recall values. Especially, when recall is less than 0.10 or ranges from 0.15 to 0.35, there is an obvious margin between DNNRE and other methods. By cross referencing the P@N results in Table 1, it is clear that our method achieves significant improvement over the comparing methods. To highlight, on average our methods attains 6.4% improvement over RESIDE which is a recent methods and also uses side information to assist prediction.
The performance of DNNRE indicates that the design of dynamic network can take advantage of the class relation with entity types, and the priori class relationship. It can dynamically adapt its parameters to represent the relations more accurately. A case study will be reported for evaluate the effectiveness of DNNRE for the style shift problem caused by keyword variation in subsection 4.7.
4.5 Evaluation for Long-tail Relations
We also evaluate the performance of DNNRE on Long-tail Relations by following the protocol of [5, 24]: (1) A subset of the test dataset in which all the relations has fewer than 100/200 training instances is selected. (2) Hits@K with
metrics is used as evaluation metric, which measures the likelihood of true relation falls into the first K candidate relations recommended by the model.
In Table 2, we observe that our method outperforms PCNN+ATT , PCNN+HATT  and PCNN+KATT  in all the Hits@K metrics. In addition, our model outperforms others significantly at least 10% absolute improvement in the most cases. This demonstrates that the incorporation of relation lexical definition and priori mutual-class relationship into the dynamic neural network can substantially boost the performance on long-tail relation classes.
|Model||DNNRE||w/o dynatt||w/o dyncla||PCNN|
|Model||DNNRE||w/o lexi||w/o affi||w/o type|
4.6 Ablation Study
In this subsection, we conduct ablation studies to validate the effect of each component of DNNRE. Note that since some bags in testing set are noisy, we use AUC (recall ) to focus on the high confidence bags in low-recall region.
In Table 3 (top), w/o dynatt denotes a variant by removing the dynamic attention and only use the vanilla static attention parameters. w/o dyncla denotes a variant by removing the dynamic classifier and and only use the vanilla static classifier parameters. When both dynamic parts are removed from DNNRE, it degrades to PCNN+ATT. Without the dynamic attention and classifier, it only achieves the AUC less than 0.25, which drops around 6 points from the results achieve by DNNRE. Merely using dynamic attention (w/o dyncla) or dynamic classifier (w/o dynatt) can boost the performance of PCNN+ATT to around 0.277 and 0.290, respectively. This demonstrates that each dynamic module contributes to the superior performance of DNNRE, that is, both the dynamic design of the attention and classifier are beneficial for relation recognition.
Note that we also remove each input of the Dynamic Parameter Generator for analysis in Table 3 (bottom). w/o lexi is DNNRE without relation lexical definitions. w/o affi denotes a variant of removing the affinity matrix in DNNRE, and the dynamic parts are directly obtained from the stage 1 in the generator. w/o type denotes DNNRE without incorporation of the entity type information. The results clearly show that removing entity type, relation lexical definition or affinity matrix results in performance drop. Particularly, when the entity type is removed, the performance drops significantly. The results validate that the dynamic characteristics are the key for DNNRE.
4.7 Case Study
Figure 4 uses two examples to show how DNNRE addresses the style shift problem. Two models, DNNRE and its variant which removes type information (as in the Ablation study) are used. The first example expresses the relation by the keyword “the president of”. The correct relation is detected by both models with high confidence. However, in the second example, the entity type changes to “news agency”, and the sentence expresses the same relation by using a different set of keywords, i.e., “anchor” and “journalist”. The proposed DNNRE is able to adjust the model parameters according to entity type information and make correct prediction: the confidence score from DNNRE for this example is 0.85 while DNNRE w/o type only obtains 0.16.
In this work, we propose a novel dynamic parameter generator, which can build a Dynamic Neural Network for Relation Extraction (DNNRE). The parameters of DNNRE are determined by the query entity types and the mutual-class relationship. Our proposed model can adjust the network parameter to address the potential style shift caused by keyword variation under different entity types. Our method is also capable of taking advantage of prior knowledge about the semantic similarities between classes. Through extensive experiments, we demonstrate that the proposed method and its various components are effective for improving the relation extraction accuracy.
-  (2019) Combining distant and direct supervision for neural relation extraction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §2.3.
Multi-level structured self-attentions for distantly supervised relation extraction.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2216–2225. Cited by: §2.2.
Reinforcement learning for relation classification from noisy data.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5779–5786. Cited by: §2.2.
-  (2005) Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the Annual Meeting on Association for Computational Linguistics, pp. 363–370. Cited by: §4.1.
-  (2018) Hierarchical relation extraction with coarse-to-fine grained attention. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2236–2245. Cited by: §2.2, §2.3, §2.3, §4.2, §4.5, §4.5.
-  (2011) Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 541–550. Cited by: §2.1, 2nd item, §4.1.
-  (2017) Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3060–3066. Cited by: §2.2, §2.3.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §3.4.
-  (2018) Cooperative denoising for distantly supervised relation extraction. In Proceedings of the International Conference on Computational Linguistics, pp. 426–436. Cited by: §2.3.
-  (2016) Neural relation extraction with selective attention over instances. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 2124–2133. Cited by: §2.2, §3.2, §3.5, 5th item, §4.1, §4.2, §4.5.
-  (2012) Fine-grained entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 94–100. Cited by: §3.3.
-  (2014) Exploring fine-grained entity type constraints for distantly supervised relation extraction. In Proceedings of the International Conference on Computational Linguistics: Technical Papers, pp. 2107–2116. Cited by: §3.3.
-  (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the Annual Meeting of the ACL and the International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Cited by: §1, §2.1, 1st item, §4.2.
Modeling relations and their mentions without labeled text.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: §2.1.
-  (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In , pp. 1456–1464. Cited by: §1.
-  (2012) Multi-instance multi-label learning for relation extraction. In Proceedings of Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 455–465. Cited by: §2.1, 3rd item, §4.1.
-  (2019) A hierarchical framework for relation extraction with reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7072–7079. Cited by: §2.2.
-  (2018) RESIDE: improving distantly-supervised neural relation extraction using side information. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1257–1266. Cited by: §2.3, §2.3, §3.3, 6th item.
-  (2017) Adversarial training for relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1778–1783. Cited by: §2.2.
-  (2009) Unsupervised relation extraction by mining wikipedia texts using information from the web. In Proceedings of the Joint Conference of the Annual Meeting of the ACL and the International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2, pp. 1021–1029. Cited by: §1.
-  (2019) Distant supervision for relation extraction with linear attenuation simulation and non–iid relevance embedding. Proceedings of the AAAI Conference on Artificial Intelligence. Cited by: 7th item.
-  (2019) Cross-relation cross-bag attention for distantly-supervised relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 419–426. Cited by: §2.2.
-  (2015) Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1753–1762. Cited by: §2.2, §3.2, 4th item, §4.3.
-  (2019) Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §2.3, §2.3, §4.2, §4.5, §4.5.
-  (2019) Multi-labeled relation extraction with attentive capsule network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7484–7491. Cited by: §2.2.