Recent years, graph neural networks (GNNs) have been applied to various fields of machine learning, including node classification (Kipf and Welling, 2016), relation classification (Schlichtkrull et al., 2017), molecular property prediction (Gilmer et al., 2017), few-shot learning (Garcia and Bruna, 2018), and achieve promising results on these tasks. These works have demonstrated GNNs’ strong power to process relational reasoning on graphs.
Relational reasoning aims to abstractly reason about entities/objects and their relations, which is an important part of human intelligence. Besides graphs, relational reasoning is also of great importance in many natural language processing tasks such as question answering, relation extraction, summarization, etc. Consider the example shown in Fig. 1, existing relation extraction models could easily extract the facts that Luc Besson directed a film Léon: The Professional and that the film is in English, but fail to infer the relationship between Luc Besson and English without multi-hop relational reasoning. By considering the reasoning patterns, one can discover that Luc Besson could speak English following a reasoning logic that Luc Besson directed Léon: The Professional and this film is in English indicates Luc Besson could speak English. However, most existing GNNs can only process multi-hop relational reasoning on pre-defined graphs and cannot be directly applied in natural language relational reasoning. Enabling multi-hop relational reasoning in natural languages remains an open problem.
To address this issue, in this paper, we propose graph neural networks with generated parameters (GP-GNNs), to adapt graph neural networks to solve the natural language relational reasoning task. GP-GNNs first constructs a fully-connected graph with the entities in the sequence of text. After that, it employs three modules to process relational reasoning: (1) an encoding module which enables edges to encode rich information from natural languages, (2) a propagation module which propagates relational information among various nodes, and (3) a classification module which makes predictions with node representations. As compared to traditional GNNs, GP-GNNs could learn edges’ parameters from natural languages, extending it from performing inferring on only non-relational graphs or graphs with a limited number of edge types to unstructured inputs such as texts.
In the experiments, we apply GP-GNNs to a classic natural language relational reasoning task: relation extraction from text. We carry out experiments on Wikipedia corpus aligned with Wikidata knowledge base (Vrandečić and Krötzsch, 2014) and build a human annotated test set as well as two distantly labeled test sets with different levels of denseness.Experiment results show that our model outperforms other models on relation extraction task by considering multi-hop relational reasoning. We also perform a qualitative analysis which shows that our model could discover more relations by reasoning more robustly as compared to baseline models.
Our main contributions are in two-fold:
(1) We extend a novel graph neural network model with generated parameters, to enable relational message-passing with rich text information, which could be applied to process relational reasoning on unstructured inputs such as natural languages.
(2) We verify our GP-GNNs in the task of relation extraction from text, which demonstrates its ability on multi-hop relational reasoning as compared to those models which extract relationships separately. Moreover, we also present three datasets, which could help future researchers compare their models in different settings.
2 Related Work
2.1 Graph Neural Networks (GNNs)
replace the Almeida-Pineda algorithm with the more generic backpropagation and demonstrate its effectiveness empirically.Gilmer et al. (2017) propose to apply GNNs to molecular property prediction tasks. Garcia and Bruna (2018)
shows how to use GNNs to learn classifiers on image datasets in a few-shot manner.Gilmer et al. (2017) study the effectiveness of message-passing in quantum chemistry. Dhingra et al. (2017) apply message-passing on a graph constructed by coreference links to answer relational questions. There are relatively fewer papers discussing how to adapt GNNs to natural language tasks. For example, Marcheggiani and Titov (2017) propose to apply GNNs to semantic role labeling and Schlichtkrull et al. (2017) apply GNNs to knowledge base completion tasks. Zhang et al. (2018) apply GNNs to relation extraction by encoding dependency trees, and De Cao et al. (2018) apply GNNs to multi-hop question answering by encoding co-occurence and co-reference relationships. Although they also consider applying GNNs to natural language processing tasks, they still perform message-passing on predefined graphs. Johnson (2017) introduces a novel neural architecture to generate a graph based on the textual input and dynamically update the relationship during the learning process. In sharp contrast, this paper focuses on extracting relations from real-world relation datasets.
2.2 Relational Reasoning
Relational reasoning has been explored in various fields. For example, Santoro et al. (2017) propose a simple neural network to reason the relationship of objects in a picture, Xu et al. (2017) build up a scene graph according to an image, and Kipf et al. (2018) model the interaction of physical objects.
In this paper, we focus on the relational reasoning in natural language domain. Existing works (Zeng et al., 2014, 2015; Lin et al., 2016) have demonstrated that neural networks are capable of capturing the pair-wise relationship between entities in certain situations. For example, (Zeng et al., 2014) is one of the earliest works that applies a simple CNN to this task, and (Zeng et al., 2015)
further extends it with piece-wise max-pooling.Nguyen and Grishman (2015) propose a multi-window version of CNN for relation extraction. Lin et al. (2016) study an attention mechanism for relation extraction tasks. Peng et al. (2017) predict n-ary relations of entities in different sentences with Graph LSTMs. Le and Titov (2018) treat relations as latent variables which are capable of inducing the relations without any supervision signals. Zeng et al. (2017) show that the relation path has an important role in relation extraction. Miwa and Bansal (2016) show the effectiveness of LSTMs (Hochreiter and Schmidhuber, 1997) in relation extraction. Christopoulou et al. (2018) proposed a walk-based model to do relation extraction. The most related work is (Sorokin and Gurevych, 2017), where the proposed model incorporates contextual relations with attention mechanism when predicting the relation of a target entity pair. The drawback of existing approaches is that they could not make full use of the multi-hop inference patterns among multiple entity pairs and their relations within the sentence.
3 Graph Neural Network with Generated Parameters (GP-GNNs)
We first define the task of natural language relational reasoning. Given a sequence of text with entities, it aims to reason on both the text and entities and make a prediction of the labels of the entities or entity pairs.
In this section, we will introduce the general framework of GP-GNNs. GP-GNNs first build a fully-connected graph , where is the set of entities, and each edge corresponds to a sequence extracted from the text. After that, GP-GNNs employ three modules including (1) encoding module, (2) propagation module and (3) classification module to proceed relational reasoning, as shown in Fig. 2.
3.1 Encoding Module
The encoding module converts sequences into transition matrices corresponding to edges, i.e. the parameters of the propagation module, by
where could be any model that could encode sequential data, such as LSTMs, GRUs, CNNs, indicates an embedding function, and denotes the parameters of the encoding module of -th layer.
3.2 Propagation Module
The propagation module learns representations for nodes layer by layer. The initial embeddings of nodes, i.e. the representations of layer , are task-related, which could be embeddings that encode features of nodes or just one-hot embeddings. Given representations of layer , the representations of layer are calculated by
where denotes the neighbours of node in graph and
denotes non-linear activation function.
3.3 Classification Module
Generally, the classification module takes node representations as inputs and outputs predictions. Therefore, the loss of GP-GNNs could be calculated as
where denotes the parameters of the classification module, is the number of layers in propagation module and denotes the ground truth label. The parameters in GP-GNNs are trained by gradient descent methods.
4 Relation Extraction with GP-GNNs
Relation extraction from text is a classic natural language relational reasoning task. Given a sentence , a set of relations and a set of entities in this sentence , where each consists of one or a sequence of tokens, relation extraction from text is to identify the pairwise relationship between each entity pair .
In this section, we will introduce how to apply GP-GNNs to relation extraction.
4.1 Encoding Module
To encode the context of entity pairs (or edges in the graph), we first concatenate the position embeddings with word embeddings in the sentence:
where denotes the word embedding of word and denotes the position embedding of word position relative to the entity pair’s position (Details of these two embeddings are introduced in the next two paragraphs.) After that, we feed the representations of entity pairs into encoder
which contains a bi-directional LSTM and a multi-layer perceptron:
where denotes the index of layer 111Adding index to neural models means their parameters are different among layers., means reshaping a vector as a matrix, encodes a sequence by concatenating tail hidden states of the forward LSTM and head hidden states of the backward LSTM together and denotes a multi-layer perceptron with non-linear activation .
We first map each token of sentence to a -dimensional embedding vector using a word embedding matrix , where is the size of the vocabulary. Throughout this paper, we stick to 50-dimensional GloVe embeddings pre-trained on a 6 billion corpus (Pennington et al., 2014).
In this work, we consider a simple entity marking scheme222As pointed out by Sorokin and Gurevych (2017), other position markers lead to no improvement in performance.: we mark each token in the sentence as either belonging to the first entity , the second entity or to neither of those. Each position marker is also mapped to a -dimensional vector by a position embedding matrix . We use notation to represent the position embedding for corresponding to entity pair .
4.2 Propagation Module
Next, we use Eq. (2) to propagate information among nodes where the initial embeddings of nodes and number of layers are further specified as follows.
The Initial Embeddings of Nodes
Suppose we are focusing on extracting the relationship between entity and entity , the initial embeddings of them are annotated as , and , while the initial embeddings of other entities are set to all zeros. We set special values for the head and tail entity’s initial embeddings as a kind of “flag” messages which we expect to be passed through propagation. Annotators and could also carry the prior knowledge about subject entity and object entity. In our experiments, we generalize the idea of Gated Graph Neural Networks (Li et al., 2016) by setting and 333The dimensions of and are the same. Hence, should be positive even integers. The embedding of subject and object could also carry the type information by changing annotators. We leave this extension for future work..
Number of Layers
In general graphs, the number of layers is chosen to be of the order of the graph diameter so that all nodes obtain information from the entire graph. In our context, however, since the graph is densely connected, the depth is interpreted simply as giving the model more expressive power. We treat as a hyper-parameter, the effectiveness of which will be discussed in detail (Sect. 5.4).
4.3 Classification Module
The output module takes the embeddings of the target entity pair as input, which are first converted by:
where represents element-wise multiplication. This could be used for classification:
where , and denotes a multi-layer perceptron module.
We use cross entropy here as the classification loss
where denotes the relation label for entity pair and denotes the whole corpus.
In practice, we stack the embeddings for every target entity pairs together to infer the underlying relationship between each pair of entities. We use PyTorch(Paszke et al., 2017) to implement our models. To make it more efficient, we avoid using loop-based, scalar-oriented code by matrix and vector operations.
Our experiments mainly aim to: (1) showing that our best models could improve the performance of relation extraction under a variety of settings; (2) illustrating that how the number of layers affect the performance of our model; and (3) performing a qualitative investigation to highlight the difference between our models and baseline models. In both part (1) and part (2), we do three subparts of experiments: (i) we will first show that our models could improve instance-level relation extraction on a human annotated test set, and (ii) then we will show that our models could also help enhance the performance of bag-level relation extraction on a distantly labeled test set 444Bag-level relation extraction is a widely accepted scheme for relation extraction with distant supervision, which means the relation of an entity pair is predicted by aggregating a bag of instances., and (iii) we also split a subset of distantly labeled test set, where the number of entities and edges is large.
5.1 Experiment Settings
Distantly labeled set
Sorokin and Gurevych (2017) have proposed a dataset with Wikipedia corpora. There is a small difference between our task and theirs: our task is to extract the relationship between every pair of entities in the sentence, whereas their task is to extract the relationship between the given entity pair and the context entity pairs. Therefore, we need to modify their dataset: (1) We added reversed edges if they are missing from a given triple, e.g. if triple (Earth, part of, Solar System) exists in the sentence, we add a reversed label, (Solar System, has a member, Earth), to it; (2) For all of the entity pairs with no relations, we added “NA” labels to them.555We also resolve entities at the same position and remove self-loops from the previous dataset. Furthermore, we limit the number of entities in one sentence to 9, resulting in only 0.0007 data loss. We use the same training set for all of the experiments.
Human annotated test set
Based on the test set provided by (Sorokin and Gurevych, 2017), 5 annotators666They are all well-educated university students. are asked to label the dataset. They are asked to decide whether or not the distant supervision is right for every pair of entities. Only the instances accepted by all 5 annotators are incorporated into the human annotated test set. There are 350 sentences and 1,230 triples in this test set.
Dense distantly labeled test set
We further split a dense test set from the distantly labeled test set. Our criteria are: (1) the number of entities should be strictly larger than 2; and (2) there must be at least one circle (with at least three entities) in the ground-truth label of the sentence 777Every edge in the circle has a non-“NA” label.. This test set could be used to test our methods’ performance on sentences with the complex interaction between entities. There are 1,350 sentences and more than 17,915 triples and 7,906 relational facts in this test set.
5.1.2 Models for Comparison
We select the following models for comparison, the first four of which are our baseline models.
Context-Aware RE, proposed by Sorokin and Gurevych (2017). This model utilizes attention mechanism to encode the context relations for predicting target relations. It was the state-of-the-art models on Wikipedia dataset. This baseline is implemented by ourselves based on authors’ public repo888https://github.com/UKPLab/emnlp2017-relation-extraction.
Multi-Window CNN. Zeng et al. (2014)
utilize convolutional neural networks to classify relations. Different from the original version of CNN proposed in(Zeng et al., 2014), our implementation, follows (Nguyen and Grishman, 2015)
, concatenates features extracted by three different window sizes: 3, 5, 7.
PCNN, proposed by Zeng et al. (2015). This model divides the whole sentence into three pieces and applies max-pooling after convolution layer piece-wisely. For CNN and following PCNN, the entity markers are the same as originally proposed in (Zeng et al., 2014, 2015).
LSTM or GP-GNN with layer. Bi-directional LSTM (Schuster and Paliwal, 1997) could be seen as an 1-layer variant of our model.
GP-GNN with or layerss. These models are capable of performing 2-hop reasoning and 3-hop reasoning, respectively.
We select the best parameters for the validation set. We select non-linear activation functions between relu and tanh, and select among 999We set all s to be the same as we do not see improvements using different s. We have also tried two forms of adjacent matrices: tied-weights (set ) and untied-weights. Table 1 shows our best hyper-parameter settings, which are used in all of our experiments.
|hidden state size||256|
|embedding size for #layers = 1||8|
|embedding size for #layers = 2 and 3||12|
5.2 Evaluation Details
So far, we have only talked about the way to implement sentence-level relation extraction. To evaluate our models and baseline models in bag-level, we utilize a bag of sentences with given entity pair to score the relations between them. Zeng et al. (2015) formalize the bag-level relation extraction as multi-instance learning. Here, we follow their idea and define the score function of entity pair and its corresponding relation as a max-one setting:
|Dataset||Human Annotated Test Set|
|Dataset||Distantly Labeled Test Set||Dense Distantly Labeled Test Set|
5.3 Effectiveness of Reasoning Mechanism
, we can see that our best models outperform all the baseline models significantly on all three test sets. These results indicate our model could successfully conduct reasoning on the fully-connected graph with generated parameters from natural language. These results also indicate that our model not only performs well on sentence-level relation extraction but also improves on bag-level relation extraction. Note that Context-Aware RE also incorporates context information to predict the relation of the target entity pair, however, we argue that Context-Aware RE only models the co-occurrence of various relations, ignoring whether the context relation participates in the reasoning process of relation extraction of the target entity pair. Context-Aware RE may introduce more noise, for it may mistakenly increase the probability of a relation with the similar topic with the context relations. We will give samples to illustrate this issue in Sect.5.5. Another interesting observation is that our #layers=1 version outperforms CNN and PCNN in these three datasets. One probable reason is that sentences from Wikipedia corpus are always complex, which may be hard to model for CNN and PCNN. Similar conclusions are also reached by Zhang and Wang (2015).
5.4 The Effectiveness of the Number of Layers
The number of layers represents the reasoning ability of our models. A -layer version has the ability to infer -hop relations. To demonstrate the effects of the number of layers, we also compare our models with different numbers of layers. From Table 2 and Table 3, we could see that on all three datasets, 3-layer version achieves the best. We could also see from Fig. 3 that as the number of layers grows, the curves get higher and higher precision, indicating considering more hops in reasoning leads to better performance. However, the improvement of the third layer is much smaller on the overall distantly supervised test set than the one on the dense subset. This observation reveals that the reasoning mechanism could help us identify relations especially on sentences where there are more entities. We could also see that on the human annotated test set 3-layer version to have a greater improvement over 2-layer version as compared with 2-layer version over 1-layer version. It is probably due to the reason that bag-level relation extraction is much easier. In real applications, different variants could be selected for different kind of sentences or we can also ensemble the prediction from different models. We leave these explorations for future work.
5.5 Qualitative Results: Case Study
Tab. 4 shows qualitative results that compare our GP-GNN model and the baseline models. The results show that GP-GNN has the ability to infer the relationship between two entities with reasoning. In the first case, GP-GNN implicitly learns a logic rule to derive (Oozham, language spoken, Malayalam) and in the second case our model implicitly learns another logic rule to find the fact (BankUnited Center, located in, English). Note that (BankUnited Center, located in, English) is even not in Wikidata, but our model could identify this fact through reasoning. We also find that Context-Aware RE tends to predict relations with similar topics. For example, in the third case, share boarder with and located in are both relations about territory issues. Consequently, Context-Aware RE makes a mistake by predicting (Kentucky, share boarder with, Ohio). As we have discussed before, this is due to its mechanism to model co-occurrence of multiple relations. However, in our model, since Ohio and Johnson County have no relationship, this wrong relation is not predicted.
6 Conclusion and Future Work
We addressed the problem of utilizing GNNs to perform relational reasoning with natural languages. Our proposed models, GP-GNNs, solves the relational message-passing task by encoding natural language as parameters and performing propagation from layer to layer. Our model can also be considered as a more generic framework for graph generation problem with unstructured input other than text, e.g. images, videos, audios. In this work, we demonstrate its effectiveness in predicting the relationship between entities in natural language and bag-level and show that by considering more hops in reasoning the performance of relation extraction could be significantly improved.
- Almeida (1987) Luis B Almeida. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings, 1st First International Conference on Neural Networks, pages 609–618. IEEE.
- Christopoulou et al. (2018) Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2018. A walk-based model on entity graphs for relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 81–88.
- De Cao et al. (2018) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920.
- Dhingra et al. (2017) Bhuwan Dhingra, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2017. Linguistic knowledge as memory for recurrent neural networks. arXiv preprint arXiv:1703.02620.
- Garcia and Bruna (2018) JVictor Garcia and Joan Bruna. 2018. Few-shot learning with graph neural networks. In Proceedings of ICLR.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of ICML.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, pages 1735–1780.
- Johnson (2017) Daniel D Johnson. 2017. Learning graphical state transitions. In Proceedings of ICLR.
- Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference for interacting systems. In ICML.
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. Proceedings of ICLR.
- Le and Titov (2018) Phong Le and Ivan Titov. 2018. Improving entity linking by modeling latent relations between mentions.
- Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. Proceedings of ICLR.
- Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of ACL, pages 2124–2133.
- Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings EMNLP.
- Miwa and Bansal (2016) Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of ACL, pages 1105–1116.
- Nguyen and Grishman (2015) Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39–48.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
- Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. TACL, pages 101–115.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543.
- Santoro et al. (2017) Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS, pages 4967–4976.
- Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks, pages 61–80.
- Schlichtkrull et al. (2017) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2017. Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103.
- Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, pages 2673–2681.
- Sorokin and Gurevych (2017) Daniil Sorokin and Iryna Gurevych. 2017. Context-aware representations for knowledge base relation extraction. In Proceedings of EMNLP, pages 1784–1789.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM.
- Xu et al. (2017) Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In
- Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP, pages 1753–1762.
- Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING, pages 2335–2344.
- Zeng et al. (2017) Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017. Incorporating relation paths in neural relation extraction. In Proceedings of EMNLP.
- Zhang and Wang (2015) Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. arXiv preprint arXiv:1508.01006.
- Zhang et al. (2018) Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Empirical Methods in Natural Language Processing (EMNLP).