The idea of distant supervision (mintz2009distant) eliminates the need for manual annotation for obtaining training data for relation extraction. Previously, this idea is used mostly to create sentence-level datasets. However, the assumption of distant supervision, that the two entities of a tuple must appear in the same sentence, is overly strict. We may not find an adequate number of evidence sentences for many relations as both entities do not appear in the same sentence. The relation extraction models built on such data can find relations only for a small number of relations and the relations of most knowledge bases (KBs) will be out of the reach of such models.
To address this issue, we propose a multi-hop relation extraction task where the subject and object entities of a tuple can appear in two different documents, and these two documents are connected via some common entities. We can create a chain of entities from the subject entity to the object entity of a tuple via the common entities across multiple documents. Each link in this chain represents a relation between the entities located at the endpoints of the link. We can determine the relation between the subject and object entities of a tuple by following this chain of relations. This approach can give training instances for more relations than sentence-level distant supervision. Following the proposed multi-hop approach, we create a two-hop relation extraction dataset for the task. Each instance of this dataset has two documents, where the first document contains the subject entity and the second document contains the object entity of a tuple. These two documents are connected via at least one common entity. This idea can be extended to create an N-hop dataset.
We also propose a hierarchical entity graph convolutional network (HEGCN) model for the task. Our proposed model has two levels of graph convolutional networks (GCNs). The first-level GCN of the hierarchy is applied to the entity mention level graph of every document to capture the relations among the entity mentions within a document. The second-level GCN of the hierarchy is applied on a unified entity-level graph, which is built using all the unique entities present in the document chain. This entity-level graph can be built on the document chain of any length and it can capture the relations among the entities across the multiple documents in the chain. Our proposed HEGCN model improves the performance on our two-hop dataset. To summarize, the following are the contributions of this paper:
(1) We propose a multi-hop relation extraction task and create a two-hop dataset. This dataset has more relations than other popular distantly supervised sentence-level or document-level relation extraction datasets.
(2) We propose a novel hierarchical entity graph convolutional network (HEGCN) for multi-hop relation extraction. Our proposed model improves the F1 score by 1.1% on our two-hop dataset, compared to strong neural baselines111The source code and data for this paper are available at https://github.com/nusnlp/MHRE.git.
2 Task Formalization
Multi-hop relation extraction can be defined as follows. Consider two entities, a subject entity and an object entity , and a chain of documents where and . There exists a chain of entities where , . The task is to find the relation between and from a pre-defined set of relations , where is the set of relations and None indicates that none of the relations in holds between and . A simpler version of this task is two-hop relation extraction where and are directly connected by at least one common entity. In this paper, we focus on two-hop relation extraction.
3 Related Work
3.1 Relation Extraction Datasets
Distantly supervised datasets are very popular for relation extraction Nayak2021DeepNA. riedel2010modeling (NYT10) and hoffmann2011knowledge
(NYT11) mapped Freebase tuples to New York Times (NYT) articles to obtain such datasets. The NYT10 and NYT11 datasets have been used extensively by researchers for relation extraction. TACREDzhang2017position is another dataset created from the TAC KBP evaluations. FewRel 2.0 gao2019fewrel is a few-shot relation extraction dataset. All these datasets are created at the sentence level. DocRED yao2019DocRED is a document-level relation extraction dataset created using Wikipedia articles and Wikidata items. To the best of our knowledge, there does not exist any relation extraction dataset which involves multiple documents.
3.2 Relation Extraction Models
Neural models have performed well on distantly supervised datasets for relation extraction. zeng2014relation; zeng2015distant
used convolutional network with max-pooling on word embeddings for this task, whereashuang2016attention; jat2018attention; nayak2019effective
used word-level attention model for single-instance sentence-level relation extraction.lin2016neural; vashishth2018reside; ye2019distant
used neural networks in a multi-instance setting to find a relation from a bag of independent sentences. Recently, graph convolutional network-based (GCN)Kipf2017SemiSupervisedCW models have become popular for many NLP tasks. These models work on non-linear graph structures. zhang2018graph; vashishth2018reside; guo2019aggcn; Zeng2020DoubleGB used graph convolution networks for relation extraction. They consider each token in a sentence as a node in the graph and use a syntactic dependency tree to create a graph structure among the nodes. Recently, neural joint extraction approaches takanobu2019hrlre; nayak2019ptrnetdecoding were proposed for this task.
3.3 Multi-hop QA versus Multi-hop RE
welbl2018constructing proposed a multi-hop QA dataset (WikiHop) where the answer can only be found using more than one document. Several neural models have been proposed song2018exploring; cao2019bag; de2019question; kundu2019exploiting to solve this task. We have created a two-hop relation extraction dataset (THRED) from this WikiHop dataset. The major difference between these two datasets is that THRED contains many None relations, whereas in the WikiHop dataset, every instance has a correct answer. Extracting the None relation is challenging, since None occurs when no relations in exist. When the number of relations in increases, it becomes more difficult to predict the relations. As such, we believe the multi-hop RE task is more challenging than the multi-hop QA task.
4 Dataset Construction
We create a two-hop relation extraction dataset from a multi-hop question-answering (QA) dataset WikiHop welbl2018constructing. welbl2018constructing defined the multi-hop QA task as follows: Given a set of supporting documents and a set of candidate answers which are mentioned in , the goal is to find the correct answer for a question by drawing on the supporting documents. They used Wikipedia articles and Wikidata vrandevcic2014wikidata tuples for creating this dataset. Each positive tuple in Wikidata has two entities, a subject entity and an object entity , and a positive relation between the subject and object entity. The questions are created by combining the subject entity and the relation , and the object entity is the correct answer for a given question. The other candidate answers are carefully chosen from Wikidata entities so that they have a similar type as the correct answer. The supporting documents are chosen in such a way that at least two documents are needed to find the correct answer. This means the subject entity and the object entity do not appear in the same document. They used a bipartite graph partition technique to create the dataset. In this bipartite graph, vertices on one side correspond to Wikidata entities, and vertices on the other side correspond to Wikipedia articles. An edge is created between an entity vertex and a document vertex if this document contains the entity. As we traverse the graph starting from vertex , it visits many document vertices and entity vertices. This constitutes the supporting document set and candidate answer set. If the candidate answer set does not contain the object entity which is the correct answer, this instance is discarded. They also limited the length of the traversal to three documents. welbl2018constructing only released the supporting documents, questions, and candidate answers for their dataset. They did not release the connecting entities.
We convert this WikiHop dataset into a two-hop relation extraction dataset. The subject entities and the candidate entities can be easily found in the documents using string matching. We use a named entity recognizer from spaCy222https://spacy.io/ to find the other entities in the documents and these entities can link these documents. We find that most of the WikiHop question-answer instances are two-hop instances. That means for most of the instances of WikiHop dataset, there is at least one document pair in the supporting document set where the first document of the pair contains the subject entity and the second document of the pair contains the correct answer, and these two documents in the pair are directly connected via some third entity. To simplify the multi-hop relation extraction task, we fix the hop count at 2. For every instance of the WikiHop dataset, we can easily find the subject entity and the positive relation from the question. The correct answer is the object entity of a positive tuple. is the positive tuple for relation extraction. For any other candidate answer , the entity pair is considered as a None tuple if there exists no relation among the four pairs , , , and in Wikidata. We check for the no relation condition for these four entity pairs involving , , and to reduce the distant supervision noise in the dataset for None tuples. We create a None candidate set with each . We first find all possible pairs of documents from the supporting document set such that the first document of the pair contains the subject entity and the second document of the pair contains either the entity or one of the entities from . We discard those pairs of documents that do not contain any common entity. The document pairs where the second document contains the entity are considered as a document chain for the positive tuple where . All other document pairs where the second document contains an entity from the set are considered as a document chain for None tuple where . In this way, using distant supervision, we can create a dataset for two-hop relation extraction. Each instance of this dataset has a chain of documents of length 2 that is the textual source of a tuple . The document contains the subject entity and the document contains the object entity . The two documents are connected with at least one common entity . There exists at least one entity chain in the document chain. The goal is to find the relation between and from the set . We refer to this two-hop dataset as THRED (two-hop relation extraction dataset) in the remaining sections of this paper. We manually checked 100 randomly selected positive samples and 100 randomly selected negative samples, and found that 76% of the selected positive samples and 82% of the selected negative samples are accurate.
|Question||located_in_administrative_entity Zoo Lake|
|#Positive entity pairs||21,490||618|
4.1 Dataset Statistics
The training, validation, and test data of the WikiHop dataset are created using distant supervision, but the validation and test data are manually verified. WikiHop test data is blind and not released. So we use their validation data to create the test data for our task and use their training data for our training and validation purposes. We include the statistics of our two-hop relation extraction dataset in Table 2. We include the statistics on the number of common entities present in the two documents of a chain in Table 3. We split the training data randomly, with 90% for training and 10% for validation. From Table 2, we see that the dataset contains a much higher number of None tuples than the positive tuples. So we randomly select None tuples so that the number of None tuples is the same as the number of positive tuples for training and validation. For evaluation, we consider the entire test dataset. From Table 4, we see that our THRED dataset contains more relations than any other distantly supervised relation extraction datasets such as the New York Times riedel2010modeling; hoffmann2011knowledge or DocRED yao2019DocRED.
5 Proposed HEGCN Model
We propose a hierarchical entity graph convolutional network (HEGCN) for multi-hop relation extraction. We encode the documents in a document chain using a bi-directional long short-term memory (BiLSTM) layerhochreiter1997long. On top of the BiLSTM layer, we use two graph convolutional networks (GCN), one after another in a hierarchy. In the first level of the GCN hierarchy, we construct a separate entity mention graph on each document of the chain using all the entities mentioned in that document. Each mention of an entity in a document is considered as a separate node in the graph. We use a graph convolutional network (GCN) to represent the entity mention graph of each document to capture the relations among the entity mentions in the document. We then construct a unified entity-level graph across all the documents in the chain. Each node of this entity-level graph represents a unique entity in the document chain. Each common entity between two documents in the chain is represented by a single node in the graph. We use a GCN to represent this entity-level graph to capture the relations among the entities across the documents. We concatenate the representations of the nodes of the subject entity and object entity and pass it to a feed-forward layer with softmax for relation classification.
5.1 Documents Encoding Layer
We use two types of embedding vectors: (1) word embedding vector(2) entity token indicator embedding vector , which indicates if a word belongs to the subject entity, object entity, or common entities. The subject and object entities are assigned the embedding index of and , respectively. The common entities in the document chain are assigned embedding index in an increasing order starting from index . The same entities present in two documents in the chain get the same embedding index. Embedding index
is used for padding andis used for all other tokens in the documents. A document is represented using a sequence of vectors where . represents the concatenation of vectors and is the document length. We concatenate all documents in a chain sequentially by using a document separator token. These token vectors are passed to a BiLSTM layer to capture the interaction among the documents in a chain. and are the output at the th step of the forward LSTM and backward LSTM respectively. We concatenate them to obtain the th BiLSTM output .
5.2 Hierarchical Entity Graph Convolutional Layers
Kipf2017SemiSupervisedCW proposed graph convolutional networks (GCN) which work on graph structures. Here, we describe the GCN which is used in our model. We represent a graph with nodes using an adjacency matrix of size . If there is an edge between node and node , then . We also add self loops, , in the graph . We normalize the adjacency matrix by using symmetric normalization proposed by Kipf2017SemiSupervisedCW. A diagonal node degree matrix of size is used in the normalization of . is the number of edges that are connected to the node in and is the corresponding normalized adjacency matrix of
. Each node of the graph receives the hidden representation of its neighboring nodes from theth layer and uses the following operation to update its own hidden representation.
is the trainable weight matrix of the th layer of the GCN, is the representation of the th node of the graph at the th layer. If has the dimension of , then the dimension of the weight matrix is . is the initial input to the GCN.
5.2.1 Entity Mention Graph Layer
We construct an entity mention graph (EMG) for each document in the chain on top of the document encoding layer. An entity string may appear at multiple locations in a document and each appearance is considered as an entity mention. We add a node in the graph for each entity mention. We connect two entity mention nodes if they appear in the same sentence (EMG type 1 edge). We assume that since they appear in the same sentence, there may exist some relation between them. We also connect two entity mention nodes if the strings of the two entity mentions are identical (EMG type 2 edge). Let be the sequence of entity mention nodes listed in the order of their appearance in a document. We connect nodes and () with an edge (EMG type 3 edge). EMG type 3 edges create a linear chain of the entity mentions and ensure that the graph is connected. We use a graph convolutional network on this graph topology to capture the relations among the entity mentions in a document.
We obtain the initial representations of the entity mention nodes from the hidden representations of the document encoding layer. We concatenate the hidden vector of the first token of an entity mention, the hidden vector of its last token, and a context vector to obtain the entity mention node representation. The context vector is obtained using an attention mechanism on the tokens of the sentence in which the entity mention appears.
and are the hidden vectors from the document encoding layer of the first and last token of an entity mention. is a trainable weight matrix, is the hidden vector of the th token of the sentence in which the entity mention is located, and is the normalized attention score for the th token with respect to the entity mention. is the length of the sentence in which the entity mention is located, and is the context vector. The entity mention node vector of the th node in the graph is passed to the GCN as . The parameters of this GCN are shared across the documents in a chain. This layer of the model is referred to as entity mention-level graph convolutional network or EMGCN.
5.2.2 Entity Graph Layer
We construct a unified entity graph (EG) on top of the entity mention graphs. First, we construct an entity graph for each document, where each unique entity string is represented as an entity node in the graph. We add an edge between two entity nodes if the strings of the two entities appear together in at least one sentence in the document (EG type 1 edge). We also form a sequence of entity nodes based on the order of appearance of the entities in a document, where only the first occurrence of multiple occurrences of an entity is kept in the sequence. We connect two consecutive entity nodes in the sequence with an edge (EG type 2 edge). This ensures that the entire entity graph remains connected.
We construct one entity graph for each document in the document chain. We unify the entity graphs of multiple documents by merging the nodes of common entities between them. The unified entity graph contains all the nodes from the multiple entity graphs, but the common entity nodes which appear in two entity graphs are merged into one node in the unified graph. There is an edge between two entity nodes in the unified entity graph if there exists an edge between them in any of the entity graphs of the documents.
We obtain the initial representations of the entity nodes from the GCN outputs of the entity mention graphs. For the common entities between two documents, we average the GCN outputs of the entity mention nodes that have an identical string as the entity from the entity mention graphs of the two documents. For other entity nodes that appear only in one document, we average the GCN outputs of the entity mention nodes that have an identical string as the entity from the entity mention graph of that document. Each entity vector is passed to another graph convolutional network as which represents the initial representation of the th entity node in the unified entity graph. We use a graph convolutional network on this graph topology to capture the relations among the entities across the documents in the document chain. This layer of the model is referred to as entity-level graph convolutional network or EGCN.
5.3 Relation Classifier
We concatenate the EGCN outputs of the nodes corresponding to the subject entity and object entity
, and pass the concatenated vector to a feed-forward network (FFN) with softmax to predict the normalized probabilities for the relation labels.
is the weight matrix,
is the bias vector of the FFN, andis the vector of normalized probabilities of relation labels.
We implement four neural baseline models for comparison with our proposed HEGCN model. Similar to our proposed model, we represent the tokens in the documents using pre-trained word embedding vectors and entity token indicator vectors. We use a document separator token when concatenating the vectors of two documents in a chain.
(1) CNN: We apply the convolution operation on the sequence of token vectors with different kernel sizes. A max-pooling operation is applied to choose the features from the outputs of the convolution operation. This feature vector is passed to a feed-forward layer with softmax to classify the relation.
(2) BiLSTM: The token vectors of the document chain are passed to a BiLSTM layer to encode its meaning. We obtain the entity mention vectors of the subject entity and the object entity by concatenating the hidden vectors of their first and last token. We average the entity mention tokens of the corresponding entity to obtain the representation of the subject entity and the object entity. These two vectors are concatenated and passed to a feed-forward layer with softmax to find the relation between them.
(3) BiLSTM_CNN: This is a combination of the BiLSTM and CNN model described above. The token vectors of the documents are passed to a BiLSTM layer and then we use the convolution operation with max-pooling with different convolutional kernel sizes on the hidden vectors of the BiLSTM layer. The feature vector obtained from the max-pooling operation is passed to a feed-forward layer with softmax to classify the relation.
(4) LinkPath: This model uses the explicit paths kundu2019exploiting from the subject entity to the object entity via the common entities to find the relation. As we consider only two-hop relations, each path from to will be of the form , where is a common entity. Since there can be multiple common entities between two documents and these common entities as well as the subject and object entities can appear multiple times in the two documents, there exist multiple paths from to
. Each path is formed with four entity mentions: (i) entity mentions of the subject entity and common entity in the first document. (ii) entity mentions of the common entity and object entity in the second document. We concatenate the BiLSTM hidden vectors of the start and end token of an entity mention to obtain its representation. Each path is constructed by concatenating all the four entity mentions of the path. This can be extended from two-hop to multi-hop relations by using a recurrent neural network that takes the path entity mentions as input, and outputs the hidden representation of the path. We average the vector representations of all the paths and pass it to a feed-forward layer with softmax to find the relation.
6.2 Parameter Settings
We use GloVe (pennington2014glove) word embeddings of dimension which is set to 300 in our experiments, and update the embeddings during training. We set the dimension to be 20 for the entity token indicator embedding vectors. The hidden vector dimension of the forward and backward LSTM is set at . The dimension of BiLSTM output is . We use different convolution filters with kernel width of , , and
for feature extraction. We use one convolutional layer in both entity mention-level GCN and entity-level GCN in our final model. Dropout layers(Srivastava2014DropoutAS) are used in our network with a dropout rate of to avoid overfitting. We train our models with a mini-batch size of and use negative log-likelihood as our objective function. We optimize the network parameters using the Adagrad optimizer (duchi2011adaptive). For evaluation, we use precision, recall, and F1 score. We do not include the None relation in the evaluation. A confidence threshold that achieves the highest F1 score on the validation dataset is used to decide if the relation of a test instance belongs to the set of relations or None.
6.3 Experimental Results
We include the median of five runs of the models on the THRED dataset in Table 5. We see that adding a BiLSTM in the document encoding layer improves the performance by close to 5% in F1 score. The BiLSTM, BiLSTM_CNN, and LinkPath models achieve similar F1 scores. When we add our proposed hierarchical entity graph convolutional layer on top of the BiLSTM layer, we get another 1.1% F1 score improvement over the next best BiLSTM model. We perform a statistical significance test using bootstrap resampling to compare each baseline and our HEGCN model, and have ascertained that the higher F1 score achieved by our model is statistically significant ().
6.4 Ablation Studies
We include the performance of our HEGCN model with different numbers of convolutional layers in the entity mention-level GCN (EMGCN) and entity-level GCN (EGCN) in Table 6. When we increase the number of layers in either GCN, the performance of the model drops. We finally use only one convolutional layer in both EMGCN and EGCN.
In Table 7, we include the ablation study of the different types of edges in EMGCN and EGCN. Removing any type of edges reduces the F1 score.
|– EMG type 1||0.679||0.689||0.684|
|– EMG type 2||0.698||0.662||0.680|
|– EMG type 3||0.666||0.693||0.679|
|– EG type 1||0.704||0.659||0.681|
|– EG type 2||0.674||0.691||0.683|
In this paper, we propose how the idea of distant supervision can be extended from sentence-level extraction to multi-hop extraction to cover more relations. We propose a general approach to create multi-hop relation extraction datasets. Following this approach, we create a two-hop relation extraction dataset that covers a higher number of relations from knowledge bases than other distantly supervised relation extraction datasets. We also propose a hierarchical entity graph convolutional network for this task. The two levels of GCN in our model help to capture the relation cues within documents and across documents. Our proposed model improves the F1 score by 1.1% on our two-hop dataset, compared to a strong neural baseline, and it can be readily extended to N-hop datasets.