Sentence-level relation extraction (RE) aims at identifying the relationship between two entities mentioned in a sentence. RE is crucial to the structural perception of human language, and also benefits many NLP applications such as automated knowledge base construction distiawan2019neural, event understanding wang2020joint, discourse understanding yu2020dialogue, and question answering zhao2020condition
. The modern tools of choice for RE are the large-scale pretrained language models (PLMs) that are used to encode individual sentences, therefore obtaining the sentence-level representationsliu2019roberta; joshi-etal-2020-spanbert; yamada-etal-2020-luke.
Existing work considers entity types and textual context as essential properties for RE peng2020learning; peters2019knowledge; zhou2021improved. Nonetheless, most existing RE models only capture these properties locally within individual instances, while not globally modeling them from the whole dataset. Given the insufficient features of a single sentence, it is beneficial to model these properties from the whole dataset and use them to enrich the semantics of individual instances.
To overcome the aforementioned limitation, we propose to mine the entity and contextual information beyond individual instances so as to further improve the relation representations. Particularly, we first construct a heterogeneous graph to connect the instances sharing common properties for RE. This graph includes the sentences and property caches. Each cache represents a property of entity types or contextual topics. We connect every sentence to the corresponding property caches (see Figure 1), and perform message passing over edges based on a graph neural network (GNN). In this way, the property caches aggregate the features from connected sentences, which will act as a complement to the sentence-level features and provide prior knowledge when identifying relations.
The constructed graph connecting sentences has the same scale as the whole dataset, which leads to high computational complexity of the GNN. To address this issue, our idea is to view the message passing of GNNs as data loading in computer systems, adapting the classical caching techniques to efficiently mining the property information from all sentences. We encapsulate this computational idea in a new GNN module, called GraphCache (Graph Neural Network as Caching), that uses an online updating strategy to refresh the property caches’ representations. In addition, we design an attention-based global-local fusion module to augment the sentence-level representations using the property caches with adaptive weights.
GraphCache can be incorporated into popular RE models to improve their effectiveness without increasing their time complexity, as analyzed in theory (Section 3.2). As far as we know, ours is the first work to propagate the features across instances to enrich the semantics for sentence-level RE. We evaluate GraphCache
on three public RE benchmarks including TACREDzhang2017tacred, SemEval-2010 task 8 hendrickx2019semeval, and TACREV alt-etal-2020-tacred. Empirical results show that GraphCache consistently improves the effectiveness of popular RE models by a significant margin and propagates features between all sentences in an efficient manner.
2 Related Work
Sentence-Level Relation Extraction. Early research efforts zeng-etal-2014-relation; wang-etal-2016-relation; zhang2017tacred
train RE models from scratch based on lexicon-level features. Recent work has shifted to fine-tuning pretrained language models (PLMs;devlin-etal-2019-bert; liu2019roberta) resulting in better performance. For example, BERT-MTB baldini-soares-etal-2019-matching continually finetunes the PLM with a matching-the-blanks objective that decides whether two sentences share the same entity. SpanBERT joshi-etal-2020-spanbert pretrains a masked language model on random contiguous spans to learn span-boundaries and predict the entire masked span. LUKE yamada-etal-2020-luke extends the PLM’s vocabulary with entities from Wikipedia and proposes an entity-aware self-attention mechanism. K-Adapter wang2020k fixes the parameters of the PLM and uses feature adapters to infuse factual and linguistic knowledge. Despite their effectiveness, most existing work on sentence-level RE exploits the entity information and context within only an individual instance, while we propose to globally capture the semantic information from the whole dataset to augment the relation representations. Our model can be flexibly plugged into existing RE models and improve their effectiveness without increasing the time complexity.
Graph Neural Networks for Natural Language Processing.
Graph Neural Networks for Natural Language Processing.Due to the large body of work on applying GNNs to NLP, we refer readers to a recent survey wu2021graph for a general review. GNNs have been explored in several NLP tasks such as semantic role labeling marcheggiani2017encoding, machine translation bastings2017graph, and text classification henaff2015deep; defferrard2016convolutional; kipf2016semi; peng2018large; yao2019graph. GNNs have also been widely adopted in various variants of relation extraction on the sentence level, zhang2018graph; zhu2019graph; guo-etal-2019-attention, the document level sahu2019inter; christopoulou2019connecting; nan-etal-2020-reasoning; zeng2020double, and the dialogue level xue2021gdpnet. However, on the sentence-level relation extraction, most existing work zhang2018graph; guo2019attention; wu2019simplifying uses the graph neural networks to encode the relation representations from individual instances instead of operating the message passing between instances. In contrast, we build a heterogeneous graph to connect the instances that share the properties for RE, and design the caching updater to efficiently perform the message passing between instances.
Task Definition. Sentence-level relation extraction (RE) aims to identify the relation between a pair of entities in a sentence. In this task, each instance is composed of a sentence, the subject and object entities, and entity types. For example, in the sentence ‘Mary gave birth to Jerry at the age of 21.’111We use underline and wavy line to denote subject and object respectively by default., ‘Mary’ and ‘Jerry’ are the entities, the entity types are both person, and the ground-truth relation between ‘Jerry’ and ‘Mary’ is parent.
We propose GraphCache (Graph Neural Networks as Caching) as a message passing methodology to model the dataset-level property representations and use them to enrich every instance’s semantics. GraphCache creates a graph representation where sentences with shared property information are connected with property caches. GraphCache first models the global semantic information by aggregating the features from the whole dataset, and then fuses the global and local features to augment the relational representations for every sentence.
We analogize the message passing in GNNs to caching in computer systems. Caching is about loading data from high volume disks to low volume caches, so as to accelerate data loading. Analogously, when GNNs perform the message passing between sentences through a smaller number of bridge nodes, we can think of the massive sentences in the dataset as the disk data, and the properties, which aggregates the features from sentences, as caches. GraphCache can be flexibly plugged into existing RE models. As far as we know, ours is the first work to propagate the features between instances to enrich the semantics for RE. GraphCache takes an existing RE model as the backbone, e.g., BERT, and takes the sentence-level representations given by the backbone as the inputs of message passing.
A GraphCache module consists of three key components: (i) A graph construction technique builds a few property caches. Each cache represents a property for RE: entity type or contextual topic. We connect each sentence to its corresponding properties, so that every property aggregates the features from its neighbor sentences. (ii) Caching message passing aggregates the sentence-level representations to model the properties’ representations in an online manner. (iii) Global-local fusion fuses the global property representations and local sentence-level ones to augment the relation representations. Next, we will discuss the three main components in more detail.
3.1 Graph Construction for Sentence-level Relation Extraction
We build a large and heterogeneous graph to connect the sentences sharing the properties: entity types and textual context, which are essential for RE peng2020learning; peters2019knowledge; zhou2021improved. The heterogeneous graph is defined as , where is the set of nodes, and is the set of edges. , where is the set of sentences, and is the property caches. Here is the set of latent topics zeng2018topic mined from the latent topics from the text corpus using LDA blei2003latent, which has been found effective in modeling useful contextual patterns jelodar2019latent
. Each topic is represented by a probability distribution over the words, and we assign each sentence to the top
topics with the largest probabilities.is the set of entity types, where every cache represents the types of an entity pair. The entity types are also crucial for predicting relations peng2020learning; zhou2021improved. An edge exists if the sentence has the property .
We will implement a GNN on this graph. Specifically, to incorporate the global property information into relation extraction, the property caches aggregates the features from the connected neighboring sentences. This step enables property caches to globally model the properties from the whole dataset. We then use the global property representations from the caches to enrich every sentence’s semantics. In this way, the property caches act as prior knowledge when identifying relations and provide each sentence with more representative features.
3.2 Caching Message Passing
We take an existing RE model as the backbone, e.g., BERT devlin-etal-2019-bert, which produces the sentence-level representation as . Next, we deploy a two-layer GNN on our heterogeneous graph for message passing across sentences. Specifically, the first GNN layer aggregates the sentence-level representations to property caches at the th training step:
where is a property, is a sentence having property , is the mean aggregator hamilton2017inductive, and is the feed-forward network. can be a linear layer in SGC wu2019simplifying
, a linear layer followed by a nonlinear activation function in GraphSAGEhamilton2017inductive, or a multi-layer perception in GIN xu2018powerful, etc. We follow SGC wu2019simplifying to implement by default. For each property , this layer aggregates the sentence-level representations from to obtain a global property embedding . In this way, the generalized context of each property is captured from the whole dataset, which is further used to enhance the relation representations for each sentence in the second GNN layer. We describe the details of the second GNN layer in Section 3.3.
Recall our heterogeneous graph for RE defined in Section 3.1. At each training step, classical GNNs perform message passing across edges between the sentences and properties. In this case, the time complexity of the first GNN layer at each training step is . Note that is larger than , which is the number of sentences in the dataset. This leads to poor scalability of GNN, since is large in practice.
To address this efficiency issue, we propose Caching GNN for RE in Algorithm 1. Our GraphCache implements a memory dictionary to store the sentence-level representations from the backbone. To keep consistency with the updating parameters during training, we deploy a caching updater to refresh the properties’ representations at each training step:
where denotes the batch at the th training step. By doing so, GraphCache greatly reduces the time complexity from to at each training step by using to obtain the property caches’ representations .
Our caching updater is much more efficient than the classical message passing of GNNs, since generally holds in practice. When we aggregate the sentence-level representations from , we provide the following proposition to show that our cache updater is as effective as the first GNN layer in Section 3.2.
When , if , we have:
Besides, because for holds as initialized in Alg. Algorithm 1, we have for . ∎
3.3 Global-Local Fusion
In the second GNN layer, we propagate the properties’ representations from the property cache to their neighboring sentences in the batch. Since a sentence may have more than one latent topic , we utilize the attention mechanism to enable the target sentence to attend to different topics with adaptive weights.
where we follow vaswani2017attention to implement . The output is the topic embedding fused for sentence . In this way, a sentence can be trained to attend to more relevant topics with higher weights.
Next, we have the entity type embedding of sentence as , where is the entity type node connected to sentence . and are the global representations of the properties related to sentence , while is the local representation of sentence . We fuse the global and local representations to enrich the semantics of sentence through a sentence-wise head:
where denotes concatenation. GraphCache makes sentence-wise relation predictions using a sentence-wise Head, implemented as a multi-layer perception (MLP), analogous to a PointNet qi2017pointnet. Since GraphCache predicts a relation label for each sentence, it can be trained by standard task-specific classification losses, e.g., cross-entropy mannor2005cross. During inference, we take after convergence as the output for RE.
In this section, we evaluate the effectiveness of our GraphCache method when incorporated into various RE models. We compare our methods against a variety of strong baselines on the task of sentence-level RE. We closely follow the experimental setting of the previous work zhang2017tacred; zhou2021improved; zhang2018graph to ensure a fair comparison, as detailed below.
4.1 Experimental Settings
Datasets. We use the standard sentence-level RE datasets: TACRED zhang2017tacred, SemEval-2010 Task 8 hendrickx2019semeval, and TACREV alt2020tacred for evaluation. TACRED contains over 106k mention pairs drawn from the yearly TAC KBP challenge. SemEval does not provide entity type annotations, for which we only construct the topic caches for message passing. alt2020tacred relabeled the development and test sets of TACRED to build TACREV. The statistics of these datasets are shown in Table 1. We follow zhang2017tacred
to use F1-micro as the evaluation metric.
Compared Methods. We compare GraphCache with the following state-of-the-art RE models: (1) PA-LSTM zhang2017tacred extends the bi-directional LSTM by incorporating positional information to the attention mechanism. (2) GCN zhang2018graph uses a graph convolutional network to gather relevant contextual information along syntactic dependency paths. (3) C-GCN zhang2018graph combines GCN and LSTM, leading to improved performance than each method alone. (4) C-SGC wu2019simplifying simplifies GCN by removing the nonlinear layers and achieves higher effectiveness. (5) SpanBERT joshi-etal-2020-spanbert extends BERT by introducing a new pretraining objective of continuous span prediction. (6) RECENT lyu2021relation restricts the candidate relations based on the entity types. (7) LUKE yamada-etal-2020-luke
pretrains the language model on both large text corpora and knowledge graphs and further proposes an entity-aware self-attention mechanism. (8)IRE zhou2021improved proposes an improved entity representation technique in data preprocessing, which enables RoBERTa to achieve state-of-the-art performance on RE.
Model Configuration. For the hyper-parameters of the considered baseline methods, e.g., the batch size, the number of hidden units, the optimizer, and the learning rate, we set them as those in the original papers. For LDA used in GraphCache, we set the number of topics as 50, and the number of top relevant topics for every sentence as 2. For all experiments, we report the median F-1 scores of five runs of training using different random seeds.
4.2 Overall Performance
We incorporate the GraphCache framework with LUKE and IRE, and report the results in Table 2. Our GraphCache method improves LUKE by 2.9% on TACREV, 1.5% on SemEval, and 1.1% on TACREV in the F1 score. For IRE, GraphCache leads to the improvement of 1.2% on TACRED, 0.8% on SemEval, 1.2% on Re-TACRED. As a result, our GraphCache achieves substantial improvements for LUKE and IRE and enables them to outperform the baseline methods.
Note that LUKE and IRE are both based on large pre-trained models, which have sufficiently large learning capacity to encode the individual instances. In this case, our GraphCache still improves their effectiveness by a large margin, which validates the benefits of modeling the properties: entity types and contextual topics, globally from the whole dataset. This is due to the use of the global property representations that enrich the semantics of each instance, which effectively act as prior knowledge that helps identify the relations and complements the sentence-level features.
4.3 Efficiency and Effectiveness of GraphCache
As analyzed in Section 3.2, GraphCache enhances the backbone RE models without increasing their time complexity. In the experiments, we analyze the efficiency and effectiveness of GraphCache on the TACRED dataset, following the experimental setting of RE in Section 4.2.
The methods we evaluate include IRE, IRE implemented with classical GNN for message passing, and IRE with our GraphCache. Table 3 reports the performance, where ‘Time’ is the training time until convergence using a Linux Server with an Intel(R) Xeon(R) E5-1650 v4 @ 3.60GHz CPU and a GeForce GTX 2080 GPU.
We notice that, compared with the classical message passing of GNN, our GraphCache method significantly reduces the time complexity per training step. As a result, our GraphCache method takes significantly less training time than the classical GNN method, and exhibits similar efficiency to the original IRE without message passing between sentences. The running time and F1 of IRE with GNN is unavailable due to the out-of-memory error. This agrees with the theoretical analysis in Section 3.2. and denote the data and batch sizes respectively. IRE’s time complexity is , which is the same as the original RoBERTa, while the time complexity of RoBERTa with GNN is , being significantly higher than our GraphCache. In practice, is generally large, and , e.g., and holds for TACRED and state-of-the-art models.
|LUKE + GraphCache (ours)||78.9||85.6|
|IRE + GraphCache (ours)||80.1||88.2|
In terms of effectiveness, our GraphCache leads to substantial improvements for RoBERTa. Our GraphCache enriches the input features for RE on every sentence by utilizing the dataset-level information beyond the individual sentences. GraphCache implements the attention module to incorporate the global property features from different topic caches with adaptive weights, which capture the most relevant information for the target relation. The improvements in effectiveness are rooted in the message passing mechanism between sentences, which mines the property information beyond individual instances and acts as a complementary to the sentence-level semantics. Our GraphCache method resolves the efficiency issues of message passing based on the caching mechanism, which updates the properties’ representations in an online manner.
|+ Entity Types||73.4||+0.7||+0.7|
|+ Contextual Topics||74.8||+1.4||+2.1|
4.4 Analysis on Unseen Entities
Some previous work zhang2018graph; joshi-etal-2020-spanbert suggests that RE models may not generalize well to unseen entities. To evaluate whether the RE models can generalize to unseen entities, existing work designs a filtered evaluation setting zhou2021improved. This setting removes all testing instances containing entities from the training set of TACRED and TACREV, which results in filtered test sets of 4,599 instances on TACRED and TACREV. These filtered test sets only contain instances with unseen entities during training.
We present the experimental results on the filtered test sets in Table 4. Our GraphCache still achieves consistently substantial improvements for LUKE and IRE on the TACRED and TACREV datasets. Specifically, our GraphCache improves the F1 scores of LUKE by 3.1% on TACRED, 3.3% on TACREV, and improves IRE by 1.8% on TACRED, 1.5% on TACREV. Taking a closer look, we observe that the improvements given by GraphCache on the filtered test sets are generally larger than those on the original test sets. The reason is that our GraphCache mines global information from the whole dataset and uses it as the prior knowledge for RE, which is not influenced by the entity names in individual sentences. When the entity names are new to the RE models, the semantic information is relatively scarce and our mined global information plays a more important role to augment the sentence-level representations.
|Input sentence||Method||Prediction||Entity type||Topic keyword|
|Founded in 1947 by two brothers, Eugene and Quentin Fabris, New Fabris started out making sewing machine parts in the 1990s.||LUKE||founded ✗||subject: Person object: Date||[brother, found, sister, parent, establish, machine, business, organize, instrument, make]|
|+ GraphCache||no_relation ✓|
|According to the suspect, Gonzalez was strangled and buried the day after the video was made, Rosas said.||LUKE||no_relation ✗||subject: Person object: Date||[strangle, die, after, when, injury, day, hospital, police, murder, later]|
|+ GraphCache||date_of_death ✓|
|He was forced to close his bar and now works occasionally at the University of Foreigners, which Knox and Kercher attended.||LUKE||no_relation ✗||subject: Person object: Organization||[university, student, attend, opening, work, school, job, professor, exchange, education]|
|+ GraphCache||schools_attended ✓|
|Margaret Garritsen graduated from the University of Michigan as an American Association of University scholar.||LUKE||schools_attended ✗||subject: Organization object: Organization||[graduate, government, association, degree, university, technology, science, scholar, receive, research]|
|+ GraphCache||no_relation ✓|
4.5 Ablation Study
We investigate the contributions of properties that we consider for constructing the heterogeneous graph. We apply different kinds of properties sequentially with our GraphCache on the LUKE model. The results are presented in Table 5. Our entity type nodes improve the effectiveness of LUKE by modeling the entity information globally on the dataset level to enrich the semantics of every sentence. This finding is consistent with peng2020learning, suggesting that the entity information can provide richer information to improve RE. Furthermore, the contextual topics lead to more significant improvements than the entity types, since the contextual information is fundamental for identifying the relations.
Finally, we analyze the sensitivity of GraphCache to the hyper-parameters , where is the number of topics and is the number of relevant topics assigned to an instance. The result is visualized in Figure 3. We vary among and among . The performance of IRE with GraphCache is relatively smooth when parameters are within certain ranges. However, extremely small values of and large result in poor performances. Too small cannot effectively model the complex contextual topics in the large text corpus, while too large induces irrelevant or noisy features for every instance. Moreover, only a poorly set hyper-parameter does not lead to significant performance degradation, which demonstrates that our GraphCache framework is able to effectively mine the beneficial properties at the dataset level and use them to enhance the relation representations for RE.
4.6 Case Study
We conduct a case study to investigate the effects of our GraphCache. Table 6 gives a qualitative comparison example between LUKE and the LUKE with our GraphCache on the relation extraction dataset TACRED. The result shows that the global property information that we mine from the whole dataset can guide the RE systems to make correct predictions. For example, in the first row, we model the global entity type information of the subject as the person and the object as the date from the whole dataset. This type information acts as the prior knowledge that prevents the model from making the wrong relation prediction of ‘founded’ between the entities ‘Quentin Fabris’ and ‘1947’ (date). Similarly, in the final row, our GraphCache filters out the incorrect relation ‘schools_attend’, since we model the entity type information from the whole dataset and thus enable the model to be aware that this relation cannot hold for the subject type as ‘organization’.
In addition, in the second row, the sentence ‘According to the suspect, Gonzalez was strangled and buried the day after the video was made, Rosas said.’ attends to the topic of keywords ‘[strangle, die, after, when, injury, day, hospital, police, murder, later]’ in our heterogeneous graph, which enriches the semantics of the sentence with the context related to the death and time. This helps the model to make the correct relation prediction ’date_of_death’.
In this paper, we study the efficient message passing to enhance the relation extraction models. We propose a novel method named GraphCache, which provides efficient message passing between instances in the whole dataset. GraphCache is a model-agnostic technique that can be incorporated into popular relation extraction models to enhance their effectiveness without increasing their time complexity. In our work, we present a simple yet effective implementation of GraphCache, which models two universal and essential properties for relation extraction: entity information and textual context. Our experimental results show that GraphCache, with our heterogeneous graph, yields significant gains for the sentence-level relation extraction in an efficient manner.
The authors would like to thank the anonymous reviewers for their discussion and feedback.
Muhao Chen and Wenxuan Zhou are supported by the National Science Foundation of United States Grant IIS 2105329, and by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research. Except for Muhao Chen and Wenxuan Zhou, this paper is supported by NUS ODPRT Grant R252-000-A81-133 and Singapore Ministry of Education Academic Research Fund Tier 3 under MOEs official grant number MOE2017-T3-1-007.