Improving Relation Extraction by Leveraging Knowledge Graph Link Prediction

by   George Stoica, et al.
Carnegie Mellon University

Relation extraction (RE) aims to predict a relation between a subject and an object in a sentence, while knowledge graph link prediction (KGLP) aims to predict a set of objects, O, given a subject and a relation from a knowledge graph. These two problems are closely related as their respective objectives are intertwined: given a sentence containing a subject and an object o, a RE model predicts a relation that can then be used by a KGLP model together with the subject, to predict a set of objects O. Thus, we expect object o to be in set O. In this paper, we leverage this insight by proposing a multi-task learning approach that improves the performance of RE models by jointly training on RE and KGLP tasks. We illustrate the generality of our approach by applying it on several existing RE models and empirically demonstrate how it helps them achieve consistent performance gains.



page 1

page 2

page 3

page 4


Trust from the past: Bayesian Personalized Ranking based Link Prediction in Knowledge Graphs

Link prediction, or predicting the likelihood of a link in a knowledge g...

Orthogonal Relation Transforms with Graph Context Modeling for Knowledge Graph Embedding

Translational distance-based knowledge graph embedding has shown progres...

DSKG: A Deep Sequential Model for Knowledge Graph Completion

Knowledge graph (KG) completion aims to fill the missing facts in a KG, ...

RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network

In this paper, we present a novel method named RECON, that automatically...

KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction

We present a novel method for relation extraction (RE) from a single sen...

LP-BERT: Multi-task Pre-training Knowledge Graph BERT for Link Prediction

Link prediction plays an significant role in knowledge graph, which is a...

How Knowledge Graph and Attention Help? A Quantitative Analysis into Bag-level Relation Extraction

Knowledge Graph (KG) and attention mechanism have been demonstrated effe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world applications ranging from search engines to conversational agents rely on the ability to uncover new relationships from existing knowledge. Relation extraction (RE) and knowledge graph (KG) link prediction (KGLP) are two closely related tasks that center around inferring new information from existing facts. RE is the task of uncovering the relationship between two entities (termed the subject and object respectively) in a sentence. Similarly, KGLP involves inferring the set of correct answers (i.e., objects) to KG questions consisting of an entity (subject) and relation. These questions are given in triple-form: (

SUBJECT, RELATION, ?). To illustrate their relationship, consider the sentence “John and Jane are married”, whose subject and object are highlighted in blue and red respectively. Given this information, RE models infer the relationship between “John” and “Jane” (e.g., “Spouse”). Similarly, KGLP models infer the answers (objects) to the question (John, Spouse, ?). Based on the sentence, the answers must include “Jane”. Thus, RE models predict the relation between a subject and object, while KGLP models infer the object from the subject and relation.

Several methods have been proposed to boost the performance of RE models by incorporating information from KGLP. However, these approaches typically require KGLP pre-training Weston et al. (2013); Wang et al. (2018), exhibit constrained parameter sharing Weston et al. (2013); Wang et al. (2018), or predominately attend over both problems through custom attention mechanisms Beltagy et al. (2018); Han et al. (2018); Zhang et al. (2019). Moreover, these frameworks only support a limited class of KGLP models that can be reframed as inferring relations from subject and objects. This constraint excludes recent KGLP methods which perform significantly better, but cannot be reformulated to satisfy the restriction. An ideal framework should support arbitrary RE and KGLP methods, including the significantly more expressive and stronger performing recent KGLP approaches. Additionally, such a framework should enable RE models to benefit from KGLP models with minimal changes to the underlying RE and KGLP methods.

Figure 1: Overview of JRRELP. JRRELP is comprised of three loss terms: the RE loss, the KGLP loss, and the coupling loss. The RE loss is illustrated in the top-left quadrant, the KGLP loss is described by the top-right quadrant, and the bottom half shows the coupling loss.

We propose a general framework which ties the RE and KGLP tasks cohesively into a single learning problem. Our architecture, termed JRRELPJointly Reasoning over Relation Extraction and Link Prediction—has the following desirable properties:

  • [noitemsep,topsep=0pt,label=,leftmargin=2em]

  • Generality:

    Our method can be applied to arbitrary RE and KGLP models to boost RE performance. The only assumption JRRELP makes is that both models are trained by minimizing a loss function (which is common across all successful RE and KGLP methods).

  • Effective Information-Sharing: JRRELP introduces a cyclical relationship between model parameters, enabling better information transfer between the learning tasks. Moreover, all parameters are shared across both the RE and KGLP tasks.

  • Performance: JRRELP boosts the performance of all baseline methods used in our evaluation. Additionally, JRRELP-enhanced baselines even match or improve upon the performance of more expressive RE models. For example, we are able to train C-GCN (Zhang et al., 2018) to match TRE (Alt et al., 2019), even though the latter was proposed as a stronger and significantly more expressive alternative.

  • Efficiency: JRRELP does not require any task-specific pre-training. It introduces a minimal overhead over the baseline methods (at most slower per batch).

An overview of JRRELP is shown in Figure 1, and is explained in detail in Section 3. Next, we present our proposed method and defer positioning with respect to related work until Section 5.

2 Background

Before presenting our method, we introduce the notation used throughout this paper, and describe the relevant learning tasks. Let describe a dataset that contains a collection of sentences. Let denote a sentence, where

represents a one-hot encoding for the

sentence token (i.e., word). Each sentence contains a subject , that is defined as a contiguous span over the sentence, and an object , that is similarly defined. Subjects and objects are summarized by their types, termed and , respectively. If not already given, these can be extracted by widely used parsing frameworks such as (Manning et al., 2014). For example, consider the sentence “John Doe lives in Miami”, where the subject is shown in blue color and the object in red color. In this case, the subject may be tagged as having type PERSON and the object may be tagged as having type CITY. Several methods (e.g., Zhang et al., 2017, 2018; Guo et al., 2019) employ type-substitution during data preprocessing: substituting subjects and objects in sentences with their corresponding types. For instance, with type-substitution our example sentence becomes “SUBJECT-PERSON SUBJECT-PERSON lives in OBJECT-CITY

.” For ease of future explanation, we assume that sentences are preprocessed using type-substitution for the remainder of this paper. Each sentence may contain additional structural features such as part-of-speech (POS) tags, named-entity-recognition (NER) tags, and a dependency parse. Analogous to extracting entity types, these can be generated from parsing frameworks. We denote all such sentence features as members of a set

. Finally, each sentence contains a relation, , between its subject and object. This may either describe their lack of connection (via a special NoRelation token), or an existing one. For instance, the relation between John Doe and Miami in our example sentence would be LivesIn. In summary, is a set of tuples: , where is the number of sentences.

2.1 Relation Extraction

Relation extraction (RE) uses , , , and from to infer the relation between and . Note that due to our type-substitution constraint, this is analogous to predicting the relation between and

. Many successful models that have been proposed to tackle this task involve learning vector embeddings for each component. Specifically, let

, , and denote the vocabulary size for sentence tokens, the number of unique relations, and the number of unique attributes in , computed over the whole training dataset. Additionally, let , , and denote the corresponding embedding sizes. We define , , and as the vocabulary, relation, and attribute embedding matrices, respectively. Note that , and are learnable model parameters. Given a sentence, a subject, an object, and its attributes, their respective embedding representations are defined as: , , , and , where is the number of tokens in and is the number of attributes in . Similarly, we define the embedded relation as . Given these embeddings, most successful RE models (e.g., Zhang et al., 2017, 2018; Guo et al., 2019; Alt et al., 2019; Soares et al., 2019) can be formulated as instances of the following model:


where is the inferred relation representation from a prediction model . To demonstrate how multiple RE methods fit under this formulation, we briefly describe the three baseline models used in our experiments.

PA-LSTM. This model was proposed by Zhang et al. (2017), and centers around formulating

as the combination of a one-directional long short-term memory (LSTM) network, and a custom position-aware attention mechanism. The sentence attributes it uses are POS and NER tags, as well as SO and OO tags representing the positional offset of each token from the subject and the object respectively. The method first applies the LSTM over the concatenated sentence, POS tag, and NER tag embeddings. A relation

is then predicted by attending the LSTM outputs with a custom position-aware attention mechanism using the SO and OO tag embeddings.

C-GCN. This model was proposed by Zhang et al. (2018), and formulates as a graph-convolution network (GCN) over sentence dependency parse trees. It uses the same sentence attributes as PA-LSTM, and additionally the sentence dependency parse. Similar to PA-LSTM, the method first encodes a concatenation of the sentence, POS tag, and NER tag embeddings using a bi-directional LSTM network. The model then infers relations from these encodings by reasoning over the graph implied by a pruned version of the provided dependency tree parse. In particular, C-GCN computes the least common ancestor (LCA) between and

, and uses the SO and OO tags to prune the tree around the LCA. Afterwards, C-GCN processes the sentence encodings using a graph convolution network (GCN) defined over the pruned dependency parse tree. The resulting representations are finally processed by a multi-layer perceptron to predict relations.

SpanBERT. This model was proposed by Joshi et al. (2019), and is a strong performing BERT Devlin et al. (2018)-based relation extraction method. SpanBERT extends BERT by pre-training at the span-level. Moreover, the model randomly masks contiguous text spans instead of individual tokens, and adds a span-boundary objective that infers masked spans from surrounding data. In contrast to PA-LSTM and C-GCN, SpanBERT only takes into account the type-substituted sentence in its input to predict relations. is formulated as its complete architecture, with

masked out. We chose this model because it is a strong performing BERT-based RE model and it is also open-sourced, allowing to easily integrate it in our experimental evaluation pipeline.

Note that PA-LSTM, C-GCN, and SpanBERT are just three of many approaches supported by our abstract RE model formulation. For instance, other transformer-based methods (Alt et al., 2019; Soares et al., 2019; Peters et al., 2019) can also be represented by using a different definition for .

2.2 Knowledge Graph Link Prediction

The objective in knowledge graph link prediction (KGLP) is to infer a set of objects given a question, , in the form of a subject-relation-object triple, missing the object. Typically, and are nodes in a knowledge graph (KG), while represents a graph edge. Although does not necessarily provide an explicit KG to reason over, it is possible to generate one by assigning unique identifiers for all subjects, relations, and objects, For instance, these may be and for subjects and objects respectively, and the relation itself. Although we assume that these identifiers are used (as they are available in our training data ), we emphasize that our method is not limited to datasets with these characteristics. Instead our framework supports any that specifies a mapping to a pre-existing KG, or where it is possible to define other unique identifiers. This is a very weak constraint. Therefore, given a sentence with , , and , we can use the subject and object types— and , respectively—to form a KG whose edges are represented by each and nodes by each and . For ease of notation, we assume that each term is a one-hot encoding of the corresponding identifier.

Due to the type-substitution preprocessing step described in Section 2, all types are included in the sentence token vocabulary. Thus, we obtain KG component embeddings by: , , and . Multiple existing KGLP methods can be characterized in terms of the following abstract model:


where is a merged representation of and . Note that the set of available object embeddings contains only valid (in the type-checking sense) object embeddings. Previous work (Stoica* et al., 2020) shows that multiple KGLP methods fit under this formulation. While certain early KGLP methods (Bordes et al., 2013; Yang et al., 2015; Lin et al., 2015; Ji et al., 2015; Trouillon et al., 2016) do not fit under this formulation, we note that they may be accommodated by a simple reconfiguration of Equation 6 to their respective scoring terms. We now provide the definition of ConvE (Dettmers et al., 2018) under this formulation, because we use ConvE as our KGLP model in our experiments. While we acknowledge that ConvE is not the current state-of-the-art (SoTA) KGLP approach, it performs very well while using only a fraction of the parameters current SoTA Stoica* et al. (2020); Wang et al. (2020) methods require, thus making it more efficient. Moreover, ConvE is an example of a KGLP method which cannot be restructured to infer from and , making it infeasible to use with any of the previous joint RE and KGLP frameworks (e.g., Wang et al., 2018; Weston et al., 2013). Note that, our results can only be further enhanced by using a stronger KGLP approach and thus this choice should not affect our conclusions.

ConvE. ConvE is defined by using the following merge function in our abstract model formulation:


where “Conv2D” is a 2D convolution operation and “” first concatenates and and then reshapes the resulting vector to be a square matrix, so that a convolution operation can be applied to it.

3 Proposed Method

As mentioned in Section 1, the RE and KGLP tasks are tightly coupled. Given a sentence (e.g., “Miami is in Florida”) that contains a subject (e.g., Miami) and an object (e.g., Florida), the goal of RE is to predict the relation (e.g., locatedIn), between and , that the sentence describes. Similarly, the goal of KGLP is to infer a set of objects using and , such that the inferred objects correspond to correct subject-relation-object triples, and where (this is known because the sentence describes this relationship). Based on this observation, we propose JRRELP, a multi-task learning framework that explicitly accounts for this relationship between RE and KGLP. JRRELP trains a RE model, , that is defined using our abstract formulation from Section 2.1 and a KGLP model, , that is defined using our abstract formulation from Section 2.2, jointly, using four key ideas:

  1. [noitemsep,topsep=0pt,leftmargin=2em]

  2. Parameter Sharing: and share all of the embedding parameters. This corresponds to the matrices , , and from Sections 2.1 and 2.2. Moreover, all parameters between RE and KGLP methods are also shared.

  3. Joint Training: The two models are trained jointly by optimizing a single objective function. This function contains terms that correspond to the RE objective function, the KGLP objective function, as well as a prediction coupling loss term.

  4. Cyclical Coupling: Our joint loss terms establish a cyclical relationship between the embedding parameters, that tightly couples the RE and KGLP tasks. This is because the RE model uses (which includes ) to predict relation representations that are then compared to to produce distribution over relations. Reciprocally, the KGLP model uses to generate object embeddings that are compared to to produce distributions over objects.

  5. Unmodified Evaluation: JRRELP does not introduce any additional terms when evaluating . Thus, rather than enhancing by increasing its capacity, JRRELP does this by altering its training trajectory.

We now provide details on how each term of the joint training objective function is defined.

RE Loss. The first term corresponds to the standard loss function used to train the RE model. This loss function is defined as follows (where we use the notation introduced in Section 2.1):


where “SCE” represents the softmax cross-entropy loss function, and is defined as in Equation 3:


where is the specific prediction function used by our RE model. Although this loss term assumes that a single relation exists between a subject and an object in a sentence, it is consistent with the loss term utilized by our baselines and is also appropriate for our widely used benchmark datasets described in Section 4. Additionally, we note that this does not restrict the applicability of JRRELP to single-relation extraction problems. For instance, “SCE” can be substituted for binary-cross entropy (BCE) in the case of having multiple applicable relations.

KGLP Loss. The second term corresponds to a popular loss function which is often used to train KGLP models. This loss function is defined as follows (where we use the notation introduced in Section 2.2):


where is defined as in Equation 6:


where is the specific merge function used by our KGLP model. Note here that is a set of objects that can be constructed automatically given all of the training data and conditioned on and , as described in Section 2.2. We also acknowledge that certain KGLP methods (Bordes et al., 2013; Yang et al., 2015; Lin et al., 2015; Ji et al., 2015; Trouillon et al., 2016) cannot be represented by this loss term. However, this does not detract from the generality of the proposed framework because they can be accommodated by changing this term to their respective objective functions.

Coupling Loss. The third term penalizes inconsistencies between the predictions of the RE and KGLP models. It is defined as follows:




where we have omitted the conditioning variables for brevity. The key difference between this loss term and the KGLP loss term is shown in red color. Specifically, the relations embeddings — — computed by in the KGLP loss term, are replaced by the predicted relation embeddings from . This term aligns the RE and KGLP methods by making the first compatible with the second, and enhances the overall performance of our framework.

3.1 JRRELP Objective Function

The JRRELP objective function is formed by putting together the above three terms:


where and

are model hyperparameters that need to be tuned properly. We note that, while in principle

and can vary independently, in our experiments we set both to the same value for simplicity and cheaper hyperparameter tuning. Furthermore, we observed no negative impact in performance.

Most importantly, due to the JRRELP parameter sharing and the use of this loss function, our framework introduces a cyclical relationship between the RE and KGLP models that couples them together very tightly. Specifically, the RE model predicts relation embeddings using that it compares to to produce distributions over relations. The KGLP model on the other hand predicts object embeddings using that it compares to to produce distributions over objects. It is mainly this cyclical relationship along with the coupling loss term that result in both the RE and KGLP models benefiting from each other and serves to enhance the performance and robustness of RE methods. An overview of JRRELP is shown in Figure 1.

Note that, even though JRRELP minimizes the joint three-task objective function shown in Equation 14, at test time we only use the RE model to predict relations between subjects and objects. Thus, JRRELP can be thought of as a framework which alters the learning trajectory of an RE model, rather than increase its capacity through using additional model parameters.

4 Experiments

We empirically evaluate the performance of JRRELP over two existing relation extraction baselines on two widely used supervised benchmark datasets. Our primary objective is to measure the importance of a joint RE and KGLP objective in environments where learning over both tasks is restricted only to data available in a relation extraction dataset. This serves to simulate how effective JRRELP may be in real-world applications where a pre-existing KG is not available for a given RE task. Additionally, we perform an ablation study to examine the impact each part of JRRELP has on its overall performance.


We use the TACRED

Zhang et al. (2017) and SemEval 2010 Task 8 Hendrickx et al. (2010) datasets for our experiments, which are commonly used in prior literature (e.g., Zhang et al., 2017, 2018; Guo et al., 2019; Soares et al., 2019). Table 1 shows their summary statistics. As mentioned in Section 2, for both datasets we utilize the following sentence attributes: NER tags, POS tags, subject/object offsets, and dependency tree structure. For the KGLP task in JRRELP, we construct the KG by generating (, , ) triples automatically, for each training sentence. We then ask questions of the form , where the answer belongs to a set of applicable objects .

Setup. We perform our experiments on TACRED consistent with prior literature (Zhang et al., 2017, 2018; Guo et al., 2019). We use the same type-substitution policy where we replace each subject and object in a sentence with their corresponding NER types. Additionally, we evaluate our models using their micro-averaged F1 scores. Finally. we report the test metrics of the model with the best validation F1 score over five independent runs. While SemEval 2010 Task 8 is traditionally evaluated without type-substitution, Zhang et al. (2018) point out that this causes models to overfit to specific entities, and does not test their ability to generalize to unseen data. They address this by masking these entities using their types. Therefore, to examine JRRELP’s generalization capabilities, we perform the same type-substitution procedure, and evaluate on the transformed dataset (denoted as SemEval-MM). Consistent with prior work (Zhang et al., 2017, 2018; Guo et al., 2019; Alt et al., 2019; Soares et al., 2019), we report the macro-averaged F1 scores. Because SemEval(-MM) does not contain a validation set, we subsample examples from the training set to use as a validation set.

Dataset # Train # Validation # Test # Relations Avg. Tokens % Negatives
TACRED 68,124 22,631 15,509 42 36.4 79.5%
SemEval-MM 8,000 - 2,717 19 19.1 17.4%
Table 1: Dataset statistics. Here, # Train, # Validation, and # Test denote the number of questions used for training, validation, and testing. # Relations describes the number of distinct relation in each dataset, Avg. Tokens refers to the average number of tokens in each dataset sentence, and % Negatives indicates the percentage of data where there is "no relation" between subjects and objects.

Models. We illustrate the generality of JRRELP by evaluating it on baselines from both classes of RE approaches:111Refer to Section 5 for their definitions. Two sequence-based models (PA-LSTM and SpanBERT), and a graph-based model (C-GCN). We join all three baselines with the KGLP method ConvE. We distinguish between our baselines and their JRRELP variants by boxing their model names (e.g. PA-LSTM is the JRRELP extended version of PA-LSTM). All models can be found in our repository:


width= Dataset Metric Models C-AGGCN TRE PA-LSTM PA-LSTM C-GCN C-GCN SpanBERT SpanBERT TACRED Precision 73.1 70.1 65.7 67.8 69.9 74.1 69.2* 74.0* Recall 64.2 65.0 64.5 65.0 63.3 61.9 71.2* 67.3* F1 68.2 67.4 71.5 65.1 66.4 66.4 67.4 70.2* 70.8* SemEval-MM Precision 75.2 74.8 76.5 76.9 81.2 82.7 Recall 78.0 80.6 79.5 80.3 86.1 85.2 F1 76.6 77.6 78.0 78.5 83.6 83.9

Table 2: Results reported by our own experiments are marked by . The remainder are taken from Alt et al. (2019) and Peters et al. (2019). All numbers are expressed as percentages. denotes experiments performed using additional data other than provided by the respective models. “–” denotes missing results from the respective publications. “SemEval-MM” denotes the Masked-Mention version of the SemEval dataset.

We report our overall performance results on TACRED in Table 2. We observe that JRRELP consistently outperforms it’s baseline variants over their F1 and precision metrics. In particular, we find that JRRELP improves all baseline model performances by at least F1, and yields improvements of up to in precision. Furthermore, JRRELP bridges the performance gap between several methods, without altering their model capacities. Notably, JRRELP extended PA-LSTM matches the reported C-GCN performance, whose JRRELP variant matches TRE Alt et al. (2019) — a significantly more expressive transformer-based approach. These results suggest that the true performance ceiling of reported relation extraction approaches may be significantly higher than their reported results, and that JRRELP serves as a conduit towards achieving these performances. Results on SemEval-MM indicate a similar pattern to TACRED: JRRELP improves performance across all baselines. This illustrates the effectiveness of JRRELP’s framework in environments with little data.

width= Dataset Metric Ablation Experiments PALSTM PA-LSTM PALSTM PALSTM C-GCN C-GCN C-GCN C-CGCN TACRED F1 65.1 66.4 65.6 66.3 66.4 67.4 66.8 67.0 SemEval-MM F1 76.6 77.6 76.8 77.3 78.0 78.5 78.1 78.4

Table 3: TACRED F1 results from our ablation study. denotes experiments conducted without , and marks those run without .

Ablation Experiments. To examine the effects of JRRELP’s and over the traditional relation extraction objective, , we perform an ablation study with each term removed on methods from both RE approach classes: sequence-based (PALSTM) and graph-based (C-GCN). Table 3 shows the F1 results. Metrics for each dataset are reported in the same manner as previous results. All ablation performances illustrate the importance of and as part of JRRELP’s framework, as their respective models are worse than the full JRRELP architecture: they exhibit performance drops up to F1 respectively. Moreover, we observe the largest performance drop from the removal of – which removes JRRELP’s consistency constraint between RE and KGLP models. This highlights importance of establishing this relationship while training to achieve strong performance.

5 Related Work

There are three areas of research that are related to the method we propose in this paper. In this section, we discuss related work in each area and position JRRELP appropriately.

Relation Extraction.

Existing RE approaches can be classified in two categories: sequence-based, and graph-based methods. Given a sentence in the form of a sequence of tokens, sequence-based models infer relations by applying recurrent neural networks

Zhou et al. (2016); Zhang et al. (2017)

, convolutional neural networks

Zeng et al. (2014); Nguyen and Grishman (2015); Wang et al. (2016), or transformers Alt et al. (2019); Soares et al. (2019); Joshi et al. (2019); Peters et al. (2019). In addition to the sentence, graph-based methods use the structural characteristics of the sentence dependency tree to achieve strong performance. Peng et al. (2017) apply an n-ary Tree-LSTM Tai et al. (2015) over a split dependency tree, while Zhang et al. (2018); Guo et al. (2019) employ a graph-convolution network (GCN) over the dependency tree.

Knowledge Graph Link Prediction. Existing KGLP approaches broadly fall under two model classes: single-hop and multi-hop. Given a subject and a relation, single-hop models infer a set of objects by mapping the subject and relation respectively to unique learnable finite dimensional vectors (embeddings) and jointly transforming them to produce an object set. These approaches can be translational Bordes et al. (2013) over the embeddings, multiplicative Yang et al. (2015); Trouillon et al. (2016), or a combination of the two Dettmers et al. (2018); Lin et al. (2015); Ji et al. (2015); Balazevic et al. (2019); Stoica* et al. (2020); Wang et al. (2020). On the other hand, multi-hop approaches determine object sets by finding paths in the KG connecting subjects to the objects, and primarily consist of path-ranking methods Lao et al. (2011); Gardner et al. (2013); Neelakantan et al. (2015); Guu et al. (2015); Toutanova et al. (2016); Das et al. (2018); Lin et al. (2018).

Joint Frameworks. Several approaches Weston et al. (2013); Han et al. (2018); Wang et al. (2018); Zhang et al. (2019); Beltagy et al. (2018) have explored using the additional supervision provided by a KG to benefit relation extraction model performance. Of these, we believe Weston et al. (2013); Han et al. (2018); Wang et al. (2018) are most similar to our work. Weston et al. (2013) proposes a framework which utilizes a KGLP model, TransE Bordes et al. (2013), as an additional re-ranking term when evaluating an RE model. While employing TransE as a re-ranker improves performance, their framework trains TransE and the respective RE approach separately without parameter sharing. This only allows very restricted information sharing during evaluation. Han et al. (2018) proposes a dual-attention framework for jointly learning KGLP and RE tasks by computing a weight distribution over training data and shares parameters between tasks. However, like Weston et al. (2013), Han et al. (2018) limits KGLP model selection to those which can reformulated as inferring relations from subjects and objects. This excludes a large number of recent methods Dettmers et al. (2018); Balazevic et al. (2019); Das et al. (2018); Lin et al. (2018); Stoica* et al. (2020); Wang et al. (2020) which cannot be reframed in this way. Wang et al. (2018) also presents a joint framework, LFDS, for training relation extraction approaches via KGLP objectives. In particular, the architecture introduces a similar objective to , but can only support the same class of KGLP methods as in Weston et al. (2013); Han et al. (2018). Moreover, LFDS requires KGLP pre-training, and does not share core parameters such as relation representations between RE and KGLP methods. This can create domain-shift between the two respective models and impact performance.

JRRELP improves upon previous literature by providing a single joint objective which simultaneously addresses all their aforementioned limitations. First, JRRELP proposes an abstract framework which supports many RE and KGLP methods through three standard-based loss terms. Second, JRRELP shares all its parameters between KGLP and RE tasks, and establishes a novel cyclical learning structure over core parameters. Third, RE and KGLP tasks are jointly trained without any problem-specific pretraining required, enabling tasks to benefit from each other simultaneously during training. Fourth, JRRELP’s structure facilitates suport for RE and KGLP methods with minimal implementation changes: only requiring their respective substitutions into and .

6 Conclusion

We propose JRRELP, a novel framework that improves upon existing relation extraction approaches by leveraging insights from the complementary problem of knowledge graph link prediction. JRRELP bridges these two tasks through an abstract multi-task learning framework that jointly learns RE and KGLP problems by unconstrained parameter sharing. We exhibit this generality be extending three diverse relation extraction methods, and improve their performances.


  • C. Alt, M. Hübner, and L. Hennig (2019) Improving relation extraction by pre-trained language representations. CoRR abs/1906.03088. External Links: Link, 1906.03088 Cited by: 3rd item, §2.1, §2.1, Table 2, §4, §4, §5.
  • I. Balazevic, C. Allen, and T. Hospedales (2019)

    TuckER: tensor factorization for knowledge graph completion


    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 5185–5194. External Links: Link, Document Cited by: §5, §5.
  • I. Beltagy, K. Lo, and W. Ammar (2018) Improving distant supervision with maxpooled attention and sentence-level supervision. CoRR abs/1810.12956. External Links: Link, 1810.12956 Cited by: §1, §5.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pp. 2787–2795. Cited by: §2.2, §3, §5, §5.
  • R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum (2018)

    Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning

    In International Conference on Learning Representations (ICLR), Cited by: §5, §5.
  • T. Dettmers, M. Pasquale, S. Pontus, and S. Riedel (2018) Convolutional 2d knowledge graph embeddings. In

    Proceedings of the 32th AAAI Conference on Artificial Intelligence

    pp. 1811–1818. External Links: Link Cited by: §2.2, §5, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §2.1.
  • M. Gardner, P. P. Talukdar, B. Kisiel, and T. Mitchell (2013) Improving learning and inference in a large knowledge-base using latent syntactic cues. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 833–838. Cited by: §5.
  • Z. Guo, Y. Zhang, and W. Lu (2019) Attention guided graph convolutional networks for relation extraction. CoRR abs/1906.07510. External Links: Link, 1906.07510 Cited by: §2.1, §2, §4, §4, §5.
  • K. Guu, J. Miller, and P. Liang (2015) Traversing knowledge graphs in vector space. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 318–327. Cited by: §5.
  • X. Han, Z. Liu, and M. Sun (2018) Neural knowledge acquisition via mutual attention between knowledge graph and text. In AAAI, Cited by: §1, §5.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2010)

    SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 33–38. External Links: Link Cited by: §4.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780.
  • G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 687–696. Cited by: §2.2, §3, §5.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. CoRR abs/1907.10529. External Links: Link, 1907.10529 Cited by: §2.1, §5.
  • N. Lao, T. Mitchell, and W. W. Cohen (2011) Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 529–539. Cited by: §5.
  • X. V. Lin, R. Socher, and C. Xiong (2018) Multi-hop knowledge graph reasoning with reward shaping. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3243–3253. Cited by: §5, §5.
  • Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015) Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2181–2187. External Links: ISBN 0-262-51129-0, Link Cited by: §2.2, §3, §5.
  • C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014) The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60. External Links: Link Cited by: §2.
  • A. Neelakantan, B. Roth, and A. McCallum (2015) Compositional vector space models for knowledge base completion. In ACL, Cited by: §5.
  • T. H. Nguyen and R. Grishman (2015)

    Relation extraction: perspective from convolutional neural networks

    In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, pp. 39–48. External Links: Link, Document Cited by: §5.
  • N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih (2017) Cross-sentence n-ary relation extraction with graph lstms. External Links: 1708.03743 Cited by: §5.
  • M. E. Peters, M. Neumann, R. L. L. IV, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. External Links: 1909.04164 Cited by: §2.1, Table 2, §5.
  • L. B. Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. CoRR abs/1906.03158. External Links: Link, 1906.03158 Cited by: §2.1, §2.1, §4, §4, §5.
  • G. Stoica*, O. Stretcu*, E. A. Platanios*, B. Póczos, and T. M. Mitchell (2020) Contextual Parameter Generation for Knowledge Graph Link Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.2, §5, §5.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. CoRR abs/1503.00075. External Links: Link, 1503.00075 Cited by: §5.
  • K. Toutanova, V. Lin, W. Yih, H. Poon, and C. Quirk (2016) Compositional learning of embeddings for relation paths in knowledge base and text.. In ACL, Cited by: §5.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In

    International Conference on Machine Learning (ICML)

    Vol. 48, pp. 2071–2080. Cited by: §2.2, §3, §5.
  • G. Wang, W. Zhang, R. Wang, Y. Zhou, X. Chen, W. Zhang, H. Zhu, and H. Chen (2018) Label-free distant supervision for relation extraction via knowledge graph embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2246–2255. External Links: Link, Document Cited by: §1, §2.2, §5.
  • L. Wang, Z. Cao, G. de Melo, and Z. Liu (2016) Relation classification via multi-level attention CNNs. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1298–1307. External Links: Link, Document Cited by: §5.
  • R. Wang, B. Li, S. Hu, W. Du, and M. Zhang (2020) Knowledge graph embedding via graph attenuated attention networks. IEEE Access 8 (), pp. 5212–5224. Cited by: §2.2, §5, §5.
  • J. Weston, A. Bordes, O. Yakhnenko, and N. Usunier (2013) Connecting language and knowledge bases with embedding models for relation extraction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1366–1371. External Links: Link Cited by: §1, §2.2, §5.
  • B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §3, §5.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 2335–2344. External Links: Link Cited by: §5.
  • N. Zhang, S. Deng, Z. Sun, G. Wang, X. Chen, W. Zhang, and H. Chen (2019) Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. CoRR abs/1903.01306. External Links: Link, 1903.01306 Cited by: §1, §5.
  • Y. Zhang, P. Qi, and C. D. Manning (2018) Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2205–2215. External Links: Link, Document Cited by: 3rd item, §2.1, §2.1, §2, §4, §4, §5.
  • Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning (2017) Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 35–45. External Links: Link, Document Cited by: §2.1, §2.1, §2, §4, §4, §5.
  • P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu (2016) Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 207–212. External Links: Link, Document Cited by: §5.