Downstream Model Design of Pre-trained Language Model for Relation Extraction Task
Supervised relation extraction methods based on deep neural network play an important role in the recent information extraction field. However, at present, their performance still fails to reach a good level due to the existence of complicated relations. On the other hand, recently proposed pre-trained language models (PLMs) have achieved great success in multiple tasks of natural language processing through fine-tuning when combined with the model of downstream tasks. However, original standard tasks of PLM do not include the relation extraction task yet. We believe that PLMs can also be used to solve the relation extraction problem, but it is necessary to establish a specially designed downstream task model or even loss function for dealing with complicated relations. In this paper, a new network architecture with a special loss function is designed to serve as a downstream model of PLMs for supervised relation extraction. Experiments have shown that our method significantly exceeded the current optimal baseline models across multiple public datasets of relation extraction.READ FULL TEXT VIEW PDF
Downstream Model Design of Pre-trained Language Model for Relation Extraction Task
As a subtask of information extraction, the predefined relation extraction task plays an important role in building structured knowledge. In recent years, with the rapid development of deep learning, many relation extraction methods based on deep neural network structure have been proposed. At present, these popular neural network methods[15, 29, 19, 14, 9, 6, 2, 25] can be mainly summarized by the following steps:
1. Obtain embeddings of the target text from an encoder. The embeddings are usually dense vectors in a low dimensional space. For example, Glove vector , or vectors from a pre-trained language model like BERT .
2. Process these embeddings and integrate their information according to a certain network structure, such as CNN , Attention Mechanism [14, 6] and GCN  and so on, to obtain the representation of the target relation.
3. Perform training on labeled dataset based on a certain classifier using the encoded information as input, such as the Softmax classifier.
However, the current methods do not perform very well on public datasets given the existence of complicated relations, such as long-distance relation, single sentence with multiple relations, and overlapped relations on entity-pairs.
Recently, the emergence of pre-trained language models has provided new ideas for solving relation extraction problems. Pre-trained language models [1, 21, 28, 22] are a kind of super-large-scale neural network model based on the deep Transformer  structure. Their initial parameters are learned through super-large self-supervised training, and then combined with multiple downstream models to fix special tasks by fine-tuning. Experiments show that the downstream tasks’ performances of this kind of models are usually far superior to those of conventional models . Unfortunately, the relation extraction task is not in the original downstream tasks list directly supported by the pre-trained language model.
Our current work attempts to leverage the power of the PLMs to establish an effective downstream model that is competent for the relation extraction tasks. In particular, we implement three important improvements in the main steps, as described above:
1. Use a pre-trained language model (this article uses BERT ) instead of the traditional encoder, and obtain a variety of token embeddings from the model. We extract the embeddings from two different layers to represent the head and tail entities separately, because it may help to learn reversible relations. On the other hand, we also add the context information into the entity to deal with long-distance relations.
2. After then, we calculate a parameterized asymmetric kernel inner product matrix between all the head and tail embeddings of each token in a sequence. Since kernels are different between each relation, we believe such a product is helpful for distinguishing multiple relations between same entity pairs. Thus the matrix can be treated as the tendency score to indicate where a certain relation exists.
3. Use the Sigmoid classifier instead of the Softmax classifier
, and use the average probability of token pairs corresponding to each entity pair as the final probability that the entity pair has a certain relation. Thus for each type of relation, and each entity pair in the input text, we can independently calculate the probability of whether a relation exists. This mechanism allows the model to predict multiple relations even overlapped on same entities.
We also notice that it is very easy to integrate Named Entity Recognition (NER) task into our model to deal with joint extraction task. For instance, adding Bi-LSTM and CRF after the BERT encoding layer makes it easy to build an efficient NER task processor. By simply adding the NER loss on original loss, we can build a joint extraction model to avoid the pipeline errors. However, in this paper we will focus on relation extraction and will not pay attention to this method.
Our experiments mainly verify two conclusions.
1. Pre-trained language model can perform well in relation extraction task after our re-designing. We evaluate the method on public datasets: SemEval 2010 Task 8 , NYT , WebNLG . The experimental results show that our model achieves a new state-of-the-art.
2. A test for our performance in different overlapped relation situations and multiple relation situations shows that our model is robust when faced with complex relations.
In recent years, numerous models have been proposed to process the supervised relation extraction task with deep neural networks. The general paradigm is to create a deep neural network model by learning from texts labeled with entities information and the relation-type-label between them. The model then can predict the possible relation-type-label for the specified entities in a new text. We introduce three kinds of common neural network structures here.
CNN based methods apply Convolutional Neural Network to capture text information and reconstruct embeddings . Besides simple CNN, a series of improved models have been proposed [15, 29, 19, 14, 9]. However, the CNN approach limits the model’s ability to handle remote relations. The information that is fused through the CNN network is often local, so it is difficult to deal with distant relations. Therefore, these methods are currently limited to a large extent and are not able to achieve a good level of application. Since more efficient methods are proposed recently, we will not compare our approach with this early work in this article.
GNN based methods use Graph Neural Network , mainly Graph Convolutional Network  to deal with entities’ relations. GNN is a neural network that can capture the topological characteristics of the graph type data, so it is necessary to define a prior graph structure. The GNN-based relation extraction methods [31, 4, 2] usually use the text dependency tree as an input prior graph structure, thereby obtaining richer information representation than CNN. However, this kind of methods relies on the dependency parser thus pipeline errors exist. In addition, grammar-level dependency tree is still a shallow representation which fails to efficiently express relations between words.
Some recent approaches consider relation extraction as a downstream task for PLMs. These methods [32, 25, 10, 27] have made some success, but we believe that they have not yet fully utilized the language model. The main reason is the lack of a valid representation of the relation - those methods tend to express the relation as a one-dimensional vector. We believe that since the relation is determined by the vectors’ correlations of head and tail entities, it should naturally be represented as a matrix rather than one-dimensional vector. In this way, more information like the order of the entities and their positions in the text will be used while predicting their relations.
We introduce the method from two aspects: network structure and loss function. Figure 1 shows the overall architecture of this method.
From the perspective of the network structure, our model has two major parts. The first part is an encoder that utilizes pre-trained language model like BERT. We obtain three embeddings for a given input text from the BERT model: the embedding for each token in the text , the embeddings obtained by passing through a self-attention Transformer, and the embedding of the entire text (that is, the CLS embedding provided by BERT). The second part is the relation computing layer. In this layer, we assume that the represents the tail entity encoded with some available predicate information, while combined with represents the head entity with context information. By performing a correlation calculation on those embeddings, the tendency scores matrix of relation in all entity pairs can be obtained.
From the perspective of loss function, we first use the Sigmoid activation function to compute probabilites of relation by using . We use locations of entities to construct a mask matrix and use it to preserve information in which represents existing entity-pairs in a sentence (See details in Section 3.3 and Figure 1). Based on the labels indicating whether each entity-pair is an instance of the -th relation type or not, we use the average values in each area of to compute a Binary Cross Entropy (BCE) loss of this specific relation. Eventually, the final loss sums all values from all relations. This formulation allows the model to predict multiple relations in a single sentence or even multiple relations for a single entity-pair. Details and formulas are described in subsections below.
The current pre-training language models are basically based on the Transformer structure  . They have a common feature: each layer of the Transformer can output a hidden vector corresponding to the input text . is an array of tokens with length . Taking BERT as an example, some simple analyses  show that the hidden vector of each layer can be used as word embeddings of
, with modest difference in precisions. Generally speaking, the deepest hidden vector representation of the Transformer network tends to work best for a downstream fine-tuning task thanks to the information integration performed by a deeper network. However, here we select the penultimate layer output vector as the initial embedding(, where is the number of hidden dimensions), for the text representation with entity information.
To get , we use the last Transformer layer of BERT which is actually a multi-head self-attention with a fully connected FFN  to deal with our initial embeddings:
where is the last output vector of BERT,
Such an operation is applied so that the embedding of every token in will, in addition to , fuse some information from tokens in other positions. In this way, Although dataset annotations usually do not carry explicit predicate information, the Transformer structure of the BERT model allows to selectively blend contextual information that is helpful for the final task. We expect that after well fine-tuning training, words with higher attention association scores correspond to a predicate of a certain relation to some extent. and can be respectively used as basic entity representations and entity representations that incorporate predicate information. In order to better capture the overall context information, the BERT’s CLS embedding () is also added to each token’s embedding to improve the basic entity representation:
Note is actually broadcasting to all tokens.
We apply an asymmetric kernel inner product method to calculate the similarity between and :
Actually, and are respectively the transformation matrices of head-entity and tail-entity embeddings in -th relation. They are the parameters learned during the training process for each relation.
If there are tokens in one input text, we find that is actually a square matrix with rows and columns. Thus it can be treated as unnormalized probability scores for -th relation between all the tokens. That is to say, , an element of position , represents the existence possibility of
-th relation between tokens at these two locations. Finally we use Sigmoid functions to normalizeto range :
where is the normalized probability matrix of -th relation.
A problem of is that it describes relations between tokens, not entities. Therefore, we use entity-mask matrix to fix this problem. For each entity pair, the location information of the entities is known. Suppose that all entities from input text constitute a set of entity pairs in the form:
Suppose is the beginning and end of the position index of an entity in the token array. Therefore, we construct a mask matrix to satisfy
where is the subscript of the matrix element. Similarly, we can construct a label matrix for the -th relation:
where is the labeled -th relation set of entity pairs from the input text . We use this mask matrix to reserve the predicted probabilities of every entity pair from , and then use the average Binary Cross Entropy to calculate the target loss of relation :
where is Hadamard product and
Thus the final loss of relation predication is
where is the index of each relation. While predicting, we use the average value of elements in , whose location accords with a certain entity-pair , as the probability of the possible triplet consisting of -th relation and entity-pair .
Maximum Training Epochs
|Maximum Sequence Length||512||100||512|
This section describes the experimental process and best results while testing our methods on multiple public datasets. We performed overall comparison experiments with the baseline methods and completed more fine-grained analysis and comparison in different types of complex relations. Codes and more details can be found in Supplementary Materials.
We use Nvidia Tesla V100 32GB for training. The BERT model we use is [BERT-Base, Uncased]. Hyper-parameters are shown in Table 1. The optimizer is Adam . Based on our problem formulation as described in Section 3, our model actually fits a binary classifier to predict whether a triplet exists or not. Therefore, it actually gives a probability for each possible triplet. We still need a threshold to divide the positive and negative classes, and we set it as for balance. More details are shown in Supplementary Materials. Our codes are developed on OpenNRE .
As described above, Pre-trained Language Model (PLM) is so powerful that it may lead to unfairness in the comparison between our method and some old methods. Therefore, we chose some recent work (C-AGGCN , GraphRel2p , -MTB , HBT ) published after the appearance of PLMs, especially BERT, as our baseline. Such a selection is useful for measuring whether we have better exploited the potential of PLM in relational extraction tasks, rather than only relied on its power. As usual, we used the micro-F1 score as evaluation criteria.
We performed our experiments on three commonly used public datasets (SemEval 2010 Task 8 , NYT , WebNLG ) and compared the performance of our method to the baseline methods mentioned above. We followed splits and special process for the datasets conducted by the previous baseline models. More details are in Supplementary Materials.
We found that the complexity of the samples in these data sets varied widely. Similar to the approach of Copy_re , we measured the complexity of the relations in the three datasets from two dimensions, i.e., the number of samples with overlapping relations and the number of samples with multiple relations.
For overlapping relations, we followed the method proposed in Copy_re  to divide samples into three types: “Normal” as all relations in the sample is normal; “EPO” as there are at least two relations overlapped in the same entity-pair in the sample; “SEO” as there are at least two relations sharing a single entity in the sample. These three types can reflect the complexity of relations. For multiple relations, we also divide samples into three types: “Single” as only one relation appears in the sample, while “Double” as two and “Multiple” as no less than three. These three types can reflect the complexity of samples.
Table 2 shows the complexity analysis of each dataset.
Table 3 shows the performance of our method on all test data sets and the comparisons with the corresponding baseline methods. Given different test settings, we find that our model generally outperformed the baseline models, with a margin over the optimal baseline ranging from to .
|Text||Ernst Haefliger…died on Saturday in Davos, Switzerland, where he maintained a second home.||Georgia Powers…said Louisville was finally ready to welcome Muhammad Ali home.||The 1 Decembrie 1918 University is located in Alba Iulia, Romania. The capital of the country is Bucharest…||The Germans of Romania are one of the ethnic groups in Romania…the 1 Decembrie 1918 University is located in the city of Alba Iulia.|
|Labels||(place of birth:Ernst Haefliger, Davos), (place of death:Ernst Haefliger, Davos)||(place of birth:Muhammad Ali, Louisville)||(is country of:Romania, Alba Iulia),
(is capital of:Bucharest, Romania)
|(is country of:Romania, Alba Iulia)|
|Predictions||(place of death:Ernst Haefliger, Davos)||(place of birth:Muhammad Ali, Louisville)||(is country of:Romania, Alba Iulia),
(is capital of:Bucharest, Romania)
|(ethnic groups in:Romania, Alba Iulia)|
To explain the difference in performance margin, we design a detailed experiment to evaluate how our model performs in each relation type as shown in Table 4. On the other hand, we also conducted the comparison regarding each relation type with 3 baseline methods ( , , HBT ) in Figure 2 and Figure 3. We only compared F1 scores on NYT and WebNLG, since nearly no complex relations exist in SemEval. Detailed analyses on each dataset are listed as below.
SemEval. SemEval only has around samples with two relations (“Others” class excluded), with only of them in test dataset. Thus in Double-relation type, our model’s performance crashes down about
because of large variance.
NYT. On NYT our method has experienced large fluctuations. The reason is that NYT is the only dataset constructed by distant supervision 
, so the data quality is low. Distant supervision methods, which automatically label data from knowledge graphs, bring more errors in complex relations than simple relations. Therefore, although it seems that the amount of complex relations is sufficiently large, the performance of our model on complex relations still lags behind that on simple relations (aroundlower in F1 score, around lower in Recall). Nonetheless, comparison results still show that our metric scores while dealing with simple relations are higher than both baselines (around higher than HBT and higher than GraphRel). Even for complex relations, we are still significantly better (around ) than GraphRel, but a little bit lower (around ) than HBT.
WebNLG. It is easy to find our performances (Precision, Recall and Micro-F1 score) keep stable no matter how complicated the type is, except for EPO. It is because there are only around samples in EPO thus the model’s performances are of high variance. Interestingly, the Micro-F1 scores of complex relations are even higher than simple relations around , which is further discussed below. Comparison results show our model is far better than baselines (around higher than HBT and higher than GraphRel).
Given the results from all the datasets, our method shows consistent high performance on simple relation extraction tasks. Furthermore, it generally demonstrates stable performance when faced with more challenging settings including overlapped relation and multiple relation extraction.
Another interesting phenomenon is, on WebNLG, our model does better (around higher on F1 score, higher on Recall) while dealing with complicated relations than simple relations. Our guess is that since our model can predict multiple relations at the same time, it may combine semantic correlations between multiple relations to find more annotated relations by preventing some semantic drift. On the other hand, on NYT, where semantic correlations between multiple relations generated by distant supervision are very likely to be fake, our model tends to neglect those meaningless semantic correlations. Therefore, it filters out potential falsely labeled relations and generate lower Recall.
To support the above reasoning, Table 5 illustrates some real examples from WebNLG and NYT dataset. On WebNLG, our examples demonstrate the beneficial effects from properly labeled complex relations on our model. In the simple sample, our model made a mistake to consider “Romania” as an ethnic group in “Alba Iulia”, while in the complex sample from WebNLG, the model made the correct prediction that “Romania” is the country of “Alba Iulia”, by successfully identifying “Bucharest” as the capital of “Romania”. In comparison, for the simple sample from NYT, the model predicted “place of birth” correctly, while failed to predict it in the complex sample, since this relation is not real and also has no semantic correlations with the correct relation “place of death”.
This paper introduces a downstream network architecture of pre-trained language model to process supervised relation extraction. The network calculates the relation score matrix of all entity pairs on all relation types by extracting the different head and tail entities’ embeddings from the pre-trained language model. Experiments have shown that it has achieved significant improvements across multiple public datasets when compared to current best practices. Moreover, further experiments demonstrate the ability of this method to deal with complex relations. Also, we believe this network will not conflict with many other methods, thus it can be combined with them (e.g., use other special PLMs like ERNIE , -MTB ) and performs better.
In addition, we believe that the current architecture has the potential to be improved for dealing with many other relation problems, including applications in long-tail relation extraction, open relation extraction, and joint extraction and so on.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: §1, §4.3.