UniRE: A Unified Label Space for Entity Relation Extraction

07/09/2021 ∙ by Yijun Wang, et al. ∙ 0

Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations. In this work, we propose to eliminate the different treatment on the two sub-tasks' label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cell's label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Example of a table for joint entity relation extraction. Each cell corresponds to a word pair. Entities are squares on diagonal, relations are rectangles off diagonal. Note that PER-SOC is a undirected (symmetrical) relation type, while PHYS and ORG-AFF are directed (asymmetrical) relation types. The table exactly expresses overlapped relations, e.g., the person entity ‘‘David Perkins’’ participates in two relations, (‘‘David Perkins’’, ‘‘wife’’, PER-SOC) and (‘‘David Perkins’’, ‘‘California’’, PHYS). For every cell, a same biaffine model predicts its label. The joint decoder is set to find the best squares and rectangles.

Extracting structured information from plain texts is a long-lasting research topic in NLP. Typically, it aims to recognize specific entities and relations for profiling the semantic of sentences. An example is shown in Figure 1, where a person entity ‘‘David Perkins’’ and a geography entity ‘‘California’’ have a physical location relation PHYS.

Methods for detecting entities and relations can be categorized into pipeline models or joint models. In the pipeline setting, entity models and relation models are independent with disentangled feature spaces and output label spaces. In the joint setting, on the other hand, some parameter sharing of feature spaces (Miwa and Bansal, 2016; Katiyar and Cardie, 2017) or decoding interactions (Yang and Cardie, 2013; Sun et al., 2019) are imposed to explore the common structure of the two tasks. It was believed that joint models could be better since they can alleviate error propagations among sub-models, have more compact parameter sets, and uniformly encode prior knowledge (e.g., constraints) on both tasks.

However, Zhong and Chen (2020) recently show that with the help of modern pre-training tools (e.g., BERT), separating the entity and relation model (with independent encoders and pipeline decoding) could surpass existing joint models. They argue that, since the output label spaces of entity and relation models are different, comparing with shared encoders, separate encoders could better capture distinct contextual information, avoid potential conflicts among them, and help decoders making a more accurate prediction, that is, separate label spaces deserve separate encoders.

In this paper, we pursue a better joint model for entity relation extraction. After revisiting existing methods, we find that though entity models and relation models share encoders, usually their label spaces are still separate (even in models with joint decoders). Therefore, parallel to Zhong and Chen (2020), we would ask whether joint encoders (decoders) deserve joint label spaces?

The challenge of developing a unified entity-relation label space is that the two sub-tasks are usually formulated into different learning problems (e.g., entity detection as sequence labeling, relation classification as multi-class classification), and their labels are placed on different things (e.g., words v.s. words pairs). One prior attempt Zheng et al. (2017) is to handle both sub-tasks with one sequence labeling model. A compound label set was devised to encode both entities and relations. However, the model’s expressiveness is sacrificed: it can detect neither overlapping relations (i.e., entities participating in multiple relation) nor isolated entities (i.e., entities not appearing in any relation).

Our key idea of defining a new unified label space is that, if we think Zheng et al. (2017)’s solution is to perform relation classification during entity labeling, we could also consider the reverse direction by seeing entity detection as a special case of relation classification. Our new input space is a two-dimensional table with each entry corresponding to a word pair in sentences (Figure 1). The joint model assign labels to each cell from a unified label space (union of entity type set and relation type set). Graphically, entities are squares on the diagonal, and relations are rectangles off the diagonal. This formulation retains full model expressiveness regarding existing entity-relation extraction scenarios (e.g., overlapped relations, directed relations, undirected relations). It is also different from the current table filling settings for entity relation extraction (Miwa and Sasaki, 2014; Gupta et al., 2016; Zhang et al., 2017; Wang and Lu, 2020), which still have separate label space for entities and relations, and treat on/off-diagonal entries differently.

Based on the tabular formulation, our joint entity relation extractor performs two actions, filling and decoding. First, filling the table is to predict each word pair’s label, which is similar to arc prediction task in dependency parsing. We adopt the biaffine attention mechanism (Dozat and Manning, 2016)

to learn interactions between word pairs. We also impose two structural constraints on the table through structural regularizations. Next, given the table filling with label logits, we devise an approximate joint decoding algorithm to output the final extracted entities and relations. Basically, it efficiently finds split points in the table to identify squares and rectangles (which is also different with existing table filling models which still apply certain sequential decoding and fill tables incrementally).

Experimental results on three benchmarks (ACE04, ACE05, SciERC) show that the proposed joint method achieves competitive performances comparing with the current state-of-the-art extractors Zhong and Chen (2020): it is better on ACE04 and SciERC, and competitive on ACE05.111Source code and models are available at https://github.com/Receiling/UniRE. Meanwhile, our new joint model is fast on decoding (x faster than the exact pipeline implementation, and comparable to an approximate pipeline, which attains lower performance). It also has a more compact parameter set: the shared encoder uses only half the number of parameters comparing with the separate encoder Zhong and Chen (2020).

2 Task Definition

Given an input sentence ( is a word), this task is to extract a set of entities and a set of relations . An entity is a span () with a pre-defined type (e.g., PER, GPE). The span is a continuous sequence of words. A relation is a triplet , where are two entities and is a pre-defined relation type describing the semantic relation among two entities (e.g., the PHYS relation between PER and GPE mentioned before). Here denote the set of possible entity types and relation types respectively.

We formulate the joint entity relation extraction as a table filling task (multi-class classification between each word pair in sentence ), as shown in Figure 1. For the sentence , we maintain a table . For each cell in table , we assign a label , where ( denotes no relation). For each entity , the label of corresponding cells should be filled in . For each relation , the label of corresponding cells should be filled in .222Assuming no overlapping entities in one sentence. While others should be filled in . In the test phase, decoding entities and relations becomes a rectangle finding problem. Note that solving this problem is not trivial, and we propose a simple but effective joint decoding algorithm to tackle this challenge.

Figure 2: Overview of our model architecture. One main objective () and two additional objectives (

) are imposed on probability tensor

and optimized jointly.

3 Approach

In this section, we first introduce our biaffine model for table filling task based on pre-trained language models (Section

3.1). Then we detail the main objective function of the table filling task (Section 3.2) and some constraints which are imposed on the table in training stage (Section 3.3). Finally we present the joint decoding algorithm to extract entities and relations (Section 3.4). Figure 2 shows an overview of our model architecture.333We only show three labels of in Figure 2 for simplicity and clarity.

3.1 Biaffine Model

Given an input sentence , to obtain the contextual representation for each word, we use a pre-trained language model (PLM) as our sentence encoder (e.g., BERT). The output of the encoder is c {h_1, …, h_|s|} = PLM({x_1, …, x_|s|}), where is the input representation of each word . Taking BERT as an example, sums the corresponding token, segment and position embeddings. To capture long-range dependencies, we also employ cross-sentence context following (Zhong and Chen, 2020), which extends the sentence to a fixed window size ( in our default settings).

To better encode direction information of words in table , we use the deep biaffine attention mechanism (Dozat and Manning, 2016)

, which achieves impressive results in the dependency parsing task. Specifically, we employ two dimension-reducing MLPs (multi-layer perceptron), i.e., a head MLP and a tail MLP, on each

as c h_i^head = MLP_head(h_i),     h_i^tail = MLP_tail(h_i), where and

are projection representations, allowing the model to identify the head or tail role of each word. Next, we calculate the scoring vector

of each word pair with biaffine model, c g_i,j = Biaff(h_i^head, h_j^tail),
Biaff(h_1, h_2) = h_1^T U_1 h_2 + U_2 (h_1 h_2) + b, where and are weight parameters, is the bias, denotes concatenation.

3.2 Table Filling

After obtaining the scoring vector , we feed

into the softmax function to predict corresponding label, yielding a categorical probability distribution over the label space

as c P(y_i, j | s) = Softmax( dropout(g_i,j)). In our experiments, we observe that applying dropout in , similar to de-noising auto-encoding, can further improve the performance. 444We set dropout rate by default.. We refer this trick to logit dropout And the training objective is to minimize rrl L_entry &=&- 1|s|2_i = 1^|s|_j = 1^|s|logP(y_i, j = y_i,j|s), where the gold label can be read from annotations, as shown in Figure 1.

3.3 Constraints

In fact, footnote 4 is based on the assumption that each label is independent. This assumption simplifies the training procedure, but ignores some structural constraints. For example, entities and relations correspond to squares and rectangles in the table. footnote 4 does not encode this constraint explicitly. To enhance our model, we propose two intuitive constraints, symmetry and implication, which are detailed in this section. Here we introduce a new notation , denoting the stack of for all word pairs in sentence .555 without logit dropout mentioned in Section 3.2 to preserve learned structure.


We have several observations from the table in the tag level. Firstly, the squares corresponding to entities must be symmetrical about the diagonal. Secondly, for symmetrical relations, the relation triples and are equivalent, thus the rectangles corresponding to two counterpart relation triples are also symmetrical about the diagonal. As shown in Figure 1, the rectangles corresponding to (‘‘his’’, ‘‘wife’’, PER-SOC) and (‘‘wife’’, ‘‘his’’, PER-SOC) are symmetrical about the diagonal. We divide the set of labels into a symmetrical label set and an asymmetrical label set . The matrix should be symmetrical about the diagonal for each label . We formulate this tag-level constraint as symmetrical loss, c L_sym = 1|s|2_i = 1^|s|_j = 1^|s|_t Y_sym|P_i, j, t - P_j, i, t|. We list all in Table 1 for our adopted datasets.

Ent Rel
Table 1: Symmetrical label set for used datasets.


A key intuition is that if a relation exists, then its two argument entities must also exist. In other words, it is impossible for a relation to exist without two corresponding entities. From the perspective of probability, it implies that the probability of relation is not greater than the probability of each argument entity. Since we model entity and relation labels in a unified probability space, this idea can be easily used in our model as the implication constraint. We impose this constraint on : for each word in the diagonal, its maximum possibility over the entity type space must not be lower than the maximum possibility for other words in the same row or column over the relation type space . We formulate this table-level constraint as implication loss, rccl L_imp & = & 1|s| & _i = 1^|s| [max_l Y_r {P_i,:,l, P_:, i, l} - max_t Y_e {P_i,i,t}]_* where

is the hinge loss. It is worth noting that we do not add margin in this loss function. Since the value of each item is a probability and might be relatively small, it is meaningless to set a large margin.

Finally, we jointly optimize the three objectives in the training stage as .666We directly sum the three losses to avoid introducing more hyper-parameters.

3.4 Decoding

In the testing stage, given the probability tensor of the sentence , 777For the symmetrical label , we set . how to decode all rectangles (including squares) corresponding to entities or relations remains a non-trivial problem. Since brute force enumeration of all rectangles is intractable, a new joint decoding algorithm is needed. We expect our decoder to have,

  • [leftmargin=1pc, itemindent=1pc]

  • Simple implementation and fast decoding. We permit slight decoding accuracy drops for scalability.

  • Strong interactions between entities and relations. When decoding entities, it should take the relation information into account, and vice versa.

Inspired by the procedures of (Sun et al., 2019), We propose a three-steps decoding algorithm: decode span first (entity spans or spans between entities), and then decode entity type of each span, and at last decode relation type of each entity pair (Figure 3

). We consider each cell’s probability scores on all labels (including entity labels and relation labels) and predict spans according to a threshold. Then, we predict entities and relations with the highest score. Our heuristic decoding algorithm could be very efficient. Next we will detail the entire decoding process, and give a formal description in the

Appendix A.

Figure 3: Overview of our joint decoding algorithm. It consists of three steps: span decoding, entity type decoding, and relation type decoding.

Span Decoding

One crucial observation of a ground-truth table is that, for an arbitrary entity, its corresponding rows (or columns) are exactly the same in the table (e.g., row 1 and row 2 of Figure 1 are identical), not only for the diagonal entries (entities are squares), but also for the off-diagonal entries (if it participates in a relation with another entity, all its rows (columns) will spot that relation label in the same way). In other words, if the adjacent rows/columns are different, there must be an entity boundary (i.e., one belonging to the entity and the other not belonging to the entity). Therefore, if our biaffine model is reasonably trained, given a model predicted table, we could use this property to find split positions of entity boundary. As expected, experiments (Figure 4) verify our assumption. We adapt this idea to the 3-dimensional probability tensor .

Specifically, we flatten as a matrix from row perspective, and then calculate the Euclidean distances ( distances) of adjacent rows. Similarly, we calculate the other Euclidean distances of adjacent columns according to a matrix from column perspective, and then average the two distances as the final distance. If the distance is larger than the threshold ( in our default settings), this position is a split position. In this way, we can decode all the spans in time complexity.

Entity Type Decoding

Given a span by span decoding,888 and denote start and end indices of the span. we decode the entity type according to the corresponding square symmetric about the diagonal: . If , we decode an entity. If , the span is not an entity.

Relation Type Decoding

After entity type decoding, given an entity with the span and another entity with the span , we decode the relation type between and according to the corresponding rectangle. Formally, . If , we decode a relation . If , and have no relation.

4 Experiments

Dataset #sents #ents(#types) #rels(#types)
ACE04 8,683 22,519(7) 4,417(6)
ACE05 14,525 38,287(7) 7,691(6)
SciERC 2,687 8,094(6) 5,463(7)
Table 2: The statistics of the adopted datasets.
Dataset Model Encoder Entity Relation
P R F1 P R F1
ACE04 Li and Ji (2014) - 83.5 76.2 79.7 60.8 36.1 45.3
Miwa and Bansal (2016) LSTM 80.8 82.9 81.8 48.7 48.1 48.4
Katiyar and Cardie (2017) LSTM 81.2 78.1 79.6 46.4 45.3 45.7
Li et al. (2019) BERT 84.4 82.9 83.6 50.1 48.7 49.4
Wang and Lu (2020) ALBERT - - 88.6 - - 59.6
Zhong and Chen (2020) BERT - - 89.2 - - 60.1
Zhong and Chen (2020) ALBERT - - 90.3 - - 62.2
UniRE BERT 87.4 88.0 87.7 62.1 58.0 60.0
UniRE ALBERT 88.9 90.0 89.5 67.3 59.3 63.0
ACE05 Li and Ji (2014) - 85.2 76.9 80.8 65.4 39.8 49.5
Miwa and Bansal (2016) LSTM 82.9 83.9 83.4 57.2 54.0 55.6
Katiyar and Cardie (2017) LSTM 84.0 81.3 82.6 55.5 51.8 53.6
Sun et al. (2019) LSTM 86.1 82.4 84.2 68.1 52.3 59.1
Li et al. (2019) BERT 84.7 84.9 84.8 64.8 56.2 60.2
Wang et al. (2020) BERT - - 87.2 - - 63.2
Wang and Lu (2020) ALBERT - - 89.5 - - 64.3
Zhong and Chen (2020) BERT - - 90.2 - - 64.6
Zhong and Chen (2020) ALBERT - - 90.9 - - 67.8
UniRE BERT 88.8 88.9 88.8 67.1 61.8 64.3
UniRE ALBERT 89.9 90.5 90.2 72.3 60.7 66.0
SciERC Wang et al. (2020) SciBERT - - 68.0 - - 34.6
Zhong and Chen (2020) SciBERT - - 68.2 - - 36.7
UniRE SciBERT 65.8 71.1 68.4 37.3 36.6 36.9
Table 3: Overall evaluation. means that the model leverages cross-sentence context information.


We conduct experiments on three entity relation extraction benchmarks: ACE04 (Doddington et al., 2004),999https://catalog.ldc.upenn.edu/LDC2005T09 ACE05 (Walker et al., 2006),101010https://catalog.ldc.upenn.edu/LDC2006T06 and SciERC (Luan et al., 2018).111111http://nlp.cs.washington.edu/sciIE/ Table 2 shows the dataset statistics. Besides, we provide detailed dataset specifications in the Appendix B.


Following suggestions in Taillé et al. (2020), we evaluate Precision (P), Recall (R), and F1 scores with micro-averaging and adopt the Strict Evaluation criterion. Specifically, a predicted entity is correct if its type and boundaries are correct, and a predicted relation is correct if its relation type is correct, as well as the boundaries and types of two argument entities are correct.

Implementation Details

We tune all hyper-parameters based on the averaged entity F1 and relation F1 on ACE05 development set, then keep the same settings on ACE04 and SciERC. For fair comparison with previous works, we use three pre-trained language models: bert-base-uncased (Devlin et al., 2019), albert-xxlarge-v1 (Lan et al., 2019) and scibert-scivocab-uncased (Beltagy et al., 2019) as the sentence encoder and fine-tune them in training stage.121212The first two are for ACE04 and ACE05, and the last one is for SciERC.

For the MLP layer, we set the hidden size as

and use GELU as the activation function. We use AdamW optimizer

(Loshchilov and Hutter, 2017) with and , and observe a phenomenon similar to (Dozat and Manning, 2016) in that setting

from 0.9 to 0.999 causes a significant drop on final performance. The batch size is 32, and the learning rate is 5e-5 with weight decay 1e-5. We apply a linear warm-up learning rate scheduler with a warm-up ratio of 0.2. We train our model with a maximum of 200 epochs (300 epochs for SciERC) and employ an early stop strategy. We perform all experiments on an Intel(R) Xeon(R) W-3175X CPU and a NVIDIA Quadro RTX 8000 GPU.

4.1 Performance Comparison

Table 3 summarizes previous works and our UniRE on three datasets.131313Since Luan et al. (2019); Wadden et al. (2019) neglect the argument entity type in relation evaluation and underperform our baseline (Zhang et al., 2020), we do not compare their results here. In general, UniRE achieves the best performance on ACE04 and SciERC and a comparable result on ACE05. Comparing with the previous best joint model (Wang and Lu, 2020), our model significantly advances both entity and relation performances, i.e., an absolute F1 of +0.9 and +0.7 for entity as well as +3.4 and +1.7 for relation, on ACE04 and ACE05 respectively. For the best pipeline model (Zhong and Chen, 2020) (current SOTA), our model achieves superior performance on ACE04 and SciERC and comparable performance on ACE05. Comparing with ACE04/ACE05, SciERC is much smaller, so entity performance on SciERC drops sharply. Since (Zhong and Chen, 2020) is a pipeline method, its relation performance is severely influenced by the poor entity performance. Nevertheless, our model is less influenced in this case and achieves better performance. Besides, our model can achieve better relation performance even with worse entity results on ACE04. Actually, our base model (BERT) has achieved competitive relation performance, which even exceeds prior models based on BERT Li et al. (2019) and ALBERT Wang and Lu (2020). These results confirm the proposed unified label space is effective for exploring the interaction between entities and relations. Note that all subsequent experiment results on ACE04 and ACE05 are based on BERT for efficiency.

Settings ACE05 SciERC
Ent Rel Ent Rel
Default 88.8 64.3 68.4 36.9
w/o symmetry loss 88.9 64.0 67.3 35.5
w/o implication loss 89.0 63.3 68.0 37.1
w/o logit dropout 88.8 61.8 66.9 34.7
w/o cross-sentence context 87.9 62.7 65.3 32.1
hard decoding 74.0 34.6 46.1 17.8
Table 4: Results (F1 score) with different settings on ACE05 and SciERC test sets. Note that we use BERT on ACE05.

4.2 Ablation Study

In this section, we analyze the effects of components in UniRE with different settings (Table 4). Particularly, we implement a naive decoding algorithm for comparison, namely ‘‘hard decoding’’, which takes the ‘‘intermediate table’’ as input. The ‘‘intermediate table’’ is the hard form of probability tensor output by the biaffine model, i.e., choosing the class with the highest probability as the label of each cell. To find entity squares on the diagonal, it first tries to judge whether the largest square () is an entity. The criterion is simply counting the number of different entity labels appearing in the square and choosing the most frequent one. If the most frequent label is , we shrink the size of square by and do the same work on two squares and so on. To avoid entity overlapping, an entity will be discarded if it overlaps with identified entities. To find relations, each entity pair is labeled by the most frequent relation label in the corresponding rectangle.

From the ablation study, we get the following observations.

Model Parameters W ACE05 SciERC
Rel (F1) Speed (sent/s) Rel (F1) Speed (sent/s)
Z&C(2020) 219M 100 64.6 14.7 36.7 19.9
Z&C(2020) 219M 100 - 237.6 - 194.7
UniRE 110M 100 63.6 340.6 34.0 314.8
UniRE 110M 200 64.3 194.2 36.9 200.1
hard decoding 110M 200 34.6 139.1 17.8 113.0
Table 5: Comparison of accuracy and efficiency on ACE05 and SciERC test sets with different context window sizes. denotes the approximation version with a faster speed and a worse performance.
  • [leftmargin=1pc, itemindent=1pc]

  • When one of the additional losses is removed, the performance will decline with varying degrees (line 2-3). Specifically, the symmetrical loss has a significant impact on SciERC (decrease 1.1 points and 1.4 points for entity and relation performance). While removing the implication loss will obviously harm the relation performance on ACE05 (1.0 point). It demonstrates that the structural information incorporated by both losses is useful for this task.

  • Comparing with the ‘‘Default’’, the performance of ‘‘w/o logit dropout’’ and ‘‘w/o cross-sentence context’’ drop more sharply (line 4-5). Logit dropout prevents the model from overfitting, and cross-sentence context provides more contextual information for this task, especially for small datasets like SciERC.

  • The ‘‘hard decoding’’ has the worst performance (its relation performance is almost half of the ‘‘Default’’) (line 6). The major reason is that ‘‘hard decoding’’ separately decodes entities and relations. It shows the proposed decoding algorithm jointly considers entities and relations, which is important for decoding.

4.3 Inference Speed

Following (Zhong and Chen, 2020), we evaluate the inference speed of our model (Table 5) on ACE05 and SciERC with the same batch size and pre-trained encoders (BERT for ACE05 and SciBERT for SciERC). Comparing with the pipeline method (Zhong and Chen, 2020), we obtain a more than speedup and achieve a comparable or even better relation performance with . As for their approximate version, our inference speed is still competitive but with better performance. If the context window size is set the same as (Zhong and Chen, 2020) (), we can further accelerate model inference with slight performance drops. Besides, ‘‘hard decoding’’ is much slower than UniRE, which demonstrates the efficiency of the proposed decoding algorithm.

4.4 Impact of Different Threshold

In Figure 4, the distance between adjacent rows not at entity boundary (‘‘Non-Ent-Bound’’) mainly concentrates at 0, while that at entity boundary (‘‘Ent-Bound’’) is usually greater than 1. This phenomenon verifies the correctness of our span decoding method. Then we evaluate the performances, with regard to the threshold in Figure 5.141414We use an additional metric to evaluate span performance, “Span F1”, is Micro-F1 of predicted split positions. Both span and entity performances sharply decrease when increases from 1.4 to 1.5, while the relation performance starts to decline slowly from . The major reason is that relations are so sparse that many entities do not participate in any relation, so the threshold of relation is much higher than that of entity. Moreover, we observe a similar phenomenon on ACE04 and SciERC, and is a general best setting on three datasets. It shows the stability and generalization of our model.

Figure 4: Distributions of adjacent rows’ distances for two categories with respect to the threshold on ACE05 dev set.
Figure 5: Performances with respect to the threshold on ACE05 dev set.

4.5 Context Window and Logit Dropout Rate

In Table 4, both cross-sentence context and logit dropout can improve the entity and relation performance. Table 6 shows the effect of different context window size and logit dropout rate . The entity and relation performances are significantly improved from to , and drop sharply from to . Similarly, we achieve the best entity and relation performances when . So we use and in our final model.

Value ACE05 SciERC
Ent Rel Ent Rel
100 87.4 62.4 69.0 36.7
200 87.9 62.1 70.6 38.3
300 87.2 60.8 69.4 35.4
0.1 87.4 61.8 71.1 37.8
0.2 87.9 62.1 70.6 38.3
0.3 87.2 62.1 67.8 33.5
0.4 87.4 62.0 70.6 35.8
Table 6: Results (F1 scores) with respect to the context window size and the logit dropout rate on ACE05 and SciERC dev sets.

4.6 Error Analysis

We further analyze the remaining errors for relation extraction and present the distribution of five errors: span splitting error (SSE), entity not found (ENF), entity type error (ETE), relation not found (RNF), and relation type error (RTE) in

Figure 6. The proportion of ‘‘SSE’’ is relatively small, which proves the effectiveness of our span decoding method. Moreover, the proportion of ‘‘not found error’’ is significantly larger than that of ‘‘type error’’ for both entity and relation. The primary reason is that the table filling suffers from the class imbalance issue, i.e., the number of is much larger than that of other classes. We reserve this imbalanced classification problem in the future.

Finally, we give some concrete examples in Figure 7 to verify the robustness of our decoding algorithm. There are some errors in the biaffine model’s prediction, such as cells in the upper left corner (first example) and upper right corner (second example) in the intermediate table. However, these errors are corrected after decoding, which demonstrates that our decoding algorithm not only recover all entities and relations but also corrects errors leveraging table structure and neighbor cells’ information.

Figure 6: Distribution of five relation extraction errors on ACE05 and SciERC test data.

5 Related Work

Entity relation extraction has been extensively studied over the decades. Existing methods can be roughly divided into two categories according to the adopted label space.

Separate Label Spaces

This category study this task as two separate sub-tasks: entity recognition and relation classification, which are defined in two separate label spaces. One early paradigm is the pipeline method (Zelenko et al., 2003; Miwa et al., 2009) that uses two independent models for two sub-tasks respectively. Then joint method handles this task with an end-to-end model to explore more interaction between entities and relations. The most basic joint paradigm, parameter sharing (Miwa and Bansal, 2016; Katiyar and Cardie, 2017), adopts two independent decoders based on a shared encoder. Recent span-based models (Luan et al., 2019; Wadden et al., 2019) also use this paradigm. To enhance the connection of two decoders, many joint decoding algorithms are proposed, such as ILP-based joint decoder (Yang and Cardie, 2013), joint MRT Sun et al. (2018), GCN-based joint inference (Sun et al., 2019). Actually, table filling method (Miwa and Sasaki, 2014; Gupta et al., 2016; Zhang et al., 2017; Wang et al., 2020) is a special case of parameter sharing in table structure. These joint models all focus on various joint algorithms but ignore the fact that they are essentially based on separate label spaces.

Unified Label Space

This family of methods aims to unify two sub-tasks and tackle this task in a unified label space. Entity relation extraction has been converted into a tagging problem (Zheng et al., 2017), a transition-based parsing problem (Wang et al., 2018), and a generation problem with Seq2Seq framework (Zeng et al., 2018; Nayak and Ng, 2020). We follow this trend and propose a new unified label space. We introduce a 2D table to tackle the overlapping relation problem in (Zheng et al., 2017). Also, our model is more versatile as not relying on complex expertise like (Wang et al., 2018), which requires external expert knowledge to design a complex transition system.

Figure 7: Examples showing the robustness of our decoding algorithm. ‘‘Gold Table’’ presents the gold label. ‘‘Intermediate Table’’ presents the biaffine model’s prediction (choosing the label with the highest probability for each cell). ‘‘Decoded Table’’ presents the final results after decoding.

6 Conclusion

In this work, we extract entities and relations in a unified label space to better mine the interaction between both sub-tasks. We propose a novel table that presents entities and relations as squares and rectangles. Then this task can be performed in two simple steps: filling the table with our biaffine model and decoding entities and relations with our joint decoding algorithm. Experiments on three benchmarks show the proposed method achieves not only state-of-the-art performance but also promising efficiency.


The authors wish to thank the reviewers for their helpful comments and suggestions. This work was (partially) supported by National Key Research and Development Program of China (2018AAA0100704), NSFC (61972250, 62076097), STCSM (18ZR1411500), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.


  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Cited by: §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.
  • G. Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, and R. Weischedel (2004) The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. External Links: Link Cited by: §4.
  • T. Dozat and C. D. Manning (2016) Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734. Cited by: §1, §3.1, §4.
  • P. Gupta, H. Schütze, and B. Andrassy (2016)

    Table filling multi-task recurrent neural network for joint entity and relation extraction

    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 2537–2547. External Links: Link Cited by: §1, §5.
  • A. Katiyar and C. Cardie (2017) Going out on a limb: joint extraction of entity mentions and relations without dependency trees. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 917–928. External Links: Link, Document Cited by: §1, Table 3, §5.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §4.
  • Q. Li and H. Ji (2014) Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 402–412. External Links: Link, Document Cited by: Appendix B, Table 3.
  • X. Li, F. Yin, Z. Sun, X. Li, A. Yuan, D. Chai, M. Zhou, and J. Li (2019) Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1340–1350. External Links: Link, Document Cited by: §4.1, Table 3.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
  • Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi (2018)

    Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction


    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 3219–3232. External Links: Link, Document Cited by: §4.
  • Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi (2019) A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3036–3046. External Links: Link, Document Cited by: Appendix B, §5.
  • Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi (2019) A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296. Cited by: footnote 13.
  • M. Miwa and M. Bansal (2016) End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1105–1116. External Links: Link, Document Cited by: Appendix B, §1, Table 3, §5.
  • M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii (2009) A rich feature vector for protein-protein interaction extraction from multiple corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 121–130. External Links: Link Cited by: §5.
  • M. Miwa and Y. Sasaki (2014) Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1858–1869. Cited by: §1, §5.
  • T. Nayak and H. T. Ng (2020) Effective modeling of encoder-decoder architecture for joint entity and relation extraction. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 8528–8535. Cited by: §5.
  • C. Sun, Y. Gong, Y. Wu, M. Gong, D. Jiang, M. Lan, S. Sun, and N. Duan (2019) Joint type inference on entities and relations via graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1361–1370. External Links: Link, Document Cited by: §1, §3.4, Table 3, §5.
  • C. Sun, Y. Wu, M. Lan, S. Sun, W. Wang, K. Lee, and K. Wu (2018) Extracting entities and relations with joint minimum risk training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2256–2265. External Links: Link, Document Cited by: §5.
  • B. Taillé, V. Guigue, G. Scoutheeten, and P. Gallinari (2020) Let’s stop incorrect comparisons in end-to-end relation extraction!. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3689–3701. External Links: Link Cited by: §4.
  • D. Wadden, U. Wennberg, Y. Luan, and H. Hajishirzi (2019) Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5784–5789. External Links: Link, Document Cited by: §5, footnote 13.
  • C. Walker, S. Strassel, J. Medero, and K. Maeda (2006) ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia 57, pp. 45. Cited by: §4.
  • J. Wang and W. Lu (2020) Two are better than one: joint entity and relation extraction with table-sequence encoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1706–1721. External Links: Link, Document Cited by: §1, §4.1, Table 3, footnote 15.
  • S. Wang, Y. Zhang, W. Che, and T. Liu (2018) Joint extraction of entities and relations based on a novel graph scheme.. In IJCAI, pp. 4461–4467. Cited by: §5.
  • Y. Wang, C. Sun, Y. Wu, J. Yan, P. Gao, and G. Xie (2020) Pre-training entity relation encoder with intra-span and inter-span information. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1692–1705. External Links: Link, Document Cited by: Table 3, §5.
  • B. Yang and C. Cardie (2013) Joint inference for fine-grained opinion extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1640–1649. Cited by: §1, §5.
  • D. Zelenko, C. Aone, and A. Richardella (2003) Kernel methods for relation extraction.

    Journal of machine learning research

    3 (Feb), pp. 1083–1106.
    Cited by: §5.
  • X. Zeng, D. Zeng, S. He, K. Liu, and J. Zhao (2018) Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514. Cited by: §5.
  • H. Zhang, Q. Liu, A. X. Fan, H. Ji, D. Zeng, F. Cheng, D. Kawahara, and S. Kurohashi (2020) Minimize exposure bias of seq2seq models in joint entity and relation extraction. arXiv preprint arXiv:2009.07503. Cited by: footnote 13.
  • M. Zhang, Y. Zhang, and G. Fu (2017) End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1730–1740. External Links: Link, Document Cited by: §1, §5.
  • S. Zheng, F. Wang, H. Bao, Y. Hao, P. Zhou, and B. Xu (2017) Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:1706.05075. Cited by: §1, §1, §5.
  • Z. Zhong and D. Chen (2020) A frustratingly easy approach for joint entity and relation extraction. arXiv preprint arXiv:2010.12812. Cited by: §1, §1, §1, §3.1, §4.1, §4.3, Table 3, Table 5.

Appendix A Decoding Algorithm

A formal description are shown in Algorithm 1.

1:Probability tensor of sentence
2:A set of entities and a set of relations
3:, ,
7:while  do
9:     if  then
11:     end if
13:end while
16:for  do
18:     if  then
19:          : and
21:     end if
23:end for
24:for  do
28:     if  then
30:     end if
31:end for
Algorithm 1 Decoding Algorithm

Appendix B Datasets

The ACE04 and ACE05 corpora are collected from various domains, such as newswire and online forums. Both corpora annotate 7 entity types and 6 relation types. we use the same data splits and pre-processing as (Li and Ji, 2014; Miwa and Bansal, 2016), i.e., 5-fold cross-validation for ACE04, and 351 training, 80 validating, and 80 testing for ACE05.151515We use the pre-processing scripts provided by (Wang and Lu, 2020) at https://github.com/LorrinWWW/two-are-better-than-one/tree/master/datasets. Besides, we randomly sample 10% of training set as the development set for ACE04.

The SciERC corpus collects 500 scientific abstracts taken from AI conference/workshop proceedings. This dataset annotates 6 entity types and 7 relation types. We adopt the same data split protocol as in (Luan et al., 2019) (350 training, 50 validating, and 100 testing). Detailed dataset specifications are shown in Table 2.

Moreover, we correct the annotations of undirected relations for three datasets, regarding each undirected relation as two directed relation instances, e.g., for the undirected relation PER-SOC, only one relation triplet (‘‘his’’, wife’’, PER-SOC) is annotated in the original dataset, we will add another relation triplet (‘‘wife’’, ‘‘his’’, PER-SOC) in our corrected datasets for symmetry. In this case, each undirected relation corresponds to two rectangles, which are symmetrical about the diagonal.