UniRE
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.
view repo
Many joint entity relation extraction models setup two separated label spaces for the two subtasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations. In this work, we propose to eliminate the different treatment on the two subtasks' label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cell's label, which unifies the learning of two subtasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.
READ FULL TEXT VIEW PDF
To solve the problem of redundant information and overlapping relations ...
read it
The task of endtoend relation extraction consists of two subtasks: i)...
read it
In this paper, we address two different types of noise in information
ex...
read it
A relation tuple consists of two entities and the relation between them,...
read it
Named entity recognition and relation extraction are two important
funda...
read it
We present a novel graphbased neural network model for relation extract...
read it
Recently we proposed the Span Attribute Tagging (SAT) Model (Du et al., ...
read it
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.
Extracting structured information from plain texts is a longlasting research topic in NLP. Typically, it aims to recognize specific entities and relations for profiling the semantic of sentences. An example is shown in Figure 1, where a person entity ‘‘David Perkins’’ and a geography entity ‘‘California’’ have a physical location relation PHYS.
Methods for detecting entities and relations can be categorized into pipeline models or joint models. In the pipeline setting, entity models and relation models are independent with disentangled feature spaces and output label spaces. In the joint setting, on the other hand, some parameter sharing of feature spaces (Miwa and Bansal, 2016; Katiyar and Cardie, 2017) or decoding interactions (Yang and Cardie, 2013; Sun et al., 2019) are imposed to explore the common structure of the two tasks. It was believed that joint models could be better since they can alleviate error propagations among submodels, have more compact parameter sets, and uniformly encode prior knowledge (e.g., constraints) on both tasks.
However, Zhong and Chen (2020) recently show that with the help of modern pretraining tools (e.g., BERT), separating the entity and relation model (with independent encoders and pipeline decoding) could surpass existing joint models. They argue that, since the output label spaces of entity and relation models are different, comparing with shared encoders, separate encoders could better capture distinct contextual information, avoid potential conflicts among them, and help decoders making a more accurate prediction, that is, separate label spaces deserve separate encoders.
In this paper, we pursue a better joint model for entity relation extraction. After revisiting existing methods, we find that though entity models and relation models share encoders, usually their label spaces are still separate (even in models with joint decoders). Therefore, parallel to Zhong and Chen (2020), we would ask whether joint encoders (decoders) deserve joint label spaces?
The challenge of developing a unified entityrelation label space is that the two subtasks are usually formulated into different learning problems (e.g., entity detection as sequence labeling, relation classification as multiclass classification), and their labels are placed on different things (e.g., words v.s. words pairs). One prior attempt Zheng et al. (2017) is to handle both subtasks with one sequence labeling model. A compound label set was devised to encode both entities and relations. However, the model’s expressiveness is sacrificed: it can detect neither overlapping relations (i.e., entities participating in multiple relation) nor isolated entities (i.e., entities not appearing in any relation).
Our key idea of defining a new unified label space is that, if we think Zheng et al. (2017)’s solution is to perform relation classification during entity labeling, we could also consider the reverse direction by seeing entity detection as a special case of relation classification. Our new input space is a twodimensional table with each entry corresponding to a word pair in sentences (Figure 1). The joint model assign labels to each cell from a unified label space (union of entity type set and relation type set). Graphically, entities are squares on the diagonal, and relations are rectangles off the diagonal. This formulation retains full model expressiveness regarding existing entityrelation extraction scenarios (e.g., overlapped relations, directed relations, undirected relations). It is also different from the current table filling settings for entity relation extraction (Miwa and Sasaki, 2014; Gupta et al., 2016; Zhang et al., 2017; Wang and Lu, 2020), which still have separate label space for entities and relations, and treat on/offdiagonal entries differently.
Based on the tabular formulation, our joint entity relation extractor performs two actions, filling and decoding. First, filling the table is to predict each word pair’s label, which is similar to arc prediction task in dependency parsing. We adopt the biaffine attention mechanism (Dozat and Manning, 2016)
to learn interactions between word pairs. We also impose two structural constraints on the table through structural regularizations. Next, given the table filling with label logits, we devise an approximate joint decoding algorithm to output the final extracted entities and relations. Basically, it efficiently finds split points in the table to identify squares and rectangles (which is also different with existing table filling models which still apply certain sequential decoding and fill tables incrementally).
Experimental results on three benchmarks (ACE04, ACE05, SciERC) show that the proposed joint method achieves competitive performances comparing with the current stateoftheart extractors Zhong and Chen (2020): it is better on ACE04 and SciERC, and competitive on ACE05.^{1}^{1}1Source code and models are available at https://github.com/Receiling/UniRE. Meanwhile, our new joint model is fast on decoding (x faster than the exact pipeline implementation, and comparable to an approximate pipeline, which attains lower performance). It also has a more compact parameter set: the shared encoder uses only half the number of parameters comparing with the separate encoder Zhong and Chen (2020).
Given an input sentence ( is a word), this task is to extract a set of entities and a set of relations . An entity is a span () with a predefined type (e.g., PER, GPE). The span is a continuous sequence of words. A relation is a triplet , where are two entities and is a predefined relation type describing the semantic relation among two entities (e.g., the PHYS relation between PER and GPE mentioned before). Here denote the set of possible entity types and relation types respectively.
We formulate the joint entity relation extraction as a table filling task (multiclass classification between each word pair in sentence ), as shown in Figure 1. For the sentence , we maintain a table . For each cell in table , we assign a label , where ( denotes no relation). For each entity , the label of corresponding cells should be filled in . For each relation , the label of corresponding cells should be filled in .^{2}^{2}2Assuming no overlapping entities in one sentence. While others should be filled in . In the test phase, decoding entities and relations becomes a rectangle finding problem. Note that solving this problem is not trivial, and we propose a simple but effective joint decoding algorithm to tackle this challenge.
In this section, we first introduce our biaffine model for table filling task based on pretrained language models (Section
3.1). Then we detail the main objective function of the table filling task (Section 3.2) and some constraints which are imposed on the table in training stage (Section 3.3). Finally we present the joint decoding algorithm to extract entities and relations (Section 3.4). Figure 2 shows an overview of our model architecture.^{3}^{3}3We only show three labels of in Figure 2 for simplicity and clarity.Given an input sentence , to obtain the contextual representation for each word, we use a pretrained language model (PLM) as our sentence encoder (e.g., BERT). The output of the encoder is c {h_1, …, h_s} = PLM({x_1, …, x_s}), where is the input representation of each word . Taking BERT as an example, sums the corresponding token, segment and position embeddings. To capture longrange dependencies, we also employ crosssentence context following (Zhong and Chen, 2020), which extends the sentence to a fixed window size ( in our default settings).
To better encode direction information of words in table , we use the deep biaffine attention mechanism (Dozat and Manning, 2016)
, which achieves impressive results in the dependency parsing task. Specifically, we employ two dimensionreducing MLPs (multilayer perceptron), i.e., a head MLP and a tail MLP, on each
as c h_i^head = MLP_head(h_i), h_i^tail = MLP_tail(h_i), where andare projection representations, allowing the model to identify the head or tail role of each word. Next, we calculate the scoring vector
of each word pair with biaffine model, c g_i,j = Biaff(h_i^head, h_j^tail),After obtaining the scoring vector , we feed
into the softmax function to predict corresponding label, yielding a categorical probability distribution over the label space
as c P(y_i, j  s) = Softmax( dropout(g_i,j)). In our experiments, we observe that applying dropout in , similar to denoising autoencoding, can further improve the performance. ^{4}^{4}4We set dropout rate by default.. We refer this trick to logit dropout And the training objective is to minimize rrl L_entry &=& 1s2∑_i = 1^s∑_j = 1^slogP(y_i, j = y_i,js), where the gold label can be read from annotations, as shown in Figure 1.In fact, footnote 4 is based on the assumption that each label is independent. This assumption simplifies the training procedure, but ignores some structural constraints. For example, entities and relations correspond to squares and rectangles in the table. footnote 4 does not encode this constraint explicitly. To enhance our model, we propose two intuitive constraints, symmetry and implication, which are detailed in this section. Here we introduce a new notation , denoting the stack of for all word pairs in sentence .^{5}^{5}5 without logit dropout mentioned in Section 3.2 to preserve learned structure.
We have several observations from the table in the tag level. Firstly, the squares corresponding to entities must be symmetrical about the diagonal. Secondly, for symmetrical relations, the relation triples and are equivalent, thus the rectangles corresponding to two counterpart relation triples are also symmetrical about the diagonal. As shown in Figure 1, the rectangles corresponding to (‘‘his’’, ‘‘wife’’, PERSOC) and (‘‘wife’’, ‘‘his’’, PERSOC) are symmetrical about the diagonal. We divide the set of labels into a symmetrical label set and an asymmetrical label set . The matrix should be symmetrical about the diagonal for each label . We formulate this taglevel constraint as symmetrical loss, c L_sym = 1s2∑_i = 1^s∑_j = 1^s∑_t ∈Y_symP_i, j, t  P_j, i, t. We list all in Table 1 for our adopted datasets.
Dataset  
Ent  Rel  


PERSOC  
SciERC 


A key intuition is that if a relation exists, then its two argument entities must also exist. In other words, it is impossible for a relation to exist without two corresponding entities. From the perspective of probability, it implies that the probability of relation is not greater than the probability of each argument entity. Since we model entity and relation labels in a unified probability space, this idea can be easily used in our model as the implication constraint. We impose this constraint on : for each word in the diagonal, its maximum possibility over the entity type space must not be lower than the maximum possibility for other words in the same row or column over the relation type space . We formulate this tablelevel constraint as implication loss, rccl L_imp & = & 1s & ∑_i = 1^s [max_l ∈Y_r {P_i,:,l, P_:, i, l}  max_t ∈Y_e {P_i,i,t}]_* where
is the hinge loss. It is worth noting that we do not add margin in this loss function. Since the value of each item is a probability and might be relatively small, it is meaningless to set a large margin.
Finally, we jointly optimize the three objectives in the training stage as .^{6}^{6}6We directly sum the three losses to avoid introducing more hyperparameters.
In the testing stage, given the probability tensor of the sentence , ^{7}^{7}7For the symmetrical label , we set . how to decode all rectangles (including squares) corresponding to entities or relations remains a nontrivial problem. Since brute force enumeration of all rectangles is intractable, a new joint decoding algorithm is needed. We expect our decoder to have,
[leftmargin=1pc, itemindent=1pc]
Simple implementation and fast decoding. We permit slight decoding accuracy drops for scalability.
Strong interactions between entities and relations. When decoding entities, it should take the relation information into account, and vice versa.
Inspired by the procedures of (Sun et al., 2019), We propose a threesteps decoding algorithm: decode span first (entity spans or spans between entities), and then decode entity type of each span, and at last decode relation type of each entity pair (Figure 3
). We consider each cell’s probability scores on all labels (including entity labels and relation labels) and predict spans according to a threshold. Then, we predict entities and relations with the highest score. Our heuristic decoding algorithm could be very efficient. Next we will detail the entire decoding process, and give a formal description in the
Appendix A.One crucial observation of a groundtruth table is that, for an arbitrary entity, its corresponding rows (or columns) are exactly the same in the table (e.g., row 1 and row 2 of Figure 1 are identical), not only for the diagonal entries (entities are squares), but also for the offdiagonal entries (if it participates in a relation with another entity, all its rows (columns) will spot that relation label in the same way). In other words, if the adjacent rows/columns are different, there must be an entity boundary (i.e., one belonging to the entity and the other not belonging to the entity). Therefore, if our biaffine model is reasonably trained, given a model predicted table, we could use this property to find split positions of entity boundary. As expected, experiments (Figure 4) verify our assumption. We adapt this idea to the 3dimensional probability tensor .
Specifically, we flatten as a matrix from row perspective, and then calculate the Euclidean distances ( distances) of adjacent rows. Similarly, we calculate the other Euclidean distances of adjacent columns according to a matrix from column perspective, and then average the two distances as the final distance. If the distance is larger than the threshold ( in our default settings), this position is a split position. In this way, we can decode all the spans in time complexity.
Given a span by span decoding,^{8}^{8}8 and denote start and end indices of the span. we decode the entity type according to the corresponding square symmetric about the diagonal: . If , we decode an entity. If , the span is not an entity.
After entity type decoding, given an entity with the span and another entity with the span , we decode the relation type between and according to the corresponding rectangle. Formally, . If , we decode a relation . If , and have no relation.
Dataset  #sents  #ents(#types)  #rels(#types) 
ACE04  8,683  22,519(7)  4,417(6) 
ACE05  14,525  38,287(7)  7,691(6) 
SciERC  2,687  8,094(6)  5,463(7) 
Dataset  Model  Encoder  Entity  Relation  
P  R  F1  P  R  F1  
ACE04  Li and Ji (2014)    83.5  76.2  79.7  60.8  36.1  45.3 
Miwa and Bansal (2016)  LSTM  80.8  82.9  81.8  48.7  48.1  48.4  
Katiyar and Cardie (2017)  LSTM  81.2  78.1  79.6  46.4  45.3  45.7  
Li et al. (2019)  BERT  84.4  82.9  83.6  50.1  48.7  49.4  
Wang and Lu (2020)  ALBERT      88.6      59.6  
Zhong and Chen (2020)^{⋄}  BERT      89.2      60.1  
Zhong and Chen (2020)^{⋄}  ALBERT      90.3      62.2  
UniRE^{⋄}  BERT  87.4  88.0  87.7  62.1  58.0  60.0  
UniRE^{⋄}  ALBERT  88.9  90.0  89.5  67.3  59.3  63.0  
ACE05  Li and Ji (2014)    85.2  76.9  80.8  65.4  39.8  49.5 
Miwa and Bansal (2016)  LSTM  82.9  83.9  83.4  57.2  54.0  55.6  
Katiyar and Cardie (2017)  LSTM  84.0  81.3  82.6  55.5  51.8  53.6  
Sun et al. (2019)  LSTM  86.1  82.4  84.2  68.1  52.3  59.1  
Li et al. (2019)  BERT  84.7  84.9  84.8  64.8  56.2  60.2  
Wang et al. (2020)  BERT      87.2      63.2  
Wang and Lu (2020)  ALBERT      89.5      64.3  
Zhong and Chen (2020)^{⋄}  BERT      90.2      64.6  
Zhong and Chen (2020)^{⋄}  ALBERT      90.9      67.8  
UniRE^{⋄}  BERT  88.8  88.9  88.8  67.1  61.8  64.3  
UniRE^{⋄}  ALBERT  89.9  90.5  90.2  72.3  60.7  66.0  
SciERC  Wang et al. (2020)  SciBERT      68.0      34.6 
Zhong and Chen (2020)^{⋄}  SciBERT      68.2      36.7  
UniRE^{⋄}  SciBERT  65.8  71.1  68.4  37.3  36.6  36.9 
We conduct experiments on three entity relation extraction benchmarks: ACE04 (Doddington et al., 2004),^{9}^{9}9https://catalog.ldc.upenn.edu/LDC2005T09 ACE05 (Walker et al., 2006),^{10}^{10}10https://catalog.ldc.upenn.edu/LDC2006T06 and SciERC (Luan et al., 2018).^{11}^{11}11http://nlp.cs.washington.edu/sciIE/ Table 2 shows the dataset statistics. Besides, we provide detailed dataset specifications in the Appendix B.
Following suggestions in Taillé et al. (2020), we evaluate Precision (P), Recall (R), and F1 scores with microaveraging and adopt the Strict Evaluation criterion. Specifically, a predicted entity is correct if its type and boundaries are correct, and a predicted relation is correct if its relation type is correct, as well as the boundaries and types of two argument entities are correct.
We tune all hyperparameters based on the averaged entity F1 and relation F1 on ACE05 development set, then keep the same settings on ACE04 and SciERC. For fair comparison with previous works, we use three pretrained language models: bertbaseuncased (Devlin et al., 2019), albertxxlargev1 (Lan et al., 2019) and scibertscivocabuncased (Beltagy et al., 2019) as the sentence encoder and finetune them in training stage.^{12}^{12}12The first two are for ACE04 and ACE05, and the last one is for SciERC.
For the MLP layer, we set the hidden size as
and use GELU as the activation function. We use AdamW optimizer
(Loshchilov and Hutter, 2017) with and , and observe a phenomenon similar to (Dozat and Manning, 2016) in that settingfrom 0.9 to 0.999 causes a significant drop on final performance. The batch size is 32, and the learning rate is 5e5 with weight decay 1e5. We apply a linear warmup learning rate scheduler with a warmup ratio of 0.2. We train our model with a maximum of 200 epochs (300 epochs for SciERC) and employ an early stop strategy. We perform all experiments on an Intel(R) Xeon(R) W3175X CPU and a NVIDIA Quadro RTX 8000 GPU.
Table 3 summarizes previous works and our UniRE on three datasets.^{13}^{13}13Since Luan et al. (2019); Wadden et al. (2019) neglect the argument entity type in relation evaluation and underperform our baseline (Zhang et al., 2020), we do not compare their results here. In general, UniRE achieves the best performance on ACE04 and SciERC and a comparable result on ACE05. Comparing with the previous best joint model (Wang and Lu, 2020), our model significantly advances both entity and relation performances, i.e., an absolute F1 of +0.9 and +0.7 for entity as well as +3.4 and +1.7 for relation, on ACE04 and ACE05 respectively. For the best pipeline model (Zhong and Chen, 2020) (current SOTA), our model achieves superior performance on ACE04 and SciERC and comparable performance on ACE05. Comparing with ACE04/ACE05, SciERC is much smaller, so entity performance on SciERC drops sharply. Since (Zhong and Chen, 2020) is a pipeline method, its relation performance is severely influenced by the poor entity performance. Nevertheless, our model is less influenced in this case and achieves better performance. Besides, our model can achieve better relation performance even with worse entity results on ACE04. Actually, our base model (BERT) has achieved competitive relation performance, which even exceeds prior models based on BERT Li et al. (2019) and ALBERT Wang and Lu (2020). These results confirm the proposed unified label space is effective for exploring the interaction between entities and relations. Note that all subsequent experiment results on ACE04 and ACE05 are based on BERT for efficiency.
Settings  ACE05  SciERC  
Ent  Rel  Ent  Rel  
Default  88.8  64.3  68.4  36.9 
w/o symmetry loss  88.9  64.0  67.3  35.5 
w/o implication loss  89.0  63.3  68.0  37.1 
w/o logit dropout  88.8  61.8  66.9  34.7 
w/o crosssentence context  87.9  62.7  65.3  32.1 
hard decoding  74.0  34.6  46.1  17.8 
In this section, we analyze the effects of components in UniRE with different settings (Table 4). Particularly, we implement a naive decoding algorithm for comparison, namely ‘‘hard decoding’’, which takes the ‘‘intermediate table’’ as input. The ‘‘intermediate table’’ is the hard form of probability tensor output by the biaffine model, i.e., choosing the class with the highest probability as the label of each cell. To find entity squares on the diagonal, it first tries to judge whether the largest square () is an entity. The criterion is simply counting the number of different entity labels appearing in the square and choosing the most frequent one. If the most frequent label is , we shrink the size of square by and do the same work on two squares and so on. To avoid entity overlapping, an entity will be discarded if it overlaps with identified entities. To find relations, each entity pair is labeled by the most frequent relation label in the corresponding rectangle.
From the ablation study, we get the following observations.
Model  Parameters  W  ACE05  SciERC  
Rel (F1)  Speed (sent/s)  Rel (F1)  Speed (sent/s)  
Z&C(2020)  219M  100  64.6  14.7  36.7  19.9 
Z&C(2020)  219M  100    237.6    194.7 
UniRE  110M  100  63.6  340.6  34.0  314.8 
UniRE  110M  200  64.3  194.2  36.9  200.1 
hard decoding  110M  200  34.6  139.1  17.8  113.0 
[leftmargin=1pc, itemindent=1pc]
When one of the additional losses is removed, the performance will decline with varying degrees (line 23). Specifically, the symmetrical loss has a significant impact on SciERC (decrease 1.1 points and 1.4 points for entity and relation performance). While removing the implication loss will obviously harm the relation performance on ACE05 (1.0 point). It demonstrates that the structural information incorporated by both losses is useful for this task.
Comparing with the ‘‘Default’’, the performance of ‘‘w/o logit dropout’’ and ‘‘w/o crosssentence context’’ drop more sharply (line 45). Logit dropout prevents the model from overfitting, and crosssentence context provides more contextual information for this task, especially for small datasets like SciERC.
The ‘‘hard decoding’’ has the worst performance (its relation performance is almost half of the ‘‘Default’’) (line 6). The major reason is that ‘‘hard decoding’’ separately decodes entities and relations. It shows the proposed decoding algorithm jointly considers entities and relations, which is important for decoding.
Following (Zhong and Chen, 2020), we evaluate the inference speed of our model (Table 5) on ACE05 and SciERC with the same batch size and pretrained encoders (BERT for ACE05 and SciBERT for SciERC). Comparing with the pipeline method (Zhong and Chen, 2020), we obtain a more than speedup and achieve a comparable or even better relation performance with . As for their approximate version, our inference speed is still competitive but with better performance. If the context window size is set the same as (Zhong and Chen, 2020) (), we can further accelerate model inference with slight performance drops. Besides, ‘‘hard decoding’’ is much slower than UniRE, which demonstrates the efficiency of the proposed decoding algorithm.
In Figure 4, the distance between adjacent rows not at entity boundary (‘‘NonEntBound’’) mainly concentrates at 0, while that at entity boundary (‘‘EntBound’’) is usually greater than 1. This phenomenon verifies the correctness of our span decoding method. Then we evaluate the performances, with regard to the threshold in Figure 5.^{14}^{14}14We use an additional metric to evaluate span performance, “Span F1”, is MicroF1 of predicted split positions. Both span and entity performances sharply decrease when increases from 1.4 to 1.5, while the relation performance starts to decline slowly from . The major reason is that relations are so sparse that many entities do not participate in any relation, so the threshold of relation is much higher than that of entity. Moreover, we observe a similar phenomenon on ACE04 and SciERC, and is a general best setting on three datasets. It shows the stability and generalization of our model.
In Table 4, both crosssentence context and logit dropout can improve the entity and relation performance. Table 6 shows the effect of different context window size and logit dropout rate . The entity and relation performances are significantly improved from to , and drop sharply from to . Similarly, we achieve the best entity and relation performances when . So we use and in our final model.
Value  ACE05  SciERC  
Ent  Rel  Ent  Rel  
100  87.4  62.4  69.0  36.7  
200  87.9  62.1  70.6  38.3  
300  87.2  60.8  69.4  35.4  
0.1  87.4  61.8  71.1  37.8  
0.2  87.9  62.1  70.6  38.3  
0.3  87.2  62.1  67.8  33.5  
0.4  87.4  62.0  70.6  35.8 
We further analyze the remaining errors for relation extraction and present the distribution of five errors: span splitting error (SSE), entity not found (ENF), entity type error (ETE), relation not found (RNF), and relation type error (RTE) in
Figure 6. The proportion of ‘‘SSE’’ is relatively small, which proves the effectiveness of our span decoding method. Moreover, the proportion of ‘‘not found error’’ is significantly larger than that of ‘‘type error’’ for both entity and relation. The primary reason is that the table filling suffers from the class imbalance issue, i.e., the number of is much larger than that of other classes. We reserve this imbalanced classification problem in the future.Finally, we give some concrete examples in Figure 7 to verify the robustness of our decoding algorithm. There are some errors in the biaffine model’s prediction, such as cells in the upper left corner (first example) and upper right corner (second example) in the intermediate table. However, these errors are corrected after decoding, which demonstrates that our decoding algorithm not only recover all entities and relations but also corrects errors leveraging table structure and neighbor cells’ information.
Entity relation extraction has been extensively studied over the decades. Existing methods can be roughly divided into two categories according to the adopted label space.
This category study this task as two separate subtasks: entity recognition and relation classification, which are defined in two separate label spaces. One early paradigm is the pipeline method (Zelenko et al., 2003; Miwa et al., 2009) that uses two independent models for two subtasks respectively. Then joint method handles this task with an endtoend model to explore more interaction between entities and relations. The most basic joint paradigm, parameter sharing (Miwa and Bansal, 2016; Katiyar and Cardie, 2017), adopts two independent decoders based on a shared encoder. Recent spanbased models (Luan et al., 2019; Wadden et al., 2019) also use this paradigm. To enhance the connection of two decoders, many joint decoding algorithms are proposed, such as ILPbased joint decoder (Yang and Cardie, 2013), joint MRT Sun et al. (2018), GCNbased joint inference (Sun et al., 2019). Actually, table filling method (Miwa and Sasaki, 2014; Gupta et al., 2016; Zhang et al., 2017; Wang et al., 2020) is a special case of parameter sharing in table structure. These joint models all focus on various joint algorithms but ignore the fact that they are essentially based on separate label spaces.
This family of methods aims to unify two subtasks and tackle this task in a unified label space. Entity relation extraction has been converted into a tagging problem (Zheng et al., 2017), a transitionbased parsing problem (Wang et al., 2018), and a generation problem with Seq2Seq framework (Zeng et al., 2018; Nayak and Ng, 2020). We follow this trend and propose a new unified label space. We introduce a 2D table to tackle the overlapping relation problem in (Zheng et al., 2017). Also, our model is more versatile as not relying on complex expertise like (Wang et al., 2018), which requires external expert knowledge to design a complex transition system.
In this work, we extract entities and relations in a unified label space to better mine the interaction between both subtasks. We propose a novel table that presents entities and relations as squares and rectangles. Then this task can be performed in two simple steps: filling the table with our biaffine model and decoding entities and relations with our joint decoding algorithm. Experiments on three benchmarks show the proposed method achieves not only stateoftheart performance but also promising efficiency.
The authors wish to thank the reviewers for their helpful comments and suggestions. This work was (partially) supported by National Key Research and Development Program of China (2018AAA0100704), NSFC (61972250, 62076097), STCSM (18ZR1411500), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.
Table filling multitask recurrent neural network for joint entity and relation extraction
. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 2537–2547. External Links: Link Cited by: §1, §5.Albert: a lite bert for selfsupervised learning of language representations
. arXiv preprint arXiv:1909.11942. Cited by: §4.Multitask identification of entities, relations, and coreference for scientific knowledge graph construction
. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, Brussels, Belgium, pp. 3219–3232. External Links: Link, Document Cited by: §4.Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 8528–8535. Cited by: §5.Journal of machine learning research
3 (Feb), pp. 1083–1106. Cited by: §5.A formal description are shown in Algorithm 1.
The ACE04 and ACE05 corpora are collected from various domains, such as newswire and online forums. Both corpora annotate 7 entity types and 6 relation types. we use the same data splits and preprocessing as (Li and Ji, 2014; Miwa and Bansal, 2016), i.e., 5fold crossvalidation for ACE04, and 351 training, 80 validating, and 80 testing for ACE05.^{15}^{15}15We use the preprocessing scripts provided by (Wang and Lu, 2020) at https://github.com/LorrinWWW/twoarebetterthanone/tree/master/datasets. Besides, we randomly sample 10% of training set as the development set for ACE04.
The SciERC corpus collects 500 scientiﬁc abstracts taken from AI conference/workshop proceedings. This dataset annotates 6 entity types and 7 relation types. We adopt the same data split protocol as in (Luan et al., 2019) (350 training, 50 validating, and 100 testing). Detailed dataset specifications are shown in Table 2.
Moreover, we correct the annotations of undirected relations for three datasets, regarding each undirected relation as two directed relation instances, e.g., for the undirected relation PERSOC, only one relation triplet (‘‘his’’, wife’’, PERSOC) is annotated in the original dataset, we will add another relation triplet (‘‘wife’’, ‘‘his’’, PERSOC) in our corrected datasets for symmetry. In this case, each undirected relation corresponds to two rectangles, which are symmetrical about the diagonal.
Comments
There are no comments yet.