Span-based Joint Entity and Relation Extraction with Transformer Pre-training

09/17/2019 ∙ by Markus Eberts, et al. ∙ Hochschule RheinMain 0

We introduce SpERT, an attention model for span-based joint entity and relation extraction. Our approach employs the pre-trained Transformer network BERT as its core. We use BERT embeddings as shared inputs for a light-weight reasoning, which features entity recognition and filtering, as well as relation classification with a localized, marker-free context representation. The model is trained on strong within-sentence negative samples, which are efficiently extracted in a single BERT pass. These aspects facilitate a search over all spans in the sentence. In ablation studies, we demonstrate the benefits of pre-training, strong negative sampling and localized context. Our model outperforms prior work by up to 5 entity and relation extraction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Transfomer networks such as BERT [devlin:2018:bert], GPT [radford:2018:lm_transformer], Transformer-XL [dai:2019:transformer_xl], RoBERTa [liu:2019:roberta] or MASS [song:2019:mass] have recently attracted strong attention in the NLP research community. These models use multi-head self-attention as a key mechanism to capture interactions between tokens [bahdanau:2014:machine_translation_joint, vaswani:2017:transformer]. This way, context-sensitive embeddings can be obtained that disambiguate homonyms and express semantic and syntactic patterns. Transformer networks are commonly pre-trained on large document collections using language modelling objectives. The resulting models can then be transferred to target tasks with relatively small supervised training data, resulting in state-of-the-art performance in many NLP tasks such as question answering [yang:2019:qa] or contextual emotion detection [chatterjee:2019:semeval_emotion].

This work investigates the use of Transformer networks for relation extraction: Given a pre-defined set of target relations and a sentence such as “Leonardo DiCaprio starred in Christopher Nolan’s thriller Inception”, our goal is to extract triplets such as (“Leonardo DiCaprio”, “plays_role_in”, “Inception”) or (“Inception”, “director”, “Christopher Nolan”). The task comprises of two subproblems, namely the identification of entities (entity recognition) and relations between them (relation classification). While common methods tackle the two problems separately [yadav:2018:survey, zhang:2015:rel_pos, zeng:2014:rel_cnn], more recent work uses joint models for both steps [bekoulis:2018:multi_head, luan:2019:span_graphs]. The latter approach seems promising, as on the one hand knowledge about entities (such as the fact that “Leonardo DiCaprio” is a person) is of interest when choosing a relation, while knowledge of the relation (“director”) can be useful when identifying entities.

We present a model for joint entity and relation extraction that utilizes the Transformer network BERT as its core. A span-based approach is followed: Any token subsequence (or span) constitutes a potential entity, and a relation can hold between any pair of spans. Our model performs an full search over all these hypotheses. Unlike previous work based on BIO/BILOU labels [bekoulis:2018:multi_head, li:2019:joint_bert, nguyen:2019:biaffine_attention], a span-based approach can identify overlapping entities such as “codeine” within “codeine intoxication”.

A key challenge lies in the fact that Transformer models like BERT are computationally expensive, such that only a single forward pass should be conducted per input sentence. Therefore, our model performs a light-weight reasoning on the resulting BERT embeddings, which includes classifiers for entities and relations. Spans are filtered using the entity classifier, and local context is represented without using particular markers. This facilitates a search over all spans. We coin our model “Span-based Entity and Relation Transformer” (SpERT). The code for reproducing our results is available online

111 Overall, our contributions are:

  • To the best of our knowledge, SpERT is the first span-based approach for joint entity and relation extraction that uses a pre-trained Transformer type model. In quantitative experiments on the CoNLL04, SciERC and ADE datasets, our model consistently outperforms prior work by up to 5% (relation extraction F1 score).

  • We investigate several aspects of our model which make a full search of the span space feasible. Particularly, we show that (1) negative samples from the same sentence yield a training that is both efficient and effective, and a sufficient number of strong negative samples appears to be vital. (2) A localized context representation is beneficial, especially for longer sentences.

  • Finally, we also study the effects of pre-training and show that fine-tuning a pre-trained model yields a strong performance increase over training from scratch.

Related Work

Traditionally, relation extraction is tackled by using separate models for entity detection and relation classification, whereas neural networks constitute the state of the art. Various approaches for relation classification have been investigated such as RNNs 

[zhang:2015:rel_pos], recursive neural networks [socher:2012:mv_rnn] or CNNs [zeng:2014:rel_cnn]. Also, Transformer models have been used for relation classification [verga:2018:transformer_encoder_bio_rc, wang:2019:bert_one_pass_rc]: The input text is fed once through a Transformer model and the resulting embeddings are classified. Note, however, that pre-labeled entities are assumed to be given. In contrast to this, our approach does not rely on labeled entities and jointly detects entities and relations.

Joint Entity and Relation Extraction

More recently, models for the joint detection of entities and relations have been studied. Most approaches detect entities by sequence-to-sequence learning: Each token is tagged according to the well-known BIO/BILOU scheme, i.e. it is marked as the beginning(B), inside(I) or last(L) token of an entity. If an entity only consists of a single token, it is assigned the unit(U) tag, while tokens that are not part of an entity are labeled with outside(O).

[miwa:2014:table] tackle joint entity and relation extraction as a table-filling problem, where each cell of the table corresponds to a word pair of the sentence. The diagonal of the table is filled with the BILOU tag of the token itself and the off-diagonal cells with the relations between the respective token pair. Relations are predicted by mapping the entities’ last words. The table is filled with relation types by minimizing a scoring function based on several features such as POS tags and entity labels. A beam search is employed to find an optimal table-filling solution. [gupta:2016:table_filling]

also formulate joint entity and relation extraction as a table-filling problem. Unlike Miwa et al. they employ a bidirectional recurrent neural network to label each word pair.

[miwa:2016:stacked_rnn] use a stacked model for joint entity and relation extraction. First, a bidirectional sequential LSTM tags the entities according to the BILOU scheme. Second, a bidirectional tree-structured RNN operates on the dependency parse tree between an entity pair to predict the relation type. [zhou:2017:joint_hybrid] utilize a combination of a bidirectional LSTM and a CNN to extract a high level feature representation of the input sentence. A sigmoid layer is used to predict all relations expressed in a sentence. For each relation, a sequential LSTM then labels the entities according to the BILOU scheme. Since named entity extraction is only performed for the most likely relations, the approach predicts a lower number of labels compared to the table-filling approaches. [zheng:2017:joint_novel_tagging] first encode input tokens with a bidirectional LSTM. Another LSTM then operates on each encoded word representation and outputs the entity boundaries (akin to BILOU scheme) alongside their relation type and their position in the relation (head/tail). Entities tagged with the same relation type are then combined to obtain the final relation triples. Conditions where one entity is related to multiple other entities are not considered. [bekoulis:2018:multi_head]

also employ a bidirectional LSTM to encode each word of the sentence. They use character embeddings alongside Word2Vec embeddings as input representations. Entity boundaries and tags are extracted with a Conditional Random Field (CRF) to infer the most likely BIO sequence. A sigmoid layer then outputs the probability that a specific relation is expressed between two words that belong to an entity with regards to the BIO scheme. In contrast to 

[zheng:2017:joint_novel_tagging], Bekoulis et al. also detect cases in which a single entity is related to multiple others.

While the above approaches heavily rely on LSTMs, our approach is attention-based. The attention mechanism has also been used in joint models: [nguyen:2019:biaffine_attention] use a BiLSTM-CRF-based model for entity recognition. Token representations are shared with the relation classification task, and embeddings for BILOU entity labels are learned. In relation classification, entities interact via a bi-affine attention layer. [chi:2019:hierarch_attention] use similar BiLSTM representations. They detect entities with BIO tags and train with an auxiliary language modeling objective. Relation classifiers attend into the BiLSTM encodings. Note, however, that neither of the two works utilize Transformer type networks.

More similar to our work is the recent approach by [li:2019:joint_bert], who also apply BERT as their core model and use a question answering setting, where entity- and relation-specific questions guide the model to head and tail entities. The model requires manually defined (pseudo-)question templates per relation, such as “find a weapon which is owned by [?]”. Entities are detected by a relation-wise labeling with BILOU-type tags, based on BERT embeddings. In contrast to this approach, our span-based model requires no explicit formulation of questions but performs a full search over all span/relation hypotheses.

Figure 1: Our approach towards joint entity and relation extraction SpERT first passes a token sequence through BERT. Then, (a) all spans within the sentence are classified into entity types, as illustrated for three sample spans (red). (b) Spans classified as non-entites (here, ) are filtered. (c) All pairs of remaining entities (here, ) are combined with their context (the span between the entities, yellow) and classified into relations.

Span-based Approaches

As BIO/BILOU-based models only assign a single tag to each token, a token cannot be part of multiple entities at the same time, such that situations with overlapping (often nested) entities cannot be covered. Think of the sentence “Ford’s Chicago plant employs 4,000 workers”, where both “Chicago” and “Chicago plant” are entities. Here, span-based approaches – which perform an exhaustive search over all spans – offer the fundamental benefit of covering overlapping entities.

While earlier span-based models address coreference resolution [lee:2017:span_coreference, lee:2018:span_coreference], more recently span-based approaches towards joint entity and relation extraction have been proposed: [luan:2018:scierc] introduce the densely annotated SciERC Dataset and use a span-based model for joint coreference resolution, entity and relation extraction. The model derives span representations with a BiLSTM over concatenated ELMo [peters:2018:elmo], word and character embeddings. These representations are shared across the three subtasks, for which different classifiers derive scores to determine which spans participate in entity classes, relations and coreferences. A beam search is conducted over the hypothesis space. A recent follow-up work [luan:2019:span_graphs] uses the same span representation but adds a graph propagation step to capture the interaction of spans. A dynamic span graph is constructed, in which embeddings are propagated using a learned gated mechanism. Using this refinement of span representations, further improvements are demonstrated.

In contrast to our work, none of the above span-based approaches utilizes pre-trained Transformer networks. We demonstrate that such pre-training is highly benefitial. Other key features to our approach include an explicit representation of localized context and strong negative sampling.


Our model uses a pre-trained BERT [devlin:2018:bert] model as its core, as illustrated in Figure 1: An input sentence is tokenized, obtaining a sequence of BPE tokens. These are passed through BERT, obtaining an embedding sequence of length (the last token represents a special classifier token capturing the overall sentence context). Unlike classical relation classification, our approach detects entities among all token subsequences (or spans). For example, the token sequence (we,will,rock,you) maps to the spans (we), (we,will), (will,rock,you), etc . We classify each span into entity types (a), filter non-entities (b), and finally classify all pairs of remaining entities into relations (c).

(a) Span Classification

Our span classifier takes an arbitrary candidate span as input. Let denote such a span. Also, we assume to be a pre-defined set of entity categories such as person or organization. The span classifier maps the span to a class out of . represents spans that do not constitute entities.

The span classifier is displayed in detail in the dashed box in Figure 1 (see Step (a)). Its input consists of three parts:

  • The span’s BERT embeddings (red) are combined using a fusion, . Regarding the fusion function

    , we found max-pooling to work best, but will investigate other options in the experiments.

  • Given the span width (here, ), we look-up a width embedding (blue) from a dedicated embedding matrix, which contains a fixed-size embedding for each span width  [lee:2017:span_coreference]

    . These embeddings are learned by backpropagation, and allow the model to incorporate a prior over the span width (note that spans which are too long are unlikely to represent entities).

This yields the following span representation (whereas denotes a concatenation):


Finally, we add the classifier token (Figure 1, green), which represents the overall sentence (or context). Context forms an important source of disambiguation, as keywords (like spouse or says) are strong indicators for entity classes (like person). The final input to the span classifier is:


This input is fed to a softmax classifier:


which yields a posterior for each entity class (incl. none).

(b) Span Filtering

By looking at the highest-scored class, the span classifier’s output (Equation 3) tells us which class each span belongs to. We use a simple approach and filter all spans assigned to the class, leaving a set of spans which supposedly constitute entities. Note that – unlike prior work [miwa:2014:table, luan:2018:scierc] – we do not perform a beam search over the entity/relation hypotheses. We also pre-filter spans longer than 10 tokens, limiting the cost of span classification to .

(c) Relation Classification

Let be a set of pre-defined relation classes. The relation classifier processes each candidate pair of entities drawn from

and estimates if any relation from

holds. The input to the classifier consists of two parts:

  1. To represent the two entity candidates , we use the fused BERT/width embeddings (Eq. 2).

  2. Obviously, words from the context such as spouse or president are important indicators of the expressed relation. One possible context representation would be the classifier token . However, we found to be unsuitable for long sentences expressing a multitude of relations. Instead, we use a more localized context drawn from the direct surrounding of the entities: Given the span ranging from the end of the first entity to the beginning of the second entity (Figure 1, yellow), we combine its BERT embeddings by max-pooling, obtaining a context representation .

Just like for the span classifier, the input to the relation classifier is obtained by concatenating the above features. Note that – since relations are asymmetric in general – we need to classify both and , i.e. the input becomes

Both and are passed through a single-layer classifier:


where denotes a sigmoid of size . Any high response in the sigmoid layer indicates that the corresponding relation holds between and . Given a confidence threshold , any relation with a score is considered activated. If none is activated, the sentence is assumed to express no known relation between the two entities.


In training, we adapt the size embeddings (blue) as well as the span/relation classifiers’ parameters (

) and fine-tune BERT in the process. Our training is supervised: Given sentences with annotated entities (including their entity types) and relations, we define a joint loss function for entity classification and relation classification:

whereas denotes the cross-entropy over the entity classes and denotes the binary cross-entropy over relation classes. Both losses are averaged over each batches’ samples. No class weights are applied. A training batch consists of sentences, from which we draw samples for both classifiers:

  • For the span classifier, we draw any labeled entity, plus a fixed number of random non-entity spans as negative samples.

  • For the relation classifier, we only use the spans classified as entities . We draw any labeled relation as a positive sample, and negative samples from that are not labeled with any relation. We found such strong negative samples – in contrast to sampling random span pairs – to be of vital importance.

Thereby, we draw samples only from the sentences within the batch, which speeds-up the training process substantially: Instead of generating samples scattered over multiple sentences – which would require us to feed all those sentences through the deep and computationally expensive BERT model – we run each sentence only once through BERT (single-pass). This way, multiple positive/negative samples pass a single shallow linear layer for the entity and relation classifier respectively.


We compare SpERT with other joint entity/relation extraction models and investigate the influence of several hyperparameters. The evaluation is conducted on three publicly available datasets:

Entity Relation
Dataset System Precision Recall F1 Precision Recall F1
CoNLL04 [miwa:2014:table]
[zhang:2017:rel_glob] - - - -
[nguyen:2019:biaffine_attention] - - - -
[chi:2019:hierarch_attention] - - - -
SciERC [luan:2018:scierc]
[luan:2019:span_graphs] - - - -
ADE [li:2016:joint_ade_bio]
SpERT (without overlap)
SpERT (with overlap)
Table 1: Test set results CoNLL04, SciERC and ADE. Our model SpERT outperforms the state-of-the-art in both entity and relation extraction by up to  5%. (metrics: micro-average, macro-average, not stated)
  • CoNLL04: The CoNLL04 dataset [roth:2004:conll04] contains sentences with annotated named entities and relations extracted from news articles. It includes four entity (Location, Organization, People, Other) and five relation types (Work-For, Kill, Organization-Based-In, Live-In, Located-In). We employ the training (1,153 sentences) and test set (288 sentences) split by [gupta:2016:table_filling]. For hyperparameter tuning, 20% of the training set is used as a held-out development part.

  • SciERC: SciERC [luan:2018:scierc] is derived from 500 abstracts of AI papers. The dataset includes six scientific entity (Task, Method, Metric, Material, Other-Scientific-Term, Generic) and seven relation types (Compare, Conjunction, Evaluate-for, Used-for, Feature-of, Part-of, Hyponym-of) in a total of sentences. We use the same train ( sentences), validation ( sentences) and test () split as [luan:2018:scierc].

  • ADE: The ADE dataset [gurulingappa:2012:ade] consists of sentences and relations extracted from medical reports that describe the adverse effects arising from drug use. It contains a single relation type Adverse-Effect and the two entity types Adverse-Effect and Drug. As in previous work, we conduct a 10-fold cross validation.

We evaluate SpERT on both entity recognition and relation extraction. An entity is considered correct if its predicted span and entity label match the ground truth. A relation is considered correct if its relation type as well as the two related entities are both correct (in span and type). Only for SciERC, entity type correctness is not considered when evaluating relation extraction [luan:2018:scierc]. Following previous work, we measure the precision, recall and F1 score for each entity and relation type, and report the macro-averaged values for the ADE dataset and the micro-averaged ones for SciERC. For ADE, the F1 score is averaged over the folds. On CoNLL04, F1 scores were reported both as micro and macro averages in prior work, which is why we report both metrics.

For all of our experiments we use the Cased model222using 12 layers, 768-dimensional embeddings, 12 heads per layer, resulting in a total 110M parameters. as a sentence encoder, pre-trained on English language [devlin:2018:bert]

. We initialize our classifiers’ weights with normally distributed random numbers (

). We use the Adam Optimizer with a linear warmup and linear decay learning rate schedule and a peak learning rate of , a dropout before the entity and relation classifier with a rate of (both according to [devlin:2018:bert]), a batch size of , and width embeddings of

dimensions. No further optimizations were conducted on those parameters. We choose the number of epochs (

), the relation filtering threshold (), as well as the number of negative entity and relation samples per sentence () based on the CoNLL04 development set. We do not specifically tune our model for the other two datasets but use the same hyperparameters instead.

Comparison with State of the Art

Table 1 shows the test set evaluation results for the three datasets. We report the average over 5 runs for each dataset. SpERT outperforms the state-of-the-art for both entity and relation extraction. While NER performance increased by % (CoNLL04), % (SciERC) and % (ADE) F1 respectively, we observe even stronger performance increases in relation extraction: Compared to [li:2019:joint_bert], who also rely on BERT as a sentence encoder but use a BILOU approach for entity extraction, our model improves the state-of-the-art on the CoNLL04 dataset by % (micro) F1. On the challenging and domain-specific SciERC dataset, the SpERT model outperforms other models by about %. This improvement might be due to the superior BERT embeddings or the localized context.

On the ADE dataset, SpERT achieves an improvement by about % (SpERT (without overlap) in Table 1) F1 compared to other models. Note that ADE also contains instances of relations with overlapping entities, which can be discovered by span-based approaches like SpERT (in contrast to BILOU-based models). These have been filtered in prior work [bekoulis:2018:multi_head, li:2017:joint_bio]. As a reference for future work on overlapping entity recognition, we also present results on the full dataset (including the overlapping entities). When including this additional challenge, our model performs only marginally worse (%) compared to not considering overlapping entities. Out of the 120 relations with overlapping entities, 65 were detected correctly (%).

Candidate Selection and Negative Sampling

Figure 2: The accuracy of entity and relation classification (F1 on CoNLL04 and SciERC development set) increases significantly with the number of negative samples.

We also study the effect of the number and sampling of negative training examples. Figure 2 shows the F1 score (relations and entities) for the CoNLL04 and SciERC development sets, plotted against the number of negative samples per sentence. We see that a sufficient number of negative samples is essential: When using only a single negative entity and relation () per sentence, relation F1 is about % (CoNLL04) and % (SciERC). With a high number of negative samples, the performance stagnates for both datasets. However, we found our results to be more stable when using a sufficiently high and (we chose in all other experiments).

For relation classification, we also assess the effect of using weak instead of strong negative relation samples: Instead of using the entity classifier as a filter for entity candidates and drawing strong negative training samples from , we omit span filtering and sample random training span pairs not matching any ground truth relation. With these weak samples, our model retains a high recall (%) on the CoNLL04 development set, but the precision decreases drastically to about %. We observed that the model tends to predict subspans of entities to be in relation when using weak samples: For example, in the sentence “[John Wilkes Booth], who assassinated [President Lincoln], was an actor”, the pairs or are chosen. Additionally, pairs where one entity is correct and the other one incorrect are also favored by the model. Obviously, span filtering is not only beneficial in terms of training and evaluation speed, but is also vital for accurate localization in SpERT.

Localized Context

Despite advances in detecting long distance relations using LSTMs or the attention mechanism, the noise induced with increasing context remains a challenge. By using a localized context, i.e. the context between entity candidates, the relation classifier can focus on the sentence’s section that is often most discriminative for the relation type. To assess this effect, we compare localized context with two other context representations that use the whole sentence:

  • Full Context: Instead of performing a max pooling over the context between entity candidates, a max pooling over all tokens in the sentence is conducted.

  • Clf Token: Just like in the entity classifier (Figure 1, green), we use a special classifier token as context, which is able to attend to the whole sentence.

We evaluate the three options on the CoNLL04 development set (Figure 3): When employing SpERT with a localized context, the model reaches an F1 score of %, which significantly outperforms a max pooling over the whole sentence (%) and using the classifier token (%).

Figure 3 also displays results with respect to the sentence length: We split the CoNLL04 development set into four different parts, namely sentences with and tokens. Obviously, localized context leads to comparable or better results for all sentence lengths, particularly for very long sentences: Here, it reaches an F1 score of %, while the performance drastically decreases to % when using the other options. This shows that guiding the model towards relevant sections of the input sentence is vital. An interesting direction for future work is to learn the relevant context with respect to the entity candidates, and to incorporate precomputed syntactical information into SpERT.

Figure 3: Macro F1 scores of relation classification on the CoNLL04 development set when using different context representations. Localized context (red) performs best overall (left), particularly on long sentences with tokens (right).

Pre-training and Entity Representation

Next, we assess the effect of BERT’s language modeling pre-training. It seems intuitive that pre-training on large-scale datasets helps the model to learn semantic and syntactic relations that are hard to capture on a limited-scale target dataset. Therefore, we test three variants of pre-training:

  1. Full: We use the fully pre-trained BERT model (LM Pre-trained, our default setting).

  2. –Layers: We retain pre-trained token embeddings but train the layers from scratch (using the default initalization [devlin:2018:bert]).

  3. –Layers,Embeddings: We train layers and token embeddings from scratch (again, using the default initialization).

Pre-training Entity F1 Relation F1
– Layers
– Layers,Embeddings
Table 2: Effect of BERT pre-training on entity and relation extraction (CoNLL04 development set). A fully pre-trained BERT model significantly outperforms two BERTs in which the self-attention layers (–Layers) or the layers and the BPE input token embeddings (–Layers,Embeddings) are trained from scratch.

As Table 2 shows, training the BERT layers from scratch results in a performance drop of about % and % (macro) F1 for entity and relation extraction respectively. Further, training the token embeddings from scratch results in an even stronger drop in F1. These results suggests that pre-training a large network like BERT is challenging on the fairly small joint entity and relation extraction datasets. Therefore, language modeling pre-training is vital for generalization and to obtain a competitive performance.

Pooling Entity F1 Relation F1
Table 3: Investigation of different entity span representations (summing and averaging of entity’s tokens).

Finally, we investigate different options for the entity span representation other than conducting a max pooling over the entity’s tokens, namely a sum and average pooling (note that a size embedding and a context representation is again concatenated to obtain the final entity representation (Equation 1). Table 3 shows the CoNLL04 (macro) F1 with respect to the different entity representations: We found the averaging of the entity tokens to be unsuitable for both entity (%) and relation extraction (%). Sum pooling improves the performance to %. Max pooling, however, outperforms this by another increase of % and % respectively.


We have presented SpERT, a span-based model for joint entity and relation extraction that relies on the pre-trained Transformer network BERT as its core. We show that with strong negative sampling, span filtering, and a localized context representation, a search over all spans in an input sentence becomes feasible. Our results suggest that span-based approaches perform competitive to BILOU-based models, and may be the more promising approach for future research due to their ability to identify overlapping entities.

In the future, we plan to investigate more elaborate forms of context for relation classifiers. Currently, our model simply employs the span between the two entities, which proved superior to global context. Employing additional syntactic features or learned context – while maintaining an efficient exhaustive search – appears to be a promising challenge.