DeepAI
Log In Sign Up

A Masked Image Reconstruction Network for Document-level Relation Extraction

Document-level relation extraction aims to extract relations among entities within a document. Compared with its sentence-level counterpart, Document-level relation extraction requires inference over multiple sentences to extract complex relational triples. Previous research normally complete reasoning through information propagation on the mention-level or entity-level document-graphs, regardless of the correlations between the relationships. In this paper, we propose a novel Document-level Relation Extraction model based on a Masked Image Reconstruction network (DRE-MIR), which models inference as a masked image reconstruction problem to capture the correlations between relationships. Specifically, we first leverage an encoder module to get the features of entities and construct the entity-pair matrix based on the features. After that, we look on the entity-pair matrix as an image and then randomly mask it and restore it through an inference module to capture the correlations between the relationships. We evaluate our model on three public document-level relation extraction datasets, i.e. DocRED, CDR, and GDA. Experimental results demonstrate that our model achieves state-of-the-art performance on these three datasets and has excellent robustness against the noises during the inference process.

READ FULL TEXT VIEW PDF
06/07/2021

Document-level Relation Extraction as Semantic Segmentation

Document-level relation extraction aims to extract relations among multi...
03/26/2022

A Densely Connected Criss-Cross Attention Network for Document-level Relation Extraction

Document-level relation extraction (RE) aims to identify relations betwe...
05/28/2022

Relation-Specific Attentions over Entity Mentions for Enhanced Document-Level Relation Extraction

Compared with traditional sentence-level relation extraction, document-l...
12/21/2020

Document-Level Relation Extraction with Reconstruction

In document-level relation extraction (DocRE), graph structure is genera...
03/28/2020

HIN: Hierarchical Inference Network for Document-Level Relation Extraction

Document-level RE requires reading, inferring and aggregating over multi...
05/13/2020

Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Document-level relation extraction requires integrating information with...
02/22/2022

CorefDRE: Document-level Relation Extraction with coreference resolution

Document-level relation extraction is to extract relation facts from a d...

1 Introduction

Relation extraction (RE) aims to identify the semantic relations between entities from raw texts, which is of great importance to many real-world applications [16, 37, 36]. Previous researches focused on sentence-level RE, which predicts the relationship between entities in a single sentence [32, 38, 1]. However, large amounts of relationships are expressed by multiple sentences in real life [30, 23]. Therefore, many recent works have made efforts to extend sentence-level RE to document-level RE [30, 34, 44, 28, 35].

Figure 1: An example comes from the DocRED dataset, which shows that the use of correlation between relations (triple) to infer complex inter-sentence relations. (a) is a document, in which different colors represent different entities. (b) lists some intra-sentence relations, which can be easily identified. (c) shows an inter-sentence relations which require reasoning techniques to be identified. The arrows between (b) and (c) indicate the correlation among relations.
Figure 2: The overall architecture of our DRE-MIR model. Firstly, the Encoder encodes the input document to obtain the entities embedding (,), and then we obtain the Entity-pair Matrix through the linear layer. Secondly, we treat the entity-pair matrix

as an image, and then randomly mask it and restore it through an inference module. Through the Masked Image Reconstruction (MIR) task, our inference module can learn how to use the correlation between relationships to infer the masked relationship. Moreover, the MIR task contains two paths, i.e. the Original path and the Mask path. Finally, we utilize a classifier to predict the relationship of each entity pair.

and represent reconstruction loss and classification loss, respectively.

Compared with sentence-level RE where a sentence contains only one entity pair to be classified, document-level RE requires the model to classify the relations of multiple entity pairs simultaneously and the entities involved in a relationship may appear in different sentences. Besides, the document-level RE also poses a great challenge, i.e. relation inference. As shown in Figure 1, it is easy to identify the intra-sentence relations shown in Figure 1b, such as (Altomonte, date of birth, 24 February 1694), (Altomonte, father, Martino Altomonte), and (Altomonte, country of citizenship, Austrian), owing to two related entities appear in the same sentence. However, it is more challenging for a model to predict the inter-sentential relations between Martino Altomonte and Austrian because the document does not explicitly express the relationship between them. This type of inter-sentential relations can only be identified through reasoning techniques. According to the statistics of the DocRED [30] dataset which is a well-known document-level RE dataset, Most of the relation instances (61.1%) require reasoning to be identified in document-level RE, which indicates that reasoning is essential for the document-level RE.

To extract such complex inter-sentence relations, most current approaches constructed a document-level graph based on heuristics, structured attention, or dependency structures

[34, 15, 4, 24], and then perform inference with graph convolutional network (GCN) [7, 11] on the document-level graph. It should be noted that methods of this type complete reasoning through the information transfering between mentions or entities. Meanwhile, considering the transformer architecture can implicitly model long-distance dependencies and can be regarded as a token-level fully connected graph, some studies [21, 44] implicitly infers through the pre-trained model rather than via the document-level graphs.

However, these methods ignore the correlation between relationships. As shown in Figure 1, we can easily infer the inter-sentence relation (Martino Altomonte, country of citizenship, Austrian) through the correlation between the relationships. Specifically, the model needs to firstly capture the correlation among (Altomonte, father, Martino Altomonte), (Altomonte, country of citizenship, Austrian), and (Martino Altomonte, country of citizenship, Austrian), and then use reasoning techniques to identify this complex inter-sentential relation as shown in Figure 1c.

To capture the interdependencies among the multiple relationships, DocuNet [35]

formulates the document-level RE as a semantic segmentation problem and uses a U-shaped segmentation module over the image-style feature map to capture global interdependencies among triples. The DocuNet model has achieved the latest state-of-the-art performance, which shows that the correlation between relationships is essential for the document-level RE. However, capturing correlations between relations through convolutional neural networks is unintuitive and inefficient due to the intrinsic distinction between entity-pair matrices and image.

In this paper, we followed the DocuNet and model the document-level RE as a table filling problem. We first construct an entity-pair matrix, where each point represents the relevant feature of an entity pair. Then, the document-level RE model labels each point of the entity-pair matrix with the corresponding relationships class. Meanwhile, we also treat the entity-pair matrix as an image. To more effectively capture the interdependencies among the relations, we propose a novel Document-level Relation Extraction model based on a Masked Image Reconstruction network (DRE-MIR), which formulates the inference problem in document-level RE as a masked image reconstruction problem. As shown in Figure 2, we first randomly mask the entity-pair matrix, and then reconstruct the entity pair matrix through the inference model. Through this the Masked Image Reconstruction (MIR) task, our model can learn how to infer masked points with the help of correlations between relations. Moreover, to more efficiently and intuitively reconstruct the masked points in the entity-pair matrix, we propose an Inference Multi-head Self-Attention (I-MSA) module which can greatly improve the inference ability of the model. As shown in Figure 3, the I-MSA contains four heads and each head corresponds to an inference mode including: A + BAB, A + BAB, A + BAB, and A + BAB.

Our contributions can be summarized as follows:

  • To the best of our knowledge, our method is the first approach that treat the inference problem in document-level RE as an image reconstruction problem.

  • We introduce the I-MSA to improve the model’s ability to reconstruct the masked entity-pair matrix.

  • Experimental results on three public document-level RE datasets shows that our Dense-CCNet model can achieve state-of-the-art performance.

2 Method

In this section, we introduce in detail our DRE-MIR model. As shown in Figure 2, the DRE-MIR mainly consists of 3 modules, i.e. encoder module, inference module, and classifier module. We first describe the encoder module in Section 2.1, then introduce the core module, i.e. inference module, in Section 2.2

, finally we describe our classifier module and loss function in Section 

2.3.

2.1 Encoder Module

[40] and [43] verified that marking entities in the input sentence by entity type can effectively improve the performance of sentence-level RE model. However, in document-level RE, each entity has multiple mentions and it is important to gather all the mention information for each entity. Therefore, we use the entity type and entity id to mark the mentions in the document, which not only can incorporate the entity type information earlier but also help to improve the aggregation of the mention information. Specifically, given document containing words, we first mark the mention in the document by inserting special symbols and at the start and end position of the mentions, where and respectively represent the entity type and entity id of the mention. Then we feed the adapted document to the pre-trained language model to obtain the context embedding of each word in the document:

(1)

Finally, we utilize the average of the embeddings of and to represent the mention.

For an entity with mentions , we follow [44] and [35], and leverage logsumexp pooling [10]

, a smooth version of max pooling, to obtain the embedding

of entity :

(2)

In addition, we calculate an entity-pair-aware context representation for each entity pair , which represents the contextual information in the document that the entity and the entity together pay attention to. The is formulated as:

(3)

where () refers to the attention score that entity () pays attention to each word in the document, is the document embedding, and refers to element-wise multiplication.

Finally, we construct an entity-pair matrix as follows:

(4)

where represents the number of entities,

refers to a feed-forward neural network,

and are the learnable weight matrix, is [CLS] token embedding which is used to represent the information of the entire document.

2.2 Inference Module

After getting the entity-pair matrix, we treat it as an image. We obtain a masked image by randomly masking the pixels of the original image and reconstruct the masked image through an inference module, as shown in Figure 2. Through this the Masked Image Reconstruction (MIR) task, our inference module can learn how to infer the masked pixels from the unmasked pixels by the correlation between the relationships.

Our inference module is a variant of Transformer’s encoder, which replaces Multi-head Self-Attention (MSA) with Inference Multi-head Self-Attention (I-MSA), as shown in Figure 3. The I-MSA contains four heads and each head corresponds to an inference mode including: A + BAB, A + BAB, A + BAB, and A + BAB. For example, for head 1 in Figure 3 which corresponds to A + BAB inference mode, we first concatenate the corresponding pixels in the A-th row and B-th column of the image, , and perform dimensionality reduction through a linear layer, . Then, performs an attention operation on . The whole process can be formulated as follows:

where, represents the concatenation operation, refers to a set, , , are the learnable weight matrix.

Inspired by [2] and [42], we reconstruct the distribution of the pixels of the masked image on the label, , instead of reconstructing the raw pixels, . The reason is that labels are more information-dense than pixels and is closer to our target task, relation classification. In addition, we reconstruct each pixel on the masked image including the masked pixel and the unmasked pixel, which is similar to the [8] method. In this way, the convergence of the model can be accelerated and better performance can be obtained. Specifically, the original image and the masked image

are first sequentially input to the inference module and the classifier module to obtain the probability distributions

and . Then, we reconstruct the masked image by minimizing bidirectional KL-divergence between the two distributions of corresponding pixels in the original image and masked image. Finally, our reconstruction loss function is formulated as follows:

(5)
Figure 3: (a) The architecture of the Inference Multi-head Self-Attention (I-MSA), which is a variant of multi-head self-attention (MSA). The I-MSA has four types of heads and each head corresponds to one inference mode. (b) Inference module, which is a variant of Transformer’s encoder by replacing MSA with I-MSA.

2.3 Classifier Module

Our classifier module is a single linear layer. The original image and the masked image are respectively input to the inference module to obtain the corresponding corrected image, and

. Then, the relation probability of each entity pair is obtained through a linear layer:

(6)

where , and is model parameters.

To alleviate the problem of unbalanced relationship distribution, we use adaptive-thresholding loss [44] as our classification loss function , which learns an adaptive threshold for each sample. Specifically, a class is introduced to separate positive classes and negative classes: positive classes would have higher probabilities than , and negative classes would have lower probabilities than . The adaptive-thresholding loss is formulated as follows:

(7)

where and are the positive classes set and negative classes set respectively.

The training objective is to minimize the loss function , which is defined as follows:

and

are hyperparameters and we simply set them to 1.

3 Experiments

3.1 Datasets

We conduct experiments on three document-level RE datasets to evaluate our DRE-MIR model. The statistics of the datasets could be found in Appendix A.

  • DocRED [30]: DocRED is a large-scale human-annotated dataset for document-level RE, which constructed from Wikipedia and Wikidata. DocRED contains 96 types of relations, 132,275 entities, and 56,354 relationship triples in total. In DocRED, more than 40.7% of relational facts can only be extracted from multiple sentences, and 61.1% of relational triples require various reasoning skills. We follow the standard split of the dataset, 3,053 documents for training, 1,000 for development and, 1,000 for the test.

  • CDR [13]: The Chemical-Disease Reactions dataset (CDR) consists of 1,500 PubMed abstracts, which is equally divided into three sets for training, development, and testing. CDR is aimed to predict the binary interactions between Chemical and Disease concepts.

  • GDA [27]: The Gene-Disease Associations dataset (GDA) is a large-scale biomedical dataset, which is constructed from MEDLINE abstracts by method of distant supervision. GDA contains 29,192 documents as the training set and 1,000 as the test set. GDA is also a binary relation classification task that identifies Gene and Disease concepts interactions. We follow [4] to divide the training set into two parts, 23,353 documents for training and 5,839 for validation.

Model Dev Test
Ign Intra- Inter- Ign
GEDA-[12] 54.52 56.16 - - 53.71 55.74
LSR-[15] 52.43 59.00 65.26 52.05 56.97 59.05
GLRE-[24] - - - - 55.40 57.40
HeterGSAN-[34] 58.13 60.18 - - 57.12 59.45
GAIN-[29] 59.14 61.22 67.10 53.90 59.00 61.24
[25] - 54.16 61.61 47.15 - 53.20
BERT-[25] - 54.42 61.80 47.28 - 53.92
HIN-[21] 54.29 56.31 - - 53.7 55.60
Coref[31] 55.32 57.51 - - 54.54 56.96
ATLOP-[44] 59.22 61.09 - - 59.31 61.30
SIRE-BERT[33] 59.82 61.60 68.07 54.01 60.18 62.05
DocuNet-[35] 59.86 61.83 - - 59.93 61.86
DRE-MIR- 60.970.10 62.960.17 68.130.14 57.290.43 61.03 63.15
Table 1: Results (%) on the development and test set of the DocRED. The scores of all the baseline models come from ATLOP [44] and SIRE [33]. And, the results on the test set are obtained by submitting to the official Codalab.
Model CDR GDA
BRAN[23] 62.1 -
EoG[4] 63.6 81.5
LSR[15] 64.8 82.2
DHG[39] 65.9 83.1
GLRE[24] 68.5 -
SciBERT[3] 65.1 82.5
ATLOP-SciBERT[44] 69.4 83.9
DocuNet-SciBERT[35] 76.3 85.3
DRE-MIR-SciBERT 76.6 86.4
Table 2: Results (%) on the biomedical datasets CDR and GDA.

3.2 Experimental Settings

Our model was implemented based on PyTorch and Huggingface’s Trans-formers

[26]. We used cased BERT-base [5] as the encoder on DocRED and SciBERT-base [3] on CDR and GDA. Our model is optimized with AdamW [14] with a linear warmup [6] for the first 6% steps followed by a linear decay to 0. By default, we randomly mask 20% of points in the entity-pair matrix and set the number of layers in the inference module to 3. All hyper-parameters are tuned on the development set, some of which are listed in Appendix B.

3.3 Results on the DocRED Dataset

On the DocRED Dataset, we choose the following two types of models as the baseline:

  • Graph-based Models: This type of method uses graph convolutional networks (GCN) [11, 22, 20] to complete inference on document-level graphs, including GEDA [12], LSR [15], GLRE [24], GAIN [34], and HeterGSAN [29].

  • Transformer-based Models

    : These models directly use pre-trained language models for document-level RE without using graph structures, including

    [29], BERT-Two-Step [25], HIN-BERT [21], [31], and ATLOP-BERT [44].

In addition, we also consider the DocuNet [35] and SIRE [33] in the comparison. The DocuNet formulates document-level RE as a semantic segmentation problem and get the latest SOTA results. While, the SIRE represents intra- and inter-sentential relations in different ways, and design a new form of logical reasoning.

We follow [30] and use and Ign

as evaluation metrics to evaluate the performance of a model, where Ign

denotes the score excluding the relational facts that are shared by the training and dev/test sets. Comparing all baseline model, our DRE-MIR model outperforms the latest state-of-the-art models by 1.14/1.11 /Ign on the dev set and 1.29/1.1 /Ign on the test set. This demonstrates that our model has an excellent overall performance. Besides, comparing the graph-based state-of-the-art model, the DRE-MIR model outperforms the GAIN model by 1.74/1.83 /Ign on the dev set and 2.11/2.03 /Ign on the test set. This shows that our model has better reasoning ability than the previous graph-based models.

The same as [15, 29] , we report Intra- / Inter- scores in Table 1, which only consider either intra- or inter-sentence relations respectively. Compared with Intra-, Inter- can better reflect the reasoning ability of the model. we can observe that our DRE-MIR model improved the Inter- score by 3.28 compared with the SIRE model. The improvement on Inter- demonstrates that our MIR task and Inference module can greatly improve the inference ability of the model. Moreover, the improvement on Inter- is greater than on intra-, which shows that the performance improvement of DRE-MIR is mainly contributed by the improvement of inter-sentence relations.

Model Ign
DRE-MIR 60.97 62.96
Only Mask path 60.13 62.20
Only reconsitution masked point 59.50 61.53
w/o MIR 59.31 61.35
w/o Inference Model 58.56 60.46
w/o I-MSA 54.67 56.22
Table 3: Ablation study of the DRE-MIR model on the development set of the DocRED. w/o Inference Model and w/o MIR removes the Inference Model and the MIR task from our mode, respectively; w/o I-MSA replaces the Inference Multi-head Self-Attention (I-MSA) with the multi-head self-attention (MSA); w/o Inference Model is our base model.

3.4 Results on the Biomedical Datasets

On the two biomedical datasets, CDR and GDA, we compared our model with a large number of baseline models and recent state-of-the-art models including BRAN [23], EoG [4], LSR [15], DHG [39], GLRE [24], SciBERT [3], ATLOP [44], and DocuNet [35].

Experiment results on two biomedical datasets are shown in Table 2. Our DRE-MIR model achieves 76.9 on the CDR dataset, which slightly out performs the DocuNe model by 0.3 . There are three possible reasons: (1) The CDR contains only two types of relations, which indicates that the correlation between relations is weak. (2) The CDR dataset contains very few annotated samples, making it difficult for our model to learn the underlying correlations. (3) The samples in the CDR dataset contain few entities, which leads to a small entity pair matrix and weakens the effectiveness of our MIR task. Although the GDA dataset also has problems (1) and (3), it is a large-scale corpus and contains a large number of samples. Therefore our model achieves 86.4 score on the GDA dataset, which improves 1.1 compared with the DocuNe model. Since the MIR task is a pre-training task in the field of machine vision, more data is required to obtain better performance.

Layer-number Ign
1-Layer 59.83 61.68
2-Layer 60.62 62.75
3-Layer 60.97 62.96
Table 4: Performance of DRE-MIR with different numbers of layers in the inference module on the development set of DocRED.

3.5 Ablation Study

We conducted an ablation experiment to validate the effectiveness of different components of our DRE-MIR model on the development set of the DocRED dataset. The results are listed in Table 3.

The w/o MIR removes the MIR task from our model and only contains the original path in the DRE-MIR model. The w/o MIR achieves an F1 score of 61.35, which outperforms the w/o Inference Model, our base model, by 0.99 . This shows that our inference module has a certain inference ability even without the MIR task. However, the DRE-MIR model without the MIR task (w/o MIR) has a performance drop of 1.61 , which proves that the MIR task can well improve the inference ability of our inference module.

The Only Mask path removes the original path from our model and only contains the Mask path. The Only Mask path is a variant of the DRE-MIR model, and its image reconstruction method is similar to [2]. The Only Mask path leads to a drop of 0.76 in performance, which proves that the original image played a guiding role in the reconstruction process of the masked image to further improve the performance of the model.

As can be seen from w/o I-MSA, replacing the I-MSA with the MSA resulted in a huge performance drop of 6.74 . This shows that our I-MSA can greatly improve the inference ability of the Transformer. We also introduce an experiment where only the masked pixels are reconstructed, i.e. Only reconsitution masked point, and observe a performance drop of 1.43 . The possible reason is that the masked pixels may affect the unmasked pixels through our inference module, but reconstructing all the pixels can effectively alleviate this negative impact.

Overall, our model improves our base model by 2.5 , which fully demonstrates that our inference module and MIR task can effectively improve the inference ability of the model.

Figure 4: Results of different masking rates used in the training process on the development set of DocRED.

3.6 Analysis & Discussion

In this section, we will further discuss and analyze our DRE-MIR model from four aspects: (1) the number of layers in the inference module, (2) the masking rate during training, (3) the inference performance, and (4) the performance to restore the masked entity-pair matrix.

Table 4 shows the performance of the DRE-MIR model with different number of layers of inference modules. We observe that increasing the number of layers from 1 to 2 improves the model performance by 1.07 score. The possible reason is that increasing the number of layers can improve the multi-hop reasoning ability of the model. However, the performance of the model is slightly improved by 0.21 when the number of layers is increased from 2 to 3. Therefore, a two-layer inference module is sufficient for general cases.

Figure 4 shows that our model obtains the best performance when trained with a masking rate of 20%. However, our model still achieves a decent performance of 61.44/59.43 /Ign when setting the masking rate to 50%, which shows that our inference module has strong inference ability to restore the masked entity-pair matrix. This also implies that using a larger masking rate to increase the training difficulty should achieve better performance under large-scale corpora, which is similar to the conclusions drawn from pre-training tasks in machine vision, such as MAE [8].

To evaluate the inference ability of the models, we follow [29, 33] and report Infer- scores in table 5, which only considers relations that engaged in the relational reasoning process. We observe that our DRE-MIR model improves 2.71 Infer- compared with the GAIN model. Removing the inference module from our model results in a performance drop of 4.10 Infer-, which demonstrates that our inference module and the MIR task can improve the inference ability of the model.

To evaluate the model’s ability of restoring the masked entity-pair matrix, we also randomly mask the entity-pair matrix during validating. We show the experimental results in Figure 5. Since we train our model with a masking rate of 20%, the performance drop of the model is very slight when the masking rate is less than 20%. Our model has only a slight performance drop of 1.84/1.88 /Ign with 50% masking rate, which shows that our model has excellent robustness. Even if the masking rate is increased to 80%, our model still achieves a score of 55.00/52.69 /Ign, which is better than the BERT-[25] model. This shows that our model has strong restoring ability.

Model Infer- P R
GAIN-GloVe 40.82 32.76 54.14
SIRE-GloVe 42.72 34.83 55.22
BERT-RE 39.62 34.12 47.23
RoBERTa-RE 41.78 37.97 46.45
GAIN- 46.89 38.71 59.45
DRE-MIR- 49.60 42.72 59.13
w/o Inference Model 45.50 38.03 56.63
Table 5: Infer- results of the DRE-MIR model on the development set of DocRED. P: Precision, R: Recall.

4 Related Work

Since many relational facts in real applications can only be recognized across sentences, a lot of recent work gradually shift their attention to document-level RE. Due to graph neural network(GNN) can effectively model long-distance dependence and complete logical reasoning, Many methods based on document-graphs are widely used for document-level RE. Specifically, they first constructed a graph structure from the document, and then applied the GCN

[11, 9] to the graph to complete logical reasoning. The graph-based method was first introduced by [17] and has recently been extended by many works [4, 12, 39, 41, 24, 15, 34, 27]. [12] proposed the Graph Enhanced Dual Attention network (GEDA) model and used it to characterize the complex interaction between sentences and potential relation instances. [34] propose Graph Aggregation-and-Inference Network (GAIN) model. GAIN first constructs a heterogeneous mention-level graph (hMG) to model complex interaction among different mentions across the document and then constructs an entity-level graph (EG), finally uses the path reasoning mechanism to infer relations between entities on EG. [15] proposed a novel LSR model, which constructs a latent document-level graph and completes logical reasoning on the graph.

In addition, due to the pre-trained language model based on the transformer architecture can implicitly model long-distance dependence and complete logical reasoning, some studies [21, 44, 25] directly apply pre-trained model without introducing document graphs. [44] proposed an ATLOP model that consists of two parts: adaptive thresholding and localized context pooling, to solve the multi-label and multi-entity problems. The SIRE [33] represents intra- and inter-sentential relations in different ways, and design a new and straightforward form of logical reasoning. Recently, the state-of-the-ar model, DocuNet [35], formulates document-level RE as semantic segmentation task and capture global information among relational triples through the U-shaped segmentation module [19].

Figure 5: Results of different masking rates used the validating process on the development set of DocRED.

Furthermore, our work is inspired by recent pre-training research in the field of machine vision, such as BIET [2], IBOT [42], and MAE [8]. BEIT followed BERT [5] and proposed a masked image modeling (MIM) task and a tokenizer to pre-train vision Transformers. The tokenizer “tokenize” the image to discrete visual tokens, which is obtained by the latent codes of discrete VAE [18]. The IBOT can perform the MIM task with an online tokenizer and formulates the MIM task as knowledge distillation (KD) distillation problem. The MAE develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

5 Conclusion and Future Work

In this work, We first formulate the inference problem in document-level RE as a Masked Image Reconstruction (MIR) problem. Then, we propose an Inference Multi-head Self-Attention (I-MSA) module to restore masked images more efficiently. The MIR task and the I-MSA module greatly improve the inference ability of our model. Experiments on three public document-level RE datasets demonstrate that our DRE-MIR model achieved better results than the existing state-of-the-art model. In the future, we will try to use our model for other inter-sentence or document-level tasks, such as cross-sentence collective event detection.

References

  • [1] L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. In Proc. of ACL, Cited by: §1.
  • [2] H. Bao, L. Dong, and F. Wei (2021) BEiT: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254. Cited by: §2.2, §3.5, §4.
  • [3] I. Beltagy, K. Lo, and A. Cohan (2019) Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Cited by: §3.2, §3.4, Table 2.
  • [4] F. Christopoulou, M. Miwa, and S. Ananiadou (2019) Connecting the dots: document-level neural relation extraction with edge-oriented graphs. arXiv preprint arXiv:1909.00228. Cited by: §1, 3rd item, §3.4, Table 2, §4.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2, §4.
  • [6] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch sgd: training imagenet in 1 hour

    .
    arXiv preprint arXiv:1706.02677. Cited by: §3.2.
  • [7] Z. Guo, Y. Zhang, and W. Lu (2019) Attention guided graph convolutional networks for relation extraction. arXiv preprint arXiv:1906.07510. Cited by: §1.
  • [8] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021)

    Masked autoencoders are scalable vision learners

    .
    arXiv preprint arXiv:2111.06377. Cited by: §2.2, §3.6, §4.
  • [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proc. of CVPR, Cited by: §4.
  • [10] R. Jia, C. Wong, and H. Poon (2019) Document-level -ary relation extraction with multiscale representation learning. arXiv preprint arXiv:1904.02347. Cited by: §2.1.
  • [11] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, 1st item, §4.
  • [12] B. Li, W. Ye, Z. Sheng, R. Xie, X. Xi, and S. Zhang (2020) Graph enhanced dual attention network for document-level relation extraction. In Proc. of COLING, Cited by: 1st item, Table 1, §4.
  • [13] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu (2016) BioCreative v cdr task corpus: a resource for chemical disease relation extraction. Database. Cited by: 2nd item.
  • [14] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §3.2.
  • [15] G. Nan, Z. Guo, I. Sekulic, and W. Lu (2020) Reasoning with latent structure refinement for document-level relation extraction. In Proc. of ACL, Cited by: §1, 1st item, §3.3, §3.4, Table 1, Table 2, §4.
  • [16] L. Qiu, Y. Xiao, Y. Qu, H. Zhou, L. Li, W. Zhang, and Y. Yu (2019) Dynamically fused graph network for multi-hop reasoning. In Proc. of ACL, Cited by: §1.
  • [17] C. Quirk and H. Poon (2016) Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873. Cited by: §4.
  • [18] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §4.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §4.
  • [20] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European semantic web conference, Cited by: 1st item.
  • [21] H. Tang, Y. Cao, Z. Zhang, J. Cao, F. Fang, S. Wang, and P. Yin (2020) Hin: hierarchical inference network for document-level relation extraction. Advances in Knowledge Discovery and Data Mining. Cited by: §1, 2nd item, Table 1, §4.
  • [22] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: 1st item.
  • [23] P. Verga, E. Strubell, and A. McCallum (2018) Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Proc. of ACL, Cited by: §1, §3.4, Table 2.
  • [24] D. Wang, W. Hu, E. Cao, and W. Sun (2020) Global-to-local neural networks for document-level relation extraction. arXiv preprint arXiv:2009.10359. Cited by: §1, 1st item, §3.4, Table 1, Table 2, §4.
  • [25] H. Wang, C. Focke, R. Sylvester, N. Mishra, and W. Wang (2019) Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898. Cited by: 2nd item, §3.6, Table 1, §4.
  • [26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019)

    Huggingface’s transformers: state-of-the-art natural language processing

    .
    arXiv preprint arXiv:1910.03771. Cited by: §3.2.
  • [27] Y. Wu, R. Luo, H. C. Leung, H. Ting, and T. Lam (2019)

    Renet: a deep learning approach for extracting gene-disease associations from literature

    .
    In Proc. of RECOMB, Cited by: 3rd item, §4.
  • [28] B. Xu, Q. Wang, Y. Lyu, Y. Zhu, and Z. Mao (2021) Entity structure within and throughout: modeling mention dependencies for document-level relation extraction. arXiv preprint arXiv:2102.10249. Cited by: §1.
  • [29] W. Xu, K. Chen, and T. Zhao (2021) Document-level relation extraction with reconstruction. In Proc. of AAAI, Cited by: 1st item, 2nd item, §3.3, §3.6, Table 1.
  • [30] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun (2019) DocRED: a large-scale document-level relation extraction dataset. In Proc. of ACL, Cited by: §1, §1, 1st item, §3.3.
  • [31] D. Ye, Y. Lin, J. Du, Z. Liu, P. Li, M. Sun, and Z. Liu (2020) Coreferential reasoning learning for language representation. arXiv preprint arXiv:2004.06870. Cited by: 2nd item, Table 1.
  • [32] D. Zeng, K. Liu, Y. Chen, and J. Zhao (2015) Distant supervision for relation extraction via piecewise convolutional neural networks. In Proc. of EMNLP, Cited by: §1.
  • [33] S. Zeng, Y. Wu, and B. Chang (2021) Sire: separate intra-and inter-sentential reasoning for document-level relation extraction. arXiv preprint arXiv:2106.01709. Cited by: §3.3, §3.6, Table 1, §4.
  • [34] S. Zeng, R. Xu, B. Chang, and L. Li (2020) Double graph based reasoning for document-level relation extraction. In Proc. of EMNLP, Cited by: §1, §1, 1st item, Table 1, §4.
  • [35] N. Zhang, X. Chen, X. Xie, S. Deng, C. Tan, M. Chen, F. Huang, L. Si, and H. Chen (2021) Document-level relation extraction as semantic segmentation. arXiv preprint arXiv:2106.03618. Cited by: §1, §1, §2.1, §3.3, §3.4, Table 1, Table 2, §4.
  • [36] N. Zhang, Q. Jia, S. Deng, X. Chen, H. Ye, H. Chen, H. Tou, G. Huang, Z. Wang, N. Hua, et al. (2021) AliCG: fine-grained and evolvable conceptual graph construction for semantic search at alibaba. arXiv preprint arXiv:2106.01686. Cited by: §1.
  • [37] S. Zhang, D. Yao, Z. Zhao, T. Chua, and F. Wu (2021) Causerec: counterfactual user sequence synthesis for sequential recommendation. In Proc. of SIGIR, Cited by: §1.
  • [38] Y. Zhang, P. Qi, and C. D. Manning (2018) Graph convolution over pruned dependency trees improves relation extraction. In Proc. of EMNLP, Cited by: §1.
  • [39] Z. Zhang, B. Yu, X. Shu, T. Liu, H. Tang, W. Yubin, and L. Guo (2020) Document-level relation extraction with dual-tier heterogeneous graph. In Proc. of COLING, Cited by: §3.4, Table 2, §4.
  • [40] Z. Zhong and D. Chen (2020) A frustratingly easy approach for entity and relation extraction. arXiv preprint arXiv:2010.12812. Cited by: §2.1.
  • [41] H. Zhou, Y. Xu, W. Yao, Z. Liu, C. Lang, and H. Jiang (2020) Global context-enhanced graph convolutional networks for document-level relation extraction. In Proc. of COLING, Cited by: §4.
  • [42] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong (2021) IBOT: image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832. Cited by: §2.2, §4.
  • [43] W. Zhou and M. Chen (2021) An improved baseline for sentence-level relation extraction. arXiv preprint arXiv:2102.01373. Cited by: §2.1.
  • [44] W. Zhou, K. Huang, T. Ma, and J. Huang (2021) Document-level relation extraction with adaptive thresholding and localized context pooling. In Proc. of AAAI, Cited by: §1, §1, §2.1, §2.3, 2nd item, §3.4, Table 1, Table 2, §4.

Appendix A Datasets

Table 5 details the statistics of the three document-level relational extraction datasets, DocRED, CDR, and GDA. These statis-tics further demonstrate the complexity of entity structure in document-level relation extraction tasks.

Appendix B Hyper-parameters Setting

Table 6 details our hyper-parameters setting. All of our hyperparameters were tuned on the development set.