Relation extraction (RE) aims to identify the semantic relations between entities from raw texts, which is of great importance to many real-world applications [16, 37, 36]. Previous researches focused on sentence-level RE, which predicts the relationship between entities in a single sentence [32, 38, 1]. However, large amounts of relationships are expressed by multiple sentences in real life [30, 23]. Therefore, many recent works have made efforts to extend sentence-level RE to document-level RE [30, 34, 44, 28, 35].
Compared with sentence-level RE where a sentence contains only one entity pair to be classified, document-level RE requires the model to classify the relations of multiple entity pairs simultaneously and the entities involved in a relationship may appear in different sentences. Besides, the document-level RE also poses a great challenge, i.e. relation inference. As shown in Figure 1, it is easy to identify the intra-sentence relations shown in Figure 1b, such as (Altomonte, date of birth, 24 February 1694), (Altomonte, father, Martino Altomonte), and (Altomonte, country of citizenship, Austrian), owing to two related entities appear in the same sentence. However, it is more challenging for a model to predict the inter-sentential relations between Martino Altomonte and Austrian because the document does not explicitly express the relationship between them. This type of inter-sentential relations can only be identified through reasoning techniques. According to the statistics of the DocRED  dataset which is a well-known document-level RE dataset, Most of the relation instances (61.1%) require reasoning to be identified in document-level RE, which indicates that reasoning is essential for the document-level RE.
To extract such complex inter-sentence relations, most current approaches constructed a document-level graph based on heuristics, structured attention, or dependency structures[34, 15, 4, 24], and then perform inference with graph convolutional network (GCN) [7, 11] on the document-level graph. It should be noted that methods of this type complete reasoning through the information transfering between mentions or entities. Meanwhile, considering the transformer architecture can implicitly model long-distance dependencies and can be regarded as a token-level fully connected graph, some studies [21, 44] implicitly infers through the pre-trained model rather than via the document-level graphs.
However, these methods ignore the correlation between relationships. As shown in Figure 1, we can easily infer the inter-sentence relation (Martino Altomonte, country of citizenship, Austrian) through the correlation between the relationships. Specifically, the model needs to firstly capture the correlation among (Altomonte, father, Martino Altomonte), (Altomonte, country of citizenship, Austrian), and (Martino Altomonte, country of citizenship, Austrian), and then use reasoning techniques to identify this complex inter-sentential relation as shown in Figure 1c.
To capture the interdependencies among the multiple relationships, DocuNet 
formulates the document-level RE as a semantic segmentation problem and uses a U-shaped segmentation module over the image-style feature map to capture global interdependencies among triples. The DocuNet model has achieved the latest state-of-the-art performance, which shows that the correlation between relationships is essential for the document-level RE. However, capturing correlations between relations through convolutional neural networks is unintuitive and inefficient due to the intrinsic distinction between entity-pair matrices and image.
In this paper, we followed the DocuNet and model the document-level RE as a table filling problem. We first construct an entity-pair matrix, where each point represents the relevant feature of an entity pair. Then, the document-level RE model labels each point of the entity-pair matrix with the corresponding relationships class. Meanwhile, we also treat the entity-pair matrix as an image. To more effectively capture the interdependencies among the relations, we propose a novel Document-level Relation Extraction model based on a Masked Image Reconstruction network (DRE-MIR), which formulates the inference problem in document-level RE as a masked image reconstruction problem. As shown in Figure 2, we first randomly mask the entity-pair matrix, and then reconstruct the entity pair matrix through the inference model. Through this the Masked Image Reconstruction (MIR) task, our model can learn how to infer masked points with the help of correlations between relations. Moreover, to more efficiently and intuitively reconstruct the masked points in the entity-pair matrix, we propose an Inference Multi-head Self-Attention (I-MSA) module which can greatly improve the inference ability of the model. As shown in Figure 3, the I-MSA contains four heads and each head corresponds to an inference mode including: A + BAB, A + BAB, A + BAB, and A + BAB.
Our contributions can be summarized as follows:
To the best of our knowledge, our method is the first approach that treat the inference problem in document-level RE as an image reconstruction problem.
We introduce the I-MSA to improve the model’s ability to reconstruct the masked entity-pair matrix.
Experimental results on three public document-level RE datasets shows that our Dense-CCNet model can achieve state-of-the-art performance.
In this section, we introduce in detail our DRE-MIR model. As shown in Figure 2, the DRE-MIR mainly consists of 3 modules, i.e. encoder module, inference module, and classifier module. We first describe the encoder module in Section 2.1, then introduce the core module, i.e. inference module, in Section 2.2
, finally we describe our classifier module and loss function in Section2.3.
2.1 Encoder Module
 and  verified that marking entities in the input sentence by entity type can effectively improve the performance of sentence-level RE model. However, in document-level RE, each entity has multiple mentions and it is important to gather all the mention information for each entity. Therefore, we use the entity type and entity id to mark the mentions in the document, which not only can incorporate the entity type information earlier but also help to improve the aggregation of the mention information. Specifically, given document containing words, we first mark the mention in the document by inserting special symbols and at the start and end position of the mentions, where and respectively represent the entity type and entity id of the mention. Then we feed the adapted document to the pre-trained language model to obtain the context embedding of each word in the document:
Finally, we utilize the average of the embeddings of and to represent the mention.
, a smooth version of max pooling, to obtain the embeddingof entity :
In addition, we calculate an entity-pair-aware context representation for each entity pair , which represents the contextual information in the document that the entity and the entity together pay attention to. The is formulated as:
where () refers to the attention score that entity () pays attention to each word in the document, is the document embedding, and refers to element-wise multiplication.
Finally, we construct an entity-pair matrix as follows:
where represents the number of entities,
refers to a feed-forward neural network,and are the learnable weight matrix, is [CLS] token embedding which is used to represent the information of the entire document.
2.2 Inference Module
After getting the entity-pair matrix, we treat it as an image. We obtain a masked image by randomly masking the pixels of the original image and reconstruct the masked image through an inference module, as shown in Figure 2. Through this the Masked Image Reconstruction (MIR) task, our inference module can learn how to infer the masked pixels from the unmasked pixels by the correlation between the relationships.
Our inference module is a variant of Transformer’s encoder, which replaces Multi-head Self-Attention (MSA) with Inference Multi-head Self-Attention (I-MSA), as shown in Figure 3. The I-MSA contains four heads and each head corresponds to an inference mode including: A + BAB, A + BAB, A + BAB, and A + BAB. For example, for head 1 in Figure 3 which corresponds to A + BAB inference mode, we first concatenate the corresponding pixels in the A-th row and B-th column of the image, , and perform dimensionality reduction through a linear layer, . Then, performs an attention operation on . The whole process can be formulated as follows:
where, represents the concatenation operation, refers to a set, , , are the learnable weight matrix.
Inspired by  and , we reconstruct the distribution of the pixels of the masked image on the label, , instead of reconstructing the raw pixels, . The reason is that labels are more information-dense than pixels and is closer to our target task, relation classification. In addition, we reconstruct each pixel on the masked image including the masked pixel and the unmasked pixel, which is similar to the  method. In this way, the convergence of the model can be accelerated and better performance can be obtained. Specifically, the original image and the masked image
are first sequentially input to the inference module and the classifier module to obtain the probability distributionsand . Then, we reconstruct the masked image by minimizing bidirectional KL-divergence between the two distributions of corresponding pixels in the original image and masked image. Finally, our reconstruction loss function is formulated as follows:
2.3 Classifier Module
Our classifier module is a single linear layer. The original image and the masked image are respectively input to the inference module to obtain the corresponding corrected image, and
. Then, the relation probability of each entity pair is obtained through a linear layer:
where , and is model parameters.
To alleviate the problem of unbalanced relationship distribution, we use adaptive-thresholding loss  as our classification loss function , which learns an adaptive threshold for each sample. Specifically, a class is introduced to separate positive classes and negative classes: positive classes would have higher probabilities than , and negative classes would have lower probabilities than . The adaptive-thresholding loss is formulated as follows:
where and are the positive classes set and negative classes set respectively.
The training objective is to minimize the loss function , which is defined as follows:
are hyperparameters and we simply set them to 1.
We conduct experiments on three document-level RE datasets to evaluate our DRE-MIR model. The statistics of the datasets could be found in Appendix A.
DocRED : DocRED is a large-scale human-annotated dataset for document-level RE, which constructed from Wikipedia and Wikidata. DocRED contains 96 types of relations, 132,275 entities, and 56,354 relationship triples in total. In DocRED, more than 40.7% of relational facts can only be extracted from multiple sentences, and 61.1% of relational triples require various reasoning skills. We follow the standard split of the dataset, 3,053 documents for training, 1,000 for development and, 1,000 for the test.
CDR : The Chemical-Disease Reactions dataset (CDR) consists of 1,500 PubMed abstracts, which is equally divided into three sets for training, development, and testing. CDR is aimed to predict the binary interactions between Chemical and Disease concepts.
GDA : The Gene-Disease Associations dataset (GDA) is a large-scale biomedical dataset, which is constructed from MEDLINE abstracts by method of distant supervision. GDA contains 29,192 documents as the training set and 1,000 as the test set. GDA is also a binary relation classification task that identifies Gene and Disease concepts interactions. We follow  to divide the training set into two parts, 23,353 documents for training and 5,839 for validation.
3.2 Experimental Settings
Our model was implemented based on PyTorch and Huggingface’s Trans-formers. We used cased BERT-base  as the encoder on DocRED and SciBERT-base  on CDR and GDA. Our model is optimized with AdamW  with a linear warmup  for the first 6% steps followed by a linear decay to 0. By default, we randomly mask 20% of points in the entity-pair matrix and set the number of layers in the inference module to 3. All hyper-parameters are tuned on the development set, some of which are listed in Appendix B.
3.3 Results on the DocRED Dataset
On the DocRED Dataset, we choose the following two types of models as the baseline:
In addition, we also consider the DocuNet  and SIRE  in the comparison. The DocuNet formulates document-level RE as a semantic segmentation problem and get the latest SOTA results. While, the SIRE represents intra- and inter-sentential relations in different ways, and design a new form of logical reasoning.
We follow  and use and Ign
as evaluation metrics to evaluate the performance of a model, where Igndenotes the score excluding the relational facts that are shared by the training and dev/test sets. Comparing all baseline model, our DRE-MIR model outperforms the latest state-of-the-art models by 1.14/1.11 /Ign on the dev set and 1.29/1.1 /Ign on the test set. This demonstrates that our model has an excellent overall performance. Besides, comparing the graph-based state-of-the-art model, the DRE-MIR model outperforms the GAIN model by 1.74/1.83 /Ign on the dev set and 2.11/2.03 /Ign on the test set. This shows that our model has better reasoning ability than the previous graph-based models.
The same as [15, 29] , we report Intra- / Inter- scores in Table 1, which only consider either intra- or inter-sentence relations respectively. Compared with Intra-, Inter- can better reflect the reasoning ability of the model. we can observe that our DRE-MIR model improved the Inter- score by 3.28 compared with the SIRE model. The improvement on Inter- demonstrates that our MIR task and Inference module can greatly improve the inference ability of the model. Moreover, the improvement on Inter- is greater than on intra-, which shows that the performance improvement of DRE-MIR is mainly contributed by the improvement of inter-sentence relations.
|Only Mask path||60.13||62.20|
|Only reconsitution masked point||59.50||61.53|
|w/o Inference Model||58.56||60.46|
3.4 Results on the Biomedical Datasets
On the two biomedical datasets, CDR and GDA, we compared our model with a large number of baseline models and recent state-of-the-art models including BRAN , EoG , LSR , DHG , GLRE , SciBERT , ATLOP , and DocuNet .
Experiment results on two biomedical datasets are shown in Table 2. Our DRE-MIR model achieves 76.9 on the CDR dataset, which slightly out performs the DocuNe model by 0.3 . There are three possible reasons: (1) The CDR contains only two types of relations, which indicates that the correlation between relations is weak. (2) The CDR dataset contains very few annotated samples, making it difficult for our model to learn the underlying correlations. (3) The samples in the CDR dataset contain few entities, which leads to a small entity pair matrix and weakens the effectiveness of our MIR task. Although the GDA dataset also has problems (1) and (3), it is a large-scale corpus and contains a large number of samples. Therefore our model achieves 86.4 score on the GDA dataset, which improves 1.1 compared with the DocuNe model. Since the MIR task is a pre-training task in the field of machine vision, more data is required to obtain better performance.
3.5 Ablation Study
We conducted an ablation experiment to validate the effectiveness of different components of our DRE-MIR model on the development set of the DocRED dataset. The results are listed in Table 3.
The w/o MIR removes the MIR task from our model and only contains the original path in the DRE-MIR model. The w/o MIR achieves an F1 score of 61.35, which outperforms the w/o Inference Model, our base model, by 0.99 . This shows that our inference module has a certain inference ability even without the MIR task. However, the DRE-MIR model without the MIR task (w/o MIR) has a performance drop of 1.61 , which proves that the MIR task can well improve the inference ability of our inference module.
The Only Mask path removes the original path from our model and only contains the Mask path. The Only Mask path is a variant of the DRE-MIR model, and its image reconstruction method is similar to . The Only Mask path leads to a drop of 0.76 in performance, which proves that the original image played a guiding role in the reconstruction process of the masked image to further improve the performance of the model.
As can be seen from w/o I-MSA, replacing the I-MSA with the MSA resulted in a huge performance drop of 6.74 . This shows that our I-MSA can greatly improve the inference ability of the Transformer. We also introduce an experiment where only the masked pixels are reconstructed, i.e. Only reconsitution masked point, and observe a performance drop of 1.43 . The possible reason is that the masked pixels may affect the unmasked pixels through our inference module, but reconstructing all the pixels can effectively alleviate this negative impact.
Overall, our model improves our base model by 2.5 , which fully demonstrates that our inference module and MIR task can effectively improve the inference ability of the model.
3.6 Analysis & Discussion
In this section, we will further discuss and analyze our DRE-MIR model from four aspects: (1) the number of layers in the inference module, (2) the masking rate during training, (3) the inference performance, and (4) the performance to restore the masked entity-pair matrix.
Table 4 shows the performance of the DRE-MIR model with different number of layers of inference modules. We observe that increasing the number of layers from 1 to 2 improves the model performance by 1.07 score. The possible reason is that increasing the number of layers can improve the multi-hop reasoning ability of the model. However, the performance of the model is slightly improved by 0.21 when the number of layers is increased from 2 to 3. Therefore, a two-layer inference module is sufficient for general cases.
Figure 4 shows that our model obtains the best performance when trained with a masking rate of 20%. However, our model still achieves a decent performance of 61.44/59.43 /Ign when setting the masking rate to 50%, which shows that our inference module has strong inference ability to restore the masked entity-pair matrix. This also implies that using a larger masking rate to increase the training difficulty should achieve better performance under large-scale corpora, which is similar to the conclusions drawn from pre-training tasks in machine vision, such as MAE .
To evaluate the inference ability of the models, we follow [29, 33] and report Infer- scores in table 5, which only considers relations that engaged in the relational reasoning process. We observe that our DRE-MIR model improves 2.71 Infer- compared with the GAIN model. Removing the inference module from our model results in a performance drop of 4.10 Infer-, which demonstrates that our inference module and the MIR task can improve the inference ability of the model.
To evaluate the model’s ability of restoring the masked entity-pair matrix, we also randomly mask the entity-pair matrix during validating. We show the experimental results in Figure 5. Since we train our model with a masking rate of 20%, the performance drop of the model is very slight when the masking rate is less than 20%. Our model has only a slight performance drop of 1.84/1.88 /Ign with 50% masking rate, which shows that our model has excellent robustness. Even if the masking rate is increased to 80%, our model still achieves a score of 55.00/52.69 /Ign, which is better than the BERT- model. This shows that our model has strong restoring ability.
|w/o Inference Model||45.50||38.03||56.63|
4 Related Work
Since many relational facts in real applications can only be recognized across sentences, a lot of recent work gradually shift their attention to document-level RE. Due to graph neural network(GNN) can effectively model long-distance dependence and complete logical reasoning, Many methods based on document-graphs are widely used for document-level RE. Specifically, they first constructed a graph structure from the document, and then applied the GCN[11, 9] to the graph to complete logical reasoning. The graph-based method was first introduced by  and has recently been extended by many works [4, 12, 39, 41, 24, 15, 34, 27].  proposed the Graph Enhanced Dual Attention network (GEDA) model and used it to characterize the complex interaction between sentences and potential relation instances.  propose Graph Aggregation-and-Inference Network (GAIN) model. GAIN first constructs a heterogeneous mention-level graph (hMG) to model complex interaction among different mentions across the document and then constructs an entity-level graph (EG), finally uses the path reasoning mechanism to infer relations between entities on EG.  proposed a novel LSR model, which constructs a latent document-level graph and completes logical reasoning on the graph.
In addition, due to the pre-trained language model based on the transformer architecture can implicitly model long-distance dependence and complete logical reasoning, some studies [21, 44, 25] directly apply pre-trained model without introducing document graphs.  proposed an ATLOP model that consists of two parts: adaptive thresholding and localized context pooling, to solve the multi-label and multi-entity problems. The SIRE  represents intra- and inter-sentential relations in different ways, and design a new and straightforward form of logical reasoning. Recently, the state-of-the-ar model, DocuNet , formulates document-level RE as semantic segmentation task and capture global information among relational triples through the U-shaped segmentation module .
Furthermore, our work is inspired by recent pre-training research in the field of machine vision, such as BIET , IBOT , and MAE . BEIT followed BERT  and proposed a masked image modeling (MIM) task and a tokenizer to pre-train vision Transformers. The tokenizer “tokenize” the image to discrete visual tokens, which is obtained by the latent codes of discrete VAE . The IBOT can perform the MIM task with an online tokenizer and formulates the MIM task as knowledge distillation (KD) distillation problem. The MAE develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
5 Conclusion and Future Work
In this work, We first formulate the inference problem in document-level RE as a Masked Image Reconstruction (MIR) problem. Then, we propose an Inference Multi-head Self-Attention (I-MSA) module to restore masked images more efficiently. The MIR task and the I-MSA module greatly improve the inference ability of our model. Experiments on three public document-level RE datasets demonstrate that our DRE-MIR model achieved better results than the existing state-of-the-art model. In the future, we will try to use our model for other inter-sentence or document-level tasks, such as cross-sentence collective event detection.
-  (2019) Matching the blanks: distributional similarity for relation learning. In Proc. of ACL, Cited by: §1.
-  (2021) BEiT: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254. Cited by: §2.2, §3.5, §4.
-  (2019) Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Cited by: §3.2, §3.4, Table 2.
-  (2019) Connecting the dots: document-level neural relation extraction with edge-oriented graphs. arXiv preprint arXiv:1909.00228. Cited by: §1, 3rd item, §3.4, Table 2, §4.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2, §4.
Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §3.2.
-  (2019) Attention guided graph convolutional networks for relation extraction. arXiv preprint arXiv:1906.07510. Cited by: §1.
Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377. Cited by: §2.2, §3.6, §4.
-  (2017) Densely connected convolutional networks. In Proc. of CVPR, Cited by: §4.
-  (2019) Document-level -ary relation extraction with multiscale representation learning. arXiv preprint arXiv:1904.02347. Cited by: §2.1.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, 1st item, §4.
-  (2020) Graph enhanced dual attention network for document-level relation extraction. In Proc. of COLING, Cited by: 1st item, Table 1, §4.
-  (2016) BioCreative v cdr task corpus: a resource for chemical disease relation extraction. Database. Cited by: 2nd item.
-  (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §3.2.
-  (2020) Reasoning with latent structure refinement for document-level relation extraction. In Proc. of ACL, Cited by: §1, 1st item, §3.3, §3.4, Table 1, Table 2, §4.
-  (2019) Dynamically fused graph network for multi-hop reasoning. In Proc. of ACL, Cited by: §1.
-  (2016) Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873. Cited by: §4.
-  (2021) Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §4.
-  (2018) Modeling relational data with graph convolutional networks. In European semantic web conference, Cited by: 1st item.
-  (2020) Hin: hierarchical inference network for document-level relation extraction. Advances in Knowledge Discovery and Data Mining. Cited by: §1, 2nd item, Table 1, §4.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: 1st item.
-  (2018) Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Proc. of ACL, Cited by: §1, §3.4, Table 2.
-  (2020) Global-to-local neural networks for document-level relation extraction. arXiv preprint arXiv:2009.10359. Cited by: §1, 1st item, §3.4, Table 1, Table 2, §4.
-  (2019) Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898. Cited by: 2nd item, §3.6, Table 1, §4.
Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §3.2.
Renet: a deep learning approach for extracting gene-disease associations from literature. In Proc. of RECOMB, Cited by: 3rd item, §4.
-  (2021) Entity structure within and throughout: modeling mention dependencies for document-level relation extraction. arXiv preprint arXiv:2102.10249. Cited by: §1.
-  (2021) Document-level relation extraction with reconstruction. In Proc. of AAAI, Cited by: 1st item, 2nd item, §3.3, §3.6, Table 1.
-  (2019) DocRED: a large-scale document-level relation extraction dataset. In Proc. of ACL, Cited by: §1, §1, 1st item, §3.3.
-  (2020) Coreferential reasoning learning for language representation. arXiv preprint arXiv:2004.06870. Cited by: 2nd item, Table 1.
-  (2015) Distant supervision for relation extraction via piecewise convolutional neural networks. In Proc. of EMNLP, Cited by: §1.
-  (2021) Sire: separate intra-and inter-sentential reasoning for document-level relation extraction. arXiv preprint arXiv:2106.01709. Cited by: §3.3, §3.6, Table 1, §4.
-  (2020) Double graph based reasoning for document-level relation extraction. In Proc. of EMNLP, Cited by: §1, §1, 1st item, Table 1, §4.
-  (2021) Document-level relation extraction as semantic segmentation. arXiv preprint arXiv:2106.03618. Cited by: §1, §1, §2.1, §3.3, §3.4, Table 1, Table 2, §4.
-  (2021) AliCG: fine-grained and evolvable conceptual graph construction for semantic search at alibaba. arXiv preprint arXiv:2106.01686. Cited by: §1.
-  (2021) Causerec: counterfactual user sequence synthesis for sequential recommendation. In Proc. of SIGIR, Cited by: §1.
-  (2018) Graph convolution over pruned dependency trees improves relation extraction. In Proc. of EMNLP, Cited by: §1.
-  (2020) Document-level relation extraction with dual-tier heterogeneous graph. In Proc. of COLING, Cited by: §3.4, Table 2, §4.
-  (2020) A frustratingly easy approach for entity and relation extraction. arXiv preprint arXiv:2010.12812. Cited by: §2.1.
-  (2020) Global context-enhanced graph convolutional networks for document-level relation extraction. In Proc. of COLING, Cited by: §4.
-  (2021) IBOT: image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832. Cited by: §2.2, §4.
-  (2021) An improved baseline for sentence-level relation extraction. arXiv preprint arXiv:2102.01373. Cited by: §2.1.
-  (2021) Document-level relation extraction with adaptive thresholding and localized context pooling. In Proc. of AAAI, Cited by: §1, §1, §2.1, §2.3, 2nd item, §3.4, Table 1, Table 2, §4.
Appendix A Datasets
Table 5 details the statistics of the three document-level relational extraction datasets, DocRED, CDR, and GDA. These statis-tics further demonstrate the complexity of entity structure in document-level relation extraction tasks.
Appendix B Hyper-parameters Setting
Table 6 details our hyper-parameters setting. All of our hyperparameters were tuned on the development set.