Relation Extraction(RE) is the task of predicting relations between named entities in plain text. It is an important task for many downstream applications such as knowledge graph construction and question answering . Most existing approaches [3, 4, 5, 6] focus on sentence-level RE, which discover relational facts from a single sentence. However, as is shown in Figure 1, in real world scenario, many relational facts lie in different sentences in a document. The task of identifying such relations is called document-level RE. To accelerate the development of document-level RE task,  published the first and so far unique dataset for large-scale document-level relation extraction, DocRED111https://github.com/thunlp/DocRED, constructed from Wikipedia.
Current baseline on document-level relation extraction calculates attention scores between entity pairs and sentences to aggregate information through the whole document. Other efforts [9, 10] link the dependency trees of adjacent sentences to capture interactions among inter-sentence entities. However, we figure out the relations between inter-sentence entities can be inferred directly from the intermediate relations between their coreference mentions. Taking Fig 1 as an example, to infer the “country” relation between “Riddarhuset” and “Sweden”, first we need to discover the relation “in” between “Riddarhuset” and “Stockholm” in sentence (IV), the relation “capital” between “Sweden” and “Stockholm” in sentence(I) and finally make decisions through these intermediate relations. Unlike prior methods which explore their relations roundaboutly by dependency trees or aggregate sentence representation, we construct document-level graphs which connect inter-sentence mentions directly as cross-sentence dependencies and perform relational reasoning on the combination of mention nodes’ representation. By transferring useful information step by step through connected mentions, compound relations can be spread to entity pairs’ representations incrementally. We argue that document-level graph enables the model capture long-distance relational facts transcending adjacent limits and alleviates the noises of unrelated text.
We also notice the relational facts annotated by document-level dataset can hardly be complete due to the large number of potential entity pairs. For example, in DocRED, every document has an average of 26.2 entities, requiring 660.24 cross-sentence annotations. To alleviate the laborious work, human annotators are provided with recommendations from RE models and distant supervision based on entity linking. As a result, many NA instances in DocRED indeed have relations but they are not recommended by RE models or entity linking. Fig 1 shows an example where (Kungliga Hovkapellet, country, Sweden) is not included in original annotations. This means forcing the prediction score of mislabelling NA instances to 0 by traditional BCE loss may hurt the model’s ability to generalize relational representations. The proposed fully-connected mention-centered document-level graph exacerbates the situation as it includes countless direct connections between mislabelling NA entity pairs and subordinate none NA mention pairs. To mitigate the problem, we design a new training objective which changes the classification task to ranking task, allowing the model to reach a balance between capturing the distribution of original annotations and preserving genuine relational representations.
Our contributions can be summarised as follows:
We propose a novel mention-centered model for document-level RE using fully-connected mention pairs to capture cross-sentence dependencies. By exchanging information between mentions pairs iteratively, entity pairs are capable of discovering more accurate inter-sentence relations. Our proposed model is independent of syntactic dependency tools and can achieve state-of-the-art performance on DocRED. We demonstrate the connections between mentions are the core component of inter-sentence relation extraction.
We show detailed analysis about the incomplete annotation problem in DocRED, which interferes the generalization of NA instances. To relieve the negative impact of aggressive linking strategy on this problem, an improved version of ranking loss is proposed. Qualitative comparison between ranking loss and BCE loss further reveals the significance of the proposed training objective on our model.
The overall architecture of proposed model is illustrated in Fig 2. The proposed model consists of four layers: Encoding Layer, Graph Construction Layer, Updating Layer and Classification Layer.
2.1 Encoding Layer
Let denote the document input, we use BERT to encode the document. A Linear layer is used to compress BERT embedding into low-dimensional space for afterwards processing. Entity type embeddings and coreference embeddings used in  are concatenated afterwards.
where is a trainable weight of linear map, and are embedding matrix for entity type and coreference type respectively. denotes concatenation.
2.2 Graph Construction Layer
2.2.1 Node Construction
We form three types of nodes in the graph: mention nodes() , entity nodes() , and sentence nodes() . A mention node ranging from the -th word to the -th word is represented as the average of hidden state from to . The representation of an entity node is computed as the average of the mention representations associated with the entity. Finally, a sentence node is represented as the average of the word representations in the sentence. In order to distinguish different node types in the graph, we concatenate a node type
embedding to each node representation. The final node representations are then estimated as, ,
2.2.2 Edge Construction
Accumulating compositional entity relations through representations of relevant mention pairs is the fundamental idea of our model. For this reason, we construct document-level graphs using the following 4 types of edges.
Mention-Mention(): To detect the implicit relations between mention pairs, we create mention-mention edges according to their relative distance. Unlike previous methods [13, 14] which only connect mentions within a sentence or coreference mentions within an entity, our model creates mention-mention edges in an aggressive strategy by connecting every mention pairs in one document, as previous illustration about (Riddarhuset, country, Sweden) in Fig 1 shows intermediate mentions can reside in different sentences or entities. In this way, document-level dependencies will be established through the chains of mentions which scattered in different sentences rather than sentence connection or the roots of parse trees between neighboring sentences. We generate the edge representation between two mentions starting from -th, -th word as:
are trainable model parameters and sigmoid activation functionis used, are the relative distances of the mentions, is distance embedding matrix. In this way, will be assigned to a real value between 0 and 1 according to their relative distance.
Mention-Entity(): To enable entities collect information gathered by subordinate mentions, we add ME edge. The edge between mention and entity is represented by:
Mention-Sentence(): If a mention belongs to a sentence, we intuitively think the sentence’s representation encode the mention’s relation information. We set the edge representation of mention and sentence as:
Entity-Sentence(): In the experiment we find that connecting entity nodes with their residing sentence nodes will improve the performance marginally. So, we set ES edge represented by:
2.3 Updating Layer
To update the representations of entity pairs, we apply GCN operation on the constructed document-level graph. Vanilla GCN is designed for node classification task and weighs the importance of original nodes and neighbouring nodes equally. However, our interest is accumulating supplementary information to mention/entity nodes without losing their local expressive power. Besides, since
in our model, the node representations will become smaller as the layers deepen especially when the node connects faraway mentions. To handle this problem, we integrate residual connections to original GCN operations:
where is the embedding of node at the layer, is a bias term, is a weight matrix. is a normalization constant.
As pointed out in , vanilla GCN operation makes the features of connected nodes similar. For one thing, we adopt this characteristic to synthesis higher-order information in the document-level graph. For another, by adding residual connections, we maintain original node representations rich of context information.
2.4 Classification Layer
2.5 Training Objective
We divide the triplets produced by our model into 3 types: NA triplets, positive none NA triplets(the final outputs) and negative none NA triplets. In DocRED task setting, the output scores of none NA triplets are arranged in descending order. Triplets whose scores are greater than a threshold will be selected to the testing procedure and rest of candidates will be omitted as negative none NA class . Apparently, as long as the model ranks higher scores to annotations rather than NA or negative none NA triplets, the results will be credible. Therefore, based on , we propose an improved version of ranking loss to simulate this procedure.
During a training batch , denotes the positive none NA triplets and denotes the negative none NA triplets. Let and be respective scores for triplets and generated by the network with parameter set
The new loss function is as follows:
where and are margins which help to measure the errors between predicted scores and labels. Training by minimizing this loss function will restrict the scores of negative none NA triplets smaller than those of positive none NA triplets. Scores of NA triplets are neglected because they will not be tested. In this way, we circumvent to make hard decisions of mislabelling NA triplets.
In this section, we will introduce the DocRED dataset and model settings of our experiments.
3.1 DocRED Dataset
We use the DocRED dataset to evaluate the proposed method. DocRED contains 3,053 /1,000 /1,000 documents for training, development and test respectively, with 132,375 entities and 96 relation types. Manual analysis shows about 40.7% of relational facts can only be extracted from multiple sentences and 61.1% relational instances require a variety of reasoning.
EoG/GCNN: EoG connects sentence nodes in the document as inter-sentence dependencies and aggregates information through attention mechanism. GCNN constructs inter-sentence interactions by linking the roots of adjacent sentences’ parse trees and co-reference mentions within an entity, and then updates information through GCN.
HIN: HIN uses a hierarchical inference method to aggregate the inference information from entity, sentence, document levels respectively by attention mechanism and translation constraint.
LSR: LSR treats the document-level graph structure as a latent variable and induces it through structured attention of shortest dependency path.
3.3 Model Settings
We use “BERT-Base, Uncased” version as BERT encoder in our experiments. The learning rate of BERT layer is while the learning rate of GCN layer is . The embedding size of BERT model is 768. The layer number of GCN is set to be 2. We set to -1 and to -2. Other settings are the same as . “+wiki” means we use relation data from Wikidata 222https://www.wikidata.org to facilitate the learning of ranking loss.
|Ignore F1||F1||Ignore F1||F1|
|- ranking loss||55.32||57.00||55.20||57.50|
4.1 Model Performance
We use the same evaluation metrics as and evaluations on the test set are reported from CodaLab.
Table 1 shows the results of different models under supervised settings. From the table, we have the following observations:(1) MCN is substantially beneficial to improvement, which indicates the information flow carried by the proposed document-level graph can enhance the dependencies between entities. (2) BERT encoder and ranking loss contributes 2% and 0.76% improvement respectively. (3) By linking mentions with Wikidata, our model benefits from its accurate relation annotations. Overall, our BERT-MCN model outperforms sequence-based models, attention-based models and parse tree-based models.
4.2 Model Analysis
In this subsection, we demonstrate the effectiveness of each component using the development set of DocRED.
4.2.1 Aggressive Mention Linking Matters
Compared with GCNN which limits the exchange of information to neighbouring sentences or inner-entity mentions, our model is able to exchange information between heterogeneous mention pairs regardless of distance. To investigate the usefulness of aggressive mention linking strategy, we restrict edge in 2.2.2 to only link mentions belonging to neighboring sentences(the same sentence include) or the same entities. As shown in Table 2, the GCNN-similar implementation of BERT-MCN, BERT-ADJ, achieves 55.30% F1 score in development set, which indicates the inference of cross-sentence mention pairs is the key of document-level relation extraction. Implementation of BiLSTM-ADJ achieves 52.00% F1 score in development set.
Additionally, we remove each type of edge in the constructed graph one by one to examine their effectiveness. As shown in Table 2, removal of MM and ME edges significantly degrades the model’s performance while MS and ES edges do not significantly affect the performance. The result provides another evidence that the edges about mentions are the critical points of document-level relation extraction.
4.2.2 Ranking Loss and NA Class
We present PR curves of BiLSTM, BiLSTM-MCN(w/o ranking loss), BERT , BERT-MCN(w/o ranking loss), BERT-MCN models in Fig 3. As shown in the figure, a peculiar sharp decline occurs in BERT-MCN(w/o R) model in the low recall area, which sharply hurts the performance but does not happen in other models. In order to find out what happens in this sharp decline, we analyse the incorrect samples from top 10% recall area. Surprisingly, we discover most samples are indeed correct or partially alluded by the text but not included in the annotations. During the annotation process of DocRED, because of the large number of potential entity pairs in the DocRED,  first generate triplet candidates from RE models and distant supervision based on entity linking, then ask human annotators to label these candidates. This process inevitably ignores some instances which traditional models are not good at, randomizing the distribution of DocRED. By replacing BCE loss with our proposed ranking loss, The PR curve of BERT-MCN is much smoother by circumventing predicting polluted NA instances in the training set. This problem also indicates the performance of our proposed model could be underestimated. We further relax restrictions if the entity pairs predict the highest none NA relations which are negative but listed in Wikidata. As Table 1 shows, our model gets additional 1.2% improvement.
4.2.3 Intra- and Inter-sentence Performance
It is obvious that more supporting evidences indicate the model should consider more information from other sentences. According to the number of supporting evidences of gold relations in dev set, we divide them into (0/1/2 or 3 supporting evidences) and (more than 4 supporting evidences), and then analyse the recall on relational facts in Fig 4333We omit BERT-LSR for the lack of source code. Apparently, MCN greatly boost the accuracy especially when the supporting evidences are large, which illustrates the MCN’s ability to synthesize the information between multiple sentences.
4.3 Case Study
Figure 5 presents some relational facts predicted by our BERT-MCN model and two baselines.
“Samsung” and “South Korea” has a relation “country” which can be identified by the word “from” in sentence (I). GCNN model(as well as BiLSTM model) fails to recognize this pattern possibly because “country” is more often related with another preposition “in”. However, models with BERT layer successfully predicts this relation. Additionally, BERT model successfully predicts 2 of 3 “manufacturer” relational facts connected with “Samsung Electronics”. After consulting wikipedia, we find the text “produced by Samsung Electronics” appears in both pages of correct mentions while the incorrect “Samsung Galaxy S9+” does not appear in the original text. We argue BERT may encode commonsense knowledge during pre-training process while BiLSTM encoder relies more on pattern recognition.
BERT-MCN model successfully predicts every “manufacturer” relational facts connected with “Samsung” while others do not. To identify this relation, the model first needs to identity “Samsung Electronics” produces “Samsung Galaxy S” and “Samsung Electronics” belongs to “Samsung” in sentence (I), then identify these products such as “Samsung Galaxy S9” belong to “Galaxy S” from sentence (III). We argue our MCN structure collects compositional information from various intermediate mention pairs, so that it can discover complicated higher-order inter-sentence relationships.
5 Related Work
propose to classify relation from entity-level pairs rather than mention-level pairs. construct similar document-level graph, but they limit mention interactions within one sentence and construct relation-dependent edge representations by attention scores between entity nodes.  leverages translation constrict to capture relation representation between entities and aggregates information across the document by attention scores between entities and sentences.
Recent GCN methods used in RE most utilize the dependency tree to construct graph. 
use the GCN layer to encode the dependency path for relation extraction and achieves state-of-art performance on TACRED dataset. extend this method by introducing attention mechanism to extract a more precise dependency graph.  first perform GCN operation on dependency tree and then performs a 2nd-phase prediction based on relation-weighted graph to consider interaction between named entities and relations.  construct document-level graph by linking adjacent roots of parse trees and coreference mentions.  dynamically learn a document-level graph through structured attention of shortest dependency trees nodes and Matrix-Tree Theorem.
6 Conclusion and Future Work
In this paper, we establish cross-sentence dependencies through document-level fully-connected mention pairs. Moreover, considering our model is sensitive to widespread mislabeled NA instances of document-level relation extraction dataset, we propose an improved version of ranking loss to generalize relational representations for mislabeled NA instances. Experimental results show our model achieves comparable results with previous methods which rely on parse trees or attention mechanism. Our future work aims to design more subtle ways to exchange information between node representations of document-level graphs.
Yi, L., Luheng, H., Mari, O., Hannaneh, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3232, Brussels, Belgium. Association for Computational Linguistics(2018)
-  Mo, Y., Wenpeng, Y., Kazi, S.H., Cicerodos, S., Bing, X., Bowen, Z.: Improved neural relation detection for knowledge base question answering. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 571–581, Vancouver, Canada. Association for Computational Linguistics(2017)
-  Daojian, Z., Kang, L., Siwei, L., Guangyou, Z., Jun, Z.: Relation classification via convolutional deep neural network. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2335–2344, Dublin, Ireland. Dublin City University and Association for Computational Linguistics(2014)
Peng, Z., Wei, S., Jun, T., Zhenyu, Q., Bingchen, L., Hongwei, H., Bo, X.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 207–212, Berlin, Germany. Association for Computational Linguistics(2016)
-  Yankai, L., Shiqi, S., Zhiyuan, L., Huanbo, L., Maosong, S..: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2124–2133, Berlin, Germany. Association for Computational Linguistics(2016)
-  Van-Thuy, P., Joan, S., Masashi, S., Yuji, M.: Ranking-based automatic seed selection and noise reduction for weakly supervised relation extraction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages89–95, Melbourne, Australia. Association for Computational Linguistics(2018)
-  Yuan, Y., Deming, Y., Peng, L., Xu, H., Yankai, L., Zhenghao, L., Zhiyuan, L., Lixin, H., Jie, Z., Maosong, S.: DocRED: A large-scale document-level relation extraction dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764–777, Florence, Italy. Association for Computational Linguistics(2019)
-  Hengzhu, T., Yanan, C., Zhenyu, Z., Jiangxia, C., Fang, F., Shi, W., Pengfei, Y.: Hin: Hierarchical inference network for document-level relation extraction. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 197–209. Springer(2020)
-  Nanyun, P., Hoifung, P., Chris, Q., Kristina, T., Wen-tau, Y.: Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101–115(2017)
-  Wei, Z., Hongfei, L., Zhiheng, L., Xiaoxia, L., Zhengguang, L., Bo, X., Yijia, Z., Zhihao, Y., Jian, W.: An effective neural model extracting document level chemical-induced disease relations from biomedical literature. Journal of biomedical informatics, 83:1–9(2018)
-  Guoshun, N., Zhijiang, G., Ivan, S., and Wei, L.: Reasoning with latent structure refinement for document-level relation extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1546–1557, Online. Association for Computational Linguistics(2020)
-  Jacob, D, Ming-Wei, C., Kenton, L., Kristina, T.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics(2019)
-  Fenia, C., Makoto, M., Sophia, A.: Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4927–4938, Hong Kong, China. Association for Computational Linguistics(2019)
Sunil, K.S., Fenia, C., Makoto, M., Sophia, A.: Inter-sentence relation extraction with document-level graph convolutional neural network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4309–4316, Florence, Italy. Association for Computational Linguistics(2019)
-  Cicero, d.S., Bing, X., Bowen, Z.: Classifying relations by ranking with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 626–634, Beijing, China. Association for Computational Linguistics(2015)
-  Daniil, S., Iryna, G.: Context-aware representations for knowledge base relation extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1784–1789, Copenhagen, Denmark. Association for Computational Linguistics(2017)
-  Hong, W., Christfried, F., Rob, S., Nilesh, M., William, W.: Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898(2019)
-  Javid, E., Dejing, D.: Chain based RNN for relation classification. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1244–1249, Denver, Colorado. Association for Computational Linguistics(2015)
-  Daojian, Z., Kang, L., Yubo, C., Jun, Z.: Distant supervision for relation extraction via piecewise convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753–1762, Lisbon, Portugal. Association for Computational Linguistics(2015)
-  Jun, F., Minlie, H., Li, Z., Yang, Y., Xiaoyan, Z.: Reinforcement learning for relation classification from noisy data. In: Thirty-Second AAAI Conference on Artificial Intelligence(2018)
-  Xiangrong, Z., Shizhu, H., Kang, L., Jun, Z.: Large scaled relation extraction with reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence(2018)
-  Pengda, Q., Weiran, X., William, Y.W.: DSGAN: Generative adversarial training for distant supervision relation extraction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 496–505, Melbourne, Australia. Association for Computational Linguistics(2018)
-  Xinsong, Z., Pengshuai, L., Weijia, J., Hai, Z.: Multi-labeled relation extraction with attentive capsule network. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7484–7491(2019)
Sebastian, R., Limin, Y., Andrew, M.: Modeling relations and their mentions wit-out labeled text. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148–163. Springer(2010)
-  Chris, Q., Hoifung, P.: Distant supervision for relation extraction beyond the sentence boundary. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1171–1182, Valencia, Spain. Association for Computational Linguistics(2017)
-  Pankaj, G., Subburam, R., Hinrich, S.,and Thomas, R.: Neural relation extraction within and across sentence boundaries. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6513–6520(2019)
-  Robin, J., Cliff, W., Hoifung, P.: Document-level n-ary relation extraction with multi-scale representation learning. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3693–3704, Minneapolis, Minnesota. Association for Computational Linguistics(2019)
-  Yuhao, Z., Peng, Q., Christopher, D.M.: Graph convolution over pruned dependency trees improves relation extraction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, Brussels, Belgium. Association for Computational Linguistics(2018)
-  Zhijiang, G., Yan, Z., Wei, L.: Attention guided graph convolutional networks for relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 241–251, Florence, Italy. Association for Computational Linguistics(2019)
-  Tsu-Jui, F., Peng-Hsuan, L., Wei-Yun, M.: GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1409–1418, Florence, Italy. Association for Computational Linguistics(2019)