1 Background and Significance
Disease knowledge graphs are a way to connect, organize, and access disparate data and information resources about diseases with numerous benefits for artificial intelligence (AI). Systematized knowledge can be injected into AI tools to imitate human experts’ reasoning so that AI-driven hypotheses can be easily found, accessed, and validated. For example, disease knowledge graphs (KGs) power AI applications, such as intelligent diagnosis and electronic health record (EHR) retrieval. However, creating high-quality knowledge graphs requires extracting relationships between diseases from disparate information sources, such as free text in the EHRs and semantic knowledge representation from literature.
Traditionally, KGs were constructed via manual efforts, requiring humans to input every fact [bollacker2008freebase]. In contrast, rule-based [rindflesch2011semantic] and semi-automated [carlson2010toward, dong2014knowledge] construction, while scalable, can suffer from poor accuracy and recall rate. Recent methods create KGs by extracting relationships from literature and building considerably more accurate KGs that comprehensively cover a domain of interest. These methods leverage pre-trained language models
Nevertheless, language-based and graph-based methods both have advantages. For example, language-based methods can reason over large datasets created using distant supervision, and graph-based methods can operate on noisy and incomplete knowledge graphs, providing robust predictions. An emerging strategy to advance relation extraction thus leverages multiple data types simultaneously [liu2020k, sun2019ernie, zhang2019ernie, he2019integrating, koncel2019text, sun2020colake, hu2019improving, xu2019connecting, zhang2019long, wang2020model, wang2014knowledge, han2016joint, ji2020joint, dai2019distantly, stoica2020improving] with multimodal learning, outperforming rule-based [rindflesch2011semantic] and semi-automated [carlson2010toward, dong2014knowledge] methods. However, existing approaches are limited in two ways, which we outline next.
First, KGs provide only positive samples (i.e., examples of true disease-disease relationships) while existing methods also require negative samples that, ideally, are disease pairs that resemble positive samples but are not valid relationships. Methods for positive-unlabeled [he2020improving] and contrastive [le2020contrastive, su2021improving] learning can be trained in such scenarios using random disease pairs from the dataset as negative proxy samples and ensuring a low false-positive rate. However, these methods may not generalize well in real-world applications because random negative samples are not necessarily realistic and fail to represent the boundary cases. To improve the quality of negative sampling in disease relation extraction, we introduce an EHR-based negative sampling strategy. With the strategy, our approach generates negative samples using disease pairs that rarely appear together in EHRs, thus having realistic negative samples to enable broad generalization of the approach.
Second, it is not uncommon for certain diseases to appear either in graph or language datasets but not in both data types. For example, in Lin et al. [lin2020highthroughput], over 60% disease pairs in the knowledge graph had no corresponding text information; there were also cases with text but no graph information. It is thus essential that multimodal approaches can make predictions when only one data type is available. Unfortunately, existing multimodal approaches with such capability [wang2020multimodal, suo2019metric, zhou2019latent, yang2018semi] do not focus on language and graphs. Further, missing data imputation techniques use adversarial learning to impute missing values , but the imputed values can introduce unwanted data bias. To address this issue, we propose a multimodal approach with a decoupled model structure where language-based and graph-based modules interact only through shared parameters and a cross-modal loss function. This method ensures our model can take advantage of both language and graph inputs and extract disease relations based on either single or multimodal inputs.
Present work. We introduce REMAP (Relation Extraction with Multimodal Alignment Penalty)222Python implementation of REMAP is available on Github at https://github.com/Lukeming-tsinghua/REMOD. Our dataset of domain-expert annotations is at https://doi.org/10.6084/m9.figshare.17776865. , a multimodal approach for extracting and classifying disease-disease relations (Figure 1). REMAP is a flexible multimodal algorithm that jointly learns over text and graphs with a unique capability to make predictions even when a disease concept exists in only one data type. To this end, REMAP specifies graph-based and text-based deep transformation functions that embed each data type separately and optimize unimodal embedding spaces such that they capture the topology of a disease KG or the text semantics of disease concepts. Finally, to achieve data fusion, REMAP aligns unimodal embedding spaces through a novel alignment penalty loss using shared disease concepts as anchors. This way, REMAP can effectively model data type-specific distribution and diverse representations while also aligning embeddings of distinct data types. Further, REMAP can be jointly trained on both graph and text data types but evaluated and implemented on either of the two modalities alone. In summary, the main contributions of this study are:
-
[leftmargin=*]
-
We develop REMAP, a flexible multimodal approach for extracting and classifying disease-disease relations. REMAP fuses knowledge graph embeddings with deep language models and can flexibly accommodate missing data types, which is a necessary capability to facilitate REMAP’s validation and transition into biomedical implementation.
-
We rigorously evaluate REMAP for extraction and classification of disease-disease relations. To this end, we create a training dataset using distant supervision and a high-quality test dataset of gold-standard annotations provided by three domain experts, all medical doctors. Our evaluations show that REMAP achieves 88.6% micro-accuracy and 81.8% micro-F1 score on the human annotated dataset, outperforming text-based methods by 10 and 17.2 percentage points, respectively. Further, REMAP achieves the best performance, 89.8% micro-accuracy and 84.1% micro-F1 score, surpassing graph-based methods by 8.4 and 10.4 percentage points, respectively.
2 Methods
We next detail the REMAP approach and illustrate it in detail for the task of disease relation extraction and classification (Figure 1). We first describe the notation, proceed with an overview of language and knowledge graph models, and outline the multimodal learning strategy to inject knowledge into extraction tasks.
2.1 Preliminaries
Notation. The input to REMAP is a combined dataset of text and graph information. This dataset consists of text information given as bags of sentences and graph information given as triplets encoding the relationship between and as . For example, , and would indicate the process to differentiate between hypogonadism versus Goldberg-Maxwell syndrome with similar symptoms that could possibly account for illness in a patient. We assume that bags of sentences in overlap with the triplets from existing KG such that each sentence of the th sentence bag contain . The remaining triplets in cannot be mapped to sentences in and sentences contain entity pairs that do not belong to existing KG. We represent the -th sentence bag as , where is the number of sentences in bag , is the tokenized sequence of -th sentence in . Here, the tokenized sequence is the combination of sentence containing mentions of subject and object entities, entity markers, title of the document in which the the sentence appears, and the article structure. Marker tokens are added to the head and tail position of each entity mentions to denote entity type information. Last, and are start indices of entity markers for subject and object entities, respectively.
Heterogeneous graph attention network. Heterogeneous graph attention network (HAN) [wang2019heterogeneous] is a graph neural network for embedding KGs by leveraging meta paths. A meta path is a sequence of node types and relation types [sun2011pathsim]. For example, in a disease KG, “Disease” “May Cause” “Disease” “Differential Diagnosis” “Disease” is a meta path. Node is connected to node via a meta path if and are the head and tail nodes, respectively, of this meta path. Each node has an initial node embedding and belongs to a type , e.g., . Graph attention network specified a parameterized deep transformation function that maps nodes to condensed data summaries, i.e., embeddings, in a node-type specific manner as: .
We denote all nodes adjacent to via a meta-path as and node-level attention mechanism provides information on how strongly attends to ’s each adjacent node when generating the embedding for . In particular, the importance of for in meta path is defined as:
(1) |
where is the sigmoid activation, indicates concatenation, and is a trainable vector. To promote stable attention, HAN uses multiple, i.e., , heads and concatenates vectors after node level attention to produce the final node embedding for node :
(2) |
Given user-defined meta paths , HAN uses the above specified node-level attention to produce node embeddings . Finally, HAN uses semantic-level attention to combine meta path-specific node embeddings as:
(3) |
(4) |
where represents the importance of meta path towards final node embeddings , and , , and are trainable parameters. The final outputs are node embeddings , representing compact vector summaries of knowledge associated with each node in the KG.
Translation- and tensor-based embeddings. decode an optimized set of embeddings into the probability estimate of relationship denotes the sigmoid function,
2.2 Language and knowledge graph encoders
Embedding disease-associated sentences. We start by tokenizing entities in sentences and proceed with an overview of the text encoder. Entity tokens identify the position and type of entities in a sentence [MTB, zhong2020frustratingly]. Specifically, tokens ¡S-type¿ and ¡S-type/¿, and ¡O-type¿ and ¡O-type/¿ are used to denote the start and end of subject (S) and object (O) entities, respectively. The entity marker tokens are type-related, meaning that entities of different types (e.g., disease concepts, medications) get different tokens. This procedure produces bags of tokenized sentences that we encode into entity embeddings using a text encoder. We use SciBERT encoder [scibert] with the SciVocab wordpiece vocabulary, which is a BERT language model optimized for scientific text with improved efficiency in biomedical domains than BioBERT or BERT model alone [zhong2020frustratingly]. Tokenized sequences in a sentence bag are fed into the language model to produce a set of sequence outputs :
(5) |
where , is the number of sentences in , . We then aggregate representations of the subject entity across all sentences in bag as: and use self-attention to obtain the final text-based embedding for subject entity as:
(6) |
where is the embedding of and is a trainable vector. Embeddings of object entities (i.e., for object entity ) are generated analogously by the text encoder. Self-attention is needed because some sentences in each bag may not capture any relationship between entities, and attention allows the model to ignore such uninformative sentences when generating embeddings.
Embedding a disease knowledge graph. We use a heterogeneous graph attention encoder [wang2019heterogeneous] to map nodes in the disease KG into embeddings. The encoder generates embeddings for every subject entity and object entity that appear as nodes in the KG as:
(7) |
where transformation is given in Eqs. (1)-(4) and denotes initial embeddings.
Scoring disease-disease relationships. Taking text-based embeddings, , and graph-based embeddings, , for every subject and object occurring in either language or graph dataset, we use a scoring function to estimate disease-disease relationships formulated as candidate triplets :
(8) | |||
(9) |
where SF is the scoring function and denotes the number of relation types. We consider three scoring functions, including the linear scoring function: , the TransE scoring function: , and the TuckER scoring function: . In the TuckER function, we use a separate kernel and for text and graph data types. In the TransE function, is the embedding of relation shared between modalities (Section 2.1).
2.3 Co-training language and graph encoders
We proceed describing the procedure for co-training text and graph encoders. From last section, we obtain relationship estimates based on evidence provided by text information, , and graph information, , for every triplet . We use the binary cross entropy to optimize those estimates in each data type as:
(10) |
(11) |
and can combine text-based loss and graph-based loss as: . This loss function is motivated by the principle of knowledge distillation [lan2018knowledge] to enhance multimodal interaction and improve classification performance. Using probabilities from the text encoder, we normalize them across relation types to obtain a text-based distribution as:
(12) |
We calculate a graph-based distribution in the same manner using softmax normalization.
Specifically, we develop two REMAP variants, REMAP-M and REMAP-B, based on how text-based and graph-based losses are combined into a multimodal objective. In REMAP-M, text-based and graph-based losses are aligned by shrinking the distance between and using Kullback-Leibler (KL) divergence:
(13) |
as a measure of misalignment as follows:
(14) |
In contrast, REMAP-B considers the best logit on two data types, inspired by ensemble distillation
(15) |
which is normalized across relation types using a softmax function:
(16) |
REMAP-B uses cross entropy to quantify the discrepancy between predicted scores and ground truth labels:
(17) |
and co-trains graph and text encoders by penalizing misaligned encoders:
(18) |
The outline of complete REMAP-B is shown in Algorithm 1.
3 Data and Experimental Setup
We proceed with the description of datasets (Section 3.1), followed by the training and implementation details of REMAP approach (Section 3.2) and the outline of experimental setup (Section 3.3).
3.1 Datasets
Datasets used in this study are multimodal and originate from diverse sources that we integrated and harmonized as outlined below. In particular, we compiled a large disease-disease knowledge graph together with text descriptions retrieved from medical data repositories using distant supervision. Further, we processed a large electronic health record dataset and curated a high quality human annotated dataset.
Disease knowledge graph. We construct a disease-disease knowledge graph from Diseases [pletscher2015diseases] and MedScape [frishauf2005medscape] repositories that unify evidence on disease-disease associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This knowledge graph contains 9,182 disease concepts, which are assigned concept unique identities (CUI) in Unified Medical Language System (UMLS), and three types of relationships between disease concepts, which are ‘may cause’ (MC), ‘may be caused by’ (MBC), and ‘differential diagnosis’ (DDx). Other relations in the knowledge graph are denoted as ‘not available’ relations (NA). The MBC relation type is the reverse relation of the MC relation type while DDx is a symmetric relation between disease concepts. Further dataset statistics are in [lin2020highthroughput] and Table 1.
Language dataset. We use a text corpus taken from Lin et al. [lin2020highthroughput] that includes sentences from Wikipedia, Medscape eMedicine, PubMed abstracts from The PubMed Baseline Repository provided by the National Library of Medicine, and four textbooks. This dataset comprises 21,315,153 medical articles, and altogether 1,466,065 sentences from these articles can be aligned to disease-disease edges in the disease knowledge graph. Details are shown in Table 1.
Electronic health record dataset. We use two types of information from electronic health records (EHRs), both taken from Beam et al. [beam2019clinical]. In particular, using 20 million medical notes, Beam et al. [beam2019clinical] created a dataset of disease concepts that appear together in the same note and used the SVD to decompose the resulting co-occurrence matrix and produce embedding vectors for disease concepts. We use information on co-occurring disease concepts to guide sampling of negative node pairs when training the disease knowledge graph. Further, we use embedding vectors from [beam2019clinical] to initialize embeddings in REMAP’s graph neural network. For out-of-dictionary disease concepts, we use the average of all concept embedding as their initial embeddings.
Human annotated dataset. We randomly selected 500 disease pairs to create an annotated dataset. For that, we recruited three senior physicians from Peking Union Medical College Hospital who manually labeled disease pairs. Annotation experts disagreed on labels for 14 disease pairs (i.e., 2.8% of the total number of disease pairs) that were resolved through consensus voting. The human annotated dataset is used for model comparison and performance evaluation.
Dataset | Total | NA | DDx | MC | MBC | Entities | ||
---|---|---|---|---|---|---|---|---|
Unaligned | - | - | 96,913 | 30,546 | 20,657 | 23,411 | 22,298 | 9,182 |
Aligned | Train | Triplet | 31,037 | 15,670 | 7,262 | 4,358 | 3,747 | 7,794 |
Language | 1,244,874 | 799,194 | 208,921 | 123,735 | 113,024 | |||
Validation | Triplet | 7,754 | 3,918 | 1,821 | 1,065 | 950 | 4,433 | |
Language | 206,179 | 68,934 | 60,165 | 43,706 | 33,474 | |||
Annotated | Triplet | 499 | 8 | 210 | 159 | 122 | 733 | |
Language | 15,012 | 96 | 4,980 | 6,699 | 3,237 |
3.2 Training REMAP models
Next we outline training details of REMAP models, including negative sampling, pre-training strategy for language and graph models, and the multimodal learning approach.
Negative sampling. We construct negative samples by sampling disease pairs whose co-occurrence in the EHR co-occurrence matrix [beam2019clinical] is lower than a pre-defined threshold. In particular, we expect that two unrelated diseases rarely appear together in EHRs, meaning that the corresponding values in the co-occurrence matrix are low. Thus, such disease pairs represent suitable negative samples to train models for classifying disease-disease relations.
Pre-training a language model.
The text-based model comprises the text encoder and the TuckER module for disease-disease relation prediction. We denote the relation embeddings as , and the loss function as (Section 2.3). In particular, we use SciBERT tokenizer and SciBERT-SciVocab-uncased model [wolf2019huggingface] . The entity markers are added to the SciVocab vocabulary, and their embeddings are initialized with uniform distribution. We set the maximum number of sentences in a bag to
Pre-training a graph neural network model. The graph-based model comprises the heterogeneous attention network encoder and the TuckER module for disease-disease relation prediction. The relation embeddings produced by the TuckER module are denoted as . In the pre-training phase, the model is optimized for the loss function is (Section 2.3). The initial embeddings for nodes in the disease knowledge graph are concept unique identifier (CUI) representations derived from the SVD decomposition of the EHR co-occurrence matrix [beam2019clinical]. Further details on hyper-parameters are in Table LABEL:tab:param.
Joint learning on the multimodal dataset. After data type-specific pre-training is completed, the text and graph models are fused in cross-modal learning. To this end, the shared relation vector is initialized as: . We consider two REMAP variants, namely REMAP-M and REMAP-B, that are optimized for different loss functions (Section 2.3). Details on hyper-parameter selection are in Table LABEL:tab:param.
3.3 Experimental setup
Next we overview performance metrics, baseline methods, and variants of REMAP approach.
Baseline methods and performance metrics. We consider 11 baseline methods for disease relation extraction, including 5 knowledge graph-based methods and 6 text-based methods. Detailed descriptions of all methods are in Appendix LABEL:sec:further-baselines. We evaluate predicted disease-disease relations by calculating the accuracy of predicted relations between disease concepts, which is an established approach for benchmarking relation extraction methods [hu2020open]. Specifically, given a triplet and a predicted score , relation is predicted to exist between and if the predicted score , which corresponds to a binary classification task for each relation type. The is a relation type-specific value determined such that binary classification performance achieves maximal F1-score. We report classification accuracy, precision, recall, and F1-score for all experiments in this study.
Variants of REMAP approach. We carry out an ablation study to examine the utility of key REMAP components. We consider the following three components and examine REMAP’s performance with and without each.
-
[leftmargin=*]
-
REMAP-B without joint learning In text-only ablations, we use SciBERT to obtain concept embeddings and , and combine them to produce a relation embedding , where is related to the concept embeddings based on text information. Finally, we consider three scoring functions to classify disease-disease relations, and we denote the models as SciBERT (linear), SciBERT (TransE), and SciBERT (TuckER). Similarly, in graph-only ablations, we first use a heterogeneous attention network to obtain graph embeddings and that are combined into prediction by different scoring functions, including HAN (linear), HAN (TransE), and HAN (TuckER).
-
REMAP-B without EHR embeddings REMAP-B uses EHR embeddings [beam2019clinical] as initial node embedding . To examine the utility of EHR embeddings, we design an ablation study that initializes node embeddings using the popular Xavier initialization [glorot2010understanding] instead of EHR embeddings. Other parts of the model are the same as in REMAP-B.
-
REMAP-B without unaligned triplets Unaligned triplets denote triplets in the disease knowledge graph that do not have the corresponding sentences in the language dataset. To demonstrate how these unaligned triplets influence model performance, we design an ablation study in which we train a REMAP-B model on the reduced disease knowledge graph with unaligned triplets excluded.
4 Results
REMAP is a multimodal approach for joint learning on graph and text datasets. We evaluate REMAP’s prowess for disease relation extraction when the REMAP model is tasked to identify candidate disease relations in either text (Section 4.1) or graph-structured (Section 4.2) data. We present ablation (Appendix LABEL:sec:ablation) and case (Section 5) studies.
4.1 Extracting disease relations from text
We start with results on the human annotated dataset where each method, while it can be trained on a multimodal text-graph dataset, is tasked to extract disease relations from text alone, meaning that the test set consists of only annotated sentences. Table 2 shows performance on the annotated set for text-based methods. Neural network models outperform methods, such as Naive Bayes and SVM. Further, BiGRU+Attention is the best performing baseline method, achieving an accuracy of 78.6 and an F1-score of 64.6. We also find that REMAP-B and REMAP-M achieve the best performance across all settings, outperforming baselines significantly. In particular, REMAP models surpass the strongest baseline by 10.0 absolute percentage points (accuracy) and by 7.2 absolute percentage points (F1-score). These results show that multimodal learning can considerably advance disease relation extraction when only one data type is available at test time.
Modality | Model | Accuracy | F1-score | |||||||
---|---|---|---|---|---|---|---|---|---|---|
micro | DDx | MC | MBC | micro | DDx | MC | MBC | |||
Text | Baselines | SVM | 59.5 | 59.0 | 68.2 | 51.2 | 31.5 | 40.9 | 25.4 | 19.0 |
NB | 74.3 | 71.4 | 69.0 | 82.4 | 47.1 | 60.2 | 29.9 | 43.2 | ||
RF | 72.0 | 68.6 | 70.8 | 76.4 | 37.8 | 53.1 | 25.5 | 19.2 | ||
TextCNN | 76.7 | 75.4 | 73.0 | 81.8 | 60.9 | 67.5 | 59.9 | 48.6 | ||
BiGRU | 77.4 | 73.0 | 77.2 | 82.0 | 62.0 | 67.9 | 54.0 | 59.8 | ||
BiGRU+attention | 78.6 | 75.0 | 78.2 | 82.6 | 64.6 | 67.7 | 63.5 | 60.6 | ||
Ours | REMAP | 88.2 | 83.6 | 89.0 | 92.0 | 80.9 | 80.7 | 80.0 | 82.6 | |
REMAP-M | 88.6 | 84.2 | 89.0 | 92.8 | 81.5 | 81.6 | 79.6 | 83.8 | ||
REMAP-B | 88.6 | 84.4 | 89.2 | 92.4 | 81.8 | 81.9 | 80.3 | 83.3 | ||
Graph | Baselines | TransE_l2 | 75.1 | 70.7 | 72.7 | 81.8 | 63.2 | 68.0 | 57.0 | 62.2 |
DistMult | 69.8 | 77.5 | 61.3 | 70.5 | 56.1 | 71.0 | 43.4 | 51.5 | ||
ComplEx | 79.0 | 75.2 | 77.8 | 84.2 | 65.0 | 69.3 | 56.5 | 66.9 | ||
RGCN | 71.8 | 78.6 | 62.5 | 74.3 | 62.2 | 75.1 | 50.8 | 58.6 | ||
TuckER | 81.5 | 77.6 | 82.3 | 84.7 | 73.7 | 76.2 | 71.7 | 71.9 | ||
Ours | REMAP | 89.6 | 86.4 | 89.6 | 92.8 | 83.5 | 84.3 | 81.6 | 84.2 | |
REMAP-M | 89.3 | 87.0 | 88.4 | 92.6 | 83.3 | 85.7 | 78.8 | 83.8 | ||
REMAP-B | 89.8 | 87.3 | 89.9 | 92.2 | 84.1 | 85.8 | 82.4 | 82.7 |
4.2 Identifying disease relations in a graph-structured dataset
We proceed with results of disease relation prediction in a setting where each method is tasked to classify disease relations based on the disease knowledge graph alone. This setting evaluates the flexibility of REMAP as REMAP can answer either graph-based or text-based disease queries. In particular, Table 2 shows performance results attained on the human annotated dataset with query disease pairs given only as disease concept nodes in the knowledge graph. We find that graph-based baselines perform better than text-based baselines. Further, TuckER, the best graph-based method, outperforms BiGRU+Attention, a top-performing text-based baseline, by 2.9 (accuracy) and 8.9 absolute percentage points (F1-score). These results suggest that the disease knowledge graph contains more valuable predictive information than text when classifying disease-disease relations. Last, we find that REMAP-B is a top performer among REMAP variants, achieving an accuracy of 89.8 and an F1-score of 84.1. Further results are in Appendices LABEL:sec:ablation-LABEL:sec:further-analysis.
5 Discussion
Table LABEL:tab:ablation shows that joint learning can considerably improve model performance in both text and graph modalities. To further analyze joint learning in REMAP-B, we analyze the characteristics of correctly and incorrectly classified disease-disease relationships. Comparing full REMAP-B with REMAP-B w/o joint learning (Table LABEL:tab:ablation), we found that 28 predicted disease-disease relationships turned from incorrect to correct predictions. Conversely, we found that 14 predicted disease-disease relationships turned correct to incorrect predictions on the text modality. Besides, 19 predictions turned from incorrect to correct and 11 inversely on the graph modality. It can be seen that improved classification accuracy on the text modality is why REMAP-B achieved better performance on the annotated set.
We proceed with a case study examining the relationship between hypogonadism and Goldberg-Maxwell syndrome. Hypogonadism and Goldberg-Maxwell syndrome are separate and distinct diseases [nguyen2003pet, grymowicz2021complete]. However, many physicians fail correctly to differentiate in their diagnosis of these two diseases, especially in light of Goldberg-Maxwell syndrome being a rare disease. Differential diagnosis is a process wherein a doctor differentiates between two or more conditions that could be behind a person’s symptoms.
Figure 3 illustrates how the prediction of the “differential diagnosis” (DDx) relationship between hypogonadism and Goldberg-Maxwell syndrome changed from incorrect to correct after joint learning with the text modality. Because REMAP-B correctly recognizes diseases from text and unifies them with diseases in the knowledge graph, it can be seen that both diseases have rich local network neighborhoods in the graph. For example, hypogonadism and Goldberg-Maxwell syndrome have many outgoing edges of type differential diagnosis. Further, meta-paths connect these diseases in the graph, such as as second-order “differential diagnosis differential diagnosis” meta-path, and their outgoing degrees in the graph are relatively high. In the case of joint learning, the text-based model can extract part of the disease representation from the graph modality to update its internal representations and thus improve text-based classification of relations.
6 Conclusion
We develop REMAP, a multi-modal learning approach for disease relation extraction that integrates knowledge graphs and language models. Results on a dataset annotated by experts show that REMAP considerably outperforms methods for learning on text or knowledge graphs alone. Further, REMAP can extract and classify disease relationships in the most challenging settings where text or graph information is absent. Finally, we provide a new data resource of extracted relationships between diseases that can serve as a benchmarking dataset for systematic evaluation and comparison of disease relation extraction algorithms.
Funding and acknowledgments
We gratefully acknowledge the support of the Harvard Translational Data Science Center for a Learning Healthcare System (CELeHS). M.Z. is supported, in part, by NSF under nos. IIS-2030459 and IIS-2033384, US Air Force Contract No. FA8702-15-D-0001, Harvard Data Science Initiative, Amazon Research Award, Bayer Early Excellence in Science Award, AstraZeneca Research, and Roche Alliance with Distinguished Scientists Award. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.
Author contributions
Y.L and K.L. are co-first authors and have developed the method and produced all analyses presented in the paper. S.Y. contributed experimental data. T.C. and M.Z. conceived and designed the study. The project was guided by M.Z., including designing methodology and outlining experiments. Y.L., K.L., S.Y., T.C., and M.Z. wrote the manuscript. All authors discussed the results and reviewed the paper.
Data and code availability
Python implementation of REMAP is available on Github at https://github.com/Lukeming-tsinghua/REMOD. The human annotated dataset used for evaluation of REMAP is available at https://doi.org/10.6084/m9.figshare.17776865.
Conflicts of interest
None declared.


