Log In Sign Up

Multimodal Learning on Graphs for Disease Relation Extraction

by   Yucong Lin, et al.

Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and relationship types. Methods: We introduce REMAP, a multimodal approach for disease relation extraction and classification. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, followed by aligning the multimodal embeddings for optimal disease relation extraction. Results: We apply REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves text-based disease relation extraction by 10.0 disease knowledge graphs with text information. Further, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4 Discussion: Systematized knowledge is becoming the backbone of AI, creating opportunities to inject semantics into AI and fully integrate it into machine learning algorithms. While prior semantic knowledge can assist in extracting disease relationships from text, existing methods can not fully leverage multimodal datasets. Conclusion: REMAP is a multimodal approach for extracting and classifying disease relationships by fusing structured knowledge and text information. REMAP provides a flexible neural architecture to easily find, access, and validate AI-driven relationships between disease concepts.


page 19

page 20


Knowledge Graph Enhanced Relation Extraction Datasets

Knowledge-enhanced methods that take advantage of auxiliary knowledge gr...

BIOS: An Algorithmically Generated Biomedical Knowledge Graph

Biomedical knowledge graphs (BioMedKGs) are essential infrastructures fo...

DiaKG: an Annotated Diabetes Dataset for Medical Knowledge Graph Construction

Knowledge Graph has been proven effective in modeling structured informa...

Knowledge Transfer with Medical Language Embeddings

Identifying relationships between concepts is a key aspect of scientific...

DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

The effective extraction of ranked disease-symptom relationships is a cr...

A Neural Architecture for Person Ontology population

A person ontology comprising concepts, attributes and relationships of p...

Cross-stitching Text and Knowledge Graph Encoders for Distantly Supervised Relation Extraction

Bi-encoder architectures for distantly-supervised relation extraction ar...

1 Background and Significance

Disease knowledge graphs are a way to connect, organize, and access disparate data and information resources about diseases with numerous benefits for artificial intelligence (AI). Systematized knowledge can be injected into AI tools to imitate human experts’ reasoning so that AI-driven hypotheses can be easily found, accessed, and validated. For example, disease knowledge graphs (KGs) power AI applications, such as intelligent diagnosis and electronic health record (EHR) retrieval. However, creating high-quality knowledge graphs requires extracting relationships between diseases from disparate information sources, such as free text in the EHRs and semantic knowledge representation from literature.

Traditionally, KGs were constructed via manual efforts, requiring humans to input every fact [bollacker2008freebase]. In contrast, rule-based [rindflesch2011semantic] and semi-automated [carlson2010toward, dong2014knowledge]

construction, while scalable, can suffer from poor accuracy and recall rate. Recent methods create KGs by extracting relationships from literature and building considerably more accurate KGs that comprehensively cover a domain of interest. These methods leverage pre-trained language models 

[devlin2018bert, scibert] and have advanced biomedical knowledge graphs [MTB, zhong2020frustratingly, lin2020highthroughput, chen2022biomedical, li2017noise]. Another approach for extracting relations uses knowledge graph embeddings (KGE), which directly predict new relations in partial, incomplete knowledge graphs. KGE methods learn how to represent every entity (i.e., node) and relation (i.e., edge) in a graph as a distinct point in a low-dimensional vector space (i.e., embedding) so that performing algebraic operations in this learned space reflects the topology of the graph [wang2017knowledge]. Embeddings produced by KGE methods can be remarkably powerful for downstream AI applications [li2021representation, zitnik2016collective, shi2017proje, lin2015learning, wang2019kgat, sun2018recurrent]. Widely used KGE methods include translation models [bordes2013translating, wang2014knowledge, lin2015learning, ji2015knowledge], bilinear models [nickel2011three, yang2014embedding, trouillon2017knowledge, balavzevic2019tucker], and graph neural networks (GNNs) [schlichtkrull2018modeling, busbridge2019relational, wang2019heterogeneous, zitnik2018modeling]. These methods leverage embeddings to predict new relations, thereby completing and growing an existing KG. However, extracting relations from a single data type may suffer from bias, noise, and inherent incompleteness. For example, in language-based methods, the training dataset is collected using distant supervision [mintz2009distant], which creates noisy sentences that can mislead relation extraction. Further, graph-based methods can suffer from the out-of-dictionary problem, limiting the ability to predict relations involving entities previously not in the KG [bordes2013translating, balazevic2019tucker].

Nevertheless, language-based and graph-based methods both have advantages. For example, language-based methods can reason over large datasets created using distant supervision, and graph-based methods can operate on noisy and incomplete knowledge graphs, providing robust predictions. An emerging strategy to advance relation extraction thus leverages multiple data types simultaneously [liu2020k, sun2019ernie, zhang2019ernie, he2019integrating, koncel2019text, sun2020colake, hu2019improving, xu2019connecting, zhang2019long, wang2020model, wang2014knowledge, han2016joint, ji2020joint, dai2019distantly, stoica2020improving] with multimodal learning, outperforming rule-based [rindflesch2011semantic] and semi-automated [carlson2010toward, dong2014knowledge] methods. However, existing approaches are limited in two ways, which we outline next.

First, KGs provide only positive samples (i.e., examples of true disease-disease relationships) while existing methods also require negative samples that, ideally, are disease pairs that resemble positive samples but are not valid relationships. Methods for positive-unlabeled [he2020improving] and contrastive [le2020contrastive, su2021improving] learning can be trained in such scenarios using random disease pairs from the dataset as negative proxy samples and ensuring a low false-positive rate. However, these methods may not generalize well in real-world applications because random negative samples are not necessarily realistic and fail to represent the boundary cases. To improve the quality of negative sampling in disease relation extraction, we introduce an EHR-based negative sampling strategy. With the strategy, our approach generates negative samples using disease pairs that rarely appear together in EHRs, thus having realistic negative samples to enable broad generalization of the approach.

Second, it is not uncommon for certain diseases to appear either in graph or language datasets but not in both data types. For example, in Lin et al. [lin2020highthroughput], over 60% disease pairs in the knowledge graph had no corresponding text information; there were also cases with text but no graph information. It is thus essential that multimodal approaches can make predictions when only one data type is available. Unfortunately, existing multimodal approaches with such capability [wang2020multimodal, suo2019metric, zhou2019latent, yang2018semi]

do not focus on language and graphs. Further, missing data imputation techniques use adversarial learning to impute missing values 

[cai2018deep, jaques2017multimodal]

, but the imputed values can introduce unwanted data bias. To address this issue, we propose a multimodal approach with a decoupled model structure where language-based and graph-based modules interact only through shared parameters and a cross-modal loss function. This method ensures our model can take advantage of both language and graph inputs and extract disease relations based on either single or multimodal inputs.

Present work. We introduce REMAP (Relation Extraction with Multimodal Alignment Penalty)222Python implementation of REMAP is available on Github at Our dataset of domain-expert annotations is at , a multimodal approach for extracting and classifying disease-disease relations (Figure 1). REMAP is a flexible multimodal algorithm that jointly learns over text and graphs with a unique capability to make predictions even when a disease concept exists in only one data type. To this end, REMAP specifies graph-based and text-based deep transformation functions that embed each data type separately and optimize unimodal embedding spaces such that they capture the topology of a disease KG or the text semantics of disease concepts. Finally, to achieve data fusion, REMAP aligns unimodal embedding spaces through a novel alignment penalty loss using shared disease concepts as anchors. This way, REMAP can effectively model data type-specific distribution and diverse representations while also aligning embeddings of distinct data types. Further, REMAP can be jointly trained on both graph and text data types but evaluated and implemented on either of the two modalities alone. In summary, the main contributions of this study are:

  • [leftmargin=*]

  • We develop REMAP, a flexible multimodal approach for extracting and classifying disease-disease relations. REMAP fuses knowledge graph embeddings with deep language models and can flexibly accommodate missing data types, which is a necessary capability to facilitate REMAP’s validation and transition into biomedical implementation.

  • We rigorously evaluate REMAP for extraction and classification of disease-disease relations. To this end, we create a training dataset using distant supervision and a high-quality test dataset of gold-standard annotations provided by three domain experts, all medical doctors. Our evaluations show that REMAP achieves 88.6% micro-accuracy and 81.8% micro-F1 score on the human annotated dataset, outperforming text-based methods by 10 and 17.2 percentage points, respectively. Further, REMAP achieves the best performance, 89.8% micro-accuracy and 84.1% micro-F1 score, surpassing graph-based methods by 8.4 and 10.4 percentage points, respectively.

2 Methods

We next detail the REMAP approach and illustrate it in detail for the task of disease relation extraction and classification (Figure 1). We first describe the notation, proceed with an overview of language and knowledge graph models, and outline the multimodal learning strategy to inject knowledge into extraction tasks.

2.1 Preliminaries

Notation. The input to REMAP is a combined dataset of text and graph information. This dataset consists of text information given as bags of sentences and graph information given as triplets encoding the relationship between and as . For example, , and would indicate the process to differentiate between hypogonadism versus Goldberg-Maxwell syndrome with similar symptoms that could possibly account for illness in a patient. We assume that bags of sentences in overlap with the triplets from existing KG such that each sentence of the th sentence bag contain . The remaining triplets in cannot be mapped to sentences in and sentences contain entity pairs that do not belong to existing KG. We represent the -th sentence bag as , where is the number of sentences in bag , is the tokenized sequence of -th sentence in . Here, the tokenized sequence is the combination of sentence containing mentions of subject and object entities, entity markers, title of the document in which the the sentence appears, and the article structure. Marker tokens are added to the head and tail position of each entity mentions to denote entity type information. Last, and are start indices of entity markers for subject and object entities, respectively.

Heterogeneous graph attention network. Heterogeneous graph attention network (HAN) [wang2019heterogeneous] is a graph neural network for embedding KGs by leveraging meta paths. A meta path is a sequence of node types and relation types [sun2011pathsim]. For example, in a disease KG, “Disease” “May Cause” “Disease” “Differential Diagnosis” “Disease” is a meta path. Node is connected to node via a meta path if and are the head and tail nodes, respectively, of this meta path. Each node has an initial node embedding and belongs to a type , e.g., . Graph attention network specified a parameterized deep transformation function that maps nodes to condensed data summaries, i.e., embeddings, in a node-type specific manner as: .

We denote all nodes adjacent to via a meta-path as and node-level attention mechanism provides information on how strongly attends to ’s each adjacent node when generating the embedding for . In particular, the importance of for in meta path is defined as:


where is the sigmoid activation, indicates concatenation, and is a trainable vector. To promote stable attention, HAN uses multiple, i.e., , heads and concatenates vectors after node level attention to produce the final node embedding for node :


Given user-defined meta paths , HAN uses the above specified node-level attention to produce node embeddings . Finally, HAN uses semantic-level attention to combine meta path-specific node embeddings as:


where represents the importance of meta path towards final node embeddings , and , , and are trainable parameters. The final outputs are node embeddings , representing compact vector summaries of knowledge associated with each node in the KG.

Translation- and tensor-based embeddings.

TransE [bordes2013translating] and TuckER [balavzevic2019tucker]

decode an optimized set of embeddings into the probability estimate of relationship

existing between entities and . This is achieved by a scoring function (SF) that either translates the embeddings in TransE as: or factorizes the embeddings in TuckER as: , where , , and represent entity and relation embeddings,

denotes the sigmoid function,

is a trainable tensor, and , , indicates tensor multiplication along dimension .

2.2 Language and knowledge graph encoders

Embedding disease-associated sentences. We start by tokenizing entities in sentences and proceed with an overview of the text encoder. Entity tokens identify the position and type of entities in a sentence [MTB, zhong2020frustratingly]. Specifically, tokens ¡S-type¿ and ¡S-type/¿, and ¡O-type¿ and ¡O-type/¿ are used to denote the start and end of subject (S) and object (O) entities, respectively. The entity marker tokens are type-related, meaning that entities of different types (e.g., disease concepts, medications) get different tokens. This procedure produces bags of tokenized sentences that we encode into entity embeddings using a text encoder. We use SciBERT encoder [scibert] with the SciVocab wordpiece vocabulary, which is a BERT language model optimized for scientific text with improved efficiency in biomedical domains than BioBERT or BERT model alone [zhong2020frustratingly]. Tokenized sequences in a sentence bag are fed into the language model to produce a set of sequence outputs :


where , is the number of sentences in , . We then aggregate representations of the subject entity across all sentences in bag as: and use self-attention to obtain the final text-based embedding for subject entity as:


where is the embedding of and is a trainable vector. Embeddings of object entities (i.e., for object entity ) are generated analogously by the text encoder. Self-attention is needed because some sentences in each bag may not capture any relationship between entities, and attention allows the model to ignore such uninformative sentences when generating embeddings.

Embedding a disease knowledge graph. We use a heterogeneous graph attention encoder [wang2019heterogeneous] to map nodes in the disease KG into embeddings. The encoder generates embeddings for every subject entity and object entity that appear as nodes in the KG as:


where transformation is given in Eqs. (1)-(4) and denotes initial embeddings.

Scoring disease-disease relationships. Taking text-based embeddings, , and graph-based embeddings, , for every subject and object occurring in either language or graph dataset, we use a scoring function to estimate disease-disease relationships formulated as candidate triplets :


where SF is the scoring function and denotes the number of relation types. We consider three scoring functions, including the linear scoring function: , the TransE scoring function: , and the TuckER scoring function: . In the TuckER function, we use a separate kernel and for text and graph data types. In the TransE function, is the embedding of relation shared between modalities (Section 2.1).

2.3 Co-training language and graph encoders

We proceed describing the procedure for co-training text and graph encoders. From last section, we obtain relationship estimates based on evidence provided by text information, , and graph information, , for every triplet . We use the binary cross entropy to optimize those estimates in each data type as:


and can combine text-based loss and graph-based loss as: . This loss function is motivated by the principle of knowledge distillation [lan2018knowledge] to enhance multimodal interaction and improve classification performance. Using probabilities from the text encoder, we normalize them across relation types to obtain a text-based distribution as:


We calculate a graph-based distribution in the same manner using softmax normalization.

Specifically, we develop two REMAP variants, REMAP-M and REMAP-B, based on how text-based and graph-based losses are combined into a multimodal objective. In REMAP-M, text-based and graph-based losses are aligned by shrinking the distance between and using Kullback-Leibler (KL) divergence:


as a measure of misalignment as follows:


In contrast, REMAP-B considers the best logit on two data types, inspired by ensemble distillation 

[guo2020online]. Specifically, if , the highest predicted score across data types is used as the best score :


which is normalized across relation types using a softmax function:


REMAP-B uses cross entropy to quantify the discrepancy between predicted scores and ground truth labels:


and co-trains graph and text encoders by penalizing misaligned encoders:


The outline of complete REMAP-B is shown in Algorithm 1.

Input: Bag of sentences , Knowledge graph , Initial node embeddings , Scoring function SF, Regularization strength , Relation type vector pretrained on text , Relation type vector pretrained on graph
Output: Model parameters of text-based and graph-based encoders
1 Initialize model parameters
2 Initialize relation representation as
3 for epoch=1:n_epochs do
4        for i=1:n_bags do
5               /* Encode text and calculate text logits, see Section 2.2 */
13               /* Encode graph and calculate graph logits, see Section 2.2 */
17               /* Find the best logit and calculate alignment penalty loss, see Section 2.3 */
Algorithm 1 REMAP, multimodal learning on graphs for disease relation extraction and classification. Shown is the outline of REMAP-B (Section 2.3).

3 Data and Experimental Setup

We proceed with the description of datasets (Section 3.1), followed by the training and implementation details of REMAP approach (Section 3.2) and the outline of experimental setup (Section 3.3).

3.1 Datasets

Datasets used in this study are multimodal and originate from diverse sources that we integrated and harmonized as outlined below. In particular, we compiled a large disease-disease knowledge graph together with text descriptions retrieved from medical data repositories using distant supervision. Further, we processed a large electronic health record dataset and curated a high quality human annotated dataset.

Disease knowledge graph. We construct a disease-disease knowledge graph from Diseases [pletscher2015diseases] and MedScape [frishauf2005medscape] repositories that unify evidence on disease-disease associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This knowledge graph contains 9,182 disease concepts, which are assigned concept unique identities (CUI) in Unified Medical Language System (UMLS), and three types of relationships between disease concepts, which are ‘may cause’ (MC), ‘may be caused by’ (MBC), and ‘differential diagnosis’ (DDx). Other relations in the knowledge graph are denoted as ‘not available’ relations (NA). The MBC relation type is the reverse relation of the MC relation type while DDx is a symmetric relation between disease concepts. Further dataset statistics are in [lin2020highthroughput] and Table 1.

Language dataset. We use a text corpus taken from Lin et al. [lin2020highthroughput] that includes sentences from Wikipedia, Medscape eMedicine, PubMed abstracts from The PubMed Baseline Repository provided by the National Library of Medicine, and four textbooks. This dataset comprises 21,315,153 medical articles, and altogether 1,466,065 sentences from these articles can be aligned to disease-disease edges in the disease knowledge graph. Details are shown in Table 1.

Electronic health record dataset. We use two types of information from electronic health records (EHRs), both taken from Beam et al. [beam2019clinical]. In particular, using 20 million medical notes, Beam et al. [beam2019clinical] created a dataset of disease concepts that appear together in the same note and used the SVD to decompose the resulting co-occurrence matrix and produce embedding vectors for disease concepts. We use information on co-occurring disease concepts to guide sampling of negative node pairs when training the disease knowledge graph. Further, we use embedding vectors from [beam2019clinical] to initialize embeddings in REMAP’s graph neural network. For out-of-dictionary disease concepts, we use the average of all concept embedding as their initial embeddings.

Human annotated dataset. We randomly selected 500 disease pairs to create an annotated dataset. For that, we recruited three senior physicians from Peking Union Medical College Hospital who manually labeled disease pairs. Annotation experts disagreed on labels for 14 disease pairs (i.e., 2.8% of the total number of disease pairs) that were resolved through consensus voting. The human annotated dataset is used for model comparison and performance evaluation.

Dataset Total NA DDx MC MBC Entities
Unaligned - - 96,913 30,546 20,657 23,411 22,298 9,182
Aligned Train Triplet 31,037 15,670 7,262 4,358 3,747 7,794
Language 1,244,874 799,194 208,921 123,735 113,024
Validation Triplet 7,754 3,918 1,821 1,065 950 4,433
Language 206,179 68,934 60,165 43,706 33,474
Annotated Triplet 499 8 210 159 122 733
Language 15,012 96 4,980 6,699 3,237
Table 1: Overview of the disease knowledge graph and the language dataset. Shown are statistics for the following relation types: Not Available (NA), Differential Diagnosis (DDx), May Cause (MC), and May Be Caused by (MBC). Total denotes the total number of all triplets (see Figure 2).

3.2 Training REMAP models

Next we outline training details of REMAP models, including negative sampling, pre-training strategy for language and graph models, and the multimodal learning approach.

Negative sampling. We construct negative samples by sampling disease pairs whose co-occurrence in the EHR co-occurrence matrix [beam2019clinical] is lower than a pre-defined threshold. In particular, we expect that two unrelated diseases rarely appear together in EHRs, meaning that the corresponding values in the co-occurrence matrix are low. Thus, such disease pairs represent suitable negative samples to train models for classifying disease-disease relations.

Pre-training a language model. The text-based model comprises the text encoder and the TuckER module for disease-disease relation prediction. We denote the relation embeddings as , and the loss function as (Section 2.3). In particular, we use SciBERT tokenizer and SciBERT-SciVocab-uncased model [wolf2019huggingface]

. The entity markers are added to the SciVocab vocabulary, and their embeddings are initialized with uniform distribution. We set the maximum number of sentences in a bag to

. If the bag size is greater than , then sentences are selected uniformly at random for training. Further details on hyper-parameters are in Table LABEL:tab:param.

Pre-training a graph neural network model. The graph-based model comprises the heterogeneous attention network encoder and the TuckER module for disease-disease relation prediction. The relation embeddings produced by the TuckER module are denoted as . In the pre-training phase, the model is optimized for the loss function is (Section 2.3). The initial embeddings for nodes in the disease knowledge graph are concept unique identifier (CUI) representations derived from the SVD decomposition of the EHR co-occurrence matrix [beam2019clinical]. Further details on hyper-parameters are in Table LABEL:tab:param.

Joint learning on the multimodal dataset. After data type-specific pre-training is completed, the text and graph models are fused in cross-modal learning. To this end, the shared relation vector is initialized as: . We consider two REMAP variants, namely REMAP-M and REMAP-B, that are optimized for different loss functions (Section 2.3). Details on hyper-parameter selection are in Table LABEL:tab:param.

3.3 Experimental setup

Next we overview performance metrics, baseline methods, and variants of REMAP approach.

Baseline methods and performance metrics. We consider 11 baseline methods for disease relation extraction, including 5 knowledge graph-based methods and 6 text-based methods. Detailed descriptions of all methods are in Appendix LABEL:sec:further-baselines. We evaluate predicted disease-disease relations by calculating the accuracy of predicted relations between disease concepts, which is an established approach for benchmarking relation extraction methods [hu2020open]. Specifically, given a triplet and a predicted score , relation is predicted to exist between and if the predicted score , which corresponds to a binary classification task for each relation type. The is a relation type-specific value determined such that binary classification performance achieves maximal F1-score. We report classification accuracy, precision, recall, and F1-score for all experiments in this study.

Variants of REMAP approach. We carry out an ablation study to examine the utility of key REMAP components. We consider the following three components and examine REMAP’s performance with and without each.

  • [leftmargin=*]

  • REMAP-B without joint learning In text-only ablations, we use SciBERT to obtain concept embeddings and , and combine them to produce a relation embedding , where is related to the concept embeddings based on text information. Finally, we consider three scoring functions to classify disease-disease relations, and we denote the models as SciBERT (linear), SciBERT (TransE), and SciBERT (TuckER). Similarly, in graph-only ablations, we first use a heterogeneous attention network to obtain graph embeddings and that are combined into prediction by different scoring functions, including HAN (linear), HAN (TransE), and HAN (TuckER).

  • REMAP-B without EHR embeddings REMAP-B uses EHR embeddings [beam2019clinical] as initial node embedding . To examine the utility of EHR embeddings, we design an ablation study that initializes node embeddings using the popular Xavier initialization [glorot2010understanding] instead of EHR embeddings. Other parts of the model are the same as in REMAP-B.

  • REMAP-B without unaligned triplets Unaligned triplets denote triplets in the disease knowledge graph that do not have the corresponding sentences in the language dataset. To demonstrate how these unaligned triplets influence model performance, we design an ablation study in which we train a REMAP-B model on the reduced disease knowledge graph with unaligned triplets excluded.

4 Results

REMAP is a multimodal approach for joint learning on graph and text datasets. We evaluate REMAP’s prowess for disease relation extraction when the REMAP model is tasked to identify candidate disease relations in either text (Section 4.1) or graph-structured (Section 4.2) data. We present ablation (Appendix LABEL:sec:ablation) and case (Section 5) studies.

4.1 Extracting disease relations from text

We start with results on the human annotated dataset where each method, while it can be trained on a multimodal text-graph dataset, is tasked to extract disease relations from text alone, meaning that the test set consists of only annotated sentences. Table 2

shows performance on the annotated set for text-based methods. Neural network models outperform methods, such as Naive Bayes and SVM. Further, BiGRU+Attention is the best performing baseline method, achieving an accuracy of 78.6 and an F1-score of 64.6. We also find that REMAP-B and REMAP-M achieve the best performance across all settings, outperforming baselines significantly. In particular, REMAP models surpass the strongest baseline by 10.0 absolute percentage points (accuracy) and by 7.2 absolute percentage points (F1-score). These results show that multimodal learning can considerably advance disease relation extraction when only one data type is available at test time.

Modality Model Accuracy F1-score
micro DDx MC MBC micro DDx MC MBC
Text Baselines SVM 59.5 59.0 68.2 51.2 31.5 40.9 25.4 19.0
NB 74.3 71.4 69.0 82.4 47.1 60.2 29.9 43.2
RF 72.0 68.6 70.8 76.4 37.8 53.1 25.5 19.2
TextCNN 76.7 75.4 73.0 81.8 60.9 67.5 59.9 48.6
BiGRU 77.4 73.0 77.2 82.0 62.0 67.9 54.0 59.8
BiGRU+attention 78.6 75.0 78.2 82.6 64.6 67.7 63.5 60.6
Ours REMAP 88.2 83.6 89.0 92.0 80.9 80.7 80.0 82.6
REMAP-M 88.6 84.2 89.0 92.8 81.5 81.6 79.6 83.8
REMAP-B 88.6 84.4 89.2 92.4 81.8 81.9 80.3 83.3
Graph Baselines TransE_l2 75.1 70.7 72.7 81.8 63.2 68.0 57.0 62.2
DistMult 69.8 77.5 61.3 70.5 56.1 71.0 43.4 51.5
ComplEx 79.0 75.2 77.8 84.2 65.0 69.3 56.5 66.9
RGCN 71.8 78.6 62.5 74.3 62.2 75.1 50.8 58.6
TuckER 81.5 77.6 82.3 84.7 73.7 76.2 71.7 71.9
Ours REMAP 89.6 86.4 89.6 92.8 83.5 84.3 81.6 84.2
REMAP-M 89.3 87.0 88.4 92.6 83.3 85.7 78.8 83.8
REMAP-B 89.8 87.3 89.9 92.2 84.1 85.8 82.4 82.7
Table 2: Results of disease relation extraction on the human annotated set. DDx: differential diagnosis, MC: may cause, MBC: may be caused by. The “micro” columns denote micro average accuracy or F1-score for DDx, MC, and MBC relation types. Further results are in Appendix LABEL:sec:further-performance.

4.2 Identifying disease relations in a graph-structured dataset

We proceed with results of disease relation prediction in a setting where each method is tasked to classify disease relations based on the disease knowledge graph alone. This setting evaluates the flexibility of REMAP as REMAP can answer either graph-based or text-based disease queries. In particular, Table 2 shows performance results attained on the human annotated dataset with query disease pairs given only as disease concept nodes in the knowledge graph. We find that graph-based baselines perform better than text-based baselines. Further, TuckER, the best graph-based method, outperforms BiGRU+Attention, a top-performing text-based baseline, by 2.9 (accuracy) and 8.9 absolute percentage points (F1-score). These results suggest that the disease knowledge graph contains more valuable predictive information than text when classifying disease-disease relations. Last, we find that REMAP-B is a top performer among REMAP variants, achieving an accuracy of 89.8 and an F1-score of 84.1. Further results are in Appendices LABEL:sec:ablation-LABEL:sec:further-analysis.

5 Discussion

Table LABEL:tab:ablation shows that joint learning can considerably improve model performance in both text and graph modalities. To further analyze joint learning in REMAP-B, we analyze the characteristics of correctly and incorrectly classified disease-disease relationships. Comparing full REMAP-B with REMAP-B w/o joint learning (Table LABEL:tab:ablation), we found that 28 predicted disease-disease relationships turned from incorrect to correct predictions. Conversely, we found that 14 predicted disease-disease relationships turned correct to incorrect predictions on the text modality. Besides, 19 predictions turned from incorrect to correct and 11 inversely on the graph modality. It can be seen that improved classification accuracy on the text modality is why REMAP-B achieved better performance on the annotated set.

We proceed with a case study examining the relationship between hypogonadism and Goldberg-Maxwell syndrome. Hypogonadism and Goldberg-Maxwell syndrome are separate and distinct diseases [nguyen2003pet, grymowicz2021complete]. However, many physicians fail correctly to differentiate in their diagnosis of these two diseases, especially in light of Goldberg-Maxwell syndrome being a rare disease. Differential diagnosis is a process wherein a doctor differentiates between two or more conditions that could be behind a person’s symptoms.

Figure 3 illustrates how the prediction of the “differential diagnosis” (DDx) relationship between hypogonadism and Goldberg-Maxwell syndrome changed from incorrect to correct after joint learning with the text modality. Because REMAP-B correctly recognizes diseases from text and unifies them with diseases in the knowledge graph, it can be seen that both diseases have rich local network neighborhoods in the graph. For example, hypogonadism and Goldberg-Maxwell syndrome have many outgoing edges of type differential diagnosis. Further, meta-paths connect these diseases in the graph, such as as second-order “differential diagnosis differential diagnosis” meta-path, and their outgoing degrees in the graph are relatively high. In the case of joint learning, the text-based model can extract part of the disease representation from the graph modality to update its internal representations and thus improve text-based classification of relations.

6 Conclusion

We develop REMAP, a multi-modal learning approach for disease relation extraction that integrates knowledge graphs and language models. Results on a dataset annotated by experts show that REMAP considerably outperforms methods for learning on text or knowledge graphs alone. Further, REMAP can extract and classify disease relationships in the most challenging settings where text or graph information is absent. Finally, we provide a new data resource of extracted relationships between diseases that can serve as a benchmarking dataset for systematic evaluation and comparison of disease relation extraction algorithms.

Funding and acknowledgments

We gratefully acknowledge the support of the Harvard Translational Data Science Center for a Learning Healthcare System (CELeHS). M.Z. is supported, in part, by NSF under nos. IIS-2030459 and IIS-2033384, US Air Force Contract No. FA8702-15-D-0001, Harvard Data Science Initiative, Amazon Research Award, Bayer Early Excellence in Science Award, AstraZeneca Research, and Roche Alliance with Distinguished Scientists Award. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.

Author contributions

Y.L and K.L. are co-first authors and have developed the method and produced all analyses presented in the paper. S.Y. contributed experimental data. T.C. and M.Z. conceived and designed the study. The project was guided by M.Z., including designing methodology and outlining experiments. Y.L., K.L., S.Y., T.C., and M.Z. wrote the manuscript. All authors discussed the results and reviewed the paper.

Data and code availability

Python implementation of REMAP is available on Github at The human annotated dataset used for evaluation of REMAP is available at

Conflicts of interest

None declared.

Figure 1: Overview of REMAP architecture. REMAP introduces a novel co-training learning strategy that continually updates a multimodal language-graph model for disease relation extraction and classification. Language and graph encoders specify deep transformation functions that embed disease concepts (i.e., subject entities and object entities ) from the language data and disease knowledge graph into compact embeddings, producing condensed summaries of language semantics and biomedical knowledge for every disease. Embeddings output by the encoders (i.e., , , , ) are then combined in a disease relation type-specific manner (e.g., “differential diagnosis” and ”may cause” relation types) and passed to a scoring function that calculates the probability representing how likely two diseases are related to each other and what kind of relationship exists between them.
Figure 2: Creating a dataset split for performance evaluation and benchmarking. Triplets in the disease knowledge graph are divided into aligned triplets and unaligned triplets. Aligned triplets are triplets that have corresponding sentences in the language dataset. Unaligned triplets have no corresponding sentences in the language dataset.
Figure 3: Illustration of (hypogonadism, DDx, Goldberg-Maxwell syndrome) triplet in REMAP. This triplet represents the relationship between symptomatically similar diseases, hypogonadism and Goldberg-Maxwell syndrome. The subject entity (hypogonadism, C0020619) is shown in orange and the object entity (Goldberg-Maxwell syndrome, C0936016) is shown in blue. (a) The subject and object entities are identified with unique UMLS concept identifiers. The relation between them is differential diagnosis. There is a DDx-DDx meta-path between these entities and that information is leveraged by our graph encoder to predict the relationship between hypogonadism and Goldberg-Maxwell syndrome. The bar plots indicate distribution of relation types going out of the subject or object entities in the knowledge graph. (b) We show two sentences representing the triplet. We use ¡sep¿ token to separate the sentences and the title of article from which we mine the sentence. If the article is from PubMed, we use special token ¡empty_title¿ as a placeholder. We add two special tokens before and after terms in each sentence to identify the positions of entities. For example, we add ¡entity-t047-1¿ before hypogonadism to mark the beginning of subject entity; t047 serves as a identifier of the semantic type for hypogonadism.