Modeling Drug-Disease Relations with Linguistic and Knowledge Graph Constraints

by   Bruno Godefroy, et al.

FDA drug labels are rich sources of information about drugs and drug-disease relations, but their complexity makes them challenging texts to analyze in isolation. To overcome this, we situate these labels in two health knowledge graphs: one built from precise structured information about drugs and diseases, and another built entirely from a database of clinical narrative texts using simple heuristic methods. We show that Probabilistic Soft Logic models defined over these graphs are superior to text-only and relation-only variants, and that the clinical narratives graph delivers exceptional results with little manual effort. Finally, we release a new dataset of drug labels with annotations for five distinct drug-disease relations.


page 1

page 2

page 3

page 4


A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective

Drug discovery and development is an extremely complex process, with hig...

Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations

Knowledge graphs are a versatile framework to encode richly structured d...

Demystifying Drug Repurposing Domain Comprehension with Knowledge Graph Embedding

Drug repurposing is more relevant than ever due to drug development's ri...

Investigating ADR mechanisms with knowledge graph mining and explainable AI

Adverse Drug Reactions (ADRs) are characterized within randomized clinic...

Modeling electronic health record data using a knowledge-graph-embedded topic model

The rapid growth of electronic health record (EHR) datasets opens up pro...

Representation learning of drug and disease terms for drug repositioning

Drug repositioning (DR) refers to identification of novel indications fo...

TeX-Graph: Coupled tensor-matrix knowledge-graph embedding for COVID-19 drug repurposing

Knowledge graphs (KGs) are powerful tools that codify relational behavio...

1 Introduction

The FDA Online Label Repository is a publicly available database of texts that provide detailed information about pharmaceutical drugs, including active ingredients, approved usage, warnings, and contraindications. Although the labels have predictable subsections and use highly regulated language, they are nuanced and presuppose deep medical knowledge, since they are intended for use by healthcare professionals (Shrank and Avorn, 2007). This makes them a formidable challenge for information extraction systems.

Figure 1: Illustrative knowledge graph for the prediction of the relation between a drug (Medrol) and a disease (dermatitis). In favor of a Treats relation: Medrol has the same pharmacologic class as Baycadron, which treats dermatitis; dermatitis has the same morphology as neuritis, which is treated by Medrol. Against a Treats relation: dermatitis has the same site as hypersensitivity, which is not treated by Medrol. Our Probabilistic Soft Logic approach learns to combine these factors.

At the same time, existing medical ontologies contain diverse structured information about drugs and diseases. The drug labels can be situated in a larger health knowledge graph that brings together these sources of information, which can then be used to understand the labels. Figure 1 illustrates the guiding intuition; if our goal is to determine whether to add the dashed Treats edge, we should take advantage of the label for Medrol as well as the drug–drug, drug–disease, and disease–disease relations that are observed. Hristovski et al. (2006) describe a similar intuition with what they call “discovery patterns”.

In this paper, we show that Probabilistic Soft Logic (PSL; Bröcheler et al. 2010; Bach et al. 2013) is an effective tool for modeling drug labels situated in larger health knowledge graphs. A PSL model is a set of logical statements interpreted as defeasible relational constraints, and the relative weights of these constraints are learned from data. In PSL, we can directly state intuitions like ‘if some drugs share a substance, then they might treat the same diseases’ and ‘if two diseases share a classification, they might be treated by the same drugs’. Fakhraei et al. (2013) show the value of these ideas for drug–disease relationships. We extend their basic idea by adapting the method of West et al. (2014): our PSL models combine graph-based constraints with constraints derived from a separate sequence-labeling model applied to the drug label texts.

The most expensive parts of this model are the structured ontologies used to build the graph. Each ontology is a large, highly specialized project requiring extensive human expertise. This motivates us to seek out alternatives. To this end, we also evaluate on a graph that is derived entirely from clinical texts. Using only key-phrase matching, we instantiate a large, noisy graph with the same core structure as in figure 1, and we show that the PSL model is robust to such noisy inputs. By combining this graph with our structured one, we achieve larger gains, but the clinical narratives graph is effective on its own, and thus emerges as a viable option for domains lacking pre-built ontologies.

Our evaluations are centered around a new dataset of annotated drug labels. In this dataset, spans of text identifying diseases are annotated with labels summarizing the drug–disease relationship expressed in the label.111 We show that our full PSL model yields superior predictions on this dataset, as compared with ablations using only the texts and only the graph edges derived from structured and textual sources.

2 Medical Ontologies Knowledge Graph

Our health knowledge graphs are focused on drugs and diseases related to obesity. We first introduce the graph we built from structured sources (802 nodes, 883 edges).

2.1 FDA Online Label Repository

From the FDA Online Label Repository222 we extracted the drug labels matching at least one of the following keywords: “obesity”, “overweight”, “asthma”, “coronary heart disease”, “hypercholesterolemia”, “gallstones”, “gastroesophageal reflux”, “gout”, “hypertriglyceridemia”, “sleep apnea”, “peripheral vascular disease”, “chronic venous insufficiencies”. This is the set of diseases in the i2b2 Obesity Challenge (Uzuner, 2009), with some omissions for the sake of precision.

The FDA individuates drugs in a very fine-grained way, resulting in many duplicate or near-duplicate labels (there are usually multiple brands and versions of a drug). To ensure no duplicates in the intuitive sense, we hand-filtered to a set of 106 drug labels. These labels mention 198 distinct diseases, resulting in 1,110 drug–disease pairs.

2.2 Drug–Disease Annotations

Relations Agreement
Prevents 154 63.0
Treats 4,425 67.3
Treats Outcomes 2,268 67.1
Not Established 241 35.1
Not Recommended 262 49.5
other 1,262 35.9
Table 1: Drug–disease relations collected using crowdsourcing. The “Agreement” column gives the average agreement between workers and the labels inferred. Higher agreement correlates strongly with the degree to which the information is explicit in the label text.

Disease mentions in drug labels have a variety of senses. Guided by our own analysis of the relevant sentences, we settled on the following set of relational descriptions, with input from clinical experts acting as consultants to us:

  • The drug prevents the disease (our label is Prevents).

  • The drug treats the disease (Treats).

  • The drug treats outcomes of the disease (Treats Outcomes).

  • The safety/effectiveness of the drug has not been established for the disease (Not Established).

  • The drug is not recommended for the disease (Not Recommended).

We emphasize the clinical importance of distinguishing between treating a disease and treating its outcomes. Cancer, for example, is not treated with Depo-Medrol, but hypercalcemia associated with cancer is. Similarly, Not Recommended identifies a contraindication, a specific kind of guidance, whereas the superficially similar Not Established has a more open meaning: the relation could be Treats, but this has not been tested or a clinical trial failed to demonstrate its efficacy.

We crowdsourced the task of assigning these labels to disease mention spans. Workers saw the entire label text, with our target diseases highlighted, and were asked to select the best statement among those provided above. We launched our task on Figure Eight, asking for 5 judgments for each drug–disease pair. To infer a label for each example from these responses, we applied Expectation Maximization (EM), which estimates the reliability of each worker and weights their contributions accordingly

(Dawid and Skene, 1979). Table 1 reports average agreement between workers and the inferred label. The Prevents, Treats, and Treats Outcomes relations are usually stated directly in drug labels, leading to high agreement. In contrast, Not Established, Not Recommended, and Other are more subtle and diverse, leading to lower agreement. The label of Normosol-R, for example, states: “The solution is not intended to supplant transfusion of whole blood or packed red cells in the presence of uncontrolled hemorrhage”. What is the relation between Normosol-R and hemorrhage? In this case, any of Not Established, Not Recommended, and Other seems acceptable.

We observe a correlation between the number of distinct disease mentions in a drug label and its ratio of Treats labels against the other labels. Some drugs are more “general purpose” than others, in that they are involved in many treatment relations. For example, the label for Prednisone contains mentions of 50 distinct diseases it can treat. To prevent our system from predicting treatment relations primarily using the out-degree of drug nodes (instead of the domain knowledge provided by the graph), we manually removed 19 of these high-degree drugs. After removing these drugs, our dataset contained 431 relations.

2.3 Structured Medical Ontologies

Our graph contains structured information about drugs and diseases from a variety of sources. Our observations for drugs are route (OpenFDA),333 pharmacologic class (OpenFDA), substances (OpenFDA), and dosage form (RxNorm).444 Our observations for diseases are all from SNOMED CT:555 finding sites, associated morphologies, and courses.

In general, these ontologies are quite sparse; a missing edge is therefore not necessarily evidence for a missing relationship. Broadly speaking, this is why text analysis can be so meaningful in this context – structured resources always fall behind because of the challenges of manual creation. This sparsity also motivates the approximate, text-based graph we introduce next.

3 Clinical Narratives Knowledge Graph

The Medical Ontologies Knowledge Graph is precise, but the underlying resources are expensive to build, which creates sparsity and can be an obstacle to adapting the ideas to new domains. To try to address this, we built a comparable graph using only de-identified clinical narratives – reports (often transcribed voice recordings) of clinicians’ interactions with patients.

Using lexicon-based matching methods, we extracted drug and disease mentions from these texts. In the resulting graph, each text is a node, with edges to the drug and disease nodes corresponding to these extracted mentions. The resulting graph has 319,598 nodes and 421,502 edges.

This is a simpler graph structure than we are able to obtain from structured resources (cf. section 2.3), but it supports our most important logical connections. For example, we can say that if diseases and are mentioned in the same narrative and is treated by a specific drug, then might also be treated by that drug. These hypotheses are often not supported; for example, a patient with a number of unrelated medical conditions might lead to many false instances of this claim. Nonetheless, we expect that, in aggregate, these connections will prove informative.

4 Models

Table 2: PSL rules. The diagrams show inferred edges (black) based on observed nodes and edges (gray), drug nodes (green), and disease nodes (red). Variables and range over drugs and diseases, respectively. The first two rules serve as priors on the Treats relation. The relation R can be has_associated_morphology, has_course, or has_finding_site. The relation can be has_route, has_substance, has_doseform, or has_pharmclass. The variable in turn ranges over the semantically appropriate entities given the relation in question. For the Clinical Narratives Graph, both R and have only the value is_mentioned_in and ranges over clinical narratives.

Our core task is to identify drug–disease Treats relations based on the drug’s label text. We consider text-only, graph-only, and combined approaches to this problem.

4.1 Text-Only Model

Our text-only model is a separately optimized conditional random fields (CRF; Lafferty et al. 2001) model trained on 2,000 annotated sentences sampled from the full dataset of FDA drug labels. When this CRF is used in isolation, we say that a drug treats a disease just in case our trained CRF identifies at least one Treats span describing in the label for . To incorporate this CRF into our PSL models, we simply add two logical statements: one to influence the prediction confidence positively, another negatively (table 2, rules 1a,b). This encodes the goal of agreeing with this model’s predictions.

4.2 PSL Graph Rules

Our PSL rules fall into a few major classes. The guiding intuitions are that (i) relations connecting the same drug–disease pairs should have the same confidence, (ii) drugs that have nodes in common should treat the same diseases, and (iii) diseases that have nodes in common should be treated by the same drugs. Table 2 schematizes the full set of rules that we define over both our graphs.

4.3 Optimization

We use the Probabilistic Soft Logic package.666 Details about our optimization choices are given in appendix A.2. The rules for our full model (all instances of the schemas in table 2), along with their learned weights and groundings (supporting graph configurations), are given in appendix A.3.

5 Experiments

Figure 2 reports our main results (see appendix A.1 for additional experimental details). Both the CRF and the graph rules make positive contributions, and any combination of these two sources of information is superior to its text-only or graph-only ablations. Furthermore, our Clinical Narratives Graph proves extremely powerful despite its simplistic construction. Numerically, our best model for essentially any amount of evidence is one that includes rules that reason about both graphs in addition to the CRF predictions, though using the Narratives Graph alone is highly competitive with this larger one. These findings strongly support our core hypothesis that information extraction from drug labels can benefit from both text analysis and graph-derived information.

Figure 2: Area under the Precision-Recall curve (AUC) as function of evidence ratio (proportion of drug-disease edges that are observed). We favor AUC over receiver–operator curve (ROC) because the dataset is highly imbalanced – only 24.8% of relations are positive for the Treats label. Results are averaged over 100 runs for each model.

6 Conclusion

We presented evidence that combining text analysis with PSL rules defined over a knowledge graph can improve predictions about drug–disease treatment relations, and we showed that this can be effective even where the graph is derived heuristically from clinical texts, which means that the techniques can be applied even in domains that lack rich, broad coverage ontologies.


  • Bach et al. (2013) Stephen Bach, Bert Huang, Ben London, and Lise Getoor. 2013. Hinge-loss Markov random fields: Convex inference for structured prediction. arXiv:1309.6813.
  • Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends in Machine learning

    , 3(1):1–122.
  • Bröcheler et al. (2010) Matthias Bröcheler, Lilyana Mihalkova, and Lise Getoor. 2010. Probabilistic similarity logic. In

    Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence

    , UAI’10, pages 73–82, Arlington, Virginia, United States. AUAI Press.
  • Dawid and Skene (1979) Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics, pages 20–28.
  • Fakhraei et al. (2013) Shobeir Fakhraei, Louiqa Raschid, and Lise Getoor. 2013. Drug-target interaction prediction for drug repurposing with probabilistic similarity logic. In Proceedings of the 12th International Workshop on Data Mining in Bioinformatics, pages 10–17. ACM.
  • Hristovski et al. (2006) Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, and Borut Peterlin. 2006. Exploiting semantic relations for literature-based discovery. In AMIA annual symposium proceedings, volume 2006, page 349. American Medical Informatics Association.
  • Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289.
  • Shrank and Avorn (2007) William H Shrank and Jerry Avorn. 2007. Educating patients about their medications: the potential and limitations of written drug information. Health Affairs, 26(3):731–740.
  • Uzuner (2009) Özlem Uzuner. 2009. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4):561–570.
  • West et al. (2014) Robert West, Hristo S. Paskov, Jure Leskovec, and Christopher Potts. 2014.

    Exploiting social network structure for person-to-person sentiment analysis.

    Transactions of the Association for Computational Linguistics, 2(2):297–310.

Appendix A Supplemental Material

a.1 Experimental Details

For each run, we sample two distinct sets of diseases. Then, with each set, we create a subgraph by considering all the drugs adjacent to these diseases, and the edges between these drugs and diseases. The two subgraphs have distinct disease nodes, distinct edges, and some drug nodes in common. One of these subgraphs is used for training, the other one for evaluation. In each subgraph, we sample 25% of the edges that are used for prediction. Depending on the evidence ratio, some of the remaining 75% of the edges are provided as observations to the model. For evidence ratio = 0.75 for example, 75% of the edges in each subgraph are observed and the other 25% are predicted by models.

a.2 Optimization details

In training, initial weight values are very important. Whatever the learning rate and number of steps, it is easy to get stuck in local optima. We computed the initial weights using the number of groundings for each rule, such that (i) each source of information (CRF, ontologies, narratives) has the same contribution, and (ii) for a given source of information, each rule has the same contribution.

The weights were learned by optimizing the pseudo-log-likelihood of the data using the voted perceptron algorithm, as implemented by the PSL package. We preferred the pseudo-log-likelihood to the log-likelihood for scalability reasons.

Each model was trained over 10 iterations, with a training step of 1.

For inference, we used the Alternating Direction Method of Multipliers (ADMM; Boyd et al. 2011), as implemented by the PSL package. Consensus and local variables were initialized to a fixed value (0.25, close to the true positive Treats edge ratio), instead of randomly, to speed up convergence. The absolute and relative error components of stopping criteria were set to , with a maximum of 25,000 iterations.

a.3 Learned weights and groundings

Rule Relative learned weight Groundings
Prior (positive) 1.10 431
Prior (negative) 0.87 431
Rule 1a 1.02 107
Rule 1b 0.98 324
Rule 2a with 1.05 3
Rule 2b with 1.11 30
Rule 2a with 0
Rule 2b with 0.97 14
Rule 2a with 1.13 37
Rule 2b with 0.88 45
Rule 3a with 3.09 347
Rule 3b with 0.00 1,095
Rule 3a with 1.36 65
Rule 3b with 1.00 140
Rule 3a with 1.99 164
Rule 3b with 0.77 384
Rule 3a with 1.00 52
Rule 3b with 1.00 101
Rule 2a with 22.12 29,803
Rule 2b with 0.0 115,251
Rule 3a with 7.42 2,568
Rule 3b with 0.90 14,062
Table A1: Learned weights and groundings in our full model (evidence ratio = 0.75). Rule numbers refer to the schemas in table 2. Weights are relative to their initial value (ratio of the learned weight over its respective initial weight). Weights assigned to “positive” rules are often larger than the weight of their respective negative rule, which suggests that positive groundings are usually more informative than negative ones. For example, drugs that have the same route are more likely to have the same positive Treats edges than to have the same negative Treats edges.