1 Introduction
The FDA Online Label Repository is a publicly available database of texts that provide detailed information about pharmaceutical drugs, including active ingredients, approved usage, warnings, and contraindications. Although the labels have predictable subsections and use highly regulated language, they are nuanced and presuppose deep medical knowledge, since they are intended for use by healthcare professionals (Shrank and Avorn, 2007). This makes them a formidable challenge for information extraction systems.
At the same time, existing medical ontologies contain diverse structured information about drugs and diseases. The drug labels can be situated in a larger health knowledge graph that brings together these sources of information, which can then be used to understand the labels. Figure 1 illustrates the guiding intuition; if our goal is to determine whether to add the dashed Treats edge, we should take advantage of the label for Medrol as well as the drug–drug, drug–disease, and disease–disease relations that are observed. Hristovski et al. (2006) describe a similar intuition with what they call “discovery patterns”.
In this paper, we show that Probabilistic Soft Logic (PSL; Bröcheler et al. 2010; Bach et al. 2013) is an effective tool for modeling drug labels situated in larger health knowledge graphs. A PSL model is a set of logical statements interpreted as defeasible relational constraints, and the relative weights of these constraints are learned from data. In PSL, we can directly state intuitions like ‘if some drugs share a substance, then they might treat the same diseases’ and ‘if two diseases share a classification, they might be treated by the same drugs’. Fakhraei et al. (2013) show the value of these ideas for drug–disease relationships. We extend their basic idea by adapting the method of West et al. (2014): our PSL models combine graphbased constraints with constraints derived from a separate sequencelabeling model applied to the drug label texts.
The most expensive parts of this model are the structured ontologies used to build the graph. Each ontology is a large, highly specialized project requiring extensive human expertise. This motivates us to seek out alternatives. To this end, we also evaluate on a graph that is derived entirely from clinical texts. Using only keyphrase matching, we instantiate a large, noisy graph with the same core structure as in figure 1, and we show that the PSL model is robust to such noisy inputs. By combining this graph with our structured one, we achieve larger gains, but the clinical narratives graph is effective on its own, and thus emerges as a viable option for domains lacking prebuilt ontologies.
Our evaluations are centered around a new dataset of annotated drug labels. In this dataset, spans of text identifying diseases are annotated with labels summarizing the drug–disease relationship expressed in the label.^{1}^{1}1https://github.com/roamanalytics/roamresearch/tree/master/Papers/PSLdrugsdiseases We show that our full PSL model yields superior predictions on this dataset, as compared with ablations using only the texts and only the graph edges derived from structured and textual sources.
2 Medical Ontologies Knowledge Graph
Our health knowledge graphs are focused on drugs and diseases related to obesity. We first introduce the graph we built from structured sources (802 nodes, 883 edges).
2.1 FDA Online Label Repository
From the FDA Online Label Repository^{2}^{2}2https://labels.fda.gov we extracted the drug labels matching at least one of the following keywords: “obesity”, “overweight”, “asthma”, “coronary heart disease”, “hypercholesterolemia”, “gallstones”, “gastroesophageal reflux”, “gout”, “hypertriglyceridemia”, “sleep apnea”, “peripheral vascular disease”, “chronic venous insufficiencies”. This is the set of diseases in the i2b2 Obesity Challenge (Uzuner, 2009), with some omissions for the sake of precision.
The FDA individuates drugs in a very finegrained way, resulting in many duplicate or nearduplicate labels (there are usually multiple brands and versions of a drug). To ensure no duplicates in the intuitive sense, we handfiltered to a set of 106 drug labels. These labels mention 198 distinct diseases, resulting in 1,110 drug–disease pairs.
2.2 Drug–Disease Annotations
Relations  Agreement  

Prevents  154  63.0 
Treats  4,425  67.3 
Treats Outcomes  2,268  67.1 
Not Established  241  35.1 
Not Recommended  262  49.5 
other  1,262  35.9 
Disease mentions in drug labels have a variety of senses. Guided by our own analysis of the relevant sentences, we settled on the following set of relational descriptions, with input from clinical experts acting as consultants to us:

The drug prevents the disease (our label is Prevents).

The drug treats the disease (Treats).

The drug treats outcomes of the disease (Treats Outcomes).

The safety/effectiveness of the drug has not been established for the disease (Not Established).

The drug is not recommended for the disease (Not Recommended).
We emphasize the clinical importance of distinguishing between treating a disease and treating its outcomes. Cancer, for example, is not treated with DepoMedrol, but hypercalcemia associated with cancer is. Similarly, Not Recommended identifies a contraindication, a specific kind of guidance, whereas the superficially similar Not Established has a more open meaning: the relation could be Treats, but this has not been tested or a clinical trial failed to demonstrate its efficacy.
We crowdsourced the task of assigning these labels to disease mention spans. Workers saw the entire label text, with our target diseases highlighted, and were asked to select the best statement among those provided above. We launched our task on Figure Eight, asking for 5 judgments for each drug–disease pair. To infer a label for each example from these responses, we applied Expectation Maximization (EM), which estimates the reliability of each worker and weights their contributions accordingly
(Dawid and Skene, 1979). Table 1 reports average agreement between workers and the inferred label. The Prevents, Treats, and Treats Outcomes relations are usually stated directly in drug labels, leading to high agreement. In contrast, Not Established, Not Recommended, and Other are more subtle and diverse, leading to lower agreement. The label of NormosolR, for example, states: “The solution is not intended to supplant transfusion of whole blood or packed red cells in the presence of uncontrolled hemorrhage”. What is the relation between NormosolR and hemorrhage? In this case, any of Not Established, Not Recommended, and Other seems acceptable.We observe a correlation between the number of distinct disease mentions in a drug label and its ratio of Treats labels against the other labels. Some drugs are more “general purpose” than others, in that they are involved in many treatment relations. For example, the label for Prednisone contains mentions of 50 distinct diseases it can treat. To prevent our system from predicting treatment relations primarily using the outdegree of drug nodes (instead of the domain knowledge provided by the graph), we manually removed 19 of these highdegree drugs. After removing these drugs, our dataset contained 431 relations.
2.3 Structured Medical Ontologies
Our graph contains structured information about drugs and diseases from a variety of sources. Our observations for drugs are route (OpenFDA),^{3}^{3}3https://open.fda.gov pharmacologic class (OpenFDA), substances (OpenFDA), and dosage form (RxNorm).^{4}^{4}4https://www.nlm.nih.gov/research/umls/rxnorm/ Our observations for diseases are all from SNOMED CT:^{5}^{5}5https://www.snomed.org finding sites, associated morphologies, and courses.
In general, these ontologies are quite sparse; a missing edge is therefore not necessarily evidence for a missing relationship. Broadly speaking, this is why text analysis can be so meaningful in this context – structured resources always fall behind because of the challenges of manual creation. This sparsity also motivates the approximate, textbased graph we introduce next.
3 Clinical Narratives Knowledge Graph
The Medical Ontologies Knowledge Graph is precise, but the underlying resources are expensive to build, which creates sparsity and can be an obstacle to adapting the ideas to new domains. To try to address this, we built a comparable graph using only deidentified clinical narratives – reports (often transcribed voice recordings) of clinicians’ interactions with patients.
Using lexiconbased matching methods, we extracted drug and disease mentions from these texts. In the resulting graph, each text is a node, with edges to the drug and disease nodes corresponding to these extracted mentions. The resulting graph has 319,598 nodes and 421,502 edges.
This is a simpler graph structure than we are able to obtain from structured resources (cf. section 2.3), but it supports our most important logical connections. For example, we can say that if diseases and are mentioned in the same narrative and is treated by a specific drug, then might also be treated by that drug. These hypotheses are often not supported; for example, a patient with a number of unrelated medical conditions might lead to many false instances of this claim. Nonetheless, we expect that, in aggregate, these connections will prove informative.
4 Models
Our core task is to identify drug–disease Treats relations based on the drug’s label text. We consider textonly, graphonly, and combined approaches to this problem.
4.1 TextOnly Model
Our textonly model is a separately optimized conditional random fields (CRF; Lafferty et al. 2001) model trained on 2,000 annotated sentences sampled from the full dataset of FDA drug labels. When this CRF is used in isolation, we say that a drug treats a disease just in case our trained CRF identifies at least one Treats span describing in the label for . To incorporate this CRF into our PSL models, we simply add two logical statements: one to influence the prediction confidence positively, another negatively (table 2, rules 1a,b). This encodes the goal of agreeing with this model’s predictions.
4.2 PSL Graph Rules
Our PSL rules fall into a few major classes. The guiding intuitions are that (i) relations connecting the same drug–disease pairs should have the same confidence, (ii) drugs that have nodes in common should treat the same diseases, and (iii) diseases that have nodes in common should be treated by the same drugs. Table 2 schematizes the full set of rules that we define over both our graphs.
4.3 Optimization
We use the Probabilistic Soft Logic package.^{6}^{6}6http://psl.linqs.org Details about our optimization choices are given in appendix A.2. The rules for our full model (all instances of the schemas in table 2), along with their learned weights and groundings (supporting graph configurations), are given in appendix A.3.
5 Experiments
Figure 2 reports our main results (see appendix A.1 for additional experimental details). Both the CRF and the graph rules make positive contributions, and any combination of these two sources of information is superior to its textonly or graphonly ablations. Furthermore, our Clinical Narratives Graph proves extremely powerful despite its simplistic construction. Numerically, our best model for essentially any amount of evidence is one that includes rules that reason about both graphs in addition to the CRF predictions, though using the Narratives Graph alone is highly competitive with this larger one. These findings strongly support our core hypothesis that information extraction from drug labels can benefit from both text analysis and graphderived information.
6 Conclusion
We presented evidence that combining text analysis with PSL rules defined over a knowledge graph can improve predictions about drug–disease treatment relations, and we showed that this can be effective even where the graph is derived heuristically from clinical texts, which means that the techniques can be applied even in domains that lack rich, broad coverage ontologies.
References
 Bach et al. (2013) Stephen Bach, Bert Huang, Ben London, and Lise Getoor. 2013. Hingeloss Markov random fields: Convex inference for structured prediction. arXiv:1309.6813.

Boyd et al. (2011)
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al.
2011.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends in Machine learning
, 3(1):1–122. 
Bröcheler et al. (2010)
Matthias Bröcheler, Lilyana Mihalkova, and Lise Getoor. 2010.
Probabilistic similarity logic.
In
Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence
, UAI’10, pages 73–82, Arlington, Virginia, United States. AUAI Press.  Dawid and Skene (1979) Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer errorrates using the EM algorithm. Applied statistics, pages 20–28.
 Fakhraei et al. (2013) Shobeir Fakhraei, Louiqa Raschid, and Lise Getoor. 2013. Drugtarget interaction prediction for drug repurposing with probabilistic similarity logic. In Proceedings of the 12th International Workshop on Data Mining in Bioinformatics, pages 10–17. ACM.
 Hristovski et al. (2006) Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, and Borut Peterlin. 2006. Exploiting semantic relations for literaturebased discovery. In AMIA annual symposium proceedings, volume 2006, page 349. American Medical Informatics Association.
 Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML01, pages 282–289.
 Shrank and Avorn (2007) William H Shrank and Jerry Avorn. 2007. Educating patients about their medications: the potential and limitations of written drug information. Health Affairs, 26(3):731–740.
 Uzuner (2009) Özlem Uzuner. 2009. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4):561–570.

West et al. (2014)
Robert West, Hristo S. Paskov, Jure Leskovec, and Christopher Potts. 2014.
Exploiting social network structure for persontoperson sentiment analysis.
Transactions of the Association for Computational Linguistics, 2(2):297–310.
Appendix A Supplemental Material
a.1 Experimental Details
For each run, we sample two distinct sets of diseases. Then, with each set, we create a subgraph by considering all the drugs adjacent to these diseases, and the edges between these drugs and diseases. The two subgraphs have distinct disease nodes, distinct edges, and some drug nodes in common. One of these subgraphs is used for training, the other one for evaluation. In each subgraph, we sample 25% of the edges that are used for prediction. Depending on the evidence ratio, some of the remaining 75% of the edges are provided as observations to the model. For evidence ratio = 0.75 for example, 75% of the edges in each subgraph are observed and the other 25% are predicted by models.
a.2 Optimization details
In training, initial weight values are very important. Whatever the learning rate and number of steps, it is easy to get stuck in local optima. We computed the initial weights using the number of groundings for each rule, such that (i) each source of information (CRF, ontologies, narratives) has the same contribution, and (ii) for a given source of information, each rule has the same contribution.
The weights were learned by optimizing the pseudologlikelihood of the data using the voted perceptron algorithm, as implemented by the PSL package. We preferred the pseudologlikelihood to the loglikelihood for scalability reasons.
Each model was trained over 10 iterations, with a training step of 1.
For inference, we used the Alternating Direction Method of Multipliers (ADMM; Boyd et al. 2011), as implemented by the PSL package. Consensus and local variables were initialized to a fixed value (0.25, close to the true positive Treats edge ratio), instead of randomly, to speed up convergence. The absolute and relative error components of stopping criteria were set to , with a maximum of 25,000 iterations.
a.3 Learned weights and groundings
Rule  Relative learned weight  Groundings 

Prior (positive)  1.10  431 
Prior (negative)  0.87  431 
Rule 1a  1.02  107 
Rule 1b  0.98  324 
Rule 2a with  1.05  3 
Rule 2b with  1.11  30 
Rule 2a with  0  
Rule 2b with  0.97  14 
Rule 2a with  1.13  37 
Rule 2b with  0.88  45 
Rule 3a with  3.09  347 
Rule 3b with  0.00  1,095 
Rule 3a with  1.36  65 
Rule 3b with  1.00  140 
Rule 3a with  1.99  164 
Rule 3b with  0.77  384 
Rule 3a with  1.00  52 
Rule 3b with  1.00  101 
Rule 2a with  22.12  29,803 
Rule 2b with  0.0  115,251 
Rule 3a with  7.42  2,568 
Rule 3b with  0.90  14,062 