Each year around the world, nearly 10 million people die from cancer 19 and the cost of cancer exceeds USD $1 trillion 17. Finding new therapeutic uses for inexpensive generic drugs ("drug repurposing") can rapidly create affordable new treatments. Hundreds of non-cancer generic drugs have shown promise for treating cancer, but it is unclear which are the most worthwhile repurposing opportunities to pursue.
Scientific publications such as preclinical laboratory studies and small clinical trials contain evidence on generic drugs being tested for cancer use. The Repurposing Drugs in Oncology (ReDO) project, through manually inspecting research articles indexed by PubMed, found anti-cancer evidence for more than 200 non-cancer generic drugs Pantziarka et al. (2017); Bouche et al. (2017); Verbaanderd et al. (2017)
. However, manual review to identify and analyze potential evidence is time-consuming and intractable to scale. As PubMed indexes millions of articles and the collection is continuously updated, it is imperative to devise (semi)automated techniques to synthesize the existing evidence. Machine learning (ML)-powered evidence synthesis would provide a comprehensive and real-time view of drug repurposing data and enable actionable insights. To this end, we must achieve task automation, algorithmic accuracy, and technical scalability in three key areas: evidence identification, extraction, and synthesis.
The work presented in this paper is part of an ambitious initiative to synthesize the plethora of scientific and real-world data on non-cancer generic drugs to identify the most promising therapies to repurpose for cancer. This type of endeavor requires close collaboration between experts in different disciplines, such as cancer research (to provide guidance, annotate datasets, and verify results), machine learning (to devise machine learning tasks, select datasets to be annotated, devise models, and evaluate performance), and software engineering (to incorporate models in end-to-end online applications). Furthermore, implementing repurposed therapies as the standard of care in medical practice requires definitive clinical trials, new incentives and business models to fund them, and engagement by various stakeholders such as patients, doctors, payers, and policymakers. In this paper, we focus on the key aspect of identifying and extracting relevant evidence from PubMed articles.
Methods for synthesizing drug repurposing evidence can be divided into three major categories: network-based methods, natural language processing (NLP) approaches, and semantic techniques Xue et al. (2018). Network-based approaches aim to infer relationships between biological entities (drug–disease or drug–target relationships), inspired by the fact that biologic entities (disease, drug, protein, etc.) in the same module of biological networks share similar characteristics Martínez et al. (2015). NLP approaches aim to both identify biological entities and mine new knowledge from scientific literature Li et al. (2009). Semantic approaches require a semantic network to be built first, which can be used with various approaches to mine relationships between entities Palma et al. (2014). We focus on NLP approaches. Our primary contributions, described in detail in the remainder of this paper, are as follows:
Formulating the pipeline of NLP tasks required to identify relevant evidence of generic drug repurposing for cancer from PubMed articles.
Precisely specifying the NLP tasks in terms of input and output (not an easy endeavor).
Creating domain-specific datasets that support the task definition.
Designing and evaluating initial models for each of the domain-specific tasks.
2 NLP Pipeline and Dataset for Drug-Cancer Evidence Extraction
PubMed, provided by the National Center for Biotechnology Information (NCBI), is a comprehensive source of biomedical studies, comprising more than 30 million biomedical abstracts and citations from various sources such as MEDLINE, life science journals, and online books. Given a list of generic drugs, the goal of our work is to automatically select abstracts from the large PubMed collection, that measure cancer-relevant phenotypic outcomes of interventions with generic drugs.111Phenotype is the observable physical properties of an organism; these include the organism’s appearance, development, and behavior. We focus on phenotypic outcomes (such as proliferation/death of cells grown in culture or tumor progression/overall survival rates for clinical trials) since they are a more direct measure of outcomes that matter to cancer patients and represent stronger therapeutic evidence (as opposed to, for example, the effects of drugs on protein levels).
We propose an evidence discovery pipeline, shown in Figure 1. First we query PubMed using a strategy inspired by the Cochrane highly sensitive search (CHSS) strategy Dickersin et al. (1994) to narrow the collection of articles we analyze. Note that querying PubMed, even with a sophisticated search string, may not yield only relevant
articles. Hence we have a (shallow) filtering stage to reject the easy irrelevant cases. Using the resulting abstracts, cancer types are identified using a named entity recognition (NER) model. With the abstract and pairs of drug-cancer types, therapeutic association is classified and also the type of study is categorized. We refer to this collection of information (i.e., drug, cancer, therapeutic association, study type) as theevidence discussed in the PubMed abstract.
The therapeutic association schema contains the following classes: 1. Irrelevant: A. Drug has no relation to cancer (cases where either the drug or the cancer is not the focus of the study); B. Abstract does not discuss a phenotypic outcome and 2. Relevant: A. Effective: the drug was shown to be effective for treating the cancer; B. Detrimental: the drug has a detrimental effect on the cancer; C. No effect: the drug has no effect on the cancer; D. Inconclusive: the results of the study are inconclusive.
The study types we consider are defined as follows: preclinical studies (in vitro, in vivo), observational studies (including case reports), and clinical trials.
Identifying such evidence from scientific abstracts is not trivial. The articles that discuss cancer interventions use domain-specific jargon which makes the text hard to comprehend by both humans with non-expert background and machines that are not trained with domain-specific data. Hence a strong collaboration between domain-experts and data scientists is required to define machine learning tasks, collect and annotate the appropriate information and design and evaluate machine learning models that address the designed tasks. Due to space limitations, we cannot elaborate in this paper on all difficulties in manually annotating datasets for all these tasks. We will refer to them during the presentation of our work. An example of an annotated abstract is shown in Figure 2.
Our team of machine learning and biomedical scientists worked closely together to fine-tune the querying and filtering strategy and to annotate cancer types, along with the therapeutic associations and study types. In the interest of space, we present the results of the dataset creation in Table 1 without the details of the iterative process of producing it.
3 Models for Cancer Type, Therapeutic Association, and Study Type
We briefly discuss the models and their performance for cancer entity extraction, therapeutic association classification, and study type classification, which form the key components of the proposed evidence discovery pipeline.
3.1 Cancer Type Extraction
For cancer entity identification, we use two main named entity recognition (NER) methods. We train sequential token-level IOB (inside, outside, beginning) tag prediction model using the BioNLP13CG dataset Pyysalo et al. (2015). Tokens that are not of interest are treated as ‘O’. We use the well-known conditional random field (CRF) Song et al. (2018)
and convolutional neural network (CNN) based SpaCy modelsHonnibal and Montani (2017) for entity extraction. We evaluate the performance on the 1085 abstracts using recall222Ratio between count of unique cancer entities predicted correctly by the model and number of unique cancer entities. with exact match and a token-level overlap score Moreau et al. (2008), where the predicted entity with highest overlap is used to compute the score. The CRF-based model obtains a recall of 54.2% and an overlap score of 66.4%, while the SpaCy-based model presents higher performance of 67.7% recall and 77.6% overlap score.
3.2 Therapeutic Association Classification
Given a drug-cancer pair and the corresponding abstract text, we build three different models for therapeutic association classification: 1. Logistic Regression
with feature vectors that are a concatenation of term frequency bag-of-words representations of abstracts, drug and cancer typeLehman et al. (2019); 2. Deep Averaging Networks (DAN)
: similar to logistic regression, except that the tokens of abstract, drug and cancer type are initialized with word vectors trained using skip-gram objective over a large set of PubMed abstractsPyysalo et al. (2013); the text of a given abstract is passed through deep averaging networks Iyyer et al. (2015) where the word vectors are re-trained (with the training data and classification objective), and the representations of abstract, drug and cancer type are concatenated, and passed through a final logistic layer; and 3. SciBERT Devlin et al. (2019); Beltagy et al. (2019): The drug and cancer type entities are encapsulated with special characters and concatenated with the input abstract text. The task is framed as a multi-class classification problem. The sequence representation is obtained using SciBERT’s encoding of the [CLS] token333[CLS] is inserted as a special beginning token for every input sequence. (from the last hidden layer). This encoding captures the entire sequence representation and is used for the multi-class classification with a logistic layer.
We perform a 5-fold cross-validation split at the document level, and evaluate performance for drug-cancer type pairs, given the abstract text. Note that we use gold-standard cancer type annotations for this analysis. We evaluate two different settings to understand the complexity of the task: (1) irrelevant vs. relevant binary classification (Table 2) and (2) all six classes (Table 3
, left side). Performance is measured using F-score. SciBERT performs the best in most cases.
3.3 Study Type Classification
We train logistic regression models with different choices of features Marshall et al. (2018): bag-of-words (BoW), publication type (PT), MeSH terms and combining all. Results using logistic regression with different choices of features are given in Table 3 (right side). Performance is measured using F-score. Using all the features together provides the best performance.
4 Conclusion and Future Work
We proposed an end-to-end evidence discovery pipeline that fetches potential candidate abstracts from PubMed for further evaluation with the goal of identifying non-cancer generic drug activity against different cancer types. We discuss the components in the pipeline, and use NLP approaches along with a number of well-thought-out heuristics to provide solutions for each component. In addition to improving the performance of individual components, our future research involves generating a database of evidence that can be used to prioritize the most promising drug-cancer combinations.
-  (2019) SciBERT: pretrained contextualized embeddings for scientific text. Note: arXiv:1903.10676 Cited by: §3.2.
-  (2017) Beyond aspirin and metformin: the untapped potential of drug repurposing in oncology. Eur. J. Cancer 172, pp. S121––S122. Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §3.2.
-  (1994) Identifying relevant studies for systematic reviews. BMJ 309, pp. 1286. Cited by: §2.
-  (2017) spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Cited by: §3.1.
-  (2015) Deep unordered composition rivals syntactic methods for text classification. In ACL-IJCNLP, Cited by: §3.2.
-  (2019) Inferring which medical treatments work from reports of clinical trials. In NAACL-HLT, Cited by: §3.2.
-  (2009) Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts. PLoS Comput. Biol. 5 (7), pp. e1000450. Cited by: §1.
-  (2018) Machine learning for identifying randomized controlled trials: an evaluation and practitioner’s guide. Res. Synth. Methods 9, pp. 602–614. Cited by: §3.3.
-  (2015) DrugNet: network-based drug–disease prioritization by integrating heterogeneous data. Artif. Intell. Med. 63 (1), pp. 41–49. Cited by: §1.
-  (2008) Robust similarity measures for named entities matching. In COLING, Cited by: §3.1.
-  (2014) Drug-target interaction prediction using semantic similarity and edge partitioning. In ISWC, Cited by: §1.
-  (2017) Repurposing non-cancer drugs in oncology — how many drugs are out there?. Note: bioRxiv:197434 Cited by: §1.
-  (2013) Distributional semantics resources for biomedical text processing. In LBM, Cited by: §3.2.
-  (2015) Overview of the cancer genetics and pathway curation tasks of BioNLP shared task 2013. BMC Bioinformatics 16 (Suppl. 10), pp. S2. Cited by: §3.1.
-  (2018) Comparison of named entity recognition methodologies in biomedical documents. BioMed. Eng. OnLine 17 (Suppl. 2), pp. 158. Cited by: §3.1.
-  The economics of cancer prevention and control. Note: https://issuu.com/uicc.org/docs/wcls2014_economics_of_cancer_final?e=10430107/10454633[Online; accessed 09-09-2019] Cited by: §1.
-  (2017) Repurposing drugs in oncology: next steps. Trends Cancer 3 (8), pp. 543–546. Cited by: §1.
-  Worldwide cancer statistics. Note: https://www.cancerresearchuk.org/health-professional/cancer-statistics/worldwide-cancer[Online; accessed 09-09-2019] Cited by: §1.
-  (2018) Review of drug repositioning approaches and resources. Int. J. Biol. Sci. 14 (10), pp. 1232–1244. Cited by: §1.