Attention-Gated Graph Convolution for Extracting Drugs and Their Interactions from Drug Labels

10/28/2019 ∙ by Tung Tran, et al. ∙ 0

Preventable adverse events as a result of medical errors present a growing concern in the healthcare system. As drug-drug interactions (DDIs) may lead to preventable adverse events, being able to extract DDIs from drug labels into a machine-processable form is an important step toward effective dissemination of drug safety information. In this study, we tackle the problem of jointly extracting drugs and their interactions, including interaction outcome, from drug labels. Our deep learning approach entails composing various intermediate representations including sequence and graph based context, where the latter is derived using graph convolutions (GC) with a novel attention-based gating mechanism (holistically called GCA). These representations are then composed in meaningful ways to handle all subtasks jointly. To overcome scarcity in training data, we additionally propose transfer learning by pre-training on related DDI data. Our model is trained and evaluated on the 2018 TAC DDI corpus. Our GCA model in conjunction with transfer learning performs at 39.20 F1 and 26.09 respectively on the first official test set and at 45.30 ER and RE respectively on the second official test set corresponding to an improvement over our prior best results by up to 6 absolute F1 points. After controlling for available training data, our model exhibits state-of-the-art performance by improving over the next comparable best outcome by roughly three F1 points in ER and 1.5 F1 points in RE evaluation across two official test sets.



There are no comments yet.


page 7

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Preventable adverse events (AE) are negative consequences of medical care resulting in injury or illness in a way that is generally considered avoidable. According to a report (Levinson, 2010) by the Department of Human and Health Services, based on an analysis of hospital visits by over a million Medicare beneficiaries, about one in seven

hospital visits were associated with an AE with 44% being considered clearly or likely preventable. Overall, AEs were responsible for an estimated US $324 million in Medicare spending for the studied month of October 2008. Preventable AEs thus introduce a growing concern in the modern healthcare system as they represent a significant fraction of hospital admissions and play a significant role in increased health care costs. Alarmingly, preventable AEs have been cited as the eighth leading cause of death in the U.S., with an estimated fatality rate of between 44,000 and 98,000 each year 

(Kohn et al., 2000). As drug-drug interactions (DDIs) may lead to to variety of preventable AEs, being able to extract DDIs from prescription drug labels is an important effort toward effective dissemination of drug safety information. This includes extracting information such as adverse drug reactions and drug-drug interactions as indicated by drug labels. The U.S. Food and Drug Administration (FDA), for example, has recently begun to transform Structured Product Labeling (SPL) documents into a computer-readable format, encoded in national standard terminologies, that will be made available to the the medical community and the public (Demner-Fushman et al., 2018). The initiative to develop a database of structured drug safety information that can be indexed, searched, and sorted is an important milestone toward a fully-automated health information exchange system.

To aid in this effort, we propose a supervised deep learning model able to tackle the problem of drug-drug interaction extraction in an end-to-end fashion. While most prior efforts assume all drug entities are known ahead of time (more in Section 2), and the drug-drug interaction extraction task reduces to a simpler binary relation classification task of known drug pairs, we propose a system able to identify drug mentions in addition to their interactions. Concretely, the system takes as input the textual content of the label (indicating dosage and drug safety precautions) of a target drug and, as output, identifies mentions of other drugs that interact with the target drug. Thus only one of the two interacting drugs is known beforehand (i.e., the “label drug”), while the other (i.e., the “precipitating drug”, or simply precipitant) is an unknown that our model is expected to extract. Along with identifying precipitants, we also determine the type of interaction associated with each precipitant; that is, whether the interaction is designated as being pharmacodynamic (PD) or pharmacokinetic (PK). In pharmacology, PD interactions are associated with a consequence on the organism while PK interactions are associated with changes in how one or both of the interacting drugs is absorbed, transported, distributed, metabolized, and excreted when used jointly. Beyond identifying the interaction type, it is also important to identify the outcome or consequence of an interaction. As defined, PK consequence can be captured using a small fixed vocabulary, while identifying PD effects is a much more contrived process. The latter involves additionally identifying spans of text correspond to a mention of a PD effect and linking each identified PD precipitants to one or more PD effects. We provide a more formal description of the task in Section 3.1. Figure 1 features a simple example of a PD interaction that is extracted from the drug label for Adenocard, where the precipitant is digitalis and the effect is “ventricular fibrillation.”

Figure 1. An example illustrating the end-to-end DDI extraction task. We first (1) identify mentions including precipitants; for each precipitant, we (2) determine the type of interaction and, based on interaction type, (3) determine the interaction outcome. In the case of PD interactions, the outcome corresponds to one of the previously identified effect spans.

To address this end-to-end variant of DDI extraction, we propose a multi-task joint-learning architecture wherein various intermediate hidden representations, including sequence-based and graph-based contextual representations based on bidirectional Long Short-Term Memory (BiLSTM) networks and graph convolution (GC) networks respectively, are composed and are then combined in clever ways to produce predictions for each subtask. GCs over dependency parse trees are useful for capturing long-distance syntactic dependencies. We innovate on conventional GCs with a sigmoid gating mechanism derived via additive attention, referred to as Graph Convolution with Attention-Gating (GCA), which determines whether or not (and to what extent) information propagates between source and target nodes corresponding to edges in the dependency tree. The attention component controls information flow by producing a sigmoid gate (corresponding to a value in

) for each edge based on an attention-like mechanism that measures relevance between node pairs. Intuitively, some dependency edges are more relevant than others; for example, negations or adjectives linked to important nouns via dependency edges may have a large influence on the overall meaning of a sentence while articles, such as “the”, “a”, and “an”, have little or no influence comparatively. A standard GC would compose all source nodes with equal weighting, while the GCA would be more selective by possibly assigning a higher sigmoid value to negations/adjectives and a lower sigmoid value to articles.

*DDI2013 NLM180 TR22 Test Set 1 Test Set 2
Number of Drug Labels 715 66 22 57 66
Total number of sentences 6489 8167 603 8195 4256
Number of sentences per Drug Label (Average) 9 124 27 144 64
Number of words per sentence (Average) 21 23 24 22 23
Proportion of sentences with annotations 70% 16% 51% 23% 23%
Number of mentions per annotated sentence (Average) 2.3 2.7 3.8 3.7 3.6
Proportion of mentions that are Precipitant 100% 60% 53% 56% 55%
Proportion of mentions that are Trigger - 20% 15% 30% 33%
Proportion of mentions that are SpecificInteraction - 23% 25% 14% 12%
Proportion of interactions that are Pharmacodynamic 14% 49% 49% 33% 28%
Proportion of interactions that are Pharmacokinetic 9% 29% 21% 28% 47%
Proportion of interactions that are Unspecified 77% 22% 30% 39% 25%

* Statistics for DDI2013 were computed on mapped examples (based on our own annotation mapping scheme) and not based on original data.

Table 1. Characteristics of various datasets

We train and evaluate our model on the Text Analysis Conference (TAC) 2018 dataset for drug-drug interaction extraction from drug labels (Demner-Fushman et al., 2018). The training data contains 22 drug labels, referred to as TR22, with gold standard annotations. As training data is scarce, we additionally propose a transfer learning step whereby the model is first trained on a re-annotated version, by TAC organizers, of the NLM-DDI CD corpus111 and SemEval-2013 Task 9 dataset (Herrero-Zazo et al., 2013); we refer to these as NLM180 and DDI2013 respectively. Two official test sets of 57 and 66 drug labels, referred to as Test Set 1 and 2 respectively, with gold standard annotations are used strictly for evaluation. Table 1 contains more information about these datasets and their characteristics. In this study, we show that the GCA improves over the standard GC and that our GCA based model with transfer learning by pretraining on external data improves over our the best model (Tran et al., 2018) from a prior study222Tran et al. (2018) was published as part of the non-refereed Text Analysis Conference (TAC); this study is an extension of our original report., that is based solely on BiLSTMs, by 4 absolute F1 points in overall performance. Furthermore, we show that our GCA based model complements our prior BiLSTM model; that is, by combining the two via ensembling, we improve over the prior best by 6 absolute F1 points in overall performance. Among comparable methods, our GCA based method exhibits state-of-the-art performance on all metrics after controlling for available training data. Our code333 is available for review and will be made publicly available on GitHub.

2. Related Works

Prior studies on DDI extraction have focused primarily on binary relation extraction where drug entities are known during test time and the learning objective is reduced to a simpler relation classification (RC) task. In RC, pairs of known drug entities occurring in the same sentence are assigned a label, from a fixed set of labels, indicating relation type (including the none or null relation). Typically, no preliminary drug entity recognition or additional consequence prediction step is required. In this section, we cover prior relation extraction methods for DDI as well as participants of the initial TAC DDI challenge.

2.1. Relation Extraction for DDI

State-of-the-art methods for DDI extraction typically involve some variant of convolutional neural networks (CNNs) or recurrent neural networks (RNNs), or a hybrid of the two. Many studies utilize the dependency parse structure of an input sentence to capture long-distance dependencies, which has previously been shown to improve performance in general relation extraction tasks 

(Zhang et al., 2018) and those in the biomedical domain (Luo et al., 2016; Liu et al., 2016a). Liu et al. (2016b) first proposed the use of standard CNNs for DDI extraction. Their approach involved convolving over an input sentence with drug entities bound to generic tokens in conjunction with so called position vectors

. Position vectors are used to indicate the offset between a word and each drug of the pair and provide additional spatial features. Improvements were attained, in a follow-up study, by instead convolving over the shortest dependency path between the candidate drug pair 

(Liu et al., 2016a). Zhao et al. (2016) introduced an enhanced version of the CNN based method by deploying word embeddings that were pretrained on syntactic parses, part-of-speech embeddings, and traditional handcrafted features. Suárez-Paniagua et al. (2017)

instead focused on fine-tuning various hyperparameter settings including word and position vector dimensions and convolution filter sizes for improved performance.

Kavuluru et al. (2017) introduced the first neural architecture for DDI extraction based on hierarchical RNNs, wherein hidden intermediate representations are composed in a sequential fashion with cyclic connections, with character and word-level input. Sahu and Anand (2018)

experimented with various ways of composing the output of a bidirectional LSTM network including max-pooling and attention pooling.

Lim et al. (2018) proposed a recursive neural network architecture using recurrent units called TreeLSTMs to produce meaningful intermediate representations that are composed based on the structure of the dependency parse tree of a sentence. Asada et al. (2018) demonstrated that combining representations of a CNN over the input text and graph convolutions over the molecular structure of the target drug pair (as informed by an external drug database) can result in improved DDI extraction performance. More recently, Sun et al. (2019) proposed a hybrid RNN/CNN method by convolving over the contextual representations produced by a preceding recurrent neural network.

2.2. TAC 2018 DDI Track

TAC is a series of workshops aimed at encouraging research in natural language processing (NLP) by providing large test collections along with a standard evaluation procedure. The “DDI Extraction from Drug Labels” track 

(Demner-Fushman et al., 2018) is established with the goal of transforming the contents of drug labels into a machine-processable format with linkage to standard terminologies. Tang et al. (2018)

placed first in the challenge using an encoder/decoder architecture to jointly identify precipitants and their interaction types and a rule-based system to determine interaction outcome. In addition to the provided training data, they downloaded and manually annotated a collection of 1148 sentences to be used as external training data.

Tran et al. (2018) placed second in the challenge using a BiLSTM for joint entity recognition and interaction type prediction, followed by a CNN with two separate dense output layers (one of PK and one for PD) for outcome prediction. Dandala et al. (2018) placed third in the challenge using a BiLSTM (with CRFs) with part-of-speech and dependency features as input for entity recognition. Next, an Attention-LSTM model was used to detect relations between recognized entities. The embeddings were pretrained on a corpus of FDA-released drug labels and used to initialized the model. NLM180 was used for training with TR22 serving as the development set. Other participants proposed systems involving similar approaches including BiLSTMs and CNNs as well as traditional linear and rule-based methods.

3. Materials and Methods

We begin by formally describing the end-to-end task in Section 3.1. Next, we describe our approach to framing and modeling the problem (Section 3.2), the proposed network architecture (Section 3.4), the data used for transfer learning (Section 3.5), and our model-ensembling approach (Section 3.6). Finally, in Section 3.7, we describe the method for model evaluation.

3.1. Task Description

Herein, we describe the end-to-end task of automatically detecting drugs and their interactions, including the outcome of identified interactions, as conveyed in drug labels. We first define drug label as a collection of sections (e.g., DOSAGE & ADMINISTRATION, CONTRAINDICATIONS, and WARNINGS

) where each section contains one or more sentences. The overall task, in essence, involve fundamental language processing techniques including named entity recognition (NER) and relation extraction (RE). The first subtask of NER is focused on identifying

mentions in the text corresponding to precipitants, interaction triggers, and interaction effects. Precipitating drugs (or simply precipitants) are defined as substances, drugs, or a drug class involved in an interaction. The second subtask of RE is focused on identifying sentence-level interactions; specifically, the goal is to identify the interacting precipitant, the type of the interaction, and outcome of the interaction. The interaction outcome depends on the interaction type as follows. Pharmacodynamic (PD) interactions are associated with a specified effect corresponding to a span within the text that describes the outcome of the interaction. Figure 1 features a simple example of a PD interaction that is extracted from the drug label for Adenocard, where the precipitant is digitalis and the effect is “ventricular fibrillation.” Naturally, it is possible for a precipitant to be involved in multiple PD interactions. Pharmacokinetic (PK) interactions, on the other hand, are associated with a label from a fixed vocabulary of National Cancer Institute (NCI) Thesaurus codes indicating various levels of increase/decrease in functional measurements. For example, consider the sentence: “There is evidence that treatment with phenytoin leads to decrease intestinal absorption of furosemide, and consequently to lower peak serum furosemide concentrations.” Here, phenytoin is involved in a PK interaction with the label drug, furosemide, and the type of PK interaction is indicated by the NCI Thesaurus code C54615 which describes a decrease in the maximum serum concentration (C) of the label drug. Lastly, unspecified (UN) interactions are interactions with an outcome that is not explicitly stated in the text and is typically indicated through cautionary remarks.

3.2. Joint Modeling Approach

The use of LABELDRUG in patients receiving digitalis
may be rarely associated with ventricular fibrillation .
Table 2. Example of the sequence labeling scheme for the sentence in Figure 1, where LABELDRUG is substitute for Adenocard.

Since only precipitants are annotated in the ground truth, we model the task of precipitant recognition and interaction type prediction jointly. We accomplish this by reducing the problem to a sequence tagging problem via a novel NER tagging scheme. That is, for each precipitant drug, we additionally encode the associated interaction type. Hence, there are three possible precipitant tags: DYN, KIN, and UN for precipitants with pharmacodynamic, pharmacokinetic, and unspecified interactions respectively. Two more tags, TRI and EFF, are added to further identify mentions of triggers and effects concurrently. To properly identify boundaries, we employ the BILOU encoding scheme (Ratinov and Roth, 2009). In the BILOU scheme, B, I, and L tags are used to indicate the beginning, inside, and last token of a multi-token entity respectively. The U tag is used for unit-length entities while the O

tag indicates that the token is outside of an entity span. As a preprocessing step, we identify the label drug in the sentence, if it is mentioned, and bind it to a generic entity token (e.g., “LABELDRUG”). We also account for indirect mentions of the label drug, such as the generic version of a brand-name drug, or cases where the label drug is referred to by its drug class. To that end, we built a lexicon of drug names mapped to alias using NLM’s Medical Subject Heading (MeSH) tree as a reference. Table 

2 shows how the tagging scheme is applied to a simple example.

Once we have identified the precipitant (as well as triggers/effects) and the interaction type for each precipitant, we subsequently predict the outcome or consequence of the interaction (if any). To that end, we consider all entity spans annotated with KIN tags and assign them a label from a static vocabulary of 20 NCI concept codes corresponding to PK consequence (i.e., multi-class classification). Likewise, we consider all entity spans annotated with DYN tags and link them to mention spans annotated with EFF tags; we accomplish this via binary classification of all pairwise combinations. For entity spans annotated with UN tags, no additional outcome prediction is needed.

3.3. Notations and Neural Building Blocks

In this section, we describe notations used in the remainder of this study. In addition, we provide a generic definition of the canonical convolutional neural networks (CNN) and long short term memory networks (LSTM) that are later used as building blocks in model construction. For ease of notation, we assume fixed sentence length and word length ; in practice, we set and

to be the maximum sentence/word length and zero-pad shorter sentences/words. Moreover, we use square brackets with matrices to indicate a row indexing operation; for example,

denotes the vector corresponding to the row of matrix .

Henceforth, the abstract function is used to represent the CNN that convolves on a window of size and maps an matrix to a vector representation of length . This is an abstraction of the canonical CNN for NLP first proposed by Kim (2014) and is defined as follows. First, we denote the convolution operation as the sum of the element-wise products of two matrices. That is, for two matrices A and B of same dimensions, . Suppose the input is a sequence of vector representations ; the output representation is defined such that

given a convolution function that convolves over a contiguous window of size , defined as

where are input vectors, and for , are network parameters (corresponding to a set of convolutional filters),

is the linear rectifier activation function. Here,

is a hyperparameter that determines the number of convolutional filters and thus the size of the final feature vector. In the study, we denote the convolution as an abstract function that convolves on a window of size and maps an matrix to a vector representation of length .

Likewise, we represent the BiLSTM network as an abstract function that maps a sequence of input vectors (e.g., word embeddings) of size (as an matrix) to a corresponding sequence of output context vectors of size (as an matrix). Let and represent an LSTM composition in the forward and backward direction. Suppose the input is a sequence of vector representations ; the output of a standard bidirectional LSTM network (BiLSTM) is a matrix such that

where is the vector concatenation operator and represents the context centered at the word. Here, is a hyperparameter that determines the size of the the context embeddings. In the study, we denote the BiLSTM network as an abstract function that maps a sequence of input vectors (e.g., word embeddings) of size (as an matrix) to a corresponding sequence of output context vectors of size (as an matrix).

3.4. Neural Network Architecture and Training Details

Figure 2. Overview of the neural network architecture for a simplified example from the drug label Adenocard. Here, the ground truth indicates that digitalis is a pharmacodynamic precipitant associated with the effect “ventricular fibrillation.” The PK predictive component is omitted given there are no precipitants involved in a PK interaction.

We begin by describing how the three types of intermediate representations are composed. The construction of word, context, and graph-based representations are described in Sections 3.4.1, 3.4.2, and 3.4.3 respectively. Next, we describe the predictive components of the network that share and utilize the intermediate representations. In Section 3.4.4, we describe the sequence-labeling component of the network used to extract drugs and their interactions. In Section 3.4.5, we describe the component for predicting interaction outcome. An overview of the architecture is shown in Figure 2. Lastly, we describe the model configuration and training process in Section 3.4.6.

3.4.1. Word-level Representation

Suppose the input is a sentence of length represented by a sequence of word indices into the vocabulary . Each word is mapped to a word embedding vector via embedding matrices such that is a hyperparameter that determines the size of word embeddings. In addition to word embeddings, we employ character-CNN based representations as commonly observed in recent neural NER models (Chiu and Nichols, 2016). Character-based models capture morphological features and help generalize to out-of-vocabulary words. For the proposed model, such representations are composed by convolving over character embeddings of size using a window of size 3, producing feature maps; the feature maps are then max-pooled to produce -length feature representations. Correspondingly, we denote as the embedding matrix given the character vocabulary ; the character-level embedding matrix for the word at position is

where for represents the character index of the word. The word-level representation is a concatenation of character-based word embeddings and pretrained word embeddings along the feature dimension; formally,

3.4.2. Context-based Representation

We compose context-based representation by simply processing the word-level representation with a BiLSTM layer as is common practice; concretely, where is a hyperparameter that determines the size of the context embeddings.

3.4.3. Graph-based Representation

In addition to the sequential nature of LSTMs, we propose an alternative and complementary graph-based approach for representing context using graph convolution (GC) networks. Typically composed on dependency parse trees, graph-based representations are useful for relation extraction as they capture long-distance relationships among words of a sentence as informed by the sentence’s syntactic dependency structure. While graph convolutions are typically applied repeatedly, our initial cross-validation results indicate that single-layered GCs are sufficient and deep GCs typically resulted in performance degradation; moreover, Zhang et al. (2018) report good performance with similarly shallow GC layers. Hence the following formulation describes a single-layered GC network, with an additional attention-based sigmoid gating mechanism, which we holistically refer to as a Graph Convolution with Attention-Gating (GCA) network. Initially motivated in Section 1, the GCA improves on conventional GCs with a sigmoid-gating mechanism derived via an alignment score function associated with additive attention (Bahdanau et al., 2015). The sigmoid “gate” determines whether or not (and to what extent) information propagates between nodes based on a learned alignment function that conceives a “relevance” score between a source and a target node (more later).

As a pre-processing step, we use a dependency parsing tool to generate the projective dependency tree for the input sentence. We represent the dependency tree as an adjacency matrix where if there is a dependency relation between words at positions and . This matrix controls the flow of information between pairs of words corresponding to connected nodes in the dependency tree (ignoring dependency type); however, it is also important for the existing information of each node to carry over on each application of the GC. Hence, as with prior work (Zhang et al., 2018), we use the modified version where

is the identity matrix to allow for self-loops in the GC network. The graph-based representation matrix

is composed such that

where are network parameters, is the hyperbolic tangent activation function, and is a hyperparameter that determines the hidden GC layer size. Thus, information propagated from source nodes to target node , based on the summation of intermediate representations, are unweighted and share equal importance.

As stated previously, we propose to extend the standard GC by adding an attention-based sigmoid gating mechanism to control the flow of information via the gating matrix . We define such that

where is a network parameter and is the hidden attention layer composed as a function of the context representation at source node and target node ; concretely,

where and are network parameters and is a hyperparameter that determines hidden attention layer size. Intuitively, the network learns the relevance of node to node via the attention and outputs a between 0 and 1 at gate . Gate controls the flow of information from node to , where 0 indicates no information is passed and 1 indicates that all information is passed. To integrate the gating mechanism, we simply redefine . In the next two sections, we show how the intermediate representations are used for end-task prediction.

3.4.4. Sequence Labeling

The sequence labeling (SL) task for detecting precipitant drugs and their interaction type is handled by a bidirectional LSTM trained on a combination of the two types of losses: conditional random fields (CRF) and softmax cross entropy (SCE). Using CRFs results in choosing a globally optimal assignment of tags to the sequence, whereas a standard softmax at the output of each step may result in less globally consistent assignments (e.g., an L tag following an O tag) but better local or partial assignments. We begin by introducing a bidirectional LSTM layer that processes the various intermediate representations. The new representation, , is defined such that

where is a hyperparameter that determines the hidden layer size. While is based on and is based on , we observed that combining these intermediate representations (manifesting at varying depth in the architecture) resulted in improved sequence-labeling performance according to preliminary experiments and prior results from Tran et al. (2018). As with residual networks (He et al., 2016), they additionally provide a kind of shortcut or “skip-connection” over intermediate layers.

Given a set of possible tags, we compose an score matrix (where represents the score of the tag at position ) such that where are network parameters. Given example and the truth tag assignment as a matrix where rows are one-hot vectors over all possible tags, the SCE loss is

where indicates whether the tag is assigned at position and is the set of all network parameters. Next, we define the CRF loss as commonly used with LSTM based models for entity recognition. We learn a transition score matrix , inferred from the training data, such that is the transition score from tag to tag . Given an example as a sequence of word indices and candidate tag sequence as a sequence of tag indices , the tag assignment score (t-score) is defined as

where . Intuitively, this score summarizes the likelihood of observing a transition from tag to tag in addition to the likelihood of emitting tag given the semantic context for . Thus is treated as a matrix of emission scores for the CRF. For an example with input and truth tag assignment , the loss is computed as the negative log-likelihood of the tag assignment as informed by the normalized tag assignment score, or

where is the set of all possible tag assignments. The final per-example loss for sequence labeling is simply a summation of the two losses: . During testing, we use the Viterbi algorithm (Viterbi, 1967), a dynamic programming approach, to decode and identify the globally optimal tag assignment.

3.4.5. Consequence prediction

Once precipitants (and corresponding interaction types) have been identified, we perform so called consequence prediction (CP) for all precipitant drugs identified as participating in a PD and PK interactions. The classification task of CP takes as input the target sentence and two candidate entities that are referred to as the subject and object entities. Here the subject is always a precipitating drug; on the other hand, the object designation depends on the type of interaction (more later). First, we define the representation matrix for CP as where

We process the matrix via convolutions of windows sizes 3, 4, and 5 and concatenate the results to produce the final feature vector . In addition to CNN features, we map entities to their graph based context features and append it to , which has been previously shown to work well in a similar architecture (Li et al., 2017). Concretely, the final feature vector is

with where , as a hyperparameter, is the number of CNN filters per convolution and and are the position index of the last word (typically the “head” word) of the subject and object entities respectively.

The actual entities determined to be the subject/object pair are based on the interaction type; for PD interactions, the subject is the precipitant drug and the object is some candidate effect mention. For PK interactions, however, the subject is the precipitant drug but the object is chosen to be the closest (based on character-offset) mention of the drug label with respect to the target precipitant drug. We found this appropriate based on manual review of the data, as the NCI code being assigned depends highly on whether the increase/decrease in functional measurements is with respect to the label drug or the precipitant drug. In case the label drug is not mentioned, a generic “null” vector is used to represent the object.

When performing sequence labeling, we pass in the entire dependency tree encoded as the matrix . However, when performing consequence prediction and both entities are non-null, we pass in a pruned version of the entire tree that is tailored to the entity pair. We apply the same pruning strategy proposed by Zhang et al. (2018), wherein for a pair of subject and object entities (corresponding to and ), we keep only nodes either along or within one hop of the shortest dependency path. This prevents distant and irrelevant portions of the dependency tree from influencing the model while retaining important modifying and negating terms. Thus the notation is used to denote the pruned version of as a function of the entity pair indicated by and .

To determine whether there is a PD interaction between a pair of entities, we employ a standard binary classification output layer. Concretely, for example sentence and output

, the probability of a PD interaction between the entity pair is

where and are network parameters. The associated binary cross entropy loss is


indicates the ground truth. For PK interactions, we instead use a softmax function to produce a probability distribution, represented as vector

, over the 20 labels corresponding to NCI Thesaurus codes. Concretely, the predicted probability of label is where and and are network parameters. Given a one-hot vector indicating the ground truth, the associated softmax cross entropy loss is

The loss for a batch of examples is simply the sum of its constituent example-based losses.

3.4.6. Neural Network Configuration and Training Details

Setting Value
Learning Rate 0.001
Dropout Rate 0.5
Character Embedding Size () 25
Character Representation Size () 50
Word Embedding Size () 200
Setting Value
Context Embedding Size () 100
GC Hidden Size () 100
GC Attention Size () 25
Sequence LSTM Hidden Size () 200
Outcome CNN Filter Count () 50
Table 3. Model configuration obtained through random search over 11-fold cross-validation of TR22 (training data).

On each training iteration, we randomly sample 10 sentences from the training data. These are re-composed into three sets of task-specific examples , , and corresponding to the tasks of sequence labeling, PD prediction, and PK prediction respectively. Unlike our prior work, in which the sub-tasks were trained in an interleaved fashion, we train on all three objectives jointly. Here, we dynamically switch between one of four training objective losses based on whether there are available training examples (in the batch and for the current iteration) for each task. The final training loss is then

We train the network for a maximum of 10,000 iterations, check-pointing and evaluating every 100 iterations on a validation set of sentences from four held-out drug labels. Only the checkpoint that performed best on the validation set is kept for test time evaluation. The choice of hyperparameters is shown in Table 3; discrete numbered parameters corresponding to embedding or hidden size were chosen from {10, 25, 50, 100, 200, 400} based on random search and optimized by assessing 11-fold cross-validation performance on TR22. The learning and dropout rates are set to typical default values. We used Word2Vec embeddings pretrained on the corpus of PubMed abstracts (Pyysalo et al., 2013)

. All other variables are initialized using values drawn from a normal distribution with a mean of

and standard deviation of

and further tuned during training. Words were tokenized on both spaces and punctuation marks; punctuation tokens were kept as is common practice for NER type systems. For dependency parsing, we use SyntaxNet444 which implements the transition-based neural model by Andor et al. (2016). We trained the aforementioned parser, using default settings, on the GENIA corpus (Kim et al., 2003) and use it to obtain projective dependency parses for each example.

3.5. Transfer Learning with Network Pre-Training

An obstacle in solving this flavor of DDI extraction as a machine learning problem is the high potential for overfitting given the sparse nature of the output space, which is further intensified by the scarce availability of high quality training data. As quality training data is expensive and requires domain expertise, we propose to use a transfer learning approach where the model is pre-trained on external data as follows. First, we pre-train on the DDI2013 dataset, which contains strictly binary relation DDI annotations and no interaction consequence annotation. Hence, DDI2013 is only used to train the sequence labeling objective

. Next, we pre-train on a re-annotated version of NLM180, a collection of 66 drug labels re-annotated to conform to TR22 formatting. While the original set of NLM180 contained 180 unique drugs with incomplete annotations, the re-annotated version contains a smaller subset of 66 drug labels but annotations are more comprehensive and includes interaction outcome information. Finally, we fine-tune for the target task by training on the official TR22 dataset. Translating DD2013 to the TAC 2018 format is an imperfect process given structural (breadth and depth of annotations) and semantic (guidelines in addition to annotator experience and vision) differences. For example, differences in how entity boundaries are annotated, such as whether or not modifier terms should be kept as part of a named entity, may have a large impact on model performance. Hence, we expect the translated version of DDI2013 to be very noisy as training examples for the target task. We describe the translation process for DDI2013 in Sections 3.5.1 and 3.5.2. Summary statistics of all relevant datasets are presented in Table 1.

3.5.1. NLM180 Mapping Scheme

In NLM180, there is no distinction between triggers and effects; moreover, PK effects are limited to coarse-grained (binary) labels corresponding to increase or decrease in function measurements. Hence, a direct mapping from NLM180 to the TR22 annotation scheme is impossible. As a compromise, NLM180 “triggers” were mapped to TR22 triggers in the case of unspecified and PK interactions. For PD interactions, we instead mapped NLM180 “triggers” to TR22 effects, which we believe to be appropriate based on our manual analysis of the data. Since we do not have both trigger and effect for every PD interaction, we opted to ignore trigger mentions altogether in the case of PD interactions to avoid introducing mixed signals. While trigger recognition has no bearing on relation extraction performance, this policy has the effect of reducing the recall upperbound on NER by about 25% based on early cross-validation results. To overcome the lack of fine-grained annotations for PK outcome in NLM180, we deploy the well-known bootstrapping approach (Jones et al., 1999)

to incrementally annotate NLM180 PK outcomes using TR22 annotations as a starting point. To mitigate the problem of semantic drift, we re-annotated by hand iterative predictions that were not consistent with the original NLM180 coarse annotations (i.e., active learning 

(Settles, 2012)).

3.5.2. DDI2013 Mapping Scheme

The DDI2013 dataset contains annotations that are incomplete with respect to the target task; specifically, annotations are limited to typed binary relations between any two drug mentioned drugs in the sentence (and not necessary between a mentioned drug and the label drug) without outcome or consequence prediction. In DDI2013, there are four types of interactions: mechanism, effect, advice and int. The mechanism type indicates that a PK mechanism is being discussed; effect indicates that the consequence of a PD interaction is being discussed; advice indicates suggestions regarding the handling of the drugs; and int is an interaction without any specific additional information. We translate the annotation by first applying a filtering step on all interactions such that it conforms to the target task; namely, we filter such that only interactions involving the label drug is kept. The non-label drug entity is then annotated as a precipitant with an interaction tag based on the following mapping scheme. Entities involved in a mechanism relation with the drug label are treated as KIN precipitants; likewise, entities in effect and advice relations are treated as DYN precipitants and int relations are treated as UNK precipitants. As there is no consequence annotation, the mapped examples are used to train the sequence labeling objective but not the other objective.

3.6. Voting-based Ensembling

Our prior effort (Tran et al., 2018) showed that model ensembling resulted in optimal performance for this task. Hence, model ensembling remains a key component of the proposed model. Our ensembling method is based on ensembling over ten models each trained with randomly initialized weights and a random development split. Intuitively, models collectively “vote” on predicted annotations that are kept and annotations that are discarded. A unique annotation (entity or relation) has one vote for each time it appears in one of the ten model prediction sets. In terms of implementation, unique annotations are incrementally added (to the final prediction set) in order of descending vote count; subsequent annotations that conflict (i.e., overlap based on character offsets) with existing annotations are discarded. Hence, we loosely refer to this approach as “voting-based” ensembling.

3.7. Model Evaluation

We used the official evaluation metrics for NER and relation extraction based on the standard precision, recall, and F1 micro-averaged over exactly matched entity/relation annotations. We use the strictest matching criteria corresponding to the official “primary” metric, as opposed to the “relaxed” metric that ignores mention and interaction

type. Concretely, the matching criteria for entity recognition considers entity bounds as well as the type of the entity. The matching criteria for relation extraction comprehensively considers precipitant drugs and, for each, the corresponding interaction type and interaction outcome. As relation extraction evaluation takes into account the bounds of constituent entity predictions, relation extraction performance is heavily reliant on entity recognition performance. On the other hand, we note that while NER evaluation considers trigger mentions, triggers are ignored when evaluating relation extraction performance. Two test sets of 57 and 66 drug labels, referred to as Test Set 1 and 2 respectively, with gold standard annotations are used for evaluation.

Next, we discuss the differences between these test sets. As shown in Table 1, Test Set 1 closely resembles TR22 with respect to the sections that are annotated. However, Test Set 1 is more sparse in the sense that there are more sentences per drug label (144 vs. 27), with a smaller proportion of those sentences having gold annotations (23% vs. 51%). Test Set 2 is unique in that it contains annotations from only two sections, namely DRUG INTERACTIONS and CLINICAL PHARMACOLOGY, the latter of which is not represented in TR22 (nor Test Set 1). Lastly, TR22, Test Set 1, and Test Set 2 all vary with respect to the distribution of interaction types, with TR22, Test Set 1, and Test Set 2 containing a higher proportion of PD, UN, and PK interactions respectively. Overall model performance is assessed using a single metric defined as the average of entity recognition and relation extraction performance across both test sets.

4. Results and Discussion

Test 1 / Entity Test 1 / Relation Test 2 / Entity Test 2 / Relation Overall
Method Training Data P (%) R (%) F (%) P (%) R (%) F (%) P (%) R (%) F (%) P (%) R (%) F (%) P (%) R (%) F (%)
BL TR22 23.82 42.04 30.39 14.74 18.38 16.35 26.15 39.69 31.51 12.48 15.43 13.79 19.30 0.12 28.88 0.12 23.01 0.06
GCA TR22 32.87 32.35 32.59 22.70 13.95 17.27 38.82 31.31 34.65 19.26 11.63 14.49 28.41 0.09 22.31 0.16 24.75 0.11
BL TR22 + NLM180 27.05 39.87 32.22 19.94 22.20 21.00 32.49 41.92 36.60 21.82 23.93 22.82 25.32 0.09 31.98 0.11 28.16 0.06
GCA TR22 + NLM180 38.30 31.20 34.38 27.97 15.14 19.63 44.13 31.18 36.53 31.79 15.76 21.06 35.55 0.18 23.32 0.20 27.90 0.17
BL TR22 + NLM180 + DDI2013 29.27 41.93 34.47 22.93 25.42 24.11 38.73 43.79 41.10 27.11 27.32 27.21 29.51 0.10 34.61 0.10 31.72 0.06
GCA TR22 + NLM180 + DDI2013 41.58 38.24 39.83 31.84 20.49 24.93 47.54 36.12 41.04 32.07 17.81 22.90 38.26 0.16 28.17 0.12 32.18 0.12
GC TR22 + NLM180 + DDI2013 38.85 36.30 37.52 29.82 18.59 22.88 43.74 34.88 38.80 31.14 16.40 21.48 35.89 0.20 26.54 0.20 30.17 0.19
GCA + BL TR22 + NLM180 + DDI2013 35.22 44.23 39.20 27.58 24.77 26.09 45.50 45.10 45.30 31.69 24.89 27.87 35.00 0.15 34.75 0.13 34.61 0.10

Our original challenge submission using a BiLSTM-based approach and trained on only TR22 and a non-official re-annotation of NLM180.
For reference, we include an evaluation of the standard GC without attention-gating.
Our current best is a combination of GCA and BL by ensembling.

Table 4.

Main results based on 95% confidence interval around mean precision, recall, and F1 based on evaluating N=100 ensembles for each model.

Test 1 / Entity Test 1 / Relation Test 2 / Entity Test 2 / Relation
Method Training Data P (%) R (%) F (%) P (%) R (%) F (%) P (%) R (%) F (%) P (%) R (%) F (%)
Dandala et al. (2018) TR22 + NLM180 41.94 23.19 29.87 25.24 16.10 19.66 44.61 29.31 35.38 22.99 16.83 19.43
Tran et al. (2018) TR22 + NLM180 29.50 37.45 33.00 22.08 21.13 21.59 36.68 40.02 38.28 22.53 21.13 23.55
BL + GCA (Ours) TR22 + NLM180 32.89 41.06 36.51 24.66 21.35 22.87 40.57 42.44 41.47 28.15 22.42 24.95
Table 5. Comparison of our method with comparable (based on training data) methods of teams in the top 5.

In order to assess model performance with confidence intervals and draw conclusions based on statistical significance, we perform a technique called bootstrap ensembling proposed by Kavuluru et al. (2017). That is, for each neural network (NN), we train a pool of 30 models each with a different set of randomly initialized weights and training-development set split. Performance of the NN is evaluated based on computing the 95% confidence interval around the mean F1 of

ensembles, where each ensemble is assembled from a set of ten models randomly sampled from the pool. This approach allows us to better assess average performance which is a nontrivial task given the high variance nature of models learned with limited training data. Our method for model ensembling (by “voting”) is described in Section 


We present the main results of this study in Table 4 where we compare our prior efforts using strictly BiLSTMs (BL) and our current best results with graph convolutions (GCA). BL with TR22 and NLM180 as training data corresponds to our prior best at 28.16% overall F1, while GCA with TR22, NLM180, and DDI2013 as training data represents our current best at 32.18% overall F1 based on graph convolutions. Here, we observe a 4 point gain in overall F1 (statistically significant at 95% confidence level based on non-overlapping confidence intervals), with most gains owing to a substantial improvement in entity recognition performance. We note that GCA is more precision focused while BL is more recall focused; moreover, GCA tends to exhibit better performance on Test Set 1, while BL tends to exhibit better performance on Test Set 2. This hints that the two architectures are highly complementary and may work well in combination. Indeed, when combined via ensembling, we observe a major performance gain across almost all measures. Here, for each ensemble, we sample five models from each pool of models (GCA and BL) for a total of ten models to ensure that results remain comparable. The resulting hybrid model exhibits the best performance overall, improving over the prior best by two points and over the current best by six points in overall F1 at 34.61%. These differences are statistically significant at the 95% confidence level. Next, we highlight that a main benefit of the GCA model is that it operates well with very small amounts of training data, as evident by the almost 2 absolute point improvement over the BiLSTM model when trained solely on TR22. These gains tend to be less notable when we involve examples from NLM180 and DDI2013. Lastly, we note that GCA (graph convolution with attention-gating) performs better than the standard GC (graph convolution without attention-gating) by two absolute points in overall F1 with improvements that are consistent across all metrics. We present a comparison of our results with other works in Table 5. We omit results by Tang et al. (2018) as they are not directly comparable to ours given the stark difference in available training data. When training on strictly TR22 and NLM180 (thus being comparable to most prior work), our model exhibits state-of-the-art performance across all metrics on either test sets.

We present Figures 4 and 4 to illustrate error cases to be discussed later in Section 5. In additional to actual and predicted annotations, these figures include a visualization sigmoid gate activity for edges in the dependency tree. The visualization serves two purposes. First, it confirms the intuition for this particular design and, second, provides a means interpreting model decisions. That is, we can observe the importance of each edge in the dependency tree as deemed by the network for a particular example. In Figure 4, for example, we can observe that for the target word “digoxin” (which is a precipitant), the phrase “use”, “concomitantly”, and “with” show very high activity. Likewise, signal flow from “hemodynamic” to “effects” is strong, and vice versa. Less important words such as articles appear to receive less incoming activity overall, even through self-loops.

Figure 3. An example sentence from the drug label for Savella along with the resulting prediction and ground truth labels. Red arrows indicate interaction outcome.
Figure 4. An example sentence from the drug label for Aubagio along with the resulting prediction and ground truth labels. Red arrows indicate interaction outcome, where C54357 is a PK label corresponding to the NCI Thesaurus code for “Increased Concomitant Drug Level.”

5. Error Analysis

In this section, we perform error analysis to identify challenging cases typically resulting in erroneous predictions by the model. One major source of difficulty for the model is boundary detection in cases of multi-word entities. Errors of this type are especially prominent in case of effect mentions which may manifest as potentially long noun phrases. Phrases with conjunctions or punctuation marks (or a combination of) may also present an obstacle for the model; for example, an effect expressed as “serious and/or life threatening reaction” may instead be predicted as simply “life threatening reaction.” Figure 4 shows a general case of this error where the model recognizes “potentiation of adverse hemodynamic effects” as the effect while the ground truth identifies the effect as simply “adverse hemodynamic effects.” This leads to both a false positive and a false negative for both the NER and the RE evaluation. We note that, given the potentially limitless ways an effect may be expressed, any disagreement among annotators (for cases beyond those addressed in annotator guidelines) during the initial annotation process will lead to inconsistent ground truth data and thus negatively affect downstream model performance. As an example, consider the following two sentences that appear in TR22: “Co-administration of SAMSCA with potent CYP3A inducers ..” and “For patients chronically taking potent inducers of CYP3A, ..” Here, one sentence is annotated such that potent is included as part of the precipitant expression, while another is annotated such that this modifier is excluded.

Mixed signals and noisy labels in general tend to be an issue especially when there is limited training data as deep learning models are prone to overfitting. When evaluating on purely effect mentions, we obtain a micro-F1 score of 66% (54% Precision, 87% Recall). However, the micro F1 is 87% when ignoring the starting boundary offset and 86% when ignoring the ending boundary offset during evaluation corresponding to roughly 20 absolute micro-F1 gain in performance. When applying the same looser evaluation criteria to triggers and precipitants, the gains are only and respectively. Thus there is immense potential for improving entity recognition of effect mentions if we can better handle boundary detection, possibly via rule-based methods or post-processing adjustments, with the added benefit of improving consequence prediction performance for PD interactions.

Precipitants interacting with the label drug being mentioned multiple times may also cause issues for the model. As an example, consider the sentence presented in Figure 4. Our model identifies both mentions of the precipitant “Digoxin” as being involved in an interaction with the drug Savella; however, the ground truth more specifically recognizes the second mention as the sole precipitant. This results in an additional false positive with respect to both NER and RE evaluation. Lastly, there are cases where the model will mistake a mention subtly referring to the label drug as a precipitant. This is a common occurrence in cases where the label drug is not referred to by name, but by a class of drugs. Typically, identifying a mention as a reference to the drug label beforehand will disqualify it from being predicted as a precipitant. While we do use a lexicon of drug names mapped to drug synonyms and drug classes to identify these indirect mentions, it is not exhaustive for all drugs. For example, within the label of the drug Lexapro, consider the sentence “Altered anticoagulant effects, including increased bleeding, have been reported when SSRIs and SNRIs are coadministered with warfarin.” Here, the model recognized SSRI and SNRI as precipitants. This is incorrect, however, as Lexapro is an SSRI and these mentions are more than likely referring to Lexapro. Without this information, the model likely assumes that it is an implicit case where the label drug is not mentioned and therefore assume all drug mentions are precipitants. Hence, curating a more exhaustive lexicon for indirectly mentions of the label drug will improve overall performance.



PD 788 37 68
PK 57 353 147
UN 170 10 599
Table 6. Confusion matrix for interaction type

Lastly, we describe a source of difficulty stemming from incorrectly classifying interaction types. Figure 

4 presents an example sentence where our model mistakes PK for PD interactions and a trigger mention for an effect mention. As PD and PK interactions tend to frequently co-occur with effect and trigger mentions respectively, predicted annotations tend to be polarized toward one pair (PD with effect) or the other (PK with trigger). Hence, differentiating between types of interactions for each recognized precipitant is another interesting class of error. Among all correctly recognized precipitants (based purely on boundary detection), we analyzed cases where one type of interaction, among PD, PK, and Unspecified (UN), is mistaken for another via the confusion matrix in Table 6. Clearly, many errors are due to cases where (1) we mistake unspecified precipitants for PD precipitants and (2) we mistake PK precipitants for unspecified precipitants. We conjecture that making precise implicit connections (not only whether there is evidence in the form of trigger words or phrases, but whether the evidence concerns the particular precipitant) is highly nontrivial. Likely, this aspect may be improved by inclusion of more high quality training data. Confusion between trigger and effect mentions is less concerning; among more than 1000 cases, there are six cases where we mistake effect for trigger and 20 cases where we mistake trigger for effect.

6. Conclusion

In this study, we proposed an end-to-end method for extracting drugs and their interactions from drug labels, including interaction outcome in the case of PK and PD interactions. The method involved composing various intermediate representations including linear and graph based context, where the latter is produced using a novel attention-gated version of the graph convolution over dependency parse trees. The so called graph convolution with attention-gating (GCA), along with transfer learning via serial pre-training using other annotated DDI datasets including DDI2013, resulted in an improvement over our original TAC challenge entry by up to 6 absolute F1 points overall. Among comparable studies (based on training data composition), our method exhibits state-of-the-art performance across all metrics and test sets. Future work will focus on curating more quality training data and leveraging semi-supervised methods overcome the scarcity in training data.

Acknowledgements and Funding

This research was partially conducted during TT’s participation in the Lister Hill National Center for Biomedical Communications (LHNCBC) Research Program in Medical Informatics for Graduate students at the U.S. National Library of Medicine, National Institutes of Health. HK is supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health. RK and TT are also supported by the U.S. National Library of Medicine through grant R21LM012274.


  • (1)
  • Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042 (2016).
  • Asada et al. (2018) Masaki Asada, Makoto Miwa, and Yutaka Sasaki. 2018. Enhancing Drug-Drug Interaction Extraction from Texts by Molecular Structure Information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 680–685.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of 3th International Conference on Learning Representations (ICLR).
  • Chiu and Nichols (2016) Jason PC Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4 (2016), 357–370.
  • Dandala et al. (2018) Bharath Dandala, Diwakar Mahajan, and Ananya Poddar. 2018. IBM Research System at TAC 2018: Deep Learning architectures for Drug-Drug Interaction extraction from Structured Product Labels. In Proceedings of the 2018 Text Analysis Conference (TAC 2018).
  • Demner-Fushman et al. (2018) Dina Demner-Fushman, Kin Wah Fung, Phong Do, Richard D. Boyce, and Travis Goodwin. 2018. Overview of the TAC 2018 Drug-Drug Interaction Extraction from Drug Labels Track. In Proceedings of the 2018 Text Analysis Conference (TAC 2018).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 770–778.
  • Herrero-Zazo et al. (2013) María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry Declerck. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics 46, 5 (2013), 914–920.
  • Jones et al. (1999) Rosie Jones, Andrew McCallum, Kamal Nigam, and Ellen Riloff. 1999. Bootstrapping for text learning tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, Vol. 1.
  • Kavuluru et al. (2017) Ramakanth Kavuluru, Anthony Rios, and Tung Tran. 2017. Extracting Drug-Drug Interactions with Word and Character-Level Recurrent Neural Networks. In Fifth IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 5–12.
  • Kim et al. (2003) J-D Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19, suppl_1 (2003), i180–i182.
  • Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751.
  • Kohn et al. (2000) Linda T Kohn, Janet M Corrigan, and Molla S Donaldson. 2000. To err is human: building a safer health system. Vol. 6. National Academies Press.
  • Levinson (2010) Daniel R Levinson. 2010. Adverse events in hospitals: national incidence among Medicare beneficiaries. Department of Health and Human Services Office of the Inspector General (2010).
  • Li et al. (2017) Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji. 2017. A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics 18, 1 (2017), 198.
  • Lim et al. (2018) Sangrak Lim, Kyubum Lee, and Jaewoo Kang. 2018. Drug drug interaction extraction from the literature using a recursive neural network. PloS one 13, 1 (2018), e0190926.
  • Liu et al. (2016a) Shengyu Liu, Kai Chen, Qingcai Chen, and Buzhou Tang. 2016a. Dependency-based convolutional neural network for drug-drug interaction extraction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1074–1080.
  • Liu et al. (2016b) Shengyu Liu, Buzhou Tang, Qingcai Chen, and Xiaolong Wang. 2016b. Drug-drug interaction extraction via convolutional neural networks. Computational and mathematical methods in medicine 2016 (2016).
  • Luo et al. (2016) Yuan Luo, Özlem Uzuner, and Peter Szolovits. 2016. Bridging semantics and syntax with graph algorithms—state-of-the-art of extracting biomedical relations. Briefings in bioinformatics 18, 1 (2016), 160–178.
  • Pyysalo et al. (2013) Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. In Proceedings of 5th International Symposium on Languages in Biology and Medicine. 39–44.
  • Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 147–155.
  • Sahu and Anand (2018) Sunil Kumar Sahu and Ashish Anand. 2018. Drug-drug interaction extraction from biomedical texts using long short-term memory network. Journal of biomedical informatics 86 (2018), 15–24.
  • Settles (2012) Burr Settles. 2012. Active learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    6, 1 (2012), 1–114.
  • Suárez-Paniagua et al. (2017) Víctor Suárez-Paniagua, Isabel Segura-Bedmar, and Paloma Martínez. 2017. Exploring convolutional neural networks for drug–drug interaction extraction. Database 2017 (2017).
  • Sun et al. (2019) Xia Sun, Ke Dong, Long Ma, Richard Sutcliffe, Feijuan He, Sushing Chen, and Jun Feng. 2019. Drug-Drug Interaction Extraction via Recurrent Hybrid Convolutional Neural Networks with an Improved Focal Loss. Entropy 21, 1 (2019), 37.
  • Tang et al. (2018) Siliang Tang, Qi Zhang, Tianpeng Zheng, Mengdi Zhou, Zhan Chen, Lixing Shen, Xiang Ren, Yueting Zhuang, Shiliang Pu, and Fei Wu Wu. 2018. Two Step Joint Model for Drug Drug Interaction Extraction. In Proceedings of the 2018 Text Analysis Conference (TAC 2018).
  • Tran et al. (2018) Tung Tran, Ramakanth Kavuluru, and Halil Kilicoglu. 2018. A Multi-Task Learning Framework for Extracting Drugs and Their Interactions from Drug Labels. In Proceedings of the 2018 Text Analysis Conference (TAC 2018).
  • Viterbi (1967) Andrew Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory 13, 2 (1967), 260–269.
  • Zhang et al. (2018) Yuhao Zhang, Peng Qi, and Christopher D Manning. 2018. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  • Zhao et al. (2016) Zhehuan Zhao, Zhihao Yang, Ling Luo, Hongfei Lin, and Jian Wang. 2016. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics 32, 22 (2016), 3444–3453.