Discovering knowledge from biomedical literature is an important task for both research organizations and industrial companies. PubMED, one of the most famous search engines for biomedical literature,111https://pubmed.ncbi.nlm.nih.gov has indexed more than articles, and millions of new papers came out every year Landhuis (2016). It is impossible to manually check all the papers to obtain useful knowledge, which results in an urgent demand to automatically discover knowledge from the literature.
The interaction between drugs and targets in the human body plays a crucial role in biomedical science and applications Sachdev and Gupta (2019); Wu et al. (2018); Yamanishi et al. (2008), e.g., drug discovery, drug repurposing, precision medicine, etc. In biomedical literature, a drug refers to any type of medication, ranging from small molecules like Aspirin, Penicillin to large molecules like Hepatitis B Vaccine. A target could be protein, enzyme or nucleic acid in our body, which binds the drugs we take. Drugs interact with targets in different ways. For example, Aspirin (drug) can inhibit (interaction) COX-1 (target), and Streptokinase (drug) can activate (interaction) Plasminogen (target). For simplicity, we call the triplet of Drug, Target and their Interaction as a “DTI triplet”.
Discovering DTI triplets from biomedical papers is challenging. First, lots of terms and aliases (e.g., abbreviations, synonyms) exist in an article, but only a small set of them contributes to DTI triplets, which makes this task harder than conventional relation extraction from general text. As shown in Figure 1, given the title and abstract of a paper, we want to discover the triplet Clotrimazo, Ergostero, Inhibitor. We can see that there are many terms like “fungal cytoplasmic membrane”, “azoles” and “C. albicans”, which are not related to the triplet we want to discover and increase the difficulty of the task. Second, lots of DTI knowledge is expressed by multiple sentences Verga et al. (2018), which is more difficult compared to the conventional relation extraction built upon one or two sentences Yao et al. (2019). As shown in Figure 1, no single sentence contains a complete DTI triplet.
Existing methods of discovering biological knowledge are mainly extractive approaches that usually sequentially use named entity recognition and entity relation extraction. Learning models for such pipeline requires detailed annotations, including all mentions of biological entities, relations between every two entity mentions and so on. However, it is difficult and costly to obtain sufficient annotations. On one hand, since the literature is usually long, the labeling workload is greatly increased compared to common knowledge extraction whose input is relatively short (e.g., a few sentences). On the other, due to the diverse term-entity aliases and specialized expressions, identifying DTI triplets from papers requires expert knowledge of biomedical domains, as shown in Figure1.
To rescue this task from the annotation difficulties, in this paper, we explore the first end-to-end solution. We regard the DTI triplet as a sequence following the order of drug, interaction and target, and use a Transformer-based model for generation. The document is fed into the encoder, and the DTI triplet is directly generated at the decoder. Therefore, learning such model does not need to annotate every entity mentions and their relations. The PubMedBERT Gu et al. (2020), a model pre-trained on M abstracts from biomedical literature, is applied to the encoder and decoder of our model. Further, we use semi-supervised annotations to improve performance, where the end-to-end model trained on the limited labeled data is used to select/filter the literature and label them.
Experimental results demonstrate that (1) generative models perform better than extractive ones and are more promising for this task; (2) leveraging unlabeled data can further boost the performance of generative models; (3) the performance of all the models is far from industrial demands, even boosted by semi-supervised learning, which suggests that DTI discovery is a challenging task and calls for more research efforts from the machine learning and natural language processing community.
Finally, we provide a new dataset, KD-DTI, for discovering DTI from documents. KD-DTI is built upon DrugBank Wishart et al. (2017) and Therapeutic Target Database (briefly, TTD) Wang et al. (2020). As far as we know, KD-DTI is the largest dataset for discovering DTI triplet from literature. The dataset contains training samples, validation samples, and test samples. We will release the data to the community.
Our contributions are summarized as follows:
(1) We explore the first end-to-end generative solution for extracting DTI triplet from biomedical literature, which shows the possibility of discovering DTI knowledge with much less annotation effort (§2
). The method can be further enhanced by semi-supervised learning as we proposed (§2.2)
(2) We create KD-DTI (§3.1), the largest dataset for discovering Drug-Target-Interaction triplets from literature. It contains K labeled data and K words in total. We expect that such a dataset will advance the research of knowledge discovery from biomedical literature.
2 Our Method
In this section, we first introduce our end-to-end generative solution for DTI triplet discovery (§2.1), and then describe how we utilize unlabeled data to further ease the difficulty of annotation (§2.2).
2.1 End-to-end Triplet Generation
To avoid labeling intermediate annotations (i.e., labels for entities mentions and relations between each pair of entities) and sequentially applying multiple models as extractive methods, we explore generative methods for this task. Specifically, we use a Transformer model Vaswani et al. (2017). The encoder of Transformer is used to encode the document (i.e., title and abstract), and the decoder of Transformer works for generating the DTI triplets. The output of the decoder follows the following format:
<d> drug <i> interaction <t> target <d> drug <i> interaction <t> target ,
where the drug, interaction and target are separated with special tokens <d>, <i> and <t>, and all triplets are concatenated as a longer sequence.
Recently, pre-trained language models, such as BERTDevlin et al. (2019) for common domain and PubMedBERT Gu et al. (2020) for biomedical domain, achieve great success in NLP areas. However, as suggested by previous work Dong et al. (2019); Zhu et al. (2020), directly using BERT to initialize parameters of generation models is not the best choice.222Using generative pre-trained models (e.g., GPT2 Radford et al. (2019)) may avoid such effects, and we leave it as future work. Therefore, we follow Zhu et al. (2020) to indirectly incorporate the pre-training models.
For encoder, we first use BERT to extract token representations as , where is BERT embedding of the -th token. After that, given an
-layer Transformer encoder, we obtain final encoder output by incorporating additional attention over the features extracted by BERT. Mathematically, the outputs of the-th encoder layer, , is calculated as follows:
where is the embedding of all input tokens333 in Eqn.(1) and in Eqn.(2) are not the BERT embedding , but the input embedding of the Transformer encoder/decoder, which are randomly initialized and learned during training.. is the layer normalization.
is the multi-head self-attention function operating on vector packages of queries, values (also used as keys). is a position-wise feed-forward network. We denote as the final representations of input tokens.
On the decoder side, we use the attention models over BERT features in a similar way to the encoder. The hidden state of-th decoding step is calculated as:
where is the output of the -th decoding layer, which is a package of hidden state representations. is the non-BERT embedding of all decoded tokens plus a <SOS> token (input of the first decoding step).
2.2 Semi-supervised Learning
The end-to-end solution proposed in the previous section reduces the efforts of detailed annotation, like annotating where an entity starts/ends in a paper (even it does not contribute to the final DTI), or the evidence that how a relation is obtained. However, it is still cost to obtain a large amount of the (document, DTI triplet) pair, since such labeling requires solid biomedical background. To remedy this, we proposed a semi-supervised method based on our end-to-end model. The method consists of two steps, where we first filter the data based on rules, and then use the model in the previous section to refine the labels. We also discuss the relation with distance supervision.
(Step-1) Rule-based filtration: We download all available titles and abstracts from PubMED. For each document (i.e., title and abstract), we use ScispaCy Neumann et al. (2019)
, an open-sourced NER tool to find out all possible drug and target entities, and useFuzzyMatch to search all mentioned interactions in the document444We collect commonly used interactions from DrugBank in advance, like inhibition, agonist, antagonist, etc.. Assume we extract , and drugs, targets and interactions from document . By enumerating all permutations, document corresponds to drug, target, interaction triplets. An underlying assumption is that the DTI triplets that appear more frequently across all documents have better confidence to be true. We then count the numbers of occurrences of each DTI triplet and delete DTI triplets with less than occurrences.555According to our preliminary exploration, if we randomly select a drug, a target and an interaction from our dataset, most of those DTI triplets occur less than 4 times in all the PubMed papers. We keep the documents that have at least one DTI triplet after the deletion. We eventually obtain a dataset with documents, which is denoted as .
(Step-2) Model-based labeling: We propose a method based on knowledge distillation Hinton et al. (2015) to refine the DTI triplet in . First, we use a generative model trained on supervised data to generate DTI triplets for each unlabeled document. If the Transformer model does not generate any triplet for a document from , we remove the document from . Each remaining document in is associated with at least one generated triplet and at least one pseudo triplet. If at least two elements (e.g., drug-target, drug-interaction, or target-interaction) of a pseudo triplet are the same as those of a generated triplet, we keep this document (and the matched triplet in ); otherwise, we delete it. After filtration, there are documents left in the dataset. Denote this dataset as . Note we will use the matched pseudo triplets in for the following experiments; the generated triplets are only used for filtration, but not for model training.
Another way to use unlabeled data is distance supervision Mintz et al. (2009) (DS). Given any DTI triplet from the labeled dataset, we use FuzzyMatch to search all in . If we find mentions of both and , we assign a pseudo label/triplet to . Denote the obtained dataset as , which has samples. We empirically found that our proposed semi-supervised method is better than distance supervision, since DS does not take interactions into consideration.
3 The Corpus
In this section, we first introduce the acquisition of the proposed dataset KD-DTI (§3.1), and then introduce its statistics and characteristics (§3.2). The dataset is available in the supplementary material and will be released later.
3.1 Dataset Creation
|Dataset||# Document||# Relation||# Sentence||# Words||Knowledge|
|CPI-DS Döring et al. (2020)||N/A||1||2,613||486k||Chemical Proteins Relation|
|BC5CDR Li et al. (2016)||1,500||1||11,089||282k||Chemical Disease Relation|
|ChemProt Antunes and Matos (2019)||2,432||5||24,923||650k||Chemical Proteins Relation|
|KD-DTI||14,256||66||139,810||3,671k||Drug Target Interaction|
|KD-DTI (semi)||139,408||66||1,556,614||39,997k||Drug Target Interaction|
Data collection The DTI triplets in our dataset come from two widely used databases, DrugBank Wishart et al. (2017) and Therapeutic Target Database (TTD) Wang et al. (2020). (1) DrugBank is a pharmaceutical knowledge base that consists of proprietary authored content describing clinical-level information about drugs.666https://go.drugbank.com/ DrugBank covers drugs, targets, types of interactions and DTI triplets. (2) TTD777http://db.idrblab.net/ttd/ is a comprehensive collection of various types of drugs, which includes drugs, targets, interactions and DTI triplets. Given a DTI triplet, if the reference papers are provided and the abstracts of those papers are openly accessible, we record the triplet and the paper. As the first step, we only use the titles and abstracts of the reference papers. The dataset of this step is denoted as .
Data filtration As a starting point of structured DTI knowledge discovery, we are only interested in the document which contains enough information to discover a DTI triplet. However, in
, some papers only generally describe some drugs and targets, in which the DTI triplets do not explicitly appear. Therefore, we heuristically filter out the samples inby which we cannot obtain the associated DTI triplets. The basic idea is that we require that the drug, target, and interaction in a triplet should be all included in a paper (name or alias appear). After filtration, we obtain documents as test set, documents as the validation set, and documents as the training set. The detailed filtration process is described in Appendix B.
Human verification We then manually check all the samples in the test sets. We employ eleven annotators with Ph.D. background. Each (document, DTI triplets) pair is independently checked by two annotators. If their evaluation results are different, another two annotators are involved for discussions. We remove those difficult cases that a consensus is not reached after the discussions of four annotators. We eventually obtain (document, DTI triplets) pairs from DrugBank, and pairs from TTD for the test set. We explicitly split them into two parts so that one could check the performance on different data sources.
3.2 Comparisons with Previous Datasets
Table 1 shows the statistics of our dataset as well as some related datasets. We have the following observations:
(1) In terms of data size (including numbers of documents, sentences and words), our dataset is much larger than previous datasets.
(2) Although ChemProt and CPI-DS also focus on tasks in the biomedical domain, they do not directly serve for drug-target interaction discovery from literature. ChemProt mainly focuses on relation extraction and assumes that entities are given in advance, while our KD-DTI is to discover DTI triplets (instead of relation only) from documents. CPI-DS is to extract relation from single sentences, and thus is much easier than our task that tackles long documents.
(3) KD-DTI includes a rich variety of relationship types that are not covered by previous datasets. ChemProt covers five relations, and CPI-DS and BC5CDR contain one relation only. In comparison, there are relations in our dataset.
(4) KD-DTI is collected from more than one data source, i.e., DrugBank and TTD, which could be used to evaluate the generalization or transfer abilities of machine learning algorithms/models.
In this section, we introduce several extractive baselines in §4.1, describe the settings of experiments in §4.2 and show the results of our end-to-end method and the results of using semi-supervised learning in §4.3 and §4.4.
4.1 Extractive Baselines
We compare with two extractive baselines: Cascade Relation extraction (CasRel), which is the state-of-the-art extractive method Wei et al. (2020) and a pure NER method, which regards relation as a special entity.
: CasRel is a cascade tagging method that can jointly perform NER and RE. CasRel leverages BERT to extract representations for input sequences. To find out DTI triplets, CasRel first tags out all possible drugs (i.e., subject) of the input. After that, CasRel searches interactions (i.e, relation) and targets (i.e., objective) for the discovered drugs. For this purpose, we train a classifier for each relation, whose input is the discovered drug and the output is the position of the target, i.e., classifications of whether each token is the start or end token for the target phrase. The classifier is allowed to output null, indicating no target for this relation.
To use CasRel, we obtain the named entity annotations of drugs and targets by searching the document with FuzzyMatch. Although CasRel achieved great success in standard relation extraction tasks like NYT Riedel et al. (2010) and WebNLG Gardent et al. (2017), in our setting, the annotations for all entities are automatically obtained without manually checking, which limits the performance of CasRel.
Pure NER method: In biomedical literature, interactions often explicitly appear in documents with specific forms (e.g., noun, verb, past/present participle). Therefore, it is natural to regard the interaction as a special entity, and use a NER model to figure out the DTI triplets. For this purpose, after obtaining the mentions of drugs, targets and interactions using FuzzyMatch
, we train a BERT-based NER model where interactions are also types of entity. During training, the BERT-based NER model is trained to predict the possibility of whether a token belongs to entity spans of drugs, targets and interactions. At the inference phase, the trained model tags out token spans of drugs, targets and interactions. For simplification, we choose the drug, target and interaction with the maximum probability to constitute the DTI triplet. With this method, we can predict at most one DTI triplet for each document.
|DrugBank||Triplet Level||Ontology Level||Entity Level (Acc.)|
|Transformer + BERT||30.32||31.46||29.26||29.50||31.64||27.64||53.00||55.27||79.47|
|Transformer + PubMedBERT||34.82||35.88||33.82||32.87||34.73||31.22||55.73||58.91||82.11|
|Transformer + BERT-attn||34.60||35.50||33.74||33.26||35.50||33.74||56.80||55.12||79.82|
|Transformer + PubMedBERT-attn||36.97||37.82||36.16||34.32||36.64||32.28||57.33||58.69||82.59|
|TTD||Triplet Level||Ontology Level||Entity Level (Acc.)|
|Transformer + BERT||7.63||7.87||7.41||7.36||8.44||6.53||14.44||51.98||87.72|
|Transformer + PubMedBERT||7.81||8.28||7.41||7.11||7.83||6.52||12.83||58.47||86.93|
|Transformer + BERT-attn||8.34||8.42||8.27||7.59||8.14||7.10||15.46||53.37||87.03|
|Transformer + PubMedBERT-attn||8.88||9.21||8.57||7.87||8.83||7.10||14.60||61.97||89.50|
For CasRel, we mainly follow the hyperparameters suggested byWei et al. (2020). A modification to CasRel is that since our input text can be longer than 888After BPE, there are abstract longer than tokens, and abstract longer than tokens, we cut the document into several pieces, each with a length of . We use BERT to encode each piece and concatenate all the representations for further processing. We use
relation-classifiers in total, where each classifier is a single-layer feed-forward network with ReLU activation that taking BERT embedding as input. The drug and target identifier is a single-layer feed-forward network.
For the pure NER model, an entity classifier of the single-layer feed-forward network is applied after the BERT module. We jointly finetune the BERT and the classification heads. We use Adam optimizer Kingma and Ba (2014) with learning rate . The minibatch size is sequences. The models are trained for epochs with early stopping, where the training stops if the validation performance does not increase for epochs.
For generative models, after tokenization, we apply BPE Sennrich et al. (2015) to both the source sequences and target sequences to reduce vocabularies. We set the number of layers as , and the embedding dimension as . We use Adam optimizer with the inverse_sqrt scheduler. The learning rate is and warm-up steps are . The dropout and attention dropout of Transformer are set as and . The label smoothing is set as . The batch size is tokens per GPU. For the Transformer with pre-trained models, we explore two methods as introduced in §2.1. We try the conventional BERT model and PubMedBERT model, in which PubMedBERT is trained using abstracts of all PubMed papers. All models are trained on a single V100 GPU.
Evaluation metrics We define a set of metrics to evaluate the performance of a model for DTI discovery, covering different granularity: (1) triplet-level metrics, which are used in previous knowledge discovery work Yao et al. (2019); Maimon and Rokach (2005) and assess the correctness of a discovered triplet as a whole; (2) ontology-level metrics, which evaluate the correctness and completeness of all discovered DTI triplets from a paper corpus as a whole; and (3) entity-level metrics, which evaluate the accuracy of discovered drugs, targets, and interactions respectively. We describe the detailed definition of metrics in Appendix C.
4.3 Results of the End-to-End Method
The test results of DrugBank and TTD are reported in Table 2
. Due to space limits, we leave the standard deviation of results in Appendix E and the case study in Appendix F. We have the following observations:
(1) Generative methods obtain better results than the extractive method (i.e., CasREL) on KD-DTI, in terms of triplet-level metric and ontology-level metric. One reason is that our task lacks manual annotation of intermediate labels such as the BIO representations of all entities and relations among any two entities. We obtain such intermediate labels with FuzzyMatch, which are usually of poor quality and therefore impair the performance of extractive methods. For the DTI triplet discovery task, such intermediate labels are often hard to obtain, and we should keep exploring how to improve performances without intermediate labels.
(2) For extractive methods, the pure NER method outperforms CasRel on triplet-level metric and entity-level metric. Specifically, for entity-level drug accuracy, the pure NER method even achieves the best result. This shows when intermediate labels are lacking and the relations among entities are comprehensive, simplifying this problem (like extracting only one triplet for a document) is another choice.
(3) Using pre-trained models is helpful for our task. Taking DrugBank as an example, for triplet-level F1, after using conventional BERT to initialize the encoder, the metric can be improved from to . After using PubMedBERT, which is a model pre-trained on all abstracts of PubMed, we achieve an even higher F1 score, . This demonstrates the effectiveness of pre-training, especially in-domain pre-training.
(4) The manner of using pre-trained models also matters. Comparing with directly initializing the encoder with a pre-trained model, we find that indirectly incorporating the pre-training model can further boost the performance: BERT-attn and PubMedBERT-attn obtain more than and point improvement over BERT and PubMedBERT respectively.
(5) The scores on TTD are lower than DrugBank because TTD is a harder dataset. To verify this, we calculate the minimal distance between drugs and targets: Given a document and a DTI triplet , let and denote two sets which are positions of drugs and targets obtained by FuzzyMatch in . The distance is defined as . For DrugBank and TTD, the average minimal distances over all test samples are and , which shows that identifying the DTI triplet from TTD requires understanding a longer document.
(6) While using pre-trained models achieves the best results, we observe that it suffers from overfitting: The F1 score on the training set is for PubMedBERT, which is much higher than those on the validation set () and test set (, the average score of two test sets). We find that simply using larger dropout or label smoothing does not help, which suggests better regularization techniques are needed for this task. More details are in Appendix D.
Effect of generation order.
Our methods learn to generate drug-target-interaction triplets sequentially for generative methods. An advantage of this method is that we could leverage the dependency among the triplets to improve the generation quality. A question arises: does the order of elements in DTI triplet matter? To find it out, we enumerate all six orders of the triplet on the standard Transformer model and the PubMedBERT-attn model. The results are in Table 3. Generally, the order of (drug, interaction, target) performs better, indicating that the order of triplet should be consistent with natural language order (i.e., subject-verb-object).
|Transformer + PubMedBERT-attn||36.97||36.13||32.48||33.75||34.89||36.54|
|Transformer + PubMedBERT-attn||8.88||8.52||7.83||7.28||8.42||7.12|
4.4 Results of the Semi-supervised Method
As generative models perform better than extractive ones, we focus on generative ones in this sub-section and conduct experiments with the Transformer model and Transformer + PubMedBERT-attn model. We merge the KD-DTI corpus with and respectively to get two enlarged datasets, and then train models on them. Instead of training from scratch, we find that initializing the parameters from a model trained on the parallel corpus KD-DTI is better.
The results are shown in Table 4. We have the following observations:
|DrugBank||Triplet Level||Ontology Level|
|No Enhance||+ DS||+ Ours||No Enhance||+ DS||+ Ours|
|Transformer + PubMedBERT-attn||36.97||35.11||39.78||34.32||35.15||38.87|
|TTD||Triplet Level||Ontology Level|
|No Enhance||+ DS||+ Ours||No Enhance||+ DS||+ Ours|
|Transformer + PubMedBERT-attn||8.88||10.83||12.26||7.87||8.01||9.80|
|w/o rule-filter||33.45 (-6.33)||6.82 (-5.44)||20.14 (-5.88)|
|w/o model-label||33.55 (-6.32)||11.09 (-1.17)||22.32 (-3.70)|
|DS w/ interaction||34.04 (-5.74)||12.16 (-0.10)||23.10 (-2.92)|
(1) Enhanced with our semi-supervised method (), we achieve more than 2 points improvement on DrugBank, for both Transformer and Transformer + PubMedBERT; On TTD, significant improvements are also observed.
(2) Enhanced with , the generation performance is also generally improved, but not as much as , which shows that the quality of the synthetic data is not as good as that from knowledge distillation. This is consistent with the discovery in Yao et al. (2019). Our conjecture is that the documents are rich in entities and noises, and simply using distance supervision without a scoring mechanism cannot lead to significant improvement, especially when the model equips pre-trained knowledge.
From observations (1) and (2), we can also conclude that pre-training and assigning pseudo labels to the unlabeled data are two orthogonal ways, both of which deserve more attention in the future.
(3) We also directly combine with the parallel KD-DTI dataset (which is up sampled by five times) and get the largest training dataset in our experiments. However, while training Transformer (without BERT) with this large dataset, the triplet-level F1 scores on DrugBank and TTD are and respectively, which are much worse than training on KD-DTI only. This demonstrates the necessity of quality control in data enhancement.
(4) Even if data enhancement can boost DTI discovery, the overall accuracy is still not very high. That is, DTI discovery is a challenging task. We need to design better models, algorithms, and/or semi-supervised learning methods to meet the expectation of real-world applications.
To better understand the proposed semi-supervised method, we remove the rule-based filtering step and model-based labeling step respectively. The result is shown in Table 5. We can see that when removing any of the proposed filtering steps, the performance drops significantly on both DrugBank and TTD data. These ablations show the effectiveness of two proposed filtering mechanisms and demonstrate a great necessity for reducing noise in semi-supervised data. We also compare with a variant of distance supervision, where all of drug, target and interaction should appear in the input document (denoted as “DS w/ interaction” in Table 5). We can see such a variant of DS does not bring significant improvement compared to the standard DS.
5 Related Work
Early research efforts on knowledge discovery are mainly extractive methods Li and Ji (2014); Miwa and Bansal (2016); Zhong and Chen (2020) and focus on discovering knowledge within a few sentences Zeng et al. (2014); Miwa and Bansal (2016); Alt et al. (2019); Zhang et al. (2019). For document level knowledge discovery, existing solutions are mainly graph-based methods that connect entities across sentences to handle knowledge expressed by multiple sentences Quirk and Poon (2017); Peng et al. (2017); Verga et al. (2018); Christopoulou et al. (2019); Nan et al. (2020). These methods often require detailed annotations, such as entity mentions and relations between mentions pairs, which are hard to obtain for DTI knowledge discovery. Recently, generative methods for knowledge discovery are explored Zeng et al. (2020); Ye et al. (2021), which directly generate knowledge triplets from input sentences and only require end-to-end annotations. However, these works focus on sentence-level extraction. End-to-end generative extraction for document-level text like literature is still less investigated.
For discovering knowledge triplets from literature, early works attempt to extract simple relation between bio-entities. Li et al. (2016) extract relation between chemical substances and diseases, and Wu et al. (2019) focus relation between genes and diseases. The genes, diseases, and chemical substances in those work are easier to recognize, and the extracted relationships are relatively simple (only one relation type) Different from them, DTI covers much more diverse entity terms and more relations. Antunes and Matos (2019) and Döring et al. (2020) discover chemical proteins relation on document-level and sentence-level respectively, which are related to DTI knowledge. However, both of them mainly focus on relation extraction, where the entities are given in advance. By contrast, we aim to jointly discover the DTI triplets from the document. On the other hand, our datasets have more target and relational types and are much larger in volume.
6 Conclusions and Future Directions
In this work, we explore the first end-to-end generative solution for extracting drug, target, interaction triplets from biomedical literature, which is one of the most important knowledge discovery tasks in the biomedical domain. Besides, we created KD-DTI, the largest dataset for this task, which is expected to boost and advance future research.
There are multiple directions to explore for this task:
(1) Accuracy improvement: We have shown that the performance of several state-of-the-art models is still far from industry demand. Therefore, how to improve accuracy for the task is an important research problem. As shown in this paper, designing better generative models and combing with pre-trained models properly are promising directions. How to effectively leverage unlabeled data (beyond pre-training) is also worthy of exploration.
(2) Dataset improvement: We have created the largest dataset for the DTI discovery task. The dataset can be improved in terms of scale and quality. Furthermore, there are many other knowledge discovery tasks in the biomedical domain, which also need public datasets for algorithm evaluation.
- Improving relation extraction by pre-trained language representations. In AKBC, Cited by: §5.
Extraction of chemical–protein interactions from the literature using neural networks and narrow instance representation. Database 2019 (baz095). External Links: Cited by: Table 1, §5.
- Connecting the dots: document-level neural relation extraction with edge-oriented graphs. In EMNLP-IJCNLP, Cited by: §5.
- Bert: pre-training of deep bidirectional transformers for language understanding. NAAcl. Cited by: §2.1.
- Unified language model pre-training for natural language understanding and generation. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13042–13054. Cited by: §2.1.
- Automated recognition of functional compound-protein relationships in literature. Plos one 15 (3), pp. e0220925. Cited by: Table 1, §5.
- Creating training corpora for nlg micro-planning. In ACL, Cited by: §4.1.
- Domain-specific language model pretraining for biomedical natural language processing. Cited by: §1, §2.1.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Scientific literature: information overload. Nature 535 (7612), pp. 457–458. External Links: Cited by: §1.
- BioCreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016. Cited by: Table 1, §5.
- Incremental joint extraction of entity mentions and relations. In ACL, pp. 402–412. Cited by: §5.
- Data mining and knowledge discovery handbook. Cited by: §4.2.
- Distant supervision for relation extraction without labeled data. In ACL, pp. 1003–1011. Cited by: §2.2.
- End-to-end relation extraction using lstms on sequences and tree structures. ACL. Cited by: §5.
- Reasoning with latent structure refinement for document-level relation extraction. In ACL, Cited by: §5.
- ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, pp. 319–327. External Links: Cited by: §2.2.
- Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5, pp. 101–115. Cited by: §5.
- Distant supervision for relation extraction beyond the sentence boundary. In EACL, Cited by: §5.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: footnote 2.
- Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: §4.1.
- A comprehensive review of feature based methods for drug target interaction prediction. Journal of Biomedical Informatics 93, pp. 103159. External Links: Cited by: §1.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §4.2.
- Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.1.
- Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In NAACL-HLT, Cited by: §1, §5.
- Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic acids research 48 (D1), pp. D1031–D1041. Cited by: §1, §3.1.
- A novel cascade binary tagging framework for relational triple extraction. In ACL, pp. 1476–1488. Cited by: §4.1, §4.2.
- DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Research 46 (D1), pp. D1074–D1082. External Links: Cited by: §1, §3.1.
Renet: a deep learning approach for extracting gene-disease associations from literature. In International Conference on Research in Computational Molecular Biology, pp. 272–284. Cited by: §5.
- Network-based methods for prediction of drug-target interactions. Frontiers in pharmacology 9, pp. 1134. Cited by: §1.
- Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24 (13), pp. i232–i240. Cited by: §1.
- DocRED: a large-scale document-level relation extraction dataset. ACL. Cited by: Appendix C, §1, §4.2, §4.4.
- Contrastive triple extraction with generative transformer. In AAAI, Cited by: §5.
- Relation classification via convolutional deep neural network. In COLING, Cited by: §5.
- CopyMTL: copy mechanism for joint extraction of entities and relations with multi-task learning. In AAAI, Cited by: §5.
- A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26 (8), pp. 1819–1837. Cited by: Appendix C.
- ERNIE: enhanced language representation with informative entities. In ACL, Cited by: §5.
- A frustratingly easy approach for joint entity and relation extraction. arXiv preprint arXiv:2010.12812. Cited by: §5.
- Incorporating BERT into neural machine translation. In ICLR, Cited by: §2.1.
Appendix A Dataset Introduction
(1) Our dataset, KD-DTI, is about to speed up the research of discovering drug, target and their interaction from the literature, which is an important topic. Our dataset is built upon DrugBank and TTD. After the confirmation from the owners of DrugBank, to use our dataset, one should register an account for DrugBank to extract the drug and target names related to DrugBank. For TTD, we confirm with the owner, and no license is required.
The dataset is in JSON format, which can be directly loaded. The data structure of our dataset is shown as follows:
(2) Any researchers about machine learning, natural language processing, biology and medicine might benefit from our dataset.
(3) Currently, the dataset is only visible to reviewers through the private URL. After the review process, we will release our dataset through Github or a publicly available website. Our dataset will be maintained for a long time.
(4) We have confirmed with the owners of DrugBank and TTD for re-distribution.
(5) The licence of the dataset is the Computational Use of Data Agreement (C-UDA) License.
(6) We will update our dataset regularly according to the feedback of users.
Appendix B Detailed Process for Data Filtration.
For ease of reference, we denote the dataset obtained without filtration as , where (1) is the document (i.e., title and abstract); (2) is the -th triplet of , with each element representing drug, target and interaction respectively; (3) is the number of triplets associated with .
As a starting point of structured DTI knowledge discovery, we are only interested in the document which contains enough information to discover a DTI triplet. However, in , some papers only generally describe some drugs and targets, in which the DTI triplets do not explicitly appear. Therefore, we heuristically filter out the samples in by which we cannot obtain the associated DTI triplets. The basic idea is that we require that the drug, target, and interaction in a triplet should be all included in a paper. We describe the details of the filtration process as below.
Given a query and a document , we first use FuzzyMatch999 An open-sourced tool that leverages Levenshtein distance to fetch similar words to the query. Simple variants are allowed like “+s”, “+ed”, etc. https://github.com/taleinat/fuzzysearch to retrieve all similar words of and its synonyms in , and denote them as , where is a retrieved phrase. Here the query can be a drug, a target, or an interaction, and both and could be a single word or a phrase with multiple words. Note that we obtain synonyms of a drug or target from the Drugbank and TTD database, where entities are attached with synonyms. Based on the retrieval results, we categorize as one of the following patterns for :
Reliable pattern, where the query and the fetched words are almost the same;
Positive pattern, where the query and the fetched words share lots of parts in common;
Negative pattern, where the query is not related to the document.
The detailed pattern are summarized in Table 6.
|Input: A query and the retrival results .|
|P1||s.t. and are exactly the same|
|(except for parentheses, cases, and punctuation marks);|
|P2||s.t. and have characters or words in common;|
|P3||There are at least elements in matching variant formats of (e.g., +s, +ed, +ing);|
|P4||There are at least elements in s.t each and has characters in common,|
|or at most different characters;|
|P6||Each in has characters, or is a meaningless word like “other”, “unknown”, etc.|
Denote the matching score between a query and a document as : If is a reliable pattern of , we set ; if is a positive pattern, we set ; otherwise, . We tried several different score setting and determined the scores with best data quality through manual check. Given any document and its -th DTI triplet , the matching score between the triplet and document is calculated as follows:
We sort all samples according to the matching scores in descending order, filter out the low-confidence samples whose score is less than zero, and only keep the top high-confidence documents-triplets pairs. We pick documents as the initial test set, the as the validation set, and the remaining documents () as the training set.
Appendix C Definition of Evaluation Metrics
In this section, we present the detailed definition of evaluation metrics for DTI discovery, which cover different granularity: (1) triplet-level metrics, (2) ontology-level metrics, and (3) entity-level metrics.
Let denote the number of documents/papers in a test set. For the -th test sample/document, the set of its associated DTI triplets is denoted as . Let denote the output of a model for the -the sample, which is another set of DTI triplets.
Triplet-level metrics Following previous work on knowledge extraction Yao et al. (2019), we evaluate that given a document, whether the model could correctly discover the corresponding DTI triplets. Since a single paper may contain multiple DTI triplets, we define precision (P), recall (R) and F1 score Zhang and Zhou (2013) as follows:
Ontology-level metrics For industrial applications, given a corpus of documents, one of the important objectives is to find out all possible knowledge from the corpus. To evaluate the knowledge coverage from the corpus level, we define ontology-level metrics (i.e., corpus-level metric) that evaluates how many triplets of the entire corpus are correctly extracted.
Define and . The ontology level precision (P), recall (R) and F1 are:
Entity-level metrics As mentioned before, a biomedical paper often contains lots of entities, but many of them are not related to the DTI triplets we want to discover. It is important to extract the right drugs, targets, and interactions from literature. Therefore, we assess the accuracy for drugs (A), targets (A), interactions (A) respectively.
Let and denote the sets of all drugs in the ground-truth triplets and model outputs for the -th sample, and similarly for , , , . We define drug accuracy, target accuracy, and interaction accuracy as below:
Appendix D Study on Regularization Techniques
We explore various combinations of dropout and label smoothing based on Transformer + PubMedBERT-attn. The F1 scores of the validation set are reported in Table 7. The best result is achieved when dropout and label smoothing are set as and respectively. However, the training F1 score is , which is significantly larger than the validation set. We keep increasing dropout and label smoothing, and found that the validation performance cannot be further improved. This shows that using basic techniques to improve generalization (i.e. dropout, label smoothing) can bring limited improvement, and we need more effective regularization techniques for the DTI triplet extraction task.
Appendix E Main Results with Standard Deviation
Figure 2 presents the standard deviation of results on DrugBank and TTD. On DrugBank, the standard deviation of each model is around . On TTD, the standard deviation scores are smaller and usually less than .
Appendix F Case Study
We perform a case study to investigate whether a model trained on the KD-DTI dataset is able to discover unseen Drug-Target-Interaction triplets and handle unseen paper. To achieve this, we train a generative model on KD-DTI and make predictions on unseen samples from another dataset, ChemProt, which contains human annotation of chemical-protein relation (a kind of drug-target interaction).We use Transformer+PubMedBERT-attn for case study.
As shown in Table 8, in the first case, the entire triplet is correctly extracted, while both the drug and whole triplet are unseen in KD-DTI. In the second case, we successfully extract one of the two annotated triplets. We attribute the missing of the second triplet to the irregular format of drug “C” and the presence of distracting items, such as “P” and “M”.
|Title: Assessment of the abuse liability of ABT-288, a novel histamine H3 receptor antagonist.|
|Abstract: RATIONALE: Histamine H3 receptor antagonists, such as ABT-288, have been shown to possess cognitive-enhancing and wakefulness-promoting effects. On the surface, this might suggest that H3 antagonists possess psychomotor stimulant-like effects and, as such, may have the potential for abuse. OBJECTIVES: The aim of the present study was to further characterize whether ABT-288 possesses stimulant-like properties and whether its pharmacology gives rise to abuse liability. METHODS: The locomotor-stimulant effects of ABT-288 were measured in mice and rats, and potential development of sensitization was addressed. Drug discrimination was used to assess amphetamine-like stimulus properties, and drug self-administration was used to evaluate reinforcing effects of ABT-288. The potential development of physical dependence was also studied. RESULTS: ABT-288 lacked locomotor-stimulant effects in both rats and mice. Repeated administration of ABT-288 did not result in cross-sensitization to the stimulant effects of d-amphetamine in mice, suggesting that there is little overlap in circuitries upon which the two drugs interact for motor activity. ABT-288 did not produce amphetamine-like discriminative stimulus effects in drug discrimination studies nor was it self-administered by rats trained to self-administer cocaine. There were no signs of physical dependence upon termination of repeated administration of ABT-288 for 30 days. CONCLUSIONS: The sum of these preclinical data, the first of their kind applied to H3 antagonists, indicates that ABT-288 is unlikely to possess a high potential for abuse in the human population and suggests that H3 antagonists, as a class, are similar in this regard.|
|Prediction: (Drug: “ABT-288”, Target: “Histamine H3 receptor”, Interaction: “antagonist”)|
|Annotation: (Drug: “ABT-288”, Target: “Histamine H3 receptor”, Interaction: “antagonist”)|
|Title: Mechanisms of Glucose Lowering of Dipeptidyl Peptidase-4 Inhibitor Sitagliptin When Used Alone or With Metformin in Type 2 Diabetes: A double-tracer study.|
|Abstract: OBJECTIVE To assess glucose-lowering mechanisms of sitagliptin (S), metformin (M), and the two combined (M+S).RESEARCH DESIGN AND METHODS We randomized 16 patients with type 2 diabetes mellitus (T2DM) to four 6-week treatments with placebo (P), M, S, and M+S. After each period, subjects received a 6-h meal tolerance test (MTT) with [(14)C]glucose to calculate glucose kinetics. Fasting plasma glucose (FPG), fasting plasma insulin, C-peptide (insulin secretory rate [ISR]), fasting plasma glucagon, and bioactive glucagon-like peptide (GLP-1) and gastrointestinal insulinotropic peptide (GIP) was measured.RESULTS FPG decreased from P, 160 4 to M, 150 4; S, 154 4; and M+S, 125 3 mg/dL. Mean post-MTT PG decreased from P, 207 5 to M, 191 4; S, 195 4; and M+S, 161 3 mg/dL (P ¡ 0.01]. The increase in mean post-MTT plasma insulin and in ISR was similar in P, M, and S and slightly greater in M+S. Fasting plasma glucagon was equal (65-75 pg/mL) with all treatments, but there was a significant drop during the initial 120 min with S 24% and M+S 34% (both P ¡ 0.05) vs. P 17% and M 16%. Fasting and mean post-MTT plasma bioactive GLP-1 were higher (P ¡ 0.01) after S and M+S vs. M and P. Basal endogenous glucose production (EGP) fell from P 2.0 0.1 to S 1.8 0.1 mg/kg min, M 1.8 0.2 mg/kg min [both P ¡ 0.05 vs. P), and M+S 1.5 0.1 mg/kg min (P ¡ 0.01 vs. P). Although the EGP slope of decline was faster in M and M+S vs. S, all had comparable greater post-MTT EGP inhibition vs. P (P ¡ 0.05).CONCLUSION SM+S combined produce additive effects to 1) reduce FPG and postmeal PG, 2) augment GLP-1 secretion and -cell function, 3) decrease plasma glucagon, and 4) inhibit fasting and postmeal EGP compared with M or S monotherapy.|
|Prediction: (Drug: “Sitagliptin”, Target: “Dipeptidyl Peptidase 4”, Interaction: “inhibitor”)|
|Annotation: (Drug: “Sitagliptin”, Target: “Dipeptidyl Peptidase 4”, Interaction: “inhibitor”), (Drug: “C”, Target: “C-peptide”, Interaction: “part of”)|
Appendix G Broader Impact
We propose a new dataset for biomedical knowledge discovery. We believe that this dataset can speed up the research of bioNLP and machine learning. For negative impact, after the success of automatic knowledge discovery, it might cause some unemployment of the related researchers and engineers.
Appendix H Distribution of Interactions
In the Figure 3, we present a details statistic of interactions included in the corpus.