Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction

08/15/2019 ∙ by Bonggun Shin, et al. ∙ Emory University DANKOOK UNIVERSITY 0

Predicting drug-target interactions (DTI) is an essential part of the drug discovery process, which is an expensive process in terms of time and cost. Therefore, reducing DTI cost could lead to reduced healthcare costs for a patient. In addition, a precisely learned molecule representation in a DTI model could contribute to developing personalized medicine, which will help many patient cohorts. In this paper, we propose a new molecule representation based on the self-attention mechanism, and a new DTI model using our molecule representation. The experiments show that our DTI model outperforms the state of the art by up to 4.9 curve. Moreover, a study using the DrugBank database proves that our model effectively lists all known drugs targeting a specific cancer biomarker in the top-30 candidate list.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many diseases are caused by abnormal protein levels, therefore, a drug is designed to target particular proteins. However, a drug may not work well for a decent portion of patients, because an individual’s response to a drug varies depending on the genetic inheritance (wang2008pharmacogenomics). Unfortunately, pharmaceutical companies focus only on a majority cohort of patients as drug discovery is an expensive process. The reduction of the cost of the drug discovery process will not only lead to drugs costing less, resulting in reduced healthcare costs for a patient but can also allow companies to develop personalized drugs based on genetics.

Among the many parts of the drug discovery process, predicting drug-target interactions (DTI) is an essential one. DTI is difficult and costly as experimental assays not only take significant time but are expensive. Furthermore, only less than 10% of the proposed DTIs are accepted as new drugs (he2017simboost). Therefore, in silico (performed on a computer) DTI predictions are much demanded since it can expedite the drug development process by systemically suggesting a new set of candidate molecules promptly, which can save time and reduce the cost of the whole process by up to 43% (dimasi2016innovation).

In response to this demand, three types of in silico

DTI prediction methods have been proposed in the literature: molecular docking, similarity-based, and deep learning-based models. Molecular docking 

(trott2010autodock; luo2016molecular) is a simulation-based method using the 3D structured features of molecules and proteins. Although it can provide an intuitive visual interpretation, it is difficult to obtain a 3D structure of a feature and cannot scale to large datasets. To mitigate these problems, two similarity-based methods, KronRLS (pahikkala2014toward) and SimBoost (he2017simboost) have been proposed using efficient machine learning methods. However, using a similarity matrix has two downsides. Firstly, feature representation is limited in the similarity space, thereby ignoring the rich information embedded in the molecule sequence. For example, if a brand new molecule is tested, the model will represent it using relatively unrelated (dissimilar) molecules, which would make the prediction inaccurate. Secondly, it necessitates the calculation of the similarity matrix which can limit the maximum number of molecules in the training process. To overcome these limitations, a deep learning-based DTI model, DeepDTA (ozturk2018deepdta)

, was proposed. It is an end-to-end convolutional neural network (CNN)-based model that eliminates the need for feature engineering. The model automatically finds useful features from a raw molecule and protein sequence. Its success has been demonstrated on two publicly available DTI benchmarks. Although this work illustrated the potential of a deep learning-based model, there are several areas for improvement:

  • CNNs can’t model potential relationships among distant atoms in a raw molecule sequence. For example, with three layers of CNNs each with a filter size of 12, the model can capture associations in atoms up to 35 distances in a sequence. We posit that the recently proposed self-attention mechanism can be used to capture any relationship among atoms in a sequence, and thereby provide a better molecule relationship

  • The one-hot encoding used to represent each molecule fails to take advantage of existing chemical structure knowledge. An abundance of chemical compounds are available in the PubChem database 

    (10.1093/nar/gky1033), from which we can extract useful chemical structures for pre-training the molecule representation network.

  • Fine-tuning is a type of transfer learning where weights trained from one network can be transferred to another so that the weights can be adjusted to the new dataset. Thus, we can transfer the weights learned from the PubChem database to our DTI model. This will help our model to use the learned knowledge of a chemical structure while tailoring it to predicting DTI interactions.

With these observations, we propose a new deep DTI model, Molecule Transformer DTI (MT-DTI), based on a new molecule representation. We use a self-attention mechanism to learn the high-dimensional structure of a molecule from a given raw sequence. Our self-attention mechanism, Molecular Transformer (MT), is pre-trained on publicly available chemical compounds (PubChem database) to learn the complex structure of a molecule. This pre-training is important, because most datasets available for DTI training has only 2000 molecules, while the data for pre-training (PubChem database) contains 97 millions of molecules. Although it does not contain interaction data but just molecules, our MT is able to learn a chemical structure from it, which will be effectively utilized when transferred to MT-DTI (our model). Therefore, we transfer this trained molecule representation to our DTI model so that it can be fine-tuned with a DTI dataset. The proposed DTI model is evaluated on two well-known benchmark DTI datasets, Kiba (tang2014making) and Davis (davis2011comprehensive), and outperforms the current state of the art (SOTA) model by 4.9% points for Kiba and 1.6% points for Davis in terms of area under the precision-recall curve. Additionally, we demonstrate the usefulness of our trained model using a known drug list targeting a specific protein. The trained model generates all FDA approved drugs with high rankings in the drug candidate lists. The demonstrated effectiveness of the proposed model can help reduce the cost of drug discovery. Furthermore, precise molecule representation can enable drugs to be designed for specific genotypes and potentially enable personalized medicine.

Technical Significance

We propose a novel molecule representation, adapting the self-attention mechanism that was recently proposed in Natural Language Process (NLP) literature. This is inspired by the idea that understanding a molecule sequence for a chemist is analogous to understanding a language for a person. We introduce a new way to train the molecule representation model to fit the DTI problem using an existing corpus to achieve a more robust representation. With this (pre)trained molecule representation, we fine-tune the proposed DTI model and achieve new SOTA performances on two public DTI benchmarks.

111The demo is publicly available at

Clinical Relevance

With our new model, we can potentially lower medication costs for patients, which can help make drugs more affordable and help patients be more adherent. In addition, this can serve as the stepping stone for designing personalized medication. Through the proper representation of molecules and proteins, we can better understand the properties of patients that make a drug helpful or not (quinn2017molecular).

2 Methods

Figure 1: The Proposed DTI Model Architecture. Inputs are molecule (SMILES) and protein (FASTA) and the regression output is the affinity score between these two inputs.

We introduce a new drug-target interaction (DTI) model and a new molecule representation in this section. The basic motivation of the proposed model is that the structure of molecule sequences is shown to be very similar to the structure of natural language sentences in that contextual and structural information of atoms are important when understanding the characteristics of a molecule (jastrzkebski2016learning). Specifically, each atom interacts with not only neighboring atoms but also long distant one in a simplified molecular-input line-entry system (SMILES) sequence, a notation that encodes the molecular structure of chemicals. However, the current SOTA method using CNNs can’t relate long distance atoms when representing a molecule. We overcome this using the self-attention mechanism. We first describe the proposed MT-DTI model architecture (Figure 1) with input and output representation. We then elaborate on each of the three main building blocks of our MT-DTI model, the character-embedded Transformer layers (Molecule Transformers, Figure 4, Section 2.2), the character-embedded Protein CNN layers (Protein CNNs, Figure 4, Section 2.3), and the dense layers to model interactions between a drug and a protein (Interaction Denses, Figure 4, Section 2.4). Then, we explain the process for pre-training the molecule transformers (MT) (Section 2.2).

2.1 Model Architecture

The MT-DTI model takes two inputs: a molecule represented by the SMILES (weininger1988smiles) sequence and a protein represented by the FASTA (lipman1985rapid) sequence. A molecule represented using the SMILES sequence is comprised of characters representing atoms or structure indicators. Mathematically, a molecule is represented as , where could be either an atom or a structure indicator, and is the sequence length, which varies depending on a molecule. This molecule sequence is fed into the Molecule Transformers (Section 2.2), to produce a molecule encoding, . Another type of input, a protein with FASTA sequence, also consists of characters of various amino acids. A formal protein representation is , where is one of the amino acids, and is the sequence length, which varies depending on a protein. This protein sequence is the input of the Protein CNNs (Section 2.3) and generates a protein encoding,

. Note that the encoding vector dimension

and are model parameters. Both of the encodings, and are together fed into the multi-layered feed-forward network, Interaction Denses (Section 2.4), followed by the last regression layer, which predicts the binding affinity scores.

Figure 2: Molecule Trans.
Figure 3: Protein CNNs.
Figure 4: Interaction Denses.

2.2 Molecule Transformers

Molecule Transformers (Figure 4) are multi-layered bidirectional Transformer encoders based on the original Transformer model (vaswani2017attention)

. The Transformer can model a sequence by itself without using a recurrent neural network (RNN) or CNN. Unlike these previous sequence processing layers (RNN or CNN), Transformer can effectively encode the relationship among long-distance tokens (atoms) in a sequence. This powerful context modeling enables many Transformer-based NLP models to outperform previous methods in many benchmarks 

(vaswani2017attention; devlin2018bert). Molecule Transformers is a modification of the existing Transformer, BERT (devlin2018bert), to better represent a molecule by changing the cost function. Before plugging it into the proposed model (Figure 1), we pre-train it using the modified masked language model task, which was introduced in the BERT model (devlin2018bert). Each Transformer block consists of a self-attention layer and feedforward layer, and it takes embedding vectors as an input. Therefore the first Transformer block needs to convert an input sequence into the form of vectors using the input embedding.

2.2.1 Input Embedding

The input to the Molecule Transformers is the sum of the token embeddings and the position embeddings. The token embeddings are similar to the word embeddings (mikolov2013distributed), in that each token, is represented by a molecule token embedding (MTE) vector, . These vectors are stored in a trainable weights , where is the size of the SMILES vocabulary and is the molecule embedding size. A MTE vector itself is not sufficient to represent a molecule sequence with a self-attention network, because a self-attention doesn’t consider the sequence order when calculating the attentions, unlike other attention mechanisms. Therefore, we add a trainable positional embedding (PE)222Please refer to (devlin2018bert) for more details., , to that makes the final input representation, where is the maximum length of a molecule sequence, which is set to 100 in this study. This process is illustrated in Figure 5.

Figure 5: An example of molecule token embedding (MTE) and positional embedding (PE) to make the model input for a given molecule sequence of methyl isocyanate (CN=C=O).

We add five special tokens to the SMILES vocabulary to make a raw molecule sequence compatible with our model. [PAD] is for dummy padding to ensure the sequence has a fixed length. [REP] is a representation token that is used when fine-tuning the Transformer in the proposed MT-DTI model. [BEGIN]/[END] indicate the beginning or end of the sequence. This token is useful for the model when dealing with a sequence longer than

. When it is truncated on both sides, the absence of [BEGIN]/[END] tokens will serve as an effective indicator of a truncation. Methyl isocyanate (CN=C=O), for example, can be represented with 9 tokens;

Each token is transformed into a corresponding vector by referencing MTE and PE.

2.2.2 Self-Attention Layer

These transformed input vectors, , are now compatible with an input to a self-attention layer. Each self-attention layer is controlled by a query vector (), key vector (), and value vector (), where , all of which are different projections of the input, (), using trainable weights, , , and , shown correspondingly in Figure 8. Then, the attention weights are computed as:

is the dimension of the key (one of the ’s in Figure 8). Thus, the learned relationship between the atoms can span the entire sequence via the self-attention weights.

Figure 6: Query, Key, and Value.
Figure 7: Self-Attention Heads.
Figure 8: Projected Output.

2.2.3 Feed-Forward Layer

Similar to multiple filters in convolutional networks, a Transformer can have multiple attention weights, called multi-head attention. If one model has -head attention, then it will have , where . These H number of attention matrices, , are then concatenated (shown on the left of Figure 8) and projected using (shown on the middle of Figure 8) to form a final output of a Transformer, (shown on the right of Figure 8).

2.2.4 pre-training

We adopt one of the pre-training tasks of BERT (devlin2018bert), the Masked Language Model. Since the structure of molecule sequences are shown to be very similar to the structure of natural language sentences (jastrzkebski2016learning)

and there are abundant training examples, we hypothesize that predicting masked tokens is an effective way of learning a chemical structure. We adopt a special token, [MASK], for this task. It replaces a small portion of tokens so that the task of the pre-training model is to predict the original tokens. We choose 15% of SMILES tokens at random for each molecule sequence, and replace the chosen token with one of the special tokens, [MASK] with the probability of 0.8. For the other 20% of the time, we replace the chosen token with a random SMILES token

333Since the [MASK] token does not exist when testing, we need to occasionally feed irrelevant tokens when training. or preserve the chosen token, with an equal probability, respectively. The target label of the task is the chosen token with the index. For example, one possible prediction task for Methyl isocyanate (CN=C=O) is

2.2.5 Fine-tuning

The weights of the pre-trained Transformers (Section 2.2.4) are used to initialize the Molecule Transformers in the proposed MT-DTI model (Figure 1). The output of the Transformers is a set of vectors, where the size is equivalent to the number of tokens. To obtain a molecule representation with a fixed length vector, we utilize the vector of the special token, [REP] in the final layer. This vector conveys the comprehensive bidirectional encoding information for a given molecule sequence, denoted as .

2.3 Protein CNNs

Another type of input to the proposed MT-DTI model is a protein sequence. We modified the protein feature extraction module introduced by 

(ozturk2018deepdta) by adding an embedding layer for the input.444Adding an embedding layer slightly improves the accuracy of the DTI model. It consists of multi-layer CNNs with an embedding layer to make a sparse protein sequence continuous, and a pooling layer to represent a protein as a fixed size vector. For a given protein sequence, , each protein token, is converted to a protein embedding vector by referencing trainable weights, , where is the size of the FASTA vocabulary and is the protein embedding size. Let be a matrix representing the input protein, where is the maximum length of a protein sequence, which is set to 1000 in this study. This protein matrix is fed into the first convolutional layer and convolved by the weights , where is the length of the filter. This operation is repeated times with the same filter length. Then this first convolution layer produces a vector , where elements in convey the -gram features across the sequence. Multiple convolutional layers can be stacked on top of the previous output of the convolutional layer. After number of convolution layers, the final vector,

, is fed into the max pooling layer. This max pooling layer selects the most salient features from the vectors produced by the filters from the last layer. Then, the output of this max pooling layer is a vector


2.4 Interaction Denses

A molecule representation (, Section 2.2.5) and a protein representation (, Section  2.3) are concatenated to create the input of Interaction Denses, . Interaction Denses approximates the affinity score through a multi-layered feed-forward network with dropout regularization. The final layer is a regression layer associated with the regression task for the proposed MT-DTI model. The weights of the network are then optimized according to the mean square error between the network output () and actual affinity values ().

3 Experiments

3.1 Datasets

3.1.1 Drug-Target Interaction

Dataset # of Compounds # of Proteins # of Interactions TRN DEV TST
DAVIS 68 442 30,056 20,037 5,009 5,010
KIBA 2,111 229 118,254 78,836 19,709 19,709
Table 1: Statistics of the Davis and Kiba datasets. TRN/DEV/TST: training, development, evaluation sets.

The proposed MT-DTI model is evaluated on two benchmarks, Kiba (tang2014making) and Davis (davis2011comprehensive), because they have been used for evaluation in previous drug-target interaction studies (pahikkala2014toward; he2017simboost; ozturk2018deepdta). Davis is a dataset comprised of large-scale biochemical selectivity assays for clinically relevant kinase inhibitors with their respective dissociation constant () values. The original values are transformed into log space, , for numerical stability, as suggested by (he2017simboost) as follows:

While Davis measures a bioactivity from one source of score, , Kiba combines heterogeneous scores, , and by optimizing consistency among them. SimBoost (he2017simboost) filtered out proteins and compounds with less than 10 interactions for computational efficiency, and we follow this procedure for a fair comparison. The number of compounds, proteins and interactions of the two datasets are summarized in Table 1. To facilitate comparison and reproducibility, we followed the same 5-fold cross validation sets with a held-out test set which is publicly available555

3.1.2 pre-training Dataset

We downloaded the chemical compound information from the PubChem database (10.1093/nar/gky1033)666 Only canonical SMILES information were used to maintain consistency of representation. A total of 97,092,853 molecules are available in the canonical SMILES format.

3.1.3 Drugbank Database

The DrugBank database comprises a bioinformatics and cheminformatics resource that provides known drug-target interaction pairs. To prove the effectiveness of drug candidates generated by our model, we designed a case study (Section  4.1) using this database. We extracted 1,794 drugs from the database, excluding any compounds that were used when training our model. These selected compounds were the input to the trained model (by Kiba dataset) along with a specific protein to generate corresponding Kiba scores. The scores were used to find the best candidate drugs targeting that protein.

3.2 Training Details

Molecule Transformer is first trained with the collected compounds from the PubChem database (Section 3.1.2), and then the trained Transformer is plugged into the MT-DTI model for fine-tuning.

3.2.1 pre-training

We use 97 million molecules for pre-training. Before feeding it to the Molecule Transformer, we tokenize each molecule at the character level. If the length of the molecules is more than 100, we truncate its head and tail together to have a fixed size of 100. We choose the middle part of the longer sequence so that the model can easily distinguish truncated sequences by simply looking at the existence of [BEGIN] and [END] tokens. The network structure of the Molecule Transformer is as follows. The number of layers is 8, the number of heads is 8, the hidden vector size is 128, the intermediate vector size is 512, the drop-out rate is 0.1, and the activation is Gelu (hendrycks2016bridging)

. These parameters are picked from preliminary experiments and the hyperparameters used in the NLP model, BERT 


. We hypothesized that finding a chemical structure might be roughly 2-4 times easier task than finding a language model, because the size of the SMILES vocabulary is smaller than natural languages (70 vs 30k). Although the SMILES vocabulary is 400 times simpler, the number of tokens in the PubChem molecule datasets is about 2.4 times more than what BERT used to pre-train (8B vs 3.3B). This indicated that the molecules might have more complexity than expected when only considering the size of the vocabulary. Therefore we used parameters that were 2-4 times smaller than BERT. We note that there may be other parameter sets that can yield even better performance. We use the batch size of 512 and the maximum token size of 100, which enables it to process 50K tokens in one batch. Considering the average length of the compound sequence is around 80, there are approximately 8 billion tokens in the training corpus. We pre-train Molecule Transformer for 6.4M steps, which is equivalent to 40 epochs (8B/50K*40=6.4M). With an 8-core TPU machine, the pre-training took about 58 hours. The final accuracy of the Masked LM task was about 0.9727, which is comparable to the 0.9855 achieved by BERT on natural language.

3.2.2 fine-tuning

The specifications of the Molecule Transformer in the MT-DTI model are the same as the one used when pre-training (Section 3.2.1). The Protein CNNs (Section 2.3) consists of one embedding layer, three CNN layers, and one max pooling layer. It uses 128-dimensional vectors for the embedding layer. For CNN blocks, we denote the filter size as and the number of the filter as . The final model parameter settings of CNNs are and . The max pooling layer selects the best token representations from the last CNN layer, which makes the feature length as 96. Interaction Dense (Section 2.4) is comprised of three feed-forward layers and one regression layer. The layer sizes, when training Kiba, are 1024, 1024, 512 in order of the feature input to the regression layer and the learning rate, , is 0.0001. We reduced the network complexity when training Davis due to the small number of training samples. We use two feed-forward layers of sizes 1024 and 512. The learning rate is adjusted to 0.001. The entire network uses the same dropout rate of 0.1. All the hyper-parameters are tuned based on the lowest mean square error of the development sets for each fold, and the final score is evaluated on the held-out test set with the model at 1000 epochs.

3.3 Evaluation Metrics

We use four metrics to evaluate the proposed model: mean squared Error (MSE), concordance index (CI) (gonen2005concordance), , and area under the precision-recall curve (AUPR). MSE is a typical loss in the optimizer. CI is the probability that the predicted scores of two randomly chosen drug-target pairs, and , are in the correct order:

where is a normalization constant (the number of data pairs) and is a step function (ozturk2018deepdta):

The  (pratim2009two; roy2013some) index is a metric for quantitative structure-activity relationship models (QSAR models). Mathematically,

where and are the squared correlation coefficients with and without intercept, respectively. An acceptable model should produce an value greater than 0.5. Since AUPR is a metric for binary classification, we transform the regression scores to binary labels using known threshold values (he2017simboost; tang2014making). For Davis, pairs with are marked as binding (1), others as no binding (0), and for Kiba, pairs with are marked as binding (1), others as no binding (0).

3.4 Baselines

For the baseline methods, two similarity-based models and one deep learning-based model, the current SOTA, are tested. One of the similarity-based models is KronRLS (pahikkala2014toward)

, whose goal is to minimize a typical squared error loss function with a special regularization term. The regularization term is given as a norm of the prediction model, which is associated with a symmetric similarity measure. Another similarity-based model is Simboost 


, which is based on a gradient boosting machine. Simboost utilizes many kinds of engineered features, such as network metrics, neighbor statistics, PageRank 

(page1999pagerank) scores, and latent vectors from matrix factorization. The last one is a deep learning model, which is the SOTA method in predicting drug-target interactions, called DeepDTA (ozturk2018deepdta). It is an end-to-end model that takes a pair of sequences, (molecule, protein), and directly predicts affinity scores from the model. Features are automatically captured through back propagation of the multi-layered convolutional neural networks.

3.5 Results

Datasets Method CI (std) MSE (std) AUPR (std)
Kiba KronRLS 0.782 (0.001) 0.411 0.342 (0.001) 0.635 (0.004)
SimBoost 0.836 (0.001) 0.222 0.629 (0.007) 0.760 (0.003)
DeepDTA 0.863 (0.002) 0.194 0.673 (0.009) 0.788 (0.004)
MT-DTI 0.844 (0.001) 0.220 0.584 (0.002) 0.789 (0.004)
MT-DTI 0.882(0.001) 0.152 0.738(0.006) 0.837(0.003)
Davis KronRLS 0.871 (0.001) 0.379 0.407 (0.005) 0.661 (0.010)
SimBoost 0.872 (0.002) 0.282 0.644 (0.006) 0.709 (0.008)
DeepDTA 0.878 (0.004) 0.261 0.630 (0.017) 0.714 (0.010)
MT-DTI 0.875 (0.001) 0.268 0.633 (0.013) 0.700 (0.011)
MT-DTI 0.887(0.003) 0.245 0.665(0.014) 0.730(0.014)
Table 2: Test set results of the proposed MT-DTI model, MT-DTI model without fine-tuning (denoted as MT-DTI), and other existing approaches.

The comparisons of our proposed MT-DTI model to the previous approaches are shown in Table 2. Reported scores are measured on the held-out test set using five models trained with the five different training sets. The best model parameters are selected based on the development set scores. MT-DTI outperforms all the other methods in all of the four metrics. The performance improvement is more noticeable when when there are many training data where the improvements of Kiba are 0.019, 0.042, 0.065, and 0.04 compared with Davis’s improvements of 0.009, 0.016, 0.035, and 0.016, for CI, MSE,

, and AUPR, respectively. Furthermore, our model tends to be more stable with a larger training set, with the lowest standard deviation for CI and AUPR. Another interesting point is that our method without fine-tuning (MT-DTI

in Table 2) produced competitive results. It outperforms the similarity based metrics and performs better than Deep-DTA for some metrics. This suggests that the molecule representation using pre-training learns some useful chemical structure that can be exploited by the interaction denses model.

4 Case Study

We performed a case study using actual FDA-approved drugs targeting a specific protein, epidermal growth factor receptor (EGFR). This protein is chosen because this is one of the famous genes related to many cancer types. We calculated the interaction scores between EGFR and the 1,794 selected molecules based on the DrugBank database (see Section 3.1.3 for the details). These scores are sorted in descending order and summarized in Table 3.

EGFR is a transmembrane protein that is activated by binding of ligands such as epidermal growth factor (EGF) and transforming growth factor alpha (TGFa) (herbst2004review). Mutations in the coding regions of the EGFR gene are associated with many cancers, including lung adenocarcinoma (sigismund2018emerging). Several tyrosine kinase inhibitors (TKIs) have been developed for the EGFR protein, including gefitinib, erlotinib, and afatinib. More recently, Osimertinib was developed as a third generation TKI targeting the T790M mutation in the exon of the EGFR gene (soria2018osimertinib). Since the direct binding of these drugs to EGFR protein is well known, we tested whether our proposed model can identify known drugs for the EGFR protein.

4.1 Biological Insights

The result indicated that our model successfully identified known EGFR targeted drugs as well as novel chemical compounds that were not reported for association with the EGFR protein. For example, the first and second generation TKIs, such as Erlotinib and Gefitinib, and Afatinib, respectively, were predicted to exhibit high affinity to the EGFR protein (Table 3). Lapatinib (medina2008lapatinib), which inhibits the tyrosine kinase activity associated with two oncogenes, EGFR and HER2/neu (human EGFR type 2), was predicted to exhibit the highest affinity. Osimertinib was also identified. Interestingly, chemical compounds targeting opioid receptors (naltrexone hydrochloride, nalbuphine hydrochloride, and oxycodone hydrochloride trihydrate) for pain relief, antihistamines (methdilazine hydrochloride and astemizole), antipsychotic medication for schizophrenia (Prolixin Enanthate and Asenapine), and corticosteroids for skin problems (Triamcinolone acetonide sodium phosphate, Oxymetazoline hydrochloride, Desonide) were predicted to be associated with EGFR. Among these chemical compounds, Astemizole was suggested as a promising compound when treated with known drugs for lung cancer patients (ellegaard2016repurposing; de2017combination). Therefore, further investigations of these chemicals may provide a new therapeutic strategy for lung cancer patients.

Ranking Compound ID Compound Name KIBA Score
1 208908 Lapatinib 14.002403
2 11557040 Lapatinib Ditosylate 13.811217
3 10184653 Afatinib 13.404812
4 16147 Triamcinolone Acetonide Sodium Phosphate 13.147043
5 5485201 Naltrexone Hydrochloride 13.114577
6 123631 Gefitinib 13.111686
7 60699 Topotecan Hydrochloride 13.108758
8 5360515 Naltrexone 13.065864
9 441351 Rocuronium Bromide 13.032806
10 6918543 Almitrine Mesylate 13.016999
11 176870 Erlotinib 12.885199
12 23422 Tubocurarine Chloride Pentahydrate 12.87076
13 6000 Tubocurarine 12.809549
14 11954379 Erlotinib Variant 12.782704
15 11954378 Erlotinib Hydrochloride 12.768639
16 3389 Prolixin Enanthate 12.737285
17 23724988 Oxycodone Hydrochloride Trihydrate 12.709352
18 14676 Methdilazine Hydrochloride 12.662965
19 5281065 Ibutilide Fumarate 12.650397
20 9869929 Avanafil 12.635439
21 60700 Topotecan 12.618897
22 5360733 Nalbuphine Hydrochloride 12.610958
23 5282487 Paroxetine Hydrochloride Hemihydrate 12.608804
24 66259 Oxymetazoline Hydrochloride 12.557486
25 5311066 Desonide 12.538858
26 2247 Astemizole 12.536284
27 11954293 Asenapine 12.534941
28 11304743 Riociguat 12.527533
29 82153 Flunisolide 12.527164
30 71496458 Osimertinib 12.507524
Table 3: Compound ranking based on the predicted Kiba scores when the target is EGFR protein. All compounds are from Drugbank database excluded any compounds in Kiba dataset. [Compound Name] represents a known EGFR targetting drug.

5 Related Work

Predicting drug-target interaction traditionally focused on a binary classification problem (yamanishi2008prediction; bleakley2009supervised; van2011gaussian; cao2012large; gonen2012predicting; cobanoglu2013predicting; cao2014computational; ozturk2016comparative). The most recent approach tackling this binary classification problem is an interpretable deep learning based model (gao2018interpretable). Although these methods show promising results on binary datasets, they are simplifying protein-ligand interactions by thresholding affinity values. In order to model these complex interactions, several methods have been proposed, which can be categorized into three kinds. The first category of these models is molecular docking (trott2010autodock; luo2016molecular), which is a simulation-based method. These methods are not scalable, due to heavy preprocessing. To overcome this downside, the second category, similarity-based methods, was proposed. They are KronRLS (pahikkala2014toward) and SimBoost (he2017simboost), which is based on the calculation of similarity matrix of inputs. With the advent of deep learning, two deep learning-based methods have been proposed (gao2018interpretable; ozturk2018deepdta). Like these models, our model is also based on deep learning, but our proposed model has a better molecule representation, and improves its performance through a transfer learning technique.

Deep learning-based transfer learning, pre-training and fine-tuning, have been applied to various tasks such as computer vision 

(rothe2015dex; ghifary2016deep), NLP (howard2018universal), speech recognition (jaitly2012application; lu2013speech), and health-care applications (shin2017classification). The idea is to use appropriate pre-trained weights to improve results in corresponding tasks, which also can be found in our experimental results.

6 Discussion

This paper proposes a new molecule representation using the self-attention mechanism, which is pre-trained using publicly available big data of compounds. The trained parameters are transferred to our DTI model (MT-DTI) so that it can be fine-tuned using two DTI benchmark data. Experimental results show that our model outperforms all other existing methods with respect to four evaluation metrics. Moreover, the case study of finding drug candidates targeting a cancer protein (EGFR) shows that our method successfully enlists all of the existing EGFR drugs in top-30 promising candidates. This suggests our DTI model could potentially yield low-cost drugs and provide personalized medicines. Our model can be further improved as the proposed attention mechanism is also applied to represent proteins. However, we didn’t explore this direction for two reasons. One reason is that the length of a protein sequence is ten times longer than a molecule sequence on average, which takes a considerable amount of time for computation. Another reason is the need for a protein dataset which contains enough sufficient information to pre-train the model. Unfortunately, such dataset is not readily available.

This work was supported by the National Science Foundation award IIS-#1838200, AWS Cloud Credits for Research program, and Google Cloud Platform research credits.