Attention-based Multi-Input Deep Learning Architecture for Biological Activity Prediction: An Application in EGFR Inhibitors

06/12/2019 ∙ by Huy Ngoc Pham, et al. ∙ 0

Machine learning and deep learning have gained popularity and achieved immense success in Drug discovery in recent decades. Historically, machine learning and deep learning models were trained on either structural data or chemical properties by separated model. In this study, we proposed an architecture training simultaneously both type of data in order to improve the overall performance. Given the molecular structure in the form of SMILES notation and their label, we generated the SMILES-based feature matrix and molecular descriptors. These data was trained on an deep learning model which was also integrated with the Attention mechanism to facilitate training and interpreting. Experiments showed that our model could raise the performance of model. With the maximum MCC 0.56 and AUC 91 inhibitors dataset, our architecture was outperforming the referring model. We also successfully integrated Attention mechanism into our model, which helped to interpret the contribution of chemical structures on bioactivity.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning was applied widely in drug discovery, especially in virtual screening for hit identification. The most popular techniques are Support Vector Machine (SVM), Decision Tree (DT), k-Nearest Neighbor (k-NN), Naive Bayesian method (NB), and Artificial Neural network (ANN).

[1]. In these methods, ANN needs not to assume that there was any relationship between activity and molecular descriptors, and ANN is usually outperforming in traditional Quantitative structure – activity relationship problem because they can deal with both nonlinear and linear relationship. As a result, ANN roses to become a robust tool for Drug Discovery and Development [2]. However, ANNs are usually sensitive to overfitting and difficult to design an optimal model. Additionally, ANNs also require huge computation resources and their results usually are unable to be interpreted. This weakness could be a reason of limited use of neural network comparing to Decision Tree or Naive Bayesian algorithms [2, 1].

Above algorithms can be applied to various types of chemical features. These features are both structure and chemistry properties. The structural information could be represented as fingerprint vector by using specific algorithms (e.g, Extended-Connectivity Fingerprints [3], Chemical Hashed Fingerprint [4]

) while chemistry information could be described by various molecular descriptors (e.g, logP, dipole moment). The ideas that combines some types of features to improve the overall performance was also mentioned in a number of studies

[5, 6]

. In these models, each set of chemical features was trained by a specific algorithms (SVM, DT, k-NN, NB, ANN or any other algorithms) to generate one particular output. After that, these outputs were pushed to the second model which was usually another multi-layers perceptron model before giving the final result. The disadvantage of this approach is that we need to train each feature set separately because of completely different algorithms. As a result, the automation of training procedure was reduced.

Regarding to the interpretation of neural network model, there are some interests in making neural network models more explainable and interpretable. A worthy approach need to be mentioned is the use of Class Activation Maps (CAM) by B. Zhou et al. [7] in the problem of image classification. By using a the connection between the activation layer and class-specific weight, CAM method is able to visualize the region in the input which contributed to the classification. Beside that, the Attention mechanisms are another methods to improve the accuracy and interpretability of sequence-to-sequence model. With the encoder-decoder architecture, the attention approach not only improves the performance but reveals the alignment between input and output [8].

To deal with the problem of both automation and interpretation in predicting biological activity, we made an effort to combine different types of chemical features in one deep learning architecture and also integrate Attention mechanism. As a result, our model could train concurrently several datasets and explain the interaction between the features and outcomes.

Ii Background

Ii-a Overview of Neural network

Ii-A1 Artificial neural network

Artificial neural network is a computing architectures which enable computer to learn from historical data. Nowadays, it is one of main tools used in machine learning. As the name suggests, artificial neural networks are inspired by how biological neurons work, however, in fact an artificial neural network is a composition of many differentiable functions chained together. Mathematically, a neural network is a non-linear mapping which assumes the output variable

as a non-linear function of its input variables


where is the parameters of the neural network and is model’s inreducible error.

A very simple neural network which contains only input and output is described as follows:


where is an approximation of .

As shown in Fig. 1, each input variable is represented as a node in the input layer and connects to the output node through a connection with weight . Note that a node with value is attached to the input layer with the corresponding connection weight to represent the offset term. The function is called activation function or squash function which introduce non-linear relationship between input and output

. There are some different activation functions such as sigmoid, tanh, ReLU. The function ReLU is the most widely used activation function as it helps to overcome the vanishing gradient problem

[9]. The model in Fig. 1 is referred as generalized linear model.

Fig. 1: Example of an ordinary neuron

The generalized linear model is simple thus may not be able to describe complex relationship between inputs and outputs. Therefore, we can extend this architecture by building multiple generalized linear model as the form of layers (or fully connected layers or hidden layers) and stack those layers together to build a neural network. Fig. 2 illustrates a 2 layered neural network, we also add neuron to the nd layer as we do for the generalized linear model below. Let and represent the -th hidden layer and output layer respectively, neural network in Fig. 2 can be described mathematically as follows:

Note that a neural network can have many layers as we want by stacking more layers. Neural networks containing more than one layers is usually called a deep neural networks.

Fig. 2: Example of an 2 layered neural network

Ii-A2 Convolutional neural network

Convolutional neural networks (CNNs) [10]

are a extraordinary members of artificial neural networks family, they are designed to operate on data with grid-like topology. CNNs are the state-of-the-art architectures in computer vision related tasks. CNNs are also applied in biological tasks and achieved remarkable results

[11, 12].

Fig. 3: Example of a CNN for the Image Classification task on CIFAR10 dataset.

Basically, a CNN block is a combination of convolution layers followed by non-linear activation and pooling layers.

  • Convolutional layer (CONV)

    : A convolutional layer is composed of a set of kernels to extract local features from the previous input. Each kernel is represented as a 3D tensor

    , where is the size of the kernel (typically or ) and denotes the total number of kernels. Since is equal to the input’s third dimension, it is frequently omitted when referring to the kernel shape. For an input , each kernel convolves with to attain a single feature map where


    where is a small block (known as receptive field) of around location ;

    is the stride which is the interval of the receptive fields of neighboring units.

  • Pooling layer (POOL)

    : Pooling layer creates a summary of learnt features from CONV layers by aggregating the information of nearby features into a single one. The most common design of pooling operation is max-pooling. For example, a

    max-pooling filter operating on a particular feature map with size will compute for each coordinate in . This will result in a new features map with size . Since a CNN typically contains multiple stacking of CONV layers, pooling is used to reduce data dimension which causes the model less computationally expensive. Pooling can also make the model invariant to small positional and translational changes.

A typical CNN architecture is generally made up of series of CNN blocks and followed by one or more fully connected layers at the end. Fig. 3 illustrates a simple CNN architecture for image classification problem.

Ii-A3 Training Neural Networks

The goal of learning is to minimize the loss function with respect to the network parameters

. To do that, we need to find an estimate for the parameters

by solving an optimization problem of the form


where is the number of instances in the training dataset; is the loss function which measures the discrepancy between model output and the ground truth. Because the optimization problem does not have closed form solution, the method of gradient descent is used. Firstly, the parameters are randomly constructed, for every iteration, the parameters are updated as follow


This process continues until some criterion is satisfied. Here, is a constant called learning rate or learning step which is the amount that the weights are updated during training. As we can be in equation 4, the loss function is computed over all the examples, which is computationally extensive. In practice, we use a modified version of gradient descent called stochastic gradient descent , that means, we do not use the whole dataset for gradient computation but a subset of data called a mini-batch

. Typically, a mini-batch contains from dozens to hundreds of samples depending on system memory. Since the neural network is a composition of multiple layers, the gradient with respect to all the parameters can be methodically compute using the chain rule of differentiation also known as

back-propagation algorithm.

Ii-A4 Regularization

One of the major issues in training neural networks is overfitting. Overfitting happens when a network performs too well on the data it has been trained on but poorly on the test set which it has never seen before. This phenomenon is due to the large number of parameters in the network. Regularization is able to regulate a network activity to ensure the model actually learns the underlying mapping function not just memorizing the input and output. Recently, there are two advanced regularizers which are widely used in deep neural network.

  • Dropout

    : During training, some weights in the network at a particular layer could be co-adapted together which may lead to overfitting. Dropout tackles this issue by randomly skipped some weights (explicitly set them zero) with a probability

    (usually or ). During inference, dropout is disabled and the weights are scaled with a factor of [13].

  • Batch normalization

    : Recall in regression analysis, one often standardizes the designed matrix so that the features have zero mean and unit variance. This action called

    normalization speeds up the convergence and make initialization easier. Batch normalization spread this procedure to not only input layer but all of the hidden layers. During training, let is values across a mini-batch , the batch norm layer calculate normalized version of via:

    where ; are mini-batch mean and variance respectively, is a constant to help computational efficiency. To make it more versatile, a batch norm layer usually has two additional learnable parameters and which stand for scale and shift factor such that:

    During inference, mini-batch mean and variance are replaced by population mean and variance which are estimated during training [14].

Ii-A5 Attention mechanism

Neural networks could be considered as a ”black box” optimization algorithm since we do not know what happens inside them. Attention mechanism enables us to visualize and interpret the activity of neural networks by allowing the network to look back to what it has passed through. This mechanism is motivated by how we, human, pay visual attention to certain regions of an images or important words in a sentence. In neural network, we can simulate this behavior by putting attention weights to express the importance of an element such as pixel in an image or a word in a sentence.

Attention mechanism is applied widely and now become a standard in many tasks such as Neural machine translation

[15], Image captioning [16].

Ii-B Overview of EGFR

Epidermal Growth Factor Receptor (EGFR) is a member of ErbB receptor family that consists of 4 types: EGFR, HER1, HER2/new, HER3 and HER4. They are located in cell membrane with the intrinsic tyrosin kinase. The binding of ligands (TGF-, amphiregulin, and other ligands) and EGFR triggers the signal amplification and diversification which lead to cell proliferation, apotosis, tumour cell mobility and angiogenesis. In some type of cancer (such as lung cancer), the overexpression and constitiute activation cause the dysregulation of EGFR pathway that activates the tumor process [17, 18].

For two decades, there was a great deal of effort in studying this target to discover novel medicine. 3D-QSAR was studied widely to analysis the molecular filed of ligands, which reveals the relationship between various substituents on molecules and biological activity [19, 20, 21, 22, 23, 24]. Other methods were also useful. R. Bathini et al. employed the molecular docking and molecular mechanics with generalized born surface area (MM/GBSA) to calculate the binding affinities of protein-ligand complexes [21]. G. Verma et al. conducted pharmacophore modeling in addition to 3D-QSAR to generate a new model which was used for screening novel inhibitors [24].

Regarding to application of machine learning techniques in EGFR inhibitors discovery, H. Singh et al. [25]

used Random Forest algorithms in order to classify EGFR inhibitors and non-inhibitors. In their study, the authors collected a set of diverse chemical and their activity on EGFR. A high accuracy was trained and validated by 5-fold cross validation (0.49 in MCC and 0.89 in AUC).

Ii-C Overview of Features set

Ii-C1 SMILES Feature matrix

SMILES (Simplified Molecular Input Line Entry System) is a way to represent the molecules in in silico study. This method uses a strict and detailed set of rules to interpret the molecular structure into the chemical notation which is user-friendly but also machine-friendly. In particularly, SMILES notation of a molecule is a chain of character which is specified for atoms, bonds, branches, cyclic structures, disconnected structures, and aromaticity [26].

Fig. 4: The steps to generate the feature matrix from chemical structure using SMILES notation

Based on this representation, M. Hirohara et al. developed a SMILES-based feature matrix to train a convolutional neural network model for predict toxicity. His model outperformed conventional approach, and performed comparably against the winner of Tox21 challenge [27].

Ii-C2 Molecular descriptors

Molecular descriptors are terms that characterize a specific aspect of a molecular, including substituent constants and whole molecular descriptors [28]. The calculation of former type derived from the difference in functional group substitution into the main core of compound. Based on this approach, the latter are the expansion of the substituent constant. However, some whole molecular descriptors are developed from totally new methods or based on physical experiments [29].

Iii Method

Iii-a Dataset

The dataset used in this study was collected by H. Singh et al. [25]. This dataset contains 3492 compounds which is classified as inhibitor or non-inhibitor of EGFR. The inhibition activity of a particular substance was assigned if its is less than 10 . The ratio of inhibitors over non-inhibitor is . The information of chemical includes ID, SMILES representation, and class (1 for inhibitor and 0 for non-inhibitor).

Iii-A1 SMILES Feature matrix generation

Based on the collected dataset, the chemical structure data in form of SMILES notation was preprocessed and converted to canonical form which is unique for each molecule by the package rdkit [30].
In this study, the SMILES Feature matrix generation method developed by M. Hirohasa et al.

was used to encode the chemical notation. The maximum length of each input was 150 and thus the input strings with length below 150 was padded with zeros at the tail. In their method, for each character in the SMILES string, a 42-dimensional vector was computed. The first 21 features represent the data about atom and the last 21 features contain SMILES syntactic information


Iii-A2 Descriptor calculation

We used the package mordred built by H. Moriwaki, Y. Tian, N. Kawashita et al. [31]

to generate molecular descriptor data. Because the SMILES notation do not provide exact 3D conformation, the 2D descriptors was only calculated with total of 1613 features . The generated data was preprocessed by imputing the meaningless features or the variables which are same for whole dataset. A standard scaler was also used to normalize the molecular descriptors dataset. We used package

numpy [32], pandas [33] and scikit-learn [34] for this process.

Iii-B Model architecture

Fig. 5: The Architecture of Attention based Multi-Input deep learning model

Iii-B1 Convolutional neural network (CNN) branch

The SMILES Feature matrix was flew through 2 CNN blocks each consisted of a 2D convolution layer, one Batch normalization layer, a Dropout layer and a Max pooling layer before being flatten and fully connected via 3 hidden Linear layers. The detail of hyper-parameter of each layer is represented in the Table I.

Layer Hyper-parameter Value
conv2d No. of input channels 1
No. of output channels 6
Kernel size (3,3)
conv2d No. of input channels 6
No. of output channels 16
Kernel size (3,3)
Dropout Dropout rate Be tuned
TABLE I: Hyper-parameter in CNN branch

Iii-B2 Molecular Descriptors (MD) branch

The MD branch was used to train the molecular descriptors data. There were 3 blocks of fully connected layers each consisted of a fully connected layer, a batch normalization layer, and a dropout layer. The activation function was used in both branches was Rectified Linear Unit (ReLU) function.

Iii-B3 Concatenation

The two output vectors accomplished from both CNN and MD branch are concatenated and fed through a fully connected layer and squashed by ReLU activation function. Here ReLU is chosen not only because of it’s robustness but it also allows the coefficients to be arbitrarily large and positive which enhances the attention mechanism in the following section.

Layer Hyper-parameter Value
Linear layer No. of neurons 512
Linear layer No. of neurons 128
Linear layer No. of neurons 64
TABLE II: Hyper-parameter in MD branch

Iii-B4 Attention mechanism

The idea of using attention mechanism came from the fact that each chemical’s atoms contribute differently to the drug’s effect. In other words, we put attention or weight to the atoms in the chemical which are represented by rows in the SMILES feature matrix. The larger the weight of a atom is the more contribution of which atom contribute to the drug. By doing this, we can extract each components weights for interpreting the results and analysis.

Let denote the vector generated from concatenation step. The vector is then used as coefficient of a linear combination of rows in the SMILES feature matrix. The vector obtained from the linear combination is concatenated with the attention weight vector and then was fed through a linear layer and squashed by sigmoid activation function to make prediction as in Fig. 5.

Iii-C Hyper-parameter tuning

In this study, PyTorch platform [35] was used in order to implement our model and the 5-fold cross-validation was conducted to evaluate the performance. In this method, the dataset was splitted into 5 parts. For each fold, the model was trained on the set of 4 parts and tested on the remaining part. The choice of training set was permuted through all divided parts of dataset, thus the model was trained 5 times and the average performance metrics of each times was used to evaluate. The ending point of Training step was determined by Early-stopping technique [36, 37]

. Thus, for each fold, the model would stop training if the loss value increase continuously 30 epochs.

The second column of Table III

presents the hyper-parameters and their considered values in the tuning step. Grid Search technique was conducted to determine the best combination of hyper-parameters which had best performance. However, in case of discovering the suitable batch size for training, several suggested value was test and the chosen was the value which utilized the system efficiently. Additionally, the threshold of classifier was determined by analysing the ROC plot and the Precision-Recall curve. The most optimal threshold was the point nearest to top-left of the ROC plot and gave the balance between Precision and Recall in the latter plot.

Hyper-parameter Value Optimal value
Batch size 32; 64; 128; 512 128
Dropout rate 0; 0.2; 0.5 0.5
Optimizer SGD; ADAM ADAM
Learning rate 1e-4; 1e-5; 1e-6 1e-5
Threshold 0.2; 0.5; 0.8 0.2
TABLE III: Hyper-parameter tuning

Iii-D Performance Evaluation

In order to assess the performance of each model, several metrics was calculated during training and validation steps (Table IV).

Metrics Formulas
AUC the area under the ROC curve
  • TP: True Positive, FN: False Negative, TN: True Negative, FP: False Positive, MCC: The Matthews correlation coefficient.

TABLE IV: Performance Metrics

The ROC analysis and AUC are usually considered as the most popular metrics for imblanced dataset because they are not biased against the minor label [38, 39]

. However, these metrics show the overall performance in the whole domain of threshold. In the other words, ROC or AUC are not represented for a particular classifier. In our study, MCC was the most preferred criteria to evaluate the model performance. This is because MCC considers all classes in the confusion matrix whereas other metrics (eg,

accuracy or F1-score) do not fully use four classes in the confusion matrix [40]. The remaining metrics were still useful for benchmark.

Iv Result

Iv-a Hyper-parameters optimization

Despite of imbalanced classes, the loss function of model (binary cross-entropy) still converged with the loss value of for Training set and for Validation set at the end of each fold when cross-validating the model CNN + MD + ATT

The training batch size is 128 which gave the best utility on the GPU Tesla K80. When comparing the effect of two type of optimizer, we observed that Adaptive Moment Estimation (ADAM) showed a better result than Stochastic gradient descent (SGD) in both running time and model performance.

The optimal collection of hyper-parameters was listed in the third column of the Table III. These values was used for evaluating performance of three considered models.

Iv-B Performance

Our architecture was trained on the EGFR dataset and evaluated by mentioned cross-validation method. The final result represents in the Table V. When using structure information in the CNN branch only, the performance was slightly better than H. Singh et al. model in all metrics excepting AUC. However, the CNN + MD model and CNN + MD + ATT model was outperforming to reference model. By comparing two important indicators, it can be seen that there was an improvement in MCC and AUC. Additionally, when training with more branch (CNN + MD and CNN + MD + ATT), the Running time was also reduced significantly to around a half.

Metrics H. Singh CNN CNN + MD CNN + MD
et al. + ATT
SENS 69.89% 73.12% 73.52% 75.30%
SPEC 86.03% 86.03% 89.22% 88.75%
ACC 83.66% 84.17% 86.94% 86.80%
MCC 0.49 0.50 0.55 0.56
AUC 89.00% 87.01% 90.25% 90.76%
RT N.A 42 min 17 min 22 min
  • SENS: Sensitivity, SPEC: Specificity, ACC: Accuracy, MCC: The Matthews correlation coefficient, AUC: The Area under the ROC curve, RT: Running time.

  • CNN: using CNN branch only, CNN + MD: using both CNN and MD branch, CNN + MD + ATT: using both CNN and MD branch with Attention mechanism.

TABLE V: Performance Comparison

Iv-C Attention mechanism

The Attention mechanism was successfully implemented in our architecture. From the attention weights vector, the weights representing to each atom in the molecules was extracted and used to indicate their contribution to compound’s activity. By using visualization on the package rdkit, these attention weights could be used to visualize the distribution of contribution over the molecular structure. Figure 6 illustrates some example from the model.

Fig. 6: Chemical structure interpretation using Attention weight

V Discussion

There are two major advantages in our architecture. The first strong point is the combination of both structure information and chemical attributes in a single learning model. As a result, this advancement made a significant improvement in both performance and automation. Another worthy innovation was the integration of attention mechanism which facilitated the interpretation of the model. In fact, attention weight generated by the model would help explain the contribution of each atom on the overall biological activity.

Comparing to another effort to make deep learning model more interpretable, our architecture have an advantage in computation because it is more easier to generate the SMILES Feature matrix than other algorithms. For example, Sanjoy Dey et al. [41] used ECFP fingerprint algorithms to transform the chemical structure into matrix feature. This method does not treat the molecule as a whole structure but calculate on each fragment of chemical with a particular radius. Additionally, there are several calculation to generate the features including tuning the hyper-parameters of the algorithms (e.g the radius of calculation).

Regarding to our implementation of Attention mechanism, we observed that each atom in a substance was treated separately; as a result, the connection between atom was not highlighted in our model, as well as the contribution of some functional groups which contain many atoms (e.g, carbonyl, carboxylic, etc) was not clearly illustrated. We proposed an solution for this limitation that is to add another branch to the architecture which embeds the substructure patterns (e.g, Extended-Connectivity Fingerprints [3], Chemical Hashed Fingerprint [4]).

When considering the running time between different models, it is clear that the longest running time was that of model with only CNN branch (42 min) while the more complicated model with more data like CNN + MD and CNN + MD + ATT took just a half of running time with 17 min and 22 min, respectively. This could be because the SMILES Feature matrix in CNN model was sparse so the model should train longer to achieve the convergence of loss function. In the CNN + MD and CNN + MD + ATT model, there could be a complement between different input branches and we supposed that there was a information flow transferring between two branches, which facilitated the training stage and performance improvement. In other study which also used several type of data [5, 6], the model trained model separately and did not used this information connection. This phenomenon might represent an advantage of our architecture.

In conclusion, the combination of different source of features is definitely useful for bioactivity prediction, especially when using deep learning model. The attention-based multi-input architecture we proposed achieved an superior score comparing to referring model. Additionally, the attention mechanism would help interpreting the interaction between each element of chemical structures and their activity.


  • [1] A. Lavecchia, “Machine-learning approaches in drug discovery: methods and applications,” Drug Discovery Today, vol. 20, pp. 318–331, mar 2015.
  • [2] D. A. Winkler, “Neural networks as robust tools in drug lead discovery and development,” Applied Biochemistry and Biotechnology - Part B Molecular Biotechnology, vol. 27, no. 2, pp. 139–167, 2004.
  • [3] D. Rogers and M. Hahn, “Extended-Connectivity Fingerprints,” Journal of Chemical Information and Modeling, vol. 50, pp. 742–754, may 2010.
  • [4] B. Al-Lazikani, “Chemical Hashed Fingerprint,” in Dictionary of Bioinformatics and Computational Biology, Chichester, UK: John Wiley & Sons, Ltd, oct 2004.
  • [5] C. P. Koch, A. M. Perna, M. Pillong, N. K. Todoroff, P. Wrede, G. Folkers, J. A. Hiss, and G. Schneider, “Scrutinizing MHC-I Binding Peptides and Their Limits of Variation,” PLoS Computational Biology, vol. 9, p. e1003088, jun 2013.
  • [6] C. P. Koch, A. M. Perna, S. Weissmüller, S. Bauer, M. Pillong, R. B. Baleeiro, M. Reutlinger, G. Folkers, P. Walden, P. Wrede, J. A. Hiss, Z. Waibler, and G. Schneider, “Exhaustive Proteome Mining for Functional MHC-I Ligands,” ACS Chemical Biology, vol. 8, pp. 1876–1881, sep 2013.
  • [7]

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” pp. 1–9, 2018.

  • [8] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” sep 2014.
  • [9] S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 6, pp. 107–116, Apr. 1998.
  • [10] Y. LeCun and Y. Bengio, “The Handbook of Brain Theory and Neural Networks,” ch. Convolutio, pp. 255–258, Cambridge, MA, USA: MIT Press, 1998.
  • [11] E. Gawehn, J. A. Hiss, and G. Schneider, “Deep Learning in Drug Discovery,” Molecular Informatics, vol. 35, no. 1, pp. 3–14, 2016.
  • [12] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke, “The rise of deep learning in drug discovery,” Drug Discovery Today, vol. 23, no. 6, pp. 1241–1250, 2018.
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
  • [14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
  • [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
  • [16] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
  • [17] G. V. Scagliotti, G. Selvaggi, S. Novello, and F. R. Hirsch, “The biology of epidermal growth factor receptor in lung cancer,” in Clinical Cancer Research, vol. 10, 2004.
  • [18] G. Lurje and H. J. Lenz, “EGFR signaling and drug discovery,” Oncology, vol. 77, no. 6, pp. 400–410, 2010.
  • [19] H. Assefa, S. Kamath, and J. K. Buolamwini, “3D-QSAR and docking studies on 4-anilinoquinazoline and 4-anilinoquinoline epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors,” Journal of Computer-Aided Molecular Design, vol. 17, pp. 475–493, aug 2003.
  • [20] S. Kamath and J. K. Buolamwini, “Receptor-Guided Alignment-Based Comparative 3D-QSAR Studies of Benzylidene Malonitrile Tyrphostins as EGFR and HER-2 Kinase Inhibitors,” Journal of Medicinal Chemistry, vol. 46, pp. 4657–4668, oct 2003.
  • [21] R. Bathini, S. K. Sivan, S. Fatima, and V. Manga, “Molecular docking, MM/GBSA and 3D-QSAR studies on EGFR inhibitors,” Journal of Chemical Sciences, vol. 128, no. 7, pp. 1163–1173, 2016.
  • [22] M. Zhao, L. Wang, L. Zheng, M. Zhang, C. Qiu, Y. Zhang, D. Du, and B. Niu, “2D-QSAR and 3D-QSAR Analyses for EGFR Inhibitors,” BioMed Research International, vol. 2017, pp. 1–11, 2017.
  • [23] R. Ruslin, R. Amelia, Y. Yamin, S. Megantara, C. Wu, and M. Arba, “3D-QSAR, molecular docking, and dynamics simulation of quinazoline-phosphoramidate mustard conjugates as EGFR inhibitor,” Journal of Applied Pharmaceutical Science, vol. 9, no. 1, pp. 89–97, 2019.
  • [24] G. Verma, M. F. Khan, W. Akhtar, M. M. Alam, M. Akhter, O. Alam, S. M. Hasan, and M. Shaquiquzzaman, “Pharmacophore modeling, 3D-QSAR, docking and ADME prediction of quinazoline based EGFR inhibitors,” Arabian Journal of Chemistry, 2016.
  • [25] H. Singh, S. Singh, D. Singla, S. M. Agarwal, and G. P. Raghava, “QSAR based model for discriminating EGFR inhibitors and non-inhibitors using Random forest,” Biology Direct, vol. 10, no. 1, pp. 1–12, 2015.
  • [26] D. Weininger, “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules,” Journal of Chemical Information and Modeling, vol. 28, pp. 31–36, feb 1988.
  • [27] M. Hirohara, Y. Saito, Y. Koda, K. Sato, and Y. Sakakibara, “Convolutional neural network based on SMILES representation of compounds for detecting chemical motif,” BMC Bioinformatics, vol. 19, p. 526, dec 2018.
  • [28] I. U. Of and A. Chemistry, “Glossary of terms used in dysmorphology,” Oxford Desk Reference - Clinical Genetics, vol. 69, no. 5, pp. 1137–1152, 2011.
  • [29] K. Roy, S. Kar, and R. N. Das, “Chemical Information and Descriptors,” in Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, pp. 47–80, Elsevier, 2015.
  • [30] G. Landrum, “RDKit: Open-source cheminformatics.”
  • [31] H. Moriwaki, Y. S. Tian, N. Kawashita, and T. Takagi, “Mordred: A molecular descriptor calculator,” Journal of Cheminformatics, vol. 10, no. 1, pp. 1–14, 2018.
  • [32] T. E. Oliphant, Guide to NumPy. USA: CreateSpace Independent Publishing Platform, 2nd ed., 2015.
  • [33] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy Array: A Structure for Efficient Numerical Computation,” Computing in Science & Engineering, vol. 13, pp. 22–30, mar 2011.
  • [34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
  • [36] W. Finnoff, F. Hergert, and H. G. Zimmermann, “Improving model selection by nonconvergent methods,” Neural Networks, vol. 6, pp. 771–783, jan 1993.
  • [37] L. Prechelt, “Automatic early stopping using cross validation: quantifying the criteria,” Neural Networks, vol. 11, pp. 761–767, jun 1998.
  • [38] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,” International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25–36, 2006.
  • [39] X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou, “On the Class Imbalance Problem,” pp. 192–201, 2009.
  • [40] D. Chicco, “Ten quick tips for machine learning in computational biology,” BioData Mining, vol. 10, no. 1, pp. 1–17, 2017.
  • [41] S. Dey, H. Luo, A. Fokoue, J. Hu, and P. Zhang, “Predicting adverse drug reactions through interpretable deep learning framework,” BMC Bioinformatics, vol. 19, no. Suppl 21, pp. 1–13, 2018.